[bitc-dev] bug in character decode routines

Jonathan S. Shapiro shap at eros-os.org
Thu May 18 16:22:12 EDT 2006


On Thu, 2006-05-18 at 19:51 +0100, David Hopwood wrote:
> Jonathan S. Shapiro wrote:

> > Actually, the specification *does* say so. See the second paragraph of
> > section 2 (Input Processing).
> 
> No, I checked that. It says Unicode, not UTF-8.

Thank you. I have corrected this. Should push to the web site by 17:30,
GMT-5.

> This is correct as far as it goes, but I think what also needs to be said
> is that *when the input compilation unit is provided as a file*, it is
> encoded as UTF-8 (also as defined by Unicode 4.1.0).

At the moment, no other form of input unit is anticipated. You are
clearly imagining something else, and I would be very interested to know
what it is in order to understand better what change (if any) is
appropriate).

In particular, we do NOT anticipate incorporating EVAL into the
language.

I can see that other compilation environments might store a program as
an AST and might compile from that. I certainly have no problem with
this, but strictly speaking that isn't an input unit of compilation.

> Incidentally, AFAICS all BitC keywords, significant punctuation, and escapes
> fall within US-ASCII, not just ISO-Latin-1. Also it is "Normalization Form C",
> not "Normalization C".

Thanks, also corrected.

> I assume it is intentional that characters that are unassigned as of
> Unicode 4.1.0 cannot be used in this version of BitC?

Yes. There is a downward compatibility problem if this is permitted.

> > I believe that we do currently open the
> > file in binary mode, as I remember noting this potential error at some
> > point. Hmm. Actually, it doesn't matter, because LF and CR are not valid
> > within strings or character literals, and in all other cases they are
> > merely white space.
> 
> Strictly speaking, the text mode translation might not only affect LF and CR.

This is true for wide character input routines (which we do not use),
but I'm not aware of any other normalization that is done when files are
accessed using the traditional stdio procedures.

We should certainly be using binary I/O in any case (and I just
confirmed that we do), but what am I forgetting here?

> >> - the generated C code should be *portable* C89. It might be reasonable to
> >>   make some assumptions about the C platform that will execute this code
> >>   (platform S) that go beyond what is guaranteed by the C89 standard (for
> >>   example, we may assume CHAR_BIT == 8), but any such assumptions should
> >>   be documented. This means that we should only use characters from the
> >>   C89 "portable character set":
> >>
> >>     A-Z a-z 0-9
> >>     ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
> >>
> >>   i.e. U+0020..007E excluding U+0024 ($), U+0040 (@) and U+0060 (`).
> > 
> > I believe that you mean C99, but otherwise I agree.
> 
> I meant C89, because there are very few complete C99 compilers, and even
> fewer that I would use when targetting an embedded platform. In the BitC
> compiler itself, it's fine to use the subset of C99 features that are
> widely implemented in compilers for desktop systems, but I think the
> generated code should be C89.
> 
> The extra features in C99 are not all that useful for generated code,
> anyway, unless I've missed something.

Hmm. One feature of C99 that we may be using is initializers that follow
statements. I'ld have to look, and I don't consider it urgent to fix
this -- the entire C-based bringup strategy was probably, in hindsight,
a mistake.

> > I'm not aware of *any* character encoding
> > that does not retain 7-bit ASCII in the low positions.
> 
> EBCDIC variants, some obsolete national variants of ASCII, and sometimes
> Shift-JIS.

I am comfortable with not supporting platforms that cannot handle 7-bit
ASCII. I agree that we should specify this as the output of the
bootstrap compiler, but that is not a matter for the language
specification. It is a matter for the compiler specification. It seems
to me, however, that this is already implied once we state that the
output is C-89 or C-99. Do I misrecall the input unit of compilation
requirements for C?

> >> - a BitC program uses UTF-8 as its execution character encoding. If we
> >>   are including "portable" characters directly in character and string
> >>   literals, then the codes for these characters in the encoding of S
> >>   (the C "execution character set") must be the same as in UTF-8 (i.e.
> >>   the same as in US-ASCII, since all characters in the portable set
> >>   are in US-ASCII). This assumption should also be documented.
> > 
> > Actually, this is not the case. BitC does not make any statement about
> > execution character encoding (intentionally). The current implementation
> > is that characters are UCS4, and strings are vectors of UCS4.
> 
> I'm a bit confused: how can the current compiler use C strings to represent
> BitC strings in that case?

It only does this for literal initialization,  IIRC, the emitted code
includes a call to a helper routine that copies these strings into the
BitC heap.

As a practical matter, this will probably need to change, because this
emission decision was made before I decided to use UCS4 strings in the
bootstrap runtime. This may be another case where a runtime
representation decision was not fully carried through into the code
generator.

Swaroop: can you double check whether my recollection of the current
initialization process is right?

> (It could use wide C strings and assume defined(__STDC_ISO_10646__) and
> sizeof(wchar_t)*CHAR_BIT == 32, but that assumption doesn't hold for
> many important target platforms.)

Precisely. At the moment I believe that we are using those string
encodings purely as byte vectors.

> I would also argue that any language supporting Unicode must (also) be able
> to use UTF-8 directly as an internal encoding, with indexing based on UTF-8
> code units. The fourfold expansion of UTF-32 for US-ASCII is not acceptable
> when storing large amounts of mostly-ASCII text.

For a production implementation, I agree in principle, but UTF-8 may not
follow. In practice, I think that the ICU ropes implementation (or
something comparable) is probably the right thing to do. The problem
lies in the existence of STRING-SET! and the need for near-constant-time
operation there.

For a bringup compiler, I decided that this wasn't a compelling problem,
and I chose to do something that would work to get us moving even though
it was the wrong thing for production use.

A hidden issue here is storage allocation. There are places where we
need to know that a program does not allocate storage dynamically. It
would be unfortunate if this meant that STRING-SET! was prohibited in
such programs. It's a puzzlement what to do about this, and it would be
unfortunate to tie mutability rules to the runtime implementation in
this particular way.

> $, @ and ` are not in the portable character set, which is why I excluded
> them here. Yes, I know that the national variants of ASCII are obsolete,
> but IBM machines that use EBCDIC variants are still in widespread use, and
> that case would "just work" provided that we stick to the portable character
> set, under the assumptions stated above.

Do I misrecollect C-89? I had thought it mandated ASCII input. If not,
then I agree that we need to deal with this, but I would like to confirm
before making this change.

> 
> > The characters '\\' and '\'' are also okay, but need to be escaped.
> >
> > But yes, the code I wrote for Swaroop was wrong. Swaroop: what we need
> > to do here is use emitReadableChar(c) rather than printf.
> > emitReadableChar should do the proper encoding. I think there may be a
> > function hiding somewhere that already knows how to do this encoding. If
> > not, you may need to adapt the one used for strings.
> 
> Right. AFAICS there should be no need to handle character literals and
> string literals differently, in terms of which characters can be output
> as-is. (Although ' is valid unescaped in C string literals and " is valid
> unescaped in C character literals, it is simpler to escape both.)

Hadn't considered that, but sounds right.

David: Thank you very much for the care you are taking here. It is
really helpful.

shap



More information about the bitc-dev mailing list