[bitc-dev] bug in character decode routines
Jonathan S. Shapiro
shap at eros-os.org
Thu May 18 00:37:40 EDT 2006
David:
Thank you for a careful list. Just FYI, here is what we currently do on
these points.
Swaroop: see the bit about emitReadableChar at the end. My printf code
was completely boogered.
On Thu, 2006-05-18 at 01:29 +0100, David Hopwood wrote:
> - I assume it is intended that a BitC source file is always encoded as
> UTF-8 (although the spec doesn't currently say so). These files should
> be opened in binary mode, avoiding any unwanted conversions.
Actually, the specification *does* say so. See the second paragraph of
section 2 (Input Processing). I believe that we do currently open the
file in binary mode, as I remember noting this potential error at some
point. Hmm. Actually, it doesn't matter, because LF and CR are not valid
within strings or character literals, and in all other cases they are
merely white space.
> - the generated C code should be *portable* C89. It might be reasonable to
> make some assumptions about the C platform that will execute this code
> (platform S) that go beyond what is guaranteed by the C89 standard (for
> example, we may assume CHAR_BIT == 8), but any such assumptions should
> be documented. This means that we should only use characters from the
> C89 "portable character set":
>
> A-Z a-z 0-9
> ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
>
> i.e. U+0020..007E excluding U+0024 ($), U+0040 (@) and U+0060 (`).
I believe that you mean C99, but otherwise I agree. We already mangle
all of the BitC identifier characters that are not legal C identifier
characters, including all UCS characters whose code points are > 127.
> - the BitC compiler seems to be writing the C source file on stdout,
> which is opened in text mode using the encoding of Q. This works if
> and only if Q and R use encodings that are identical for characters
> in the portable character set. I think this is a reasonable assumption,
> but it should be documented.
We only write stdout in the absence of other specification, and it is
intended that this behavior will go away. It is a testing convenience.
In any case, this works universally provided the compiler output is
restricted to 7-bit ASCII. I'm not aware of *any* character encoding
that does not retain 7-bit ASCII in the low positions.
There *is* a problem here if the output character set is a 16-bit
character set such as shift-JIS. This is why we do *not* output using
the wide character routines.
Actually, I encountered significant frustration here. There is no
portable way from within a C program to determine whether the current
wide character locale is a UTF-8 locale, which is exceptionally
irritating. This prompted me to avoid the stdio support for wide
characters altogether.
> - we can use either '\n' or '\x0A' to output newlines. Using '\n'
> assumes that platform R accepts platform Q's newline encoding; using
> '\x0A' assumes that platform R accepts the Unix (LF) newline encoding.
> It probably doesn't matter, but since we are already assuming that
> Q and R have compatible encodings for portable characters, '\n' may
> be slightly better.
Because newline is not legal within literals, I think that it actually
doesn't matter. The C compiler accepts \r and \n as whitespace. The only
damage that might ensue from getting this wrong is mishandled line
numbers, which is not critical for the bootstrap compiler. In any case
it is easily corrected by DOS2UNIX or equivalent if desired.
> - a BitC program uses UTF-8 as its execution character encoding. If we
> are including "portable" characters directly in character and string
> literals, then the codes for these characters in the encoding of S
> (the C "execution character set") must be the same as in UTF-8 (i.e.
> the same as in US-ASCII, since all characters in the portable set
> are in US-ASCII). This assumption should also be documented.
Actually, this is not the case. BitC does not make any statemebt about
execution character encoding (intentionally). The current implementation
is that characters are UCS4, and strings are vectors of UCS4. Future
implementations will probably use a rope-like implementation of strings,
but that isn't required and I certainly didn't want to commit the
implementer to this.
UTF-8 is used by the (forthcoming) standard library as the *external*
representation, but there is nothing to stop a user from implementing an
alternative library.
However, the minute you start talking about going to C you aren't
talking about the normal externalization problem. You are now talking
specifically about the output encoding requirements of the compiler. I
agree with your statement of those requirements, and (I think) we are
already doing what you say.
> So, if 'c' is intended to hold a Unicode scalar value, I would write the
> code in question as:
>
> if (isPortableLiteralChar(c)) printf("'%c'", c);
> else printf("%d", c);
>
> bool isPortableLiteralChar(wint_t c) {
> return c >= 0x20 && c <= 0x7E /* printable ASCII */
> && c != 0x22 /* not " */
> && c != 0x24 /* not $ */
> && c != 0x27 /* not ' */
> && c != 0x40 /* not @ */
> && c != 0x5C /* not \ */
> && c != 0x60; /* not ` */
> }
Um, wait. the characters '"', '$', '@', and '`' are no problem to emit
to C. The characters '\\' and '\'' are also okay, but need to be
escaped.
But yes, the code I wrote for Swaroop was wrong. Swaroop: what we need
to do here is use emitReadableChar(c) rather than printf.
emitReadableChar should do the proper encoding. I think there may be a
function hiding somewhere that already knows how to do this encoding. If
not, you may need to adapt the one used for strings.
>
> This also happens to be correct if 'c' is a UTF-8 code unit rather than
> a Unicode scalar value, alhough in that case I would say
> 'bool isPortableLiteralCodeUnit(uint8_t c)'.
C is never a UTF-8 code unit. It *may* be a UCS4 code point, but that is
another matter entirely.
More information about the bitc-dev
mailing list