[bitc-dev] bug in character decode routines

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Thu May 18 14:51:32 EDT 2006


Jonathan S. Shapiro wrote:
> David:
> 
> Thank you for a careful list. Just FYI, here is what we currently do on
> these points.
> 
> Swaroop: see the bit about emitReadableChar at the end. My printf code
> was completely boogered.
> 
> On Thu, 2006-05-18 at 01:29 +0100, David Hopwood wrote:
> 
>> - I assume it is intended that a BitC source file is always encoded as
>>   UTF-8 (although the spec doesn't currently say so). These files should
>>   be opened in binary mode, avoiding any unwanted conversions.
> 
> Actually, the specification *does* say so. See the second paragraph of
> section 2 (Input Processing).

No, I checked that. It says Unicode, not UTF-8.

# Input units of compilation are defined to be encoded using the Unicode
# character set as defined in version 4.1.0 of the Unicode standard [12],
# using Normalization C. All keywords and syntactically significant
# punctuation fall within the ISO-LATIN-1 subset, and the language
# provides for ISO-LATIN-1 encodable ``escapes'' that can be used to
# express the full Unicode character code space in character and string
# literals.

This is correct as far as it goes, but I think what also needs to be said
is that *when the input compilation unit is provided as a file*, it is
encoded as UTF-8 (also as defined by Unicode 4.1.0).

Incidentally, AFAICS all BitC keywords, significant punctuation, and escapes
fall within US-ASCII, not just ISO-Latin-1. Also it is "Normalization Form C",
not "Normalization C".

I assume it is intentional that characters that are unassigned as of
Unicode 4.1.0 cannot be used in this version of BitC?

> I believe that we do currently open the
> file in binary mode, as I remember noting this potential error at some
> point. Hmm. Actually, it doesn't matter, because LF and CR are not valid
> within strings or character literals, and in all other cases they are
> merely white space.

Strictly speaking, the text mode translation might not only affect LF and CR.

>> - the generated C code should be *portable* C89. It might be reasonable to
>>   make some assumptions about the C platform that will execute this code
>>   (platform S) that go beyond what is guaranteed by the C89 standard (for
>>   example, we may assume CHAR_BIT == 8), but any such assumptions should
>>   be documented. This means that we should only use characters from the
>>   C89 "portable character set":
>>
>>     A-Z a-z 0-9
>>     ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
>>
>>   i.e. U+0020..007E excluding U+0024 ($), U+0040 (@) and U+0060 (`).
> 
> I believe that you mean C99, but otherwise I agree.

I meant C89, because there are very few complete C99 compilers, and even
fewer that I would use when targetting an embedded platform. In the BitC
compiler itself, it's fine to use the subset of C99 features that are
widely implemented in compilers for desktop systems, but I think the
generated code should be C89.

The extra features in C99 are not all that useful for generated code,
anyway, unless I've missed something.

> We already mangle
> all of the BitC identifier characters that are not legal C identifier
> characters, including all UCS characters whose code points are > 127.
> 
>> - the BitC compiler seems to be writing the C source file on stdout,
>>   which is opened in text mode using the encoding of Q. This works if
>>   and only if Q and R use encodings that are identical for characters
>>   in the portable character set. I think this is a reasonable assumption,
>>   but it should be documented.
> 
> We only write stdout in the absence of other specification, and it is
> intended that this behavior will go away. It is a testing convenience.
> 
> In any case, this works universally provided the compiler output is
> restricted to 7-bit ASCII. I'm not aware of *any* character encoding
> that does not retain 7-bit ASCII in the low positions.

EBCDIC variants, some obsolete national variants of ASCII, and sometimes
Shift-JIS.

> There *is* a problem here if the output character set is a 16-bit
> character set such as shift-JIS.

Shift-JIS is a multibyte charset. It is *almost* compatible with US-ASCII
except that 0x5C is sometimes (inconsistently) used to encode a Yen symbol
rather than '\'. Actually, I think Shift-JIS-as-implemented-in-C-compilers
always treats 0x5C as '\'.

> This is why we do *not* output using the wide character routines.

Indeed you shouldn't.

> Actually, I encountered significant frustration here. There is no
> portable way from within a C program to determine whether the current
> wide character locale is a UTF-8 locale, which is exceptionally
> irritating.

Tell me about it. The C committee don't seem to understand the importance
of being able to do I/O in known character encodings.

> This prompted me to avoid the stdio support for wide
> characters altogether.
> 
>> - we can use either '\n' or '\x0A' to output newlines. Using '\n'
>>   assumes that platform R accepts platform Q's newline encoding; using
>>   '\x0A' assumes that platform R accepts the Unix (LF) newline encoding.
>>   It probably doesn't matter, but since we are already assuming that
>>   Q and R have compatible encodings for portable characters, '\n' may
>>   be slightly better.
> 
> Because newline is not legal within literals, I think that it actually
> doesn't matter. The C compiler accepts \r and \n as whitespace. The only
> damage that might ensue from getting this wrong is mishandled line
> numbers, which is not critical for the bootstrap compiler. In any case
> it is easily corrected by DOS2UNIX or equivalent if desired.

OK.

>> - a BitC program uses UTF-8 as its execution character encoding. If we
>>   are including "portable" characters directly in character and string
>>   literals, then the codes for these characters in the encoding of S
>>   (the C "execution character set") must be the same as in UTF-8 (i.e.
>>   the same as in US-ASCII, since all characters in the portable set
>>   are in US-ASCII). This assumption should also be documented.
> 
> Actually, this is not the case. BitC does not make any statement about
> execution character encoding (intentionally). The current implementation
> is that characters are UCS4, and strings are vectors of UCS4.

I'm a bit confused: how can the current compiler use C strings to represent
BitC strings in that case?

(It could use wide C strings and assume defined(__STDC_ISO_10646__) and
sizeof(wchar_t)*CHAR_BIT == 32, but that assumption doesn't hold for
many important target platforms.)

I would also argue that any language supporting Unicode must (also) be able
to use UTF-8 directly as an internal encoding, with indexing based on UTF-8
code units. The fourfold expansion of UTF-32 for US-ASCII is not acceptable
when storing large amounts of mostly-ASCII text.

>>So, if 'c' is intended to hold a Unicode scalar value, I would write the
>>code in question as:
>>
>>  if (isPortableLiteralChar(c)) printf("'%c'", c);
>>  else printf("%d", c);
>>
>>  bool isPortableLiteralChar(wint_t c) {
>>      return c >= 0x20 && c <= 0x7E  /* printable ASCII */
>>          && c != 0x22               /* not " */
>>          && c != 0x24               /* not $ */
>>          && c != 0x27               /* not ' */
>>          && c != 0x40               /* not @ */
>>          && c != 0x5C               /* not \ */
>>          && c != 0x60;              /* not ` */
>>  }
> 
> 
> Um, wait. the characters '"', '$', '@', and '`' are no problem to emit
> to C.

$, @ and ` are not in the portable character set, which is why I excluded
them here. Yes, I know that the national variants of ASCII are obsolete,
but IBM machines that use EBCDIC variants are still in widespread use, and
that case would "just work" provided that we stick to the portable character
set, under the assumptions stated above.

> The characters '\\' and '\'' are also okay, but need to be escaped.
>
> But yes, the code I wrote for Swaroop was wrong. Swaroop: what we need
> to do here is use emitReadableChar(c) rather than printf.
> emitReadableChar should do the proper encoding. I think there may be a
> function hiding somewhere that already knows how to do this encoding. If
> not, you may need to adapt the one used for strings.

Right. AFAICS there should be no need to handle character literals and
string literals differently, in terms of which characters can be output
as-is. (Although ' is valid unescaped in C string literals and " is valid
unescaped in C character literals, it is simpler to escape both.)

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




More information about the bitc-dev mailing list