[bitc-dev] (partially fixed) character emission

Jonathan S. Shapiro shap at eros-os.org
Thu May 18 22:54:59 EDT 2006


David:

I've just put (I hope) proper character encoding into the C generator,
but I didn't deal with the EBCDIC issues. I also haven't yet looked at
the string emission issue.

Concerning EBCDIC, I went and looked at some code point comparison
charts. The problem appears to go far beyond '@' and '$'. There is a
much broader problem, which is that the code points for letters in
general do not match up.

When we emit a literal initializer for a character in the C generator,
we aren't really interested in what the glyph is. The only reason to
emit a glyph at all is for the convenience of a human reading the code.
The important issue is that the character literal end up having the
right UTF-8 code point.

So my concern is that a C compiler that is compiling based on an EBCDIC
tokenizer is going to mis-encode a great many characters if we emit them
as C character literals. It appears (to me) that the only safe encoding
if we care about EBCDIC-based C compilers is to emit *everything* using
octal escapes, and perhaps emit comments for the sake of the human
reader.

For character literals this will work fine, but for string literals it
is a complete nuisance.

To properly support EBCDIC, we would also need to completely rebuild the
lexer and the parser.

My inclination, at the moment, is to simply duck the EBCDIC issue for
the moment.


shap



More information about the bitc-dev mailing list