[bitc-dev] BitC now supports UNICODE

Jonathan S. Shapiro shap at eros-os.org
Mon Feb 20 00:06:44 EST 2006


I have just added the following changes to BitC in the repository:

  1. The new specification for character and string literals is
     now in effect, with transitional support for some of the
     old, bad ideas.

  2. In a character literal of the form #\C, C can now be any
     valid UNICODE printable character.

  3. In identifiers, the valid characters have been updated to
     include any UNICODE identifier characters.

  4. In strings, non-escaped characters now can be any non-blank,
     non-control UNICODE character. ASCII space (U+0020) is also
     permitted.

  5. The compiler now works in non-UNICODE locales.

By "any UNICODE character", I mean any code point having the required
UNICODE attributes according to the UNICODE 4.1.0 specification, encoded
as UTF-8.

For the exact details on the changes in literal specifications, see the
0.9+ specification, which states precisely what the rules are.

NOTE: You may need to install the libicu-devel package from the Fedora
Extras repository (or install libicu v3.4 yourself) to get the compiler
to build.

Known Bug: I am not doing anything to canonicalize characters that have
multiple code points, mainly because I don't really know what the right
thing to do is. If you know enough about, say, Han to key the same
character using two different composited sequences, don't expect your
identifiers to resolve properly.

For string literals, I don't think that the compiler has any business
altering your input.

For composited character literals, we are simply getting it wrong at the
moment, and I don't understand the issues well enough to know what to do
about it.

If somebody out there knows enough about input methods to try to test
this better than I can, I'ld be grateful.


shap



More information about the bitc-dev mailing list