[bitc-dev] BitC 0.20: Unicode

Jonathan S. Shapiro shap at eros-os.org
Tue Mar 9 13:13:40 PST 2010


[Re-send - original sent to wrong alias]

One of the mundane issues I want to take up is character and string
encoding. The issue that is driving this is JVM/CLR, neither of which
properly implements unicode. That is: the "character" type in both runtimes
is 16 bit, and this can only encode the Basic Multilingual Plane.

As of BitC 0.10, the position was:
External (on-file) encoding for source units of compilation is UTF-8.
Strings are immutable.
Characters are 32 bits, UCS-4 encoded
Actually, there is a bug in the 0.10 specification: string literals are
discussed, but strings are not specified as a core type.

>From a principled standpoint, I still think that the decisions above were
the right ones, but both JVM and CLR are restricted to a 16-bit native
character type.

For strings, there isn't really any problem worse than inconvenience. In CLR
(and I believe in JVM), characters outside the BMP can be encoded in strings
using surrogate pairs, and with (considerable) care these can be processed.
So long as we reject strings that contain malformed surrogate pairs, we
should be fine. In practice this means:
Strings returned from CLR/JVM routines need to be validated.
Substring operations must fail if they would result in a broken surrogate
pair. The easiest way to handle this is to define them in terms of
characters rather than code units.
So the crux of the matter is the size of the "char" type in BitC. It is
clear that for JVM/CLR interaction we will need a 16-bit character-like
type. It seems equally clear that for any sane unicode-based processing we
need a 32-bit character-like type.

If we choose BitC "char" to be 16-bits, then I propose to add a new
character type UniChar that covers the full unicode set. Alternatively, we
could define "char" as the 32-bit unit, and introduce BMPChar or CodeUnit
for interaction with JVM/CLR.

Is there an obviously preferable choice? If not, then given that we want to
be able to target these platforms, what do you think we should to about all
this?


Possibly relevant background:
http://perldoc.perl.org/Encode/Unicode.html#Surrogate-Pairs
The four possible positions you can take once you start down the UCS-2 path:
http://cad.kiev.ua/~demch/multiling/unicode/utf16.html

Jonathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.coyotos.org/pipermail/bitc-dev/attachments/20100309/c24d5dd5/attachment.html 


More information about the bitc-dev mailing list