[bitc-dev] BitC 0.20: Unicode
Kevin Reid
kpreid at mac.com
Tue Mar 9 13:27:28 PST 2010
On Mar 9, 2010, at 16:13, Jonathan S. Shapiro wrote:
> [Re-send - original sent to wrong alias]
>
> One of the mundane issues I want to take up is character and string
> encoding. The issue that is driving this is JVM/CLR, neither of
> which properly implements unicode. That is: the "character" type in
> both runtimes is 16 bit, and this can only encode the Basic
> Multilingual Plane.
Actually, Java is nominally UTF-16:
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413
(This is the language, not the VM, spec, yes; but since interop with
Java standard libraries is presumably the primary thing of interest in
choosing how a JVM::char is interpreted...)
> [...]
> In CLR (and I believe in JVM), characters outside the BMP can be
(and should be, for interop)
> encoded in strings using surrogate pairs, and with (considerable)
> care these can be processed. So long as we reject strings that
> contain malformed surrogate pairs, we should be fine. In practice
> this means:
> Strings returned from CLR/JVM routines need to be validated.
> Substring operations must fail if they would result in a broken
> surrogate pair. The easiest way to handle this is to define them in
> terms of characters rather than code units.
It is possible-but-weird to handle such things as uneven widths and
invalid substring indexes by defining the high-level interfaces such
that *numeric* indexes are never seen by most programmers; see Taylor
Campbell's Scheme work on this idea. It seems reasonable to me, but I
haven't actually done any work within the system.
The starting premise as I recall it is essentially that even if we
always work in 32-bit units, that isn't what user-programmers actually
want -- consider combining characters. Rather, the primitives should
be iterating over strings in selectable units (grapheme cluster,
scalar value, utf-N code point, whatever) and parsing.
> If we choose BitC "char" to be 16-bits, then I propose to add a new
> character type UniChar that covers the full unicode set.
> Alternatively, we could define "char" as the 32-bit unit, and
> introduce BMPChar or CodeUnit for interaction with JVM/CLR.
I suggest UTF16Char for maximum obviousness.
--
Kevin Reid <http://switchb.org/kpreid/>
More information about the bitc-dev
mailing list