[bitc-dev] Unicode and bitc
Jonathan S. Shapiro
shap at eros-os.org
Tue Oct 12 09:00:18 PDT 2010
One minor point, and then I'll respond to the main one:
> .NET uses UTF16 ( USC-2) since there was no UTF8 when it was designed .
Check your facts on this. UTF-8 dates back to 1992. That predates the first
major release of Java, never mind .Net.
But the problem in the large here is that ontogeny recapitulates philogeny.
The fact that Java is so intimately tied to processing of XML means that
large bodies of existing code are written to a UCS-2 indexing assumption.
This is further amplified by the fact that nearly all useful character sets
are encodable in UCS-2, and UCS4 font sizes aren't very tractable.
The long and short of all of this is that we're stuck with legacy indexing
schemes. Perhaps more importantly, we're stuck with a three-layer stack in
which characters, code points, and code units can defy interpretation unless
consistency rules are imposed at all three layers.
So two addendums to my earlier four rules:
1. I had not intended to imply that strings would *not* include ucs1
indexing. If only for the sake of certain low-level memory operations, they
need to be byte indexable.
2. The model I propose is very careful not to take any position that commits
the implementation to a particular representation. I'ld note that the IBM
ICU components have a very strong string implementation that satisfies all
of the concerns you raise while retaining perfectly fine in-memory space
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the bitc-dev