[bitc-dev] Unicode and bitc
Ben Kloosterman
bklooste at gmail.com
Tue Oct 12 19:00:13 PDT 2010
Re UTF8 design you are correct , I got mixed between , .when .NET was
designed no one used UCS-1 and UCS-2 was common and the fact it wasn't out
when windows was designed..
Looking further , UCS-2 is now regarded as obsolete as a document
representation and UTF-16 is not the same as it has variable sized
extensions. ( Note all UCS-2 is readable by UTF-16 but not the reverse)
yet basic indexing of variable sized format UTF-8 or UTF-16 is misleading to
developers as you nearly always need to do a O(n) scan from the start this
means you need different methods to handle it optimally . If you want to
allow indexible strings I would suggest 2 strings but you can do all the
indexing you need with char[] ( or utf32[] etc .
While Java does have excellent XML parsers there are plenty of good C ones
which do utf-8. Libxml2-SAX blows away Java ones by 30-50% ,working in
UCS-2 means you may not be able to meet your c performance goal... In Java
land sTAX apis are common even though they give inferior performance they
have an easier to use API. Anyway for bitc I don't see this strong java
base as an issue as there is plenty of good c (utf-8) parsers ( and a few
C++ sTAX) and you can easily write a wrapper with minimal impact. If UCS-2
was common and strong I would consider this argument more strongly but it's
a legacy standard and Java , windows and hence .NET are burdened with to
and from USC-2 conversion costs. There are no USC-2 documents anymore and a
USC-2 system which can't do UTF-8 , UTf-16 or UTF-32 representations is
even illegal in China.
2. The model I propose is very careful not to take any position that commits
the implementation to a particular representation. I'ld note that the IBM
ICU components have a very strong string implementation that satisfies all
of the concerns you raise while retaining perfectly fine in-memory space
performance
Java still suffers from excessive memory usage on embedded devices and their
SAX xml parsers are still inferior to C. Regarding taking a position that
is true but note as I said an indexer on a string implies to a developer a
fixed with implementation which can only be ASCII , UCS-2 , UCS-4 and UTF32
without causing developers to write unexpectedly poorly performing code for
UTF-8 and UTF-16 . If you exclude a public indexer from string ( and just
use a to and from char[] ) then the std lib can handle indexing as needed
and a dev that implements indexing will have to be more carefully of the
format.
eg if char is utf 8 you can have string have an underlying representation
of char[] however all the lib methods are on string this provides a number
of additional benefits eg
- String can be copied to char[] at very low cost ( cast is
possible but you lose the immutability)
- Programmers will use indexing on strings only when needed relying
more on the library ,this subtlety improves code quality this is very
obvious in the MS world ( and .NET uses UCS-2 so could have used indexers)
- Strings are immutable , providing GC benefits as well as multi
threading esp the diabolical string changed by other thread issue.
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.coyotos.org/pipermail/bitc-dev/attachments/20101013/d8b63a80/attachment-0001.html
More information about the bitc-dev
mailing list