[bitc-dev] Unicode and bitc
bklooste at gmail.com
Tue Oct 12 07:17:07 PDT 2010
Just reading about strings in bitc
There is a compromise position, which is where we are currently leaning:
* A well-formed string consists of a sequence of code points. The
specification does not take a position on the encoding of strings in the
* Strings support indexing on both UCS-2 and UCS-4 code units.
* Any operation that accepts code units and produces a string is
obliged to confirm that the code unit sequence constitutes a well-formed
code point sequence to ensure that multiple indexing schemes are possible.
* Implementations are encouraged where possible to use a run-encoded
internal representation of strings incorporating a hidden cached cursor,
such that arbitrary indexing and sequential indexing are both implemented in
O(1) time. A reference implementation for such an encoding will eventually
be provided by the BitC implementation.
Is this wise ? UTF8 content is ubiquitous .. I was challenged on this
recently for foreign sites on utf16 being smaller ( USC-2) and it turned
out that very few used utf16 and even when they did ,utf8 despite the
variable length encoding was significantly smaller ( mainly due to the huge
amount of ASCII Content in xml and html files and the fact it has the same
length for common characters - even Asian and Sanskrit ) .. This is pretty
major as it means you have to convert nearly all html and xml from utf-8 to
utf16 or utf-32 .
When I started with C# from C was kind of wondering how strings would work
well without indexers as its kind of a shock but after many years it works
quite well. Strings are immutable ( which means they get put in a special
region of the GC which doesn't need to be remarked ) and is also nice for
multi threaded work and if the highest performance is need you can work with
a mutable char array ( like C) and convert to and from strings. Now .NET
uses UTF16 ( USC-2) since there was no UTF8 when it was designed it does
give .NET quite a hefty penalty in string work ( esp html and xml) compared
to UTF 8 parsers as it has to process almost twice the data and convert from
utf8 to utf16. Note I think if .NET was UTF8 it would be significantly
faster for string handling though it is quite fast already when you consider
the strings are heap based objects.
Now to use strings as not indexible internal UTF8 and a easily index able
char of ASCII , UTF16 or UTF32 will require conversion but in my
experience these conversions are rare in most cases you just deal with
strings or the char . At the very least your string data would use half
the memory ( or a quarter of UTF32! ) which is nothing to sneeze at since
string data makes up a large amount of program data especially for embedded
systems. I do note such a string class works best with a stack like
nursery allocator due to fast creation but there is no reason they couldn't
Now the lib itself could and would probably index the private UTF8 data of
the string with the indexes being byte offsets but while direct
calculations on the index is meaningless most operations these days tend to
be matching and searching which can be done at the same time..
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the bitc-dev