[bitc-dev] Unicode and bitc
Jonathan S. Shapiro
shap at eros-os.org
Wed Oct 13 15:16:15 PDT 2010
On Tue, Oct 12, 2010 at 9:47 PM, William Leslie <
william.leslie.ttg at gmail.com> wrote:
> How can we attribute the performance difference between these xml
> parsers to encoding? Where are the benchmarks?
> Memory usage of strings probably isn't as important as you think...
I think this is incorrect. In UNIX programs circa 1990, 20% of live
in-memory data on workstations was character string data. By 2000, that
number was closer to 60%. The proportion on servers is much higher. So size
of character representation matters both for memory usage reasons and for
cache bandwidth reasons - the latter probably more compelling than the
> - for
> large strings, you are probably more interested in using a stream
> decoder then a great big in-memory string, and if that doesn't suit
> your use case, you probably want to implement your own string type,
> whether that be ropes or an array in utf-8 or whatever.
Possibly, and perhaps, but if so then you aren't concerned about the native
> For typical in-memory string manipulation, UCS-2 has served us well,
> and people usually work under the assumption that indexing or slicing
> a string by index-of-codepoint is O(1) (even if the strings resulting
> from the slice may not be valid). I think it is a useful assumption,
> and that programmers will continue to want cheap slices based on a
> vague if sometimes incorrect count of characters for the time being.
Perhaps more to the point, if you *don't* have this you have to give up
either the XML content model or the XPath indexing model, neither of which
is optional in practice...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the bitc-dev