[bitc-dev] Unicode and bitc
Ben Kloosterman
bklooste at gmail.com
Thu Oct 14 20:31:13 PDT 2010
>
>> So first, I think this is the wrong way to prioritize as a matter of
>> defaults, but second, I think I've already made it clear that no
>either/or
>> choice is actually required. The "stranded string" approach does all
>of what
>> you want and more. The O(log n) factor issue is more than compensated
>for by
>> the improvement in D-cache and D-TLB utilization.
>
>I wasn't sure you'd be willing to accept the overhead of the built-in
>string type being a type-class / capsule / interface / whatever, or
>that you would be comfortable with the default string data type being
>a more complicated structure (ropes, indexed strings, strings with
>extents, etc). If they are, the emphasis on the representation working
>well for all situations is less important to me. If you don't have any
>reservations about the extra overhead from that abstraction (compared
>to their C equivalents), then I don't imagine anyone will.
Another option is instead os say string and stringBuilder ( a .NET class to
build strings efficiently using an internal array and a mutable array for
the last which is useful for printf style formatting) you could have String
and LargeString with string being a byte index lean and mean UTF8 and large
string as discussed.
LargeString
-using a tree
-easily supports custom indexes like lines
- Mutable support for adding to last array.
- Efficient backward compatibility with char/point index APIs.
- The tree could support both all UTF16 , UCS-4 or mixed depending on a
mode.. eg for memory conservation the default is mixed but for interop you
can set the rep to UCS-2 or UCS-4.
The internal lib could overload both where appropriate ( or maybe even a
nasty hack on the types for pseudo no cost inheritance) . IMHO mutability is
not really a big issue on large strings as these often justify a lock if it
is required as long as the cheap string is used for messages etc.
In such a case is adopted it may be worth considering the utf-8 string as a
valuetype as most short types eg <8-16 chars (UTF-8 here so = 8-16 bytes)
are probably quicker pass by value with no heap overhead probably around
24-32 chars/bytes is break even . Im not sure if the stack space and lib
concerns here ( overloading or inheritance hack , casting ) are worth it but
it would be very good for all those common short strings.
Ben
More information about the bitc-dev
mailing list