[bitc-dev] Unicode and bitc

Ben Kloosterman bklooste at gmail.com
Thu Oct 14 20:31:13 PDT 2010


>
 >> So first, I think this is the wrong way to prioritize as a matter of
 >> defaults, but second, I think I've already made it clear that no
 >either/or
 >> choice is actually required. The "stranded string" approach does all
 >of what
 >> you want and more. The O(log n) factor issue is more than compensated
 >for by
 >> the improvement in D-cache and D-TLB utilization.
 >
 >I wasn't sure you'd be willing to accept the overhead of the built-in
 >string type being a type-class / capsule / interface / whatever, or
 >that you would be comfortable with the default string data type being
 >a more complicated structure (ropes, indexed strings, strings with
 >extents, etc). If they are, the emphasis on the representation working
 >well for all situations is less important to me. If you don't have any
 >reservations about the extra overhead from that abstraction (compared
 >to their C equivalents), then I don't imagine anyone will.

Another option is instead os say string and stringBuilder ( a .NET class to
build strings efficiently using an internal array and a mutable array for
the last which is useful for printf style formatting) you could have String
and LargeString with string being a byte index lean and mean UTF8 and large
string as discussed. 

LargeString

-using a tree 
-easily supports custom indexes like lines 
- Mutable support for adding to last array. 
- Efficient backward compatibility with char/point index APIs.
- The tree could support both all UTF16 , UCS-4 or mixed depending on a
mode.. eg for memory conservation the default is mixed but for interop you
can set the rep to UCS-2 or UCS-4.   

The internal lib could overload both where appropriate ( or maybe even a
nasty hack on the types for pseudo no cost inheritance) . IMHO mutability is
not really a big issue on large strings as these often justify a lock if it
is required as long as the cheap string is used for messages etc.

In such a case is adopted it may be worth considering the utf-8 string as a
valuetype as most short types eg <8-16 chars  (UTF-8 here so = 8-16 bytes)
are probably quicker pass by value with no heap overhead probably around
24-32 chars/bytes is break even . Im not sure if the stack space and lib
concerns here ( overloading or inheritance hack , casting ) are worth it but
it would be very good for all those common short strings.

Ben



More information about the bitc-dev mailing list