[bitc-dev] Unicode and bitc
bklooste at gmail.com
Fri Oct 15 06:00:54 PDT 2010
>> I think the memory overhead is too bad in business application and DB
>> land... but it is a good candidate for the worked on fixedchararrays
>> I'm not viewing it from the point of view of Unix apps reading the
>> then finishing but more from memory hungry Business apps, App servers
>> servers , SOA servers , DBs etc where the strings stay in memory for a
>> time while not used.
>I think this is in most cases caused by poor design which in turn is
>caused by the relative expenses of buying more hardware and improving
Good design is expensive ...run times correctly optimize developer time.. Witness the use of string indexers everywhere which require lots of string compares.
>The native encodings for CJK include characters not in Unicode which
>may or may not be added to Unicode eventually.
>There are also encodings which include character variants to
>accurately represent ancient Chinese texts, for example, which are
>likely not going to be folded into Unicode.
>So you should be prepared for situations when text in an external
>encoding cannot be completely converted into the internal encoding.
CJK is a mess and attempting to fold the common traditional chars used in Japan , Hong Kong , Taiwan and Korea was a huge mistake and has led to the slow adoption of Unicode there. The #1 issue as has been mentioned is that the representation is not 1:1.
The ancient Chinese forms including bone script are now in Unicode. What you do need to be prepared for is the character set changes...
All the 70,000 Simplified Chars are in Unicode though it does change every year.
> So you're okay with reducing the D-cache and D-TLB performance on
>> large-scale programs, and therefore their overall performance, by a
>> of >4? That seems a bit over-purist to me.
>> So first, I think this is the wrong way to prioritize as a matter of
>> defaults, but second, I think I've already made it clear that no
>> choice is actually required. The "stranded string" approach does all
>> you want and more. The O(log n) factor issue is more than compensated
>> the improvement in D-cache and D-TLB utilization.
>By calling for a more complex implementation you are reducing the
>I-cache and I-TLB (which may or may not be separate from data).
>I can't say which is more important in which situation without tedious
>analysis of running actual programs on actual hardware, though.
I don’t think the strand representation will use a lot of code ( though it does need a lto of thought and tuning)
>> 2. The problem we are trying to solve ( GC , O(N) ) apply only to
>> strings so why pay the price for frequently used small strings. A
>> courses approach may fit better and the big string can solve a number
>> other problems.
>Since *any* reference takes 8 bytes (the size of a pointer) I don't
>see an empty string taking 8 bytes as an issue.
Empty string wont be 8 bytes you are looking at the reference to string
+ the object overhead + 2 internal null pointers for an increase of 16 bytes..
Even worse when not using null able references you would have a string array initialized to empty strings this could be nasty for a multi dimensional array ( eg data tables , sql readers etc) . Anyway empty arrays by themselves are not a huge issue but small strings in general especially as they are very frequent.
More information about the bitc-dev