[bitc-dev] Newline conventions
David Hopwood
david.nospam.hopwood at blueyonder.co.uk
Sat Feb 18 22:43:33 EST 2006
Jonathan S. Shapiro wrote:
> David:
>
> I'm not completely ignorant about UNICODE, but I definitely haven't
> followed it closely, and I would really appreciate a confirm on my
> understanding of the implications of your mail. Can you answer yes or no
> (or expand, as appropriate) on each of the following statements:
>
> 1. There exist UNICODE code points above 65536, and this has presented
> some problems for both Java and C#, because their internal 'char'
> datatype was selected to be 16 bits.
Yes. OTOH, it is quite feasible to use UTF-16 as an internal encoding,
as demonstrated by the ICU libraries. It's also feasible to use UTF-8 as
an internal encoding, which avoids the need to convert for I/O (since
most important external protocols are defined to use UTF-8).
The important thing is to choose one of these and stick with it.
> It is still desirable for BitC to use 32-bit characters.
If you mean characters up to U+10FFFF, yes.
The UTF-32 encoding form is highly inefficient, even as an internal
encoding, because of the extra memory (and memory bandwidth) it requires.
The advantage of fixed-length encoding of code points is less than might
be expected, because encodings of Unicode "abstract characters" are always
potentially variable-length due to combining sequences.
> 2. Resolution M38.6 removes the private use codepoints [above U+10FFFF]
> from ISO/IEC 10646-1. This means that the highest legal code point is
> representable as a 4-byte UTF-8 sequence.
Yes. It also effectively removes the possibility that those codepoints
will be assigned in future (barring some complete redesign of the encoding
forms).
--
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
More information about the bitc-dev
mailing list