[bitc-dev] Newline conventions

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Sat Feb 18 22:43:33 EST 2006


Jonathan S. Shapiro wrote:
> David:
> 
> I'm not completely ignorant about UNICODE, but I definitely haven't
> followed it closely, and I would really appreciate a confirm on my
> understanding of the implications of your mail. Can you answer yes or no
> (or expand, as appropriate) on each of the following statements:
> 
> 1. There exist UNICODE code points above 65536, and this has presented
> some problems for both Java and C#, because their internal 'char'
> datatype was selected to be 16 bits.

Yes. OTOH, it is quite feasible to use UTF-16 as an internal encoding,
as demonstrated by the ICU libraries. It's also feasible to use UTF-8 as
an internal encoding, which avoids the need to convert for I/O (since
most important external protocols are defined to use UTF-8).

The important thing is to choose one of these and stick with it.

> It is still desirable for BitC to use 32-bit characters.

If you mean characters up to U+10FFFF, yes.

The UTF-32 encoding form is highly inefficient, even as an internal
encoding, because of the extra memory (and memory bandwidth) it requires.
The advantage of fixed-length encoding of code points is less than might
be expected, because encodings of Unicode "abstract characters" are always
potentially variable-length due to combining sequences.

> 2. Resolution M38.6 removes the private use codepoints [above U+10FFFF]
> from ISO/IEC 10646-1. This means that the highest legal code point is
> representable as a 4-byte UTF-8 sequence.

Yes. It also effectively removes the possibility that those codepoints
will be assigned in future (barring some complete redesign of the encoding
forms).

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



More information about the bitc-dev mailing list