[bitc-dev] Newline conventions
Jonathan S. Shapiro
shap at eros-os.org
Sat Feb 18 23:22:11 EST 2006
On Sun, 2006-02-19 at 03:43 +0000, David Hopwood wrote:
> Yes. OTOH, it is quite feasible to use UTF-16 as an internal encoding,
> as demonstrated by the ICU libraries. It's also feasible to use UTF-8 as
> an internal encoding, which avoids the need to convert for I/O (since
> most important external protocols are defined to use UTF-8).
>
> The important thing is to choose one of these and stick with it.
It is definitely feasible to use UTF-16 for *string* encoding, but you
still need a 32-bit character encoding.
Once you choose a string encoding that doesn't provide a linear offset
from start of string to character position, you're committed to a
complex internal representation for strings if you want any sort of
indexing efficiency. At that point, as you say, it matters less and less
what your internal representation is.
BitC is specified (or at least, it *should* be) to use UTF-8 encoding
for source input and I plan to use this for the default I/O library as
well. This certainly doesn't stop someone from doing a different I/O
library for another encoding at some point.
> The UTF-32 encoding form is highly inefficient, even as an internal
> encoding, because of the extra memory (and memory bandwidth) it requires.
I absolutely agree, and this is why I don't plan to use it. The main
(IMHO the *only*) advantage of UTF-32 encoding is that it is really easy
to implement it correctly. This is a really good reason to use it in the
bootstrap compiler, for example, but only as a temporary expedient.
You may have noticed that the BitC spec doesn't state a position on the
*internal* representation of strings. This is because I definitely want
to go to a better encoding in the production libraries.
> The advantage of fixed-length encoding of code points is less than might
> be expected, because encodings of Unicode "abstract characters" are always
> potentially variable-length due to combining sequences.
Yes. I was aware of this.
> > 2. Resolution M38.6 removes the private use codepoints [above U+10FFFF]
> > from ISO/IEC 10646-1. This means that the highest legal code point is
> > representable as a 4-byte UTF-8 sequence.
>
> Yes. It also effectively removes the possibility that those codepoints
> will be assigned in future (barring some complete redesign of the encoding
> forms).
Or at least not without re-extending UTF-8, which is my prediction for
what will happen at some point.
shap
More information about the bitc-dev
mailing list