[bitc-dev] Code for bitc_vector_string
Jonathan S. Shapiro
shap at eros-os.org
Mon Jul 3 21:08:50 EDT 2006
On Tue, 2006-07-04 at 01:27 +0100, Sam Mason wrote:
> On Mon, Jul 03, 2006 at 08:30:20PM +0100, David Hopwood wrote:
> > Only code points up to 0x10FFFF (and therefore only up to 4-byte UTF-8
> > character encodings) are valid; also the code points reserved for UTF-16
> > surrogates are not valid in UTF-8.
> I'd like to be able to claim innocence and just say that I was copying
> what was there already (mainly in libbitc/stdio.c), but that's probably
> not a useful way of fixing the code so I'll try and get both of them
> doing the right thing.
Sam: We left the 6 byte stuff in place deliberately, pending a more
careful decision. We need to consolidate this code into a single place
and then fix it once.
> > Please see the conformance requirements in chapter 3 of
> > <http://www.unicode.org/versions/Unicode4.1.0/>, particularly C12a.
> I've never looked at Unicode's standard before so I may well be reading
> it wrong; but what I see associated with C12a appears to relate to the
> handling of ill-formed "code unit sequences". D36 of the same chapter
> seems to document the valid byte sequences, I would interpret table
> 3-5 and accompanying text as saying that five and six-byte sequences
> are invalid.
> I've rehashed my code and the existing code to only accept one to
> four-byte byte sequences, and added an exception to make sure we only
> write out UTF-8 containing these characters. I should probably put
> some better checking in the two decoding routines so they validate their
> input better, but it's late!
> bitc-dev mailing list
> bitc-dev at coyotos.org
More information about the bitc-dev