[bitc-dev] Code for bitc_vector_string
David Hopwood
david.nospam.hopwood at blueyonder.co.uk
Mon Jul 3 15:30:20 EDT 2006
Sam Mason wrote:
> - size_t len = vec->strlen(s);
> + bitc_word_t len = vec->len;
> + bitc_char_t * ucs4 = vec->elem;
> +
> + bitc_word_t utf8len = 0;
> + for (bitc_word_t i = vec->len-1; i >= 0; i--) {
> + if (ucs4[i] <= 0x7f) {
> + utf8len += 1;
> + } else if (ucs4[i] <= 0x7ff) {
> + utf8len += 2;
> + } else if (ucs4[i] <= 0xffff) {
> + utf8len += 3;
> + } else if (ucs4[i] <= 0x1fffff) {
> + utf8len += 4;
> + } else if (ucs4[i] <= 0x3ffffff) {
> + utf8len += 5;
> + } else if (ucs4[i] <= 0x7fffffff) {
> + utf8len += 6;
> + }
> + }
[...]
Only code points up to 0x10FFFF (and therefore only up to 4-byte UTF-8
character encodings) are valid; also the code points reserved for UTF-16
surrogates are not valid in UTF-8.
Please see the conformance requirements in chapter 3 of
<http://www.unicode.org/versions/Unicode4.1.0/>, particularly C12a.
--
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
More information about the bitc-dev
mailing list