[bitc-dev] Code for bitc_vector_string

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Mon Jul 3 15:30:20 EDT 2006


Sam Mason wrote:
> -  size_t len = vec->strlen(s);
> +  bitc_word_t   len  = vec->len;
> +  bitc_char_t * ucs4 = vec->elem;
> +  
> +  bitc_word_t utf8len = 0;
> +  for (bitc_word_t i = vec->len-1; i >= 0; i--) {
> +    if (ucs4[i] <= 0x7f) {
> +      utf8len += 1;
> +    } else if (ucs4[i] <= 0x7ff) {
> +      utf8len += 2;
> +    } else if (ucs4[i] <= 0xffff) {
> +      utf8len += 3;
> +    } else if (ucs4[i] <= 0x1fffff) {
> +      utf8len += 4;
> +    } else if (ucs4[i] <= 0x3ffffff) {
> +      utf8len += 5;
> +    } else if (ucs4[i] <= 0x7fffffff) {
> +      utf8len += 6;
> +    }
> +  }
[...]

Only code points up to 0x10FFFF (and therefore only up to 4-byte UTF-8
character encodings) are valid; also the code points reserved for UTF-16
surrogates are not valid in UTF-8.

Please see the conformance requirements in chapter 3 of
<http://www.unicode.org/versions/Unicode4.1.0/>, particularly C12a.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




More information about the bitc-dev mailing list