[bitc-dev] Newline conventions
David Hopwood
david.nospam.hopwood at blueyonder.co.uk
Sat Feb 18 20:38:03 EST 2006
Jonathan S. Shapiro wrote:
> On Sat, 2006-02-18 at 11:39 -0500, Kevin Reid wrote:
>
>>Perl's one-byte escapes \xHH and \OOO parse as many digits as they
>>find up to the maximum for their type (two and three, respectively).
>>The Unicode equivalent would be six hexadecimal digits (U+10FFFF) and
>>my test string would be written as "\U+00202202".
>
> The problem with any non-delimited sequence is that UTF-8 keeps creeping
> upwards. Originally it was a max of 4 bytes, but with the latest 10646
> encodings it might hypothetically go to 6 bytes.
No, UTF-8 has "crept" *downwards*. It was originally up to 6 bytes, but has
been restricted to 4 bytes (in both Unicode and ISO/IEC 10646) so that
well-formed UTF-8 encodes an identical range of code points to UTF-16.
This change is reflected in the IETF RFCs for UTF-8, between RFC 2279 and
RFC 3629 which obsoletes it.
UTF-32 and UCS-4 have also changed similarly. From
<http://www.unicode.org/reports/tr19/tr19-9.html>:
# 3 Relation to ISO/IEC 10646 and UCS-4
#
# ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since UTF-32 is
# simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as well
# as to the Unicode Standard.
#
# As of the recent publication of the second edition of ISO/IEC 10646-1, UCS-4
# still assigns private use codepoints (E00000_16..FFFFFF_16 and
# 60000000_16..7FFFFFFF_16) that are not in the range of valid Unicode codepoints.
# To promote interoperability among the Unicode encoding forms JTC1/SC2/WG2 has
# approved a motion removing those private use assignments:
#
# Resolution M38.6 (Restriction of encoding space) [adopted unanimously]
#
# "WG2 accepts the proposal in document N2175 towards removing the provision
# for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to ensure
# internal consistency in the standard between UCS-4, UTF-8 and UTF-16 encoding
# formats, and instructs its project editor [to] prepare suitable text for
# processing as a future Technical Corrigendum or an Amendment to 10646-1:2000."
(also in <http://anubis.dkuug.dk/JTC1/SC2/open/02n3422.pdf>).
Note that this is really a restriction of the encoding space, not just an
unassignment of some characters.
>>SGML-style: "\U+2022;02"
This seems the most readable option to me.
--
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
More information about the bitc-dev
mailing list