[bitc-dev] Newline conventions
Jonathan S. Shapiro
shap at eros-os.org
Sat Feb 18 12:06:43 EST 2006
On Sat, 2006-02-18 at 11:39 -0500, Kevin Reid wrote:
> Perl's one-byte escapes \xHH and \OOO parse as many digits as they
> find up to the maximum for their type (two and three, respectively).
> The Unicode equivalent would be six hexadecimal digits (U+10FFFF) and
> my test string would be written as "\U+00202202".
The problem with any non-delimited sequence is that UTF-8 keeps creeping
upwards. Originally it was a max of 4 bytes, but with the latest 10646
encodings it might hypothetically go to 6 bytes. There is room in the
encoding to go to 7 bytes without breaking existing encodings.
Given this, I think that any escape sequence needs to be delimited. The
problem I foresee is that
"\U+aabb23"
might either be the three character sequence
U+aabb 2 3
or the 6-octet character encoding
U+aabb23
depending on the value of 'aa'. The UTF-8 encoding standard ensures that
it is unambiguous from the lexer point of view, and also ensures that
existing uses will remain unviolated, but from the *human* point of view
it's awful.
So for escapes I'm currently leaning toward:
\{U+xxx...xxx}
There are two advantages over your alternative
\U+(xxx...xxx)
The "U+xxx" form looks a little more like the standard way of writing
these things, and it fits more nicely with the other existing escapes.
> The N-Triples language defined for the W3C's RDF Test Cases uses a
> specific syntax which is a subset of Python's and is specified to
> have exactly one way to encode any particular character. <http://
> www.w3.org/TR/rdf-testcases/#ntrip_strings>
UNICODE already requires that the shortest legal UTF-8 encoding must be
used, so the encoding is already unambiguous.
> SGML-style: "\U+2022;02"
Yes, that would be another possible delimiting. I don't think I have a
strong preference.
shap
More information about the bitc-dev
mailing list