[bitc-dev] Newline conventions

Jonathan S. Shapiro shap at eros-os.org
Sat Feb 18 12:06:43 EST 2006


On Sat, 2006-02-18 at 11:39 -0500, Kevin Reid wrote:
> Perl's one-byte escapes \xHH and \OOO parse as many digits as they  
> find up to the maximum for their type (two and three, respectively).  
> The Unicode equivalent would be six hexadecimal digits (U+10FFFF) and  
> my test string would be written as "\U+00202202".

The problem with any non-delimited sequence is that UTF-8 keeps creeping
upwards. Originally it was a max of 4 bytes, but with the latest 10646
encodings it might hypothetically go to 6 bytes. There is room in the
encoding to go to 7 bytes without breaking existing encodings.

Given this, I think that any escape sequence needs to be delimited. The
problem I foresee is that 

	"\U+aabb23"

might either be the three character sequence

	U+aabb 2 3

or the 6-octet character encoding

	U+aabb23

depending on the value of 'aa'. The UTF-8 encoding standard ensures that
it is unambiguous from the lexer point of view, and also ensures that
existing uses will remain unviolated, but from the *human* point of view
it's awful.

So for escapes I'm currently leaning toward:

	\{U+xxx...xxx}

There are two advantages over your alternative

	\U+(xxx...xxx)

The "U+xxx" form looks a little more like the standard way of writing
these things, and it fits more nicely with the other existing escapes.

> The N-Triples language defined for the W3C's RDF Test Cases uses a  
> specific syntax which is a subset of Python's and is specified to  
> have exactly one way to encode any particular character. <http:// 
> www.w3.org/TR/rdf-testcases/#ntrip_strings>

UNICODE already requires that the shortest legal UTF-8 encoding must be
used, so the encoding is already unambiguous.

> SGML-style: "\U+2022;02"

Yes, that would be another possible delimiting. I don't think I have a
strong preference.


shap



More information about the bitc-dev mailing list