[bitc-dev] Newline conventions

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Sat Feb 18 15:44:09 EST 2006


Jonathan S. Shapiro wrote:
> Before somebody feels compelled to point out how horrible it is, let me
> say that I am already coming to *hate*
> 
> 	#\{linefeed}
> 
> More importantly, I am coming to hate strings like
> 
> 	"Hello, world\{linefeed}"
> 
> The entire current convention for character syntax is a nightmare. After
> reviewing the conventions used by Scheme, here is the revised plan that
> will be coming up in the next revision of the language specification:
> 
> Characters:
> 
>   #\X is the character X provided X is printable 
>   #\U+XXXX is a unicode code point
>   #\tab
>   #\newline
>   #\space
> 
> Strings, between the outer double quotes:
> 
>   X is a character if X is printable
>   \n -- newline
>   \r -- carriage return
>   \t -- horizontal tab
>   \\ -- backslash
>   \f -- formfeed
>   \b -- backspace (?)
>   \" -- double quote embedded in the string
>   \U+XXXX -- unicode code point.

Unicode code points go up to U+10FFFF, so the syntax used in the standard
allows U+XXXXX and U+XXXXXX (but not less than 4 hex digits).

This does not create any ambiguity for characters, but it does for
embedded Unicode escapes in strings. There are several options:

1. Only support code points up to U+FFFF in strings.

2. Use a longest-match rule, so that "\U+10ABCD" is a string with a
   single character, and it would be necessary to write
   "\U+10AB\U+0043\U+0044" for the 3-character string "ႫCD".

3. Use \uXXXX and \U00XXXXXX. This is ugly, but consistent with
   at least Java, C, C++, Python and Javascript.

4. Use a syntax with an explicit end-delimiter for Unicode escapes
   (I would suggest either space or ';').

1 and 2 are pretty awful, so I think it has to be 3 or 4.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>





More information about the bitc-dev mailing list