[bitc-dev] bug in character decode routines
David Hopwood
david.nospam.hopwood at blueyonder.co.uk
Wed May 17 20:29:58 EDT 2006
Jonathan S. Shapiro wrote:
> Technically, I believe that space is not considered printable.
It is considered printable.
C99 7.4.1.8 #2:
# The isprint function tests for any printing character including space (' ').
C99 7.4 #3:
# The term printing character refers to a member of a locale-specific set of
# characters, each of which occupies one printing position on a display device;
# the term control character refers to a member of a locale-specific set of
# characters that are not printing characters.167) All letters and digits are
# printing characters.
#
# 167) In an implementation that uses the seven-bit US ASCII character set,
# the printing characters are those whose values lie from 0x20 (space) through
# 0x7E (tilde); the control characters are those whose values lie from 0 (NUL)
# through 0x1F (US), and the character 0x7F (DEL).
Let us, however, be a bit more careful about the character encodings being
used. Suppose that we are compiling the BitC compiler on platform P,
running it on platform Q, compiling the output C source on platform R,
and running that on platform S. A "platform" here implies an encoding.
[No, I'm not overcomplicating things. Although it will be rare that *all*
of these platforms are different, some of them may well be different.
For example, I might actually be using BitC in a situation where P and Q
could be either Linux (UTF-8) or Windows (Cp1252), R would be MS-DOS (Cp437)
in order to use a particular C cross-compiler, and S would be an embedded
platform (can't remember what encoding it normally uses).]
Anyway,
- the C source files for the BitC compiler are distributed as US-ASCII,
but they could be converted to any encoding needed by platform P;
that isn't a problem.
- I assume it is intended that a BitC source file is always encoded as
UTF-8 (although the spec doesn't currently say so). These files should
be opened in binary mode, avoiding any unwanted conversions.
- the generated C code should be *portable* C89. It might be reasonable to
make some assumptions about the C platform that will execute this code
(platform S) that go beyond what is guaranteed by the C89 standard (for
example, we may assume CHAR_BIT == 8), but any such assumptions should
be documented. This means that we should only use characters from the
C89 "portable character set":
A-Z a-z 0-9
! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
i.e. U+0020..007E excluding U+0024 ($), U+0040 (@) and U+0060 (`).
Of these, " ' and \ may not be used as-is in character or string
literals. We can also use whatever platform R accepts as a newline
(see below), although not in literals. There is no good reason to
use tabs.
- the BitC compiler seems to be writing the C source file on stdout,
which is opened in text mode using the encoding of Q. This works if
and only if Q and R use encodings that are identical for characters
in the portable character set. I think this is a reasonable assumption,
but it should be documented.
- we can use either '\n' or '\x0A' to output newlines. Using '\n'
assumes that platform R accepts platform Q's newline encoding; using
'\x0A' assumes that platform R accepts the Unix (LF) newline encoding.
It probably doesn't matter, but since we are already assuming that
Q and R have compatible encodings for portable characters, '\n' may
be slightly better.
- a BitC program uses UTF-8 as its execution character encoding. If we
are including "portable" characters directly in character and string
literals, then the codes for these characters in the encoding of S
(the C "execution character set") must be the same as in UTF-8 (i.e.
the same as in US-ASCII, since all characters in the portable set
are in US-ASCII). This assumption should also be documented.
So, if 'c' is intended to hold a Unicode scalar value, I would write the
code in question as:
if (isPortableLiteralChar(c)) printf("'%c'", c);
else printf("%d", c);
bool isPortableLiteralChar(wint_t c) {
return c >= 0x20 && c <= 0x7E /* printable ASCII */
&& c != 0x22 /* not " */
&& c != 0x24 /* not $ */
&& c != 0x27 /* not ' */
&& c != 0x40 /* not @ */
&& c != 0x5C /* not \ */
&& c != 0x60; /* not ` */
}
This also happens to be correct if 'c' is a UTF-8 code unit rather than
a Unicode scalar value, alhough in that case I would say
'bool isPortableLiteralCodeUnit(uint8_t c)'.
--
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>
More information about the bitc-dev
mailing list