[bitc-dev] bug in character decode routines

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Wed May 17 20:29:58 EDT 2006


Jonathan S. Shapiro wrote:
> Technically, I believe that space is not considered printable.

It is considered printable.

C99 7.4.1.8 #2:
# The isprint function tests for any printing character including space (' ').

C99 7.4 #3:
# The term printing character refers to a member of a locale-specific set of
# characters, each of which occupies one printing position on a display device;
# the term control character refers to a member of a locale-specific set of
# characters that are not printing characters.167) All letters and digits are
# printing characters.
#
# 167) In an implementation that uses the seven-bit US ASCII character set,
# the printing characters are those whose values lie from 0x20 (space) through
# 0x7E (tilde); the control characters are those whose values lie from 0 (NUL)
# through 0x1F (US), and the character 0x7F (DEL).


Let us, however, be a bit more careful about the character encodings being
used. Suppose that we are compiling the BitC compiler on platform P,
running it on platform Q, compiling the output C source on platform R,
and running that on platform S. A "platform" here implies an encoding.

[No, I'm not overcomplicating things. Although it will be rare that *all*
of these platforms are different, some of them may well be different.
For example, I might actually be using BitC in a situation where P and Q
could be either Linux (UTF-8) or Windows (Cp1252), R would be MS-DOS (Cp437)
in order to use a particular C cross-compiler, and S would be an embedded
platform (can't remember what encoding it normally uses).]

Anyway,

 - the C source files for the BitC compiler are distributed as US-ASCII,
   but they could be converted to any encoding needed by platform P;
   that isn't a problem.

 - I assume it is intended that a BitC source file is always encoded as
   UTF-8 (although the spec doesn't currently say so). These files should
   be opened in binary mode, avoiding any unwanted conversions.

 - the generated C code should be *portable* C89. It might be reasonable to
   make some assumptions about the C platform that will execute this code
   (platform S) that go beyond what is guaranteed by the C89 standard (for
   example, we may assume CHAR_BIT == 8), but any such assumptions should
   be documented. This means that we should only use characters from the
   C89 "portable character set":

     A-Z a-z 0-9
     ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~

   i.e. U+0020..007E excluding U+0024 ($), U+0040 (@) and U+0060 (`).

   Of these, " ' and \ may not be used as-is in character or string
   literals. We can also use whatever platform R accepts as a newline
   (see below), although not in literals. There is no good reason to
   use tabs.

 - the BitC compiler seems to be writing the C source file on stdout,
   which is opened in text mode using the encoding of Q. This works if
   and only if Q and R use encodings that are identical for characters
   in the portable character set. I think this is a reasonable assumption,
   but it should be documented.

 - we can use either '\n' or '\x0A' to output newlines. Using '\n'
   assumes that platform R accepts platform Q's newline encoding; using
   '\x0A' assumes that platform R accepts the Unix (LF) newline encoding.
   It probably doesn't matter, but since we are already assuming that
   Q and R have compatible encodings for portable characters, '\n' may
   be slightly better.

 - a BitC program uses UTF-8 as its execution character encoding. If we
   are including "portable" characters directly in character and string
   literals, then the codes for these characters in the encoding of S
   (the C "execution character set") must be the same as in UTF-8 (i.e.
   the same as in US-ASCII, since all characters in the portable set
   are in US-ASCII). This assumption should also be documented.


So, if 'c' is intended to hold a Unicode scalar value, I would write the
code in question as:

  if (isPortableLiteralChar(c)) printf("'%c'", c);
  else printf("%d", c);

  bool isPortableLiteralChar(wint_t c) {
      return c >= 0x20 && c <= 0x7E  /* printable ASCII */
          && c != 0x22               /* not " */
          && c != 0x24               /* not $ */
          && c != 0x27               /* not ' */
          && c != 0x40               /* not @ */
          && c != 0x5C               /* not \ */
          && c != 0x60;              /* not ` */
  }

This also happens to be correct if 'c' is a UTF-8 code unit rather than
a Unicode scalar value, alhough in that case I would say
'bool isPortableLiteralCodeUnit(uint8_t c)'.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




More information about the bitc-dev mailing list