[bitc-dev] bug in character decode routines

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Thu May 18 21:57:06 EDT 2006


Jonathan S. Shapiro wrote:
> On Thu, 2006-05-18 at 19:51 +0100, David Hopwood wrote:
> 
>>This is correct as far as it goes, but I think what also needs to be said
>>is that *when the input compilation unit is provided as a file*, it is
>>encoded as UTF-8 (also as defined by Unicode 4.1.0).
> 
> At the moment, no other form of input unit is anticipated. You are
> clearly imagining something else, and I would be very interested to know
> what it is in order to understand better what change (if any) is
> appropriate).

I was imagining a read-eval-print-loop. In that case the user types in
code as characters, and the encoding is not visible.

> In particular, we do NOT anticipate incorporating EVAL into the
> language.

Not as a required feature, I agree. However, I don't see any reason to
disallow implementations that can accept source code other than from a
file, just because that input might not be encoded as UTF-8.

>>I assume it is intentional that characters that are unassigned as of
>>Unicode 4.1.0 cannot be used in this version of BitC?
> 
> Yes. There is a downward compatibility problem if this is permitted.

Right, that's what I thought.

>>>I believe that we do currently open the
>>>file in binary mode, as I remember noting this potential error at some
>>>point. Hmm. Actually, it doesn't matter, because LF and CR are not valid
>>>within strings or character literals, and in all other cases they are
>>>merely white space.
>>
>>Strictly speaking, the text mode translation might not only affect LF and CR.
> 
> This is true for wide character input routines (which we do not use),
> but I'm not aware of any other normalization that is done when files are
> accessed using the traditional stdio procedures.
> 
> We should certainly be using binary I/O in any case (and I just
> confirmed that we do), but what am I forgetting here?

I'm probably being unnecessarily picky, but the C standard is extremely vague
about text files:

C99 7.19.2 #2:
# [...] Characters may have to be added, altered, or deleted on input and
# output to conform to differing conventions for representing text in the host
# environment. Thus, there need not be a one-to-one correspondence between the
# characters in a stream and those in the external representation. [...]

IOW, while typical Unix-ish platforms only tend to mangle at most newlines,
the standard allows arbitrary mangling.

I tend to avoid opening files in text mode entirely in my own C programs.

>>>>- the generated C code should be *portable* C89. [...]
>>
>>The extra features in C99 are not all that useful for generated code,
>>anyway, unless I've missed something.
> 
> Hmm. One feature of C99 that we may be using is initializers that follow
> statements. I'd have to look, and I don't consider it urgent to fix
> this -- the entire C-based bringup strategy was probably, in hindsight,
> a mistake.

If it were my decision, I would probably have implemented the bootstrap
compiler in ML, Scheme or Haskell, generating MLRISC or a Scheme-based IL.
Generating C may look like a tempting way to get something working quickly,
but it has many deficiencies as an intermediate language.

>>>[...] BitC does not make any statement about
>>>execution character encoding (intentionally). The current implementation
>>>is that characters are UCS4, and strings are vectors of UCS4.
>>
>>I'm a bit confused: how can the current compiler use C strings to represent
>>BitC strings in that case?
> 
> It only does this for literal initialization,  IIRC, the emitted code
> includes a call to a helper routine that copies these strings into the
> BitC heap.

OK.

>>I would also argue that any language supporting Unicode must (also) be able
>>to use UTF-8 directly as an internal encoding, with indexing based on UTF-8
>>code units. The fourfold expansion of UTF-32 for US-ASCII is not acceptable
>>when storing large amounts of mostly-ASCII text.
> 
> For a production implementation, I agree in principle, but UTF-8 may not
> follow. In practice, I think that the ICU ropes implementation (or
> something comparable) is probably the right thing to do. The problem
> lies in the existence of STRING-SET! and the need for near-constant-time
> operation there.
> 
> For a bringup compiler, I decided that this wasn't a compelling problem,
> and I chose to do something that would work to get us moving even though
> it was the wrong thing for production use.

Fair enough. However, I think that despite the resulting library complexity,
there is a good case for allowing a program to explicitly specify the
encoding of a Unicode string (UTF-8, UTF-16 or UTF-32). It needn't
complicate the core language significantly.

> A hidden issue here is storage allocation. There are places where we
> need to know that a program does not allocate storage dynamically. It
> would be unfortunate if this meant that STRING-SET! was prohibited in
> such programs. It's a puzzlement what to do about this, and it would be
> unfortunate to tie mutability rules to the runtime implementation in
> this particular way.
> 
>>$, @ and ` are not in the portable character set, which is why I excluded
>>them here. Yes, I know that the national variants of ASCII are obsolete,
>>but IBM machines that use EBCDIC variants are still in widespread use, and
>>that case would "just work" provided that we stick to the portable character
>>set, under the assumptions stated above.
> 
> Do I misrecollect C-89? I had thought it mandated ASCII input.

No, there are essentially no requirements on the source file encoding.

C99 1. #2:
# This International Standard does not specify [...] the mechanism by which
# C programs are transformed for use by a data-processing system;

C99 5.1.1.2 #1:
# Physical source file multibyte characters are mapped, in an implementation-
# defined manner, to the source character set (introducing new-line characters
# for end-of-line indicators) if necessary.

where the source character set is itself implementation-defined.

(C89 and C99 are the same here, I just don't have the C89 standard in front of me.)

Even POSIX does not mandate ASCII.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>




More information about the bitc-dev mailing list