[bitc-dev] Encoding of string literals

Jonathan S. Shapiro shap at eros-os.org
Fri May 19 09:35:48 EDT 2006


On Fri, 2006-05-19 at 11:53 +0200, Dominique Quatravaux wrote:
> Jonathan S. Shapiro wrote:
> 
> >The difficulty with this encoding is that there is no portable way in C
> >to write a literal initializer for it.
> >
> Surely you mean a *human-readable* literal initializer? afaict one can
> definitely initialize an arbitrary string of bytes or 32-bit longs in C
> in a portable fashion.

Yes, but that is not the representation. The representation is a length
word **followed by** an arbitrary sequence of bytes.

This requires the C-99 extension that permits a 0-length array at end of
structure. This feature is not universally implemented, and the constant
initialization rules for it do not appear to have been covered precisely
in the standard.

> > This drove us to copy the strings
> >at run time.
> >
> Let this slow thinker get this straight. You mean that the generated C
> code has, say, UTF-8 strings in it that are converted into UTF-32 at
> compiled-program startup time?

Yes. This was done because of representation, but also because of the
logic of construction (each construction returns a distinct instance).
If we change that logic, then of course this can be removed.

In my opinion, in hindsight, the representation choice was not a good
one, and I think we should revise it. This doesn't alter the logic
issue.

> >It appears (to me) that the only safe encoding
> >if we care about EBCDIC-based C compilers is to emit *everything* using
> >octal escapes, and perhaps emit comments for the sake of the human
> >reader.
> >
> >For character literals this will work fine, but for string literals it
> >is a complete nuisance.
> >
> I cannot (yet?) see what is wrong with this approach. As an aid towards
> legibility of the intermediate code (which need not even be a design
> goal imho, but oh well) you could stash all the UTF-32 string literals
> as a kind of symbol table at the bottom of the emitted C file, e.g.
> (sorry for my pidgin C):
> 
>     static const STRING_T
> literal_number_2_from_bitc_source_file_at_line_312; // "Beyonc\x{E9}"

Yes, we could. If I thought that supporting EBCDIC was important, I
would do something similar to this.


shap



More information about the bitc-dev mailing list