[bitc-dev] Encoding of string literals
Jonathan S. Shapiro
shap at eros-os.org
Fri May 19 09:35:48 EDT 2006
On Fri, 2006-05-19 at 11:53 +0200, Dominique Quatravaux wrote:
> Jonathan S. Shapiro wrote:
>
> >The difficulty with this encoding is that there is no portable way in C
> >to write a literal initializer for it.
> >
> Surely you mean a *human-readable* literal initializer? afaict one can
> definitely initialize an arbitrary string of bytes or 32-bit longs in C
> in a portable fashion.
Yes, but that is not the representation. The representation is a length
word **followed by** an arbitrary sequence of bytes.
This requires the C-99 extension that permits a 0-length array at end of
structure. This feature is not universally implemented, and the constant
initialization rules for it do not appear to have been covered precisely
in the standard.
> > This drove us to copy the strings
> >at run time.
> >
> Let this slow thinker get this straight. You mean that the generated C
> code has, say, UTF-8 strings in it that are converted into UTF-32 at
> compiled-program startup time?
Yes. This was done because of representation, but also because of the
logic of construction (each construction returns a distinct instance).
If we change that logic, then of course this can be removed.
In my opinion, in hindsight, the representation choice was not a good
one, and I think we should revise it. This doesn't alter the logic
issue.
> >It appears (to me) that the only safe encoding
> >if we care about EBCDIC-based C compilers is to emit *everything* using
> >octal escapes, and perhaps emit comments for the sake of the human
> >reader.
> >
> >For character literals this will work fine, but for string literals it
> >is a complete nuisance.
> >
> I cannot (yet?) see what is wrong with this approach. As an aid towards
> legibility of the intermediate code (which need not even be a design
> goal imho, but oh well) you could stash all the UTF-32 string literals
> as a kind of symbol table at the bottom of the emitted C file, e.g.
> (sorry for my pidgin C):
>
> static const STRING_T
> literal_number_2_from_bitc_source_file_at_line_312; // "Beyonc\x{E9}"
Yes, we could. If I thought that supporting EBCDIC was important, I
would do something similar to this.
shap
More information about the bitc-dev
mailing list