[bitc-dev] Encoding of string literals

Jonathan S. Shapiro shap at eros-os.org
Thu May 18 23:01:06 EDT 2006


I think I remember why we chose this funny approach to string literal
encoding in the transitional compiler.

First, my original plan was to use UTF-8 encoding within the BitC
runtime. This simplifies many interactions with the outside world, and
it speeds up string handling in the common case -- at least for US
residents. So this drove the initial decision.

Later, when the time came to choose an internal string representation, I
wanted to avoid an extra level of indirection. Because of this, the
encoding of strings in the BitC heap is that the character data
immediately follows (sequentially) a length word. This decision, of
course, is not exposed outside the runtime, and it could easily be
changed.

Given my later experiences with TinyScheme, I now recognize that this
decision is closely tied to the collector design, and should probably be
re-evaluated in any case, but I'm trying at the moment to explain how we
got where we are right now.

The difficulty with this encoding is that there is no portable way in C
to write a literal initializer for it. This drove us to copy the strings
at run time.

I propose that we should have the discussion about user specified
encodings, and then maybe change this encoding. The change will impact
the C emitter trivially, and the implementation of some low-level string
support routines, but that should be the entirety of the damage.

shap




More information about the bitc-dev mailing list