[bitc-dev] Interactive REPL

Jonathan S. Shapiro shap at eros-os.org
Thu May 18 23:11:07 EDT 2006


On Fri, 2006-05-19 at 02:57 +0100, David Hopwood wrote:
> Jonathan S. Shapiro wrote:
> > On Thu, 2006-05-18 at 19:51 +0100, David Hopwood wrote:
> > 
> >>This is correct as far as it goes, but I think what also needs to be said
> >>is that *when the input compilation unit is provided as a file*, it is
> >>encoded as UTF-8 (also as defined by Unicode 4.1.0).
> > 
> > At the moment, no other form of input unit is anticipated. You are
> > clearly imagining something else, and I would be very interested to know
> > what it is in order to understand better what change (if any) is
> > appropriate).
> 
> I was imagining a read-eval-print-loop. In that case the user types in
> code as characters, and the encoding is not visible.

I disagree. The encoding is visible at the terminal.

Our model for the interactive REPL is that it differs from batch
compilation in two ways:

  1. The interaction environment is viewed as a unit of compilation
     whose read never terminates.

  2. To support interaction, expressions can be entered at top level.

The second part is semantically problematic, and we are going to need to
look at it carefully.

> >>I would also argue that any language supporting Unicode must (also) be able
> >>to use UTF-8 directly as an internal encoding, with indexing based on UTF-8
> >>code units. The fourfold expansion of UTF-32 for US-ASCII is not acceptable
> >>when storing large amounts of mostly-ASCII text.
> ...
> I think that despite the resulting library complexity,
> there is a good case for allowing a program to explicitly specify the
> encoding of a Unicode string (UTF-8, UTF-16 or UTF-32). It needn't
> complicate the core language significantly.

I'm not convinced of this. When the compiler sees "xyz", it must choose
*some* encoding for that. More generally, the runtime must take a
position on the encoding of any data type that may appear as a literal
in compilation units.

However, here is something we can do that might meet your concern: use
type classes. We can introduce a type class (String 'a) that defines the
core operations on strings. This leaves the user or the library free to
provide alternative implementations, which should address your concern
about large data sets.

Regrettably this does not remove the need to choose an encoding for
program literals. In that regard, we have a separate but interacting
problem.

For some applications, we need to be able to be able to check that a
BitC program runs in constant storage. Not just logically or
conceptually constant, but constant in an absolute sense. This is very
hard to do if a consequence of STRING-SET! is to rebuild the internal
string representation.

Because of this, my current inclination is to keep the UCS4 vector
encoding for use in encoding input string literals, but add the type
class so that other forms can be added independently.

Reactions?


shap



More information about the bitc-dev mailing list