Re: Nicest UTF

Marcin 'Qrczak' Kowalczyk Sat, 04 Dec 2004 16:56:04 -0800

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> There's nothing that requires the string storage to use the same
> "exposed" array,


The point is that indexing should better be O(1).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
   with clever caching it's good enough in common cases.

Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.

> Anyway, each time you use an index to access to some components of a
> String, the returned value is not an immutable String, but a mutable
> character or code unit or code point, from which you can build
> *other* immatable Strings

No, individual characters are immutable in almost every language.
Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.

> When you do that, the returned character or code unit or code point
> does not guarantee that you'll build valid Unicode strings. In fact,
> such character-level interface is not enough to work with and
> transform Strings (for example it does not work to perform correct
> transformation of lettercase, or to manage grapheme clusters).

This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.

In fact in my language there is no separate character type: a code
point extracted from a string is represented by a string of length 1.
It doesn't change the fact that indexing a string by code point index
should run in constant time, and thus using UTF-8 internally would be
a bad idea unless we implement one of the three points above.

> Once you realize that, which UTF you use to handle immutable String
> objects is not important, because it becomes part of the "blackbox"
> implementation of String instances.

The black box must provide enough tools to implement any algorithm
specified in terms of characters, an algorithm which was not already
provided as a primitive by the language.

Algorithms generally scan strings sequentially, but in order to store
positions to come back to them later you must use indices or some
iterators. Indices are simpler (and in my case more efficient).

> Using SCSU for such String blackbox can be a good option if this
> effectively helps in store many strings in a compact (for global
> performance) but still very fast (for transformations) representation.

I disagree. SCSU can be a separate type to be used explicitly, but
it's a bad idea for the default string representation. Most strings
are short, and thus constant factors and simplicity matter more than
the amount of storage. And you wouldn't save much storage anyway:
as I said, in my representation strings which contain only characters
U+0000..U+00FF are stored one byte per character. The majority of
strings in average programs is ASCII.

In general what I don't like in SCSU is that there is no obvious
compression algorithm which makes good use of various features. Each
compression algorithm is either not as powerful as it could, or is
extremely slow (trying various choices), or is extremely complicated
(trying only sensible paths).

> Unfortunately, the immutable String implementations in Java or C#
> or Python does not allow the application designer to decide which
> representation will be the best (they are implemented as concrete
> classes instead of virtual interfaces with possible multiple
> implementations, as they should; the alternative to interfaces would
> have been class-level methods allowing the application to trade with
> the blackbox class implementation the tuning parameters).

Some functions accept any sequence of characters. Other functions
accept only standard strings. The question is how often to use each
style.

Choosing the first option increases flexibility but adds an overhead
in the common case. For example case mapping of a string would have to
either perform dispatching functions at each step, or be implemented
twice. Currently it's implemented for strings only, in C, and thus
avoids calling a generic indexing function and other overheads. At
some time I will probably implement it again, to work for arbitrary
sequences of characters, but it's more work for effects that I don't
currently need, so it's not a priority.

Sometimes the flexibility is only on the surface. If I allowed any
sequence of characters to specify a filename, it would have to be
converted to an array of characters in order to be passed to the OS
anyway. And if a string is used as a filename many times, it would
be converted many times.

You are free to use other string representations in your code,
especially in dynamically typed languages which constrain the type
only in low-level functions like using a string as a filename. But
there is always some primitive character or string type, to make a
generic string API possible at all (what would your virtual interfaces
use?), and generally some primitive string type to make common cases
of string handling fast.

Note that if the other representation does not allow O(1) indexing by
code point *and* it's not enough to always scan characters sequentially
from the beginning to end, *then* you must invent some iteration
protocol not currently present in my language.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

Reply via email to