"Philippe Verdy" <[EMAIL PROTECTED]> writes: > There's nothing that requires the string storage to use the same > "exposed" array,
The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer indices. 2. Exposing a different unit in the API. 3. Living with the fact that indexing is not O(1) in general; perhaps with clever caching it's good enough in common cases. Altough all three choices can work, I would prefer to avoid them. If I had to, I would probably choose 1. But for now I've chosen a representation based on code points. > Anyway, each time you use an index to access to some components of a > String, the returned value is not an immutable String, but a mutable > character or code unit or code point, from which you can build > *other* immatable Strings No, individual characters are immutable in almost every language. Assignment to a character variable can be thought as changing the reference to point to a different character object, even if it's physically implemented by overwriting raw character code. > When you do that, the returned character or code unit or code point > does not guarantee that you'll build valid Unicode strings. In fact, > such character-level interface is not enough to work with and > transform Strings (for example it does not work to perform correct > transformation of lettercase, or to manage grapheme clusters). This is a different issue. Indeed transformations like case mapping work in terms of strings, but in order to implement them you must split a string into some units of bounded size (code points, bytes, etc.). All non-trivial string algorithms boil down to working on individual units, because conditionals and dispatch tables must be driven by finite sets. Any unit of a bounded size is technically workable, but they are not equally convenient. Most algorithms are specified in terms of code points, so I chose code points for the basic unit in the API. In fact in my language there is no separate character type: a code point extracted from a string is represented by a string of length 1. It doesn't change the fact that indexing a string by code point index should run in constant time, and thus using UTF-8 internally would be a bad idea unless we implement one of the three points above. > Once you realize that, which UTF you use to handle immutable String > objects is not important, because it becomes part of the "blackbox" > implementation of String instances. The black box must provide enough tools to implement any algorithm specified in terms of characters, an algorithm which was not already provided as a primitive by the language. Algorithms generally scan strings sequentially, but in order to store positions to come back to them later you must use indices or some iterators. Indices are simpler (and in my case more efficient). > Using SCSU for such String blackbox can be a good option if this > effectively helps in store many strings in a compact (for global > performance) but still very fast (for transformations) representation. I disagree. SCSU can be a separate type to be used explicitly, but it's a bad idea for the default string representation. Most strings are short, and thus constant factors and simplicity matter more than the amount of storage. And you wouldn't save much storage anyway: as I said, in my representation strings which contain only characters U+0000..U+00FF are stored one byte per character. The majority of strings in average programs is ASCII. In general what I don't like in SCSU is that there is no obvious compression algorithm which makes good use of various features. Each compression algorithm is either not as powerful as it could, or is extremely slow (trying various choices), or is extremely complicated (trying only sensible paths). > Unfortunately, the immutable String implementations in Java or C# > or Python does not allow the application designer to decide which > representation will be the best (they are implemented as concrete > classes instead of virtual interfaces with possible multiple > implementations, as they should; the alternative to interfaces would > have been class-level methods allowing the application to trade with > the blackbox class implementation the tuning parameters). Some functions accept any sequence of characters. Other functions accept only standard strings. The question is how often to use each style. Choosing the first option increases flexibility but adds an overhead in the common case. For example case mapping of a string would have to either perform dispatching functions at each step, or be implemented twice. Currently it's implemented for strings only, in C, and thus avoids calling a generic indexing function and other overheads. At some time I will probably implement it again, to work for arbitrary sequences of characters, but it's more work for effects that I don't currently need, so it's not a priority. Sometimes the flexibility is only on the surface. If I allowed any sequence of characters to specify a filename, it would have to be converted to an array of characters in order to be passed to the OS anyway. And if a string is used as a filename many times, it would be converted many times. You are free to use other string representations in your code, especially in dynamically typed languages which constrain the type only in low-level functions like using a string as a filename. But there is always some primitive character or string type, to make a generic string API possible at all (what would your virtual interfaces use?), and generally some primitive string type to make common cases of string handling fast. Note that if the other representation does not allow O(1) indexing by code point *and* it's not enough to always scan characters sequentially from the beginning to end, *then* you must invent some iteration protocol not currently present in my language. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/