From: "Markus Scherer" <[EMAIL PROTECTED]> > Rick Cameron wrote: > > Does anyone know of a String class, in C++, Java or .NET, that hides the > > actual encoding used, and provides a public API based on code points? > > Possible, but that would be inefficient. If you have a mismatch between your API semantics and your > implementation on such a fundamental level, then you incur a performance penalty. Either you > increase your operation time because you have to translate indexes on the fly, or you increase your > memory use because you have to keep and maintain a map. You can optimize it, of course, especially > for UTF-16 where surrogate pairs will be extremely rare, but there is very little point in not > indexing 16-bit code units.
Actually the increase in operation time with UTF-16 is extremely minimal, face to the cost of adding and maintaining a new set of APIs based on UTF-32, or to adapt existing algorithms that already work well with UTF-16 to work now with larger code units, or to the additional memory cost involved in almost all modern languages where the API will be used. Handling UTF-16 surrogates, that can very easily recognized and parsed in both directions is not a time-critical operation. The only case where it could cause problems is when working with substrings, but such behavior can be specified precisely in the updated UTF-16 API. Additionally, if one imagines that it is always safe to create substrings based only on UTF-32 code units or code points, he also forgets that such operation is not more safe because the more user-friendly level is the default grapheme cluster. More generally, an API that is based on handling individual "characters" (in fact code points) will have its own caveats, notably because of the effects of normalization (reordering or recombining of combining characters). A good i18n-enabled application or API should better work only with strings, not only with "characters" (or code points or UTF-32 code units). So the real difficulty is not in the choice of internal representation between UTF-16 or UTF-32, but in performing safe coputing on grapheme clusters. Even if a String is internally stored and managed with UTF-16 code units, it's very basic to work without transcoding it into temporary buffers, using iterators. An iterator that parses a UTF-16 encoded string and returns UTF-32 code units or code points can safely be created to work with more complex iterators such as grapheme cluster iterators. Also libraries that can perform normalization of UTF-16 strings without creating any instance of a UTF-32 string already exist. The difference of performance with a normalizer working on UTF-32 string instances is extremely small as the effective complexity is not at the internal encoding level. Now imagine a database which is defined to store a length-limited VARCHAR(32) field, where the SQL CHAR type is bound at database creation time to be a UTF-32 code unit. Nothing will forbid the database to use UTF-16 internally in its tables for storage, as counting the number of equivalent UTF-32 code units for the same string is extremely easy (so a field constraint check will have very little to do to count characters, not significantly more than counting characters in a null-terminated C/C++ string...). JavaScript/ECMAScript and Java both define their "char" to be UTF-16 code-units, not necessarily a full code point. This is not a limitation, as soon as you realize that the strlen() function or length() method is not described to reurn the number of code points but the number of code units needed to allocate back-store arrays... A good String library would also add a function to compute the number of code points from the same string, and its performance would be equivalent to strlen() or length(). If needed, when the string should not be scanned, a String class could as well maintain a cache of the number of code points it stores. But today, most softwares already perform strlen() calls extremely often, without impact. In Java where the String.length() method is not scanning the backing store but using the cached length field of a sized array instance, one can as well add a member field to cache this code points' length... In fact the actual internal storage of a String may still be UTF-32 or UTF-8 or SCSU or CESU-8 or whatever other Unicode-compatible encoding even if the API is exposed in terms of UTF-16 code units. The actual size of code units in memory should be hidden at the String API level, allowing various internal representations that will best fit with the String source/usage/target. In fact it is also possible to allow mixing various internal encodings for String instances, by caching some member fields to store or flag the convention used for encoding/decoding it (this would also include the normalization states of a string, or may be other states like the previously checked letter cases). Consider a String as an object with a required immutable content (the string value), and various properties which can be either computed or cached and added as additional members, and then you can minimize the number of transcoding, folding or transform operations needed on a string to conform to an interface, by delaying all these operations until they are really needed. In most cases, the Character level is not so much useful except temporarily for computing these operations...

