Re: Canonical equivalence in rendering: mandatory or recommended?

Markus Scherer Wed, 15 Oct 2003 11:31:49 -0700

Philippe Verdy wrote:

... In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is

This does or does not save space and time depending on the average string contents and on what kind of processing you do.

chosen dynamically, depending on usage of that string: if
the string is not accessed often with char indices (which in
Java does not return actual Unicode codepoint indices as
there may be surrogates) the UTF-8 representation uses less
memory in most cases.

It is possible, with a custom class loader to overide the default
String class used in the Java core libraries (note that compiled
Java .class files use UTF-8 for internally stored String constants,

No. It's close to UTF-8, but .class files use a proprietary encoding instead of UTF-8. See the .class file documentation from Sun.

as this allows independance with the architecture, and this is the
class loader that transforms the bytes storage of String constants
into actual chars storage, i.e. currently UTF-16 at runtime.)

Looking at the Java VM machine specification, there does not
seem to be something implying that a Java "char" is necessarily a
16-bit entity. So I think that there will be sometime a conforming
Java VM that will return UTF-32 codepoints in a single char, or
some derived representation using 24-bit storage units.

I don't know about the VM spec, but the language and its APIs have 16-bit chars wired deeply into them. It would be possible to _add_ a new char32 type, but that is not planned, as far as I know. _Changing_ char would break all sorts of code. However, as far as I have heard, a future Java release may provide access to Unicode code points and use ints for them.

(And please do not confuse using a single integer for a code point with UTF-32 - UTF-32 is an encoding form for _strings_ requiring a certain bit pattern. Integers containing code points are just that, integers containing code points, not any UTF.)

So there already are some changes of representation for Strings in
Java, and similar technics could be used as well in C#, ECMAScript,
and so on...

I am quite confident that existing languages like these will keep using 16-bit Unicode strings, for the same reasons as for Java: Changing the string units would break all kinds of code.

Besides, most software with good Unicode support and non-trivial string handling uses 16-bit Unicode strings, which avoids transformations where software components meet.

... Depending of runtime
tuning parameters, the internal representation of String objects may
(should) become transparent to applications. One future goal

The internal representation is already transparent in languages like Java. The API behavior has to match the documentation, though, and cannot be changed on a whim.

would be that a full Unicode String API will return real characters
as grapheme clusters of varying length, in a way that can be
comparable, orderable, etc... to better match what the users
consider as a string "length" (i.e. a number of grapheme clusters,
if not simply a combining sequence if we exclude the more complex
case of Hangul Jamos and Brahmic clusters).

This is overkill for low-level string handling, and is available via library functions. Such library functions might be part of a language's standard libraries, but won't replace low-level access functions.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Re: Canonical equivalence in rendering: mandatory or recommended?

Reply via email to