Re: Canonical equivalence in rendering: mandatory or recommended?

Philippe Verdy Wed, 15 Oct 2003 07:45:59 -0700

From: Jill Ramonsky 

> I think that next time I write a similar API, I will deal with
> (string+bool) pairs, instead of plain strings, with the bool
> meaning "already normalised". This would definitely speed
> things up. Of course, for any strings coming in from
> "outside", I'd still have to assume they were not
> normalised, just in case.


I had the same experience, and I solved it by using an
additional byte in string objects that contains the
current normalization form(s) as a bitfield with bits set
if the string is either in NFC, NFD, NFKC or KFKD form.

This bitfield is all zeroes for unknown (still unparsed)
forms, and one bit is set if the string has already been
parsed: I don't always have to require a string to be in
any NF* form, so I perform normalization only when needed
by testing this bit first which indicates if parsing is required.

This saves lots of unnecessary string allocations and copies
and reduces a lot the process VM footprint. The same
optimization can be done in Java by subclassing the String
class to add a "form" field and related form conversion (getters)
and tests methods. In fact, to further optimize and reduce the
memory footprint of Java strings, in fact I choosed to store
the String in a array of bytes with UTF-8, instead of an
array of chars with UTF-16. The internal representation is
chosen dynamically, depending on usage of that string: if
the string is not accessed often with char indices (which in
Java does not return actual Unicode codepoint indices as
there may be surrogates) the UTF-8 representation uses less
memory in most cases.

It is possible, with a custom class loader to overide the default
String class used in the Java core libraries (note that compiled
Java .class files use UTF-8 for internally stored String constants,
as this allows independance with the architecture, and this is the
class loader that transforms the bytes storage of String constants
into actual chars storage, i.e. currently UTF-16 at runtime.)

Looking at the Java VM machine specification, there does not
seem to be something implying that a Java "char" is necessarily a
16-bit entity. So I think that there will be sometime a conforming
Java VM that will return UTF-32 codepoints in a single char, or
some derived representation using 24-bit storage units.

So there already are some changes of representation for Strings in
Java, and similar technics could be used as well in C#, ECMAScript,
and so on... Handling UTF-16 surrogates will then be something of
the past, except if one uses the legacy String APIs that will
continue to emulate UTF-16 code unit indices. Depending of runtime
tuning parameters, the internal representation of String objects may
(should) become transparent to applications. One future goal
would be that a full Unicode String API will return real characters
as grapheme clusters of varying length, in a way that can be
comparable, orderable, etc... to better match what the users
consider as a string "length" (i.e. a number of grapheme clusters,
if not simply a combining sequence if we exclude the more complex
case of Hangul Jamos and Brahmic clusters).

This leads to many discussions about what is a "character"... This
may be context specific (depending on the application needs, the
system locale, or user preferences)... For XML, which recommends
(but does not mandate) the NFC form, it seems that the definition
of a character is mostly the combining sequence. It is very strange
however that this is a SHOULD and not a MUST, as this may return
unpredictable results in XML applications depending on whever the
SHOULD is implemented or not in the XML parser or transformation
engine.

Re: Canonical equivalence in rendering: mandatory or recommended?

Reply via email to