From: Jill Ramonsky > I think that next time I write a similar API, I will deal with > (string+bool) pairs, instead of plain strings, with the bool > meaning "already normalised". This would definitely speed > things up. Of course, for any strings coming in from > "outside", I'd still have to assume they were not > normalised, just in case.
I had the same experience, and I solved it by using an additional byte in string objects that contains the current normalization form(s) as a bitfield with bits set if the string is either in NFC, NFD, NFKC or KFKD form. This bitfield is all zeroes for unknown (still unparsed) forms, and one bit is set if the string has already been parsed: I don't always have to require a string to be in any NF* form, so I perform normalization only when needed by testing this bit first which indicates if parsing is required. This saves lots of unnecessary string allocations and copies and reduces a lot the process VM footprint. The same optimization can be done in Java by subclassing the String class to add a "form" field and related form conversion (getters) and tests methods. In fact, to further optimize and reduce the memory footprint of Java strings, in fact I choosed to store the String in a array of bytes with UTF-8, instead of an array of chars with UTF-16. The internal representation is chosen dynamically, depending on usage of that string: if the string is not accessed often with char indices (which in Java does not return actual Unicode codepoint indices as there may be surrogates) the UTF-8 representation uses less memory in most cases. It is possible, with a custom class loader to overide the default String class used in the Java core libraries (note that compiled Java .class files use UTF-8 for internally stored String constants, as this allows independance with the architecture, and this is the class loader that transforms the bytes storage of String constants into actual chars storage, i.e. currently UTF-16 at runtime.) Looking at the Java VM machine specification, there does not seem to be something implying that a Java "char" is necessarily a 16-bit entity. So I think that there will be sometime a conforming Java VM that will return UTF-32 codepoints in a single char, or some derived representation using 24-bit storage units. So there already are some changes of representation for Strings in Java, and similar technics could be used as well in C#, ECMAScript, and so on... Handling UTF-16 surrogates will then be something of the past, except if one uses the legacy String APIs that will continue to emulate UTF-16 code unit indices. Depending of runtime tuning parameters, the internal representation of String objects may (should) become transparent to applications. One future goal would be that a full Unicode String API will return real characters as grapheme clusters of varying length, in a way that can be comparable, orderable, etc... to better match what the users consider as a string "length" (i.e. a number of grapheme clusters, if not simply a combining sequence if we exclude the more complex case of Hangul Jamos and Brahmic clusters). This leads to many discussions about what is a "character"... This may be context specific (depending on the application needs, the system locale, or user preferences)... For XML, which recommends (but does not mandate) the NFC form, it seems that the definition of a character is mostly the combining sequence. It is very strange however that this is a SHOULD and not a MUST, as this may return unpredictable results in XML applications depending on whever the SHOULD is implemented or not in the XML parser or transformation engine.

