... In fact, to further optimize and reduce the memory footprint of Java strings, in fact I choosed to store the String in a array of bytes with UTF-8, instead of an array of chars with UTF-16. The internal representation is
This does or does not save space and time depending on the average string contents and on what kind of processing you do.
chosen dynamically, depending on usage of that string: if the string is not accessed often with char indices (which in Java does not return actual Unicode codepoint indices as there may be surrogates) the UTF-8 representation uses less memory in most cases.
It is possible, with a custom class loader to overide the default String class used in the Java core libraries (note that compiled Java .class files use UTF-8 for internally stored String constants,
No. It's close to UTF-8, but .class files use a proprietary encoding instead of UTF-8. See the .class file documentation from Sun.
as this allows independance with the architecture, and this is the class loader that transforms the bytes storage of String constants into actual chars storage, i.e. currently UTF-16 at runtime.)
Looking at the Java VM machine specification, there does not seem to be something implying that a Java "char" is necessarily a 16-bit entity. So I think that there will be sometime a conforming Java VM that will return UTF-32 codepoints in a single char, or some derived representation using 24-bit storage units.
I don't know about the VM spec, but the language and its APIs have 16-bit chars wired deeply into them. It would be possible to _add_ a new char32 type, but that is not planned, as far as I know. _Changing_ char would break all sorts of code. However, as far as I have heard, a future Java release may provide access to Unicode code points and use ints for them.
(And please do not confuse using a single integer for a code point with UTF-32 - UTF-32 is an encoding form for _strings_ requiring a certain bit pattern. Integers containing code points are just that, integers containing code points, not any UTF.)
So there already are some changes of representation for Strings in Java, and similar technics could be used as well in C#, ECMAScript, and so on...
I am quite confident that existing languages like these will keep using 16-bit Unicode strings, for the same reasons as for Java: Changing the string units would break all kinds of code.
Besides, most software with good Unicode support and non-trivial string handling uses 16-bit Unicode strings, which avoids transformations where software components meet.
... Depending of runtime tuning parameters, the internal representation of String objects may (should) become transparent to applications. One future goal
The internal representation is already transparent in languages like Java. The API behavior has to match the documentation, though, and cannot be changed on a whim.
would be that a full Unicode String API will return real characters as grapheme clusters of varying length, in a way that can be comparable, orderable, etc... to better match what the users consider as a string "length" (i.e. a number of grapheme clusters, if not simply a combining sequence if we exclude the more complex case of Hangul Jamos and Brahmic clusters).
This is overkill for low-level string handling, and is available via library functions. Such library functions might be part of a language's standard libraries, but won't replace low-level access functions.
Best regards, markus
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.

