John Cowan <jcowan at reutershealth dot com> wrote:
A 32-bit length count, followed by an array of N arbitrary Unicode characters, would probably be the best implementation today.
Which is essentially what the Java String class has, if you unwrap it.
Then why do the DataInput and DataOutput interfaces perform this special conversion? There isn't any mention, on the page whose URL Theodore originally provided, of compatibility with C strings. If a Java String consists of a count followed by the data, why would "embedded nulls" in the data make any difference?
Needed for the class loader, to load the string constants pool within compiled classes.
Needed in the JNI interface to C, which has a legacy 8-bit strings interface inherited from old versions of Java (this interface lacks a separate string-length indicator and uses null-terminated strings).
But not needed with the newer JNI interfaces for C where strings are arrays of 16bit "char" code units, with a separate explicit 32-bit string length indicator (no need to escape nulls).
Not needed and not used for file or stream I/O, where *true* UTF-8 is supported by the "UTF-8"-named Charset instance (which fully complies with Unicode definition of UTF-8).

