Title: RE: Unicode forms for internal storage

> Last night it occurred to me it might be possible to design an
> internal storage format for this class which had better memory usage
> characteristics. In particular I'd like ASCII data to occupy only a
> single byte, and all other BMP characters from 128 to 65535 to occupy
> only two bytes. Non-BMP characters could be stored in surrogate pairs.

        BZZZT!  Sorry, thanks for playing.  You can't get the advantages of both with no drawbacks.  Given the octets 0x5B5B, how would you know if you had "[[" or a Chinese character?

> 3. This is all completely private to one class. No data in this form
> will be passed on the wire. None will be exposed via the public API
> which is completely based on Java strings (that is, UTF-16).

        Good idea.  We have too many external encodings anyway.

> However, I would like the translation into and out of this format to
> be at least as fast as the translation between UTF-8 and UTF-16 the
> class is currently performing on every call to setValue and getValue,
> ideally faster.

        Hmmm - again, this may be asking for too much.  The UTF-8/UTF-16 transform is pretty simple.  Is it bogging you down?

> Has anyone done any work on Unicode formats for this use-case? Does
> anyone have any references or ideas to share?

        If your application will use much more of European or non-European languages, then just use UTF-8 or UTF-16 respectively, as you won't really lose much space that  way.  If space usage is random/indeterminate/evenly distributed, then, assuming that any given string is primarily in a single language, a TLV type discriminating between UTF-8 and UTF-16 should do nicely.  Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16) and the length, in octets, of the string (therefore max of 32,767 octets per string, which shouldn't ordinarily be a problem).  Then encode the string in your efficiency-chosen format.  Since you have a length, you can skip the terminator.  The resulting structure is at most one byte longer than the string would have been had it been encoded as straight UTF-8 or UTF-16, and is double octet aligned, so native UTF-16 functions can be used if they exist.


        HTH,

/|/|ike

Reply via email to