re: Least used parts of BMP.

Philippe Verdy Wed, 02 Jun 2010 06:56:24 -0700

> Resending (from Gmail), because the Unicode list rejected the SMTP server of 
> my mail provider (Spamcop is defective).


Nothing forbifs you to create new serializations of Unicode; you may
even create it so that it will be a conforming
process (meaning that it will preserve *all* valid Unicode code
points, without even affecting their relative order,
except possibly their normalization form as long as you preserve the
canonical equivalences).
If creating a Unicode conforming process is a goal, this means that
your serialization will have to also preserve
all PUAs and all valid codepoints of the BMP, with the only exception
of isolated/unordered surrogates.

Once this condition is met, nothing forbids you to reuse some binary
encoding space, in your serialization scheme,
including the space used by PUAs and surrogates (after all, the UTF-16
encoding scheme already does that).

According to your description, your scheme will be using 8-bit code
units, so you don't have to worry about the byte
order when serializing code units into bytes.

But one goal will probably be not met by your encoding : there will be
no separation of byte values and positions
delimiting each encoded sequence. This will have a consequence: your
encoing will not be processable from any random
position and not backward, if you can't determine where a sequence
starts that represents and encodes a single code.

This means that fast search algorithms (Boyer-Moore and similar) won't
be usable for fast plain text searches, and
the whole encoded text stream will be processble only from its start
and only in the foreward direction (after all
there's a similar problem with compressed streams that store
compressed texts into a single stream instead of
sequences of blocks including some resynchronization points.)

If you intend to use such encoding, it may be advisable that it
contains a easily recognizable sequence, unique in
all directions and recognizable from any starting point in the stream,
as a synchronization point, and that itselfs
actually represents *no* code point at all.

With such resynchronization point (which may be freely inserted
between encoded blocks of valid code points), a fast
search algorithm will still be usable; additionally the
resynchronization sequence may also be used as padding to
fill the last bytes of data blocks with some fixed size (e.g. with
blocks of 4KB, meaning that searches can be
initiated at least from any position multiple of 4KB).

Currently, UTF-8, UTF-16 and UTF-32 do not contain any
resynchronization sequence, but the separation of code unit
values still allows a stream to be scanned from any starting position,
plus or minus a very small number of code
units around this position (not exceeding a handful of bytes), to find
the start of a encoded code point: this means
that fast-search algorithms can work easily with them, with a very
small impact on performance.

If performance is really a target, you also have to consider what is
the order of volume of these texts to process:
if lots of texts need to be processed, it is certainly not the
encoding itself which will impact the performance but
the overall size of these texts, and a good compression scheme will
allow to minimize the number of I/O (or
equivalently memory page swaps, or misses in fast memory caches or disk caches).

When the texts to process are in fact very small, and just a small
part within larger data sets, there's absolutely
no reason to use somthing else than very simple UTF's like UTF-8 or
UTF-16 or UTF-32, even if these schemes is not
very space efficient: actually these are not the texts that will be
compressed alone, but the whole dataset
containing them, in order to gain more performance: for example XML
documents sill use UTF-8 by default, and it is
still very efficient, even for large datasets, because these documents
are actually compressed as a whole for
storage or transmissions, and possibly split into blocks of fixed
sizes (it's possible in XML to also include
padding to make sure that all data blocks will still have a constant
size, but generic compressions schemes also
often contain their own resynchronization bytes for this goal).

In other words: there's generally absolutely no need to use anything
else than the basic standardized UTF's : trying
to work at the codepoint-only level, instead of larger texts or
datasets, is not the best approach to improve the
performances, when you can gain a lot more by working on larger
datasets containing these texts. And any new
encoding scheme will instead defeat all the other many optimizations
that have been worked in interoperable
libraries.

Unicode does not standardize more than the 3 encoding schemes (and
their byte order variants for those with code
units larger than bytes). I think it's much enough. But nothing
forbids you to create others, and Unicode has
included some informative annexes about such schemes for applicaitons
needing some compression.

What is true is that UTF-8 was initially designed to allow the
encoding of code points up to 30 bits; some
inefficiencies are remaining because not all of its encoding space is
actually used; we could certainly have a UTF
that is using byte-sizes code units, fully compatible with ASCII (1
byte for them), and that would encode all other
valid codepoints into sequences of 2 or 3 bytes, including with the
possibility of determining which of them are
start bytes or trailing bytes (for fast-search compatibility; 4-byte
sequences can be avoided (but are still needed
with UTF-8 because of its inefficient use of the valid encoding space,
since the valid code points are now
restricted to the 17 first planes, i.e. a bit more than 20 bits).

Note that you said you'll need to encode 20 bits: this is wrong if you
need full Unicode process conformance (17
planes, minus the surrogates space is still more than 20 bits). If
such a conformance is not met, your scheme will
not be usable with lots of libraries and applications that are
designed to work only with Unicode conforming
processes (that also assume that all PUAs, in the BMP or in the last 2
planes, can effectively be preserved).

Philippe.

> Message du 02/06/10 07:47
> De : "Kannan Goundan"
> A : [email protected]
> Copie à :
> Objet : Least used parts of BMP.
>
>
> I'm trying to come up with a compact encoding for Unicode strings for
> data serialization purposes.  The goals are fast read/write and small
> size.
>
> The plan:
> 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
> 2. Non-BMP code points are encoded as three bytes
> - The first two bytes are code points from the BMP's UTF-16 surrogate
> range (11 bits of data)
> - The next byte provides an additional 8 bits of data.
>
> Unfortunately, this doesn't quite work because it only gives me 19
> bits to encode non-BMP code points, but I need 20 bits.  To solve this
> problem, I'm planning on stealing a bit of code space from the BMP the
> private-use area.  If I did, then:
> - I'd get the bits needed to encoded the Non-BMP in 3 bytes.
> - The stolen code points of the private-use area would now have to be
> encoded using 3 bytes.
>
> I chose the private-use area because I assumed it would be the least
> commonly used, so having these code points require 3 bytes instead of
> 2 bytes wasn't that big a deal.  Does this sound reasonable?  Do
> people suggest a different section of the BMP to steal from, or a
> different encoding altogether?
>
> Thanks for reading.
> -- Kannan
>
> P.S. I actually have two encodings.  One is similar to UTF-8 in that
> it's ASCII-biased.  The encoding described above is intended for
> non-ASCII-biased data.  The programmer selects which encoding to use
> based on what he thinks the data looks like.
>
>

re: Least used parts of BMP.

Reply via email to