Re: Least used parts of BMP.

Kannan Goundan Wed, 02 Jun 2010 18:18:09 -0700

Thanks to everyone for the detailed responses.  I definitely
appreciate the feedback on the broader issue (even though my question
was very narrow).

I should clarify my use case a little.  I'm creating a generic data
serialization format similar to Google Protocol Buffers and Apache
Thrift.  Other than Unicode strings, the format supports many other
data types -- all of which are serialized in a custom format.  Some
data types will contain a lot of string data while others will contain
very little.  As is the case with other tools in this area, standard
compression techniques can be applied to the entire payload as a
separate pass (e.g. gzip).

I can see how there are benefits to using one of the standard
encodings.  However, at this point, my goals are basically fast
serialization/deserialization and small size.  I might eventually see
the error in my ways (and feel like an idiot for ignoring your
advice), but in the interest of not wasting your time any more than I
already have, I should mention that suggestions to stick to a standard
encoding will fall on mostly deaf ears.

For my current use case, I don't need to perform random accesses in
serialized data so I don't see a need to make the space-usage
compromises that UTF-8 and UTF-16 make.  A more compact UTF-8-like
encoding will get you ASCII with one byte, the first 1/4 of the BMP
with two bytes, and everything else with three bytes.  A more compact
UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and
everything else in 3.  Maybe not huge savings, but if you're of the
opinion that sticking to a standard doesn't buy you anything... :-)

I'll definitely take a closer look at SCSU.  Hopefully the encoding
speed is good enough.  Most of the other serialization tools just
blast out UTF-8, making them very fast on strings that contain mostly
ASCII.  I hope SCSU doesn't get me killed in ASCII-only encoding
benchmarks (http://wiki.github.com/eishay/jvm-serializers/).  I really
do like the idea of making my format less ASCII-biased, though.  And,
like I said before, I don't care much about sticking to a standard
encoding -- if stock SCSU ends up being too slow or complex, I might
still be able to use techniques from SCSU in a custom encoding.

(Philippe: when I said I needed 20 bits, I meant that I needed 20 bits
for the stuff after the BMP.  I fully intend for my encoding to handle
every Unicode codepoint, minus surrogates.)

Thanks again, everyone.
-- Kannan

On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <[email protected]> wrote:
> On 6/2/2010 12:25 AM, Kannan Goundan wrote:
>>
>> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <[email protected]> wrote:
>>
>>>
>>> Why not use SCSU?
>>>
>>> You get the small size and the encoder/decoder aren't that
>>> complicated.
>>>
>>
>> Hmm... I had skimmed the SCSU document a few days ago.  At the time it
>> seemed a bit more complicated than I wanted.  What's nice about UTF-8
>> and UTF-16-like encodings is that the space usage is predictable.
>>
>> But maybe I'll take a closer look.  If a simple SCSU encoder can do
>> better than more "standard" encodings 99% of the time, then maybe it's
>> worth it...
>>
>>
>
> It will, because it's designed to compress commonly used characters.
>
> Start with the existing sample code and optimize it. Many features of SCSU
> are optional, using them gives slightly better compression, but you don't
> always have to use them and the result is still legal SCSU. Sometimes
> leaving out a feature can make your encoder a tad simpler, although I found
> that you can be pretty fast with decent performance.
>
> A./
>

Re: Least used parts of BMP.

Reply via email to