Thanks to everyone for the detailed responses. I definitely appreciate the feedback on the broader issue (even though my question was very narrow).
I should clarify my use case a little. I'm creating a generic data serialization format similar to Google Protocol Buffers and Apache Thrift. Other than Unicode strings, the format supports many other data types -- all of which are serialized in a custom format. Some data types will contain a lot of string data while others will contain very little. As is the case with other tools in this area, standard compression techniques can be applied to the entire payload as a separate pass (e.g. gzip). I can see how there are benefits to using one of the standard encodings. However, at this point, my goals are basically fast serialization/deserialization and small size. I might eventually see the error in my ways (and feel like an idiot for ignoring your advice), but in the interest of not wasting your time any more than I already have, I should mention that suggestions to stick to a standard encoding will fall on mostly deaf ears. For my current use case, I don't need to perform random accesses in serialized data so I don't see a need to make the space-usage compromises that UTF-8 and UTF-16 make. A more compact UTF-8-like encoding will get you ASCII with one byte, the first 1/4 of the BMP with two bytes, and everything else with three bytes. A more compact UTF-16-like format gets the BMP in 2 bytes (minus some PUA) and everything else in 3. Maybe not huge savings, but if you're of the opinion that sticking to a standard doesn't buy you anything... :-) I'll definitely take a closer look at SCSU. Hopefully the encoding speed is good enough. Most of the other serialization tools just blast out UTF-8, making them very fast on strings that contain mostly ASCII. I hope SCSU doesn't get me killed in ASCII-only encoding benchmarks (http://wiki.github.com/eishay/jvm-serializers/). I really do like the idea of making my format less ASCII-biased, though. And, like I said before, I don't care much about sticking to a standard encoding -- if stock SCSU ends up being too slow or complex, I might still be able to use techniques from SCSU in a custom encoding. (Philippe: when I said I needed 20 bits, I meant that I needed 20 bits for the stuff after the BMP. I fully intend for my encoding to handle every Unicode codepoint, minus surrogates.) Thanks again, everyone. -- Kannan On Wed, Jun 2, 2010 at 13:12, Asmus Freytag <[email protected]> wrote: > On 6/2/2010 12:25 AM, Kannan Goundan wrote: >> >> On Tue, Jun 1, 2010 at 23:30, Asmus Freytag <[email protected]> wrote: >> >>> >>> Why not use SCSU? >>> >>> You get the small size and the encoder/decoder aren't that >>> complicated. >>> >> >> Hmm... I had skimmed the SCSU document a few days ago. At the time it >> seemed a bit more complicated than I wanted. What's nice about UTF-8 >> and UTF-16-like encodings is that the space usage is predictable. >> >> But maybe I'll take a closer look. If a simple SCSU encoder can do >> better than more "standard" encodings 99% of the time, then maybe it's >> worth it... >> >> > > It will, because it's designed to compress commonly used characters. > > Start with the existing sample code and optimize it. Many features of SCSU > are optional, using them gives slightly better compression, but you don't > always have to use them and the result is still legal SCSU. Sometimes > leaving out a feature can make your encoder a tad simpler, although I found > that you can be pretty fast with decent performance. > > A./ >

