On Tue, Jun 1, 2010 at 11:04 PM, Kannan Goundan <[email protected]> wrote: > > I'm trying to come up with a compact encoding for Unicode strings for > data serialization purposes. The goals are fast read/write and small > size. > > The plan: > 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates). > 2. Non-BMP code points are encoded as three bytes > - The first two bytes are code points from the BMP's UTF-16 surrogate > range (11 bits of data) > - The next byte provides an additional 8 bits of data.
Why? I can't imagine any use-case where you're dealing with enough data outside the BMP to make using this instead of UTF-16 a real win. You have a case where you're dealing with a large amount of Egyptian Hieroglyphics or obscure Chinese characters, and it's worth adding the complexity to go from four bytes to three in some cases, but not use SCSU or a standard compression like zlib's? -- Kie ekzistas vivo, ekzistas espero.

