On Tue, Jun 1, 2010 at 11:04 PM, Kannan Goundan <[email protected]> wrote:
>
> I'm trying to come up with a compact encoding for Unicode strings for
> data serialization purposes.  The goals are fast read/write and small
> size.
>
> The plan:
> 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
> 2. Non-BMP code points are encoded as three bytes
> - The first two bytes are code points from the BMP's UTF-16 surrogate
> range (11 bits of data)
> - The next byte provides an additional 8 bits of data.

Why? I can't imagine any use-case where you're dealing with enough
data outside the BMP to make using this instead of UTF-16 a real win.
You have a case where you're dealing with a large amount of Egyptian
Hieroglyphics or obscure Chinese characters, and it's worth adding the
complexity to go from four bytes to three in some cases, but not use
SCSU or a standard compression like zlib's?

--
Kie ekzistas vivo, ekzistas espero.


Reply via email to