On Wed, Jun 2, 2010 at 21:43, Doug Ewell <[email protected]> wrote:
>> If you want a really fast alternate encoding, you could encode all of
>> Unicode in at most 3 bytes.  Use the high bit as a "continuation" bit and
>> the lower 7 bits as the data.
>>
>> ASCII gets passed through unchanged.
>
> This is essentially what I was going to suggest to Kannan, since avoidance
> of ASCII bytes, nulls, etc. is not relevant to his use case. The conversion
> is lightning-fast; it can be optimized to be even faster than UTF-8.

This is currently what I do (I was referring to this as the "compact
UTF-8-like encoding").  The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):

  0xxxxxxx
  10xxxxxx xyyyyyyy
  110xxxxx xxyyyyyy yzzzzzzz

This is essentially how I encode integers as well.

-- Kannan


Reply via email to