On Wed, Jun 2, 2010 at 21:43, Doug Ewell <[email protected]> wrote: >> If you want a really fast alternate encoding, you could encode all of >> Unicode in at most 3 bytes. Use the high bit as a "continuation" bit and >> the lower 7 bits as the data. >> >> ASCII gets passed through unchanged. > > This is essentially what I was going to suggest to Kannan, since avoidance > of ASCII bytes, nulls, etc. is not relevant to his use case. The conversion > is lightning-fast; it can be optimized to be even faster than UTF-8.
This is currently what I do (I was referring to this as the "compact UTF-8-like encoding"). The one difference is that I put all the marker bits in the first byte (instead of in the high bit of every byte): 0xxxxxxx 10xxxxxx xyyyyyyy 110xxxxx xxyyyyyy yzzzzzzz This is essentially how I encode integers as well. -- Kannan

