On Wed, Apr 27, 2011 at 11:00 AM, Joe Pallas <[email protected]> wrote:
> > On Apr 26, 2011, at 11:54 PM, Himanshu Vashishtha wrote: > > > HBase uses utf-8 encoding to store the row keys, so it can store > non-ascii > > characters too (yes they will be larger than 1 byte). > > That statement may be misleading. HBase doesn't use any encoding at all, > because row keys are simply arrays of bytes. HBase cares only about the > sorting order of those byte arrays, and neither knows nor cares what > interpretation the client may attach to them. > What I meant was for String like "façade" or "fad", it uses utf-8 encoding > scheme to create those byte arrays (and therefore you can store non ascii > values too, though they will vary from 1-4 bytes in size but as an end user, > you don't care about that). > > The UTF-8 standard mentions that the byte-value lexicographic sorting order > of UTF-8 strings matches the sorting order of the Unicode character numbers, > so a client can turn 16- or 32-bit Unicode strings into UTF-8 in order to > use them as keys and they will sort the same way. (Although the standard > warns that "a sort order based on character numbers is almost never > culturally valid.") > > On the plus side, that means you never have to worry about "What's the next > character after ç?" Just add 1. But don't be surprised when "fad" comes > before "façade" in your sort. > > yes, no need to do any hard coding. Just add 1 to the last byte of the byte array that is formed from the prefix of the key that you want to search. Hope this is not that confusing now. :) > joe > >
