On Apr 26, 2011, at 11:54 PM, Himanshu Vashishtha wrote: > HBase uses utf-8 encoding to store the row keys, so it can store non-ascii > characters too (yes they will be larger than 1 byte).
That statement may be misleading. HBase doesn't use any encoding at all, because row keys are simply arrays of bytes. HBase cares only about the sorting order of those byte arrays, and neither knows nor cares what interpretation the client may attach to them. The UTF-8 standard mentions that the byte-value lexicographic sorting order of UTF-8 strings matches the sorting order of the Unicode character numbers, so a client can turn 16- or 32-bit Unicode strings into UTF-8 in order to use them as keys and they will sort the same way. (Although the standard warns that "a sort order based on character numbers is almost never culturally valid.") On the plus side, that means you never have to worry about "What's the next character after ç?" Just add 1. But don't be surprised when "fad" comes before "façade" in your sort. joe
