On Apr 26, 2011, at 11:54 PM, Himanshu Vashishtha wrote:

> HBase uses utf-8 encoding to store the row keys, so it can store non-ascii
> characters too (yes they will be larger than 1 byte).

That statement may be misleading.  HBase doesn't use any encoding at all, 
because row keys are simply arrays of bytes.  HBase cares only about the 
sorting order of those byte arrays, and neither knows nor cares what 
interpretation the client may attach to them.

The UTF-8 standard mentions that the byte-value lexicographic sorting order of 
UTF-8 strings matches the sorting order of the Unicode character numbers, so a 
client can turn 16- or 32-bit Unicode strings into UTF-8 in order to use them 
as keys and they will sort the same way.  (Although the standard warns that "a 
sort order based on character numbers is almost never culturally valid.")

On the plus side, that means you never have to worry about "What's the next 
character after ç?"  Just add 1.  But don't be surprised when "fad" comes 
before "façade" in your sort.

joe

Reply via email to