HBase column performance and schema design advice

Joe Pallas Tue, 26 Apr 2011 11:43:48 -0700

I could use some additional feedback on what I should expect from HBase 
performance and how it affects schema design ("additional" because I had a 
brief exchange on IRC about this yesterday).


In short: we tried to load some data into our test system, and found we were 
spending a lot of time in HTable.exists.  The client was coded with some 
redundant checks that things did or did not exist.  (In fairness, those checks 
probably make sense outside of the bulk loading environment.)

Our test system, running on a not-too-elderly 2.3GHz 8-core Opteron, used 
4-byte row keys for the narrow test and 4-byte qualifiers for the wide test. 
The narrow test shows HTable.get and HTable.exists are basically constant time, 
and the wide test shows HTable.get and HTable.exists are basically linear in 
the ordinal position of the qualifier.

The problem?  Extrapolating from that linear time performance for wide rows, if 
we have 250,000 qualifiers in a wide row, we can expect it to take about a half 
second to retrieve the last one.  I had thought that the access time would be 
logarithmic, rather than linear, since the qualifiers are all sorted (binary 
search).  That's not the case.

Most of the schema design advice I've seen doesn't mention this when comparing 
rows vs columns.  But it looks like we can't get all three of random access, 
(row-level) atomicity, and scale-independent performance using wide rows.

Is my analysis off-base?  I'd appreciate any advice on how to trade off those 
three desires.

Thanks.
joe

PS I tried turning on ROWCOL Bloom filters, and I did not see a change in the 
performance of exists for non-existent qualifiers.  Shouldn't I have?  Could I 
have screwed up my test somehow?

HBase column performance and schema design advice

Reply via email to