I could use some additional feedback on what I should expect from HBase
performance and how it affects schema design ("additional" because I had a
brief exchange on IRC about this yesterday).
In short: we tried to load some data into our test system, and found we were
spending a lot of time in HTable.exists. The client was coded with some
redundant checks that things did or did not exist. (In fairness, those checks
probably make sense outside of the bulk loading environment.)
Our test system, running on a not-too-elderly 2.3GHz 8-core Opteron, used
4-byte row keys for the narrow test and 4-byte qualifiers for the wide test.
The narrow test shows HTable.get and HTable.exists are basically constant time,
and the wide test shows HTable.get and HTable.exists are basically linear in
the ordinal position of the qualifier.
The problem? Extrapolating from that linear time performance for wide rows, if
we have 250,000 qualifiers in a wide row, we can expect it to take about a half
second to retrieve the last one. I had thought that the access time would be
logarithmic, rather than linear, since the qualifiers are all sorted (binary
search). That's not the case.
Most of the schema design advice I've seen doesn't mention this when comparing
rows vs columns. But it looks like we can't get all three of random access,
(row-level) atomicity, and scale-independent performance using wide rows.
Is my analysis off-base? I'd appreciate any advice on how to trade off those
three desires.
Thanks.
joe
PS I tried turning on ROWCOL Bloom filters, and I did not see a change in the
performance of exists for non-existent qualifiers. Shouldn't I have? Could I
have screwed up my test somehow?