Hi,

My organization has been doing something zany to simulate atomic row operations 
is HBase.

We have a converter-object model for the writables that are populated in an 
HBase table, and one of the governing assumptions
is that if you are dealing with an Object record, you read all the columns that 
compose it out of HBase or a different data source.

When we read lots of data in from a source system that we are trying to mirror 
with HBase, if a column is null that means that whatever is
in HBase for that column is no longer valid. We  have simulated what I believe 
is now called a AtomicRowMutation by using a single Put
and populating it with blanks. The downside is the wasted space accrued by the 
metadata for the blank columns.

Atomicity is not of utmost importance to us, but performance is. My approach 
has been to create a Put and Delete object for a record and populate the Delete 
with the null columns. Then we call HTable.batch(List<Row>) on a bunch of 
these. It is my impression that this
shouldn't appreciably increase network traffic as the RPC calls will be bundled.

Has anyone else addressed this problem? Does this seem like a reasonable 
approach?
What sort of performance overhead should I expect?

Also, I've seen some Jira tickets about making this an atomic operation in its 
own right. Is that something that
I can expect with CDH3U4?

Thanks,

Keith Wyss

Reply via email to