Hi, My organization has been doing something zany to simulate atomic row operations is HBase.
We have a converter-object model for the writables that are populated in an HBase table, and one of the governing assumptions is that if you are dealing with an Object record, you read all the columns that compose it out of HBase or a different data source. When we read lots of data in from a source system that we are trying to mirror with HBase, if a column is null that means that whatever is in HBase for that column is no longer valid. We have simulated what I believe is now called a AtomicRowMutation by using a single Put and populating it with blanks. The downside is the wasted space accrued by the metadata for the blank columns. Atomicity is not of utmost importance to us, but performance is. My approach has been to create a Put and Delete object for a record and populate the Delete with the null columns. Then we call HTable.batch(List<Row>) on a bunch of these. It is my impression that this shouldn't appreciably increase network traffic as the RPC calls will be bundled. Has anyone else addressed this problem? Does this seem like a reasonable approach? What sort of performance overhead should I expect? Also, I've seen some Jira tickets about making this an atomic operation in its own right. Is that something that I can expect with CDH3U4? Thanks, Keith Wyss
