There are no duplicates. Cells have versions, which are time stamped. You could set the number of versions to one... But I'd recommend sticking w the default 3.
Sent from a remote device. Please excuse any typos... Mike Segel On Mar 2, 2013, at 9:42 PM, Matt Corgan <[email protected]> wrote: > I have a few use cases where I'd like to leverage HBase's high write > throughput to blindly write lots of data even if most of it hasn't changed > since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, > however, I don't want to keep all the duplicate copies around forever. At > compaction time, I'd like the compactor to compare the values of cells with > the same row/family/qualifier and only keep the *oldest* version of > duplicates. By keeping the oldest versions I can get a snapshot of a row > at any historical time. > > Lars, I think you said Salesforce retains many versions of cells - do you > retain all the duplicates? > > I'm guessing co-processors would be the solution and am looking for some > pointers on the cleanest way to implement it or some code if anyone has > already solved the problem. > > I'm also wondering if people think it's a generic enough use case that > HBase could support it natively, say, with a column family attribute > DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at > compaction time because of all the value comparisons. > > Thanks for any tips, > Matt
