Thank you Paul. I was just thinking that I could use add a reducer to the step that prepares the data to build custom logic around having multiple entries which produce the same rowkey. What do u think?
Sent from my iPhone On 03/10/2012, at 17:12, Paul Mackles <[email protected]> wrote: > Keys in hbase are a combination of rowkey/column/timestamp. > > Two records with the same rowkey but different column will result in two > different cells with the same rowkey which is probably what you expect. > > For two records with the same rowkey and same column, the timestamp will > normally differentiate them but in the case of a bulk load, the timestamp > could be the same so it may actually be a tie and both will be stored. > There are no updates in bulk loads. > > All 20 versions will get loaded but the 10 oldest will be deleted during > the next major compaction. > > I would definitely recommend setting up small scale tests for all of the > above scenarios to confirm. > > On 10/3/12 3:35 PM, "Juan P." <[email protected]> wrote: > >> Hi guys, >> I've been reading up on bulk load using MapReduce jobs and I wanted to >> validate something. >> >> If I the input I wanted to load into HBase produced the same key for >> several lines. How will HBase handle that? >> >> I understand the MapReduce job will create StoreFiles which the region >> servers just pick up and make available to the users. But is there a >> validation to treat the first as insert and the rest as updates? >> >> What about the limit on the number of versions of a key HBase can have? If >> I want to have 10 versions, but the bulk load has 20 values for the same >> key, will it only keep the last 10? >> >> Thanks, >> Juan >
