Why don't you aggregate these data in a preprocessing step... like a map-reduce job? You can then load the output of that work directly into HBase.
-n On Wed, Nov 28, 2012 at 5:37 AM, Narayanan K <[email protected]> wrote: > Hi all, > > I have a scenario where I need to do aggregation while bulk loading into > HBase. > > Say for example, I have the following rows in my flat file, each with 2 > fields - product-id, amount. Values as below : > > P1, 1000 > P2, 200 > P3, 2500 > P1,1500 > P2, 300 > > My rowkey is product-id and I have a column : details:amount=<val> > > What I want is, after the bulk load of the above file, the table must have > the following rows and column values : > > P1 -- details:amount=2500 > P2 -- details:amount=500 > P3 -- details:amount=2500 > > My understanding of Bulk Load is that, when the map function gets a row > from the file, it can do some transformation, prepare the rowkey, columns > and then write to the HBase Table. > > But in our case, we will need an instance of the HTable in the Mapper, do a > GET operation and find the rowkey if it already exists and then add up the > column amounts and then write back. > But in that case, all parallel mappers will open connection to the same > table and the GET will not be synchronized - leading to race conditions, > right ? > > Is this the right way to do? If not, what are the other ways by which this > can be achieved? > > Thanks in advance, > Narayanan K >
