We've been using pig to read bulk data from hdfs, transform it and load it into HBase using the HBaseStorage class, which has worked well for us. If you try it out you'll want to build from the 0.9.0 branch (being cut as we speak I beleive) or the trunk. There's an open pig JIRA with a patch to disable the WAL that you might want to consider too, but I don't recall the jira # OTTOMH.
Bill On Tuesday, April 19, 2011, Peter Haidinyak <[email protected]> wrote: > Hidey Ho, > I went to a talk last week on HBase Do's and Don'ts and discovered > the Java client I used to populate my HBase tables is a "don't". I spent the > weekend trying to come up with a better way to populate the table but > couldn't, so I throw the question to the group. > > Conditions: > Receive a new log file every ten minutes. > The log files contain anywhere from 500-2,000k rows. > The rows contain anywhere from 28 to 100 columns of data to be parsed. > > Receive a new Click Log every morning. > The Click Log contains around 300-400k rows with each row having 15 > columns of data. > > I have a six node cluster (32bit 4G RAM) with four of the servers being > Region Servers. > > Constraints: > The data in HBase from the Search Logs can't lag by more than ten minutes. > Queries to HBase must have an average return time of less than one second, > worst case four seconds. > Reports are based on a summary of a day's data. > Need to add new reports rapidly. (Under a day). > > > Currently my 'solution' consists of a long running Java application that > reads in a new Search Log when it appears, aggregates the required columns > and then updates the HBase Tables. I keep a running total of the day's > aggregated columns in Maps so I don't have to reread the day's data to update > my totals. Currently a day's worth of data fits in 10G of memory but that > won't scale for ever. The Click Logs are only read once from a database and > the placed into an HBase table. I can add a new report by updating the import > to collect the new data and then store that new data in a new HBase Table. I > then create a new query just for that table. > > My question is... > What would be a better approach (map/reduce, etc) that with the current > conditions satisfies my constraints? > > Thanks > > -Pete >
