Hidey Ho,
        I went to a talk last week on HBase Do's and Don'ts and discovered the 
Java client I used to populate my HBase tables is a "don't". I spent the 
weekend trying to come up with a better way to populate the table but couldn't, 
so I throw the question to the group.

Conditions:
Receive a new log file every ten minutes. 
        The log files contain anywhere from 500-2,000k rows. 
        The rows contain anywhere from 28 to 100 columns of data to be parsed.

Receive a new Click Log every morning. 
        The Click Log contains around 300-400k rows with each row having 15 
columns of data.

I have a six node cluster (32bit 4G RAM) with four of the servers being Region 
Servers.

Constraints:
The data in HBase from the Search Logs can't lag by more than ten minutes.
Queries to HBase must have an average return time of less than one second, 
worst case four seconds.
Reports are based on a summary of a day's data.
Need to add new reports rapidly. (Under a day).


Currently my 'solution' consists of a long running Java application that reads 
in a new Search Log when it appears, aggregates the required columns and then 
updates the HBase Tables. I keep a running total of the day's aggregated 
columns in Maps so I don't have to reread the day's data to update my totals. 
Currently a day's worth of data fits in 10G of memory but that won't scale for 
ever. The Click Logs are only read once from a database and the placed into an 
HBase table. I can add a new report by updating the import to collect the new 
data and then store that new data in a new HBase Table. I then create a new 
query just for that table.

My question is...
  What would be a better approach (map/reduce, etc) that with the current 
conditions satisfies my constraints?

Thanks

-Pete

Reply via email to