Hidey Ho,
I went to a talk last week on HBase Do's and Don'ts and discovered the
Java client I used to populate my HBase tables is a "don't". I spent the
weekend trying to come up with a better way to populate the table but couldn't,
so I throw the question to the group.
Conditions:
Receive a new log file every ten minutes.
The log files contain anywhere from 500-2,000k rows.
The rows contain anywhere from 28 to 100 columns of data to be parsed.
Receive a new Click Log every morning.
The Click Log contains around 300-400k rows with each row having 15
columns of data.
I have a six node cluster (32bit 4G RAM) with four of the servers being Region
Servers.
Constraints:
The data in HBase from the Search Logs can't lag by more than ten minutes.
Queries to HBase must have an average return time of less than one second,
worst case four seconds.
Reports are based on a summary of a day's data.
Need to add new reports rapidly. (Under a day).
Currently my 'solution' consists of a long running Java application that reads
in a new Search Log when it appears, aggregates the required columns and then
updates the HBase Tables. I keep a running total of the day's aggregated
columns in Maps so I don't have to reread the day's data to update my totals.
Currently a day's worth of data fits in 10G of memory but that won't scale for
ever. The Click Logs are only read once from a database and the placed into an
HBase table. I can add a new report by updating the import to collect the new
data and then store that new data in a new HBase Table. I then create a new
query just for that table.
My question is...
What would be a better approach (map/reduce, etc) that with the current
conditions satisfies my constraints?
Thanks
-Pete