Hi All, I would like to get your opinion on how to best optimize an HBase cluster for map reduce jobs. The main purpose that we would like to experiment with HBase is to do near real time aggregations for the data we receive. There is a service that writes a constant stream of data to HBase. I would like to schedule a job that runs every hour or every two hours that aggregates the data and writes back to Hbase with the results of the aggregation.
Querying random records by key is only going to be for the aggregated results. I have the region size set to a higher value (100 G) to optimize for this case. I have also set the Scan.setCaching() value to about 10,000 rows for the mapreduce job (I might tune this further). Are there any other parameters that the community would suggest to optimize for this use case ? I have machines with 8GB of memory, where I have allocated about 4GB to the RegionServer for now, and only 1 G to the datanode/jobtracker. What should I keep in mind in regards to allocating memory to these daemons ? Your input is appreciated. Thank you, Sam
