Hi All,

I would like to get your opinion on how to best optimize an HBase cluster
for map reduce jobs. The main purpose that we would like to experiment with
HBase is to do near real time aggregations for the data we receive. There is
a service that writes a constant stream of data to HBase. I would like to
schedule a job that runs every hour or every two hours that aggregates the
data and writes back to Hbase with the results of the aggregation.

Querying random records by key is only going to be for the aggregated
results.

I have the region size set to a higher value (100 G) to optimize for this
case. I have also set the Scan.setCaching() value to about 10,000 rows for
the mapreduce job (I might tune this further). Are there any other
parameters that the community would suggest to optimize for this use case ?

I have machines with 8GB of memory, where I have allocated about 4GB to the
RegionServer for now, and only 1 G to the datanode/jobtracker. What should I
keep in mind in regards to allocating memory to these daemons ?

Your input is appreciated.

Thank you,

Sam

Reply via email to