Strategies for aggregating data in a HBase table

Steinmaurer Thomas Mon, 28 Nov 2011 01:57:50 -0800

Hello,


this has been already discussed a bit in the past, but I'm trying to
refresh this thread as this is an important design issue in our HBase
evaluation.

 

Basically, the result of our evaluation was that we gonna be happy with
what Hadoop/HBase offers for managing our measurement/sensor data.
Although one crucial thing for e.g. backend analysis tasks is, we need
access to aggregated data very quickly. The idea is to run a MapReduce
job and store the dialy aggregates in a RDBMS, which allows us to access
aggregated data more easily via different tools (BI frontends etc.).
Monthly and yearly aggregates are then handled with RDBMS concepts like
Materialized Views and Partitioning.

 

While it is an option processing the entire HBase table e.g. every night
when we go live, it probably isn't an option when data volume grows over
the years. So, what options are there for some kind of incremental
aggregating only new data?

 

- Perhaps using versioning (internal timestamp) might be an option?

- Perhaps having some kind of HBase (daily) staging table which is
truncated after aggregating data is an option?

- How could Co-processors help here (at the time of the Go-Live, they
might be available in e.g. Cloudera)?

 

etc.

 

Any ideas/comments are appreciated.

 

Thanks,

Thomas

Strategies for aggregating data in a HBase table

Reply via email to