Hello,
this has been already discussed a bit in the past, but I'm trying to refresh this thread as this is an important design issue in our HBase evaluation. Basically, the result of our evaluation was that we gonna be happy with what Hadoop/HBase offers for managing our measurement/sensor data. Although one crucial thing for e.g. backend analysis tasks is, we need access to aggregated data very quickly. The idea is to run a MapReduce job and store the dialy aggregates in a RDBMS, which allows us to access aggregated data more easily via different tools (BI frontends etc.). Monthly and yearly aggregates are then handled with RDBMS concepts like Materialized Views and Partitioning. While it is an option processing the entire HBase table e.g. every night when we go live, it probably isn't an option when data volume grows over the years. So, what options are there for some kind of incremental aggregating only new data? - Perhaps using versioning (internal timestamp) might be an option? - Perhaps having some kind of HBase (daily) staging table which is truncated after aggregating data is an option? - How could Co-processors help here (at the time of the Go-Live, they might be available in e.g. Cloudera)? etc. Any ideas/comments are appreciated. Thanks, Thomas
