Hello,
we are storing detailed measurement values in a Hadoop/Hbase cluster. For end-user / analysis tasks, we need to provide aggregated values along a date dimension (aggregate by day, month, quarter, year). The aggregates shall be stored in an Oracle database for easier data mangling via different client types (OLAP clients ...) A brute-force approach for generating the aggregates is to run a MapReduce job in the night which process the entire Hbase table and does the aggregation. I wonder, are there any best practices on how to possibly do the pre-aggregation thing via a MapReduce job in an incremental way? For example, how to detect changes in HBase since the last MR-Job run etc ... Thanks! Regards, Thomas
