Hi! Yes, while running the MR job once a day, the requirement possibly might be every X hours and only aggregate what has changed/added in the measurement value table since the last aggregation MR-run.
Versioning? I think you refer to the internal timestamps per row? We are currently investigating this, if it can be used to define the time range for the next incremental run. We don't set the timestamp when inserting rows, so I guess this is the insertion/change timestamp as Jave date/time datatype? Thanks, Thomas -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Stack Sent: Freitag, 02. September 2011 17:16 To: [email protected] Subject: Re: Incremental pre-aggregation strategy with MapReduce Can you rely on versioning? If MR job runs once a day, only aggregate whats changed in last day? Turn off speculative execution. You'll need a means of dealing with MR jobs failing; i.e. throw away the aggregations done by the failed job rather than have the aggregations done by the failed job(s) plus the successful job compounded. St.Ack On Thu, Sep 1, 2011 at 11:06 PM, Steinmaurer Thomas <[email protected]> wrote: > Hello, > > > > we are storing detailed measurement values in a Hadoop/Hbase cluster. > For end-user / analysis tasks, we need to provide aggregated values > along a date dimension (aggregate by day, month, quarter, year). The > aggregates shall be stored in an Oracle database for easier data > mangling via different client types (OLAP clients ...) > > > > A brute-force approach for generating the aggregates is to run a > MapReduce job in the night which process the entire Hbase table and > does the aggregation. > > > > I wonder, are there any best practices on how to possibly do the > pre-aggregation thing via a MapReduce job in an incremental way? For > example, how to detect changes in HBase since the last MR-Job run etc > ... > > > > Thanks! > > > > Regards, > > Thomas > > > >
