Hi!

Yes, while running the MR job once a day, the requirement possibly might
be every X hours and only aggregate what has changed/added in the
measurement value table since the last aggregation MR-run.

Versioning? I think you refer to the internal timestamps per row? We are
currently investigating this, if it can be used to define the time range
for the next incremental run. We don't set the timestamp when inserting
rows, so I guess this is the insertion/change timestamp as Jave
date/time datatype?

Thanks,
Thomas


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
Stack
Sent: Freitag, 02. September 2011 17:16
To: [email protected]
Subject: Re: Incremental pre-aggregation strategy with MapReduce

Can you rely on versioning?  If MR job runs once a day, only aggregate
whats changed in last day?

Turn off speculative execution.

You'll need a means of dealing with MR jobs failing; i.e. throw away the
aggregations done by the failed job rather than have the aggregations
done by the failed job(s) plus the successful job compounded.

St.Ack

On Thu, Sep 1, 2011 at 11:06 PM, Steinmaurer Thomas
<[email protected]> wrote:
> Hello,
>
>
>
> we are storing detailed measurement values in a Hadoop/Hbase cluster.
> For end-user / analysis tasks, we need to provide aggregated values 
> along a date dimension (aggregate by day, month, quarter, year). The 
> aggregates shall be stored in an Oracle database for easier data 
> mangling via different client types (OLAP clients ...)
>
>
>
> A brute-force approach for generating the aggregates is to run a 
> MapReduce job in the night which process the entire Hbase table and 
> does the aggregation.
>
>
>
> I wonder, are there any best practices on how to possibly do the 
> pre-aggregation thing via a MapReduce job in an incremental way? For 
> example, how to detect changes in HBase since the last MR-Job run etc 
> ...
>
>
>
> Thanks!
>
>
>
> Regards,
>
> Thomas
>
>
>
>

Reply via email to