Hi, I want to aggregate (time-stamped) event data at daily, weekly and monthly level stored in a directory in data/yyyy/mm/dd/dat.gz format. For example:
Each dat.gz file contains tuples in (datetime, id, value) format. I can perform aggregation as follows: but this code doesn't seem to be efficient because it doesn't exploit the data dependencies in reduce steps. For example, there is no dependency in between the data of 2010-01 and that of all other dates (2010-02, 2010-03, ...) so it would be ideal if we can load one month of data in each node once and perform all three (daily, weekly and monthly) aggregation. I think I could use mapPartitions and a big reducer that performs all three aggregations, but not sure it is a right way to go. Is there a more efficient way to perform these aggregations (by loading data once) yet keeping the code modular? Thanks, K -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-aggregate-event-data-at-daily-weekly-monthly-level-tp3673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.