Which version of spark? In 1.4 window queries will show up for these kind of scenarios.
1 thing I can suggest is keep daily aggregates materialised and partioned by key and sorted by key-day combination using repartitionandsort method. It allows you to use custom partitioner and custom sorter. Best Ayan On 29 May 2015 03:31, "igor.berman" <igor.ber...@gmail.com> wrote: > Hi, > I have a batch daily job that computes daily aggregate of several counters > represented by some object. > After daily aggregation is done, I want to compute block of 3 days > aggregation(3,7,30 etc) > To do so I need to add new daily aggregation to the current block and then > subtract from current block the daily aggregation of the last day within > the > current block(sliding window...) > I've implemented it with something like: > > baseBlockRdd.leftjoin(lastDayRdd).map(subtraction).fullOuterJoin(newDayRdd).map(addition) > All rdds are keyed by unique id(long). Each rdd is saved in avro files > after > the job finishes and loaded when job starts(on next day). baseBlockRdd is > much larger than lastDay and newDay rdds(depends on the size of the block) > > Unfortunately the performance is not satisfactory due to many shuffles(I > have parallelism etc) I was looking for the way to improve performance > somehow, to make sure that one task "joins" same local keys without > reshuffling baseBlockRdd(which is big) each time the job starts(see > https://spark-project.atlassian.net/browse/SPARK-1061 as related issue) > so bottom line - how to join big rdd with smaller rdd without reshuffling > big rdd over and over again? > As soon as I've saved this big rdd and reloaded it from disk I want that > every other rdd will be partitioned and collocated by the same > "partitioner"(which is absent for hadooprdd) ... somehow, so that only > small > rdds will be sent over network. > > Another idea I had - somehow split baseBlock into 2 parts with filter by > keys of small rdds and then join, however I'm not sure it's possible to > implement this filter without join. > > any ideas would be appreciated, > thanks in advance > Igor > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Batch-aggregation-by-sliding-window-join-tp23074.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >