In my opinion you should use fold pattern. Obviously after an sort by trasformation.
Paolo Inviata dal mio Windows Phone ________________________________ Da: Asim Jalis<mailto:asimja...@gmail.com> Inviato: 06/01/2015 23:11 A: Sean Owen<mailto:so...@cloudera.com> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Oggetto: Re: RDD Moving Average One problem with this is that we are creating a lot of iterables containing a lot of repeated data. Is there a way to do this so that we can calculate a moving average incrementally? On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>> wrote: Yes, if you break it down to... tickerRDD.map(ticker => (ticker.timestamp, ticker) ).map { case(ts, ticker) => ((ts / 60000) * 60000, ticker) }.groupByKey ... as Michael alluded to, then it more naturally extends to the sliding window, since you can flatMap one Ticker to many (bucket, ticker) pairs, then group. I think this would implementing 1 minute buckets, sliding by 10 seconds: tickerRDD.flatMap(ticker => (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts, ticker)) ).map { case(ts, ticker) => ((ts / 60000) * 60000, ticker) }.groupByKey On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimja...@gmail.com<mailto:asimja...@gmail.com>> wrote: I guess I can use a similar groupBy approach. Map each event to all the windows that it can belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach. On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimja...@gmail.com<mailto:asimja...@gmail.com>> wrote: Except I want it to be a sliding window. So the same record could be in multiple buckets.