Interesting, I am not sure the order in which fold() encounters elements is guaranteed, although from reading the code, I imagine in practice it is first-to-last by partition and then folded first-to-last from those results on the driver. I don't know this would lead to a solution though as the result here needs to be an RDD, not one value.
On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter <paolo.plat...@agilelab.it> wrote: > In my opinion you should use fold pattern. Obviously after an sort by > trasformation. > > Paolo > > Inviata dal mio Windows Phone > ------------------------------ > Da: Asim Jalis <asimja...@gmail.com> > Inviato: 06/01/2015 23:11 > A: Sean Owen <so...@cloudera.com> > Cc: user@spark.apache.org > Oggetto: Re: RDD Moving Average > > One problem with this is that we are creating a lot of iterables > containing a lot of repeated data. Is there a way to do this so that we can > calculate a moving average incrementally? > > On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <so...@cloudera.com> wrote: > >> Yes, if you break it down to... >> >> tickerRDD.map(ticker => >> (ticker.timestamp, ticker) >> ).map { case(ts, ticker) => >> ((ts / 60000) * 60000, ticker) >> }.groupByKey >> >> ... as Michael alluded to, then it more naturally extends to the >> sliding window, since you can flatMap one Ticker to many (bucket, ticker) >> pairs, then group. I think this would implementing 1 minute buckets, >> sliding by 10 seconds: >> >> tickerRDD.flatMap(ticker => >> (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts, >> ticker)) >> ).map { case(ts, ticker) => >> ((ts / 60000) * 60000, ticker) >> }.groupByKey >> >> On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimja...@gmail.com> wrote: >> >>> I guess I can use a similar groupBy approach. Map each event to all >>> the windows that it can belong to. Then do a groupBy, etc. I was wondering >>> if there was a more elegant approach. >>> >>> On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimja...@gmail.com> wrote: >>> >>>> Except I want it to be a sliding window. So the same record could be >>>> in multiple buckets. >>>> >>>> >