do these requirements boils down to a need for foldLeftByKey with sorting of the values?
https://issues.apache.org/jira/browse/SPARK-3655 On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu <ben...@gmail.com> wrote: > I have similar requirememt,take top N by key. right now I use > groupByKey,but one key would group more than half data in some dataset. > > Yours, Xuefeng Wu 吴雪峰 敬上 > > On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nkronenf...@oculusinfo.com> > wrote: > > I think it would depend on the type and amount of information you're > collecting. > > If you're just trying to collect small numbers for each window, and don't > have an overwhelming number of windows, you might consider using > accumulators. Just make one per value per time window, and for each data > point, add it to the accumulators for the time windows in which it > belongs. We've found this approach a lot faster than anything involving a > shuffle. This should work fine for stuff like max(), min(), and mean() > > If you're collecting enough data that accumulators are impractical, I > think I would try multiple passes. Cache your data, and for each pass, > filter to that window, and perform all your operations on the filtered > RDD. Because of the caching, it won't be significantly slower than > processing it all at once - in fact, it will probably be a lot faster, > because the shuffles are shuffling less information. This is similar to > what you're suggesting about partitioning your rdd, but probably simpler > and easier. > > That being said, your restriction 3 seems to be in contradiction to the > rest of your request - if your aggregation needs to be able to look at all > the data at once, then that seems contradictory to viewing the data through > an RDD. Could you explain a bit more what you mean by that? > > -Nathan > > > On Wed, Dec 3, 2014 at 4:26 PM, ameyc <ambr...@gmail.com> wrote: > >> Hi, >> >> So my Spark app needs to run a sliding window through a time series >> dataset >> (I'm not using Spark streaming). And then run different types on >> aggregations on per window basis. Right now I'm using a groupByKey() which >> gives me Iterables for each window. There are a few concerns I have with >> this approach: >> >> 1. groupByKey() could potentially fail for a key not fitting in the >> memory. >> 2. I'd like to run aggregations like max(), mean() on each of the groups, >> it'd be nice to have the RDD functionality at this point instead of the >> iterables. >> 3. I can't use reduceByKey() or aggregateByKey() are some of my >> aggregations >> need to have a view of the entire window. >> >> Only other way I could think of is partitioning my RDDs into multiple RDDs >> with each RDD representing a window. Is this a sensible approach? Or is >> there any other way of going about this? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com > >