I have similar requirememt,take top N by key. right now I use groupByKey,but one key would group more than half data in some dataset.
Yours, Xuefeng Wu 吴雪峰 敬上 > On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nkronenf...@oculusinfo.com> > wrote: > > I think it would depend on the type and amount of information you're > collecting. > > If you're just trying to collect small numbers for each window, and don't > have an overwhelming number of windows, you might consider using > accumulators. Just make one per value per time window, and for each data > point, add it to the accumulators for the time windows in which it belongs. > We've found this approach a lot faster than anything involving a shuffle. > This should work fine for stuff like max(), min(), and mean() > > If you're collecting enough data that accumulators are impractical, I think I > would try multiple passes. Cache your data, and for each pass, filter to > that window, and perform all your operations on the filtered RDD. Because of > the caching, it won't be significantly slower than processing it all at once > - in fact, it will probably be a lot faster, because the shuffles are > shuffling less information. This is similar to what you're suggesting about > partitioning your rdd, but probably simpler and easier. > > That being said, your restriction 3 seems to be in contradiction to the rest > of your request - if your aggregation needs to be able to look at all the > data at once, then that seems contradictory to viewing the data through an > RDD. Could you explain a bit more what you mean by that? > > -Nathan > > >> On Wed, Dec 3, 2014 at 4:26 PM, ameyc <ambr...@gmail.com> wrote: >> Hi, >> >> So my Spark app needs to run a sliding window through a time series dataset >> (I'm not using Spark streaming). And then run different types on >> aggregations on per window basis. Right now I'm using a groupByKey() which >> gives me Iterables for each window. There are a few concerns I have with >> this approach: >> >> 1. groupByKey() could potentially fail for a key not fitting in the memory. >> 2. I'd like to run aggregations like max(), mean() on each of the groups, >> it'd be nice to have the RDD functionality at this point instead of the >> iterables. >> 3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations >> need to have a view of the entire window. >> >> Only other way I could think of is partitioning my RDDs into multiple RDDs >> with each RDD representing a window. Is this a sensible approach? Or is >> there any other way of going about this? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org > > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com