Re: Alternatives to groupByKey

Xuefeng Wu Wed, 03 Dec 2014 15:36:04 -0800

I have similar requirememt，take top N by key. right now I use groupByKey，but 
one key would group more than half data in some dataset.


Yours, Xuefeng Wu 吴雪峰 敬上

> On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nkronenf...@oculusinfo.com> 
> wrote:
> 
> I think it would depend on the type and amount of information you're 
> collecting.
> 
> If you're just trying to collect small numbers for each window, and don't 
> have an overwhelming number of windows, you might consider using 
> accumulators.  Just make one per value per time window, and for each data 
> point, add it to the accumulators for the time windows in which it belongs.  
> We've found this approach a lot faster than anything involving a shuffle.  
> This should work fine for stuff like max(), min(), and mean()
> 
> If you're collecting enough data that accumulators are impractical, I think I 
> would try multiple passes.  Cache your data, and for each pass, filter to 
> that window, and perform all your operations on the filtered RDD.  Because of 
> the caching, it won't be significantly slower than processing it all at once 
> - in fact, it will probably be a lot faster, because the shuffles are 
> shuffling less information.  This is similar to what you're suggesting about 
> partitioning your rdd, but probably simpler and easier.
> 
> That being said, your restriction 3 seems to be in contradiction to the rest 
> of your request - if your aggregation needs to be able to look at all the 
> data at once, then that seems contradictory to viewing the data through an 
> RDD.  Could you explain a bit more what you mean by that?
> 
>                 -Nathan
> 
> 
>> On Wed, Dec 3, 2014 at 4:26 PM, ameyc <ambr...@gmail.com> wrote:
>> Hi,
>> 
>> So my Spark app needs to run a sliding window through a time series dataset
>> (I'm not using Spark streaming). And then run different types on
>> aggregations on per window basis. Right now I'm using a groupByKey() which
>> gives me Iterables for each window. There are a few concerns I have with
>> this approach:
>> 
>> 1. groupByKey() could potentially fail for a key not fitting in the memory.
>> 2. I'd like to run aggregations like max(), mean() on each of the groups,
>> it'd be nice to have the RDD functionality at this point instead of the
>> iterables.
>> 3. I can't use reduceByKey() or aggregateByKey() are some of my aggregations
>> need to have a view of the entire window.
>> 
>> Only other way I could think of is partitioning my RDDs into multiple RDDs
>> with each RDD representing a window. Is this a sensible approach? Or is
>> there any other way of going about this?
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> -- 
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com

Re: Alternatives to groupByKey

Reply via email to