Re: Alternatives to groupByKey

Koert Kuipers Wed, 03 Dec 2014 15:50:29 -0800

do these requirements boils down to a need for foldLeftByKey with sorting
of the values?


https://issues.apache.org/jira/browse/SPARK-3655


On Wed, Dec 3, 2014 at 6:34 PM, Xuefeng Wu <ben...@gmail.com> wrote:

> I have similar requirememt，take top N by key. right now I use
> groupByKey，but one key would group more than half data in some dataset.
>
> Yours, Xuefeng Wu 吴雪峰 敬上
>
> On 2014年12月4日, at 上午7:26, Nathan Kronenfeld <nkronenf...@oculusinfo.com>
> wrote:
>
> I think it would depend on the type and amount of information you're
> collecting.
>
> If you're just trying to collect small numbers for each window, and don't
> have an overwhelming number of windows, you might consider using
> accumulators.  Just make one per value per time window, and for each data
> point, add it to the accumulators for the time windows in which it
> belongs.  We've found this approach a lot faster than anything involving a
> shuffle.  This should work fine for stuff like max(), min(), and mean()
>
> If you're collecting enough data that accumulators are impractical, I
> think I would try multiple passes.  Cache your data, and for each pass,
> filter to that window, and perform all your operations on the filtered
> RDD.  Because of the caching, it won't be significantly slower than
> processing it all at once - in fact, it will probably be a lot faster,
> because the shuffles are shuffling less information.  This is similar to
> what you're suggesting about partitioning your rdd, but probably simpler
> and easier.
>
> That being said, your restriction 3 seems to be in contradiction to the
> rest of your request - if your aggregation needs to be able to look at all
> the data at once, then that seems contradictory to viewing the data through
> an RDD.  Could you explain a bit more what you mean by that?
>
>                 -Nathan
>
>
> On Wed, Dec 3, 2014 at 4:26 PM, ameyc <ambr...@gmail.com> wrote:
>
>> Hi,
>>
>> So my Spark app needs to run a sliding window through a time series
>> dataset
>> (I'm not using Spark streaming). And then run different types on
>> aggregations on per window basis. Right now I'm using a groupByKey() which
>> gives me Iterables for each window. There are a few concerns I have with
>> this approach:
>>
>> 1. groupByKey() could potentially fail for a key not fitting in the
>> memory.
>> 2. I'd like to run aggregations like max(), mean() on each of the groups,
>> it'd be nice to have the RDD functionality at this point instead of the
>> iterables.
>> 3. I can't use reduceByKey() or aggregateByKey() are some of my
>> aggregations
>> need to have a view of the entire window.
>>
>> Only other way I could think of is partitioning my RDDs into multiple RDDs
>> with each RDD representing a window. Is this a sensible approach? Or is
>> there any other way of going about this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-tp20293.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>
>

Re: Alternatives to groupByKey

Reply via email to