Thanks Liquan, that was really helpful.

On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei <liquan...@gmail.com> wrote:

> Hi Dave,
>
> You can replace groupByKey with reduceByKey to improve performance in some
> cases. reduceByKey performs map side combine which can reduce Network IO
> and shuffle size where as groupByKey will not perform map side combine.
>
> combineByKey is more general then aggregateByKey. Actually, the
> implementation of aggregateByKey, reduceByKey and groupByKey is achieved by
> combineByKey. aggregateByKey is similar to reduceByKey but you can provide
> initial values when performing aggregation.
>
> As the name suggests, aggregateByKey is suitable for compute aggregations
> for keys, example aggregations such as sum, avg, etc. The rule here is that
> the extra computation spent for map side combine can reduce the size sent
> out to other nodes and driver. If your func has satisfies this rule, you
> probably should use aggregateByKey.
>
> combineByKey is more general and you have the flexibility to specify
> whether you'd like to perform map side combine. However, it is more complex
> to use. At minimum, you need to implement three functions: createCombiner,
> mergeValue, mergeCombiners.
>
> Hope this helps!
> Liquan
>
> On Sun, Sep 28, 2014 at 11:59 PM, David Rowe <davidr...@gmail.com> wrote:
>
>> Hi All,
>>
>> After some hair pulling, I've reached the realisation that an operation I
>> am currently doing via:
>>
>> myRDD.groupByKey.mapValues(func)
>>
>> should be done more efficiently using aggregateByKey or combineByKey.
>> Both of these methods would do, and they seem very similar to me in terms
>> of their function.
>>
>> My question is, what are the differences between these two methods (other
>> than the slight differences in their type signatures)? Under what
>> circumstances should I use one or the other?
>>
>> Thanks
>>
>> Dave
>>
>>
>>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>

Reply via email to