Hi Marco,In your case, since you don't need to perform an aggregation (such as
a sum or average) over each key, using groupByKey may perform better.
groupByKey inherently utilizes compactBuffer which is much more efficient than
ArrayBuffer.
Thanks.LIN Chen
Date: Tue, 5 Jan 2016 21:13:40 +
S
Looking at PairRDDFunctions.scala :
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner:
Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
...
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, part
Thanks Liquan, that was really helpful.
On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei wrote:
> Hi Dave,
>
> You can replace groupByKey with reduceByKey to improve performance in some
> cases. reduceByKey performs map side combine which can reduce Network IO
> and shuffle size where as groupByKey w
Hi Dave,
You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where as groupByKey will not perform map side combine.
combineByKey is more general then aggregateByKey. Actually, the
impl