RE: aggregateByKey vs combineByKey

2016-01-05 Thread LINChen
Hi Marco,In your case, since you don't need to perform an aggregation (such as a sum or average) over each key, using groupByKey may perform better. groupByKey inherently utilizes compactBuffer which is much more efficient than ArrayBuffer. Thanks.LIN Chen Date: Tue, 5 Jan 2016 21:13:40 + S

Re: aggregateByKey vs combineByKey

2016-01-05 Thread Ted Yu
Looking at PairRDDFunctions.scala : def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)] = self.withScope { ... combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, part

Re: aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Thanks Liquan, that was really helpful. On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei wrote: > Hi Dave, > > You can replace groupByKey with reduceByKey to improve performance in some > cases. reduceByKey performs map side combine which can reduce Network IO > and shuffle size where as groupByKey w

Re: aggregateByKey vs combineByKey

2014-09-29 Thread Liquan Pei
Hi Dave, You can replace groupByKey with reduceByKey to improve performance in some cases. reduceByKey performs map side combine which can reduce Network IO and shuffle size where as groupByKey will not perform map side combine. combineByKey is more general then aggregateByKey. Actually, the impl