combineByKey is designed for when your return type from the aggregation is 
different from the values being aggregated (e.g. you group together objects), 
and it should allow you to modify the leftmost argument of each function 
(mergeCombiners, mergeValue, etc) and return that instead of allocating a new 
object. So it should work with mutable objects — please post what problems you 
had with that. reduceByKey actually also allows this if your types are the same.

Matei

On Jun 11, 2014, at 3:21 PM, Diana Hu <siyin...@gmail.com> wrote:

> Hello all,
> 
> I've seen some performance improvements using combineByKey as opposed to 
> reduceByKey or a groupByKey+map function. I have a couple questions. it'd be 
> great if any one can provide some light into this.
> 
> 1) When should I use combineByKey vs reduceByKey?
> 
> 2) Do the containers need to be immutable for combineByKey? I've created 
> custom data structures for the containers, one mutable and one immutable. The 
> tests with the mutable containers, spark crashed with an error on missing 
> references. However the downside of immutable containers (which works on my 
> tests), is that for large datasets the garbage collector gets called many 
> more times, and it tends to run out of heap space as the GC can't catch up. I 
> tried some of the tips here 
> http://spark.apache.org/docs/latest/tuning.html#memory-tuning and tuning the 
> JVM params, but this seems to be too much tuning?
> 
> Thanks in advance,
> - Diana

Reply via email to