Hello all,

I've seen some performance improvements using combineByKey as opposed to
reduceByKey or a groupByKey+map function. I have a couple questions. it'd
be great if any one can provide some light into this.

1) When should I use combineByKey vs reduceByKey?

2) Do the containers need to be immutable for combineByKey? I've created
custom data structures for the containers, one mutable and one immutable.
The tests with the mutable containers, spark crashed with an error on
missing references. However the downside of immutable containers (which
works on my tests), is that for large datasets the garbage collector gets
called many more times, and it tends to run out of heap space as the GC
can't catch up. I tried some of the tips here
http://spark.apache.org/docs/latest/tuning.html#memory-tuning and tuning
the JVM params, but this seems to be too much tuning?

Thanks in advance,
- Diana

Reply via email to