Hello all, I've seen some performance improvements using combineByKey as opposed to reduceByKey or a groupByKey+map function. I have a couple questions. it'd be great if any one can provide some light into this.
1) When should I use combineByKey vs reduceByKey? 2) Do the containers need to be immutable for combineByKey? I've created custom data structures for the containers, one mutable and one immutable. The tests with the mutable containers, spark crashed with an error on missing references. However the downside of immutable containers (which works on my tests), is that for large datasets the garbage collector gets called many more times, and it tends to run out of heap space as the GC can't catch up. I tried some of the tips here http://spark.apache.org/docs/latest/tuning.html#memory-tuning and tuning the JVM params, but this seems to be too much tuning? Thanks in advance, - Diana