Matei, Thanks for the answer this clarifies this very much. Based on my usage I would use combineByKey, since the output is another custom data structures.
I found out my issues with combineByKey were relieved after doing more tuning with the level of parallelism. I've found that it really depends on the size of my dataset, since I did tests for 1000, 10K, 100K, 1M data points, for now the GC issue is under control once I modified my data structures to be mutable and the key part I was missing was that all classes within it need it to be serializable Thanks! - Diana On Wed, Jun 11, 2014 at 6:06 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > combineByKey is designed for when your return type from the aggregation is > different from the values being aggregated (e.g. you group together > objects), and it should allow you to modify the leftmost argument of each > function (mergeCombiners, mergeValue, etc) and return that instead of > allocating a new object. So it should work with mutable objects — please > post what problems you had with that. reduceByKey actually also allows this > if your types are the same. > > Matei > > > On Jun 11, 2014, at 3:21 PM, Diana Hu <siyin...@gmail.com> wrote: > > Hello all, > > I've seen some performance improvements using combineByKey as opposed to > reduceByKey or a groupByKey+map function. I have a couple questions. it'd > be great if any one can provide some light into this. > > 1) When should I use combineByKey vs reduceByKey? > > 2) Do the containers need to be immutable for combineByKey? I've created > custom data structures for the containers, one mutable and one immutable. > The tests with the mutable containers, spark crashed with an error on > missing references. However the downside of immutable containers (which > works on my tests), is that for large datasets the garbage collector gets > called many more times, and it tends to run out of heap space as the GC > can't catch up. I tried some of the tips here > http://spark.apache.org/docs/latest/tuning.html#memory-tuning and tuning > the JVM params, but this seems to be too much tuning? > > Thanks in advance, > - Diana > > >