Re: When to use CombineByKey vs reduceByKey?

Diana Hu Thu, 12 Jun 2014 18:53:33 -0700

Matei,

Thanks for the answer this clarifies this very much. Based on my usage I
would use combineByKey, since the output is another custom data structures.


I found out my issues with combineByKey were relieved after doing more
tuning with the level of parallelism. I've found that it really depends on
the size of my dataset, since I did tests for 1000, 10K, 100K, 1M data
points, for now the GC issue is under control once I modified my data
structures to be mutable and the key part I was missing was that all
classes within it need it to be serializable

Thanks!

- Diana


On Wed, Jun 11, 2014 at 6:06 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> combineByKey is designed for when your return type from the aggregation is
> different from the values being aggregated (e.g. you group together
> objects), and it should allow you to modify the leftmost argument of each
> function (mergeCombiners, mergeValue, etc) and return that instead of
> allocating a new object. So it should work with mutable objects — please
> post what problems you had with that. reduceByKey actually also allows this
> if your types are the same.
>
> Matei
>
>
> On Jun 11, 2014, at 3:21 PM, Diana Hu <siyin...@gmail.com> wrote:
>
> Hello all,
>
> I've seen some performance improvements using combineByKey as opposed to
> reduceByKey or a groupByKey+map function. I have a couple questions. it'd
> be great if any one can provide some light into this.
>
> 1) When should I use combineByKey vs reduceByKey?
>
> 2) Do the containers need to be immutable for combineByKey? I've created
> custom data structures for the containers, one mutable and one immutable.
> The tests with the mutable containers, spark crashed with an error on
> missing references. However the downside of immutable containers (which
> works on my tests), is that for large datasets the garbage collector gets
> called many more times, and it tends to run out of heap space as the GC
> can't catch up. I tried some of the tips here
> http://spark.apache.org/docs/latest/tuning.html#memory-tuning and tuning
> the JVM params, but this seems to be too much tuning?
>
> Thanks in advance,
> - Diana
>
>
>

Re: When to use CombineByKey vs reduceByKey?

Reply via email to