Re: ReduceByKey performance optimisation

Sean Owen Sat, 13 Sep 2014 07:42:40 -0700

This is more concise:

x.groupBy(obj.fieldtobekey).values.map(_.head)


... but I doubt it's faster.

If all objects with the same fieldtobekey are within the same
partition, then yes I imagine your biggest speedup comes from
exploiting that. How about ...

x.mapPartitions(_.map(obj => (obj.fieldtobekey, obj)).toMap.values)

This does require that all keys, plus a representative object each,
fits in memory.
I bet you can make it faster than this example too.


On Sat, Sep 13, 2014 at 1:15 PM, Gary Malouf <malouf.g...@gmail.com> wrote:
> You need something like:
>
> val x: RDD[MyAwesomeObject]
>
> x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l }
>
> Does that make sense?
>
>
> On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme <julien.ca...@gmail.com>
> wrote:
>>
>> I need to remove objects with duplicate key, but I need the whole object.
>> Object which have the same key are not necessarily equal, though (but I can
>> dump any on the ones that have identical key).
>>
>> 2014-09-13 12:50 GMT+02:00 Sean Owen <so...@cloudera.com>:
>>>
>>> If you are just looking for distinct keys, .keys.distinct() should be
>>> much better.
>>>
>>> On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme <julien.ca...@gmail.com>
>>> wrote:
>>> > Hello,
>>> >
>>> > I am facing performance issues with reduceByKey. In know that this
>>> > topic has
>>> > already been covered but I did not really find answers to my question.
>>> >
>>> > I am using reduceByKey to remove entries with identical keys, using, as
>>> > reduce function, (a,b) => a. It seems to be a relatively
>>> > straightforward use
>>> > of reduceByKey, but performances on moderately big RDDs (some tens of
>>> > millions of line) are very low, far from what you can reach with
>>> > mono-server
>>> > computing packages like R for example.
>>> >
>>> > I have read on other threads on the topic that reduceByKey always
>>> > entirely
>>> > shuffle the whole data. Is that true ? So it means that a custom
>>> > partitionning could not help, right? In my case, I could relatively
>>> > easily
>>> > grant that two identical keys would always be on the same partition,
>>> > therefore an option could by to use mapPartition and reeimplement
>>> > reduce
>>> > locally, but I would like to know if there are simpler / more elegant
>>> > alternatives.
>>> >
>>> > Thanks for your help,
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ReduceByKey performance optimisation

Reply via email to