How to aggregte data by key

Meisam Fathi Mon, 11 Nov 2013 13:36:40 -0800

Hi,

I'm trying to use Spark to aggregate data.


I am doing something similar to this right now.

    val groupByRdd = rdd.groupBy(x => (x._1,))
    val aggregateRdd = groupByRdd map(x => (x._2.sum)

This works fine for smaller datasets but runs OOM for larger datasets
(the groupBy operation runs out memory).

I know I could use RDD.aggregate() if I wanted to aggregate all the
data for all keys. Is there anyway to use something similar to
RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
or RDD.aggregateByKey() in Spark. Is there one already implemented or
should I write my own?

Thanks,
Meisam

How to aggregte data by key

Reply via email to