Hi,

I'm trying to use Spark to aggregate data.

I am doing something similar to this right now.

    val groupByRdd = rdd.groupBy(x => (x._1,))
    val aggregateRdd = groupByRdd map(x => (x._2.sum)

This works fine for smaller datasets but runs OOM for larger datasets
(the groupBy operation runs out memory).

I know I could use RDD.aggregate() if I wanted to aggregate all the
data for all keys. Is there anyway to use something similar to
RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
or RDD.aggregateByKey() in Spark. Is there one already implemented or
should I write my own?

Thanks,
Meisam

Reply via email to