For RDDs of pairs, Spark has operations like reduceByKey() and
combineByKey(). The Quick Start guide features a word count example that
uses reduceByKey():
https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations

These operations are defined in a class called PairRDDFunctions (
https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).
 SparkContext provides implicit conversions to let you write
rdd.reduceByKey(...) without having to manually wrap your RDD into a
PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your
imports.



On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <[email protected]>wrote:

> Hi,
>
> I'm trying to use Spark to aggregate data.
>
> I am doing something similar to this right now.
>
>     val groupByRdd = rdd.groupBy(x => (x._1,))
>     val aggregateRdd = groupByRdd map(x => (x._2.sum)
>
> This works fine for smaller datasets but runs OOM for larger datasets
> (the groupBy operation runs out memory).
>
> I know I could use RDD.aggregate() if I wanted to aggregate all the
> data for all keys. Is there anyway to use something similar to
> RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
> or RDD.aggregateByKey() in Spark. Is there one already implemented or
> should I write my own?
>
> Thanks,
> Meisam
>

Reply via email to