For RDDs of pairs, Spark has operations like reduceByKey() and combineByKey(). The Quick Start guide features a word count example that uses reduceByKey(): https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations
These operations are defined in a class called PairRDDFunctions ( https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions). SparkContext provides implicit conversions to let you write rdd.reduceByKey(...) without having to manually wrap your RDD into a PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your imports. On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <[email protected]>wrote: > Hi, > > I'm trying to use Spark to aggregate data. > > I am doing something similar to this right now. > > val groupByRdd = rdd.groupBy(x => (x._1,)) > val aggregateRdd = groupByRdd map(x => (x._2.sum) > > This works fine for smaller datasets but runs OOM for larger datasets > (the groupBy operation runs out memory). > > I know I could use RDD.aggregate() if I wanted to aggregate all the > data for all keys. Is there anyway to use something similar to > RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey() > or RDD.aggregateByKey() in Spark. Is there one already implemented or > should I write my own? > > Thanks, > Meisam >
