Thanks Josh.
On Mon, Nov 11, 2013 at 4:50 PM, Josh Rosen <[email protected]> wrote: > For RDDs of pairs, Spark has operations like reduceByKey() and > combineByKey(). The Quick Start guide features a word count example that > uses reduceByKey(): > https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations > > These operations are defined in a class called PairRDDFunctions > (https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions). > SparkContext provides implicit conversions to let you write > rdd.reduceByKey(...) without having to manually wrap your RDD into a > PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your > imports. > > > > On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <[email protected]> > wrote: >> >> Hi, >> >> I'm trying to use Spark to aggregate data. >> >> I am doing something similar to this right now. >> >> val groupByRdd = rdd.groupBy(x => (x._1,)) >> val aggregateRdd = groupByRdd map(x => (x._2.sum) >> >> This works fine for smaller datasets but runs OOM for larger datasets >> (the groupBy operation runs out memory). >> >> I know I could use RDD.aggregate() if I wanted to aggregate all the >> data for all keys. Is there anyway to use something similar to >> RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey() >> or RDD.aggregateByKey() in Spark. Is there one already implemented or >> should I write my own? >> >> Thanks, >> Meisam > >
