Re: How to aggregte data by key

Meisam Fathi Mon, 11 Nov 2013 14:16:39 -0800

Thanks Josh.


On Mon, Nov 11, 2013 at 4:50 PM, Josh Rosen <[email protected]> wrote:
> For RDDs of pairs, Spark has operations like reduceByKey() and
> combineByKey(). The Quick Start guide features a word count example that
> uses reduceByKey():
> https://spark.incubator.apache.org/docs/latest/quick-start.html#more-on-rdd-operations
>
> These operations are defined in a class called PairRDDFunctions
> (https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).
> SparkContext provides implicit conversions to let you write
> rdd.reduceByKey(...) without having to manually wrap your RDD into a
> PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your
> imports.
>
>
>
> On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi <[email protected]>
> wrote:
>>
>> Hi,
>>
>> I'm trying to use Spark to aggregate data.
>>
>> I am doing something similar to this right now.
>>
>>     val groupByRdd = rdd.groupBy(x => (x._1,))
>>     val aggregateRdd = groupByRdd map(x => (x._2.sum)
>>
>> This works fine for smaller datasets but runs OOM for larger datasets
>> (the groupBy operation runs out memory).
>>
>> I know I could use RDD.aggregate() if I wanted to aggregate all the
>> data for all keys. Is there anyway to use something similar to
>> RDD.aggregate()? I'm looking for an operation like RDD.reduceByKey()
>> or RDD.aggregateByKey() in Spark. Is there one already implemented or
>> should I write my own?
>>
>> Thanks,
>> Meisam
>
>

Re: How to aggregte data by key

Reply via email to