Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Sorry, I answered too fast. Please disregard my last message: I did mean aggregate.  You say: "RDD.aggregate() does not support aggregation by key." What would you need aggregation by key for, if you do not, at the beginning, have an RDD of key-value pairs, and do not want to build one ?

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Oh, I’m sorry, I meant `aggregateByKey`. https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions — FG On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi wrote: > Francois, > RDD.aggregate() does not support aggregation by key. But, indeed, that is the > kind of imp

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with g

Re: RDD.combineBy

2015-01-27 Thread francois . garillot
Have you looked at the `aggregate` function in the RDD API ? If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine, howe

RDD.combineBy

2015-01-27 Thread Mohit Jaggi
Hi All, I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD. I