Sorry, I answered too fast. Please disregard my last message: I did mean 
aggregate. 




You say: "RDD.aggregate() does not support aggregation by key."




What would you need aggregation by key for, if you do not, at the beginning, 
have an RDD of key-value pairs, and do not want to build one ? Could you share 
more about the kind of processing you have in mind ?


—
FG

On Thu, Jan 29, 2015 at 8:01 PM, null <francois.garil...@typesafe.com>
wrote:

> Oh, I’m sorry, I meant `aggregateByKey`.
> https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
> —
> FG
> On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
>> Francois,
>> RDD.aggregate() does not support aggregation by key. But, indeed, that is 
>> the kind of implementation I am looking for, one that does not allocate 
>> intermediate space for storing (K,V) pairs. When working with large datasets 
>> this type of intermediate memory allocation wrecks havoc with garbage 
>> collection, not to mention unnecessarily increases the working memory 
>> requirement of the program.
>> I wonder if someone has already noticed this and there is an effort underway 
>> to optimize this. If not, I will take a shot at adding this functionality.
>> Mohit.
>>> On Jan 27, 2015, at 1:52 PM, francois.garil...@typesafe.com wrote:
>>> 
>>> Have you looked at the `aggregate` function in the RDD API ? 
>>> 
>>> If your way of extracting the “key” (identifier) and “value” (payload) 
>>> parts of the RDD elements is uniform (a function), it’s unclear to me how 
>>> this would be more efficient that extracting key and value and then using 
>>> combine, however.
>>> 
>>> —
>>> FG
>>> 
>>> 
>>> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitja...@gmail.com 
>>> <mailto:mohitja...@gmail.com>> wrote:
>>> 
>>> Hi All, 
>>> I have a use case where I have an RDD (not a k,v pair) where I want to do a 
>>> combineByKey() operation. I can do that by creating an intermediate RDD of 
>>> k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it 
>>> will be more efficient if I can avoid this intermediate RDD. Is there a way 
>>> I can do this by passing in a function that extracts the key, like in 
>>> RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD 
>>> anyway, maybe a better implementation is possible for that too?] 
>>> If not, is it worth adding to the Spark API? 
>>> 
>>> Mohit. 
>>> --------------------------------------------------------------------- 
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> 
>>> 
>>> 

Reply via email to