Sorry, I answered too fast. Please disregard my last message: I did mean aggregate.
You say: "RDD.aggregate() does not support aggregation by key." What would you need aggregation by key for, if you do not, at the beginning, have an RDD of key-value pairs, and do not want to build one ? Could you share more about the kind of processing you have in mind ? — FG On Thu, Jan 29, 2015 at 8:01 PM, null <francois.garil...@typesafe.com> wrote: > Oh, I’m sorry, I meant `aggregateByKey`. > https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions > — > FG > On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mohitja...@gmail.com> wrote: >> Francois, >> RDD.aggregate() does not support aggregation by key. But, indeed, that is >> the kind of implementation I am looking for, one that does not allocate >> intermediate space for storing (K,V) pairs. When working with large datasets >> this type of intermediate memory allocation wrecks havoc with garbage >> collection, not to mention unnecessarily increases the working memory >> requirement of the program. >> I wonder if someone has already noticed this and there is an effort underway >> to optimize this. If not, I will take a shot at adding this functionality. >> Mohit. >>> On Jan 27, 2015, at 1:52 PM, francois.garil...@typesafe.com wrote: >>> >>> Have you looked at the `aggregate` function in the RDD API ? >>> >>> If your way of extracting the “key” (identifier) and “value” (payload) >>> parts of the RDD elements is uniform (a function), it’s unclear to me how >>> this would be more efficient that extracting key and value and then using >>> combine, however. >>> >>> — >>> FG >>> >>> >>> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitja...@gmail.com >>> <mailto:mohitja...@gmail.com>> wrote: >>> >>> Hi All, >>> I have a use case where I have an RDD (not a k,v pair) where I want to do a >>> combineByKey() operation. I can do that by creating an intermediate RDD of >>> k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it >>> will be more efficient if I can avoid this intermediate RDD. Is there a way >>> I can do this by passing in a function that extracts the key, like in >>> RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD >>> anyway, maybe a better implementation is possible for that too?] >>> If not, is it worth adding to the Spark API? >>> >>> Mohit. >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>>