Oh, I’m sorry, I meant `aggregateByKey`.
https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions — FG On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi <mohitja...@gmail.com> wrote: > Francois, > RDD.aggregate() does not support aggregation by key. But, indeed, that is the > kind of implementation I am looking for, one that does not allocate > intermediate space for storing (K,V) pairs. When working with large datasets > this type of intermediate memory allocation wrecks havoc with garbage > collection, not to mention unnecessarily increases the working memory > requirement of the program. > I wonder if someone has already noticed this and there is an effort underway > to optimize this. If not, I will take a shot at adding this functionality. > Mohit. >> On Jan 27, 2015, at 1:52 PM, francois.garil...@typesafe.com wrote: >> >> Have you looked at the `aggregate` function in the RDD API ? >> >> If your way of extracting the “key” (identifier) and “value” (payload) parts >> of the RDD elements is uniform (a function), it’s unclear to me how this >> would be more efficient that extracting key and value and then using >> combine, however. >> >> — >> FG >> >> >> On Tue, Jan 27, 2015 at 10:17 PM, Mohit Jaggi <mohitja...@gmail.com >> <mailto:mohitja...@gmail.com>> wrote: >> >> Hi All, >> I have a use case where I have an RDD (not a k,v pair) where I want to do a >> combineByKey() operation. I can do that by creating an intermediate RDD of >> k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it >> will be more efficient if I can avoid this intermediate RDD. Is there a way >> I can do this by passing in a function that extracts the key, like in >> RDD.groupBy()? [oops, RDD.groupBy seems to create the intermediate RDD >> anyway, maybe a better implementation is possible for that too?] >> If not, is it worth adding to the Spark API? >> >> Mohit. >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >>