Hi Paweł, Thanks a lot for your answer. I finally got the program to work by using aggregateByKey, but I was wondering why creating thousands of RDDs doesn't work. I think that could be interesting for using methods that work on RDDs like for example JavaDoubleRDD.stats() ( http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaDoubleRDD.html#stats%28%29). If the groups are small then I can chain groupBy(), collect(), parallelize() and stats(), but that is quite inefficient because it implies moving data to and from the driver, and also doesn't scale to big groups; on the other hand if I use aggregateByKey or a similar function then I cannot use stats() so I have to reimplement it. In general I was looking for a way to reuse other functions that I have that work on RDDs, for using them on groups of data in a RDD, because I don't see a how to directly apply them to each of the groups in a pair RDD.
Again, thanks a lot for your answer, Greetings, Juan Rodriguez 2015-02-18 14:56 GMT+01:00 Paweł Szulc <[email protected]>: > Maybe you can omit using grouping all together with groupByKey? What is > your next step after grouping elements by key? Are you trying to reduce > values? If so then I would recommend using some reducing functions like for > example reduceByKey or aggregateByKey. Those will first reduce value for > each key locally on each node before doing actual IO over the network. > There will also be no grouping phase so you will not run into memory issues. > > Please let me know if that helped > > Pawel Szulc > @rabbitonweb > http://www.rabbitonweb.com > > > On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá < > [email protected]> wrote: > >> Hi, >> >> I'm writing a Spark program where I want to divide a RDD into different >> groups, but the groups are too big to use groupByKey. To cope with that, >> since I know in advance the list of keys for each group, I build a map from >> the keys to the RDDs that result from filtering the input RDD to get the >> records for the corresponding key. This works when I have a small number of >> keys, but for big number of keys (tens of thousands) the execution gets >> stuck, without issuing any new Spark stage. I suspect the reason is that >> the Spark scheduler is not able to handle so many RDDs. Does it make sense? >> I'm rewriting the program to use a single RDD of pairs, with cached >> partions, but I wanted to be sure I understand the problem here. >> >> Thanks a lot in advance, >> >> Greetings, >> >> Juan Rodriguez >> > >
