At some level, enough RDDs creates hundreds of thousands of tiny partitions of data each of which creates a task for each stage. The raw overhead of all the message passing can slow things down a lot. I would not design something to use an RDD per key. You would generally use key by some value you want to divide and filter on, and then use a *ByKey to do your work.
You don't work with individual RDDs this way, but usually that's good news. You usually have a lot more flexibility operating just in pure Java / Scala to do whatever you need inside your function. On Wed, Feb 18, 2015 at 2:12 PM, Juan Rodríguez Hortalá <[email protected]> wrote: > Hi Paweł, > > Thanks a lot for your answer. I finally got the program to work by using > aggregateByKey, but I was wondering why creating thousands of RDDs doesn't > work. I think that could be interesting for using methods that work on RDDs > like for example JavaDoubleRDD.stats() ( > http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaDoubleRDD.html#stats%28%29). > If the groups are small then I can chain groupBy(), collect(), parallelize() > and stats(), but that is quite inefficient because it implies moving data to > and from the driver, and also doesn't scale to big groups; on the other hand > if I use aggregateByKey or a similar function then I cannot use stats() so I > have to reimplement it. In general I was looking for a way to reuse other > functions that I have that work on RDDs, for using them on groups of data in > a RDD, because I don't see a how to directly apply them to each of the > groups in a pair RDD. > > Again, thanks a lot for your answer, > > Greetings, > > Juan Rodriguez > > > > > 2015-02-18 14:56 GMT+01:00 Paweł Szulc <[email protected]>: >> >> Maybe you can omit using grouping all together with groupByKey? What is >> your next step after grouping elements by key? Are you trying to reduce >> values? If so then I would recommend using some reducing functions like for >> example reduceByKey or aggregateByKey. Those will first reduce value for >> each key locally on each node before doing actual IO over the network. There >> will also be no grouping phase so you will not run into memory issues. >> >> Please let me know if that helped >> >> Pawel Szulc >> @rabbitonweb >> http://www.rabbitonweb.com >> >> >> On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá >> <[email protected]> wrote: >>> >>> Hi, >>> >>> I'm writing a Spark program where I want to divide a RDD into different >>> groups, but the groups are too big to use groupByKey. To cope with that, >>> since I know in advance the list of keys for each group, I build a map from >>> the keys to the RDDs that result from filtering the input RDD to get the >>> records for the corresponding key. This works when I have a small number of >>> keys, but for big number of keys (tens of thousands) the execution gets >>> stuck, without issuing any new Spark stage. I suspect the reason is that the >>> Spark scheduler is not able to handle so many RDDs. Does it make sense? I'm >>> rewriting the program to use a single RDD of pairs, with cached partions, >>> but I wanted to be sure I understand the problem here. >>> >>> Thanks a lot in advance, >>> >>> Greetings, >>> >>> Juan Rodriguez >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
