Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from filtering the input RDD to get the records for the corresponding key. This works when I have a small number of keys, but for big number of keys (tens of thousands) the execution gets stuck, without issuing any new Spark stage. I suspect the reason is that the Spark scheduler is not able to handle so many RDDs. Does it make sense? I'm rewriting the program to use a single RDD of pairs, with cached partions, but I wanted to be sure I understand the problem here.
Thanks a lot in advance, Greetings, Juan Rodriguez
