Hi,

I'm writing a Spark program where I want to divide a RDD into different
groups, but the groups are too big to use groupByKey. To cope with that,
since I know in advance the list of keys for each group, I build a map from
the keys to the RDDs that result from filtering the input RDD to get the
records for the corresponding key. This works when I have a small number of
keys, but for big number of keys (tens of thousands) the execution gets
stuck, without issuing any new Spark stage. I suspect the reason is that
the Spark scheduler is not able to handle so many RDDs. Does it make sense?
I'm rewriting the program to use a single RDD of pairs, with cached
partions, but I wanted to be sure I understand the problem here.

Thanks a lot in advance,

Greetings,

Juan Rodriguez

Reply via email to