Maybe you can omit using grouping all together with groupByKey? What is
your next step after grouping elements by key? Are you trying to reduce
values? If so then I would recommend using some reducing functions like for
example reduceByKey or aggregateByKey. Those will first reduce value for
each key locally on each node before doing actual IO over the network.
There will also be no grouping phase so you will not run into memory issues.

Please let me know if that helped

Pawel Szulc
@rabbitonweb
http://www.rabbitonweb.com


On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá <
[email protected]> wrote:

> Hi,
>
> I'm writing a Spark program where I want to divide a RDD into different
> groups, but the groups are too big to use groupByKey. To cope with that,
> since I know in advance the list of keys for each group, I build a map from
> the keys to the RDDs that result from filtering the input RDD to get the
> records for the corresponding key. This works when I have a small number of
> keys, but for big number of keys (tens of thousands) the execution gets
> stuck, without issuing any new Spark stage. I suspect the reason is that
> the Spark scheduler is not able to handle so many RDDs. Does it make sense?
> I'm rewriting the program to use a single RDD of pairs, with cached
> partions, but I wanted to be sure I understand the problem here.
>
> Thanks a lot in advance,
>
> Greetings,
>
> Juan Rodriguez
>

Reply via email to