Hi Paweł,

Thanks a lot for your answer. I finally got the program to work by using
aggregateByKey, but I was wondering why creating thousands of RDDs doesn't
work. I think that could be interesting for using methods that work on RDDs
like for example JavaDoubleRDD.stats() (
http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaDoubleRDD.html#stats%28%29).
If the groups are small then I can chain groupBy(), collect(),
parallelize() and stats(), but that is quite inefficient because it implies
moving data to and from the driver, and also doesn't scale to big groups;
on the other hand if I use aggregateByKey or a similar function then I
cannot use stats() so I have to reimplement it. In general I was looking
for a way to reuse other functions that I have that work on RDDs, for using
them on groups of data in a RDD, because I don't see a how to directly
apply them to each of the groups in a pair RDD.

Again, thanks a lot for your answer,

Greetings,

Juan Rodriguez




2015-02-18 14:56 GMT+01:00 Paweł Szulc <[email protected]>:

> Maybe you can omit using grouping all together with groupByKey? What is
> your next step after grouping elements by key? Are you trying to reduce
> values? If so then I would recommend using some reducing functions like for
> example reduceByKey or aggregateByKey. Those will first reduce value for
> each key locally on each node before doing actual IO over the network.
> There will also be no grouping phase so you will not run into memory issues.
>
> Please let me know if that helped
>
> Pawel Szulc
> @rabbitonweb
> http://www.rabbitonweb.com
>
>
> On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá <
> [email protected]> wrote:
>
>> Hi,
>>
>> I'm writing a Spark program where I want to divide a RDD into different
>> groups, but the groups are too big to use groupByKey. To cope with that,
>> since I know in advance the list of keys for each group, I build a map from
>> the keys to the RDDs that result from filtering the input RDD to get the
>> records for the corresponding key. This works when I have a small number of
>> keys, but for big number of keys (tens of thousands) the execution gets
>> stuck, without issuing any new Spark stage. I suspect the reason is that
>> the Spark scheduler is not able to handle so many RDDs. Does it make sense?
>> I'm rewriting the program to use a single RDD of pairs, with cached
>> partions, but I wanted to be sure I understand the problem here.
>>
>> Thanks a lot in advance,
>>
>> Greetings,
>>
>> Juan Rodriguez
>>
>
>

Reply via email to