At some level, enough RDDs creates hundreds of thousands of tiny
partitions of data each of which creates a task for each stage. The
raw overhead of all the message passing can slow things down a lot. I
would not design something to use an RDD per key. You would generally
use key by some value you want to divide and filter on, and then use a
*ByKey to do your work.

You don't work with individual RDDs this way, but usually that's good
news. You usually have a lot more flexibility operating just in pure
Java / Scala to do whatever you need inside your function.

On Wed, Feb 18, 2015 at 2:12 PM, Juan Rodríguez Hortalá
<[email protected]> wrote:
> Hi Paweł,
>
> Thanks a lot for your answer. I finally got the program to work by using
> aggregateByKey, but I was wondering why creating thousands of RDDs doesn't
> work. I think that could be interesting for using methods that work on RDDs
> like for example JavaDoubleRDD.stats() (
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaDoubleRDD.html#stats%28%29).
> If the groups are small then I can chain groupBy(), collect(), parallelize()
> and stats(), but that is quite inefficient because it implies moving data to
> and from the driver, and also doesn't scale to big groups; on the other hand
> if I use aggregateByKey or a similar function then I cannot use stats() so I
> have to reimplement it. In general I was looking for a way to reuse other
> functions that I have that work on RDDs, for using them on groups of data in
> a RDD, because I don't see a how to directly apply them to each of the
> groups in a pair RDD.
>
> Again, thanks a lot for your answer,
>
> Greetings,
>
> Juan Rodriguez
>
>
>
>
> 2015-02-18 14:56 GMT+01:00 Paweł Szulc <[email protected]>:
>>
>> Maybe you can omit using grouping all together with groupByKey? What is
>> your next step after grouping elements by key? Are you trying to reduce
>> values? If so then I would recommend using some reducing functions like for
>> example reduceByKey or aggregateByKey. Those will first reduce value for
>> each key locally on each node before doing actual IO over the network. There
>> will also be no grouping phase so you will not run into memory issues.
>>
>> Please let me know if that helped
>>
>> Pawel Szulc
>> @rabbitonweb
>> http://www.rabbitonweb.com
>>
>>
>> On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá
>> <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I'm writing a Spark program where I want to divide a RDD into different
>>> groups, but the groups are too big to use groupByKey. To cope with that,
>>> since I know in advance the list of keys for each group, I build a map from
>>> the keys to the RDDs that result from filtering the input RDD to get the
>>> records for the corresponding key. This works when I have a small number of
>>> keys, but for big number of keys (tens of thousands) the execution gets
>>> stuck, without issuing any new Spark stage. I suspect the reason is that the
>>> Spark scheduler is not able to handle so many RDDs. Does it make sense? I'm
>>> rewriting the program to use a single RDD of pairs, with cached partions,
>>> but I wanted to be sure I understand the problem here.
>>>
>>> Thanks a lot in advance,
>>>
>>> Greetings,
>>>
>>> Juan Rodriguez
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to