Thanks for coming back to the list  with response!

pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik <himan...@gmail.com>
napisał:

> Hi,
>
> I was able to solve the issue. Putting down the settings that worked for
> me.
>
> 1) It was happening due to the large number of partitions.I *coalesce*'d
> the RDD as early as possible in my code into lot less partitions ( used .
> coalesce(10000) to bring down from 500K to 10k)
>
> 2) Increased the settings for the parameters *spark.akka.frameSize (=
> 500), **spark.akka.timeout,**spark.akka.askTimeout and 
> **spark.core.connection.ack.wait.timeout
> *to get rid of any insufficient frame size and timeout errors
>
> Thanks
> Himanish
>
> On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary <himan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is
>> loaded into memory , the size is around 18G.
>>
>> Whenever I run a distinct() on the RDD, the driver ( spark-shell in
>> yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
>> process seems to stall.The spark driver UI also hangs.
>>
>> In ganglia the only node with high load is the driver host. I have tried
>> repartitioning the data into less number of partitions ( using coalesce or
>> repartition) with no luck.
>>
>> I have attached the jstack output which shows few threads in BLOCKED
>> status. Not sure what exactly is going on here.
>>
>> The driver program was started with 15G memory on AWS EMR. Appreciate any
>> thoughts regarding the issue.
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>

Reply via email to