Hi Srinath,

Thanks for such an elaborate reply. How to reduce the number of overall
tasks?

I found, after simply repartitioning the csv file into 8 parts and
converting it to parquet with snappy compression, helped not only in even
distribution of the tasks on all nodes, but also helped in bringing the end
to end job timing down to approx 0.8X of the prior run.

Query - Check if there are too many partitions in the RDD and tune it using
spark.sql.shuffle.partitions. How to do this? Because I have a huge
pipeline of memory and CPU intensive operations, which will ideally have
innumerable spark transformations. At which level should I apply the same?
My total tasks of an average dataset is going to around 2 millions
(approx), is it a bad show? How can I control? Do I need to re-factor my
entire Pipeline (series of codes) then?

Below is the new executors show while the updated run is taking place -




Thanks,
Aakash.

On Tue, Jun 12, 2018 at 2:14 PM, Srinath C <srinat...@gmail.com> wrote:

> Hi Aakash,
>
> Can you check the logs for Executor ID 0? It was restarted on worker
> 192.168.49.39 perhaps due to OOM or something.
>
> Also observed that the number of tasks are high and unevenly distributed
> across the workers.
> Check if there are too many partitions in the RDD and tune it using
> spark.sql.shuffle.partitions.
> If the uneven distribution is still occurring then try repartitioning the
> data set using appropriate fields.
>
> Hope that helps.
> Regards,
> Srinath.
>
>
> On Tue, Jun 12, 2018 at 1:39 PM Aakash Basu <aakash.spark....@gmail.com>
> wrote:
>
>> Yes, but when I did increase my executor memory, the spark job is going
>> to halt after running a few steps, even though, the executor isn't dying.
>>
>> Data - 60,000 data-points, 230 columns (60 MB data).
>>
>> Any input on why it behaves like that?
>>
>> On Tue, Jun 12, 2018 at 8:15 AM, Vamshi Talla <vamsh...@hotmail.com>
>> wrote:
>>
>>> Aakash,
>>>
>>> Like Jorn suggested, did you increase your test data set? If so, did you
>>> also update your executor-memory setting? It seems like you might exceeding
>>> the executor memory threshold.
>>>
>>> Thanks
>>> Vamshi Talla
>>>
>>> Sent from my iPhone
>>>
>>> On Jun 11, 2018, at 8:54 AM, Aakash Basu <aakash.spark....@gmail.com>
>>> wrote:
>>>
>>> Hi Jorn/Others,
>>>
>>> Thanks for your help. Now, data is being distributed in a proper way,
>>> but the challenge is, after a certain point, I'm getting this error, after
>>> which, everything stops moving ahead -
>>>
>>> 2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on
>>> 192.168.49.39
>>> <https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2F192.168.49.39&data=02%7C01%7C%7Cdc9886e0d4be43fdf0cb08d5cf9a6fda%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636643184560233393&sdata=T0QyzG2Sufk0kktKK3U2BVsAszvhCzx%2FFNnXOxpiWPs%3D&reserved=0>:
>>> Remote RPC client disassociated. Likely due to containers exceeding
>>> thresholds, or network issues. Check driver logs for WARN messages.
>>>
>>> <image.png>
>>>
>>> How to avoid this scenario?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On Mon, Jun 11, 2018 at 4:16 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> If it is in kB then spark will always schedule it to one node. As soon
>>>> as it gets bigger you will see usage of more nodes.
>>>>
>>>> Hence increase your testing Dataset .
>>>>
>>>> On 11. Jun 2018, at 12:22, Aakash Basu <aakash.spark....@gmail.com>
>>>> wrote:
>>>>
>>>> Jorn - The code is a series of feature engineering and model tuning
>>>> operations. Too big to show. Yes, data volume is too low, it is in KBs,
>>>> just tried to experiment with a small dataset before going for a large one.
>>>>
>>>> Akshay - I ran with your suggested spark configurations, I get this
>>>> (the node changed, but the problem persists) -
>>>>
>>>> <image.png>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 11, 2018 at 3:16 PM, akshay naidu <akshaynaid...@gmail.com>
>>>> wrote:
>>>>
>>>>> try
>>>>>  --num-executors 3 --executor-cores 4 --executor-memory 2G --conf
>>>>> spark.scheduler.mode=FAIR
>>>>>
>>>>> On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu <
>>>>> aakash.spark....@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have submitted a job on* 4 node cluster*, where I see, most of the
>>>>>> operations happening at one of the worker nodes and other two are simply
>>>>>> chilling out.
>>>>>>
>>>>>> Picture below puts light on that -
>>>>>>
>>>>>> How to properly distribute the load?
>>>>>>
>>>>>> My cluster conf (4 node cluster [1 driver; 3 slaves]) -
>>>>>>
>>>>>> *Cores - 6*
>>>>>> *RAM - 12 GB*
>>>>>> *HDD - 60 GB*
>>>>>>
>>>>>> My Spark Submit command is as follows -
>>>>>>
>>>>>> *spark-submit --master spark://192.168.49.37:7077
>>>>>> <https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2F192.168.49.37%3A7077&data=02%7C01%7C%7Cdc9886e0d4be43fdf0cb08d5cf9a6fda%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636643184560233393&sdata=wS4drWE7%2FAJFXoUL3w0OzIRNL54RLKRTeMUBB%2BY1B28%3D&reserved=0>
>>>>>> --num-executors 3 --executor-cores 5 --executor-memory 4G
>>>>>> /appdata/bblite-codebase/prima_diabetes_indians.py*
>>>>>>
>>>>>> What to do?
>>>>>>
>>>>>> Thanks,
>>>>>> Aakash.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Reply via email to