Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Awesome! it looks promising. Thanks Rishabh and Marcelo.

On Wed, Feb 3, 2016 at 12:09 PM, Rishabh Wadhawan 
wrote:

> Check out this link
> http://spark.apache.org/docs/latest/configuration.html and check
> spark.shuffle.service. Thanks
>
> On Feb 3, 2016, at 1:02 PM, Marcelo Vanzin  wrote:
>
> Yes, but you don't necessarily need to use dynamic allocation (just enable
> the external shuffle service).
>
> On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel 
> wrote:
>
>> Do you mean this setup?
>>
>> https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
>>
>>
>>
>> On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin 
>> wrote:
>>
>>> Without the exact error from the driver that caused the job to restart,
>>> it's hard to tell. But a simple way to improve things is to install the
>>> Spark shuffle service on the YARN nodes, so that even if an executor
>>> crashes, its shuffle output is still available to other executors.
>>>
>>> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel 
>>> wrote:
>>>
 Hi,

 I have a spark job running on yarn-client mode. At some point during
 Join stage, executor(container) runs out of memory and yarn kills it. Due
 to this Entire job restarts! and it keeps doing it on every failure?

 What is the best way to checkpoint? I see there's checkpoint api and
 other option might be to persist before Join stage. Would that prevent
 retry of entire job? How about just retrying only the task that was
 distributed to that faulty executor?

 Thanks



 [image: What's New with Xactly]
 

   [image: LinkedIn]
   [image: Twitter]
   [image: Facebook]
   [image: YouTube]
 
>>>
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>>
>
>
>
> --
> Marcelo
>
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Hi,

I have a spark job running on yarn-client mode. At some point during Join
stage, executor(container) runs out of memory and yarn kills it. Due to
this Entire job restarts! and it keeps doing it on every failure?

What is the best way to checkpoint? I see there's checkpoint api and other
option might be to persist before Join stage. Would that prevent retry of
entire job? How about just retrying only the task that was distributed to
that faulty executor?

Thanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Marcelo Vanzin
Without the exact error from the driver that caused the job to restart,
it's hard to tell. But a simple way to improve things is to install the
Spark shuffle service on the YARN nodes, so that even if an executor
crashes, its shuffle output is still available to other executors.

On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel  wrote:

> Hi,
>
> I have a spark job running on yarn-client mode. At some point during Join
> stage, executor(container) runs out of memory and yarn kills it. Due to
> this Entire job restarts! and it keeps doing it on every failure?
>
> What is the best way to checkpoint? I see there's checkpoint api and other
> option might be to persist before Join stage. Would that prevent retry of
> entire job? How about just retrying only the task that was distributed to
> that faulty executor?
>
> Thanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 




-- 
Marcelo


Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Do you mean this setup?
https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation



On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin  wrote:

> Without the exact error from the driver that caused the job to restart,
> it's hard to tell. But a simple way to improve things is to install the
> Spark shuffle service on the YARN nodes, so that even if an executor
> crashes, its shuffle output is still available to other executors.
>
> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel 
> wrote:
>
>> Hi,
>>
>> I have a spark job running on yarn-client mode. At some point during Join
>> stage, executor(container) runs out of memory and yarn kills it. Due to
>> this Entire job restarts! and it keeps doing it on every failure?
>>
>> What is the best way to checkpoint? I see there's checkpoint api and
>> other option might be to persist before Join stage. Would that prevent
>> retry of entire job? How about just retrying only the task that was
>> distributed to that faulty executor?
>>
>> Thanks
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>
>
> --
> Marcelo
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Marcelo Vanzin
Yes, but you don't necessarily need to use dynamic allocation (just enable
the external shuffle service).

On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel  wrote:

> Do you mean this setup?
>
> https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
>
>
>
> On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin 
> wrote:
>
>> Without the exact error from the driver that caused the job to restart,
>> it's hard to tell. But a simple way to improve things is to install the
>> Spark shuffle service on the YARN nodes, so that even if an executor
>> crashes, its shuffle output is still available to other executors.
>>
>> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a spark job running on yarn-client mode. At some point during
>>> Join stage, executor(container) runs out of memory and yarn kills it. Due
>>> to this Entire job restarts! and it keeps doing it on every failure?
>>>
>>> What is the best way to checkpoint? I see there's checkpoint api and
>>> other option might be to persist before Join stage. Would that prevent
>>> retry of entire job? How about just retrying only the task that was
>>> distributed to that faulty executor?
>>>
>>> Thanks
>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>   [image: Twitter]
>>>   [image: Facebook]
>>>   [image: YouTube]
>>> 
>>
>>
>>
>>
>> --
>> Marcelo
>>
>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>



-- 
Marcelo


Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Rishabh Wadhawan
Hi Nirav
There is a difference between dynamic resource allocation and shuffle service. 
The dynamic allocation when you enable the configurations for it, every time 
you run any task spark will determine the number of executors required to run 
that task for you, which means decreasing the executors when task is simple and 
bumping more executors when task is complex. However, shuffle service would 
basically transfer the intermediate state during any transformation or a task 
execution to another executor if the current executor dies during the process. 
So even one of your executor dies the other active executor could take the 
intermediate state and start executing the process. 
> On Feb 3, 2016, at 1:02 PM, Marcelo Vanzin  wrote:
> 
> Yes, but you don't necessarily need to use dynamic allocation (just enable 
> the external shuffle service).
> 
> On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel  > wrote:
> Do you mean this setup?
> https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
>  
> 
> 
> 
> 
> On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin  > wrote:
> Without the exact error from the driver that caused the job to restart, it's 
> hard to tell. But a simple way to improve things is to install the Spark 
> shuffle service on the YARN nodes, so that even if an executor crashes, its 
> shuffle output is still available to other executors.
> 
> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel  > wrote:
> Hi,
> 
> I have a spark job running on yarn-client mode. At some point during Join 
> stage, executor(container) runs out of memory and yarn kills it. Due to this 
> Entire job restarts! and it keeps doing it on every failure?
> 
> What is the best way to checkpoint? I see there's checkpoint api and other 
> option might be to persist before Join stage. Would that prevent retry of 
> entire job? How about just retrying only the task that was distributed to 
> that faulty executor? 
> 
> Thanks
> 
> 
> 
>  
> 
>     
>    
>       
> 
> 
> 
> -- 
> Marcelo
> 
> 
> 
> 
>  
> 
>     
>    
>       
> 
> 
> 
> -- 
> Marcelo



Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Rishabh Wadhawan
Check out this link  http://spark.apache.org/docs/latest/configuration.html 
 and check 
spark.shuffle.service. Thanks
> On Feb 3, 2016, at 1:02 PM, Marcelo Vanzin  wrote:
> 
> Yes, but you don't necessarily need to use dynamic allocation (just enable 
> the external shuffle service).
> 
> On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel  > wrote:
> Do you mean this setup?
> https://spark.apache.org/docs/1.5.2/job-scheduling.html#dynamic-resource-allocation
>  
> 
> 
> 
> 
> On Wed, Feb 3, 2016 at 11:50 AM, Marcelo Vanzin  > wrote:
> Without the exact error from the driver that caused the job to restart, it's 
> hard to tell. But a simple way to improve things is to install the Spark 
> shuffle service on the YARN nodes, so that even if an executor crashes, its 
> shuffle output is still available to other executors.
> 
> On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel  > wrote:
> Hi,
> 
> I have a spark job running on yarn-client mode. At some point during Join 
> stage, executor(container) runs out of memory and yarn kills it. Due to this 
> Entire job restarts! and it keeps doing it on every failure?
> 
> What is the best way to checkpoint? I see there's checkpoint api and other 
> option might be to persist before Join stage. Would that prevent retry of 
> entire job? How about just retrying only the task that was distributed to 
> that faulty executor? 
> 
> Thanks
> 
> 
> 
>  
> 
>     
>    
>       
> 
> 
> 
> -- 
> Marcelo
> 
> 
> 
> 
>  
> 
>     
>    
>       
> 
> 
> 
> -- 
> Marcelo