Without the exact error from the driver that caused the job to restart,
it's hard to tell. But a simple way to improve things is to install the
Spark shuffle service on the YARN nodes, so that even if an executor
crashes, its shuffle output is still available to other executors.

On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel <[email protected]> wrote:

> Hi,
>
> I have a spark job running on yarn-client mode. At some point during Join
> stage, executor(container) runs out of memory and yarn kills it. Due to
> this Entire job restarts! and it keeps doing it on every failure?
>
> What is the best way to checkpoint? I see there's checkpoint api and other
> option might be to persist before Join stage. Would that prevent retry of
> entire job? How about just retrying only the task that was distributed to
> that faulty executor?
>
> Thanks
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>




-- 
Marcelo

Reply via email to