Re: Question about YARN NodeManager and ApplicationMaster failures

Navina Ramesh Thu, 03 Mar 2016 11:38:30 -0800

@Steve: Are there any existing applications whose AM has the code to handle
the rebuilding states on restart? I am curious because we are currently
trying to improve Apache Samza's behavior on NM restarts. We seem to
occasionally run into some orphaned containers and I am wondering if we are
not handling shutdown/failure properly.


Navina

On Thu, Mar 3, 2016 at 5:34 AM, Junping Du <[email protected]> wrote:

> With proper configuration, container (include AM) can still running when
> NM get failed. Please check YARN-1336 for NM restart work preserving.
> For AM failed (restart), after YARN-1489 (Work-preserving
> ApplicationMaster restart), the container will not get killed when AM
> failed (within maximum attempts). However, like mentioned by Steve, each
> application should figure out ways to wire new AM attempt with existing
> containers and sync states (and most of them haven't done it yet.). The
> ongoing work for MAPREDUCE-6608 is an example.
>
>
> Thanks,
>
> Junping
> ________________________________________
> From: Steve Loughran <[email protected]>
> Sent: Thursday, March 03, 2016 1:16 PM
> To: [email protected]
> Subject: Re: Question about YARN NodeManager and ApplicationMaster failures
>
> > On 3 Mar 2016, at 12:58, Dustin Cote <[email protected]> wrote:
> >
> > -dev since this is more of a user question
> >
> > The NodeManager is the parent for the application master, so any
> containers
> > (including application master containers) that are running where the
> failed
> > NodeManager is located will die.  If an application master fails, then a
> > new one is created up to your limit (set by
> > yarn.resourcemanager.am.max-attempts).  The other containers associated
> > with the application master are supposed to continue on and pick up the
> > newly started application master.
>
>
> Only if you tell yarn to keep containers over restart and the AM has the
> code to rebuild its state. Most of AM's don't do this (MR, Tez, Spark,
> etc), as the state is hard to preserve and rebuild.
>
> See YARN-896 for all the details of things related to long-lived services
>
> You can also put a reset window on AM failures, YARN-611.
>
> Oh, and there's work-preserving NM restart, but that's another topic  ....
>
> > The resource manager takes care of the
> > bookkeeping needed to make this happen.  I'd suggest you have a look at
> the
> > series of blogs here
> > <
> http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/
> >
> > for
> > a more in depth look at the mechanics.
> >
> > -Dustin
> >
> > On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <
> [email protected]>
> > wrote:
> >
> >> Hello,
> >>
> >> I have some questions regarding failure of NodeManager and Application
> >> Master.
> >> What happens if NodeManager which is running on the same node as
> >> Application Master fails?
> >> Does Application Master fail as well?
> >>
> >> Also How is Application Master failure handled with respect to its
> >> (child) container?
> >> Do these containers fail too?
> >> If Yes, is there a way these containers can be assigned to new
> >> instance of application master that might come up on some other node?
> >>
> >
> >
> >
> > --
> > Dustin Cote
> > Customer Operations Engineer
> > <http://www.cloudera.com>
>
>


-- 
Navina R.

Re: Question about YARN NodeManager and ApplicationMaster failures

Reply via email to