@Steve: Are there any existing applications whose AM has the code to handle the rebuilding states on restart? I am curious because we are currently trying to improve Apache Samza's behavior on NM restarts. We seem to occasionally run into some orphaned containers and I am wondering if we are not handling shutdown/failure properly.
Navina On Thu, Mar 3, 2016 at 5:34 AM, Junping Du <[email protected]> wrote: > With proper configuration, container (include AM) can still running when > NM get failed. Please check YARN-1336 for NM restart work preserving. > For AM failed (restart), after YARN-1489 (Work-preserving > ApplicationMaster restart), the container will not get killed when AM > failed (within maximum attempts). However, like mentioned by Steve, each > application should figure out ways to wire new AM attempt with existing > containers and sync states (and most of them haven't done it yet.). The > ongoing work for MAPREDUCE-6608 is an example. > > > Thanks, > > Junping > ________________________________________ > From: Steve Loughran <[email protected]> > Sent: Thursday, March 03, 2016 1:16 PM > To: [email protected] > Subject: Re: Question about YARN NodeManager and ApplicationMaster failures > > > On 3 Mar 2016, at 12:58, Dustin Cote <[email protected]> wrote: > > > > -dev since this is more of a user question > > > > The NodeManager is the parent for the application master, so any > containers > > (including application master containers) that are running where the > failed > > NodeManager is located will die. If an application master fails, then a > > new one is created up to your limit (set by > > yarn.resourcemanager.am.max-attempts). The other containers associated > > with the application master are supposed to continue on and pick up the > > newly started application master. > > > Only if you tell yarn to keep containers over restart and the AM has the > code to rebuild its state. Most of AM's don't do this (MR, Tez, Spark, > etc), as the state is hard to preserve and rebuild. > > See YARN-896 for all the details of things related to long-lived services > > You can also put a reset window on AM failures, YARN-611. > > Oh, and there's work-preserving NM restart, but that's another topic .... > > > The resource manager takes care of the > > bookkeeping needed to make this happen. I'd suggest you have a look at > the > > series of blogs here > > < > http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/ > > > > for > > a more in depth look at the mechanics. > > > > -Dustin > > > > On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt < > [email protected]> > > wrote: > > > >> Hello, > >> > >> I have some questions regarding failure of NodeManager and Application > >> Master. > >> What happens if NodeManager which is running on the same node as > >> Application Master fails? > >> Does Application Master fail as well? > >> > >> Also How is Application Master failure handled with respect to its > >> (child) container? > >> Do these containers fail too? > >> If Yes, is there a way these containers can be assigned to new > >> instance of application master that might come up on some other node? > >> > > > > > > > > -- > > Dustin Cote > > Customer Operations Engineer > > <http://www.cloudera.com> > > -- Navina R.
