Yes, it looks like we’re running up against YARN-2578. That’s very unfortunate.
Thanks for everyone’s investigation and input. mn > On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <[email protected]> > wrote: > > Hi > > I had seen this issue in my cluster without HA configured when the > process is Halted. I assume that your scenario also having similar issue > when Active RM machine is Shutdown abruptly. May be you can verify and > compare taking thread dump of NM and with below JIRA’s. > > Open JIRA’s in community regarding this problem are > https://issues.apache.org/jira/i#browse/YARN-1061 > <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA) > https://issues.apache.org/jira/i#browse/YARN-2578 > <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA) > > > Thanks & Regards > Rohith Sharma K S > > From: Matt Narrell [mailto:[email protected]] > Sent: 24 April 2015 23:28 > To: [email protected] > Subject: Re: YARN HA Active ResourceManager failover when machine is stopped > > Also, another observation is that when the VMs are halted, its seems like the > NodeManagers do not consider this a scenario to round-robin among the > configured ResourceManagers? Is there some timeout that I’ve missed to > instruct the NodeManagers to do this round-robining in the case of the > machine not responding (to distinguish it from a network blip)? > > mn > > On Apr 24, 2015, at 1:50 AM, Drake민영근 <[email protected] > <mailto:[email protected]>> wrote: > > Hi, Matt > > The second log file looks like node manager's log, not the standby resource > manager. > > Thanks. > > Drake 민영근 Ph.D > kt NexR > > On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <[email protected] > <mailto:[email protected]>> wrote: > Active ResourceManager: http://pastebin.com/hE0ppmnb > <http://pastebin.com/hE0ppmnb> > Standby ResourceManager: http://pastebin.com/DB8VjHqA > <http://pastebin.com/DB8VjHqA> > > Oppressively chatty and not much valuable info contained therein. > > > On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <[email protected] > <mailto:[email protected]>> wrote: > > I have run into this offline with someone else too but couldn't root-cause it. > > Will you be able to share your active/standby ResourceManager logs via > pastebin or something? > > +Vinod > > On Apr 23, 2015, at 9:41 AM, Matt Narrell <[email protected] > <mailto:[email protected]>> wrote: > > > I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0 > > I’m testing the YARN HA ResourceManager failover. If I STOP the active > ResourceManager (shut the machine off), the standby ResourceManager is > elected to active, but the NodeManagers do not register themselves with the > newly elected active ResourceManager. If I restart the machine (but DO NOT > resume the YARN services) the NodeManagers register with the newly elected > ResourceManager and my jobs resume. I assume I have some bad configuration, > as this produces a SPOF, and is not HA in the sense I’m expecting. > > Thanks, > mn
