Ah, yes. Ok please see below: Scenario one: Stop the Active ResourceManager process (leaving the VM running) Active ResourceManager: https://gist.github.com/mnarrell/157c8e1b82d40541cd88 <https://gist.github.com/mnarrell/157c8e1b82d40541cd88> Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d <https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>
Scenario two: Shutdown the VM ($ shutdown -h now) Active ResourceManager: https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b <https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b> Standby ResourceManager: https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf <https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf> Here is the yarn-site.xml https://gist.github.com/mnarrell/115a3eff03bbef947a57 <https://gist.github.com/mnarrell/115a3eff03bbef947a57> We have some suspicion that this could be related to fencing? We speculate that when the machine is shutdown, the NodeManagers do not see the NoRouteToHost exception as a failover situation? We have a pretty vanilla configuration of YARN, mostly Ambari defaults, and have compared our configuration to the YARN ResourceManager HA documentation from Apache and Hortonworks. mn > On Apr 24, 2015, at 1:50 AM, Drake민영근 <[email protected]> wrote: > > Hi, Matt > > The second log file looks like node manager's log, not the standby resource > manager. > > Thanks. > > Drake 민영근 Ph.D > kt NexR > > On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <[email protected] > <mailto:[email protected]>> wrote: > Active ResourceManager: http://pastebin.com/hE0ppmnb > <http://pastebin.com/hE0ppmnb> > Standby ResourceManager: http://pastebin.com/DB8VjHqA > <http://pastebin.com/DB8VjHqA> > > Oppressively chatty and not much valuable info contained therein. > > >> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli >> <[email protected] <mailto:[email protected]>> wrote: >> >> I have run into this offline with someone else too but couldn't root-cause >> it. >> >> Will you be able to share your active/standby ResourceManager logs via >> pastebin or something? >> >> +Vinod >> >> On Apr 23, 2015, at 9:41 AM, Matt Narrell <[email protected] >> <mailto:[email protected]>> wrote: >> >>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0 >>> >>> I’m testing the YARN HA ResourceManager failover. If I STOP the active >>> ResourceManager (shut the machine off), the standby ResourceManager is >>> elected to active, but the NodeManagers do not register themselves with the >>> newly elected active ResourceManager. If I restart the machine (but DO NOT >>> resume the YARN services) the NodeManagers register with the newly elected >>> ResourceManager and my jobs resume. I assume I have some bad configuration, >>> as this produces a SPOF, and is not HA in the sense I’m expecting. >>> >>> Thanks, >>> mn >> > >
