Re: YARN HA Active ResourceManager failover when machine is stopped

Matt Narrell Fri, 24 Apr 2015 10:40:12 -0700

Ah, yes.  Ok please see below:

Scenario one: Stop the Active ResourceManager process (leaving the VM running)
Active ResourceManager:  https://gist.github.com/mnarrell/157c8e1b82d40541cd88 
<https://gist.github.com/mnarrell/157c8e1b82d40541cd88>
Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d 
<https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>


Scenario two: Shutdown the VM ($ shutdown -h now)
Active ResourceManager:  https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b 
<https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b>
Standby ResourceManager:  https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf 
<https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf>

Here is the yarn-site.xml
https://gist.github.com/mnarrell/115a3eff03bbef947a57 
<https://gist.github.com/mnarrell/115a3eff03bbef947a57>

We have some suspicion that this could be related to fencing?  We speculate 
that when the machine is shutdown, the NodeManagers do not see the 
NoRouteToHost exception as a failover situation?  We have a pretty vanilla 
configuration of YARN, mostly Ambari defaults, and have compared our 
configuration to the YARN ResourceManager HA documentation from Apache and 
Hortonworks.

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <[email protected]> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource 
> manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <[email protected] 
> <mailto:[email protected]>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb 
> <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA 
> <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli 
>> <[email protected] <mailto:[email protected]>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause 
>> it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via 
>> pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active 
>>> ResourceManager (shut the machine off), the standby ResourceManager is 
>>> elected to active, but the NodeManagers do not register themselves with the 
>>> newly elected active ResourceManager. If I restart the machine (but DO NOT 
>>> resume the YARN services) the NodeManagers register with the newly elected 
>>> ResourceManager and my jobs resume. I assume I have some bad configuration, 
>>> as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
>

Re: YARN HA Active ResourceManager failover when machine is stopped

Reply via email to