Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <[email protected]> 
> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the 
> process is Halted.  I assume that your scenario also having similar issue 
> when Active RM machine is Shutdown abruptly.  May be you can verify and 
> compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 
> <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 
> <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:[email protected]] 
> Sent: 24 April 2015 23:28
> To: [email protected]
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the 
> NodeManagers do not consider this a scenario to round-robin among the 
> configured ResourceManagers?  Is there some timeout that I’ve missed to 
> instruct the NodeManagers to do this round-robining in the case of the 
> machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource 
> manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <[email protected] 
> <mailto:[email protected]>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb 
> <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA 
> <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via 
> pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active 
> ResourceManager (shut the machine off), the standby ResourceManager is 
> elected to active, but the NodeManagers do not register themselves with the 
> newly elected active ResourceManager. If I restart the machine (but DO NOT 
> resume the YARN services) the NodeManagers register with the newly elected 
> ResourceManager and my jobs resume. I assume I have some bad configuration, 
> as this produces a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn

Reply via email to