Hi
I had seen this issue in my cluster without HA configured when the process
is Halted. I assume that your scenario also having similar issue when Active
RM machine is Shutdown abruptly. May be you can verify and compare taking
thread dump of NM and with below JIRA’s.
Open JIRA’s in community regarding this problem are
https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA)
https://issues.apache.org/jira/i#browse/YARN-2578 (With HA)
Thanks & Regards
Rohith Sharma K S
From: Matt Narrell [mailto:[email protected]]
Sent: 24 April 2015 23:28
To: [email protected]
Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
Also, another observation is that when the VMs are halted, its seems like the
NodeManagers do not consider this a scenario to round-robin among the
configured ResourceManagers? Is there some timeout that I’ve missed to
instruct the NodeManagers to do this round-robining in the case of the machine
not responding (to distinguish it from a network blip)?
mn
On Apr 24, 2015, at 1:50 AM, Drake민영근
<[email protected]<mailto:[email protected]>> wrote:
Hi, Matt
The second log file looks like node manager's log, not the standby resource
manager.
Thanks.
Drake 민영근 Ph.D
kt NexR
On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell
<[email protected]<mailto:[email protected]>> wrote:
Active ResourceManager: http://pastebin.com/hE0ppmnb
Standby ResourceManager: http://pastebin.com/DB8VjHqA
Oppressively chatty and not much valuable info contained therein.
On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli
<[email protected]<mailto:[email protected]>> wrote:
I have run into this offline with someone else too but couldn't root-cause it.
Will you be able to share your active/standby ResourceManager logs via pastebin
or something?
+Vinod
On Apr 23, 2015, at 9:41 AM, Matt Narrell
<[email protected]<mailto:[email protected]>> wrote:
I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
I’m testing the YARN HA ResourceManager failover. If I STOP the active
ResourceManager (shut the machine off), the standby ResourceManager is elected
to active, but the NodeManagers do not register themselves with the newly
elected active ResourceManager. If I restart the machine (but DO NOT resume the
YARN services) the NodeManagers register with the newly elected ResourceManager
and my jobs resume. I assume I have some bad configuration, as this produces a
SPOF, and is not HA in the sense I’m expecting.
Thanks,
mn