[
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724903#comment-16724903
]
Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM:
------------------------------------------------------------
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never
become active; don't we want to just die?*_
What do you mean by force give-up ? exit RM ?
The underlying curator implementation *will retry the connection in
background*, even though the exception is thrown. See *Guaranteeable* interface
in Curator. I think exit RM is too harsh here. Even though RM remains at
standby, all services should be already shutdown, so there's no harm to the end
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think
curator will *NOT* retry the connection, because I saw below things in the log
and checked curator's code:
*Background exception was not retry-able or retry gave up for
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0]
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461)
at
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable*
interface.
So, in the patch, if rejoin election throws exception, it will send
EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk
connect string config.
was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never
become active; don't we want to just die?*_
What do you mean by force give-up ? exit RM ?
The underlying curator implementation *will retry the connection in
background*, even though the exception is thrown. See *Guaranteeable* interface
in Curator. I think exit RM is too harsh here. Even though RM remains at
standby, all services should be already shutdown, so there's no harm to the end
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think
curator will *NOT* retry the connection, because I saw below things in the log
and checked curator's code:
*Background exception was not retry-able or retry gave up for
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0]
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461)
at
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the *Guaranteeable* interface in
Curator.
So, in the patch, if rejoin election throws exception, it will send
EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk
connect string config.
> Standby RM hangs (not retry or crash) forever due to forever lost from leader
> election
> --------------------------------------------------------------------------------------
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.9.2
> Reporter: Yuqi Wang
> Assignee: Yuqi Wang
> Priority: Major
> Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
> Standby RM hangs (not retry or crash) forever due to forever lost from
> leader election
>
> {color:#205081}*Issue Repro Steps:*{color}
> # Start multiple RMs in HA mode
> # Modify all hostnames in the zk connect string to different values in DNS.
> (In reality, we need to replace old/bad zk machines to new/good zk machines,
> so their DNS hostname will be changed.)
>
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
> Start to becomeActive
> Start RMActiveServices
> Start CommonNodeLabelsManager failed due to zk connect
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
> Fail to becomeActive
> ReJoin Election
> Failed to Join Election due to zk connect UnknownHostException (Here the
> exception is eat and just send event)
> Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
> Start StandByTransitionThread
> Already in standby state
> ReJoin Election
> Failed to Join Election due to zk connect UnknownHostException (Here the
> exception is eat and just send event)
> Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
> Start StandByTransitionThread
> Found RMActiveServices's StandByTransitionRunnable object has already run
> previously, so immediately return
>
> (The standby RM failed to rejoin the election, but it will never retry or
> crash later, so afterwards no zk related logs and the standby RM is forever
> hang, even if the zk connect string hostnames are changed back the orignal
> ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join
> election* (give up join election should only happen on RM decide to crash),
> otherwise, a RM without inside the election can never become active again and
> start real works.
>
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent
> happens, RM should transition to standby, instead of crash.
> *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition
> to standby, instead of crash.* (In contrast, before this change, RM makes all
> to crash instead of to standby)
> So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author
> [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
> {quote}I think a good approach here would be to change the RMFatalEvent
> handler to transition to standby as the default reaction, *with shutdown as a
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the
> failures in {color:#14892c}whitelist{color}:*
> public enum RMFatalEventType {
> {color:#14892c}// Source <- Store{color}
> {color:#14892c}STATE_STORE_FENCED,{color}
> {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
> EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
> {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
> CRITICAL_THREAD_CRASH
> }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and
> future added failure types, should crash RM, because we *cannot ensure* that
> they will *never* cause RM cannot work in standby state, and the
> *conservative* way is to crash RM. Besides, after crash, the RM's external
> watchdog service can know this and try to repair the RM machine, send alerts,
> etc.
> For more details, please check the patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]