Liang Xie created HDFS-7763:
-------------------------------

             Summary: fix zkfc hung issue due to not catching exception in a 
corner case
                 Key: HDFS-7763
                 URL: https://issues.apache.org/jira/browse/HDFS-7763
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 2.6.0
            Reporter: Liang Xie
            Assignee: Liang Xie


In our product cluster, we hit both the two zkfc process is hung after a zk 
network outage.

the zkfc log said:
{code}
2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, 
closing socket connection and attempting reconnect
2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x4a61bacdd9dfb2 closed
2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal 
error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not 
retrying further znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
11300
2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
HealthMonitor thread
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 11300
{code}

and the thread dump also be uploaded as attachment.
>From the dump, we can see due to the unknown non-daemon 
>threads(pool-*-thread-*), the process did not exit, but the critical threads, 
>like health monitor and rpc threads had been stopped, so our 
>watchdog(supervisord) had not not observed the zkfc process is down or 
>abnormal.  so the following namenode failover could not be done as expected.

there're two possible fixes here, 1) figure out the unset-thread-name, like 
pool-7-thread-1, where them came from and close or set daemon property. i tried 
to search but got nothing right now. 2) catch the exception from 
ZKFailoverController.run() so we can continue to exec the System.exit, the 
attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to