Zbigniew Kostrzewa created HDFS-12834:
-----------------------------------------

             Summary: DFSZKFailoverController on error exits with 0 error code
                 Key: HDFS-12834
                 URL: https://issues.apache.org/jira/browse/HDFS-12834
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 3.0.0-alpha4, 2.7.3
            Reporter: Zbigniew Kostrzewa


On error {{DFSZKFailoverController}} exits with 0 return code which leads to 
problems when integrating it with scripts and monitoring tools, e.g. systemd, 
which when configured to restart service only on failure does not restart ZKFC 
service because it exited with 0.

For example, in my case, systemd reported zkfc exited with success but in logs 
I have found this:
{noformat}
2017-11-14 05:33:55,075 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 3334ms for sessionid 
0x15fb794bd240001, closing socket connection and attempting reconnect
2017-11-14 05:33:55,178 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
2017-11-14 05:33:55,564 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate 
using SASL (unknown error)
2017-11-14 05:33:55,566 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.9.4.73/10.9.4.73:2182, initiating session
2017-11-14 05:33:55,569 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 10.9.4.73/10.9.4.73:2182, sessionid = 
0x15fb794bd240001, negotiated timeout = 5000
2017-11-14 05:33:55,570 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2017-11-14 05:33:58,230 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x15fb794bd240001, likely server has 
closed socket, closing socket connection and attempting reconnect
2017-11-14 05:33:58,335 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
2017-11-14 05:33:58,402 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.9.4.138/10.9.4.138:2181. Will not attempt to 
authenticate using SASL (unknown error)
2017-11-14 05:33:58,403 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.9.4.138/10.9.4.138:2181, initiating session
2017-11-14 05:33:58,406 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x15fb794bd240001, likely server has 
closed socket, closing socket connection and attempting reconnect
2017-11-14 05:33:59,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.9.4.228/10.9.4.228:2183. Will not attempt to 
authenticate using SASL (unknown error)
2017-11-14 05:33:59,219 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.9.4.228/10.9.4.228:2183, initiating session
2017-11-14 05:33:59,221 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x15fb794bd240001, likely server has 
closed socket, closing socket connection and attempting reconnect
2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to authenticate 
using SASL (unknown error)
2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 1773ms for sessionid 
0x15fb794bd240001, closing socket connection and attempting reconnect
2017-11-14 05:34:01,196 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
znode monitoring connection errors.
2017-11-14 05:34:02,153 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x15fb794bd240001 closed
2017-11-14 05:34:02,154 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal 
error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not 
retrying further znode monitoring connection errors.
2017-11-14 05:34:02,154 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2017-11-14 05:34:05,208 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8019
2017-11-14 05:34:05,487 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8019
2017-11-14 05:34:05,488 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder
2017-11-14 05:34:05,487 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2017-11-14 05:34:05,488 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
HealthMonitor thread
2017-11-14 05:34:05,490 FATAL 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, 
exiting now
java.lang.RuntimeException: ZK Failover Controller failed: Received stat error 
from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring 
connection errors.
        at 
org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
        at 
org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
        at 
org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
        at 
org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
        at 
org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
        at 
org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
        at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
{noformat}


The code that seems responsible is in {{DFSZKFailoverController.java}}:
{code}
  public static void main(String args[])
      throws Exception {
...
    int retCode = 0;
    try {
      retCode = zkfc.run(parser.getRemainingArgs());
    } catch (Throwable t) {
      LOG.fatal("Got a fatal error, exiting now", t); 
    }   
    System.exit(retCode);
  }
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to