[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage
[ https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613447#comment-15613447 ] Lei Xia commented on HELIX-608: --- There is a bug in the zkclient lib we are using. In zkClient.java, _connection and _connection.getZookeeper() never returns null until the client is explicitly closed. And once it is closed, a flag (_closed) is set. This flag is checked in retryUntilConnected() before calling callback. For this reason, neither Helix's extended zkClient nor the original zkClient checks the null pointer in its various retry-able operations. protected boolean exists(final String path, final boolean watch) { .. try { return retryUntilConnected(new Callable() { @Override public Boolean call() throws Exception { return _connection.exists(path, watch); } }); . } public T retryUntilConnected(Callable callable) throws ZkInterruptedException, IllegalArgumentException, ZkException, RuntimeException { . while (true) { if (_closed) { throw new IllegalStateException("ZkClient already closed!"); } try { return callable.call(); } catch (ConnectionLossException e) { ... waitForRetry(); } catch (SessionExpiredException e) { waitForRetry(); } catch (KeeperException e) { throw ZkException.create(e); } catch (InterruptedException e) { throw new ZkInterruptedException(e); } catch (Exception e) { throw ExceptionUtil.convertToRuntimeException(e); } . } } However, there is a bug in reconnect(), which closes the _connection, and reconnect it. It does not set _closed flag after close the connection, so if reconnect fails, then reconnect() returns with _connection be null and _closed not set. We then see NPE if there are still pending read/writes to retry. private void reconnect() { getEventLock().lock(); try { _connection.close(); _connection.connect(this); } catch (InterruptedException e) { throw new ZkInterruptedException(e); } finally { getEventLock().unlock(); } } https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java The right way is to fix reconnect(), however, since it is private method, Helix can not override it. This NPE exception happens when the client fails to reconnect to zk server, which should be rare given zookeeper is supposed to be highly available. However, once it happens, even if Helix checks it against null, we can do nothing more than throw a different exception. Instead, I will open a ticket to zkClient open source community to convince them to fix the problem. > NPE and unable to reconnect to zookeeper after a network outage > --- > > Key: HELIX-608 > URL: https://issues.apache.org/jira/browse/HELIX-608 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.7.1 >Reporter: Changgeng Li >Assignee: Lei Xia > > I noticed one of the participant is not a live instance in zookeeper after a > network outage, while the java process is live. I have to restart the java > process to make it live again. > Found following logs: > ERROR 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: Error while calling watcher > java.lang.RuntimeException: Exception while restarting zk client > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) > ~[zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > [zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > [zaaa.jar:?] > Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zzookeeperhost:2181,zookeeperhost2.com:2181/a > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure > in name resolution > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > ~[?:1.7.0_72] > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) > ~[?:1
[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage
[ https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610469#comment-15610469 ] Lei Xia commented on HELIX-608: --- we saw similar exceptions here too. I am investigating it. > NPE and unable to reconnect to zookeeper after a network outage > --- > > Key: HELIX-608 > URL: https://issues.apache.org/jira/browse/HELIX-608 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.7.1 >Reporter: Changgeng Li > > I noticed one of the participant is not a live instance in zookeeper after a > network outage, while the java process is live. I have to restart the java > process to make it live again. > Found following logs: > ERROR 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: Error while calling watcher > java.lang.RuntimeException: Exception while restarting zk client > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) > ~[zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > [zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > [zaaa.jar:?] > Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zzookeeperhost:2181,zookeeperhost2.com:2181/a > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure > in name resolution > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > ~[?:1.7.0_72] > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) > ~[?:1.7.0_72] > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName0(InetAddress.java:1246) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1162) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > ~[?:1.7.0_72] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) > ~[zaaa.jar:?] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) > ~[zaaa.jar:?] > at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > INFO 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: EventThread shut down > ERROR 2015-07-28 17:12:15,014 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > [zaaa.jar:?] > ERROR 2015-07-28 17:12:15,015 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/EXTERNALVIEW changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@35d1655] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > ~[zaaa.jar
[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage
[ https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586546#comment-15586546 ] kishore gopalakrishna commented on HELIX-608: - I will look into the code and get back to you. [~Lei Xu] Do you have any idea on why this might be happening? > NPE and unable to reconnect to zookeeper after a network outage > --- > > Key: HELIX-608 > URL: https://issues.apache.org/jira/browse/HELIX-608 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.7.1 >Reporter: Changgeng Li > > I noticed one of the participant is not a live instance in zookeeper after a > network outage, while the java process is live. I have to restart the java > process to make it live again. > Found following logs: > ERROR 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: Error while calling watcher > java.lang.RuntimeException: Exception while restarting zk client > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) > ~[zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > [zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > [zaaa.jar:?] > Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zzookeeperhost:2181,zookeeperhost2.com:2181/a > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure > in name resolution > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > ~[?:1.7.0_72] > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) > ~[?:1.7.0_72] > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName0(InetAddress.java:1246) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1162) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > ~[?:1.7.0_72] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) > ~[zaaa.jar:?] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) > ~[zaaa.jar:?] > at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > INFO 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: EventThread shut down > ERROR 2015-07-28 17:12:15,014 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > [zaaa.jar:?] > ERROR 2015-07-28 17:12:15,015 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/EXTERNALVIEW changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@35d1655] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) > ~[zaaa.jar:?] > at > org
[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage
[ https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586528#comment-15586528 ] Haohui Mai commented on HELIX-608: -- I have seen this as well. I found that the callback handlers contain ZooKeeper clients that have stale connections. They will throw NPE if the handlers are executed. I found that Helix does not clear up the callback handlers whenever Helix tries to reconnect to ZK. What is the best approach to fix it? > NPE and unable to reconnect to zookeeper after a network outage > --- > > Key: HELIX-608 > URL: https://issues.apache.org/jira/browse/HELIX-608 > Project: Apache Helix > Issue Type: Bug > Components: helix-core >Affects Versions: 0.7.1 >Reporter: Changgeng Li > > I noticed one of the participant is not a live instance in zookeeper after a > network outage, while the java process is live. I have to restart the java > process to make it live again. > Found following logs: > ERROR 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: Error while calling watcher > java.lang.RuntimeException: Exception while restarting zk client > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) > ~[zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > [zaaa.jar:?] > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > [zaaa.jar:?] > Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zzookeeperhost:2181,zookeeperhost2.com:2181/a > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure > in name resolution > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > ~[?:1.7.0_72] > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) > ~[?:1.7.0_72] > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName0(InetAddress.java:1246) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1162) > ~[?:1.7.0_72] > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > ~[?:1.7.0_72] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) > ~[zaaa.jar:?] > at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) > ~[zaaa.jar:?] > at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > ~[zaaa.jar:?] > ... 3 more > INFO 2015-07-28 17:12:15,010 [main-EventThread] > org.apache.zookeeper.ClientCnxn: EventThread shut down > ERROR 2015-07-28 17:12:15,014 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) > ~[zaaa.jar:?] > at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > ~[zaaa.jar:?] > at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) > ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?] > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > [zaaa.jar:?] > ERROR 2015-07-28 17:12:15,015 > [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] > org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of > /zaaa/EXTERNALVIEW changed sent to > org.apache.helix.manager.zk.ZkCallbackHandler@35d1655] > java.lang.NullPointerException > at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > ~[zaaa.jar:?] > at org.apache.helix.m