[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage

2016-10-27 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613447#comment-15613447
 ] 

Lei Xia commented on HELIX-608:
---

There is a bug in the zkclient lib we are using. In zkClient.java, _connection 
and _connection.getZookeeper() never returns null until the client is 
explicitly closed. And once it is closed, a flag (_closed) is set.  This flag 
is checked in retryUntilConnected() before calling callback.  For this reason, 
neither Helix's extended zkClient nor the original zkClient checks the null 
pointer in its various retry-able operations.

protected boolean exists(final String path, final boolean watch) {
..
try {
  return retryUntilConnected(new Callable() {
@Override
public Boolean call() throws Exception {
  return _connection.exists(path, watch);
}
  });
 .
}

public  T retryUntilConnected(Callable callable) throws 
ZkInterruptedException, IllegalArgumentException, ZkException, RuntimeException 
{
.
while (true) {
if (_closed) {
throw new IllegalStateException("ZkClient already closed!");
}
try {
return callable.call();
} catch (ConnectionLossException e) {
   ...
waitForRetry();
} catch (SessionExpiredException e) {
   
waitForRetry();
} catch (KeeperException e) {
throw ZkException.create(e);
} catch (InterruptedException e) {
throw new ZkInterruptedException(e);
} catch (Exception e) {
throw ExceptionUtil.convertToRuntimeException(e);
}
.
}
}

  However, there is a bug in reconnect(), which closes the _connection, and 
reconnect it.  It does not set _closed flag after close the connection, so if 
reconnect fails, then reconnect() returns with _connection be null and _closed 
not set. We then see NPE if there are still pending read/writes to retry.

private void reconnect() {
getEventLock().lock();
try {
_connection.close();
_connection.connect(this);
} catch (InterruptedException e) {
throw new ZkInterruptedException(e);
} finally {
getEventLock().unlock();
}
}
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java

  The right way is to fix reconnect(), however, since it is private method, 
Helix can not override it.   This NPE exception happens when the client fails 
to reconnect to zk server, which should be rare given zookeeper is supposed to 
be highly available.  However, once it happens, even if Helix checks it against 
null, we can do nothing more than throw a different exception.  Instead, I will 
open a ticket to zkClient open source community to convince them to fix the 
problem.




> NPE and unable to reconnect to zookeeper after a network outage
> ---
>
> Key: HELIX-608
> URL: https://issues.apache.org/jira/browse/HELIX-608
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Affects Versions: 0.7.1
>Reporter: Changgeng Li
>Assignee: Lei Xia
>
> I noticed one of the participant is not a live instance in zookeeper after a 
> network outage, while the java process is live. I have to restart the java 
> process to make it live again. 
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: Error while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) 
> ~[zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) 
> [zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) 
> [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
> zzookeeperhost:2181,zookeeperhost2.com:2181/a
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure 
> in name resolution
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) 
> ~[?:1.7.0_72]
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) 
> ~[?:1

[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage

2016-10-26 Thread Lei Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610469#comment-15610469
 ] 

Lei Xia commented on HELIX-608:
---

we saw similar exceptions here too.  I am investigating it. 

> NPE and unable to reconnect to zookeeper after a network outage
> ---
>
> Key: HELIX-608
> URL: https://issues.apache.org/jira/browse/HELIX-608
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Affects Versions: 0.7.1
>Reporter: Changgeng Li
>
> I noticed one of the participant is not a live instance in zookeeper after a 
> network outage, while the java process is live. I have to restart the java 
> process to make it live again. 
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: Error while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) 
> ~[zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) 
> [zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) 
> [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
> zzookeeperhost:2181,zookeeperhost2.com:2181/a
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure 
> in name resolution
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) 
> ~[?:1.7.0_72]
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) 
> ~[?:1.7.0_72]
> at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName0(InetAddress.java:1246) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1162) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1098) 
> ~[?:1.7.0_72]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> INFO  2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> ERROR 2015-07-28 17:12:15,014 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> [zaaa.jar:?]
> ERROR 2015-07-28 17:12:15,015 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/EXTERNALVIEW changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@35d1655]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 
> ~[zaaa.jar

[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage

2016-10-18 Thread kishore gopalakrishna (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586546#comment-15586546
 ] 

kishore gopalakrishna commented on HELIX-608:
-

I will look into the code and get back to you. [~Lei Xu] Do you have any idea 
on why this might be happening?

> NPE and unable to reconnect to zookeeper after a network outage
> ---
>
> Key: HELIX-608
> URL: https://issues.apache.org/jira/browse/HELIX-608
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Affects Versions: 0.7.1
>Reporter: Changgeng Li
>
> I noticed one of the participant is not a live instance in zookeeper after a 
> network outage, while the java process is live. I have to restart the java 
> process to make it live again. 
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: Error while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) 
> ~[zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) 
> [zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) 
> [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
> zzookeeperhost:2181,zookeeperhost2.com:2181/a
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure 
> in name resolution
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) 
> ~[?:1.7.0_72]
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) 
> ~[?:1.7.0_72]
> at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName0(InetAddress.java:1246) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1162) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1098) 
> ~[?:1.7.0_72]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> INFO  2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> ERROR 2015-07-28 17:12:15,014 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> [zaaa.jar:?]
> ERROR 2015-07-28 17:12:15,015 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/EXTERNALVIEW changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@35d1655]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at 
> org

[jira] [Commented] (HELIX-608) NPE and unable to reconnect to zookeeper after a network outage

2016-10-18 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586528#comment-15586528
 ] 

Haohui Mai commented on HELIX-608:
--

I have seen this as well. I found that the callback handlers contain ZooKeeper 
clients that have stale connections. They will throw NPE if the handlers are 
executed.

I found that Helix does not clear up the callback handlers whenever Helix tries 
to reconnect to ZK. What is the best approach to fix it?

> NPE and unable to reconnect to zookeeper after a network outage
> ---
>
> Key: HELIX-608
> URL: https://issues.apache.org/jira/browse/HELIX-608
> Project: Apache Helix
>  Issue Type: Bug
>  Components: helix-core
>Affects Versions: 0.7.1
>Reporter: Changgeng Li
>
> I noticed one of the participant is not a live instance in zookeeper after a 
> network outage, while the java process is live. I have to restart the java 
> process to make it live again. 
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: Error while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) 
> ~[zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) 
> [zaaa.jar:?]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) 
> [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to 
> zzookeeperhost:2181,zookeeperhost2.com:2181/a
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure 
> in name resolution
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) 
> ~[?:1.7.0_72]
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) 
> ~[?:1.7.0_72]
> at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName0(InetAddress.java:1246) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1162) 
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1098) 
> ~[?:1.7.0_72]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:387) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ClientCnxn.(ClientCnxn.java:332) 
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:383) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) 
> ~[zaaa.jar:?]
> ... 3 more
> INFO  2015-07-28 17:12:15,010 [main-EventThread] 
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> ERROR 2015-07-28 17:12:15,014 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at 
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) 
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) 
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) 
> [zaaa.jar:?]
> ERROR 2015-07-28 17:12:15,015 
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a] 
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of 
> /zaaa/EXTERNALVIEW changed sent to 
> org.apache.helix.manager.zk.ZkCallbackHandler@35d1655]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) 
> ~[zaaa.jar:?]
> at org.apache.helix.m