Yep, this is a safety feature where Helix automatically detects GC and disconnects from the cluster automatically. Unfortunately in some cases it surfaces as NPE.
We should probably describe the reason for disabling in the instance config. Currently we just disable the node, we should probably add an attribute DISABLE_CAUSE:"TOO MANY DISCONNECTS FROM ZK. CHECK JAVA GC LOG" or something like that. thanks, Kishore G On Thu, Apr 30, 2015 at 1:18 PM, Vinoth Chandar <[email protected]> wrote: > yep .. Seeing this > > $ grep -i flap /var/log/streamio/streamio.log > 2015-04-30 16:08:50,823 ERROR - ZKHelixManager - instanceName: > ??--checkpointer is flapping. disconnect it. maxDisconnectThreshold: 5 > disconnects in 300000ms. > 2015-04-30 16:09:30,140 ERROR - ZKHelixManager - instanceName: > ??-controller- is flapping. disconnect it. maxDisconnectThreshold: 5 > disconnects in 300000ms. > 2015-04-30 16:11:05,679 ERROR - ZKHelixManager - instanceName: > ??-controller- is flapping. disconnect it. maxDisconnectThreshold: 5 > disconnects in 300000ms. > > and confirmed its GCing from the logs. (Sorry, had a bad dashboard > originally that did not catch this) > > Thanks > Vinoth > > On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <[email protected]> wrote: > >> Hi Vinoth, >> >> The NPE indicates the zookeeper connection in ZkClient is NULL. The >> connection becomes NULL only when HelixManager#disconnect() is called. This >> may happen if you directly call HelixManager#disconnect() or there are >> frequent GC's and HelixManager disconnects itself. You may grep >> "KeeperState" to figure out the connection state changes. >> >> Thanks, >> Jason >> >> >> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <[email protected]> wrote: >> >>> Hi guys, >>> >>> I am hitting the following with 0.6.5, upon a ZK connection timeout . We >>> make this call to the PropertyStore to figure out an offset to resume from. >>> This error eventually puts every partition into an error state and comes to >>> a grinding halt. Any pointers to troubleshoot this? Nonetheless, there >>> should nt be an NPE right? >>> >>> NullPointerException >>> >>> - >>> >>> org.apache.helix.manager.zk.ZkClient$4 in call at line 241 >>> - >>> >>> org.apache.helix.manager.zk.ZkClient$4 in call at line 237 >>> - >>> >>> org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675 >>> - >>> >>> org.apache.helix.manager.zk.ZkClient in readData at line 237 >>> - >>> >>> org.I0Itec.zkclient.ZkClient in readData at line 761 >>> - >>> >>> org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308 >>> - >>> >>> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line >>> 377 >>> - >>> >>> org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line >>> 100 >>> >>> >>> >>> Thanks >>> Vinoth >>> >> >> >
