Its awesome that there is Flap detection. The NPE is misleading.. Let me file a ticket for these also.. (will do by EOW)
Thanks Vinoth On Thu, Apr 30, 2015 at 1:39 PM, kishore g <[email protected]> wrote: > Yep, this is a safety feature where Helix automatically detects GC and > disconnects from the cluster automatically. Unfortunately in some cases it > surfaces as NPE. > > We should probably describe the reason for disabling in the instance > config. Currently we just disable the node, we should probably add an > attribute DISABLE_CAUSE:"TOO MANY DISCONNECTS FROM ZK. CHECK JAVA GC LOG" > or something like that. > > thanks, > Kishore G > > On Thu, Apr 30, 2015 at 1:18 PM, Vinoth Chandar <[email protected]> wrote: > >> yep .. Seeing this >> >> $ grep -i flap /var/log/streamio/streamio.log >> 2015-04-30 16:08:50,823 ERROR - ZKHelixManager - >> instanceName: ??--checkpointer is flapping. disconnect it. >> maxDisconnectThreshold: 5 disconnects in 300000ms. >> 2015-04-30 16:09:30,140 ERROR - ZKHelixManager - >> instanceName: ??-controller- is flapping. disconnect it. >> maxDisconnectThreshold: 5 disconnects in 300000ms. >> 2015-04-30 16:11:05,679 ERROR - ZKHelixManager - >> instanceName: ??-controller- is flapping. disconnect it. >> maxDisconnectThreshold: 5 disconnects in 300000ms. >> >> and confirmed its GCing from the logs. (Sorry, had a bad dashboard >> originally that did not catch this) >> >> Thanks >> Vinoth >> >> On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <[email protected]> wrote: >> >>> Hi Vinoth, >>> >>> The NPE indicates the zookeeper connection in ZkClient is NULL. The >>> connection becomes NULL only when HelixManager#disconnect() is called. This >>> may happen if you directly call HelixManager#disconnect() or there are >>> frequent GC's and HelixManager disconnects itself. You may grep >>> "KeeperState" to figure out the connection state changes. >>> >>> Thanks, >>> Jason >>> >>> >>> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <[email protected]> >>> wrote: >>> >>>> Hi guys, >>>> >>>> I am hitting the following with 0.6.5, upon a ZK connection timeout . >>>> We make this call to the PropertyStore to figure out an offset to resume >>>> from. This error eventually puts every partition into an error state and >>>> comes to a grinding halt. Any pointers to troubleshoot this? Nonetheless, >>>> there should nt be an NPE right? >>>> >>>> NullPointerException >>>> >>>> - >>>> >>>> org.apache.helix.manager.zk.ZkClient$4 in call at line 241 >>>> - >>>> >>>> org.apache.helix.manager.zk.ZkClient$4 in call at line 237 >>>> - >>>> >>>> org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675 >>>> - >>>> >>>> org.apache.helix.manager.zk.ZkClient in readData at line 237 >>>> - >>>> >>>> org.I0Itec.zkclient.ZkClient in readData at line 761 >>>> - >>>> >>>> org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308 >>>> - >>>> >>>> org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line >>>> 377 >>>> - >>>> >>>> org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line >>>> 100 >>>> >>>> >>>> >>>> Thanks >>>> Vinoth >>>> >>> >>> >> >
