Re: NPE trying to reconnect, upon ZK Timeout

Vinoth Chandar Thu, 30 Apr 2015 13:45:32 -0700

Its awesome that there is Flap detection.

The NPE is misleading.. Let me file a ticket for these also.. (will do by
EOW)


Thanks
Vinoth

On Thu, Apr 30, 2015 at 1:39 PM, kishore g <[email protected]> wrote:

> Yep, this is a safety feature where Helix automatically detects GC and
> disconnects from the cluster automatically. Unfortunately in some cases it
> surfaces as NPE.
>
> We should probably describe the reason for disabling in the instance
> config. Currently we just disable the node, we should probably add an
> attribute DISABLE_CAUSE:"TOO MANY DISCONNECTS FROM ZK. CHECK JAVA GC LOG"
> or something like that.
>
> thanks,
> Kishore G
>
> On Thu, Apr 30, 2015 at 1:18 PM, Vinoth Chandar <[email protected]> wrote:
>
>> yep .. Seeing this
>>
>> $ grep -i flap /var/log/streamio/streamio.log
>> 2015-04-30 16:08:50,823 ERROR - ZKHelixManager             -
>> instanceName: ??--checkpointer is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> 2015-04-30 16:09:30,140 ERROR - ZKHelixManager             -
>> instanceName: ??-controller- is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> 2015-04-30 16:11:05,679 ERROR - ZKHelixManager             -
>> instanceName: ??-controller- is flapping. disconnect it.
>>  maxDisconnectThreshold: 5 disconnects in 300000ms.
>>
>> and confirmed its GCing from the logs. (Sorry, had a bad dashboard
>> originally that did not catch this)
>>
>> Thanks
>> Vinoth
>>
>> On Thu, Apr 30, 2015 at 12:12 PM, Zhen Zhang <[email protected]> wrote:
>>
>>> Hi Vinoth,
>>>
>>> The NPE indicates the zookeeper connection in ZkClient is NULL. The
>>> connection becomes NULL only when HelixManager#disconnect() is called. This
>>> may happen if you directly call HelixManager#disconnect() or there are
>>> frequent GC's and HelixManager disconnects itself. You may grep
>>> "KeeperState" to figure out the connection state changes.
>>>
>>> Thanks,
>>> Jason
>>>
>>>
>>> On Thu, Apr 30, 2015 at 11:53 AM, Vinoth Chandar <[email protected]>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I am hitting the following with 0.6.5, upon a ZK connection timeout .
>>>> We make this call to the PropertyStore to figure out an offset to resume
>>>> from. This error eventually puts every partition into an error state and
>>>> comes to a grinding halt.  Any pointers to troubleshoot this? Nonetheless,
>>>> there should nt be an NPE right?
>>>>
>>>> NullPointerException
>>>>
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 241
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient$4 in call at line 237
>>>>    -
>>>>
>>>>    org.I0Itec.zkclient.ZkClient in retryUntilConnected at line 675
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkClient in readData at line 237
>>>>    -
>>>>
>>>>    org.I0Itec.zkclient.ZkClient in readData at line 761
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkBaseDataAccessor in get at line 308
>>>>    -
>>>>
>>>>    org.apache.helix.manager.zk.ZkCacheBaseDataAccessor in get at line
>>>>    377
>>>>    -
>>>>
>>>>    org.apache.helix.store.zk.AutoFallbackPropertyStore in get at line
>>>>    100
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Vinoth
>>>>
>>>
>>>
>>
>

Re: NPE trying to reconnect, upon ZK Timeout

Reply via email to