Question about spectator behavior whenever it is under zookeeper flapping

Hang Qi Fri, 27 Jun 2014 11:17:40 -0700

Hi folks,

We are using helix 0.6.3 to build our storage system, and our clients rely
on the spectator to route traffic to corresponding node.


It works very well, however, currently we encounter an issue that almost
all the clients fail to route the traffic, and the log shows that

ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: hostname
is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects in
300000ms.

Look at the code, there is flapping detection mechanism in ZKHelixManager,
and in case of zookeeper flapping, it will disconnect itself, and in turn
it will call resetHandlers in disconnect() method, result in the
routingTableProvider reset, thus the routingTable becomes empty.

When browsing the jira, I find that this feature was introduced by helix-31
and helix-32. I like the idea of zookeeper flapping detection and
disconnect when it happens for participant and controller, that makes the
whole cluster more stable.

However, in the spectator's perspective, the more reasonable behavior is
that it keeps using the most up to date state from zookeeper even if
zookeeper is down in my opinion. Besides, it should keep retrying to
connect to the zookeeper, or provide some callback to let client know. What
do you think?

So my question is, what is the most practical way to handle this in client?
Currently we use the work around to increase the value of
helixmanager.maxDisconnectThreshold. Is there any callback I could register
to get notified about the disconnect event, does polling
HelixManager#isConnect works?

Thanks
Hang Qi

Question about spectator behavior whenever it is under zookeeper flapping

Reply via email to