Hi folks, We are using helix 0.6.3 to build our storage system, and our clients rely on the spectator to route traffic to corresponding node.
It works very well, however, currently we encounter an issue that almost all the clients fail to route the traffic, and the log shows that ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: hostname is flapping. diconnect it. maxDisconnectThreshold: 5 disconnects in 300000ms. Look at the code, there is flapping detection mechanism in ZKHelixManager, and in case of zookeeper flapping, it will disconnect itself, and in turn it will call resetHandlers in disconnect() method, result in the routingTableProvider reset, thus the routingTable becomes empty. When browsing the jira, I find that this feature was introduced by helix-31 and helix-32. I like the idea of zookeeper flapping detection and disconnect when it happens for participant and controller, that makes the whole cluster more stable. However, in the spectator's perspective, the more reasonable behavior is that it keeps using the most up to date state from zookeeper even if zookeeper is down in my opinion. Besides, it should keep retrying to connect to the zookeeper, or provide some callback to let client know. What do you think? So my question is, what is the most practical way to handle this in client? Currently we use the work around to increase the value of helixmanager.maxDisconnectThreshold. Is there any callback I could register to get notified about the disconnect event, does polling HelixManager#isConnect works? Thanks Hang Qi
