Hi Hang, may I know why the connections between router and zookeeper are flapping? Is it caused by GC on routers?
Thanks, Jason On Fri, Jun 27, 2014 at 11:40 AM, kishore g <[email protected]> wrote: > Hi Hang, > > Good point, I agree that the handling of flapping should be different > based on the role. For now, we have focused on the participant but as you > have explained its not the right thing to do for a spectator. > > Keeping the latest information is the right thing to do in Specator. We > should probably create a JIRA and go over the possible solutions. > > So couple of things we need to decide > -- keep the latest information > -- Retry to Zookeeper -- > -- How do we provide a callback to client if they need custom logic. > > Polling HelixManager.isConnected should work but its possible to miss that > event, for example if your polling interval is 10 seconds if the disconnect > and connect happens within that time interval client may not notice that. > > Ideally we want to avoid clients understanding the Zookeeper > state/internals. In the long term this will allow us to plugin a different > backend for storing state information. > > Thanks, > Kishore G > > > > > > > On Fri, Jun 27, 2014 at 11:14 AM, Hang Qi <[email protected]> wrote: > >> Hi folks, >> >> We are using helix 0.6.3 to build our storage system, and our clients >> rely on the spectator to route traffic to corresponding node. >> >> It works very well, however, currently we encounter an issue that almost >> all the clients fail to route the traffic, and the log shows that >> >> ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: >> hostname is flapping. diconnect it. maxDisconnectThreshold: 5 disconnects >> in 300000ms. >> >> Look at the code, there is flapping detection mechanism in >> ZKHelixManager, and in case of zookeeper flapping, it will disconnect >> itself, and in turn it will call resetHandlers in disconnect() method, >> result in the routingTableProvider reset, thus the routingTable becomes >> empty. >> >> When browsing the jira, I find that this feature was introduced by >> helix-31 and helix-32. I like the idea of zookeeper flapping detection and >> disconnect when it happens for participant and controller, that makes the >> whole cluster more stable. >> >> However, in the spectator's perspective, the more reasonable behavior is >> that it keeps using the most up to date state from zookeeper even if >> zookeeper is down in my opinion. Besides, it should keep retrying to >> connect to the zookeeper, or provide some callback to let client know. What >> do you think? >> >> So my question is, what is the most practical way to handle this in >> client? Currently we use the work around to increase the value of >> helixmanager.maxDisconnectThreshold. Is there any callback I could register >> to get notified about the disconnect event, does polling >> HelixManager#isConnect works? >> >> Thanks >> Hang Qi >> > >
