Hi, We've been seeing a problem with our zookeeper servers lately, where all of a sudden a session loses some of the watchers registered on some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka cluster in one DC establishing sessions (with 6sec timeout) with a ZK cluster (of 4 machines) in another DC and registers watchers on some zookeeper paths. Every couple of weeks, we observe some problem with the Kafka servers, where on investigating further, we find that the session lost some of the key watches, but not all.
The last time this happened, we ran the wchc command on the ZK servers and saw the problem. Unfortunately, we lost relevant information from the ZK logs by the time we were ready to debug it further. Since this causes Kafka servers to stop making progress, we want to setup some kind of alert when this happens. This will help us collect more information to give you. Particularly, we were thinking about running wchp periodically (maybe once a minute), grepping for the ZK paths and counting the number of watches that should be registered for correct operation. But I observed that the watcher info is not replicated across all ZK servers, so we would have to query every ZK server to inorder to get the full list. I'm not sure running wchp periodically on all ZK servers is the best option for this alert. Can you think of what could be the problem here and how we can setup this alert for now ? Thanks Neha
