Hi Jun, It depends. You might just reregister the watch on another node (specifically, the original node minus the chroot). This case is really easy to test, even on a single, locally running instance. Just create a watch then print out the watches using wchc or wcwp. Restart the zookeeper. After the client automatically reconnects, rerun the four letter word to observe what happened to the watch.
-Jamie On Nov 7, 2011, at 7:27 PM, Jun Rao <[email protected]> wrote: > Jamie, > > We do use chroot. However, the chroot problem will lose all watchers, not > some watchers, right? > > Thanks, > > Jun > > On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder > <[email protected]>wrote: > >> Hi Neha, >> >> I encountered a similar problem with zookeeper losing watches and found >> that it was related to this bug: >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-961 >> >> Are you using a chroot? >> >> Thanks, >> Jamie >> Cli >> On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <neha. @gmail.com >>> wrote: >> >>> Hi, >>> >>> We've been seeing a problem with our zookeeper servers lately, where >>> all of a sudden a session loses some of the watchers registered on >>> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka >>> cluster in one DC establishing sessions (with 6sec timeout) with a ZK >>> cluster (of 4 machines) in another DC and registers watchers on some >>> zookeeper paths. Every couple of weeks, we observe some problem with >>> the Kafka servers, where on investigating further, we find that the >>> session lost some of the key watches, but not all. >>> >>> The last time this happened, we ran the wchc command on the ZK servers >>> and saw the problem. Unfortunately, we lost relevant information from >>> the ZK logs by the time we were ready to debug it further. Since this >>> causes Kafka servers to stop making progress, we want to setup some >>> kind of alert when this happens. This will help us collect more >>> information to give you. Particularly, we were thinking about running >>> wchp periodically (maybe once a minute), grepping for the ZK paths and >>> counting the number of watches that should be registered for correct >>> operation. But I observed that the watcher info is not replicated >>> across all ZK servers, so we would have to query every ZK server to >>> inorder to get the full list. >>> >>> I'm not sure running wchp periodically on all ZK servers is the best >>> option for this alert. Can you think of what could be the problem here >>> and how we can setup this alert for now ? >>> >>> Thanks >>> Neha >>> >>
