On Thu, May 24, 2012 at 3:42 PM, Matthew Ward <[email protected]> wrote: > I have a couple theories and questions I was hoping to clear up (all java > based 3.3.4): > 1) I have been trying to troubleshoot the reason for high system wait time on > one of our zookeeper instances. The theory I have is that setting watches > increases the system wait load. Does this theory sound accurate?
The two most common causes of high latency are GC/swapping and high disk utilization on the transaction log (WAL). Check for that first. Have you seen this page? https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting Given you mention AWS in q2 that might also be related - remember you're not accessing the disk(s) directly so disk issues are even more likely - the main issue being that we need to fsync the txnlog before responding to the proposal. (I often use strace on the fsync fdatasync methods to track/graph this) > 2) Question 2 is a follow up to the first... whenever I do a watch and wait > for the event, I have an 'insurance policy' (since AWS is fun...) of setting > a mutex with a timeout, before retrying the operation and potentially setting > another watch. How does zookeeper handle duplicate watches? Am I exacerbating > the system wait load issue by setting duplicate watches? If there a way I > should cancel the watch? A particular session can establish only a single watch on a particular path. Multiple watches have no negative effect (other than a round-trip read to the server of course). Patrick
