What's your filesystem & vm.dirty_ratio? May be your OS flushes a lot sometimes. See e.g. http://www.sysxperts.com/home/announce/vmdirtyratioandvmdirtybackgroundratio
2012/11/7 mattgordon <[email protected]> > We have been trying to understand why our ZooKeeper cluster will > occasionally > have a wave of connection closed exceptions. We have switched to > -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode for garbage collection with > no noticeable improvements. > > The symptoms are: > > (1) All nodes show messages like "fsync-ing the write ahead log in > SyncThread:0 took 6309ms which will adversely effect operation latency. See > the ZooKeeper troubleshooting guide" with times typically around 5 seconds. > At least once, this fsync appeared in the leaders log immediately before a > wave of: > > ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception: > java.nio.channels.CancelledKeyException > at > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at > > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) > at > > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) > at > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171) > at > > org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) > > Our clients received ZookeeperConnectionClosed exceptions at this time and > all traffic on the ZooKeeper cluster essentially went to zero for a moment > before resuming normal operation with new connections. > > (2) Probably unrelated since I haven't correlated it temporally with the > client errors, but running "sudo strace -r -T -f -p 9574 -e > trace=fsync,fdatasync -o trace.txt" turns up some messages like "10581 > 0.000246 — SIGSEGV (Segmentation fault) @ 0 (0) —" > > > ZK Version: 3.3.4 > Cluster has 5 nodes running in EC2 > > Here is a screenshot showing ZooKeeper network traffic going to zero at the > time of the connection closed exceptions: http://i.imgur.com/dfNh0.png > > > > Anyone have ideas on what the cause of these "waves" of > CancelledKeyExceptions could be from? > > -- Best regards, Vitalii Tymchyshyn
