Hello, I am running ZK on a shared disk (I know, I shouldn't be, but I am constrained right now) alongside Kafka 0.8 beta. What we are experiencing is a problem where we get really long fsync times (according to the logs), followed by a loss of connection of our Kafka clients. Kafka attempts to reconnect a few times and eventually it dies because it hits the maximum retry attempts.
The fsync error is seen below: 2014-01-23 13:18:38,746 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 12762ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2014-01-23 13:23:41,332 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 7552ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2014-01-23 13:28:49,656 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 6350ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2014-01-23 13:33:45,063 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 1039ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2014-01-23 13:34:00,024 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 9490ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2014-01-23 13:44:09,003 [myid:] - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 8747ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide This is also followed by some of these for good measure: 2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) The way I see it is that I currently have two problems: 1) The setup of ZK is an issue due to the shared disk, and 2) Kafka clients do not automatically recover when it hits the maximum number of retries. I am looking for a way to at least mitigate the zookeeper issue. Perhaps if I modify the timeouts in such a way that the Kafka clients don't fail like they do... What are the best ways to mitigate the issue for now, as I am limited to a single disk? Increasing tickTime? My current ZK config is the default that comes with version 3.4.5, so the tickTime is 2000. My Kafka clients have defined the zktimeout variable to be 30000. I realize that this is a Zookeeper mailing list, but right now I cannot pinpoint the exact cause of my problems, but it appears to me that ZK is the one. Thanks
