Problems with running ZK on a shared disk

Ahmed H. Thu, 23 Jan 2014 10:54:44 -0800

Hello,

I am running ZK on a shared disk (I know, I shouldn't be, but I am
constrained right now) alongside Kafka 0.8 beta. What we are experiencing
is a problem where we get really long fsync times (according to the logs),
followed by a loss of connection of our Kafka clients. Kafka attempts to
reconnect a few times and eventually it dies because it hits the maximum
retry attempts.


The fsync error is seen below:

2014-01-23 13:18:38,746 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 12762ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:23:41,332 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 7552ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:28:49,656 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 6350ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:33:45,063 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 1039ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:34:00,024 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 9490ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:44:09,003 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 8747ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide


This is also followed by some of these for good measure:

2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
 at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
 at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170)
at
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167)
 at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)


The way I see it is that I currently have two problems: 1) The setup of ZK
is an issue due to the shared disk, and 2) Kafka clients do not
automatically recover when it hits the maximum number of retries. I am
looking for a way to at least mitigate the zookeeper issue. Perhaps if I
modify the timeouts in such a way that the Kafka clients don't fail like
they do...

What are the best ways to mitigate the issue for now, as I am limited to a
single disk? Increasing tickTime? My current ZK config is the default that
comes with version 3.4.5, so the tickTime is 2000. My Kafka clients have
defined the zktimeout variable to be 30000.

I realize that this is a Zookeeper mailing list, but right now I cannot
pinpoint the exact cause of my problems, but it appears to me that ZK is
the one.

Thanks

Problems with running ZK on a shared disk

Reply via email to