> Benjamin: is it possible there was a partition in your network, where
node 2 and node 3 weren’t able to communicate with each other?
Well I guess it's possible but I can't confirm that.

I'm trying to get the most our of what the logs are saying.

On the leader, this seems to be the heart of the problem:

*2015-01-04 16:18:21,897 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing
connection to peer due to transaction timeout.*
*2015-01-04 16:18:21,898 [myid:2] - WARN  [LearnerHandler-/204.53.107.
<http://204.53.107.>*
*249:43402:LearnerHandler@646] - ******* GOODBYE /204.53.107.249:43402
<http://204.53.107.249:43402> ********2015-01-04 16:18:21,905 [myid:2] -
WARN  [QuorumPeer[myid=2]/0:0:0:0:0:*
*0:0:0:2181:LearnerHandler@687] - Closing connection to peer due to
transaction timeout.2015-01-04 16:18:21,907 [myid:2] - WARN
[LearnerHandler-/204.53.107.
<http://204.53.107.>**247:45953:LearnerHandler@646]
- ******* GOODBYE /204.53.107.247:45953 <http://204.53.107.247:45953>
*********

1) What transaction are we talking about?
2) Does it ever retries?
3) Is that what's causing the interrupted exception?

Also on NODE1:
*fsync-ing the write ahead log in SyncThread:1 took 11024m*

4) the write ahead log is just the log for a ZK write? or is it for the
whole snapshot?
5) is it configured with syncLimit=5 ? (so 10 seconds with a 2 seconds
tickTime?)

The ZK log output is pretty obscure, but may be it's just me.
Thanks again for any clue on how to diagnose this.

Benjamin



On Thu, Jan 8, 2015 at 9:54 AM, Ibrahim El-sanosi (PGR) <
[email protected]> wrote:

> Yes, correct.
>
> Ibrahim
>
> -----Original Message-----
> From: Sékine Coulibaly [mailto:[email protected]]
> Sent: Thursday, January 08, 2015 12:36 م
> To: [email protected]
> Subject: Re: Failover when one node fails to write on the disk?
>
> Ibrahim,
> So, the minimum number of zk nodes is 5, not three as is commonly thought.
> Right ?
> With 5 nodes, one can support one or two nodes failures.
> Neither did I expect a 3 nodes cluster to stop with one node failing since
> there still is a majority...
> Hmmm, will Check this !
>
> Le jeudi 8 janvier 2015, Ibrahim El-sanosi (PGR) <
> [email protected]> a écrit :
>
> > Hi Benjamin,
> >
> > The reason why Node2 and Node 3 stop running is that ZooKeeper must
> > have a quorum of servers to make progress. Zookeeper needs at least 3
> > servers in order to run. In your scenario, you started with three
> > servers which is fine, but since one of the server fails, the
> > zookeeper stop running because it lacks of the quorum (majority).
> >
> > Ibrahim
> >
> > -----Original Message-----
> > From: Benjamin Jaton [mailto:[email protected] <javascript:;>]
> > Sent: Wednesday, January 07, 2015 10:34 م
> > To: [email protected] <javascript:;>
> > Subject: Failover when one node fails to write on the disk?
> >
> > Using zookeeper 3.4.5 I came across a situation where all the 3
> > Zookeeper suddenly stop.
> >
> > What I see is that NODE1 fails to write on the disk. so it makes sense
> > to me that NODE1 stops.
> >
> > But it is unclear why NODE2 and NODE3 would stop running as well, I
> > have a hard time making sense of the log messages.
> >
> > Any insight would be greatly appreciated!
> >
> > see log extracts below:
> >
> > NODE1:
> >
> > -- no log for several days before this --
> > 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321]
> > - fsync-ing the write ahead log in SyncThread:1 took 11024ms which
> > will adversely effect operation latency. See the ZooKeeper
> > troubleshooting guide
> > 2015-01-04 16:18:22,380 [myid:1] - WARN
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader java.io.EOFException
> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> >         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >         at
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >         at
> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> >         at
> >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> >         at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> > 2015-01-04 16:18:23,384 [myid:1] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:23,492 [myid:1] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:24,060 [myid:1] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> >
> >
> > NODE2:
> >
> > -- no log for several days before this --
> > 2015-01-04 16:18:21,899 [myid:3] - WARN
> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader java.io.EOFException
> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> >         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >         at
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >         at
> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> >         at
> >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> >         at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> > 2015-01-04 16:18:22,760 [myid:3] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:22,801 [myid:3] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:22,886 [myid:3] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> >
> >
> > NODE3 (leader):
> >
> > -- no log for several days before this --
> > 2015-01-04 16:18:21,897 [myid:2] - WARN
> > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing
> > connection to peer due to transaction timeout.
> > 2015-01-04 16:18:21,898 [myid:2] - WARN
> > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *******
> > GOODBYE
> > /204.53.107.249:43402 ********
> > 2015-01-04 16:18:21,905 [myid:2] - WARN
> > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing
> > connection to peer due to transaction timeout.
> > 2015-01-04 16:18:21,907 [myid:2] - WARN
> > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *******
> > GOODBYE
> > /204.53.107.247:45953 ********
> > 2015-01-04 16:18:21,918 [myid:2] - WARN
> > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring
> > unexpected exception java.lang.InterruptedException
> >         at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> >         at
> >
> >
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> >         at
> >
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
> >         at
> >
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.j
> > ava:649)
> > 2015-01-04 16:18:23,003 [myid:2] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:23,007 [myid:2] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> > 2015-01-04 16:18:23,115 [myid:2] - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
> >
> >
> > Thanks!
> > Benjamin
> >
>

Reply via email to