[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-515:
-----------------------------------

    Fix Version/s:     (was: 3.3.0)
                   3.4.0

> Zookeeper quorum didn't provide service when restart after an "Out of memory" 
> crash
> -----------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-515
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-515
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.2.0
>         Environment: Linux 2.6.9-52bs-4core #2 SMP Wed Jan 16 14:44:08 EST 
> 2008 x86_64 x86_64 x86_64 GNU/Linux
> Jdk: 1.6.0_14 
>            Reporter: Qian Ye
>             Fix For: 3.4.0
>
>
> The Zookeeper quorum, containing 5 servers, didn't provide service when 
> restart after an "Out of memory" crash. 
> It happened as following:
> 1. we built  a Zookeeper quorum which contained  5 servers, say 1, 3, 4, 5, 6 
> (have no 2), and 6 was the leader.
> 2. we created 18 threads on 6 different servers to set and get data from a 
> znode in the Zookeeper at the same time.  The size of the data is 1MB. The 
> test threads did their job as fast as possible, no pause between two 
> operation, and they repeated the setting and getting 4000 times. 
> 3. the Zookeeper leader crashed about 10 mins  after the test threads 
> started. The leader printed out the log:
> 2009-08-25 12:00:12,301 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b5 due to java.io.IOException: Read error
> 2009-08-25 12:00:12,318 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b6 due to java.io.IOException: Read error
> 2009-08-25 12:03:44,086 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b8 due to java.io.IOException: Read error
> 2009-08-25 12:04:53,757 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x523
> 4223c2dc00b7 due to java.io.IOException: Read error
> 2009-08-25 12:15:45,151 - FATAL [SyncThread:0:syncrequestproces...@131] - 
> Severe unrecoverable error, exiting
> java.lang.OutOfMemoryError: Java heap space
>     at java.util.Arrays.copyOf(Arrays.java:2786)
>     at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
>     at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
>     at 
> org.apache.jute.BinaryOutputArchive.writeInt(BinaryOutputArchive.java:55)
>     at org.apache.zookeeper.txn.SetDataTxn.serialize(SetDataTxn.java:42)
>     at 
> org.apache.zookeeper.server.persistence.Util.marshallTxnEntry(Util.java:262)
>     at 
> org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:154)
>     at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:268)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:100)
> It is clear that the leader ran out of memory. then the server 4 was down 
> almost at the same time, and printed out the log:
> 2009-08-25 12:15:45,995 - ERROR 
> [FollowerRequestProcessor:3:followerrequestproces...@91] - Unexpected 
> exception causing
> exit
> java.net.SocketException: Connection reset
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>     at java.io.DataOutputStream.write(DataOutputStream.java:90)
>     at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>     at 
> org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:119)
>     at 
> org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:51)
>     at 
> org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:97)
>     at org.apache.zookeeper.server.quorum.Follower.request(Follower.java:399)
>     at 
> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:86)
> 2009-08-25 12:15:45,996 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x423
> 4ab894330075 due to java.net.SocketException: Broken pipe
> 2009-08-25 12:15:45,996 - FATAL [SyncThread:3:syncrequestproces...@131] - 
> Severe unrecoverable error, exiting
> java.net.SocketException: Broken pipe
>     at java.net.SocketOutputStream.socketWrite0(Native Method)
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
>     at 
> org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:147)
>     at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:92)
> 2009-08-25 12:15:45,995 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@309] - 
> Exception when following the leader
> java.net.SocketException: Broken pipe
>     at java.net.SocketOutputStream.socketWrite0(Native Method)
>     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>     at 
> org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:256)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 12:15:46,022 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 12:15:46,022 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 12:15:46,023 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> It was really strange that when the 2 server down, the other three servers 
> cannot provide service any more, the 'stat' command all leaded to 
> "ZooKeeperServer not running".
> 4. I restarted the server 6(the former leader) and the server 4. But the 
> service didn't come back. All the five servers printed "ZookeeperServer not 
> running". The server 6 printed the logs:
> 2009-08-25 14:02:15,395 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:27,703 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: Responded to info probe
> 2009-08-25 14:02:28,733 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:42,070 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:02:55,407 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:08,744 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:22,080 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:29,396 - ERROR [main:u...@238] - Last transaction was 
> partial.
> 2009-08-25 14:03:35,417 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> 2009-08-25 14:03:48,761 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running 
> The server 4 printed logs like:
> 2009-08-25 14:03:48,747 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:02,091 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:15,427 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:17,816 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=0
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:18,820 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=1
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:19,823 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=2
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:04:28,764 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:04:42,101 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> the server 1, 3, 5 printed out the logs like:
> 2009-08-25 14:01:35,396 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: ZooKeeperServer not running
> 2009-08-25 14:01:36,554 - WARN  [QuorumPeer:/0.0.0.0:2181:leaderelect...@194] 
> - Ignoring exception while looking for lea
> der
> java.net.SocketTimeoutException: Receive timed out 
>     at java.net.PlainDatagramSocketImpl.receive0(Native Method)
>     at 
> java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
>     at java.net.DatagramSocket.receive(DatagramSocket.java:712)
>     at 
> org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
> 2009-08-25 14:01:37,758 - WARN  [QuorumPeer:/0.0.0.0:2181:leaderelect...@194] 
> - Ignoring exception while looking for lea
> der
> java.net.SocketTimeoutException: Receive timed out 
>     at java.net.PlainDatagramSocketImpl.receive0(Native Method)
>     at 
> java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
>     at java.net.DatagramSocket.receive(DatagramSocket.java:712)
>     at 
> org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:170)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)
> 2009-08-25 14:01:37,865 - WARN  [QuorumPeer:/0.0.0.0:2181:follo...@164] - 
> Unexpected exception, tries=0
> java.net.ConnectException: Connection refused
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>     at java.net.Socket.connect(Socket.java:525)
>     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
> 2009-08-25 14:01:38,289 - WARN  [NIOServerCxn.Factory:2181:nioserverc...@497] 
> - Exception causing close of session 0x0 d
> ue to java.io.IOException: Responded to info probe
> my zoo.cfg is like:
> tickTime=2000
> dataDir=./status/
> clientPort=2181
> initLimit=10
> syncLimit=2
> server.1=10.81.11.107:2888:3888
> server.2=10.81.11.106:2888:3888
> server.3=10.81.11.89:2888:3888
> server.4=10.81.11.99:2888:3888
> server.5=10.81.11.79:2888:3888
> Several questions:
> 1. Why the leader selection failed after the restart?
> 2. Is the size of data too big to be processed properly?
> 3. How can I recover from this situation? Can I just remove the version-2 
> directory on server 6(the former leader) and restart the server?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to