Can you tell why the server wasn't responding to the notifications from the observer? The log file is from the observer and it sounds like it is being able to send messages out, but it isn't clear why the server isn't responding.
-Flavio > On 14 Oct 2015, at 01:51, elastic search <[email protected]> wrote: > > > Hello Experts > > We have 2 Observers running in AWS connecting over to local ZK Ensemble in > our own DataCenter. > > There have been instances where we see network drop for a minute between the > networks. > However the Observers take around 15 minutes to recover even if the network > outage is for a minute. > > From the logs > java.net.SocketTimeoutException: Read timed out > 2015-10-13 22:26:03,927 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 400 > 2015-10-13 22:26:04,328 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 800 > 2015-10-13 22:26:05,129 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 1600 > 2015-10-13 22:26:06,730 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 3200 > 2015-10-13 22:26:09,931 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 6400 > 2015-10-13 22:26:16,332 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 12800 > 2015-10-13 22:26:29,133 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 25600 > 2015-10-13 22:26:54,734 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 51200 > 2015-10-13 22:27:45,935 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:28:45,936 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:29:45,937 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:30:45,938 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:31:45,939 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:32:45,940 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:33:45,941 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:34:45,942 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:35:45,943 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:36:45,944 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:37:45,945 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:38:45,946 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:39:45,947 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:40:45,948 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > 2015-10-13 22:41:45,949 [myid:4] - INFO > [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > Notification time out: 60000 > > And then finally exits the QuorumCnxManager run loop with the following > message > WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection broken for > id 2 > > How can we ensure the observer does not go out for service such a long > duration ? > > Attached the full logs > > Please help > Thanks > > <zookeeper.log.zip>
