Hi German, Thanks for the followup! I have log files for all the servers and are quite big (greater than 25MB) hence could not upload send the log files through mail. Is it ok if I file a bug on this this and upload logs there?
Thanks & Regards, Deepak On Sun, Jan 26, 2014 at 1:53 AM, German Blanco < [email protected]> wrote: > Hello Deepak, > > sorry for the slow response. > I can't figure out what might be going on here without the log files. > The traces you see in S2 do not indicate any problem, as far as I see. It > seems that you have a client running in S2 that tries to connect to that > server. Since S2 hasn't been able to join a quorum, the server attending > clients hasn't been started and the connection is rejected. > Maybe, to start with, you could start by uploading the traces around the > connection loss between S2 and S3 (say a couple of minutes before and > after). > > Regards, > > German. > > > On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap <[email protected] > >wrote: > > > Hi, > > > > zoo.cfg is : > > > > maxClientCnxns=50 > > # The number of milliseconds of each tick > > tickTime=2000 > > # The number of ticks that the initial > > # synchronization phase can take > > initLimit=10 > > # The number of ticks that can pass between > > # sending a request and getting an acknowledgement > > syncLimit=5 > > # the directory where the snapshot is stored. > > dataDir=/var/lib/zookeeper > > # the port at which the clients will connect > > clientPort=2181 > > > > autopurge.snapRetainCount=3 > > autopurge.purgeInterval=1 > > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic > > > > > > > > zoo.cfg.dynamic is: > > > > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181 > > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181 > > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181 > > version=1 > > > > > > Thanks & Regards, > > Deepak > > > > > > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco < > > [email protected]> wrote: > > > > > Sorry but the attachment didn't make it through. > > > It might be safer to put the files somewhere in the web and send a > link. > > > > > > > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap < > [email protected] > > > >wrote: > > > > > > > Hi German, > > > > > > > > Please find zookeeper config files attached. > > > > > > > > Thanks & Regards, > > > > Deepak > > > > > > > > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco < > > > > [email protected]> wrote: > > > > > > > >> Hello! > > > >> > > > >> Could you please post your configuration files? > > > >> > > > >> Regards, > > > >> > > > >> German. > > > >> > > > >> > > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap < > > [email protected] > > > >> >wrote: > > > >> > > > >> > Hi All, > > > >> > > > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk > servers > > in > > > >> the > > > >> > quorum. > > > >> > The problem we are facing is that one zookeeper server in the > quorum > > > >> falls > > > >> > apart, and never becomes part of the cluster until we restart > > > zookeeper > > > >> > server on that node. > > > >> > > > > >> > Our interpretation from zookeeper logs on all nodes is as follows: > > > >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk > > > >> server > > > >> > 3) > > > >> > Initially S3 is the leader while S1 and S2 are followers. > > > >> > > > > >> > S2 hits 46 sec latency while fsyncing write ahead log and results > in > > > >> loss > > > >> > of connection with S3. > > > >> > S3 in turn prints following error message: > > > >> > > > > >> > Unexpected exception causing shutdown while sock still open > > > >> > java.net.SocketTimeoutException: Read timed out > > > >> > Stack trace > > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ******** > > > >> > > > > >> > S2 in this case closes connection with S3(leader) and shuts down > > > >> follower > > > >> > with following log messages: > > > >> > Closing connection to leader, exception during packet send > > > >> > java.net.SocketException: Socket close > > > >> > Follower@194] - shutdown called > > > >> > java.lang.Exception: shutdown Follower > > > >> > > > > >> > After this point S3 could never reestablish connection with S2 and > > > >> leader > > > >> > election mechanism keeps failing. S3 now keeps printing following > > > >> message > > > >> > repeatedly: > > > >> > Cannot open channel to 2 at election address /169.254.1.2:3888 > > > >> > java.net.ConnectException: Connection refused. > > > >> > > > > >> > While S3 is in this state, S2 repeatedly keeps printing following > > > >> message: > > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 > > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket > > connection > > > >> from > > > >> > / > > > >> > 127.0.0.1:60667 > > > >> > Exception causing close of session 0x0: ZooKeeperServer not > running > > > >> > Closed socket connection for client /127.0.0.1:60667 (no session > > > >> > established for client) > > > >> > > > > >> > Leader election never completes successfully and causing S2 to > fall > > > >> apart > > > >> > from the quorum. > > > >> > S2 was out of quorum for almost 1 week. > > > >> > > > > >> > While debugging this issue, we found out that both election and > peer > > > >> > connection ports on S2 can't be telneted from any of the node > (S1, > > > S2, > > > >> > S3). Network connectivity is not the issue. Later, we restarted > the > > ZK > > > >> > server S2 (service zookeeper-server restart) -- now we could > telnet > > to > > > >> both > > > >> > the ports and S2 joined the ensemble after a leader election > > attempt. > > > >> > Any idea what might be forcing S2 to get into a situation where it > > > won't > > > >> > accept any connections on the leader election and peer connection > > > ports? > > > >> > > > > >> > Should I file a jira on this and upload all log files while > > submitting > > > >> the > > > >> > jira as log files are close to 250MB each? > > > >> > > > > >> > Thanks & Regards, > > > >> > Deepak > > > >> > > > > >> > > > > > > > > > > > > > >
