Doprbox link for log files: https://dl.dropboxusercontent.com/u/36429721/zklog.tgz
On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[email protected]>wrote: > Jira has attachment limit of 10MB, hence uploaded log files on dropbox. > > *Please refer events close to * > > *"2014-01-07 10:34:01" * > > *timestamp on all nodes.* > > > * Thanks & Regards,* > > *Deepak * > > > On Mon, Jan 27, 2014 at 12:34 PM, German Blanco < > [email protected]> wrote: > >> I don't see why it would be a problem for anybody. >> If this happens not to be a problem in ZooKeeper we can always close the >> bug case. >> >> >> On Mon, Jan 27, 2014 at 8:33 PM, Deepak Jagtap <[email protected] >> >wrote: >> >> > Hi German, >> > >> > Thanks for the followup! >> > I have log files for all the servers and are quite big (greater than >> 25MB) >> > hence could not >> > upload send the log files through mail. >> > Is it ok if I file a bug on this this and upload logs there? >> > >> > Thanks & Regards, >> > Deepak >> > >> > >> > >> > On Sun, Jan 26, 2014 at 1:53 AM, German Blanco < >> > [email protected]> wrote: >> > >> > > Hello Deepak, >> > > >> > > sorry for the slow response. >> > > I can't figure out what might be going on here without the log files. >> > > The traces you see in S2 do not indicate any problem, as far as I >> see. It >> > > seems that you have a client running in S2 that tries to connect to >> that >> > > server. Since S2 hasn't been able to join a quorum, the server >> attending >> > > clients hasn't been started and the connection is rejected. >> > > Maybe, to start with, you could start by uploading the traces around >> the >> > > connection loss between S2 and S3 (say a couple of minutes before and >> > > after). >> > > >> > > Regards, >> > > >> > > German. >> > > >> > > >> > > On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap < >> [email protected] >> > > >wrote: >> > > >> > > > Hi, >> > > > >> > > > zoo.cfg is : >> > > > >> > > > maxClientCnxns=50 >> > > > # The number of milliseconds of each tick >> > > > tickTime=2000 >> > > > # The number of ticks that the initial >> > > > # synchronization phase can take >> > > > initLimit=10 >> > > > # The number of ticks that can pass between >> > > > # sending a request and getting an acknowledgement >> > > > syncLimit=5 >> > > > # the directory where the snapshot is stored. >> > > > dataDir=/var/lib/zookeeper >> > > > # the port at which the clients will connect >> > > > clientPort=2181 >> > > > >> > > > autopurge.snapRetainCount=3 >> > > > autopurge.purgeInterval=1 >> > > > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic >> > > > >> > > > >> > > > >> > > > zoo.cfg.dynamic is: >> > > > >> > > > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181 >> > > > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181 >> > > > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181 >> > > > version=1 >> > > > >> > > > >> > > > Thanks & Regards, >> > > > Deepak >> > > > >> > > > >> > > > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco < >> > > > [email protected]> wrote: >> > > > >> > > > > Sorry but the attachment didn't make it through. >> > > > > It might be safer to put the files somewhere in the web and send a >> > > link. >> > > > > >> > > > > >> > > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap < >> > > [email protected] >> > > > > >wrote: >> > > > > >> > > > > > Hi German, >> > > > > > >> > > > > > Please find zookeeper config files attached. >> > > > > > >> > > > > > Thanks & Regards, >> > > > > > Deepak >> > > > > > >> > > > > > >> > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco < >> > > > > > [email protected]> wrote: >> > > > > > >> > > > > >> Hello! >> > > > > >> >> > > > > >> Could you please post your configuration files? >> > > > > >> >> > > > > >> Regards, >> > > > > >> >> > > > > >> German. >> > > > > >> >> > > > > >> >> > > > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap < >> > > > [email protected] >> > > > > >> >wrote: >> > > > > >> >> > > > > >> > Hi All, >> > > > > >> > >> > > > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk >> > > servers >> > > > in >> > > > > >> the >> > > > > >> > quorum. >> > > > > >> > The problem we are facing is that one zookeeper server in the >> > > quorum >> > > > > >> falls >> > > > > >> > apart, and never becomes part of the cluster until we restart >> > > > > zookeeper >> > > > > >> > server on that node. >> > > > > >> > >> > > > > >> > Our interpretation from zookeeper logs on all nodes is as >> > follows: >> > > > > >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 >> => >> > zk >> > > > > >> server >> > > > > >> > 3) >> > > > > >> > Initially S3 is the leader while S1 and S2 are followers. >> > > > > >> > >> > > > > >> > S2 hits 46 sec latency while fsyncing write ahead log and >> > results >> > > in >> > > > > >> loss >> > > > > >> > of connection with S3. >> > > > > >> > S3 in turn prints following error message: >> > > > > >> > >> > > > > >> > Unexpected exception causing shutdown while sock still open >> > > > > >> > java.net.SocketTimeoutException: Read timed out >> > > > > >> > Stack trace >> > > > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ******** >> > > > > >> > >> > > > > >> > S2 in this case closes connection with S3(leader) and shuts >> down >> > > > > >> follower >> > > > > >> > with following log messages: >> > > > > >> > Closing connection to leader, exception during packet send >> > > > > >> > java.net.SocketException: Socket close >> > > > > >> > Follower@194] - shutdown called >> > > > > >> > java.lang.Exception: shutdown Follower >> > > > > >> > >> > > > > >> > After this point S3 could never reestablish connection with >> S2 >> > and >> > > > > >> leader >> > > > > >> > election mechanism keeps failing. S3 now keeps printing >> > following >> > > > > >> message >> > > > > >> > repeatedly: >> > > > > >> > Cannot open channel to 2 at election address / >> 169.254.1.2:3888 >> > > > > >> > java.net.ConnectException: Connection refused. >> > > > > >> > >> > > > > >> > While S3 is in this state, S2 repeatedly keeps printing >> > following >> > > > > >> message: >> > > > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 >> > > > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket >> > > > connection >> > > > > >> from >> > > > > >> > / >> > > > > >> > 127.0.0.1:60667 >> > > > > >> > Exception causing close of session 0x0: ZooKeeperServer not >> > > running >> > > > > >> > Closed socket connection for client /127.0.0.1:60667 (no >> > session >> > > > > >> > established for client) >> > > > > >> > >> > > > > >> > Leader election never completes successfully and causing S2 >> to >> > > fall >> > > > > >> apart >> > > > > >> > from the quorum. >> > > > > >> > S2 was out of quorum for almost 1 week. >> > > > > >> > >> > > > > >> > While debugging this issue, we found out that both election >> and >> > > peer >> > > > > >> > connection ports on S2 can't be telneted from any of the >> node >> > > (S1, >> > > > > S2, >> > > > > >> > S3). Network connectivity is not the issue. Later, we >> restarted >> > > the >> > > > ZK >> > > > > >> > server S2 (service zookeeper-server restart) -- now we could >> > > telnet >> > > > to >> > > > > >> both >> > > > > >> > the ports and S2 joined the ensemble after a leader election >> > > > attempt. >> > > > > >> > Any idea what might be forcing S2 to get into a situation >> where >> > it >> > > > > won't >> > > > > >> > accept any connections on the leader election and peer >> > connection >> > > > > ports? >> > > > > >> > >> > > > > >> > Should I file a jira on this and upload all log files while >> > > > submitting >> > > > > >> the >> > > > > >> > jira as log files are close to 250MB each? >> > > > > >> > >> > > > > >> > Thanks & Regards, >> > > > > >> > Deepak >> > > > > >> > >> > > > > >> >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
