OK, that might be. I added a comment in the JIRA case that you created (ZOOKEEPER-1869, for others to know the reference) stating that at some point the logs say "leaving the listener" for the election in server 2 and it is not clear if the server restarts the listener from there. I think it is better to continue the discussion in the JIRA case and leave this thread here.
On Tue, Jan 28, 2014 at 9:44 PM, Deepak Jagtap <[email protected]>wrote: > Hi German, > > I went through the zookeeper logs again and it looks like a zookeeper bug > to me. > Leader election was initiated and it never completed as one zookeeper > server went in zombie (hung) state. > Please note that zookeeper was running all the nodes when this happened. > > Thanks & Regards, > Deepak > > > > > On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[email protected] > >wrote: > > > Doprbox link for log files: > > https://dl.dropboxusercontent.com/u/36429721/zklog.tgz > > > > > > On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[email protected] > >wrote: > > > >> Jira has attachment limit of 10MB, hence uploaded log files on dropbox. > >> > >> *Please refer events close to * > >> > >> *"2014-01-07 10:34:01" * > >> > >> *timestamp on all nodes.* > >> > >> > >> * Thanks & Regards,* > >> > >> *Deepak * > >> > >> > >> On Mon, Jan 27, 2014 at 12:34 PM, German Blanco < > >> [email protected]> wrote: > >> > >>> I don't see why it would be a problem for anybody. > >>> If this happens not to be a problem in ZooKeeper we can always close > the > >>> bug case. > >>> > >>> > >>> On Mon, Jan 27, 2014 at 8:33 PM, Deepak Jagtap < > [email protected] > >>> >wrote: > >>> > >>> > Hi German, > >>> > > >>> > Thanks for the followup! > >>> > I have log files for all the servers and are quite big (greater than > >>> 25MB) > >>> > hence could not > >>> > upload send the log files through mail. > >>> > Is it ok if I file a bug on this this and upload logs there? > >>> > > >>> > Thanks & Regards, > >>> > Deepak > >>> > > >>> > > >>> > > >>> > On Sun, Jan 26, 2014 at 1:53 AM, German Blanco < > >>> > [email protected]> wrote: > >>> > > >>> > > Hello Deepak, > >>> > > > >>> > > sorry for the slow response. > >>> > > I can't figure out what might be going on here without the log > files. > >>> > > The traces you see in S2 do not indicate any problem, as far as I > >>> see. It > >>> > > seems that you have a client running in S2 that tries to connect to > >>> that > >>> > > server. Since S2 hasn't been able to join a quorum, the server > >>> attending > >>> > > clients hasn't been started and the connection is rejected. > >>> > > Maybe, to start with, you could start by uploading the traces > around > >>> the > >>> > > connection loss between S2 and S3 (say a couple of minutes before > and > >>> > > after). > >>> > > > >>> > > Regards, > >>> > > > >>> > > German. > >>> > > > >>> > > > >>> > > On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap < > >>> [email protected] > >>> > > >wrote: > >>> > > > >>> > > > Hi, > >>> > > > > >>> > > > zoo.cfg is : > >>> > > > > >>> > > > maxClientCnxns=50 > >>> > > > # The number of milliseconds of each tick > >>> > > > tickTime=2000 > >>> > > > # The number of ticks that the initial > >>> > > > # synchronization phase can take > >>> > > > initLimit=10 > >>> > > > # The number of ticks that can pass between > >>> > > > # sending a request and getting an acknowledgement > >>> > > > syncLimit=5 > >>> > > > # the directory where the snapshot is stored. > >>> > > > dataDir=/var/lib/zookeeper > >>> > > > # the port at which the clients will connect > >>> > > > clientPort=2181 > >>> > > > > >>> > > > autopurge.snapRetainCount=3 > >>> > > > autopurge.purgeInterval=1 > >>> > > > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic > >>> > > > > >>> > > > > >>> > > > > >>> > > > zoo.cfg.dynamic is: > >>> > > > > >>> > > > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181 > >>> > > > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181 > >>> > > > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181 > >>> > > > version=1 > >>> > > > > >>> > > > > >>> > > > Thanks & Regards, > >>> > > > Deepak > >>> > > > > >>> > > > > >>> > > > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco < > >>> > > > [email protected]> wrote: > >>> > > > > >>> > > > > Sorry but the attachment didn't make it through. > >>> > > > > It might be safer to put the files somewhere in the web and > send > >>> a > >>> > > link. > >>> > > > > > >>> > > > > > >>> > > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap < > >>> > > [email protected] > >>> > > > > >wrote: > >>> > > > > > >>> > > > > > Hi German, > >>> > > > > > > >>> > > > > > Please find zookeeper config files attached. > >>> > > > > > > >>> > > > > > Thanks & Regards, > >>> > > > > > Deepak > >>> > > > > > > >>> > > > > > > >>> > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco < > >>> > > > > > [email protected]> wrote: > >>> > > > > > > >>> > > > > >> Hello! > >>> > > > > >> > >>> > > > > >> Could you please post your configuration files? > >>> > > > > >> > >>> > > > > >> Regards, > >>> > > > > >> > >>> > > > > >> German. > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap < > >>> > > > [email protected] > >>> > > > > >> >wrote: > >>> > > > > >> > >>> > > > > >> > Hi All, > >>> > > > > >> > > >>> > > > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 > zk > >>> > > servers > >>> > > > in > >>> > > > > >> the > >>> > > > > >> > quorum. > >>> > > > > >> > The problem we are facing is that one zookeeper server in > >>> the > >>> > > quorum > >>> > > > > >> falls > >>> > > > > >> > apart, and never becomes part of the cluster until we > >>> restart > >>> > > > > zookeeper > >>> > > > > >> > server on that node. > >>> > > > > >> > > >>> > > > > >> > Our interpretation from zookeeper logs on all nodes is as > >>> > follows: > >>> > > > > >> > (For simplicity assume S1=> zk server1, S2 => zk server2, > >>> S3 => > >>> > zk > >>> > > > > >> server > >>> > > > > >> > 3) > >>> > > > > >> > Initially S3 is the leader while S1 and S2 are followers. > >>> > > > > >> > > >>> > > > > >> > S2 hits 46 sec latency while fsyncing write ahead log and > >>> > results > >>> > > in > >>> > > > > >> loss > >>> > > > > >> > of connection with S3. > >>> > > > > >> > S3 in turn prints following error message: > >>> > > > > >> > > >>> > > > > >> > Unexpected exception causing shutdown while sock still > open > >>> > > > > >> > java.net.SocketTimeoutException: Read timed out > >>> > > > > >> > Stack trace > >>> > > > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ******** > >>> > > > > >> > > >>> > > > > >> > S2 in this case closes connection with S3(leader) and > shuts > >>> down > >>> > > > > >> follower > >>> > > > > >> > with following log messages: > >>> > > > > >> > Closing connection to leader, exception during packet send > >>> > > > > >> > java.net.SocketException: Socket close > >>> > > > > >> > Follower@194] - shutdown called > >>> > > > > >> > java.lang.Exception: shutdown Follower > >>> > > > > >> > > >>> > > > > >> > After this point S3 could never reestablish connection > with > >>> S2 > >>> > and > >>> > > > > >> leader > >>> > > > > >> > election mechanism keeps failing. S3 now keeps printing > >>> > following > >>> > > > > >> message > >>> > > > > >> > repeatedly: > >>> > > > > >> > Cannot open channel to 2 at election address / > >>> 169.254.1.2:3888 > >>> > > > > >> > java.net.ConnectException: Connection refused. > >>> > > > > >> > > >>> > > > > >> > While S3 is in this state, S2 repeatedly keeps printing > >>> > following > >>> > > > > >> message: > >>> > > > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 > >>> > > > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket > >>> > > > connection > >>> > > > > >> from > >>> > > > > >> > / > >>> > > > > >> > 127.0.0.1:60667 > >>> > > > > >> > Exception causing close of session 0x0: ZooKeeperServer > not > >>> > > running > >>> > > > > >> > Closed socket connection for client /127.0.0.1:60667 (no > >>> > session > >>> > > > > >> > established for client) > >>> > > > > >> > > >>> > > > > >> > Leader election never completes successfully and causing > S2 > >>> to > >>> > > fall > >>> > > > > >> apart > >>> > > > > >> > from the quorum. > >>> > > > > >> > S2 was out of quorum for almost 1 week. > >>> > > > > >> > > >>> > > > > >> > While debugging this issue, we found out that both > election > >>> and > >>> > > peer > >>> > > > > >> > connection ports on S2 can't be telneted from any of the > >>> node > >>> > > (S1, > >>> > > > > S2, > >>> > > > > >> > S3). Network connectivity is not the issue. Later, we > >>> restarted > >>> > > the > >>> > > > ZK > >>> > > > > >> > server S2 (service zookeeper-server restart) -- now we > could > >>> > > telnet > >>> > > > to > >>> > > > > >> both > >>> > > > > >> > the ports and S2 joined the ensemble after a leader > election > >>> > > > attempt. > >>> > > > > >> > Any idea what might be forcing S2 to get into a situation > >>> where > >>> > it > >>> > > > > won't > >>> > > > > >> > accept any connections on the leader election and peer > >>> > connection > >>> > > > > ports? > >>> > > > > >> > > >>> > > > > >> > Should I file a jira on this and upload all log files > while > >>> > > > submitting > >>> > > > > >> the > >>> > > > > >> > jira as log files are close to 250MB each? > >>> > > > > >> > > >>> > > > > >> > Thanks & Regards, > >>> > > > > >> > Deepak > >>> > > > > >> > > >>> > > > > >> > >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > >> > > >
