Do you have the logs from the zk server? What type of machines are you running this on? Which version of ZK?
Thanks, C On Wed, Jan 21, 2015 at 2:00 PM, Ian Rose <[email protected]> wrote: > Hi all - > > Reviving an old thread here. We have now been running Zookeeper for a few > months now and this problem has continued to dog us. This is entirely a > localhost problem (when zookeeper and the client app are both running on a > developer laptop) - thankfully everything has been solid on our production > servers. Nonetheless this causes major headaches in our development > process. For example, I just now debugged a problem our UX lead was having > getting our dev stack up and running on his local machine; the problem was > that our startup script was failing when using zkCli.sh to create a "/solr" > node in ZK. This would not be a problem if it was a rare/spurious error, > but he is getting this error every single time. > > For reference, here is the complete stack trace (which appears after > several seconds - the failure is not immediate): > > Exception in thread "main" > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /solr > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:695) > at > org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588) > at > org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360) > at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323) > at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282) > > Frustratingly, this behavior is very inconsistent. In fact, after 5+ > failures in a row on our UX lead's machine, now I just heard from him that > he tried again and it worked. Could it be that there is some kind of state > on Zookeeper that needs to timeout and disappear before things will work? > > More concretely, the next time this happens, what kind of info should I > capture to make this more debuggable? > > Many thanks, > Ian > > > > On Mon, Sep 29, 2014 at 4:51 PM, Ian Rose <[email protected]> wrote: > > > Well unfortunately although I am able to repro this problem with a > > completely new, clean ZK install on my laptop, I *cannot* repro with the > > same on my coworkers laptop. So unfortunately I am forced to conclude > that > > there is something strange going on locally. > > > > Thanks for the help anyhow! > > > > - Ian > > > > > > On Mon, Sep 29, 2014 at 9:40 AM, Camille Fournier <[email protected]> > > wrote: > > > >> No that is not expected. Odd that you get disconnected once and then > >> reconnect fine. Does the same thing happen in your kazoo clients, one > >> disconnect but then the second connect is ok? > >> Which version of ZK are you running? Are you running this with some sort > >> of > >> auth or password to the zk server? > >> > >> Thanks, > >> C > >> > >> On Mon, Sep 29, 2014 at 9:24 AM, Ian Rose <[email protected]> > wrote: > >> > >> > Perhaps I'm simply misunderstanding what the expected behavior would > be. > >> > Why would my client by disconnected? Does zookeeper drop idle > clients? > >> > And note that this isn't a spurious disconnect; my client is *always* > >> > dropped > >> > at that time. > >> > > >> > On Friday zkCli seemed to be working just fine for me, but now I am > >> getting > >> > disconnections similar to my kazoo-based client. > >> > > >> > Here is a session showing `ls /' failing. This behavior is > reproducible > >> > for me currently. > >> > > >> > $ ./bin/zkCli.sh > >> > > Connecting to localhost:2181 > >> > > Welcome to ZooKeeper! > >> > > JLine support is enabled > >> > > [zk: localhost:2181(CONNECTING) 0] > >> > > WATCHER:: > >> > > WatchedEvent state:SyncConnected type:None path:null > >> > > [zk: localhost:2181(CONNECTED) 0] > >> > > [zk: localhost:2181(CONNECTED) 0] ls / > >> > > WATCHER:: > >> > > WatchedEvent state:Disconnected type:None path:null > >> > > Exception in thread "main" > >> > > org.apache.zookeeper.KeeperException$ConnectionLossException: > >> > > KeeperErrorCode = ConnectionLoss for / > >> > > at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > >> > > at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > >> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > >> > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > >> > > at > >> > > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:720) > >> > > at > >> org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588) > >> > > at > >> org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360) > >> > > at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323) > >> > > at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282) > >> > > >> > > >> > And here is a session where after connecting I just sit and wait for a > >> bit > >> > (without sending any commands), whereupon I get disconnected and then > >> > immediately reconnected. Interestingly, after the reconnecting I am > >> able > >> > to issue multiple commands without any issues. This behavior is > >> > reproducible for me currently. > >> > > >> > $ ./bin/zkCli.sh > >> > > Connecting to localhost:2181 > >> > > Welcome to ZooKeeper! > >> > > JLine support is enabled > >> > > [zk: localhost:2181(CONNECTING) 0] > >> > > [zk: localhost:2181(CONNECTING) 0] > >> > > WATCHER:: > >> > > WatchedEvent state:SyncConnected type:None path:null > >> > > [zk: localhost:2181(CONNECTED) 0] > >> > > WATCHER:: > >> > > WatchedEvent state:Disconnected type:None path:null > >> > > WATCHER:: > >> > > WatchedEvent state:SyncConnected type:None path:null > >> > > [zk: localhost:2181(CONNECTED) 0] > >> > > [zk: localhost:2181(CONNECTED) 0] ls / > >> > > [zookeeper] > >> > > [zk: localhost:2181(CONNECTED) 1] > >> > > [zk: localhost:2181(CONNECTED) 1] > >> > > [zk: localhost:2181(CONNECTED) 1] ls /zookeeper > >> > > [quota] > >> > > [zk: localhost:2181(CONNECTED) 2] ls /zookeeper/quota > >> > > [] > >> > > >> > > >> > > >> > Thanks, > >> > - Ian > >> > > >> > > >> > On Sun, Sep 28, 2014 at 2:22 PM, Camille Fournier <[email protected] > > > >> > wrote: > >> > > >> > > Sorry but with what you've sent us I don't really see what the > problem > >> > is. > >> > > It does look like you connect and then nothing happens for 20s and > >> then > >> > the > >> > > connection is dropped. If you use the zkCli script to connect via > the > >> > > command line do you see the same problem? > >> > > > >> > > C > >> > > > >> > > On Fri, Sep 26, 2014 at 12:48 PM, Ian Rose <[email protected]> > >> > wrote: > >> > > > >> > > > Hi all - > >> > > > > >> > > > I've just gotten started using SolrCloud, which uses Zookeeper for > >> > > > coordination; I am otherwise completely new to Zookeeper. Now I > am > >> > > trying > >> > > > to query Zookeeper directly for some simple information. I am > >> finding, > >> > > > however, that although my clients are able to connect they very > >> > > frequently > >> > > > receive timeouts. It almost seems like the server isn't receiving > >> the > >> > > > heartbeat messages at all (or isn't responding to them). I've > seen > >> > > similar > >> > > > behavior both when using the go-zookeeper > >> > > > <https://godoc.org/github.com/samuel/go-zookeeper/zk> and kazoo > >> > > > <https://github.com/python-zk/kazoo> (python) client libraries (I > >> > wanted > >> > > > to > >> > > > try >1 to ensure that it wasn't a client lib problem). > >> > > > > >> > > > My config is very simple: I am running the client and a single > >> > Zookeeper > >> > > > node on the same machine (my laptop). There are no other clients > of > >> > the > >> > > > Zookeeper node while I am running these tests, so there is no > >> practical > >> > > > possibility that the JVM is overloaded or GCing. > >> > > > > >> > > > Here is a very basic kazoo client that I am using. Obviously it > >> isn't > >> > > > doing any "real" work right now - this is just to demonstrate the > >> > > > disconnects. > >> > > > > >> > > > #!/usr/bin/env python2.7 > >> > > > > >> > > > from kazoo.client import KazooClient, KazooState > >> > > > import logging > >> > > > import time > >> > > > > >> > > > def my_listener(state): > >> > > > if state == KazooState.LOST: > >> > > > # Register somewhere that the session was lost > >> > > > logging.warning('handle lost') > >> > > > elif state == KazooState.SUSPENDED: > >> > > > # Handle being disconnected from Zookeeper > >> > > > logging.debug('handle being disconnected') > >> > > > else: > >> > > > # Handle being connected/reconnected to Zookeeper > >> > > > logging.debug('handle being (re)connected') > >> > > > > >> > > > > >> > > > if __name__ == '__main__': > >> > > > logging.basicConfig(format='%(asctime)-15s %(levelname)s > >> > %(message)s', > >> > > > level=logging.DEBUG) > >> > > > > >> > > > logging.debug('starting...') > >> > > > zk = KazooClient(hosts='127.0.0.1:2181', timeout=30) > >> > > > zk.start() > >> > > > zk.add_listener(my_listener) > >> > > > > >> > > > time.sleep(35) > >> > > > > >> > > > zk.stop() > >> > > > > >> > > > -------------------- > >> > > > > >> > > > Here is my zookeeper config: > >> > > > > >> > > > # The number of milliseconds of each tick > >> > > > tickTime=2000 > >> > > > > >> > > > # The number of ticks that the initial > >> > > > # synchronization phase can take > >> > > > initLimit=10 > >> > > > > >> > > > # The number of ticks that can pass between > >> > > > # sending a request and getting an acknowledgement > >> > > > syncLimit=5 > >> > > > > >> > > > # the directory where the snapshot is stored. > >> > > > dataDir=/Users/ianrose/Code/zookeeper/var/data > >> > > > > >> > > > # the port at which the clients will connect > >> > > > clientPort=2181 > >> > > > > >> > > > # The number of snapshots to retain in dataDir > >> > > > autopurge.snapRetainCount=5 > >> > > > > >> > > > # Purge task interval in hours > >> > > > # Set to "0" to disable auto purge feature > >> > > > autopurge.purgeInterval=1 > >> > > > > >> > > > > >> > > > -------------------- > >> > > > > >> > > > > >> > > > Here is the output I get on the client: > >> > > > > >> > > > 2014-09-26 12:43:20,603 DEBUG starting... > >> > > > 2014-09-26 12:43:20,604 INFO Connecting to 127.0.0.1:2181 > >> > > > 2014-09-26 12:43:20,605 DEBUG Sending request(xid=None): > >> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000, > >> > > session_id=0, > >> > > > > >> > > > >> > > >> > passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', > >> > > > read_only=None) > >> > > > 2014-09-26 12:43:20,609 INFO Zookeeper connection established, > >> state: > >> > > > CONNECTED > >> > > > 2014-09-26 12:43:40,222 WARNING Connection dropped: outstanding > >> > heartbeat > >> > > > ping not received > >> > > > 2014-09-26 12:43:40,222 WARNING Transition to CONNECTING > >> > > > 2014-09-26 12:43:40,222 INFO Zookeeper connection lost > >> > > > 2014-09-26 12:43:40,222 DEBUG handle being disconnected > >> > > > 2014-09-26 12:43:40,747 INFO Connecting to 127.0.0.1:2181 > >> > > > 2014-09-26 12:43:40,748 DEBUG Sending request(xid=None): > >> > > > Connect(protocol_version=0, last_zxid_seen=0, time_out=30000, > >> > > > session_id=92520388231233540, > >> > > > passwd='\xd1M\xb9\xb3\xae\xab\xa1!@x\x06nv\xb7\xe3*', > >> read_only=None) > >> > > > 2014-09-26 12:43:40,750 INFO Zookeeper connection established, > >> state: > >> > > > CONNECTED > >> > > > 2014-09-26 12:43:40,751 DEBUG handle being (re)connected > >> > > > 2014-09-26 12:43:55,611 DEBUG Sending request(xid=1): Close() > >> > > > 2014-09-26 12:43:55,614 INFO Closing connection to 127.0.0.1:2181 > >> > > > 2014-09-26 12:43:55,615 INFO Zookeeper session lost, state: CLOSED > >> > > > 2014-09-26 12:43:55,615 WARNING handle lost > >> > > > > >> > > > > >> > > > And here is trace-level logging from the server side: > >> > > > > >> > > > 2014-09-26 12:43:20,605 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][ > >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] - > Accepted > >> socket > >> > > > connection from /127.0.0.1:58959 > >> > > > 2014-09-26 12:43:20,605 [myid:] - DEBUG [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session > >> establishment > >> > > > request > >> > > > from client /127.0.0.1:58959 client's lastZxid is 0x0 > >> > > > 2014-09-26 12:43:20,605 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868%5D%5B>] - Client > >> attempting to > >> > > > establish new session at /127.0.0.1:58959 > >> > > > 2014-09-26 12:43:20,605 [myid:] - TRACE [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - SessionTrackerImpl > >> --- Adding > >> > > > session > >> > > > 0x148b2cd8b010004 30000 > >> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0 > >> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004 > >> > > > type:createSession cxid:0x0 zxid:0xfffffffffffffffe > txntype:unknown > >> > > > reqpath:n/a > >> > > > 2014-09-26 12:43:20,606 [myid:] - TRACE [ProcessThread(sid:0 > >> > > > cport:-1)::ZooTrace@71][] - SessionTrackerImpl --- Existing > session > >> > > > 0x148b2cd8b010004 30000 > >> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request:: > >> > > > sessionid:0x148b2cd8b010004 type:createSession cxid:0x0 zxid:0x78f > >> > > > txntype:-10 reqpath:n/a > >> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@90 > ][] > >> - > >> > > > :Esessionid:0x148b2cd8b010004 type:createSession cxid:0x0 > zxid:0x78f > >> > > > txntype:-10 reqpath:n/a > >> > > > 2014-09-26 12:43:20,608 [myid:] - TRACE [SyncThread:0:ZooTrace@71 > ][] > >> - > >> > > > SessionTrackerImpl --- Existing session 0x148b2cd8b010004 30000 > >> > > > 2014-09-26 12:43:20,608 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@160][] - > >> > sessionid:0x148b2cd8b010004 > >> > > > type:createSession cxid:0x0 zxid:0x78f txntype:-10 reqpath:n/a > >> > > > 2014-09-26 12:43:20,609 [myid:] - INFO > >> > [SyncThread:0:ZooKeeperServer@617 > >> > > > ][] > >> > > > - Established session 0x148b2cd8b010004 with negotiated timeout > >> 30000 > >> > for > >> > > > client /127.0.0.1:58959 > >> > > > 2014-09-26 12:43:40,748 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197][ > >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197%5D%5B>] - > Accepted > >> socket > >> > > > connection from /127.0.0.1:58960 > >> > > > 2014-09-26 12:43:40,749 [myid:] - DEBUG [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@810%5D%5B>] - Session > >> establishment > >> > > > request > >> > > > from client /127.0.0.1:58960 client's lastZxid is 0x0 > >> > > > 2014-09-26 12:43:40,749 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861%5D%5B>] - Client > >> attempting to > >> > renew > >> > > > session 0x148b2cd8b010004 at /127.0.0.1:58960 > >> > > > 2014-09-26 12:43:40,749 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][ > >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket > >> connection > >> > > for > >> > > > client /127.0.0.1:58959 which had sessionid 0x148b2cd8b010004 > >> > > > 2014-09-26 12:43:40,750 [myid:] - TRACE [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooTrace@71][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooTrace@71%5D%5B>] - Session > >> 0x148b2cd8b010004 is > >> > > valid: > >> > > > true > >> > > > 2014-09-26 12:43:40,750 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617][ > >> <http://0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617%5D%5B>] - Established > >> session > >> > > > 0x148b2cd8b010004 with negotiated timeout 30000 for client / > >> > > > 127.0.0.1:58960 > >> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request:: > >> > > > sessionid:0x148b2cd8b010004 type:ping cxid:0xfffffffffffffffe > >> > > > zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a > >> > > > 2014-09-26 12:43:50,742 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@160][] - > >> > sessionid:0x148b2cd8b010004 > >> > > > type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe > >> > txntype:unknown > >> > > > reqpath:n/a > >> > > > 2014-09-26 12:43:55,612 [myid:] - TRACE [ProcessThread(sid:0 > >> > > > cport:-1)::ZooTrace@90][] - :Psessionid:0x148b2cd8b010004 > >> > > > type:closeSession > >> > > > cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a > >> > > > 2014-09-26 12:43:55,612 [myid:] - INFO [ProcessThread(sid:0 > >> > > > cport:-1)::SessionTrackerImpl@198][] - Session closing: > >> > > 0x148b2cd8b010004 > >> > > > 2014-09-26 12:43:55,612 [myid:] - INFO [ProcessThread(sid:0 > >> > > > cport:-1)::PrepRequestProcessor@494][] - Processed session > >> termination > >> > > for > >> > > > sessionid: 0x148b2cd8b010004 > >> > > > 2014-09-26 12:43:55,613 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@88][] - Processing request:: > >> > > > sessionid:0x148b2cd8b010004 type:closeSession cxid:0x1 zxid:0x790 > >> > > > txntype:-11 reqpath:n/a > >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@90 > ][] > >> - > >> > > > :Esessionid:0x148b2cd8b010004 type:closeSession cxid:0x1 > zxid:0x790 > >> > > > txntype:-11 reqpath:n/a > >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE [SyncThread:0:ZooTrace@71 > ][] > >> - > >> > > > SessionTrackerImpl --- Removing session 0x148b2cd8b010004 > >> > > > 2014-09-26 12:43:55,614 [myid:] - DEBUG > >> > > > [SyncThread:0:FinalRequestProcessor@160][] - > >> > sessionid:0x148b2cd8b010004 > >> > > > type:closeSession cxid:0x1 zxid:0x790 txntype:-11 reqpath:n/a > >> > > > 2014-09-26 12:43:55,614 [myid:] - TRACE > >> [SyncThread:0:NIOServerCnxn@168 > >> > > ][] > >> > > > - Add a buffer to outgoingBuffers, sk > >> > > sun.nio.ch.SelectionKeyImpl@28cfdeb5 > >> > > > is valid: true > >> > > > 2014-09-26 12:43:55,615 [myid:] - INFO [NIOServerCxn.Factory: > >> > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007][ > >> <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007%5D%5B>] - Closed socket > >> connection > >> > > for > >> > > > client /127.0.0.1:58960 which had sessionid 0x148b2cd8b010004 > >> > > > > >> > > > > >> > > > I'm assuming that this is due to a misunderstanding on my part. > Can > >> > > anyone > >> > > > point me in the right direction? > >> > > > > >> > > > Much appreciated! > >> > > > - Ian > >> > > > > >> > > > >> > > >> > > > > >
