Cameron, you seem to be right, the simple zookeeper client also fails sometimes. I will bring it up to the ZK team, thank you!
On Tue, Jul 12, 2016 at 3:09 PM, Cameron McKenzie <[email protected]> wrote: > I'm at a bit of a loss then. Looking at the stack trace, it's in the ZK > code (not the Curator code) that it appears to be blocking. Maybe try and > create a test that removes Curator and just uses the ZK client directly and > see if you still have the same problem? > > On Wed, Jul 13, 2016 at 7:54 AM, Benjamin Jaton <[email protected]> > wrote: > >> Yes, I mean the ping goes through properly >> >> On Tue, Jul 12, 2016 at 2:50 PM, Cameron McKenzie <[email protected] >> > wrote: >> >>> Are the 'dead' hosts reachable? I seem to recall having problems with >>> the ZK client blocking for a long time if you had a host configured that >>> was resolvable but not reachable. >>> >>> On Wed, Jul 13, 2016 at 4:30 AM, Benjamin Jaton < >>> [email protected]> wrote: >>> >>>> More to the story: >>>> >>>> While it consistently works on the same machine than the sole running >>>> ZK node when I specify only 2 of the 3 servers in the connection string, >>>> it doesn't work when I specify 3 hosts: >>>> >>>> >>>> 2016-07-12 11:23:46 INFO o.a.z.ZooKeeper:716 - Initiating client >>>> connection, connectString=QA-E8WIN11:2181,QA-E8WIN12:2181,QA-E8WIN13:2181 >>>> sessionTimeout=45000 watcher=org.apache.curator.ConnectionState@c86b9e3 >>>> 2016-07-12 11:23:46 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>> connection to server QA-E8WIN13/10.11.8.230:2181. Will not attempt to >>>> authenticate using SASL (unknown error) >>>> 2016-07-12 11:24:01 INFO o.a.z.ClientCnxn:1251 - Client session timed >>>> out, have not heard from server in 15018ms for sessionid 0x0, closing >>>> socket connection and attempting reconnect >>>> Successfully established the connection with ZooKeeper >>>> 2016-07-12 11:24:01 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>> connection to server QA-E8WIN12/10.11.8.232:2181. Will not attempt to >>>> authenticate using SASL (unknown error) >>>> 2016-07-12 11:24:16 INFO o.a.z.ClientCnxn:1251 - Client session timed >>>> out, have not heard from server in 15000ms for sessionid 0x0, closing >>>> socket connection and attempting reconnect >>>> 2016-07-12 11:24:16 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>> connection to server QA-E8WIN11/10.11.8.236:2181. Will not attempt to >>>> authenticate using SASL (unknown error) >>>> 2016-07-12 11:24:16 INFO o.a.z.ClientCnxn:980 - Socket connection >>>> established, initiating session, client: /10.11.8.236:50423, server: >>>> QA-E8WIN11/10.11.8.236:2181 >>>> 2016-07-12 11:24:16 INFO o.a.z.ClientCnxn:1400 - Session establishment >>>> complete on server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>> 0x10003c8af64001b, negotiated timeout = 45000 (READ-ONLY mode) >>>> 2016-07-12 11:24:16 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN13/10.11.8.230:2181 for being r/w. Timeout 200 >>>> 2016-07-12 11:24:16 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: READ_ONLY >>>> 2016-07-12 11:24:37 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout 400 >>>> 2016-07-12 11:24:58 INFO o.a.z.ClientCnxn:1251 - Client session timed >>>> out, have not heard from server in 42023ms for sessionid 0x10003c8af64001b, >>>> closing socket connection and attempting reconnect >>>> 2016-07-12 11:24:58 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: SUSPENDED >>>> 2016-07-12 11:25:00 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>> connection to server QA-E8WIN11/10.11.8.236:2181. Will not attempt to >>>> authenticate using SASL (unknown error) >>>> 2016-07-12 11:25:00 INFO o.a.z.ClientCnxn:980 - Socket connection >>>> established, initiating session, client: /10.11.8.236:50428, server: >>>> QA-E8WIN11/10.11.8.236:2181 >>>> 2016-07-12 11:25:00 INFO o.a.z.ClientCnxn:1400 - Session establishment >>>> complete on server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>> 0x10003c8af64001c, negotiated timeout = 45000 (READ-ONLY mode) >>>> 2016-07-12 11:25:00 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN13/10.11.8.230:2181 for being r/w. Timeout 800 >>>> 2016-07-12 11:25:00 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: READ_ONLY >>>> 2016-07-12 11:25:21 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout 1600 >>>> 2016-07-12 11:25:42 INFO o.a.z.ClientCnxn:1251 - Client session timed >>>> out, have not heard from server in 42011ms for sessionid 0x10003c8af64001c, >>>> closing socket connection and attempting reconnect >>>> 2016-07-12 11:25:42 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: SUSPENDED >>>> 2016-07-12 11:25:44 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>> connection to server QA-E8WIN11/10.11.8.236:2181. Will not attempt to >>>> authenticate using SASL (unknown error) >>>> 2016-07-12 11:25:44 INFO o.a.z.ClientCnxn:980 - Socket connection >>>> established, initiating session, client: /10.11.8.236:50435, server: >>>> QA-E8WIN11/10.11.8.236:2181 >>>> 2016-07-12 11:25:44 INFO o.a.z.ClientCnxn:1400 - Session establishment >>>> complete on server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>> 0x10003c8af64001d, negotiated timeout = 45000 (READ-ONLY mode) >>>> 2016-07-12 11:25:44 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN13/10.11.8.230:2181 for being r/w. Timeout 3200 >>>> 2016-07-12 11:25:44 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: READ_ONLY >>>> 2016-07-12 11:26:05 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout 6400 >>>> 2016-07-12 11:26:26 INFO o.a.z.ClientCnxn:1251 - Client session timed >>>> out, have not heard from server in 42003ms for sessionid 0x10003c8af64001d, >>>> closing socket connection and attempting reconnect >>>> 2016-07-12 11:26:26 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>> change: SUSPENDED >>>> Exception in thread "main" >>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>> KeeperErrorCode = ConnectionLoss for / >>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1753) >>>> at >>>> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:310) >>>> at >>>> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:299) >>>> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) >>>> at >>>> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:295) >>>> at >>>> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:287) >>>> at >>>> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:34) >>>> at test.Test.main(Test.java:25) >>>> >>>> >>>> That seems to suggest that adding 1 host to check somehow messes up >>>> something in the timeouts. >>>> I am suspicious of those: >>>> 2016-07-12 11:25:44 INFO o.a.z.ClientCnxn:1297 - Checking server >>>> QA-E8WIN13/10.11.8.230:2181 for being r/w. Timeout 3200 >>>> They seem to be "blocking" the existing connection we have to the >>>> healthy ZK server, which goes stale. Is that possible? >>>> >>>> >>>> On Tue, Jul 12, 2016 at 11:17 AM, Benjamin Jaton < >>>> [email protected]> wrote: >>>> >>>>> I did not use the zkCli tool but the ZK java API directly. And other >>>>> tools like ZooInspector (contrib). They were run from a remote machine. >>>>> I ran it on the same machine and the one read only server is running >>>>> properly, I can browse it using those methods. >>>>> >>>>> Trying to make sense of those client-server interactions is difficult >>>>> but here is what I got: >>>>> >>>>> Client: >>>>> 2016-07-12 10:55:01,508 INFO org.apache.zookeeper.ClientCnxn:1138 - >>>>> *Opening >>>>> socket connection to server QA-E8WIN11*/10.11.8.236:2181. Will not >>>>> attempt to authenticate using SASL (unknown error) >>>>> 2016-07-12 10:55:01,512 INFO org.apache.zookeeper.ClientCnxn:980 - >>>>> *Socket >>>>> connection established*, initiating session, client: /10.11.12.4:60586, >>>>> server: QA-E8WIN11/10.11.8.236:2181 >>>>> 2016-07-12 10:55:01,521 INFO org.apache.zookeeper.ClientCnxn:1400 - >>>>> *Session >>>>> establishment complete on server QA-E8WIN11*/10.11.8.236:2181, *sessionid >>>>> = 0x10003c8af640007*, negotiated timeout = 45000 (READ-ONLY mode) >>>>> 2016-07-12 10:55:01,523 INFO org.apache.zookeeper.ClientCnxn:1297 - >>>>> Checking server QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout 200 >>>>> 2016-07-12 10:55:01,528 INFO >>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - *State >>>>> change: READ_ONLY* >>>>> 2016-07-12 10:57:08,800 INFO org.apache.zookeeper.ClientCnxn:1253 - >>>>> *Unable >>>>> to read additional data from server sessionid 0x10003c8af640007, likely >>>>> server has closed socket, closing socket connection and attempting >>>>> reconnect* >>>>> 2016-07-12 10:57:08,903 INFO >>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - *State >>>>> change: SUSPENDED* >>>>> >>>>> >>>>> Server: >>>>> 2016-07-12T10:55:56,468 [myid:] - INFO [ProcessThread(sid:1 >>>>> cport:-1)::ZooKeeperServer@678] - *Established session >>>>> 0x10003c8af640007 with negotiated timeout 45000* for client / >>>>> 10.11.12.4:60586 >>>>> (...) >>>>> 2016-07-12T10:56:41,756 [myid:] - INFO >>>>> [SessionTracker:ZooKeeperServer@384] - *Expiring session >>>>> 0x10003c8af640007, timeout of 45000ms exceeded* >>>>> >>>>> >>>>> Note: Don't worry about the time being about 1 min off between the >>>>> machines. >>>>> >>>>> What I don't understand is why we are losing the read-only connection >>>>> we successfully established. ZK ro server says it timed out? >>>>> >>>>> Now interestingly, when I run it on the same machine it works and >>>>> completes pretty quickly: >>>>> >>>>> >>>>> 2016-07-12 10:43:49 INFO o.a.z.ZooKeeper:716 - Initiating client >>>>> connection, connectString=QA-E8WIN11:2181,QA-E8WIN12:2181 >>>>> sessionTimeout=45000 watcher=org.apache.curator.ConnectionState@e4487af >>>>> 2016-07-12 10:43:49 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>>> connection to server QA-E8WIN12/10.11.8.232:2181. Will not attempt to >>>>> authenticate using SASL (unknown error) >>>>> Successfully established the connection with ZooKeeper >>>>> 2016-07-12 10:44:10 WARN o.a.z.ClientCnxn:1257 - Session 0x0 for >>>>> server null, unexpected error, closing socket connection and attempting >>>>> reconnect >>>>> java.net.ConnectException: Connection timed out: no further information >>>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>>>> ~[?:1.8.0_91] >>>>> at >>>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) >>>>> ~[?:1.8.0_91] >>>>> at >>>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) >>>>> ~[zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518] >>>>> at >>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1236) >>>>> [zookeeper-3.5.0-alpha.jar:3.5.2-alpha-SNAPSHOT-1702518] >>>>> 2016-07-12 10:44:11 INFO o.a.z.ClientCnxn:1138 - Opening socket >>>>> connection to server QA-E8WIN11/10.11.8.236:2181. Will not attempt to >>>>> authenticate using SASL (unknown error) >>>>> 2016-07-12 10:44:11 INFO o.a.z.ClientCnxn:980 - Socket connection >>>>> established, initiating session, client: /10.11.8.236:50147, server: >>>>> QA-E8WIN11/10.11.8.236:2181 >>>>> 2016-07-12 10:44:11 INFO o.a.z.ClientCnxn:1400 - Session >>>>> establishment complete on server QA-E8WIN11/10.11.8.236:2181, >>>>> sessionid = 0x10003b315b30012, negotiated timeout = 45000 (READ-ONLY mode) >>>>> 2016-07-12 10:44:11 INFO o.a.z.ClientCnxn:1297 - Checking server >>>>> QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout 200 >>>>> 2016-07-12 10:44:11 INFO o.a.c.f.s.ConnectionStateManager:228 - State >>>>> change: READ_ONLY >>>>> 2016-07-12 10:44:32 INFO o.a.z.ClientCnxn:1297 - Checking server >>>>> QA-E8WIN11/10.11.8.236:2181 for being r/w. Timeout 400 >>>>> Done. >>>>> >>>>> >>>>> On Mon, Jul 11, 2016 at 4:42 PM, Cameron McKenzie < >>>>> [email protected]> wrote: >>>>> >>>>>> It looks like a connection issue based on the logs. When you >>>>>> confirmed that the connection was ok via the zkCli, was that running on >>>>>> the >>>>>> same host as the read only server? Can you run it from the same host that >>>>>> you're running the Java client from? >>>>>> >>>>>> On Tue, Jul 12, 2016 at 9:40 AM, Benjamin Jaton < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Well I was trying to let it run and it eventually fails, with a >>>>>>> stack: >>>>>>> >>>>>>> Successfully established the connection with ZooKeeper >>>>>>> 2016-07-11 16:28:21,943 INFO org.apache.zookeeper.ClientCnxn:1251 - >>>>>>> Client session timed out, have not heard from server in 22519ms for >>>>>>> sessionid 0x0, closing socket connection and attempting reconnect >>>>>>> 2016-07-11 16:28:22,046 INFO org.apache.zookeeper.ClientCnxn:1138 - >>>>>>> Opening socket connection to server QA-E8WIN11/10.11.8.236:2181. >>>>>>> Will not attempt to authenticate using SASL (unknown error) >>>>>>> 2016-07-11 16:28:22,050 INFO org.apache.zookeeper.ClientCnxn:980 - >>>>>>> Socket connection established, initiating session, client: / >>>>>>> 10.11.12.4:54702, server: QA-E8WIN11/10.11.8.236:2181 >>>>>>> 2016-07-11 16:28:22,058 INFO org.apache.zookeeper.ClientCnxn:1400 - >>>>>>> Session establishment complete on server QA-E8WIN11/10.11.8.236:2181, >>>>>>> sessionid = 0x100024f09360080, negotiated timeout = 45000 (READ-ONLY >>>>>>> mode) >>>>>>> 2016-07-11 16:28:22,061 INFO org.apache.zookeeper.ClientCnxn:1297 - >>>>>>> Checking server QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout >>>>>>> 200 >>>>>>> 2016-07-11 16:28:22,066 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: READ_ONLY >>>>>>> 2016-07-11 16:30:29,312 INFO org.apache.zookeeper.ClientCnxn:1253 - >>>>>>> Unable to read additional data from server sessionid 0x100024f09360080, >>>>>>> likely server has closed socket, closing socket connection and >>>>>>> attempting >>>>>>> reconnect >>>>>>> 2016-07-11 16:30:29,416 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: SUSPENDED >>>>>>> 2016-07-11 16:30:31,031 INFO org.apache.zookeeper.ClientCnxn:1138 - >>>>>>> Opening socket connection to server QA-E8WIN11/10.11.8.236:2181. >>>>>>> Will not attempt to authenticate using SASL (unknown error) >>>>>>> 2016-07-11 16:30:31,032 INFO org.apache.zookeeper.ClientCnxn:980 - >>>>>>> Socket connection established, initiating session, client: / >>>>>>> 10.11.12.4:55156, server: QA-E8WIN11/10.11.8.236:2181 >>>>>>> 2016-07-11 16:30:31,034 INFO org.apache.zookeeper.ClientCnxn:1400 - >>>>>>> Session establishment complete on server QA-E8WIN11/10.11.8.236:2181, >>>>>>> sessionid = 0x100024f09360083, negotiated timeout = 45000 (READ-ONLY >>>>>>> mode) >>>>>>> 2016-07-11 16:30:31,034 INFO org.apache.zookeeper.ClientCnxn:1297 - >>>>>>> Checking server QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout >>>>>>> 400 >>>>>>> 2016-07-11 16:30:31,035 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: READ_ONLY >>>>>>> 2016-07-11 16:32:38,335 INFO org.apache.zookeeper.ClientCnxn:1253 - >>>>>>> Unable to read additional data from server sessionid 0x100024f09360083, >>>>>>> likely server has closed socket, closing socket connection and >>>>>>> attempting >>>>>>> reconnect >>>>>>> 2016-07-11 16:32:38,441 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: SUSPENDED >>>>>>> 2016-07-11 16:32:40,425 INFO org.apache.zookeeper.ClientCnxn:1138 - >>>>>>> Opening socket connection to server QA-E8WIN11/10.11.8.236:2181. >>>>>>> Will not attempt to authenticate using SASL (unknown error) >>>>>>> 2016-07-11 16:32:40,426 INFO org.apache.zookeeper.ClientCnxn:980 - >>>>>>> Socket connection established, initiating session, client: / >>>>>>> 10.11.12.4:55637, server: QA-E8WIN11/10.11.8.236:2181 >>>>>>> 2016-07-11 16:32:40,428 INFO org.apache.zookeeper.ClientCnxn:1400 - >>>>>>> Session establishment complete on server QA-E8WIN11/10.11.8.236:2181, >>>>>>> sessionid = 0x100024f09360087, negotiated timeout = 45000 (READ-ONLY >>>>>>> mode) >>>>>>> 2016-07-11 16:32:40,428 INFO org.apache.zookeeper.ClientCnxn:1297 - >>>>>>> Checking server QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout >>>>>>> 800 >>>>>>> 2016-07-11 16:32:40,428 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: READ_ONLY >>>>>>> 2016-07-11 16:34:47,615 INFO org.apache.zookeeper.ClientCnxn:1253 - >>>>>>> Unable to read additional data from server sessionid 0x100024f09360087, >>>>>>> likely server has closed socket, closing socket connection and >>>>>>> attempting >>>>>>> reconnect >>>>>>> 2016-07-11 16:34:47,716 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: SUSPENDED >>>>>>> 2016-07-11 16:34:49,557 INFO org.apache.zookeeper.ClientCnxn:1138 - >>>>>>> Opening socket connection to server QA-E8WIN11/10.11.8.236:2181. >>>>>>> Will not attempt to authenticate using SASL (unknown error) >>>>>>> 2016-07-11 16:34:49,558 INFO org.apache.zookeeper.ClientCnxn:980 - >>>>>>> Socket connection established, initiating session, client: / >>>>>>> 10.11.12.4:56094, server: QA-E8WIN11/10.11.8.236:2181 >>>>>>> 2016-07-11 16:34:49,560 INFO org.apache.zookeeper.ClientCnxn:1400 - >>>>>>> Session establishment complete on server QA-E8WIN11/10.11.8.236:2181, >>>>>>> sessionid = 0x100024f09360089, negotiated timeout = 45000 (READ-ONLY >>>>>>> mode) >>>>>>> 2016-07-11 16:34:49,560 INFO org.apache.zookeeper.ClientCnxn:1297 - >>>>>>> Checking server QA-E8WIN12/10.11.8.232:2181 for being r/w. Timeout >>>>>>> 1600 >>>>>>> 2016-07-11 16:34:49,560 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: READ_ONLY >>>>>>> 2016-07-11 16:36:56,895 INFO org.apache.zookeeper.ClientCnxn:1253 - >>>>>>> Unable to read additional data from server sessionid 0x100024f09360089, >>>>>>> likely server has closed socket, closing socket connection and >>>>>>> attempting >>>>>>> reconnect >>>>>>> Exception in thread "main" 2016-07-11 16:36:56,996 INFO >>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>> change: SUSPENDED >>>>>>> org.apache.zookeeper.KeeperException$ConnectionLossException: >>>>>>> KeeperErrorCode = ConnectionLoss for / >>>>>>> at >>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>>>>>> at >>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1753) >>>>>>> at >>>>>>> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:310) >>>>>>> at >>>>>>> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:299) >>>>>>> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) >>>>>>> at >>>>>>> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:295) >>>>>>> at >>>>>>> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:287) >>>>>>> at >>>>>>> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:34) >>>>>>> at TestCuratorReadOnly.main(TestCuratorReadOnly.java:23) >>>>>>> >>>>>>> On Mon, Jul 11, 2016 at 4:39 PM, Cameron McKenzie < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> So, it appears that the initial connection occurs ok, but then >>>>>>>> something goes wrong >>>>>>>> >>>>>>>> 2016-07-11 15:49:15,712 INFO org.apache.zookeeper.ClientCnxn:1253 >>>>>>>> - Unable to read additional data from server sessionid >>>>>>>> 0x100024f0936005b, >>>>>>>> likely server has closed socket, closing socket connection and >>>>>>>> attempting >>>>>>>> reconnect >>>>>>>> >>>>>>>> Then a reconnection occurs ok, and then the cycle starts again. >>>>>>>> >>>>>>>> Can you see where in the stack the getData() call is hanging? >>>>>>>> >>>>>>>> On Tue, Jul 12, 2016 at 9:35 AM, Benjamin Jaton < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> The blockUntilConnectedOrTimedOut() does return, we see the line: >>>>>>>>> "Successfully established the connection with ZooKeeper" >>>>>>>>> which happens right after. >>>>>>>>> >>>>>>>>> The getData() is where it hangs. >>>>>>>>> >>>>>>>>> We see this first: >>>>>>>>> 2016-07-11 15:47:08,420 INFO >>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>>>> change: READ_ONLY >>>>>>>>> >>>>>>>>> and then we see it again 2 minutes later: >>>>>>>>> 2016-07-11 15:49:17,316 INFO >>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>>>> change: READ_ONLY >>>>>>>>> >>>>>>>>> and then again 2 minutes after that: >>>>>>>>> 2016-07-11 15:51:26,470 INFO >>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - State >>>>>>>>> change: READ_ONLY >>>>>>>>> >>>>>>>>> Then I stopped it. >>>>>>>>> >>>>>>>>> On Mon, Jul 11, 2016 at 4:13 PM, Cameron McKenzie < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> hey Ben, >>>>>>>>>> From the logs it looks like the connection is established at the >>>>>>>>>> end? >>>>>>>>>> >>>>>>>>>> Are you saying that the blockUntilConnectedOrTimedOut() does not >>>>>>>>>> return when this occurs? >>>>>>>>>> cheers >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jul 12, 2016 at 9:10 AM, Benjamin Jaton < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> ZK is 3.5.1-alpha >>>>>>>>>>> Curator is 2.11.0 >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 11, 2016 at 4:01 PM, Benjamin Jaton < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I have a 3 nodes ZK cluster , readonly mode is enabled. >>>>>>>>>>>> 2 nodes are down, so the third one is in read-only (verified by >>>>>>>>>>>> using the ZK API manually). >>>>>>>>>>>> >>>>>>>>>>>> I'm using this piece of code: >>>>>>>>>>>> >>>>>>>>>>>> Builder curatorClientBuilder = CuratorFrameworkFactory.builder() >>>>>>>>>>>> .connectString("QA-E8WIN11:2181,QA-E8WIN12:2181") >>>>>>>>>>>> .sessionTimeoutMs(45000).connectionTimeoutMs(15000) >>>>>>>>>>>> .retryPolicy(new RetryNTimes(3, 5000)).canBeReadOnly(true); >>>>>>>>>>>> >>>>>>>>>>>> CuratorFramework client = curatorClientBuilder.build(); >>>>>>>>>>>> client.start(); >>>>>>>>>>>> client.getZookeeperClient().blockUntilConnectedOrTimedOut(); >>>>>>>>>>>> System.out.println("Successfully established the connection >>>>>>>>>>>> with ZooKeeper"); >>>>>>>>>>>> client.getData().forPath("/"); >>>>>>>>>>>> System.out.println("Done."); >>>>>>>>>>>> >>>>>>>>>>>> When curator pick the host that is UP first, it goes through >>>>>>>>>>>> very quickly. When it picks the host that is down first >>>>>>>>>>>> (QA-E8WIN12), it >>>>>>>>>>>> never goes through. >>>>>>>>>>>> >>>>>>>>>>>> Am I missing something here? >>>>>>>>>>>> >>>>>>>>>>>> Below is the log: >>>>>>>>>>>> >>>>>>>>>>>> 2016-07-11 15:46:45,733 INFO >>>>>>>>>>>> org.apache.curator.framework.imps.CuratorFrameworkImpl:235 - >>>>>>>>>>>> Starting >>>>>>>>>>>> 2016-07-11 15:46:45,743 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:zookeeper.version=3.5.1-alpha-1748895, built on >>>>>>>>>>>> 06/17/2016 >>>>>>>>>>>> 22:24 GMT >>>>>>>>>>>> 2016-07-11 15:46:45,743 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client environment: >>>>>>>>>>>> host.name=w-rli09-ben >>>>>>>>>>>> 2016-07-11 15:46:45,744 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.version=1.8.0_11 >>>>>>>>>>>> 2016-07-11 15:46:45,744 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.vendor=Oracle >>>>>>>>>>>> Corporation >>>>>>>>>>>> 2016-07-11 15:46:45,744 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.home=/usr/local/apps/jdk1.8.0_11/jre >>>>>>>>>>>> 2016-07-11 15:46:45,744 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.class.path=(..) >>>>>>>>>>>> 2016-07-11 15:46:45,746 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.io.tmpdir=/tmp >>>>>>>>>>>> 2016-07-11 15:46:45,746 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:java.compiler=<NA> >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client environment: >>>>>>>>>>>> os.name=Linux >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:os.arch=amd64 >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:os.version=3.8.0-44-generic >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client environment: >>>>>>>>>>>> user.name=benji >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:user.home=(..) >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:user.dir=(..) >>>>>>>>>>>> 2016-07-11 15:46:45,747 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:os.memory.free=385MB >>>>>>>>>>>> 2016-07-11 15:46:45,748 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:os.memory.max=5358MB >>>>>>>>>>>> 2016-07-11 15:46:45,748 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:109 - Client >>>>>>>>>>>> environment:os.memory.total=397MB >>>>>>>>>>>> 2016-07-11 15:46:45,750 INFO >>>>>>>>>>>> org.apache.zookeeper.ZooKeeper:716 - Initiating client connection, >>>>>>>>>>>> connectString=QA-E8WIN11:2181,QA-E8WIN12:2181 sessionTimeout=45000 >>>>>>>>>>>> watcher=org.apache.curator.ConnectionState@3e96bacf >>>>>>>>>>>> 2016-07-11 15:46:45,769 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1138 - Opening socket connection >>>>>>>>>>>> to server >>>>>>>>>>>> QA-E8WIN12/10.11.8.232:2181. Will not attempt to authenticate >>>>>>>>>>>> using SASL (unknown error) >>>>>>>>>>>> Successfully established the connection with ZooKeeper >>>>>>>>>>>> 2016-07-11 15:47:08,300 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1251 - Client session timed out, >>>>>>>>>>>> have not >>>>>>>>>>>> heard from server in 22534ms for sessionid 0x0, closing socket >>>>>>>>>>>> connection >>>>>>>>>>>> and attempting reconnect >>>>>>>>>>>> 2016-07-11 15:47:08,403 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1138 - Opening socket connection >>>>>>>>>>>> to server >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181. Will not attempt to authenticate >>>>>>>>>>>> using SASL (unknown error) >>>>>>>>>>>> 2016-07-11 15:47:08,406 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:980 - Socket connection >>>>>>>>>>>> established, >>>>>>>>>>>> initiating session, client: /10.11.12.4:47004, server: >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181 >>>>>>>>>>>> 2016-07-11 15:47:08,413 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1400 - Session establishment >>>>>>>>>>>> complete on >>>>>>>>>>>> server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>>>>>>>>>> 0x100024f0936005b, negotiated timeout = 45000 (READ-ONLY mode) >>>>>>>>>>>> 2016-07-11 15:47:08,415 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1297 - Checking server QA-E8WIN12/ >>>>>>>>>>>> 10.11.8.232:2181 for being r/w. Timeout 200 >>>>>>>>>>>> 2016-07-11 15:47:08,420 INFO >>>>>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - >>>>>>>>>>>> State >>>>>>>>>>>> change: READ_ONLY >>>>>>>>>>>> 2016-07-11 15:49:15,712 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1253 - Unable to read additional >>>>>>>>>>>> data from >>>>>>>>>>>> server sessionid 0x100024f0936005b, likely server has closed >>>>>>>>>>>> socket, >>>>>>>>>>>> closing socket connection and attempting reconnect >>>>>>>>>>>> 2016-07-11 15:49:15,815 INFO >>>>>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - >>>>>>>>>>>> State >>>>>>>>>>>> change: SUSPENDED >>>>>>>>>>>> 2016-07-11 15:49:17,313 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1138 - Opening socket connection >>>>>>>>>>>> to server >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181. Will not attempt to authenticate >>>>>>>>>>>> using SASL (unknown error) >>>>>>>>>>>> 2016-07-11 15:49:17,314 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:980 - Socket connection >>>>>>>>>>>> established, >>>>>>>>>>>> initiating session, client: /10.11.12.4:47295, server: >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181 >>>>>>>>>>>> 2016-07-11 15:49:17,316 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1400 - Session establishment >>>>>>>>>>>> complete on >>>>>>>>>>>> server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>>>>>>>>>> 0x100024f0936005f, negotiated timeout = 45000 (READ-ONLY mode) >>>>>>>>>>>> 2016-07-11 15:49:17,316 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1297 - Checking server QA-E8WIN12/ >>>>>>>>>>>> 10.11.8.232:2181 for being r/w. Timeout 400 >>>>>>>>>>>> 2016-07-11 15:49:17,316 INFO >>>>>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - >>>>>>>>>>>> State >>>>>>>>>>>> change: READ_ONLY >>>>>>>>>>>> 2016-07-11 15:51:24,607 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1253 - Unable to read additional >>>>>>>>>>>> data from >>>>>>>>>>>> server sessionid 0x100024f0936005f, likely server has closed >>>>>>>>>>>> socket, >>>>>>>>>>>> closing socket connection and attempting reconnect >>>>>>>>>>>> 2016-07-11 15:51:24,713 INFO >>>>>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - >>>>>>>>>>>> State >>>>>>>>>>>> change: SUSPENDED >>>>>>>>>>>> 2016-07-11 15:51:26,467 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1138 - Opening socket connection >>>>>>>>>>>> to server >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181. Will not attempt to authenticate >>>>>>>>>>>> using SASL (unknown error) >>>>>>>>>>>> 2016-07-11 15:51:26,468 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:980 - Socket connection >>>>>>>>>>>> established, >>>>>>>>>>>> initiating session, client: /10.11.12.4:47576, server: >>>>>>>>>>>> QA-E8WIN11/10.11.8.236:2181 >>>>>>>>>>>> 2016-07-11 15:51:26,470 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1400 - Session establishment >>>>>>>>>>>> complete on >>>>>>>>>>>> server QA-E8WIN11/10.11.8.236:2181, sessionid = >>>>>>>>>>>> 0x100024f09360061, negotiated timeout = 45000 (READ-ONLY mode) >>>>>>>>>>>> 2016-07-11 15:51:26,470 INFO >>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:1297 - Checking server QA-E8WIN12/ >>>>>>>>>>>> 10.11.8.232:2181 for being r/w. Timeout 800 >>>>>>>>>>>> 2016-07-11 15:51:26,470 INFO >>>>>>>>>>>> org.apache.curator.framework.state.ConnectionStateManager:228 - >>>>>>>>>>>> State >>>>>>>>>>>> change: READ_ONLY >>>>>>>>>>>> (...) >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
