[jira] [Comment Edited] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-03-09 Thread Eron Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393254#comment-16393254
 ] 

Eron Wright  edited comment on ZOOKEEPER-2982 at 3/9/18 5:40 PM:
-

[~andorm] are you driving this issue now?   Would you please assign the bug 
appropriately?

I'm keen to see the patch make it into 3.5.4.


was (Author: eronwright):
[~andorm] are you driving this issue now?   Would you please assign the bug 
appropriate?

I'm keen to see the patch make it into 3.5.4.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-23 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375213#comment-16375213
 ] 

Abraham Fine edited comment on ZOOKEEPER-2982 at 2/24/18 1:42 AM:
--

[~fpj] I believe your diagnosis to be correct and I agree that [~eronwright]'s 
fix would solve the problem in the case that DNS eventually is fixed. My 
concern with the current solution is that it could cause us to jump back and 
forth between leader election and the quorum when the DNS stays in a bad state. 
For example, imagine a 3 node cluster {z1, z2, z3}. z3 is always offline and z2 
has no entry in dns. z2 will connect to z1 and win the leader election. When it 
comes time to form the quorum z1 will be unable to follow z2 as it wont be able 
to resolve its address.

Just spitballing here, but what if we had z1 connect to the 
{{remoteSocketAddress}} of the socket created from the connection it received 
in {{QuorumCnxManager}}? I understand there are some security concerns here and 
I'm not sure how much we care about that since they would be cured by Kerberos 
or TLS (one day). We could also do a reverse dns lookup and reject the 
connection if the reverse lookup does not align with our expected hostname. 

What do you guys think?



was (Author: abrahamfine):
[~fpj] I believe your diagnosis to be correct and I agree that [~eronwright]'s 
fix would solve the problem in the case that DNS eventually is fixed. My 
concern with the current solution is that it could cause us to jump back and 
forth between leader election and the quorum when the DNS stays in a bad state. 
For example, imagine a 3 node cluster {z1, z2, z3}. z3 is always offline and z2 
has no entry in dns. z2 will connect to z1 and win the leader election. When it 
comes time to form the quorum z1 will be unable to follow z2 as it wont be able 
to resolve its address.

Just spitballing here, but what if we had z1 connect to the 
{{remoteSocketAddress}} of the socket created from the connection it received 
in {{QuorumCnxManager}}? I understand there are some security concerns here and 
I'm not sure how much we care about that since they would be stifled by 
Kerberos. We could also do a reverse dns lookup and reject the connection if 
the reverse lookup does not align with our expected hostname. 

What do you guys think?


> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372542#comment-16372542
 ] 

Flavio Junqueira edited comment on ZOOKEEPER-2982 at 2/22/18 8:33 AM:
--

I have tried your recipe for reproducing as well [~andorm] by changing 
{{/etc/hosts}} and got the same issue. The problem is that the leader fails to 
bind to the port, which actually makes me wonder whether we need to do anything 
about the leader with respect to this issue:

{noformat}
java.net.SocketException: Unresolved address
at java.net.ServerSocket.bind(ServerSocket.java:368)
at java.net.ServerSocket.bind(ServerSocket.java:329)
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
{noformat}

Your suggestion of the alternative change is sensible, but I'd say that for 
consistency, it is better that we simply do the same that we have in 3.4, which 
is to make the change in {{findLeader}}.

One thing that I believe we haven't been able to do is to have a test case to 
report it. It would be good to have it, but I'm not sure what would be a good 
way.


was (Author: fpj):
I have tried your recipe for reproducing as well [~andorm] by changing 
{{/etc/hosts}} and got the same issue. The problem is that the leader fails to 
bind to the port, which actually makes me wonder whether we need to do anything 
about the leader with respect to this issue:

```
java.net.SocketException: Unresolved address
at java.net.ServerSocket.bind(ServerSocket.java:368)
at java.net.ServerSocket.bind(ServerSocket.java:329)
at org.apache.zookeeper.server.quorum.Leader.(Leader.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
```

Your suggestion of the alternative change is sensible, but I'd say that for 
consistency, it is better that we simply do the same that we have in 3.4, which 
is to make the change in {{findLeader}}.

One thing that I believe we haven't been able to do is to have a test case to 
report it. It would be good to have it, but I'm not sure what would be a good 
way.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-21 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371980#comment-16371980
 ] 

Andor Molnar edited comment on ZOOKEEPER-2982 at 2/21/18 8:42 PM:
--

Looks like that sock.connect() in connectToLeader() requires the address to be 
resolved already.

If QuorumCnxnManager() fails to do that, connectToLeader() should be able to 
detect & fix it by doing resolution explicitly when addr.isUnresolved() == true.

Not sure if it's any better than doing recreateSocketAddress() in findLeader(), 
but it may be another option to consider.


was (Author: andorm):
Looks like that sock.connect() in connectToLeader() requires the address to be 
resolved already.

If QuorumCnxnManager() fails to do that, connectToLeader() might be able to 
detect & fix it by doing resolution explicitly when addr.isUnresolved() == true.

Not sure if it's any better than doing recreateSocketAddress() in findLeader(), 
but it may be another option to consider.

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
> Attachments: 3.5.3-beta.zip, fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution

2018-02-20 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370631#comment-16370631
 ] 

Abraham Fine edited comment on ZOOKEEPER-2982 at 2/20/18 10:28 PM:
---

I'm wondering if  [~rthille] can chime in on this.

It looks like the change this JIRA is talking about is referenced by 
https://issues.apache.org/jira/browse/ZOOKEEPER-1506?focusedCommentId=14711955=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14711955

Is there a reason why this change was left out of branch-3.5 (and master)?

My guess is that in master and branch-3.5 we always call 
`recreateSocketAddresses` in `connectOne` which should be called during leader 
election of communication to another quorum member stops. Again, it would be 
great to have [~rthille] confirm/tell me how wrong I am.


was (Author: abrahamfine):
I'm wondering if  [~rthille] can chime in on this.

It looks like the change this JIRA is talking about is referenced by 
https://issues.apache.org/jira/browse/ZOOKEEPER-1506?focusedCommentId=14711955=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14711955

Is there a reason why this change was left out of branch-3.5 (and master)?

> Re-try DNS hostname -> IP resolution
> 
>
> Key: ZOOKEEPER-2982
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.5.1, 3.5.3
>Reporter: Eron Wright 
>Priority: Blocker
> Fix For: 3.5.4, 3.6.0
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)