Re: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

2022-03-10 Thread Dan Leary
Thanks, I'll give that a try.
Perhaps I'll file a feature request to auto-update the zk endpoints on a
reconfig event so the mesos masters don't have to be restarted.


On Thu, Mar 10, 2022 at 12:49 PM Thomas Langé  wrote:

> Hi,
>
> You answer is in your last message:
> >Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as
> > each zk participant gets reconfig'ed?
>
> From what I understand, when your issue happens, your ZK cluster is
> healthy but Mesos masters fails to connect.
> It seems to be because Mesos masters are still configured to contact the 3
> "legacy nodes". As long as they are in the ZK cluster, they will forward
> your request to ZK leader, so the whole setup works. When you remove them,
> mesos-master cannot know how to reach a valid ZK member to access the
> cluster.
> So, you need to update the --zk parameter to always contain members of the
> cluster (Mesos won't read ZK configuration to fetch new members and
> auto-update its "--zk endpoints").
>
> To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is
> not aware of those changes.
>
> Bw,
>
> Thomas
> --
> *From:* Dan Leary 
> *Sent:* Thursday, 10 March 2022 16:16
> *To:* user@mesos.apache.org 
> *Subject:* [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic
> Reconfiguration?
>
> Thomas-
>
> Encouraging news.  Appreciate the response.
>
> I've tried both non-incremental and incremental reconfigs with the same
> result.
> With 3 zk participants (quorum 2) we first add 3 observers.
> Non-incrementally we then remove a participant then add an observer as
> participant.
> Repeat twice, last time the current leading participant is the one removed.
> At this point the 3 mesos-masters all seem fine.
> My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG
> events and gets the updated list of zk participants just fine.
> But when we terminate the original zk servers that are now running as
> non-voting followers, the mesos-masters all seem to keep trying to
> reconnect to the now-dead former zk participants.
> Eventually heartbeats fail and the whole cluster shuts down.
> The masters log messages like:
>
> 2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827:
> Initiating client connection,
> host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=1
> watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=
> context=0x7f255c000bf8 flags=0
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> ]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181
> ]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183
> ]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> 

Re: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

2022-03-10 Thread Thomas Langé
Hi,

You answer is in your last message:
>Perhaps a mesos-master needs to be terminated and then restarted with an 
>updated zk:// list as
> each zk participant gets reconfig'ed?

>From what I understand, when your issue happens, your ZK cluster is healthy 
>but Mesos masters fails to connect.
It seems to be because Mesos masters are still configured to contact the 3 
"legacy nodes". As long as they are in the ZK cluster, they will forward your 
request to ZK leader, so the whole setup works. When you remove them, 
mesos-master cannot know how to reach a valid ZK member to access the cluster.
So, you need to update the --zk parameter to always contain members of the 
cluster (Mesos won't read ZK configuration to fetch new members and auto-update 
its "--zk endpoints").

To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is not 
aware of those changes.

Bw,

Thomas

From: Dan Leary 
Sent: Thursday, 10 March 2022 16:16
To: user@mesos.apache.org 
Subject: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Thomas-

Encouraging news.  Appreciate the response.

I've tried both non-incremental and incremental reconfigs with the same result.
With 3 zk participants (quorum 2) we first add 3 observers.
Non-incrementally we then remove a participant then add an observer as 
participant.
Repeat twice, last time the current leading participant is the one removed.
At this point the 3 mesos-masters all seem fine.
My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG 
events and gets the updated list of zk participants just fine.
But when we terminate the original zk servers that are now running as 
non-voting followers, the mesos-masters all seem to keep trying to reconnect to 
the now-dead former zk participants.
Eventually heartbeats fail and the whole cluster shuts down.
The masters log messages like:

2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=localhost:2181,localhost:2182,localhost:2183 
sessionTimeout=1 watcher=0x7f25ba3af67e sessionId=0 sessionPasswd= 
context=0x7f255c000bf8 flags=0
2022-03-08 
13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2022-03-08 
13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket 
[127.0.0.1:2182]
 zk retcode=-4, errno=111(Connection refused): server refused to accept the 
client
2022-03-08 
13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server refused 
to accept the client
2022-03-08 
13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket 
[127.0.0.1:2181]
 zk retcode=-4, errno=111(Connection refused): server refused to accept the 
client
2022-03-08 
13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket 
[127.0.0.1:2183]
 zk retcode=-4, errno=111(Connection refused): server refused to accept the 
client
2022-03-08 
13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket 
[127.0.0.1:2182]
 zk retcode=-4, errno=111(Connection refused): server refused to accept the 
client
2022-03-08 
13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: 
Socket [::1:2182] zk retcod