Thanks, I'll give that a try.
Perhaps I'll file a feature request to auto-update the zk endpoints on a
reconfig event so the mesos masters don't have to be restarted.

On Thu, Mar 10, 2022 at 12:49 PM Thomas Langé <> wrote:

> Hi,
> You answer is in your last message:
> >Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as
> > each zk participant gets reconfig'ed?
> From what I understand, when your issue happens, your ZK cluster is
> healthy but Mesos masters fails to connect.
> It seems to be because Mesos masters are still configured to contact the 3
> "legacy nodes". As long as they are in the ZK cluster, they will forward
> your request to ZK leader, so the whole setup works. When you remove them,
> mesos-master cannot know how to reach a valid ZK member to access the
> cluster.
> So, you need to update the --zk parameter to always contain members of the
> cluster (Mesos won't read ZK configuration to fetch new members and
> auto-update its "--zk endpoints").
> To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is
> not aware of those changes.
> Bw,
> Thomas
> ------------------------------
> *From:* Dan Leary <>
> *Sent:* Thursday, 10 March 2022 16:16
> *To:* <>
> *Subject:* [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic
> Reconfiguration?
> Thomas-
> Encouraging news.  Appreciate the response.
> I've tried both non-incremental and incremental reconfigs with the same
> result.
> With 3 zk participants (quorum 2) we first add 3 observers.
> Non-incrementally we then remove a participant then add an observer as
> participant.
> Repeat twice, last time the current leading participant is the one removed.
> At this point the 3 mesos-masters all seem fine.
> My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG
> events and gets the updated list of zk participants just fine.
> But when we terminate the original zk servers that are now running as
> non-voting followers, the mesos-masters all seem to keep trying to
> reconnect to the now-dead former zk participants.
> Eventually heartbeats fail and the whole cluster shuts down.
> The masters log messages like:
> 2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827:
> Initiating client connection,
> host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000
> watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null>
> context=0x7f255c000bf8 flags=0
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [
> <>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> Where ports 2181, 2182, 2183 are the old participants, the new
> participants are on ports 2184, 2185, 2186  (single host test environment).
> Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as each zk participant gets reconfig'ed?
> -Dan
> On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <> wrote:
> Hi,
> We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration
> capability without any issue for Mesos 1.9. The only thing that should be
> handled carefully is the addition/removal of Zookeeper members when using
> dynamic reconf feature.
> What do you mean by "mesos-master can handle a dynamic reconfiguration of
> the zk ensemble" ? To my understanding, Mesos will only connect to ZK to
> elect a leader through ZK primitives; I don't think there is a correlation
> with how ZK members are set in the cluster.
> How do you remove/add members to the ZK member list? The issue you
> encounter might come from inconsistencies in ZK cluster.
> Regards,
> Thomas
> ------------------------------
> *From:* Charles-François Natali <>
> *Sent:* Wednesday, 9 March 2022 23:44
> *To:* user <>
> *Subject:* [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?
> Hi Dan,
> I don't think anyone has been looking at this, and i doubt we will, since
> we are quite low on resources.
> Cheers,
> On Tue, Mar 8, 2022, 19:01 Dan Leary <> wrote:
> Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic
> reconfiguration" capability.
> <>
> Seems prima facie like mesos-master can handle a dynamic reconfig of the
> zk ensemble up to the point
> where a new set of participants has been added to the ensemble and the old
> participants
> have been demoted to non-voting followers.  But when the non-voting
> follower processes are
> terminated the master logs seem to indicate that the masters keep trying
> and failing to reconnect
> to the old zk leader, even though they've apparently received updates with
> the new ensemble participants.
> Anybody have any insight into this?
> Any plans to support zk dynamic reconfiguration in the future?
> Seems like it could make for easier O/S maintenance of one's master/zk
> cluster hosts.
> -Dan

Reply via email to