Thanks, I'll give that a try.
Perhaps I'll file a feature request to auto-update the zk endpoints on a
reconfig event so the mesos masters don't have to be restarted.


On Thu, Mar 10, 2022 at 12:49 PM Thomas Langé <t.la...@criteo.com> wrote:

> Hi,
>
> You answer is in your last message:
> >Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as
> > each zk participant gets reconfig'ed?
>
> From what I understand, when your issue happens, your ZK cluster is
> healthy but Mesos masters fails to connect.
> It seems to be because Mesos masters are still configured to contact the 3
> "legacy nodes". As long as they are in the ZK cluster, they will forward
> your request to ZK leader, so the whole setup works. When you remove them,
> mesos-master cannot know how to reach a valid ZK member to access the
> cluster.
> So, you need to update the --zk parameter to always contain members of the
> cluster (Mesos won't read ZK configuration to fetch new members and
> auto-update its "--zk endpoints").
>
> To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is
> not aware of those changes.
>
> Bw,
>
> Thomas
> ------------------------------
> *From:* Dan Leary <d...@touchplan.io>
> *Sent:* Thursday, 10 March 2022 16:16
> *To:* user@mesos.apache.org <user@mesos.apache.org>
> *Subject:* [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic
> Reconfiguration?
>
> Thomas-
>
> Encouraging news.  Appreciate the response.
>
> I've tried both non-incremental and incremental reconfigs with the same
> result.
> With 3 zk participants (quorum 2) we first add 3 observers.
> Non-incrementally we then remove a participant then add an observer as
> participant.
> Repeat twice, last time the current leading participant is the one removed.
> At this point the 3 mesos-masters all seem fine.
> My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG
> events and gets the updated list of zk participants just fine.
> But when we terminate the original zk servers that are now running as
> non-voting followers, the mesos-masters all seem to keep trying to
> reconnect to the now-dead former zk participants.
> Eventually heartbeats fail and the whole cluster shuts down.
> The masters log messages like:
>
> 2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827:
> Initiating client connection,
> host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000
> watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null>
> context=0x7f255c000bf8 flags=0
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
>
>
> Where ports 2181, 2182, 2183 are the old participants, the new
> participants are on ports 2184, 2185, 2186  (single host test environment).
> Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as each zk participant gets reconfig'ed?
>
> -Dan
>
>
> On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <t.la...@criteo.com> wrote:
>
> Hi,
>
> We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration
> capability without any issue for Mesos 1.9. The only thing that should be
> handled carefully is the addition/removal of Zookeeper members when using
> dynamic reconf feature.
>
> What do you mean by "mesos-master can handle a dynamic reconfiguration of
> the zk ensemble" ? To my understanding, Mesos will only connect to ZK to
> elect a leader through ZK primitives; I don't think there is a correlation
> with how ZK members are set in the cluster.
>
> How do you remove/add members to the ZK member list? The issue you
> encounter might come from inconsistencies in ZK cluster.
>
> Regards,
>
> Thomas
> ------------------------------
> *From:* Charles-François Natali <cf.nat...@gmail.com>
> *Sent:* Wednesday, 9 March 2022 23:44
> *To:* user <user@mesos.apache.org>
> *Subject:* [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?
>
> Hi Dan,
>
> I don't think anyone has been looking at this, and i doubt we will, since
> we are quite low on resources.
>
>
> Cheers,
>
>
>
>
> On Tue, Mar 8, 2022, 19:01 Dan Leary <d...@touchplan.io> wrote:
>
> Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic
> reconfiguration" capability.
> https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I8KdueeNebZ8%2FQ5%2Ffo8qy1OTk4fqG3VeIuS7VVI%2FlXk%3D&reserved=0>
>
> Seems prima facie like mesos-master can handle a dynamic reconfig of the
> zk ensemble up to the point
> where a new set of participants has been added to the ensemble and the old
> participants
> have been demoted to non-voting followers.  But when the non-voting
> follower processes are
> terminated the master logs seem to indicate that the masters keep trying
> and failing to reconnect
> to the old zk leader, even though they've apparently received updates with
> the new ensemble participants.
>
> Anybody have any insight into this?
> Any plans to support zk dynamic reconfiguration in the future?
> Seems like it could make for easier O/S maintenance of one's master/zk
> cluster hosts.
>
> -Dan
>
>

Reply via email to