Thanks, I'll give that a try. Perhaps I'll file a feature request to auto-update the zk endpoints on a reconfig event so the mesos masters don't have to be restarted.
On Thu, Mar 10, 2022 at 12:49 PM Thomas Langé <t.la...@criteo.com> wrote: > Hi, > > You answer is in your last message: > >Perhaps a mesos-master needs to be terminated and then restarted with an > updated zk:// list as > > each zk participant gets reconfig'ed? > > From what I understand, when your issue happens, your ZK cluster is > healthy but Mesos masters fails to connect. > It seems to be because Mesos masters are still configured to contact the 3 > "legacy nodes". As long as they are in the ZK cluster, they will forward > your request to ZK leader, so the whole setup works. When you remove them, > mesos-master cannot know how to reach a valid ZK member to access the > cluster. > So, you need to update the --zk parameter to always contain members of the > cluster (Mesos won't read ZK configuration to fetch new members and > auto-update its "--zk endpoints"). > > To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is > not aware of those changes. > > Bw, > > Thomas > ------------------------------ > *From:* Dan Leary <d...@touchplan.io> > *Sent:* Thursday, 10 March 2022 16:16 > *To:* user@mesos.apache.org <user@mesos.apache.org> > *Subject:* [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic > Reconfiguration? > > Thomas- > > Encouraging news. Appreciate the response. > > I've tried both non-incremental and incremental reconfigs with the same > result. > With 3 zk participants (quorum 2) we first add 3 observers. > Non-incrementally we then remove a participant then add an observer as > participant. > Repeat twice, last time the current leading participant is the one removed. > At this point the 3 mesos-masters all seem fine. > My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG > events and gets the updated list of zk participants just fine. > But when we terminate the original zk servers that are now running as > non-voting followers, the mesos-masters all seem to keep trying to > reconnect to the now-dead former zk participants. > Eventually heartbeats fail and the whole cluster shuts down. > The masters log messages like: > > 2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827: > Initiating client connection, > host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000 > watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null> > context=0x7f255c000bf8 flags=0 > 2022-03-08 > 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2022-03-08 > 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2182 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > 2022-03-08 > 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2022-03-08 > 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2181 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > 2022-03-08 > 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2183 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2182 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2181 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2022-03-08 > 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: > Socket [127.0.0.1:2183 > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>] > zk retcode=-4, errno=111(Connection refused): server refused to accept the > client > > > Where ports 2181, 2182, 2183 are the old participants, the new > participants are on ports 2184, 2185, 2186 (single host test environment). > Perhaps a mesos-master needs to be terminated and then restarted with an > updated zk:// list as each zk participant gets reconfig'ed? > > -Dan > > > On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <t.la...@criteo.com> wrote: > > Hi, > > We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration > capability without any issue for Mesos 1.9. The only thing that should be > handled carefully is the addition/removal of Zookeeper members when using > dynamic reconf feature. > > What do you mean by "mesos-master can handle a dynamic reconfiguration of > the zk ensemble" ? To my understanding, Mesos will only connect to ZK to > elect a leader through ZK primitives; I don't think there is a correlation > with how ZK members are set in the cluster. > > How do you remove/add members to the ZK member list? The issue you > encounter might come from inconsistencies in ZK cluster. > > Regards, > > Thomas > ------------------------------ > *From:* Charles-François Natali <cf.nat...@gmail.com> > *Sent:* Wednesday, 9 March 2022 23:44 > *To:* user <user@mesos.apache.org> > *Subject:* [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration? > > Hi Dan, > > I don't think anyone has been looking at this, and i doubt we will, since > we are quite low on resources. > > > Cheers, > > > > > On Tue, Mar 8, 2022, 19:01 Dan Leary <d...@touchplan.io> wrote: > > Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic > reconfiguration" capability. > https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html > <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I8KdueeNebZ8%2FQ5%2Ffo8qy1OTk4fqG3VeIuS7VVI%2FlXk%3D&reserved=0> > > Seems prima facie like mesos-master can handle a dynamic reconfig of the > zk ensemble up to the point > where a new set of participants has been added to the ensemble and the old > participants > have been demoted to non-voting followers. But when the non-voting > follower processes are > terminated the master logs seem to indicate that the masters keep trying > and failing to reconnect > to the old zk leader, even though they've apparently received updates with > the new ensemble participants. > > Anybody have any insight into this? > Any plans to support zk dynamic reconfiguration in the future? > Seems like it could make for easier O/S maintenance of one's master/zk > cluster hosts. > > -Dan > >