>>> Ken Gaillot <kgail...@redhat.com> schrieb am 27.01.2021 um 18:46 in Nachricht <02cd90fcc10f1021d9f51649e2991da3209a6935.ca...@redhat.com>: > On Wed, 2021-01-27 at 08:35 +0100, Ulrich Windl wrote: >> > > > Tomas Jelinek <tojel...@redhat.com> schrieb am 26.01.2021 um >> > > > 16:15 in >> >> Nachricht >> <48f935a5-184f-d2d7-7f1a-db596aa6c...@redhat.com>: >> > Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a): >> > > On Mon, 2021‑01‑25 at 09:51 +0100, Jehan‑Guillaume de Rorthais >> > > wrote: >> > > > Hi Digimer, >> > > > >> > > > On Sun, 24 Jan 2021 15:31:22 ‑0500 >> > > > Digimer <li...@alteeve.ca> wrote: >> > > > [...] >> > > > > I had a test server (srv01‑test) running on node 1 >> > > > > (el8‑a01n01), >> > > > > and on >> > > > > node 2 (el8‑a01n02) I ran 'pcs cluster stop ‑‑all'. >> > > > > >> > > > > It appears like pacemaker asked the VM to migrate to node >> > > > > 2 >> > > > > instead of >> > > > > stopping it. Once the server was on node 2, I couldn't use >> > > > > 'pcs >> > > > > resource >> > > > > disable <vm>' as it returned that that resource was >> > > > > unmanaged, and >> > > > > the >> > > > > cluster shut down was hung. When I directly stopped the VM >> > > > > and then >> > > > > did >> > > > > a 'pcs resource cleanup', the cluster shutdown completed. >> > > > >> > > > As actions during a cluster shutdown cannot be handled in the >> > > > same >> > > > transition >> > > > for each nodes, I usually add a step to disable all resources >> > > > using >> > > > property >> > > > "stop‑all‑resources" before shutting down the cluster: >> > > > >> > > > pcs property set stop‑all‑resources=true >> > > > pcs cluster stop ‑‑all >> > > > >> > > > But it seems there's a very new cluster property to handle that >> > > > (IIRC, one or >> > > > two releases ago). Look at "shutdown‑lock" doc: >> > > > >> > > > [...] >> > > > some users prefer to make resources highly available only >> > > > for >> > > > failures, with >> > > > no recovery for clean shutdowns. If this option is true, >> > > > resources >> > > > active on a >> > > > node when it is cleanly shut down are kept "locked" to that >> > > > node >> > > > (not allowed >> > > > to run elsewhere) until they start again on that node after >> > > > it >> > > > rejoins (or >> > > > for at most shutdown‑lock‑limit, if set). >> > > > [...] >> > > > >> > > > [...] >> > > > > So as best as I can tell, pacemaker really did ask for a >> > > > > migration. Is >> > > > > this the case? >> > > > >> > > > AFAIK, yes, because each cluster shutdown request is handled >> > > > independently at >> > > > node level. There's a large door open for all kind of race >> > > > conditions >> > > > if >> > > > requests are handled with some random lags on each nodes. >> > > >> > > I'm going to guess that's what happened. >> > > >> > > The basic issue is that there is no "cluster shutdown" in >> > > Pacemaker, >> > > only "node shutdown". I'm guessing "pcs cluster stop ‑‑all" sends >> > > shutdown requests for each node in sequence (probably via >> > > systemd), and >> > > if the nodes are quick enough, one could start migrating off >> > > resources >> > > before all the others get their shutdown request. >> > >> > Pcs is doing its best to stop nodes in parallel. The first >> > implementation of this was done back in 2015: >> > https://bugzilla.redhat.com/show_bug.cgi?id=1180506 >> > Since then, we moved to using curl for network communication, which >> > also >> > handles parallel cluster stop. Obviously, this doesn't ensure the >> > stop >> > command arrives to and is processed on all nodes at the exactly >> > same time. >> > >> > Basically, pcs sends 'stop pacemaker' request to all nodes in >> > parallel >> > and waits for it to finish on all nodes. Then it sends 'stop >> > corosync' >> > request to all nodes in parallel. The actual stopping on each node >> > is >> > done by 'systemctl stop'. >> >> Hi! >> >> I wonder: Is there actually a "stop node" command in the >> communication >> protocol, or doe just just kill the crmd remotely? >> In the first case (command exists), we would only neeed a gouping for >> multiple >> commands, and we'd have a cluster shutdown: >> One node sends a group of commands to stop every node. The nodes >> acknowledge >> and then begin to stop... >> (A "group of commands" is like a single database transaction >> containing >> multiple changes) >> >> Regards, >> Ulrich >
Hi Ken! as I periodically forget: Thanks once again for explaining! > This is the current sequence of a clean shutdown for one node: > > 1. Someone or something (e.g. systemctl stop) sends SIGTERM to > pacemakerd on the node to be shut down. > > 2. pacemakerd relays that signal to all the subdaemons on the node and > waits for them to exit before exiting itself. > > 3. When the controller gets the SIGTERM, it sends a shutdown request to > the controller on the DC node. I wasn't aware that DC is a process of it's own; I thought it's just a role of the crmd (seems to be called pacemaker-controld nowadays). > > 4. When the DC receives the node's shutdown request, it sets a > "shutdown" node attribute for the node and invokes the scheduler, which > schedules all appropriate actions (stopping or moving resources, etc.). But it does not trigger an "election" at that point (when the DC is to be shutdown); right? Only when the DC left membership an election is triggered; right? > > 6. The DC coordinates all the necessary actions that were scheduled, > then sends a confirmation to the node that requested shutdown. That happens _after_ all the resources were migrated or stopped? > > 7. When the controller receives the confirmation, it exits. > > So ... > > A "shut down the whole cluster" command should be possible, but the > process would need significant redesign. Currently a node has to > initiate its own shutdown, because the local pacemakerd and controller > have to be aware it's happening. > > I envision a new controller API request for cluster shutdown that would > be relayed to all controllers, and each controller would send SIGTERM > to the local pacemakerd. The DC would additionally set the shutdown > attribute for all nodes at once and invoke the scheduler. Timing and > corner cases would require a lot of attention (no DC elected, any node > crashing at any point in the process, etc.). A tricky part could be that of confirmation: I had implemented a syslogd featuring a control protocol that allows remote restart (after upgrading the "binary") and shutdown. The command interpreter parses the command, executes it, then reports back the result. That is a bit hairy for shutdown and reload/restart: The process can't report back that it had been shutting down itself. So I tricked, queueing the shutdown/resrat commands, confirm (the queueing), and then process the queue... AFAIK "shutdown confirmation" in the cluster is indirect: You know that the node had shut down wehn it stops responding... > > It should be feasible, someone would just need time to do it. Sometimes it's like two months of thinking and one day of coding ;-) Regards, Ulrich > >> > Yes, the nodes which get the request sooner may start migrating >> > resources. >> > >> > Regards, >> > Tomas >> > >> > > >> > > There would be a way around it. Normally Pacemaker is shut down >> > > via >> > > SIGTERM to pacemakerd (which is what systemctl stop does), but >> > > inside >> > > Pacemaker it's implemented as a special "shutdown" transient node >> > > attribute, set to the epoch timestamp of the request. It would be >> > > possible to set that attribute for all nodes in a copy of the >> > > CIB, then >> > > load that into the live cluster. >> > > >> > > stop‑all‑resources as suggested would be another way around it >> > > (and >> > > would have to be cleared after start‑up, which could be a plus or >> > > a >> > > minus depending on how much control vs convenience you want). >> > > >> > >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/