On Tue, 2018-04-10 at 12:56 -0500, Ryan Thomas wrote: > I’m trying to implement a HA solution which recovers very quickly > when a node fails. It my configuration, when I reboot a node, I see > in the logs that pacemaker realizes the node is down, and decides to > move all resources to the surviving node. To do this, it initiates a > ‘stop’ operation on each of the resources to perform the move. The > ‘stop’ fails as expected after 20s (the default action timeout). > However, in this case, with the node known to be down, I’d like to > avoid this 20 second delay. The node is known to be down, so any > operations sent to the node will fail. It would be nice if > operations sent to a down node would immediately fail, thus reducing > the time it takes the resource to be started on the surviving node. > I do not want to reduce the timeout for the operation, because the > timeout is sensible for when a resource moves due to a non-node- > failure. Is there a way to accomplish this? > > Thanks for your help.
How are you rebooting -- cleanly (normal shutdown) or simulating a failure (e.g. power button)? In a normal shutdown, pacemaker will move all resources off the node before it shuts down. These operations shouldn't fail, because the node isn't down yet. When a node fails, corosync should detect this and notify pacemaker. Pacemaker will not try to execute any operations on a failed node. Instead, it will fence it. What log messages do you see from corosync and pacemaker indicating that the node is down? Do you have fencing configured and tested? -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org