Update: It seems like fencing does work as I expected it to work. The problem was with how I was testing it. I was seeing the node turned “off” (isolated) and then “on” (unisolated) immediately which seemed wrong. This was because the way I was turning the node off in my testing was to kill the some processes, including the pacemaker and corosync processes. However the systemd unit file for pacemaker/corosync is configured to restart the service immediately if it dies. So, I was seeing the "on" call immediately after the "off" because the pacemaker/corosync service was restarted, so it appeared the node I just killed, immediately came back. Thanks, Ryan
On Tue, Sep 4, 2018 at 7:49 PM Ken Gaillot <kgail...@redhat.com> wrote: > On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote: > > I’m seeing unexpected behavior when using “unfencing” – I don’t think > > I’m understanding it correctly. I configured a resource that > > “requires unfencing” and have a custom fencing agent which “provides > > unfencing”. I perform a simple test where I setup the cluster and > > then run “pcs stonith fence node2”, and I see that node2 is > > successfully fenced by sending an “off” action to my fencing agent. > > But, immediately after this, I see an “on” action sent to my fencing > > agent. My fence agent doesn’t implement the “reboot” action, so > > perhaps its trying to reboot by running an off action followed by a > > on action. Prior to adding “provides unfencing” to the fencing > > agent, I didn’t see the on action. It seems unsafe to say “node2 you > > can’t run” and then immediately “ you can run”. > > I'm not as familiar with unfencing as I'd like, but I believe the basic > idea is: > > - the fence agent's off action cuts the machine off from something > essential needed to run resources (generally shared storage or network > access) > > - the fencing works such that a fenced host is not able to request > rejoining the cluster without manual intervention by a sysadmin > > - when the sysadmin allows the host back into the cluster, and it > contacts the other nodes to rejoin, the cluster will call the fence > agent's on action, which is expected to re-enable the host's access > > How that works in practice, I have only vague knowledge. > > > I don’t think I’m understanding this aspect of fencing/stonith. I > > thought that the fence agent acted as a proxy to a node, when the > > node was fenced, it was isolated from shared storage by some means > > (power, fabric, etc). It seems like it shouldn’t become unfenced > > until connectivity between the nodes is repaired. Yet, the node is > > turn “off” (isolated) and then “on” (unisolated) immediately. This > > (kind-of) makes sense for a fencing agent that uses power to isolate, > > since when it’s turned back on, pacemaker will not started any > > resources on that node until it sees the other nodes (due to the > > wait_for_all setting). However, for other types of fencing agents, > > it doesn’t make sense. Does the “off” action not mean isolate from > > shared storage? And the “on” action not mean unisolate? What is the > > correct way to understand fencing/stonith? > > I think the key idea is that "on" will be called when the fenced node > asks to rejoin the cluster. So stopping that from happening until a > sysadmin has intervened is an important part (if I'm not missing > something). > > Note that if the fenced node still has network connectivity to the > cluster, and the fenced node is actually operational, it will be > notified by the cluster that it was fenced, and it will stop its > pacemaker, thus fulfilling the requirement. But you obviously can't > rely on that because fencing may be called precisely because network > connectivity is lost or the host is not fully operational. > > > The behavior I wanted to see was, when pacemaker lost connectivity to > > a node, it would run the off action for that node. If this > > succeeded, it could continue running resources. Later, when > > pacemaker saw the node again it would run the “on” action on the > > fence agent (knowing that it was no longer split-brained). Node2, > > would try to do the same thing, but once it was fenced, it would not > > longer attempt to fence node1. It also wouldn’t attempt to start any > > resources. I thought that adding “requires unfencing” to the > > resource would make this happen. Is there a way to get this > > behavior? > > That is basically what happens, the question is how "pacemaker saw the > node again" becomes possible. > > > > > Thanks! > > > > btw, here's the cluster configuration: > > > > pcs cluster auth node1 node2 > > pcs cluster setup --name ataCluster node1 node2 > > pcs cluster start –all > > pcs property set stonith-enabled=true > > pcs resource defaults migration-threshold=1 > > pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence > > meta requires=unfencing > > pcs stonith create myStonith fence_custom op monitor interval=0 meta > > provides=unfencing > > pcs property set symmetric-cluster=true > -- > Ken Gaillot <kgail...@redhat.com> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org