On Wed, 2019-10-09 at 20:10 +0200, Kadlecsik József wrote: > On Wed, 9 Oct 2019, Ken Gaillot wrote: > > > > One of the nodes has got a failure ("watchdog: BUG: soft lockup > > > - > > > CPU#7 stuck for 23s"), which resulted that the node could > > > process > > > traffic on the backend interface but not on the fronted one. Thus > > > the > > > services became unavailable but the cluster thought the node is > > > all > > > right and did not stonith it. > > > > > > How could we protect the cluster against such failures? > > > > See the ocf:heartbeat:ethmonitor agent (to monitor the interface > > itself) > > and/or the ocf:pacemaker:ping agent (to monitor reachability of > > some IP > > such as a gateway) > > This looks really promising, thank you! Does the cluster regard it as > a > failure when a ocf:heartbeat:ethmonitor agent clone on a node does > not > run? :-)
If you configure it typically, so that it runs on all nodes, then a start failure on any node will be recorded in the cluster status. To get other resources to move off such a node, you would colocate them with the ethmonitor resource. > > Best regards, > Jozsef > -- > E-mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics > H-1525 Budapest 114, POB. 49, Hungary > ______________________________________________ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/