On Tue, Jun 7, 2022 at 10:27 AM Zoran Bošnjak <zoran.bosn...@via.si> wrote:
>
> Hi, I need some help with correct fencing configuration in 5-node cluster.
>
> The speciffic issue is that there are 3 rooms, where in addition to node 
> failure scenario, each room can fail too (for example in case of room power 
> failure or room network failure).
>
> room0: [ node0 ]
> roomA: [ node1, node2 ]
> roomB: [ node3, node4 ]
>
> - ipmi board is present on each node
> - watchdog timer is available
> - shared storage is not available
>
> Please advice, what would be a proper fencing configuration in this case.
>
> The intention is to configure ipmi fencing (using "fence_idrac" agent) plus 
> watchdog timer as a fallback. In other words, I would like to tell the 
> pacemaker: "If fencing is required, try to fence via ipmi. In case of ipmi 
> fence failure, after some timeout assume watchdog has rebooted the node, so 
> it is safe to proceed, as if the (self)fencing had succeeded)."
>
> From the documentation is not clear to me whether this would be:
> a) multiple fencing where ipmi would be first level and sbd would be a second 
> level fencing (where sbd always succeeds)
> b) or this is considered a single level fencing with a timeout

With b) falling back to watchdog-fencing wouldn't work properly
although I remember
some recent change that might make it fall back without issues.
I would try to go for a) as with a reasonably current
pacemaker-version (iirc 2.1.0 and above)
you should be able to make the watchdog-fencing-device visible as with
other fencing-devices
(just use fence_watchdog as the fence-agent - still implemented inside
pacemaker
fence-watchdog-binary actually just provides the meta-data).
Like this you can limit watchdog-fencing to certain-nodes that do
actually provide a proper
hardware-watchdog and you can add it to a topology.

Depending on your infra-structure an alternative solution to using
watchdog-fencing
for your case (where you can't access ipmis in a room with
power-outage) might be
fabric-fencing.

Klaus
>
> I have tried to followed option b) and create stonith resource for each node 
> and setup the stonith-watchdog-timeout, like this:
>
> ---
> # for each node... [0..4]
> export name=...
> export ip=...
> export password=...
> sudo pcs stonith create "fence_ipmi_$name" fence_idrac \
>     lanplus=1 ip="$ip" \
>     username="admin"  password="$password" \
>     pcmk_host_list="$name" op monitor interval=10m timeout=10s
>
> sudo pcs property set stonith-watchdog-timeout=20
>
> # start dummy resource
> sudo pcs resource create dummy ocf:heartbeat:Dummy op monitor interval=30s
> ---
>
> I am not sure if additional location constraints have to be specified for 
> stonith resources. For example: I have noticed that pacemaker will start a 
> stonith resource on the same node as the fencing target. Is this OK?
>
> Should there be any location constraints regarding fencing and rooms?
>
> 'sbd' is running, properties are as follows:
>
> ---
> $ sudo pcs property show
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: debian
>  dc-version: 2.0.3-4b1f869f0f
>  have-watchdog: true
>  last-lrm-refresh: 1654583431
>  stonith-enabled: true
>  stonith-watchdog-timeout: 20
> ---
>
> Ipmi fencing (when the ipmi connection is alive) works correctly for each 
> node. The watchdog timer also seems to be working correctly. The problem is 
> that dummy resource is not restarted as expected.
>
> In the test scenario, the dummy resource is currently running on node1. I 
> have simulated node failure by unplugging the ipmi AND host network 
> interfaces from node1. The result was that node1 gets rebooted (by watchdog), 
> but the rest of the pacemaker cluster was unable to fence node1 (this is 
> expected, since node1's ipmi is not accessible). The problem is that dummy 
> resource remains stopped and node1 unclean. I was expecting that 
> stonith-watchdog-timeout kicks in, so that dummy resource gets restarted on 
> some other node which has quorum.
>
> Obviously there is something wrong with my configuration, since this seems to 
> be a reasonably simple scenario for the pacemaker. Appreciate your help.
>
> regards,
> Zoran
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to