Seems you are not using any fencing / stonith mechanism. A cluster is not fully functional without it.
On Thu, Aug 10, 2023, 4:03 PM Tiaan Wessels <tiaanwess...@gmail.com> wrote: > Hi, > > I need some help! > > I have a DRBD cluster and one node was switched off for a couple of days. > The single node ran fine without a hiccup. When i switch it on I got into a > situation where all resources got stopped and one DRBD volume was secondary > and the others primary as it seemingly tried to perform a role swop to the > node just switched on (ha1 was live and then i switched on ha2 at 08:06 for > the sake of logs understanding) > > bash-5.1# cat /proc/drbd > version: 8.4.11 (api:1/proto:86-101) > srcversion: 60F610B702CC05315B04B50 > 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- > ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 > pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- > ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 > lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- > ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 > pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > > The cluster state ended up as > > bash-5.1# pcs status > Cluster name: HA > Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 > 08:38:40Z) > Cluster Summary: > * Stack: corosync > * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition > with quorum > * Last updated: Thu Aug 10 08:38:40 2023 > * Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on > ha1.local > * 2 nodes configured > * 14 resource instances configured > > Node List: > * Online: [ ha1.local ha2.local ] > > Full List of Resources: > * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable): > * Promoted: [ ha2.local ] > * Unpromoted: [ ha1.local ] > * Resource Group: nsdrbd: > * LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local > * LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped > * LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped > * ClusterIP (ocf:heartbeat:IPaddr2): Stopped > * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable): > * Promoted: [ ha1.local ] > * Unpromoted: [ ha2.local ] > * postgresql (systemd:postgresql): Stopped > * Clone Set: LV_HOME-clone [LV_HOME] (promotable): > * Promoted: [ ha1.local ] > * Unpromoted: [ ha2.local ] > * ns_mhswdog (lsb:mhswdog): Stopped > * Clone Set: pingd-clone [pingd]: > * Started: [ ha1.local ha2.local ] > > Failed Resource Actions: > * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: > Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 > after 1m30.003s > * LV_BLOB promote on ha2.local could not be executed (Timed Out: > Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 > after 1m30.001s > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > I attach the logs of the two nodes. I also attach the output of pcs config > show > > My questions: > - can anyone help me figure out what happened here ? > - as a side question, if a situation resolved itself, is there a way to > have pcs do a resource cleanup by itself ? > > Thanks > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/