Hi, I need some help!
I have a DRBD cluster and one node was switched off for a couple of days. The single node ran fine without a hiccup. When i switch it on I got into a situation where all resources got stopped and one DRBD volume was secondary and the others primary as it seemingly tried to perform a role swop to the node just switched on (ha1 was live and then i switched on ha2 at 08:06 for the sake of logs understanding) bash-5.1# cat /proc/drbd version: 8.4.11 (api:1/proto:86-101) srcversion: 60F610B702CC05315B04B50 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 The cluster state ended up as bash-5.1# pcs status Cluster name: HA Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10 08:38:40Z) Cluster Summary: * Stack: corosync * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition with quorum * Last updated: Thu Aug 10 08:38:40 2023 * Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on ha1.local * 2 nodes configured * 14 resource instances configured Node List: * Online: [ ha1.local ha2.local ] Full List of Resources: * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable): * Promoted: [ ha2.local ] * Unpromoted: [ ha1.local ] * Resource Group: nsdrbd: * LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local * LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped * LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped * ClusterIP (ocf:heartbeat:IPaddr2): Stopped * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable): * Promoted: [ ha1.local ] * Unpromoted: [ ha2.local ] * postgresql (systemd:postgresql): Stopped * Clone Set: LV_HOME-clone [LV_HOME] (promotable): * Promoted: [ ha1.local ] * Unpromoted: [ ha2.local ] * ns_mhswdog (lsb:mhswdog): Stopped * Clone Set: pingd-clone [pingd]: * Started: [ ha1.local ha2.local ] Failed Resource Actions: * LV_POSTGRES promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023 after 1m30.003s * LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after 1m30.001s Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled I attach the logs of the two nodes. I also attach the output of pcs config show My questions: - can anyone help me figure out what happened here ? - as a side question, if a situation resolved itself, is there a way to have pcs do a resource cleanup by itself ? Thanks
Cluster Name: HA Corosync Nodes: ha1.local ha2.local Pacemaker Nodes: ha1.local ha2.local Resources: Resource: postgresql (class=systemd type=postgresql) Operations: monitor: postgresql-monitor-interval-60s interval=60s start: postgresql-start-interval-0s interval=0s timeout=100 stop: postgresql-stop-interval-0s interval=0s timeout=100 Resource: ns_mhswdog (class=lsb type=mhswdog) Operations: force-reload: ns_mhswdog-force-reload-interval-0s interval=0s timeout=15 monitor: ns_mhswdog-monitor-interval-60s interval=60s timeout=10s on-fail=standby restart: ns_mhswdog-restart-interval-0s interval=0s timeout=140s start: ns_mhswdog-start-interval-0s interval=0s timeout=80s stop: ns_mhswdog-stop-interval-0s interval=0s timeout=80s Group: nsdrbd Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem) Attributes: LV_BLOBFS-instance_attributes device=/dev/drbd0 directory=/data fstype=ext4 Operations: monitor: LV_BLOBFS-monitor-interval-20s interval=20s timeout=40s start: LV_BLOBFS-start-interval-0s interval=0s timeout=60s stop: LV_BLOBFS-stop-interval-0s interval=0s timeout=60s Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem) Attributes: LV_POSTGRESFS-instance_attributes device=/dev/drbd1 directory=/var/lib/pgsql fstype=ext4 Operations: monitor: LV_POSTGRESFS-monitor-interval-20s interval=20s timeout=40s start: LV_POSTGRESFS-start-interval-0s interval=0s timeout=60s stop: LV_POSTGRESFS-stop-interval-0s interval=0s timeout=60s Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem) Attributes: LV_HOMEFS-instance_attributes device=/dev/drbd2 directory=/home fstype=ext4 Operations: monitor: LV_HOMEFS-monitor-interval-20s interval=20s timeout=40s start: LV_HOMEFS-start-interval-0s interval=0s timeout=60s stop: LV_HOMEFS-stop-interval-0s interval=0s timeout=60s Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ClusterIP-instance_attributes cidr_netmask=32 ip=192.168.51.75 Operations: monitor: ClusterIP-monitor-interval-60s interval=60s start: ClusterIP-start-interval-0s interval=0s timeout=20s stop: ClusterIP-stop-interval-0s interval=0s timeout=20s Clone: LV_BLOB-clone Meta Attributes: LV_BLOB-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: LV_BLOB (class=ocf provider=linbit type=drbd) Attributes: LV_BLOB-instance_attributes drbd_resource=lv_blob Operations: demote: LV_BLOB-demote-interval-0s interval=0s timeout=90 monitor: LV_BLOB-monitor-interval-60s interval=60s role=Promoted monitor: LV_BLOB-monitor-interval-63s interval=63s role=Unpromoted notify: LV_BLOB-notify-interval-0s interval=0s timeout=90 promote: LV_BLOB-promote-interval-0s interval=0s timeout=90 reload: LV_BLOB-reload-interval-0s interval=0s timeout=30 start: LV_BLOB-start-interval-0s interval=0s timeout=240 stop: LV_BLOB-stop-interval-0s interval=0s timeout=100 Clone: LV_POSTGRES-clone Meta Attributes: LV_POSTGRES-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd) Attributes: LV_POSTGRES-instance_attributes drbd_resource=lv_postgres Operations: demote: LV_POSTGRES-demote-interval-0s interval=0s timeout=90 monitor: LV_POSTGRES-monitor-interval-60s interval=60s role=Promoted monitor: LV_POSTGRES-monitor-interval-63s interval=63s role=Unpromoted notify: LV_POSTGRES-notify-interval-0s interval=0s timeout=90 promote: LV_POSTGRES-promote-interval-0s interval=0s timeout=90 reload: LV_POSTGRES-reload-interval-0s interval=0s timeout=30 start: LV_POSTGRES-start-interval-0s interval=0s timeout=240 stop: LV_POSTGRES-stop-interval-0s interval=0s timeout=100 Clone: LV_HOME-clone Meta Attributes: LV_HOME-clone-meta_attributes clone-max=2 clone-node-max=1 notify=true promotable=true promoted-max=1 promoted-node-max=1 Resource: LV_HOME (class=ocf provider=linbit type=drbd) Attributes: LV_HOME-instance_attributes drbd_resource=lv_home Operations: demote: LV_HOME-demote-interval-0s interval=0s timeout=90 monitor: LV_HOME-monitor-interval-60s interval=60s role=Promoted monitor: LV_HOME-monitor-interval-63s interval=63s role=Unpromoted notify: LV_HOME-notify-interval-0s interval=0s timeout=90 promote: LV_HOME-promote-interval-0s interval=0s timeout=90 reload: LV_HOME-reload-interval-0s interval=0s timeout=30 start: LV_HOME-start-interval-0s interval=0s timeout=240 stop: LV_HOME-stop-interval-0s interval=0s timeout=100 Clone: pingd-clone Resource: pingd (class=ocf provider=pacemaker type=ping) Attributes: pingd-instance_attributes dampen=6s host_list=192.168.51.251 multiplier=1000 Operations: monitor: pingd-monitor-interval-10s interval=10s timeout=60s reload-agent: pingd-reload-agent-interval-0s interval=0s timeout=20s start: pingd-start-interval-0s interval=0s timeout=60s stop: pingd-stop-interval-0s interval=0s timeout=20s Stonith Devices: Fencing Levels: Location Constraints: Resource: ClusterIP Constraint: location-ClusterIP Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule) Expression: pingd lt 1 (id:location-ClusterIP-rule-expr) Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1) Ordering Constraints: promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory) promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory) start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory) promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory) start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory) start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory) start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory) start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory) Colocation Constraints: LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY) LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY) postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY) LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY) ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY) ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY) ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY) ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: Meta Attrs: build-resource-defaults resource-stickiness=INFINITY Operations Defaults: Meta Attrs: op_defaults-meta_attributes timeout=240s Cluster Properties: cluster-infrastructure: corosync cluster-name: HA dc-version: 2.1.4-5.el9_1.2-dc6eb4362e have-watchdog: false last-lrm-refresh: 1688971748 maintenance-mode: false no-quorum-policy: ignore stonith-enabled: false Tags: No tags defined Quorum: Options:
Aug 10 08:06:55 [1128] ha2.local corosync notice [MAIN ] Corosync Cluster Engine 3.1.5 starting up Aug 10 08:06:55 [1128] ha2.local corosync info [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow Aug 10 08:06:56 [1128] ha2.local corosync notice [TOTEM ] Initializing transport (Kronosnet). Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] totemknet initialized Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration map access [0] Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cmap Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync configuration service [1] Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cfg Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: cpg Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync profile loading service [4] Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Using quorum provider corosync_votequorum Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: votequorum Aug 10 08:06:57 [1128] ha2.local corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Aug 10 08:06:57 [1128] ha2.local corosync info [QB ] server name: quorum Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 0 Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 0: local addr: 192.168.51.216, port=5405 Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configuring link 1 Aug 10 08:06:57 [1128] ha2.local corosync info [TOTEM ] Configured link number 1: local addr: 10.0.0.2, port=5406 Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:06:57 [1128] ha2.local corosync warning [KNET ] host: host: 1 has no active links Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync members[1]: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [TOTEM ] A new membership (2.126) was formed. Members joined: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [QUORUM] Members[1]: 2 Aug 10 08:06:57 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service. Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 0 is up Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:07:00 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 469 Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync members[2]: 1 2 Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Sync joined[1]: 1 Aug 10 08:07:00 [1128] ha2.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 1 Aug 10 08:07:00 [1128] ha2.local corosync notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] This node is within the primary component and will provide service. Aug 10 08:07:00 [1128] ha2.local corosync notice [QUORUM] Members[2]: 1 2 Aug 10 08:07:00 [1128] ha2.local corosync notice [MAIN ] Completed service synchronization, ready to provide service. Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up Aug 10 08:07:05 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397 Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 8885 Aug 10 08:07:08 [1128] ha2.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down Aug 10 08:14:13 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up Aug 10 08:14:15 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down Aug 10 08:19:53 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up Aug 10 08:19:54 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] link: host: 1 link: 1 is down Aug 10 08:23:18 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] rx: host: 1 link: 1 is up Aug 10 08:23:19 [1128] ha2.local corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 0 is up Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Aug 10 08:07:00 [1032387] ha1.local corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync members[2]: 1 2 Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Sync joined[1]: 2 Aug 10 08:07:00 [1032387] ha1.local corosync notice [TOTEM ] A new membership (1.12d) was formed. Members joined: 2 Aug 10 08:07:00 [1032387] ha1.local corosync notice [QUORUM] Members[2]: 1 2 Aug 10 08:07:00 [1032387] ha1.local corosync notice [MAIN ] Completed service synchronization, ready to provide service. Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up Aug 10 08:07:07 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down Aug 10 08:11:48 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up Aug 10 08:11:50 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] link: host: 2 link: 1 is down Aug 10 08:12:22 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] rx: host: 2 link: 1 is up Aug 10 08:12:23 [1032387] ha1.local corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/