Hi guys, thanks for helping, could be quite hard troubleshooting unexpected fails expecially if they are not easily tracked on the pacemaker / system logs. all servers are baremetal , i requested the BMC logs hoping there are some informations. you guys said the sbd is too tight, can you explain me and suggest a valid configuration?
ps: yesterday i resyc the old master (to slave) and rejoined into the cluster. i found the following error into the var/log/messages about the sbd grep -r sbd messages Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4) Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + writing 'b' to sysrq on timeout Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: servant_cluster: Monitoring unknown cluster health Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant cluster is healthy (age: 0) Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using watchdog device '/dev/watchdog' Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Verbose mode enabled. Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Watchdog enabled. Jul 13 20:54:28 ltaoperdbs02 sbd[189176]: notice: main: Doing flush + writing 'b' to sysrq on timeout Jul 13 20:54:28 ltaoperdbs02 sbd[189178]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog': Device or resource busy (16) Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid: Servant for pcmk (pid: 189178) has terminated Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid: Servant for cluster (pid: 189179) has terminated Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: notice: main: Doing flush + writing 'b' to sysrq on timeout Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog0': Device or resource busy (16) Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog': Device or resource busy (16) if i check the systemctl status sbd: systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago Docs: man:sbd(8) Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS) Main PID: 185357 (sbd) CGroup: /system.slice/sbd.service ├─185357 sbd: inquisitor ├─185362 sbd: watcher: Pacemaker └─185363 sbd: watcher: Cluster Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based fencing daemon... Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + writing 'b' to sysrq on timeout Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: servant_cluster: Monitoring unknown cluster health Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant cluster is healthy (age: 0) Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using watchdog device '/dev/watchdog' Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based fencing daemon. Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) this is happening to all 3 nodes, any toughts? Thanks for helping, have as good day Damiano Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger < kwenn...@redhat.com> ha scritto: > > > On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <arvidj...@gmail.com> > wrote: > >> On 13.07.2021 23:09, damiano giuliani wrote: >> > Hi Klaus, thanks for helping, im quite lost because cant find out the >> > causes. >> > i attached the corosync logs of all three nodes hoping you guys can find >> > and hint me something i cant see. i really appreciate the effort. >> > the old master log seems cutted at 00:38. so nothing interessing. >> > the new master and the third slave logged what its happened. but i cant >> > figure out the cause the old master went lost. >> > >> >> The reason it was lost is most likely outside of pacemaker. You need to >> check other logs on the node that was lost, may be BMC if this is bare >> metal or hypervisor if it is virtualized system. >> >> All that these logs say is that ltaoperdbs02 was lost from the point of >> view of two other nodes. It happened at the same time (around Jul 13 >> 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it >> was software crash, hardware failure or network outage cannot be >> determined from these logs. >> >> What speaks against a pure network-outage is that we don't see > the corosync memberhip messages on the node that died. > Of course it is possible that the log wasn't flushed out before reboot > but usually I'd expect that there would be enough time. > If something kept corosync or sbd from being scheduled that would > explain why we don't see messages from these instances. > And that was why I was asking to check if in the setup corosync and > sbd are able to switch to rt-scheduling. > But of course that is all speculations and from what we know it can > be merely anything from an administrative hard shutdown via > some BMC to whatever. > >> >> > something interessing could be the stonith logs of the new master and >> the >> > third slave: >> > >> > NEW MASTER: >> > grep stonith-ng /var/log/messages >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node >> ltaoperdbs02 >> > state is now lost >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer >> > with id=1 and/or uname=ltaoperdbs02 from the membership cache >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client >> > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device >> > '(any)' >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting >> peer >> > fencing (reboot) targeting ltaoperdbs02 >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find >> > anyone to fence (reboot) ltaoperdbs02 with any device >> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s >> for >> > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5 >> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing >> > (reboot) by ltaoperdbs02 for >> > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete >> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation >> > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for >> > crmd.228700@ltaoperdbs03.f5d882d5: OK >> > >> > THIRD SLAVE: >> > grep stonith-ng /var/log/messages >> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Node >> ltaoperdbs02 >> > state is now lost >> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Purged 1 peer >> with >> > id=1 and/or uname=ltaoperdbs02 from the membership cache >> > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]: notice: Operation >> 'reboot' >> > targeting ltaoperdbs02 on ltaoperdbs03 for >> crmd.228700@ltaoperdbs03.f5d882d5: >> > OK >> > >> > i really appreciate the help and what you think about it. >> > >> > PS the stonith should be set to 10s (pcs property set >> > stonith-watchdog-timeout=10s) are u suggest different setting? >> > >> > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger < >> > kwenn...@redhat.com> ha scritto: >> > >> >> >> >> >> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani < >> >> damianogiulian...@gmail.com> wrote: >> >> >> >>> Hi guys, >> >>> im back with some PAF postgres cluster problems. >> >>> tonight the cluster fenced the master node and promote the PAF >> resource >> >>> to a new node. >> >>> everything went fine, unless i really dont know why. >> >>> so this morning i noticed the old master was fenced by sbd and a new >> >>> master was promoted, this happen tonight at 00.40.XX. >> >>> filtering the logs i cant find out the any reasons why the old master >> was >> >>> fenced and the start of promotion of the new master (which seems went >> >>> perfectly), at certain point, im a bit lost cuz non of us can is able >> to >> >>> get the real reason. >> >>> the cluster worked flawessy for days with no issues, till now. >> >>> crucial for me uderstand why this switch occured. >> >>> >> >>> a attached the current status and configuration and logs. >> >>> on the old master node log cant find any reasons >> >>> on the new master the only thing is the fencing and the promotion. >> >>> >> >>> >> >>> PS: >> >>> could be this the reason of fencing? >> >>> >> >>> grep -e sbd /var/log/messages >> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: >> >>> Servant pcmk is outdated (age: 4) >> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: >> >>> Servant pcmk is healthy (age: 0) >> >>> >> >> That was yesterday afternoon and not 0:40 today in the morning. >> >> With the watchdog-timeout set to 5s this may have been tight though. >> >> Maybe check your other nodes for similar warnings - or check the >> >> compressed warnings. >> >> Maybe you can as well check the journal of sbd after start to see if it >> >> managed to run rt-scheduled. >> >> Is this a bare-metal-setup or running on some hypervisor? >> >> Unfortunately I'm not enough into postgres to tell if there is anything >> >> interesting about the last >> >> messages shown before the suspected watchdog-reboot. >> >> Was there some administrative stuff done by ltauser before the reboot? >> If >> >> yes what? >> >> >> >> Regards, >> >> Klaus >> >> >> >> >> >>> >> >>> Any though and help is really appreciate. >> >>> >> >>> Damiano >> >>> _______________________________________________ >> >>> Manage your subscription: >> >>> https://lists.clusterlabs.org/mailman/listinfo/users >> >>> >> >>> ClusterLabs home: https://www.clusterlabs.org/ >> >>> >> >> _______________________________________________ >> >> Manage your subscription: >> >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> > >> > >> > _______________________________________________ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> > >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/