On Wed, Jul 14, 2021 at 3:28 PM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote:
> >>> damiano giuliani <damianogiulian...@gmail.com> schrieb am 14.07.2021 > um > 12:49 > in Nachricht > <CAG=zynojrmkc5az8nz2r82crabj3z+genuw_8de3ujfu1hd...@mail.gmail.com>: > > Hi guys, thanks for helping, > > > > could be quite hard troubleshooting unexpected fails expecially if they > are > > not easily tracked on the pacemaker / system logs. > > all servers are baremetal , i requested the BMC logs hoping there are > some > > informations. > > you guys said the sbd is too tight, can you explain me and suggest a > valid > > configuration? > > You must answer these questions for yourself: > * What is the maximum read/write delay for your sbd device that still means > the storage is working? Before assuming something like 1s also think of > firmware updates, bad disk sectors, etc. > stonith-watchdog-timeout set and no 'Servant starting for device' log - I guess no poison-pill-fencing then > * Then configure the sbd parameters accordingly > * Finally configure the stonith timeout to be not less than the time sbd > needs > in worst case to down the machine. If the cluster starts recovering while > the > other node is not down already, you may have data corruption or other > failures. > yep - 2 * watchdog-timeout should be a good pick in this case > > > > > ps: yesterday i resyc the old master (to slave) and rejoined into the > > cluster. > > i found the following error into the var/log/messages about the sbd > > > > grep -r sbd messages > > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: > Servant > > pcmk is outdated (age: 4) > > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: > Servant > > pcmk is healthy (age: 0) > > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + > > writing 'b' to sysrq on timeout > > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: > > servant_pcmk: Monitoring Pacemaker health > > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: > > servant_cluster: Monitoring unknown cluster health > > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: > > Servant cluster is healthy (age: 0) > > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using > > watchdog device '/dev/watchdog' > > Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: > > Servant pcmk is healthy (age: 0) > > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Verbose mode > > enabled. > > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Watchdog > enabled. > > Jul 13 20:54:28 ltaoperdbs02 sbd[189176]: notice: main: Doing flush + > > writing 'b' to sysrq on timeout > > Jul 13 20:54:28 ltaoperdbs02 sbd[189178]: pcmk: notice: > > servant_pcmk: Monitoring Pacemaker health > > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: notice: inquisitor_child: > > Servant pcmk is healthy (age: 0) > > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: error: watchdog_init_fd: > Cannot > > open watchdog device '/dev/watchdog': Device or resource busy (16) > > Maybe also debug the watchdog device. > > > > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: > cleanup_servant_by_pid: > > Servant for pcmk (pid: 189178) has terminated > > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: > cleanup_servant_by_pid: > > Servant for cluster (pid: 189179) has terminated > > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: notice: main: Doing flush + > > writing 'b' to sysrq on timeout > > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: > Cannot > > open watchdog device '/dev/watchdog0': Device or resource busy (16) > > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: > Cannot > > open watchdog device '/dev/watchdog': Device or resource busy (16) > > > > if i check the systemctl status sbd: > > > > systemctl status sbd.service > > ● sbd.service - Shared-storage based fencing daemon > > Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor > > preset: disabled) > > Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago > > Docs: man:sbd(8) > > Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid > > watch (code=exited, status=0/SUCCESS) > > Main PID: 185357 (sbd) > > CGroup: /system.slice/sbd.service > > ├─185357 sbd: inquisitor > > ├─185362 sbd: watcher: Pacemaker > > └─185363 sbd: watcher: Cluster > > > > Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based > > fencing daemon... > > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + > > writing 'b' to sysrq on timeout > > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: > > servant_pcmk: Monitoring Pacemaker health > > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: > > servant_cluster: Monitoring unknown cluster health > > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: > > Servant cluster is healthy (age: 0) > > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using > > watchdog device '/dev/watchdog' > > Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based > > fencing daemon. > > Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: > > Servant pcmk is healthy (age: 0) > > > > this is happening to all 3 nodes, any toughts? > > Bad watchdog? > > > > > Thanks for helping, have as good day > > > > Damiano > > > > > > Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger < > > kwenn...@redhat.com> ha scritto: > > > >> > >> > >> On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <arvidj...@gmail.com> > >> wrote: > >> > >>> On 13.07.2021 23:09, damiano giuliani wrote: > >>> > Hi Klaus, thanks for helping, im quite lost because cant find out the > >>> > causes. > >>> > i attached the corosync logs of all three nodes hoping you guys can > find > >>> > and hint me something i cant see. i really appreciate the effort. > >>> > the old master log seems cutted at 00:38. so nothing interessing. > >>> > the new master and the third slave logged what its happened. but i > cant > >>> > figure out the cause the old master went lost. > >>> > > >>> > >>> The reason it was lost is most likely outside of pacemaker. You need to > >>> check other logs on the node that was lost, may be BMC if this is bare > >>> metal or hypervisor if it is virtualized system. > >>> > >>> All that these logs say is that ltaoperdbs02 was lost from the point of > >>> view of two other nodes. It happened at the same time (around Jul 13 > >>> 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it > >>> was software crash, hardware failure or network outage cannot be > >>> determined from these logs. > >>> > >>> What speaks against a pure network-outage is that we don't see > >> the corosync memberhip messages on the node that died. > >> Of course it is possible that the log wasn't flushed out before reboot > >> but usually I'd expect that there would be enough time. > >> If something kept corosync or sbd from being scheduled that would > >> explain why we don't see messages from these instances. > >> And that was why I was asking to check if in the setup corosync and > >> sbd are able to switch to rt-scheduling. > >> But of course that is all speculations and from what we know it can > >> be merely anything from an administrative hard shutdown via > >> some BMC to whatever. > >> > >>> > >>> > something interessing could be the stonith logs of the new master and > >>> the > >>> > third slave: > >>> > > >>> > NEW MASTER: > >>> > grep stonith-ng /var/log/messages > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node > >>> ltaoperdbs02 > >>> > state is now lost > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 > peer > >>> > with id=1 and/or uname=ltaoperdbs02 from the membership cache > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client > >>> > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with > device > >>> > '(any)' > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting > >>> peer > >>> > fencing (reboot) targeting ltaoperdbs02 > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't > find > >>> > anyone to fence (reboot) ltaoperdbs02 with any device > >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s > >>> for > >>> > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5 > >>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: > Self-fencing > >>> > (reboot) by ltaoperdbs02 for > >>> > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete > >>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation > >>> > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for > >>> > crmd.228700@ltaoperdbs03.f5d882d5: OK > >>> > > >>> > THIRD SLAVE: > >>> > grep stonith-ng /var/log/messages > >>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Node > >>> ltaoperdbs02 > >>> > state is now lost > >>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Purged 1 > peer > >>> with > >>> > id=1 and/or uname=ltaoperdbs02 from the membership cache > >>> > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]: notice: Operation > >>> 'reboot' > >>> > targeting ltaoperdbs02 on ltaoperdbs03 for > >>> crmd.228700@ltaoperdbs03.f5d882d5: > >>> > OK > >>> > > >>> > i really appreciate the help and what you think about it. > >>> > > >>> > PS the stonith should be set to 10s (pcs property set > >>> > stonith-watchdog-timeout=10s) are u suggest different setting? > >>> > > >>> > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger < > >>> > kwenn...@redhat.com> ha scritto: > >>> > > >>> >> > >>> >> > >>> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani < > >>> >> damianogiulian...@gmail.com> wrote: > >>> >> > >>> >>> Hi guys, > >>> >>> im back with some PAF postgres cluster problems. > >>> >>> tonight the cluster fenced the master node and promote the PAF > >>> resource > >>> >>> to a new node. > >>> >>> everything went fine, unless i really dont know why. > >>> >>> so this morning i noticed the old master was fenced by sbd and a > new > >>> >>> master was promoted, this happen tonight at 00.40.XX. > >>> >>> filtering the logs i cant find out the any reasons why the old > master > >>> was > >>> >>> fenced and the start of promotion of the new master (which seems > went > >>> >>> perfectly), at certain point, im a bit lost cuz non of us can is > able > >>> to > >>> >>> get the real reason. > >>> >>> the cluster worked flawessy for days with no issues, till now. > >>> >>> crucial for me uderstand why this switch occured. > >>> >>> > >>> >>> a attached the current status and configuration and logs. > >>> >>> on the old master node log cant find any reasons > >>> >>> on the new master the only thing is the fencing and the promotion. > >>> >>> > >>> >>> > >>> >>> PS: > >>> >>> could be this the reason of fencing? > >>> >>> > >>> >>> grep -e sbd /var/log/messages > >>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: > >>> >>> Servant pcmk is outdated (age: 4) > >>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: > >>> >>> Servant pcmk is healthy (age: 0) > >>> >>> > >>> >> That was yesterday afternoon and not 0:40 today in the morning. > >>> >> With the watchdog-timeout set to 5s this may have been tight though. > >>> >> Maybe check your other nodes for similar warnings - or check the > >>> >> compressed warnings. > >>> >> Maybe you can as well check the journal of sbd after start to see if > it > >>> >> managed to run rt-scheduled. > >>> >> Is this a bare-metal-setup or running on some hypervisor? > >>> >> Unfortunately I'm not enough into postgres to tell if there is > anything > >>> >> interesting about the last > >>> >> messages shown before the suspected watchdog-reboot. > >>> >> Was there some administrative stuff done by ltauser before the > reboot? > >>> If > >>> >> yes what? > >>> >> > >>> >> Regards, > >>> >> Klaus > >>> >> > >>> >> > >>> >>> > >>> >>> Any though and help is really appreciate. > >>> >>> > >>> >>> Damiano > >>> >>> _______________________________________________ > >>> >>> Manage your subscription: > >>> >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> >>> > >>> >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> >>> > >>> >> _______________________________________________ > >>> >> Manage your subscription: > >>> >> https://lists.clusterlabs.org/mailman/listinfo/users > >>> >> > >>> >> ClusterLabs home: https://www.clusterlabs.org/ > >>> >> > >>> > > >>> > > >>> > _______________________________________________ > >>> > Manage your subscription: > >>> > https://lists.clusterlabs.org/mailman/listinfo/users > >>> > > >>> > ClusterLabs home: https://www.clusterlabs.org/ > >>> > > >>> > >>> _______________________________________________ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > >>> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/