[ClusterLabs] Antw: [EXT] Re: unexpected fenced node and promotion of the new master PAF ‑ postgres

2021-07-14 Thread Ulrich Windl
>>> Strahil Nikolov  schrieb am 14.07.2021 um 18:34 in
Nachricht <1290865141.3036488.1626280482...@mail.yahoo.com>:
> If you experience multiple outages, you should consider enabling the kdump 
> feature of sbd. It will increase the takeover time, but might provide 
> valuable info.

I doubt you want that for a big server as the kdumps could fill your filesystem.
Also kdump won't capture any event from the past; it's most useful if the 
kernel hangs or crashes.

> Best Regards,Strahil Nikolov
>  
>  
>   On Wed, Jul 14, 2021 at 15:12, Klaus Wenninger wrote:  
>  
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
>   




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: unexpected fenced node and promotion of the new master PAF ‑ postgres

2021-07-14 Thread Klaus Wenninger
On Wed, Jul 14, 2021 at 3:28 PM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> damiano giuliani  schrieb am 14.07.2021
> um
> 12:49
> in Nachricht
> :
> > Hi guys, thanks for helping,
> >
> > could be quite hard troubleshooting unexpected fails expecially if they
> are
> > not easily tracked on the pacemaker / system logs.
> > all servers are baremetal , i requested the BMC logs hoping there are
> some
> > informations.
> > you guys said the sbd is too tight, can you explain me and suggest a
> valid
> > configuration?
>
> You must answer these questions for yourself:
> * What is the maximum read/write delay for your sbd device that still means
> the storage is working? Before assuming something like 1s also think of
> firmware updates, bad disk sectors, etc.
>
stonith-watchdog-timeout set and no 'Servant starting for device' log - I
guess no poison-pill-fencing then

> * Then configure the sbd parameters accordingly
> * Finally configure the stonith timeout to be not less than the time sbd
> needs
> in worst case to down the machine. If the cluster starts recovering while
> the
> other node is not down already, you may have data corruption or other
> failures.
>
yep - 2 * watchdog-timeout should be a good pick in this case

>
> >
> > ps: yesterday i resyc the old master (to slave) and rejoined into the
> > cluster.
> > i found the following error into the var/log/messages about the sbd
> >
> >  grep -r sbd messages
> > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
> Servant
> > pcmk is outdated (age: 4)
> > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
> Servant
> > pcmk is healthy (age: 0)
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:  pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:   cluster:   notice:
> > servant_cluster: Monitoring unknown cluster health
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> > Servant cluster is healthy (age: 0)
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: watchdog_init: Using
> > watchdog device '/dev/watchdog'
> > Jul 13 20:42:19 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> > Servant pcmk is healthy (age: 0)
> > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:info: main: Verbose mode
> > enabled.
> > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:info: main: Watchdog
> enabled.
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189176]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189178]:  pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:  notice: inquisitor_child:
> > Servant pcmk is healthy (age: 0)
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog': Device or resource busy (16)
>
> Maybe also debug the watchdog device.
>
>
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning:
> cleanup_servant_by_pid:
> > Servant for pcmk (pid: 189178) has terminated
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning:
> cleanup_servant_by_pid:
> > Servant for cluster (pid: 189179) has terminated
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog0': Device or resource busy (16)
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog': Device or resource busy (16)
> >
> > if i check the systemctl status sbd:
> >
> > systemctl status sbd.service
> > ● sbd.service - Shared-storage based fencing daemon
> >Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> > preset: disabled)
> >Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago
> >  Docs: man:sbd(8)
> >   Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> > watch (code=exited, status=0/SUCCESS)
> >  Main PID: 185357 (sbd)
> >CGroup: /system.slice/sbd.service
> >├─185357 sbd: inquisitor
> >├─185362 sbd: watcher: Pacemaker
> >└─185363 sbd: watcher: Cluster
> >
> > Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based
> > fencing daemon...
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:   notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:   pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:cluster:   notice:
> > servant_cluster: Monitoring unknown cluster health
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:   notice: inquisitor_child:
> > Servant cluster is healthy (age: 0)
> > Jul 13 20:42:15 ltaoperdbs02 sb

[ClusterLabs] Antw: [EXT] Re: unexpected fenced node and promotion of the new master PAF ‑ postgres

2021-07-14 Thread Ulrich Windl
>>> damiano giuliani  schrieb am 14.07.2021 um
12:49
in Nachricht
:
> Hi guys, thanks for helping,
> 
> could be quite hard troubleshooting unexpected fails expecially if they are
> not easily tracked on the pacemaker / system logs.
> all servers are baremetal , i requested the BMC logs hoping there are some
> informations.
> you guys said the sbd is too tight, can you explain me and suggest a valid
> configuration?

You must answer these questions for yourself:
* What is the maximum read/write delay for your sbd device that still means
the storage is working? Before assuming something like 1s also think of
firmware updates, bad disk sectors, etc.
* Then configure the sbd parameters accordingly
* Finally configure the stonith timeout to be not less than the time sbd needs
in worst case to down the machine. If the cluster starts recovering while the
other node is not down already, you may have data corruption or other
failures.

> 
> ps: yesterday i resyc the old master (to slave) and rejoined into the
> cluster.
> i found the following error into the var/log/messages about the sbd
> 
>  grep -r sbd messages
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 4)
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
> Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:  notice: main: Doing flush +
> writing 'b' to sysrq on timeout
> Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:  pcmk:   notice:
> servant_pcmk: Monitoring Pacemaker health
> Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:   cluster:   notice:
> servant_cluster: Monitoring unknown cluster health
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> Servant cluster is healthy (age: 0)
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: watchdog_init: Using
> watchdog device '/dev/watchdog'
> Jul 13 20:42:19 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> Servant pcmk is healthy (age: 0)
> Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:info: main: Verbose mode
> enabled.
> Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:info: main: Watchdog enabled.
> Jul 13 20:54:28 ltaoperdbs02 sbd[189176]:  notice: main: Doing flush +
> writing 'b' to sysrq on timeout
> Jul 13 20:54:28 ltaoperdbs02 sbd[189178]:  pcmk:   notice:
> servant_pcmk: Monitoring Pacemaker health
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:  notice: inquisitor_child:
> Servant pcmk is healthy (age: 0)
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:   error: watchdog_init_fd: Cannot
> open watchdog device '/dev/watchdog': Device or resource busy (16)

Maybe also debug the watchdog device.


> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid:
> Servant for pcmk (pid: 189178) has terminated
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid:
> Servant for cluster (pid: 189179) has terminated
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:  notice: main: Doing flush +
> writing 'b' to sysrq on timeout
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd: Cannot
> open watchdog device '/dev/watchdog0': Device or resource busy (16)
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd: Cannot
> open watchdog device '/dev/watchdog': Device or resource busy (16)
> 
> if i check the systemctl status sbd:
> 
> systemctl status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
>Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
>Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago
>  Docs: man:sbd(8)
>   Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=exited, status=0/SUCCESS)
>  Main PID: 185357 (sbd)
>CGroup: /system.slice/sbd.service
>├─185357 sbd: inquisitor
>├─185362 sbd: watcher: Pacemaker
>└─185363 sbd: watcher: Cluster
> 
> Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based
> fencing daemon...
> Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:   notice: main: Doing flush +
> writing 'b' to sysrq on timeout
> Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:   pcmk:   notice:
> servant_pcmk: Monitoring Pacemaker health
> Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:cluster:   notice:
> servant_cluster: Monitoring unknown cluster health
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:   notice: inquisitor_child:
> Servant cluster is healthy (age: 0)
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:   notice: watchdog_init: Using
> watchdog device '/dev/watchdog'
> Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based
> fencing daemon.
> Jul 13 20:42:19 ltaoperdbs02 sbd[185357]:   notice: inquisitor_child:
> Servant pcmk is healthy (age: 0)
> 
> this is happening to all 3 nodes, any toughts?

Bad watchdog? 

> 
> Thanks for helping, have as good day
> 
> Damiano
> 
> 
> Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger <
> kwenn...@redh