[ClusterLabs] Antw: Re: Antw: [EXT] The 2 servers of the cluster randomly reboot almost together

2022-02-17 Thread Ulrich Windl
>>> Sebastien BASTARD schrieb am 17.02.2022 um 16:28 in Nachricht : > Thank you Ulrich for your script ! > > I launched it, with 10 seconds delay : > >- on Server A, to ping Server B >- on Server B, to ping server A >- on QDevice, to ping server A and Server B > > I currently can't

Re: [ClusterLabs] The 2 servers of the cluster randomly reboot almost together

2022-02-17 Thread Strahil Nikolov via Users
Token timeout -> network issue ? Just run a continious ping (with timestamp) and log it into a file (from each host to other host + qdevice ip). Best Regards,Strahil Nikolov On Thu, Feb 17, 2022 at 11:38, Sebastien BASTARD wrote: Hello CoroSync's team ! We currently have a proxmox

Re: [ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together

2022-02-17 Thread Sebastien BASTARD
Thank you Ulrich for your script ! I launched it, with 10 seconds delay : - on Server A, to ping Server B - on Server B, to ping server A - on QDevice, to ping server A and Server B I currently can't ping Qdevice from server A and B, because it is behind a firewall which only

Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Ken Gaillot
On Thu, 2022-02-17 at 14:05 +0100, Lentes, Bernd wrote: > - On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com > wrote: > > > > Splitting logs between different messages does not really help in > > interpreting > > them. > > I agree. > Here is the complete excerpt from the respective

Re: [ClusterLabs] Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Strahil Nikolov via Users
To be honest, I always check  https://documentation.suse.com/sle-ha/15-SP3/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-watchdog-timings for sbd and timings. Best Regards,Strahil Nikolov On Wed, Feb 16, 2022 at 19:31, Klaus Wenninger wrote:

[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger schrieb am 17.02.2022 um 13:34 in Nachricht : ... > But feedback is welcome so that we can do a little tweaking that makes them > fit > for a larger audience. > Remember a case where devices stalled for 50s during a firmware-update > shouldn't trigger fencing - definitely a

Re: [ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Lentes, Bernd
- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com wrote: > > > Splitting logs between different messages does not really help in interpreting > them. I agree. Here is the complete excerpt from the respective time: https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Klaus Wenninger
On Thu, Feb 17, 2022 at 12:38 PM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Klaus Wenninger schrieb am 17.02.2022 um 10:49 > in > Nachricht > : > ... > >> For completeness: Yes, sbd did recover: > >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant >

[ClusterLabs] Antw: Re: Antw: Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 17.02.2022 >>> um 12:02 in Nachricht <171889141.179613256.1645095740749.javamail.zim...@helmholtz-muenchen.de>: > > - On Feb 17, 2022, at 10:26 AM, Ulrich Windl > ulrich.wi...@rz.uni-regensburg.de wrote: > > "Ulrich Windl" schrieb am 17.02.2022 > >>

[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger schrieb am 17.02.2022 um 10:49 in Nachricht : ... >> For completeness: Yes, sbd did recover: >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant >> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated >> Feb 14 13:01:42 h18 sbd[6615]:

[ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together

2022-02-17 Thread Ulrich Windl
Hi! It seems your problem is the network. Maybe check the connectivity between all nodes (and quorum device). Some time ago I wrote a simple script that can log ups and downs (you'll ahve to adjust for non-LAN traffic); maybe it helps: # Test Host Status (Up, Down) via ping (ICMP Echo)

Re: [ClusterLabs] Antw: Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Lentes, Bernd
- On Feb 17, 2022, at 10:26 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: "Ulrich Windl" schrieb am 17.02.2022 > > To correct myself: crm was a "-w" (wait) option that will wait until the DC is > idle. In most cases it just waits until the operation requeszted has

Re: [ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Klaus Wenninger
On Thu, Feb 17, 2022 at 10:14 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Klaus Wenninger schrieb am 16.02.2022 um 16:59 > in > Nachricht > : > > On Wed, Feb 16, 2022 at 4:26 PM Klaus Wenninger > wrote: > > > >> > >> > >> On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl < > >>

[ClusterLabs] The 2 servers of the cluster randomly reboot almost together

2022-02-17 Thread Sebastien BASTARD
Hello CoroSync's team ! We currently have a proxmox cluster with 2 servers (at different providers and different cities) and another server, in our company, with qdevice. Schematic : (A) Proxmox Server A (Provider One) -- (B) Proxmox Server B (Provider Two)

Re: [ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Klaus Wenninger
On Thu, Feb 17, 2022 at 9:27 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Klaus Wenninger schrieb am 16.02.2022 um 16:26 > in > Nachricht > : > > On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl < > > ulrich.wi...@rz.uni-regensburg.de> wrote: > > > >> Hi! > >> > >> When changing

[ClusterLabs] Antw: Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Ulrich Windl
>>> "Ulrich Windl" schrieb am 17.02.2022 um 10:18 in Nachricht <620e12e102a100047...@gwsmtp.uni-regensburg.de>: ... > Yes, as "stop" just sets the role of the resource; processing happens in the > background. > The crm syntax does not have a "‑‑wait", but you can list all names in one >

[ClusterLabs] Antw: Re: Antw: [EXT] Re: crm resource stop VirtualDomain ‑ but VirtualDomain shutdown start some minutes later

2022-02-17 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 16.02.2022 >>> um 17:44 in Nachricht <1319350245.178868537.1645029860404.javamail.zim...@helmholtz-muenchen.de>: > - On Feb 16, 2022, at 1:01 PM, Ulrich Windl > ulrich.wi...@rz.uni-regensburg.de wrote: >> Bernd, >> >> I guess the syslog(/journal of the DC has

[ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger schrieb am 16.02.2022 um 16:59 in Nachricht : > On Wed, Feb 16, 2022 at 4:26 PM Klaus Wenninger wrote: > >> >> >> On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl < >> ulrich.wi...@rz.uni-regensburg.de> wrote: >> >>> Hi! >>> >>> When changing some FC cables I noticed that sbd

[ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger schrieb am 16.02.2022 um 16:26 in Nachricht : > On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Hi! >> >> When changing some FC cables I noticed that sbd complained 2 seconds after >> the connection went down (event though the