Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
On 15.04.2021 23:09, Steffen Vinther Sørensen wrote: > On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger wrote: >> >> On 4/15/21 3:26 PM, Ulrich Windl wrote: >> Steffen Vinther Sørensen schrieb am 15.04.2021 um >>> 14:56 in >>> Nachricht >>> : On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl wrote: Steffen Vinther Sørensen schrieb am 15.04.2021 um > 13:10 in > Nachricht > : >> Hi there, >> >> In this 3 node cluster, node03 been offline for a while, and being >> brought up to service. Then a migration of a VirtualDomain is being >> attempted, and node02 is then fenced. >> >> Provided is logs from all 2 nodes, and the 'pcs config' as well as a >> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is >> it because of the failed ipmi monitor warning ? > After a short glace it looks as if the network traffic used for VM >>> migration > killed the corosync (or other) communication. > May I ask what part is making you think so ? >>> The part that I saw no reason for an intended fencing. >> And it looks like node02 is being cut off from all >> networking-communication - both corosync & ipmi. >> May really be the networking-load although I would >> rather bet on something more systematic like a >> Mac/IP-conflict with the VM or something. >> I see you are having libvirtd under cluster-control. >> Maybe bringing up the network-topology destroys the >> connection between the nodes. >> Has the cluster been working with the 3 nodes before? >> >> >> Klaus > > Hi Klaus > > Yes it has been working before with all 3 nodes and migrations back > and forth, but a few more VirtualDomains have been deployed since the > last migration test. > > It happens very fast, almost immediately after migration is starting. > Could it be that some timeout values should be adjusted ? > I just don't have any idea where to start looking, as to me there is > nothing obviously suspicious found in the logs. > I would look at performance stats, may be node02 was overloaded and could not answer in time. Although standard sar stats are collected every 15 minutes which is usually too coarse for it. Migration could stress network. Talk with your network support, any errors around this time? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Node fenced for unknown reason
On 15.04.2021 14:10, Steffen Vinther Sørensen wrote: > Hi there, > > In this 3 node cluster, node03 been offline for a while, and being > brought up to service. Then a migration of a VirtualDomain is being > attempted, and node02 is then fenced. > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? It was fenced because communication between node02 and two other nodes was lost. Why it happened cannot be answered based on available logs. > Is > it because of the failed ipmi monitor warning ? > No. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
On 15.04.2021 16:39, Klaus Wenninger wrote: > On 4/15/21 3:26 PM, Ulrich Windl wrote: > Steffen Vinther Sørensen schrieb am 15.04.2021 um >> 14:56 in >> Nachricht >> : >>> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl >>> wrote: >>> Steffen Vinther Sørensen schrieb am >>> 15.04.2021 um 13:10 in Nachricht : > Hi there, > > In this 3 node cluster, node03 been offline for a while, and being > brought up to service. Then a migration of a VirtualDomain is being > attempted, and node02 is then fenced. > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > it because of the failed ipmi monitor warning ? After a short glace it looks as if the network traffic used for VM >> migration killed the corosync (or other) communication. >>> May I ask what part is making you think so ? >> The part that I saw no reason for an intended fencing. > And it looks like node02 is being cut off from all > networking-communication - both corosync & ipmi. Well, IPMI fencing was (claimed to be) successful, so monitoring errors could be false positive. Still it is something that needs investigation. ... judging by Apr 15 06:59:26 kvm03-node02 systemd-logind[4179]: Power key pressed. IPMI fencing *was* successful. > May really be the networking-load although I would > rather bet on something more systematic like a > Mac/IP-conflict with the VM or something. > I see you are having libvirtd under cluster-control. > Maybe bringing up the network-topology destroys the > connection between the nodes. > Has the cluster been working with the 3 nodes before? > > > Klaus >> > > Here is the outline: > > At 06:58:27 node03 is being activated with 'pcs start node03', nothing > suspicious in the logs > > At 06:59:17 a resource migration is attempted from node02 to node03 > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' > > > on node01 this happens: > > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: > unknown error > > And node02 is fenced ? > > /Steffen > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi ALl, Sorry... Due to my operation mistake, the same email was sent multiple times. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: Cluster Labs - All topics related to open-source clustering welcomed > ; Cluster Labs - All topics related to open-source > clustering welcomed > Cc: > Date: 2021/4/15, Thu 11:45 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > Hi Klaus, > Hi Ken, > > We have confirmed that the operation is improved by the test. > Thank you for your prompt response. > > We look forward to including this fix in the release version of RHEL 8.4. > > Best Regards, > Hideo Yamauchi. > > > > - Original Message - >> From: "renayama19661...@ybb.ne.jp" > >> To: "kwenn...@redhat.com" ; Cluster > Labs - All topics related to open-source clustering welcomed > ; Cluster Labs - All topics related to open-source > clustering welcomed >> Cc: >> Date: 2021/4/13, Tue 07:08 >> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. >> >> Hi Klaus, >> Hi Ken, >> >>> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 > with >> >>> I guess the simplest possible solution to the immediate issue so >>> that we can discuss it. >> >> >> Thank you for the fix. >> >> >> I have confirmed that the fixes have been merged. >> >> I'll test this fix today just in case. >> >> Many thanks, >> Hideo Yamauchi. >> >> >> - Original Message - >>> From: Klaus Wenninger >>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to >> open-source clustering welcomed >>> Cc: >>> Date: 2021/4/12, Mon 22:22 >>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource > control >> fails. >>> >>> On 4/9/21 5:13 PM, Klaus Wenninger wrote: On 4/9/21 4:04 PM, Klaus Wenninger wrote: > On 4/9/21 3:45 PM, Klaus Wenninger wrote: >> On 4/9/21 3:36 PM, Klaus Wenninger wrote: >>> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: Hi Klaus, Thanks for your comment. > Hmm ... is that with selinux enabled? > Respectively do you see any related avc > messages? Selinux is not enabled. Isn't crm_mon caused by not returning a > response >> when >>> pacemakerd prepares to stop? >> yep ... that doesn't look good. >> While in pcmk_shutdown_worker ipc isn't handled. > Stop ... that should actually work as pcmk_shutdown_worker > should exit quite quickly and proceed after mainloop > dispatching when called again. > Don't see anything atm that might be blocking for longer > ... > but let me dig into it further ... What happens is clear (thanks Ken for the hint ;-) ). When pacemakerd is shutting down - already when it shuts down the resources and not just when it starts to reap the subdaemons - crm_mon reads that state and doesn't try to connect to the cib anymore. >>> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 > with >>> I guess the simplest possible solution to the immediate issue so >>> that we can discuss it. >> Question is why that didn't create issue earlier. >> Probably I didn't test with resources that had > crm_mon in >> their stop/monitor-actions but sbd should have run into >> issues. >> >> Klaus >>> But when shutting down a node the resources should be >>> shutdown before pacemakerd goes down. >>> But let me have a look if it can happen that > pacemakerd >>> doesn't react to the ipc-pings before. That btw. > might >> be >>> lethal for sbd-scenarios (if the phase is too long > and it >>> migh actually not be defined). >>> >>> My idea with selinux would have been that it might > block >>> the ipc if crm_mon is issued by execd. But well > forget >>> about it as it is not enabled ;-) >>> >>> >>> Klaus pgsql needs the result of crm_mon in demote > processing >> and >>> stop processing. crm_mon should return a response even after > pacemakerd >> goes >>> into a stop operation. Best Regards, Hideo Yamauchi. - Original Message - > From: Klaus Wenninger > > To: renayama19661...@ybb.ne.jp; Cluster Labs > - All >> >>> topics related > to open-source clustering welcomed >>> > Cc: > Date: 2021/4/9, Fri 21:12 > Subject: Re: [ClusterLabs] [Problem] In >> RHEL8.4beta, >>> pgsql > resource control fails. > > On 4/8/21 11:21 PM, > renayama19661...@ybb.ne.jp >> wrote:
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger wrote: > > On 4/15/21 3:26 PM, Ulrich Windl wrote: > Steffen Vinther Sørensen schrieb am 15.04.2021 um > > 14:56 in > > Nachricht > > : > >> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl > >> wrote: > >> Steffen Vinther Sørensen schrieb am 15.04.2021 um > >>> 13:10 in > >>> Nachricht > >>> : > Hi there, > > In this 3 node cluster, node03 been offline for a while, and being > brought up to service. Then a migration of a VirtualDomain is being > attempted, and node02 is then fenced. > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > it because of the failed ipmi monitor warning ? > >>> After a short glace it looks as if the network traffic used for VM > > migration > >>> killed the corosync (or other) communication. > >>> > >> May I ask what part is making you think so ? > > The part that I saw no reason for an intended fencing. > And it looks like node02 is being cut off from all > networking-communication - both corosync & ipmi. > May really be the networking-load although I would > rather bet on something more systematic like a > Mac/IP-conflict with the VM or something. > I see you are having libvirtd under cluster-control. > Maybe bringing up the network-topology destroys the > connection between the nodes. > Has the cluster been working with the 3 nodes before? > > > Klaus About the libvirtd under cluster-control, this is because I'm using virtlockd which in turn depends on the gfs2 filesystems for lockfiles. Also some VM images are located on those gfs2 filesystems. Maybe it is not necessary like that, as long as the VirtualDomains depend on those filesystems, libvirtd could just always be running, so its network-topology being turned on/off won't disturb anything. Best clue I got for now. /Steffen > > > > Here is the outline: > > At 06:58:27 node03 is being activated with 'pcs start node03', nothing > suspicious in the logs > > At 06:59:17 a resource migration is attempted from node02 to node03 > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' > > > on node01 this happens: > > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: > unknown error > > And node02 is fenced ? > > /Steffen > >>> > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger wrote: > > On 4/15/21 3:26 PM, Ulrich Windl wrote: > Steffen Vinther Sørensen schrieb am 15.04.2021 um > > 14:56 in > > Nachricht > > : > >> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl > >> wrote: > >> Steffen Vinther Sørensen schrieb am 15.04.2021 um > >>> 13:10 in > >>> Nachricht > >>> : > Hi there, > > In this 3 node cluster, node03 been offline for a while, and being > brought up to service. Then a migration of a VirtualDomain is being > attempted, and node02 is then fenced. > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > it because of the failed ipmi monitor warning ? > >>> After a short glace it looks as if the network traffic used for VM > > migration > >>> killed the corosync (or other) communication. > >>> > >> May I ask what part is making you think so ? > > The part that I saw no reason for an intended fencing. > And it looks like node02 is being cut off from all > networking-communication - both corosync & ipmi. > May really be the networking-load although I would > rather bet on something more systematic like a > Mac/IP-conflict with the VM or something. > I see you are having libvirtd under cluster-control. > Maybe bringing up the network-topology destroys the > connection between the nodes. > Has the cluster been working with the 3 nodes before? > > > Klaus Hi Klaus Yes it has been working before with all 3 nodes and migrations back and forth, but a few more VirtualDomains have been deployed since the last migration test. It happens very fast, almost immediately after migration is starting. Could it be that some timeout values should be adjusted ? I just don't have any idea where to start looking, as to me there is nothing obviously suspicious found in the logs. /Steffen > > > > Here is the outline: > > At 06:58:27 node03 is being activated with 'pcs start node03', nothing > suspicious in the logs > > At 06:59:17 a resource migration is attempted from node02 to node03 > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' > > > on node01 this happens: > > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: > unknown error > > And node02 is fenced ? > > /Steffen > >>> > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Hi Klaus, Hi Ken, We have confirmed that the operation is improved by the test. Thank you for your prompt response. We look forward to including this fix in the release version of RHEL 8.4. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp" > To: "kwenn...@redhat.com" ; Cluster Labs - All topics > related to open-source clustering welcomed ; Cluster > Labs - All topics related to open-source clustering welcomed > > Cc: > Date: 2021/4/13, Tue 07:08 > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. > > Hi Klaus, > Hi Ken, > >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with > >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. > > > Thank you for the fix. > > > I have confirmed that the fixes have been merged. > > I'll test this fix today just in case. > > Many thanks, > Hideo Yamauchi. > > > - Original Message - >> From: Klaus Wenninger >> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed >> Cc: >> Date: 2021/4/12, Mon 22:22 >> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control > fails. >> >> On 4/9/21 5:13 PM, Klaus Wenninger wrote: >>> On 4/9/21 4:04 PM, Klaus Wenninger wrote: On 4/9/21 3:45 PM, Klaus Wenninger wrote: > On 4/9/21 3:36 PM, Klaus Wenninger wrote: >> On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote: >>> Hi Klaus, >>> >>> Thanks for your comment. >>> Hmm ... is that with selinux enabled? Respectively do you see any related avc messages? >>> >>> Selinux is not enabled. >>> Isn't crm_mon caused by not returning a response > when >> pacemakerd >>> prepares to stop? > yep ... that doesn't look good. > While in pcmk_shutdown_worker ipc isn't handled. Stop ... that should actually work as pcmk_shutdown_worker should exit quite quickly and proceed after mainloop dispatching when called again. Don't see anything atm that might be blocking for longer ... but let me dig into it further ... >>> What happens is clear (thanks Ken for the hint ;-) ). >>> When pacemakerd is shutting down - already when it >>> shuts down the resources and not just when it starts to >>> reap the subdaemons - crm_mon reads that state and >>> doesn't try to connect to the cib anymore. >> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with >> I guess the simplest possible solution to the immediate issue so >> that we can discuss it. > Question is why that didn't create issue earlier. > Probably I didn't test with resources that had crm_mon in > their stop/monitor-actions but sbd should have run into > issues. > > Klaus >> But when shutting down a node the resources should be >> shutdown before pacemakerd goes down. >> But let me have a look if it can happen that pacemakerd >> doesn't react to the ipc-pings before. That btw. might > be >> lethal for sbd-scenarios (if the phase is too long and it >> migh actually not be defined). >> >> My idea with selinux would have been that it might block >> the ipc if crm_mon is issued by execd. But well forget >> about it as it is not enabled ;-) >> >> >> Klaus >>> >>> pgsql needs the result of crm_mon in demote processing > and >> stop >>> processing. >>> crm_mon should return a response even after pacemakerd > goes >> into a >>> stop operation. >>> >>> Best Regards, >>> Hideo Yamauchi. >>> >>> >>> - Original Message - From: Klaus Wenninger To: renayama19661...@ybb.ne.jp; Cluster Labs - All > >> topics related to open-source clustering welcomed >> Cc: Date: 2021/4/9, Fri 21:12 Subject: Re: [ClusterLabs] [Problem] In > RHEL8.4beta, >> pgsql resource control fails. On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp > wrote: > Hi Ken, > Hi All, > > In the pgsql resource, crm_mon is executed > in the >> process of > demote and stop, and the result is processed. > However, pacemaker included in RHEL8.4beta > fails >> to execute > this crm_mon. > - The problem also occurs on github master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f). > The problem can be easily reproduced in the >> following ways. > > Step1. Modify to execute crm_mon in the stop > >> process of the > Dummy resource. > > > dummy_stop() { > mon=$(crm_mon -1) > ret=$?
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
On 4/15/21 3:26 PM, Ulrich Windl wrote: Steffen Vinther Sørensen schrieb am 15.04.2021 um 14:56 in Nachricht : On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl wrote: Steffen Vinther Sørensen schrieb am 15.04.2021 um 13:10 in Nachricht : Hi there, In this 3 node cluster, node03 been offline for a while, and being brought up to service. Then a migration of a VirtualDomain is being attempted, and node02 is then fenced. Provided is logs from all 2 nodes, and the 'pcs config' as well as a bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is it because of the failed ipmi monitor warning ? After a short glace it looks as if the network traffic used for VM migration killed the corosync (or other) communication. May I ask what part is making you think so ? The part that I saw no reason for an intended fencing. And it looks like node02 is being cut off from all networking-communication - both corosync & ipmi. May really be the networking-load although I would rather bet on something more systematic like a Mac/IP-conflict with the VM or something. I see you are having libvirtd under cluster-control. Maybe bringing up the network-topology destroys the connection between the nodes. Has the cluster been working with the 3 nodes before? Klaus Here is the outline: At 06:58:27 node03 is being activated with 'pcs start node03', nothing suspicious in the logs At 06:59:17 a resource migration is attempted from node02 to node03 with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' on node01 this happens: Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: unknown error And node02 is fenced ? /Steffen ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason
>>> Steffen Vinther Sørensen schrieb am 15.04.2021 um 14:56 in Nachricht : > On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl > wrote: >> >> >>> Steffen Vinther Sørensen schrieb am 15.04.2021 um >> 13:10 in >> Nachricht >> : >> > Hi there, >> > >> > In this 3 node cluster, node03 been offline for a while, and being >> > brought up to service. Then a migration of a VirtualDomain is being >> > attempted, and node02 is then fenced. >> > >> > Provided is logs from all 2 nodes, and the 'pcs config' as well as a >> > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is >> > it because of the failed ipmi monitor warning ? >> >> After a short glace it looks as if the network traffic used for VM migration >> killed the corosync (or other) communication. >> > > May I ask what part is making you think so ? The part that I saw no reason for an intended fencing. > >> > >> > >> > Here is the outline: >> > >> > At 06:58:27 node03 is being activated with 'pcs start node03', nothing >> > suspicious in the logs >> > >> > At 06:59:17 a resource migration is attempted from node02 to node03 >> > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' >> > >> > >> > on node01 this happens: >> > >> > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing >> > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: >> > unknown error >> > >> > And node02 is fenced ? >> > >> > /Steffen >> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Node fenced for unknown reason
On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl wrote: > > >>> Steffen Vinther Sørensen schrieb am 15.04.2021 um > 13:10 in > Nachricht > : > > Hi there, > > > > In this 3 node cluster, node03 been offline for a while, and being > > brought up to service. Then a migration of a VirtualDomain is being > > attempted, and node02 is then fenced. > > > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > > it because of the failed ipmi monitor warning ? > > After a short glace it looks as if the network traffic used for VM migration > killed the corosync (or other) communication. > May I ask what part is making you think so ? > > > > > > Here is the outline: > > > > At 06:58:27 node03 is being activated with 'pcs start node03', nothing > > suspicious in the logs > > > > At 06:59:17 a resource migration is attempted from node02 to node03 > > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' > > > > > > on node01 this happens: > > > > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing > > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: > > unknown error > > > > And node02 is fenced ? > > > > /Steffen > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Node fenced for unknown reason
>>> Steffen Vinther Sørensen schrieb am 15.04.2021 um 13:10 in Nachricht : > Hi there, > > In this 3 node cluster, node03 been offline for a while, and being > brought up to service. Then a migration of a VirtualDomain is being > attempted, and node02 is then fenced. > > Provided is logs from all 2 nodes, and the 'pcs config' as well as a > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > it because of the failed ipmi monitor warning ? After a short glace it looks as if the network traffic used for VM migration killed the corosync (or other) communication. > > > Here is the outline: > > At 06:58:27 node03 is being activated with 'pcs start node03', nothing > suspicious in the logs > > At 06:59:17 a resource migration is attempted from node02 to node03 > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk' > > > on node01 this happens: > > Apr 15 06:59:17 kvm03-node01 pengine[29024]: warning: Processing > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk: > unknown error > > And node02 is fenced ? > > /Steffen ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Coming in Pacemaker 2.0.1: build-time default for resource-stickiness
>>> Ken Gaillot schrieb am 14.04.2021 um 19:46 in Nachricht <6b9a088d07369cc39a82c1ff3af41c43c10b34a2.ca...@redhat.com>: > Hello all, > > I hope to have the first Pacemaker 2.0.1 release candidate ready next > week! > > A recently added feature is a new build‑time option to change the > resource‑stickiness default for new CIBs. > > Currently, resource‑stickiness defaults to 0, meaning the cluster is > free to move resources around to balance them across nodes and so > forth. Many new users are surprised by this behavior and expect sticky > behavior by default. Well I think zero stickiness is good to teach new users that the cluster will move any resource unless told otherwise. > > Now, building Pacemaker using ./configure ‑‑with‑resource‑stickiness‑ > default= tells Pacemaker to add a rsc_defaults section to empty CIBs > with a resource‑stickiness of . Distributions and users who build > from source can set this if they're tried of explaining stickiness to > surprised users and expect fewer users to be surprised by stickiness. > :) Hopefully there will be great variability between distributions, and releases also, so that users will lears that it's best to set the stickiness as needed 8-) > > Adding a resource default to all new CIBs is an unusual way of changing > a default. > > We can't simply leave it to higher‑level tools, because when creating a > cluster, the cluster may not be started immediately and thus there is > no way to set the property. Also, the user has a variety of ways to > create or start a cluster, so no tool can assume it has full control. > > We leave the implicit default stickiness at 0, and instead set the > configured default via a rsc_defaults entry in new CIBs, so that it > won't affect existing clusters or rolling upgrades (current users won't > see behavior change), and unlike implicit defaults, users can query and > remove resource defaults. IMHO implicit values are great if they never vary, and there is common agreement what the ("reasonable") default value is. Obviously that is not the case, so maybe it's better not to have any implicit (default) values; instead require to specify them (as a global default) all. Regards, Ulrich > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: Single-node automated startup question
>>> Ken Gaillot schrieb am 14.04.2021 um 18:35 in Nachricht <00635dba0dfc70430d4fd7820677b47d242d65d2.ca...@redhat.com>: [...] >> >> Startup fencing is pacemaker default (startup‑fencing cluster >> option). > > Start‑up fencing will have the desired effect in >2 node cluster, but > in 2‑node cluster the corosync wait_for_all option is key. This is another good example where pacemaker is (maybe for historic reasons) more complicated than necessary (IMHO): Why not have a single "cluster-formation-timeout" that waits for nodes to join when initially forming a cluster (i.e. the node starting has no quorum (yet))? So if that timeout expired and there is no quorum (subject of other configuration parameters), the node will commit suicide (self-fencing, preferably to "off" instead of "reboot"). Of course any two-node cluster would need some tie-breaker (like grabbing some exclusive lock on a shared storage). > > If wait_for_all is true (which is the default when two_node is set), > then a node that comes up alone will wait until it sees the other node > at least once before becoming quorate. This prevents an isolated node > from coming up and fencing a node that's happily running. > > Setting wait_for_all to false will make an isolated node immediately > become quorate. It will do what you want, which is fence the other node > and take over resources, but the danger is that this node is the one > that's having trouble (e.g. can't see the other node due to a network > card issue). The healthy node could fence the unhealthy node, which > might then reboot and come up and shoot the healthy node. > > There's no direct equivalent of a delay before becoming quorate, but I > don't think that helps ‑‑ the boot time acts as a sort of random delay, > and a delay doesn't help the issue of an unhealthy node shooting a > healthy one. > > My recommendation would be to set wait_for_all to true as long as both > nodes are known to be healthy. Once an unhealthy node is down and > expected to stay down, set wait_for_all to false on the healthy node so > it can reboot and bring the cluster up. (The unhealthy node will still > have wait_for_all=true, so it won't cause any trouble even if it comes > up.) > [...] Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/