Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Andrei Borzenkov
On 15.04.2021 23:09, Steffen Vinther Sørensen wrote:
> On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger  wrote:
>>
>> On 4/15/21 3:26 PM, Ulrich Windl wrote:
>> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
>>> 14:56 in
>>> Nachricht
>>> :
 On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
  wrote:
 Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> 13:10 in
> Nachricht
> :
>> Hi there,
>>
>> In this 3 node cluster, node03 been offline for a while, and being
>> brought up to service. Then a migration of a VirtualDomain is being
>> attempted, and node02 is then fenced.
>>
>> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
>> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
>> it because of the failed ipmi monitor warning ?
> After a short glace it looks as if the network traffic used for VM
>>> migration
> killed the corosync (or other) communication.
>
 May I ask what part is making you think so ?
>>> The part that I saw no reason for an intended fencing.
>> And it looks like node02 is being cut off from all
>> networking-communication - both corosync & ipmi.
>> May really be the networking-load although I would
>> rather bet on something more systematic like a
>> Mac/IP-conflict with the VM or something.
>> I see you are having libvirtd under cluster-control.
>> Maybe bringing up the network-topology destroys the
>> connection between the nodes.
>> Has the cluster been working with the 3 nodes before?
>>
>>
>> Klaus
> 
> Hi Klaus
> 
> Yes it has been working before with all 3 nodes and migrations back
> and forth, but a few more VirtualDomains have been deployed since the
> last migration test.
> 
> It happens very fast, almost immediately after migration is starting.
> Could it be that some timeout values should be adjusted ?
> I just don't have any idea where to start looking, as to me there is
> nothing obviously suspicious found in the logs.
> 


I would look at performance stats, may be node02 was overloaded and
could not answer in time. Although standard sar stats are collected
every 15 minutes which is usually too coarse for it.

Migration could stress network. Talk with your network support, any
errors around this time?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Node fenced for unknown reason

2021-04-15 Thread Andrei Borzenkov
On 15.04.2021 14:10, Steffen Vinther Sørensen wrote:
> Hi there,
> 
> In this 3 node cluster, node03 been offline for a while, and being
> brought up to service. Then a migration of a VirtualDomain is being
> attempted, and node02 is then fenced.
> 
> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
> bzcatted pe-warn. Anyone with an idea of why the node was fenced ?

It was fenced because communication between node02 and two other nodes
was lost. Why it happened cannot be answered based on available logs.

> Is
> it because of the failed ipmi monitor warning ?
> 

No.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Andrei Borzenkov
On 15.04.2021 16:39, Klaus Wenninger wrote:
> On 4/15/21 3:26 PM, Ulrich Windl wrote:
> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
>> 14:56 in
>> Nachricht
>> :
>>> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
>>>  wrote:
>>> Steffen Vinther Sørensen  schrieb am
>>> 15.04.2021 um
 13:10 in
 Nachricht
 :
> Hi there,
>
> In this 3 node cluster, node03 been offline for a while, and being
> brought up to service. Then a migration of a VirtualDomain is being
> attempted, and node02 is then fenced.
>
> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
> it because of the failed ipmi monitor warning ?
 After a short glace it looks as if the network traffic used for VM
>> migration
 killed the corosync (or other) communication.

>>> May I ask what part is making you think so ?
>> The part that I saw no reason for an intended fencing.
> And it looks like node02 is being cut off from all
> networking-communication - both corosync & ipmi.

Well, IPMI fencing was (claimed to be) successful, so monitoring errors
could be false positive. Still it is something that needs investigation.

... judging by

Apr 15 06:59:26 kvm03-node02 systemd-logind[4179]: Power key pressed.

IPMI fencing *was* successful.

> May really be the networking-load although I would
> rather bet on something more systematic like a
> Mac/IP-conflict with the VM or something.
> I see you are having libvirtd under cluster-control.
> Maybe bringing up the network-topology destroys the
> connection between the nodes.
> Has the cluster been working with the 3 nodes before?
> 
> 
> Klaus
>>
>
> Here is the outline:
>
> At 06:58:27 node03 is being activated with 'pcs start node03', nothing
> suspicious in the logs
>
> At  06:59:17 a resource migration is attempted from node02 to node03
> with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
>
>
> on node01 this happens:
>
> Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
> failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
> unknown error
>
> And node02 is fenced ?
>
> /Steffen

> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014
Hi ALl,

Sorry...
Due to my operation mistake, the same email was sent multiple times.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
> Cc: 
> Date: 2021/4/15, Thu 11:45
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
> We have confirmed that the operation is improved by the test.
> Thank you for your prompt response.
> 
> We look forward to including this fix in the release version of RHEL 8.4.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>  To: "kwenn...@redhat.com" ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/13, Tue 07:08
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  Hi Klaus,
>>  Hi Ken,
>> 
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>> 
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>> 
>> 
>>  Thank you for the fix.
>> 
>> 
>>  I have confirmed that the fixes have been merged.
>> 
>>  I'll test this fix today just in case.
>> 
>>  Many thanks,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>   From: Klaus Wenninger 
>>>   To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>>  open-source clustering welcomed 
>>>   Cc: 
>>>   Date: 2021/4/12, Mon 22:22
>>>   Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
> control 
>>  fails.
>>> 
>>>   On 4/9/21 5:13 PM, Klaus Wenninger wrote:
    On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>    On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>    On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>    On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
    Hi Klaus,
 
    Thanks for your comment.
 
>    Hmm ... is that with selinux enabled?
>    Respectively do you see any related avc 
> messages?
 
    Selinux is not enabled.
    Isn't crm_mon caused by not returning a 
> response 
>>  when 
>>>   pacemakerd 
    prepares to stop?
>>    yep ... that doesn't look good.
>>    While in pcmk_shutdown_worker ipc isn't handled.
>    Stop ... that should actually work as pcmk_shutdown_worker
>    should exit quite quickly and proceed after mainloop
>    dispatching when called again.
>    Don't see anything atm that might be blocking for longer 
> ...
>    but let me dig into it further ...
    What happens is clear (thanks Ken for the hint ;-) ).
    When pacemakerd is shutting down - already when it
    shuts down the resources and not just when it starts to
    reap the subdaemons - crm_mon reads that state and
    doesn't try to connect to the cib anymore.
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>>    Question is why that didn't create issue earlier.
>>    Probably I didn't test with resources that had 
> crm_mon in
>>    their stop/monitor-actions but sbd should have run into
>>    issues.
>> 
>>    Klaus
>>>    But when shutting down a node the resources should be
>>>    shutdown before pacemakerd goes down.
>>>    But let me have a look if it can happen that 
> pacemakerd
>>>    doesn't react to the ipc-pings before. That btw. 
> might 
>>  be
>>>    lethal for sbd-scenarios (if the phase is too long 
> and it
>>>    migh actually not be defined).
>>> 
>>>    My idea with selinux would have been that it might 
> block
>>>    the ipc if crm_mon is issued by execd. But well 
> forget
>>>    about it as it is not enabled ;-)
>>> 
>>> 
>>>    Klaus
 
    pgsql needs the result of crm_mon in demote 
> processing 
>>  and 
>>>   stop 
    processing.
    crm_mon should return a response even after 
> pacemakerd 
>>  goes 
>>>   into a 
    stop operation.
 
    Best Regards,
    Hideo Yamauchi.
 
 
    - Original Message -
>    From: Klaus Wenninger 
> 
>    To: renayama19661...@ybb.ne.jp; Cluster Labs 
> - All 
>> 
>>>   topics related 
>    to open-source clustering welcomed 
>>>   
>    Cc:
>    Date: 2021/4/9, Fri 21:12
>    Subject: Re: [ClusterLabs] [Problem] In 
>>  RHEL8.4beta, 
>>>   pgsql 
>    resource control fails.
> 
>    On 4/8/21 11:21 PM, 
> renayama19661...@ybb.ne.jp 
>>  wrote:

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Steffen Vinther Sørensen
On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger  wrote:
>
> On 4/15/21 3:26 PM, Ulrich Windl wrote:
>  Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> > 14:56 in
> > Nachricht
> > :
> >> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
> >>  wrote:
> >> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> >>> 13:10 in
> >>> Nachricht
> >>> :
>  Hi there,
> 
>  In this 3 node cluster, node03 been offline for a while, and being
>  brought up to service. Then a migration of a VirtualDomain is being
>  attempted, and node02 is then fenced.
> 
>  Provided is logs from all 2 nodes, and the 'pcs config' as well as a
>  bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
>  it because of the failed ipmi monitor warning ?
> >>> After a short glace it looks as if the network traffic used for VM
> > migration
> >>> killed the corosync (or other) communication.
> >>>
> >> May I ask what part is making you think so ?
> > The part that I saw no reason for an intended fencing.
> And it looks like node02 is being cut off from all
> networking-communication - both corosync & ipmi.
> May really be the networking-load although I would
> rather bet on something more systematic like a
> Mac/IP-conflict with the VM or something.
> I see you are having libvirtd under cluster-control.
> Maybe bringing up the network-topology destroys the
> connection between the nodes.
> Has the cluster been working with the 3 nodes before?
>
>
> Klaus

About the libvirtd under cluster-control, this is because I'm using
virtlockd which in turn depends on the gfs2 filesystems for lockfiles.
Also some VM images are located on those gfs2 filesystems.

Maybe it is not necessary like that, as long as the VirtualDomains
depend on those filesystems, libvirtd could just always be running, so
its network-topology being turned on/off  won't disturb anything.

Best clue I got for now.

/Steffen

> >
> 
>  Here is the outline:
> 
>  At 06:58:27 node03 is being activated with 'pcs start node03', nothing
>  suspicious in the logs
> 
>  At  06:59:17 a resource migration is attempted from node02 to node03
>  with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
> 
> 
>  on node01 this happens:
> 
>  Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
>  failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
>  unknown error
> 
>  And node02 is fenced ?
> 
>  /Steffen
> >>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Steffen Vinther Sørensen
On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger  wrote:
>
> On 4/15/21 3:26 PM, Ulrich Windl wrote:
>  Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> > 14:56 in
> > Nachricht
> > :
> >> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
> >>  wrote:
> >> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> >>> 13:10 in
> >>> Nachricht
> >>> :
>  Hi there,
> 
>  In this 3 node cluster, node03 been offline for a while, and being
>  brought up to service. Then a migration of a VirtualDomain is being
>  attempted, and node02 is then fenced.
> 
>  Provided is logs from all 2 nodes, and the 'pcs config' as well as a
>  bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
>  it because of the failed ipmi monitor warning ?
> >>> After a short glace it looks as if the network traffic used for VM
> > migration
> >>> killed the corosync (or other) communication.
> >>>
> >> May I ask what part is making you think so ?
> > The part that I saw no reason for an intended fencing.
> And it looks like node02 is being cut off from all
> networking-communication - both corosync & ipmi.
> May really be the networking-load although I would
> rather bet on something more systematic like a
> Mac/IP-conflict with the VM or something.
> I see you are having libvirtd under cluster-control.
> Maybe bringing up the network-topology destroys the
> connection between the nodes.
> Has the cluster been working with the 3 nodes before?
>
>
> Klaus

Hi Klaus

Yes it has been working before with all 3 nodes and migrations back
and forth, but a few more VirtualDomains have been deployed since the
last migration test.

It happens very fast, almost immediately after migration is starting.
Could it be that some timeout values should be adjusted ?
I just don't have any idea where to start looking, as to me there is
nothing obviously suspicious found in the logs.

/Steffen

> >
> 
>  Here is the outline:
> 
>  At 06:58:27 node03 is being activated with 'pcs start node03', nothing
>  suspicious in the logs
> 
>  At  06:59:17 a resource migration is attempted from node02 to node03
>  with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
> 
> 
>  on node01 this happens:
> 
>  Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
>  failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
>  unknown error
> 
>  And node02 is fenced ?
> 
>  /Steffen
> >>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014
Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; Cluster Labs - All topics 
> related to open-source clustering welcomed ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi Klaus,
>>> 
>>>   Thanks for your comment.
>>> 
   Hmm ... is that with selinux enabled?
   Respectively do you see any related avc messages?
>>> 
>>>   Selinux is not enabled.
>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>   prepares to stop?
>   yep ... that doesn't look good.
>   While in pcmk_shutdown_worker ipc isn't handled.
   Stop ... that should actually work as pcmk_shutdown_worker
   should exit quite quickly and proceed after mainloop
   dispatching when called again.
   Don't see anything atm that might be blocking for longer ...
   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>   Question is why that didn't create issue earlier.
>   Probably I didn't test with resources that had crm_mon in
>   their stop/monitor-actions but sbd should have run into
>   issues.
> 
>   Klaus
>>   But when shutting down a node the resources should be
>>   shutdown before pacemakerd goes down.
>>   But let me have a look if it can happen that pacemakerd
>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>   lethal for sbd-scenarios (if the phase is too long and it
>>   migh actually not be defined).
>> 
>>   My idea with selinux would have been that it might block
>>   the ipc if crm_mon is issued by execd. But well forget
>>   about it as it is not enabled ;-)
>> 
>> 
>>   Klaus
>>> 
>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>   processing.
>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>   stop operation.
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   - Original Message -
   From: Klaus Wenninger 
   To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> 
>>  topics related 
   to open-source clustering welcomed 
>>  
   Cc:
   Date: 2021/4/9, Fri 21:12
   Subject: Re: [ClusterLabs] [Problem] In 
> RHEL8.4beta, 
>>  pgsql 
   resource control fails.
 
   On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp 
> wrote:
>     Hi Ken,
>     Hi All,
> 
>     In the pgsql resource, crm_mon is executed 
> in the 
>>  process of 
>   demote and
   stop, and the result is processed.
>     However, pacemaker included in RHEL8.4beta 
> fails 
>>  to execute 
>   this crm_mon.
>       - The problem also occurs on github
   master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>     The problem can be easily reproduced in the 
>>  following ways.
> 
>     Step1. Modify to execute crm_mon in the stop 
> 
>>  process of the 
>   Dummy resource.
>     
> 
>     dummy_stop() {
>          mon=$(crm_mon -1)
>          ret=$?

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Klaus Wenninger

On 4/15/21 3:26 PM, Ulrich Windl wrote:

Steffen Vinther Sørensen  schrieb am 15.04.2021 um

14:56 in
Nachricht
:

On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
 wrote:

Steffen Vinther Sørensen  schrieb am 15.04.2021 um

13:10 in
Nachricht
:

Hi there,

In this 3 node cluster, node03 been offline for a while, and being
brought up to service. Then a migration of a VirtualDomain is being
attempted, and node02 is then fenced.

Provided is logs from all 2 nodes, and the 'pcs config' as well as a
bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
it because of the failed ipmi monitor warning ?

After a short glace it looks as if the network traffic used for VM

migration

killed the corosync (or other) communication.


May I ask what part is making you think so ?

The part that I saw no reason for an intended fencing.

And it looks like node02 is being cut off from all
networking-communication - both corosync & ipmi.
May really be the networking-load although I would
rather bet on something more systematic like a
Mac/IP-conflict with the VM or something.
I see you are having libvirtd under cluster-control.
Maybe bringing up the network-topology destroys the
connection between the nodes.
Has the cluster been working with the 3 nodes before?


Klaus




Here is the outline:

At 06:58:27 node03 is being activated with 'pcs start node03', nothing
suspicious in the logs

At  06:59:17 a resource migration is attempted from node02 to node03
with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'


on node01 this happens:

Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
unknown error

And node02 is fenced ?

/Steffen




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Ulrich Windl
>>> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
14:56 in
Nachricht
:
> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
>  wrote:
>>
>> >>> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
>> 13:10 in
>> Nachricht
>> :
>> > Hi there,
>> >
>> > In this 3 node cluster, node03 been offline for a while, and being
>> > brought up to service. Then a migration of a VirtualDomain is being
>> > attempted, and node02 is then fenced.
>> >
>> > Provided is logs from all 2 nodes, and the 'pcs config' as well as a
>> > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
>> > it because of the failed ipmi monitor warning ?
>>
>> After a short glace it looks as if the network traffic used for VM
migration
>> killed the corosync (or other) communication.
>>
> 
> May I ask what part is making you think so ?

The part that I saw no reason for an intended fencing.

> 
>> >
>> >
>> > Here is the outline:
>> >
>> > At 06:58:27 node03 is being activated with 'pcs start node03', nothing
>> > suspicious in the logs
>> >
>> > At  06:59:17 a resource migration is attempted from node02 to node03
>> > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
>> >
>> >
>> > on node01 this happens:
>> >
>> > Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
>> > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
>> > unknown error
>> >
>> > And node02 is fenced ?
>> >
>> > /Steffen
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Steffen Vinther Sørensen
On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
 wrote:
>
> >>> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
> 13:10 in
> Nachricht
> :
> > Hi there,
> >
> > In this 3 node cluster, node03 been offline for a while, and being
> > brought up to service. Then a migration of a VirtualDomain is being
> > attempted, and node02 is then fenced.
> >
> > Provided is logs from all 2 nodes, and the 'pcs config' as well as a
> > bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
> > it because of the failed ipmi monitor warning ?
>
> After a short glace it looks as if the network traffic used for VM migration
> killed the corosync (or other) communication.
>

May I ask what part is making you think so ?

> >
> >
> > Here is the outline:
> >
> > At 06:58:27 node03 is being activated with 'pcs start node03', nothing
> > suspicious in the logs
> >
> > At  06:59:17 a resource migration is attempted from node02 to node03
> > with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
> >
> >
> > on node01 this happens:
> >
> > Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
> > failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
> > unknown error
> >
> > And node02 is fenced ?
> >
> > /Steffen
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Node fenced for unknown reason

2021-04-15 Thread Ulrich Windl
>>> Steffen Vinther Sørensen  schrieb am 15.04.2021 um
13:10 in
Nachricht
:
> Hi there,
> 
> In this 3 node cluster, node03 been offline for a while, and being
> brought up to service. Then a migration of a VirtualDomain is being
> attempted, and node02 is then fenced.
> 
> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
> it because of the failed ipmi monitor warning ?

After a short glace it looks as if the network traffic used for VM migration
killed the corosync (or other) communication.

> 
> 
> Here is the outline:
> 
> At 06:58:27 node03 is being activated with 'pcs start node03', nothing
> suspicious in the logs
> 
> At  06:59:17 a resource migration is attempted from node02 to node03
> with 'pcs resource move sikkermail30 kvm03-node02.logiva-gcs.dk'
> 
> 
> on node01 this happens:
> 
> Apr 15 06:59:17 kvm03-node01 pengine[29024]:  warning: Processing
> failed monitor of ipmi-fencing-node01 on kvm03-node02.logiva-gcs.dk:
> unknown error
> 
> And node02 is fenced ?
> 
> /Steffen



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Coming in Pacemaker 2.0.1: build-time default for resource-stickiness

2021-04-15 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 14.04.2021 um 19:46 in
Nachricht
<6b9a088d07369cc39a82c1ff3af41c43c10b34a2.ca...@redhat.com>:
> Hello all,
> 
> I hope to have the first Pacemaker 2.0.1 release candidate ready next
> week!
> 
> A recently added feature is a new build‑time option to change the
> resource‑stickiness default for new CIBs.
> 
> Currently, resource‑stickiness defaults to 0, meaning the cluster is
> free to move resources around to balance them across nodes and so
> forth. Many new users are surprised by this behavior and expect sticky
> behavior by default.

Well I think zero stickiness is good to teach new users that the cluster will
move any resource unless told otherwise.

> 
> Now, building Pacemaker using ./configure ‑‑with‑resource‑stickiness‑
> default= tells Pacemaker to add a rsc_defaults section to empty CIBs
> with a resource‑stickiness of . Distributions and users who build
> from source can set this if they're tried of explaining stickiness to
> surprised users and expect fewer users to be surprised by stickiness.
> :)

Hopefully there will be great variability between distributions, and releases
also, so that users will lears that it's best to set the stickiness as needed
8-)

> 
> Adding a resource default to all new CIBs is an unusual way of changing
> a default.
> 
> We can't simply leave it to higher‑level tools, because when creating a
> cluster, the cluster may not be started immediately and thus there is
> no way to set the property. Also, the user has a variety of ways to
> create or start a cluster, so no tool can assume it has full control.
> 
> We leave the implicit default stickiness at 0, and instead set the
> configured default via a rsc_defaults entry in new CIBs, so that it
> won't affect existing clusters or rolling upgrades (current users won't
> see behavior change), and unlike implicit defaults, users can query and
> remove resource defaults.

IMHO implicit values are great if they never vary, and there is common
agreement what the ("reasonable") default value is.
Obviously that is not the case, so maybe it's better not to have any implicit
(default) values; instead require to specify them (as a global default) all.

Regards,
Ulrich


> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Single-node automated startup question

2021-04-15 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 14.04.2021 um 18:35 in
Nachricht
<00635dba0dfc70430d4fd7820677b47d242d65d2.ca...@redhat.com>:

[...]
>> 
>> Startup fencing is pacemaker default (startup‑fencing cluster
>> option).
> 
> Start‑up fencing will have the desired effect in >2 node cluster, but
> in 2‑node cluster the corosync wait_for_all option is key.

This is another good example where pacemaker is (maybe for historic reasons)
more complicated than necessary (IMHO):
Why not have a single "cluster-formation-timeout" that waits for nodes to join
when initially forming a cluster (i.e. the node starting has no quorum (yet))?
So if that timeout expired and there is no quorum (subject of other
configuration parameters), the node will commit suicide (self-fencing,
preferably to "off" instead of "reboot").
Of course any two-node cluster would need some tie-breaker (like grabbing some
exclusive lock on a shared storage).

> 
> If wait_for_all is true (which is the default when two_node is set),
> then a node that comes up alone will wait until it sees the other node
> at least once before becoming quorate. This prevents an isolated node
> from coming up and fencing a node that's happily running.
> 
> Setting wait_for_all to false will make an isolated node immediately
> become quorate. It will do what you want, which is fence the other node
> and take over resources, but the danger is that this node is the one
> that's having trouble (e.g. can't see the other node due to a network
> card issue). The healthy node could fence the unhealthy node, which
> might then reboot and come up and shoot the healthy node.
> 
> There's no direct equivalent of a delay before becoming quorate, but I
> don't think that helps ‑‑ the boot time acts as a sort of random delay,
> and a delay doesn't help the issue of an unhealthy node shooting a
> healthy one.
> 
> My recommendation would be to set wait_for_all to true as long as both
> nodes are known to be healthy. Once an unhealthy node is down and
> expected to stay down, set wait_for_all to false on the healthy node so
> it can reboot and bring the cluster up. (The unhealthy node will still
> have wait_for_all=true, so it won't cause any trouble even if it comes
> up.) 
> 
[...]

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/