Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Strahil Nikolov
Hi Bernd,

As SLES 12 is in a such a support phase, I guess SUSE will provide fixes only 
for SLES 15.
It will be best if you open them a case and ask them about that.

Best Regards,
Strahil Nikolov

На 19 август 2020 г. 17:29:32 GMT+03:00, "Lentes, Bernd" 
 написа:
>
>- On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com wrote:
>>> This appears to be a scheduler bug.
>> 
>> Fix is in master branch and will land in 2.0.5 expected at end of the
>> year
>> 
>> https://github.com/ClusterLabs/pacemaker/pull/2146
>
>A principal question:
>I have SLES 12 and i'm using the pacemaker version provided with the
>distribution.
>If this fix is backported depends on Suse.
>
>If i install und update pacemaker manually (not the version provided by
>Suse),
>i loose my support from them, but have always the most recent code and
>fixes.
>
>If i stay with the version from Suse i have support from them, but
>maybe not all fixes and not the most recent code.
>
>What is your approach ?
>Recommendations ?
>
>Thanks.
>
>Bernd
>Helmholtz Zentrum München
>
>Helmholtz Zentrum Muenchen
>Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>Ingolstaedter Landstr. 1
>85764 Neuherberg
>www.helmholtz-muenchen.de
>Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
>Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
>Guenther
>Registergericht: Amtsgericht Muenchen HRB 6466
>USt-IdNr: DE 129521671
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Ken Gaillot
On Wed, 2020-08-19 at 16:29 +0200, Lentes, Bernd wrote:
> - On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com
> wrote:
> > > This appears to be a scheduler bug.
> > 
> > Fix is in master branch and will land in 2.0.5 expected at end of
> > the
> > year
> > 
> > https://github.com/ClusterLabs/pacemaker/pull/2146
> 
> A principal question:
> I have SLES 12 and i'm using the pacemaker version provided with the
> distribution.
> If this fix is backported depends on Suse.
> 
> If i install und update pacemaker manually (not the version provided
> by Suse),
> i loose my support from them, but have always the most recent code
> and fixes.
> 
> If i stay with the version from Suse i have support from them, but
> maybe not all fixes and not the most recent code.
> 
> What is your approach ?
> Recommendations ?

I'd recommend sticking with the supported version, and filing a bug
report with the distro asking for a specific fix to be backported when
you have the need.

Regarding the upstream project, running a release is fine, but I
wouldn't recommend running the master branch in production. Important
details of new features can change before release, and with frequent
development it's more likely to have regressions.

> 
> Thanks.
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Lentes, Bernd

- On Aug 19, 2020, at 4:04 PM, kgaillot kgail...@redhat.com wrote:
>> This appears to be a scheduler bug.
> 
> Fix is in master branch and will land in 2.0.5 expected at end of the
> year
> 
> https://github.com/ClusterLabs/pacemaker/pull/2146

A principal question:
I have SLES 12 and i'm using the pacemaker version provided with the 
distribution.
If this fix is backported depends on Suse.

If i install und update pacemaker manually (not the version provided by Suse),
i loose my support from them, but have always the most recent code and fixes.

If i stay with the version from Suse i have support from them, but maybe not 
all fixes and not the most recent code.

What is your approach ?
Recommendations ?

Thanks.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Lentes, Bernd

- On Aug 18, 2020, at 7:30 PM, kgaillot kgail...@redhat.com wrote:


>> > I'm not sure, I'd have to see the pe input.
>> 
>> You find it here:
>> https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29
> 
> This appears to be a scheduler bug.
> 
> The scheduler considers a migration to be "dangling" if it has a record
> of a failed migrate_to on the source node, but no migrate_from on the
> target node (and no migrate_from or start on the source node, which
> would indicate a later full restart or reverse migration).
> 
> In this case, any migrate_from on the target has since been superseded
> by a failed start and a successful stop, so there is no longer a record
> of it. Therefore the migration is considered dangling, which requires a
> full stop on the source node.
> 
> However in this case we already have a successful stop on the source
> node after the failed migrate_to, and I believe that should be
> sufficient to consider it no longer dangling.
> 

Thanks for your explananation Ken.
For me a Fence i don't understand is the worst that can happen to a HA cluster.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-19 Thread Ken Gaillot
On Tue, 2020-08-18 at 12:30 -0500, Ken Gaillot wrote:
> On Tue, 2020-08-18 at 16:47 +0200, Lentes, Bernd wrote:
> > 
> > - On Aug 17, 2020, at 5:09 PM, kgaillot kgail...@redhat.com
> > wrote:
> > 
> > 
> > > > I checked all relevant pe-files in this time period.
> > > > This is what i found out (i just write the important entries):
> > 
> >  
> > > > Executing cluster transition:
> > > >  * Resource action: vm_nextcloudstop on ha-idg-2
> > > > Revised cluster status:
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > 
> > > > ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-
> > > > input-
> > > > 3118 -G transition-4516.xml -D transition-4516.dot
> > > > Current cluster status:
> > > > Node ha-idg-1 (1084777482): standby
> > > > Online: [ ha-idg-2 ]
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > <== vm_nextcloud is stopped
> > > > Transition Summary:
> > > >  * Shutdown ha-idg-1
> > > > Executing cluster transition:
> > > >  * Resource action: vm_nextcloudstop on ha-idg-1 < why
> > > > stop ?
> > > > It is already stopped
> > > 
> > > I'm not sure, I'd have to see the pe input.
> > 
> > You find it here: 
> > https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29
> 
> This appears to be a scheduler bug.

Fix is in master branch and will land in 2.0.5 expected at end of the
year

https://github.com/ClusterLabs/pacemaker/pull/2146

> The scheduler considers a migration to be "dangling" if it has a
> record
> of a failed migrate_to on the source node, but no migrate_from on the
> target node (and no migrate_from or start on the source node, which
> would indicate a later full restart or reverse migration).
> 
> In this case, any migrate_from on the target has since been
> superseded
> by a failed start and a successful stop, so there is no longer a
> record
> of it. Therefore the migration is considered dangling, which requires
> a
> full stop on the source node.
> 
> However in this case we already have a successful stop on the source
> node after the failed migrate_to, and I believe that should be
> sufficient to consider it no longer dangling.
> 
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > <===
> > > > vm_nextcloud is stopped
> > > > Transition Summary:
> > > >  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
> > > > Executing cluster transition:
> > > >  * Fencing ha-idg-1 (Off)
> > > >  * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It
> > > > is
> > > > already stopped ?
> > > > Revised cluster status:
> > > > Node ha-idg-1 (1084777482): OFFLINE (standby)
> > > > Online: [ ha-idg-2 ]
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > 
> > > > I don't understand why the cluster tries to stop a resource
> > > > which
> > > > is
> > > > already stopped.
> > 
> > Bernd
> > Helmholtz Zentrum München
> > 
> > Helmholtz Zentrum Muenchen
> > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> > Ingolstaedter Landstr. 1
> > 85764 Neuherberg
> > www.helmholtz-muenchen.de
> > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> > Kerstin
> > Guenther
> > Registergericht: Amtsgericht Muenchen HRB 6466
> > USt-IdNr: DE 129521671
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-18 Thread Ken Gaillot
On Tue, 2020-08-18 at 16:47 +0200, Lentes, Bernd wrote:
> 
> - On Aug 17, 2020, at 5:09 PM, kgaillot kgail...@redhat.com
> wrote:
> 
> 
> > > I checked all relevant pe-files in this time period.
> > > This is what i found out (i just write the important entries):
> 
>  
> > > Executing cluster transition:
> > >  * Resource action: vm_nextcloudstop on ha-idg-2
> > > Revised cluster status:
> > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > 
> > > ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-
> > > input-
> > > 3118 -G transition-4516.xml -D transition-4516.dot
> > > Current cluster status:
> > > Node ha-idg-1 (1084777482): standby
> > > Online: [ ha-idg-2 ]
> > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > <== vm_nextcloud is stopped
> > > Transition Summary:
> > >  * Shutdown ha-idg-1
> > > Executing cluster transition:
> > >  * Resource action: vm_nextcloudstop on ha-idg-1 < why
> > > stop ?
> > > It is already stopped
> > 
> > I'm not sure, I'd have to see the pe input.
> 
> You find it here: 
> https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29

This appears to be a scheduler bug.

The scheduler considers a migration to be "dangling" if it has a record
of a failed migrate_to on the source node, but no migrate_from on the
target node (and no migrate_from or start on the source node, which
would indicate a later full restart or reverse migration).

In this case, any migrate_from on the target has since been superseded
by a failed start and a successful stop, so there is no longer a record
of it. Therefore the migration is considered dangling, which requires a
full stop on the source node.

However in this case we already have a successful stop on the source
node after the failed migrate_to, and I believe that should be
sufficient to consider it no longer dangling.

> > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <===
> > > vm_nextcloud is stopped
> > > Transition Summary:
> > >  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
> > > Executing cluster transition:
> > >  * Fencing ha-idg-1 (Off)
> > >  * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is
> > > already stopped ?
> > > Revised cluster status:
> > > Node ha-idg-1 (1084777482): OFFLINE (standby)
> > > Online: [ ha-idg-2 ]
> > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > 
> > > I don't understand why the cluster tries to stop a resource which
> > > is
> > > already stopped.
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-18 Thread Lentes, Bernd


- On Aug 17, 2020, at 5:09 PM, kgaillot kgail...@redhat.com wrote:


>> I checked all relevant pe-files in this time period.
>> This is what i found out (i just write the important entries):
 
>> Executing cluster transition:
>>  * Resource action: vm_nextcloudstop on ha-idg-2
>> Revised cluster status:
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> 
>> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
>> 3118 -G transition-4516.xml -D transition-4516.dot
>> Current cluster status:
>> Node ha-idg-1 (1084777482): standby
>> Online: [ ha-idg-2 ]
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> <== vm_nextcloud is stopped
>> Transition Summary:
>>  * Shutdown ha-idg-1
>> Executing cluster transition:
>>  * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ?
>> It is already stopped
> 
> I'm not sure, I'd have to see the pe input.

You find it here: 
https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29

>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <===
>> vm_nextcloud is stopped
>> Transition Summary:
>>  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
>> Executing cluster transition:
>>  * Fencing ha-idg-1 (Off)
>>  * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is
>> already stopped ?
>> Revised cluster status:
>> Node ha-idg-1 (1084777482): OFFLINE (standby)
>> Online: [ ha-idg-2 ]
>>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
>> 
>> I don't understand why the cluster tries to stop a resource which is
>> already stopped.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-17 Thread Ken Gaillot
On Fri, 2020-08-14 at 20:37 +0200, Lentes, Bernd wrote:
> - On Aug 9, 2020, at 10:17 PM, Bernd Lentes 
> bernd.len...@helmholtz-muenchen.de wrote:
> 
> 
> > > So this appears to be the problem. From these logs I would guess
> > > the
> > > successful stop on ha-idg-1 did not get written to the CIB for
> > > some
> > > reason. I'd look at the pe input from this transition on ha-idg-2 
> > > to
> > > confirm that.
> > > 
> > > Without the DC knowing about the stop, it tries to schedule a new
> > > one,
> > > but the node is shutting down so it can't do it, which means it
> > > has to
> > > be fenced.
> 
> I checked all relevant pe-files in this time period.
> This is what i found out (i just write the important entries):
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3116 -G transition-3116.xml -D transition-3116.dot
> Current cluster status:
>  ...
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-1
> Transition Summary:
>  ...
> * Migratevm_nextcloud   ( ha-idg-1 -> ha-idg-2 )
> Executing cluster transition:
>  * Resource action: vm_nextcloudmigrate_from on ha-idg-2 <===
> migrate vm_nextcloud
>  * Resource action: vm_nextcloudstop on ha-idg-1 
>  * Pseudo action:   vm_nextcloud_start_0
> Revised cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
> vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2
> 
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-
> 48 -G transition-4514.xml -D transition-4514.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
> ...
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-
> idg-1 ] <== migration failed
> Transition Summary:
> ..
>  * Recovervm_nextcloud( ha-idg-2 )
> Executing cluster transition:
>  * Resource action: vm_nextcloudstop on ha-idg-2
>  * Resource action: vm_nextcloudstop on ha-idg-1 
>  * Resource action: vm_nextcloudstart on ha-idg-2
>  * Resource action: vm_nextcloudmonitor=3 on ha-idg-2
> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3117 -G transition-3117.xml -D transition-3117.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2
> <== start on ha-idg-2 failed
> Transition Summary:
>  * Stop   vm_nextcloud ( ha-idg-2 )   due to node
> availability < stop vm_nextcloud (what means due to node
> availability ?)

"Due to node availability" means no node is allowed to run the
resource, so it has to be stopped.

> Executing cluster transition:
>  * Resource action: vm_nextcloudstop on ha-idg-2
> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3118 -G transition-4516.xml -D transition-4516.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> <== vm_nextcloud is stopped
> Transition Summary:
>  * Shutdown ha-idg-1
> Executing cluster transition:
>  * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ?
> It is already stopped

I'm not sure, I'd have to see the pe input.

> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-
> 3545 -G transition-0.xml -D transition-0.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): pending
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <==
> vm_nextcloud is stopped
> Transition Summary:
> 
> Executing cluster transition:
> Using the original execution date of: 2020-07-20 15:05:33Z
> Revised cluster status:
> vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-
> 749 -G transition-1.xml -D transition-1.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): OFFLINE (standby)
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <===
> vm_nextcloud is stopped
> Transition Summary:
>  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
> Executing cluster transition:
>  * Fencing ha-idg-1 (Off)
>  * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is
> already stopped ?
> Revised cluster status:
> Node ha-idg-1 (1084777482): OFFLINE (standby)
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> I don't understand why the cluster tries to stop a resource which is
> already stopped.
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt 

Re: [ClusterLabs] why is node fenced ?

2020-08-17 Thread Ken Gaillot
On Fri, 2020-08-14 at 12:17 +0200, Lentes, Bernd wrote:
> 
> - On Aug 10, 2020, at 11:59 PM, kgaillot kgail...@redhat.com
> wrote:
> > The most recent transition is aborted, but since all its actions
> > are
> > complete, the only effect is to trigger a new transition.
> > 
> > We should probably rephrase the log message. In fact, the whole
> > "transition" terminology is kind of obscure. It's hard to come up
> > with
> > something better though.
> > 
> 
> Hi Ken,
> 
> i don't get it. How can s.th. be aborted which is already completed ?

I agree the wording is confusing :)

From the code's point of view, the actions in the transition are
complete, but the transition itself (as an abstract entity) remains
current until the next one starts. However that's academic and
meaningless from a user's point of view, so the log messages should be
reworded.

> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-14 Thread Lentes, Bernd
- On Aug 9, 2020, at 10:17 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:


>> So this appears to be the problem. From these logs I would guess the
>> successful stop on ha-idg-1 did not get written to the CIB for some
>> reason. I'd look at the pe input from this transition on ha-idg-2 to
>> confirm that.
>> 
>> Without the DC knowing about the stop, it tries to schedule a new one,
>> but the node is shutting down so it can't do it, which means it has to
>> be fenced.

I checked all relevant pe-files in this time period.
This is what i found out (i just write the important entries):

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3116 -G 
transition-3116.xml -D transition-3116.dot
Current cluster status:
 ...
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-1
Transition Summary:
 ...
* Migratevm_nextcloud   ( ha-idg-1 -> ha-idg-2 )
Executing cluster transition:
 * Resource action: vm_nextcloudmigrate_from on ha-idg-2 <=== migrate 
vm_nextcloud
 * Resource action: vm_nextcloudstop on ha-idg-1 
 * Pseudo action:   vm_nextcloud_start_0
Revised cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2


ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-48 -G 
transition-4514.xml -D transition-4514.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
...
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-idg-1 ] 
<== migration failed
Transition Summary:
..
 * Recovervm_nextcloud( ha-idg-2 )
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-2
 * Resource action: vm_nextcloudstop on ha-idg-1 
 * Resource action: vm_nextcloudstart on ha-idg-2
 * Resource action: vm_nextcloudmonitor=3 on ha-idg-2
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3117 -G 
transition-3117.xml -D transition-3117.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2 <== start 
on ha-idg-2 failed
Transition Summary:
 * Stop   vm_nextcloud ( ha-idg-2 )   due to node availability < 
stop vm_nextcloud (what means due to node availability ?)
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-2
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3118 -G 
transition-4516.xml -D transition-4516.dot
Current cluster status:
Node ha-idg-1 (1084777482): standby
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <== 
vm_nextcloud is stopped
Transition Summary:
 * Shutdown ha-idg-1
Executing cluster transition:
 * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ? It is 
already stopped
Revised cluster status:
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-3545 -G 
transition-0.xml -D transition-0.dot
Current cluster status:
Node ha-idg-1 (1084777482): pending
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is 
stopped
Transition Summary:

Executing cluster transition:
Using the original execution date of: 2020-07-20 15:05:33Z
Revised cluster status:
vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-749 -G 
transition-1.xml -D transition-1.dot
Current cluster status:
Node ha-idg-1 (1084777482): OFFLINE (standby)
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <=== vm_nextcloud 
is stopped
Transition Summary:
 * Fence (Off) ha-idg-1 'resource actions are unrunnable'
Executing cluster transition:
 * Fencing ha-idg-1 (Off)
 * Pseudo action:   vm_nextcloud_stop_0 <=== why stop ? It is already 
stopped ?
Revised cluster status:
Node ha-idg-1 (1084777482): OFFLINE (standby)
Online: [ ha-idg-2 ]
 vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped

I don't understand why the cluster tries to stop a resource which is already 
stopped.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-14 Thread Lentes, Bernd


- On Aug 10, 2020, at 11:59 PM, kgaillot kgail...@redhat.com wrote:
> The most recent transition is aborted, but since all its actions are
> complete, the only effect is to trigger a new transition.
> 
> We should probably rephrase the log message. In fact, the whole
> "transition" terminology is kind of obscure. It's hard to come up with
> something better though.
> 
Hi Ken,

i don't get it. How can s.th. be aborted which is already completed ?

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-08-10 Thread Ken Gaillot
On Sun, 2020-08-09 at 22:17 +0200, Lentes, Bernd wrote:
> 
> - Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:
> 
> > On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
> > > Hi,
> > > 
> > > a few days ago one of my nodes was fenced and i don't know why,
> > > which
> > > is something i really don't like.
> > > What i did:
> > > I put one node (ha-idg-1) in standby. The resources on it (most
> > > of
> > > all virtual domains) were migrated to ha-idg-2,
> > > except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
> > > missing the xml of the domain points to.
> > > Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
> > > course also failed.
> > > Then ha-idg-1 was fenced.
> > > I did a "crm history" over the respective time period, you find
> > > it
> > > here:
> > > https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
> > > 
> > > Here, from my point of view, the most interesting from the logs:
> > > ha-idg-1:
> > > Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> > > cib_perform_op:  Diff: --- 2.16196.19 2
> > > Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> > > cib_perform_op:  Diff: +++ 2.16197.0
> > > bc9a558dfbe6d7196653ce56ad1ee758
> > > Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> > > cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
> > > Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> > > cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482
> > > ']/i
> > > nstance_attributes[@id='nodes-108
> > > 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
> > > ha-idg-1 set to standby
> > > 
> > > Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice:
> > > process_lrm_event:   ha-idg-1-vm_nextcloud_migrate_to_0:3169
> > > [
> > > error: Cannot access storage file
> > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu
> > > ntu/
> > > ubuntu-18.04.4-live-server-amd64.iso': No such file or
> > > directory\nocf-exit-reason:vm_nextcloud: live migration to ha-
> > > idg-2
> > > failed: 1\n ]
> > > migration failed
> > > 
> > > Jul 20 17:04:01 [23767] ha-idg-1pengine:error:
> > > native_create_actions:   Resource vm_nextcloud is active on 2
> > > nodes
> > > (attempting recovery)
> > > ???
> > 
> > This is standard for a failed live migration -- the cluster doesn't
> > know how far the migration actually got before failing, so it has
> > to
> > assume the VM could be active on either node. (The log message
> > would
> > make more sense saying "might be active" rather than "is active".)
> > 
> > > Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice:
> > > LogAction:*
> > > Recovervm_nextcloud   ( ha-idg-2 )
> > 
> > The recovery from that situation is a full stop on both nodes, and
> > start on one of them.
> > 
> > > Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
> > > te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on
> > > ha-
> > > idg-2 | action 106
> > > Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
> > > te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
> > > locally on ha-idg-1 | action 2
> > > 
> > > Jul 20 17:04:01 [23768] ha-idg-1   crmd: info:
> > > match_graph_event:   Action vm_nextcloud_stop_0 (106)
> > > confirmed
> > > on ha-idg-2 (rc=0)
> > > 
> > > Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
> > > process_lrm_event:   Result of stop operation for
> > > vm_nextcloud on
> > > ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0
> > > confirmed=true
> > > cib-update=5960
> > 
> > It looks like both stops succeeded.
> > 
> > > Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
> > > crm_signal_dispatch: Caught 'Terminated' signal | 15
> > > (invoking
> > > handler)
> > > systemctl stop pacemaker.service
> > > 
> > > 
> > > ha-idg-2:
> > > Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice:
> > > process_lrm_event:   Result of stop operation for
> > > vm_nextcloud on
> > > ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0
> > > confirmed=true
> > > cib-update=57
> > > the log from ha-idg-2 is two seconds ahead of ha-idg-1
> > > 
> > > Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice:
> > > log_execute: executing - rsc:vm_nextcloud action:start
> > > call_id:192
> > > Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> > > operation_finished:  vm_nextcloud_start_0:29107:stderr [
> > > error:
> > > Failed to create domain from /mnt/share/vm_nextcloud.xml ]
> > > Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> > > operation_finished:  vm_nextcloud_start_0:29107:stderr [
> > > error:
> > > Cannot access storage file
> > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu
> > > ntu/
> > > ubuntu-18.04.4-live-server-amd64.iso': No such file or directory
> > > ]
> > > Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> > > operation_finished:  

Re: [ClusterLabs] why is node fenced ?

2020-08-10 Thread Lentes, Bernd


- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:

> On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
>> Hi,
>> 
>> a few days ago one of my nodes was fenced and i don't know why, which
>> is something i really don't like.
>> What i did:
>> I put one node (ha-idg-1) in standby. The resources on it (most of
>> all virtual domains) were migrated to ha-idg-2,
>> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
>> missing the xml of the domain points to.
>> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
>> course also failed.
>> Then ha-idg-1 was fenced.
>> I did a "crm history" over the respective time period, you find it
>> here:
>> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
>> 
>> Here, from my point of view, the most interesting from the logs:
>> ha-idg-1:
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  Diff: --- 2.16196.19 2
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
>> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
>> cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482']/i
>> nstance_attributes[@id='nodes-108
>> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
>> ha-idg-1 set to standby
>> 
>> Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   ha-idg-1-vm_nextcloud_migrate_to_0:3169 [
>> error: Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or
>> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2
>> failed: 1\n ]
>> migration failed
>> 
>> Jul 20 17:04:01 [23767] ha-idg-1pengine:error:
>> native_create_actions:   Resource vm_nextcloud is active on 2 nodes
>> (attempting recovery)
>> ???
> 
> This is standard for a failed live migration -- the cluster doesn't
> know how far the migration actually got before failing, so it has to
> assume the VM could be active on either node. (The log message would
> make more sense saying "might be active" rather than "is active".)
> 
>> Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice:
>> LogAction:*
>> Recovervm_nextcloud   ( ha-idg-2 )
> 
> The recovery from that situation is a full stop on both nodes, and
> start on one of them.
> 
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on ha-
>> idg-2 | action 106
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
>> locally on ha-idg-1 | action 2
>> 
>> Jul 20 17:04:01 [23768] ha-idg-1   crmd: info:
>> match_graph_event:   Action vm_nextcloud_stop_0 (106) confirmed
>> on ha-idg-2 (rc=0)
>> 
>> Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960
> 
> It looks like both stops succeeded.
> 
>> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
>> crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking
>> handler)
>> systemctl stop pacemaker.service
>> 
>> 
>> ha-idg-2:
>> Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=57
>> the log from ha-idg-2 is two seconds ahead of ha-idg-1
>> 
>> Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice:
>> log_execute: executing - rsc:vm_nextcloud action:start
>> call_id:192
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
>> Failed to create domain from /mnt/share/vm_nextcloud.xml ]
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
>> Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ]
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> operation_finished:  vm_nextcloud_start_0:29107:stderr [ ocf-
>> exit-reason:Failed to start virtual domain vm_nextcloud. ]
>> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
>> log_finished:finished - rsc:vm_nextcloud action:start call_id:192
>> pid:29107 exit-code:1 exec-time:581ms queue-time:0ms
>> start on ha-idg-2 failed
> 
> The start failed ...
> 
>> Jul 20 17:05:32 [10691] ha-idg-2   crmd: info:
>> do_dc_takeover:  Taking over DC status for this partition
>> ha-idg-1 stopped pacemaker
> 
> Since the 

Re: [ClusterLabs] why is node fenced ?

2020-08-10 Thread Lentes, Bernd

- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgail...@redhat.com:

 
> Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.

The other way round.

>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud
>> on ha-idg-1: unknown error | rc=1
>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed start of vm_nextcloud on
>> ha-idg-2: unknown error | rc
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2pengine: info:
>> native_color:Resource vm_nextcloud cannot run anywhere
>> logical
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (pending)
>> ???
> 
> So this appears to be the problem. From these logs I would guess the
> successful stop on ha-idg-1 did not get written to the CIB for some
> reason. I'd look at the pe input from this transition on ha-idg-2 to
> confirm that.
> 
> Without the DC knowing about the stop, it tries to schedule a new one,
> but the node is shutting down so it can't do it, which means it has to
> be fenced.
> 
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (offline)
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> pe_fence_node:   Cluster node ha-idg-1 will be fenced: resource
>> actions are unrunnable
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning:
>> stage6:  Scheduling Node ha-idg-1 for STONITH
>> Jul 20 17:05:35 [10690] ha-idg-2pengine: info:
>> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-
>> idg-1 is fenced
>> Jul 20 17:05:35 [10690] ha-idg-2pengine:   notice:
>> LogNodeActions:   * Fence (Off) ha-idg-1 'resource actions are
>> unrunnable'
>> 
>> 
>> Why does it say "Jul 20 17:05:35 [10690] ha-idg-
>> 2pengine:  warning: custom_action:   Action vm_nextcloud_stop_0
>> on ha-idg-1 is unrunnable (offline)" although
>> "Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
>> process_lrm_event:   Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960"
>> says that stop was ok ?

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2020-07-29 Thread Ken Gaillot
On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
> Hi,
> 
> a few days ago one of my nodes was fenced and i don't know why, which
> is something i really don't like.
> What i did:
> I put one node (ha-idg-1) in standby. The resources on it (most of
> all virtual domains) were migrated to ha-idg-2,
> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
> missing the xml of the domain points to.
> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
> course also failed.
> Then ha-idg-1 was fenced.
> I did a "crm history" over the respective time period, you find it
> here:
> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
> 
> Here, from my point of view, the most interesting from the logs:
> ha-idg-1:
> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> cib_perform_op:  Diff: --- 2.16196.19 2
> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> cib_perform_op:  Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
> Jul 20 16:59:33 [23763] ha-idg-1cib: info:
> cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482']/i
> nstance_attributes[@id='nodes-108
> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
> ha-idg-1 set to standby
> 
> Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice:
> process_lrm_event:   ha-idg-1-vm_nextcloud_migrate_to_0:3169 [
> error: Cannot access storage file
> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
> ubuntu-18.04.4-live-server-amd64.iso': No such file or
> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2
> failed: 1\n ]
> migration failed
> 
> Jul 20 17:04:01 [23767] ha-idg-1pengine:error:
> native_create_actions:   Resource vm_nextcloud is active on 2 nodes
> (attempting recovery)
> ???

This is standard for a failed live migration -- the cluster doesn't
know how far the migration actually got before failing, so it has to
assume the VM could be active on either node. (The log message would
make more sense saying "might be active" rather than "is active".)

> Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice:
> LogAction:*
> Recovervm_nextcloud   ( ha-idg-2 )

The recovery from that situation is a full stop on both nodes, and
start on one of them.

> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on ha-
> idg-2 | action 106
> Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice:
> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
> locally on ha-idg-1 | action 2
> 
> Jul 20 17:04:01 [23768] ha-idg-1   crmd: info:
> match_graph_event:   Action vm_nextcloud_stop_0 (106) confirmed
> on ha-idg-2 (rc=0)
> 
> Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice:
> process_lrm_event:   Result of stop operation for vm_nextcloud on
> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
> cib-update=5960

It looks like both stops succeeded.

> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
> crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking
> handler)
> systemctl stop pacemaker.service
> 
> 
> ha-idg-2:
> Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice:
> process_lrm_event:   Result of stop operation for vm_nextcloud on
> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true
> cib-update=57
> the log from ha-idg-2 is two seconds ahead of ha-idg-1
> 
> Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice:
> log_execute: executing - rsc:vm_nextcloud action:start
> call_id:192
> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
> Failed to create domain from /mnt/share/vm_nextcloud.xml ]
> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> operation_finished:  vm_nextcloud_start_0:29107:stderr [ error:
> Cannot access storage file
> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ]
> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> operation_finished:  vm_nextcloud_start_0:29107:stderr [ ocf-
> exit-reason:Failed to start virtual domain vm_nextcloud. ]
> Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice:
> log_finished:finished - rsc:vm_nextcloud action:start call_id:192
> pid:29107 exit-code:1 exec-time:581ms queue-time:0ms
> start on ha-idg-2 failed

The start failed ...

> Jul 20 17:05:32 [10691] ha-idg-2   crmd: info:
> do_dc_takeover:  Taking over DC status for this partition
> ha-idg-1 stopped pacemaker

Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.

> Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning:
> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud
> on ha-idg-1: unknown 

Re: [ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd


- Am 29. Jul 2020 um 17:26 schrieb Bernd Lentes 
bernd.len...@helmholtz-muenchen.de:

Hi,

sorry, i missed:
OS: SLES 12 SP4
kernel: 4.12.14-95.32
pacmaker: pacemaker-1.1.19+20181105.ccd6b5b10-3.13.1.x86_64


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] why is node fenced ?

2020-07-29 Thread Lentes, Bernd
Hi,

a few days ago one of my nodes was fenced and i don't know why, which is 
something i really don't like.
What i did:
I put one node (ha-idg-1) in standby. The resources on it (most of all virtual 
domains) were migrated to ha-idg-2,
except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the xml 
of the domain points to.
Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course also 
failed.
Then ha-idg-1 was fenced.

I did a "crm history" over the respective time period, you find it here:
https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF

Here, from my point of view, the most interesting from the logs:
ha-idg-1:
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  Diff: 
--- 2.16196.19 2
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  Diff: 
+++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  +  
/cib:  @epoch=16197, @num_updates=0
Jul 20 16:59:33 [23763] ha-idg-1cib: info: cib_perform_op:  +  
/cib/configuration/nodes/node[@id='1084777482']/instance_attributes[@id='nodes-108
4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
ha-idg-1 set to standby

Jul 20 16:59:34 [23768] ha-idg-1   crmd:   notice: process_lrm_event:   
ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ error: Cannot access storage file 
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso':
 No such file or directory\nocf-exit-reason:vm_nextcloud: live migration to 
ha-idg-2 failed: 1\n ]
migration failed

Jul 20 17:04:01 [23767] ha-idg-1pengine:error: native_create_actions:   
Resource vm_nextcloud is active on 2 nodes (attempting recovery)
???

Jul 20 17:04:01 [23767] ha-idg-1pengine:   notice: LogAction:* 
Recovervm_nextcloud   ( ha-idg-2 )

Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_nextcloud_stop_0 on ha-idg-2 | action 106
Jul 20 17:04:01 [23768] ha-idg-1   crmd:   notice: te_rsc_command:  
Initiating stop operation vm_nextcloud_stop_0 locally on ha-idg-1 | action 2

Jul 20 17:04:01 [23768] ha-idg-1   crmd: info: match_graph_event:   
Action vm_nextcloud_stop_0 (106) confirmed on ha-idg-2 (rc=0)

Jul 20 17:04:06 [23768] ha-idg-1   crmd:   notice: process_lrm_event:   
Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) | call=3197 
key=vm_nextcloud_stop_0 confirmed=true cib-update=5960

Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice: crm_signal_dispatch: 
Caught 'Terminated' signal | 15 (invoking handler)
systemctl stop pacemaker.service


ha-idg-2:
Jul 20 17:04:03 [10691] ha-idg-2   crmd:   notice: process_lrm_event:   
Result of stop operation for vm_nextcloud on ha-idg-2: 0 (ok) | call=157 
key=vm_nextcloud_stop_0 confirmed=true cib-update=57
the log from ha-idg-2 is two seconds ahead of ha-idg-1

Jul 20 17:04:08 [10688] ha-idg-2   lrmd:   notice: log_execute: 
executing - rsc:vm_nextcloud action:start call_id:192
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ error: Failed to create domain from 
/mnt/share/vm_nextcloud.xml ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ error: Cannot access storage file 
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-18.04.4-live-server-amd64.iso':
 No such file or directory ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: operation_finished:  
vm_nextcloud_start_0:29107:stderr [ ocf-exit-reason:Failed to start virtual 
domain vm_nextcloud. ]
Jul 20 17:04:09 [10688] ha-idg-2   lrmd:   notice: log_finished:
finished - rsc:vm_nextcloud action:start call_id:192 pid:29107 exit-code:1 
exec-time:581ms queue-time:0ms
start on ha-idg-2 failed

Jul 20 17:05:32 [10691] ha-idg-2   crmd: info: do_dc_takeover:  Taking 
over DC status for this partition
ha-idg-1 stopped pacemaker

Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: unpack_rsc_op_failure:   
Processing failed migrate_to of vm_nextcloud on ha-idg-1: unknown error | rc=1
Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: unpack_rsc_op_failure:   
Processing failed start of vm_nextcloud on ha-idg-2: unknown error | rc

Jul 20 17:05:33 [10690] ha-idg-2pengine: info: native_color:
Resource vm_nextcloud cannot run anywhere
logical

Jul 20 17:05:33 [10690] ha-idg-2pengine:  warning: custom_action:   Action 
vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (pending)
???

Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning: custom_action:   Action 
vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (offline)
Jul 20 17:05:35 [10690] ha-idg-2pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: resource actions are 

Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Ken Gaillot
On Thu, 2019-10-10 at 17:22 +0200, Lentes, Bernd wrote:
> HI,
> 
> i have a two node cluster running on SLES 12 SP4.
> I did some testing on it.
> I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a
> few minutes later because i made a mistake.
> ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started
> corosync/pacemaker on it.
> It seems ha-idg-1 didn't find the DC after starting cluster and some
> sec later elected itself  to the DC, 
> afterwards fenced ha-idg-2.

For some reason, corosync on the two nodes was not able to communicate
with each other.

This type of situation is why corosync's two_node option normally
includes wait_for_all.

> 
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network
> interface [192.168.100.10] is now up.
> 
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped
> (2ms)
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_state_transition:  State transition S_PENDING -> S_ELECTION |
> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> election_check:   election-DC won by local node
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input
> I_ELECTION_DC received in state S_ELECTION from election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice:
> do_state_transition:  State transition S_ELECTION ->
> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL
> origin=election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_te_control:Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-
> 71bd17047f82
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> set_graph_functions:  Setting custom graph functions
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info:
> do_dc_takeover:   Taking over DC status for this partition
> 
> Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning:
> stage6:   Scheduling Node ha-idg-2 for STONITH
> Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice:
> LogNodeActions:* Fence (Off) ha-idg-2 'node is unclean'
> 
> Is my understanding correct ?

Yes

> In the log of ha-idg-2 i don't find anything for this period:
> 
> Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info:
> cib_device_update:   Device fence_ilo_ha-idg-2 has been disabled
> on ha-idg-2: score=-1
> Oct 09 17:58:51 [12503] ha-idg-2cib: info:
> cib_process_ping:Reporting our current digest to ha-idg-2:
> 59c4cfb14defeafbeb3417e42cd9 for 2.9506.36 (0x242b110 0)
> 
> Oct 09 18:00:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0001 (was )
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info:
> throttle_check_thresholds:   Moderate CPU load detected:
> 32.220001
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0010 (was 0001)
> Oct 09 18:01:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode: 0001 (was 0010)
> Oct 09 18:02:42 [12508] ha-idg-2   crmd: info:
> throttle_send_command:   New throttle mode:  (was 0001)
> 
> ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on
> it again:
> 
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
> 
> What is the meaning of the lines with the throttle ?

Those messages could definitely be improved. The particular mode values
indicate no significant CPU load (), low load (0001), medium
(0010), high (0100), or extreme (1000).

I wouldn't expect a CPU spike to lock up corosync for very long, but it
could be related somehow.

> 
> Thanks.
> 
> 
> Bernd
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs 

Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Andrei Borzenkov
10.10.2019 18:22, Lentes, Bernd пишет:
> HI,
> 
> i have a two node cluster running on SLES 12 SP4.
> I did some testing on it.
> I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few 
> minutes later because i made a mistake.
> ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker 
> on it.
> It seems ha-idg-1 didn't find the DC after starting cluster

Which likely was the reason for fencing in the first place.

> and some sec later elected itself  to the DC, 
> afterwards fenced ha-idg-2.
> 
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster 
> Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync built-in 
> features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
> transport (UDP/IP Multicast).
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network 
> interface [192.168.100.10] is now up.
> 
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: crm_timer_popped: 
> Election Trigger (I_DC_TIMEOUT) just popped (2ms)
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input 
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_state_transition:
>   State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT 
> cause=C_TIMER_POPPED origin=crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: election_check:   
> election-DC won by local node
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input 
> I_ELECTION_DC received in state S_ELECTION from election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice: do_state_transition:
>   State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC 
> cause=C_FSA_INTERNAL origin=election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_te_control:
> Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: set_graph_functions:
>   Setting custom graph functions
> Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_dc_takeover:   
> Taking over DC status for this partition
> 
> Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning: stage6:   Scheduling 
> Node ha-idg-2 for STONITH
> Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice: LogNodeActions:* 
> Fence (Off) ha-idg-2 'node is unclean'
> 
> Is my understanding correct ?
> 
> 
> In the log of ha-idg-2 i don't find anything for this period:
> 
> Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update: 
>   Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1
> Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping:  
>   Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 
> for 2.9506.36 (0x242b110 0)
> 
> Oct 09 18:00:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0001 (was )
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: 
> throttle_check_thresholds:   Moderate CPU load detected: 32.220001
> Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0010 (was 0001)
> Oct 09 18:01:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode: 0001 (was 0010)
> Oct 09 18:02:42 [12508] ha-idg-2   crmd: info: throttle_send_command: 
>   New throttle mode:  (was 0001)
> 
> ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it 
> again:
> 
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync Cluster 
> Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync built-in 
> features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
> transport (UDP/IP Multicast).
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: aes256 hash: sha1
> 
> What is the meaning of the lines with the throttle ?
> 
> Thanks.
> 
> 
> Bernd
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Why is node fenced ?

2019-10-10 Thread Lentes, Bernd
HI,

i have a two node cluster running on SLES 12 SP4.
I did some testing on it.
I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few 
minutes later because i made a mistake.
ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started corosync/pacemaker on 
it.
It seems ha-idg-1 didn't find the DC after starting cluster and some sec later 
elected itself  to the DC, 
afterwards fenced ha-idg-2.

Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster 
Engine ('2.3.6'): started and ready to provide service.
Oct 09 18:04:43 [9550] ha-idg-1 corosync info[MAIN  ] Corosync built-in 
features: debug testagents augeas systemd pie relro bindnow
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 09 18:04:43 [9550] ha-idg-1 corosync notice  [TOTEM ] The network interface 
[192.168.100.10] is now up.

Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: crm_timer_popped: 
Election Trigger (I_DC_TIMEOUT) just popped (2ms)
Oct 09 18:05:06 [9565] ha-idg-1   crmd:  warning: do_log:   Input 
I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_state_transition:  
State transition S_PENDING -> S_ELECTION | input=I_DC_TIMEOUT 
cause=C_TIMER_POPPED origin=crm_timer_popped
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: election_check:   
election-DC won by local node
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_log:   Input 
I_ELECTION_DC received in state S_ELECTION from election_win_cb
Oct 09 18:05:06 [9565] ha-idg-1   crmd:   notice: do_state_transition:  
State transition S_ELECTION -> S_INTEGRATION | input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=election_win_cb
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_te_control:
Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-71bd17047f82
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: set_graph_functions:  
Setting custom graph functions
Oct 09 18:05:06 [9565] ha-idg-1   crmd: info: do_dc_takeover:   Taking 
over DC status for this partition

Oct 09 18:05:07 [9564] ha-idg-1pengine:  warning: stage6:   Scheduling Node 
ha-idg-2 for STONITH
Oct 09 18:05:07 [9564] ha-idg-1pengine:   notice: LogNodeActions:* 
Fence (Off) ha-idg-2 'node is unclean'

Is my understanding correct ?


In the log of ha-idg-2 i don't find anything for this period:

Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info: cib_device_update:   
Device fence_ilo_ha-idg-2 has been disabled on ha-idg-2: score=-1
Oct 09 17:58:51 [12503] ha-idg-2cib: info: cib_process_ping:
Reporting our current digest to ha-idg-2: 59c4cfb14defeafbeb3417e42cd9 for 
2.9506.36 (0x242b110 0)

Oct 09 18:00:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was )
Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: 
throttle_check_thresholds:   Moderate CPU load detected: 32.220001
Oct 09 18:01:12 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0010 (was 0001)
Oct 09 18:01:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode: 0001 (was 0010)
Oct 09 18:02:42 [12508] ha-idg-2   crmd: info: throttle_send_command:   
New throttle mode:  (was 0001)

ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on it again:

Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [MAIN  ] Corosync Cluster 
Engine ('2.3.6'): started and ready to provide service.
Oct 09 18:29:05 [11795] ha-idg-2 corosync info[MAIN  ] Corosync built-in 
features: debug testagents augeas systemd pie relro bindnow
Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
transport (UDP/IP Multicast).
Oct 09 18:29:05 [11795] ha-idg-2 corosync notice  [TOTEM ] Initializing 
transmit/receive security (NSS) crypto: aes256 hash: sha1

What is the meaning of the lines with the throttle ?

Thanks.


Bernd

-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:

Re: [ClusterLabs] why is node fenced ?

2019-08-15 Thread Lentes, Bernd



- Am 14. Aug 2019 um 19:07 schrieb kgaillot kgail...@redhat.com:


>> That's my setting:
>> 
>> expected_votes: 2
>>   two_node: 1
>>   wait_for_all: 0
>> 
>> no-quorum-policy=ignore
>> 
>> I did that because i want be able to start the cluster although one
>> node has e.g. a hardware problem.
>> Is that ok ?
> 
> Well that's why you're seeing what you're seeing, which is also why
> wait_for_all was created :)
> 
> You definitely don't need no-quorum-policy=ignore in any case. With
> two_node, corosync will continue to provide quorum to pacemaker when
> one node goes away, so from pacemaker's view no-quorum-policy never
> kicks in.
> 
> With wait_for_all enabled, the newly joining node wouldn't get quorum
> initially, so it wouldn't fence the other node. So that's the trade-
> off, preventing this situation vs being able to start one node alone
> intentionally. Personally, I'd leave wait_for_all on normally, and
> manually change it to 0 whenever I was intentionally taking one node
> down for an extended time.

That sounds like a good idea, i will think about it.

> Of course all of that is just recovery, and doesn't explain why the
> nodes can't see each other to begin with.
> 

Yes. I don't have any idea. The bonds, each with two eth's for the corosync 
rings are connected directly from host to host,
no switch between, just a wire. Two wires break at the same moment ... i can't 
believe.
And the bonds are monitored via SNMP, so i'm immediately informed when they 
have trouble.
I didn't get any e-Mail.
Maybe heavy load at that time ? I have atop running, logging every second,i 
will have a look
in the respective logs.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-14 Thread Ken Gaillot
On Wed, 2019-08-14 at 11:57 +0200, Lentes, Bernd wrote:
> 
> - On Aug 13, 2019, at 1:19 AM, kgaillot kgail...@redhat.com
> wrote:
> 
> 
> > 
> > The key messages are:
> > 
> > Aug 09 17:43:27 [6326] ha-idg-1   crmd: info:
> > crm_timer_popped: Election
> > Trigger (I_DC_TIMEOUT) just popped (2ms)
> > Aug 09 17:43:27 [6326] ha-idg-1   crmd:  warning:
> > do_log:   Input
> > I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> > 
> > That indicates the newly rebooted node didn't hear from the other
> > node
> > within 20s, and so assumed it was dead.
> > 
> > The new node had quorum, but never saw the other node's corosync,
> > so
> > I'm guessing you have two_node and/or wait_for_all disabled in
> > corosync.conf, and/or you have no-quorum-policy=ignore in
> > pacemaker.
> > 
> > I'd recommend two_node: 1 in corosync.conf, with no explicit
> > wait_for_all or no-quorum-policy setting. That would ensure a
> > rebooted/restarted node doesn't get initial quorum until it has
> > seen
> > the other node.
> 
> That's my setting:
> 
> expected_votes: 2
>   two_node: 1
>   wait_for_all: 0
> 
> no-quorum-policy=ignore
> 
> I did that because i want be able to start the cluster although one
> node has e.g. a hardware problem.
> Is that ok ?

Well that's why you're seeing what you're seeing, which is also why
wait_for_all was created :)

You definitely don't need no-quorum-policy=ignore in any case. With
two_node, corosync will continue to provide quorum to pacemaker when
one node goes away, so from pacemaker's view no-quorum-policy never
kicks in.

With wait_for_all enabled, the newly joining node wouldn't get quorum
initially, so it wouldn't fence the other node. So that's the trade-
off, preventing this situation vs being able to start one node alone
intentionally. Personally, I'd leave wait_for_all on normally, and
manually change it to 0 whenever I was intentionally taking one node
down for an extended time.

Of course all of that is just recovery, and doesn't explain why the
nodes can't see each other to begin with.

> 
> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> Heinrich Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-14 Thread Lentes, Bernd



- On Aug 13, 2019, at 1:19 AM, kgaillot kgail...@redhat.com wrote:


> 
> The key messages are:
> 
> Aug 09 17:43:27 [6326] ha-idg-1   crmd: info: crm_timer_popped: 
> Election
> Trigger (I_DC_TIMEOUT) just popped (2ms)
> Aug 09 17:43:27 [6326] ha-idg-1   crmd:  warning: do_log:   Input
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> 
> That indicates the newly rebooted node didn't hear from the other node
> within 20s, and so assumed it was dead.
> 
> The new node had quorum, but never saw the other node's corosync, so
> I'm guessing you have two_node and/or wait_for_all disabled in
> corosync.conf, and/or you have no-quorum-policy=ignore in pacemaker.
> 
> I'd recommend two_node: 1 in corosync.conf, with no explicit
> wait_for_all or no-quorum-policy setting. That would ensure a
> rebooted/restarted node doesn't get initial quorum until it has seen
> the other node.

That's my setting:

expected_votes: 2
  two_node: 1
  wait_for_all: 0

no-quorum-policy=ignore

I did that because i want be able to start the cluster although one node has 
e.g. a hardware problem.
Is that ok ?


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Lentes, Bernd



- On Aug 13, 2019, at 3:34 PM, Matthias Ferdinand m...@14v.de wrote:
>> 17:26:35  crm node standby ha-idg1-
> 
> if that is not a copy error (ha-idg1- vs. ha-idg-1), then ha-idg-1
> was not set to standby, and installing updates may have done some
> meddling with corosync/pacemaker (like stopping corosync without
> stopping pacemaker) while having active resources.
> 

It's a typo. There were no updates for corosync or pacmaker.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Matthias Ferdinand
On Mon, Aug 12, 2019 at 04:09:48PM -0400, users-requ...@clusterlabs.org wrote:
> Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST)
> From: "Lentes, Bernd" 
> To: Pacemaker ML 
> Subject: [ClusterLabs] why is node fenced ?
> Message-ID:
>   <546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de>
>   
...
> 17:26:35  crm node standby ha-idg1-

if that is not a copy error (ha-idg1- vs. ha-idg-1), then ha-idg-1
was not set to standby, and installing updates may have done some
meddling with corosync/pacemaker (like stopping corosync without
stopping pacemaker) while having active resources.

Matthias
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Lentes, Bernd
- On Aug 12, 2019, at 7:47 PM, Chris Walker cwal...@cray.com wrote:

> When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for
> example,
> 
> Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd: info: 
> pcmk_quorum_notification:
> Quorum retained | membership=1320 members=1
> 
> after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed
> as part of startup fencing.
> 
> There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw
> ha-idg-1 either, so it appears that there was no communication at all between
> the two nodes.
> 
> I'm not sure exactly why the nodes did not see one another, but there are
> indications of network issues around this time
> 
> 2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now
> running without any active interface!
> 
> so perhaps that's related.

This is the initialization of the bond1 on ha-idg-1 during boot.
3 seconds later bond1 is fine:

2019-08-09T17:42:19.299886+02:00 ha-idg-2 kernel: [ 1232.117470] tg3 
:03:04.0 eth2: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.299908+02:00 ha-idg-2 kernel: [ 1232.117482] tg3 
:03:04.0 eth2: Flow control is on for TX and on for RX
2019-08-09T17:42:19.315756+02:00 ha-idg-2 kernel: [ 1232.131565] tg3 
:03:04.1 eth3: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.315767+02:00 ha-idg-2 kernel: [ 1232.131568] tg3 
:03:04.1 eth3: Flow control is on for TX and on for RX
2019-08-09T17:42:19.351781+02:00 ha-idg-2 kernel: [ 1232.169386] bond1: link 
status definitely up for interface eth2, 1000 Mbps full duplex
2019-08-09T17:42:19.351792+02:00 ha-idg-2 kernel: [ 1232.169390] bond1: making 
interface eth2 the new active one
2019-08-09T17:42:19.352521+02:00 ha-idg-2 kernel: [ 1232.169473] bond1: first 
active interface up!
2019-08-09T17:42:19.352532+02:00 ha-idg-2 kernel: [ 1232.169480] bond1: link 
status definitely up for interface eth3, 1000 Mbps full duplex

also on ha-idg-1:

2019-08-09T17:42:19.168035+02:00 ha-idg-1 kernel: [  110.164250] tg3 
:02:00.3 eth3: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.168050+02:00 ha-idg-1 kernel: [  110.164252] tg3 
:02:00.3 eth3: Flow control is on for TX and on for RX
2019-08-09T17:42:19.168052+02:00 ha-idg-1 kernel: [  110.164254] tg3 
:02:00.3 eth3: EEE is disabled
2019-08-09T17:42:19.172020+02:00 ha-idg-1 kernel: [  110.171378] tg3 
:02:00.2 eth2: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.172028+02:00 ha-idg-1 kernel: [  110.171380] tg3 
:02:00.2 eth2: Flow control is on for TX and on for RX
2019-08-09T17:42:19.172029+02:00 ha-idg-1 kernel: [  110.171382] tg3 
:02:00.2 eth2: EEE is disabled
 ...
2019-08-09T17:42:19.244066+02:00 ha-idg-1 kernel: [  110.240310] bond1: link 
status definitely up for interface eth2, 1000 Mbps full duplex
2019-08-09T17:42:19.244083+02:00 ha-idg-1 kernel: [  110.240311] bond1: making 
interface eth2 the new active one
2019-08-09T17:42:19.244085+02:00 ha-idg-1 kernel: [  110.240353] bond1: first 
active interface up!
2019-08-09T17:42:19.244087+02:00 ha-idg-1 kernel: [  110.240356] bond1: link 
status definitely up for interface eth3, 1000 Mbps full duplex

And the cluster is started afterwards on ha-idg-1 at 17:43:04. I don't find 
further entries for problems with bond1. So i think it's not related.
Time is synchronized by ntp.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-12 Thread Ken Gaillot
On Mon, 2019-08-12 at 18:09 +0200, Lentes, Bernd wrote:
> Hi,
> 
> last Friday (9th of August) i had to install patches on my two-node
> cluster.
> I put one of the nodes (ha-idg-2) into standby (crm node standby ha-
> idg-2), patched it, rebooted, 
> started the cluster (systemctl start pacemaker) again, put the node
> again online, everything fine.
> 
> Then i wanted to do the same procedure with the other node (ha-idg-
> 1).
> I put it in standby, patched it, rebooted, started pacemaker again.
> But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
> I know that nodes which are unclean need to be shutdown, that's
> logical.
> 
> But i don't know from where the conclusion comes that the node is
> unclean respectively why it is unclean,
> i searched in the logs and didn't find any hint.

The key messages are:

Aug 09 17:43:27 [6326] ha-idg-1   crmd: info: crm_timer_popped: 
Election Trigger (I_DC_TIMEOUT) just popped (2ms)
Aug 09 17:43:27 [6326] ha-idg-1   crmd:  warning: do_log:   Input 
I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped

That indicates the newly rebooted node didn't hear from the other node
within 20s, and so assumed it was dead.

The new node had quorum, but never saw the other node's corosync, so
I'm guessing you have two_node and/or wait_for_all disabled in
corosync.conf, and/or you have no-quorum-policy=ignore in pacemaker.

I'd recommend two_node: 1 in corosync.conf, with no explicit
wait_for_all or no-quorum-policy setting. That would ensure a
rebooted/restarted node doesn't get initial quorum until it has seen
the other node.

> I put the syslog and the pacemaker log on a seafile share, i'd be
> very thankful if you'll have a look.
> https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
> 
> Here the cli history of the commands:
> 
> 17:03:04  crm node standby ha-idg-2
> 17:07:15  zypper up (install Updates on ha-idg-2)
> 17:17:30  systemctl reboot
> 17:25:21  systemctl start pacemaker.service
> 17:25:47  crm node online ha-idg-2
> 17:26:35  crm node standby ha-idg1-
> 17:30:21  zypper up (install Updates on ha-idg-1)
> 17:37:32  systemctl reboot
> 17:43:04  systemctl start pacemaker.service
> 17:44:00  ha-idg-1 is fenced
> 
> Thanks.
> 
> Bernd
> 
> OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
> 
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] why is node fenced ?

2019-08-12 Thread Chris Walker
When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for 
example,

Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd: info: pcmk_quorum_notification: 
Quorum retained | membership=1320 members=1

after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed 
as part of startup fencing.

There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw 
ha-idg-1 either, so it appears that there was no communication at all between 
the two nodes.

I'm not sure exactly why the nodes did not see one another, but there are 
indications of network issues around this time

2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now 
running without any active interface!

so perhaps that's related.

HTH,
Chris


On 8/12/19, 12:09 PM, "Users on behalf of Lentes, Bernd" 
 
wrote:

Hi,

last Friday (9th of August) i had to install patches on my two-node cluster.
I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), 
patched it, rebooted, 
started the cluster (systemctl start pacemaker) again, put the node again 
online, everything fine.

Then i wanted to do the same procedure with the other node (ha-idg-1).
I put it in standby, patched it, rebooted, started pacemaker again.
But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
I know that nodes which are unclean need to be shutdown, that's logical.

But i don't know from where the conclusion comes that the node is unclean 
respectively why it is unclean,
i searched in the logs and didn't find any hint.

I put the syslog and the pacemaker log on a seafile share, i'd be very 
thankful if you'll have a look.
https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/

Here the cli history of the commands:

17:03:04  crm node standby ha-idg-2
17:07:15  zypper up (install Updates on ha-idg-2)
17:17:30  systemctl reboot
17:25:21  systemctl start pacemaker.service
17:25:47  crm node online ha-idg-2
17:26:35  crm node standby ha-idg1-
17:30:21  zypper up (install Updates on ha-idg-1)
17:37:32  systemctl reboot
17:43:04  systemctl start pacemaker.service
17:44:00  ha-idg-1 is fenced

Thanks.

Bernd

OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1


-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
Bassler, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] why is node fenced ?

2019-08-12 Thread Lentes, Bernd
Hi,

last Friday (9th of August) i had to install patches on my two-node cluster.
I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), 
patched it, rebooted, 
started the cluster (systemctl start pacemaker) again, put the node again 
online, everything fine.

Then i wanted to do the same procedure with the other node (ha-idg-1).
I put it in standby, patched it, rebooted, started pacemaker again.
But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
I know that nodes which are unclean need to be shutdown, that's logical.

But i don't know from where the conclusion comes that the node is unclean 
respectively why it is unclean,
i searched in the logs and didn't find any hint.

I put the syslog and the pacemaker log on a seafile share, i'd be very thankful 
if you'll have a look.
https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/

Here the cli history of the commands:

17:03:04  crm node standby ha-idg-2
17:07:15  zypper up (install Updates on ha-idg-2)
17:17:30  systemctl reboot
17:25:21  systemctl start pacemaker.service
17:25:47  crm node online ha-idg-2
17:26:35  crm node standby ha-idg1-
17:30:21  zypper up (install Updates on ha-idg-1)
17:37:32  systemctl reboot
17:43:04  systemctl start pacemaker.service
17:44:00  ha-idg-1 is fenced

Thanks.

Bernd

OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1


-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/idg 

Perfekt ist wer keine Fehler macht 
Also sind Tote perfekt
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] why is node fenced ?

2019-05-17 Thread Jan Pokorný
On 16/05/19 17:10 +0200, Lentes, Bernd wrote:
> my HA-Cluster with two nodes fenced one on 14th of may.
> ha-idg-1 has been the DC, ha-idg-2 was fenced.
> It happened around 11:30 am.
> The log from the fenced one isn't really informative:
> 
> [...]
> 
> Node restarts at 11:44 am.
> The DC is more informative:
> 
> =
> 2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
> 2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: 
> (packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when 
> attempting to remove it
> 2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
> 2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: 
> pam_unix(crond:session): session opened for user root by (uid=0)
> 2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988 
> of user root.
> 2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: 
> pam_unix(crond:session): session closed for user root
> 2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 
> :02:00.3 eth3: Link is down
> 2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: 
> link status definitely down for interface eth3, disabling it
> 2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network 
> Adapter Link Down (Slot 0, Port 4)
> 2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 
> :02:00.3 eth3: Link is up at 100 Mbps, full duplex
> 2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 
> :02:00.3 eth3: Flow control is on for TX and on for RX
> 2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 
> :02:00.3 eth3: EEE is disabled
> 2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: 
> link status definitely up for interface eth3, 100 Mbps full duplex
> 2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network 
> Adapter Link Down (Slot 0, Port 4) has been repaired
> 2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new 
> membership (192.168.100.10:1120) was formed. Members left: 1084777492
> 2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed 
> to receive the leave message. failed: 1084777492
> 2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 
> joined, 1 left, 1 members
> 2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing all 
> ha-idg-2 attributes for peer loss
> 2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer 
> with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 
> peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged 
> 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] 
> Members[1]: 1084777482
> 2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> 2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: 
> closing connection to node 1084777492
> 2019-05-14T11:30:16.555995+02:00 ha-idg-1 dlm_controld[12279]: 1015492 fence 
> request 1084777492 pid 24568 nodedown time 1557826216 fence_all dlm_stonith
> 2019-05-14T11:30:16.626285+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.626534+02:00 ha-idg-1 dlm_stonith: stonith_api_time: 
> Found 1 entries for 1084777492/(null): 0 in progress, 1 completed
> 2019-05-14T11:30:16.626731+02:00 ha-idg-1 dlm_stonith: stonith_api_time: Node 
> 1084777492/(null) last kicked at: 1556884018
> 2019-05-14T11:30:16.626875+02:00 ha-idg-1 stonith-ng[6965]:   notice: Client 
> stonith-api.24568.6a9fa406 wants to fence (reboot) '1084777492' with device 
> '(any)'
> 2019-05-14T11:30:16.627026+02:00 ha-idg-1 crmd[6969]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE
> 2019-05-14T11:30:16.627165+02:00 ha-idg-1 crmd[6969]:   notice: Node ha-idg-2 
> state is now lost
> 2019-05-14T11:30:16.627302+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.627439+02:00 ha-idg-1 stonith-ng[6965]:   notice: 
> Requesting peer fencing (reboot) of ha-idg-2
> 2019-05-14T11:30:16.627578+02:00 ha-idg-1 pacemakerd[6963]:   notice: Node 
> 

[ClusterLabs] why is node fenced ?

2019-05-16 Thread Lentes, Bernd
Hi,

my HA-Cluster with two nodes fenced one on 14th of may.
ha-idg-1 has been the DC, ha-idg-2 was fenced.
It happened around 11:30 am.
The log from the fenced one isn't really informative:

==
2019-05-14T11:22:09.948980+02:00 ha-idg-2 liblogging-stdlog: -- MARK --
2019-05-14T11:28:21.548898+02:00 ha-idg-2 sshd[14269]: Accepted 
keyboard-interactive/pam for root from 10.35.34.70 port 59449 ssh2
2019-05-14T11:28:21.550602+02:00 ha-idg-2 sshd[14269]: pam_unix(sshd:session): 
session opened for user root by (uid=0)
2019-05-14T11:28:21.554640+02:00 ha-idg-2 systemd-logind[2798]: New session 
15385 of user root.
2019-05-14T11:28:21.555067+02:00 ha-idg-2 systemd[1]: Started Session 15385 of 
user root.

2019-05-14T11:44:07.664785+02:00 ha-idg-2 systemd[1]: systemd 228 running in 
system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP 
+LIBCRYPTSETUP +GC   Neustart !!!
RYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
2019-05-14T11:44:07.664902+02:00 ha-idg-2 kernel: [0.00] Linux version 
4.12.14-95.13-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 
SMP Fri Mar
22 06:04:58 UTC 2019 (c01bf34)
2019-05-14T11:44:07.665492+02:00 ha-idg-2 systemd[1]: Detected architecture 
x86-64.
2019-05-14T11:44:07.665510+02:00 ha-idg-2 kernel: [0.00] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.12.14-95.13-default 
root=/dev/mapper/vg_local-lv_root resume=/
dev/disk/by-uuid/2849c504-2e45-4ec8-bbf8-724cf358ee25 splash=verbose showopts
2019-05-14T11:44:07.665510+02:00 ha-idg-2 systemd[1]: Set hostname to 
.
=

Node restarts at 11:44 am.
The DC is more informative:

=
2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: 
(packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when 
attempting to remove it
2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: pam_unix(crond:session): 
session opened for user root by (uid=0)
2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988 of 
user root.
2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: pam_unix(crond:session): 
session closed for user root
2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 
:02:00.3 eth3: Link is down
2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: link 
status definitely down for interface eth3, disabling it
2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network 
Adapter Link Down (Slot 0, Port 4)
2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 
:02:00.3 eth3: Link is up at 100 Mbps, full duplex
2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 
:02:00.3 eth3: Flow control is on for TX and on for RX
2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 
:02:00.3 eth3: EEE is disabled
2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: link 
status definitely up for interface eth3, 100 Mbps full duplex
2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network Adapter 
Link Down (Slot 0, Port 4) has been repaired
2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A 
processor failed, forming new configuration.
2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new 
membership (192.168.100.10:1120) was formed. Members left: 1084777492
2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed to 
receive the leave message. failed: 1084777492
2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 
joined, 1 left, 1 members
2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node ha-idg-2 
state is now lost
2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2 
state is now lost
2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing all 
ha-idg-2 attributes for peer loss
2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer 
with id=1084777492 and/or uname=ha-idg-2 from the membership cache
2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node 
ha-idg-2 state is now lost
2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 peer 
with id=1084777492 and/or uname=ha-idg-2 from the membership cache
2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged 1 
peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] 
Members[1]: 1084777482
2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] Completed 
service synchronization, ready to provide service.
2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: closing 
connection to node