Re: [ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra
) for controller set by do_state_transition:559
63835:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__set_flags_as)   debug: FSA action flags 0x0080
(A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565
63836:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0200 (an_action)
for controller cleared by do_fsa_action:108
63837:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0020 (an_action)
for controller cleared by do_fsa_action:108
63838:Jul 17 14:16:55.092 FILE-2 pacemaker-controld  [15962]
(pcmk__clear_flags_as) debug: FSA action flags 0x0080 (an_action)
for controller cleared by do_fsa_action:108
63863:Jul 17 14:17:25.073 FILE-2 pacemaker-controld  [15962]
(throttle_cib_load)debug: cib load: 0.000667 (2 ticks in 30s)
63864:Jul 17 14:17:25.073 FILE-2 pacemaker-controld  [15962]
(throttle_mode)debug: Current load is 0.65 across 10 core(s)
63865:Jul 17 14:17:55.073 FILE-2 pacemaker-controld  [15962]
(throttle_cib_load)debug: cib load: 0.000333 (1 ticks in 30s)
63866:Jul 17 14:17:55.073 FILE-2 pacemaker-controld  [15962]
(throttle_mode)debug: Current load is 0.85 across 10 core(s)
63868:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958]
(process_remote_stonith_exec)  debug: Finalizing action 'reboot'
targeting FILE-2 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0
id=4e523b34
63869:Jul 17 14:18:20.085 FILE-2 pacemaker-fenced[15958]
(remote_op_done)   notice: Operation 'reboot' targeting FILE-2 by FILE-4
for pacemaker-controld.19415@FILE-6: OK | id=4e523b34
63872:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(exec_alert_list)  info: Sending fencing alert via pf-ha-alert to (null)
63875:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(tengine_stonith_notify)   crit: We were allegedly just fenced by FILE-4
for FILE-6!
63876:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962]
(crm_xml_cleanup)  info: Cleaning up memory from libxml2
63877:Jul 17 14:18:20.089 FILE-2 pacemaker-controld  [15962] (crm_exit)
info: Exiting pacemaker-controld | with status 100
63900:Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_child_exit)  warning: Shutting cluster down because
pacemaker-controld[15962] had fatal failure
63902:Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker) debug: pacemaker-controld confirmed stopped
63956:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958]
(process_remote_stonith_exec)  debug: Finalizing action 'reboot'
targeting FILE-1 on behalf of pacemaker-controld.19415@FILE-6: OK | rc=0
id=446afc42
63957:Jul 17 14:18:20.101 FILE-2 pacemaker-fenced[15958]
(remote_op_done)   notice: Operation 'reboot' targeting FILE-1 by FILE-5
for pacemaker-controld.19415@FILE-6: OK | id=446afc42

Thanks
Priyanka

On Thu, Jul 20, 2023 at 12:07 AM Ken Gaillot  wrote:

> On Wed, 2023-07-19 at 23:49 +0530, Priyanka Balotra wrote:
> > Hi All,
> > I am using SLES 15 SP4. One of the nodes of the cluster is brought
> > down and boot up after sometime. Pacemaker service came up first but
> > later it faced a fatal shutdown. Due to that crm service is down.
> >
> > The logs from /var/log/pacemaker.pacemaker.log are as follows:
> >
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_child_exit)warning: Shutting cluster down because
> > pacemaker-controld[15962] had fatal failure
>
> The interesting messages will be before this. The ones with "pacemaker-
> controld" will be the most relevant, at least initially.
>
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_shutdown_worker)   notice: Shutting down Pacemaker
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
> > (pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
> > Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
> >   notice: Stopping pacemaker-schedulerd | sent signal 15 to process
> > 15961
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (crm_signal_dispatch)notice: Caught 'Terminated' signal | 15
> > (invoking handler)
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (qb_ipcs_us_withdraw)info: withdrawing server sockets
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (qb_ipcs_unref)  debug: qb_ipcs_unref() - destroying
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
> > (crm_xml_cleanup)info: Cleaning up memory from libxml2
> > Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
> >   info: Exiting pacemaker-schedulerd | with status 0
> > Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
> 

[ClusterLabs] Pacemaker fatal shutdown

2023-07-19 Thread Priyanka Balotra
Hi All,
I am using SLES 15 SP4. One of the nodes of the cluster is brought down and
boot up after sometime. Pacemaker service came up first but later it faced
a fatal shutdown. Due to that crm service is down.

The logs from /var/log/pacemaker.pacemaker.log are as follows:

Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 warning: Shutting cluster down because pacemaker-controld[15962] had
fatal failure
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   notice: Shutting down Pacemaker
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-controld confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-schedulerd | sent signal 15 to process 15961
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961]
(qb_ipcs_us_withdraw)info: withdrawing server sockets
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (qb_ipcs_unref)
 debug: qb_ipcs_unref() - destroying
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_xml_cleanup)
 info: Cleaning up memory from libxml2
Jul 17 14:18:20.093 FILE-2 pacemaker-schedulerd[15961] (crm_exit)
info: Exiting pacemaker-schedulerd | with status 0
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(qb_ipcs_event_sendv)debug: new_event_notification
(/dev/shm/qb-15957-15962-12-RDPw6O/qb): Broken pipe (32)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_notify_send_one)warning: Could not notify client crmd: Broken pipe
| id=e29d175e-7e91-4b6a-bffb-fabfdd7a33bf
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_delete operation for section
//node_state[@uname='FILE-2']/*: OK (rc=0, origin=FILE-6/crmd/74,
version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemaker-fenced[15958]
(xml_patch_version_check)debug: Can apply patch 0.24.75 to 0.24.74
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (pcmk_child_exit)
 info: pacemaker-schedulerd[15961] exited with status 0 (OK)
Jul 17 14:18:20.093 FILE-2 pacemaker-based [15957]
(cib_process_request)info: Completed cib_modify operation for section
status: OK (rc=0, origin=FILE-6/crmd/75, version=0.24.75)
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956]
(pcmk_shutdown_worker)   debug: pacemaker-schedulerd confirmed stopped
Jul 17 14:18:20.093 FILE-2 pacemakerd  [15956] (stop_child)
notice: Stopping pacemaker-attrd | sent signal 15 to process 15960
Jul 17 14:18:20.093 FILE-2 pacemaker-attrd [15960]
(crm_signal_dispatch)notice: Caught 'Terminated' signal | 15 (invoking
handler)

Could you please help me understand the issue here.

Regards
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra
I am using SLES 15 SP4. Is the no-quorum-policy still supported?

Thanks
Priyanka

On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot  wrote:

> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
> > In this case stonith has been configured as a resource,
> > primitive stonith-sbd stonith:external/sbd
> >
> > For it to be functional properly , the resource needs to be up, which
> > is only possible if the system is quorate.
>
> Pacemaker can use a fence device even if its resource is not active.
> The resource being active just allows Pacemaker to monitor the device
> regularly.
>
> >
> > Hence our requirement is to make the system quorate even if one Node
> > of the cluster is up.
> > Stonith will then take care of any split-brain scenarios.
>
> In that case it sounds like no-quorum-policy=ignore is actually what
> you want.
>
> >
> > Thanks
> > Priyanka
> >
> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
> > wrote:
> > >
> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
> > > arvidj...@gmail.com> wrote:
> > > > On 27.06.2023 07:21, Priyanka Balotra wrote:
> > > > > Hi Andrei,
> > > > > After this state the system went through some more fencings and
> > > > we saw the
> > > > > following state:
> > > > >
> > > > > :~ # crm status
> > > > > Cluster Summary:
> > > > >* Stack: corosync
> > > > >* Current DC: FILE-2 (version
> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
> > > > - partition
> > > > > with quorum
> > > >
> > > > It says "partition with quorum" so what exactly is the problem?
> > >
> > > I guess the problem is that resources aren't being recovered on
> > > the nodes in the quorate partition.
> > > Reason for that is probably that - as Ken was already suggesting -
> > > fencing isn't
> > > working properly or fencing-devices used are simply inappropriate
> > > for the
> > > purpose (e.g. onboard IPMI).
> > > The fact that a node is rebooting isn't enough. The node that
> > > initiated fencing
> > > has to know that it did actually work. But we're just guessing
> > > here. Logs should
> > > show what is actually going on.
> > >
> > > Klaus
> > > > >* Last updated: Mon Jun 26 12:44:15 2023
> > > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
> > > > cibadmin on FILE-2
> > > > >* 4 nodes configured
> > > > >* 11 resource instances configured
> > > > >
> > > > > Node List:
> > > > >* Node FILE-1: UNCLEAN (offline)
> > > > >* Node FILE-4: UNCLEAN (offline)
> > > > >* Online: [ FILE-2 ]
> > > > >* Online: [ FILE-3 ]
> > > > >
> > > > > At this stage FILE-1 and FILE-4 were continuously getting
> > > > fenced (we have
> > > > > device based stonith configured but the resource was not up ) .
> > > > > Two nodes were online and two were offline. So quorum wasn't
> > > > attained
> > > > > again.
> > > > > 1)  For such a scenario we need help to be able to have one
> > > > cluster live .
> > > > > 2)  And in cases where only one node of the cluster is up and
> > > > others are
> > > > > down we need the resources and cluster to be up .
> > > > >
> > > > > Thanks
> > > > > Priyanka
> > > > >
> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> > > > arvidj...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > > >>> Hi All,
> > > > >>> We are seeing an issue where we replaced no-quorum-
> > > > policy=ignore with
> > > > >> other
> > > > >>> options in corosync.conf order to simulate the same behaviour
> > > > :
> > > > >>>
> > > > >>>
> > > > >>> * wait_for_all: 0*
> > > > >>>
> > > > >>> *last_man_standing: 1
> > > > last_man_standing_window: 2*
> > > > >>>
> > > > >>> There was another property (auto-tie-breaker) tried but

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra
In this case stonith has been configured as a resource,
*primitive stonith-sbd stonith:external/sbd*

For it to be functional properly , the resource needs to be up, which is
only possible if the system is quorate.
Hence our requirement is to make the system quorate even if one Node of the
cluster is up.
Stonith will then take care of any split-brain scenarios.

Thanks
Priyanka

On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger  wrote:

>
>
> On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov 
> wrote:
>
>> On 27.06.2023 07:21, Priyanka Balotra wrote:
>> > Hi Andrei,
>> > After this state the system went through some more fencings and we saw
>> the
>> > following state:
>> >
>> > :~ # crm status
>> > Cluster Summary:
>> >* Stack: corosync
>> >* Current DC: FILE-2 (version
>> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
>> partition
>> > with quorum
>>
>> It says "partition with quorum" so what exactly is the problem?
>>
>
> I guess the problem is that resources aren't being recovered on
> the nodes in the quorate partition.
> Reason for that is probably that - as Ken was already suggesting - fencing
> isn't
> working properly or fencing-devices used are simply inappropriate for the
> purpose (e.g. onboard IPMI).
> The fact that a node is rebooting isn't enough. The node that initiated
> fencing
> has to know that it did actually work. But we're just guessing here. Logs
> should
> show what is actually going on.
>
> Klaus
>
>>
>> >* Last updated: Mon Jun 26 12:44:15 2023
>> >* Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
>> FILE-2
>> >* 4 nodes configured
>> >* 11 resource instances configured
>> >
>> > Node List:
>> >* Node FILE-1: UNCLEAN (offline)
>> >* Node FILE-4: UNCLEAN (offline)
>> >* Online: [ FILE-2 ]
>> >* Online: [ FILE-3 ]
>> >
>> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we
>> have
>> > device based stonith configured but the resource was not up ) .
>> > Two nodes were online and two were offline. So quorum wasn't attained
>> > again.
>> > 1)  For such a scenario we need help to be able to have one cluster
>> live .
>> > 2)  And in cases where only one node of the cluster is up and others are
>> > down we need the resources and cluster to be up .
>> >
>> > Thanks
>> > Priyanka
>> >
>> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
>> > wrote:
>> >
>> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
>> >>> Hi All,
>> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
>> >> other
>> >>> options in corosync.conf order to simulate the same behaviour :
>> >>>
>> >>>
>> >>> * wait_for_all: 0*
>> >>>
>> >>> *last_man_standing: 1last_man_standing_window: 2*
>> >>>
>> >>> There was another property (auto-tie-breaker) tried but couldn't
>> >> configure
>> >>> it as crm did not recognise this property.
>> >>>
>> >>> But even after using these options, we are seeing that system is not
>> >>> quorate if at least half of the nodes are not up.
>> >>>
>> >>> Some properties from crm config are as follows:
>> >>>
>> >>>
>> >>>
>> >>> *primitive stonith-sbd stonith:external/sbd \params
>> >>> pcmk_delay_base=5s.*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> *.property cib-bootstrap-options: \have-watchdog=true \
>> >>>
>> >>
>> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
>> >>> \cluster-infrastructure=corosync \cluster-name=FILE \
>> >>> stonith-enabled=true \stonith-timeout=172 \
>> >>> stonith-action=reboot \stop-all-resources=false \
>> >>> no-quorum-policy=ignorersc_defaults bui

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-26 Thread Priyanka Balotra
Hi Andrei,
After this state the system went through some more fencings and we saw the
following state:

:~ # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: FILE-2 (version
2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition
with quorum
  * Last updated: Mon Jun 26 12:44:15 2023
  * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2
  * 4 nodes configured
  * 11 resource instances configured

Node List:
  * Node FILE-1: UNCLEAN (offline)
  * Node FILE-4: UNCLEAN (offline)
  * Online: [ FILE-2 ]
  * Online: [ FILE-3 ]

At this stage FILE-1 and FILE-4 were continuously getting fenced (we have
device based stonith configured but the resource was not up ) .
Two nodes were online and two were offline. So quorum wasn't attained
again.
1)  For such a scenario we need help to be able to have one cluster live .
2)  And in cases where only one node of the cluster is up and others are
down we need the resources and cluster to be up .

Thanks
Priyanka

On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
wrote:

> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > Hi All,
> > We are seeing an issue where we replaced no-quorum-policy=ignore with
> other
> > options in corosync.conf order to simulate the same behaviour :
> >
> >
> > * wait_for_all: 0*
> >
> > *last_man_standing: 1last_man_standing_window: 2*
> >
> > There was another property (auto-tie-breaker) tried but couldn't
> configure
> > it as crm did not recognise this property.
> >
> > But even after using these options, we are seeing that system is not
> > quorate if at least half of the nodes are not up.
> >
> > Some properties from crm config are as follows:
> >
> >
> >
> > *primitive stonith-sbd stonith:external/sbd \params
> > pcmk_delay_base=5s.*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *.property cib-bootstrap-options: \have-watchdog=true \
> >
> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
> > \cluster-infrastructure=corosync \cluster-name=FILE \
> >stonith-enabled=true \stonith-timeout=172 \
> > stonith-action=reboot \stop-all-resources=false \
> > no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> > resource-stickiness=1rsc_defaults rsc-options: \
> > resource-stickiness=100 \migration-threshold=3 \
> > failure-timeout=1m \cluster-recheck-interval=10minop_defaults
> > op-options: \timeout=600 \record-pending=true*
> >
> > On a 4-node setup when the whole cluster is brought up together we see
> > error logs like:
> >
> > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Fencing and resource management disabled due to lack of quorum*
> >
> > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Ignoring malformed node_state entry without uname*
> >
> > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-2 is unclean!*
> >
> > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-3 is unclean!*
> >
> > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
> > warning: Node FILE-4 is unclean!*
> >
>
> According to this output FILE-1 lost connection to three other nodes, in
> which case it cannot be quorate.
>
> >
> > Kindly help correct the configuration to make the system function
> normally
> > with all resources up, even if there is just one node up.
> >
> > Please let me know if any more info is needed.
> >
> > Thanks
> > Priyanka
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-26 Thread Priyanka Balotra
Hi All,
We are seeing an issue where we replaced no-quorum-policy=ignore with other
options in corosync.conf order to simulate the same behaviour :


* wait_for_all: 0*

*last_man_standing: 1last_man_standing_window: 2*

There was another property (auto-tie-breaker) tried but couldn't configure
it as crm did not recognise this property.

But even after using these options, we are seeing that system is not
quorate if at least half of the nodes are not up.

Some properties from crm config are as follows:



*primitive stonith-sbd stonith:external/sbd \params
pcmk_delay_base=5s.*




















*.property cib-bootstrap-options: \have-watchdog=true \
dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
\cluster-infrastructure=corosync \cluster-name=FILE \
  stonith-enabled=true \stonith-timeout=172 \
stonith-action=reboot \stop-all-resources=false \
no-quorum-policy=ignorersc_defaults build-resource-defaults: \
resource-stickiness=1rsc_defaults rsc-options: \
resource-stickiness=100 \migration-threshold=3 \
failure-timeout=1m \cluster-recheck-interval=10minop_defaults
op-options: \timeout=600 \record-pending=true*

On a 4-node setup when the whole cluster is brought up together we see
error logs like:

*2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Fencing and resource management disabled due to lack of quorum*

*2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Ignoring malformed node_state entry without uname*

*2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-2 is unclean!*

*2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-3 is unclean!*

*2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-4 is unclean!*


Kindly help correct the configuration to make the system function normally
with all resources up, even if there is just one node up.

Please let me know if any more info is needed.

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] crm node stays online after issuing node standby command

2023-03-15 Thread Priyanka Balotra
+Ayush

Thanks


On Wed, 15 Mar 2023 at 8:17 PM, Ken Gaillot  wrote:

> Hi,
>
> If you can reproduce the problem, the following info would be helpful:
>
> * "cibadmin -Q | grep standby" : to show whether it was successfully
> recorded in the CIB (will show info for any node with standby, but the
> XML ID likely has the node name or ID in it)
>
> * "attrd_updater -Q -n standby -N FILE-2" : to show whether the
> attribute manager has the right value in memory for the affected node
>
>
> On Wed, 2023-03-15 at 15:51 +0530, Ayush Siddarath wrote:
> > Hi All,
> >
> > We are seeing an issue as part of crm maintenance operations. As part
> > of the upgrade process, the crm nodes are put into standby mode.
> > But it's observed that one of the nodes fails to go into standby mode
> > despite the "crm node standby" returning success.
> >
> > Commands issued to put nodes into maintenance :
> >
> > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-1] =>
> > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-1",
> > > "delta": "0:00:00.442615", "end": "2023-03-15 06:07:08.150375",
> > > "rc": 0, "start": "2023-03-15 06:07:07.707760", "stderr": "",
> > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby
> > > node FILE-1", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby
> > > node FILE-1"]}
> > > .
> > > [2023-03-15 06:07:08 +] [468] [INFO] changed: [FILE-2] =>
> > > {"changed": true, "cmd": "/usr/sbin/crm node standby FILE-2",
> > > "delta": "0:00:00.459407", "end": "2023-03-15 06:07:08.223749",
> > > "rc": 0, "start": "2023-03-15 06:07:07.764342", "stderr": "",
> > > "stderr_lines": [], "stdout": "\u001b[32mINFO\u001b[0m: standby
> > > node FILE-2", "stdout_lines": ["\u001b[32mINFO\u001b[0m: standby
> > > node FILE-2"]}
> >
> >   
> >
> > Crm status o/p after above command execution:
> >
> > > FILE-2:/var/log # crm status
> > > Cluster Summary:
> > >   * Stack: corosync
> > >   * Current DC: FILE-1 (version 2.1.2+20211124.ada5c3b36-
> > > 150400.2.43-2.1.2+20211124.ada5c3b36) - partition with quorum
> > >   * Last updated: Wed Mar 15 08:32:27 2023
> > >   * Last change:  Wed Mar 15 06:07:08 2023 by root via cibadmin on
> > > FILE-4
> > >   * 4 nodes configured
> > >   * 11 resource instances configured (5 DISABLED)
> > > Node List:
> > >   * Node FILE-1: standby (with active resources)
> > >   * Node FILE-3: standby (with active resources)
> > >   * Node FILE-4: standby (with active resources)
> > >   * Online: [ FILE-2 ]
> >
> > pacemaker logs indicate that FILE-2 received the commands to put it
> > into standby.
> >
> > > FILE-2:/var/log # grep standby /var/log/pacemaker/pacemaker.log
> > > Mar 15 06:07:08.098 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.166 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.170 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> > > Mar 15 06:07:08.230 FILE-2 pacemaker-based [8635]
> > > (cib_perform_op)  info: ++
> > > > > value="on"/>
> >
> >
> > Issue is quite intermittent and observed on other nodes as well.
> > We have seen a similar issue when we try to remove the node from
> > standby mode (using crm node online) command. One/more nodes fails to
> > get removed from standby mode.
> >
> > We suspect it could be an issue with parallel execution of node
> > standby/online command for all nodes but this issue wasn't observed
> > with pacemaker packaged with SLES15 SP2 OS.
> >
> > I'm attaching the pacemaker.log from FILE-2 for analysis. Let us know
> > if any additional information is required.
> >
> > OS: SLES15 SP4
> > Pacemaker version -->
> >  crmadmin --version
> > Pacemaker 2.1.2+20211124.ada5c3b36-150400.2.43
> >
> > Thanks,
> > Ayush
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

2022-06-22 Thread Priyanka Balotra
Hi Klaus,
The config is as follows:
There are 2  nodes in the setup and some resources configured (stonith, IP,
systemd services related).
Sorry, I can share only high level details for this.

- pacemaker version
# rpm -qa pacemaker

pacemaker-2.0.3+20200511.2b248d828-1.10.x86_64




# rpm -qa corosync

corosync-2.4.5-10.14.6.1.x86_64


 # rpm -qa crmsh

crmsh-4.2.0+git.1585096577.f3257c89-3.4.noarch


On Wed, Jun 22, 2022 at 5:45 PM Klaus Wenninger  wrote:

> On Wed, Jun 22, 2022 at 1:46 PM Priyanka Balotra
>  wrote:
> >
> > Hi All,
> >
> > We are seeing an issue where we performed cluster shutdown followed by
> cluster boot operation. All the nodes joined the cluster excet one (the
> first node). Here are some pacemaker logs around that timestamp:
> >
> > 2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]:
> notice: Operation 'off' targeting FILE-1 on FILE-2 for
> pacemaker-controld.11523@FILE-2.0b09e949: OK
> >
> > 2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]:  error:
> stonith_construct_reply: Triggered assert at fenced_commands.c:2363 :
> request != NULL
> >
> > 2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]:
> warning: Can't create a sane reply
> >
> > 2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]:
> crit: We were allegedly just fenced by FILE-2 for FILE-2!
> >
> > 2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]:  warning:
> Shutting cluster down because pacemaker-controld[11643] had fatal failure
> >
> > 2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]:  notice:
> Shutting down Pacemaker
> >
> > 2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]:  notice:
> Stopping pacemaker-schedulerd
> >
> >
> >
> > Let us know if you need any more logs to find an rca to this.
>
> A little bit more info about your configuration and the pacemaker-version
> (cib?)
> used would definitely be helpful.
>
> Klaus
> >
> > Thanks
> > Priyanka
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] pacemaker-fenced[11637]: warning: Can't create a sane reply

2022-06-22 Thread Priyanka Balotra
Hi All,

We are seeing an issue where we performed cluster shutdown followed by
cluster boot operation. All the nodes joined the cluster excet one (the
first node). Here are some pacemaker logs around that timestamp:

2022-06-19T07:02:08.690213+00:00 FILE-1 pacemaker-fenced[11637]:  notice:
Operation 'off' targeting FILE-1 on FILE-2 for
pacemaker-controld.11523@FILE-2.0b09e949: OK

2022-06-19T07:02:08.690604+00:00 FILE-1 pacemaker-fenced[11637]:  *error:
stonith_construct_reply: Triggered assert at fenced_commands.c:2363 :
request != NULL*

2022-06-19T07:02:08.690781+00:00 FILE-1 pacemaker-fenced[11637]:
warning: *Can't
create a sane reply*

2022-06-19T07:02:08.691872+00:00 FILE-1 pacemaker-controld[11643]:  crit:
We were allegedly just fenced by FILE-2 for FILE-2!

2022-06-19T07:02:08.693994+00:00 FILE-1 pacemakerd[11622]:  warning:
Shutting cluster down because pacemaker-controld[11643] had fatal failure

2022-06-19T07:02:08.694209+00:00 FILE-1 pacemakerd[11622]:  notice:
Shutting down Pacemaker

2022-06-19T07:02:08.694381+00:00 FILE-1 pacemakerd[11622]:  notice:
Stopping pacemaker-schedulerd


Let us know if you need any more logs to find an rca to this.

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] crm status shows CURRENT DC as None

2022-06-13 Thread Priyanka Balotra
Hi Folks,

crm status shows CURRENT DC as None. Please check and let us know why the
current DC is not pointing to any of the nodes.



*CRM Status:*

Cluster Summary:

  * Stack: corosync

*  * Current DC: NONE*

  * Last updated: Tue Jun  7 06:14:59 2022

  * Last change:  Tue Jun  7 05:29:40 2022 by root via cibadmin on FILE-2

  * 2 nodes configured

  * 9 resource instances configured


   - How the current DC will be set to any node once we see as *None*?
   - Is there any impact on cluster functionality?

Thanks
Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)

2022-03-23 Thread Priyanka Balotra
Hi All,



We have a scenario on SLES 12 SP3 cluster.

The scenario is explained as follows in the order of events:

-   There is a 2-node cluster (FILE-1, FILE-2)

-   The cluster and the resources were up and running fine initially .

-   Then fencing request from pacemaker got issued on both nodes
simultaneously



Logs from 1st node:

2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to
receive the leave message. failed: 2

.

.

2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice:
Requesting that FILE-1 perform 'off' action targeting FILE-2



Logs from 2nd node:

2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to
receive the leave message. failed: 1

.

.

Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith)
notice: Requesting that FILE-2 perform 'off' action targeting FILE-1



-   When the nodes came up after unfencing, the DC got set after
election

-   After that the resources which were expected to run on only one
node became active on both (all) nodes of the cluster.





27290 2022-02-22T04:16:31.699186+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource stonith-sbd is active on 2 nodes (attempting recovery)
27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27292 2022-02-22T04:16:31.699590+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource FILE_Filesystem is active on 2 nodes (attem pting recovery)
27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27294 2022-02-22T04:16:31.699878+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource IP_Floating is active on 2 nodes (attemptin g recovery)
27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27296 2022-02-22T04:16:31.700203+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Service_Postgresql is active on 2 nodes (at tempting recovery)
27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]:
error: Resource Service_Postgrest is active on 2 nodes (att empting
recovery)
27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27300 2022-02-22T04:16:31.700792+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Service_esm_primary is active on 2 nodes (a ttempting recovery)
27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]:
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active
for more information
27302 2022-02-22T04:16:31.701086+00:00 FILE-2
pacemaker-schedulerd[5018]: error:
Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery)





Can you guys please help us understand if this is indeed a split-brain
scenario ? Under what circumstances can such a scenario be observed?

We can have very serious impact if such a case can re-occur inspite of
stonith already configured. Hence the ask .

In case this situation gets reproduced, how can it be handled?

Note: We have stonith configured and it has been working fine so far. In
this case also, the initial fencing happened from stonith only.



Thanks in advance!

Priyanka
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/