Re: [ClusterLabs] controlling cluster behavior on startup

2024-01-30 Thread Klaus Wenninger
On Tue, Jan 30, 2024 at 2:21 PM Walker, Chris 
wrote:

> >>> However, now it seems to wait that amount of time before it elects a
> >>> DC, even when quorum is acquired earlier.  In my log snippet below,
> >>> with dc-deadtime 300s,
> >>
> >> The dc-deadtime is not waiting for quorum, but for another DC to show
> >> up. If all nodes show up, it can proceed, but otherwise it has to wait.
>
> > I believe all the nodes showed up by 14:17:04, but it still waited until
> 14:19:26 to elect a DC:
>
> > Jan 29 14:14:25 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher12 is now membe  (was in
> unknown state)
> > Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher11 is now membe  (was in
> unknown state)
> > Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (quorum_notification_cb)  notice: Quorum acquired | membership=54 members=2
> > Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log)  info:
> Input I_ELECTION_DC received in state S_ELECTION from election_win_cb
>
> > This is a cluster with 2 nodes, gopher11 and gopher12.
>
> This is our experience with dc-deadtime too: even if both nodes in the
> cluster show up, dc-deadtime must elapse before the cluster starts.  This
> was discussed on this list a while back (
> https://www.mail-archive.com/users@clusterlabs.org/msg03897.html) and an
> RFE came out of it (https://bugs.clusterlabs.org/show_bug.cgi?id=5310).
>
>
>
> I’ve worked around this by having an ExecStartPre directive for Corosync
> that does essentially:
>
>
>
> while ! systemctl -H ${peer} is-active corosync; do sleep 5; done
>
>
>
> With this in place, the nodes wait for each other before starting Corosync
> and Pacemaker.  We can then use the default 20s dc-deadtime so that the DC
> election happens quickly once both nodes are up.
>

Actually wait-for-all coming per default with 2-node should lead to quorum
being delayed till both nodes showed up.
And if we make the cluster not ignore quorum it shouldn't start fencing
before it sees the peer - right?
Running a 2-node-cluster ignoring quorum or without wait-for-all is a
delicate thing anyway I would say
and shouldn't work in a generic case. Not saying it is an issue here -
guess there just isn't enough
info about the cluster to say.
So you shouldn't need this raised dc-deadtime and thus wouldn't experience
large startup-delays.

Regards,
Klaus


> Thanks,
>
> Chris
>
>
>
> *From: *Users  on behalf of Faaland, Olaf
> P. via Users 
> *Date: *Monday, January 29, 2024 at 7:46 PM
> *To: *Ken Gaillot , Cluster Labs - All topics
> related to open-source clustering welcomed 
> *Cc: *Faaland, Olaf P. 
> *Subject: *Re: [ClusterLabs] controlling cluster behavior on startup
>
> >> However, now it seems to wait that amount of time before it elects a
> >> DC, even when quorum is acquired earlier.  In my log snippet below,
> >> with dc-deadtime 300s,
> >
> > The dc-deadtime is not waiting for quorum, but for another DC to show
> > up. If all nodes show up, it can proceed, but otherwise it has to wait.
>
> I believe all the nodes showed up by 14:17:04, but it still waited until
> 14:19:26 to elect a DC:
>
> Jan 29 14:14:25 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher12 is now membe  (was in
> unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher11 is now membe  (was in
> unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (quorum_notification_cb)  notice: Quorum acquired | membership=54 members=2
> Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log)  info:
> Input I_ELECTION_DC received in state S_ELECTION from election_win_cb
>
> This is a cluster with 2 nodes, gopher11 and gopher12.
>
> Am I misreading that?
>
> thanks,
> Olaf
>
> 
> From: Ken Gaillot 
> Sent: Monday, January 29, 2024 3:49 PM
> To: Faaland, Olaf P.; Cluster Labs - All topics related to open-source
> clustering welcomed
> Subject: Re: [ClusterLabs] controlling cluster behavior on startup
>
> On Mon, 2024-01-29 at 22:48 +, Faaland, Olaf P. wrote:
> > Thank you, Ken.
> >
> > I changed my configuration management system to put an initial
> > cib.xml into /var/lib/pacemaker/cib/, which sets all the property
> > values I was setting via pcs commands, including dc-deadtime.  I
> > removed those "pcs property set" commands from the ones that are run
> > at startup time.
> >
> > That worked in the sense that after Pacemaker start, the node waits
> > my newly specified dc-deadtime of 300s before giving up on the
> > partner node and fencing it, if the partner never appears as a
> > member.
> >
> > However, now it seems to wait that amount of time before it elects a
> > DC, even when quorum is acquired earlier.  In my log snippet below,
> > with dc-deadtime 300s,
>
> The dc-deadtime is not waiting for 

Re: [ClusterLabs] controlling cluster behavior on startup

2024-01-30 Thread Ken Gaillot
On Tue, 2024-01-30 at 13:20 +, Walker, Chris wrote:
> >>> However, now it seems to wait that amount of time before it
> elects a
> >>> DC, even when quorum is acquired earlier.  In my log snippet
> below,
> >>> with dc-deadtime 300s,
> >>
> >> The dc-deadtime is not waiting for quorum, but for another DC to
> show
> >> up. If all nodes show up, it can proceed, but otherwise it has to
> wait.
> 
> > I believe all the nodes showed up by 14:17:04, but it still waited
> until 14:19:26 to elect a DC:
> 
> > Jan 29 14:14:25 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher12 is now membe 
> (was in unknown state)
> > Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher11 is now membe 
> (was in unknown state)
> > Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (quorum_notification_cb)  notice: Quorum acquired | membership=54
> members=2
> > Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log) 
> info: Input I_ELECTION_DC received in state S_ELECTION from
> election_win_cb
> 
> > This is a cluster with 2 nodes, gopher11 and gopher12.
> 
> This is our experience with dc-deadtime too: even if both nodes in
> the cluster show up, dc-deadtime must elapse before the cluster
> starts.  This was discussed on this list a while back (
> https://www.mail-archive.com/users@clusterlabs.org/msg03897.html) and
> an RFE came out of it (
> https://bugs.clusterlabs.org/show_bug.cgi?id=5310). 

Ah, I misremembered, I thought we had done that :(

>  
> I’ve worked around this by having an ExecStartPre directive for
> Corosync that does essentially:
>  
> while ! systemctl -H ${peer} is-active corosync; do sleep 5; done
>  
> With this in place, the nodes wait for each other before starting
> Corosync and Pacemaker.  We can then use the default 20s dc-deadtime
> so that the DC election happens quickly once both nodes are up.

That makes sense

> Thanks,
> Chris
>  
> From: Users  on behalf of Faaland,
> Olaf P. via Users 
> Date: Monday, January 29, 2024 at 7:46 PM
> To: Ken Gaillot , Cluster Labs - All topics
> related to open-source clustering welcomed 
> Cc: Faaland, Olaf P. 
> Subject: Re: [ClusterLabs] controlling cluster behavior on startup
> 
> >> However, now it seems to wait that amount of time before it elects
> a
> >> DC, even when quorum is acquired earlier.  In my log snippet
> below,
> >> with dc-deadtime 300s,
> >
> > The dc-deadtime is not waiting for quorum, but for another DC to
> show
> > up. If all nodes show up, it can proceed, but otherwise it has to
> wait.
> 
> I believe all the nodes showed up by 14:17:04, but it still waited
> until 14:19:26 to elect a DC:
> 
> Jan 29 14:14:25 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher12 is now membe 
> (was in unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (peer_update_callback)info: Cluster node gopher11 is now membe 
> (was in unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697]
> (quorum_notification_cb)  notice: Quorum acquired | membership=54
> members=2
> Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log)  info:
> Input I_ELECTION_DC received in state S_ELECTION from election_win_cb
> 
> This is a cluster with 2 nodes, gopher11 and gopher12.
> 
> Am I misreading that?
> 
> thanks,
> Olaf
> 
> 
> From: Ken Gaillot 
> Sent: Monday, January 29, 2024 3:49 PM
> To: Faaland, Olaf P.; Cluster Labs - All topics related to open-
> source clustering welcomed
> Subject: Re: [ClusterLabs] controlling cluster behavior on startup
> 
> On Mon, 2024-01-29 at 22:48 +, Faaland, Olaf P. wrote:
> > Thank you, Ken.
> >
> > I changed my configuration management system to put an initial
> > cib.xml into /var/lib/pacemaker/cib/, which sets all the property
> > values I was setting via pcs commands, including dc-deadtime.  I
> > removed those "pcs property set" commands from the ones that are
> run
> > at startup time.
> >
> > That worked in the sense that after Pacemaker start, the node waits
> > my newly specified dc-deadtime of 300s before giving up on the
> > partner node and fencing it, if the partner never appears as a
> > member.
> >
> > However, now it seems to wait that amount of time before it elects
> a
> > DC, even when quorum is acquired earlier.  In my log snippet below,
> > with dc-deadtime 300s,
> 
> The dc-deadtime is not waiting for quorum, but for another DC to show
> up. If all nodes show up, it can proceed, but otherwise it has to
> wait.
> 
> >
> > 14:14:24 Pacemaker starts on gopher12
> > 14:17:04 quorum is acquired
> > 14:19:26 Election Trigger just popped (start time + dc-deadtime
> > seconds)
> > 14:19:26 gopher12 wins the election
> >
> > Is there other configuration that needs to be present in the cib at
> > startup time?
> >
> > thanks,
> > Olaf
> >
> > === log extract using new system of 

Re: [ClusterLabs] controlling cluster behavior on startup

2024-01-30 Thread Walker, Chris
>>> However, now it seems to wait that amount of time before it elects a
>>> DC, even when quorum is acquired earlier.  In my log snippet below,
>>> with dc-deadtime 300s,
>>
>> The dc-deadtime is not waiting for quorum, but for another DC to show
>> up. If all nodes show up, it can proceed, but otherwise it has to wait.

> I believe all the nodes showed up by 14:17:04, but it still waited until 
> 14:19:26 to elect a DC:

> Jan 29 14:14:25 gopher12 pacemaker-controld  [123697] (peer_update_callback)  
>   info: Cluster node gopher12 is now membe  (was in unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697] (peer_update_callback)  
>   info: Cluster node gopher11 is now membe  (was in unknown state)
> Jan 29 14:17:04 gopher12 pacemaker-controld  [123697] 
> (quorum_notification_cb)  notice: Quorum acquired | membership=54 members=2
> Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log)  info: Input 
> I_ELECTION_DC received in state S_ELECTION from election_win_cb

> This is a cluster with 2 nodes, gopher11 and gopher12.

This is our experience with dc-deadtime too: even if both nodes in the cluster 
show up, dc-deadtime must elapse before the cluster starts.  This was discussed 
on this list a while back 
(https://www.mail-archive.com/users@clusterlabs.org/msg03897.html) and an RFE 
came out of it (https://bugs.clusterlabs.org/show_bug.cgi?id=5310).

I’ve worked around this by having an ExecStartPre directive for Corosync that 
does essentially:

while ! systemctl -H ${peer} is-active corosync; do sleep 5; done

With this in place, the nodes wait for each other before starting Corosync and 
Pacemaker.  We can then use the default 20s dc-deadtime so that the DC election 
happens quickly once both nodes are up.
Thanks,
Chris

From: Users  on behalf of Faaland, Olaf P. via 
Users 
Date: Monday, January 29, 2024 at 7:46 PM
To: Ken Gaillot , Cluster Labs - All topics related to 
open-source clustering welcomed 
Cc: Faaland, Olaf P. 
Subject: Re: [ClusterLabs] controlling cluster behavior on startup
>> However, now it seems to wait that amount of time before it elects a
>> DC, even when quorum is acquired earlier.  In my log snippet below,
>> with dc-deadtime 300s,
>
> The dc-deadtime is not waiting for quorum, but for another DC to show
> up. If all nodes show up, it can proceed, but otherwise it has to wait.

I believe all the nodes showed up by 14:17:04, but it still waited until 
14:19:26 to elect a DC:

Jan 29 14:14:25 gopher12 pacemaker-controld  [123697] (peer_update_callback)
info: Cluster node gopher12 is now membe  (was in unknown state)
Jan 29 14:17:04 gopher12 pacemaker-controld  [123697] (peer_update_callback)
info: Cluster node gopher11 is now membe  (was in unknown state)
Jan 29 14:17:04 gopher12 pacemaker-controld  [123697] (quorum_notification_cb)  
notice: Quorum acquired | membership=54 members=2
Jan 29 14:19:26 gopher12 pacemaker-controld  [123697] (do_log)  info: Input 
I_ELECTION_DC received in state S_ELECTION from election_win_cb

This is a cluster with 2 nodes, gopher11 and gopher12.

Am I misreading that?

thanks,
Olaf


From: Ken Gaillot 
Sent: Monday, January 29, 2024 3:49 PM
To: Faaland, Olaf P.; Cluster Labs - All topics related to open-source 
clustering welcomed
Subject: Re: [ClusterLabs] controlling cluster behavior on startup

On Mon, 2024-01-29 at 22:48 +, Faaland, Olaf P. wrote:
> Thank you, Ken.
>
> I changed my configuration management system to put an initial
> cib.xml into /var/lib/pacemaker/cib/, which sets all the property
> values I was setting via pcs commands, including dc-deadtime.  I
> removed those "pcs property set" commands from the ones that are run
> at startup time.
>
> That worked in the sense that after Pacemaker start, the node waits
> my newly specified dc-deadtime of 300s before giving up on the
> partner node and fencing it, if the partner never appears as a
> member.
>
> However, now it seems to wait that amount of time before it elects a
> DC, even when quorum is acquired earlier.  In my log snippet below,
> with dc-deadtime 300s,

The dc-deadtime is not waiting for quorum, but for another DC to show
up. If all nodes show up, it can proceed, but otherwise it has to wait.

>
> 14:14:24 Pacemaker starts on gopher12
> 14:17:04 quorum is acquired
> 14:19:26 Election Trigger just popped (start time + dc-deadtime
> seconds)
> 14:19:26 gopher12 wins the election
>
> Is there other configuration that needs to be present in the cib at
> startup time?
>
> thanks,
> Olaf
>
> === log extract using new system of installing partial cib.xml before
> startup
> Jan 29 14:14:24 gopher12 pacemakerd  [123690]
> (main)notice: Starting Pacemaker 2.1.7-1.t4 | build=2.1.7
> features:agent-manpages ascii-docs compat-2.0 corosync-ge-2 default-
> concurrent-fencing generated-manpages monotonic nagios ncurses remote
> systemd
> Jan 29 14:14:25 gopher12 pacemaker-attrd [123695]
>