Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread lejeczek via Users



On 31/01/2024 18:11, Ken Gaillot wrote:

On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote:

On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:


On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking
specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps
outside.
I would like to create/remove constraints, when cluster
starts &
stops, respectively.

many thanks, L.


You could use node status alerts for that, but it's risky for
alert
agents to change the configuration (since that may result in
more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only
node
start/stop. You could approximate that by checking whether the
node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that
does what
you want and order everything else after it. However it's even
more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and
order it
after pacemaker.

What's wrong with leaving the constraints permanently
configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does
it brake any
replication? What's these constraints you are dealing with?

Could you share your configuration?

Constraints beyond/above of what is required by PAF agent
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple
(separate, for each pgSQL cluster) masters and you want to
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master
per HA node)
These below, I've tried, those move the master onto chosen
node but.. then the issues I mentioned.

-> $ pcs constraint location PGSQL-PAF-5438-clone prefers
ubusrv1=1002
or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
require-all=false setoptions score=-1000


Anti-colocation sets tend to be tricky currently -- if the first
resource can't be assigned to a node, none of them can. We have an idea
for a better implementation:

  https://projects.clusterlabs.org/T383

In the meantime, a possible workaround is to use placement-
strategy=balanced and define utilization for the clones only. The
promoted roles will each get a slight additional utilization, and the
cluster should spread them out across nodes whenever possible. I don't
know if that will avoid the replication issues but it may be worth a
try.

using _balanced_ causes a small mayhem to PAF/pgsql:

-> $ pcs property
Cluster Properties:
 REDIS-6380_REPL_INFO: ubusrv3
 REDIS-6381_REPL_INFO: ubusrv2
 REDIS-6382_REPL_INFO: ubusrv2
 REDIS-6385_REPL_INFO: ubusrv1
 REDIS_REPL_INFO: ubusrv1
 cluster-infrastructure: corosync
 cluster-name: ubusrv
 dc-version: 2.1.2-ada5c3b36e2
 have-watchdog: false
 last-lrm-refresh: 1706711588
 placement-strategy: default
 stonith-enabled: false

-> $ pcs resource utilization PGSQL-PAF-5438 cpu="20"

-> $ pcs property set placement-strategy=balanced # when 
resource stops:

I change it back:
-> $ pcs property set placement-strategy=default
and pgSQL/paf works again

I've not used _utilization_ nor _placement-strategy_ before, 
thus chance that I'm missing something is solid.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 31 Jan 2024 18:23:40 +0100
lejeczek via Users  wrote:

> On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 31 Jan 2024 16:37:21 +0100
> > lejeczek via Users  wrote:
> >  
> >>
> >> On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:  
> >>> On Wed, 31 Jan 2024 16:02:12 +0100
> >>> lejeczek via Users  wrote:
> >>>  
>  On 29/01/2024 17:22, Ken Gaillot wrote:  
> > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:  
> >> Hi guys.
> >>
> >> Is it possible to trigger some... action - I'm thinking specifically
> >> at shutdown/start.
> >> If not within the cluster then - if you do that - perhaps outside.
> >> I would like to create/remove constraints, when cluster starts &
> >> stops, respectively.
> >>
> >> many thanks, L.
> >>  
> > You could use node status alerts for that, but it's risky for alert
> > agents to change the configuration (since that may result in more
> > alerts and potentially some sort of infinite loop).
> >
> > Pacemaker has no concept of a full cluster start/stop, only node
> > start/stop. You could approximate that by checking whether the node
> > receiving the alert is the only active node.
> >
> > Another possibility would be to write a resource agent that does what
> > you want and order everything else after it. However it's even more
> > risky for a resource agent to modify the configuration.
> >
> > Finally you could write a systemd unit to do what you want and order it
> > after pacemaker.
> >
> > What's wrong with leaving the constraints permanently configured?  
>  yes, that would be for a node start/stop
>  I struggle with using constraints to move pgsql (PAF) master
>  onto a given node - seems that co/locating paf's master
>  results in troubles (replication brakes) at/after node
>  shutdown/reboot (not always, but way too often)  
> >>> What? What's wrong with colocating PAF's masters exactly? How does it
> >>> brake any replication? What's these constraints you are dealing with?
> >>>
> >>> Could you share your configuration?  
> >> Constraints beyond/above of what is required by PAF agent
> >> itself, say...
> >> you have multiple pgSQL cluster with PAF - thus multiple
> >> (separate, for each pgSQL cluster) masters and you want to
> >> spread/balance those across HA cluster
> >> (or in other words - avoid having more that 1 pgsql master
> >> per HA node)  
> > ok
> >  
> >> These below, I've tried, those move the master onto chosen
> >> node but.. then the issues I mentioned.  
> > You just mentioned it breaks the replication, but there so little
> > information about your architecture and configuration, it's impossible to
> > imagine how this could break the replication.
> >
> > Could you add details about the issues ?
> >  
> >> -> $ pcs constraint location PGSQL-PAF-5438-clone prefers  
> >> ubusrv1=1002
> >> or  
> >> -> $ pcs constraint colocation set PGSQL-PAF-5435-clone  
> >> PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
> >> require-all=false setoptions score=-1000  
> > I suppose "collocation" constraint is the way to go, not the "location"
> > one.  
> This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in 
> my case

No, this is not easy to replicate. I have no idea how you setup your PostgreSQL
replication, neither I have your full pacemaker configuration.

Please provide either detailed setupS and/or ansible and/or terraform and/or
vagrant, then a detailed scenario showing how it breaks. This is how you can
help and motivate devs to reproduce your issue and work on it.

I will not try to poke around for hours until I find an issue that might not
even be the same than yours.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread Ken Gaillot
On Thu, 2024-02-01 at 14:31 +0100, lejeczek via Users wrote:
> 
> On 31/01/2024 18:11, Ken Gaillot wrote:
> > On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote:
> > > On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:
> > > > On Wed, 31 Jan 2024 16:02:12 +0100
> > > > lejeczek via Users  wrote:
> > > > 
> > > > > On 29/01/2024 17:22, Ken Gaillot wrote:
> > > > > > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users
> > > > > > wrote:
> > > > > > > Hi guys.
> > > > > > > 
> > > > > > > Is it possible to trigger some... action - I'm thinking
> > > > > > > specifically
> > > > > > > at shutdown/start.
> > > > > > > If not within the cluster then - if you do that - perhaps
> > > > > > > outside.
> > > > > > > I would like to create/remove constraints, when cluster
> > > > > > > starts &
> > > > > > > stops, respectively.
> > > > > > > 
> > > > > > > many thanks, L.
> > > > > > > 
> > > > > > You could use node status alerts for that, but it's risky
> > > > > > for
> > > > > > alert
> > > > > > agents to change the configuration (since that may result
> > > > > > in
> > > > > > more
> > > > > > alerts and potentially some sort of infinite loop).
> > > > > > 
> > > > > > Pacemaker has no concept of a full cluster start/stop, only
> > > > > > node
> > > > > > start/stop. You could approximate that by checking whether
> > > > > > the
> > > > > > node
> > > > > > receiving the alert is the only active node.
> > > > > > 
> > > > > > Another possibility would be to write a resource agent that
> > > > > > does what
> > > > > > you want and order everything else after it. However it's
> > > > > > even
> > > > > > more
> > > > > > risky for a resource agent to modify the configuration.
> > > > > > 
> > > > > > Finally you could write a systemd unit to do what you want
> > > > > > and
> > > > > > order it
> > > > > > after pacemaker.
> > > > > > 
> > > > > > What's wrong with leaving the constraints permanently
> > > > > > configured?
> > > > > yes, that would be for a node start/stop
> > > > > I struggle with using constraints to move pgsql (PAF) master
> > > > > onto a given node - seems that co/locating paf's master
> > > > > results in troubles (replication brakes) at/after node
> > > > > shutdown/reboot (not always, but way too often)
> > > > What? What's wrong with colocating PAF's masters exactly? How
> > > > does
> > > > it brake any
> > > > replication? What's these constraints you are dealing with?
> > > > 
> > > > Could you share your configuration?
> > > Constraints beyond/above of what is required by PAF agent
> > > itself, say...
> > > you have multiple pgSQL cluster with PAF - thus multiple
> > > (separate, for each pgSQL cluster) masters and you want to
> > > spread/balance those across HA cluster
> > > (or in other words - avoid having more that 1 pgsql master
> > > per HA node)
> > > These below, I've tried, those move the master onto chosen
> > > node but.. then the issues I mentioned.
> > > 
> > > -> $ pcs constraint location PGSQL-PAF-5438-clone prefers
> > > ubusrv1=1002
> > > or
> > > -> $ pcs constraint colocation set PGSQL-PAF-5435-clone
> > > PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
> > > require-all=false setoptions score=-1000
> > > 
> > Anti-colocation sets tend to be tricky currently -- if the first
> > resource can't be assigned to a node, none of them can. We have an
> > idea
> > for a better implementation:
> > 
> >   https://projects.clusterlabs.org/T383
> > 
> > In the meantime, a possible workaround is to use placement-
> > strategy=balanced and define utilization for the clones only. The
> > promoted roles will each get a slight additional utilization, and
> > the
> > cluster should spread them out across nodes whenever possible. I
> > don't
> > know if that will avoid the replication issues but it may be worth
> > a
> > try.
> using _balanced_ causes a small mayhem to PAF/pgsql:
> 
> -> $ pcs property
> Cluster Properties:
>   REDIS-6380_REPL_INFO: ubusrv3
>   REDIS-6381_REPL_INFO: ubusrv2
>   REDIS-6382_REPL_INFO: ubusrv2
>   REDIS-6385_REPL_INFO: ubusrv1
>   REDIS_REPL_INFO: ubusrv1
>   cluster-infrastructure: corosync
>   cluster-name: ubusrv
>   dc-version: 2.1.2-ada5c3b36e2
>   have-watchdog: false
>   last-lrm-refresh: 1706711588
>   placement-strategy: default
>   stonith-enabled: false
> 
> -> $ pcs resource utilization PGSQL-PAF-5438 cpu="20"
> 
> -> $ pcs property set placement-strategy=balanced # when 
> resource stops:
> I change it back:
> -> $ pcs property set placement-strategy=default
> and pgSQL/paf works again
> 
> I've not used _utilization_ nor _placement-strategy_ before, 
> thus chance that I'm missing something is solid.
> 

See:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/utilization.html

You have to define the node capacities as well as the resource
utilization. I'd define the capacities high enough to run all the
resources (assuming you want that capability if all other nodes are in
standby or whatever).

[ClusterLabs] Gracefully Failing Live Migrations

2024-02-01 Thread Billy Croan
Sometimes I've tried to move a resource from one node to another, and it
migrates live without a problem.  Other times I get

> Failed Resource Actions:
> * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102,
> status=complete, exitreason='myvm: live migration to node2 failed: 1',
> last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms, exec=35874ms
>

And I find out the live part of the migration failed, when the vm reboots
and an (albeit minor) outage occurs.

Is there a way to configure pacemaker, so that if it is unable to migrate
live it simply does not migrate at all?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Gracefully Failing Live Migrations

2024-02-01 Thread Ken Gaillot
On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote:
> Sometimes I've tried to move a resource from one node to another, and
> it migrates live without a problem.  Other times I get 
> > Failed Resource Actions:
> > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102,
> > status=complete, exitreason='myvm: live migration to node2 failed:
> > 1',
> > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms,
> > exec=35874ms
> > 
> 
> And I find out the live part of the migration failed, when the vm
> reboots and an (albeit minor) outage occurs.
> 
> Is there a way to configure pacemaker, so that if it is unable to
> migrate live it simply does not migrate at all?
> 

No. Pacemaker automatically replaces a required stop/start sequence
with live migration when possible. If there is a live migration
attempted, by definition the resource must move one way or another.
Also, live migration involves three steps, and if one of them fails,
the resource is in an unknown state, so it must be restarted anyway.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Gracefully Failing Live Migrations

2024-02-01 Thread Billy Croan
How do I figure out which of the three steps failed and why?

On Thu, Feb 1, 2024 at 11:15 AM Ken Gaillot  wrote:

> On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote:
> > Sometimes I've tried to move a resource from one node to another, and
> > it migrates live without a problem.  Other times I get
> > > Failed Resource Actions:
> > > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102,
> > > status=complete, exitreason='myvm: live migration to node2 failed:
> > > 1',
> > > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms,
> > > exec=35874ms
> > >
> >
> > And I find out the live part of the migration failed, when the vm
> > reboots and an (albeit minor) outage occurs.
> >
> > Is there a way to configure pacemaker, so that if it is unable to
> > migrate live it simply does not migrate at all?
> >
>
> No. Pacemaker automatically replaces a required stop/start sequence
> with live migration when possible. If there is a live migration
> attempted, by definition the resource must move one way or another.
> Also, live migration involves three steps, and if one of them fails,
> the resource is in an unknown state, so it must be restarted anyway.
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Gracefully Failing Live Migrations

2024-02-01 Thread Ken Gaillot
On Thu, 2024-02-01 at 12:57 -0600, Billy Croan wrote:
> How do I figure out which of the three steps failed and why?

They're normal resource actions: migrate_to, migrate_from, and stop.
You can investigate them in the usual way (status, logs).

> 
> On Thu, Feb 1, 2024 at 11:15 AM Ken Gaillot 
> wrote:
> > On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote:
> > > Sometimes I've tried to move a resource from one node to another,
> > and
> > > it migrates live without a problem.  Other times I get 
> > > > Failed Resource Actions:
> > > > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102,
> > > > status=complete, exitreason='myvm: live migration to node2
> > failed:
> > > > 1',
> > > > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms,
> > > > exec=35874ms
> > > > 
> > > 
> > > And I find out the live part of the migration failed, when the vm
> > > reboots and an (albeit minor) outage occurs.
> > > 
> > > Is there a way to configure pacemaker, so that if it is unable to
> > > migrate live it simply does not migrate at all?
> > > 
> > 
> > No. Pacemaker automatically replaces a required stop/start sequence
> > with live migration when possible. If there is a live migration
> > attempted, by definition the resource must move one way or another.
> > Also, live migration involves three steps, and if one of them
> > fails,
> > the resource is in an unknown state, so it must be restarted
> > anyway.
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread lejeczek via Users




On 01/02/2024 15:02, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 18:23:40 +0100
lejeczek via Users  wrote:


On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:37:21 +0100
lejeczek via Users  wrote:
  

On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:

On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:
  

On 29/01/2024 17:22, Ken Gaillot wrote:

On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:

Hi guys.

Is it possible to trigger some... action - I'm thinking specifically
at shutdown/start.
If not within the cluster then - if you do that - perhaps outside.
I would like to create/remove constraints, when cluster starts &
stops, respectively.

many thanks, L.
  

You could use node status alerts for that, but it's risky for alert
agents to change the configuration (since that may result in more
alerts and potentially some sort of infinite loop).

Pacemaker has no concept of a full cluster start/stop, only node
start/stop. You could approximate that by checking whether the node
receiving the alert is the only active node.

Another possibility would be to write a resource agent that does what
you want and order everything else after it. However it's even more
risky for a resource agent to modify the configuration.

Finally you could write a systemd unit to do what you want and order it
after pacemaker.

What's wrong with leaving the constraints permanently configured?

yes, that would be for a node start/stop
I struggle with using constraints to move pgsql (PAF) master
onto a given node - seems that co/locating paf's master
results in troubles (replication brakes) at/after node
shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does it
brake any replication? What's these constraints you are dealing with?

Could you share your configuration?

Constraints beyond/above of what is required by PAF agent
itself, say...
you have multiple pgSQL cluster with PAF - thus multiple
(separate, for each pgSQL cluster) masters and you want to
spread/balance those across HA cluster
(or in other words - avoid having more that 1 pgsql master
per HA node)

ok
  

These below, I've tried, those move the master onto chosen
node but.. then the issues I mentioned.

You just mentioned it breaks the replication, but there so little
information about your architecture and configuration, it's impossible to
imagine how this could break the replication.

Could you add details about the issues ?
  

-> $ pcs constraint location PGSQL-PAF-5438-clone prefers
ubusrv1=1002
or
-> $ pcs constraint colocation set PGSQL-PAF-5435-clone
PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
require-all=false setoptions score=-1000

I suppose "collocation" constraint is the way to go, not the "location"
one.

This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in
my case

No, this is not easy to replicate. I have no idea how you setup your PostgreSQL
replication, neither I have your full pacemaker configuration.

Please provide either detailed setupS and/or ansible and/or terraform and/or
vagrant, then a detailed scenario showing how it breaks. This is how you can
help and motivate devs to reproduce your issue and work on it.

I will not try to poke around for hours until I find an issue that might not
even be the same than yours.
How about you start with the basics - strange inclination to 
complicate things when they are not, I hear from you - 
that's what I did while "stumbled" upon these "issues"

How about just:
a) do vanilla-default pgSQL in Ubuntu (or perhaps any other 
OS of your choice), I use _pg_createcluster_
b) follow PAF official guide (a single PAF resource should 
suffice)
Have a healthy pgSQL cluster, OS _reboot_ nodes - play with 
that, all should be ok, moving around/electing master should 
work a ok.
Then... add, play with "additional" co/location constraints, 
then OS reboots,- things should begin braking.
I have 3-node HA cluster & 3-node PAF resource = 1 master + 
2 slaves.
Only thing I deliberately set, to alleviate pgsql 
replication was _wal_keep_size_ - I increased that, but this 
is subjective.


It's fine with me if you don't feel like doing this.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/