Re: [ClusterLabs] trigger something at ?
On 31/01/2024 18:11, Ken Gaillot wrote: On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 Anti-colocation sets tend to be tricky currently -- if the first resource can't be assigned to a node, none of them can. We have an idea for a better implementation: https://projects.clusterlabs.org/T383 In the meantime, a possible workaround is to use placement- strategy=balanced and define utilization for the clones only. The promoted roles will each get a slight additional utilization, and the cluster should spread them out across nodes whenever possible. I don't know if that will avoid the replication issues but it may be worth a try. using _balanced_ causes a small mayhem to PAF/pgsql: -> $ pcs property Cluster Properties: REDIS-6380_REPL_INFO: ubusrv3 REDIS-6381_REPL_INFO: ubusrv2 REDIS-6382_REPL_INFO: ubusrv2 REDIS-6385_REPL_INFO: ubusrv1 REDIS_REPL_INFO: ubusrv1 cluster-infrastructure: corosync cluster-name: ubusrv dc-version: 2.1.2-ada5c3b36e2 have-watchdog: false last-lrm-refresh: 1706711588 placement-strategy: default stonith-enabled: false -> $ pcs resource utilization PGSQL-PAF-5438 cpu="20" -> $ pcs property set placement-strategy=balanced # when resource stops: I change it back: -> $ pcs property set placement-strategy=default and pgSQL/paf works again I've not used _utilization_ nor _placement-strategy_ before, thus chance that I'm missing something is solid. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On Wed, 31 Jan 2024 18:23:40 +0100 lejeczek via Users wrote: > On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote: > > On Wed, 31 Jan 2024 16:37:21 +0100 > > lejeczek via Users wrote: > > > >> > >> On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: > >>> On Wed, 31 Jan 2024 16:02:12 +0100 > >>> lejeczek via Users wrote: > >>> > On 29/01/2024 17:22, Ken Gaillot wrote: > > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: > >> Hi guys. > >> > >> Is it possible to trigger some... action - I'm thinking specifically > >> at shutdown/start. > >> If not within the cluster then - if you do that - perhaps outside. > >> I would like to create/remove constraints, when cluster starts & > >> stops, respectively. > >> > >> many thanks, L. > >> > > You could use node status alerts for that, but it's risky for alert > > agents to change the configuration (since that may result in more > > alerts and potentially some sort of infinite loop). > > > > Pacemaker has no concept of a full cluster start/stop, only node > > start/stop. You could approximate that by checking whether the node > > receiving the alert is the only active node. > > > > Another possibility would be to write a resource agent that does what > > you want and order everything else after it. However it's even more > > risky for a resource agent to modify the configuration. > > > > Finally you could write a systemd unit to do what you want and order it > > after pacemaker. > > > > What's wrong with leaving the constraints permanently configured? > yes, that would be for a node start/stop > I struggle with using constraints to move pgsql (PAF) master > onto a given node - seems that co/locating paf's master > results in troubles (replication brakes) at/after node > shutdown/reboot (not always, but way too often) > >>> What? What's wrong with colocating PAF's masters exactly? How does it > >>> brake any replication? What's these constraints you are dealing with? > >>> > >>> Could you share your configuration? > >> Constraints beyond/above of what is required by PAF agent > >> itself, say... > >> you have multiple pgSQL cluster with PAF - thus multiple > >> (separate, for each pgSQL cluster) masters and you want to > >> spread/balance those across HA cluster > >> (or in other words - avoid having more that 1 pgsql master > >> per HA node) > > ok > > > >> These below, I've tried, those move the master onto chosen > >> node but.. then the issues I mentioned. > > You just mentioned it breaks the replication, but there so little > > information about your architecture and configuration, it's impossible to > > imagine how this could break the replication. > > > > Could you add details about the issues ? > > > >> -> $ pcs constraint location PGSQL-PAF-5438-clone prefers > >> ubusrv1=1002 > >> or > >> -> $ pcs constraint colocation set PGSQL-PAF-5435-clone > >> PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master > >> require-all=false setoptions score=-1000 > > I suppose "collocation" constraint is the way to go, not the "location" > > one. > This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in > my case No, this is not easy to replicate. I have no idea how you setup your PostgreSQL replication, neither I have your full pacemaker configuration. Please provide either detailed setupS and/or ansible and/or terraform and/or vagrant, then a detailed scenario showing how it breaks. This is how you can help and motivate devs to reproduce your issue and work on it. I will not try to poke around for hours until I find an issue that might not even be the same than yours. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On Thu, 2024-02-01 at 14:31 +0100, lejeczek via Users wrote: > > On 31/01/2024 18:11, Ken Gaillot wrote: > > On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote: > > > On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: > > > > On Wed, 31 Jan 2024 16:02:12 +0100 > > > > lejeczek via Users wrote: > > > > > > > > > On 29/01/2024 17:22, Ken Gaillot wrote: > > > > > > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users > > > > > > wrote: > > > > > > > Hi guys. > > > > > > > > > > > > > > Is it possible to trigger some... action - I'm thinking > > > > > > > specifically > > > > > > > at shutdown/start. > > > > > > > If not within the cluster then - if you do that - perhaps > > > > > > > outside. > > > > > > > I would like to create/remove constraints, when cluster > > > > > > > starts & > > > > > > > stops, respectively. > > > > > > > > > > > > > > many thanks, L. > > > > > > > > > > > > > You could use node status alerts for that, but it's risky > > > > > > for > > > > > > alert > > > > > > agents to change the configuration (since that may result > > > > > > in > > > > > > more > > > > > > alerts and potentially some sort of infinite loop). > > > > > > > > > > > > Pacemaker has no concept of a full cluster start/stop, only > > > > > > node > > > > > > start/stop. You could approximate that by checking whether > > > > > > the > > > > > > node > > > > > > receiving the alert is the only active node. > > > > > > > > > > > > Another possibility would be to write a resource agent that > > > > > > does what > > > > > > you want and order everything else after it. However it's > > > > > > even > > > > > > more > > > > > > risky for a resource agent to modify the configuration. > > > > > > > > > > > > Finally you could write a systemd unit to do what you want > > > > > > and > > > > > > order it > > > > > > after pacemaker. > > > > > > > > > > > > What's wrong with leaving the constraints permanently > > > > > > configured? > > > > > yes, that would be for a node start/stop > > > > > I struggle with using constraints to move pgsql (PAF) master > > > > > onto a given node - seems that co/locating paf's master > > > > > results in troubles (replication brakes) at/after node > > > > > shutdown/reboot (not always, but way too often) > > > > What? What's wrong with colocating PAF's masters exactly? How > > > > does > > > > it brake any > > > > replication? What's these constraints you are dealing with? > > > > > > > > Could you share your configuration? > > > Constraints beyond/above of what is required by PAF agent > > > itself, say... > > > you have multiple pgSQL cluster with PAF - thus multiple > > > (separate, for each pgSQL cluster) masters and you want to > > > spread/balance those across HA cluster > > > (or in other words - avoid having more that 1 pgsql master > > > per HA node) > > > These below, I've tried, those move the master onto chosen > > > node but.. then the issues I mentioned. > > > > > > -> $ pcs constraint location PGSQL-PAF-5438-clone prefers > > > ubusrv1=1002 > > > or > > > -> $ pcs constraint colocation set PGSQL-PAF-5435-clone > > > PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master > > > require-all=false setoptions score=-1000 > > > > > Anti-colocation sets tend to be tricky currently -- if the first > > resource can't be assigned to a node, none of them can. We have an > > idea > > for a better implementation: > > > > https://projects.clusterlabs.org/T383 > > > > In the meantime, a possible workaround is to use placement- > > strategy=balanced and define utilization for the clones only. The > > promoted roles will each get a slight additional utilization, and > > the > > cluster should spread them out across nodes whenever possible. I > > don't > > know if that will avoid the replication issues but it may be worth > > a > > try. > using _balanced_ causes a small mayhem to PAF/pgsql: > > -> $ pcs property > Cluster Properties: > REDIS-6380_REPL_INFO: ubusrv3 > REDIS-6381_REPL_INFO: ubusrv2 > REDIS-6382_REPL_INFO: ubusrv2 > REDIS-6385_REPL_INFO: ubusrv1 > REDIS_REPL_INFO: ubusrv1 > cluster-infrastructure: corosync > cluster-name: ubusrv > dc-version: 2.1.2-ada5c3b36e2 > have-watchdog: false > last-lrm-refresh: 1706711588 > placement-strategy: default > stonith-enabled: false > > -> $ pcs resource utilization PGSQL-PAF-5438 cpu="20" > > -> $ pcs property set placement-strategy=balanced # when > resource stops: > I change it back: > -> $ pcs property set placement-strategy=default > and pgSQL/paf works again > > I've not used _utilization_ nor _placement-strategy_ before, > thus chance that I'm missing something is solid. > See: https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/utilization.html You have to define the node capacities as well as the resource utilization. I'd define the capacities high enough to run all the resources (assuming you want that capability if all other nodes are in standby or whatever).
[ClusterLabs] Gracefully Failing Live Migrations
Sometimes I've tried to move a resource from one node to another, and it migrates live without a problem. Other times I get > Failed Resource Actions: > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102, > status=complete, exitreason='myvm: live migration to node2 failed: 1', > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms, exec=35874ms > And I find out the live part of the migration failed, when the vm reboots and an (albeit minor) outage occurs. Is there a way to configure pacemaker, so that if it is unable to migrate live it simply does not migrate at all? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Gracefully Failing Live Migrations
On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote: > Sometimes I've tried to move a resource from one node to another, and > it migrates live without a problem. Other times I get > > Failed Resource Actions: > > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102, > > status=complete, exitreason='myvm: live migration to node2 failed: > > 1', > > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms, > > exec=35874ms > > > > And I find out the live part of the migration failed, when the vm > reboots and an (albeit minor) outage occurs. > > Is there a way to configure pacemaker, so that if it is unable to > migrate live it simply does not migrate at all? > No. Pacemaker automatically replaces a required stop/start sequence with live migration when possible. If there is a live migration attempted, by definition the resource must move one way or another. Also, live migration involves three steps, and if one of them fails, the resource is in an unknown state, so it must be restarted anyway. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Gracefully Failing Live Migrations
How do I figure out which of the three steps failed and why? On Thu, Feb 1, 2024 at 11:15 AM Ken Gaillot wrote: > On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote: > > Sometimes I've tried to move a resource from one node to another, and > > it migrates live without a problem. Other times I get > > > Failed Resource Actions: > > > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102, > > > status=complete, exitreason='myvm: live migration to node2 failed: > > > 1', > > > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms, > > > exec=35874ms > > > > > > > And I find out the live part of the migration failed, when the vm > > reboots and an (albeit minor) outage occurs. > > > > Is there a way to configure pacemaker, so that if it is unable to > > migrate live it simply does not migrate at all? > > > > No. Pacemaker automatically replaces a required stop/start sequence > with live migration when possible. If there is a live migration > attempted, by definition the resource must move one way or another. > Also, live migration involves three steps, and if one of them fails, > the resource is in an unknown state, so it must be restarted anyway. > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Gracefully Failing Live Migrations
On Thu, 2024-02-01 at 12:57 -0600, Billy Croan wrote: > How do I figure out which of the three steps failed and why? They're normal resource actions: migrate_to, migrate_from, and stop. You can investigate them in the usual way (status, logs). > > On Thu, Feb 1, 2024 at 11:15 AM Ken Gaillot > wrote: > > On Thu, 2024-02-01 at 10:20 -0600, Billy Croan wrote: > > > Sometimes I've tried to move a resource from one node to another, > > and > > > it migrates live without a problem. Other times I get > > > > Failed Resource Actions: > > > > * vm_myvm_migrate_to_0 on node1 'unknown error' (1): call=102, > > > > status=complete, exitreason='myvm: live migration to node2 > > failed: > > > > 1', > > > > last-rc-change='Sat Jan 13 09:13:31 2024', queued=1ms, > > > > exec=35874ms > > > > > > > > > > And I find out the live part of the migration failed, when the vm > > > reboots and an (albeit minor) outage occurs. > > > > > > Is there a way to configure pacemaker, so that if it is unable to > > > migrate live it simply does not migrate at all? > > > > > > > No. Pacemaker automatically replaces a required stop/start sequence > > with live migration when possible. If there is a live migration > > attempted, by definition the resource must move one way or another. > > Also, live migration involves three steps, and if one of them > > fails, > > the resource is in an unknown state, so it must be restarted > > anyway. > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 01/02/2024 15:02, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 18:23:40 +0100 lejeczek via Users wrote: On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:37:21 +0100 lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) ok These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. You just mentioned it breaks the replication, but there so little information about your architecture and configuration, it's impossible to imagine how this could break the replication. Could you add details about the issues ? -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 I suppose "collocation" constraint is the way to go, not the "location" one. This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in my case No, this is not easy to replicate. I have no idea how you setup your PostgreSQL replication, neither I have your full pacemaker configuration. Please provide either detailed setupS and/or ansible and/or terraform and/or vagrant, then a detailed scenario showing how it breaks. This is how you can help and motivate devs to reproduce your issue and work on it. I will not try to poke around for hours until I find an issue that might not even be the same than yours. How about you start with the basics - strange inclination to complicate things when they are not, I hear from you - that's what I did while "stumbled" upon these "issues" How about just: a) do vanilla-default pgSQL in Ubuntu (or perhaps any other OS of your choice), I use _pg_createcluster_ b) follow PAF official guide (a single PAF resource should suffice) Have a healthy pgSQL cluster, OS _reboot_ nodes - play with that, all should be ok, moving around/electing master should work a ok. Then... add, play with "additional" co/location constraints, then OS reboots,- things should begin braking. I have 3-node HA cluster & 3-node PAF resource = 1 master + 2 slaves. Only thing I deliberately set, to alleviate pgsql replication was _wal_keep_size_ - I increased that, but this is subjective. It's fine with me if you don't feel like doing this. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/