Re: [ClusterLabs] pcsd web interface not working on EL 9.3
On 19/02/2024 09:06, Strahil Nikolov via Users wrote: Hi All, Is there a specific setup I missed in order to setup the web interface ? Usually, you just login with the hacluster user on https://fqdn:2224 but when I do a curl, I get an empty response. Best Regards, Strahil Nikolov ___ Was it giving out some stuff before? I've never did curl on it. Won't guess why do that. Yes it is _empty_for me too - though I use it with reverse proxy. While _main_ URL curl returns no content this does get: -> $ curl https://pcs-ha.mine.priv/ui/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] clone_op_key pcmk__notify_key - Triggered fatal assertion
Hi guys. Everything seems to be working a ok yet pacemakers logs ... error: clone_op_key: Triggered fatal assertion at pcmk_graph_producer.c:207 : (n_type != NULL) && (n_task != NULL) error: pcmk__notify_key: Triggered fatal assertion at actions.c:187 : op_type != NULL error: clone_op_key: Triggered fatal assertion at pcmk_graph_producer.c:207 : (n_type != NULL) && (n_task != NULL) error: pcmk__notify_key: Triggered fatal assertion at actions.c:187 : op_type != NULL ... error: pcmk__create_history_xml: Triggered fatal assertion at pcmk_sched_actions.c:1163 : n_type != NULL error: pcmk__create_history_xml: Triggered fatal assertion at pcmk_sched_actions.c:1164 : n_task != NULL error: pcmk__notify_key: Triggered fatal assertion at actions.c:187 : op_type != NULL ... Looks critical, is it it - would you know? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 31/01/2024 16:37, lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 Wanted to share an observation - not a measurement of anything, I did not take those - of different, latest pgSQL version which I put in place of version 14 which I've been using all this time. (also with that upgrade - from Postgres own repos - came update of PAF) So, with pgSQL ver. 16 and the same of everything else - now paf/pgSQL resources behave a lot lot better, survives just fine all those cases - with ! extra constraints of course - where previously it had replication failures. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] colocation constraint - do I get it all wrong?
On 01/01/2024 18:28, Ken Gaillot wrote: On Fri, 2023-12-22 at 17:02 +0100, lejeczek via Users wrote: hi guys. I have a colocation constraint: -> $ pcs constraint ref DHCPD Resource: DHCPD colocation-DHCPD-GATEWAY-NM-link-INFINITY and the trouble is... I thought DHCPD is to follow GATEWAY-NM-link, always! If that is true that I see very strange behavior, namely. When there is an issue with DHCPD resource, cannot be started, then GATEWAY-NM-link gets tossed around by the cluster. Is that normal & expected - is my understanding of _colocation_ completely wrong - or my cluster is indeed "broken"? many thanks, L. Pacemaker considers the preferences of colocated resources when assigning a resource to a node, to ensure that as many resources as possible can run. So if a colocated resource becomes unable to run on a node, the primary resource might move to allow the colocated resource to run. So what is the way to "fix" this - is it simply low/er score for such constraint? In my case _dhcpd_ is important but if fails sometimes as it's often tampered with, so... make _dhcpd_ flow gateway_link but just fail _dhcp_ (it it keeps failing) and leave _gateway_link_ alone if/where it's good. Or perhaps there a global config/param for whole cluster behaviour? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] to heal gfid which is not there -- IGNORE wrong list
On 04/02/2024 12:57, lejeczek via Users wrote: hi guys. So, I've manged to make my volume go haywire, here: -> $ gluster volume heal VMAIL info Brick 10.1.1.100:/devs/00.GLUSTERs/VMAIL Status: Connected Number of entries: 0 Brick 10.1.1.101:/devs/00.GLUSTERs/VMAIL /dovecot-uidlist Status: Connected Number of entries: 1 Brick 10.1.1.99:/devs/00.GLUSTERs/VMAIL-arbiter Status: Connected Number of entries: 0 That _gfid_ nor _dovecot-uidlist_ do not exist on any of the bricks, but I've noticed that all bricks have: .glusterfs-anonymous-inode-462a1850-a31a-4a17-934d-26f3996dc9b8 which is an empty dir. That volume I fear won't heal on its own and might need some intervention - would you know how/where to start? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home:https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] to heal gfid which is not there
hi guys. So, I've manged to make my volume go haywire, here: -> $ gluster volume heal VMAIL info Brick 10.1.1.100:/devs/00.GLUSTERs/VMAIL Status: Connected Number of entries: 0 Brick 10.1.1.101:/devs/00.GLUSTERs/VMAIL /dovecot-uidlist Status: Connected Number of entries: 1 Brick 10.1.1.99:/devs/00.GLUSTERs/VMAIL-arbiter Status: Connected Number of entries: 0 That _gfid_ nor _dovecot-uidlist_ do not exist on any of the bricks, but I've noticed that all bricks have: .glusterfs-anonymous-inode-462a1850-a31a-4a17-934d-26f3996dc9b8 which is an empty dir. That volume I fear won't heal on its own and might need some intervention - would you know how/where to start? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 01/02/2024 15:02, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 18:23:40 +0100 lejeczek via Users wrote: On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:37:21 +0100 lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) ok These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. You just mentioned it breaks the replication, but there so little information about your architecture and configuration, it's impossible to imagine how this could break the replication. Could you add details about the issues ? -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 I suppose "collocation" constraint is the way to go, not the "location" one. This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in my case No, this is not easy to replicate. I have no idea how you setup your PostgreSQL replication, neither I have your full pacemaker configuration. Please provide either detailed setupS and/or ansible and/or terraform and/or vagrant, then a detailed scenario showing how it breaks. This is how you can help and motivate devs to reproduce your issue and work on it. I will not try to poke around for hours until I find an issue that might not even be the same than yours. How about you start with the basics - strange inclination to complicate things when they are not, I hear from you - that's what I did while "stumbled" upon these "issues" How about just: a) do vanilla-default pgSQL in Ubuntu (or perhaps any other OS of your choice), I use _pg_createcluster_ b) follow PAF official guide (a single PAF resource should suffice) Have a healthy pgSQL cluster, OS _reboot_ nodes - play with that, all should be ok, moving around/electing master should work a ok. Then... add, play with "additional" co/location constraints, then OS reboots,- things should begin braking. I have 3-node HA cluster & 3-node PAF resource = 1 master + 2 slaves. Only thing I deliberately set, to alleviate pgsql replication was _wal_keep_size_ - I increased that, but this is subjective. It's fine with me if you don't feel like doing this. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 31/01/2024 18:11, Ken Gaillot wrote: On Wed, 2024-01-31 at 16:37 +0100, lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 Anti-colocation sets tend to be tricky currently -- if the first resource can't be assigned to a node, none of them can. We have an idea for a better implementation: https://projects.clusterlabs.org/T383 In the meantime, a possible workaround is to use placement- strategy=balanced and define utilization for the clones only. The promoted roles will each get a slight additional utilization, and the cluster should spread them out across nodes whenever possible. I don't know if that will avoid the replication issues but it may be worth a try. using _balanced_ causes a small mayhem to PAF/pgsql: -> $ pcs property Cluster Properties: REDIS-6380_REPL_INFO: ubusrv3 REDIS-6381_REPL_INFO: ubusrv2 REDIS-6382_REPL_INFO: ubusrv2 REDIS-6385_REPL_INFO: ubusrv1 REDIS_REPL_INFO: ubusrv1 cluster-infrastructure: corosync cluster-name: ubusrv dc-version: 2.1.2-ada5c3b36e2 have-watchdog: false last-lrm-refresh: 1706711588 placement-strategy: default stonith-enabled: false -> $ pcs resource utilization PGSQL-PAF-5438 cpu="20" -> $ pcs property set placement-strategy=balanced # when resource stops: I change it back: -> $ pcs property set placement-strategy=default and pgSQL/paf works again I've not used _utilization_ nor _placement-strategy_ before, thus chance that I'm missing something is solid. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:37:21 +0100 lejeczek via Users wrote: On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) ok These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. You just mentioned it breaks the replication, but there so little information about your architecture and configuration, it's impossible to imagine how this could break the replication. Could you add details about the issues ? -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 I suppose "collocation" constraint is the way to go, not the "location" one. This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in my case -> $ pcs resource config PGSQL-PAF-5438-clone Clone: PGSQL-PAF-5438-clone Meta Attrs: failure-timeout=60s master-max=1 notify=true promotable=true Resource: PGSQL-PAF-5438 (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/lib/postgresql/16/bin datadir=/var/lib/postgresql/16/paf-5438 maxlag=1000 pgdata=/etc/postgresql/16/paf-5438 pgport=5438 Operations: demote interval=0s timeout=120s (PGSQL-PAF-5438-demote-interval-0s) methods interval=0s timeout=5 (PGSQL-PAF-5438-methods-interval-0s) monitor interval=15s role=Master timeout=10s (PGSQL-PAF-5438-monitor-interval-15s) monitor interval=16s role=Slave timeout=10s (PGSQL-PAF-5438-monitor-interval-16s) notify interval=0s timeout=60s (PGSQL-PAF-5438-notify-interval-0s) promote interval=0s timeout=30s (PGSQL-PAF-5438-promote-interval-0s) reload interval=0s timeout=20 (PGSQL-PAF-5438-reload-interval-0s) start interval=0s timeout=60s (PGSQL-PAF-5438-start-interval-0s) stop interval=0s timeout=60s (PGSQL-PAF-5438-stop-interval-0s) so, regarding PAF - 1 master + 2 slaves, have a healthy pqSQL/PAF cluster to begin with, then make resource prefer a specific node (with simplest variant of constraints I tried): -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 and play with it, rebooting node(s) with OS' _reboot_ I at some point, get HA/resource unable to start pgSQL, unable to elect a master (logs saying with replication broken) and I have to "fix" pgSQL cluster outside of PAF, using _pg_basebackup_ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote: On Wed, 31 Jan 2024 16:02:12 +0100 lejeczek via Users wrote: On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) What? What's wrong with colocating PAF's masters exactly? How does it brake any replication? What's these constraints you are dealing with? Could you share your configuration? Constraints beyond/above of what is required by PAF agent itself, say... you have multiple pgSQL cluster with PAF - thus multiple (separate, for each pgSQL cluster) masters and you want to spread/balance those across HA cluster (or in other words - avoid having more that 1 pgsql master per HA node) These below, I've tried, those move the master onto chosen node but.. then the issues I mentioned. -> $ pcs constraint location PGSQL-PAF-5438-clone prefers ubusrv1=1002 or -> $ pcs constraint colocation set PGSQL-PAF-5435-clone PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master require-all=false setoptions score=-1000 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] trigger something at ?
On 29/01/2024 17:22, Ken Gaillot wrote: On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote: Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L. You could use node status alerts for that, but it's risky for alert agents to change the configuration (since that may result in more alerts and potentially some sort of infinite loop). Pacemaker has no concept of a full cluster start/stop, only node start/stop. You could approximate that by checking whether the node receiving the alert is the only active node. Another possibility would be to write a resource agent that does what you want and order everything else after it. However it's even more risky for a resource agent to modify the configuration. Finally you could write a systemd unit to do what you want and order it after pacemaker. What's wrong with leaving the constraints permanently configured? yes, that would be for a node start/stop I struggle with using constraints to move pgsql (PAF) master onto a given node - seems that co/locating paf's master results in troubles (replication brakes) at/after node shutdown/reboot (not always, but way too often) Ideally I'm hoping that: at node stop, stopping node could check if it's PAF's master and if yes so then remove given constraints ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] trigger something at ?
Hi guys. Is it possible to trigger some... action - I'm thinking specifically at shutdown/start. If not within the cluster then - if you do that - perhaps outside. I would like to create/remove constraints, when cluster starts & stops, respectively. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] replica vol when a peer is lost/offed & qcow2 ?
Hi guys. I wonder if you might have any tips/tweaks for volume/cluster to make it more resilient? accommodating? to qcow2 files but! when a peer is lots or missing? I have 3-peer cluster/volume: 2 + 1 arbiter & my experience is such, that when all is good then.. well, all is good, but... when one peers was lots - even if it's the arbiter - then VMs begin to suffer from & to report, their filesystems, errors. Perhaps it's not a volume's properties/config but whole cluster? Or perhaps 3-peer vol is not good for things such a qcow2s? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] colocation constraint - do I get it all wrong?
hi guys. I have a colocation constraint: -> $ pcs constraint ref DHCPD Resource: DHCPD colocation-DHCPD-GATEWAY-NM-link-INFINITY and the trouble is... I thought DHCPD is to follow GATEWAY-NM-link, always! If that is true that I see very strange behavior, namely. When there is an issue with DHCPD resource, cannot be started, then GATEWAY-NM-link gets tossed around by the cluster. Is that normal & expected - is my understanding of _colocation_ completely wrong - or my cluster is indeed "broken"? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] colocate Redis - weird
On 19/12/2023 19:13, lejeczek via Users wrote: hi guys, Is this below not the weirdest thing? -> $ pcs constraint ref PGSQL-PAF-5435 Resource: PGSQL-PAF-5435 colocation-HA-10-1-1-84-PGSQL-PAF-5435-clone-INFINITY colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory-1 colocation_set_PePePe Here Redis master should folow pgSQL master. Which such constraint: -> $ pcs resource status PGSQL-PAF-5435 * Clone Set: PGSQL-PAF-5435-clone [PGSQL-PAF-5435] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs resource status REDIS-6385-clone * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable): * Unpromoted: [ ubusrv1 ubusrv2 ubusrv3 ] If I remove that constrain: -> $ pcs constraint delete colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY -> $ pcs resource status REDIS-6385-clone * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] and ! I can manually move Redis master around, master moves to each server just fine. I again, add that constraint: -> $ pcs constraint colocation add master REDIS-6385-clone with master PGSQL-PAF-5435-clone and the same... What there might be about that one node - resource removed, created anew and cluster insists on keeping master there. I can manually move the master anywhere but if I _clear_ the resource, no constraints then cluster move it back to the same node. I wonder about: a) "transient" node attrs & b) if this cluster is somewhat broken. On a) - can we read more about those somewhere?(not the code/internals) thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] colocate Redis - weird
hi guys, Is this below not the weirdest thing? -> $ pcs constraint ref PGSQL-PAF-5435 Resource: PGSQL-PAF-5435 colocation-HA-10-1-1-84-PGSQL-PAF-5435-clone-INFINITY colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory order-PGSQL-PAF-5435-clone-HA-10-1-1-84-Mandatory-1 colocation_set_PePePe Here Redis master should folow pgSQL master. Which such constraint: -> $ pcs resource status PGSQL-PAF-5435 * Clone Set: PGSQL-PAF-5435-clone [PGSQL-PAF-5435] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs resource status REDIS-6385-clone * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable): * Unpromoted: [ ubusrv1 ubusrv2 ubusrv3 ] If I remove that constrain: -> $ pcs constraint delete colocation-REDIS-6385-clone-PGSQL-PAF-5435-clone-INFINITY -> $ pcs resource status REDIS-6385-clone * Clone Set: REDIS-6385-clone [REDIS-6385] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] and ! I can manually move Redis master around, master moves to each server just fine. I again, add that constraint: -> $ pcs constraint colocation add master REDIS-6385-clone with master PGSQL-PAF-5435-clone and the same... ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] resource-agents and VMs
Hi guys. my resources-agents depend like so: resource-agents-deps.target ○ ├─00\\x2dVMsy.mount ● └─virt-guest-shutdown.target when I reboot a node VMs seems to migrated off it live a ok, but.. when node comes back on after a reboot, VMs fail to migrate back to it, live. I see on such node -> $ journalctl -lf -o cat -u virtqemud.service Starting Virtualization qemu daemon... Started Virtualization qemu daemon. libvirt version: 9.5.0, package: 6.el9 (buil...@centos.org, 2023-08-25-08:53:56, ) hostname: dzien.mine.priv Path '/00-VMsy/enc.podnode3.qcow2' is not accessible: No such file or directory and I wonder if it's indeed the fact the the _path_ is absent at the moment cluster just after node start, tries to migrate VM resource... Is it possible to somehow - seemingly my _resource-agents-deps.target_ does not do - assure that cluster, perhaps on per-resource basis, will wait/check a path first? BTW, that paths is available and is made sure is available to the system, it's a glusterfs mount. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] how to colocate promoted resources ?
On 08/12/2023 13:25, Jehan-Guillaume de Rorthais wrote: Hi, On Wed, 6 Dec 2023 10:36:39 +0100 lejeczek via Users wrote: How do your colocate your promoted resources with balancing underlying resources as priority? What do you mean? With a simple scenario, say 3 nodes and 3 pgSQL clusters what would be best possible way - I'm thinking most gentle at the same time, if that makes sense. I'm not sure it answers your question (as I don't understand it), but here is a doc explaining how to create and move two IP supposed to each start on secondaries, avoiding the primary node if possible, as long as secondaries nodes exists https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#adding-ips-on-standbys-nodes ++ Apologies, perhaps I was quite vague. I was thinking - having a 3-node HA cluster and 3-node single-master->slaves pgSQL, now.. say, I want pgSQL masters to spread across HA cluster so I theory - having each HA node identical hardware-wise - masters' resources would nicely balance out across HA cluster. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ethernet link up/down - ?
On 04/12/2023 20:58, Reid Wahl wrote: On Thu, Nov 30, 2023 at 10:30 AM lejeczek via Users wrote: On 07/02/2022 20:09, lejeczek via Users wrote: Hi guys How do you guys go about doing link up/down as a resource? many thanks, L. With simple tests I confirmed that indeed Linux - on my hardware at leat - can easily power down&up an eth link - if a @devel reads this: Is there an agent in the suite which a non-programmer could easily (for most safely) adopt for such purpose? I understand such agent has to be cloneable & promotable. The iface-bridge resource appears to do something similar for bridges. I don't see anything currently for links in general. Where can I find that agent? Any comment on that idea about adding, introducing such "link" agent into agents in the future? Should I go _github_ and suggest it there perhaps? Naturally done by "devel" it would be ideal, as opposed to, by us user/admins. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] how to colocate promoted resources ?
Hi guys. How do your colocate your promoted resources with balancing underlying resources as priority? With a simple scenario, say 3 nodes and 3 pgSQL clusters what would be best possible way - I'm thinking most gentle at the same time, if that makes sense. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] make promoted follow promoted resource ?
On 26/11/2023 12:20, Reid Wahl wrote: On Sun, Nov 26, 2023 at 1:32 AM lejeczek via Users wrote: Hi guys. With these: -> $ pcs resource status REDIS-6381-clone * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable): * Promoted: [ ubusrv2 ] * Unpromoted: [ ubusrv1 ubusrv3 ] -> $ pcs resource status PGSQL-PAF-5433-clone * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs constraint ref REDIS-6381-clone Resource: REDIS-6381-clone colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY basically promoted Redis should follow promoted pgSQL but it's not happening, usually it does. I presume pcs/cluster does something internally which results in disobeying/ignoring that _colocation_ constraint for these resources. I presume scoring might play a role: REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) (with-rsc-role:Master) but usually, that scoring works, only "now" it does not. Any comments I appreciate much. thanks, L. I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but cannot see explanation for this. ... notice: Calculated transition 110, saving inputs in /var/lib/pacemaker/pengine/pe-input-3729.bz2 notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Actions: Start REDIS-6381:0 ( ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv3 ) notice: Actions: Start REDIS-6381:2 ( ubusrv1 ) notice: Calculated transition 111, saving inputs in /var/lib/pacemaker/pengine/pe-input-3730.bz2 notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2 notice: Requesting local execution of start operation for REDIS-6381 on ubusrv2 (to redis) root on none pam_unix(su:session): session opened for user redis(uid=127) by (uid=0) pam_sss(su:session): Request to sssd failed. Connection refused pam_unix(su:session): session closed for user redis pam_sss(su:session): Request to sssd failed. Connection refused notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000 notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create master-REDIS-6381=1000: Transient attribute change INFO: demote: Setting master to 'no-such-master' notice: Result of start operation for REDIS-6381 on ubusrv2: ok notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped notice: Actions: PromoteREDIS-6381:0 ( Unpromoted -> Promoted ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv1 ) notice: Actions: Start REDIS-6381:2 ( ubusrv3 ) notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-3731.bz2 notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating start operation REDIS-6381_start_0 on ubusrv1 notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3 notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2 notice: Requesting local execution of promote operation for REDIS-6381 on ubusrv2 notice: Result of promote operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on ubusrv3 notice: Result of notify op
[ClusterLabs] IPaddr2 Started (disabled) ?
hi guys. A cluster thinks the resource is up: ... * HA-10-1-1-80 (ocf:heartbeat:IPaddr2): Started ubusrv3 (disabled) .. while it is not the case. What might it mean? Config is simple: -> $ pcs resource config HA-10-1-1-80 Resource: HA-10-1-1-80 (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=10.1.1.80 Meta Attrs: failure-timeout=20s target-role=Stopped Operations: monitor interval=10s timeout=20s (HA-10-1-1-80-monitor-interval-10s) start interval=0s timeout=20s (HA-10-1-1-80-start-interval-0s) stop interval=0s timeout=20s (HA-10-1-1-80-stop-interval-0s) I expected, well.. cluster to behave. This results in a trouble - because of that exactly I think - for when resource is re-enabled then that IP/resource is not instantiated. Is its monitor not working? Is there a way to "harden" cluster perhaps via resource config? cluster is Ubuntu VM boxes. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ethernet link up/down - ?
On 07/02/2022 20:09, lejeczek via Users wrote: Hi guys How do you guys go about doing link up/down as a resource? many thanks, L. With simple tests I confirmed that indeed Linux - on my hardware at leat - can easily power down&up an eth link - if a @devel reads this: Is there an agent in the suite which a non-programmer could easily (for most safely) adopt for such purpose? I understand such agent has to be cloneable & promotable. btw. I think many would consider that a really neat & useful addition to resource-agents - if authors/devel made it into the core package. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ethernet link up/down - ?
On 16/02/2022 10:37, Klaus Wenninger wrote: On Tue, Feb 15, 2022 at 5:25 PM lejeczek via Users wrote: On 07/02/2022 19:21, Antony Stone wrote: > On Monday 07 February 2022 at 20:09:02, lejeczek via Users wrote: > >> Hi guys >> >> How do you guys go about doing link up/down as a resource? > I apply or remove addresses on the interface, using "IPaddr2" and "IPv6addr", > which I know is not the same thing. > > Why do you separately want to control link up/down? I can't think what I > would use this for. Just out of curiosity and as I haven't seen an answer in the thread yet - maybe I overlooked something ... Is this to control some link-triggered redundancy setup with switches? Revisiting my own question/thread. Yes. Very close to what Klaus wondered - it's a device over which I have no control and from that device perspective it's simply - link is up then I'll "serve" it. I've been thinking lowest possible layer shall be the safest way - thus asked about controlling eth link that way: down/up by means of electric power, ideally. As opposed to ha-cluster calling some middle men such as network managers. I read some eth nics/drivers can power down a port. Is there an agent & a way to do that? > > > Antony. > Kind of similar - tcp/ip and those layers configs are delivered by DHCP. I'd think it would have to be a clone resource with one master without any constraints where cluster freely decides where to put master(link up) on - which is when link gets dhcp-served. But I wonder if that would mean writing up a new resource - I don't think there is anything like that included in ready-made pcs/ocf packages. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] make promoted follow promoted resource ?
On 26/11/2023 17:44, Andrei Borzenkov wrote: On 26.11.2023 12:32, lejeczek via Users wrote: Hi guys. With these: -> $ pcs resource status REDIS-6381-clone * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable): * Promoted: [ ubusrv2 ] * Unpromoted: [ ubusrv1 ubusrv3 ] -> $ pcs resource status PGSQL-PAF-5433-clone * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs constraint ref REDIS-6381-clone Resource: REDIS-6381-clone colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY basically promoted Redis should follow promoted pgSQL but it's not happening, usually it does. I presume pcs/cluster does something internally which results in disobeying/ignoring that _colocation_ constraint for these resources. I presume scoring might play a role: REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) (with-rsc-role:Master) but usually, that scoring works, only "now" it does not. Any comments I appreciate much. thanks, L. I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but cannot see explanation for this. ... notice: Calculated transition 110, saving inputs in /var/lib/pacemaker/pengine/pe-input-3729.bz2 notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Actions: Start REDIS-6381:0 ( ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv3 ) notice: Actions: Start REDIS-6381:2 ( ubusrv1 ) notice: Calculated transition 111, saving inputs in /var/lib/pacemaker/pengine/pe-input-3730.bz2 notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2 notice: Requesting local execution of start operation for REDIS-6381 on ubusrv2 (to redis) root on none pam_unix(su:session): session opened for user redis(uid=127) by (uid=0) pam_sss(su:session): Request to sssd failed. Connection refused pam_unix(su:session): session closed for user redis pam_sss(su:session): Request to sssd failed. Connection refused notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000 This is the only line that sets master score, so apparently ubusrv2 is the only node where your clone *can* be promoted. Whether pacemaker is expected to fail this operation because it violates constraint I do not know. notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create master-REDIS-6381=1000: Transient attribute change INFO: demote: Setting master to 'no-such-master' notice: Result of start operation for REDIS-6381 on ubusrv2: ok notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped notice: Actions: Promote REDIS-6381:0 ( Unpromoted -> Promoted ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv1 ) notice: Actions: Start REDIS-6381:2 ( ubusrv3 ) notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-3731.bz2 notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating start operation REDIS-6381_start_0 on ubusrv1 notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3 notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2 notice: Requesting local execution of promote operation for REDIS-6381 on ubusrv2 notice: Result of promote operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation fo
Re: [ClusterLabs] make promoted follow promoted resource ?
On 26/11/2023 12:20, Reid Wahl wrote: On Sun, Nov 26, 2023 at 1:32 AM lejeczek via Users wrote: Hi guys. With these: -> $ pcs resource status REDIS-6381-clone * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable): * Promoted: [ ubusrv2 ] * Unpromoted: [ ubusrv1 ubusrv3 ] -> $ pcs resource status PGSQL-PAF-5433-clone * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs constraint ref REDIS-6381-clone Resource: REDIS-6381-clone colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY basically promoted Redis should follow promoted pgSQL but it's not happening, usually it does. I presume pcs/cluster does something internally which results in disobeying/ignoring that _colocation_ constraint for these resources. I presume scoring might play a role: REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) (with-rsc-role:Master) but usually, that scoring works, only "now" it does not. Any comments I appreciate much. thanks, L. I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but cannot see explanation for this. ... notice: Calculated transition 110, saving inputs in /var/lib/pacemaker/pengine/pe-input-3729.bz2 notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Actions: Start REDIS-6381:0 ( ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv3 ) notice: Actions: Start REDIS-6381:2 ( ubusrv1 ) notice: Calculated transition 111, saving inputs in /var/lib/pacemaker/pengine/pe-input-3730.bz2 notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2 notice: Requesting local execution of start operation for REDIS-6381 on ubusrv2 (to redis) root on none pam_unix(su:session): session opened for user redis(uid=127) by (uid=0) pam_sss(su:session): Request to sssd failed. Connection refused pam_unix(su:session): session closed for user redis pam_sss(su:session): Request to sssd failed. Connection refused notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000 notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create master-REDIS-6381=1000: Transient attribute change INFO: demote: Setting master to 'no-such-master' notice: Result of start operation for REDIS-6381 on ubusrv2: ok notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped notice: Actions: PromoteREDIS-6381:0 ( Unpromoted -> Promoted ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv1 ) notice: Actions: Start REDIS-6381:2 ( ubusrv3 ) notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-3731.bz2 notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating start operation REDIS-6381_start_0 on ubusrv1 notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3 notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2 notice: Requesting local execution of promote operation for REDIS-6381 on ubusrv2 notice: Result of promote operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on ubusrv3 notice: Result of notify op
Re: [ClusterLabs] make promoted follow promoted resource ?
On 26/11/2023 10:32, lejeczek via Users wrote: Hi guys. With these: -> $ pcs resource status REDIS-6381-clone * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable): * Promoted: [ ubusrv2 ] * Unpromoted: [ ubusrv1 ubusrv3 ] -> $ pcs resource status PGSQL-PAF-5433-clone * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs constraint ref REDIS-6381-clone Resource: REDIS-6381-clone colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY basically promoted Redis should follow promoted pgSQL but it's not happening, usually it does. I presume pcs/cluster does something internally which results in disobeying/ignoring that _colocation_ constraint for these resources. I presume scoring might play a role: REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) (with-rsc-role:Master) but usually, that scoring works, only "now" it does not. Any comments I appreciate much. thanks, L. I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but cannot see explanation for this. ... notice: Calculated transition 110, saving inputs in /var/lib/pacemaker/pengine/pe-input-3729.bz2 notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Actions: Start REDIS-6381:0 ( ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv3 ) notice: Actions: Start REDIS-6381:2 ( ubusrv1 ) notice: Calculated transition 111, saving inputs in /var/lib/pacemaker/pengine/pe-input-3730.bz2 notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2 notice: Requesting local execution of start operation for REDIS-6381 on ubusrv2 (to redis) root on none pam_unix(su:session): session opened for user redis(uid=127) by (uid=0) pam_sss(su:session): Request to sssd failed. Connection refused pam_unix(su:session): session closed for user redis pam_sss(su:session): Request to sssd failed. Connection refused notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000 notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create master-REDIS-6381=1000: Transient attribute change INFO: demote: Setting master to 'no-such-master' notice: Result of start operation for REDIS-6381 on ubusrv2: ok notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped notice: Actions: Promote REDIS-6381:0 ( Unpromoted -> Promoted ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv1 ) notice: Actions: Start REDIS-6381:2 ( ubusrv3 ) notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-3731.bz2 notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating start operation REDIS-6381_start_0 on ubusrv1 notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3 notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2 notice: Requesting local execution of promote operation for REDIS-6381 on ubusrv2 notice: Result of promote operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Setting master-REDIS-6381[ubusrv3]: (unset) -> 1 notice:
[ClusterLabs] make promoted follow promoted resource ?
Hi guys. With these: -> $ pcs resource status REDIS-6381-clone * Clone Set: REDIS-6381-clone [REDIS-6381] (promotable): * Promoted: [ ubusrv2 ] * Unpromoted: [ ubusrv1 ubusrv3 ] -> $ pcs resource status PGSQL-PAF-5433-clone * Clone Set: PGSQL-PAF-5433-clone [PGSQL-PAF-5433] (promotable): * Promoted: [ ubusrv1 ] * Unpromoted: [ ubusrv2 ubusrv3 ] -> $ pcs constraint ref REDIS-6381-clone Resource: REDIS-6381-clone colocation-REDIS-6381-clone-PGSQL-PAF-5433-clone-INFINITY basically promoted Redis should follow promoted pgSQL but it's not happening, usually it does. I presume pcs/cluster does something internally which results in disobeying/ignoring that _colocation_ constraint for these resources. I presume scoring might play a role: REDIS-6385-clone with PGSQL-PAF-5435-clone (score:1001) (rsc-role:Master) (with-rsc-role:Master) but usually, that scoring works, only "now" it does not. Any comments I appreciate much. thanks, L. I looked at pamaker log - snippet below after REDIS-6381-clone re-enabled - but cannot see explanation for this. ... notice: Calculated transition 110, saving inputs in /var/lib/pacemaker/pengine/pe-input-3729.bz2 notice: Transition 110 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3729.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Actions: Start REDIS-6381:0 ( ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv3 ) notice: Actions: Start REDIS-6381:2 ( ubusrv1 ) notice: Calculated transition 111, saving inputs in /var/lib/pacemaker/pengine/pe-input-3730.bz2 notice: Initiating start operation REDIS-6381_start_0 locally on ubusrv2 notice: Requesting local execution of start operation for REDIS-6381 on ubusrv2 (to redis) root on none pam_unix(su:session): session opened for user redis(uid=127) by (uid=0) pam_sss(su:session): Request to sssd failed. Connection refused pam_unix(su:session): session closed for user redis pam_sss(su:session): Request to sssd failed. Connection refused notice: Setting master-REDIS-6381[ubusrv2]: (unset) -> 1000 notice: Transition 111 aborted by status-2-master-REDIS-6381 doing create master-REDIS-6381=1000: Transient attribute change INFO: demote: Setting master to 'no-such-master' notice: Result of start operation for REDIS-6381 on ubusrv2: ok notice: Transition 111 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=14, Source=/var/lib/pacemaker/pengine/pe-input-3730.bz2): Stopped notice: Actions: Promote REDIS-6381:0 ( Unpromoted -> Promoted ubusrv2 ) notice: Actions: Start REDIS-6381:1 ( ubusrv1 ) notice: Actions: Start REDIS-6381:2 ( ubusrv3 ) notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-3731.bz2 notice: Initiating notify operation REDIS-6381_pre_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating start operation REDIS-6381_start_0 on ubusrv1 notice: Initiating start operation REDIS-6381:2_start_0 on ubusrv3 notice: Initiating notify operation REDIS-6381_post_notify_start_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_start_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_start_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_pre_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_pre_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Initiating promote operation REDIS-6381_promote_0 locally on ubusrv2 notice: Requesting local execution of promote operation for REDIS-6381 on ubusrv2 notice: Result of promote operation for REDIS-6381 on ubusrv2: ok notice: Initiating notify operation REDIS-6381_post_notify_promote_0 locally on ubusrv2 notice: Requesting local execution of notify operation for REDIS-6381 on ubusrv2 notice: Initiating notify operation REDIS-6381_post_notify_promote_0 on ubusrv1 notice: Initiating notify operation REDIS-6381:2_post_notify_promote_0 on ubusrv3 notice: Result of notify operation for REDIS-6381 on ubusrv2: ok notice: Setting master-REDIS-6381[ubusrv3]: (unset) -> 1 notice: Transition 112 aborted by status-3-master-REDIS-6381 doing create master-REDIS-6381=1: Transient attrib
Re: [ClusterLabs] [EXT] moving VM live fails?
On 24/11/2023 08:33, Windl, Ulrich wrote: Hi! So you have different CPUs in the cluster? We once had a similar situation with Xen using live migration: Migration failed, and the cluster "wasn't that smart" handling the situation. The solution was (with the help of support) to add some CPU flags masking in the VM configuration so that the newer CPU features were not used. Before migration worked from the older to the newer CPU and back if the VM had been started on the older CPU initially, but when it had been started on the newer CPU it could not migrate to the older one. I vaguely remember the details but the hardware was a ProLiant 380 G7 vs. 380 G8 or so. My recommendation is: If you want to size a cluster, but all the nodes you need, because if you want to add another node maybe in two years or so, the older model may not be available any more. IT lifetime is short. Regards, Ulrich -Original Message- From: Users On Behalf Of lejeczek via Users Sent: Friday, November 17, 2023 12:55 PM To: users@clusterlabs.org Cc: lejeczek Subject: [EXT] [ClusterLabs] moving VM live fails? Hi guys. I have a resource which when asked to 'move' then it fails with: virtqemud[3405456]: operation failed: guest CPU doesn't match specification: missing features: xsave but VM domain does not require (nor disable) the feature: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ This must go down - same environment - all the way, to the kernel. I knew that but forgotten - one box did change default kernel & after reboot booted with Centos's default as opposed to other boxes running with ver. 6.x. So, my bad, not really an issue. Thanks,. L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] node orderly shutdown
Hi guys. Having a node with a couple of _promoted_ resources - when such node is os-shutdown in an orderly manner it seems that cluster takes a while. By a "while" I mean longer than I'd expect a relatively simple 3-node cluster to move/promote a few _promoted_ resources: redis, postgresql, IP onto another. Is there somewhere one can look, tweak or measure & troubleshot, in order to "fix" this, if possible at all? From watching such a "promoted" node I see that as systemd stops all services going into power-down target - it's _pacemaker_ which as last systemd takes bit longer before complete shutdown. Or perhaps you have another & more than one approach / technique to node-with-promoted-resources shutdown, a better one? many thanks,___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] non-existent attribute ?
Hi guys. My 3-node cluster had one node absent for a long time and now when it's back I cannot get _mariadb_ to start on that node. ... * MARIADB (ocf:heartbeat:galera): ORPHANED Stopped ... * MARIADB-last-committed : 147 * MARIADB-safe-to-bootstrap : 0 I wanted to start with a _cleanup_ of node's attribute - as mariadb node outside of ha-pcs starts & run a ok, but I get: -> $ pcs node attribute dzien MARIADB-last-committed='' Error: attribute: 'MARIADB-last-committed' doesn't exist for node: 'dzien' Is possible to clean that up safely? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [EXT] Re: PAF / pgSQL fails after OS/system shutdown - FIX
On 13/11/2023 13:08, Jehan-Guillaume de Rorthais via Users wrote: On Mon, 13 Nov 2023 11:39:45 + "Windl, Ulrich" wrote: But shouldn't the RA check for that (and act appropriately)? Interesting. I'm open to discuss this. Below my thoughts so far. Why the RA should check that? There's so many way to setup the system and PostgreSQL, where should the RA stop checking for all possible way to break it? The RA checks various (maybe too many) things related to the instance itself already. I know various other PostgreSQL setups that would trigger errors in the cluster if the dba doesn't check everything is correct. I'm really reluctant to add add a fair amount of code in the RA to correctly parse and check the complex PostgreSQL's setup. This would add complexity and bugs. Or maybe I could add a specific OCF_CHECK_LEVEL sysadmins can trigger by hand before starting the cluster. But I wonder if it worth the pain, how many people will know about this and actually run it? The problem here is that few users actually realize how the postgresql-common wrapper works and what it actually does behind your back. I really appreciate this wrapper, I do. But when you setup a Pacemaker cluster, you either have to bend to it when setting up PAF (as documented), or avoid it completely. PAF is all about drawing a clear line between the sysadmin job and the dba one. Dba must build a cluster of instances ready to start/replicate with standard binaries (not wrappers) before sysadmin can set up the resource in your cluster. Thoughts? I would be the same/similar mind - which is - adding more code to account for more/all _config_ cases may not be the healthiest approach, however!! Just like some code-writers out there in around the globe here too, some do seem to _not_ appreciate or seem to completely ignore, parts which are meant do alleviate the TCO burden of any software - documentation best bits which for Unix/Linux... are MAN PAGES. This is rhetorical but I'll ask - what is a project worth when its own man pages miss critical _gotchas_ and similar gimmicks. If you read here are are a prospective/future coder-writer for Linux please make such note - your 'man pages" ought to be at least!! as good as your code. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] moving VM live fails?
Hi guys. I have a resource which when asked to 'move' then it fails with: virtqemud[3405456]: operation failed: guest CPU doesn't match specification: missing features: xsave but VM domain does not require (nor disable) the feature: what even more interesting, _virsh_ migrate does the migration live a ok. I'm on Centos 9 with packages up to day. All thoughts much appreciated. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX
On 10/11/2023 18:16, Jehan-Guillaume de Rorthais wrote: On Fri, 10 Nov 2023 17:17:41 +0100 lejeczek via Users wrote: ... Of course you can use "pg_stat_tmp", just make sure the temp folder exists: cat < /etc/tmpfiles.d/postgresql-part.conf # Directory for PostgreSQL temp stat files d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - - EOF To take this file in consideration immediately without rebooting the server, run the following command: systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf Then there must be something else at play here with Ubuntus, for none of the nodes has any extra/additional configs for those paths & I'm sure that those were not created manually. Indeed. This parameter is usually set by pg_createcluster command and the folder created by both pg_createcluster and pg_ctlcluster commands when needed. This is explained in PAF tutorial there: https://clusterlabs.github.io/PAF/Quick_Start-Debian-10-pcs.html#postgresql-and-cluster-stack-installation These commands comes from the postgresql-common wrapper, used in all Debian related distros, allowing to install, create and use multiple PostgreSQL versions on the same server. Perhpaphs pgSQL created these on it's own outside of HA-cluster. No, the Debian packaging did. Just create the config file I pointed you in my previous answer, systemd-tmpfiles will take care of it and you'll be fine. yes, that was a weird trip. I could not take the whole cluster down & as I kept fiddling with it trying to "fix" it - within & out of _pcs_ those paths were created by wrappers from disto packages - I know see that - but at first it did not look that way. Still that directive for the stats - I wonder if that got introduced, injected somewhere in between because the PG cluster - which I had for a while - did not experience this "issue" from the start and not until "recently" thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX
On 10/11/2023 13:13, Jehan-Guillaume de Rorthais wrote: On Fri, 10 Nov 2023 12:27:24 +0100 lejeczek via Users wrote: ... to share my "fix" for it - perhaps it was introduced by OS/packages (Ubuntu 22) updates - ? - as oppose to resource agent itself. As the logs point out - pg_stat_tmp - is missing and from what I see it's only the master, within a cluster, doing those stats. That appeared, I use the word for I did not put it into configs, on all nodes. fix = to not use _pg_stat_tmp_ directive/option at all. Of course you can use "pg_stat_tmp", just make sure the temp folder exists: cat < /etc/tmpfiles.d/postgresql-part.conf # Directory for PostgreSQL temp stat files d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - - EOF To take this file in consideration immediately without rebooting the server, run the following command: systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf Then there must be something else at play here with Ubuntus, for none of the nodes has any extra/additional configs for those paths & I'm sure that those were not created manually. Perhpaphs pgSQL created these on it's own outside of HA-cluster. Also, one node has the path missing but the other two have & between those two only - how it seems to me - the master PG actually put any data there and if that is the case - does the existence of the path alone guarantee anything, is the question. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX
On 07/11/2023 17:57, lejeczek via Users wrote: hi guys Having 3-node pgSQL cluster with PAF - when all three systems are shutdown at virtually the same time then PAF fails to start when HA cluster is operational again. from status: ... Migration Summary: * Node: ubusrv2 (2): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv3 (3): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv1 (1): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' Failed Resource Actions: * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=84ms * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=82ms * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=108ms and all three pgSQLs show virtually identical logs: ... 2023-11-07 16:54:45.532 UTC [24936] LOG: starting PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv4 address "0.0.0.0", port 5433 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv6 address "::", port 5433 2023-11-07 16:54:45.535 UTC [24936] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" 2023-11-07 16:54:45.547 UTC [24938] LOG: database system was interrupted while in recovery at log time 2023-11-07 15:30:56 UTC 2023-11-07 16:54:45.547 UTC [24938] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2023-11-07 16:54:45.819 UTC [24938] LOG: entering standby mode 2023-11-07 16:54:45.824 UTC [24938] FATAL: could not open directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such file or directory 2023-11-07 16:54:45.825 UTC [24936] LOG: startup process (PID 24938) exited with exit code 1 2023-11-07 16:54:45.825 UTC [24936] LOG: aborting startup due to startup process failure 2023-11-07 16:54:45.826 UTC [24936] LOG: database system is shut down Is this "test" case's result, as I showed above, expected? It reproduces every time. If not - what might it be I'm missing? many thanks, L. ___ to share my "fix" for it - perhaps it was introduced by OS/packages (Ubuntu 22) updates - ? - as oppose to resource agent itself. As the logs point out - pg_stat_tmp - is missing and from what I see it's only the master, within a cluster, doing those stats. That appeared, I use the word for I did not put it into configs, on all nodes. fix = to not use _pg_stat_tmp_ directive/option at all.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown
On 07/11/2023 17:57, lejeczek via Users wrote: hi guys Having 3-node pgSQL cluster with PAF - when all three systems are shutdown at virtually the same time then PAF fails to start when HA cluster is operational again. from status: ... Migration Summary: * Node: ubusrv2 (2): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv3 (3): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv1 (1): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' Failed Resource Actions: * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=84ms * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=82ms * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=108ms and all three pgSQLs show virtually identical logs: ... 2023-11-07 16:54:45.532 UTC [24936] LOG: starting PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv4 address "0.0.0.0", port 5433 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv6 address "::", port 5433 2023-11-07 16:54:45.535 UTC [24936] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" 2023-11-07 16:54:45.547 UTC [24938] LOG: database system was interrupted while in recovery at log time 2023-11-07 15:30:56 UTC 2023-11-07 16:54:45.547 UTC [24938] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2023-11-07 16:54:45.819 UTC [24938] LOG: entering standby mode 2023-11-07 16:54:45.824 UTC [24938] FATAL: could not open directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such file or directory 2023-11-07 16:54:45.825 UTC [24936] LOG: startup process (PID 24938) exited with exit code 1 2023-11-07 16:54:45.825 UTC [24936] LOG: aborting startup due to startup process failure 2023-11-07 16:54:45.826 UTC [24936] LOG: database system is shut down Is this "test" case's result, as I showed above, expected? It reproduces every time. If not - what might it be I'm missing? many thanks, L. Actually, the resource fails to start on a node a single node - as opposed to entire cluster shutdown as I noted originally - which was powered down in an orderly fashion and powered back on. That the the time of power-cycle the node was PAF resource master, it fails: ... 2023-11-09 20:35:04.439 UTC [17727] LOG: starting PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit 2023-11-09 20:35:04.439 UTC [17727] LOG: listening on IPv4 address "0.0.0.0", port 5433 2023-11-09 20:35:04.439 UTC [17727] LOG: listening on IPv6 address "::", port 5433 2023-11-09 20:35:04.442 UTC [17727] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" 2023-11-09 20:35:04.452 UTC [17731] LOG: database system was interrupted while in recovery at log time 2023-11-09 20:25:21 UTC 2023-11-09 20:35:04.452 UTC [17731] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2023-11-09 20:35:04.809 UTC [17731] LOG: entering standby mode 2023-11-09 20:35:04.813 UTC [17731] FATAL: could not open directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such file or directory 2023-11-09 20:35:04.814 UTC [17727] LOG: startup process (PID 17731) exited with exit code 1 2023-11-09 20:35:04.814 UTC [17727] LOG: aborting startup due to startup process failure 2023-11-09 20:35:04.815 UTC [17727] LOG: database system is shut down The master at the time node was shut down did get moved over to standby/slave node, properly, I'm on Ubuntu with: ii corosync 3.1.6-1ubuntu1 amd64 cluster engine daemon and utilities ii pacemaker 2.1.2-1ubuntu3.1 amd64 cluster resource manager ii pacemaker-cli-utils 2.1.2-1ubuntu3.1 amd64 cluster resource manager command line utilities ii pa
[ClusterLabs] PAF / pgSQL fails after OS/system shutdown
hi guys Having 3-node pgSQL cluster with PAF - when all three systems are shutdown at virtually the same time then PAF fails to start when HA cluster is operational again. from status: ... Migration Summary: * Node: ubusrv2 (2): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv3 (3): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' * Node: ubusrv1 (1): * PGSQL-PAF-5433: migration-threshold=100 fail-count=100 last-failure='Tue Nov 7 17:52:38 2023' Failed Resource Actions: * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=84ms * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=82ms * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, status='complete', exitreason='Unexpected state for instance "PGSQL-PAF-5433" (returned 1)', last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms, exec=108ms and all three pgSQLs show virtually identical logs: ... 2023-11-07 16:54:45.532 UTC [24936] LOG: starting PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv4 address "0.0.0.0", port 5433 2023-11-07 16:54:45.532 UTC [24936] LOG: listening on IPv6 address "::", port 5433 2023-11-07 16:54:45.535 UTC [24936] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" 2023-11-07 16:54:45.547 UTC [24938] LOG: database system was interrupted while in recovery at log time 2023-11-07 15:30:56 UTC 2023-11-07 16:54:45.547 UTC [24938] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2023-11-07 16:54:45.819 UTC [24938] LOG: entering standby mode 2023-11-07 16:54:45.824 UTC [24938] FATAL: could not open directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such file or directory 2023-11-07 16:54:45.825 UTC [24936] LOG: startup process (PID 24938) exited with exit code 1 2023-11-07 16:54:45.825 UTC [24936] LOG: aborting startup due to startup process failure 2023-11-07 16:54:45.826 UTC [24936] LOG: database system is shut down Is this "test" case's result, as I showed above, expected? It reproduces every time. If not - what might it be I'm missing? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?
On 08/09/2023 17:29, Jehan-Guillaume de Rorthais wrote: On Fri, 8 Sep 2023 16:52:53 +0200 lejeczek via Users wrote: Hi guys. Before I start fiddling and brake things I wonder if somebody knows if: pgSQL can work with: |wal_level = archive for PAF ? Or more general question with pertains to ||wal_level - can _barman_ be used with pgSQL "under" PAF? PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so it can have hot standbys where it can connects and query there status. Wal level "replica" includes the archive level, so you can set up archiving. Of course you can use barman or any other tools to manage your PITR Backups, even when Pacemaker/PAF is looking at your instances. This is even the very first step you should focus on during your journey to HA. Regards, and with _barman_ specifically - is one method preferred, recommended over another: streaming VS rsync - for/with PAF? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] PAF / PGSQLMS and wal_level ?
Hi guys. Before I start fiddling and brake things I wonder if somebody knows if: pgSQL can work with: |wal_level = archive for PAF ? Or more general question with pertains to ||wal_level - can _barman_ be used with pgSQL "under" PAF? many thanks, L. |___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / PGSQLMS on Ubuntu
On 07/09/2023 16:20, lejeczek via Users wrote: On 07/09/2023 16:09, Andrei Borzenkov wrote: On Thu, Sep 7, 2023 at 5:01 PM lejeczek via Users wrote: Hi guys. I'm trying to set ocf_heartbeat_pgsqlms agent but I get: ... Failed Resource Actions: * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid parameter' because 'Parameter "recovery_target_timeline" MUST be set to 'latest'. It is currently set to ''' at Thu Sep 7 13:58:06 2023 after 54ms I'm new to Ubuntu and I see that Ubuntu has a bit different approach to paths (in comparison to how Centos do it). I see separation between config & data, eg. 14 paf 5433 down postgres /var/lib/postgresql/14/paf /var/log/postgresql/postgresql-14-paf.log I create the resource like here: -> $ pcs resource create PGSQL-PAF-5433 ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin pgdata=/etc/postgresql/14/paf datadir=/var/lib/postgresql/14/paf meta failure-timeout=30s master-max=1 op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Promoted" op monitor interval=16s timeout=10s role="Unpromoted" op notify timeout=60s promotable notify=true failure-timeout=30s master-max=1 --disable Ubuntu 22.04.3 LTS What am I missing can you tell? Exactly what the message tells you. You need to set recovery_target=latest. and having it in 'postgresql.conf' make it all work for you? I've had it and got those errors - perhaps that has to be set some place else. In case anybody was in this situation - I was missing one important bit: _bindir_ Ubuntu's pgSQL binaries have different path - what resource/agent returns as errors is utterly confusing. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF / PGSQLMS on Ubuntu
On 07/09/2023 16:09, Andrei Borzenkov wrote: On Thu, Sep 7, 2023 at 5:01 PM lejeczek via Users wrote: Hi guys. I'm trying to set ocf_heartbeat_pgsqlms agent but I get: ... Failed Resource Actions: * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid parameter' because 'Parameter "recovery_target_timeline" MUST be set to 'latest'. It is currently set to ''' at Thu Sep 7 13:58:06 2023 after 54ms I'm new to Ubuntu and I see that Ubuntu has a bit different approach to paths (in comparison to how Centos do it). I see separation between config & data, eg. 14 paf 5433 down postgres /var/lib/postgresql/14/paf /var/log/postgresql/postgresql-14-paf.log I create the resource like here: -> $ pcs resource create PGSQL-PAF-5433 ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin pgdata=/etc/postgresql/14/paf datadir=/var/lib/postgresql/14/paf meta failure-timeout=30s master-max=1 op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Promoted" op monitor interval=16s timeout=10s role="Unpromoted" op notify timeout=60s promotable notify=true failure-timeout=30s master-max=1 --disable Ubuntu 22.04.3 LTS What am I missing can you tell? Exactly what the message tells you. You need to set recovery_target=latest. and having it in 'postgresql.conf' make it all work for you? I've had it and got those errors - perhaps that has to be set some place else. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] PAF / PGSQLMS on Ubuntu
Hi guys. I'm trying to set ocf_heartbeat_pgsqlms agent but I get: ... Failed Resource Actions: * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid parameter' because 'Parameter "recovery_target_timeline" MUST be set to 'latest'. It is currently set to ''' at Thu Sep 7 13:58:06 2023 after 54ms I'm new to Ubuntu and I see that Ubuntu has a bit different approach to paths (in comparison to how Centos do it). I see separation between config & data, eg. 14 paf 5433 down postgres /var/lib/postgresql/14/paf /var/log/postgresql/postgresql-14-paf.log I create the resource like here: -> $ pcs resource create PGSQL-PAF-5433 ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin pgdata=/etc/postgresql/14/paf datadir=/var/lib/postgresql/14/paf meta failure-timeout=30s master-max=1 op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Promoted" op monitor interval=16s timeout=10s role="Unpromoted" op notify timeout=60s promotable notify=true failure-timeout=30s master-max=1 --disable Ubuntu 22.04.3 LTS What am I missing can you tell? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] last man - issue
Hi guys. That below should work, right? -> $ pcs quorum update last_man_standing=1 --skip-offline Checking corosync is not running on nodes... Warning: Unable to connect to dzien (Failed to connect to dzien port 2224: No route to host) Warning: dzien: Unable to check if corosync is not running Error: swir: corosync is running Error: whale: corosync is running Error: Errors have occurred, therefore pcs is unable to continue if it should, why does it not or what else am I missing? many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] newly created clone waits for off-lined node
On 13/07/2023 17:33, Ken Gaillot wrote: On Wed, 2023-07-12 at 21:08 +0200, lejeczek via Users wrote: Hi guys. I have a fresh new 'galera' clone and that one would not start & cluster says: ... INFO: Waiting on node to report database status before Master instances can start. ... Is that only for newly created resources - which I guess it must be - and if so then why? Naturally, next question would be - how to make such resource start in that very circumstance? many thank, L. That is part of the agent rather than Pacemaker. Looking at the agent code, it's based on a node attribute the agent sets, so it is only empty for newly created resources that haven't yet run on a node. I'm not sure if there's a way around it. (Anyone else have experience with that?) any expert/devel would agree - this qualifies as a bug? To add to my last message - mariadb starts a okey outside of the cluster but also if I manually change 'safe_to_bootstrap' to '1' on any of the - in my case - two nodes, then cluster will also start the clone ok. But then... disable(seems that Galera cluster gets shut down ok) & enable the clone and clone fails to start with those errors as earlier - waiting for off-lined node - and leaving the clone as 'Unpromoted' ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] newly created clone waits for off-lined node
On 13/07/2023 17:33, Ken Gaillot wrote: On Wed, 2023-07-12 at 21:08 +0200, lejeczek via Users wrote: Hi guys. I have a fresh new 'galera' clone and that one would not start & cluster says: ... INFO: Waiting on node to report database status before Master instances can start. ... Is that only for newly created resources - which I guess it must be - and if so then why? Naturally, next question would be - how to make such resource start in that very circumstance? many thank, L. That is part of the agent rather than Pacemaker. Looking at the agent code, it's based on a node attribute the agent sets, so it is only empty for newly created resources that haven't yet run on a node. I'm not sure if there's a way around it. (Anyone else have experience with that?) But this is bit weird -- even with 'promoted-max' & 'clone-max' to exclude (& on top of it ban the resource on the) off-lined node, clone would not start -- no? Perhaps if not a an obvious bug, I'd think this could/should be improved, the code. If anybody wanted to reproduce - have a "regular" galera clone okey, off-line a node, remove re/create the clone. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] newly created clone waits for off-lined node
Hi guys. I have a fresh new 'galera' clone and that one would not start & cluster says: ... INFO: Waiting on node to report database status before Master instances can start. ... Is that only for newly created resources - which I guess it must be - and if so then why? Naturally, next question would be - how to make such resource start in that very circumstance? many thank, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] location constraint does not move promoted resource ?
On 03/07/2023 18:55, Andrei Borzenkov wrote: On 03.07.2023 19:39, Ken Gaillot wrote: On Mon, 2023-07-03 at 19:22 +0300, Andrei Borzenkov wrote: On 03.07.2023 18:07, Ken Gaillot wrote: On Mon, 2023-07-03 at 12:20 +0200, lejeczek via Users wrote: On 03/07/2023 11:16, Andrei Borzenkov wrote: On 03.07.2023 12:05, lejeczek via Users wrote: Hi guys. I have pgsql with I constrain like so: -> $ pcs constraint location PGSQL-clone rule role=Promoted score=-1000 gateway-link ne 1 and I have a few more location constraints with that ethmonitor & those work, but this one does not seem to. When contraint is created cluster is silent, no errors nor warning, but relocation does not take place. I can move promoted resource manually just fine, to that node where 'location' should move it. Instance to promote is selected according to promotion scores which are normally set by resource agent. Documentation implies that standard location constraints are also taken in account, but there is no explanation how promotion scores interoperate with location scores. It is possible that promotion score in this case takes precedence. It seems to have kicked in with score=-1 but.. that was me just guessing. Indeed it would be great to know how those are calculated, in a way which would' be admin friendly or just obvious. thanks, L. It's a longstanding goal to have some sort of tool for explaining how scores interact in a given situation. However it's a challenging problem and there's never enough time ... Basically, all scores are added together for each node, and the node with the highest score runs the resource, subject to any placement strategy configured. These mainly include stickiness, location constraints, colocation constraints, and node health. Nodes may be And you omitted the promotion scores which was the main question. Oh right -- first, the above is used to determine the nodes on which clone instances will be placed. After that, an appropriate number of nodes are selected for the promoted role, based on promotion scores and location and colocation constraints for the promoted role. I am sorry but it does not really explain anything. Let's try concrete examples a) master clone instance has location score -1000 for a node and promotion score 1000. Is this node eligible for promoting clone instance (assuming no other scores are present)? b) promotion score is equal on two nodes A and B, but node A has better location score than node B. Is it guaranteed that clone will be promoted on A? a real-life example: ...Colocation Constraints: Started resource 'HA-10-1-1-253' with Promoted resource 'PGSQL-clone' (id: colocation-HA-10-1-1-253-PGSQL-clone-INFINITY) score=INFINITY .. Order Constraints: promote resource 'PGSQL-clone' then start resource 'HA-10-1-1-253' (id: order- PGSQL-clone-HA-10-1-1-253-Mandatory) symmetrical=0 kind=Mandatory demote resource 'PGSQL-clone' then stop resource 'HA-10-1-1-253' (id: order- PGSQL-clone-HA-10-1-1-253-Mandatory-1) symmetrical=0 kind=Mandatory I had to bump this one up to: ... resource 'PGSQL-clone' (id: location-PGSQL-clone) Rules: Rule: role=Promoted score=-1 (id: location-PGSQL-clone-rule) Expression: gateway-link ne 1 (id: location-PGSQL-clone-rule-expr) '-1000' did not seem to be good enough, '-1' was just a "lucky" guess. as earlier: I was able to 'move' the promoted and I think 'prefers' also worked. I don't know it 'pgsql' would work with any other constraints, if it was safe to try so. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] location constraint does not move promoted resource ?
On 03/07/2023 11:16, Andrei Borzenkov wrote: On 03.07.2023 12:05, lejeczek via Users wrote: Hi guys. I have pgsql with I constrain like so: -> $ pcs constraint location PGSQL-clone rule role=Promoted score=-1000 gateway-link ne 1 and I have a few more location constraints with that ethmonitor & those work, but this one does not seem to. When contraint is created cluster is silent, no errors nor warning, but relocation does not take place. I can move promoted resource manually just fine, to that node where 'location' should move it. Instance to promote is selected according to promotion scores which are normally set by resource agent. Documentation implies that standard location constraints are also taken in account, but there is no explanation how promotion scores interoperate with location scores. It is possible that promotion score in this case takes precedence. It seems to have kicked in with score=-1 but.. that was me just guessing. Indeed it would be great to know how those are calculated, in a way which would' be admin friendly or just obvious. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] location constraint does not move promoted resource ?
Hi guys. I have pgsql with I constrain like so: -> $ pcs constraint location PGSQL-clone rule role=Promoted score=-1000 gateway-link ne 1 and I have a few more location constraints with that ethmonitor & those work, but this one does not seem to. When contraint is created cluster is silent, no errors nor warning, but relocation does not take place. I can move promoted resource manually just fine, to that node where 'location' should move it. All thoughts share are much appreciated. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] silence resource ? - PGSQL
Hi guys. Having 'pgsql' set up in what I'd say is a vanilla-default confg, pacemaker's journal log is flooded with: ... pam_unix(runuser:session): session closed for user postgres pam_unix(runuser:session): session opened for user postgres(uid=26) by (uid=0) pam_unix(runuser:session): session closed for user postgres pam_unix(runuser:session): session opened for user postgres(uid=26) by (uid=0) pam_unix(runuser:session): session closed for user postgres pam_unix(runuser:session): session opened for user postgres(uid=26) by (uid=0) pam_unix(runuser:session): session closed for user postgres ... Would you have a working fix or even a suggestion on how to silence those? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?
On 09/06/2023 09:04, Reid Wahl wrote: On Thu, Jun 8, 2023 at 10:55 PM lejeczek via Users wrote: On 09/06/2023 01:38, Reid Wahl wrote: On Thu, Jun 8, 2023 at 2:24 PM lejeczek via Users wrote: Ouch. Let's see the full output of the move command, with the whole CIB that failed to validate. For a while there I thought perhaps it was just that one pglsq resource, but it seems that any - though only a few are set up - (only clone promoted?)resource fails to move. Perhaps primarily to do with 'pcs' -> $ pcs resource move REDIS-clone --promoted podnode3 Error: cannot move resource 'REDIS-clone' 1 This is the problem: `validate-with="pacemaker-3.6"`. That old schema doesn't support role="Promoted" in a location constraint. Support begins with version 3.7 of the schema: https://github.com/ClusterLabs/pacemaker/commit/e7f1424df49ac41b2d38b72af5ff9ad5121432d2. You'll need at least Pacemaker 2.1.0. I have: corosynclib-3.1.7-1.el9.x86_64 corosync-3.1.7-1.el9.x86_64 pacemaker-schemas-2.1.6-2.el9.noarch pacemaker-libs-2.1.6-2.el9.x86_64 pacemaker-cluster-libs-2.1.6-2.el9.x86_64 pacemaker-cli-2.1.6-2.el9.x86_64 pacemaker-2.1.6-2.el9.x86_64 pcs-0.11.5-2.el9.x86_64 pacemaker-remote-2.1.6-2.el9.x86_64 and the reset is Centos 9 up-to-what-is-in-repos If all your cluster nodes are at those versions, try `pcs cluster cib-upgrade` 'cib-upgrade' might be a fix for this "issue" - but should it not happen at rpm update time? Differently but still 'move' errored out: -> $ pcs resource move REDIS-clone --promoted podnode2 Location constraint to move resource 'REDIS-clone' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'REDIS-clone' has been removed Waiting for the cluster to apply configuration changes... Error: resource 'REDIS-clone' is promoted on node 'podnode3'; unpromoted on nodes 'podnode1', 'podnode2' Error: Errors have occurred, therefore pcs is unable to continue Then a node-promted-resource going into 'standby' or being rebooted - for first/one time - seemed to "fix" 'move'. -> $ pcs resource move REDIS-clone --promoted podnode2 Location constraint to move resource 'REDIS-clone' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'REDIS-clone' has been removed Waiting for the cluster to apply configuration changes... resource 'REDIS-clone' is promoted on node 'podnode2'; unpromoted on nodes 'podnode1', 'podnode3' quite puzzling, sharing in case others experience this/similar. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?
On 09/06/2023 01:38, Reid Wahl wrote: On Thu, Jun 8, 2023 at 2:24 PM lejeczek via Users wrote: Ouch. Let's see the full output of the move command, with the whole CIB that failed to validate. For a while there I thought perhaps it was just that one pglsq resource, but it seems that any - though only a few are set up - (only clone promoted?)resource fails to move. Perhaps primarily to do with 'pcs' -> $ pcs resource move REDIS-clone --promoted podnode3 Error: cannot move resource 'REDIS-clone' 1 This is the problem: `validate-with="pacemaker-3.6"`. That old schema doesn't support role="Promoted" in a location constraint. Support begins with version 3.7 of the schema: https://github.com/ClusterLabs/pacemaker/commit/e7f1424df49ac41b2d38b72af5ff9ad5121432d2. You'll need at least Pacemaker 2.1.0. I have: corosynclib-3.1.7-1.el9.x86_64 corosync-3.1.7-1.el9.x86_64 pacemaker-schemas-2.1.6-2.el9.noarch pacemaker-libs-2.1.6-2.el9.x86_64 pacemaker-cluster-libs-2.1.6-2.el9.x86_64 pacemaker-cli-2.1.6-2.el9.x86_64 pacemaker-2.1.6-2.el9.x86_64 pcs-0.11.5-2.el9.x86_64 pacemaker-remote-2.1.6-2.el9.x86_64 and the reset is Centos 9 up-to-what-is-in-repos 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 2
Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?
Ouch. Let's see the full output of the move command, with the whole CIB that failed to validate. For a while there I thought perhaps it was just that one pglsq resource, but it seems that any - though only a few are set up - (only clone promoted?)resource fails to move. Perhaps primarily to do with 'pcs' -> $ pcs resource move REDIS-clone --promoted podnode3 Error: cannot move resource 'REDIS-clone' 1 validate-with="pacemaker-3.6" epoch="8212" num_updates="0" admin_epoch="0" cib-last-written="Thu Jun 8 21:59:53 2023" update-origin="podnode1" update-client="crm_attribute" have-quorum="1" update-user="root" dc-uuid="1"> 2 3 4 5 id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/> 6 name="dc-version" value="2.1.6-2.el9-6fdc9deea29"/> 7 id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> 8 id="cib-bootstrap-options-cluster-name" name="cluster-name" value="podnodes"/> 9 id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> 10 id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1686047745"/> 11 id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="false"/> 12 13 14 name="REDIS_REPL_INFO" value="podnode1"/> 15 16 17 18 19 20 name="PGSQL-data-status" value="DISCONNECT"/> 21 22 23 24 25 name="PGSQL-data-status" value="DISCONNECT"/> 26 27 28 29 30 name="PGSQL-data-status" value="LATEST"/> 31 32 33 34 35 provider="heartbeat" type="IPaddr2"> 36 id="HA-10-1-1-226-meta_attributes"> 37 id="HA-10-1-1-226-meta_attributes-failure-timeout" name="failure-timeout" value="30s"/> 38 39 id="HA-10-1-1-226-instance_attributes"> 40 id="HA-10-1-1-226-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/> 41 id="HA-10-1-1-226-instance_attributes-ip" name="ip" value="10.1.1.226"/> 42 43 44 interval="10s" name="monitor" timeout="20s"/> 45 interval="0s" name="start" timeout="20s"/> 46 interval="0s" name="stop" timeout="20s"/> 47 48 49 provider="heartbeat" type="IPaddr2"> 50 id="HA-10-3-1-226-meta_attributes"> 51 id="HA-10-3-1-226-meta_attributes-failure-timeout" name="failure-timeout" value="30s"/> 52 53 id="HA-10-3-1-226-instance_attributes"> 54 id="HA-10-3-1-226-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/> 55 id="HA-10-3-1-226-instance_attributes-ip" name="ip" value="10.3.1.226"/> 56 57 58 interval="10s" name="monitor" timeout="20s"/> 59 interval="0s" name="start" timeout="20s"/> 60 interval="0s" name="stop" timeout="20s"/> 61 62 63 64 provider="heartbeat"> 65 id="REDIS-instance_attributes"> 66 name="bin" value="/usr/bin/redis-server"/> 67 id="REDIS-instance_attributes-client_bin" name="client_bin" value="/usr/bin/redis-cli"/> 68 id="REDIS-instance_attributes-config" name="config" value="/etc/redis/redis.conf"/> 69 id="REDIS-instance_attributes-rundir" name="rundir" value="/run/redis"/> 70 id="REDIS-instance_attributes-user" name="user" value="redis"/> 71 id="REDIS-instance_attributes-wait_last_known_master" name="wait_last_known_master" value="true"/> 72 73 74 timeout="120s" id="REDIS-demote-interval-0s"/> 75 interval="45s" id="REDIS-monitor-interval-45s"> 76 id="REDIS-monitor-interval-45s-instance_attributes"> 77 id="REDIS-monitor-interval-45s-instance_attributes-OCF_CHECK_LEVEL" name="OCF_CHECK_LEVEL" value="0"/> 78 79 80 timeout="60s" interval="20s" id="REDIS-monitor-interval-20s"> 81 id="REDIS-monitor-interval-20s-instance_attributes"> 82 id="REDIS-monitor-interval-20s-instance_attributes-OCF_CHECK_LEVEL" name="OCF_CHECK_LEVEL" value="0"/> 83 84 85 timeout="60s" interval="60s" id="REDIS-monitor-interval-60s"> 86 id="REDIS-monitor-interval-60s-instance_attributes"> 87 id="REDIS-monitor-interval-60s-instance_attributes-OCF_CHECK_LEVEL" name="OCF_CHECK_LEVEL" value="0"/> 88 89 90 timeout="90s" id="REDIS-notify-interval-0s"/> 91 timeout="120s" id="REDIS-promote-interval-0s"/> 92
Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?
On 05/06/2023 16:23, Ken Gaillot wrote: On Sat, 2023-06-03 at 15:09 +0200, lejeczek via Users wrote: Hi guys. I've something which I'm new to entirely - cluster which is seemingly okey errors, fails to move a resource. What pcs version are you using? I believe there was a move regression in a recent push. I'd won't contaminate here just yet with long json cluster spits when fails but a snippet: -> $ pcs resource move PGSQL-clone --promoted podnode1 Error: cannot move resource 'PGSQL-clone' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... crm_resource: Error performing operation: Invalid configuration This one line: (might be more) puzzles me, as there is no such node/member in the cluster and so I try: That's not a problem. Pacemaker allows "custom" values in both cluster options and resource/action meta-attributes. I don't know whether redis is actually using that or not. -> $ pcs property unset redis_REPL_INFO --force Warning: Cannot remove property 'redis_REPL_INFO', it is not present in property set 'cib-bootstrap-options' That's because the custom options are in their own cluster_property_set. I believe pcs can only manage the options in the cluster_property_set with id="cib-bootstrap-options", so you'd have to use "pcs cluster edit" or crm_attribute to remove the custom ones. Any & all suggestions on how to fix this are much appreciated. many thanks, L. I've downgraded back to: pacemaker-2.1.6-1.el9.x86_64 pcs-0.11.4-7.el9.x86_64 but, it's either not enough - if bugs are in those that is - or issues are somewhere else. for 'move' still fails the same: ... crm_resource: Error performing operation: Invalid configuration ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster okey but errors when tried to move resource - ?
On 05/06/2023 16:23, Ken Gaillot wrote: On Sat, 2023-06-03 at 15:09 +0200, lejeczek via Users wrote: Hi guys. I've something which I'm new to entirely - cluster which is seemingly okey errors, fails to move a resource. What pcs version are you using? I believe there was a move regression in a recent push. I'd won't contaminate here just yet with long json cluster spits when fails but a snippet: -> $ pcs resource move PGSQL-clone --promoted podnode1 Error: cannot move resource 'PGSQL-clone' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... crm_resource: Error performing operation: Invalid configuration This one line: (might be more) puzzles me, as there is no such node/member in the cluster and so I try: That's not a problem. Pacemaker allows "custom" values in both cluster options and resource/action meta-attributes. I don't know whether redis is actually using that or not. -> $ pcs property unset redis_REPL_INFO --force Warning: Cannot remove property 'redis_REPL_INFO', it is not present in property set 'cib-bootstrap-options' That's because the custom options are in their own cluster_property_set. I believe pcs can only manage the options in the cluster_property_set with id="cib-bootstrap-options", so you'd have to use "pcs cluster edit" or crm_attribute to remove the custom ones. Any & all suggestions on how to fix this are much appreciated. many thanks, L. Well. that would make sense - those must have gone in & through to the repos' rpms - as I suspected thos broke after recent 'dnf' updates. I'll fiddle with 'downgrades' later on - but if true, then rather critical, no? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] cluster okey but errors when tried to move resource - ?
Hi guys. I've something which I'm new to entirely - cluster which is seemingly okey errors, fails to move a resource. I'd won't contaminate here just yet with long json cluster spits when fails but a snippet: -> $ pcs resource move PGSQL-clone --promoted podnode1 Error: cannot move resource 'PGSQL-clone' 1 validate-with="pacemaker-3.6" epoch="8109" num_updates="0" admin_epoch="0" cib-last-written="Sat Jun 3 13:49:34 2023" update-origin="podnode2" update-client="cibadmin" have-quorum="1" update-user="root" dc-uuid="2"> 2 3 4 5 id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/> 6 name="dc-version" value="2.1.6-1.el9-802a72226be"/> 7 id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> 8 id="cib-bootstrap-options-cluster-name" name="cluster-name" value="podnodes"/> 9 id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> 10 id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1683293193"/> 11 id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="false"/> 12 13 14 name="redis_REPL_INFO" value="c8kubernode1"/> 15 name="REDIS_REPL_INFO" value="podnode3"/> 16 17 ... crm_resource: Error performing operation: Invalid configuration This one line: (might be more) name="redis_REPL_INFO" value="c8kubernode1"/> puzzles me, as there is no such node/member in the cluster and so I try: -> $ pcs property unset redis_REPL_INFO --force Warning: Cannot remove property 'redis_REPL_INFO', it is not present in property set 'cib-bootstrap-options' Any & all suggestions on how to fix this are much appreciated. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ocf_heartbeat_pgsql - lock file
Hi guys. I have a resource seemingly running a ok but when a node gets rebooted then cluster finds it not able, not good to start the resource. Failed Resource Actions: * PGSQL start on podnode3 returned 'error' (My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.) at Sun May 7 11:48:43 2023 after 121ms and indeed, with manual intervention, after removal of that file, cluster seems to be happy to rejoin the node into pgsql cluster. Is that intentional, by design and if yes/no then why it happens? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?
On 05/05/2023 10:41, Jehan-Guillaume de Rorthais wrote: On Fri, 5 May 2023 10:08:17 +0200 lejeczek via Users wrote: On 25/04/2023 14:16, Jehan-Guillaume de Rorthais wrote: Hi, On Mon, 24 Apr 2023 12:32:45 +0200 lejeczek via Users wrote: I've been looking up and fiddling with this RA but unsuccessfully so far, that I wonder - is it good for current versions of pgSQLs? As far as I know, the pgsql agent is still supported, last commit on it happen in Jan 11th 2023. I don't know about its compatibility with latest PostgreSQL versions. I've been testing it many years ago, I just remember it was quite hard to setup, understand and manage from the maintenance point of view. Also, this agent is fine in a shared storage setup where it only start/stop/monitor the instance, without paying attention to its role (promoted or not). It's not only that it's hard - which is purely due to piss-poor man page in my opinion - but it really sounds "expired". I really don't know. My feeling is that the manpage might be expired, which really doesn't help with this agent, but not the RA itself. Eg. man page speaks of 'recovery.conf' which - as I understand it - newer/current versions of pgSQL do not! even use... which makes one wonder. This has been fixed in late 2019, but with no documentation associated :/ See: https://github.com/ClusterLabs/resource-agents/commit/a43075be72683e1d4ddab700ec16d667164d359c Regards, Right.. like the rest of us admins/users going to the code is the very first thing we do :) RA/resource seems to work but I sincerely urge the programmer(s)/author(s) responsible, to update man pages so they reflect state of affairs as of today - there is nothing more discouraging - to the rest of us sysadmins/endusers - than man pages which look like stingy conservatives programmer's note. What setup of this RA does, is misleading in parts - still creates config part in a file called 'recovery.conf' on that note - does this RA not require way to much open pgSQL setup, meaning not very secure? Would anybody know? I cannot see - again ! regular man pages and not the code - how replication could be secured down by not using user 'postgres' itself and then perhaps adding authentication with passwords, at least. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?
On 05/05/2023 10:08, Andrei Borzenkov wrote: On Fri, May 5, 2023 at 11:03 AM lejeczek via Users wrote: On 29/04/2023 21:02, Reid Wahl wrote: On Sat, Apr 29, 2023 at 3:34 AM lejeczek via Users wrote: Hi guys. I presume these are a consequence of having resource of VirtuaDomain type set up(& enabled) - but where, how cab users control presence & content of those? Yep: https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/VirtualDomain#L674-L680 You can't really control the content, since it's set by the resource agent. (You could change it after creation but that defeats the purpose.) However, you can view it at /run/systemd/system/resource-agents-deps.target.d/libvirt.conf. You can see the systemd_drop_in definition here: https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/ocf-shellfuncs.in#L654-L673 I wonder how much of an impact those bits have on the cluster(&more?) Take '99-VirtualDomain-libvirt.conf' - that one poses questions, with c9s 'libvirtd.service' is not really used or should not be, new modular approach is devised there. So, with 'resources-agents' having: After=libvirtd.service and users not being able to manage those bit - is that not asking for trouble? it does no harm (missing units are simply ignored) but it certainly does not do anything useful either. OTOH modular approach is also optional, so you could still use monolithic libvirtd on cluster nodes. So it is more a documentation issue. Not sure what you mean by 'missing unit' - unit is there only is not used, is disabled. What does 'resource-agents-deps' do with that? I don't suppose upstream, redhat & others made that effort, those changes with the suggestions to us consumers - do go back to "old" stuff. I'd suggest, if devel/contributors read here - and I'd imagine other users would reckon as well - to enhance RAs, certainly VirtualDomain, with a parameter/attribute with which users could, at least to certain extent, control those "outside" of cluster, dependencies. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?
On 25/04/2023 14:16, Jehan-Guillaume de Rorthais wrote: Hi, On Mon, 24 Apr 2023 12:32:45 +0200 lejeczek via Users wrote: I've been looking up and fiddling with this RA but unsuccessfully so far, that I wonder - is it good for current versions of pgSQLs? As far as I know, the pgsql agent is still supported, last commit on it happen in Jan 11th 2023. I don't know about its compatibility with latest PostgreSQL versions. I've been testing it many years ago, I just remember it was quite hard to setup, understand and manage from the maintenance point of view. Also, this agent is fine in a shared storage setup where it only start/stop/monitor the instance, without paying attention to its role (promoted or not). It's not only that it's hard - which is purely due to piss-poor man page in my opinion - but it really sounds "expired". Eg. man page speaks of 'recovery.conf' which - as I understand it - newer/current versions of pgSQL do not! even use... which makes one wonder. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?
On 29/04/2023 21:02, Reid Wahl wrote: On Sat, Apr 29, 2023 at 3:34 AM lejeczek via Users wrote: Hi guys. I presume these are a consequence of having resource of VirtuaDomain type set up(& enabled) - but where, how cab users control presence & content of those? Yep: https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/VirtualDomain#L674-L680 You can't really control the content, since it's set by the resource agent. (You could change it after creation but that defeats the purpose.) However, you can view it at /run/systemd/system/resource-agents-deps.target.d/libvirt.conf. You can see the systemd_drop_in definition here: https://github.com/ClusterLabs/resource-agents/blob/v4.12.0/heartbeat/ocf-shellfuncs.in#L654-L673 I wonder how much of an impact those bits have on the cluster(&more?) Take '99-VirtualDomain-libvirt.conf' - that one poses questions, with c9s 'libvirtd.service' is not really used or should not be, new modular approach is devised there. So, with 'resources-agents' having: After=libvirtd.service and users not being able to manage those bit - is that not asking for trouble? thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] 99-VirtualDomain-libvirt.conf under control - ?
Hi guys. I presume these are a consequence of having resource of VirtuaDomain type set up(& enabled) - but where, how cab users control presence & content of those? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] PAF / pgsqlms - lags behind in terms of RA specs?
Hi guys. anybody here use PAF with up-to-date Centos? I see this RA fails from cluster perspective, I've failed a report over at github no comment there so thought I'd ask around. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?
Hi guys. I've been looking up and fiddling with this RA but unsuccessfully so far, that I wonder - is it good for current versions of pgSQLs? many thanks, L.g___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] FEEDBACK WANTED: possible deprecation of nagios-class resources
On 19/04/2023 21:08, Ken Gaillot wrote: Hi all, I am considering deprecating Pacemaker's support for nagios-class resources. This has nothing to do with nagios monitoring of a Pacemaker cluster, which would be unaffected. This is about Pacemaker's ability to use nagios plugin scripts as a type of resource. Nagios-class resources act as a sort of proxy monitor for another resource. You configure the main resource (typically a VM or container) as normal, then configure a nagios: resource to monitor it. If the nagios monitor fails, the main resource is restarted. Most of that use case is now better covered by Pacemaker Remote and bundles. The only advantage of nagios-class resources these days is if you have a VM or container image that you can't modify (to add Pacemaker Remote). If we deprecate nagios-class resources, the preferred alternative would be to write a custom OCF agent with the monitoring you want (which can even call a nagios plugin). Does anyone here use nagios-class resources? If it's actively being used, I'm willing to keep it around. But if there's no demand, we'd rather not have to maintain that (poorly tested) code forever. Myself, regarding ha/pacemekr/vm, I've not dealt with nagios ever. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?
On 19/04/2023 16:16, Ken Gaillot wrote: On Wed, 2023-04-19 at 08:00 +0200, lejeczek via Users wrote: On 18/04/2023 21:02, Ken Gaillot wrote: On Tue, 2023-04-18 at 19:36 +0200, lejeczek via Users wrote: On 18/04/2023 18:22, Ken Gaillot wrote: On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote: Hi guys. When it's done by the cluster itself, eg. a node goes 'standby' - how do clusters migrate VirtualDomain resources? 1. Call resource agent migrate_to action on original node 2. Call resource agent migrate_from action on new node 3. Call resource agent stop action on original node Do users have any control over it and if so then how? The allow-migrate resource meta-attribute (true/false) I'd imagine there must be some docs - I failed to find It's sort of scattered throughout Pacemaker Explained -- the main one is: https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources Especially in large deployments one obvious question would be - I'm guessing as my setup is rather SOHO - can VMs migrate in sequence or it is(always?) a kind of 'swarm' migration? The migration-limit cluster property specifies how many live migrations may be initiated at once (the default of -1 means unlimited). But if this is cluster property - unless I got it wrong, hopefully - then this govern any/all resources. If so, can such a limit be rounded down to RA type or perhaps group of resources? many thanks, L. No, it's global To me it feels so intuitive, so natural & obvious that I will ask - nobody yet suggested that such feature be available to smaller divisions of cluster independently of global rule? In the vastness of resource types many are polar opposites and to treat them all the same? Would be great to have some way to tell cluster to run different migration/relocation limits on for eg. compute-heavy resources VS light-weight ones - where to "file" such a enhancement suggestion, Bugzilla? many thanks, L. Looking at the code, I see it's a little different than I originally thought. First, I overlooked that it's correctly documented as a per-node limit rather than a cluster-wide limit. That highlights the complexity of allowing different values for different resources; if rscA has a migration limit of 2, and rscB has a migration limit of 5, do we allow up to 2 rscA migrations and 5 rscB migrations simultaneously, or do we weight them relative to each other so the total capacity is still constrained (for example limiting it to 1 rscA migration and 2 rscB migrations together)? My first thoughts were - I cannot comment on the code, only inasmuch as an admin would care - perhaps to introduce, if would not require business logic total overhaul, "migration groups"(while not being another resource type) whose such groups then a resource could be member. Or perhaps marry 'migration-limit' to 'resource group' which would take priority over global/node-wide rule. One way or another, simple to end-users - then user/admin sets N-limit of resources which in such group can be live-migrated at one time, say... in this-given-group only 2 resources can cluster attempt to live-migrate simultaneously, then wait for success or failure but wait for result and only then proceed to next & ... We would almost need something like the node utilization feature, being able to define a node's total migration capacity and then how much of that capacity is taken up by the migration of a specific resource. That seems overcomplicated to me, especially since there aren't that many resource types that support live migration. Those types which do support live migration and are compute-heavy, then I really wonder how large consumers do VirtualDomain migration, as one good example. Say a Virtual/Cloud provides - there a chunky host node might host hundreds VMs - there, but anywhere else timeouts, all/any, must be some real, fixed number. As of right now, how intuitive is what cluster does when it swarms - say equally - those hundreds of VMs to remaining-available nodes... ... even with fast inner-node connectivity many - without migration-limit - live-migrations will timeout. Is cluster capable of some very clever heuristics so humans could leave it to the machine to ensure that such mass-migration will not fail simply due to overall bottleneck of the underlying infrastructure? ... and could the cluster alone do that? Would not VirtualDomain agent have to gather comprehensive metric data on each VM in the first place, to feed it to the cluster internal logic..? I would see some way similar to these which I mentioned above, as relatively effective and surely down-to-earth, practical aid to alleviate cases such as VMs "mass-migration". Second, any actions on a Pacemaker Remote node count toward the throttling limit of its conne
Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?
On 18/04/2023 21:02, Ken Gaillot wrote: On Tue, 2023-04-18 at 19:36 +0200, lejeczek via Users wrote: On 18/04/2023 18:22, Ken Gaillot wrote: On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote: Hi guys. When it's done by the cluster itself, eg. a node goes 'standby' - how do clusters migrate VirtualDomain resources? 1. Call resource agent migrate_to action on original node 2. Call resource agent migrate_from action on new node 3. Call resource agent stop action on original node Do users have any control over it and if so then how? The allow-migrate resource meta-attribute (true/false) I'd imagine there must be some docs - I failed to find It's sort of scattered throughout Pacemaker Explained -- the main one is: https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources Especially in large deployments one obvious question would be - I'm guessing as my setup is rather SOHO - can VMs migrate in sequence or it is(always?) a kind of 'swarm' migration? The migration-limit cluster property specifies how many live migrations may be initiated at once (the default of -1 means unlimited). But if this is cluster property - unless I got it wrong, hopefully - then this govern any/all resources. If so, can such a limit be rounded down to RA type or perhaps group of resources? many thanks, L. No, it's global To me it feels so intuitive, so natural & obvious that I will ask - nobody yet suggested that such feature be available to smaller divisions of cluster independently of global rule? In the vastness of resource types many are polar opposites and to treat them all the same? Would be great to have some way to tell cluster to run different migration/relocation limits on for eg. compute-heavy resources VS light-weight ones - where to "file" such a enhancement suggestion, Bugzilla? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] manner in which cluster migrates VirtualDomain - ?
On 18/04/2023 18:22, Ken Gaillot wrote: On Tue, 2023-04-18 at 14:58 +0200, lejeczek via Users wrote: Hi guys. When it's done by the cluster itself, eg. a node goes 'standby' - how do clusters migrate VirtualDomain resources? 1. Call resource agent migrate_to action on original node 2. Call resource agent migrate_from action on new node 3. Call resource agent stop action on original node Do users have any control over it and if so then how? The allow-migrate resource meta-attribute (true/false) I'd imagine there must be some docs - I failed to find It's sort of scattered throughout Pacemaker Explained -- the main one is: https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/advanced-options.html#migrating-resources Especially in large deployments one obvious question would be - I'm guessing as my setup is rather SOHO - can VMs migrate in sequence or it is(always?) a kind of 'swarm' migration? The migration-limit cluster property specifies how many live migrations may be initiated at once (the default of -1 means unlimited). But if this is cluster property - unless I got it wrong, hopefully - then this govern any/all resources. If so, can such a limit be rounded down to RA type or perhaps group of resources? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] manner in which cluster migrates VirtualDomain - ?
Hi guys. When it's done by the cluster itself, eg. a node goes 'standby' - how do clusters migrate VirtualDomain resources? Do users have any control over it and if so then how? I'd imagine there must be some docs - I failed to find Especially in large deployments one obvious question would be - I'm guessing as my setup is rather SOHO - can VMs migrate in sequence or it is(always?) a kind of 'swarm' migration? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain - node map - ?
On 17/04/2023 06:17, Andrei Borzenkov wrote: On 16.04.2023 16:29, lejeczek via Users wrote: On 16/04/2023 12:54, Andrei Borzenkov wrote: On 16.04.2023 13:40, lejeczek via Users wrote: Hi guys Some agents do employ that concept of node/host map which I do not see in any manual/docs that this agent does - would you suggest some technique or tips on how to achieve similar? I'm thinking specifically of 'migrate' here, as I understand 'migration' just uses OS' own resolver to call migrate_to node. No, pacemaker decides where to migrate the resource and calls agents on the current source and then on the intended target passing this information. Yes pacemaker does that but - as I mentioned - some agents do employ that "internal" nodes map "technique". I see no mention of that/similar in the manual for VirtualDomain so I asked if perhaps somebody had an idea of how to archive such result by some other means. What about showing example of these "some agents" or better describing what you want to achieve? 'ocf_heartbeat_galera' is one such agent. But after more fiddling with 'VirtualDomain' I understand this agent does offer this functionality tough not in that clear way nor to that full extent, but it's there. So, what I needed was possible to achieve. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain - node map - ?
On 16/04/2023 12:54, Andrei Borzenkov wrote: On 16.04.2023 13:40, lejeczek via Users wrote: Hi guys Some agents do employ that concept of node/host map which I do not see in any manual/docs that this agent does - would you suggest some technique or tips on how to achieve similar? I'm thinking specifically of 'migrate' here, as I understand 'migration' just uses OS' own resolver to call migrate_to node. No, pacemaker decides where to migrate the resource and calls agents on the current source and then on the intended target passing this information. Yes pacemaker does that but - as I mentioned - some agents do employ that "internal" nodes map "technique". I see no mention of that/similar in the manual for VirtualDomain so I asked if perhaps somebody had an idea of how to archive such result by some other means. thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain - node map - ?
Hi guys Some agents do employ that concept of node/host map which I do not see in any manual/docs that this agent does - would you suggest some technique or tips on how to achieve similar? I'm thinking specifically of 'migrate' here, as I understand 'migration' just uses OS' own resolver to call migrate_to node. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ocf:heartbeat:galera - SELinux
Hi guys. I'd prefer to avoid putting 'mysqld_t' into permissive side of SELinux - which apparently would be the easiest way out - so I wonder: does anybody here have a SELinux module which would make Galera agent run successfully? Devel/authors, I think ignored that part - that critical part. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] cluster with redundant links - PCSD offline
Hi guys. I have a simple 2-node cluster with redundant links and I wonder why status reports like this: ... Node List: * Node swir (1): online, feature set 3.16.2 * Node whale (2): online, feature set 3.16.2 ... PCSD Status: swir: Online whale: Offline ... Cluster's config: ... Nodes: swir: Link 0 address: 10.1.1.100 Link 1 address: 10.3.1.100 nodeid: 1 whale: Link 0 address: 10.1.1.101 Link 1 address: 10.3.1.101 nodeid: 2 ... Is that all normal for a cluster when 'Link 0' is down? I thought redundant link(s) would be for that exact reason - so cluster would remain fully operational. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)
On 03/01/2023 21:44, Ken Gaillot wrote: On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote: On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote: Hi, On Tue, 3 Jan 2023 16:44:01 +0100 lejeczek via Users wrote: To get/have Postgresql cluster with 'pgsqlms' resource, such cluster needs a 'master' IP - what do you guys do when/if you have multiple resources off this agent? I wonder if it is possible to keep just one IP and have all those resources go to it - probably 'scoring' would be very tricky then, or perhaps not? That would mean all promoted pgsql MUST be on the same node at any time. If one of your instance got some troubles and need to failover, *ALL* of them would failover. This imply not just a small failure time window for one instance, but for all of them, all the users. Or you do separate IP for each 'pgsqlms' resource - the easiest way out? That looks like a better option to me, yes. Regards, Not related - Is this an old bug?: -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Master" op monitor interval=16s timeout=10s role="Slave" op notify timeout=60s meta promotable=true notify=true master-max=1 --disable Error: Validation result from agent (use --force to override): ocf-exit-reason:You must set meta parameter notify=true for your master resource Error: Errors have occurred, therefore pcs is unable to continue pcs now runs an agent's validate-all action before creating a resource. In this case it's detecting a real issue in your command. The options you have after "meta" are clone options, not meta options of the resource being cloned. If you just change "meta" to "clone" it should work. Nope. Exact same error message. If I remember correctly there was a bug specifically pertained to 'notify=true' I'm on C8S with resource-agents-paf-4.9.0-35.el8.x86_64. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)
On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote: Hi, On Tue, 3 Jan 2023 16:44:01 +0100 lejeczek via Users wrote: To get/have Postgresql cluster with 'pgsqlms' resource, such cluster needs a 'master' IP - what do you guys do when/if you have multiple resources off this agent? I wonder if it is possible to keep just one IP and have all those resources go to it - probably 'scoring' would be very tricky then, or perhaps not? That would mean all promoted pgsql MUST be on the same node at any time. If one of your instance got some troubles and need to failover, *ALL* of them would failover. This imply not just a small failure time window for one instance, but for all of them, all the users. Or you do separate IP for each 'pgsqlms' resource - the easiest way out? That looks like a better option to me, yes. Regards, Not related - Is this an old bug?: -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Master" op monitor interval=16s timeout=10s role="Slave" op notify timeout=60s meta promotable=true notify=true master-max=1 --disable Error: Validation result from agent (use --force to override): ocf-exit-reason:You must set meta parameter notify=true for your master resource Error: Errors have occurred, therefore pcs is unable to continue ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] multiple resources - pgsqlms - and IP(s)
Hi guys. To get/have Postgresql cluster with 'pgsqlms' resource, such cluster needs a 'master' IP - what do you guys do when/if you have multiple resources off this agent? I wonder if it is possible to keep just one IP and have all those resources go to it - probably 'scoring' would be very tricky then, or perhaps not? Or you do separate IP for each 'pgsqlms' resource - the easiest way out? all thoughts share are much appreciated. many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] podman containers as resources - ? - with a twist
On 28/12/2022 21:53, Reid Wahl wrote: On Wed, Dec 28, 2022 at 6:08 AM lejeczek via Users wrote: Hi guys. I have a situation which begins to look like quite the pickle and I'm in it, with no possible or no elegant at least, way out. I'm hoping you guys can share your thoughts. My cluster mounts a path, in two steps 1) runs systemd luks service 2) mount that unlocked luks device under a certain path now... that certain path is where user(s) home dir resides and... as the result of all that 'systemd' does not pick up user's systemd units. (must be way too late for 'systemd') How would you fix that? Right now I manually poke systemd with, as that given user: -> $ systemctl --user daemon-reload only then 'systemd' picks up user units - until then 'systemd' says "Unit ... could not be found" Naturally, I do not want 'manually'. I'm thinking... somehow have cluster make OS's 'systemd' redo that user systemd bits, after the resource successful start, or... have cluster somehow manage that user's systemd units directly, on its own. In case it might make it bit more clear - those units are 'podman' containers, non-root containers. You might be able to manage these containers via the ocf:heartbeat:podman resource agent, with `--user=` in the `run_opts` resource option along with any other relevant options. If something like that doesn't work, then you could write a simple lsb-class or systemd-class cluster resource to do what you're currently doing manually. There may be other options; those are the two that come to mind. From man pages for that resource I assumed that agent creates a new container (from "scratch") each time and - if that is the case indeed - I need to control/manage an existing ones. I was thinking about putting these bits I do manually into a 'systemd' service but I cannot see how to go about it - I understand there is a clear separation of OS/root systemd and users' systemd and I don't think - I can be wrong I do hope - it is possible to have OS/root (from that namespace) sytemd do users systemd/namespaces. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] podman containers as resources - ? - with a twist
Hi guys. I have a situation which begins to look like quite the pickle and I'm in it, with no possible or no elegant at least, way out. I'm hoping you guys can share your thoughts. My cluster mounts a path, in two steps 1) runs systemd luks service 2) mount that unlocked luks device under a certain path now... that certain path is where user(s) home dir resides and... as the result of all that 'systemd' does not pick up user's systemd units. (must be way too late for 'systemd') How would you fix that? Right now I manually poke systemd with, as that given user: -> $ systemctl --user daemon-reload only then 'systemd' picks up user units - until then 'systemd' says "Unit ... could not be found" Naturally, I do not want 'manually'. I'm thinking... somehow have cluster make OS's 'systemd' redo that user systemd bits, after the resource successful start, or... have cluster somehow manage that user's systemd units directly, on its own. In case it might make it bit more clear - those units are 'podman' containers, non-root containers. All thoughts shared are much appreciated. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Ordering - clones & relocation
Hi guys. This might be tricky but also can be trivial, I'm hoping for the latter of course. Ordering Constraints: ... start non-clone then start a-clone-resource (kind:Mandatory) ... Now, when 'non-clone' gets relocated what then happens to 'a-clone-resource' ? or.. for that matter, to any resource when 'ordering' is in play? Do such "dependent" resources get restarted, when source resource 'moves' and if yes, is it possible to have PCS do not that? many thanks, L.___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?
On 28/07/2022 00:33, Reid Wahl wrote: On Wed, Jul 27, 2022 at 2:08 AM lejeczek via Users wrote: On 26/07/2022 20:56, Reid Wahl wrote: On Tue, Jul 26, 2022 at 4:21 AM lejeczek via Users wrote: Hi guys I set up a clone of a new instance of mariadb galera - which otherwise, outside of pcs works - but I see something weird. Firstly cluster claims it's all good: -> $ pcs status --full ... * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable): * mariadb-apps(ocf::heartbeat:galera): Master sucker.internal.ccn * mariadb-apps(ocf::heartbeat:galera): Master drunk.internal.ccn Clearly the problem is that your server is drunk. but that mariadb is _not_ started actually. In clone's attr I set: config=/apps/etc/mariadb-server.cnf I also for peace of mind set: datadir=/apps/mysql/data even tough '/apps/etc/mariadb-server.cnf' declares that & other bits - again, works outside of pcs. Then I see in pacemaker logs: notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13 mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ] .. and I think what the F? resource-agents-4.9.0-22.el8.x86_64 All thoughts share are much appreciated. many thanks, L. How do you start mariadb outside of pacemaker's control? simply by: -> $ sudo -u mysql /usr/libexec/mysqld --defaults-file=/apps/etc/mariadb-server.cnf --wsrep-new-clusterrep Got it. FWIW, `--wsrep-new-clusterrep` doesn't appear to be a valid option (no search results and nothing in the help output). Maybe this is a typo in the email; `--wsrep-new-cluster` is valid. It seems that *something* is running, based on the "Starting" message and the fact that the resources are still in Started state... Yes, a "regular" instance of galera (which outside of pacemaker works with "regular" systemd) is running as ofc one resource already. The logic to start and promote the galera resource is contained within /usr/lib/ocf/resource.d/heartbeat/galera and /usr/lib/ocf/lib/heartbeat/mysql-common.sh. I encourage you to inspect those for any relevant differences between your own startup method and that of the resource agent. As one example, note that the resource agent uses mysqld_safe by default. This is configurable via the `binary` option for the resource. Be sure that you've looked at all the available options (`pcs resource describe ocf:heartbeat:galera`) and configured any of them that you need. You've definitely at least started that process with config and datadir. log snippet I pasted was not to emphasize 'mysqld_safe' but the fact that resource does: ... Starting mysqld daemon with databases from /var/lib/mysql ... Yes, I did check man pages for options and if if you suggest checking source code then I say that's case for a bug report. For something like pacemaker, I wouldn't point a user to the code. galera and mysql-common.sh are shell scripts, so they're easier to interpret. The idea was "maybe you can find a discrepancy between how the scripts are starting mysql, and how you're manually starting mysql." Filing a bug might be reasonable. I don't have much hands-on experience with mysql or with this resource agent. I tried my best to make it clear, doubt can do that better but I'll re-try: a) bits needed to run a "new/second" instance are all in 'mariadb-server.cnf' and galera works outside of pacemaker (cmd above) b) if I use attrs: 'datadir' & 'config' yet resource tells me "..daemon with databases from /var/lib/mysql..." then.. people looking into the code should perhaps be you (as Redhat employee) and/or other authors/developers - if I filed it as BZ - no? It should be easily reproducible, have: 1) a galera (as a 2nd/3rd/etc. instance, outside & with different settings from "standard" OS's rpm installation) work & tested 2) set up a resource: -> $ pcs resource create mariadb-apps ocf:heartbeat:galera cluster_host_map="drunk.internal.ccn:10.0.1.6;sucker.internal.ccn:10.0.1.7" config="/apps/etc/mariadb-server.cnf" wsrep_cluster_address="gcomm://10.0.1.6:5567,10.0.1.7:5567" user=mysql group=mysql check_user="pacemaker" check_passwd="pacemaker#9897" op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="20s" op monitor role="Master" OCF_CHECK_LEVEL="0" timeout="30s" interval="10s" op monitor role="Slave" OCF_CHECK_LEVEL="0" timeout="30s" interval="30s" promotable promoted-max=2 meta failure-timeout=30s The datadir option isn't specified in the resource create command above. If datadir not explicitly
Re: [ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?
On 26/07/2022 20:56, Reid Wahl wrote: On Tue, Jul 26, 2022 at 4:21 AM lejeczek via Users wrote: Hi guys I set up a clone of a new instance of mariadb galera - which otherwise, outside of pcs works - but I see something weird. Firstly cluster claims it's all good: -> $ pcs status --full ... * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable): * mariadb-apps(ocf::heartbeat:galera): Master sucker.internal.ccn * mariadb-apps(ocf::heartbeat:galera): Master drunk.internal.ccn Clearly the problem is that your server is drunk. but that mariadb is _not_ started actually. In clone's attr I set: config=/apps/etc/mariadb-server.cnf I also for peace of mind set: datadir=/apps/mysql/data even tough '/apps/etc/mariadb-server.cnf' declares that & other bits - again, works outside of pcs. Then I see in pacemaker logs: notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13 mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ] .. and I think what the F? resource-agents-4.9.0-22.el8.x86_64 All thoughts share are much appreciated. many thanks, L. How do you start mariadb outside of pacemaker's control? simply by: -> $ sudo -u mysql /usr/libexec/mysqld --defaults-file=/apps/etc/mariadb-server.cnf --wsrep-new-clusterrep It seems that *something* is running, based on the "Starting" message and the fact that the resources are still in Started state... Yes, a "regular" instance of galera (which outside of pacemaker works with "regular" systemd) is running as ofc one resource already. The logic to start and promote the galera resource is contained within /usr/lib/ocf/resource.d/heartbeat/galera and /usr/lib/ocf/lib/heartbeat/mysql-common.sh. I encourage you to inspect those for any relevant differences between your own startup method and that of the resource agent. As one example, note that the resource agent uses mysqld_safe by default. This is configurable via the `binary` option for the resource. Be sure that you've looked at all the available options (`pcs resource describe ocf:heartbeat:galera`) and configured any of them that you need. You've definitely at least started that process with config and datadir. log snippet I pasted was not to emphasize 'mysqld_safe' but the fact that resource does: ... Starting mysqld daemon with databases from /var/lib/mysql ... Yes, I did check man pages for options and if if you suggest checking source code then I say that's case for a bug report. I tried my best to make it clear, doubt can do that better but I'll re-try: a) bits needed to run a "new/second" instance are all in 'mariadb-server.cnf' and galera works outside of pacemaker (cmd above) b) if I use attrs: 'datadir' & 'config' yet resource tells me "..daemon with databases from /var/lib/mysql..." then.. people looking into the code should perhaps be you (as Redhat employee) and/or other authors/developers - if I filed it as BZ - no? It should be easily reproducible, have: 1) a galera (as a 2nd/3rd/etc. instance, outside & with different settings from "standard" OS's rpm installation) work & tested 2) set up a resource: -> $ pcs resource create mariadb-apps ocf:heartbeat:galera cluster_host_map="drunk.internal.ccn:10.0.1.6;sucker.internal.ccn:10.0.1.7" config="/apps/etc/mariadb-server.cnf" wsrep_cluster_address="gcomm://10.0.1.6:5567,10.0.1.7:5567" user=mysql group=mysql check_user="pacemaker" check_passwd="pacemaker#9897" op monitor OCF_CHECK_LEVEL="0" timeout="30s" interval="20s" op monitor role="Master" OCF_CHECK_LEVEL="0" timeout="30s" interval="10s" op monitor role="Slave" OCF_CHECK_LEVEL="0" timeout="30s" interval="30s" promotable promoted-max=2 meta failure-timeout=30s and you should get (give version & evn is the same) see what I see - weird "misbehavior" many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?
Hi guys I set up a clone of a new instance of mariadb galera - which otherwise, outside of pcs works - but I see something weird. Firstly cluster claims it's all good: -> $ pcs status --full ... * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable): * mariadb-apps (ocf::heartbeat:galera): Master sucker.internal.ccn * mariadb-apps (ocf::heartbeat:galera): Master drunk.internal.ccn but that mariadb is _not_ started actually. In clone's attr I set: config=/apps/etc/mariadb-server.cnf I also for peace of mind set: datadir=/apps/mysql/data even tough '/apps/etc/mariadb-server.cnf' declares that & other bits - again, works outside of pcs. Then I see in pacemaker logs: notice: mariadb-apps_star...@drunk.internal.ccn output [ 220726 11:56:13 mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ] .. and I think what the F? resource-agents-4.9.0-22.el8.x86_64 All thoughts share are much appreciated. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Redis OCF agent - cannot decide which is master - ?
Hi guys. I have a peculiar case - to me at least - here. Unless I tell cluster to move resource to a node (two-node cluster) with '--master' cluster keeps both nodes as "slaves" As soon as I remove such a constraint both nodes start to log: ... 3442363:M 07 Jul 2022 20:05:31.561 # Setting secondary replication ID to b14b339d4b0162c3bcd73c9591b60fa491f475dd, valid up to offset: 435. New replication ID is d704c3a0815fdca98c22392341c7a05e2bb89429 3442363:M 07 Jul 2022 20:05:31.561 * MASTER MODE enabled (user request from 'id=66 addr=/run/redis/redis.sock:0 laddr=/run/redis/redis.sock:0 fd=8 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=40920 argv-mem=12 obl=0 oll=0 omem=0 tot-mem=61476 events=r cmd=slaveof user=default redir=-1') 3442363:M 07 Jul 2022 20:05:32.269 * Replica 10.0.1.7:6379 asks for synchronization 3442363:M 07 Jul 2022 20:05:32.269 * Partial resynchronization request from 10.0.1.7:6379 accepted. Sending 0 bytes of backlog starting from offset 435. 3442363:S 07 Jul 2022 20:11:19.175 # Connection with replica 10.0.1.7:6379 lost. 3442363:S 07 Jul 2022 20:11:19.175 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer. 3442363:S 07 Jul 2022 20:11:19.175 * Connecting to MASTER no-such-master:6379 3442363:S 07 Jul 2022 20:11:19.177 # Unable to connect to MASTER: (null) 3442363:S 07 Jul 2022 20:11:19.177 * REPLICAOF no-such-master:6379 enabled (user request from 'id=90 addr=/run/redis/redis.sock:0 laddr=/run/redis/redis.sock:0 fd=10 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 qbuf=48 qbuf-free=40906 argv-mem=25 obl=0 oll=0 omem=0 tot-mem=61489 events=r cmd=slaveof user=default redir=-1') and then infamous: ... 3442363:S 07 Jul 2022 20:11:24.184 # Unable to connect to MASTER: (null) 3442363:S 07 Jul 2022 20:11:25.187 * Connecting to MASTER no-such-master:6379 .. What is PCS/agent getting wrong there? btw. OCF Redis does not need 'redis-sentinel.conf' , right? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF with postgresql 13?
On 08/03/2022 16:20, Jehan-Guillaume de Rorthais wrote: Removing the node attributes with the resource might be legit from the Pacemaker point of view, but I'm not sure how they can track the dependency (ping Ken?). PAF has no way to know the ressource is being deleted and can not remove its node attribute before hand. Maybe PCS can look for promotable score and remove them during the "resource delete" command (ping Tomas)? bit of catch-22, no? To those of us to whom it's first foray into all "this" it might be - it was to me - I see "attributes" hanging on for no reason, resource does not exist, then first thing I want to do is a "cleanup" - natural - but this to result in inability to re-create the resource(all remain slaves) at later time with 'pcs' one-liner (no cib-push) is a: no no should be not... ... because, at later time, why should not resource re/creation with 'pcs resource create' bring back those 'node attrs' ? Unless... 'pgsqlms' can only be done by 'cib-push' under every circumstance - so, a) after manual node attr removal b) clean cluster, very first no prior 'pgsqlms' resource deployment. yes, it would be great to have this 'improvement/enhancement' in future releases incorporated - those node attr removed @resource removal time. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF with postgresql 13?
On 08/03/2022 10:21, Jehan-Guillaume de Rorthais wrote: op start timeout=60s \ op stop timeout=60s \ op promote timeout=30s >> \ op demote timeout=120s \ op monitor interval=15s timeout=10s >> role="Master" meta master-max=1 \ op monitor interval=16s >> timeout=10s role="Slave" \ op notify timeout=60s meta notify=true > Because "op" appears, we are back in resource ("pgsqld") context, > anything after is interpreted as ressource and operation attributes, > even the "meta notify=true". That's why your pgsqld-clone doesn't > have the meta attribute "notify=true" set. Here is one-liner that should do - add, as per 'debug-' suggestion, 'master-max=1' -> $ pcs resource create pgsqld ocf:heartbeat:pgsqlms bindir=/usr/bin pgdata=/var/lib/pgsql/data op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Master" op monitor interval=16s timeout=10s role="Slave" op notify timeout=60s promotable notify=true master-max=1 && pcs constraint colocation add HA-10-1-1-226 with master pgsqld-clone INFINITY && pcs constraint order promote pgsqld-clone then start HA-10-1-1-226 symmetrical=false kind=Mandatory && pcs constraint order demote pgsqld-clone then stop HA-10-1-1-226 symmetrical=false kind=Mandatory but ... ! this "issue" is reproducible! So now you have working 'pgsqlms', then do: -> $ pcs resource delete pgsqld '-clone' should get removed too, so now no 'pgsqld' resource(s) but cluster - weirdly in my mind - leaves node attributes on. I see 'master-pgsqld' with each node and do not see why 'node attributes' should be kept(certainly shown) for non-existent resources(to which only resources those attrs are instinct) So, you want to "clean" that for, perhaps for now you are not going to have/use 'pgsqlms', you can do that with: -> $ pcs node attribute node1 master-pgsqld="" # same for remaining nodes now .. ! repeat your one-liner which worked just a moment ago and you should get exact same or similar errors(while all nodes are stuck on 'slave' -> $ pcs resource debug-promote pgsqld crm_resource: Error performing operation: Error occurred Operation force-promote for pgsqld (ocf:heartbeat:pgsqlms) returned 1 (error: Can not get current node LSN location) /tmp:5432 - accepting connections ocf-exit-reason:Can not get current node LSN location You have to 'cib-push' to "fix" this very problem. In my(admin's) opinion this is a 100% candidate for a bug - whether in PCS or PAF - perhaps authors may wish to comment? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PAF with postgresql 13?
On 21/02/2022 16:01, Jehan-Guillaume de Rorthais wrote: On Mon, 21 Feb 2022 09:04:27 + CHAMPAGNE Julie wrote: ... The last release is 2 years old, is it still in development? There's no activity because there's not much to do on it. PAF is mainly in maintenance (bug fix) mode. I have few ideas here and there. It might land soon or later, but nothing really fancy. It just works. The current effort is to reborn the old workshop that was written few years ago to translate it to english. Perhaps as the author(s) you can chip in and/or help via comments to rectify this: ... Problem: package resource-agents-paf-4.9.0-7.el8.x86_64 requires resource-agents = 4.9.0-7.el8, but none of the providers can be installed - cannot install both resource-agents-4.9.0-7.el8.x86_64 and resource-agents-4.9.0-13.el8.x86_64 - cannot install the best update candidate for package resource-agents-paf-2.3.0-1.noarch - cannot install the best update candidate for package resource-agents-4.9.0-13.el8.x86_64 (try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages) ... How I understand that is that CentOS guys/community have PAF in 'resilientstorage' repo. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain + GlusterFS - troubles coming with CentOS 9
Hi guys. With CentOS 9's packages & binaries libgfapi is removed from libvirt/qemu which means that if you want to use GlusterFS for VMs image storage you have to expose its volumes via FS mount point - that is how I understand these changes - which seems to cause quite a problem to the HA. Maybe it's only my setup, yet I'd think it'll be the only way forward for everybody here - if on Gluster - having a simple mount point and VM resources running off it, causes HA cluster to fail to live-migrate but only @ system reboot and makes such a reboot/shutdown take "ages". Anybody sees the same or similar problem? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ethernet link up/down - ?
On 07/02/2022 19:21, Antony Stone wrote: On Monday 07 February 2022 at 20:09:02, lejeczek via Users wrote: Hi guys How do you guys go about doing link up/down as a resource? I apply or remove addresses on the interface, using "IPaddr2" and "IPv6addr", which I know is not the same thing. Why do you separately want to control link up/down? I can't think what I would use this for. Antony. Kind of similar - tcp/ip and those layers configs are delivered by DHCP. I'd think it would have to be a clone resource with one master without any constraints where cluster freely decides where to put master(link up) on - which is when link gets dhcp-served. But I wonder if that would mean writing up a new resource - I don't think there is anything like that included in ready-made pcs/ocf packages. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ethernet link up/down - ?
Hi guys How do you guys go about doing link up/down as a resource? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain - unable to migrate
Hi guys I'm having problems with cluster on new CentOS Stream 9 and I'd be glad if you can share your thoughts. -> $ pcs resource move c8kubermaster2 swir Location constraint to move resource 'c8kubermaster2' has been created Waiting for the cluster to apply configuration changes... Location constraint created to move resource 'c8kubermaster2' has been removed Waiting for the cluster to apply configuration changes... Error: resource 'c8kubermaster2' is running on node 'whale' Error: Errors have occurred, therefore pcs is unable to continue Not much in pacemaker logs, perhaps nothing at all. VM does migrate with 'virsh' just fine. -> $ pcs resource config c8kubermaster1 Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=90s (c8kubermaster1-migrate_from-interval-0s) migrate_to interval=0s timeout=90s (c8kubermaster1-migrate_to-interval-0s) monitor interval=30s (c8kubermaster1-monitor-interval-30s) start interval=0s timeout=60s (c8kubermaster1-start-interval-0s) stop interval=0s timeout=60s (c8kubermaster1-stop-interval-0s) Any and all suggestions & thoughts are much appreciated. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] no systemd - ?
hi guys I've always been RHE/Fedora user and memories of times before 'systemd' almost completely vacated my brain - nowadays, is it possible to have HA without systemd would you know and if so, then how would that work? many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Resource(s) for mysql+galera cluster
On 15/12/2021 08:16, Michele Baldessari wrote: Hi, On Tue, Dec 14, 2021 at 03:09:56PM +, lejeczek via Users wrote: I failed to find any good or any for that matter info, on present-day mariadb/mysql galera cluster setups - would you know of any docs which discuss such scenario comprehensively? I see there among resources are 'mariadb' and 'galera' but which one to use I'm still confused. in the deployment of OpenStack via TripleO[1] we set up galera in multi-master mode using the galera resource agent[2] (it runs in bundles aka containers in our specific case, but that is tangential). Here is an example of the pcs commands we use to start it (these are a bit old, things might have changed since): pcs property set --node controller-1 galera-role=true pcs resource bundle create galera-bundle container docker image=192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest replicas=3 masters=3 options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" run-command="/bin/bash /usr/local/bin/kolla_start" network=host storage-map id=mysql-cfg-files source-dir=/var/lib/kolla/config_files/mysql.json target-dir=/var/lib/kolla/config_files/config.json options=ro storage-map id=mysql-cfg-data source-dir=/var/lib/config-data/puppet-generated/mysql/ target-dir=/var/lib/kolla/config_files/src options=ro storage-map id=mysql-hosts source-dir=/etc/hosts target-dir=/etc/hosts options=ro storage-map id=mysql-localtime source-dir=/etc/localtime target-dir=/etc/localtime options=ro storage-map id=mysql-lib source-dir=/var/lib/mysql target-dir=/var/lib/mysql options=rw storage-map id=mysql-log-mariadb source-dir=/var/log/mariadb target-dir=/var/log/mariadb options=rw storage-map id=mysql-log source-dir=/var/log/containers/mysql ta rget-dir=/var/log/mysql options=rw storage-map id=mysql-dev-log source-dir=/dev/log target-dir=/dev/log options=rw network control-port=3123 --disabled pcs constraint location galera-bundle rule resource-discovery=exclusive score=0 galera-role eq true pcs resource create galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' additional_parameters='--open-files-limit=16384' enable_creation=true wsrep_cluster_address='gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain' cluster_host_map='controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain' meta master-max=3 ordered=true container-attribute-target=host op promote timeout=300s on-fail=block bundle galera-bundle I don't think we have any docs discussing tradeoffs around this setup, but maybe this is a start. If you want multi-master galera then you can do it with the galera RA. If you need to look at more recent configs check out [3] (Click on a job and the logs and you can check something like [4] for the full pcs config) hth, Michele [1] https://docs.openstack.org/tripleo-docs/latest/ [2] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/galera.in [3] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master [4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/31594c2/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz I had used 'galera' OCF resource which worked, following man page for it but if I can share an opinion - it's seems that most(if not all developers) "suffer" from the same problem - namely, inability to write a solid, very good man/docs pages, which is the case with HA/Pacemaker. (there are exceptions such as 'lvm2', 'systemd', 'firewalld' & more (I'm a CentOS/Fedora user)) Especially man pages suffer - while developers know that it's first place user/admin goes to. So I struggle with multi-master setup - many thanks for the examples - "regular" master/slave works, but that gives out only one node having 'mariadb' up&running, whereas remaining nodes keep resource in 'standby' mode - as I understand it. When I try to fiddle with similar to you configs then 'galera' tells me: .. Galera must be configured as a multistate Master/Slave resource. .. and my setup should be simpler - no containers, no bundles - only 3 nodes. many thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resource(s) for mysql+galera cluster
Hi guys I failed to find any good or any for that matter info, on present-day mariadb/mysql galera cluster setups - would you know of any docs which discuss such scenario comprehensively? I see there among resources are 'mariadb' and 'galera' but which one to use I'm still confused. many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] VirtualDomain - started but... not really
Hi! My guess is that you checked the corresponding logs already; why not show them here? I can imagine that the VMs die rather early after start. Regards, Ulrich lejeczek via Users schrieb am 10.12.2021 um 17:33 in Nachricht : Hi guys. I quite often.. well, to frequently in my mind, see a VM which cluster says: -> $ pcs resource status | grep -v disabled ... * c8kubermaster2(ocf::heartbeat:VirtualDomain): Started dzien .. but that is false, also cluster itself confirms it: -> $ pcs resource debug-monitor c8kubermaster2 crm_resource: Error performing operation: Not running Operation force-check for c8kubermaster2 (ocf:heartbeat:VirtualDomain) returned: 'not running' (7) What is the issue here, might be & how best to troubleshoot it? -> $ pcs resource config c8kubermaster2 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=180s (c8kubermaster2-migrate_from-interval-0s) migrate_to interval=0s timeout=180s (c8kubermaster2-migrate_to-interval-0s) monitor interval=30s (c8kubermaster2-monitor-interval-30s) start interval=0s timeout=90s (c8kubermaster2-start-interval-0s) stop interval=0s timeout=90s (c8kubermaster2-stop-interval-0s) many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ Not much there in the logs I could see (which is probably why cluster decides the resource is okey) What is the resource's monitor for it not for that exactly - to check the state of resource - whether it dies early or late should not matter. What suffices in order to "fix" such resource false-positive, I do quick dis/enable the resource or as in this very instance rpm updates which restarted node. Again, how cluster might think resource is okey while debug-monitor shows it's not. I only do not know how to reproduce this in a controlled, orderly manner. thanks, L ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain - started but... not really
On 10/12/2021 21:17, Ken Gaillot wrote: On Fri, 2021-12-10 at 16:33 +, lejeczek via Users wrote: Hi guys. I quite often.. well, to frequently in my mind, see a VM which cluster says: -> $ pcs resource status | grep -v disabled ... * c8kubermaster2(ocf::heartbeat:VirtualDomain): Started dzien .. but that is false, also cluster itself confirms it: -> $ pcs resource debug-monitor c8kubermaster2 crm_resource: Error performing operation: Not running Operation force-check for c8kubermaster2 (ocf:heartbeat:VirtualDomain) returned: 'not running' (7) What is the issue here, might be & how best to troubleshoot it? Is there a recurring monitor configured for the resource? I have already shown the config for there resource. Is there not? -> $ pcs resource config c8kubermaster2 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=180s (c8kubermaster2-migrate_from-interval-0s) migrate_to interval=0s timeout=180s (c8kubermaster2-migrate_to-interval-0s) monitor interval=30s (c8kubermaster2-monitor-interval-30s) start interval=0s timeout=90s (c8kubermaster2-start-interval-0s) stop interval=0s timeout=90s (c8kubermaster2-stop-interval-0s) many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] VirtualDomain - started but... not really
Hi guys. I quite often.. well, to frequently in my mind, see a VM which cluster says: -> $ pcs resource status | grep -v disabled ... * c8kubermaster2 (ocf::heartbeat:VirtualDomain): Started dzien .. but that is false, also cluster itself confirms it: -> $ pcs resource debug-monitor c8kubermaster2 crm_resource: Error performing operation: Not running Operation force-check for c8kubermaster2 (ocf:heartbeat:VirtualDomain) returned: 'not running' (7) What is the issue here, might be & how best to troubleshoot it? -> $ pcs resource config c8kubermaster2 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=180s (c8kubermaster2-migrate_from-interval-0s) migrate_to interval=0s timeout=180s (c8kubermaster2-migrate_to-interval-0s) monitor interval=30s (c8kubermaster2-monitor-interval-30s) start interval=0s timeout=90s (c8kubermaster2-start-interval-0s) stop interval=0s timeout=90s (c8kubermaster2-stop-interval-0s) many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] thanks! to mailman owners/admins...
...for fixing DMARCs(my Yahoo). many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Qemu VM resources - cannot acquire state change lock
On 26/08/2021 10:35, Klaus Wenninger wrote: On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users mailto:users@clusterlabs.org>> wrote: Hi guys. I sometimes - I think I know when in terms of any pattern - get resources stuck on one node (two-node cluster) with these in libvirtd's logs: ... Cannot start job (query, none, none) for domain c8kubermaster1; current job is (modify, none, none) owned by (192261 qemuProcessReconnect, 0 , 0 (flags=0x0)) for (1093s, 0s, 0s) Cannot start job (query, none, none) for domain ubuntu-tor; current job is (modify, none, none) owned by (192263 qemuProcessReconnect, 0 , 0 (flags=0x0)) for (1093s, 0s, 0s) Timed out during operation: cannot acquire state change lock (held by monitor=qemuProcessReconnect) Timed out during operation: cannot acquire state change lock (held by monitor=qemuProcessReconnect) ... when this happens, and if the resourec is meant to be the other node, I have to to disable the resource first, then the node on which resources are stuck will shutdown the VM and then I have to re-enable that resource so it would, only then, start on that other, the second node. I think this problem occurs if I restart 'libvirtd' via systemd. Any thoughts on this guys? What are the logs on the pacemaker-side saying? An issue with migration? Klaus I'll have to try to tidy up the "protocol" with my stuff so I could call it all reproducible, at the moment if only feels that way, as reproducible. I'm on CentOS Stream and have 2-node cluster, with KVM resources, with same glusterfs cluster 2-node. (all psychically is two machines) 1) I power down one node in orderly manner and the other node is last-man-standing. 2) after a while (not sure if time period is also a key here) I brought up that first node. 3) the last man-standing-node libvirtd becomes irresponsive (don't know yet, if that is only after the first node came back up) to virt cmd and to probably everything else, pacameker log says: ... pacemaker-controld[2730]: error: Result of probe operation for c8kubernode2 on dzien: Timed Out ... and libvirtd log does not say anything really (with default debug levels) 4) if glusterfs might play any role? Healing of the volume(s) is finished at this time, completed successfully. This the moment where I would manually 'systemd restart libvirtd' that irresponsive node(was last-man-standing) and got original error messages. There is plenty of room for anybody to make guesses, obvious. Is it 'libvirtd' going haywire because glusterfs volume is in an unhealthy state and needs healing? Is it pacemaker last-man-standing which makes 'libvirtd' go haywire? etc... I can add much concrete stuff at this moment but will appreciate any thoughts you want to share. thanks, L many thanks, L. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users> ClusterLabs home: https://www.clusterlabs.org/ <https://www.clusterlabs.org/> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/