Re: [ClusterLabs] PostgreSQL server timelines offset after promote

2024-03-27 Thread Jehan-Guillaume de Rorthais via Users
Bonjour Thierry,

On Mon, 25 Mar 2024 10:55:06 +
FLORAC Thierry  wrote:

> I'm trying to create a PostgreSQL master/slave cluster using streaming
> replication and pgsqlms agent. Cluster is OK but my problem is this : the
> master node is sometimes restarted for system operations, and the slave is
> then promoted without any problem ; 

When you have to do some planed system operation, you **must** diligently
ask permission to pacemaker. Pacemaker is the real owner of your resource. It
will react to any unexpected event, even if it's a planed one. You must
consider it as a hidden colleague taking care of your resource.

There's various way to deal with Pacemaker when you need to do some system
maintenance on your primary, it depends on your constraints. Here are two
examples:

* ask Pacemaker to move the "promoted" role to another node
* then put the node in standby mode
* then do your admin tasks
* then unstandby your node: a standby should start on the original node
* optional: move back your "promoted" role to original node

Or:

* put the whole cluster in maintenance mode
* then do your admin tasks
* then check everything works as the cluster expect
* then exit the maintenance mode

The second one might be tricky if Pacemaker find some unexpected status/event
when exiting the maintenance mode.

You can find (old) example of administrative tasks in:

* with pcs: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html
* with crm: https://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html
* with "low level" commands:
  https://clusterlabs.github.io/PAF/administration.html

These docs updates are long overdue, sorry about that :(

Also, here is an hidden gist (that needs some updates as well):

https://github.com/ClusterLabs/PAF/tree/workshop/docs/workshop/fr


> after reboot, the old master is re-promoted, but I often get an error in
> slave logs :
> 
>   FATAL:  la plus grande timeline 1 du serveur principal est derrière la
>   timeline de restauration 2
> 
> which can be translated in english to :
> 
>   FATAL: the highest timeline 1 of main server is behind restoration timeline
>  2


This is unexpected. I wonder how Pacemaker is being stopped. It is supposed to
stop gracefully its resource. The promotion scores should be updated to reflect
the local resource is not a primary anymore and PostgreSQL should be demoted
then stopped. It is supposed to start as a standby after a graceful shutdown.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-02-01 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 31 Jan 2024 18:23:40 +0100
lejeczek via Users  wrote:

> On 31/01/2024 17:13, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 31 Jan 2024 16:37:21 +0100
> > lejeczek via Users  wrote:
> >  
> >>
> >> On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:  
> >>> On Wed, 31 Jan 2024 16:02:12 +0100
> >>> lejeczek via Users  wrote:
> >>>  
>  On 29/01/2024 17:22, Ken Gaillot wrote:  
> > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:  
> >> Hi guys.
> >>
> >> Is it possible to trigger some... action - I'm thinking specifically
> >> at shutdown/start.
> >> If not within the cluster then - if you do that - perhaps outside.
> >> I would like to create/remove constraints, when cluster starts &
> >> stops, respectively.
> >>
> >> many thanks, L.
> >>  
> > You could use node status alerts for that, but it's risky for alert
> > agents to change the configuration (since that may result in more
> > alerts and potentially some sort of infinite loop).
> >
> > Pacemaker has no concept of a full cluster start/stop, only node
> > start/stop. You could approximate that by checking whether the node
> > receiving the alert is the only active node.
> >
> > Another possibility would be to write a resource agent that does what
> > you want and order everything else after it. However it's even more
> > risky for a resource agent to modify the configuration.
> >
> > Finally you could write a systemd unit to do what you want and order it
> > after pacemaker.
> >
> > What's wrong with leaving the constraints permanently configured?  
>  yes, that would be for a node start/stop
>  I struggle with using constraints to move pgsql (PAF) master
>  onto a given node - seems that co/locating paf's master
>  results in troubles (replication brakes) at/after node
>  shutdown/reboot (not always, but way too often)  
> >>> What? What's wrong with colocating PAF's masters exactly? How does it
> >>> brake any replication? What's these constraints you are dealing with?
> >>>
> >>> Could you share your configuration?  
> >> Constraints beyond/above of what is required by PAF agent
> >> itself, say...
> >> you have multiple pgSQL cluster with PAF - thus multiple
> >> (separate, for each pgSQL cluster) masters and you want to
> >> spread/balance those across HA cluster
> >> (or in other words - avoid having more that 1 pgsql master
> >> per HA node)  
> > ok
> >  
> >> These below, I've tried, those move the master onto chosen
> >> node but.. then the issues I mentioned.  
> > You just mentioned it breaks the replication, but there so little
> > information about your architecture and configuration, it's impossible to
> > imagine how this could break the replication.
> >
> > Could you add details about the issues ?
> >  
> >> -> $ pcs constraint location PGSQL-PAF-5438-clone prefers  
> >> ubusrv1=1002
> >> or  
> >> -> $ pcs constraint colocation set PGSQL-PAF-5435-clone  
> >> PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master
> >> require-all=false setoptions score=-1000  
> > I suppose "collocation" constraint is the way to go, not the "location"
> > one.  
> This should be easy to replicate, 3 x VMs, Ubuntu 22.04 in 
> my case

No, this is not easy to replicate. I have no idea how you setup your PostgreSQL
replication, neither I have your full pacemaker configuration.

Please provide either detailed setupS and/or ansible and/or terraform and/or
vagrant, then a detailed scenario showing how it breaks. This is how you can
help and motivate devs to reproduce your issue and work on it.

I will not try to poke around for hours until I find an issue that might not
even be the same than yours.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-01-31 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 31 Jan 2024 16:37:21 +0100
lejeczek via Users  wrote:

> 
> 
> On 31/01/2024 16:06, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 31 Jan 2024 16:02:12 +0100
> > lejeczek via Users  wrote:
> >
> >>
> >> On 29/01/2024 17:22, Ken Gaillot wrote:
> >>> On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:
>  Hi guys.
> 
>  Is it possible to trigger some... action - I'm thinking specifically
>  at shutdown/start.
>  If not within the cluster then - if you do that - perhaps outside.
>  I would like to create/remove constraints, when cluster starts &
>  stops, respectively.
> 
>  many thanks, L.
> 
> >>> You could use node status alerts for that, but it's risky for alert
> >>> agents to change the configuration (since that may result in more
> >>> alerts and potentially some sort of infinite loop).
> >>>
> >>> Pacemaker has no concept of a full cluster start/stop, only node
> >>> start/stop. You could approximate that by checking whether the node
> >>> receiving the alert is the only active node.
> >>>
> >>> Another possibility would be to write a resource agent that does what
> >>> you want and order everything else after it. However it's even more
> >>> risky for a resource agent to modify the configuration.
> >>>
> >>> Finally you could write a systemd unit to do what you want and order it
> >>> after pacemaker.
> >>>
> >>> What's wrong with leaving the constraints permanently configured?
> >> yes, that would be for a node start/stop
> >> I struggle with using constraints to move pgsql (PAF) master
> >> onto a given node - seems that co/locating paf's master
> >> results in troubles (replication brakes) at/after node
> >> shutdown/reboot (not always, but way too often)
> > What? What's wrong with colocating PAF's masters exactly? How does it brake
> > any replication? What's these constraints you are dealing with?
> >
> > Could you share your configuration?
> Constraints beyond/above of what is required by PAF agent 
> itself, say...
> you have multiple pgSQL cluster with PAF - thus multiple 
> (separate, for each pgSQL cluster) masters and you want to 
> spread/balance those across HA cluster
> (or in other words - avoid having more that 1 pgsql master 
> per HA node)

ok

> These below, I've tried, those move the master onto chosen 
> node but.. then the issues I mentioned.

You just mentioned it breaks the replication, but there so little information
about your architecture and configuration, it's impossible to imagine how this
could break the replication.

Could you add details about the issues ?

> -> $ pcs constraint location PGSQL-PAF-5438-clone prefers 
> ubusrv1=1002
> or
> -> $ pcs constraint colocation set PGSQL-PAF-5435-clone 
> PGSQL-PAF-5434-clone PGSQL-PAF-5433-clone role=Master 
> require-all=false setoptions score=-1000

I suppose "collocation" constraint is the way to go, not the "location" one.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] trigger something at ?

2024-01-31 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 31 Jan 2024 16:02:12 +0100
lejeczek via Users  wrote:

> 
> 
> On 29/01/2024 17:22, Ken Gaillot wrote:
> > On Fri, 2024-01-26 at 13:55 +0100, lejeczek via Users wrote:
> >> Hi guys.
> >>
> >> Is it possible to trigger some... action - I'm thinking specifically
> >> at shutdown/start.
> >> If not within the cluster then - if you do that - perhaps outside.
> >> I would like to create/remove constraints, when cluster starts &
> >> stops, respectively.
> >>
> >> many thanks, L.
> >>
> > You could use node status alerts for that, but it's risky for alert
> > agents to change the configuration (since that may result in more
> > alerts and potentially some sort of infinite loop).
> >
> > Pacemaker has no concept of a full cluster start/stop, only node
> > start/stop. You could approximate that by checking whether the node
> > receiving the alert is the only active node.
> >
> > Another possibility would be to write a resource agent that does what
> > you want and order everything else after it. However it's even more
> > risky for a resource agent to modify the configuration.
> >
> > Finally you could write a systemd unit to do what you want and order it
> > after pacemaker.
> >
> > What's wrong with leaving the constraints permanently configured?
> yes, that would be for a node start/stop
> I struggle with using constraints to move pgsql (PAF) master 
> onto a given node - seems that co/locating paf's master 
> results in troubles (replication brakes) at/after node 
> shutdown/reboot (not always, but way too often)

What? What's wrong with colocating PAF's masters exactly? How does it brake any
replication? What's these constraints you are dealing with?

Could you share your configuration?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Beginner lost with promotable "group" design

2024-01-31 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 31 Jan 2024 15:41:28 +0100
Adam Cecile  wrote:
[...]

> Thanks a lot for your suggestion, it seems I have something that work 
> correctly now, final configuration is:

I would recommend configuring in an offline CIB then pushing it to production
as a whole. Eg.:

  # get current CIB
  pcs cluster cib cluster1.xml

  # edit offline CIB
  pcs -f cluster1.xml resource create Internal-IPv4 ocf:heartbeat:IPaddr2 \
ip=10.0.0.254 nic=eth0 cidr_netmask=24 op monitor interval=30

  [...]
  [...]

  pcs -f cluster1.xml stonith create vmfence fence_vmware_rest \
pcmk_host_map="gw-1:gw-1;gw-2:gw-2;gw-3:gw-3" \
ip=10.1.2.3 ssl=1 username=corosync@vsphere.local \
password=p4ssw0rd ssl_insecure=1

  # push offline CIB to production
  pcs cluster cib-push scope=configuration cluster1.xml

> I have a quick one regarding fencing. I disconnected eth0 from gw-3 and 
> the VM has been restarted automatically, so I guess it's the fencing 
> agent that kicked in. However, I left the VM in such state (so it's seen 
> offline by other nodes) and I thought it would end up being powered off 
> for good. However, it seems fencing agent is keeping it powered on. Is 
> that expected ?

The default action of this fencing agent is "reboot". Check the VM uptime to
confirm it has been rebooted. If you want it to shutdown the VM, add
"action=off"

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Planning for Pacemaker 3

2024-01-25 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 24 Jan 2024 16:47:54 -0600
Ken Gaillot  wrote:
...
> > Erm. Well, as this is a major upgrade where we can affect people's
> > conf and
> > break old things & so on, I'll jump in this discussion with a
> > wishlist to
> > discuss :)
> >   
> 
> I made sure we're tracking all these (links below),

Thank you Ken, for creating these tasks. I subscribed to them, but it seems I
can not discuss on them (or maybe I failed to find how to do it).

> but realistically we're going to have our hands full dropping all the
> deprecated stuff in the time we have.

Let me know how I can help on these subject. Also, I'm still silently sitting on
IRC chan if needed.

> Most of these can be done in any version.

Four out of seven can be done in any version. For the three other left, in my
humble opinion and needs from the PAF agent point of view:

1. «Support failure handling of notify actions»
   https://projects.clusterlabs.org/T759
2. «Change allowed range of scores and value of +/-INFINITY»
   https://projects.clusterlabs.org/T756
3. «Default to sending clone notifications when agent supports it»
   https://projects.clusterlabs.org/T758

The first is the most important as it allows to implement an actual election
before the promotion, breaking the current transition if promotion score doesn't
reflect the reality since last monitor action. Current PAF's code makes a lot of
convolution to have a decent election mechanism preventing the promotion of a
lagging node.

The second one would help removing some useless complexity from some resource
agent code (at least in PAF).

The third one is purely for confort and cohesion between actions setup.

Have a good day!

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Planning for Pacemaker 3

2024-01-23 Thread Jehan-Guillaume de Rorthais via Users
Hi there !

On Wed, 03 Jan 2024 11:06:27 -0600
Ken Gaillot  wrote:

> Hi all,
> 
> I'd like to release Pacemaker 3.0.0 around the middle of this year. 
> I'm gathering proposed changes here:
> 
>  https://projects.clusterlabs.org/w/projects/pacemaker/pacemaker_3.0_changes/
> 
> Please review for anything that might affect you, and reply here if you
> have any concerns.


Erm. Well, as this is a major upgrade where we can affect people's conf and
break old things & so on, I'll jump in this discussion with a wishlist to
discuss :)

1. "recover", "migration-to" and "migration-from" actions support ?

  See discussion:
  https://lists.clusterlabs.org/pipermail/developers/2020-February/002258.html

2.1. INT64 promotion scores?
2.2. discovering promotion score ahead of promotion?
2.3. make OCF_RESKEY_CRM_meta_notify_* or equivalent officially available in all
 actions 

  See discussion:
  https://lists.clusterlabs.org/pipermail/developers/2020-February/002255.html

3.1. deprecate "notify=true" clone option, make it true by default
3.2. react to notify action return code

  See discussion:
  https://lists.clusterlabs.org/pipermail/developers/2020-February/002256.html

Off course, I can volunteer to help on some topics.

Cheers!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to colocate promoted resources ?

2023-12-08 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 8 Dec 2023 17:11:58 +0100
lejeczek via Users  wrote:
...
> Apologies, perhaps I was quite vague.
> I was thinking - having a 3-node HA cluster and 3-node 
> single-master->slaves pgSQL, now..
> say, I want pgSQL masters to spread across HA cluster so I 
> theory - having each HA node identical hardware-wise - 
> masters' resources would nicely balance out across HA cluster.

Multi-Primary doesn't exists with (core) PostgreSQL. Unless you want to create
3 different instances, each with their own set of databases and tables, not
replicating with each others?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to colocate promoted resources ?

2023-12-08 Thread Jehan-Guillaume de Rorthais via Users
Hi,

On Wed, 6 Dec 2023 10:36:39 +0100
lejeczek via Users  wrote:

> How do your colocate your promoted resources with balancing 
> underlying resources as priority?

What do you mean?

> With a simple scenario, say
> 3 nodes and 3 pgSQL clusters
> what would be best possible way - I'm thinking most gentle 
> at the same time, if that makes sense.

I'm not sure it answers your question (as I don't understand it), but here is a
doc explaining how to create and move two IP supposed to each start on
secondaries, avoiding the primary node if possible, as long as secondaries nodes
exists

https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#adding-ips-on-standbys-nodes

++
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-13 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 10 Nov 2023 20:34:40 +0100
lejeczek via Users  wrote:

> On 10/11/2023 18:16, Jehan-Guillaume de Rorthais wrote:
> > On Fri, 10 Nov 2023 17:17:41 +0100
> > lejeczek via Users  wrote:
> >
> > ...  
> >>> Of course you can use "pg_stat_tmp", just make sure the temp folder
> >>> exists:
> >>>
> >>> cat < /etc/tmpfiles.d/postgresql-part.conf
> >>> # Directory for PostgreSQL temp stat files
> >>> d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
> >>> EOF
> >>>
> >>> To take this file in consideration immediately without rebooting the
> >>> server, run the following command:
> >>>
> >>> systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf  
> >> Then there must be something else at play here with Ubuntus,
> >> for none of the nodes has any extra/additional configs for
> >> those paths & I'm sure that those were not created manually.  
> > Indeed.
> >
> > This parameter is usually set by pg_createcluster command and the folder
> > created by both pg_createcluster and pg_ctlcluster commands when needed.
> >
> > This is explained in PAF tutorial there:
> >
> > https://clusterlabs.github.io/PAF/Quick_Start-Debian-10-pcs.html#postgresql-and-cluster-stack-installation
> >
> > These commands comes from the postgresql-common wrapper, used in all Debian
> > related distros, allowing to install, create and use multiple PostgreSQL
> > versions on the same server.
> >  
> >> Perhpaphs pgSQL created these on it's own outside of HA-cluster.  
> > No, the Debian packaging did.
> >
> > Just create the config file I pointed you in my previous answer,
> > systemd-tmpfiles will take care of it and you'll be fine.
> >  
> [...]
> Still that directive for the stats - I wonder if that got 
> introduced, injected somewhere in between because the PG 
> cluster - which I had for a while - did not experience this 
> "issue" from the start and not until "recently"

No, it's a really old behavior of the debian wrapper. 

I documented it for PAF on Debian 8 and 9 in 2017-2018, but it has been there
for 9.5 years:

  
https://salsa.debian.org/postgresql/postgresql-common/-/commit/e83bbefd0d7e87890eee8235476c403fcea50fa8

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [EXT] Re: PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-13 Thread Jehan-Guillaume de Rorthais via Users
On Mon, 13 Nov 2023 11:39:45 +
"Windl, Ulrich"  wrote:

> But shouldn't the RA check for that (and act appropriately)?

Interesting. I'm open to discuss this. Below my thoughts so far.

Why the RA should check that? There's so many way to setup the system and
PostgreSQL, where should the RA stop checking for all possible way to break it? 

The RA checks various (maybe too many) things related to the instance itself
already.

I know various other PostgreSQL setups that would trigger errors in the cluster
if the dba doesn't check everything is correct. I'm really reluctant to
add add a fair amount of code in the RA to correctly parse and check the
complex PostgreSQL's setup. This would add complexity and bugs. Or maybe I
could add a specific OCF_CHECK_LEVEL sysadmins can trigger by hand before
starting the cluster. But I wonder if it worth the pain, how many people will
know about this and actually run it?

The problem here is that few users actually realize how the postgresql-common
wrapper works and what it actually does behind your back. I really appreciate
this wrapper, I do. But when you setup a Pacemaker cluster, you either have to
bend to it when setting up PAF (as documented), or avoid it completely.

PAF is all about drawing a clear line between the sysadmin job and the
dba one. Dba must build a cluster of instances ready to start/replicate with
standard binaries (not wrappers) before sysadmin can set up the resource in your
cluster.

Thoughts?

> -Original Message-
> From: Users  On Behalf Of Jehan-Guillaume de
> Rorthais via Users Sent: Friday, November 10, 2023 1:13 PM
> To: lejeczek via Users 
> Cc: Jehan-Guillaume de Rorthais 
> Subject: [EXT] Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown -
> FIX
> 
> On Fri, 10 Nov 2023 12:27:24 +0100
> lejeczek via Users  wrote:
> ...
> > >
> > to share my "fix" for it - perhaps it was introduced by 
> > OS/packages (Ubuntu 22) updates - ? - as oppose to resource 
> > agent itself.
> > 
> > As the logs point out - pg_stat_tmp - is missing and from 
> > what I see it's only the master, within a cluster, doing 
> > those stats.
> > That appeared, I use the word for I did not put it into 
> > configs, on all nodes.
> > fix = to not use _pg_stat_tmp_ directive/option at all.  
> 
> Of course you can use "pg_stat_tmp", just make sure the temp folder exists:
> 
>   cat < /etc/tmpfiles.d/postgresql-part.conf
>   # Directory for PostgreSQL temp stat files
>   d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
>   EOF
> 
> To take this file in consideration immediately without rebooting the server,
> run the following command:
> 
>   systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-10 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 10 Nov 2023 17:17:41 +0100
lejeczek via Users  wrote:

...
> > Of course you can use "pg_stat_tmp", just make sure the temp folder exists:
> >
> >cat < /etc/tmpfiles.d/postgresql-part.conf
> ># Directory for PostgreSQL temp stat files
> >d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
> >EOF
> >
> > To take this file in consideration immediately without rebooting the server,
> > run the following command:
> >
> >systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf  
> Then there must be something else at play here with Ubuntus, 
> for none of the nodes has any extra/additional configs for 
> those paths & I'm sure that those were not created manually.

Indeed.

This parameter is usually set by pg_createcluster command and the folder
created by both pg_createcluster and pg_ctlcluster commands when needed.

This is explained in PAF tutorial there:

https://clusterlabs.github.io/PAF/Quick_Start-Debian-10-pcs.html#postgresql-and-cluster-stack-installation

These commands comes from the postgresql-common wrapper, used in all Debian
related distros, allowing to install, create and use multiple PostgreSQL
versions on the same server.

> Perhpaphs pgSQL created these on it's own outside of HA-cluster.

No, the Debian packaging did.

Just create the config file I pointed you in my previous answer,
systemd-tmpfiles will take care of it and you'll be fine.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / pgSQL fails after OS/system shutdown - FIX

2023-11-10 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 10 Nov 2023 12:27:24 +0100
lejeczek via Users  wrote:
...
> >  
> to share my "fix" for it - perhaps it was introduced by 
> OS/packages (Ubuntu 22) updates - ? - as oppose to resource 
> agent itself.
> 
> As the logs point out - pg_stat_tmp - is missing and from 
> what I see it's only the master, within a cluster, doing 
> those stats.
> That appeared, I use the word for I did not put it into 
> configs, on all nodes.
> fix = to not use _pg_stat_tmp_ directive/option at all.

Of course you can use "pg_stat_tmp", just make sure the temp folder exists:

  cat < /etc/tmpfiles.d/postgresql-part.conf
  # Directory for PostgreSQL temp stat files
  d /var/run/postgresql/14-paf.pg_stat_tmp 0700 postgres postgres - -
  EOF

To take this file in consideration immediately without rebooting the server,
run the following command:

  systemd-tmpfiles --create /etc/tmpfiles.d/postgresql-part.conf
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-13 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 13 Sep 2023 17:32:01 +0200
lejeczek via Users  wrote:

> On 08/09/2023 17:29, Jehan-Guillaume de Rorthais wrote:
> > On Fri, 8 Sep 2023 16:52:53 +0200
> > lejeczek via Users  wrote:
> >  
> >> Hi guys.
> >>
> >> Before I start fiddling and brake things I wonder if
> >> somebody knows if:
> >> pgSQL can work with: |wal_level = archive for PAF ?
> >> Or more general question with pertains to ||wal_level - can
> >> _barman_ be used with pgSQL "under" PAF?  
> > PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so
> > it can have hot standbys where it can connects and query there status.
> >
> > Wal level "replica" includes the archive level, so you can set up archiving.
> >
> > Of course you can use barman or any other tools to manage your PITR Backups,
> > even when Pacemaker/PAF is looking at your instances. This is even the very
> > first step you should focus on during your journey to HA.
> >
> > Regards,  
> and with _barman_ specifically - is one method preferred, 
> recommended over another: streaming VS rsync - for/with PAF?

PAF doesn't need PITR, nor it has preferred a method for it. Both are not
related.

The PITR procedure and tooling you are setting up will help for disaster
recovery, PRA, not PCA. So feel free to choose the ones that help you achieve
**YOUR** RTO/RPO needs in case of disaster.

The only vaguely related subject between PAF and your PITR tooling is how fast
ans easily you'll be able to setup a standby from backups if needed.

Just avoid slots if possible, as WALs could quickly fill your filesystem if
something goes wrong and slots must keep WALs around. If you must use them, set
max_slot_wal_keep_size.

++
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-08 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 8 Sep 2023 16:52:53 +0200
lejeczek via Users  wrote:

> Hi guys.
> 
> Before I start fiddling and brake things I wonder if 
> somebody knows if:
> pgSQL can work with: |wal_level = archive for PAF ?
> Or more general question with pertains to ||wal_level - can 
> _barman_ be used with pgSQL "under" PAF?

PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so it
can have hot standbys where it can connects and query there status.

Wal level "replica" includes the archive level, so you can set up archiving.

Of course you can use barman or any other tools to manage your PITR Backups,
even when Pacemaker/PAF is looking at your instances. This is even the very
first step you should focus on during your journey to HA.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / PGSQLMS on Ubuntu

2023-09-08 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 8 Sep 2023 10:26:42 +0200
lejeczek via Users  wrote:

> On 07/09/2023 16:20, lejeczek via Users wrote:
> >
> >
> > On 07/09/2023 16:09, Andrei Borzenkov wrote:  
> >> On Thu, Sep 7, 2023 at 5:01 PM lejeczek via Users 
> >>  wrote:  
> >>> Hi guys.
> >>>
> >>> I'm trying to set ocf_heartbeat_pgsqlms agent but I get:
> >>> ...
> >>> Failed Resource Actions:
> >>>    * PGSQL-PAF-5433 stop on ubusrv3 returned 'invalid 
> >>> parameter' because 'Parameter "recovery_target_timeline" 
> >>> MUST be set to 'latest'. It is currently set to ''' at 
> >>> Thu Sep  7 13:58:06 2023 after 54ms
> >>>
> >>> I'm new to Ubuntu and I see that Ubuntu has a bit 
> >>> different approach to paths (in comparison to how Centos 
> >>> do it).
> >>> I see separation between config & data, eg.
> >>>
> >>> 14  paf 5433 down   postgres 
> >>> /var/lib/postgresql/14/paf 
> >>> /var/log/postgresql/postgresql-14-paf.log
> >>>
> >>> I create the resource like here:
> >>>  
> >>> -> $ pcs resource create PGSQL-PAF-5433   
> >>> ocf:heartbeat:pgsqlms pgport=5433 bindir=/usr/bin 
> >>> pgdata=/etc/postgresql/14/paf 
> >>> datadir=/var/lib/postgresql/14/paf meta 
> >>> failure-timeout=30s master-max=1 op start timeout=60s op 
> >>> stop timeout=60s op promote timeout=30s op demote 
> >>> timeout=120s op monitor interval=15s timeout=10s 
> >>> role="Promoted" op monitor interval=16s timeout=10s 
> >>> role="Unpromoted" op notify timeout=60s promotable 
> >>> notify=true failure-timeout=30s master-max=1 --disable
> >>>
> >>> Ubuntu 22.04.3 LTS
> >>> What am I missing can you tell?  
> >> Exactly what the message tells you. You need to set 
> >> recovery_target=latest.  
> > and having it in 'postgresql.conf' make it all work for you?
> > I've had it and got those errors - perhaps that has to be 
> > set some place else.
> >  
> In case anybody was in this situation - I was missing one 
> important bit: _bindir_
> Ubuntu's pgSQL binaries have different path - what 
> resource/agent returns as errors is utterly confusing.

Uh ? Good point, I'll definitely have to check that.

Thanks for the report!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?

2023-05-05 Thread Jehan-Guillaume de Rorthais via Users
On Fri, 5 May 2023 10:08:17 +0200
lejeczek via Users  wrote:

> On 25/04/2023 14:16, Jehan-Guillaume de Rorthais wrote:
> > Hi,
> >
> > On Mon, 24 Apr 2023 12:32:45 +0200
> > lejeczek via Users  wrote:
> >  
> >> I've been looking up and fiddling with this RA but
> >> unsuccessfully so far, that I wonder - is it good for
> >> current versions of pgSQLs?  
> > As far as I know, the pgsql agent is still supported, last commit on it
> > happen in Jan 11th 2023. I don't know about its compatibility with latest
> > PostgreSQL versions.
> >
> > I've been testing it many years ago, I just remember it was quite hard to
> > setup, understand and manage from the maintenance point of view.
> >
> > Also, this agent is fine in a shared storage setup where it only
> > start/stop/monitor the instance, without paying attention to its role
> > (promoted or not).
> >  
> It's not only that it's hard - which is purely due to 
> piss-poor man page in my opinion - but it really sounds 
> "expired".

I really don't know. My feeling is that the manpage might be expired, which
really doesn't help with this agent, but not the RA itself.

> Eg. man page speaks of 'recovery.conf' which - as I 
> understand it - newer/current versions of pgSQL do not! even 
> use... which makes one wonder.

This has been fixed in late 2019, but with no documentation associated :/
See:
https://github.com/ClusterLabs/resource-agents/commit/a43075be72683e1d4ddab700ec16d667164d359c

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] OCF_HEARTBEAT_PGSQL - any good with current Postgres- ?

2023-04-25 Thread Jehan-Guillaume de Rorthais via Users
Hi,

On Mon, 24 Apr 2023 12:32:45 +0200
lejeczek via Users  wrote:

> I've been looking up and fiddling with this RA but 
> unsuccessfully so far, that I wonder - is it good for 
> current versions of pgSQLs?

As far as I know, the pgsql agent is still supported, last commit on it happen
in Jan 11th 2023. I don't know about its compatibility with latest PostgreSQL
versions.

I've been testing it many years ago, I just remember it was quite hard to
setup, understand and manage from the maintenance point of view.

Also, this agent is fine in a shared storage setup where it only
start/stop/monitor the instance, without paying attention to its role (promoted
or not).

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue

2023-03-21 Thread Jehan-Guillaume de Rorthais via Users
On Tue, 21 Mar 2023 11:47:23 +0100
Jérôme BECOT  wrote:

> Le 21/03/2023 à 11:00, Jehan-Guillaume de Rorthais a écrit :
> > Hi,
> >
> > On Tue, 21 Mar 2023 09:33:04 +0100
> > Jérôme BECOT  wrote:
> >  
> >> We have several clusters running for different zabbix components. Some
> >> of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
> >> Zabbix-proxy server and a VIP, and a corosync-qdevice.  
> > I'm not sure to understand your topology. The corosync-device is not
> > supposed to be on a cluster node. It is supposed to be on a remote node and
> > provide some quorum features to one or more cluster without setting up the
> > whole pacemaker/corosync stack.  
> I was not clear, the qdevice is deployed on a remote node, as intended.

ok

> >> The MySQL servers are always up to replicate, and are configured in
> >> Master/Master (they both replicate from the other but only one is supposed
> >> to be updated by the proxy running on the master node).  
> > Why do you bother with Master/Master when a simple (I suppose, I'm not a
> > MySQL cluster guy) Primary-Secondary topology or even a shared storage
> > would be enough and would keep your logic (writes on one node only) safe
> > from incidents, failures, errors, etc?
> >
> > HA must be a simple as possible. Remove useless parts when you can.  
> A shared storage moves the complexity somewhere else.

Yes, on storage/SAN side.

> A classic Primary / secondary can be an option if PaceMaker manages to start
> the client on the slave node,

I suppose this can be done using a location constraint.

> but it would become Master/Master during the split brain.

No, and if you do have real split brain, then you might have something wrong in
your setup. See below.


> >> One cluster is prompt to frequent sync errors, with duplicate entries
> >> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
> >> zabbix-proxy-01 pacemaker-controld  [948] (pcmk_cpg_membership)
> >> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
> >> cluster exit", and within the next second, a rejoin. The same messages
> >> are in the other node logs, suggesting a split brain, which should not
> >> happen, because there is a quorum device.  
> > Would it be possible your SQL sync errors and the left/join issues are
> > correlated and are both symptoms of another failure? Look at your log for
> > some explanation about why the node decided to leave the cluster.  
> 
> My guess is that maybe a high latency in network cause the disjoin, 
> hence starting Zabbix-proxy on both nodes causes the replication error. 
> It is configured to use the vip which is up locally because there is a 
> split brain.

If you have a split brain, that means your quorum setup is failing. 

No node could start/promote a resource without having the quorum. If a node is
isolated from the cluster and quorum-device, it should stop its resources, not
recover/promote them.

If both nodes lost connection with each others, but are still connected to the
quorum-device, the later should be able to grant the quorum on one side only.

Lastly, quorum is a split brain protection when "things are going fine".
Fencing is a split brain protection for all other situations. Fencing is hard
and painful, but it saves from many split brain situation.

> This is why I'm requesting guidance to check/monitor these nodes to find 
> out if it is temporary network latency that is causing the disjoin.

A cluster is always very sensitive to network latency/failures. You need to
build on stronger fondations.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue

2023-03-21 Thread Jehan-Guillaume de Rorthais via Users
Hi,

On Tue, 21 Mar 2023 09:33:04 +0100
Jérôme BECOT  wrote:

> We have several clusters running for different zabbix components. Some 
> of these clusters consist of 2 zabbix proxies,where nodes run Mysql, 
> Zabbix-proxy server and a VIP, and a corosync-qdevice. 

I'm not sure to understand your topology. The corosync-device is not supposed
to be on a cluster node. It is supposed to be on a remote node and provide some
quorum features to one or more cluster without setting up the whole
pacemaker/corosync stack.

> The MySQL servers are always up to replicate, and are configured in
> Master/Master (they both replicate from the other but only one is supposed to
> be updated by the proxy running on the master node).

Why do you bother with Master/Master when a simple (I suppose, I'm not a MySQL
cluster guy) Primary-Secondary topology or even a shared storage would be
enough and would keep your logic (writes on one node only) safe from incidents,
failures, errors, etc?

HA must be a simple as possible. Remove useless parts when you can.

> One cluster is prompt to frequent sync errors, with duplicate entries 
> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41 
> zabbix-proxy-01 pacemaker-controld  [948] (pcmk_cpg_membership)     
> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via 
> cluster exit", and within the next second, a rejoin. The same messages 
> are in the other node logs, suggesting a split brain, which should not 
> happen, because there is a quorum device.

Would it be possible your SQL sync errors and the left/join issues are
correlated and are both symptoms of another failure? Look at your log for some
explanation about why the node decided to leave the cluster.

> Can you help me to troubleshoot this ? I can provide any 
> log/configuration required in the process, so let me know.
> 
> I'd also like to ask if there is a bit of configuration that can be done 
> to postpone service start on the other node for two or three seconds as 
> a quick workaround ?

How would it be a workaround?

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [External] : Reload DNSMasq after IPAddr2 change ?

2023-02-10 Thread Jehan-Guillaume de Rorthais via Users
Hi,

What about using the Dummy resource agent (ocf_heartbeat_dummy(7)) and collocate
it with your IP address? This RA creates a local file on start and removes it
on stop. The game now is to watch for this path from a systemd path unit and
trigger the reload when file appears. See systemd.path(5).

There might be multiple other ways to trigger such a reload using various
strategies and hack relying on systemd or even DBus events...

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Resource validation [was: multiple resources - pgsqlms - and IP(s)]

2023-01-09 Thread Jehan-Guillaume de Rorthais via Users
Hi,

I definitely have some work/improvements to do on the pgsqlms agent, but
there's still some details I'm interested to discuss below.

On Fri, 6 Jan 2023 16:36:19 -0800
Reid Wahl  wrote:

> On Fri, Jan 6, 2023 at 3:26 PM Jehan-Guillaume de Rorthais via Users
>  wrote:
>>
>> On Wed, 4 Jan 2023 11:15:06 +0100
>> Tomas Jelinek  wrote:
>>  
>>> Dne 04. 01. 23 v 8:29 Reid Wahl napsal(a):  
>>>> On Tue, Jan 3, 2023 at 10:53 PM lejeczek via Users
>>>>  wrote:  
>>>>>
>>>>> On 03/01/2023 21:44, Ken Gaillot wrote:  
>>>>>> On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote:  
[...]
>>>>>>> Not related - Is this an old bug?:
>>>>>>>  
>>>>>>> -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms  
>>>>>>> bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s
>>>>>>> op stop timeout=60s op promote timeout=30s op demote
>>>>>>> timeout=120s op monitor interval=15s timeout=10s
>>>>>>> role="Master" op monitor interval=16s timeout=10s
>>>>>>> role="Slave" op notify timeout=60s meta promotable=true
>>>>>>> notify=true master-max=1 --disable
>>>>>>> Error: Validation result from agent (use --force to override):
>>>>>>>  ocf-exit-reason:You must set meta parameter notify=true
>>>>>>> for your master resource
>>>>>>> Error: Errors have occurred, therefore pcs is unable to continue  
>>>>>> pcs now runs an agent's validate-all action before creating a
>>>>>> resource. In this case it's detecting a real issue in your command.
>>>>>> The options you have after "meta" are clone options, not meta options
>>>>>> of the resource being cloned. If you just change "meta" to "clone" it
>>>>>> should work.  
>>>>> Nope. Exact same error message.
>>>>> If I remember correctly there was a bug specifically
>>>>> pertained to 'notify=true'  
>>>>
>>>> The only recent one I can remember was a core dump.
>>>> - Bug 2039675 - pacemaker coredump with ocf:heartbeat:mysql resource
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=2039675)
>>>>
>>>>  From a quick inspection of the pcs resource validation code
>>>> (lib/pacemaker/live.py:validate_resource_instance_attributes_via_pcmk()),
>>>> it doesn't look like it passes the meta attributes. It only passes the
>>>> instance attributes. (I could be mistaken.)
>>>>
>>>> The pgsqlms resource agent checks the notify meta attribute's value as
>>>> part of the validate-all action. If pcs doesn't pass the meta
>>>> attributes to crm_resource, then the check will fail.
>>>
>>> Pcs cannot pass meta attributes to crm_resource, because there is
>>> nowhere to pass them to.  
>>
>> But, they are passed as environment variable by Pacemaker, why pcs couldn't
>> set them as well when running the agent?  
> 
> pcs uses crm_resource to run the validate-all action. crm_resource
> doesn't provide a way to pass in meta attributes -- only instance
> attributes. Whether crm_resource should provide that is another
> question...

But crm_resource can set them as environment variable, they are inherited to
the resource agent when executing it:

  # This fails
  # crm_resource --validate   \
 --class ocf --agent pgsqlms --provider heartbeat \
 --option pgdata=/var/lib/pgsql/15/data   \
 --option bindir=/usr/pgsql-15/bin
  Operation validate (ocf:heartbeat:pgsqlms) returned 5 (not installed: 
You must set meta parameter notify=true for your "master" resource)
  ocf-exit-reason:You must set meta parameter notify=true for your "master" 
resource
  crm_resource: Error performing operation: Not installed

  # This fails on a different mandatory setup
  # OCF_RESKEY_CRM_meta_notify=1  \
crm_resource --validate   \
 --class ocf --agent pgsqlms --provider heartbeat \
 --option pgdata=/var/lib/pgsql/15/data   \
 --option bindir=/usr/pgsql-15/bin
  Operation validate (ocf:heartbeat:pgsqlms) returned 5 (not installed:
You must set meta parameter master-max=1 for your "master" resource)
  ocf-exit-reason:You must set meta paramete

Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)

2023-01-06 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 4 Jan 2023 11:15:06 +0100
Tomas Jelinek  wrote:

> Dne 04. 01. 23 v 8:29 Reid Wahl napsal(a):
> > On Tue, Jan 3, 2023 at 10:53 PM lejeczek via Users
> >  wrote:  
> >>
> >>
> >>
> >> On 03/01/2023 21:44, Ken Gaillot wrote:  
> >>> On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote:  
>  On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote:  
> > Hi,
> >
> > On Tue, 3 Jan 2023 16:44:01 +0100
> > lejeczek via Users  wrote:
> >  
> >> To get/have Postgresql cluster with 'pgsqlms' resource, such
> >> cluster needs a 'master' IP - what do you guys do when/if
> >> you have multiple resources off this agent?
> >> I wonder if it is possible to keep just one IP and have all
> >> those resources go to it - probably 'scoring' would be very
> >> tricky then, or perhaps not?  
> > That would mean all promoted pgsql MUST be on the same node at any
> > time.
> > If one of your instance got some troubles and need to failover,
> > *ALL* of them
> > would failover.
> >
> > This imply not just a small failure time window for one instance,
> > but for all
> > of them, all the users.
> >  
> >> Or you do separate IP for each 'pgsqlms' resource - the
> >> easiest way out?  
> > That looks like a better option to me, yes.
> >
> > Regards,  
>  Not related - Is this an old bug?:
>   
>  -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms  
>  bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s
>  op stop timeout=60s op promote timeout=30s op demote
>  timeout=120s op monitor interval=15s timeout=10s
>  role="Master" op monitor interval=16s timeout=10s
>  role="Slave" op notify timeout=60s meta promotable=true
>  notify=true master-max=1 --disable
>  Error: Validation result from agent (use --force to override):
>   ocf-exit-reason:You must set meta parameter notify=true
>  for your master resource
>  Error: Errors have occurred, therefore pcs is unable to continue  
> >>> pcs now runs an agent's validate-all action before creating a resource.
> >>> In this case it's detecting a real issue in your command. The options
> >>> you have after "meta" are clone options, not meta options of the
> >>> resource being cloned. If you just change "meta" to "clone" it should
> >>> work.  
> >> Nope. Exact same error message.
> >> If I remember correctly there was a bug specifically
> >> pertained to 'notify=true'  
> > 
> > The only recent one I can remember was a core dump.
> > - Bug 2039675 - pacemaker coredump with ocf:heartbeat:mysql resource
> > (https://bugzilla.redhat.com/show_bug.cgi?id=2039675)
> > 
> >  From a quick inspection of the pcs resource validation code
> > (lib/pacemaker/live.py:validate_resource_instance_attributes_via_pcmk()),
> > it doesn't look like it passes the meta attributes. It only passes the
> > instance attributes. (I could be mistaken.)
> > 
> > The pgsqlms resource agent checks the notify meta attribute's value as
> > part of the validate-all action. If pcs doesn't pass the meta
> > attributes to crm_resource, then the check will fail.
> >   
> 
> Pcs cannot pass meta attributes to crm_resource, because there is 
> nowhere to pass them to.

But, they are passed as environment variable by Pacemaker, why pcs couldn't set
them as well when running the agent?

> As defined in OCF 1.1, only instance attributes 
> matter for validation, see 
> https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#check-levels

It doesn't state clearly that meta attributes must be ignored by the agent
during these actions.

And one could argue checking a meta attribute is a purely internal setup check,
at level 0.

> The agents are bugged - they depend on meta data being passed to 
> validation. This is already tracked and being worked on:
> 
> https://github.com/ClusterLabs/resource-agents/pull/1826

The pgsqlms resource agent checks the OCF_RESKEY_CRM_meta_notify environment
variable before raising this error.

The pgsqlms resource agent is relying on notify action to make some important
checks and actions. Without notifies, the resource will just behave wrongly.
This is an essential check.

However, I've been considering moving some of these checks only during the
probe action. Would it make sense? The notify check could move there as there's
no need to check it on a regular basis.

Thanks,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] multiple resources - pgsqlms - and IP(s)

2023-01-03 Thread Jehan-Guillaume de Rorthais via Users
Hi,

On Tue, 3 Jan 2023 16:44:01 +0100
lejeczek via Users  wrote:

> To get/have Postgresql cluster with 'pgsqlms' resource, such 
> cluster needs a 'master' IP - what do you guys do when/if 
> you have multiple resources off this agent?
> I wonder if it is possible to keep just one IP and have all 
> those resources go to it - probably 'scoring' would be very 
> tricky then, or perhaps not?

That would mean all promoted pgsql MUST be on the same node at any time.
If one of your instance got some troubles and need to failover, *ALL* of them
would failover.

This imply not just a small failure time window for one instance, but for all
of them, all the users.

> Or you do separate IP for each 'pgsqlms' resource - the 
> easiest way out?

That looks like a better option to me, yes.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [External] : Re: Fence Agent tests

2022-11-07 Thread Jehan-Guillaume de Rorthais via Users
On Mon, 7 Nov 2022 14:06:51 +
Robert Hayden  wrote:

> > -Original Message-
> > From: Users  On Behalf Of Valentin Vidic
> > via Users
> > Sent: Sunday, November 6, 2022 5:20 PM
> > To: users@clusterlabs.org
> > Cc: Valentin Vidić 
> > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> > 
> > On Sun, Nov 06, 2022 at 09:08:19PM +, Robert Hayden wrote:  
> > > When SBD_PACEMAKER was set to "yes", the lack of network connectivity  
> > to the node  
> > > would be seen and acted upon by the remote nodes (evicts and takes
> > > over ownership of the resources).  But the impacted node would just
> > > sit logging IO errors.  Pacemaker would keep updating the /dev/watchdog
> > > device so SBD would not self evict.   Once I re-enabled the network, then
> > >  
> > the
> > 
> > Interesting, not sure if this is the expected behaviour based on:
> > 
> > https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2
> > 017-
> > August/022699.html__;!!ACWV5N9M2RV99hQ!IvnnhGI1HtTBGTKr4VFabWA
> > LeMfBWNhcS0FHsPFHwwQ3Riu5R3pOYLaQPNia-
> > GaB38wRJ7Eq4Q3GyT5C3s8y7w$
> > 
> > Does SBD log "Majority of devices lost - surviving on pacemaker" or
> > some other messages related to Pacemaker?  
> 
> Yes.
> 
> > 
> > Also what is the status of Pacemaker when the network is down? Does it
> > report no quorum or something else?
> >   
> 
> Pacemaker on the failing node shows quorum even though it has lost 
> communication to the Quorum Device and to the other node in the cluster.

This is the main issue. Maybe inspecting the corosync-cmapctl output could shed
some lights on some setup we are missing?

> The non-failing node of the cluster can see the Quorum Device system and 
> thus correctly determines to fence the failing node and take over its 
> resources.

Normal.

> Only after I run firewall-cmd --panic-off, will the failing node start to log
> messages about loss of TOTEM and getting a new consensus with the 
> now visible members.
> 
> I think all of that explains the lack of self-fencing when the sbd setting of
> SBD_PACEMAKER=yes is used.

I'm not sure. If I understand correctly, SBD_PACEMAKER=yes only instruct sbd to
keep an eye on the pacemaker+corosync processes (as described up thread). It
doesn't explain why Pacemaker keeps holding the quorum, but I might miss
something...
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [External] : Re: Fence Agent tests

2022-11-05 Thread Jehan-Guillaume de Rorthais via Users
On Sat, 5 Nov 2022 20:54:55 +
Robert Hayden  wrote:

> > -Original Message-
> > From: Jehan-Guillaume de Rorthais 
> > Sent: Saturday, November 5, 2022 3:45 PM
> > To: users@clusterlabs.org
> > Cc: Robert Hayden 
> > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests
> > 
> > On Sat, 5 Nov 2022 20:53:09 +0100
> > Valentin Vidić via Users  wrote:
> >   
> > > On Sat, Nov 05, 2022 at 06:47:59PM +, Robert Hayden wrote:  
> > > > That was my impression as well...so I may have something wrong.  My
> > > > expectation was that SBD daemon should be writing to the  
> > /dev/watchdog  
> > > > within 20 seconds and the kernel watchdog would self fence.  
> > >
> > > I don't see anything unusual in the config except that pacemaker mode is
> > > also enabled. This means that the cluster is providing signal for sbd even
> > > when the storage device is down, for example:
> > >
> > > 883 ?SL 0:00 sbd: inquisitor
> > > 892 ?SL 0:00  \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...
> > > 893 ?SL 0:00  \_ sbd: watcher: Pacemaker
> > > 894 ?SL 0:00  \_ sbd: watcher: Cluster
> > >
> > > You can strace different sbd processes to see what they are doing at any
> > > point.  
> > 
> > I suspect both watchers should detect the loss of network/communication
> > with
> > the other node.
> > 
> > BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the
> > local **Pacemaker** is still quorate (via corosync). See the full chapter:
> > «If Pacemaker integration is activated, SBD will not self-fence if
> > **device** majority is lost [...]»
> > https://urldefense.com/v3/__https://documentation.suse.com/sle-ha/15-
> > SP4/html/SLE-HA-all/cha-ha-storage-
> > protect.html__;!!ACWV5N9M2RV99hQ!LXxpjg0QHdAP0tvr809WCErcpPH0lx
> > MKesDNqK-PU_Xpvb_KIGlj3uJcVLIbzQLViOi3EiSV3bkPUCHr$
> > 
> > Would it be possible that no node is shutting down because the cluster is in
> > two-node mode? Because of this mode, both would keep the quorum
> > expecting the
> > fencing to kill the other one... Except there's no active fencing here, only
> > "self-fencing".
> >   
> 
> I failed to mention I also have a Quorum Device also setup to add its vote to
> the quorum. So two_node is not enabled. 

oh, ok.

> I suspect Valentin was onto to something with pacemaker keeping the watchdog
> device updated as it thinks the cluster is ok.  Need to research and test
> that theory out.  I will try to carve some time out next week for that.

AFAIK, Pacemaker strictly rely on SBD to deal with the watchdog. It doesn't feed
it by itself.

In Pacemaker mode, SBD is watching the two most important part of the cluster:
Pacemaker and Corosync:

* the "Pacemaker watcher" of SBD connects to the CIB and check it's still
  updated on a regular basis and the self-node is marked online.
* the "Cluster watchers" all connect with each others using a dedicated
  communication group in corosync ring(s).

Both watchers can report a failure to SBD that would self-stop the node.

If the network if down, I suppose the cluster watcher should complain. But I
suspect Pacemaker somehow keeps reporting as quorate, thus, forbidding SBD to
kill the whole node...

> Appreciate all of the feedback.  I have been dealing with Cluster Suite for a
> decade+ but focused on the company's setup.  I still have lots to learn,
> which keeps me interested.

+1

Keep us informed!

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [External] : Re: Fence Agent tests

2022-11-05 Thread Jehan-Guillaume de Rorthais via Users
On Sat, 5 Nov 2022 20:53:09 +0100
Valentin Vidić via Users  wrote:

> On Sat, Nov 05, 2022 at 06:47:59PM +, Robert Hayden wrote:
> > That was my impression as well...so I may have something wrong.  My
> > expectation was that SBD daemon should be writing to the /dev/watchdog
> > within 20 seconds and the kernel watchdog would self fence.  
> 
> I don't see anything unusual in the config except that pacemaker mode is
> also enabled. This means that the cluster is providing signal for sbd even
> when the storage device is down, for example:
> 
> 883 ?SL 0:00 sbd: inquisitor
> 892 ?SL 0:00  \_ sbd: watcher: /dev/vdb1 - slot: 0 - uuid: ...
> 893 ?SL 0:00  \_ sbd: watcher: Pacemaker
> 894 ?SL 0:00  \_ sbd: watcher: Cluster
> 
> You can strace different sbd processes to see what they are doing at any
> point.

I suspect both watchers should detect the loss of network/communication with
the other node.

BUT, when sbd is in Pacemaker mode, it doesn't reset the node if the
local **Pacemaker** is still quorate (via corosync). See the full chapter:
«If Pacemaker integration is activated, SBD will not self-fence if **device**
majority is lost [...]»
https://documentation.suse.com/sle-ha/15-SP4/html/SLE-HA-all/cha-ha-storage-protect.html

Would it be possible that no node is shutting down because the cluster is in
two-node mode? Because of this mode, both would keep the quorum expecting the
fencing to kill the other one... Except there's no active fencing here, only
"self-fencing".

To verify this guess, check the corosync conf for the "two_node" parameter and
if both nodes still report as quorate during network outage using:

  corosync-quorumtool -s

If this turn to be a good guess, without **active** fencing, I suppose a cluster
can not rely on the two-node mode. I'm not sure what would be the best setup
though.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync over dedicated interface?

2022-10-03 Thread Jehan-Guillaume de Rorthais via Users
On Mon, 3 Oct 2022 14:45:49 +0200
Tomas Jelinek  wrote:

> Dne 28. 09. 22 v 18:22 Jehan-Guillaume de Rorthais via Users napsal(a):
> > Hi,
> > 
> > A small addendum below.
> > 
> > On Wed, 28 Sep 2022 11:42:53 -0400
> > "Kevin P. Fleming"  wrote:
> >   
> >> On Wed, Sep 28, 2022 at 11:37 AM Dave Withheld 
> >> wrote:  
> >>>
> >>> Is it possible to get corosync to use the private network and stop trying
> >>> to use the LAN for cluster communications? Or am I totally off-base and am
> >>> missing something in my drbd/pacemaker configuration?  
> >>
> >> Absolutely! When I setup my two-node cluster recently I did exactly
> >> that. If you are using 'pcs' to manage your cluster, ensure that you
> >> add the 'addr=' parameter during 'pcs host auth' so that Corosync and
> >> the layers above it will use that address for the host. Something
> >> like:
> >>
> >> $ pcs host auth cluster-node-1 addr=192.168.10.1 cluster-node-2
> >> addr=192.168.10.2  
> > 
> > You can even set multiple rings so corosync can rely on both:
> > 
> >$ pcs host auth\
> >  cluster-node-1 addr=192.168.10.1 addr=10.20.30.1 \
> >  cluster-node-2 addr=192.168.10.2 addr=10.20.30.2  
> 
> Hi,
> 
> Just a little correction.
> 
> The 'pcs host auth' command accepts only one addr= for each node. The 
> address will be then used for pcs communication. If you don't put any 
> addr= in the 'pcs cluster setup' command, it will be used for corosync 
> communication as well.
> 
> However, if you want to set corosync to use multiple rings, you do that 
> by specifying addr= in the 'pcs cluster setup' command like this:
> 
> pcs cluster setup cluster_name \
> cluster-node-1 addr=192.168.10.1 addr=10.20.30.1 \
> cluster-node-2 addr=192.168.10.2 addr=10.20.30.2
> 
> If you used addr= in the 'pcs host auth' command and you want the same 
> address to be used by corosync, you need to specify that address in the 
> 'pcs cluster setup' command. If you only specify the second address, 
> you'll end up with a one-ring cluster.

Oops! I mixed commands up, sorry!

Thank you Tomas.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Corosync over dedicated interface?

2022-09-28 Thread Jehan-Guillaume de Rorthais via Users
Hi,

A small addendum below.

On Wed, 28 Sep 2022 11:42:53 -0400
"Kevin P. Fleming"  wrote:

> On Wed, Sep 28, 2022 at 11:37 AM Dave Withheld 
> wrote:
> >
> > Is it possible to get corosync to use the private network and stop trying
> > to use the LAN for cluster communications? Or am I totally off-base and am
> > missing something in my drbd/pacemaker configuration?  
> 
> Absolutely! When I setup my two-node cluster recently I did exactly
> that. If you are using 'pcs' to manage your cluster, ensure that you
> add the 'addr=' parameter during 'pcs host auth' so that Corosync and
> the layers above it will use that address for the host. Something
> like:
> 
> $ pcs host auth cluster-node-1 addr=192.168.10.1 cluster-node-2
> addr=192.168.10.2

You can even set multiple rings so corosync can rely on both:

  $ pcs host auth\
cluster-node-1 addr=192.168.10.1 addr=10.20.30.1 \
cluster-node-2 addr=192.168.10.2 addr=10.20.30.2

Then, compare (but do not edit!) your "/etc/corosync/corosync.conf" on all
nodes.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] DRBD and SQL Server

2022-09-28 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 28 Sep 2022 02:33:59 -0400
Madison Kelly  wrote:

> ...
> I'm happy to go into more detail, but I'll stop here until/unless you have
> more questions. Otherwise I'd write a book. :)

I would buy it ;)
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] (no subject)

2022-09-07 Thread Jehan-Guillaume de Rorthais via Users
Hey,

On Wed, 7 Sep 2022 19:12:53 +0900
권오성  wrote:

> Hello.
> I am a student who wants to implement a redundancy system with raspberry pi.
> Last time, I posted about how to proceed with installation on raspberry pi
> and received a lot of comments.
> Among them, I searched a lot after looking at the comments saying that
> fencing stonith should not be false.
> (ex -> sudo pcs property set stonith-enabled=false)
> However, I saw a lot of posts saying that there is no choice but to do
> false because there is no ipmi in raspberry pi, and I wonder how we can
> solve it in this situation.

Fencing is not juste about IPMI:
* you can use external smart devices to shut down your nodes (eg. PDU, UPS, a
  self-made fencing device
  (https://www.alteeve.com/w/Building_a_Node_Assassin_v1.1.4))
* you can fence a node by disabling its access to the network from a
  manageable switch, without shutting down the node
* you can use 1/2/3 shared storage + the hardware RPi watchdog using the SBD
  service
* you can use the internal hardware RPi watchdog using the SBD service, without
  shared disk

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/