Re: [ClusterLabs] PostgreSQL server timelines offset after promote

2024-04-02 Thread FLORAC Thierry
Hi Jehan-Guillaume,

Thanks for your links, but I already had a look at them when using Pacemaker 
for the first time (except for the last one).
Actually, I forgot to mention that PostgreSQL and pacemaker are run on a Debian 
GNU/Linux system (latest "Bookworm" release); on reboot, Pacemaker is stopped 
using "systemctl stop" , and resources *seems* to be migrated correctly to 
promoted slave.
When the previous master is restarted, Pacemaker is actually restarted 
automatically and instantly promoted as new master; it *seems* that it's at 
this moment that a new timeline is sometimes created on backup server...

Don't you think that I should disable automatic restarting of Pacemaker on the 
master server, and handle promotion manually when a switch occurs?

Best regards,
Thierry

--
Thierry Florac

Resp. Pôle Architecture Applicative et Mobile
DSI - Dépt. Études et Solutions Tranverses
2 bis avenue du Général Leclerc - CS 30042
94704 MAISONS-ALFORT Cedex
Tél : 01 40 19 59 64 - 06 26 53 42 09
www.onf.fr<https://www.onf.fr>

[https://www.ext.onf.fr/img/onf-signature.jpg]


De : Jehan-Guillaume de Rorthais 
Envoyé : mercredi 27 mars 2024 09:04
À : FLORAC Thierry 
Cc : Cluster Labs - All topics related to open-source clustering welcomed 

Objet : Re: [ClusterLabs] PostgreSQL server timelines offset after promote

Bonjour Thierry,

On Mon, 25 Mar 2024 10:55:06 +
FLORAC Thierry  wrote:

> I'm trying to create a PostgreSQL master/slave cluster using streaming
> replication and pgsqlms agent. Cluster is OK but my problem is this : the
> master node is sometimes restarted for system operations, and the slave is
> then promoted without any problem ;

When you have to do some planed system operation, you **must** diligently
ask permission to pacemaker. Pacemaker is the real owner of your resource. It
will react to any unexpected event, even if it's a planed one. You must
consider it as a hidden colleague taking care of your resource.

There's various way to deal with Pacemaker when you need to do some system
maintenance on your primary, it depends on your constraints. Here are two
examples:

* ask Pacemaker to move the "promoted" role to another node
* then put the node in standby mode
* then do your admin tasks
* then unstandby your node: a standby should start on the original node
* optional: move back your "promoted" role to original node

Or:

* put the whole cluster in maintenance mode
* then do your admin tasks
* then check everything works as the cluster expect
* then exit the maintenance mode

The second one might be tricky if Pacemaker find some unexpected status/event
when exiting the maintenance mode.

You can find (old) example of administrative tasks in:

* with pcs: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html
* with crm: https://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html
* with "low level" commands:
  https://clusterlabs.github.io/PAF/administration.html

These docs updates are long overdue, sorry about that :(

Also, here is an hidden gist (that needs some updates as well):

https://github.com/ClusterLabs/PAF/tree/workshop/docs/workshop/fr


> after reboot, the old master is re-promoted, but I often get an error in
> slave logs :
>
>   FATAL:  la plus grande timeline 1 du serveur principal est derrière la
>   timeline de restauration 2
>
> which can be translated in english to :
>
>   FATAL: the highest timeline 1 of main server is behind restoration timeline
>  2


This is unexpected. I wonder how Pacemaker is being stopped. It is supposed to
stop gracefully its resource. The promotion scores should be updated to reflect
the local resource is not a primary anymore and PostgreSQL should be demoted
then stopped. It is supposed to start as a standby after a graceful shutdown.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PostgreSQL server timelines offset after promote

2024-03-27 Thread Jehan-Guillaume de Rorthais via Users
Bonjour Thierry,

On Mon, 25 Mar 2024 10:55:06 +
FLORAC Thierry  wrote:

> I'm trying to create a PostgreSQL master/slave cluster using streaming
> replication and pgsqlms agent. Cluster is OK but my problem is this : the
> master node is sometimes restarted for system operations, and the slave is
> then promoted without any problem ; 

When you have to do some planed system operation, you **must** diligently
ask permission to pacemaker. Pacemaker is the real owner of your resource. It
will react to any unexpected event, even if it's a planed one. You must
consider it as a hidden colleague taking care of your resource.

There's various way to deal with Pacemaker when you need to do some system
maintenance on your primary, it depends on your constraints. Here are two
examples:

* ask Pacemaker to move the "promoted" role to another node
* then put the node in standby mode
* then do your admin tasks
* then unstandby your node: a standby should start on the original node
* optional: move back your "promoted" role to original node

Or:

* put the whole cluster in maintenance mode
* then do your admin tasks
* then check everything works as the cluster expect
* then exit the maintenance mode

The second one might be tricky if Pacemaker find some unexpected status/event
when exiting the maintenance mode.

You can find (old) example of administrative tasks in:

* with pcs: https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html
* with crm: https://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html
* with "low level" commands:
  https://clusterlabs.github.io/PAF/administration.html

These docs updates are long overdue, sorry about that :(

Also, here is an hidden gist (that needs some updates as well):

https://github.com/ClusterLabs/PAF/tree/workshop/docs/workshop/fr


> after reboot, the old master is re-promoted, but I often get an error in
> slave logs :
> 
>   FATAL:  la plus grande timeline 1 du serveur principal est derrière la
>   timeline de restauration 2
> 
> which can be translated in english to :
> 
>   FATAL: the highest timeline 1 of main server is behind restoration timeline
>  2


This is unexpected. I wonder how Pacemaker is being stopped. It is supposed to
stop gracefully its resource. The promotion scores should be updated to reflect
the local resource is not a primary anymore and PostgreSQL should be demoted
then stopped. It is supposed to start as a standby after a graceful shutdown.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] PostgreSQL server timelines offset after promote

2024-03-25 Thread FLORAC Thierry
Hi,

I'm trying to create a PostgreSQL master/slave cluster using streaming 
replication and pgsqlms agent.
Cluster is OK but my problem is this : the master node is sometimes restarted 
for system operations, and the slave is then promoted without any problem ; 
after reboot, the old master is re-promoted, but I often get an error in slave 
logs :

  FATAL:  la plus grande timeline 1 du serveur principal est derrière la 
timeline de restauration 2

which can be translated in english to :

  FATAL: the highest timeline 1 of main server is behind restoration timeline 2

Is there a way to avoid this?

This is my pacemaker configuration:
node 1: pgsql-master.heb.onf.fr \
attributes master-pgsql-pel=1001
node 2: pgsql-slave.heb.onf.fr \
attributes master-pgsql-pel=1000
primitive pgsql-mail MailTo \
params email="thierry.flo...@onf.fr" subject="[PRODUCTION] PostgreSQL 
server status changed: "
primitive pgsql-pel pgsqlms \
params bindir="/usr/lib/postgresql/15/bin" 
pgdata="/var/local/postgresql/15/pel" pgport=5432 start_opts="-c 
config_file=/etc/postgresql/15/pel/postgresql.conf" \
op methods interval=0s timeout=5 \
op start timeout=60s interval=0s \
op stop timeout=60s interval=0s \
op promote timeout=30s interval=0s \
op monitor interval=15s timeout=10s role=Promoted \
op demote timeout=120s interval=0s \
op reload interval=0s timeout=20 \
op monitor interval=16s timeout=10s role=Unpromoted \
op notify timeout=60s interval=0s
primitive pgsql-vip IPaddr2 \
params iflabel=pgsql ip=100.127.36.50 \
op start interval=0s timeout=20s \
op stop interval=0s timeout=20s \
op monitor interval=5s \
meta target-role=Started
group pgsql-group pgsql-vip \
meta target-role=Started
clone pgsql-ha pgsql-pel \
meta notify=true master-max=1 target-role=Started promotable=true
location cli-prefer-pgsql-ha pgsql-ha role=Started inf: pgsql-master.heb.onf.fr
location cli-prefer-pgsql-vip pgsql-vip role=Started inf: 
pgsql-master.heb.onf.fr
colocation colocation-pgsql-vip-pgsql-ha-INFINITY inf: pgsql-vip:Started 
pgsql-ha:Promoted
order order-pgsql-ha-pgsql-vip-demote Mandatory: pgsql-ha:demote pgsql-vip:stop 
symmetrical=false
order order-pgsql-ha-pgsql-vip-promote Mandatory: pgsql-ha:promote 
pgsql-vip:start symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.1.5-a3f44794f94 \
cluster-infrastructure=corosync \
cluster-name=pgsql-heb-production \
stonith-enabled=false \
last-lrm-refresh=1710491873
And here are PostgreSQL replication settings:
listen_addresses = '*'

hot_standby = on
hot_standby_feedback = on

wal_level = replica
wal_log_hints = on
wal_receiver_status_interval = 5s
wal_keep_size = 256
max_wal_senders = 5

synchronous_commit = off
synchronous_standby_names = '*'

recovery_target_timeline = 'latest'

Best regards,
Thierry

--
Thierry Florac

Resp. Pôle Architecture Applicative et Mobile
DSI - Dépt. Études et Solutions Tranverses
2 bis avenue du Général Leclerc - CS 30042
94704 MAISONS-ALFORT Cedex
Tél : 01 40 19 59 64 - 06 26 53 42 09
www.onf.fr

[https://www.ext.onf.fr/img/onf-signature.jpg]

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/