Re: [ClusterLabs] Is it possible to downgrade feature-set in 2.1.6-8

2024-02-26 Thread vitaly
Thank you!
_Vitaly

> On 02/26/2024 10:28 AM EST Ken Gaillot  wrote:
> 
>  
> On Thu, 2024-02-22 at 08:05 -0500, vitaly wrote:
> > Hello. 
> > We have a product with 2 node clusters.
> > Our current version is using Pacemaker 2.1.4 the new version will be
> > using Pacemaker 2.1.6
> > During upgrade failure it is possible that one node will come up with
> > the new Pacemaker and work alone for a while.
> > Then old node would later come up and try to join the cluster.
> > This would fail due to the different feature-sets of the cluster
> > nodes. The older feature-set would not be able to join the newer
> > feature-set.
> >  
> > Question: 
> > Is is possible to force new node with Pacemaker 2.1.6 to use older
> > feature-set (3.15.0) for a while until second node is upgraded and is
> > able to work with Pacemaker 2.1.6?
> 
> No
> 
> >  
> > Thank you very much!
> > _Vitaly
> > 
> -- 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Is it possible to downgrade feature-set in 2.1.6-8

2024-02-22 Thread vitaly
Hello. 
We have a product with 2 node clusters.
Our current version is using Pacemaker 2.1.4 the new version will be using 
Pacemaker 2.1.6
During upgrade failure it is possible that one node will come up with the new 
Pacemaker and work alone for a while.
Then old node would later come up and try to join the cluster.
This would fail due to the different feature-sets of the cluster nodes. The 
older feature-set would not be able to join the newer feature-set.
 
Question:
Is is possible to force new node with Pacemaker 2.1.6 to use older feature-set 
(3.15.0) for a while until second node is upgraded and is able to work with 
Pacemaker 2.1.6?
 
Thank you very much!
_Vitaly___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly
OK,
Thank you very much for your help!
_Vitaly

> On 07/05/2022 8:47 PM Reid Wahl  wrote:
> 
>  
> On Tue, Jul 5, 2022 at 3:03 PM vitaly  wrote:
> >
> > Hello,
> > Yes, the snippet has everything there was for the full second of Jul 05 
> > 11:54:34. I did not cut anything between the last line of 11:54:33 and 
> > first line of 11:54:35.
> >
> > Here is grep from pacemaker config:
> >
> > d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker
> > PCMK_logfile=/var/log/pacemaker.log
> > SBD_SYNC_RESOURCE_STARTUP="no"
> > PCMK_trace_functions=services_action_sync,svc_read_output
> > d19-25-left.lab.archivas.com ~ #
> >
> > I also grepped CURRENT pacemaker.log for services_action_sync and got just 
> > 4 recs for the time that does not seem to match failures:
> >
> > d19-25-left.lab.archivas.com ~ # grep services_action_sync 
> > /var/log/pacemaker.log
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> > /usr/sbin/fence_ipmilan = 0
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:903)  trace:  >  stdout:  > version="1.0" ?>
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
> > /usr/sbin/fence_sbd = 0
> > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
> > (services_action_sync@services.c:903)  trace:  >  stdout:  > version="1.0" ?>
> >
> > This is grep of messages for failures:
> >
> > d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" 
> > /var/log/messages
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> > receive meta-data for ocf:heartbeat:pgsql-rhino
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to 
> > get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> > Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to 
> &

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly via Users
Hello,
Yes, the snippet has everything there was for the full second of Jul 05 
11:54:34. I did not cut anything between the last line of 11:54:33 and first 
line of 11:54:35.

Here is grep from pacemaker config:

d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker
PCMK_logfile=/var/log/pacemaker.log
SBD_SYNC_RESOURCE_STARTUP="no"
PCMK_trace_functions=services_action_sync,svc_read_output
d19-25-left.lab.archivas.com ~ # 

I also grepped CURRENT pacemaker.log for services_action_sync and got just 4 
recs for the time that does not seem to match failures:

d19-25-left.lab.archivas.com ~ # grep services_action_sync 
/var/log/pacemaker.log
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
/usr/sbin/fence_ipmilan = 0
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:903)  trace:  >  stdout: 
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:901)  trace:  > (null)_(null)_0: 
/usr/sbin/fence_sbd = 0
Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] 
(services_action_sync@services.c:903)  trace:  >  stdout: 

This is grep of messages for failures:

d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" 
/var/log/messages 
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive 
meta-data for ocf:heartbeat:pgsql-rhino
Jul  5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get 
metadata for postgres (ocf:heartbeat:pgsql-rhino)
d19-25-left.lab.archivas.com ~ # 

Sorry, these logs are not the same time as this morning as I reinstalled 
cluster couple of times today. 

Thanks,
_Vitaly


> On 07/05/2022 3:19 PM Reid Wahl  wrote:
> 
>  
> On Tue, Jul 5, 2022 at 5:17 AM vitaly  wrote:
> >
> > Hello,
> > Thanks for looking at this issue!
> > Snippets from /var/log/messages and /var/log/pacemaker.log are below.
> > _Vitaly
> >
> > Here is /var/log/pacemaker.log snippet around the failure:
> >
> > Jul 05 11:

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-05 Thread vitaly
ice: Requesting 
local execution of start operation for N1F1 on d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting 
local execution of start operation for fs_monitor on 
d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right fs_monitor-rhino(fs_monitor)[2298359]: INFO: 
Started fs_monitor.sh, pid=2298369
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
monitor operation for tomcat-instance on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
start operation for fs_monitor on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting 
local execution of monitor operation for fs_monitor on 
d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting 
local execution of start operation for ClusterMonitor on 
d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
monitor operation for fs_monitor on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Adding inet 
address 172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 
(with label bond0:N1F1)
Jul  5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Bringing 
device bond0 up
Jul  5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used 
not_used
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
start operation for N1F1 on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting 
local execution of monitor operation for N1F1 on d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right cluster_monitor-rhino(ClusterMonitor)[2298481]: 
INFO: Started cluster_monitor.sh, pid=2298549
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
start operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting 
local execution of monitor operation for ClusterMonitor on 
d19-25-right.lab.archivas.com
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
monitor operation for N1F1 on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
monitor operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of 
monitor operation for ethmonitor on d19-25-right.lab.archivas.com: ok
Jul  5 11:54:34 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Changing 
pgsql-data-status on d19-25-left.lab.archivas.com : DISCONNECT->STREAMING|ASYNC.
Jul  5 11:54:35 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Setup 
d19-25-left.lab.archivas.com into sync mode.

> On 07/04/2022 3:57 PM Reid Wahl  wrote:
> 
>  
> On Mon, Jul 4, 2022 at 7:19 AM vitaly  wrote:
> >
> > I get printout of metadata as follows:
> > d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf 
> > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data
> > 
> > 
> > 
> > 1.0
> >
> > 
> > Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource.
> > 
> > Manages a PostgreSQL database instance
> >
> > 
> > 
> > 
> > Path to pg_ctl command.
> > 
> > pgctl
> > 
> > 
> >
> > 
> > 
> > Start options (-o start_opt in pg_ctl). "-i -p 5432" for example.
> > 
> > start_opt
> > 
> >
> > 
> > 
> > 
> > Additional pg_ctl options (-w, -W etc..).
> > 
> > ctl_opt
> > 
> > 
> >
> > 
> > 
> > Path to psql command.
> > 
> > psql
> > 
> > 
> >
> > 
> > 
> > Path to PostgreSQL data directory.
> > 
> > pgdata
> > 
> > 
> >
> > 
> > 
> > User that owns PostgreSQL.
> > 
> > pgdba
> > 
> > 
> >
> > 
> > 
> > Hostname/IP address where PostgreSQL is listening
> > 
> > pghost
> > 
> > 
> >
> > 
> > 
> > Port where PostgreSQL is listening
> > 
> > pgport
> > 
> > 
> >
> > 
> > 
> > PostgreSQL user that pgsql RA will user for monitor operations. If it's not 
> > set
> > pgdba user will be used.
> > 
> > monitor_user
> > 
> > 
> >
> > 
> > 
> > Password for monitor user.
> > 
> > monitor_password
> 

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-04 Thread vitaly
I get printout of metadata as follows:
d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf 
/usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data



1.0


Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource.

Manages a PostgreSQL database instance




Path to pg_ctl command.

pgctl





Start options (-o start_opt in pg_ctl). "-i -p 5432" for example.

start_opt





Additional pg_ctl options (-w, -W etc..).

ctl_opt





Path to psql command.

psql





Path to PostgreSQL data directory.

pgdata





User that owns PostgreSQL.

pgdba





Hostname/IP address where PostgreSQL is listening

pghost





Port where PostgreSQL is listening

pgport





PostgreSQL user that pgsql RA will user for monitor operations. If it's not set
pgdba user will be used.

monitor_user





Password for monitor user.

monitor_password





SQL script that will be used for monitor operations.

monitor_sql





Path to the PostgreSQL configuration file for the instance.

Configuration file





Database that will be used for monitoring.

pgdb





Path to PostgreSQL server log output file.

logfile





Unix socket directory for PostgeSQL

socketdir





Number of shutdown retries (using -m fast) before resorting to -m immediate

stop escalation





Replication mode(none(default)/async/sync).
"async" and "sync" require PostgreSQL 9.1 or later.
If you use async or sync, it requires node_list, master_ip, restore_command
parameters, and needs setting postgresql.conf, pg_hba.conf up for
replication.
Please delete "include /../../rep_mode.conf" line in postgresql.conf
when you switch from sync to async.

rep_mode





All node names. Please separate each node name with a space.
This is required for replication.

node list





restore_command for recovery.conf.
This is required for replication.

restore_command





Master's floating IP address to be connected from hot standby.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

master ip





User used to connect to the master server.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

repuser





Location of WALS archived by the other node

remote_wals_dir





Location of WALS on current node in Rhino before 2.2.0

xlogs_dir





Location of WALS on current node in Rhino 2.2.0 and later

wals_dir





User used to connect to the master server.
This parameter is used for "primary_conninfo" in recovery.conf.
This is required for replication.

reppassword





primary_conninfo options of recovery.conf except host, port, user and 
application_name.
This is optional for replication.

primary_conninfo_opt





Path to temporary directory.
This is optional for replication.

tmpdir





Number of checking xlog on monitor before promote.
This is optional for replication.

xlog check count





The timeout of crm_attribute forever update command.
Default value is 5 seconds.
This is optional for replication.

The timeout of crm_attribute forever update 
command.





Number of shutdown retries (using -m fast) before resorting to -m immediate
in Slave state.
This is optional for replication.

stop escalation_in_slave





Number of seconds to wait for a postgreSQL process to be running but not 
necessarilly usable

Seconds to wait for a process to be running





Number of failed starts before the system forces a recovery from the master 
database

Start failures before recovery





Configuration file with overrides for pgsql-rhino.

Rhino configuration file




















> On 07/04/2022 5:39 AM Reid Wahl  wrote:
> 
>  
> On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl  wrote:
> >
> > On Sat, Jul 2, 2022 at 1:12 PM vitaly  wrote:
> > >
> > > Sorry, I noticed that I am missing meta "notice=true" and after adding it 
> > > to postgres-ms configuration "notice" events started to come through.
> > > Item 1 still needs explanation. As pacemaker-controld keeps complaining.
> >
> > What happens when you run `OCF_ROOT=/usr/lib/ocf
> > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`?
> 
> This may also be relevant:
> https://lists.clusterlabs.org/pipermail/users/2022-June/030391.html
> 
> >
> > > Thanks!
> > > _Vitaly
> > >
> > > > On 07/02/2022 2:04 PM vitaly  wrote:
> > > >
> > > >
> > > > Hello Everybody.
> > > > I have a 2 node cluster with clone resource “postgres-ms”. We are 
> > > > running following versions of pacemaker/corosync:
> > > > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> > > > pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> > > > pacemaker-libs-2.0.5-9.el8.x86_64
> > > > pacemaker-

Re: [ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-02 Thread vitaly
Sorry, I noticed that I am missing meta "notice=true" and after adding it to 
postgres-ms configuration "notice" events started to come through.
Item 1 still needs explanation. As pacemaker-controld keeps complaining.
Thanks!
_Vitaly

> On 07/02/2022 2:04 PM vitaly  wrote:
> 
>  
> Hello Everybody.
> I have a 2 node cluster with clone resource “postgres-ms”. We are running 
> following versions of pacemaker/corosync:
> d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
> pacemaker-cluster-libs-2.0.5-9.el8.x86_64
> pacemaker-libs-2.0.5-9.el8.x86_64
> pacemaker-cli-2.0.5-9.el8.x86_64
> corosynclib-3.1.0-5.el8.x86_64
> pacemaker-schemas-2.0.5-9.el8.noarch
> corosync-3.1.0-5.el8.x86_64
> pacemaker-2.0.5-9.el8.x86_64
> 
> There are couple of issues that could be related. 
> 1. There are following messages in the logs coming from pacemaker-controld:
> Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to 
> receive meta-data for ocf:heartbeat:pgsql-rhino
> Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to 
> get metadata for postgres (ocf:heartbeat:pgsql-rhino)
> 
> 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which 
> causes multiple issues with postgres synchronization during availability 
> events. 
> 
> 3. Item 2 raises another question. Who is setting these values:
> ${OCF_RESKEY_CRM_meta_notify_type}
> ${OCF_RESKEY_CRM_meta_notify_operation}
> 
> Here is excerpt from cluster config:
> 
> d19-25-left.lab.archivas.com ~ # pcs config 
> 
> Cluster Name: 
> Corosync Nodes:
>  d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
> Pacemaker Nodes:
>  d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com
> 
> Resources:
>  Clone: postgres-ms
>   Meta Attrs: promotable=true target-role=started
>   Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
>Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com 
> d19-25-right.lab.archivas.com" pgdata=/pg_data 
> remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX 
> repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh 
> wal_restore.py  %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
> xlogs_dir=/pg_data/pg_xlog
>Meta Attrs: is-managed=true
>Operations: demote interval=0 on-fail=restart timeout=120s 
> (postgres-demote-interval-0)
>methods interval=0s timeout=5 (postgres-methods-interval-0s)
>monitor interval=10s on-fail=restart timeout=300s 
> (postgres-monitor-interval-10s)
>monitor interval=5s on-fail=restart role=Master timeout=300s 
> (postgres-monitor-interval-5s)
>notify interval=0 on-fail=restart timeout=90s 
> (postgres-notify-interval-0)
>promote interval=0 on-fail=restart timeout=120s 
> (postgres-promote-interval-0)
>start interval=0 on-fail=restart timeout=1800s 
> (postgres-start-interval-0)
>stop interval=0 on-fail=fence timeout=120s 
> (postgres-stop-interval-0)
> Thank you very much!
> _Vitaly
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Postgres clone resource does not get "notice" events

2022-07-02 Thread vitaly
Hello Everybody.
I have a 2 node cluster with clone resource “postgres-ms”. We are running 
following versions of pacemaker/corosync:
d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync"
pacemaker-cluster-libs-2.0.5-9.el8.x86_64
pacemaker-libs-2.0.5-9.el8.x86_64
pacemaker-cli-2.0.5-9.el8.x86_64
corosynclib-3.1.0-5.el8.x86_64
pacemaker-schemas-2.0.5-9.el8.noarch
corosync-3.1.0-5.el8.x86_64
pacemaker-2.0.5-9.el8.x86_64

There are couple of issues that could be related. 
1. There are following messages in the logs coming from pacemaker-controld:
Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to 
receive meta-data for ocf:heartbeat:pgsql-rhino
Jul  2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to 
get metadata for postgres (ocf:heartbeat:pgsql-rhino)

2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which causes 
multiple issues with postgres synchronization during availability events. 

3. Item 2 raises another question. Who is setting these values:
${OCF_RESKEY_CRM_meta_notify_type}
${OCF_RESKEY_CRM_meta_notify_operation}

Here is excerpt from cluster config:

d19-25-left.lab.archivas.com ~ # pcs config 

Cluster Name: 
Corosync Nodes:
 d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com
Pacemaker Nodes:
 d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com

Resources:
 Clone: postgres-ms
  Meta Attrs: promotable=true target-role=started
  Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino)
   Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com 
d19-25-right.lab.archivas.com" pgdata=/pg_data 
remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX 
repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh 
wal_restore.py  %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal 
xlogs_dir=/pg_data/pg_xlog
   Meta Attrs: is-managed=true
   Operations: demote interval=0 on-fail=restart timeout=120s 
(postgres-demote-interval-0)
   methods interval=0s timeout=5 (postgres-methods-interval-0s)
   monitor interval=10s on-fail=restart timeout=300s 
(postgres-monitor-interval-10s)
   monitor interval=5s on-fail=restart role=Master timeout=300s 
(postgres-monitor-interval-5s)
   notify interval=0 on-fail=restart timeout=90s 
(postgres-notify-interval-0)
   promote interval=0 on-fail=restart timeout=120s 
(postgres-promote-interval-0)
   start interval=0 on-fail=restart timeout=1800s 
(postgres-start-interval-0)
   stop interval=0 on-fail=fence timeout=120s 
(postgres-stop-interval-0)
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] I_DC_TIMEOUT and node fenced when it joins the cluster

2022-04-15 Thread vitaly
Hello Everybody.
I am seeing occasionally the following behavior on two node cluster. 
1. Abruptly rebooting both nodes of the cluster (using "reboot")
2. Both nodes start to come up. Node d18-3-left (2) comes up first 
Apr 13 23:56:09 d18-3-left corosync[11465]:   [MAIN  ] Corosync Cluster Engine 
('2.4.4'): started and ready to provide service.

3. Second node d18-3-right (1) joins the cluster

Apr 13 23:56:58 d18-3-left corosync[11466]:   [TOTEM ] A new membership 
(172.16.1.1:60) was formed. Members joined: 1
Apr 13 23:56:58 d18-3-left corosync[11466]:   [QUORUM] This node is within the 
primary component and will provide service.
Apr 13 23:56:58 d18-3-left corosync[11466]:   [QUORUM] Members[2]: 1 2
Apr 13 23:56:58 d18-3-left corosync[11466]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 13 23:56:58 d18-3-left pacemakerd[11717]:   notice: Quorum acquired
Apr 13 23:56:58 d18-3-left crmd[11763]:   notice: Quorum acquired

4. 2 seconds later node d18-3-left shows I_DC_TIMEOUT and starts fencing of the 
newly joined node.

Apr 13 23:57:00 d18-3-left crmd[11763]:  warning: Input I_DC_TIMEOUT received 
in state S_PENDING from crm_timer_popped
After that we get:
Apr 13 23:57:00 d18-3-left crmd[11763]:   notice: State transition S_ELECTION 
-> S_INTEGRATION
Apr 13 23:57:00 d18-3-left crmd[11763]:  warning: Input I_ELECTION_DC received 
in state S_INTEGRATION from do_election_check

and fence the node:
Apr 13 23:57:01 d18-3-left pengine[11762]:  warning: Scheduling Node 
d18-3-right.lab.archivas.com for STONITH
Apr 13 23:57:01 d18-3-left pengine[11762]:   notice:  * Fence (reboot) 
d18-3-right.lab.archivas.com 'node is unclean'

5. After this the node that was fenced comes up again and joins the cluster 
without any issues.

Any idea on what is going on here?
Thanks,
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-24 Thread vitaly
Ken, 
Thank you very much for your help!
3.1.15-3 seems to satisfy our need and needed very few fixed to build on CentOs 
8. 
We will go ahead with that version.
Thanks again!
_Vitaly

> On November 23, 2021 6:03 PM Ken Gaillot  wrote:
> 
>  
> On Tue, 2021-11-23 at 17:36 -0500, vitaly wrote:
> > Thank you!
> > I understand the purpose. We did not hit the problem with this issue
> > until there was some failure during upgrade at the customer site and
> > the old node died. New one came up and old node was never able to
> > join until we killed new one and started old in the single node mode.
> 
> Once an older node leaves the cluster, the best course of action would
> be to upgrade it before trying to have it rejoin.
> 
> > My question about rpms was related to pacemaker vs corosync rpms. I
> > guess that crm_feature_set is defined in pacemaker rpms. 
> > I understand that all rpms built for pacemaker have to be same
> > version as should all rpms built for corosync.
> > If 1.1.15 uses 3.0.10 I will try 1.1.15 then.
> 
> That would let you run 1.1.13 and 1.1.15 nodes indefinitely without any
> serious issues. However trying to upgrade past 1.1.15 would put you in
> the same situation -- if the 1.1.15 node leaves the cluster, it can't
> rejoin until it's upgraded to the newer version.
> 
> > Thank you very much for your help!
> > _Vitaly
> > 
> > > On November 23, 2021 5:12 PM Ken Gaillot 
> > > wrote:
> > > 
> > >  
> > > On Tue, 2021-11-23 at 14:11 -0500, vitaly wrote:
> > > > Hello,
> > > > I am working on the upgrade from older version of
> > > > pacemaker/corosync
> > > > to the current one. In the interim we need to sync newly
> > > > installed
> > > > node with the node running old software. Our old node uses
> > > > pacemaker
> > > > 1.1.13-3.fc22 and corosync 2.3.5-1.fc22 and has crm_feature_set
> > > > 3.0.10.
> > > > 
> > > > For interim sync I used pacemaker 1.1.18-2.fc28 and corosync
> > > > 2.4.4-
> > > > 1.fc28. This version is using crm_feature_set 3.0.14. 
> > > > This version is working fine, but it has issues in some edge
> > > > cases,
> > > > like when the new node starts alone and then the old one tries to
> > > > join.
> > > 
> > > That's the intended behavior of mixed-version clusters -- once an
> > > older
> > > node leaves the cluster, it can't rejoin without being upgraded.
> > > This
> > > allows new features to become available once all older nodes are
> > > gone.
> > > 
> > > Mixed-version clusters should only be used in a rolling upgrade,
> > > i.e.
> > > upgrading each node in turn and returning it to the cluster.
> > > 
> > > > So I need to rebuild rpms for crm_feature_set 3.0.10. This will
> > > > be
> > > > used just once and then it will be upgraded to the latest
> > > > versions of
> > > > pacemaker and corosync.
> > > > 
> > > > Now, couple of questions:
> > > > 1. Which rpm defines crm_feature_set?
> > > 
> > > The feature set applies to all RPMs of a particular version. You
> > > can't
> > > mix and match RPMs from different versions.
> > > 
> > > > 2. Which version of this rpm has crm_feature_set 3.0.10?
> > > 
> > > The feature set of each released version can be seen at:
> > > 
> > > https://wiki.clusterlabs.org/wiki/ReleaseCalendar
> > > 
> > > 1.1.13 through 1.1.15 had feature set 3.0.10
> > > 
> > > > 3. Where could I get source rpms to rebuild this rpm on CentOs 8?
> > > > Thanks a lot!
> > > > _Vitaly Zolotusky
> > > 
> > > The stock packages in the repos should be fine. All newer versions
> > > support rolling upgrades from 1.1.13.
> > > 
> > > -- 
> > > Ken Gaillot 
> -- 
> Ken Gaillot 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-24 Thread Vitaly Zolotusky
Ulrich, 
Yes, Fedora is far ahead of 22, but our product was in the field for quite a 
few years. Later versions are running on 28, but now we are moved on to CentOs. 
Upgrade was always happening online with no service interruption.
The problem with upgrade to the new version of Corosync is that it does not 
talk to the old one.
Now that we need to replace Pacemaker, Corosync and Postgres we need to have 
one brief interruption for Pacemaker, Corosync and Postgres update, but before 
we update Pacemaker we have to sync new nodes with old ones to keep cluster 
running until all nodes are ready. 
So what we do is - we install new nodes with old version of Pacemaker, Corosync 
and Postgres, have them running in mixed (old/new) configuration until all are 
ready for new Pacemaker, Corosync and Postgres and then we shutdown the cluster 
for couple of min and upgrade Pacemaker, Corosync and Postgres. 
The only reason we shutdown the cluster now is because old Corosync does not 
talk to new Corosync.
_Vitaly

> On November 24, 2021 2:22 AM Ulrich Windl  
> wrote:
> 
>  
> >>> vitaly  schrieb am 23.11.2021 um 20:11 in Nachricht
> <45677632.67420.1637694706...@webmail6.networksolutionsemail.com>:
> > Hello,
> > I am working on the upgrade from older version of pacemaker/corosync to the
> 
> > current one. In the interim we need to sync newly installed node with the 
> > node running old software. Our old node uses pacemaker 1.1.13‑3.fc22 and 
> > corosync 2.3.5‑1.fc22 and has crm_feature_set 3.0.10.
> > 
> > For interim sync I used pacemaker 1.1.18‑2.fc28 and corosync 2.4.4‑1.fc28. 
> > This version is using crm_feature_set 3.0.14. 
> > This version is working fine, but it has issues in some edge cases, like 
> > when the new node starts alone and then the old one tries to join.
> 
> What I'm wondering (not wearing a red hat): Isn't Fedora at something like 33
> or 34 right now?
> If so, why bother with such old versions?
> 
> > 
> > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just
> 
> > once and then it will be upgraded to the latest versions of pacemaker and 
> > corosync.
> > 
> > Now, couple of questions:
> > 1. Which rpm defines crm_feature_set?
> > 2. Which version of this rpm has crm_feature_set 3.0.10?
> > 3. Where could I get source rpms to rebuild this rpm on CentOs 8?
> > Thanks a lot!
> > _Vitaly Zolotusky
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-23 Thread vitaly
Thank you!
I understand the purpose. We did not hit the problem with this issue until 
there was some failure during upgrade at the customer site and the old node 
died. New one came up and old node was never able to join until we killed new 
one and started old in the single node mode.
My question about rpms was related to pacemaker vs corosync rpms. I guess that 
crm_feature_set is defined in pacemaker rpms. 
I understand that all rpms built for pacemaker have to be same version as 
should all rpms built for corosync.
If 1.1.15 uses 3.0.10 I will try 1.1.15 then.
Thank you very much for your help!
_Vitaly

> On November 23, 2021 5:12 PM Ken Gaillot  wrote:
> 
>  
> On Tue, 2021-11-23 at 14:11 -0500, vitaly wrote:
> > Hello,
> > I am working on the upgrade from older version of pacemaker/corosync
> > to the current one. In the interim we need to sync newly installed
> > node with the node running old software. Our old node uses pacemaker
> > 1.1.13-3.fc22 and corosync 2.3.5-1.fc22 and has crm_feature_set
> > 3.0.10.
> > 
> > For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4-
> > 1.fc28. This version is using crm_feature_set 3.0.14. 
> > This version is working fine, but it has issues in some edge cases,
> > like when the new node starts alone and then the old one tries to
> > join.
> 
> That's the intended behavior of mixed-version clusters -- once an older
> node leaves the cluster, it can't rejoin without being upgraded. This
> allows new features to become available once all older nodes are gone.
> 
> Mixed-version clusters should only be used in a rolling upgrade, i.e.
> upgrading each node in turn and returning it to the cluster.
> 
> > 
> > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be
> > used just once and then it will be upgraded to the latest versions of
> > pacemaker and corosync.
> > 
> > Now, couple of questions:
> > 1. Which rpm defines crm_feature_set?
> 
> The feature set applies to all RPMs of a particular version. You can't
> mix and match RPMs from different versions.
> 
> > 2. Which version of this rpm has crm_feature_set 3.0.10?
> 
> The feature set of each released version can be seen at:
> 
> https://wiki.clusterlabs.org/wiki/ReleaseCalendar
> 
> 1.1.13 through 1.1.15 had feature set 3.0.10
> 
> > 3. Where could I get source rpms to rebuild this rpm on CentOs 8?
> > Thanks a lot!
> > _Vitaly Zolotusky
> 
> The stock packages in the repos should be fine. All newer versions
> support rolling upgrades from 1.1.13.
> 
> > 
> -- 
> Ken Gaillot 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-23 Thread vitaly
Thank you!
With some minor fixes I was able to build it on CentOs 8. Tomorrow will test if 
it works and gives me the same feature set.
Which side if responsible for crm_feature_set? Pacemaker or Corosync?
Thank you very much for your help!
_Vitaly

> On November 23, 2021 2:49 PM Strahil Nikolov  wrote:
> 
> 
> Have you tried with a Fedora package from the archives?
> I found 
> https://dl.fedoraproject.org/pub/archive/fedora/linux/releases/23/Everything/x86_64/os/Packages/p/pacemaker-1.1.13-3.fc23.x86_64.rpm
>  & 
> https://dl.fedoraproject.org/pub/archive/fedora/linux/releases/23/Everything/x86_64/os/Packages/c/corosync-2.3.5-1.fc23.x86_64.rpm
>  which theoretically should be close enough.
> 
> 
> P.S.: I couldn't find those versions for Fedora 22, but they seem available 
> for F23.
>  
> Best Regards,
> Strahil Nikolov
> 
> В вторник, 23 ноември 2021 г., 21:11:58 Гринуич+2, vitaly  
> написа:
> 
> 
> Hello,
> I am working on the upgrade from older version of pacemaker/corosync to the 
> current one. In the interim we need to sync newly installed node with the 
> node running old software. Our old node uses pacemaker 1.1.13-3.fc22 and 
> corosync 2.3.5-1.fc22 and has crm_feature_set 3.0.10.
> 
> For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4-1.fc28. 
> This version is using crm_feature_set 3.0.14. 
> This version is working fine, but it has issues in some edge cases, like when 
> the new node starts alone and then the old one tries to join.
> 
> So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just 
> once and then it will be upgraded to the latest versions of pacemaker and 
> corosync.
> 
> Now, couple of questions:
> 1. Which rpm defines crm_feature_set?
> 2. Which version of this rpm has crm_feature_set 3.0.10?
> 3. Where could I get source rpms to rebuild this rpm on CentOs 8?
> Thanks a lot!
> _Vitaly Zolotusky
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-23 Thread vitaly
Hello,
I am working on the upgrade from older version of pacemaker/corosync to the 
current one. In the interim we need to sync newly installed node with the node 
running old software. Our old node uses pacemaker 1.1.13-3.fc22 and corosync 
2.3.5-1.fc22 and has crm_feature_set 3.0.10.

For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4-1.fc28. This 
version is using crm_feature_set 3.0.14. 
This version is working fine, but it has issues in some edge cases, like when 
the new node starts alone and then the old one tries to join.

So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just 
once and then it will be upgraded to the latest versions of pacemaker and 
corosync.

Now, couple of questions:
1. Which rpm defines crm_feature_set?
2. Which version of this rpm has crm_feature_set 3.0.10?
3. Where could I get source rpms to rebuild this rpm on CentOs 8?
Thanks a lot!
_Vitaly Zolotusky
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Upgrading/downgrading cluster configuration

2020-10-22 Thread Vitaly Zolotusky
Thanks for the reply.
I do see backup config command in pcs, but not in crmsh. 
What would that be in crmsh? Would something like this work after Corosync 
started in the state with all resources inactive? I'll try this: 

crm configure save 
crm configure load 

Thank you!
_Vitaly Zolotusky

> On October 22, 2020 1:54 PM Strahil Nikolov  wrote:
> 
>  
> Have you tried to backup the config via crmsh/pcs and when you downgrade to 
> restore from it ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В четвъртък, 22 октомври 2020 г., 15:40:43 Гринуич+3, Vitaly Zolotusky 
>  написа: 
> 
> 
> 
> 
> 
> Hello,
> We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our 
> procedure includes upgrade where we stopthe cluster, replace rpms and restart 
> the cluster. Upgrade works fine, but we also need to implement rollback in 
> case something goes wrong. 
> When we rollback and reload old RPMs cluster says that there are no active 
> resources. It looks like there is a problem with cluster configuration 
> version.
> Here is output of the crm_mon:
> 
> d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1
> Stack: corosync
> Current DC: NONE
> Last updated: Thu Oct 22 12:39:37 2020
> Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on 
> d21-22-left.lab.archivas.com
> 
> 2 nodes configured
> 15 resources configured
> 
> Node d21-22-left.lab.archivas.com: UNCLEAN (offline)
> Node d21-22-right.lab.archivas.com: UNCLEAN (offline)
> 
> No active resources
> 
> 
> Node Attributes:
> 
> ***
> What would be the best way to implement downgrade of the configuration? 
> Should we just change crm feature set, or we need to rebuild the whole config?
> Thanks!
> _Vitaly Zolotusky
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Upgrading/downgrading cluster configuration

2020-10-22 Thread Vitaly Zolotusky
Hello,
We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our 
procedure includes upgrade where we stopthe cluster, replace rpms and restart 
the cluster. Upgrade works fine, but we also need to implement rollback in case 
something goes wrong. 
When we rollback and reload old RPMs cluster says that there are no active 
resources. It looks like there is a problem with cluster configuration version.
Here is output of the crm_mon:

d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1
Stack: corosync
Current DC: NONE
Last updated: Thu Oct 22 12:39:37 2020
Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on 
d21-22-left.lab.archivas.com

2 nodes configured
15 resources configured

Node d21-22-left.lab.archivas.com: UNCLEAN (offline)
Node d21-22-right.lab.archivas.com: UNCLEAN (offline)

No active resources


Node Attributes:

***
What would be the best way to implement downgrade of the configuration? Should 
we just change crm feature set, or we need to rebuild the whole config?
Thanks!
_Vitaly Zolotusky
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-12 Thread Vitaly Zolotusky
Hello,
This is exactly what we do in some of our software. We have a messaging version 
number and we can negotiate messaging protocol between nodes. Then 
communication is happening on the highest common version.
Would be great to have something like that available in Corosync!
Thanks,
_Vitaly

> On June 12, 2020 3:40 AM Ulrich Windl  
> wrote:
> 
>  
> Hi!
> 
> I can't help here, but in general I think corosync should support "upgrade"
> mode. Maybe like this:
> The newer version can also speak the previous protocol and the current
> protocol will be enabled only after all nodes in the cluster are upgraded.
> 
> Probably this would require a tripple version number field like (oldest
> version supported, version being requested/used, newest version supported).
> 
> For the API a questy about the latest commonly agreed version number and a
> request to use a different version number would be needed, too.
> 
> Regards,
> Ulrich
> 
> >>> Vitaly Zolotusky  schrieb am 11.06.2020 um 04:14 in
> Nachricht
> <19881_1591841678_5EE1938D_19881_553_1_1163034878.247559.1591841668387@webmail6.
> etworksolutionsemail.com>:
> > Hello everybody.
> > We are trying to do a rolling upgrade from Corosync 2.3.5‑1 to Corosync 
> > 2.99+. It looks like they are not compatible and we are getting messages 
> > like:
> > Jun 11 02:10:20 d21‑22‑left corosync[6349]:   [TOTEM ] Message received from
> 
> > 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+)..
> Ignoring
> > on the upgraded node and 
> > Jun 11 01:02:37 d21‑22‑right corosync[14912]:   [TOTEM ] Invalid packet
> data
> > Jun 11 01:02:38 d21‑22‑right corosync[14912]:   [TOTEM ] Incoming packet has
> 
> > different crypto type. Rejecting
> > Jun 11 01:02:38 d21‑22‑right corosync[14912]:   [TOTEM ] Received message
> has 
> > invalid digest... ignoring.
> > on the pre‑upgrade node.
> > 
> > Is there a good way to do this upgrade? 
> > I would appreciate it very much if you could point me to any documentation 
> > or articles on this issue.
> > Thank you very much!
> > _Vitaly
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Hello, Strahil.
Thanks for your suggestion. 
We are doing something similar to what you suggest, but:
1. We do not have external storage. Our product is a single box with 2 internal 
heads and 10-14 PB (peta) of data in a single box. (or it could have 9 boxes 
hooked up together , but still with just 2 heads and 9 times more storage). 
2. Setup a new cluster is kind of hard. We do that on an extra partitions in 
chroot while the old cluster is running, so shutdown should be pretty short if 
we can figure out a way for the cluster to work while we configure the new 
partitions.
3. At this time we have to stop a node to move configuration from old to new 
partition, initialize new databases, etc. While we are doing that the other 
node is taking over all processing.
We will see if we can incorporate your suggestion into our upgrade path.
Thanks a lot for your help!
_Vitaly

 
> On June 11, 2020 12:00 PM Strahil Nikolov  wrote:
> 
>  
> Hi Vitaly,
> 
> have you considered  something like  this:
> 1.  Setup a  new cluster
> 2.  Present the same  shared storage on the new  cluster
> 3. Prepare the resource configuration but do not apply yet.
> 3. Power down all  resources on old cluster
> 4. Deploy the resources on the new cluster and immediately bring the  
> resources up
> 5. Remove access  to the shared storage for the  old cluster
> 6. Wipe the  old  cluster.
> 
> Downtime  will be way  shorter.
> 
> Best Regards,
> Strahil  Nikolov
> 
> На 11 юни 2020 г. 17:48:47 GMT+03:00, Vitaly Zolotusky  
> написа:
> >Thank you very much for quick reply!
> >I will try to either build new version on Fedora 22, or build the old
> >version on CentOs 8 and do a HA stack upgrade separately from my full
> >product/OS upgrade. A lot of my customers would be extremely unhappy
> >with even short downtime, so I can't really do the full upgrade
> >offline.
> >Thanks again!
> >_Vitaly
> >
> >> On June 11, 2020 10:14 AM Jan Friesse  wrote:
> >> 
> >>  
> >> > Thank you very much for your help!
> >> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that
> >it may work with rolling upgrade (we were fooled by the same major
> >version (2)). Our fresh install works fine on V3.0.3-5.
> >> > Do you know if it is possible to build Pacemaker 3.0.3-5 and
> >Corosync 2.0.3 on Fedora 22 so that I 
> >> 
> >> Good question. Fedora 22 is quite old but close to RHEL 7 for which
> >we 
> >> build packages automatically (https://kronosnet.org/builds/) so it 
> >> should be possible. But you are really on your own, because I don't 
> >> think anybody ever tried it.
> >> 
> >> Regards,
> >>Honza
> >> 
> >> 
> >> 
> >> upgrade the stack before starting "real" upgrade of the product?
> >> > Then I can do the following sequence:
> >> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version
> >> > 2. start HA stack on the old OS and product version with Pacemaker
> >3.0.3 and bring the product online
> >> > 3. start rolling upgrade for product upgrade to the new OS and
> >product version
> >> > Thanks again for your help!
> >> > _Vitaly
> >> > 
> >> >> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> >> >>
> >> >>   
> >> >> Vitaly,
> >> >>
> >> >>> Hello everybody.
> >> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to
> >Corosync 2.99+. It looks like they are not compatible and we are
> >getting messages like:
> >> >>
> >> >> Yes, they are not wire compatible. Also please do not use 2.99
> >versions,
> >> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a
> >long time
> >> >> released (3.0.4 is latest and I would recommend using it - there
> >were
> >> >> quite a few important bugfixes between 3.0.0 and 3.0.4)
> >> >>
> >> >>
> >> >>> Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message
> >received from 172.18.52.44 has bad magic number (probably sent by
> >Corosync 2.3+).. Ignoring
> >> >>> on the upgraded node and
> >> >>> Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid
> >packet data
> >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming
> >packet has different crypto type. Rejecting
> >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received
&

Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Thank you very much for quick reply!
I will try to either build new version on Fedora 22, or build the old version 
on CentOs 8 and do a HA stack upgrade separately from my full product/OS 
upgrade. A lot of my customers would be extremely unhappy with even short 
downtime, so I can't really do the full upgrade offline.
Thanks again!
_Vitaly

> On June 11, 2020 10:14 AM Jan Friesse  wrote:
> 
>  
> > Thank you very much for your help!
> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may 
> > work with rolling upgrade (we were fooled by the same major version (2)). 
> > Our fresh install works fine on V3.0.3-5.
> > Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 
> > on Fedora 22 so that I 
> 
> Good question. Fedora 22 is quite old but close to RHEL 7 for which we 
> build packages automatically (https://kronosnet.org/builds/) so it 
> should be possible. But you are really on your own, because I don't 
> think anybody ever tried it.
> 
> Regards,
>Honza
> 
> 
> 
> upgrade the stack before starting "real" upgrade of the product?
> > Then I can do the following sequence:
> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version
> > 2. start HA stack on the old OS and product version with Pacemaker 3.0.3 
> > and bring the product online
> > 3. start rolling upgrade for product upgrade to the new OS and product 
> > version
> > Thanks again for your help!
> > _Vitaly
> > 
> >> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> >>
> >>   
> >> Vitaly,
> >>
> >>> Hello everybody.
> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 
> >>> 2.99+. It looks like they are not compatible and we are getting messages 
> >>> like:
> >>
> >> Yes, they are not wire compatible. Also please do not use 2.99 versions,
> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time
> >> released (3.0.4 is latest and I would recommend using it - there were
> >> quite a few important bugfixes between 3.0.0 and 3.0.4)
> >>
> >>
> >>> Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received 
> >>> from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. 
> >>> Ignoring
> >>> on the upgraded node and
> >>> Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet 
> >>> data
> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet 
> >>> has different crypto type. Rejecting
> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message 
> >>> has invalid digest... ignoring.
> >>> on the pre-upgrade node.
> >>>
> >>> Is there a good way to do this upgrade?
> >>
> >> Usually best way is to start from scratch in testing environment to make
> >> sure everything works as expected. Then you can shutdown current
> >> cluster, upgrade and start it again - config file is mostly compatible,
> >> you may just consider changing transport to knet. I don't think there is
> >> any definitive guide to do upgrade without shutting down whole cluster,
> >> but somebody else may have idea.
> >>
> >> Regards,
> >> Honza
> >>
> >>> I would appreciate it very much if you could point me to any 
> >>> documentation or articles on this issue.
> >>> Thank you very much!
> >>> _Vitaly
> >>> ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Thank you very much for your help!
We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may work 
with rolling upgrade (we were fooled by the same major version (2)). Our fresh 
install works fine on V3.0.3-5.
Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 on 
Fedora 22 so that I upgrade the stack before starting "real" upgrade of the 
product? 
Then I can do the following sequence:
1. "quick" full shutdown for HA stack upgrade to 3.0 version
2. start HA stack on the old OS and product version with Pacemaker 3.0.3 and 
bring the product online
3. start rolling upgrade for product upgrade to the new OS and product version
Thanks again for your help!
_Vitaly

> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> 
>  
> Vitaly,
> 
> > Hello everybody.
> > We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 
> > 2.99+. It looks like they are not compatible and we are getting messages 
> > like:
> 
> Yes, they are not wire compatible. Also please do not use 2.99 versions, 
> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time 
> released (3.0.4 is latest and I would recommend using it - there were 
> quite a few important bugfixes between 3.0.0 and 3.0.4)
> 
> 
> > Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received 
> > from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. 
> > Ignoring
> > on the upgraded node and
> > Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet data
> > Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet 
> > has different crypto type. Rejecting
> > Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message 
> > has invalid digest... ignoring.
> > on the pre-upgrade node.
> > 
> > Is there a good way to do this upgrade?
> 
> Usually best way is to start from scratch in testing environment to make 
> sure everything works as expected. Then you can shutdown current 
> cluster, upgrade and start it again - config file is mostly compatible, 
> you may just consider changing transport to knet. I don't think there is 
> any definitive guide to do upgrade without shutting down whole cluster, 
> but somebody else may have idea.
> 
> Regards,
>Honza
> 
> > I would appreciate it very much if you could point me to any documentation 
> > or articles on this issue.
> > Thank you very much!
> > _Vitaly
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> >
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-10 Thread Vitaly Zolotusky
Hello everybody.
We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 2.99+. 
It looks like they are not compatible and we are getting messages like:
Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received from 
172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. Ignoring
on the upgraded node and 
Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet data
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet has 
different crypto type. Rejecting
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message has 
invalid digest... ignoring.
on the pre-upgrade node.

Is there a good way to do this upgrade? 
I would appreciate it very much if you could point me to any documentation or 
articles on this issue.
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence_sbd script in Fedora30?

2019-09-24 Thread Vitaly Zolotusky
Thank you Everybody for quick reply!
I was a little confused that fence_sbd script was removed from sbd package. I 
guess now it is living only in fence_agents.
Also, I was looking for some guidance on new (for me) parameters in fence_sbd, 
but I think I have figured that out. Another problem I have is that we modify 
scripts to work with our hardware and I am in the process of going through 
these changes.
Thanks again!
_Vitaly

> On September 24, 2019 at 12:29 AM Andrei Borzenkov  
> wrote:
> 
> 
> 23.09.2019 23:23, Vitaly Zolotusky пишет:
> > Hello,
> > I am trying to upgrade to Fedora 30. The platform is two node cluster with 
> > pacemaker.
> > It Fedora 28 we were using old fence_sbd script from 2013:
> > 
> > # This STONITH script drives the shared-storage stonith plugin.
> > # Copyright (C) 2013 Lars Marowsky-Bree 
> > 
> > We were overwriting the distribution script in custom built RPM with the 
> > one from 2013.
> > It looks like there is no fence_sbd script any more in the agents source 
> > and some apis changed so that the old script would not work.
> > Do you have any documentation / suggestions on how to move from old 
> > fence_sbd script to the latest?
> 
> What's wrong with external/sbd stonith resource?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fence_sbd script in Fedora30?

2019-09-23 Thread Vitaly Zolotusky
Hello,
I am trying to upgrade to Fedora 30. The platform is two node cluster with 
pacemaker.
It Fedora 28 we were using old fence_sbd script from 2013:

# This STONITH script drives the shared-storage stonith plugin.
# Copyright (C) 2013 Lars Marowsky-Bree 

We were overwriting the distribution script in custom built RPM with the one 
from 2013.
It looks like there is no fence_sbd script any more in the agents source and 
some apis changed so that the old script would not work.
Do you have any documentation / suggestions on how to move from old fence_sbd 
script to the latest?
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-18 Thread Vitaly Zolotusky
Chris,
Thanks a lot for the info. I'll explore both options.
_Vitaly

> On December 18, 2018 at 11:13 AM Chris Walker  wrote:
> 
> 
> Looks like rhino66-left was scheduled for fencing because it was not present 
> 20 seconds (the dc-deadtime parameter) after rhino66-right started Pacemaker 
> (startup fencing).  I can think of a couple of ways to allow all nodes to 
> survive if they come up far apart in time (i.e., father apart than 
> dc-deadtime):
> 
> 1.  Increase dc-deadtime.  Unfortunately, the cluster always waits for 
> dc-deadtime to expire before starting resources, so this can delay your 
> cluster's startup.
> 
> 2.  As Ken mentioned, synchronize the starting of Corosync and Pacemaker.  I 
> did this with a simple ExecStartPre systemd script:
> 
> [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf
> [Service]
> ExecStartPre=/sbin/ha_wait.sh
> TimeoutStartSec=11min
> [root@bug0 ~]#
> 
> where ha_wait.sh has something like:
> 
> #!/bin/bash
> 
> timeout=600
> 
> peer=
> 
> echo "Waiting for ${peer}"
> peerup() {
>   systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \
> egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return > 0
>   return 1
> }
> 
> now=${SECONDS}
> while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do
>   echo -n .
>   sleep 5
> done
> 
> peerup && echo "${peer} is up starting HA" || echo "${peer} not up after 
> ${timeout} starting HA alone"
> 
> 
> This will cause corosync startup to block for 10 minutes waiting for the 
> partner node to come up, after which both nodes will start corosync/pacemaker 
> close in time.  If one node never comes up, then it will wait 10 minutes 
> before starting, after which the other node will be fenced (startup fencing 
> and subsequent resource startup will only happen will only occur if 
> no-quorum-policy is set to ignore)
> 
> HTH,
> 
> Chris
> 
> On 12/17/18 6:25 PM, Vitaly Zolotusky wrote:
> 
> Ken, Thank you very much for quick response!
> I do have "two_node: 1" in the corosync.conf. I have attached it to this 
> email (not from the same system as original messages, but they are all the 
> same).
> Syncing startup of corosync and pacemaker on different nodes would be a 
> problem for us.
> I suspect that the problem is that corosync assumes quorum is reached as soon 
> as corosync is started on both nodes, but pacemaker does not abort fencing 
> until pacemaker starts on the other node.
> 
> I will try to work around this issue by moving corosync and pacemaker 
> startups on single node as close to each other as possible.
> Thanks again!
> _Vitaly
> 
> 
> 
> On December 17, 2018 at 6:01 PM Ken Gaillot 
> <mailto:kgail...@redhat.com> wrote:
> 
> 
> On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote:
> 
> 
> Hello,
> I have a 2 node cluster and stonith is configured for SBD and
> fence_ipmilan.
> fence_ipmilan for node 1 is configured for 0 delay and for node 2 for
> 30 sec delay so that nodes do not start killing each other during
> startup.
> 
> 
> 
> If you're using corosync 2 or later, you can set "two_node: 1" in
> corosync.conf. That implies the wait_for_all option, so that at start-
> up, both nodes must be present before quorum can be reached the first
> time. (After that point, one node can go away and quorum will be
> retained.)
> 
> Another way to avoid this is to start corosync on all nodes, then start
> pacemaker on all nodes.
> 
> 
> 
> In some cases (usually right after installation and when node 1 comes
> up first and node 2 second) the node that comes up first (node 1)
> states that node 2 is unclean, but can't fence it until quorum
> reached.
> Then as soon as quorum is reached after startup of corosync on node 2
> it sends a fence request for node 2.
> Fence_ipmilan gets into 30 sec delay.
> Pacemaker gets started on node 2.
> While fence_ipmilan is still waiting for the delay node 1 crmd aborts
> transition that requested the fence.
> Even though the transition was aborted, when delay time expires node
> 2 gets fenced.
> 
> 
> 
> Currently, pacemaker has no way of cancelling fencing once it's been
> initiated. Technically, it would be possible to cancel an operation
> that's in the delay stage (assuming that no other fence device has
> already been attempted, if there are more than one), but that hasn't
> been implemented.
> 
> 
> 
> Excerpts from messages are below. I also attached messages from both
> nodes and pe-input files fro

Re: [ClusterLabs] Antw: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-18 Thread Vitaly Zolotusky
Ulrich,
Thank you very much for suggestion.
My guess is that node 2 is considered unclean because both nodes were rebooted 
without pacemaker knowledge after installation. For our appliance it should be 
OK as we are supposed to survive multiple hardware and power failures. So this 
case is just an indication that something is not working right and we would 
like to get to the bottom of it.
Thanks again!
_Vitaly

> On December 18, 2018 at 1:47 AM Ulrich Windl 
>  wrote:
> 
> 
> >>> Vitaly Zolotusky  schrieb am 17.12.2018 um 21:43 in 
> >>> Nachricht
> <1782126841.215210.1545079428...@webmail6.networksolutionsemail.com>:
> > Hello,
> > I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan.
> > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 
> > sec 
> > delay so that nodes do not start killing each other during startup.
> > In some cases (usually right after installation and when node 1 comes up 
> > first and node 2 second) the node that comes up first (node 1) states that 
> > node 2 is unclean, but can't fence it until quorum reached. 
> 
> I'd concentrate on examining why node2 is considered unclean. Of course that 
> doesn't fix the issue, but if fixing it takes some time, you'll have a 
> work-around ;-)
> 
> > Then as soon as quorum is reached after startup of corosync on node 2 it 
> > sends a fence request for node 2. 
> > Fence_ipmilan gets into 30 sec delay.
> > Pacemaker gets started on node 2.
> > While fence_ipmilan is still waiting for the delay node 1 crmd aborts 
> > transition that requested the fence.
> > Even though the transition was aborted, when delay time expires node 2 gets 
> > fenced.
> > Excerpts from messages are below. I also attached messages from both nodes 
> > and pe-input files from node 1.
> > Any suggestions would be appreciated.
> > Thank you very much for your help!
> > Vitaly Zolotusky
> > 
> > Here are excerpts from the messages:
> > 
> > Node 1 - controller - rhino66-right 172.18.51.81 - came up first  
> > *
> > 
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and 
> > resource 
> > management disabled due to lack of quorum
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node 
> > rhino66-left.lab.archivas.com is unclean!
> > Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence 
> > unclean 
> > nodes until quorum is attained (or no-quorum-policy is set to ignore)
> > .
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new membership 
> > (172.16.1.81:60) was formed. Members joined: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for all 
> > cluster members. Current votes: 1 expected_votes: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is 
> > within 
> > the primary component and will provide service.
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]: 1 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a 
> > node 
> > name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a 
> > node 
> > name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null) state is 
> > now member
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node (null) 
> > state 
> > is now member
> > Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: State transition 
> > S_IDLE 
> >-> S_POLICY_ENGINE
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice: Watchdog will be 
> > used via SBD if fencing is required
> > Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Scheduling Node 
> > rhino66-left.lab.archivas.com for STONITH
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Fence (reboot) 
> > rhino66-left.lab.archivas.com 'node is unclean'
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start  
&

Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-17 Thread Vitaly Zolotusky
Ken, Thank you very much for quick response! 
I do have "two_node: 1" in the corosync.conf. I have attached it to this email 
(not from the same system as original messages, but they are all the same).
Syncing startup of corosync and pacemaker on different nodes would be a problem 
for us.
I suspect that the problem is that corosync assumes quorum is reached as soon 
as corosync is started on both nodes, but pacemaker does not abort fencing 
until pacemaker starts on the other node.

I will try to work around this issue by moving corosync and pacemaker startups 
on single node as close to each other as possible.
Thanks again!
_Vitaly

> On December 17, 2018 at 6:01 PM Ken Gaillot  wrote:
> 
> 
> On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote:
> > Hello,
> > I have a 2 node cluster and stonith is configured for SBD and
> > fence_ipmilan.
> > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for
> > 30 sec delay so that nodes do not start killing each other during
> > startup.
> 
> If you're using corosync 2 or later, you can set "two_node: 1" in
> corosync.conf. That implies the wait_for_all option, so that at start-
> up, both nodes must be present before quorum can be reached the first
> time. (After that point, one node can go away and quorum will be
> retained.)
> 
> Another way to avoid this is to start corosync on all nodes, then start
> pacemaker on all nodes.
> 
> > In some cases (usually right after installation and when node 1 comes
> > up first and node 2 second) the node that comes up first (node 1)
> > states that node 2 is unclean, but can't fence it until quorum
> > reached. 
> > Then as soon as quorum is reached after startup of corosync on node 2
> > it sends a fence request for node 2. 
> > Fence_ipmilan gets into 30 sec delay.
> > Pacemaker gets started on node 2.
> > While fence_ipmilan is still waiting for the delay node 1 crmd aborts
> > transition that requested the fence.
> > Even though the transition was aborted, when delay time expires node
> > 2 gets fenced.
> 
> Currently, pacemaker has no way of cancelling fencing once it's been
> initiated. Technically, it would be possible to cancel an operation
> that's in the delay stage (assuming that no other fence device has
> already been attempted, if there are more than one), but that hasn't
> been implemented.
> 
> > Excerpts from messages are below. I also attached messages from both
> > nodes and pe-input files from node 1.
> > Any suggestions would be appreciated.
> > Thank you very much for your help!
> > Vitaly Zolotusky
> > 
> > Here are excerpts from the messages:
> > 
> > Node 1 - controller - rhino66-right 172.18.51.81 - came up
> > first  *
> > 
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and
> > resource management disabled due to lack of quorum
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node rhino66-
> > left.lab.archivas.com is unclean!
> > Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence
> > unclean nodes until quorum is attained (or no-quorum-policy is set to
> > ignore)
> > .
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new
> > membership (172.16.1.81:60) was formed. Members joined: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for
> > all cluster members. Current votes: 1 expected_votes: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is
> > within the primary component and will provide service.
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]:
> > 1 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed
> > service synchronization, ready to provide service.
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum
> > acquired
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> > obtain a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null)
> > state is now member
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> > obtain a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node
> > (null)