Re: [ClusterLabs] Is it possible to downgrade feature-set in 2.1.6-8
Thank you! _Vitaly > On 02/26/2024 10:28 AM EST Ken Gaillot wrote: > > > On Thu, 2024-02-22 at 08:05 -0500, vitaly wrote: > > Hello. > > We have a product with 2 node clusters. > > Our current version is using Pacemaker 2.1.4 the new version will be > > using Pacemaker 2.1.6 > > During upgrade failure it is possible that one node will come up with > > the new Pacemaker and work alone for a while. > > Then old node would later come up and try to join the cluster. > > This would fail due to the different feature-sets of the cluster > > nodes. The older feature-set would not be able to join the newer > > feature-set. > > > > Question: > > Is is possible to force new node with Pacemaker 2.1.6 to use older > > feature-set (3.15.0) for a while until second node is upgraded and is > > able to work with Pacemaker 2.1.6? > > No > > > > > Thank you very much! > > _Vitaly > > > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Is it possible to downgrade feature-set in 2.1.6-8
Hello. We have a product with 2 node clusters. Our current version is using Pacemaker 2.1.4 the new version will be using Pacemaker 2.1.6 During upgrade failure it is possible that one node will come up with the new Pacemaker and work alone for a while. Then old node would later come up and try to join the cluster. This would fail due to the different feature-sets of the cluster nodes. The older feature-set would not be able to join the newer feature-set. Question: Is is possible to force new node with Pacemaker 2.1.6 to use older feature-set (3.15.0) for a while until second node is upgraded and is able to work with Pacemaker 2.1.6? Thank you very much! _Vitaly___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Postgres clone resource does not get "notice" events
OK, Thank you very much for your help! _Vitaly > On 07/05/2022 8:47 PM Reid Wahl wrote: > > > On Tue, Jul 5, 2022 at 3:03 PM vitaly wrote: > > > > Hello, > > Yes, the snippet has everything there was for the full second of Jul 05 > > 11:54:34. I did not cut anything between the last line of 11:54:33 and > > first line of 11:54:35. > > > > Here is grep from pacemaker config: > > > > d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker > > PCMK_logfile=/var/log/pacemaker.log > > SBD_SYNC_RESOURCE_STARTUP="no" > > PCMK_trace_functions=services_action_sync,svc_read_output > > d19-25-left.lab.archivas.com ~ # > > > > I also grepped CURRENT pacemaker.log for services_action_sync and got just > > 4 recs for the time that does not seem to match failures: > > > > d19-25-left.lab.archivas.com ~ # grep services_action_sync > > /var/log/pacemaker.log > > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] > > (services_action_sync@services.c:901) trace: > (null)_(null)_0: > > /usr/sbin/fence_ipmilan = 0 > > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] > > (services_action_sync@services.c:903) trace: > stdout: > version="1.0" ?> > > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] > > (services_action_sync@services.c:901) trace: > (null)_(null)_0: > > /usr/sbin/fence_sbd = 0 > > Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] > > (services_action_sync@services.c:903) trace: > stdout: > version="1.0" ?> > > > > This is grep of messages for failures: > > > > d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" > > /var/log/messages > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to > > receive meta-data for ocf:heartbeat:pgsql-rhino > > Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to > > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to > &
Re: [ClusterLabs] Postgres clone resource does not get "notice" events
Hello, Yes, the snippet has everything there was for the full second of Jul 05 11:54:34. I did not cut anything between the last line of 11:54:33 and first line of 11:54:35. Here is grep from pacemaker config: d19-25-left.lab.archivas.com ~ # egrep -v '^($|#)' /etc/sysconfig/pacemaker PCMK_logfile=/var/log/pacemaker.log SBD_SYNC_RESOURCE_STARTUP="no" PCMK_trace_functions=services_action_sync,svc_read_output d19-25-left.lab.archivas.com ~ # I also grepped CURRENT pacemaker.log for services_action_sync and got just 4 recs for the time that does not seem to match failures: d19-25-left.lab.archivas.com ~ # grep services_action_sync /var/log/pacemaker.log Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] (services_action_sync@services.c:901) trace: > (null)_(null)_0: /usr/sbin/fence_ipmilan = 0 Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] (services_action_sync@services.c:903) trace: > stdout: Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] (services_action_sync@services.c:901) trace: > (null)_(null)_0: /usr/sbin/fence_sbd = 0 Jul 05 21:20:21 d19-25-left.lab.archivas.com pacemaker-fenced[47287] (services_action_sync@services.c:903) trace: > stdout: This is grep of messages for failures: d19-25-left.lab.archivas.com ~ # grep " 5 21:[23].*Failed to .*pgsql-rhino" /var/log/messages Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:43 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:44 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:44 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:47 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:47 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:48 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:20:49 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:20:49 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 5 21:30:26 d19-25-left pacemaker-controld[47291]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) d19-25-left.lab.archivas.com ~ # Sorry, these logs are not the same time as this morning as I reinstalled cluster couple of times today. Thanks, _Vitaly > On 07/05/2022 3:19 PM Reid Wahl wrote: > > > On Tue, Jul 5, 2022 at 5:17 AM vitaly wrote: > > > > Hello, > > Thanks for looking at this issue! > > Snippets from /var/log/messages and /var/log/pacemaker.log are below. > > _Vitaly > > > > Here is /var/log/pacemaker.log snippet around the failure: > > > > Jul 05 11:
Re: [ClusterLabs] Postgres clone resource does not get "notice" events
ice: Requesting local execution of start operation for N1F1 on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of start operation for fs_monitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right fs_monitor-rhino(fs_monitor)[2298359]: INFO: Started fs_monitor.sh, pid=2298369 Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for tomcat-instance on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for fs_monitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for fs_monitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of start operation for ClusterMonitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for fs_monitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Adding inet address 172.18.51.93/23 with broadcast address 172.18.51.255 to device bond0 (with label bond0:N1F1) Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: Bringing device bond0 up Jul 5 11:54:34 d19-25-right IPaddr2-rhino(N1F1)[2298357]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /run/resource-agents/send_arp-172.18.51.93 bond0 172.18.51.93 auto not_used not_used Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for N1F1 on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for N1F1 on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right cluster_monitor-rhino(ClusterMonitor)[2298481]: INFO: Started cluster_monitor.sh, pid=2298549 Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of start operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Requesting local execution of monitor operation for ClusterMonitor on d19-25-right.lab.archivas.com Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for N1F1 on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for ClusterMonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pacemaker-controld[2294546]: notice: Result of monitor operation for ethmonitor on d19-25-right.lab.archivas.com: ok Jul 5 11:54:34 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Changing pgsql-data-status on d19-25-left.lab.archivas.com : DISCONNECT->STREAMING|ASYNC. Jul 5 11:54:35 d19-25-right pgsql-rhino(postgres)[2298353]: INFO: Setup d19-25-left.lab.archivas.com into sync mode. > On 07/04/2022 3:57 PM Reid Wahl wrote: > > > On Mon, Jul 4, 2022 at 7:19 AM vitaly wrote: > > > > I get printout of metadata as follows: > > d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf > > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data > > > > > > > > 1.0 > > > > > > Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource. > > > > Manages a PostgreSQL database instance > > > > > > > > > > Path to pg_ctl command. > > > > pgctl > > > > > > > > > > > > Start options (-o start_opt in pg_ctl). "-i -p 5432" for example. > > > > start_opt > > > > > > > > > > > > Additional pg_ctl options (-w, -W etc..). > > > > ctl_opt > > > > > > > > > > > > Path to psql command. > > > > psql > > > > > > > > > > > > Path to PostgreSQL data directory. > > > > pgdata > > > > > > > > > > > > User that owns PostgreSQL. > > > > pgdba > > > > > > > > > > > > Hostname/IP address where PostgreSQL is listening > > > > pghost > > > > > > > > > > > > Port where PostgreSQL is listening > > > > pgport > > > > > > > > > > > > PostgreSQL user that pgsql RA will user for monitor operations. If it's not > > set > > pgdba user will be used. > > > > monitor_user > > > > > > > > > > > > Password for monitor user. > > > > monitor_password >
Re: [ClusterLabs] Postgres clone resource does not get "notice" events
I get printout of metadata as follows: d19-25-left.lab.archivas.com ~ # OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data 1.0 Resource script for PostgreSQL. It manages a PostgreSQL as an HA resource. Manages a PostgreSQL database instance Path to pg_ctl command. pgctl Start options (-o start_opt in pg_ctl). "-i -p 5432" for example. start_opt Additional pg_ctl options (-w, -W etc..). ctl_opt Path to psql command. psql Path to PostgreSQL data directory. pgdata User that owns PostgreSQL. pgdba Hostname/IP address where PostgreSQL is listening pghost Port where PostgreSQL is listening pgport PostgreSQL user that pgsql RA will user for monitor operations. If it's not set pgdba user will be used. monitor_user Password for monitor user. monitor_password SQL script that will be used for monitor operations. monitor_sql Path to the PostgreSQL configuration file for the instance. Configuration file Database that will be used for monitoring. pgdb Path to PostgreSQL server log output file. logfile Unix socket directory for PostgeSQL socketdir Number of shutdown retries (using -m fast) before resorting to -m immediate stop escalation Replication mode(none(default)/async/sync). "async" and "sync" require PostgreSQL 9.1 or later. If you use async or sync, it requires node_list, master_ip, restore_command parameters, and needs setting postgresql.conf, pg_hba.conf up for replication. Please delete "include /../../rep_mode.conf" line in postgresql.conf when you switch from sync to async. rep_mode All node names. Please separate each node name with a space. This is required for replication. node list restore_command for recovery.conf. This is required for replication. restore_command Master's floating IP address to be connected from hot standby. This parameter is used for "primary_conninfo" in recovery.conf. This is required for replication. master ip User used to connect to the master server. This parameter is used for "primary_conninfo" in recovery.conf. This is required for replication. repuser Location of WALS archived by the other node remote_wals_dir Location of WALS on current node in Rhino before 2.2.0 xlogs_dir Location of WALS on current node in Rhino 2.2.0 and later wals_dir User used to connect to the master server. This parameter is used for "primary_conninfo" in recovery.conf. This is required for replication. reppassword primary_conninfo options of recovery.conf except host, port, user and application_name. This is optional for replication. primary_conninfo_opt Path to temporary directory. This is optional for replication. tmpdir Number of checking xlog on monitor before promote. This is optional for replication. xlog check count The timeout of crm_attribute forever update command. Default value is 5 seconds. This is optional for replication. The timeout of crm_attribute forever update command. Number of shutdown retries (using -m fast) before resorting to -m immediate in Slave state. This is optional for replication. stop escalation_in_slave Number of seconds to wait for a postgreSQL process to be running but not necessarilly usable Seconds to wait for a process to be running Number of failed starts before the system forces a recovery from the master database Start failures before recovery Configuration file with overrides for pgsql-rhino. Rhino configuration file > On 07/04/2022 5:39 AM Reid Wahl wrote: > > > On Mon, Jul 4, 2022 at 1:06 AM Reid Wahl wrote: > > > > On Sat, Jul 2, 2022 at 1:12 PM vitaly wrote: > > > > > > Sorry, I noticed that I am missing meta "notice=true" and after adding it > > > to postgres-ms configuration "notice" events started to come through. > > > Item 1 still needs explanation. As pacemaker-controld keeps complaining. > > > > What happens when you run `OCF_ROOT=/usr/lib/ocf > > /usr/lib/ocf/resource.d/heartbeat/pgsql-rhino meta-data`? > > This may also be relevant: > https://lists.clusterlabs.org/pipermail/users/2022-June/030391.html > > > > > > Thanks! > > > _Vitaly > > > > > > > On 07/02/2022 2:04 PM vitaly wrote: > > > > > > > > > > > > Hello Everybody. > > > > I have a 2 node cluster with clone resource “postgres-ms”. We are > > > > running following versions of pacemaker/corosync: > > > > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync" > > > > pacemaker-cluster-libs-2.0.5-9.el8.x86_64 > > > > pacemaker-libs-2.0.5-9.el8.x86_64 > > > > pacemaker-
Re: [ClusterLabs] Postgres clone resource does not get "notice" events
Sorry, I noticed that I am missing meta "notice=true" and after adding it to postgres-ms configuration "notice" events started to come through. Item 1 still needs explanation. As pacemaker-controld keeps complaining. Thanks! _Vitaly > On 07/02/2022 2:04 PM vitaly wrote: > > > Hello Everybody. > I have a 2 node cluster with clone resource “postgres-ms”. We are running > following versions of pacemaker/corosync: > d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync" > pacemaker-cluster-libs-2.0.5-9.el8.x86_64 > pacemaker-libs-2.0.5-9.el8.x86_64 > pacemaker-cli-2.0.5-9.el8.x86_64 > corosynclib-3.1.0-5.el8.x86_64 > pacemaker-schemas-2.0.5-9.el8.noarch > corosync-3.1.0-5.el8.x86_64 > pacemaker-2.0.5-9.el8.x86_64 > > There are couple of issues that could be related. > 1. There are following messages in the logs coming from pacemaker-controld: > Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to > receive meta-data for ocf:heartbeat:pgsql-rhino > Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to > get metadata for postgres (ocf:heartbeat:pgsql-rhino) > > 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which > causes multiple issues with postgres synchronization during availability > events. > > 3. Item 2 raises another question. Who is setting these values: > ${OCF_RESKEY_CRM_meta_notify_type} > ${OCF_RESKEY_CRM_meta_notify_operation} > > Here is excerpt from cluster config: > > d19-25-left.lab.archivas.com ~ # pcs config > > Cluster Name: > Corosync Nodes: > d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com > Pacemaker Nodes: > d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com > > Resources: > Clone: postgres-ms > Meta Attrs: promotable=true target-role=started > Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino) >Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com > d19-25-right.lab.archivas.com" pgdata=/pg_data > remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX > repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh > wal_restore.py %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal > xlogs_dir=/pg_data/pg_xlog >Meta Attrs: is-managed=true >Operations: demote interval=0 on-fail=restart timeout=120s > (postgres-demote-interval-0) >methods interval=0s timeout=5 (postgres-methods-interval-0s) >monitor interval=10s on-fail=restart timeout=300s > (postgres-monitor-interval-10s) >monitor interval=5s on-fail=restart role=Master timeout=300s > (postgres-monitor-interval-5s) >notify interval=0 on-fail=restart timeout=90s > (postgres-notify-interval-0) >promote interval=0 on-fail=restart timeout=120s > (postgres-promote-interval-0) >start interval=0 on-fail=restart timeout=1800s > (postgres-start-interval-0) >stop interval=0 on-fail=fence timeout=120s > (postgres-stop-interval-0) > Thank you very much! > _Vitaly > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Postgres clone resource does not get "notice" events
Hello Everybody. I have a 2 node cluster with clone resource “postgres-ms”. We are running following versions of pacemaker/corosync: d19-25-left.lab.archivas.com ~ # rpm -qa | grep "pacemaker\|corosync" pacemaker-cluster-libs-2.0.5-9.el8.x86_64 pacemaker-libs-2.0.5-9.el8.x86_64 pacemaker-cli-2.0.5-9.el8.x86_64 corosynclib-3.1.0-5.el8.x86_64 pacemaker-schemas-2.0.5-9.el8.noarch corosync-3.1.0-5.el8.x86_64 pacemaker-2.0.5-9.el8.x86_64 There are couple of issues that could be related. 1. There are following messages in the logs coming from pacemaker-controld: Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: error: Failed to receive meta-data for ocf:heartbeat:pgsql-rhino Jul 2 14:59:27 d19-25-right pacemaker-controld[1489734]: warning: Failed to get metadata for postgres (ocf:heartbeat:pgsql-rhino) 2. ocf:heartbeat:pgsql-rhino does not get any "notice" operations which causes multiple issues with postgres synchronization during availability events. 3. Item 2 raises another question. Who is setting these values: ${OCF_RESKEY_CRM_meta_notify_type} ${OCF_RESKEY_CRM_meta_notify_operation} Here is excerpt from cluster config: d19-25-left.lab.archivas.com ~ # pcs config Cluster Name: Corosync Nodes: d19-25-right.lab.archivas.com d19-25-left.lab.archivas.com Pacemaker Nodes: d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com Resources: Clone: postgres-ms Meta Attrs: promotable=true target-role=started Resource: postgres (class=ocf provider=heartbeat type=pgsql-rhino) Attributes: master_ip=172.16.1.6 node_list="d19-25-left.lab.archivas.com d19-25-right.lab.archivas.com" pgdata=/pg_data remote_wals_dir=/remote/walarchive rep_mode=sync reppassword=XX repuser=XXX restore_command="/opt/rhino/sil/bin/script_wrapper.sh wal_restore.py %f %p" tmpdir=/pg_data/tmp wals_dir=/pg_data/pg_wal xlogs_dir=/pg_data/pg_xlog Meta Attrs: is-managed=true Operations: demote interval=0 on-fail=restart timeout=120s (postgres-demote-interval-0) methods interval=0s timeout=5 (postgres-methods-interval-0s) monitor interval=10s on-fail=restart timeout=300s (postgres-monitor-interval-10s) monitor interval=5s on-fail=restart role=Master timeout=300s (postgres-monitor-interval-5s) notify interval=0 on-fail=restart timeout=90s (postgres-notify-interval-0) promote interval=0 on-fail=restart timeout=120s (postgres-promote-interval-0) start interval=0 on-fail=restart timeout=1800s (postgres-start-interval-0) stop interval=0 on-fail=fence timeout=120s (postgres-stop-interval-0) Thank you very much! _Vitaly ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] I_DC_TIMEOUT and node fenced when it joins the cluster
Hello Everybody. I am seeing occasionally the following behavior on two node cluster. 1. Abruptly rebooting both nodes of the cluster (using "reboot") 2. Both nodes start to come up. Node d18-3-left (2) comes up first Apr 13 23:56:09 d18-3-left corosync[11465]: [MAIN ] Corosync Cluster Engine ('2.4.4'): started and ready to provide service. 3. Second node d18-3-right (1) joins the cluster Apr 13 23:56:58 d18-3-left corosync[11466]: [TOTEM ] A new membership (172.16.1.1:60) was formed. Members joined: 1 Apr 13 23:56:58 d18-3-left corosync[11466]: [QUORUM] This node is within the primary component and will provide service. Apr 13 23:56:58 d18-3-left corosync[11466]: [QUORUM] Members[2]: 1 2 Apr 13 23:56:58 d18-3-left corosync[11466]: [MAIN ] Completed service synchronization, ready to provide service. Apr 13 23:56:58 d18-3-left pacemakerd[11717]: notice: Quorum acquired Apr 13 23:56:58 d18-3-left crmd[11763]: notice: Quorum acquired 4. 2 seconds later node d18-3-left shows I_DC_TIMEOUT and starts fencing of the newly joined node. Apr 13 23:57:00 d18-3-left crmd[11763]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped After that we get: Apr 13 23:57:00 d18-3-left crmd[11763]: notice: State transition S_ELECTION -> S_INTEGRATION Apr 13 23:57:00 d18-3-left crmd[11763]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check and fence the node: Apr 13 23:57:01 d18-3-left pengine[11762]: warning: Scheduling Node d18-3-right.lab.archivas.com for STONITH Apr 13 23:57:01 d18-3-left pengine[11762]: notice: * Fence (reboot) d18-3-right.lab.archivas.com 'node is unclean' 5. After this the node that was fenced comes up again and joins the cluster without any issues. Any idea on what is going on here? Thanks, _Vitaly ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?
Ken, Thank you very much for your help! 3.1.15-3 seems to satisfy our need and needed very few fixed to build on CentOs 8. We will go ahead with that version. Thanks again! _Vitaly > On November 23, 2021 6:03 PM Ken Gaillot wrote: > > > On Tue, 2021-11-23 at 17:36 -0500, vitaly wrote: > > Thank you! > > I understand the purpose. We did not hit the problem with this issue > > until there was some failure during upgrade at the customer site and > > the old node died. New one came up and old node was never able to > > join until we killed new one and started old in the single node mode. > > Once an older node leaves the cluster, the best course of action would > be to upgrade it before trying to have it rejoin. > > > My question about rpms was related to pacemaker vs corosync rpms. I > > guess that crm_feature_set is defined in pacemaker rpms. > > I understand that all rpms built for pacemaker have to be same > > version as should all rpms built for corosync. > > If 1.1.15 uses 3.0.10 I will try 1.1.15 then. > > That would let you run 1.1.13 and 1.1.15 nodes indefinitely without any > serious issues. However trying to upgrade past 1.1.15 would put you in > the same situation -- if the 1.1.15 node leaves the cluster, it can't > rejoin until it's upgraded to the newer version. > > > Thank you very much for your help! > > _Vitaly > > > > > On November 23, 2021 5:12 PM Ken Gaillot > > > wrote: > > > > > > > > > On Tue, 2021-11-23 at 14:11 -0500, vitaly wrote: > > > > Hello, > > > > I am working on the upgrade from older version of > > > > pacemaker/corosync > > > > to the current one. In the interim we need to sync newly > > > > installed > > > > node with the node running old software. Our old node uses > > > > pacemaker > > > > 1.1.13-3.fc22 and corosync 2.3.5-1.fc22 and has crm_feature_set > > > > 3.0.10. > > > > > > > > For interim sync I used pacemaker 1.1.18-2.fc28 and corosync > > > > 2.4.4- > > > > 1.fc28. This version is using crm_feature_set 3.0.14. > > > > This version is working fine, but it has issues in some edge > > > > cases, > > > > like when the new node starts alone and then the old one tries to > > > > join. > > > > > > That's the intended behavior of mixed-version clusters -- once an > > > older > > > node leaves the cluster, it can't rejoin without being upgraded. > > > This > > > allows new features to become available once all older nodes are > > > gone. > > > > > > Mixed-version clusters should only be used in a rolling upgrade, > > > i.e. > > > upgrading each node in turn and returning it to the cluster. > > > > > > > So I need to rebuild rpms for crm_feature_set 3.0.10. This will > > > > be > > > > used just once and then it will be upgraded to the latest > > > > versions of > > > > pacemaker and corosync. > > > > > > > > Now, couple of questions: > > > > 1. Which rpm defines crm_feature_set? > > > > > > The feature set applies to all RPMs of a particular version. You > > > can't > > > mix and match RPMs from different versions. > > > > > > > 2. Which version of this rpm has crm_feature_set 3.0.10? > > > > > > The feature set of each released version can be seen at: > > > > > > https://wiki.clusterlabs.org/wiki/ReleaseCalendar > > > > > > 1.1.13 through 1.1.15 had feature set 3.0.10 > > > > > > > 3. Where could I get source rpms to rebuild this rpm on CentOs 8? > > > > Thanks a lot! > > > > _Vitaly Zolotusky > > > > > > The stock packages in the repos should be fine. All newer versions > > > support rolling upgrades from 1.1.13. > > > > > > -- > > > Ken Gaillot > -- > Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?
Ulrich, Yes, Fedora is far ahead of 22, but our product was in the field for quite a few years. Later versions are running on 28, but now we are moved on to CentOs. Upgrade was always happening online with no service interruption. The problem with upgrade to the new version of Corosync is that it does not talk to the old one. Now that we need to replace Pacemaker, Corosync and Postgres we need to have one brief interruption for Pacemaker, Corosync and Postgres update, but before we update Pacemaker we have to sync new nodes with old ones to keep cluster running until all nodes are ready. So what we do is - we install new nodes with old version of Pacemaker, Corosync and Postgres, have them running in mixed (old/new) configuration until all are ready for new Pacemaker, Corosync and Postgres and then we shutdown the cluster for couple of min and upgrade Pacemaker, Corosync and Postgres. The only reason we shutdown the cluster now is because old Corosync does not talk to new Corosync. _Vitaly > On November 24, 2021 2:22 AM Ulrich Windl > wrote: > > > >>> vitaly schrieb am 23.11.2021 um 20:11 in Nachricht > <45677632.67420.1637694706...@webmail6.networksolutionsemail.com>: > > Hello, > > I am working on the upgrade from older version of pacemaker/corosync to the > > > current one. In the interim we need to sync newly installed node with the > > node running old software. Our old node uses pacemaker 1.1.13‑3.fc22 and > > corosync 2.3.5‑1.fc22 and has crm_feature_set 3.0.10. > > > > For interim sync I used pacemaker 1.1.18‑2.fc28 and corosync 2.4.4‑1.fc28. > > This version is using crm_feature_set 3.0.14. > > This version is working fine, but it has issues in some edge cases, like > > when the new node starts alone and then the old one tries to join. > > What I'm wondering (not wearing a red hat): Isn't Fedora at something like 33 > or 34 right now? > If so, why bother with such old versions? > > > > > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just > > > once and then it will be upgraded to the latest versions of pacemaker and > > corosync. > > > > Now, couple of questions: > > 1. Which rpm defines crm_feature_set? > > 2. Which version of this rpm has crm_feature_set 3.0.10? > > 3. Where could I get source rpms to rebuild this rpm on CentOs 8? > > Thanks a lot! > > _Vitaly Zolotusky > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?
Thank you! I understand the purpose. We did not hit the problem with this issue until there was some failure during upgrade at the customer site and the old node died. New one came up and old node was never able to join until we killed new one and started old in the single node mode. My question about rpms was related to pacemaker vs corosync rpms. I guess that crm_feature_set is defined in pacemaker rpms. I understand that all rpms built for pacemaker have to be same version as should all rpms built for corosync. If 1.1.15 uses 3.0.10 I will try 1.1.15 then. Thank you very much for your help! _Vitaly > On November 23, 2021 5:12 PM Ken Gaillot wrote: > > > On Tue, 2021-11-23 at 14:11 -0500, vitaly wrote: > > Hello, > > I am working on the upgrade from older version of pacemaker/corosync > > to the current one. In the interim we need to sync newly installed > > node with the node running old software. Our old node uses pacemaker > > 1.1.13-3.fc22 and corosync 2.3.5-1.fc22 and has crm_feature_set > > 3.0.10. > > > > For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4- > > 1.fc28. This version is using crm_feature_set 3.0.14. > > This version is working fine, but it has issues in some edge cases, > > like when the new node starts alone and then the old one tries to > > join. > > That's the intended behavior of mixed-version clusters -- once an older > node leaves the cluster, it can't rejoin without being upgraded. This > allows new features to become available once all older nodes are gone. > > Mixed-version clusters should only be used in a rolling upgrade, i.e. > upgrading each node in turn and returning it to the cluster. > > > > > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be > > used just once and then it will be upgraded to the latest versions of > > pacemaker and corosync. > > > > Now, couple of questions: > > 1. Which rpm defines crm_feature_set? > > The feature set applies to all RPMs of a particular version. You can't > mix and match RPMs from different versions. > > > 2. Which version of this rpm has crm_feature_set 3.0.10? > > The feature set of each released version can be seen at: > > https://wiki.clusterlabs.org/wiki/ReleaseCalendar > > 1.1.13 through 1.1.15 had feature set 3.0.10 > > > 3. Where could I get source rpms to rebuild this rpm on CentOs 8? > > Thanks a lot! > > _Vitaly Zolotusky > > The stock packages in the repos should be fine. All newer versions > support rolling upgrades from 1.1.13. > > > > -- > Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?
Thank you! With some minor fixes I was able to build it on CentOs 8. Tomorrow will test if it works and gives me the same feature set. Which side if responsible for crm_feature_set? Pacemaker or Corosync? Thank you very much for your help! _Vitaly > On November 23, 2021 2:49 PM Strahil Nikolov wrote: > > > Have you tried with a Fedora package from the archives? > I found > https://dl.fedoraproject.org/pub/archive/fedora/linux/releases/23/Everything/x86_64/os/Packages/p/pacemaker-1.1.13-3.fc23.x86_64.rpm > & > https://dl.fedoraproject.org/pub/archive/fedora/linux/releases/23/Everything/x86_64/os/Packages/c/corosync-2.3.5-1.fc23.x86_64.rpm > which theoretically should be close enough. > > > P.S.: I couldn't find those versions for Fedora 22, but they seem available > for F23. > > Best Regards, > Strahil Nikolov > > В вторник, 23 ноември 2021 г., 21:11:58 Гринуич+2, vitaly > написа: > > > Hello, > I am working on the upgrade from older version of pacemaker/corosync to the > current one. In the interim we need to sync newly installed node with the > node running old software. Our old node uses pacemaker 1.1.13-3.fc22 and > corosync 2.3.5-1.fc22 and has crm_feature_set 3.0.10. > > For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4-1.fc28. > This version is using crm_feature_set 3.0.14. > This version is working fine, but it has issues in some edge cases, like when > the new node starts alone and then the old one tries to join. > > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just > once and then it will be upgraded to the latest versions of pacemaker and > corosync. > > Now, couple of questions: > 1. Which rpm defines crm_feature_set? > 2. Which version of this rpm has crm_feature_set 3.0.10? > 3. Where could I get source rpms to rebuild this rpm on CentOs 8? > Thanks a lot! > _Vitaly Zolotusky > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?
Hello, I am working on the upgrade from older version of pacemaker/corosync to the current one. In the interim we need to sync newly installed node with the node running old software. Our old node uses pacemaker 1.1.13-3.fc22 and corosync 2.3.5-1.fc22 and has crm_feature_set 3.0.10. For interim sync I used pacemaker 1.1.18-2.fc28 and corosync 2.4.4-1.fc28. This version is using crm_feature_set 3.0.14. This version is working fine, but it has issues in some edge cases, like when the new node starts alone and then the old one tries to join. So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just once and then it will be upgraded to the latest versions of pacemaker and corosync. Now, couple of questions: 1. Which rpm defines crm_feature_set? 2. Which version of this rpm has crm_feature_set 3.0.10? 3. Where could I get source rpms to rebuild this rpm on CentOs 8? Thanks a lot! _Vitaly Zolotusky ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Upgrading/downgrading cluster configuration
Thanks for the reply. I do see backup config command in pcs, but not in crmsh. What would that be in crmsh? Would something like this work after Corosync started in the state with all resources inactive? I'll try this: crm configure save crm configure load Thank you! _Vitaly Zolotusky > On October 22, 2020 1:54 PM Strahil Nikolov wrote: > > > Have you tried to backup the config via crmsh/pcs and when you downgrade to > restore from it ? > > Best Regards, > Strahil Nikolov > > > > > > > В четвъртък, 22 октомври 2020 г., 15:40:43 Гринуич+3, Vitaly Zolotusky > написа: > > > > > > Hello, > We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our > procedure includes upgrade where we stopthe cluster, replace rpms and restart > the cluster. Upgrade works fine, but we also need to implement rollback in > case something goes wrong. > When we rollback and reload old RPMs cluster says that there are no active > resources. It looks like there is a problem with cluster configuration > version. > Here is output of the crm_mon: > > d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1 > Stack: corosync > Current DC: NONE > Last updated: Thu Oct 22 12:39:37 2020 > Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on > d21-22-left.lab.archivas.com > > 2 nodes configured > 15 resources configured > > Node d21-22-left.lab.archivas.com: UNCLEAN (offline) > Node d21-22-right.lab.archivas.com: UNCLEAN (offline) > > No active resources > > > Node Attributes: > > *** > What would be the best way to implement downgrade of the configuration? > Should we just change crm feature set, or we need to rebuild the whole config? > Thanks! > _Vitaly Zolotusky > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Upgrading/downgrading cluster configuration
Hello, We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our procedure includes upgrade where we stopthe cluster, replace rpms and restart the cluster. Upgrade works fine, but we also need to implement rollback in case something goes wrong. When we rollback and reload old RPMs cluster says that there are no active resources. It looks like there is a problem with cluster configuration version. Here is output of the crm_mon: d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1 Stack: corosync Current DC: NONE Last updated: Thu Oct 22 12:39:37 2020 Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on d21-22-left.lab.archivas.com 2 nodes configured 15 resources configured Node d21-22-left.lab.archivas.com: UNCLEAN (offline) Node d21-22-right.lab.archivas.com: UNCLEAN (offline) No active resources Node Attributes: *** What would be the best way to implement downgrade of the configuration? Should we just change crm feature set, or we need to rebuild the whole config? Thanks! _Vitaly Zolotusky ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?
Hello, This is exactly what we do in some of our software. We have a messaging version number and we can negotiate messaging protocol between nodes. Then communication is happening on the highest common version. Would be great to have something like that available in Corosync! Thanks, _Vitaly > On June 12, 2020 3:40 AM Ulrich Windl > wrote: > > > Hi! > > I can't help here, but in general I think corosync should support "upgrade" > mode. Maybe like this: > The newer version can also speak the previous protocol and the current > protocol will be enabled only after all nodes in the cluster are upgraded. > > Probably this would require a tripple version number field like (oldest > version supported, version being requested/used, newest version supported). > > For the API a questy about the latest commonly agreed version number and a > request to use a different version number would be needed, too. > > Regards, > Ulrich > > >>> Vitaly Zolotusky schrieb am 11.06.2020 um 04:14 in > Nachricht > <19881_1591841678_5EE1938D_19881_553_1_1163034878.247559.1591841668387@webmail6. > etworksolutionsemail.com>: > > Hello everybody. > > We are trying to do a rolling upgrade from Corosync 2.3.5‑1 to Corosync > > 2.99+. It looks like they are not compatible and we are getting messages > > like: > > Jun 11 02:10:20 d21‑22‑left corosync[6349]: [TOTEM ] Message received from > > > 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. > Ignoring > > on the upgraded node and > > Jun 11 01:02:37 d21‑22‑right corosync[14912]: [TOTEM ] Invalid packet > data > > Jun 11 01:02:38 d21‑22‑right corosync[14912]: [TOTEM ] Incoming packet has > > > different crypto type. Rejecting > > Jun 11 01:02:38 d21‑22‑right corosync[14912]: [TOTEM ] Received message > has > > invalid digest... ignoring. > > on the pre‑upgrade node. > > > > Is there a good way to do this upgrade? > > I would appreciate it very much if you could point me to any documentation > > or articles on this issue. > > Thank you very much! > > _Vitaly > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?
Hello, Strahil. Thanks for your suggestion. We are doing something similar to what you suggest, but: 1. We do not have external storage. Our product is a single box with 2 internal heads and 10-14 PB (peta) of data in a single box. (or it could have 9 boxes hooked up together , but still with just 2 heads and 9 times more storage). 2. Setup a new cluster is kind of hard. We do that on an extra partitions in chroot while the old cluster is running, so shutdown should be pretty short if we can figure out a way for the cluster to work while we configure the new partitions. 3. At this time we have to stop a node to move configuration from old to new partition, initialize new databases, etc. While we are doing that the other node is taking over all processing. We will see if we can incorporate your suggestion into our upgrade path. Thanks a lot for your help! _Vitaly > On June 11, 2020 12:00 PM Strahil Nikolov wrote: > > > Hi Vitaly, > > have you considered something like this: > 1. Setup a new cluster > 2. Present the same shared storage on the new cluster > 3. Prepare the resource configuration but do not apply yet. > 3. Power down all resources on old cluster > 4. Deploy the resources on the new cluster and immediately bring the > resources up > 5. Remove access to the shared storage for the old cluster > 6. Wipe the old cluster. > > Downtime will be way shorter. > > Best Regards, > Strahil Nikolov > > На 11 юни 2020 г. 17:48:47 GMT+03:00, Vitaly Zolotusky > написа: > >Thank you very much for quick reply! > >I will try to either build new version on Fedora 22, or build the old > >version on CentOs 8 and do a HA stack upgrade separately from my full > >product/OS upgrade. A lot of my customers would be extremely unhappy > >with even short downtime, so I can't really do the full upgrade > >offline. > >Thanks again! > >_Vitaly > > > >> On June 11, 2020 10:14 AM Jan Friesse wrote: > >> > >> > >> > Thank you very much for your help! > >> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that > >it may work with rolling upgrade (we were fooled by the same major > >version (2)). Our fresh install works fine on V3.0.3-5. > >> > Do you know if it is possible to build Pacemaker 3.0.3-5 and > >Corosync 2.0.3 on Fedora 22 so that I > >> > >> Good question. Fedora 22 is quite old but close to RHEL 7 for which > >we > >> build packages automatically (https://kronosnet.org/builds/) so it > >> should be possible. But you are really on your own, because I don't > >> think anybody ever tried it. > >> > >> Regards, > >>Honza > >> > >> > >> > >> upgrade the stack before starting "real" upgrade of the product? > >> > Then I can do the following sequence: > >> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version > >> > 2. start HA stack on the old OS and product version with Pacemaker > >3.0.3 and bring the product online > >> > 3. start rolling upgrade for product upgrade to the new OS and > >product version > >> > Thanks again for your help! > >> > _Vitaly > >> > > >> >> On June 11, 2020 3:30 AM Jan Friesse wrote: > >> >> > >> >> > >> >> Vitaly, > >> >> > >> >>> Hello everybody. > >> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to > >Corosync 2.99+. It looks like they are not compatible and we are > >getting messages like: > >> >> > >> >> Yes, they are not wire compatible. Also please do not use 2.99 > >versions, > >> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a > >long time > >> >> released (3.0.4 is latest and I would recommend using it - there > >were > >> >> quite a few important bugfixes between 3.0.0 and 3.0.4) > >> >> > >> >> > >> >>> Jun 11 02:10:20 d21-22-left corosync[6349]: [TOTEM ] Message > >received from 172.18.52.44 has bad magic number (probably sent by > >Corosync 2.3+).. Ignoring > >> >>> on the upgraded node and > >> >>> Jun 11 01:02:37 d21-22-right corosync[14912]: [TOTEM ] Invalid > >packet data > >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Incoming > >packet has different crypto type. Rejecting > >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Received &
Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?
Thank you very much for quick reply! I will try to either build new version on Fedora 22, or build the old version on CentOs 8 and do a HA stack upgrade separately from my full product/OS upgrade. A lot of my customers would be extremely unhappy with even short downtime, so I can't really do the full upgrade offline. Thanks again! _Vitaly > On June 11, 2020 10:14 AM Jan Friesse wrote: > > > > Thank you very much for your help! > > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may > > work with rolling upgrade (we were fooled by the same major version (2)). > > Our fresh install works fine on V3.0.3-5. > > Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 > > on Fedora 22 so that I > > Good question. Fedora 22 is quite old but close to RHEL 7 for which we > build packages automatically (https://kronosnet.org/builds/) so it > should be possible. But you are really on your own, because I don't > think anybody ever tried it. > > Regards, >Honza > > > > upgrade the stack before starting "real" upgrade of the product? > > Then I can do the following sequence: > > 1. "quick" full shutdown for HA stack upgrade to 3.0 version > > 2. start HA stack on the old OS and product version with Pacemaker 3.0.3 > > and bring the product online > > 3. start rolling upgrade for product upgrade to the new OS and product > > version > > Thanks again for your help! > > _Vitaly > > > >> On June 11, 2020 3:30 AM Jan Friesse wrote: > >> > >> > >> Vitaly, > >> > >>> Hello everybody. > >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync > >>> 2.99+. It looks like they are not compatible and we are getting messages > >>> like: > >> > >> Yes, they are not wire compatible. Also please do not use 2.99 versions, > >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time > >> released (3.0.4 is latest and I would recommend using it - there were > >> quite a few important bugfixes between 3.0.0 and 3.0.4) > >> > >> > >>> Jun 11 02:10:20 d21-22-left corosync[6349]: [TOTEM ] Message received > >>> from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. > >>> Ignoring > >>> on the upgraded node and > >>> Jun 11 01:02:37 d21-22-right corosync[14912]: [TOTEM ] Invalid packet > >>> data > >>> Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Incoming packet > >>> has different crypto type. Rejecting > >>> Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Received message > >>> has invalid digest... ignoring. > >>> on the pre-upgrade node. > >>> > >>> Is there a good way to do this upgrade? > >> > >> Usually best way is to start from scratch in testing environment to make > >> sure everything works as expected. Then you can shutdown current > >> cluster, upgrade and start it again - config file is mostly compatible, > >> you may just consider changing transport to knet. I don't think there is > >> any definitive guide to do upgrade without shutting down whole cluster, > >> but somebody else may have idea. > >> > >> Regards, > >> Honza > >> > >>> I would appreciate it very much if you could point me to any > >>> documentation or articles on this issue. > >>> Thank you very much! > >>> _Vitaly > >>> ___ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?
Thank you very much for your help! We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may work with rolling upgrade (we were fooled by the same major version (2)). Our fresh install works fine on V3.0.3-5. Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 on Fedora 22 so that I upgrade the stack before starting "real" upgrade of the product? Then I can do the following sequence: 1. "quick" full shutdown for HA stack upgrade to 3.0 version 2. start HA stack on the old OS and product version with Pacemaker 3.0.3 and bring the product online 3. start rolling upgrade for product upgrade to the new OS and product version Thanks again for your help! _Vitaly > On June 11, 2020 3:30 AM Jan Friesse wrote: > > > Vitaly, > > > Hello everybody. > > We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync > > 2.99+. It looks like they are not compatible and we are getting messages > > like: > > Yes, they are not wire compatible. Also please do not use 2.99 versions, > these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time > released (3.0.4 is latest and I would recommend using it - there were > quite a few important bugfixes between 3.0.0 and 3.0.4) > > > > Jun 11 02:10:20 d21-22-left corosync[6349]: [TOTEM ] Message received > > from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. > > Ignoring > > on the upgraded node and > > Jun 11 01:02:37 d21-22-right corosync[14912]: [TOTEM ] Invalid packet data > > Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Incoming packet > > has different crypto type. Rejecting > > Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Received message > > has invalid digest... ignoring. > > on the pre-upgrade node. > > > > Is there a good way to do this upgrade? > > Usually best way is to start from scratch in testing environment to make > sure everything works as expected. Then you can shutdown current > cluster, upgrade and start it again - config file is mostly compatible, > you may just consider changing transport to knet. I don't think there is > any definitive guide to do upgrade without shutting down whole cluster, > but somebody else may have idea. > > Regards, >Honza > > > I would appreciate it very much if you could point me to any documentation > > or articles on this issue. > > Thank you very much! > > _Vitaly > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?
Hello everybody. We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 2.99+. It looks like they are not compatible and we are getting messages like: Jun 11 02:10:20 d21-22-left corosync[6349]: [TOTEM ] Message received from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. Ignoring on the upgraded node and Jun 11 01:02:37 d21-22-right corosync[14912]: [TOTEM ] Invalid packet data Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Incoming packet has different crypto type. Rejecting Jun 11 01:02:38 d21-22-right corosync[14912]: [TOTEM ] Received message has invalid digest... ignoring. on the pre-upgrade node. Is there a good way to do this upgrade? I would appreciate it very much if you could point me to any documentation or articles on this issue. Thank you very much! _Vitaly ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence_sbd script in Fedora30?
Thank you Everybody for quick reply! I was a little confused that fence_sbd script was removed from sbd package. I guess now it is living only in fence_agents. Also, I was looking for some guidance on new (for me) parameters in fence_sbd, but I think I have figured that out. Another problem I have is that we modify scripts to work with our hardware and I am in the process of going through these changes. Thanks again! _Vitaly > On September 24, 2019 at 12:29 AM Andrei Borzenkov > wrote: > > > 23.09.2019 23:23, Vitaly Zolotusky пишет: > > Hello, > > I am trying to upgrade to Fedora 30. The platform is two node cluster with > > pacemaker. > > It Fedora 28 we were using old fence_sbd script from 2013: > > > > # This STONITH script drives the shared-storage stonith plugin. > > # Copyright (C) 2013 Lars Marowsky-Bree > > > > We were overwriting the distribution script in custom built RPM with the > > one from 2013. > > It looks like there is no fence_sbd script any more in the agents source > > and some apis changed so that the old script would not work. > > Do you have any documentation / suggestions on how to move from old > > fence_sbd script to the latest? > > What's wrong with external/sbd stonith resource? > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Fence_sbd script in Fedora30?
Hello, I am trying to upgrade to Fedora 30. The platform is two node cluster with pacemaker. It Fedora 28 we were using old fence_sbd script from 2013: # This STONITH script drives the shared-storage stonith plugin. # Copyright (C) 2013 Lars Marowsky-Bree We were overwriting the distribution script in custom built RPM with the one from 2013. It looks like there is no fence_sbd script any more in the agents source and some apis changed so that the old script would not work. Do you have any documentation / suggestions on how to move from old fence_sbd script to the latest? Thank you very much! _Vitaly ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.
Chris, Thanks a lot for the info. I'll explore both options. _Vitaly > On December 18, 2018 at 11:13 AM Chris Walker wrote: > > > Looks like rhino66-left was scheduled for fencing because it was not present > 20 seconds (the dc-deadtime parameter) after rhino66-right started Pacemaker > (startup fencing). I can think of a couple of ways to allow all nodes to > survive if they come up far apart in time (i.e., father apart than > dc-deadtime): > > 1. Increase dc-deadtime. Unfortunately, the cluster always waits for > dc-deadtime to expire before starting resources, so this can delay your > cluster's startup. > > 2. As Ken mentioned, synchronize the starting of Corosync and Pacemaker. I > did this with a simple ExecStartPre systemd script: > > [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf > [Service] > ExecStartPre=/sbin/ha_wait.sh > TimeoutStartSec=11min > [root@bug0 ~]# > > where ha_wait.sh has something like: > > #!/bin/bash > > timeout=600 > > peer= > > echo "Waiting for ${peer}" > peerup() { > systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \ > egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return > 0 > return 1 > } > > now=${SECONDS} > while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do > echo -n . > sleep 5 > done > > peerup && echo "${peer} is up starting HA" || echo "${peer} not up after > ${timeout} starting HA alone" > > > This will cause corosync startup to block for 10 minutes waiting for the > partner node to come up, after which both nodes will start corosync/pacemaker > close in time. If one node never comes up, then it will wait 10 minutes > before starting, after which the other node will be fenced (startup fencing > and subsequent resource startup will only happen will only occur if > no-quorum-policy is set to ignore) > > HTH, > > Chris > > On 12/17/18 6:25 PM, Vitaly Zolotusky wrote: > > Ken, Thank you very much for quick response! > I do have "two_node: 1" in the corosync.conf. I have attached it to this > email (not from the same system as original messages, but they are all the > same). > Syncing startup of corosync and pacemaker on different nodes would be a > problem for us. > I suspect that the problem is that corosync assumes quorum is reached as soon > as corosync is started on both nodes, but pacemaker does not abort fencing > until pacemaker starts on the other node. > > I will try to work around this issue by moving corosync and pacemaker > startups on single node as close to each other as possible. > Thanks again! > _Vitaly > > > > On December 17, 2018 at 6:01 PM Ken Gaillot > <mailto:kgail...@redhat.com> wrote: > > > On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote: > > > Hello, > I have a 2 node cluster and stonith is configured for SBD and > fence_ipmilan. > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for > 30 sec delay so that nodes do not start killing each other during > startup. > > > > If you're using corosync 2 or later, you can set "two_node: 1" in > corosync.conf. That implies the wait_for_all option, so that at start- > up, both nodes must be present before quorum can be reached the first > time. (After that point, one node can go away and quorum will be > retained.) > > Another way to avoid this is to start corosync on all nodes, then start > pacemaker on all nodes. > > > > In some cases (usually right after installation and when node 1 comes > up first and node 2 second) the node that comes up first (node 1) > states that node 2 is unclean, but can't fence it until quorum > reached. > Then as soon as quorum is reached after startup of corosync on node 2 > it sends a fence request for node 2. > Fence_ipmilan gets into 30 sec delay. > Pacemaker gets started on node 2. > While fence_ipmilan is still waiting for the delay node 1 crmd aborts > transition that requested the fence. > Even though the transition was aborted, when delay time expires node > 2 gets fenced. > > > > Currently, pacemaker has no way of cancelling fencing once it's been > initiated. Technically, it would be possible to cancel an operation > that's in the delay stage (assuming that no other fence device has > already been attempted, if there are more than one), but that hasn't > been implemented. > > > > Excerpts from messages are below. I also attached messages from both > nodes and pe-input files fro
Re: [ClusterLabs] Antw: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.
Ulrich, Thank you very much for suggestion. My guess is that node 2 is considered unclean because both nodes were rebooted without pacemaker knowledge after installation. For our appliance it should be OK as we are supposed to survive multiple hardware and power failures. So this case is just an indication that something is not working right and we would like to get to the bottom of it. Thanks again! _Vitaly > On December 18, 2018 at 1:47 AM Ulrich Windl > wrote: > > > >>> Vitaly Zolotusky schrieb am 17.12.2018 um 21:43 in > >>> Nachricht > <1782126841.215210.1545079428...@webmail6.networksolutionsemail.com>: > > Hello, > > I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan. > > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 > > sec > > delay so that nodes do not start killing each other during startup. > > In some cases (usually right after installation and when node 1 comes up > > first and node 2 second) the node that comes up first (node 1) states that > > node 2 is unclean, but can't fence it until quorum reached. > > I'd concentrate on examining why node2 is considered unclean. Of course that > doesn't fix the issue, but if fixing it takes some time, you'll have a > work-around ;-) > > > Then as soon as quorum is reached after startup of corosync on node 2 it > > sends a fence request for node 2. > > Fence_ipmilan gets into 30 sec delay. > > Pacemaker gets started on node 2. > > While fence_ipmilan is still waiting for the delay node 1 crmd aborts > > transition that requested the fence. > > Even though the transition was aborted, when delay time expires node 2 gets > > fenced. > > Excerpts from messages are below. I also attached messages from both nodes > > and pe-input files from node 1. > > Any suggestions would be appreciated. > > Thank you very much for your help! > > Vitaly Zolotusky > > > > Here are excerpts from the messages: > > > > Node 1 - controller - rhino66-right 172.18.51.81 - came up first > > * > > > > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Fencing and > > resource > > management disabled due to lack of quorum > > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Node > > rhino66-left.lab.archivas.com is unclean! > > Nov 29 16:47:54 rhino66-right pengine[22183]: notice: Cannot fence > > unclean > > nodes until quorum is attained (or no-quorum-policy is set to ignore) > > . > > Nov 29 16:48:38 rhino66-right corosync[6677]: [TOTEM ] A new membership > > (172.16.1.81:60) was formed. Members joined: 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [VOTEQ ] Waiting for all > > cluster members. Current votes: 1 expected_votes: 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] This node is > > within > > the primary component and will provide service. > > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] Members[2]: 1 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [MAIN ] Completed service > > synchronization, ready to provide service. > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Quorum acquired > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Quorum acquired > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain a > > node > > name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not obtain > > a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain a > > node > > name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Node (null) state is > > now member > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not obtain > > a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Node (null) > > state > > is now member > > Nov 29 16:48:54 rhino66-right crmd[22184]: notice: State transition > > S_IDLE > >-> S_POLICY_ENGINE > > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: Watchdog will be > > used via SBD if fencing is required > > Nov 29 16:48:54 rhino66-right pengine[22183]: warning: Scheduling Node > > rhino66-left.lab.archivas.com for STONITH > > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Fence (reboot) > > rhino66-left.lab.archivas.com 'node is unclean' > > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start &
Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.
Ken, Thank you very much for quick response! I do have "two_node: 1" in the corosync.conf. I have attached it to this email (not from the same system as original messages, but they are all the same). Syncing startup of corosync and pacemaker on different nodes would be a problem for us. I suspect that the problem is that corosync assumes quorum is reached as soon as corosync is started on both nodes, but pacemaker does not abort fencing until pacemaker starts on the other node. I will try to work around this issue by moving corosync and pacemaker startups on single node as close to each other as possible. Thanks again! _Vitaly > On December 17, 2018 at 6:01 PM Ken Gaillot wrote: > > > On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote: > > Hello, > > I have a 2 node cluster and stonith is configured for SBD and > > fence_ipmilan. > > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for > > 30 sec delay so that nodes do not start killing each other during > > startup. > > If you're using corosync 2 or later, you can set "two_node: 1" in > corosync.conf. That implies the wait_for_all option, so that at start- > up, both nodes must be present before quorum can be reached the first > time. (After that point, one node can go away and quorum will be > retained.) > > Another way to avoid this is to start corosync on all nodes, then start > pacemaker on all nodes. > > > In some cases (usually right after installation and when node 1 comes > > up first and node 2 second) the node that comes up first (node 1) > > states that node 2 is unclean, but can't fence it until quorum > > reached. > > Then as soon as quorum is reached after startup of corosync on node 2 > > it sends a fence request for node 2. > > Fence_ipmilan gets into 30 sec delay. > > Pacemaker gets started on node 2. > > While fence_ipmilan is still waiting for the delay node 1 crmd aborts > > transition that requested the fence. > > Even though the transition was aborted, when delay time expires node > > 2 gets fenced. > > Currently, pacemaker has no way of cancelling fencing once it's been > initiated. Technically, it would be possible to cancel an operation > that's in the delay stage (assuming that no other fence device has > already been attempted, if there are more than one), but that hasn't > been implemented. > > > Excerpts from messages are below. I also attached messages from both > > nodes and pe-input files from node 1. > > Any suggestions would be appreciated. > > Thank you very much for your help! > > Vitaly Zolotusky > > > > Here are excerpts from the messages: > > > > Node 1 - controller - rhino66-right 172.18.51.81 - came up > > first * > > > > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Fencing and > > resource management disabled due to lack of quorum > > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Node rhino66- > > left.lab.archivas.com is unclean! > > Nov 29 16:47:54 rhino66-right pengine[22183]: notice: Cannot fence > > unclean nodes until quorum is attained (or no-quorum-policy is set to > > ignore) > > . > > Nov 29 16:48:38 rhino66-right corosync[6677]: [TOTEM ] A new > > membership (172.16.1.81:60) was formed. Members joined: 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [VOTEQ ] Waiting for > > all cluster members. Current votes: 1 expected_votes: 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] This node is > > within the primary component and will provide service. > > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] Members[2]: > > 1 2 > > Nov 29 16:48:38 rhino66-right corosync[6677]: [MAIN ] Completed > > service synchronization, ready to provide service. > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Quorum acquired > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Quorum > > acquired > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain > > a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not > > obtain a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain > > a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Node (null) > > state is now member > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not > > obtain a node name for corosync nodeid 2 > > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Node > > (null)