[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
>>> damiano giuliani schrieb am 13.07.2021 um >>> 13:42 in Nachricht : > Hi guys, > im back with some PAF postgres cluster problems. > tonight the cluster fenced the master node and promote the PAF resource to > a new node. > everything went fine, unless i really dont know why. > so this morning i noticed the old master was fenced by sbd and a new master > was promoted, this happen tonight at 00.40.XX. > filtering the logs i cant find out the any reasons why the old master was > fenced and the start of promotion of the new master (which seems went > perfectly), at certain point, im a bit lost cuz non of us can is able to > get the real reason. > the cluster worked flawessy for days with no issues, till now. > crucial for me uderstand why this switch occured. > > a attached the current status and configuration and logs. > on the old master node log cant find any reasons > on the new master the only thing is the fencing and the promotion. > > > PS: > could be this the reason of fencing? First I think your timeouts are rather aggressive. Hope there are no virtual machines involved. Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1 This may be a networking problem, or the other node dies for some unknown reason. That's the reason for fencing. Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Our peer on the DC (ltaoperdbs02) is dead Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean You said there is no reason for fencing, but here it is! Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster' The fencing timing is also quite aggressive IMHO. Could it be that a command saturated the network? Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG: duration: 660.329 ms execute : SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ; Regards, Ulrich > > grep -e sbd /var/log/messages > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant > pcmk is outdated (age: 4) > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant > pcmk is healthy (age: 0) > > Any though and help is really appreciate. > > Damiano ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] pcs update resource command not working
Hi Tomas , Thanks for the response. As you said ,Specifying an empty value for an option is a syntax for removing the option. Yes, this means there is no way to set an option to an empty string value using pcs. But our problem we have disabled root user account by default, so we are not sure which user can be given here and if that given user account got disabled/password expire what will be impact of this cluster monitoring SNMP service and so on. Thanks and Regards, S Sathish S ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres
On 13.07.2021 23:09, damiano giuliani wrote: > Hi Klaus, thanks for helping, im quite lost because cant find out the > causes. > i attached the corosync logs of all three nodes hoping you guys can find > and hint me something i cant see. i really appreciate the effort. > the old master log seems cutted at 00:38. so nothing interessing. > the new master and the third slave logged what its happened. but i cant > figure out the cause the old master went lost. > The reason it was lost is most likely outside of pacemaker. You need to check other logs on the node that was lost, may be BMC if this is bare metal or hypervisor if it is virtualized system. All that these logs say is that ltaoperdbs02 was lost from the point of view of two other nodes. It happened at the same time (around Jul 13 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it was software crash, hardware failure or network outage cannot be determined from these logs. > something interessing could be the stonith logs of the new master and the > third slave: > > NEW MASTER: > grep stonith-ng /var/log/messages > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node ltaoperdbs02 > state is now lost > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer > with id=1 and/or uname=ltaoperdbs02 from the membership cache > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device > '(any)' > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting peer > fencing (reboot) targeting ltaoperdbs02 > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find > anyone to fence (reboot) ltaoperdbs02 with any device > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s for > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5 > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing > (reboot) by ltaoperdbs02 for > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for > crmd.228700@ltaoperdbs03.f5d882d5: OK > > THIRD SLAVE: > grep stonith-ng /var/log/messages > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Node ltaoperdbs02 > state is now lost > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Purged 1 peer with > id=1 and/or uname=ltaoperdbs02 from the membership cache > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]: notice: Operation 'reboot' > targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: > OK > > i really appreciate the help and what you think about it. > > PS the stonith should be set to 10s (pcs property set > stonith-watchdog-timeout=10s) are u suggest different setting? > > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger < > kwenn...@redhat.com> ha scritto: > >> >> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani < >> damianogiulian...@gmail.com> wrote: >> >>> Hi guys, >>> im back with some PAF postgres cluster problems. >>> tonight the cluster fenced the master node and promote the PAF resource >>> to a new node. >>> everything went fine, unless i really dont know why. >>> so this morning i noticed the old master was fenced by sbd and a new >>> master was promoted, this happen tonight at 00.40.XX. >>> filtering the logs i cant find out the any reasons why the old master was >>> fenced and the start of promotion of the new master (which seems went >>> perfectly), at certain point, im a bit lost cuz non of us can is able to >>> get the real reason. >>> the cluster worked flawessy for days with no issues, till now. >>> crucial for me uderstand why this switch occured. >>> >>> a attached the current status and configuration and logs. >>> on the old master node log cant find any reasons >>> on the new master the only thing is the fencing and the promotion. >>> >>> >>> PS: >>> could be this the reason of fencing? >>> >>> grep -e sbd /var/log/messages >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: >>> Servant pcmk is outdated (age: 4) >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: >>> Servant pcmk is healthy (age: 0) >>> >> That was yesterday afternoon and not 0:40 today in the morning. >> With the watchdog-timeout set to 5s this may have been tight though. >> Maybe check your other nodes for similar warnings - or check the >> compressed warnings. >> Maybe you can as well check the journal of sbd after start to see if it >> managed to run rt-scheduled. >> Is this a bare-metal-setup or running on some hypervisor? >> Unfortunately I'm not enough into postgres to tell if there is anything >> interesting about the last >> messages shown before the suspected watchdog-reboot. >> Was there some administrative stuff done by ltauser before the reboot? If >> yes what? >> >> Regards, >> Klaus >> >> >>> >>> Any
Re: [ClusterLabs] QDevice vs 3rd host for majority node quorum
In some cases the third location has a single IP and it makes sense to use it as QDevice. If it has multiple network connections to that location - use a full blown node . Best Regards,Strahil Nikolov On Tue, Jul 13, 2021 at 20:44, Andrei Borzenkov wrote: On 13.07.2021 19:52, Gerry R Sommerville wrote: > Hello everyone, > I am currently comparing using QDevice vs adding a 3rd host to my > even-number-node cluster and I am wondering about the details concerning > network > communication. > For example, say my cluster is utilizing multiple heartbeat rings. Would the > QDevice take into account and use the IPs specified in the different rings? > Or No. > does it only use the one specified under the quorum directive for QDevice? Yes. Remote device is unrelated to corosync rings. Qdevice receives information of current cluster membership from all nodes (point of view), computes partitions and selects partition that will remain quorate. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] QDevice vs 3rd host for majority node quorum
On 13.07.2021 19:52, Gerry R Sommerville wrote: > Hello everyone, > I am currently comparing using QDevice vs adding a 3rd host to my > even-number-node cluster and I am wondering about the details concerning > network > communication. > For example, say my cluster is utilizing multiple heartbeat rings. Would the > QDevice take into account and use the IPs specified in the different rings? > Or No. > does it only use the one specified under the quorum directive for QDevice? Yes. Remote device is unrelated to corosync rings. Qdevice receives information of current cluster membership from all nodes (point of view), computes partitions and selects partition that will remain quorate. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] QDevice vs 3rd host for majority node quorum
Hello everyone, I am currently comparing using QDevice vs adding a 3rd host to my even-number-node cluster and I am wondering about the details concerning network communication.For example, say my cluster is utilizing multiple heartbeat rings. Would the QDevice take into account and use the IPs specified in the different rings? Or does it only use the one specified under the quorum directive for QDevice? Gerry Sommerville E-mail: ge...@ca.ibm.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] heartbeat rings questions
On 12/07/2021 23:27, Kiril Pashin wrote: Hi , is it valid to use the same network adapter interface on the same host to be part of multiple heart beat rings ? There should be no problem from technical side, but I wouldn't call this use case "valid". Idea of multiple rings is to have multiple independent connections between nodes - something what one nic simply doesn't provide. The scenario is hostA has eth0 ( ip 192.10.10.1 ) interface and hostB has eth0 ( 192.20.20.1 ) and eth1 ( 192.20.20.2 ) . This is unsupported configuration Are there any restrictions to form two heartbeat rings { eth0, eth0 } and { eth0, eth1 } Technically only restriction is based on IP. But to make it reliable one should use multiple NICs with multiple links and multiple switches. as well as create a nozzle device to be able to ping hostA in case eth0 or eth1 go down on hostB It is definitively possible to create noozle device which will allow to ping hostA in case some of nic fails, but not in the way described in config snip. Noozle device should have different IP subnet (noozle is basically yet another network card). nodelist { node { ring0_addr: 192.10.10.1 ring1_addr: 192.10.10.1 name: node1 nodeid: 1 } node { ring0_addr: 192.20.20.1 ring1_addr: 192.20.20.2 name: node2 nodeid: 2 } } nozzle { name: noz01 ipaddr: 192.168.10.0 ipprefix: 24 } This config will definitively not work. Regards, Honza Thanks, Kiril Pashin DB2 Purescale Development & Support kir...@ca.ibm.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] heartbeat rings questions
Hi , is it valid to use the same network adapter interface on the same host to be part of multiple heart beat rings ? The scenario is hostA has eth0 ( ip 192.10.10.1 ) interface and hostB has eth0 ( 192.20.20.1 ) and eth1 ( 192.20.20.2 ) . Are there any restrictions to form two heartbeat rings { eth0, eth0 } and { eth0, eth1 } as well as create a nozzle device to be able to ping hostA in case eth0 or eth1 go down on hostB nodelist { node { ring0_addr: 192.10.10.1 ring1_addr: 192.10.10.1 name: node1 nodeid: 1 } node { ring0_addr: 192.20.20.1 ring1_addr: 192.20.20.2 name: node2 nodeid: 2 }}nozzle { name: noz01 ipaddr: 192.168.10.0 ipprefix: 24} Thanks, Kiril PashinDB2 Purescale Development & Supportkir...@ca.ibm.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] @ maillist Admins - DMARC (yahoo)
On Tue, 2021-07-13 at 14:46 +0300, Andrei Borzenkov wrote: > On Mon, Jul 12, 2021 at 5:50 PM wrote: > > > > On Sat, 2021-07-10 at 12:34 +0100, lejeczek wrote: > > > Hi Admins(of this mailing list) > > > > > > Could you please fix in DMARC(s) so those of us who are on > > > Yahoo would be able to receive own emails/thread. > > > > > > many thanks, L. > > > > I suppose we should do something, since this is likely to be more > > of an > > issue as time goes on. Unfortunately, it's not as simple as > > flipping a > > switch. These are the two reasonable choices: > > > > The problem is, both are incomplete. Unfortunately that's correct, there is no way to make everybody happy. > > (1) Change the "From" on list messages so that they appear to be > > from > > the list, rather than the poster. For example, your posts would > > show up > > as "From: lejeczek via ClusterLabs Users " > > rather than "From: lejeczek ". This is less > > intrusive but makes it more difficult to reply directly to the > > sender, > > add the sender to an address book, etc. > > > > This will pass SPF but fail DKIM > > > > > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to > > the > > list instead of original author, and adding the list signature. > > This is > > more standards-compliant, since the List-* headers can still be > > used > > for filtering, unsubscribing, and replying to the list, but not all > > mail clients make those easy to use. > > > > This will pass DKIM but fail SPF. > > I do not know how many domains implement only SPF, only DKIM or both. Yes, that is a well-known issue, and for that reason I think it's very rare for DMARC to be used with only one of them, even though the DMARC standard allows that. (A domain that doesn't use DMARC can use just one of SPF or DKIM without problems -- in fact, clusterlabs.org uses SPF but not DKIM or DMARC.) > > > > Anyone have preferences for one over the other? > > > > (Less reasonable options include wrapping every post in MIME, and > > disallowing users from DMARC domains to post to the list.) > > Well, enabling ARC in addition to either of the options may somehow > mitigate them. It depends on *recipient* domain support though. Also > I > am not sure whether Mailman 2.x supports it. > > From a personal perspective, I already filter by list ids and > [ClusterLabs] just wastes screen real estate. But I remember somewhat > heated responses when openSUSE changed list software and dropped > prefixes - apparently quite some users were using single mailbox and > relied on prefixes to prioritize message reading. Yes, I worry about that. But I'm afraid the technology has evolved to a point where that's no longer a reasonable approach to email. I am leaning to (2) for that reason; it seems to be the more standards- compliant and future-oriented approach. I know the new format would take some getting used to, but hopefully it smooths out over time. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: @ maillist Admins ‑ DMARC (yahoo)
On Tue, 2021-07-13 at 10:23 +0200, Ulrich Windl wrote: > > > > schrieb am 12.07.2021 um 16:50 in > > > > Nachricht > > <08471514b28d1e3f6859707f5951f07887336865.ca...@redhat.com>: > > On Sat, 2021‑07‑10 at 12:34 +0100, lejeczek wrote: > > > Hi Admins(of this mailing list) > > > > > > Could you please fix in DMARC(s) so those of us who are on > > > Yahoo would be able to receive own emails/thread. > > > > > > many thanks, L. > > > > I suppose we should do something, since this is likely to be more > > of an > > issue as time goes on. Unfortunately, it's not as simple as > > flipping a > > switch. These are the two reasonable choices: > > > > > > (1) Change the "From" on list messages so that they appear to be > > from > > the list, rather than the poster. For example, your posts would > > show up > > as "From: lejeczek via ClusterLabs Users " > > rather than "From: lejeczek ". This is less > > intrusive but makes it more difficult to reply directly to the > > sender, > > add the sender to an address book, etc. > > > > > > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to > > the > > list instead of original author, and adding the list signature. > > This is > > more standards‑compliant, since the List‑* headers can still be > > used > > for filtering, unsubscribing, and replying to the list, but not all > > mail clients make those easy to use. > > > > > > Anyone have preferences for one over the other? > > I have no idea about DMARC, so I'm qualified for an opinion ;-) > My guess is that the changes mentioned to the original message make > the DMARC > signature invalid. Right > IMHO the best solution would be to (if at all) chack DMARC on > receipt, and > "re-sign" before sending it out to the list. Only the sender's domain mailers have the signing key. Once our mailing list server receives it, we can't modify the existing body or headers without breaking the DMARC (DKIM) signature. (Changing the "From" works because at that point the message is no longer from the DMARC-protected domain.) > > Regards, > Ulrich > > > > > (Less reasonable options include wrapping every post in MIME, and > > disallowing users from DMARC domains to post to the list.) -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres
On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani < damianogiulian...@gmail.com> wrote: > Hi guys, > im back with some PAF postgres cluster problems. > tonight the cluster fenced the master node and promote the PAF resource to > a new node. > everything went fine, unless i really dont know why. > so this morning i noticed the old master was fenced by sbd and a new > master was promoted, this happen tonight at 00.40.XX. > filtering the logs i cant find out the any reasons why the old master was > fenced and the start of promotion of the new master (which seems went > perfectly), at certain point, im a bit lost cuz non of us can is able to > get the real reason. > the cluster worked flawessy for days with no issues, till now. > crucial for me uderstand why this switch occured. > > a attached the current status and configuration and logs. > on the old master node log cant find any reasons > on the new master the only thing is the fencing and the promotion. > > > PS: > could be this the reason of fencing? > > grep -e sbd /var/log/messages > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant > pcmk is outdated (age: 4) > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant > pcmk is healthy (age: 0) > That was yesterday afternoon and not 0:40 today in the morning. With the watchdog-timeout set to 5s this may have been tight though. Maybe check your other nodes for similar warnings - or check the compressed warnings. Maybe you can as well check the journal of sbd after start to see if it managed to run rt-scheduled. Is this a bare-metal-setup or running on some hypervisor? Unfortunately I'm not enough into postgres to tell if there is anything interesting about the last messages shown before the suspected watchdog-reboot. Was there some administrative stuff done by ltauser before the reboot? If yes what? Regards, Klaus > > Any though and help is really appreciate. > > Damiano > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] @ maillist Admins - DMARC (yahoo)
On Mon, Jul 12, 2021 at 5:50 PM wrote: > > On Sat, 2021-07-10 at 12:34 +0100, lejeczek wrote: > > Hi Admins(of this mailing list) > > > > Could you please fix in DMARC(s) so those of us who are on > > Yahoo would be able to receive own emails/thread. > > > > many thanks, L. > > I suppose we should do something, since this is likely to be more of an > issue as time goes on. Unfortunately, it's not as simple as flipping a > switch. These are the two reasonable choices: > The problem is, both are incomplete. > > (1) Change the "From" on list messages so that they appear to be from > the list, rather than the poster. For example, your posts would show up > as "From: lejeczek via ClusterLabs Users " > rather than "From: lejeczek ". This is less > intrusive but makes it more difficult to reply directly to the sender, > add the sender to an address book, etc. > This will pass SPF but fail DKIM > > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to the > list instead of original author, and adding the list signature. This is > more standards-compliant, since the List-* headers can still be used > for filtering, unsubscribing, and replying to the list, but not all > mail clients make those easy to use. > This will pass DKIM but fail SPF. I do not know how many domains implement only SPF, only DKIM or both. > > Anyone have preferences for one over the other? > > (Less reasonable options include wrapping every post in MIME, and > disallowing users from DMARC domains to post to the list.) Well, enabling ARC in addition to either of the options may somehow mitigate them. It depends on *recipient* domain support though. Also I am not sure whether Mailman 2.x supports it. >From a personal perspective, I already filter by list ids and [ClusterLabs] just wastes screen real estate. But I remember somewhat heated responses when openSUSE changed list software and dropped prefixes - apparently quite some users were using single mailbox and relied on prefixes to prioritize message reading. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres
Hi guys, im back with some PAF postgres cluster problems. tonight the cluster fenced the master node and promote the PAF resource to a new node. everything went fine, unless i really dont know why. so this morning i noticed the old master was fenced by sbd and a new master was promoted, this happen tonight at 00.40.XX. filtering the logs i cant find out the any reasons why the old master was fenced and the start of promotion of the new master (which seems went perfectly), at certain point, im a bit lost cuz non of us can is able to get the real reason. the cluster worked flawessy for days with no issues, till now. crucial for me uderstand why this switch occured. a attached the current status and configuration and logs. on the old master node log cant find any reasons on the new master the only thing is the fencing and the promotion. PS: could be this the reason of fencing? grep -e sbd /var/log/messages Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4) Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant pcmk is healthy (age: 0) Any though and help is really appreciate. Damiano pcs status Cluster name: ltaoperdbscluster Stack: corosync Current DC: ltaoperdbs03 (version 1.1.23-1.el7-9acf116022) - partition with quorum Last updated: Tue Jul 13 10:06:01 2021 Last change: Tue Jul 13 00:41:05 2021 by root via crm_attribute on ltaoperdbs03 3 nodes configured 4 resource instances configured Online: [ ltaoperdbs03 ltaoperdbs04 ] OFFLINE: [ ltaoperdbs02 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] Masters: [ ltaoperdbs03 ] Slaves: [ ltaoperdbs04 ] Stopped: [ ltaoperdbs02 ] pgsql-master-ip(ocf::heartbeat:IPaddr2): Started ltaoperdbs03 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled sbd: active/enabled [root@ltaoperdbs03 pengine]# pcs config show Cluster Name: ltaoperdbscluster Corosync Nodes: ltaoperdbs02 ltaoperdbs03 ltaoperdbs04 Pacemaker Nodes: ltaoperdbs02 ltaoperdbs03 ltaoperdbs04 Resources: Master: pgsql-ha Meta Attrs: notify=true Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/pgsql-13/bin pgdata=/workspace/pdgs-db/13/data pgport=5432 Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s) methods interval=0s timeout=5 (pgsqld-methods-interval-0s) monitor interval=15s role=Master timeout=25s (pgsqld-monitor-interval-15s) monitor interval=16s role=Slave timeout=25s (pgsqld-monitor-interval-16s) notify interval=0s timeout=60s (pgsqld-notify-interval-0s) promote interval=0s timeout=30s (pgsqld-promote-interval-0s) reload interval=0s timeout=20 (pgsqld-reload-interval-0s) start interval=0s timeout=60s (pgsqld-start-interval-0s) stop interval=0s timeout=60s (pgsqld-stop-interval-0s) Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 ip=172.18.2.10 Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s) start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s) stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory) demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory-1) Colocation Constraints: pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: ltaoperdbscluster dc-version: 1.1.23-1.el7-9acf116022 have-watchdog: true last-lrm-refresh: 1625090339 stonith-enabled: true stonith-watchdog-timeout: 10s Quorum: Options: stonith_admin --verbose --history "*" ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021 SBD CONFIG grep -v \# /etc/sysconfig/sbd | sort | uniq SBD_DELAY_START=no SBD_MOVE_TO_ROOT_CGROUP=auto SBD_OPTS= SBD_PACEMAKER=yes SBD_STARTMODE=always SBD_TIMEOUT_ACTION=flush,reboot SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5 ltaoperdbs03 cluster]# stonith_admin --verbose --history "*" ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021 [root@ltaoperdbs03 cluster]# grep "Jul 13 00:40:" /var/log/messages Jul 13 00:40:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser. Jul 13 00:40:01 ltaoperdbs03 systemd: Started
Re: [ClusterLabs] pcs update resource command not working
Dne 09. 07. 21 v 7:29 S Sathish S napsal(a): Hi Team, we have find the cause of this problem as per below changelog pcs resource update command doesn’t support empty meta_attributes anymore. https://github.com/ClusterLabs/pcs/blob/0.9.169/CHANGELOG.md pcs resource update does not create an empty meta_attributes element any more (rhbz#1568353) This bz is not related to your issue. [root@node01 testadmin]# pcs resource update SNMP_node01 user='' extra_options="-E /opt/occ/CXP/tools/PCSXXX.sh" Later modified to below command it work for us. [root@node01 testadmin]# pcs resource update SNMP_node01 user='root' extra_options="-E /opt/occ/CXP/tools/PCSXXX.sh" The commands work as expected, user='root' sets the value of 'user' to 'root', user='' deletes 'user'. Specifying an empty value for an option is a syntax for removing the option. Yes, this means there is no way to set an option to an empty string value using pcs. But our problem we have disabled root user account by default , so we are not sure which user can be given here and if that given user account got disabled/password expire what will be impact of this cluster monitoring SNMP service and so on. While doing pcs resource create SNMP with empty string for user attribute it work for us, One more difference we noticed. 'pcs resource create' allows setting empty string values. It is a known bug which we track and will fix eventually. Regards, Tomas Query: 1) It is recommended to create SNMP ClusterMon resource type with empty user as attribute. 2) if not , update resource with some user and that user account got disabled/password expire what will be impact of this cluster monitoring SNMP service and so on. Thanks and Regards, S Sathish ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: Q: Prevent non‑live VM migration
>>> schrieb am 12.07.2021 um 16:53 in Nachricht <376475adc8217a97adf9374d0aad0317eabd5f90.ca...@redhat.com>: > On Mon, 2021‑07‑12 at 08:35 +0200, Ulrich Windl wrote: >> Hi! >> >> We had some problem in the cluster that prevented live migration of >> VMs. As a consequence the cluster migrated the VMs using stop/start. For completeness, it tuned out that the libvirt install script eats up the "--listen" parameter that is required when starting libvirt from the cluster: # The '--listen' option is incompatible with socket activation. # We need to forcibly remove it from /etc/sysconfig/libvirtd. # Also add the --timeout option to be consistent with upstream. # See boo#1156161 for details sed -i -e '/^\s*LIBVIRTD_ARGS=/s/--listen//g' /etc/sysconfig/libvirtd if ! grep -q -E '^\s*LIBVIRTD_ARGS=.*--timeout' /etc/sysconfig/libvirtd ; then sed -i 's/^\s*LIBVIRTD_ARGS="\(.*\)"/LIBVIRTD_ARGS="\1 --timeout 120"/' /etc/sysconfig/libvirtd Another set of tricks: So adding "--listen" again and restarting vlibvirtd.service almost fixed the problem: While it's OK to restart libvirtd while VMs are running, there were stale locks (virtlockd) that were tricky to clean up. Most specifically messages like these aren't really helpful (What the hack does that lock refer to?): Jul 13 10:03:11 h16 virtlockd[8935]: resource busy: Lockspace resource '56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked Jul 13 10:03:11 h16 libvirtd[22972]: resource busy: Lockspace resource '56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked Even if you find that "lock" in /var/lib/libvirt/lockd/files/56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15, you are not more clever than before, I'm afraid ;-) BTW: I had filed an enhancement request regarding that some time ago... Regards, Ulrich >> I wonder: Is there a way to prevent stop/start migration if live‑ >> migration failed? > > The only thing I can think of is setting on‑fail=block for migrate_to > and migrate_from actions. I'd be cautious though; if the migration > fails in a way that leaves the domain inaccessible, it will stay that > way. > >> In out case the migration was triggeerd by resource placement >> strategy. >> >> The messages logged would look like this: >> warning: Unexpected result (error: v15: live migration to h18 >> failed: 1) was recorded for migrate_to of prm_xen_v15 on h16 >> >> Regards, >> Ulrich >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: @ maillist Admins ‑ DMARC (yahoo)
>>> schrieb am 12.07.2021 um 16:50 in Nachricht <08471514b28d1e3f6859707f5951f07887336865.ca...@redhat.com>: > On Sat, 2021‑07‑10 at 12:34 +0100, lejeczek wrote: >> Hi Admins(of this mailing list) >> >> Could you please fix in DMARC(s) so those of us who are on >> Yahoo would be able to receive own emails/thread. >> >> many thanks, L. > > I suppose we should do something, since this is likely to be more of an > issue as time goes on. Unfortunately, it's not as simple as flipping a > switch. These are the two reasonable choices: > > > (1) Change the "From" on list messages so that they appear to be from > the list, rather than the poster. For example, your posts would show up > as "From: lejeczek via ClusterLabs Users " > rather than "From: lejeczek ". This is less > intrusive but makes it more difficult to reply directly to the sender, > add the sender to an address book, etc. > > > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to the > list instead of original author, and adding the list signature. This is > more standards‑compliant, since the List‑* headers can still be used > for filtering, unsubscribing, and replying to the list, but not all > mail clients make those easy to use. > > > Anyone have preferences for one over the other? I have no idea about DMARC, so I'm qualified for an opinion ;-) My guess is that the changes mentioned to the original message make the DMARC signature invalid. IMHO the best solution would be to (if at all) chack DMARC on receipt, and "re-sign" before sending it out to the list. Regards, Ulrich > > (Less reasonable options include wrapping every post in MIME, and > disallowing users from DMARC domains to post to the list.) > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?
On 2021/7/12 15:52, Ulrich Windl wrote: Hi! can you give some details on what is necessary to trigger the problem? There is a ABBA lock between reflink comand and ocfs2 ocfs2_complete_recovery routine(this routine will be triggered by timer, mount, node recovery), the dead lock is not always encountered. For the more details, refer to the link as below, https://oss.oracle.com/pipermail/ocfs2-devel/2021-July/015671.html Thanks Gang (I/O load, CPU load, concurrent operations on one node or on multiple nodes, using reflink snapshots, using ioctl(FS_IOC_FIEMAP), etc.) Regards, Ulrich Gang He schrieb am 11.07.2021 um 10:55 in Nachricht Hi Ulrich, Thank for your update. Based on some feedback from the upstream, there is a patch (ocfs2: initialize ip_next_orphan), which should fix this problem. I can comfirm the patch looks very similar with your problem. I will verify it next week, then let you know the result. Thanks Gang From: Users on behalf of Ulrich Windl Sent: Friday, July 9, 2021 15:56 To: users@clusterlabs.org Subject: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else? Hi! An update on the issue: SUSE support found out that the reason for the hanging processes is a deadlock caused by a race condition (Kernel 5.3.18‑24.64‑default). Support is working on a fix. Today the cluster "fixed" the problem in an unusual way: h19 kernel: Out of memory: Killed process 6838 (corosync) total‑vm:261212kB, anon‑rss:31444kB, file‑rss:7700kB, shmem‑rss:121872kB I doubt that was the best possible choice ;‑) The dead corosync caused the DC (h18) to fence h19 (which was successful), but the DC was fenced while it tried to recover resources, so the complete cluster rebooted. Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/