[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

2021-07-13 Thread Ulrich Windl
>>> damiano giuliani  schrieb am 13.07.2021 um 
>>> 13:42
in Nachricht
:
> Hi guys,
> im back with some PAF postgres cluster problems.
> tonight the cluster fenced the master node and promote the PAF resource to
> a new node.
> everything went fine, unless i really dont know why.
> so this morning i noticed the old master was fenced by sbd and a new master
> was promoted, this happen tonight at 00.40.XX.
> filtering the logs i cant find out the any reasons why the old master was
> fenced and the start of promotion of the new master (which seems went
> perfectly), at certain point, im a bit lost cuz non of us can is able to
> get the real reason.
> the cluster worked flawessy for days  with no issues, till now.
> crucial for me uderstand why this switch occured.
> 
> a attached the current status and configuration and logs.
> on the old master node log cant find any reasons
> on the new master the only thing is the fencing and the promotion.
> 
> 
> PS:
> could be this the reason of fencing?

First I think your timeouts are rather aggressive. Hope there are no virtual 
machines involved.

Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the 
leave message. failed: 1

This may be a networking problem, or the other node dies for some unknown 
reason.
That's the reason for fencing.
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Our peer on the DC 
(ltaoperdbs02) is dead
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is 
unclean

You said there is no reason for fencing, but here it is!

Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node 
ltaoperdbs02 for STONITH
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Fence (reboot) 
ltaoperdbs02 'peer is no longer part of the cluster'

The fencing timing is also quite aggressive IMHO.

Could it be that a command saturated the network?
Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 
UTC [172262] LOG:  duration: 660.329 ms  execute :  SELECT  
xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 
ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON 
f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE 
xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id 
= 3GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;

Regards,
Ulrich

> 
> grep  -e sbd /var/log/messages
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 4)
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
> 
> Any though and help is really appreciate.
> 
> Damiano




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] pcs update resource command not working

2021-07-13 Thread S Sathish S
Hi Tomas ,

Thanks for the response.

As you said ,Specifying an empty value for an option is a syntax for removing 
the option. Yes, this means there is no way to set an option to an empty string 
value using pcs.

But our problem we have disabled root user account by default, so we are not 
sure which user can be given here and if that given user account got 
disabled/password expire what will be impact of this cluster monitoring SNMP 
service and so on.

Thanks and Regards,
S Sathish S

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

2021-07-13 Thread Andrei Borzenkov
On 13.07.2021 23:09, damiano giuliani wrote:
> Hi Klaus, thanks for helping, im quite lost because cant find out the
> causes.
> i attached the corosync logs of all three nodes hoping you guys can find
> and hint me  something i cant see. i really appreciate the effort.
> the old master log seems cutted at 00:38. so nothing interessing.
> the new master and the third slave logged what its happened. but i cant
> figure out the cause the old master went lost.
> 

The reason it was lost is most likely outside of pacemaker. You need to
check other logs on the node that was lost, may be BMC if this is bare
metal or hypervisor if it is virtualized system.

All that these logs say is that ltaoperdbs02 was lost from the point of
view of two other nodes. It happened at the same time (around Jul 13
00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it
was software crash, hardware failure or network outage cannot be
determined from these logs.


> something interessing could be the stonith logs of the new master and the
> third slave:
> 
> NEW MASTER:
> grep stonith-ng /var/log/messages
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02
> state is now lost
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer
> with id=1 and/or uname=ltaoperdbs02 from the membership cache
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client
> crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device
> '(any)'
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer
> fencing (reboot) targeting ltaoperdbs02
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find
> anyone to fence (reboot) ltaoperdbs02 with any device
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for
> ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing
> (reboot) by ltaoperdbs02 for
> crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation
> 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for
> crmd.228700@ltaoperdbs03.f5d882d5: OK
> 
> THIRD SLAVE:
> grep stonith-ng /var/log/messages
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node ltaoperdbs02
> state is now lost
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer with
> id=1 and/or uname=ltaoperdbs02 from the membership cache
> Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation 'reboot'
> targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5:
> OK
> 
> i really appreciate the help and  what you think about it.
> 
> PS the stonith should be set to 10s (pcs  property set
> stonith-watchdog-timeout=10s) are u suggest different setting?
> 
> Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <
> kwenn...@redhat.com> ha scritto:
> 
>>
>>
>> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
>> damianogiulian...@gmail.com> wrote:
>>
>>> Hi guys,
>>> im back with some PAF postgres cluster problems.
>>> tonight the cluster fenced the master node and promote the PAF resource
>>> to a new node.
>>> everything went fine, unless i really dont know why.
>>> so this morning i noticed the old master was fenced by sbd and a new
>>> master was promoted, this happen tonight at 00.40.XX.
>>> filtering the logs i cant find out the any reasons why the old master was
>>> fenced and the start of promotion of the new master (which seems went
>>> perfectly), at certain point, im a bit lost cuz non of us can is able to
>>> get the real reason.
>>> the cluster worked flawessy for days  with no issues, till now.
>>> crucial for me uderstand why this switch occured.
>>>
>>> a attached the current status and configuration and logs.
>>> on the old master node log cant find any reasons
>>> on the new master the only thing is the fencing and the promotion.
>>>
>>>
>>> PS:
>>> could be this the reason of fencing?
>>>
>>> grep  -e sbd /var/log/messages
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
>>> Servant pcmk is outdated (age: 4)
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
>>> Servant pcmk is healthy (age: 0)
>>>
>> That was yesterday afternoon and not 0:40 today in the morning.
>> With the watchdog-timeout set to 5s this may have been tight though.
>> Maybe check your other nodes for similar warnings - or check the
>> compressed warnings.
>> Maybe you can as well check the journal of sbd after start to see if it
>> managed to run rt-scheduled.
>> Is this a bare-metal-setup or running on some hypervisor?
>> Unfortunately I'm not enough into postgres to tell if there is anything
>> interesting about the last
>> messages shown before the suspected watchdog-reboot.
>> Was there some administrative stuff done by ltauser before the reboot? If
>> yes what?
>>
>> Regards,
>> Klaus
>>
>>
>>>
>>> Any 

Re: [ClusterLabs] QDevice vs 3rd host for majority node quorum

2021-07-13 Thread Strahil Nikolov
In some cases the third location has a single IP and it makes sense to use it 
as QDevice. If it has multiple network connections to that location - use a 
full blown node .
Best Regards,Strahil Nikolov
 
 
  On Tue, Jul 13, 2021 at 20:44, Andrei Borzenkov wrote:   
On 13.07.2021 19:52, Gerry R Sommerville wrote:
> Hello everyone,
> I am currently comparing using QDevice vs adding a 3rd host to my 
> even-number-node cluster and I am wondering about the details concerning 
> network 
> communication.
> For example, say my cluster is utilizing multiple heartbeat rings. Would the 
> QDevice take into account and use the IPs specified in the different rings? 
> Or 

No.

> does it only use the one specified under the quorum directive for QDevice?

Yes. Remote device is unrelated to corosync rings. Qdevice receives
information of current cluster membership from all nodes (point of
view), computes partitions and selects partition that will remain quorate.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] QDevice vs 3rd host for majority node quorum

2021-07-13 Thread Andrei Borzenkov
On 13.07.2021 19:52, Gerry R Sommerville wrote:
> Hello everyone,
> I am currently comparing using QDevice vs adding a 3rd host to my 
> even-number-node cluster and I am wondering about the details concerning 
> network 
> communication.
> For example, say my cluster is utilizing multiple heartbeat rings. Would the 
> QDevice take into account and use the IPs specified in the different rings? 
> Or 

No.

> does it only use the one specified under the quorum directive for QDevice?

Yes. Remote device is unrelated to corosync rings. Qdevice receives
information of current cluster membership from all nodes (point of
view), computes partitions and selects partition that will remain quorate.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] QDevice vs 3rd host for majority node quorum

2021-07-13 Thread Gerry R Sommerville
Hello everyone,
 
I am currently comparing using QDevice vs adding a 3rd host to my even-number-node cluster and I am wondering about the details concerning network communication.For example, say my cluster is utilizing multiple heartbeat rings. Would the QDevice take into account and use the IPs specified in the different rings? Or does it only use the one specified under the quorum directive for QDevice? 
Gerry Sommerville
E-mail: ge...@ca.ibm.com

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] heartbeat rings questions

2021-07-13 Thread Jan Friesse

On 12/07/2021 23:27, Kiril Pashin wrote:

Hi ,
is it valid to use the same network adapter interface on the same host to be
part of multiple
heart beat rings ?


There should be no problem from technical side, but I wouldn't call this 
use case "valid". Idea of multiple rings is to have multiple independent 
connections between nodes - something what one nic simply doesn't provide.



The scenario is hostA has eth0 ( ip 192.10.10.1 ) interface and hostB has eth0 (
192.20.20.1 ) and eth1 ( 192.20.20.2 ) .


This is unsupported configuration


Are there any restrictions to form two heartbeat rings { eth0, eth0 } and {
eth0, eth1 }


Technically only restriction is based on IP. But to make it reliable one 
should use multiple NICs with multiple links and multiple switches.



as well as create a nozzle device to be able to ping hostA in case eth0 or eth1
go down on hostB


It is definitively possible to create noozle device which will allow to 
ping hostA in case some of nic fails, but not in the way described in 
config snip. Noozle device should have different IP subnet (noozle is 
basically yet another network card).



nodelist {
  node {
  ring0_addr: 192.10.10.1
  ring1_addr: 192.10.10.1
  name: node1
  nodeid: 1
  }
  node {
  ring0_addr: 192.20.20.1
  ring1_addr: 192.20.20.2
  name: node2
  nodeid: 2
  }
}
nozzle {
  name: noz01
  ipaddr: 192.168.10.0
  ipprefix: 24
}


This config will definitively not work.

Regards,
  Honza


Thanks,
Kiril Pashin
DB2 Purescale Development & Support
kir...@ca.ibm.com



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] heartbeat rings questions

2021-07-13 Thread Kiril Pashin
Hi ,
 
is it valid to use the same network adapter interface on the same host to be part of multiple
heart beat rings ?
 
The scenario is hostA has eth0 ( ip 192.10.10.1 ) interface and hostB has eth0 ( 192.20.20.1 ) and eth1 ( 192.20.20.2 ) .
Are there any restrictions to form two heartbeat rings { eth0, eth0 } and { eth0, eth1 }
as well as create a nozzle device to be able to ping hostA in case eth0 or eth1 go down on hostB
 
nodelist {    node {    ring0_addr: 192.10.10.1    ring1_addr: 192.10.10.1    name: node1    nodeid: 1    }    node {    ring0_addr: 192.20.20.1    ring1_addr: 192.20.20.2    name: node2    nodeid: 2    }}nozzle {    name: noz01    ipaddr: 192.168.10.0    ipprefix: 24}
 
Thanks,
 
Kiril PashinDB2 Purescale Development & Supportkir...@ca.ibm.com 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] @ maillist Admins - DMARC (yahoo)

2021-07-13 Thread kgaillot
On Tue, 2021-07-13 at 14:46 +0300, Andrei Borzenkov wrote:
> On Mon, Jul 12, 2021 at 5:50 PM  wrote:
> > 
> > On Sat, 2021-07-10 at 12:34 +0100, lejeczek wrote:
> > > Hi Admins(of this mailing list)
> > > 
> > > Could you please fix in DMARC(s) so those of us who are on
> > > Yahoo would be able to receive own emails/thread.
> > > 
> > > many thanks, L.
> > 
> > I suppose we should do something, since this is likely to be more
> > of an
> > issue as time goes on. Unfortunately, it's not as simple as
> > flipping a
> > switch. These are the two reasonable choices:
> > 
> 
> The problem is, both are incomplete.

Unfortunately that's correct, there is no way to make everybody happy.


> > (1) Change the "From" on list messages so that they appear to be
> > from
> > the list, rather than the poster. For example, your posts would
> > show up
> > as "From: lejeczek via ClusterLabs Users "
> > rather than "From: lejeczek ". This is less
> > intrusive but makes it more difficult to reply directly to the
> > sender,
> > add the sender to an address book, etc.
> > 
> 
> This will pass SPF but fail DKIM
> 
> > 
> > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to
> > the
> > list instead of original author, and adding the list signature.
> > This is
> > more standards-compliant, since the List-* headers can still be
> > used
> > for filtering, unsubscribing, and replying to the list, but not all
> > mail clients make those easy to use.
> > 
> 
> This will pass DKIM but fail SPF.
> 
> I do not know how many domains implement only SPF, only DKIM or both.

Yes, that is a well-known issue, and for that reason I think it's very
rare for DMARC to be used with only one of them, even though the DMARC
standard allows that.

(A domain that doesn't use DMARC can use just one of SPF or DKIM
without problems -- in fact, clusterlabs.org uses SPF but not DKIM or
DMARC.)

> > 
> > Anyone have preferences for one over the other?
> > 
> > (Less reasonable options include wrapping every post in MIME, and
> > disallowing users from DMARC domains to post to the list.)
> 
> Well, enabling ARC in addition to either of the options may somehow
> mitigate them. It depends on *recipient* domain support though. Also
> I
> am not sure whether Mailman 2.x supports it.
> 
> From a personal perspective, I already filter by list ids and
> [ClusterLabs] just wastes screen real estate. But I remember somewhat
> heated responses when openSUSE changed list software and dropped
> prefixes - apparently quite some users were using single mailbox and
> relied on prefixes to prioritize message reading.

Yes, I worry about that. But I'm afraid the technology has evolved to a
point where that's no longer a reasonable approach to email.

I am leaning to (2) for that reason; it seems to be the more standards-
compliant and future-oriented approach. I know the new format would
take some getting used to, but hopefully it smooths out over time.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: @ maillist Admins ‑ DMARC (yahoo)

2021-07-13 Thread kgaillot
On Tue, 2021-07-13 at 10:23 +0200, Ulrich Windl wrote:
> > > >  schrieb am 12.07.2021 um 16:50 in
> > > > Nachricht
> 
> <08471514b28d1e3f6859707f5951f07887336865.ca...@redhat.com>:
> > On Sat, 2021‑07‑10 at 12:34 +0100, lejeczek wrote:
> > > Hi Admins(of this mailing list)
> > > 
> > > Could you please fix in DMARC(s) so those of us who are on 
> > > Yahoo would be able to receive own emails/thread.
> > > 
> > > many thanks, L.
> > 
> > I suppose we should do something, since this is likely to be more
> > of an
> > issue as time goes on. Unfortunately, it's not as simple as
> > flipping a
> > switch. These are the two reasonable choices:
> > 
> > 
> > (1) Change the "From" on list messages so that they appear to be
> > from
> > the list, rather than the poster. For example, your posts would
> > show up
> > as "From: lejeczek via ClusterLabs Users "
> > rather than "From: lejeczek ". This is less
> > intrusive but makes it more difficult to reply directly to the
> > sender,
> > add the sender to an address book, etc.
> > 
> > 
> > (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to
> > the
> > list instead of original author, and adding the list signature.
> > This is
> > more standards‑compliant, since the List‑* headers can still be
> > used
> > for filtering, unsubscribing, and replying to the list, but not all
> > mail clients make those easy to use.
> > 
> > 
> > Anyone have preferences for one over the other?
> 
> I have no idea about DMARC, so I'm qualified for an opinion ;-)
> My guess is that the changes mentioned to the original message make
> the DMARC
> signature invalid.

Right

> IMHO the best solution would be to (if at all) chack DMARC on
> receipt, and
> "re-sign" before sending it out to the list.

Only the sender's domain mailers have the signing key. Once our mailing
list server receives it, we can't modify the existing body or headers
without breaking the DMARC (DKIM) signature. (Changing the "From" works
because at that point the message is no longer from the DMARC-protected 
domain.)

> 
> Regards,
> Ulrich
> 
> > 
> > (Less reasonable options include wrapping every post in MIME, and
> > disallowing users from DMARC domains to post to the list.)
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

2021-07-13 Thread Klaus Wenninger
On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
damianogiulian...@gmail.com> wrote:

> Hi guys,
> im back with some PAF postgres cluster problems.
> tonight the cluster fenced the master node and promote the PAF resource to
> a new node.
> everything went fine, unless i really dont know why.
> so this morning i noticed the old master was fenced by sbd and a new
> master was promoted, this happen tonight at 00.40.XX.
> filtering the logs i cant find out the any reasons why the old master was
> fenced and the start of promotion of the new master (which seems went
> perfectly), at certain point, im a bit lost cuz non of us can is able to
> get the real reason.
> the cluster worked flawessy for days  with no issues, till now.
> crucial for me uderstand why this switch occured.
>
> a attached the current status and configuration and logs.
> on the old master node log cant find any reasons
> on the new master the only thing is the fencing and the promotion.
>
>
> PS:
> could be this the reason of fencing?
>
> grep  -e sbd /var/log/messages
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 4)
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
That was yesterday afternoon and not 0:40 today in the morning.
With the watchdog-timeout set to 5s this may have been tight though.
Maybe check your other nodes for similar warnings - or check the compressed
warnings.
Maybe you can as well check the journal of sbd after start to see if it
managed to run rt-scheduled.
Is this a bare-metal-setup or running on some hypervisor?
Unfortunately I'm not enough into postgres to tell if there is anything
interesting about the last
messages shown before the suspected watchdog-reboot.
Was there some administrative stuff done by ltauser before the reboot? If
yes what?

Regards,
Klaus


>
> Any though and help is really appreciate.
>
> Damiano
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] @ maillist Admins - DMARC (yahoo)

2021-07-13 Thread Andrei Borzenkov
On Mon, Jul 12, 2021 at 5:50 PM  wrote:
>
> On Sat, 2021-07-10 at 12:34 +0100, lejeczek wrote:
> > Hi Admins(of this mailing list)
> >
> > Could you please fix in DMARC(s) so those of us who are on
> > Yahoo would be able to receive own emails/thread.
> >
> > many thanks, L.
>
> I suppose we should do something, since this is likely to be more of an
> issue as time goes on. Unfortunately, it's not as simple as flipping a
> switch. These are the two reasonable choices:
>

The problem is, both are incomplete.

>
> (1) Change the "From" on list messages so that they appear to be from
> the list, rather than the poster. For example, your posts would show up
> as "From: lejeczek via ClusterLabs Users "
> rather than "From: lejeczek ". This is less
> intrusive but makes it more difficult to reply directly to the sender,
> add the sender to an address book, etc.
>

This will pass SPF but fail DKIM

>
> (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to the
> list instead of original author, and adding the list signature. This is
> more standards-compliant, since the List-* headers can still be used
> for filtering, unsubscribing, and replying to the list, but not all
> mail clients make those easy to use.
>

This will pass DKIM but fail SPF.

I do not know how many domains implement only SPF, only DKIM or both.

>
> Anyone have preferences for one over the other?
>
> (Less reasonable options include wrapping every post in MIME, and
> disallowing users from DMARC domains to post to the list.)

Well, enabling ARC in addition to either of the options may somehow
mitigate them. It depends on *recipient* domain support though. Also I
am not sure whether Mailman 2.x supports it.

>From a personal perspective, I already filter by list ids and
[ClusterLabs] just wastes screen real estate. But I remember somewhat
heated responses when openSUSE changed list software and dropped
prefixes - apparently quite some users were using single mailbox and
relied on prefixes to prioritize message reading.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

2021-07-13 Thread damiano giuliani
Hi guys,
im back with some PAF postgres cluster problems.
tonight the cluster fenced the master node and promote the PAF resource to
a new node.
everything went fine, unless i really dont know why.
so this morning i noticed the old master was fenced by sbd and a new master
was promoted, this happen tonight at 00.40.XX.
filtering the logs i cant find out the any reasons why the old master was
fenced and the start of promotion of the new master (which seems went
perfectly), at certain point, im a bit lost cuz non of us can is able to
get the real reason.
the cluster worked flawessy for days  with no issues, till now.
crucial for me uderstand why this switch occured.

a attached the current status and configuration and logs.
on the old master node log cant find any reasons
on the new master the only thing is the fencing and the promotion.


PS:
could be this the reason of fencing?

grep  -e sbd /var/log/messages
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
pcmk is outdated (age: 4)
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
pcmk is healthy (age: 0)

Any though and help is really appreciate.

Damiano
pcs status
Cluster name: ltaoperdbscluster
Stack: corosync
Current DC: ltaoperdbs03 (version 1.1.23-1.el7-9acf116022) - partition with 
quorum
Last updated: Tue Jul 13 10:06:01 2021
Last change: Tue Jul 13 00:41:05 2021 by root via crm_attribute on ltaoperdbs03

3 nodes configured
4 resource instances configured

Online: [ ltaoperdbs03 ltaoperdbs04 ]
OFFLINE: [ ltaoperdbs02 ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ ltaoperdbs03 ]
 Slaves: [ ltaoperdbs04 ]
 Stopped: [ ltaoperdbs02 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started ltaoperdbs03

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled
[root@ltaoperdbs03 pengine]# pcs config show
Cluster Name: ltaoperdbscluster
Corosync Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04
Pacemaker Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-13/bin pgdata=/workspace/pdgs-db/13/data 
pgport=5432
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
   methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
   monitor interval=15s role=Master timeout=25s 
(pgsqld-monitor-interval-15s)
   monitor interval=16s role=Slave timeout=25s 
(pgsqld-monitor-interval-16s)
   notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
   promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
   reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
   start interval=0s timeout=60s (pgsqld-start-interval-0s)
   stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=172.18.2.10
  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
  start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s)
  stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory)
  demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) 
(id:order-pgsql-ha-pgsql-master-ip-Mandatory-1)
Colocation Constraints:
  pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ltaoperdbscluster
 dc-version: 1.1.23-1.el7-9acf116022
 have-watchdog: true
 last-lrm-refresh: 1625090339
 stonith-enabled: true
 stonith-watchdog-timeout: 10s

Quorum:
  Options:


stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from 
ltaoperdbs03 at Tue Jul 13 00:40:47 2021

SBD CONFIG
grep -v \# /etc/sysconfig/sbd | sort | uniq

SBD_DELAY_START=no
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_OPTS=
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_TIMEOUT_ACTION=flush,reboot
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
ltaoperdbs03 cluster]# stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from 
ltaoperdbs03 at Tue Jul 13 00:40:47 2021

[root@ltaoperdbs03 cluster]# grep "Jul 13 00:40:" /var/log/messages
Jul 13 00:40:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started 

Re: [ClusterLabs] pcs update resource command not working

2021-07-13 Thread Tomas Jelinek

Dne 09. 07. 21 v 7:29 S Sathish S napsal(a):

Hi Team,

we have find the cause of this problem as per below changelog pcs 
resource update command doesn’t support empty meta_attributes anymore.


https://github.com/ClusterLabs/pcs/blob/0.9.169/CHANGELOG.md

pcs resource update does not create an empty meta_attributes element any 
more (rhbz#1568353)


This bz is not related to your issue.



[root@node01 testadmin]# pcs resource update SNMP_node01 user='' 
extra_options="-E /opt/occ/CXP/tools/PCSXXX.sh"


Later modified to below command it work for us.

[root@node01 testadmin]# pcs resource update SNMP_node01 user='root' 
extra_options="-E /opt/occ/CXP/tools/PCSXXX.sh"




The commands work as expected, user='root' sets the value of 'user' to 
'root', user='' deletes 'user'.


Specifying an empty value for an option is a syntax for removing the 
option. Yes, this means there is no way to set an option to an empty 
string value using pcs.


But our problem we have disabled root user account by default , so we 
are not sure which user can be given here and if that given user account 
got disabled/password expire what will be impact of this cluster 
monitoring SNMP service and so on.


While doing pcs resource create SNMP with empty string for user 
attribute it work for us, One more difference we noticed.


'pcs resource create' allows setting empty string values. It is a known 
bug which we track and will fix eventually.



Regards,
Tomas



Query:

 1) It is recommended to create SNMP ClusterMon resource type 
with empty user as attribute.


     2) if not , update resource with some user and that user 
account got disabled/password expire what will be impact of this cluster 
monitoring SNMP service and so on.


Thanks and Regards,

S Sathish


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Q: Prevent non‑live VM migration

2021-07-13 Thread Ulrich Windl
>>>  schrieb am 12.07.2021 um 16:53 in Nachricht
<376475adc8217a97adf9374d0aad0317eabd5f90.ca...@redhat.com>:
> On Mon, 2021‑07‑12 at 08:35 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> We had some problem in the cluster that prevented live migration of
>> VMs. As a consequence the cluster migrated the VMs using stop/start.

For completeness, it tuned out that the libvirt install script eats up the
"--listen" parameter that is required when starting libvirt from the cluster:
# The '--listen' option is incompatible with socket activation.
# We need to forcibly remove it from /etc/sysconfig/libvirtd.
# Also add the --timeout option to be consistent with upstream.
# See boo#1156161 for details
sed -i -e '/^\s*LIBVIRTD_ARGS=/s/--listen//g' /etc/sysconfig/libvirtd
if ! grep -q -E '^\s*LIBVIRTD_ARGS=.*--timeout' /etc/sysconfig/libvirtd ;
then
sed -i 's/^\s*LIBVIRTD_ARGS="\(.*\)"/LIBVIRTD_ARGS="\1 --timeout 120"/'
/etc/sysconfig/libvirtd

Another set of tricks:
So adding "--listen" again and restarting vlibvirtd.service almost fixed the
problem:
While it's OK to restart libvirtd while VMs are running, there were stale
locks (virtlockd) that were tricky to clean up.

Most specifically messages like these aren't really helpful (What the hack
does that lock refer to?):
Jul 13 10:03:11 h16 virtlockd[8935]: resource busy: Lockspace resource
'56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked
Jul 13 10:03:11 h16 libvirtd[22972]: resource busy: Lockspace resource
'56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked

Even if you find that "lock" in
/var/lib/libvirt/lockd/files/56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15,
you are not more clever than before, I'm afraid ;-)

BTW: I had filed an enhancement request regarding that some time ago...

Regards,
Ulrich

>> I wonder: Is there a way to prevent stop/start migration if live‑
>> migration failed?
> 
> The only thing I can think of is setting on‑fail=block for migrate_to
> and migrate_from actions. I'd be cautious though; if the migration
> fails in a way that leaves the domain inaccessible, it will stay that
> way.
> 
>> In out case the migration was triggeerd by resource placement
>> strategy.
>> 
>> The messages logged would look like this:
>>  warning: Unexpected result (error: v15: live migration to h18
>> failed: 1) was recorded for migrate_to of prm_xen_v15 on h16
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: @ maillist Admins ‑ DMARC (yahoo)

2021-07-13 Thread Ulrich Windl
>>>  schrieb am 12.07.2021 um 16:50 in Nachricht
<08471514b28d1e3f6859707f5951f07887336865.ca...@redhat.com>:
> On Sat, 2021‑07‑10 at 12:34 +0100, lejeczek wrote:
>> Hi Admins(of this mailing list)
>> 
>> Could you please fix in DMARC(s) so those of us who are on 
>> Yahoo would be able to receive own emails/thread.
>> 
>> many thanks, L.
> 
> I suppose we should do something, since this is likely to be more of an
> issue as time goes on. Unfortunately, it's not as simple as flipping a
> switch. These are the two reasonable choices:
> 
> 
> (1) Change the "From" on list messages so that they appear to be from
> the list, rather than the poster. For example, your posts would show up
> as "From: lejeczek via ClusterLabs Users "
> rather than "From: lejeczek ". This is less
> intrusive but makes it more difficult to reply directly to the sender,
> add the sender to an address book, etc.
> 
> 
> (2) Stop adding [ClusterLabs] to subject lines, setting ReplyTo: to the
> list instead of original author, and adding the list signature. This is
> more standards‑compliant, since the List‑* headers can still be used
> for filtering, unsubscribing, and replying to the list, but not all
> mail clients make those easy to use.
> 
> 
> Anyone have preferences for one over the other?

I have no idea about DMARC, so I'm qualified for an opinion ;-)
My guess is that the changes mentioned to the original message make the DMARC
signature invalid.
IMHO the best solution would be to (if at all) chack DMARC on receipt, and
"re-sign" before sending it out to the list.

Regards,
Ulrich

> 
> (Less reasonable options include wrapping every post in MIME, and
> disallowing users from DMARC domains to post to the list.)
> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-07-13 Thread Gang He



On 2021/7/12 15:52, Ulrich Windl wrote:

Hi!

can you give some details on what is necessary to trigger the problem?
There is a ABBA lock between reflink comand and ocfs2 
ocfs2_complete_recovery routine(this routine will be triggered by timer, 
mount, node recovery), the dead lock is not always encountered.

For the more details, refer to the link as below,
https://oss.oracle.com/pipermail/ocfs2-devel/2021-July/015671.html

Thanks
Gang


(I/O load, CPU load, concurrent operations on one node or on multiple nodes,
using reflink snapshots, using ioctl(FS_IOC_FIEMAP), etc.)

Regards,
Ulrich


Gang He  schrieb am 11.07.2021 um 10:55 in Nachricht




Hi Ulrich,

Thank for your update.
Based on some feedback from the upstream, there is a patch (ocfs2:
initialize ip_next_orphan), which should fix this problem.
I can comfirm the patch looks very similar with your problem.
I will verify it next week, then let you know the result.

Thanks
Gang


From: Users  on behalf of Ulrich Windl

Sent: Friday, July 9, 2021 15:56
To: users@clusterlabs.org
Subject: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any
one else?

Hi!

An update on the issue:
SUSE support found out that the reason for the hanging processes is a
deadlock caused by a race condition (Kernel 5.3.18‑24.64‑default). Support

is

working on a fix.
Today the cluster "fixed" the problem in an unusual way:

h19 kernel: Out of memory: Killed process 6838 (corosync) total‑vm:261212kB,



anon‑rss:31444kB, file‑rss:7700kB, shmem‑rss:121872kB

I doubt that was the best possible choice ;‑)

The dead corosync caused the DC (h18) to fence h19 (which was successful),
but the DC was fenced while it tried to recover resources, so the complete
cluster rebooted.

Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/