[ClusterLabs] Antw: [EXT] Re: Stonith external/ssh "device"?

2022-12-21 Thread Ulrich Windl
>>> Antony Stone  schrieb am 21.12.2022 um
17:19 in
Nachricht <202212211719.34369.antony.st...@ha.open.source.it>:
> On Wednesday 21 December 2022 at 16:59:16, Antony Stone wrote:
> 
>> Hi.
>> 
>> I'm implementing fencing on a 7‑node cluster as described recently:
>> https://lists.clusterlabs.org/pipermail/users/2022‑December/030714.html 
>> 
>> I'm using external/ssh for the time being, and it works if I test it
using:
>> 
>> stonith ‑t external/ssh ‑p "nodeA nodeB nodeC" ‑T reset nodeB
>> 
>> 
>> However, when it's supposed to be invoked because a node has got stuck, I
>> simply find syslog full of the following (one from each of the other six
>> nodes in the cluster):
>> 
>> pacemaker‑fenced[3262]:   notice: Operation reboot of nodeB by 
for
>> pacemaker‑controld.26852@nodeA.93b391b2: No such device
>> 
>> I have defined seven stonith resources, one for rebooting each machine,
and
>> I can see from "crm status" that they have been assigned randomly amongst
>> the other servers, usually one per server, so that looks good.
>> 
>> 
>> The main things that puzzle me about the log message are:
>> 
>> a) why does it say ""?  Is this more like "anyone", meaning that
>> no‑ one in particular is required to do this task, provided that at least
>> someone does it?  Does this indicate a configuration problem?
> 
> PS: I've just noticed that I'm also getting log entries immediately 
> afterwards:
> 
> pacemaker‑controld[3264]:   notice: Peer nodeB was not terminated (reboot)
by 
> 
>  on behalf of pacemaker‑controld.26852: No such device
> 
>> b) what is this "device" referred to?  I'm using "external/ssh" so there
is
>> no actual Stonith device for power‑cycling hardware machines ‑ am I
>> supposed to define some sort of dummy device somewhere?
>> 
>> For clarity, this is what I have added to my cluster configuration to set
>> this up:
>> 
>> primitive reboot_nodeA   stonith:external/sshparams
hostlist="nodeA"
>> location only_nodeA  reboot_nodeA‑inf: nodeA

"location only_nodeA" meaning "location not_nodeA"? ;-)


>> 
>> ...repeated for all seven nodes.
>> 
>> I also have "stonith‑enabled=yes" in the cib‑bootstrap‑options.
>> 
>> 
>> Ideas, anyone?
>> 
>> Thanks,
>> 
>> 
>> Antony.
> 
> ‑‑ 
> Normal people think "If it ain't broke, don't fix it".
> Engineers think "If it ain't broke, it doesn't have enough features yet".
> 
>Please reply to the
list;
>  please *don't* CC 
> me.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Bug pacemaker with multiple IP

2022-12-21 Thread Ulrich Windl
You could also try something like "watch fuser $(which ip)" or (if you can)
write a program using inotify and IN_OPEN to see which procrees are opening the
binary.

>>> Thomas CAS  schrieb am 21.12.2022 um 09:24 in Nachricht
:

> Ken,
> 
> Antivirus (sophos‑av) is running but not in "real time access scanning", the

> scheduled scan is however at 9pm every day.
> 7 minutes later, we got these alerts. 
> The anti virus may indeed be the cause.
> 
> I had the case on December 13 (with systemctl here):
> 
> pacemaker.log‑20221217.gz:Dec 13 21:07:53 wd‑websqlng01 pacemaker‑controld 

> [5082] (process_lrm_event)  notice: wd‑websqlng01‑NGINX_monitor_15000:454 [

> /etc/init.d/nginx: 33: /lib/lsb/init‑functions.d/40‑systemd: systemctl: Text

> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init‑functions.d/40‑systemd: 
> /bin/systemctl: Text file busy\n ]
> pacemaker.log‑20221217.gz:Dec 13 21:07:53 wd‑websqlng01 pacemaker‑controld 

> [5082] (process_lrm_event)  notice: wd‑websqlng01‑NGINX_monitor_15000:454 [

> /etc/init.d/nginx: 33: /lib/lsb/init‑functions.d/40‑systemd: systemctl: Text

> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init‑functions.d/40‑systemd: 
> /bin/systemctl: Text file busy\n ]
> 
> After, this happens rarely, we had the case in August:
> 
> pacemaker.log‑20220826.gz:Aug 25 21:06:31 wd‑websqlng01 pacemaker‑controld 

> [3718] (process_lrm_event)  notice: 
> wd‑websqlng01‑NGINX‑VIP‑232_monitor_1:2877 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf‑exit‑reason:IPaddr2 only supported Linux.\n ]
> pacemaker.log‑20220826.gz:Aug 25 21:06:31 wd‑websqlng01 pacemaker‑controld 

> [3718] (process_lrm_event)  notice: 
> wd‑websqlng01‑NGINX‑VIP‑231_monitor_1:2880 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf‑exit‑reason:IPaddr2 only supported Linux.\n ]
> 
> It's always around 9:00‑9:07 pm, 
> I'll move the virus scan to 10pm and see.
> 
> Thanks,
> Best regards,
> 
> Thomas Cas  |  Technicien du support infogérance
> PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en 
> IKOULA Data Center 34 rue Pont Assy ‑ 51100 Reims ‑ FRANCE
> Before printing this letter, think about the impact on the environment!
> 
> ‑Message d'origine‑
> De : Reid Wahl  
> Envoyé : mardi 20 décembre 2022 20:34
> À : Cluster Labs ‑ All topics related to open‑source clustering welcomed 
> 
> Cc : Ken Gaillot ; Service Infogérance 
> 
> Objet : Re: [ClusterLabs] Bug pacemaker with multiple IP
> 
> [Vous ne recevez pas souvent de courriers de nw...@redhat.com. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification
]
> 
> On Tue, Dec 20, 2022 at 6:25 AM Thomas CAS  wrote:
>>
>> Hello Ken,
>>
>> Thanks for your answer.
>> There was no update running at the time of the bug, which is why I thought

> that having too many IPs caused this type of error.
>> The /usr/sbin/ip executable was not being modified either.
>>
>> We have many clusters, and only this one has so many IPs and this problem.
> 
> How often does this happen, and is it reliably reproducible under any 
> circumstances? Any antivirus software running? It'd be nice to check 
> something like lsof or strace while it's happening, but that may not be 
> feasible if it's sporadic; running those at every monitor would generate
lots 
> of logs.
> 
> AFAICT, having multiple processes execute (or read) the `ip` binary 
> simultaneously *shouldn't* cause problems, as long as nothing opens it for 
> write.
> 
>>
>> Best regards,
>>
>> Thomas Cas  |  Technicien du support infogérance
>> PHONE : +33 3 51 25 23 26   WEB : 
>
https://fra01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ikoula.c

>
om%2Fen=05%7C01%7Ctcas%40ikoula.com%7C9aab91944bd6454a773808dae2c13ae4%7C
>
cb7a4a4ea7f747cc931f80db4a66f1c7%7C0%7C0%7C638071616800939086%7CUnknown%7CTWF
>
pbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
>
C3000%7C%7C%7C=oYe7ws2%2BPx3sMblOFBgkuXuSHTdguzB%2Flk83O5W2MjE%3D
> d=0
>> IKOULA Data Center 34 rue Pont Assy ‑ 51100 Reims ‑ FRANCE Before 
>> printing this letter, think about the impact on the environment!
>>
>> ‑Message d'origine‑
>> De : Ken Gaillot  Envoyé : lundi 19 décembre 2022 
>> 22:08 À : Cluster Labs ‑ All topics related to open‑source clustering 
>> welcomed  Cc : Service Infogérance 
>>  Objet : Re: [ClusterLabs] Bug pacemaker with 
>> multiple IP
>>
>> [Vous ne recevez pas souvent de courriers de kgail...@redhat.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Mon, 2022‑12‑19 at 09:48 +, Thomas CAS wrote:
>> > Hello Clusterlabs,
>> >
>> > I would like to report a bug on Pacemaker with the "IPaddr2"
>> > resource:
>> >
>> > OS: Debian 10
>> > Kernel: Linux wd‑websqlng01 4.19.0‑18‑amd64 #1 SMP Debian 4.19.208‑1
>> > (2021‑09‑29) x86_64 GNU/Linux
>> > 

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Stonith

2022-12-21 Thread Klaus Wenninger
On Wed, Dec 21, 2022 at 4:51 PM Ken Gaillot  wrote:

> On Wed, 2022-12-21 at 10:45 +0100, Ulrich Windl wrote:
> > > > > Ken Gaillot  schrieb am 20.12.2022 um
> > > > > 16:21 in
> > Nachricht
> > <3a5960c2331f97496119720f6b5a760b3fe3bbcf.ca...@redhat.com>:
> > > On Tue, 2022‑12‑20 at 11:33 +0300, Andrei Borzenkov wrote:
> > > > On Tue, Dec 20, 2022 at 10:07 AM Ulrich Windl
> > > >  wrote:
> > > > > > But keep in mind that if the whole site is down (or
> > > > > > unaccessible)
> > > > > > you
> > > > > > will not have access to IPMI/PDU/whatever on this site so
> > > > > > your
> > > > > > stonith
> > > > > > agents will fail ...
> > > > >
> > > > > But, considering the design, such site won't have a quorum and
> > > > > should commit suicide, right?
> > > > >
> > > >
> > > > Not by default.
> > >
> > > And even if it does, the rest of the cluster can't assume that it
> > > did,
> > > so resources can't be recovered. It could work with sbd, but the
> > > poster
> > > said that the physical hosts aren't accessible.
> >
> > Why? Assuming fencing is configured, the nodes part of the quorum
> > should wait
> > for fencing delay, assuming fencing (or suicide) was done.
> > Then they can manage resources. OK, a non-working fencing or suicide
> > mechanism
> > is a different story...
> >
> > Regards,
> > Ulrich
>
> Right, that would be using watchdog-based SBD for self-fencing, but the
> poster can't use SBD in this case.
>

Read it in a way that this would just be a PoC setup.
Like ssh-fencing as a replacement for a real fencing-device one can
use softdog (or whatever the virtual-environment offers that is supported
by the kernel as watchdog-device) with watchdog-fencing at least for
PoC purposes.
I guess it depends on how the final setup is gonna differ from the PoC
setup. Knowing that things like live-migration, pausing a machine,
running on heavily overcommitted hosts, snapshots, ... would
be critical for the scenario one could simply try to avoid these things
during PoC tests if they are not relevant for a final production setup.

Klaus


> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 17:19:34, Antony Stone wrote:

> > pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by 
> > for pacemaker-controld.26852@nodeA.93b391b2: No such device

> pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot)
> by  on behalf of pacemaker-controld.26852: No such device

I have resolved this - there was a discrepancy between the node names (some 
simple hostnames, some FQDNs) in my main cluster configuration, and the 
hostlist parameter for the external/ssh fencing plugin.

I have set them all to be simple hostnames with no domain and now all is 
working as expected.

I still find the log message "no such device" rather confusing.


Thanks,


Antony.

-- 
 yes, but this is #lbw, we don't do normal

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
On Wednesday 21 December 2022 at 16:59:16, Antony Stone wrote:

> Hi.
> 
> I'm implementing fencing on a 7-node cluster as described recently:
> https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html
> 
> I'm using external/ssh for the time being, and it works if I test it using:
> 
> stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB
> 
> 
> However, when it's supposed to be invoked because a node has got stuck, I
> simply find syslog full of the following (one from each of the other six
> nodes in the cluster):
> 
> pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for
> pacemaker-controld.26852@nodeA.93b391b2: No such device
> 
> I have defined seven stonith resources, one for rebooting each machine, and
> I can see from "crm status" that they have been assigned randomly amongst
> the other servers, usually one per server, so that looks good.
> 
> 
> The main things that puzzle me about the log message are:
> 
> a) why does it say ""?  Is this more like "anyone", meaning that
> no- one in particular is required to do this task, provided that at least
> someone does it?  Does this indicate a configuration problem?

PS: I've just noticed that I'm also getting log entries immediately 
afterwards:

pacemaker-controld[3264]:   notice: Peer nodeB was not terminated (reboot) by 
 on behalf of pacemaker-controld.26852: No such device

> b) what is this "device" referred to?  I'm using "external/ssh" so there is
> no actual Stonith device for power-cycling hardware machines - am I
> supposed to define some sort of dummy device somewhere?
> 
> For clarity, this is what I have added to my cluster configuration to set
> this up:
> 
> primitive reboot_nodeAstonith:external/sshparams hostlist="nodeA"
> location only_nodeA   reboot_nodeA-inf: nodeA
> 
> ...repeated for all seven nodes.
> 
> I also have "stonith-enabled=yes" in the cib-bootstrap-options.
> 
> 
> Ideas, anyone?
> 
> Thanks,
> 
> 
> Antony.

-- 
Normal people think "If it ain't broke, don't fix it".
Engineers think "If it ain't broke, it doesn't have enough features yet".

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Stonith external/ssh "device"?

2022-12-21 Thread Antony Stone
Hi.

I'm implementing fencing on a 7-node cluster as described recently:
https://lists.clusterlabs.org/pipermail/users/2022-December/030714.html

I'm using external/ssh for the time being, and it works if I test it using:

stonith -t external/ssh -p "nodeA nodeB nodeC" -T reset nodeB


However, when it's supposed to be invoked because a node has got stuck, I 
simply find syslog full of the following (one from each of the other six nodes 
in the cluster):

pacemaker-fenced[3262]:   notice: Operation reboot of nodeB by  for 
pacemaker-controld.26852@nodeA.93b391b2: No such device

I have defined seven stonith resources, one for rebooting each machine, and I 
can see from "crm status" that they have been assigned randomly amongst the 
other servers, usually one per server, so that looks good.


The main things that puzzle me about the log message are:

a) why does it say ""?  Is this more like "anyone", meaning that no-
one in particular is required to do this task, provided that at least someone 
does it?  Does this indicate a configuration problem?

b) what is this "device" referred to?  I'm using "external/ssh" so there is no 
actual Stonith device for power-cycling hardware machines - am I supposed to 
define some sort of dummy device somewhere?

For clarity, this is what I have added to my cluster configuration to set this 
up:

primitive reboot_nodeA  stonith:external/sshparams hostlist="nodeA"
location only_nodeA reboot_nodeA-inf: nodeA

...repeated for all seven nodes.

I also have "stonith-enabled=yes" in the cib-bootstrap-options.


Ideas, anyone?

Thanks,


Antony.

-- 
This sentence contains exacly three erors.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Stonith

2022-12-21 Thread Ken Gaillot
On Wed, 2022-12-21 at 10:45 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 20.12.2022 um
> > > > 16:21 in
> Nachricht
> <3a5960c2331f97496119720f6b5a760b3fe3bbcf.ca...@redhat.com>:
> > On Tue, 2022‑12‑20 at 11:33 +0300, Andrei Borzenkov wrote:
> > > On Tue, Dec 20, 2022 at 10:07 AM Ulrich Windl
> > >  wrote:
> > > > > But keep in mind that if the whole site is down (or
> > > > > unaccessible)
> > > > > you
> > > > > will not have access to IPMI/PDU/whatever on this site so
> > > > > your
> > > > > stonith
> > > > > agents will fail ...
> > > > 
> > > > But, considering the design, such site won't have a quorum and
> > > > should commit suicide, right?
> > > > 
> > > 
> > > Not by default.
> > 
> > And even if it does, the rest of the cluster can't assume that it
> > did,
> > so resources can't be recovered. It could work with sbd, but the
> > poster
> > said that the physical hosts aren't accessible.
> 
> Why? Assuming fencing is configured, the nodes part of the quorum
> should wait
> for fencing delay, assuming fencing (or suicide) was done.
> Then they can manage resources. OK, a non-working fencing or suicide
> mechanism
> is a different story...
> 
> Regards,
> Ulrich

Right, that would be using watchdog-based SBD for self-fencing, but the
poster can't use SBD in this case.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Bug pacemaker with multiple IP

2022-12-21 Thread Thomas CAS
Ken,

Antivirus (sophos-av) is running but not in "real time access scanning", the 
scheduled scan is however at 9pm every day.
7 minutes later, we got these alerts. 
The anti virus may indeed be the cause.

I had the case on December 13 (with systemctl here):

pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld  
[5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [ 
/etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text 
file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
/bin/systemctl: Text file busy\n ]
pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld  
[5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [ 
/etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text 
file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
/bin/systemctl: Text file busy\n ]

After, this happens rarely, we had the case in August:

pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld  
[3718] (process_lrm_event)  notice: 
wd-websqlng01-NGINX-VIP-232_monitor_1:2877 [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld  
[3718] (process_lrm_event)  notice: 
wd-websqlng01-NGINX-VIP-231_monitor_1:2880 [ 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
/usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]

It's always around 9:00-9:07 pm, 
I'll move the virus scan to 10pm and see.

Thanks,
Best regards,

Thomas Cas  |  Technicien du support infogérance
PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en
IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE
Before printing this letter, think about the impact on the environment!

-Message d'origine-
De : Reid Wahl  
Envoyé : mardi 20 décembre 2022 20:34
À : Cluster Labs - All topics related to open-source clustering welcomed 

Cc : Ken Gaillot ; Service Infogérance 

Objet : Re: [ClusterLabs] Bug pacemaker with multiple IP

[Vous ne recevez pas souvent de courriers de nw...@redhat.com. Découvrez 
pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]

On Tue, Dec 20, 2022 at 6:25 AM Thomas CAS  wrote:
>
> Hello Ken,
>
> Thanks for your answer.
> There was no update running at the time of the bug, which is why I thought 
> that having too many IPs caused this type of error.
> The /usr/sbin/ip executable was not being modified either.
>
> We have many clusters, and only this one has so many IPs and this problem.

How often does this happen, and is it reliably reproducible under any 
circumstances? Any antivirus software running? It'd be nice to check something 
like lsof or strace while it's happening, but that may not be feasible if it's 
sporadic; running those at every monitor would generate lots of logs.

AFAICT, having multiple processes execute (or read) the `ip` binary 
simultaneously *shouldn't* cause problems, as long as nothing opens it for 
write.

>
> Best regards,
>
> Thomas Cas  |  Technicien du support infogérance
> PHONE : +33 3 51 25 23 26   WEB : 
> https://fra01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ikoula.com%2Fen=05%7C01%7Ctcas%40ikoula.com%7C9aab91944bd6454a773808dae2c13ae4%7Ccb7a4a4ea7f747cc931f80db4a66f1c7%7C0%7C0%7C638071616800939086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=oYe7ws2%2BPx3sMblOFBgkuXuSHTdguzB%2Flk83O5W2MjE%3D=0
> IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE Before 
> printing this letter, think about the impact on the environment!
>
> -Message d'origine-
> De : Ken Gaillot  Envoyé : lundi 19 décembre 2022 
> 22:08 À : Cluster Labs - All topics related to open-source clustering 
> welcomed  Cc : Service Infogérance 
>  Objet : Re: [ClusterLabs] Bug pacemaker with 
> multiple IP
>
> [Vous ne recevez pas souvent de courriers de kgail...@redhat.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
>
> On Mon, 2022-12-19 at 09:48 +, Thomas CAS wrote:
> > Hello Clusterlabs,
> >
> > I would like to report a bug on Pacemaker with the "IPaddr2"
> > resource:
> >
> > OS: Debian 10
> > Kernel: Linux wd-websqlng01 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1
> > (2021-09-29) x86_64 GNU/Linux
> > Pacemaker version: 2.0.1-5+deb10u2
> >
> > You will find the configuration of our cluster with 2 nodes attached.
> >
> > Bug :
> >
> > We have several IP configured in the cluster configuration (12) 
> > Sometimes the cluster is unstable with the following errors in the 
> > pacemaker logs:
> >
> > Dec 18 21:07:51 **SENSITIVEDATA** pacemaker-execd [5079]
> > (operation_finished)   notice: NGINX-VIP-
> > 

[ClusterLabs] Fix for CVE-2022-30123 and CVE-2019-11358

2022-12-21 Thread A Gunasekar via Users
Hi Team,

Please be informed, we have got notified from our security tool that our pcs 
version 0.9 is affected by the CVE-2022-30123 and CVE-2019-11358.
It would be great if we help to get answers for the below queries.


  *   We are currently in RHEL 7.9 OS and using pcs 0.9 version, Is there any 
fix planned/available for this affection version (0.9.x) of pcs ?
  *   Let us know in which release this CVEs fix are planned ?

Our system Details:-
OS Version: RHEL 7.9
Cluster lab PCS  version: 0.9


[Ericsson]
Gunasekar A
Senior Software Engineer
BDGS SA BSS PDU BSS PDG EC CH NGCRS
Mobile: +919894561292
Email ID: a.gunase...@ericsson.com


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Bug pacemaker with multiple IP

2022-12-21 Thread Klaus Wenninger
On Wed, Dec 21, 2022 at 11:26 AM Reid Wahl  wrote:

> On Wed, Dec 21, 2022 at 2:15 AM Ulrich Windl
>  wrote:
> >
> > Hi!
> >
> > I wonder: Could the error message be triggered by adding an exclusive
> manatory
> > lock in the ip binary?
> > If that triggers the bug, I'm rather sure that the error message is bad.
> > Shouldn't that be EWOULDBLOCK then?
>
> I did some cursory reading earlier today, and it seems that ETXTBSY is
> becoming less common: https://lwn.net/Articles/866493/
>
> Either way, that would be a question for kernel maintainers.
>

Maybe network-stack-guys there or sbdy with deeper insight of how the
ip-tool
is currently interfering with the kernel.
Without knowing any details certain things might be handled calling
bpf-binaries
and ip being the userspace application this might still be shown if it was
actually rather about a bpf-binary to be executed. Thinking of
race-conditions
at that front ...


>
> > (I have no idea how Sophos AV works, though. If they open the files to
> check
> > in write-mode, it's really stupid then IMHO)
> >
> > Regards,
> > Ulrich
> >
> >
> > >>> Reid Wahl  schrieb am 21.12.2022 um 10:19 in
> Nachricht
> > :
> > > On Wed, Dec 21, 2022 at 12:24 AM Thomas CAS  wrote:
> > >>
> > >> Ken,
> > >>
> > >> Antivirus (sophos-av) is running but not in "real time access
> scanning",
> > the
> > > scheduled scan is however at 9pm every day.
> > >> 7 minutes later, we got these alerts.
> > >> The anti virus may indeed be the cause.
> > >
> > > I see. That does seem fairly likely. At least, there's no other
> > > obvious candidate for the cause.
> > >
> > > I used to work on a customer-facing support team for the ClusterLabs
> > > suite, and we received a fair number of cases where bizarre issues
> > > (such as hangs and access errors) were apparently caused by an
> > > antivirus. In those cases, all other usual lines of investigation were
> > > exhausted, and when we asked the customer to disable their AV, the
> > > issue disappeared. This happened with several different AV products.
> > >
> > > I can't say with any certainty that the AV is causing your issue, and
> > > I know it's frustrating that you won't know whether any given
> > > intervention worked, since this only happens once every few months.
> > >
> > > You may want to either exclude certain files from the scan, or write a
> > > short script to place the cluster in maintenance mode before the scan
> > > and take it out of maintenance after the scan is complete.
> > >
> > >>
> > >> I had the case on December 13 (with systemctl here):
> > >>
> > >> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01
> pacemaker-controld
> >
> > > [5082] (process_lrm_event)  notice:
> wd-websqlng01-NGINX_monitor_15000:454 [
> >
> > > /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd:
> systemctl: Text
> >
> > > file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd:
> > > /bin/systemctl: Text file busy\n ]
> > >> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01
> pacemaker-controld
> >
> > > [5082] (process_lrm_event)  notice:
> wd-websqlng01-NGINX_monitor_15000:454 [
> >
> > > /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd:
> systemctl: Text
> >
> > > file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd:
> > > /bin/systemctl: Text file busy\n ]
> > >>
> > >> After, this happens rarely, we had the case in August:
> > >>
> > >> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01
> pacemaker-controld
> >
> > > [3718] (process_lrm_event)  notice:
> > > wd-websqlng01-NGINX-VIP-232_monitor_1:2877 [
> > > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1:
> > > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file
> > > busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
> > >> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01
> pacemaker-controld
> >
> > > [3718] (process_lrm_event)  notice:
> > > wd-websqlng01-NGINX-VIP-231_monitor_1:2880 [
> > > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1:
> > > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file
> > > busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
> > >>
> > >> It's always around 9:00-9:07 pm,
> > >> I'll move the virus scan to 10pm and see.
> > >
> > > That also sounds like a good plan to confirm the cause :) It might
> > > take a while to find out though.
> > >
> > >>
> > >> Thanks,
> > >> Best regards,
> > >>
> > >> Thomas Cas  |  Technicien du support infogérance
> > >> PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en
> > >> IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE
> > >> Before printing this letter, think about the impact on the
> environment!
> > >>
> > >> -Message d'origine-
> > >> De : Reid Wahl 
> > >> Envoyé : mardi 20 décembre 2022 20:34
> > >> À : Cluster Labs - All topics related to open-source clustering
> welcomed
> > > 
> > >> Cc : Ken Gaillot ; Service Infogérance
> > > 
> > >> Objet : Re: [ClusterLabs] Bug pacemaker with 

Re: [ClusterLabs] Antw: [EXT] Re: Bug pacemaker with multiple IP

2022-12-21 Thread Reid Wahl
On Wed, Dec 21, 2022 at 2:15 AM Ulrich Windl
 wrote:
>
> Hi!
>
> I wonder: Could the error message be triggered by adding an exclusive manatory
> lock in the ip binary?
> If that triggers the bug, I'm rather sure that the error message is bad.
> Shouldn't that be EWOULDBLOCK then?

I did some cursory reading earlier today, and it seems that ETXTBSY is
becoming less common: https://lwn.net/Articles/866493/

Either way, that would be a question for kernel maintainers.

> (I have no idea how Sophos AV works, though. If they open the files to check
> in write-mode, it's really stupid then IMHO)
>
> Regards,
> Ulrich
>
>
> >>> Reid Wahl  schrieb am 21.12.2022 um 10:19 in Nachricht
> :
> > On Wed, Dec 21, 2022 at 12:24 AM Thomas CAS  wrote:
> >>
> >> Ken,
> >>
> >> Antivirus (sophos-av) is running but not in "real time access scanning",
> the
> > scheduled scan is however at 9pm every day.
> >> 7 minutes later, we got these alerts.
> >> The anti virus may indeed be the cause.
> >
> > I see. That does seem fairly likely. At least, there's no other
> > obvious candidate for the cause.
> >
> > I used to work on a customer-facing support team for the ClusterLabs
> > suite, and we received a fair number of cases where bizarre issues
> > (such as hangs and access errors) were apparently caused by an
> > antivirus. In those cases, all other usual lines of investigation were
> > exhausted, and when we asked the customer to disable their AV, the
> > issue disappeared. This happened with several different AV products.
> >
> > I can't say with any certainty that the AV is causing your issue, and
> > I know it's frustrating that you won't know whether any given
> > intervention worked, since this only happens once every few months.
> >
> > You may want to either exclude certain files from the scan, or write a
> > short script to place the cluster in maintenance mode before the scan
> > and take it out of maintenance after the scan is complete.
> >
> >>
> >> I had the case on December 13 (with systemctl here):
> >>
> >> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld
>
> > [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [
>
> > /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text
>
> > file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd:
> > /bin/systemctl: Text file busy\n ]
> >> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld
>
> > [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [
>
> > /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text
>
> > file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd:
> > /bin/systemctl: Text file busy\n ]
> >>
> >> After, this happens rarely, we had the case in August:
> >>
> >> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld
>
> > [3718] (process_lrm_event)  notice:
> > wd-websqlng01-NGINX-VIP-232_monitor_1:2877 [
> > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1:
> > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file
> > busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
> >> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld
>
> > [3718] (process_lrm_event)  notice:
> > wd-websqlng01-NGINX-VIP-231_monitor_1:2880 [
> > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1:
> > /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file
> > busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
> >>
> >> It's always around 9:00-9:07 pm,
> >> I'll move the virus scan to 10pm and see.
> >
> > That also sounds like a good plan to confirm the cause :) It might
> > take a while to find out though.
> >
> >>
> >> Thanks,
> >> Best regards,
> >>
> >> Thomas Cas  |  Technicien du support infogérance
> >> PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en
> >> IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE
> >> Before printing this letter, think about the impact on the environment!
> >>
> >> -Message d'origine-
> >> De : Reid Wahl 
> >> Envoyé : mardi 20 décembre 2022 20:34
> >> À : Cluster Labs - All topics related to open-source clustering welcomed
> > 
> >> Cc : Ken Gaillot ; Service Infogérance
> > 
> >> Objet : Re: [ClusterLabs] Bug pacemaker with multiple IP
> >>
> >> [Vous ne recevez pas souvent de courriers de nw...@redhat.com. Découvrez
> > pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification
> ]
> >>
> >> On Tue, Dec 20, 2022 at 6:25 AM Thomas CAS  wrote:
> >> >
> >> > Hello Ken,
> >> >
> >> > Thanks for your answer.
> >> > There was no update running at the time of the bug, which is why I
> thought
> > that having too many IPs caused this type of error.
> >> > The /usr/sbin/ip executable was not being modified either.
> >> >
> >> > We have many clusters, and only this one has so many IPs and this
> problem.
> >>
> >> How often does this happen, and is it reliably reproducible under any
> > 

[ClusterLabs] Antw: [EXT] Re: Bug pacemaker with multiple IP

2022-12-21 Thread Ulrich Windl
Hi!

I wonder: Could the error message be triggered by adding an exclusive manatory
lock in the ip binary?
If that triggers the bug, I'm rather sure that the error message is bad.
Shouldn't that be EWOULDBLOCK then?
(I have no idea how Sophos AV works, though. If they open the files to check
in write-mode, it's really stupid then IMHO)

Regards,
Ulrich


>>> Reid Wahl  schrieb am 21.12.2022 um 10:19 in Nachricht
:
> On Wed, Dec 21, 2022 at 12:24 AM Thomas CAS  wrote:
>>
>> Ken,
>>
>> Antivirus (sophos-av) is running but not in "real time access scanning",
the 
> scheduled scan is however at 9pm every day.
>> 7 minutes later, we got these alerts.
>> The anti virus may indeed be the cause.
> 
> I see. That does seem fairly likely. At least, there's no other
> obvious candidate for the cause.
> 
> I used to work on a customer-facing support team for the ClusterLabs
> suite, and we received a fair number of cases where bizarre issues
> (such as hangs and access errors) were apparently caused by an
> antivirus. In those cases, all other usual lines of investigation were
> exhausted, and when we asked the customer to disable their AV, the
> issue disappeared. This happened with several different AV products.
> 
> I can't say with any certainty that the AV is causing your issue, and
> I know it's frustrating that you won't know whether any given
> intervention worked, since this only happens once every few months.
> 
> You may want to either exclude certain files from the scan, or write a
> short script to place the cluster in maintenance mode before the scan
> and take it out of maintenance after the scan is complete.
> 
>>
>> I had the case on December 13 (with systemctl here):
>>
>> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld 

> [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [

> /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text

> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
> /bin/systemctl: Text file busy\n ]
>> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld 

> [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [

> /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text

> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
> /bin/systemctl: Text file busy\n ]
>>
>> After, this happens rarely, we had the case in August:
>>
>> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld 

> [3718] (process_lrm_event)  notice: 
> wd-websqlng01-NGINX-VIP-232_monitor_1:2877 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
>> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld 

> [3718] (process_lrm_event)  notice: 
> wd-websqlng01-NGINX-VIP-231_monitor_1:2880 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
>>
>> It's always around 9:00-9:07 pm,
>> I'll move the virus scan to 10pm and see.
> 
> That also sounds like a good plan to confirm the cause :) It might
> take a while to find out though.
> 
>>
>> Thanks,
>> Best regards,
>>
>> Thomas Cas  |  Technicien du support infogérance
>> PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en 
>> IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE
>> Before printing this letter, think about the impact on the environment!
>>
>> -Message d'origine-
>> De : Reid Wahl 
>> Envoyé : mardi 20 décembre 2022 20:34
>> À : Cluster Labs - All topics related to open-source clustering welcomed 
> 
>> Cc : Ken Gaillot ; Service Infogérance 
> 
>> Objet : Re: [ClusterLabs] Bug pacemaker with multiple IP
>>
>> [Vous ne recevez pas souvent de courriers de nw...@redhat.com. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification
]
>>
>> On Tue, Dec 20, 2022 at 6:25 AM Thomas CAS  wrote:
>> >
>> > Hello Ken,
>> >
>> > Thanks for your answer.
>> > There was no update running at the time of the bug, which is why I
thought 
> that having too many IPs caused this type of error.
>> > The /usr/sbin/ip executable was not being modified either.
>> >
>> > We have many clusters, and only this one has so many IPs and this
problem.
>>
>> How often does this happen, and is it reliably reproducible under any 
> circumstances? Any antivirus software running? It'd be nice to check 
> something like lsof or strace while it's happening, but that may not be 
> feasible if it's sporadic; running those at every monitor would generate
lots 
> of logs.
>>
>> AFAICT, having multiple processes execute (or read) the `ip` binary 
> simultaneously *shouldn't* cause problems, as long as nothing opens it for 
> write.
>>
>> >
>> > Best regards,
>> >
>> > Thomas Cas  |  

[ClusterLabs] Antw: Re: Antw: [EXT] Re: Stonith

2022-12-21 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 20.12.2022 um 16:21 in
Nachricht
<3a5960c2331f97496119720f6b5a760b3fe3bbcf.ca...@redhat.com>:
> On Tue, 2022‑12‑20 at 11:33 +0300, Andrei Borzenkov wrote:
>> On Tue, Dec 20, 2022 at 10:07 AM Ulrich Windl
>>  wrote:
>> > > But keep in mind that if the whole site is down (or unaccessible)
>> > > you
>> > > will not have access to IPMI/PDU/whatever on this site so your
>> > > stonith
>> > > agents will fail ...
>> > 
>> > But, considering the design, such site won't have a quorum and
>> > should commit suicide, right?
>> > 
>> 
>> Not by default.
> 
> And even if it does, the rest of the cluster can't assume that it did,
> so resources can't be recovered. It could work with sbd, but the poster
> said that the physical hosts aren't accessible.

Why? Assuming fencing is configured, the nodes part of the quorum should wait
for fencing delay, assuming fencing (or suicide) was done.
Then they can manage resources. OK, a non-working fencing or suicide mechanism
is a different story...

Regards,
Ulrich


> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Bug pacemaker with multiple IP

2022-12-21 Thread Reid Wahl
On Wed, Dec 21, 2022 at 12:24 AM Thomas CAS  wrote:
>
> Ken,
>
> Antivirus (sophos-av) is running but not in "real time access scanning", the 
> scheduled scan is however at 9pm every day.
> 7 minutes later, we got these alerts.
> The anti virus may indeed be the cause.

I see. That does seem fairly likely. At least, there's no other
obvious candidate for the cause.

I used to work on a customer-facing support team for the ClusterLabs
suite, and we received a fair number of cases where bizarre issues
(such as hangs and access errors) were apparently caused by an
antivirus. In those cases, all other usual lines of investigation were
exhausted, and when we asked the customer to disable their AV, the
issue disappeared. This happened with several different AV products.

I can't say with any certainty that the AV is causing your issue, and
I know it's frustrating that you won't know whether any given
intervention worked, since this only happens once every few months.

You may want to either exclude certain files from the scan, or write a
short script to place the cluster in maintenance mode before the scan
and take it out of maintenance after the scan is complete.

>
> I had the case on December 13 (with systemctl here):
>
> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld  
> [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [ 
> /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text 
> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
> /bin/systemctl: Text file busy\n ]
> pacemaker.log-20221217.gz:Dec 13 21:07:53 wd-websqlng01 pacemaker-controld  
> [5082] (process_lrm_event)  notice: wd-websqlng01-NGINX_monitor_15000:454 [ 
> /etc/init.d/nginx: 33: /lib/lsb/init-functions.d/40-systemd: systemctl: Text 
> file busy\n/etc/init.d/nginx: 82: /lib/lsb/init-functions.d/40-systemd: 
> /bin/systemctl: Text file busy\n ]
>
> After, this happens rarely, we had the case in August:
>
> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld  
> [3718] (process_lrm_event)  notice: 
> wd-websqlng01-NGINX-VIP-232_monitor_1:2877 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
> pacemaker.log-20220826.gz:Aug 25 21:06:31 wd-websqlng01 pacemaker-controld  
> [3718] (process_lrm_event)  notice: 
> wd-websqlng01-NGINX-VIP-231_monitor_1:2880 [ 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: 1: 
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2: uname: Text file 
> busy\nocf-exit-reason:IPaddr2 only supported Linux.\n ]
>
> It's always around 9:00-9:07 pm,
> I'll move the virus scan to 10pm and see.

That also sounds like a good plan to confirm the cause :) It might
take a while to find out though.

>
> Thanks,
> Best regards,
>
> Thomas Cas  |  Technicien du support infogérance
> PHONE : +33 3 51 25 23 26   WEB : www.ikoula.com/en
> IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE
> Before printing this letter, think about the impact on the environment!
>
> -Message d'origine-
> De : Reid Wahl 
> Envoyé : mardi 20 décembre 2022 20:34
> À : Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc : Ken Gaillot ; Service Infogérance 
> 
> Objet : Re: [ClusterLabs] Bug pacemaker with multiple IP
>
> [Vous ne recevez pas souvent de courriers de nw...@redhat.com. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Dec 20, 2022 at 6:25 AM Thomas CAS  wrote:
> >
> > Hello Ken,
> >
> > Thanks for your answer.
> > There was no update running at the time of the bug, which is why I thought 
> > that having too many IPs caused this type of error.
> > The /usr/sbin/ip executable was not being modified either.
> >
> > We have many clusters, and only this one has so many IPs and this problem.
>
> How often does this happen, and is it reliably reproducible under any 
> circumstances? Any antivirus software running? It'd be nice to check 
> something like lsof or strace while it's happening, but that may not be 
> feasible if it's sporadic; running those at every monitor would generate lots 
> of logs.
>
> AFAICT, having multiple processes execute (or read) the `ip` binary 
> simultaneously *shouldn't* cause problems, as long as nothing opens it for 
> write.
>
> >
> > Best regards,
> >
> > Thomas Cas  |  Technicien du support infogérance
> > PHONE : +33 3 51 25 23 26   WEB : 
> > https://fra01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ikoula.com%2Fen=05%7C01%7Ctcas%40ikoula.com%7C9aab91944bd6454a773808dae2c13ae4%7Ccb7a4a4ea7f747cc931f80db4a66f1c7%7C0%7C0%7C638071616800939086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=oYe7ws2%2BPx3sMblOFBgkuXuSHTdguzB%2Flk83O5W2MjE%3D=0
> > IKOULA Data Center 34 rue Pont Assy - 51100 Reims - FRANCE Before