On 4/16/21 8:09 AM, Steffen Vinther Sørensen wrote:
On Fri, Apr 16, 2021 at 6:56 AM Andrei Borzenkov <arvidj...@gmail.com> wrote:
On 15.04.2021 23:09, Steffen Vinther Sørensen wrote:
On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger <kwenn...@redhat.com> wrote:
On 4/15/21 3:26 PM, Ulrich Windl wrote:
Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 15.04.2021 um
14:56 in
Nachricht
<calhdmbixzoyf-gxg82ont4mgfm6q-_imceuvhypgwky41jj...@mail.gmail.com>:
On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
<ulrich.wi...@rz.uni-regensburg.de> wrote:
Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 15.04.2021 um
13:10 in
Nachricht
<CALhdMBhMQRwmgoWEWuiGMDr7HfVOTTKvW8=nqms2p2e9p8y...@mail.gmail.com>:
Hi there,
In this 3 node cluster, node03 been offline for a while, and being
brought up to service. Then a migration of a VirtualDomain is being
attempted, and node02 is then fenced.
Provided is logs from all 2 nodes, and the 'pcs config' as well as a
bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
it because of the failed ipmi monitor warning ?
After a short glace it looks as if the network traffic used for VM
migration
killed the corosync (or other) communication.
May I ask what part is making you think so ?
The part that I saw no reason for an intended fencing.
And it looks like node02 is being cut off from all
networking-communication - both corosync & ipmi.
May really be the networking-load although I would
rather bet on something more systematic like a
Mac/IP-conflict with the VM or something.
I see you are having libvirtd under cluster-control.
Maybe bringing up the network-topology destroys the
connection between the nodes.
Has the cluster been working with the 3 nodes before?
Klaus
Hi Klaus
Yes it has been working before with all 3 nodes and migrations back
and forth, but a few more VirtualDomains have been deployed since the
last migration test.
It happens very fast, almost immediately after migration is starting.
Could it be that some timeout values should be adjusted ?
I just don't have any idea where to start looking, as to me there is
nothing obviously suspicious found in the logs.
I would look at performance stats, may be node02 was overloaded and
could not answer in time. Although standard sar stats are collected
every 15 minutes which is usually too coarse for it.
Migration could stress network. Talk with your network support, any
errors around this time?
I see no network errors around that time when checking e-mails and
syslogs from network equipment.
Last night I tried to bring up the node02 that was fenced earlier 'pcs
cluster start', and initiated a migration. Same thing happened, node03
was fenced almost immediately.
Then I tried to bring back up node03 and leave it for the night. This
morning I then did several migrations successfully. So it might be
something that needs more time to get up, maybe the
clustermanaged-libvirtd network components.
Did you - instead of migrating - try to stop the vm on one node and
bring it up on the other? Just to see if it is the traffic or the vm
alonein that state ... maybe starting the vm already does something
harmful to networking if libvirtd didn't have the time to bring
up the networking topology (or something similar during startup
like it being attached somewhere wrong - missing the right attachment
point still - so that it grabs all packets ...).
When doing a migration you are possibly moving a mac-address from
one switch-port to another. Maybe bringing up the network topology
is doing some testing/gratuitous arp with the same mac which looks
like the one more switching around - within a certain time - that
the switch is configured to punish with a temporary locking of the
port.
Just a few unsorted thoughts ... at least nothing that would ring a
bell immediately ...
Klaus
I have Prometheus scraping node_exporter from all 3 nodes, and I can
dig around network traffic around the incidents, for the 2 failing
incidents, upon migration the traffic rises to a stable 250Mb/s or
600Mb/s for a couple of minutes.
For successful migrations, network traffic always goes to 1000Mb/s
which is the max for single connection, the nodes have 4x1000Mb nics
bonded, and there is very low traffic going on there around any of the
incidents
/Steffen
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/