[ovirt-users] 3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

thomas Sun, 01 Dec 2019 06:38:28 -0800

Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot 
and oVirt /engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and 
network.

All CentOS 7.7 updated daily.

These machines may not be used exclusively for oVirt so I don't want to
re-install the OS, when an oVirt setup fails: Instead I try my best to clean up
the nodes when doing another oVirt installation run.

They ran oVirt for a week or two using a completely distinct set of storage, so
they are fundamentally sound, but we wanted higher storage capacity so I
swapped everything and re-installed CentOS very much the same way as before.

The first oVirt setup went smoothly but the cluster crumbled without much
usage. I won't go into details here, because I didn't want to investigate for
now, instead I focussed on redoing the installation and cleaning up the old
setup.

I know the docs actually recommend starting with wiped hardware, but
operationally that would be a show-stopper for the intended use case.

So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without
redoing the whole Gluster storage setup, where apart from SSD caching not
working, I don't have issues).

Undoing the network changes in such a way that the oVirt HCI wizard ceases
complaining is a bit more involved. I typically run:
- vdsm-tool ovn-unconfigure
- vdsm-tool clear-nets (now need to switch to the console)
- vdsm-tool remove-config

and then I still need to edit
/etc/sysconfig/network-scripts/ifcfg-<ethernet-device> to bring the physical
adapter back to life.
Sometimes I still need to remove the ovirtmgmnt bridge manually etc.

Whether I remove and redo the Gluster as a bit of an effect in re-installation,
but it doesn't make a difference in what follows.

So here is where I am currently getting stuck consistently:

The wizard is gone through preparing the Gluster storage (which is completely
functional at that point), has created the local VM on the installation node,
installed the Postgres database, filled it etc. basically has oVirt up and
running with the primary Gluster node and now would like to add the second and
third nodes.

At that point I get "Connection lost" in the Web-Wizard, evidently as a
consequence of Ansible fiddling around heavily to set up the local bridge for
the VM. I remember that for the scripted variant of the setup it is recommended
to run the script behind 'screen' or 'tmux' in order to ensure its execution
isn't interrupted by that. But for the GUI variant, evidently there *should* be
some other type of potection, perhaps via the re-connecting nature of HTTP...

Pushing the "Reconnect" button on the GUI at that point doesn't return you to
the point of the setup, but only offers to redeploy, while the
HostedEngineLocal is still there and running.

I ssh'd into the machine and started looking for errors and warning and saw
that the installation had gone rather far without incidence. OTOPI had
completely finished the WildFly server is up and running the Postgres database
fully installed and running smoothly, the only thing I can find is that it's
trying to add the additional gluster nodes, but complains that these nodes
(quotes gluster-UUIDs) are not part of the "cluster". An investigation into the
Postgres database shows, that the 'gluster_server' table indeed only has the
primary node in it.

I don't know what part of the process should have added the other two nodes,
but there seems to be no *remaining* connectivity issue with the Gluster
members. I installed gscli and connected to all three nodes and volumes without
issue.

I am guessing at this point, that the complex rewiring of the software defined
network is causing a temporary issue and a race condition that I don't know how
to recover from.

Since the oVirt management GUI is actually fully operational and can be reached
from the primary node via the temporary bridge, I went into the GUI and even
managed to add the additional two nodes without any problems. Their
installation went through without any issues, they showed up in the
gluster_servers table on Postgress and basically the installation could have
proceeded from that point, except... that I don't know how to restart the
process from that point: It still has to 'beam' the local VM into the Gluster
storage and restart it there.

I have gone through the process three times now, with absolutely identical
results.

I could use some help how to recover from that situation, which looks like a
race condition, nothing a re-installation of everything would really resolve.

In the mean-time, I'll try the scripted variant on 'screen' to see if that
fares better.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/TRUEYBHAYPJUAIXGZ5EI5LRM5ZFE62YM/

[ovirt-users] 3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

Reply via email to