Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot 
and oVirt /engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and 
network.

All CentOS 7.7 updated daily.

These machines may not be used exclusively for oVirt so I don't want to 
re-install the OS, when an oVirt setup fails: Instead I try my best to clean up 
the nodes when doing another oVirt installation run.

They ran oVirt for a week or two using a completely distinct set of storage, so 
they are fundamentally sound, but we wanted higher storage capacity so I 
swapped everything and re-installed CentOS very much the same way as before.

The first oVirt setup went smoothly but the cluster crumbled without much 
usage. I won't go into details here, because I didn't want to investigate for 
now, instead I focussed on redoing the installation and cleaning up the old 
setup.

I know the docs actually recommend starting with wiped hardware, but 
operationally that would be a show-stopper for the intended use case.

So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without 
redoing the whole Gluster storage setup, where apart from SSD caching not 
working, I don't have issues).

Undoing the network changes in such a way that the oVirt HCI wizard ceases 
complaining is a bit more involved. I typically run:
- vdsm-tool ovn-unconfigure
- vdsm-tool clear-nets (now need to switch to the console)
- vdsm-tool remove-config

and then I still need to edit 
/etc/sysconfig/network-scripts/ifcfg-<ethernet-device> to bring the physical 
adapter back to life.
Sometimes I still need to remove the ovirtmgmnt bridge manually etc.

Whether I remove and redo the Gluster as a bit of an effect in re-installation, 
but it doesn't make a difference in what follows.

So here is where I am currently getting stuck consistently:

The wizard is gone through preparing the Gluster storage (which is completely 
functional at that point), has created the local VM on the installation node, 
installed the Postgres database, filled it etc. basically has oVirt up and 
running with the primary Gluster node and now would like to add the second and 
third nodes.

At that point I get "Connection lost" in the Web-Wizard, evidently as a 
consequence of Ansible fiddling around heavily to set up the local bridge for 
the VM. I remember that for the scripted variant of the setup it is recommended 
to run the script behind 'screen' or 'tmux' in order to ensure its execution 
isn't interrupted by that. But for the GUI variant, evidently there *should* be 
some other type of potection, perhaps via the re-connecting nature of HTTP...

Pushing the "Reconnect" button on the GUI at that point doesn't return you to 
the point of the setup, but only offers to redeploy, while the 
HostedEngineLocal is still there and running.

I ssh'd into the machine and started looking for errors and warning and saw 
that the installation had gone rather far without incidence. OTOPI had 
completely finished the WildFly server is up and running the Postgres database 
fully installed and running smoothly, the only thing I can find is that it's 
trying to add the additional gluster nodes, but complains that these nodes 
(quotes gluster-UUIDs) are not part of the "cluster". An investigation into the 
Postgres database shows, that the 'gluster_server' table indeed only has the 
primary node in it.

I don't know what part of the process should have added the other two nodes, 
but there seems to be no *remaining* connectivity issue with the Gluster 
members. I installed gscli and connected to all three nodes and volumes without 
issue.

I am guessing at this point, that the complex rewiring of the software defined 
network is causing a temporary issue and a race condition that I don't know how 
to recover from.

Since the oVirt management GUI is actually fully operational and can be reached 
from the primary node via the temporary bridge, I went into the GUI and even 
managed to add the additional two nodes without any problems. Their 
installation went through without any issues, they showed up in the 
gluster_servers table on Postgress and basically the installation could have 
proceeded from that point, except... that I don't know how to restart the 
process from that point: It still has to 'beam' the local VM into the Gluster 
storage and restart it there.

I have gone through the process three times now, with absolutely identical 
results.

I could use some help how to recover from that situation, which looks like a 
race condition, nothing a re-installation of everything would really resolve.

In the mean-time, I'll try the scripted variant on 'screen' to see if that 
fares better.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/TRUEYBHAYPJUAIXGZ5EI5LRM5ZFE62YM/

Reply via email to