Yesterday I went through the same situation after our router crashed and bugged the connections with the Hosts.
The solution is quite simple and already documented by Red Hat. [1] Just restarting the hosted-engine solves the problem. `systemctl restart hosted-engine` [1] https://access.redhat.com/solutions/4292981 Em ter., 2 de mai. de 2023 às 09:14, <ivan.lezhnjov...@gmail.com> escreveu: > Hi! > > We have a problem with multiple hosts stuck in Connecting state, which I > hoped somebody here could help us wrap our heads around. > > All hosts, except one, seem to have very similar symptoms but I'll focus > on one host that represents the rest. > > So, the host is stuck in Connecting state and this what we see in oVirt > log files. > > /var/log/ovirt-engine/engine.log: > > 2023-04-20 09:51:53,021+03 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [] > Command 'GetCapabilitiesAsyncVDSCommand(HostName = ABC010-176-XYZ, > VdsIdAndVdsVDSCommandParametersBase:{hostId='2c458562-3d4d-4408-afc9-9a9484984a91', > vds='Host[ABC010-176-XYZ,2c458562-3d4d-4408-afc9-9a9484984a91]'})' > execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: > SSL session is invalid > 2023-04-20 09:55:16,556+03 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] > EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ABC010-176-XYZ command > Get Host Capabilities failed: Message timeout which can be caused by > communication issues > > /var/log/vdsm/vdsm.log: > > 2023-04-20 17:48:51,977+0300 INFO (vmrecovery) [vdsm.api] START > getConnectedStoragePoolsList() from=internal, > task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:31) > 2023-04-20 17:48:51,977+0300 INFO (vmrecovery) [vdsm.api] FINISH > getConnectedStoragePoolsList return={'poollist': []} from=internal, > task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:37) > 2023-04-20 17:48:51,978+0300 INFO (vmrecovery) [vds] recovery: waiting > for storage pool to go up (clientIF:723) > > Both engine.log and vdsm.log are flooded with these messages. They are > repeated at regular intervals ad infinitum. This is one common symptom > shared by multiple hosts in our deployment. They all have these message > loops in engine.log and vdsm.log files. On all > > Running vdsm-client Host getConnectedStoragePools also returns an empty > list represented by [] on all hosts (but interestingly there is one that > showed Storage Pool UUID and yet it was still stuck in Connecting state). > > This particular host (ABC010-176-XYZ) is connected to 3 CEPH iSCSI Storage > Domains and lsblk shows 3 block devices with matching UUIDs in their device > components. So, the storage seems to be connected but the Storage Pool is > not? How is that even possible? > > Now, what's even more weird is that we tried rebooting the host (via > Administrator Portal) and it didn't help. We even tried removing and > re-adding the host in Administrator Portal but to no avail. > > Additionally, the host refused to go into Maintenance mode so we had to > enforce it by manually updating Engine DB. > > We also tried reinstalling the host via Administrator Portal and ran into > another weird problem, which I'm not sure if it's a related one or a > problem that deserves a dedicated discussion thread but, basically, the > underlying Ansible playbook exited with the following error message: > > "stdout" : "fatal: [10.10.10.176]: UNREACHABLE! => {\"changed\": false, > \"msg\": \"Data could not be sent to remote host \\\"10.10.10.176\\\". Make > sure this host can be reached over ssh: \", \"unreachable\": true}", > > Counterintuitively, just before running Reinstall via Administrator Portal > we had been able to reboot the same host (which as you know oVirt does via > Ansible as well). So, no changes on the host in between just different > Ansible playbooks. To confirm that we actually had access to the host over > ssh we successfully ran ssh -p $PORT root@10.10.10.176 -i > /etc/pki/ovirt-engine/keys/engine_id_rsa and it worked. > > That made us scratch our heads for a while but what seems to had fixed > Ansible's ssh access problems was manual full stop of all VDSM-related > systemd services on the host. It was just a wild guess but as soon as we > stopped all VDSM services Ansible stopped complaining about not being able > to reach the target host and successfully did its job. > > I'm sure you'd like to see more logs but I'm not certain what exactly is > relevant. There are a ton of logs as this deployment is comprised of nearly > 80 hosts. So, I guess it's best if you just request to see specific logs, > messages or configuration details and I'll cherry-pick what's relevant. > > We don't really understand what's going on and would appreciate any help. > We tried just about anything we could think of to resolve this issue and > are running out of ideas what to do next. > > If you have any questions just ask and I'll do my best to answer them. > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/ECE6466SEJ4MW4MN23GQ7IFDCVO5HOTU/ >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/GVO5EEMIPIQ2SFUIKC6QQ3FHPXLRT5ST/