Hi!

We have a problem with multiple hosts stuck in Connecting state, which I hoped 
somebody here could help us wrap our heads around.

All hosts, except one, seem to have very similar symptoms but I'll focus on one 
host that represents the rest.

So, the host is stuck in Connecting state and this what we see in oVirt log 
files.

 /var/log/ovirt-engine/engine.log:

2023-04-20 09:51:53,021+03 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] 
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [] 
Command 'GetCapabilitiesAsyncVDSCommand(HostName = ABC010-176-XYZ, 
VdsIdAndVdsVDSCommandParametersBase:{hostId='2c458562-3d4d-4408-afc9-9a9484984a91',
 vds='Host[ABC010-176-XYZ,2c458562-3d4d-4408-afc9-9a9484984a91]'})' execution 
failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is 
invalid
2023-04-20 09:55:16,556+03 ERROR 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] 
EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ABC010-176-XYZ command Get 
Host Capabilities failed: Message timeout which can be caused by communication 
issues

/var/log/vdsm/vdsm.log:

2023-04-20 17:48:51,977+0300 INFO  (vmrecovery) [vdsm.api] START 
getConnectedStoragePoolsList() from=internal, 
task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:31)
2023-04-20 17:48:51,977+0300 INFO  (vmrecovery) [vdsm.api] FINISH 
getConnectedStoragePoolsList return={'poollist': []} from=internal, 
task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:37)
2023-04-20 17:48:51,978+0300 INFO  (vmrecovery) [vds] recovery: waiting for 
storage pool to go up (clientIF:723)

Both engine.log and vdsm.log are flooded with these messages. They are repeated 
at regular intervals ad infinitum. This is one common symptom shared by 
multiple hosts in our deployment. They all have these message loops in 
engine.log and vdsm.log files. On all 

Running vdsm-client Host getConnectedStoragePools also returns an empty list 
represented by [] on all hosts (but interestingly there is one that showed 
Storage Pool UUID and yet it was still stuck in Connecting state).

This particular host (ABC010-176-XYZ) is connected to 3 CEPH iSCSI Storage 
Domains and lsblk shows 3 block devices with matching UUIDs in their device 
components. So, the storage seems to be connected but the Storage Pool is not? 
How is that even possible?

Now, what's even more weird is that we tried rebooting the host (via 
Administrator Portal) and it didn't help. We even tried removing and re-adding 
the host in Administrator Portal but to no avail.

Additionally, the host refused to go into Maintenance mode so we had to enforce 
it by manually updating Engine DB.

We also tried reinstalling the host via Administrator Portal and ran into 
another weird problem, which I'm not sure if it's a related one or a problem 
that deserves a dedicated discussion thread but, basically, the underlying 
Ansible playbook exited with the following error message:

"stdout" : "fatal: [10.10.10.176]: UNREACHABLE! => {\"changed\": false, 
\"msg\": \"Data could not be sent to remote host \\\"10.10.10.176\\\". Make 
sure this host can be reached over ssh: \", \"unreachable\": true}",

Counterintuitively, just before running Reinstall via Administrator Portal we 
had been able to reboot the same host (which as you know oVirt does via Ansible 
as well). So, no changes on the host in between just different Ansible 
playbooks. To confirm that we actually had access to the host over ssh we 
successfully ran ssh -p $PORT root@10.10.10.176 -i 
/etc/pki/ovirt-engine/keys/engine_id_rsa and it worked.

That made us scratch our heads for a while but what seems to had fixed 
Ansible's ssh access problems was manual full stop of all VDSM-related systemd 
services on the host. It was just a wild guess but as soon as we stopped all 
VDSM services Ansible stopped complaining about not being able to reach the 
target host and successfully did its job.

I'm sure you'd like to see more logs but I'm not certain what exactly is 
relevant. There are a ton of logs as this deployment is comprised of nearly 80 
hosts. So, I guess it's best if you just request to see specific logs, messages 
or configuration details and I'll cherry-pick what's relevant.

We don't really understand what's going on and would appreciate any help. We 
tried just about anything we could think of to resolve this issue and are 
running out of ideas what to do next.

If you have any questions just ask and I'll do my best to answer them.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ECE6466SEJ4MW4MN23GQ7IFDCVO5HOTU/

Reply via email to