[ovirt-users] Re: Multiple hosts stuck in Connecting state waiting for storage pool to go up.

Murilo Morais Wed, 03 May 2023 05:23:39 -0700

Yesterday I went through the same situation after our router crashed and
bugged the connections with the Hosts.


The solution is quite simple and already documented by Red Hat. [1]

Just restarting the hosted-engine solves the problem.
`systemctl restart hosted-engine`

[1] https://access.redhat.com/solutions/4292981

Em ter., 2 de mai. de 2023 às 09:14, <ivan.lezhnjov...@gmail.com> escreveu:

> Hi!
>
> We have a problem with multiple hosts stuck in Connecting state, which I
> hoped somebody here could help us wrap our heads around.
>
> All hosts, except one, seem to have very similar symptoms but I'll focus
> on one host that represents the rest.
>
> So, the host is stuck in Connecting state and this what we see in oVirt
> log files.
>
>  /var/log/ovirt-engine/engine.log:
>
> 2023-04-20 09:51:53,021+03 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) []
> Command 'GetCapabilitiesAsyncVDSCommand(HostName = ABC010-176-XYZ,
> VdsIdAndVdsVDSCommandParametersBase:{hostId='2c458562-3d4d-4408-afc9-9a9484984a91',
> vds='Host[ABC010-176-XYZ,2c458562-3d4d-4408-afc9-9a9484984a91]'})'
> execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException:
> SSL session is invalid
> 2023-04-20 09:55:16,556+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) []
> EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ABC010-176-XYZ command
> Get Host Capabilities failed: Message timeout which can be caused by
> communication issues
>
> /var/log/vdsm/vdsm.log:
>
> 2023-04-20 17:48:51,977+0300 INFO  (vmrecovery) [vdsm.api] START
> getConnectedStoragePoolsList() from=internal,
> task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:31)
> 2023-04-20 17:48:51,977+0300 INFO  (vmrecovery) [vdsm.api] FINISH
> getConnectedStoragePoolsList return={'poollist': []} from=internal,
> task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:37)
> 2023-04-20 17:48:51,978+0300 INFO  (vmrecovery) [vds] recovery: waiting
> for storage pool to go up (clientIF:723)
>
> Both engine.log and vdsm.log are flooded with these messages. They are
> repeated at regular intervals ad infinitum. This is one common symptom
> shared by multiple hosts in our deployment. They all have these message
> loops in engine.log and vdsm.log files. On all
>
> Running vdsm-client Host getConnectedStoragePools also returns an empty
> list represented by [] on all hosts (but interestingly there is one that
> showed Storage Pool UUID and yet it was still stuck in Connecting state).
>
> This particular host (ABC010-176-XYZ) is connected to 3 CEPH iSCSI Storage
> Domains and lsblk shows 3 block devices with matching UUIDs in their device
> components. So, the storage seems to be connected but the Storage Pool is
> not? How is that even possible?
>
> Now, what's even more weird is that we tried rebooting the host (via
> Administrator Portal) and it didn't help. We even tried removing and
> re-adding the host in Administrator Portal but to no avail.
>
> Additionally, the host refused to go into Maintenance mode so we had to
> enforce it by manually updating Engine DB.
>
> We also tried reinstalling the host via Administrator Portal and ran into
> another weird problem, which I'm not sure if it's a related one or a
> problem that deserves a dedicated discussion thread but, basically, the
> underlying Ansible playbook exited with the following error message:
>
> "stdout" : "fatal: [10.10.10.176]: UNREACHABLE! => {\"changed\": false,
> \"msg\": \"Data could not be sent to remote host \\\"10.10.10.176\\\". Make
> sure this host can be reached over ssh: \", \"unreachable\": true}",
>
> Counterintuitively, just before running Reinstall via Administrator Portal
> we had been able to reboot the same host (which as you know oVirt does via
> Ansible as well). So, no changes on the host in between just different
> Ansible playbooks. To confirm that we actually had access to the host over
> ssh we successfully ran ssh -p $PORT root@10.10.10.176 -i
> /etc/pki/ovirt-engine/keys/engine_id_rsa and it worked.
>
> That made us scratch our heads for a while but what seems to had fixed
> Ansible's ssh access problems was manual full stop of all VDSM-related
> systemd services on the host. It was just a wild guess but as soon as we
> stopped all VDSM services Ansible stopped complaining about not being able
> to reach the target host and successfully did its job.
>
> I'm sure you'd like to see more logs but I'm not certain what exactly is
> relevant. There are a ton of logs as this deployment is comprised of nearly
> 80 hosts. So, I guess it's best if you just request to see specific logs,
> messages or configuration details and I'll cherry-pick what's relevant.
>
> We don't really understand what's going on and would appreciate any help.
> We tried just about anything we could think of to resolve this issue and
> are running out of ideas what to do next.
>
> If you have any questions just ask and I'll do my best to answer them.
> _______________________________________________
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/ECE6466SEJ4MW4MN23GQ7IFDCVO5HOTU/
>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/GVO5EEMIPIQ2SFUIKC6QQ3FHPXLRT5ST/

[ovirt-users] Re: Multiple hosts stuck in Connecting state waiting for storage pool to go up.

Reply via email to