Hi All, I've got some issues with connecting my oVirt Cluster to my Ceph Cluster via iSCSI. There are two issues, and I don't know if one is causing the other, if they are related at all, or if they are two separate, unrelated issues. Let me explain.
The Situation ------------- - I have a working three node Ceph Cluster (Ceph Quincy on Rocky Linux 8.6) - The Ceph Cluster has four Storage Pools of between 4 and 8 TB each - The Ceph Cluster has three iSCSI Gateways - There is a single iSCSI Target on the Ceph Cluster - The iSCSI Target has all three iSCSI Gateways attached - The iSCSI Target has all four Storage Pools attached - The four Storage Pools have been assigned LUNs 0-3 - I have set up (Discovery) CHAP Authorisation on the iSCSI Target - I have a working three node self-hosted oVirt Cluster (oVirt v4.5.3 on Rocky Linux 8.6) - The oVirt Cluster has (in addition to the hosted_storage Storage Domain) three GlusterFS Storage Domains - I can ping all three Ceph Cluster Nodes to/from all three oVirt Hosts - The iSCSI Target on the Ceph Cluster has all three oVirt Hosts Initiators attached - Each Initiator has all four Ceph Storage Pools attached - I have set up CHAP Authorisation on the iSCSI Target's Initiators - The Ceph Cluster Admin Portal reports that all three Initiators are "logged_in" - I have previous connected Ceph iSCSI LUNs to the oVirt Cluster successfully (as an experiment), but had to remove and re-instate them for the "final" version(?). - The oVirt Admin Portal (ie HostedEngine) reports that Initiators are 1 & 2 (ie oVirt Hosts 1 & 2) are "logged_in" to all three iSCSI Gateways - The oVirt Admin Portal reports that Initiator 3 (ie oVirt Host 3) is "logged_in" to iSCSI Gateways 1 & 2 - I can "force" Initiator 3 to become "logged_in" to iSCSI Gateway 3, but when I do this it is *not* persistent - oVirt Hosts 1 & 2 can/have discovered all three iSCSI Gateways - oVirt Hosts 1 & 2 can/have discovered all four LUNs/Targets on all three iSCSI Gateways - oVirt Host 3 can only discover 2 of the iSCSI Gateways - For Target/LUN 0 oVirt Host 3 can only "see" the LUN provided by iSCSI Gateway 1 - For Targets/LUNs 1-3 oVirt Host 3 can only "see" the LUNs provided by iSCSI Gateways 1 & 2 - oVirt Host 3 can *not* "see" any of the Targets/LUNs provided by iSCSI Gateway 3 - When I create a new oVirt Storage Domain for any of the four LUNs: - I am presented with a message saying "The following LUNs are already in use..." - I am asked to "Approve operation" via a checkbox, which I do - As I watch the oVirt Admin Portal I can see the new iSCSI Storage Domain appear in the Storage Domain list, and then after a few minutes it is removed - After those few minutes I am presented with this failure message: "Error while executing action New SAN Storage Domain: Network error during communication with the Host." - I have looked in the engine.log and all I could find that was relevant (as far as I know) was this: ~~~ 2022-11-28 19:59:20,506+11 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (default task-1) [77b0c12d] Command 'CreateStorageDomainVDSCommand(HostName = ovirt_node_1.mynet.local, CreateStorageDomainVDSCommandParameters:{hostId='967301de-be9f-472a-8e66-03c24f01fa71', storageDomain='StorageDomainStatic:{name='data', id='2a14e4bd-c273-40a0-9791-6d683d145558'}', args='s0OGKR-80PH-KVPX-Fi1q-M3e4-Jsh7-gv337P'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2022-11-28 19:59:20,507+11 ERROR [org.ovirt.engine.core.bll.storage.domain.AddSANStorageDomainCommand] (default task-1) [77b0c12d] Command 'org.ovirt.engine.core.bll.storage.domain.AddSANStorageDomainCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022) ~~~ I cannot see/detect any "communication issue" - but then again I'm not 100% sure what I should be looking for I have looked on-line for an answer, and apart from not being able to get past Red Hat's "wall" to see the solutions that they have, all I could find that was relevant was this: https://lists.ovirt.org/archives/list/de...@ovirt.org/thread/AVLORQNOLJHRWMHTM4WCDRVP7VSIZBGR/ . If this *is* relevant then there is not enough context here for me to proceed (ie/eg *where* (which host/vm) should that command be run?). I also found (for a previous version of oVirt) notes about modifying the Postgres DB manual to resolve a similar issue. While I am more than comfortable doing this (I've been an SQL DBA for well over 20 years) this seems like asking for trouble - at least until I hear back from the oVirt Devs that this is OK to do - and of course, I'll need the relevant commands / locations / authorisations to get into the DB. Questions --------- - Are the two issues (oVirt Host 3 not having a full picture of the Ceph iSCSI environment and the oVirt iSCSI Storage Domain creation failure) related? - Do I need to "refresh" the iSCSI info on the oVirt Hosts, and if so, how do I do this? - Do I need to "flush" the old LUNs from the oVirt Cluster, and if so, how do I do this? - Where else should I be looking for info in the logs (& which logs)? - Does *anyone* have any other ideas how to resolve the situation - especially when using the Ceph iSCII Gateways? Thanks in advance Cheers Dulux-Oz _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MCOJD4R6PS4BKUTUM3BXYWSX5RDPWR2N/