Hi all,

  I run an oVirt 4.3.6.7-1.el7 installation (50+ hosts, 40+ FC storage domains 
on two all-flash arrays) and experienced a problem accessing single storage 
domains.

As a result, hosts were taken "not operational" because they could not see all 
storage domains, SPM started to move around the hosts.

oVirt messages start with:
2019-11-04 15:10:22.739+01 | VDSM HOST082 command SpmStatusVDS failed: (-202, 
'Sanlock resource read failure', 'IO timeout')
2019-11-04 15:10:22.781+01 | Invalid status on Data Center <name>. Setting Data 
Center status to Non Responsive (On host HOST82, Error: General Exception).
...
2019-11-04 15:13:58.836+01 | Host HOST017 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.85+01  | Host HOST005 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.85+01  | Host HOST012 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST002 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST010 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST011 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:58.852+01 | Host HOST004 cannot access the Storage Domain(s) 
HOST_LUN_204 attached to the Data Center <name>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.011+01 | Host HOST017 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.238+01 | Host HOST004 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.249+01 | Host HOST005 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.255+01 | Host HOST012 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.273+01 | Host HOST002 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.279+01 | Host HOST010 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:13:59.386+01 | Host HOST011 cannot access the Storage Domain(s) 
<UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to 
Non-Operational.
2019-11-04 15:15:14.145+01 | Storage domain HOST_LUN_221 experienced a high 
latency of 9.60953 seconds from host HOST038. This may cause performance and 
functional issues. Please consult your Storage Administrator.

The problem mainly affected two storage domains (on the same array) but I also 
saw single messages for other storage domains (one the other array as well).

Storage domains stayed available to the hosts, all VMs continued to run.

When constantly reading from the storage domains (/bin/dd iflag=direct 
if=<metadata>  bs=4096 count=1 of=/dev/null) we got expected 20+ MBytes/s on 
all but some storage domains. One of them showed "transfer rates" around 200 
Bytes/s, but went up to normal performance from time to time. Transfer rate to 
this domain was different between the hosts.

/var/log/messages contain qla2xxx abort messages on almost all hosts. There are 
no errors on SAN switches or storage array (but vendor is still investigating). 
I did not see high load on the storage array.

The system seemed to stabilize when I stopped all VMs on the affected storage 
domain and this storage domain became "inactive". Currently, this storage 
domain still is inactive and we cannot place it in maintenance mode ("Failed to 
deactivate Storage Domain") nor activate it. OVF Metadata seems to be corrupt 
as well (failed to update OVF disks <id>, OVF data isn't updated on those OVF 
stores). The first six 512 byte blocks of /dev/<id>/metadata seem to contain 
only zeros.

Any advice on how to proceed here?
Is there a way to recover this storage domain?

All the best,
Oliver

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/XIFBJG4BYAXT4KKRDGYGHBXKD36E2I52/

Reply via email to