Hello We have a cluster with 9 ovirt host and a storage cluster with 3 storage servers, we run ovirt 4.3.6.6-1. Storage is setup is replica 3 with arbiter. There is another unmanaged gluster volume which is replica 3 based on local disks of ovirt hosts and the engine and a few small pfsense vm reside on it.
Yesterday we experienced a hardware failure on one raid controler on one of the storage servers, so the server stopped working. On that particular server on it's local disks was a nfs storage domain used as ISO. After the storage server failure multiple machines got stuck and in not responding state, others kept working but we could not manage them from ovirt. many machines had iso mounted so it may have to do with this. https://bugzilla.redhat.com/show_bug.cgi?id=1337073 But machines with ISO mounted were not the only ones affected. Allmost every host on ovirt clusterd switched from non responding to ok status, many machines were on unknown status even if they did not have mounted iso or even resided in any storage domain of the gluster cluster. I.E. pfsense machines residing on local drives as the engine. These machines could not be Powered off from inside the engine. In any case I could not put the iso domain on maintenance because isos were still mounted on varius virtual machines. I'm wondering what caused the allmost total failure, was it the nfs domain that could not be mounted ? And of course how could I prevent such a situation. I'm sending you the logs from the critical time period from one of the hosts. Please let me know if you need more debug info. 2019-10-17 06:01:43,039+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getStats succeeded in 0.25 seconds (__init__:312) 2019-10-17 06:01:46,638+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos/ec67be6b-367f-4e41-a686-695ab50c0034/dom_md/metadata' is blocked for 39 0.00 seconds (check:282) 2019-10-17 06:01:49,318+0300 WARN (itmap/0) [storage.scanDomains] Could not collect metadata file for domain path /rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos (fileSD:913) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 902, in collectMetaFiles metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 110, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 561, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 442, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out 2019-10-17 06:01:49,319+0300 ERROR (monitor/ec67be6) [storage.Monitor] Error checking domain ec67be6b-367f-4e41-a686-695ab50c0034 (monitor:425) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 145, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 135, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'ec67be6b-367f-4e41-a686-695ab50c0034',) 2019-10-17 06:01:50,253+0300 INFO (periodic/42) [vdsm.api] START repoStats(domains=()) from=internal, task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:48) 2019-10-17 06:01:50,253+0300 INFO (periodic/42) [vdsm.api] FINISH repoStats return={u'3c973b94-a57d-4462-bcb3-3cef13290327': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.000932761', ' lastCheck': '3.6', 'valid': True}, u'ec67be6b-367f-4e41-a686-695ab50c0034': {'code': 2001, 'actual': True, 'version': -1, 'acquired': False, 'delay': '0', 'lastCheck': '0.9', 'valid': False}, u'a1ecca6f-1487-492 d-b9bf-d3fb60f1bc99': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.00142361', 'lastCheck': '3.7', 'valid': True}, u'd5c52bc8-1a83-4915-b82f-d98e33db7c99': {'code': 0, 'actual': True, ' version': 5, 'acquired': True, 'delay': '0.00124982', 'lastCheck': '3.6', 'valid': True}, u'453eb3c2-c3a5-4d5f-8c1d-b16588fc5eff': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.001343', 'lastCheck': '3.6', 'valid': True}} from=internal, task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:54) 2019-10-17 06:01:50,403+0300 WARN (vdsm.Scheduler) [Executor] executor state: count=8 workers=set([<Worker name=qgapoller/58 waiting task#=1 at 0x7fcd61483310>, <Worker name=qgapoller/53 running <Task discardab le <Operation action=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=30.00 at 0x7fcd615adcd0> discarded task#=11 at 0x7fcd615ad2d0>, <Work er name=qgapoller/56 running <Task discardable <Operation action=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=10.00 at 0x7fcd614cd8d0> discarded task#=11 at 0x7fcd610013d0>, <Worker name=qgapoller/60 waiting task#=0 at 0x7fcd612cd910>, <Worker name=qgapoller/57 running <Task discardable <Operation action=<VmDispatcher operation=<function <lambd a> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=20.00 at 0x7fcd8003f190> discarded task#=0 at 0x7fcd3c355310>, <Worker name=qgapoller/55 running <Task discardable <Operation acti on=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=40.00 at 0x7fcd61483bd0> discarded task#=57 at 0x7fcd611dd150>, <Worker name=qgapoller/ 59 waiting task#=0 at 0x7fcd614318d0>, <Worker name=qgapoller/54 waiting task#=14 at 0x7fcd3c3f5fd0>]) (executor:213) 2019-10-17 06:01:50,403+0300 INFO (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=qgapoller/56 running <Task discardable <Operation action=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668 > at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=10.00 at > 0x7fcd614cd8d0> discarded task#=11 at 0x7fcd610013d0> (executor:355) 2019-10-17 06:01:52,094+0300 WARN (vdsm.Scheduler) [Executor] executor state: count=10 workers=set([<Worker name=periodic/42 waiting task#=4 at 0x7fcd3c592850>, <Worker name=periodic/43 waiting task#=0 at 0x7fc d6124a090>, <Worker name=periodic/39 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=37.50 at 0x7fcd61483610> discarded task#=10 at 0x7fcd8003fed0>, <Worker name=periodic/2 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, durati on=832.50 at 0x7fcd3c505f10> discarded task#=287729 at 0x7fcd806020d0>, <Worker name=periodic/41 waiting task#=7 at 0x7fcd3c5927d0>, <Worker name=periodic/3 running <Task discardable <Operation action=<vdsm.virt .sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=802.50 at 0x7fcd6162d550> discarded task#=287772 at 0x7fcd3c592910>, <Worker name=periodic/34 running <Task discard able <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=22.50 at 0x7fcd3c529d10> discarded task#=43 at 0x7fcd3c4add50>, <Worker name=perio dic/40 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.UpdateVolumes'> at 0x7fcd3c5924d0> at 0x7fcd3c592e10> timeout=30.0, duration=2.89 at 0x7fcd3c433710> task#=20 at 0x7fcd612cd890>, <Worker name=periodic/38 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3 c315fd0> discarded task#=18 at 0x7fcd6124a1d0>, <Worker name=periodic/4 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5 , duration=817.50 at 0x7fcd615d62d0> discarded task#=284811 at 0x7fcd614495d0>]) (executor:213) 2019-10-17 06:01:52,094+0300 INFO (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/38 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3c315fd0> discarded task#=18 at 0x7fcd6124a1d0> (executor:355) 2019-10-17 06:01:52,495+0300 INFO (monitor/453eb3c) [IOProcessClient] (/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos) Closing client (__init__:610) 2019-10-17 06:01:52,495+0300 INFO (monitor/453eb3c) [IOProcessClient] (/glusterSD/10.252.166.128:_volume1) Closing client (__init__:610) 2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient] (/glusterSD/10.252.166.130:_volume3) Closing client (__init__:610) 2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient] (/glusterSD/10.252.166.129:_volume2) Closing client (__init__:610) 2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient] (/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Closing client (__init__:610) 2019-10-17 06:01:53,123+0300 INFO (itmap/0) [IOProcessClient] (/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting client (__init__:308) 2019-10-17 06:01:53,133+0300 INFO (itmap/1) [IOProcessClient] (/glusterSD/10.252.166.130:_volume3) Starting client (__init__:308) 2019-10-17 06:01:53,136+0300 INFO (ioprocess/28525) [IOProcess] (/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting ioprocess (__init__:434) 2019-10-17 06:01:53,149+0300 INFO (itmap/2) [IOProcessClient] (/glusterSD/10.252.166.128:_volume1) Starting client (__init__:308) 2019-10-17 06:01:53,152+0300 INFO (ioprocess/28534) [IOProcess] (/glusterSD/10.252.166.130:_volume3) Starting ioprocess (__init__:434) 2019-10-17 06:01:53,161+0300 INFO (itmap/3) [IOProcessClient] (/glusterSD/10.252.166.129:_volume2) Starting client (__init__:308) 2019-10-17 06:01:53,165+0300 WARN (monitor/a1ecca6) [storage.PersistentDict] Could not parse line `^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^ so on 2019-10-17 06:01:53,166+0300 INFO (ioprocess/28540) [IOProcess] (/glusterSD/10.252.166.128:_volume1) Starting ioprocess (__init__:434) 2019-10-17 06:01:53,175+0300 INFO (ioprocess/28547) [IOProcess] (/glusterSD/10.252.166.129:_volume2) Starting ioprocess (__init__:434) 2019-10-17 06:01:54,763+0300 INFO (monitor/a1ecca6) [storage.StorageDomain] Removing remnants of deleted images [] (fileSD:726) 2019-10-17 06:01:55,271+0300 INFO (jsonrpc/4) [api.host] START getAllVmStats() from=::1,32814 (api:48) 2019-10-17 06:01:55,274+0300 WARN (jsonrpc/4) [virt.vm] (vmId='38586008-37a0-4400-83b2-771090c50a9a') monitor became unresponsive (command timeout, age=850.680000001) (vm:6061) 2019-10-17 06:01:55,277+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getAllVmStats', 'id': u'33b72799-68cd-419c-bb6c-d9c9f61664b5'} at 0x7fcd3c329d10> timeout=60, duration=60.01 at 0x7fcd3c3298d0> task#=39322 at 0x7fcd3c581ed0>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in getAllVmStats File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1395, in getAllVmStats statsList = self._cif.getAllVmStats() File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 569, in getAllVmStats return [v.getStats() for v in self.getVMs().values()] File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1781, in getStats oga_stats = self._getGuestStats() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2000, in _getGuestStats self._update_guest_disk_mapping() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2015, in _update_guest_disk_mapping self._sync_metadata() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5142, in _sync_metadata self._md_desc.dump(self._dom) File: "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 515, in dump 0) File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2364, in setMetadata ret = libvirtmod.virDomainSetMetadata(self._o, type, metadata, key, uri, flags) (executor:363) 2019-10-17 06:01:55,279+0300 WARN (jsonrpc/3) [virt.vm] (vmId='b14cbec9-7acb-4f80-ad5e-3169e17d5328') Couldn't update metadata: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats) (vm:2018) 2019-10-17 06:01:55,285+0300 INFO (jsonrpc/3) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::1,32814 (api:54) 2019-10-17 06:01:55,290+0300 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 60.02 seconds (__init__:312) _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZDH2QMC3DD5Y6PFBDP6667QKPS3ISW3L/