Hello 
We have a cluster with 9 ovirt host and a storage cluster with 3 storage 
servers, we run ovirt 4.3.6.6-1.
Storage is setup is replica 3 with arbiter.
There is another unmanaged gluster volume which is replica 3 based on local 
disks of ovirt hosts and the engine and a few small pfsense vm reside on it.

Yesterday we experienced a hardware failure on one raid controler on one of the 
storage servers, so the server stopped working.
On that particular server on it's local disks was a nfs storage domain used as 
ISO.

After the storage server failure multiple machines got stuck and in not 
responding state, others kept working but we could not manage them from ovirt. 
many machines had iso mounted so  it may have to do with this. 
https://bugzilla.redhat.com/show_bug.cgi?id=1337073
But machines with ISO mounted were not the only ones affected. Allmost every 
host on ovirt clusterd switched from non responding to ok status, many machines 
were on unknown status 
even if they did not have mounted iso or even resided in any  storage domain of 
the gluster cluster. I.E. pfsense machines residing on local drives as the 
engine. These machines could not be Powered off from inside the engine. In any 
case I could not put the iso domain on maintenance because isos were still 
mounted on varius virtual machines.
I'm wondering what caused the allmost total failure, was it the nfs domain that 
could not be mounted ? And of course how could I prevent such a situation. 
I'm sending you the logs from the critical time period from one of the hosts. 
Please let me know if you need more debug info.

2019-10-17 06:01:43,039+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call 
Host.getStats succeeded in 0.25 seconds (__init__:312)
2019-10-17 06:01:46,638+0300 WARN  (check/loop) [storage.check] Checker 
u'/rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos/ec67be6b-367f-4e41-a686-695ab50c0034/dom_md/metadata'
 is blocked for 39
0.00 seconds (check:282)
2019-10-17 06:01:49,318+0300 WARN  (itmap/0) [storage.scanDomains] Could not 
collect metadata file for domain path 
/rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos (fileSD:913)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 902, in 
collectMetaFiles
    metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 
110, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 561, in 
glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 442, in 
_sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
2019-10-17 06:01:49,319+0300 ERROR (monitor/ec67be6) [storage.Monitor] Error 
checking domain ec67be6b-367f-4e41-a686-695ab50c0034 (monitor:425)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406, in 
_checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in 
__getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in 
getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in 
_realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in 
_findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 145, in 
findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 135, in 
findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: 
(u'ec67be6b-367f-4e41-a686-695ab50c0034',)
2019-10-17 06:01:50,253+0300 INFO  (periodic/42) [vdsm.api] START 
repoStats(domains=()) from=internal, 
task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:48)
2019-10-17 06:01:50,253+0300 INFO  (periodic/42) [vdsm.api] FINISH repoStats 
return={u'3c973b94-a57d-4462-bcb3-3cef13290327': {'code': 0, 'actual': True, 
'version': 5, 'acquired': True, 'delay': '0.000932761', '
lastCheck': '3.6', 'valid': True}, u'ec67be6b-367f-4e41-a686-695ab50c0034': 
{'code': 2001, 'actual': True, 'version': -1, 'acquired': False, 'delay': '0', 
'lastCheck': '0.9', 'valid': False}, u'a1ecca6f-1487-492
d-b9bf-d3fb60f1bc99': {'code': 0, 'actual': True, 'version': 5, 'acquired': 
True, 'delay': '0.00142361', 'lastCheck': '3.7', 'valid': True}, 
u'd5c52bc8-1a83-4915-b82f-d98e33db7c99': {'code': 0, 'actual': True, '
version': 5, 'acquired': True, 'delay': '0.00124982', 'lastCheck': '3.6', 
'valid': True}, u'453eb3c2-c3a5-4d5f-8c1d-b16588fc5eff': {'code': 0, 'actual': 
True, 'version': 5, 'acquired': True, 'delay': '0.001343',
'lastCheck': '3.6', 'valid': True}} from=internal, 
task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:54)
2019-10-17 06:01:50,403+0300 WARN  (vdsm.Scheduler) [Executor] executor state: 
count=8 workers=set([<Worker name=qgapoller/58 waiting task#=1 at 
0x7fcd61483310>, <Worker name=qgapoller/53 running <Task discardab
le <Operation action=<VmDispatcher operation=<function <lambda> at 
0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, 
duration=30.00 at 0x7fcd615adcd0> discarded task#=11 at 0x7fcd615ad2d0>, <Work
er name=qgapoller/56 running <Task discardable <Operation action=<VmDispatcher 
operation=<function <lambda> at 0x7fcd80036668> at 0x7fcd80492850> at 
0x7fcd80492890> timeout=10, duration=10.00 at 0x7fcd614cd8d0> 
discarded task#=11 at 0x7fcd610013d0>, <Worker name=qgapoller/60 waiting 
task#=0 at 0x7fcd612cd910>, <Worker name=qgapoller/57 running <Task discardable 
<Operation action=<VmDispatcher operation=<function <lambd
a> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, 
duration=20.00 at 0x7fcd8003f190> discarded task#=0 at 0x7fcd3c355310>, <Worker 
name=qgapoller/55 running <Task discardable <Operation acti
on=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668> at 
0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=40.00 at 
0x7fcd61483bd0> discarded task#=57 at 0x7fcd611dd150>, <Worker name=qgapoller/
59 waiting task#=0 at 0x7fcd614318d0>, <Worker name=qgapoller/54 waiting 
task#=14 at 0x7fcd3c3f5fd0>]) (executor:213)
2019-10-17 06:01:50,403+0300 INFO  (vdsm.Scheduler) [Executor] Worker 
discarded: <Worker name=qgapoller/56 running <Task discardable <Operation 
action=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668
> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=10.00 at 
> 0x7fcd614cd8d0> discarded task#=11 at 0x7fcd610013d0> (executor:355)
2019-10-17 06:01:52,094+0300 WARN  (vdsm.Scheduler) [Executor] executor state: 
count=10 workers=set([<Worker name=periodic/42 waiting task#=4 at 
0x7fcd3c592850>, <Worker name=periodic/43 waiting task#=0 at 0x7fc
d6124a090>, <Worker name=periodic/39 running <Task discardable <Operation 
action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 
0x7fcd3c5a52d0> timeout=7.5, duration=37.50 at 0x7fcd61483610>
 discarded task#=10 at 0x7fcd8003fed0>, <Worker name=periodic/2 running <Task 
discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, durati
on=832.50 at 0x7fcd3c505f10> discarded task#=287729 at 0x7fcd806020d0>, <Worker 
name=periodic/41 waiting task#=7 at 0x7fcd3c5927d0>, <Worker name=periodic/3 
running <Task discardable <Operation action=<vdsm.virt
.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> 
timeout=7.5, duration=802.50 at 0x7fcd6162d550> discarded task#=287772 at 
0x7fcd3c592910>, <Worker name=periodic/34 running <Task discard
able <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=22.50 at 
0x7fcd3c529d10> discarded task#=43 at 0x7fcd3c4add50>, <Worker name=perio
dic/40 running <Task discardable <Operation action=<VmDispatcher 
operation=<class 'vdsm.virt.periodic.UpdateVolumes'> at 0x7fcd3c5924d0> at 
0x7fcd3c592e10> timeout=30.0, duration=2.89 at 0x7fcd3c433710> task#=20
 at 0x7fcd612cd890>, <Worker name=periodic/38 running <Task discardable 
<Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3
c315fd0> discarded task#=18 at 0x7fcd6124a1d0>, <Worker name=periodic/4 running 
<Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor 
object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5
, duration=817.50 at 0x7fcd615d62d0> discarded task#=284811 at 
0x7fcd614495d0>]) (executor:213)
2019-10-17 06:01:52,094+0300 INFO  (vdsm.Scheduler) [Executor] Worker 
discarded: <Worker name=periodic/38 running <Task discardable <Operation 
action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 
0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3c315fd0> discarded 
task#=18 at 0x7fcd6124a1d0> (executor:355)
2019-10-17 06:01:52,495+0300 INFO  (monitor/453eb3c) [IOProcessClient] 
(/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos) Closing client (__init__:610)
2019-10-17 06:01:52,495+0300 INFO  (monitor/453eb3c) [IOProcessClient] 
(/glusterSD/10.252.166.128:_volume1) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO  (monitor/453eb3c) [IOProcessClient] 
(/glusterSD/10.252.166.130:_volume3) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO  (monitor/453eb3c) [IOProcessClient] 
(/glusterSD/10.252.166.129:_volume2) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO  (monitor/453eb3c) [IOProcessClient] 
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Closing client (__init__:610)
2019-10-17 06:01:53,123+0300 INFO  (itmap/0) [IOProcessClient] 
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting client (__init__:308)
2019-10-17 06:01:53,133+0300 INFO  (itmap/1) [IOProcessClient] 
(/glusterSD/10.252.166.130:_volume3) Starting client (__init__:308)
2019-10-17 06:01:53,136+0300 INFO  (ioprocess/28525) [IOProcess] 
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,149+0300 INFO  (itmap/2) [IOProcessClient] 
(/glusterSD/10.252.166.128:_volume1) Starting client (__init__:308)
2019-10-17 06:01:53,152+0300 INFO  (ioprocess/28534) [IOProcess] 
(/glusterSD/10.252.166.130:_volume3) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,161+0300 INFO  (itmap/3) [IOProcessClient] 
(/glusterSD/10.252.166.129:_volume2) Starting client (__init__:308)
2019-10-17 06:01:53,165+0300 WARN  (monitor/a1ecca6) [storage.PersistentDict] 
Could not parse line 
`^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
so on

2019-10-17 06:01:53,166+0300 INFO  (ioprocess/28540) [IOProcess] 
(/glusterSD/10.252.166.128:_volume1) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,175+0300 INFO  (ioprocess/28547) [IOProcess] 
(/glusterSD/10.252.166.129:_volume2) Starting ioprocess (__init__:434)
2019-10-17 06:01:54,763+0300 INFO  (monitor/a1ecca6) [storage.StorageDomain] 
Removing remnants of deleted images [] (fileSD:726)
2019-10-17 06:01:55,271+0300 INFO  (jsonrpc/4) [api.host] START getAllVmStats() 
from=::1,32814 (api:48)
2019-10-17 06:01:55,274+0300 WARN  (jsonrpc/4) [virt.vm] 
(vmId='38586008-37a0-4400-83b2-771090c50a9a') monitor became unresponsive 
(command timeout, age=850.680000001) (vm:6061)
2019-10-17 06:01:55,277+0300 WARN  (vdsm.Scheduler) [Executor] Worker blocked: 
<Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': 
'2.0', 'method': u'Host.getAllVmStats', 'id': 
u'33b72799-68cd-419c-bb6c-d9c9f61664b5'} at 0x7fcd3c329d10> timeout=60, 
duration=60.01 at 0x7fcd3c3298d0> task#=39322 at 0x7fcd3c581ed0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, 
in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in 
_execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in 
__call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in 
_serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in 
_handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in 
_dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getAllVmStats
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1395, in 
getAllVmStats
  statsList = self._cif.getAllVmStats()
File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 569, in 
getAllVmStats
  return [v.getStats() for v in self.getVMs().values()]
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1781, in getStats
  oga_stats = self._getGuestStats()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2000, in 
_getGuestStats
  self._update_guest_disk_mapping()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2015, in 
_update_guest_disk_mapping
  self._sync_metadata()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5142, in 
_sync_metadata
  self._md_desc.dump(self._dom)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 515, in 
dump
  0)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 
131, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in 
wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2364, in setMetadata
  ret = libvirtmod.virDomainSetMetadata(self._o, type, metadata, key, uri, 
flags) (executor:363)
2019-10-17 06:01:55,279+0300 WARN  (jsonrpc/3) [virt.vm] 
(vmId='b14cbec9-7acb-4f80-ad5e-3169e17d5328') Couldn't update metadata: Timed 
out during operation: cannot acquire state change lock (held by 
monitor=remoteDispatchConnectGetAllDomainStats) (vm:2018)
2019-10-17 06:01:55,285+0300 INFO  (jsonrpc/3) [api.host] FINISH getAllVmStats 
return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} 
from=::1,32814 (api:54)
2019-10-17 06:01:55,290+0300 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call 
Host.getAllVmStats succeeded in 60.02 seconds (__init__:312)
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZDH2QMC3DD5Y6PFBDP6667QKPS3ISW3L/

Reply via email to