We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the hosts to activate. They are unable to activate the "master" domain. The master storage domain show "locked" while the other storage domains show Unknown (disks) and inactive (ISO) All the domains are on the same NFS server, we are able to mount it, the permissions are good. We believe we might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694or http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:
Michael KublinApr 10 Patch Set 5: Do not submit Liron, please abondon this work. This interacts with host life cycle which will be changed, during a change a following problem will be solved as well. So, We were wondering what we can do to get our oVirt back online, or rather what the correct way is to solve this. We have a few VMs that are down which we are looking for ways to recover as quickly as possible. Thanks in advance, Tommy Here are the ovirt-engine logs: 2013-04-23 21:30:04,041 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c' 2013-04-23 21:30:04,043 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34 2013-04-23 21:30:04,049 WARN [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked Here are the logs from vdsm: Thread-29::DEBUG::2013-04-23 21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpt1-vmdisks1 /rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1' (cwd None) Thread-29::DEBUG::2013-04-23 21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso' (cwd None) Thread-29::INFO::2013-04-23 21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]} Thread-29::DEBUG::2013-04-23 21:36:06,071::task::1151::TaskManager.Task::(prepare) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist': [{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]} Thread-29::DEBUG::2013-04-23 21:36:06,071::task::568::TaskManager.Task::(_updateState) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing -> state finished Thread-29::DEBUG::2013-04-23 21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-29::DEBUG::2013-04-23 21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-29::DEBUG::2013-04-23 21:36:06,072::task::957::TaskManager.Task::(_decref) Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False Thread-30::DEBUG::2013-04-23 21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197] Thread-30::DEBUG::2013-04-23 21:36:06,112::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init -> state preparing Thread-30::INFO::2013-04-23 21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1, scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f', msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73, options=None) Thread-30::DEBUG::2013-04-23 21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at '__init__' Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::504::ResourceManager::(registerResource) Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' for lock type 'exclusive' Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::547::ResourceManager::(registerResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now locking as 'exclusive' (1 active user) Thread-30::DEBUG::2013-04-23 21:36:06,114::resourceManager::227::ResourceManager.Request::(grant) ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted request Thread-30::INFO::2013-04-23 21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain: 774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73) Thread-30::DEBUG::2013-04-23 21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-30::DEBUG::2013-04-23 21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling method Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan) Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling method Thread-30::DEBUG::2013-04-23 21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None) Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = 'iscsiadm: No session found.\n'; <rc> = 21 Thread-30::DEBUG::2013-04-23 21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host0/scan' (cwd None) MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host1/scan' (cwd None) MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd of=/sys/class/scsi_host/host2/scan' (cwd None) MainProcess|Thread-30::DEBUG::2013-04-23 21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI scan, this will take up to 30 seconds Thread-30::DEBUG::2013-04-23 21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/multipath' (cwd None) Thread-30::DEBUG::2013-04-23 21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0 Thread-30::DEBUG::2013-04-23 21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result Thread-30::DEBUG::2013-04-23 21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None) Thread-30::DEBUG::2013-04-23 21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = ' Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5 Thread-30::WARNING::2013-04-23 21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found'] Thread-30::DEBUG::2013-04-23 21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-30::DEBUG::2013-04-23 21:36:08,524::resourceManager::557::ResourceManager::(releaseResource) Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::573::ResourceManager::(releaseResource) Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active users) Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::578::ResourceManager::(releaseResource) Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding out if anyone is waiting for it. Thread-30::DEBUG::2013-04-23 21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f', Clearing records. Thread-30::ERROR::2013-04-23 21:36:08,526::task::833::TaskManager.Task::(_setError) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 42, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 642, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c' Thread-30::DEBUG::2013-04-23 21:36:08,527::task::852::TaskManager.Task::(_run) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run: f551fa3f-9d8c-4de3-895a-964c821060d4 ('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1, '0f63de0e-7d98-48ce-99ec-add109f83c4f', '774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task Thread-30::DEBUG::2013-04-23 21:36:08,528::task::1177::TaskManager.Task::(stop) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing (force False) Thread-30::DEBUG::2013-04-23 21:36:08,528::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True Thread-30::INFO::2013-04-23 21:36:08,528::task::1134::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted: 'Cannot find master domain' - code 304 Thread-30::DEBUG::2013-04-23 21:36:08,529::task::1139::TaskManager.Task::(prepare) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find master domain Thread-30::DEBUG::2013-04-23 21:36:08,529::task::957::TaskManager.Task::(_decref) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True Thread-30::DEBUG::2013-04-23 21:36:08,529::task::892::TaskManager.Task::(_doAbort) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False Thread-30::DEBUG::2013-04-23 21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-30::DEBUG::2013-04-23 21:36:08,530::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing -> state aborting Thread-30::DEBUG::2013-04-23 21:36:08,530::task::523::TaskManager.Task::(__state_aborting) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none Thread-30::DEBUG::2013-04-23 21:36:08,531::task::568::TaskManager.Task::(_updateState) Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting -> state failed Thread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-30::DEBUG::2013-04-23 21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-30::ERROR::2013-04-23 21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status': {'message': "Cannot find master domain: 'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f, msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}} [root@vmserver3 vdsm]#
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users