We had a hard crash (network, then power) on our 2 node Ovirt Cluster. We
have NFS datastore on CentOS 6 (3.2.0-1.39.el6). We can no longer get the
hosts to activate. They are unable to activate the "master" domain. The
master storage domain show "locked" while the other storage domains show
Unknown (disks) and inactive (ISO) All the domains are on the same NFS
server, we are able to mount it, the permissions are good. We believe we
might be getting bit by https://bugzilla.redhat.com/show_bug.cgi?id=920694or
http://gerrit.ovirt.org/#/c/13709/ which says to cease working on it:

Michael KublinApr 10

Patch Set 5: Do not submit

Liron, please abondon this work. This interacts with host life cycle which
will be changed, during a change a following problem will be solved as well.


So, We were wondering what we can do to get our oVirt back online, or
rather what the correct way is to solve this. We have a few VMs that are
down which we are looking for ways to recover as quickly as possible.

Thanks in advance,
Tommy

Here are the ovirt-engine logs:

2013-04-23 21:30:04,041 ERROR
[org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command
ConnectStoragePoolVDS execution failed. Exception:
IRSNoMasterDomainException: IRSGenericException: IRSErrorException:
IRSNoMasterDomainException: Cannot find master domain:
'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
2013-04-23 21:30:04,043 INFO
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
(pool-3-thread-49) FINISH, ConnectStoragePoolVDSCommand, log id: 50524b34
2013-04-23 21:30:04,049 WARN
 [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand]
(pool-3-thread-49) [7c5867d6] CanDoAction of action ReconstructMasterDomain
failed.
Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status
Locked



Here are the logs from vdsm:

Thread-29::DEBUG::2013-04-23
21:36:05,906::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
10.101.0.148:/c/vpt1-vmdisks1
/rhev/data-center/mnt/10.101.0.148:_c_vpt1-vmdisks1'
(cwd None)
Thread-29::DEBUG::2013-04-23
21:36:06,008::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
10.101.0.148:/c/vpool-iso /rhev/data-center/mnt/10.101.0.148:_c_vpool-iso'
(cwd None)
Thread-29::INFO::2013-04-23
21:36:06,065::logUtils::44::dispatcher::(wrapper) Run and protect:
connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id':
'7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0, 'id':
'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
Thread-29::DEBUG::2013-04-23
21:36:06,071::task::1151::TaskManager.Task::(prepare)
Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::finished: {'statuslist':
[{'status': 0, 'id': '7c19bd42-c3dc-41b9-b81b-d9b75214b8dc'}, {'status': 0,
'id': 'eff2ef61-0b12-4429-b087-8742be17ae90'}]}
Thread-29::DEBUG::2013-04-23
21:36:06,071::task::568::TaskManager.Task::(_updateState)
Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::moving from state preparing ->
state finished
Thread-29::DEBUG::2013-04-23
21:36:06,071::resourceManager::830::ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {}
Thread-29::DEBUG::2013-04-23
21:36:06,072::resourceManager::864::ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
Thread-29::DEBUG::2013-04-23
21:36:06,072::task::957::TaskManager.Task::(_decref)
Task=`48337e40-2446-4357-b6dc-2c86f4da67e2`::ref 0 aborting False
Thread-30::DEBUG::2013-04-23
21:36:06,112::BindingXMLRPC::161::vds::(wrapper) [10.101.0.197]
Thread-30::DEBUG::2013-04-23
21:36:06,112::task::568::TaskManager.Task::(_updateState)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state init ->
state preparing
Thread-30::INFO::2013-04-23
21:36:06,113::logUtils::41::dispatcher::(wrapper) Run and protect:
connectStoragePool(spUUID='0f63de0e-7d98-48ce-99ec-add109f83c4f', hostID=1,
scsiKey='0f63de0e-7d98-48ce-99ec-add109f83c4f',
msdUUID='774e3604-f449-4b3e-8c06-7cd16f98720c', masterVersion=73,
options=None)
Thread-30::DEBUG::2013-04-23
21:36:06,113::resourceManager::190::ResourceManager.Request::(__init__)
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Request
was made in '/usr/share/vdsm/storage/resourceManager.py' line '189' at
'__init__'
Thread-30::DEBUG::2013-04-23
21:36:06,114::resourceManager::504::ResourceManager::(registerResource)
Trying to register resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
for lock type 'exclusive'
Thread-30::DEBUG::2013-04-23
21:36:06,114::resourceManager::547::ResourceManager::(registerResource)
Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free. Now
locking as 'exclusive' (1 active user)
Thread-30::DEBUG::2013-04-23
21:36:06,114::resourceManager::227::ResourceManager.Request::(grant)
ResName=`Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f`ReqID=`ee74329a-0a92-465a-be50-b8acc6d7246a`::Granted
request
Thread-30::INFO::2013-04-23
21:36:06,115::sp::625::Storage.StoragePool::(connect) Connect host #1 to
the storage pool 0f63de0e-7d98-48ce-99ec-add109f83c4f with master domain:
774e3604-f449-4b3e-8c06-7cd16f98720c (ver = 73)
Thread-30::DEBUG::2013-04-23
21:36:06,116::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,116::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,117::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,117::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,117::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,118::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:06,118::misc::1054::SamplingMethod::(__call__) Trying to enter
sampling method (storage.sdc.refreshStorage)
Thread-30::DEBUG::2013-04-23
21:36:06,118::misc::1056::SamplingMethod::(__call__) Got in to sampling
method
Thread-30::DEBUG::2013-04-23
21:36:06,119::misc::1054::SamplingMethod::(__call__) Trying to enter
sampling method (storage.iscsi.rescan)
Thread-30::DEBUG::2013-04-23
21:36:06,119::misc::1056::SamplingMethod::(__call__) Got in to sampling
method
Thread-30::DEBUG::2013-04-23
21:36:06,119::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/sbin/iscsiadm -m session -R' (cwd None)
Thread-30::DEBUG::2013-04-23
21:36:06,136::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> =
'iscsiadm: No session found.\n'; <rc> = 21
Thread-30::DEBUG::2013-04-23
21:36:06,136::misc::1064::SamplingMethod::(__call__) Returning last result
MainProcess|Thread-30::DEBUG::2013-04-23
21:36:06,139::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
of=/sys/class/scsi_host/host0/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23
21:36:06,142::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
of=/sys/class/scsi_host/host1/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23
21:36:06,146::misc::84::Storage.Misc.excCmd::(<lambda>) '/bin/dd
of=/sys/class/scsi_host/host2/scan' (cwd None)
MainProcess|Thread-30::DEBUG::2013-04-23
21:36:06,149::iscsi::402::Storage.ISCSI::(forceIScsiScan) Performing SCSI
scan, this will take up to 30 seconds
Thread-30::DEBUG::2013-04-23
21:36:08,152::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/sbin/multipath' (cwd None)
Thread-30::DEBUG::2013-04-23
21:36:08,254::misc::84::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> =
''; <rc> = 0
Thread-30::DEBUG::2013-04-23
21:36:08,256::lvm::477::OperationMutex::(_invalidateAllPvs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,256::lvm::479::OperationMutex::(_invalidateAllPvs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,257::lvm::488::OperationMutex::(_invalidateAllVgs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,257::lvm::490::OperationMutex::(_invalidateAllVgs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,258::lvm::508::OperationMutex::(_invalidateAllLvs) Operation 'lvm
invalidate operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,258::lvm::510::OperationMutex::(_invalidateAllLvs) Operation 'lvm
invalidate operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,258::misc::1064::SamplingMethod::(__call__) Returning last result
Thread-30::DEBUG::2013-04-23
21:36:08,259::lvm::368::OperationMutex::(_reloadvgs) Operation 'lvm reload
operation' got the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,261::misc::84::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"]
ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3
filter = [ \\"r%.*%\\" ] }  global {  locking_type=1
 prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50
 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free
774e3604-f449-4b3e-8c06-7cd16f98720c' (cwd None)
Thread-30::DEBUG::2013-04-23
21:36:08,514::misc::84::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '
 Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found\n'; <rc> = 5
Thread-30::WARNING::2013-04-23
21:36:08,516::lvm::373::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['
 Volume group "774e3604-f449-4b3e-8c06-7cd16f98720c" not found']
Thread-30::DEBUG::2013-04-23
21:36:08,518::lvm::397::OperationMutex::(_reloadvgs) Operation 'lvm reload
operation' released the operation mutex
Thread-30::DEBUG::2013-04-23
21:36:08,524::resourceManager::557::ResourceManager::(releaseResource)
Trying to release resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f'
Thread-30::DEBUG::2013-04-23
21:36:08,525::resourceManager::573::ResourceManager::(releaseResource)
Released resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' (0 active
users)
Thread-30::DEBUG::2013-04-23
21:36:08,525::resourceManager::578::ResourceManager::(releaseResource)
Resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f' is free, finding
out if anyone is waiting for it.
Thread-30::DEBUG::2013-04-23
21:36:08,525::resourceManager::585::ResourceManager::(releaseResource) No
one is waiting for resource 'Storage.0f63de0e-7d98-48ce-99ec-add109f83c4f',
Clearing records.
Thread-30::ERROR::2013-04-23
21:36:08,526::task::833::TaskManager.Task::(_setError)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 926, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 973, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 642, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1166, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1505, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain:
'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'
Thread-30::DEBUG::2013-04-23
21:36:08,527::task::852::TaskManager.Task::(_run)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._run:
f551fa3f-9d8c-4de3-895a-964c821060d4
('0f63de0e-7d98-48ce-99ec-add109f83c4f', 1,
'0f63de0e-7d98-48ce-99ec-add109f83c4f',
'774e3604-f449-4b3e-8c06-7cd16f98720c', 73) {} failed - stopping task
Thread-30::DEBUG::2013-04-23
21:36:08,528::task::1177::TaskManager.Task::(stop)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::stopping in state preparing
(force False)
Thread-30::DEBUG::2013-04-23
21:36:08,528::task::957::TaskManager.Task::(_decref)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 1 aborting True
Thread-30::INFO::2013-04-23
21:36:08,528::task::1134::TaskManager.Task::(prepare)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::aborting: Task is aborted:
'Cannot find master domain' - code 304
Thread-30::DEBUG::2013-04-23
21:36:08,529::task::1139::TaskManager.Task::(prepare)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Prepare: aborted: Cannot find
master domain
Thread-30::DEBUG::2013-04-23
21:36:08,529::task::957::TaskManager.Task::(_decref)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::ref 0 aborting True
Thread-30::DEBUG::2013-04-23
21:36:08,529::task::892::TaskManager.Task::(_doAbort)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::Task._doAbort: force False
Thread-30::DEBUG::2013-04-23
21:36:08,530::resourceManager::864::ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
Thread-30::DEBUG::2013-04-23
21:36:08,530::task::568::TaskManager.Task::(_updateState)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state preparing ->
state aborting
Thread-30::DEBUG::2013-04-23
21:36:08,530::task::523::TaskManager.Task::(__state_aborting)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::_aborting: recover policy none
Thread-30::DEBUG::2013-04-23
21:36:08,531::task::568::TaskManager.Task::(_updateState)
Task=`f551fa3f-9d8c-4de3-895a-964c821060d4`::moving from state aborting ->
state failed
Thread-30::DEBUG::2013-04-23
21:36:08,531::resourceManager::830::ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {}
Thread-30::DEBUG::2013-04-23
21:36:08,531::resourceManager::864::ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
Thread-30::ERROR::2013-04-23
21:36:08,532::dispatcher::67::Storage.Dispatcher.Protect::(run) {'status':
{'message': "Cannot find master domain:
'spUUID=0f63de0e-7d98-48ce-99ec-add109f83c4f,
msdUUID=774e3604-f449-4b3e-8c06-7cd16f98720c'", 'code': 304}}
[root@vmserver3 vdsm]#
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to