Re: [Users] Data Center stuck between "Non Responsive" and "Contending"

Ted Miller Sun, 26 Jan 2014 15:25:02 -0800

On 1/26/2014 5:17 PM, Itamar Heim wrote:

On 01/27/2014 12:00 AM, Ted Miller wrote:


On 1/26/2014 4:00 PM, Itamar Heim wrote:

On 01/26/2014 10:51 PM, Ted Miller wrote:


On 1/26/2014 3:10 PM, Itamar Heim wrote:

On 01/26/2014 10:08 PM, Ted Miller wrote:

My Data Center is down, and won't come back up.

Data Center Status on the GUI flips between "Non Responsive" and
"Contending"

Also noted:
Host sometimes seen flipping between "Low" and "Contending" in SPM
column.
Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown
VM2 is "up" under "Volumes" tab

Created another volume for VM storage.  It shows up in "volumes" tab,
but when I try to add "New Domain" in storage tab, says that "There
are
No Data Centers to which the Storage Domain can be attached"

Setup:
2 hosts w/ glusterfs storage
1 engine
all 3 computers Centos 6.5, just updated
ovirt-engine                       3.3.0.1-1.el6
ovirt-engine-lib                 3.3.2-1.el6
ovirt-host-deploy.noarch  1.1.3-1.el6
glusterfs.x86_64               3.4.2-1.el6

This loop seems to repeat in the ovirt-engine log (grep of log showing
only DefaultQuartzScheduler_Worker-79 thread:

2014-01-26 14:44:58,416 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-79) Irs placed on server
9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover
2014-01-26 14:44:58,511 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds -
office4a,
spmStatus Free, storage pool mill
2014-01-26 14:44:58,550 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-79) SpmStatus on vds
127ed939-34af-41a8-87a0-e2f6174b1877: Free
2014-01-26 14:44:58,571 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-79) starting spm on vds office4a,
storage
pool mill, prevId 2, LVER 15
2014-01-26 14:44:58,579 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) START,
SpmStartVDSCommand(HostName =
office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877,
storagePoolId =
536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15,
storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
id: 74c38eb7
2014-01-26 14:44:58,617 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId =
e8986753-fc80-4b11-a11d-6d3470b1728c
2014-01-26 14:45:00,662 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
(DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS
method
2014-01-26 14:45:00,664 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
(DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and
error message VDSGenericException: VDSErrorException: Failed to
HSMGetTaskStatusVDS, error = Cannot acquire host id
2014-01-26 14:45:00,665 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId =
e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished
2014-01-26 14:45:00,666 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result:
cleanSuccess, message: VDSGenericException: VDSErrorException:
Failed to
HSMGetTaskStatusVDS, error = Cannot acquire host id
2014-01-26 14:45:00,695 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm
status: Free
2014-01-26 14:45:00,702 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
(DefaultQuartzScheduler_Worker-79) START,
HSMClearTaskVDSCommand(HostName = office4a, HostId =
127ed939-34af-41a8-87a0-e2f6174b1877,
taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6
2014-01-26 14:45:00,722 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
(DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log
id: 336ec5a6
2014-01-26 14:45:00,724 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return:
org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,

log id: 74c38eb7
2014-01-26 14:45:00,733 INFO
[org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
(DefaultQuartzScheduler_Worker-79) Running command:
SetStoragePoolStatusCommand internal: true. Entities affected : ID:
536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
2014-01-26 14:45:00,778 ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-79)
IrsBroker::Failed::GetStoragePoolInfoVDS due to:
IrsSpmStartFailedException: IRSGenericException: IRSErrorException:
SpmStart failed

Ted Miller
Elkhart, IN, USA



_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users


is this gluster storage (guessing sunce you mentioned a 'volume')

yes (mentioned under "setup" above)

does it have a quorum?

Volume Name: VM2
Type: Replicate
Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.41.65.2:/bricks/01/VM2
Brick2: 10.41.65.4:/bricks/01/VM2
Brick3: 10.41.65.4:/bricks/101/VM2
Options Reconfigured:
cluster.server-quorum-type: server
storage.owner-gid: 36
storage.owner-uid: 36
auth.allow: *
user.cifs: off
nfs.disa

(there were reports of split brain on the domain metadata before when
no quorum exist for gluster)

after full heal:

[root@office4a ~]$ gluster volume heal VM2 info
Gathering Heal info on volume VM2 has been successful

Brick 10.41.65.2:/bricks/01/VM2
Number of entries: 0

Brick 10.41.65.4:/bricks/01/VM2
Number of entries: 0

Brick 10.41.65.4:/bricks/101/VM2
Number of entries: 0
[root@office4a ~]$ gluster volume heal VM2 info split-brain
Gathering Heal info on volume VM2 has been successful

Brick 10.41.65.2:/bricks/01/VM2
Number of entries: 0

Brick 10.41.65.4:/bricks/01/VM2
Number of entries: 0

Brick 10.41.65.4:/bricks/101/VM2
Number of entries: 0

noticed this in host /var/log/messages (while looking for something
else).  Loop seems to repeat over and over.

Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
[30419]: read_sectors delta_leader offset 512 rv -90

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids



Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
[3771]: s1997 add_lockspace fail result -90
Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected
error#012Traceback
(most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
line
857, in _run#012 return fn(*args, **kargs)#012  File
"/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args,
**kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line 2111, in
getAllTasksStatuses#012    allTasksStatus = sp.getAllTasksStatuses()#012
File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012
raise
SecureError()#012SecureError
Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
[30495]: read_sectors delta_leader offset 512 rv -90

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids



Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
[3772]: s1998 add_lockspace fail result -90
Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected
error#012Traceback
(most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
line
857, in _run#012 return fn(*args, **kargs)#012  File
"/usr/share/vdsm/storage/task.py", line 318, in run#012 return
self.cmd(*self.argslist, **self.argsdict)#012 File
"/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
self.masterDomain.acquireHostId(self.id)#012  File
"/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012
self._clusterLock.acquireHostId(hostId, async)#012  File
"/usr/share/vdsm/storage/clusterlock.py", line 189, in
acquireHostId#012    raise se.AcquireHostIdFailure(self._sdUUID,
e)#012AcquireHostIdFailure: Cannot acquire host id:
('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock
lockspace add failure', 'Message too long'))


fede - thoughts on above?

(vojtech reported something similar, but it sorted out for him after someretries)


Ted Miller
Elkhart, IN, USA


this is the new storage domain? what about the previous volume for the
first SD?


The default/default data center/cluster had to be abandoned because of a
split-brain that could not be healed.  Can't remove old storage from
database, can't get data center up due to corrupt storage, ends up a
circular argument.

I started over with same hosts, totally new storage in new data center.
This mill/one data center/cluster was working fine with VM2 storage,
then died.

Ted Miller

Looking at the problems with Sanlock, I read the page on it, then went pokingaround.


I find the sanlock.log is full of:

2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:02014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 00x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 5122014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 512rv -90/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90

Just looking at the log, it looks like sanlock is trying to "read_sectorsdelta_leader offset 512", probably in this file:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids


Well, I can see why it fails -- the ids file is empty.

[root@office4a ~]$ ls/rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/-l

total 1029
-rw-rw---- 1 vdsm kvm       0 Jan 22 00:44 ids
-rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 inbox
-rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases
-rw-r--r-- 1 vdsm kvm     491 Jan 21 18:20 metadata
-rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 outbox

I have no idea why it is empty, or what is supposed to be there, but anythingtrying to read with an offset of 512 is going to kick back an error on a 0length file. :(

I recall that on the previous ids file, on the domain that went up in smoke(improperly recovered split-brain), the file was about 1 megabyte.


The date on the 'ids' file is probably about when the Data Center went belly-up.

I hope this helps someone figure out what I can do to get out of this endlesscircle of trying to start the Data Center. Maybe there is a way to force aninitialization of the ids file, and then everyone could be happy?


Ted Miller

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] Data Center stuck between "Non Responsive" and "Contending"

Reply via email to