Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-31 Thread Dan Kenigsberg
On Fri, Jan 25, 2013 at 12:30:39PM +0800, Royce Lv wrote:
 
 
 
 3. Actions we will take:
 (1)As a work round we can first remove the zombie reaper from
 supervdsm server
 (2)I'll see whether python has a fixed version for this

It does not seem so:
http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/managers.py#l225

Please open a python bug about it, and report it to the mailing list.
The python bug should have a simple reproducer, not something that
requires vdsm.


 (3)Yaniv is working on changing vdsm/svdsm communication channel to
 pipe and handle it ourselves, I believe we'll get rid of this with
 that properly handled.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-25 Thread Mark Wu

On Fri 25 Jan 2013 05:23:24 PM CST, Royce Lv wrote:

I patched python source managers.py to retry recv() after EINTR,
supervdsm works well and the issue gone.
Even declared in python doc that:only the main thread can set a
new signal handler, and the main thread will be the only one to
receive signals (this is enforced by the Python signal
http://docs.python.org/2/library/signal.html#module-signal module,
even if the underlying thread implementation supports sending signals
to individual threads). (http://docs.python.org/2/library/signal.html)
But according to my test script bellow, the child process forker
event if it is not the main thread, also got SIGCHILD, Will it be a
python BUG or feature?


Thanks for sharing your test code. After testing your script and  
looking into the python signal module,
I found that the signal could be received by any thread!  Python 
installs a signal handler wrapper for
all user defined signal handler. The wrapper  just adds the actual 
handler to the queue of pending
calls,  and the main thread check the pending calls and run the actual 
hander before executing the next instruction.
So, the signal interruption can happens to any thread and therefore the 
wrapper could run in any thread.
But the user-defined handler only runs in the main thread. And if the 
signal occurs in non main thread, the main
thread will not be interrupted  and the signal will only be handled 
when the main thread finishes the current instruction.


Please correct me if I am wrong.



I agree with Mark maybe we should use synchronised way to deal
with child process instead of signal handler.
import threading
import signal
import time
import os
from multiprocessing import Process,Pipe
def _zombieReaper(signum, frame):
 print 'sigchild!!!'

def child():
 time.sleep(5)
def sleepThread():
 proc = Process(target=child)
 proc.start()
 pip,pop = Pipe()
 pip.recv()--This line will get IOError.EINTR by SIGCHLD
def main():
 signal.signal(signal.SIGCHLD, _zombieReaper)
 servThread = threading.Thread(target = sleepThread)
 servThread.setDaemon(True)
 servThread.start()
 time.sleep(30)
if __name__ == '__main__':
 main()
On 01/25/2013 03:20 PM, Mark Wu wrote:

Great work!
The default action for SIGCHLD is ignore, so there's no problems
reported before a signal handler is installed by zombie reaper.
But I still have one problem: the python multiprocessing.manager code
is running a new thread and according to the implementation of
python's signal, only the main thread can receive the signal.
So how is the signal delivered to the server thread?


On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:


Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:
   with the test case attached, put it under /usr/share/vdsm/tests/,
run #./run_tests.sh superVdsmTests.py
   and this issue will be reproduced.
   2.Log analyse:
   We notice a strange pattern in this log: connectStorageServer be
called twice, first supervdsm call succeed, second fails becasue of
validateAccess().
   That is because for the first call validateAccess returns normally
and leave a child there, when the second validateAccess call arrives
and multirprocessing manager is receiving the method message, it is
just the time first child exit and SIGCHLD comming, this signal
interrupted multiprocessing receive system call, python managers.py
should handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.
Thread-18::DEBUG::2013-01-22
10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda)
'/usr/bin/sudo -n /bin/mount -t nfs -o
soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/silvermoon
/rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
Thread-18::DEBUG::2013-01-22
10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda)
'/usr/bin/sudo -n /bin/mount -t nfs -o
soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/undercity
/rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
Thread-18::ERROR::2013-01-22
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could
not connect to storageServer
Traceback (most recent call last):
  File /usr/share/vdsm/storage/hsm.py, line 2211, in
connectStorageServer
conObj.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 303, in
connect
return self._mountCon.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 209, in
connect
 fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
  File /usr/share/vdsm/storage/fileSD.py, line 55, in
validateDirAccess
(os.R_OK | os.X_OK))
  File /usr/share/vdsm/supervdsm.py, line 81, in __call__
return callMethod()
  File /usr/share/vdsm/supervdsm.py, line 72, in lambda
**kwargs)
  File string, line 2, in validateAccess
  File /usr/lib64/python2.6/multiprocessing/managers.py, line
740, in _callmethod
raise convert_to_error(kind, result)
the 

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Dan Kenigsberg
On Wed, Jan 23, 2013 at 04:44:29PM -0600, Dead Horse wrote:
 I narrowed down on the commit where the originally reported issue crept in:
 commitfc3a44f71d2ef202cff18d7203b9e4165b546621building and testing with
 this commit or subsequent commits yields the original issue.

Could you provide more info on the failure to attach storage?
Does it happen only with NFS? Only with particular server or options?
It may have been a regression due to

commit fc3a44f71d2ef202cff18d7203b9e4165b546621
Author: Royce Lv lvro...@linux.vnet.ibm.com

integrate zombie reaper in supervdsmServer

not that I understand how. Let's ask the relevant parties (Royce and
Yaniv).
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Royce Lv

On 01/24/2013 05:21 PM, Dan Kenigsberg wrote:

quent commits yields the

Hi,
Will you provide the log or let me access the test env if 
possible(cause we don't have IB in our Lab)? I'll look at it immediately.

Sorry for the inconvenience if I have introduced the regression.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread ybronhei

On 01/24/2013 12:44 AM, Dead Horse wrote:

I narrowed down on the commit where the originally reported issue crept in:
commitfc3a44f71d2ef202cff18d7203b9e4165b546621building and testing with
this commit or subsequent commits yields the original issue.
Interesting.. it might be related to this commit and we're trying to 
reproduce it.


Did you try to remove that code and run again? does it work without the 
additional of zombieReaper?
does the connectivity to the storage work well? when you run 'ls' on the 
mounted folder you get see the files without a long delay ? it might 
related to too long timeout when validating access to this mount..

we work on that.. any additional info can help

Thanks.


- DHC


On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
deadhorseconsult...@gmail.comwrote:


Indeed reverting back to an older vdsm clears up the above issue. However
now I the issue is see is:
Thread-18::ERROR::2013-01-23
15:50:42,885::task::833::TaskManager.Task::(_setError)
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected error

Traceback (most recent call last):
   File /usr/share/vdsm/storage/task.py, line 840, in _run
 return fn(*args, **kargs)
   File /usr/share/vdsm/logUtils.py, line 42, in wrapper
 res = f(*args, **kwargs)
   File /usr/share/vdsm/storage/hsm.py, line 923, in connectStoragePool
 masterVersion, options)
   File /usr/share/vdsm/storage/hsm.py, line 970, in _connectStoragePool
 res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 643, in connect
 self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 1167, in __rebuild
 self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
masterVersion=masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 1506, in getMasterDomain
 raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain:
'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
msdUUID=67534cca-1327-462a-b455-a04464084b31'
Thread-18::DEBUG::2013-01-23
15:50:42,887::task::852::TaskManager.Task::(_run)
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
08709e68-bcbc-40d8-843a-d69d4df40ac6
('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
'f90a0d1c-06ca-11e2-a05b-00151712f280',
'67534cca-1327-462a-b455-a04464084b31', 433) {} failed - stopping task

This is with vdsm built from
commit25a2d8572ad32352227c98a86631300fbd6523c1
- DHC


On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse 
deadhorseconsult...@gmail.com wrote:


VDSM was built from:
commit 166138e37e75767b32227746bb671b1dab9cdd5e

Attached is the full vdsm log

I should also note that from engine perspective it sees the master
storage domain as locked and the others as unknown.


On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg dan...@redhat.comwrote:


On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:

Any ideas on this one? (from VDSM log):
Thread-25::DEBUG::2013-01-22
15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client

[3.57.111.30]::call

getCapabilities with () {}
Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
cannot read ib0 speed
Traceback (most recent call last):
   File /usr/lib64/python2.6/site-packages/vdsm/netinfo.py, line 155,

in

speed
 s = int(file('/sys/class/net/%s/speed' % dev).read())
IOError: [Errno 22] Invalid argument

Causes VDSM to fail to attach storage


I doubt that this is the cause of the failure, as vdsm has always
reported 0 for ib devices, and still is.
it happens only when you call to getCapabilities.. so it doesn't related 
to the flow, and it can't effect the storage.

Dan: I guess this is not the issue but why is the IOError?


Does a former version works with your Engine?
Could you share more of your vdsm.log? I suppose the culprit lies in one
one of the storage-related commands, not in statistics retrieval.



Engine side sees:
ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
(QuartzScheduler_Worker-96) [553ef26e] The connection with details
192.168.0.1:/ovirt/ds failed because of error code 100 and error

message

is: general exception
2013-01-22 15:35:30,160 INFO
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
(QuartzScheduler_Worker-96) [1ab78378] Running command:
SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
2013-01-22 15:35:30,200 INFO
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-96) [1ab78378] START,
SetVdsStatusVDSCommand(HostName = kezan, HostId =
8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
2013-01-22 15:35:30,211 INFO
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,

log

id: 4af5c4cd
2013-01-22 15:35:30,242 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Dead Horse
This test harness setup here consists of two servers tied to NFS storage
via IB (NFS mounts are via IPoIB, NFS over RDMA is disabled) . All storage
domains are NFS. The issue does occur with both servers on when attempting
to bring them out of maintenance mode with the end result being
non-operational due to storage attach fail.

The current issue is now that with a working older commit the master
storage domain is stuck in state locked and I see the secondary issue
wherein VDSM cannot seem to find or contact the master storage domain even
though it is there. I am can mount the master storage domain manually and
and all content appears to be accounted for accordingly on either host.

Here is the current contents of the master storage domain metadata:
CLASS=Data
DESCRIPTION=orgrimmar
IOOPTIMEOUTSEC=1
LEASERETRIES=3
LEASETIMESEC=5
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=417
POOL_DESCRIPTION=Azeroth
POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
POOL_SPM_ID=1
POOL_SPM_LVER=6
POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
ROLE=Master
SDUUID=67534cca-1327-462a-b455-a04464084b31
TYPE=NFS
VERSION=3
_SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721

So now with the older working commit I now get this the
StoragePoolMasterNotFound: Cannot find master domain error (prior details
above when I worked backwards to that commit)

This is odd as the nodes can definitely reach the master storage domain:

showmount from one of the el6.3 nodes:
[root@kezan ~]# showmount -e 192.168.0.1
Export list for 192.168.0.1:
/ovirt/orgrimmar192.168.0.0/16

mount/ls from one of the nodes:
[root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
[root@kezan ~]# ls -al /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
total 1100
drwxr-xr-x 2 vdsm kvm4096 Jan 24 11:44 .
drwxr-xr-x 5 vdsm kvm4096 Oct 19 16:16 ..
-rw-rw 1 vdsm kvm 1048576 Jan 19 22:09 ids
-rw-rw 1 vdsm kvm   0 Sep 25 00:46 inbox
-rw-rw 1 vdsm kvm 2097152 Jan 10 13:33 leases
-rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
-rw-rw 1 vdsm kvm   0 Sep 25 00:46 outbox


- DHC



On Thu, Jan 24, 2013 at 7:51 AM, ybronhei ybron...@redhat.com wrote:

 On 01/24/2013 12:44 AM, Dead Horse wrote:

 I narrowed down on the commit where the originally reported issue crept
 in:
 commitfc3a44f71d2ef202cff18d72**03b9e4165b546621building and testing with

 this commit or subsequent commits yields the original issue.

 Interesting.. it might be related to this commit and we're trying to
 reproduce it.

 Did you try to remove that code and run again? does it work without the
 additional of zombieReaper?
 does the connectivity to the storage work well? when you run 'ls' on the
 mounted folder you get see the files without a long delay ? it might
 related to too long timeout when validating access to this mount..
 we work on that.. any additional info can help

 Thanks.


 - DHC


 On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
 deadhorseconsult...@gmail.com**wrote:

  Indeed reverting back to an older vdsm clears up the above issue. However
 now I the issue is see is:
 Thread-18::ERROR::2013-01-23
 15:50:42,885::task::833::**TaskManager.Task::(_setError)
 Task=`08709e68-bcbc-40d8-843a-**d69d4df40ac6`::Unexpected error

 Traceback (most recent call last):
File /usr/share/vdsm/storage/task.**py, line 840, in _run
  return fn(*args, **kargs)
File /usr/share/vdsm/logUtils.py, line 42, in wrapper
  res = f(*args, **kwargs)
File /usr/share/vdsm/storage/hsm.**py, line 923, in
 connectStoragePool
  masterVersion, options)
File /usr/share/vdsm/storage/hsm.**py, line 970, in
 _connectStoragePool
  res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
File /usr/share/vdsm/storage/sp.**py, line 643, in connect
  self.__rebuild(msdUUID=**msdUUID, masterVersion=masterVersion)
File /usr/share/vdsm/storage/sp.**py, line 1167, in __rebuild
  self.masterDomain = self.getMasterDomain(msdUUID=**msdUUID,
 masterVersion=masterVersion)
File /usr/share/vdsm/storage/sp.**py, line 1506, in getMasterDomain
  raise se.StoragePoolMasterNotFound(**self.spUUID, msdUUID)
 StoragePoolMasterNotFound: Cannot find master domain:
 'spUUID=f90a0d1c-06ca-11e2-**a05b-00151712f280,
 msdUUID=67534cca-1327-462a-**b455-a04464084b31'
 Thread-18::DEBUG::2013-01-23
 15:50:42,887::task::852::**TaskManager.Task::(_run)
 Task=`08709e68-bcbc-40d8-843a-**d69d4df40ac6`::Task._run:
 08709e68-bcbc-40d8-843a-**d69d4df40ac6
 

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Dead Horse
Tried some manual edits to SD states in the dbase. The net result was I was
able to get a node active. However as reconstructing the master storage
domain kicked in it was unable to do so. It was also not able to recognize
the other SD with similar failure modes to the unrecognized master above.
Guessing the newer VDSM version borked things pretty good. So being as this
is a test harness and the SD data is not worth saving I just smoked the all
SD. I ran engine-cleanup and started from fresh and all is well now.

- DHC


On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
deadhorseconsult...@gmail.comwrote:

 This test harness setup here consists of two servers tied to NFS storage
 via IB (NFS mounts are via IPoIB, NFS over RDMA is disabled) . All storage
 domains are NFS. The issue does occur with both servers on when attempting
 to bring them out of maintenance mode with the end result being
 non-operational due to storage attach fail.

 The current issue is now that with a working older commit the master
 storage domain is stuck in state locked and I see the secondary issue
 wherein VDSM cannot seem to find or contact the master storage domain even
 though it is there. I am can mount the master storage domain manually and
 and all content appears to be accounted for accordingly on either host.

 Here is the current contents of the master storage domain metadata:
 CLASS=Data
 DESCRIPTION=orgrimmar
 IOOPTIMEOUTSEC=1
 LEASERETRIES=3
 LEASETIMESEC=5
 LOCKPOLICY=
 LOCKRENEWALINTERVALSEC=5
 MASTER_VERSION=417
 POOL_DESCRIPTION=Azeroth

 POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
 POOL_SPM_ID=1
 POOL_SPM_LVER=6
 POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
 REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
 ROLE=Master
 SDUUID=67534cca-1327-462a-b455-a04464084b31
 TYPE=NFS
 VERSION=3
 _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721

 So now with the older working commit I now get this the
 StoragePoolMasterNotFound: Cannot find master domain error (prior details
 above when I worked backwards to that commit)

 This is odd as the nodes can definitely reach the master storage domain:

 showmount from one of the el6.3 nodes:
 [root@kezan ~]# showmount -e 192.168.0.1
 Export list for 192.168.0.1:
 /ovirt/orgrimmar192.168.0.0/16

 mount/ls from one of the nodes:
 [root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
 [root@kezan ~]# ls -al /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
 total 1100
 drwxr-xr-x 2 vdsm kvm4096 Jan 24 11:44 .
 drwxr-xr-x 5 vdsm kvm4096 Oct 19 16:16 ..
 -rw-rw 1 vdsm kvm 1048576 Jan 19 22:09 ids
 -rw-rw 1 vdsm kvm   0 Sep 25 00:46 inbox
 -rw-rw 1 vdsm kvm 2097152 Jan 10 13:33 leases
 -rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
 -rw-rw 1 vdsm kvm   0 Sep 25 00:46 outbox


 - DHC



 On Thu, Jan 24, 2013 at 7:51 AM, ybronhei ybron...@redhat.com wrote:

 On 01/24/2013 12:44 AM, Dead Horse wrote:

 I narrowed down on the commit where the originally reported issue crept
 in:
 commitfc3a44f71d2ef202cff18d72**03b9e4165b546621building and testing
 with

 this commit or subsequent commits yields the original issue.

 Interesting.. it might be related to this commit and we're trying to
 reproduce it.

 Did you try to remove that code and run again? does it work without the
 additional of zombieReaper?
 does the connectivity to the storage work well? when you run 'ls' on the
 mounted folder you get see the files without a long delay ? it might
 related to too long timeout when validating access to this mount..
 we work on that.. any additional info can help

 Thanks.


 - DHC


 On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
 deadhorseconsult...@gmail.com**wrote:

  Indeed reverting back to an older vdsm clears up the above issue.
 However
 now I the issue is see is:
 Thread-18::ERROR::2013-01-23
 15:50:42,885::task::833::**TaskManager.Task::(_setError)
 Task=`08709e68-bcbc-40d8-843a-**d69d4df40ac6`::Unexpected error

 Traceback (most recent call last):
File /usr/share/vdsm/storage/task.**py, line 840, in _run
  return fn(*args, **kargs)
File /usr/share/vdsm/logUtils.py, line 42, in wrapper
  res = f(*args, **kwargs)
File /usr/share/vdsm/storage/hsm.**py, line 923, in
 connectStoragePool
  masterVersion, options)
File /usr/share/vdsm/storage/hsm.**py, line 970, in
 _connectStoragePool
  res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
File /usr/share/vdsm/storage/sp.**py, line 643, in connect
  self.__rebuild(msdUUID=**msdUUID, masterVersion=masterVersion)
File 

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Mark Wu

Great work!
The default action for SIGCHLD is ignore, so there's no problems 
reported before a signal handler is installed by zombie reaper.
But I still have one problem: the python multiprocessing.manager code 
is running a new thread and according to the implementation of python's 
signal, only the main thread can receive the signal.

So how is the signal delivered to the server thread?


On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:


Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:
   with the test case attached, put it under /usr/share/vdsm/tests/,
run #./run_tests.sh superVdsmTests.py
   and this issue will be reproduced.
   2.Log analyse:
   We notice a strange pattern in this log: connectStorageServer be
called twice, first supervdsm call succeed, second fails becasue of
validateAccess().
   That is because for the first call validateAccess returns normally
and leave a child there, when the second validateAccess call arrives
and multirprocessing manager is receiving the method message, it is
just the time first child exit and SIGCHLD comming, this signal
interrupted multiprocessing receive system call, python managers.py
should handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.
Thread-18::DEBUG::2013-01-22 
10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd 
None)
Thread-18::DEBUG::2013-01-22 
10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd 
None)
Thread-18::ERROR::2013-01-22 
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to 
storageServer
Traceback (most recent call last):
  File /usr/share/vdsm/storage/hsm.py, line 2211, in connectStorageServer
conObj.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 303, in connect
return self._mountCon.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 209, in connect
fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
  File /usr/share/vdsm/storage/fileSD.py, line 55, in validateDirAccess
(os.R_OK | os.X_OK))
  File /usr/share/vdsm/supervdsm.py, line 81, in __call__
return callMethod()
  File /usr/share/vdsm/supervdsm.py, line 72, in lambda
**kwargs)
  File string, line 2, in validateAccess
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 740, in 
_callmethod
raise convert_to_error(kind, result)
the vdsm side receive RemoteError because of supervdsm server
multiprocessing manager raise error KIND='TRACEBACK'
  RemoteError:
The upper part is the trace back from the client side, the following
part is from server side:
---
Traceback (most recent call last):
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 214, in 
serve_client
request = recv()
IOError: [Errno 4] Interrupted system call
---

Corresponding Python source code:managers.py(Server side)
 def serve_client(self, conn):
 '''
 Handle requests from the proxies in a particular process/thread
 '''
 util.debug('starting server thread to service %r',
threading.current_thread().name)
 recv = conn.recv
 send = conn.send
 id_to_obj = self.id_to_obj
 while not self.stop:
 try:
 methodname = obj = None
 request = recv()--this line been interrupted 
by SIGCHLD
 ident, methodname, args, kwds = request
 obj, exposed, gettypeid = id_to_obj[ident]
 if methodname not in exposed:
 raise AttributeError(
 'method %r of %r object is not in exposed=%r' %
 (methodname, type(obj), exposed)
 )
 function = getattr(obj, methodname)
 try:
 res = function(*args, **kwds)
 except Exception, e:
 msg = ('#ERROR', e)
 else:
 typeid = gettypeid and gettypeid.get(methodname, None)
 if typeid:
 rident, rexposed = self.create(conn, typeid, res)
 token = Token(typeid, self.address, rident)
 msg = ('#PROXY', (rexposed, token))
 else:
 msg = ('#RETURN', res)
 except AttributeError:
 if methodname is None:
 msg = ('#TRACEBACK', format_exc())

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-23 Thread Dan Kenigsberg
On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
 Any ideas on this one? (from VDSM log):
 Thread-25::DEBUG::2013-01-22
 15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client [3.57.111.30]::call
 getCapabilities with () {}
 Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
 cannot read ib0 speed
 Traceback (most recent call last):
   File /usr/lib64/python2.6/site-packages/vdsm/netinfo.py, line 155, in
 speed
 s = int(file('/sys/class/net/%s/speed' % dev).read())
 IOError: [Errno 22] Invalid argument
 
 Causes VDSM to fail to attach storage

I doubt that this is the cause of the failure, as vdsm has always
reported 0 for ib devices, and still is.

Does a former version works with your Engine?
Could you share more of your vdsm.log? I suppose the culprit lies in one
one of the storage-related commands, not in statistics retrieval.

 
 Engine side sees:
 ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
 (QuartzScheduler_Worker-96) [553ef26e] The connection with details
 192.168.0.1:/ovirt/ds failed because of error code 100 and error message
 is: general exception
 2013-01-22 15:35:30,160 INFO
 [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
 (QuartzScheduler_Worker-96) [1ab78378] Running command:
 SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
 8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
 2013-01-22 15:35:30,200 INFO
 [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
 (QuartzScheduler_Worker-96) [1ab78378] START,
 SetVdsStatusVDSCommand(HostName = kezan, HostId =
 8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
 nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
 2013-01-22 15:35:30,211 INFO
 [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
 (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand, log
 id: 4af5c4cd
 2013-01-22 15:35:30,242 ERROR
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
 values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
 storagepoolname
 
 Engine = latest master
 VDSM = latest master

Since latest master is an unstable reference by definition, I'm sure
that History would thank you if you post the exact version (git hash?)
of the code.

 node = el6

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-23 Thread Dead Horse
Indeed reverting back to an older vdsm clears up the above issue. However
now I the issue is see is:
Thread-18::ERROR::2013-01-23
15:50:42,885::task::833::TaskManager.Task::(_setError)
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected error
Traceback (most recent call last):
  File /usr/share/vdsm/storage/task.py, line 840, in _run
return fn(*args, **kargs)
  File /usr/share/vdsm/logUtils.py, line 42, in wrapper
res = f(*args, **kwargs)
  File /usr/share/vdsm/storage/hsm.py, line 923, in connectStoragePool
masterVersion, options)
  File /usr/share/vdsm/storage/hsm.py, line 970, in _connectStoragePool
res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File /usr/share/vdsm/storage/sp.py, line 643, in connect
self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File /usr/share/vdsm/storage/sp.py, line 1167, in __rebuild
self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
masterVersion=masterVersion)
  File /usr/share/vdsm/storage/sp.py, line 1506, in getMasterDomain
raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain:
'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
msdUUID=67534cca-1327-462a-b455-a04464084b31'
Thread-18::DEBUG::2013-01-23
15:50:42,887::task::852::TaskManager.Task::(_run)
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
08709e68-bcbc-40d8-843a-d69d4df40ac6
('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
'f90a0d1c-06ca-11e2-a05b-00151712f280',
'67534cca-1327-462a-b455-a04464084b31', 433) {} failed - stopping task

This is with vdsm built from
commit25a2d8572ad32352227c98a86631300fbd6523c1
- DHC


On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse
deadhorseconsult...@gmail.comwrote:

 VDSM was built from:
 commit 166138e37e75767b32227746bb671b1dab9cdd5e

 Attached is the full vdsm log

 I should also note that from engine perspective it sees the master storage
 domain as locked and the others as unknown.


 On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg dan...@redhat.com wrote:

 On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
  Any ideas on this one? (from VDSM log):
  Thread-25::DEBUG::2013-01-22
  15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client
 [3.57.111.30]::call
  getCapabilities with () {}
  Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
  cannot read ib0 speed
  Traceback (most recent call last):
File /usr/lib64/python2.6/site-packages/vdsm/netinfo.py, line 155,
 in
  speed
  s = int(file('/sys/class/net/%s/speed' % dev).read())
  IOError: [Errno 22] Invalid argument
 
  Causes VDSM to fail to attach storage

 I doubt that this is the cause of the failure, as vdsm has always
 reported 0 for ib devices, and still is.

 Does a former version works with your Engine?
 Could you share more of your vdsm.log? I suppose the culprit lies in one
 one of the storage-related commands, not in statistics retrieval.

 
  Engine side sees:
  ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
  (QuartzScheduler_Worker-96) [553ef26e] The connection with details
  192.168.0.1:/ovirt/ds failed because of error code 100 and error
 message
  is: general exception
  2013-01-22 15:35:30,160 INFO
  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
  (QuartzScheduler_Worker-96) [1ab78378] Running command:
  SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
  8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
  2013-01-22 15:35:30,200 INFO
  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-96) [1ab78378] START,
  SetVdsStatusVDSCommand(HostName = kezan, HostId =
  8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
  nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
  2013-01-22 15:35:30,211 INFO
  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,
 log
  id: 4af5c4cd
  2013-01-22 15:35:30,242 ERROR
  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
  (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
  values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
  storagepoolname
 
  Engine = latest master
  VDSM = latest master

 Since latest master is an unstable reference by definition, I'm sure
 that History would thank you if you post the exact version (git hash?)
 of the code.

  node = el6



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-23 Thread Dead Horse
I narrowed down on the commit where the originally reported issue crept in:
commitfc3a44f71d2ef202cff18d7203b9e4165b546621building and testing with
this commit or subsequent commits yields the original issue.

- DHC


On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
deadhorseconsult...@gmail.comwrote:

 Indeed reverting back to an older vdsm clears up the above issue. However
 now I the issue is see is:
 Thread-18::ERROR::2013-01-23
 15:50:42,885::task::833::TaskManager.Task::(_setError)
 Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected error

 Traceback (most recent call last):
   File /usr/share/vdsm/storage/task.py, line 840, in _run
 return fn(*args, **kargs)
   File /usr/share/vdsm/logUtils.py, line 42, in wrapper
 res = f(*args, **kwargs)
   File /usr/share/vdsm/storage/hsm.py, line 923, in connectStoragePool
 masterVersion, options)
   File /usr/share/vdsm/storage/hsm.py, line 970, in _connectStoragePool
 res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 643, in connect
 self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 1167, in __rebuild
 self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
 masterVersion=masterVersion)
   File /usr/share/vdsm/storage/sp.py, line 1506, in getMasterDomain
 raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
 StoragePoolMasterNotFound: Cannot find master domain:
 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
 msdUUID=67534cca-1327-462a-b455-a04464084b31'
 Thread-18::DEBUG::2013-01-23
 15:50:42,887::task::852::TaskManager.Task::(_run)
 Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
 08709e68-bcbc-40d8-843a-d69d4df40ac6
 ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
 'f90a0d1c-06ca-11e2-a05b-00151712f280',
 '67534cca-1327-462a-b455-a04464084b31', 433) {} failed - stopping task

 This is with vdsm built from
 commit25a2d8572ad32352227c98a86631300fbd6523c1
 - DHC


 On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse 
 deadhorseconsult...@gmail.com wrote:

 VDSM was built from:
 commit 166138e37e75767b32227746bb671b1dab9cdd5e

 Attached is the full vdsm log

 I should also note that from engine perspective it sees the master
 storage domain as locked and the others as unknown.


 On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg dan...@redhat.comwrote:

 On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
  Any ideas on this one? (from VDSM log):
  Thread-25::DEBUG::2013-01-22
  15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client
 [3.57.111.30]::call
  getCapabilities with () {}
  Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
  cannot read ib0 speed
  Traceback (most recent call last):
File /usr/lib64/python2.6/site-packages/vdsm/netinfo.py, line 155,
 in
  speed
  s = int(file('/sys/class/net/%s/speed' % dev).read())
  IOError: [Errno 22] Invalid argument
 
  Causes VDSM to fail to attach storage

 I doubt that this is the cause of the failure, as vdsm has always
 reported 0 for ib devices, and still is.

 Does a former version works with your Engine?
 Could you share more of your vdsm.log? I suppose the culprit lies in one
 one of the storage-related commands, not in statistics retrieval.

 
  Engine side sees:
  ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
  (QuartzScheduler_Worker-96) [553ef26e] The connection with details
  192.168.0.1:/ovirt/ds failed because of error code 100 and error
 message
  is: general exception
  2013-01-22 15:35:30,160 INFO
  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
  (QuartzScheduler_Worker-96) [1ab78378] Running command:
  SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
  8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
  2013-01-22 15:35:30,200 INFO
  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-96) [1ab78378] START,
  SetVdsStatusVDSCommand(HostName = kezan, HostId =
  8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
  nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
  2013-01-22 15:35:30,211 INFO
  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,
 log
  id: 4af5c4cd
  2013-01-22 15:35:30,242 ERROR
  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
  (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
  values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
  storagepoolname
 
  Engine = latest master
  VDSM = latest master

 Since latest master is an unstable reference by definition, I'm sure
 that History would thank you if you post the exact version (git hash?)
 of the code.

  node = el6




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users