Re: [vdsm] [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Mark Wu

Great work!
The default action for SIGCHLD is ignore, so there's no problems 
reported before a signal handler is installed by zombie reaper.
But I still have one problem: the python multiprocessing.manager code 
is running a new thread and according to the implementation of python's 
signal, only the main thread can receive the signal.

So how is the signal delivered to the server thread?


On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:


Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:
   with the test case attached, put it under /usr/share/vdsm/tests/,
run #./run_tests.sh superVdsmTests.py
   and this issue will be reproduced.
   2.Log analyse:
   We notice a strange pattern in this log: connectStorageServer be
called twice, first supervdsm call succeed, second fails becasue of
validateAccess().
   That is because for the first call validateAccess returns normally
and leave a child there, when the second validateAccess call arrives
and multirprocessing manager is receiving the method message, it is
just the time first child exit and SIGCHLD comming, this signal
interrupted multiprocessing receive system call, python managers.py
should handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.
Thread-18::DEBUG::2013-01-22 
10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd 
None)
Thread-18::DEBUG::2013-01-22 
10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd 
None)
Thread-18::ERROR::2013-01-22 
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to 
storageServer
Traceback (most recent call last):
  File /usr/share/vdsm/storage/hsm.py, line 2211, in connectStorageServer
conObj.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 303, in connect
return self._mountCon.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 209, in connect
fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
  File /usr/share/vdsm/storage/fileSD.py, line 55, in validateDirAccess
(os.R_OK | os.X_OK))
  File /usr/share/vdsm/supervdsm.py, line 81, in __call__
return callMethod()
  File /usr/share/vdsm/supervdsm.py, line 72, in lambda
**kwargs)
  File string, line 2, in validateAccess
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 740, in 
_callmethod
raise convert_to_error(kind, result)
the vdsm side receive RemoteError because of supervdsm server
multiprocessing manager raise error KIND='TRACEBACK'
  RemoteError:
The upper part is the trace back from the client side, the following
part is from server side:
---
Traceback (most recent call last):
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 214, in 
serve_client
request = recv()
IOError: [Errno 4] Interrupted system call
---

Corresponding Python source code:managers.py(Server side)
 def serve_client(self, conn):
 '''
 Handle requests from the proxies in a particular process/thread
 '''
 util.debug('starting server thread to service %r',
threading.current_thread().name)
 recv = conn.recv
 send = conn.send
 id_to_obj = self.id_to_obj
 while not self.stop:
 try:
 methodname = obj = None
 request = recv()--this line been interrupted 
by SIGCHLD
 ident, methodname, args, kwds = request
 obj, exposed, gettypeid = id_to_obj[ident]
 if methodname not in exposed:
 raise AttributeError(
 'method %r of %r object is not in exposed=%r' %
 (methodname, type(obj), exposed)
 )
 function = getattr(obj, methodname)
 try:
 res = function(*args, **kwds)
 except Exception, e:
 msg = ('#ERROR', e)
 else:
 typeid = gettypeid and gettypeid.get(methodname, None)
 if typeid:
 rident, rexposed = self.create(conn, typeid, res)
 token = Token(typeid, self.address, rident)
 msg = ('#PROXY', (rexposed, token))
 else:
 msg = ('#RETURN', res)
 except AttributeError:
 if methodname is None:
 msg = ('#TRACEBACK', format_exc())

Re: [vdsm] [Users] latest vdsm cannot read ib device speeds causing storage attach fail

2013-01-24 Thread Mark Wu

On 01/25/2013 03:20 PM, Mark Wu wrote:

Great work!
The default action for SIGCHLD is ignore, so there's no problems 
reported before a signal handler is installed by zombie reaper.
But I still have one problem: the python multiprocessing.manager code 
is running a new thread and according to the implementation of 
python's signal, only the main thread can receive the signal.

So how is the signal delivered to the server thread?

Is it possible to reap the zombie process  in the main thread 
asynchronously and periodically?  It could be more safe than using the 
signal handler.




On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:


Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:
   with the test case attached, put it under /usr/share/vdsm/tests/,
run #./run_tests.sh superVdsmTests.py
   and this issue will be reproduced.
   2.Log analyse:
   We notice a strange pattern in this log: connectStorageServer be
called twice, first supervdsm call succeed, second fails becasue of
validateAccess().
   That is because for the first call validateAccess returns normally
and leave a child there, when the second validateAccess call arrives
and multirprocessing manager is receiving the method message, it is
just the time first child exit and SIGCHLD comming, this signal
interrupted multiprocessing receive system call, python managers.py
should handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.
Thread-18::DEBUG::2013-01-22 
10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda) 
'/usr/bin/sudo -n /bin/mount -t nfs -o 
soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/silvermoon 
/rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
Thread-18::DEBUG::2013-01-22 
10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda) 
'/usr/bin/sudo -n /bin/mount -t nfs -o 
soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/undercity 
/rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
Thread-18::ERROR::2013-01-22 
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could 
not connect to storageServer

Traceback (most recent call last):
  File /usr/share/vdsm/storage/hsm.py, line 2211, in 
connectStorageServer

conObj.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 303, in connect
return self._mountCon.connect()
  File /usr/share/vdsm/storage/storageServer.py, line 209, in connect
 fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
  File /usr/share/vdsm/storage/fileSD.py, line 55, in 
validateDirAccess

(os.R_OK | os.X_OK))
  File /usr/share/vdsm/supervdsm.py, line 81, in __call__
return callMethod()
  File /usr/share/vdsm/supervdsm.py, line 72, in lambda
**kwargs)
  File string, line 2, in validateAccess
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 740, 
in _callmethod

raise convert_to_error(kind, result)
the vdsm side receive RemoteError because of supervdsm server
multiprocessing manager raise error KIND='TRACEBACK'
  RemoteError:
The upper part is the trace back from the client side, the following
part is from server side:
--- 


Traceback (most recent call last):
  File /usr/lib64/python2.6/multiprocessing/managers.py, line 214, 
in serve_client

request = recv()
IOError: [Errno 4] Interrupted system call
--- 



Corresponding Python source code:managers.py(Server side)
 def serve_client(self, conn):
 '''
 Handle requests from the proxies in a particular process/thread
 '''
 util.debug('starting server thread to service %r',
threading.current_thread().name)
 recv = conn.recv
 send = conn.send
 id_to_obj = self.id_to_obj
 while not self.stop:
 try:
 methodname = obj = None
 request = recv()--this line been 
interrupted by SIGCHLD

 ident, methodname, args, kwds = request
 obj, exposed, gettypeid = id_to_obj[ident]
 if methodname not in exposed:
 raise AttributeError(
 'method %r of %r object is not in exposed=%r' %
 (methodname, type(obj), exposed)
 )
 function = getattr(obj, methodname)
 try:
 res = function(*args, **kwds)
 except Exception, e:
 msg = ('#ERROR', e)
 else:
 typeid = gettypeid and gettypeid.get(methodname, 
None)

 if typeid:
 rident, rexposed = self.create(conn, typeid, 
res)

 token = Token(typeid, self.address, rident)
 msg = ('#PROXY', (rexposed, token))