Re: [vdsm] [Users] latest vdsm cannot read ib device speeds causing storage attach fail
Great work! The default action for SIGCHLD is ignore, so there's no problems reported before a signal handler is installed by zombie reaper. But I still have one problem: the python multiprocessing.manager code is running a new thread and according to the implementation of python's signal, only the main thread can receive the signal. So how is the signal delivered to the server thread? On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote: Hi, I reproduced this issue, and I believe it's a python bug. 1. How to reproduce: with the test case attached, put it under /usr/share/vdsm/tests/, run #./run_tests.sh superVdsmTests.py and this issue will be reproduced. 2.Log analyse: We notice a strange pattern in this log: connectStorageServer be called twice, first supervdsm call succeed, second fails becasue of validateAccess(). That is because for the first call validateAccess returns normally and leave a child there, when the second validateAccess call arrives and multirprocessing manager is receiving the method message, it is just the time first child exit and SIGCHLD comming, this signal interrupted multiprocessing receive system call, python managers.py should handle INTR and retry recv() like we do in vdsm but it's not, so the second one raise error. Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None) Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None) Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File /usr/share/vdsm/storage/hsm.py, line 2211, in connectStorageServer conObj.connect() File /usr/share/vdsm/storage/storageServer.py, line 303, in connect return self._mountCon.connect() File /usr/share/vdsm/storage/storageServer.py, line 209, in connect fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file) File /usr/share/vdsm/storage/fileSD.py, line 55, in validateDirAccess (os.R_OK | os.X_OK)) File /usr/share/vdsm/supervdsm.py, line 81, in __call__ return callMethod() File /usr/share/vdsm/supervdsm.py, line 72, in lambda **kwargs) File string, line 2, in validateAccess File /usr/lib64/python2.6/multiprocessing/managers.py, line 740, in _callmethod raise convert_to_error(kind, result) the vdsm side receive RemoteError because of supervdsm server multiprocessing manager raise error KIND='TRACEBACK' RemoteError: The upper part is the trace back from the client side, the following part is from server side: --- Traceback (most recent call last): File /usr/lib64/python2.6/multiprocessing/managers.py, line 214, in serve_client request = recv() IOError: [Errno 4] Interrupted system call --- Corresponding Python source code:managers.py(Server side) def serve_client(self, conn): ''' Handle requests from the proxies in a particular process/thread ''' util.debug('starting server thread to service %r', threading.current_thread().name) recv = conn.recv send = conn.send id_to_obj = self.id_to_obj while not self.stop: try: methodname = obj = None request = recv()--this line been interrupted by SIGCHLD ident, methodname, args, kwds = request obj, exposed, gettypeid = id_to_obj[ident] if methodname not in exposed: raise AttributeError( 'method %r of %r object is not in exposed=%r' % (methodname, type(obj), exposed) ) function = getattr(obj, methodname) try: res = function(*args, **kwds) except Exception, e: msg = ('#ERROR', e) else: typeid = gettypeid and gettypeid.get(methodname, None) if typeid: rident, rexposed = self.create(conn, typeid, res) token = Token(typeid, self.address, rident) msg = ('#PROXY', (rexposed, token)) else: msg = ('#RETURN', res) except AttributeError: if methodname is None: msg = ('#TRACEBACK', format_exc())
Re: [vdsm] [Users] latest vdsm cannot read ib device speeds causing storage attach fail
On 01/25/2013 03:20 PM, Mark Wu wrote: Great work! The default action for SIGCHLD is ignore, so there's no problems reported before a signal handler is installed by zombie reaper. But I still have one problem: the python multiprocessing.manager code is running a new thread and according to the implementation of python's signal, only the main thread can receive the signal. So how is the signal delivered to the server thread? Is it possible to reap the zombie process in the main thread asynchronously and periodically? It could be more safe than using the signal handler. On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote: Hi, I reproduced this issue, and I believe it's a python bug. 1. How to reproduce: with the test case attached, put it under /usr/share/vdsm/tests/, run #./run_tests.sh superVdsmTests.py and this issue will be reproduced. 2.Log analyse: We notice a strange pattern in this log: connectStorageServer be called twice, first supervdsm call succeed, second fails becasue of validateAccess(). That is because for the first call validateAccess returns normally and leave a child there, when the second validateAccess call arrives and multirprocessing manager is receiving the method message, it is just the time first child exit and SIGCHLD comming, this signal interrupted multiprocessing receive system call, python managers.py should handle INTR and retry recv() like we do in vdsm but it's not, so the second one raise error. Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None) Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(lambda) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None) Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File /usr/share/vdsm/storage/hsm.py, line 2211, in connectStorageServer conObj.connect() File /usr/share/vdsm/storage/storageServer.py, line 303, in connect return self._mountCon.connect() File /usr/share/vdsm/storage/storageServer.py, line 209, in connect fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file) File /usr/share/vdsm/storage/fileSD.py, line 55, in validateDirAccess (os.R_OK | os.X_OK)) File /usr/share/vdsm/supervdsm.py, line 81, in __call__ return callMethod() File /usr/share/vdsm/supervdsm.py, line 72, in lambda **kwargs) File string, line 2, in validateAccess File /usr/lib64/python2.6/multiprocessing/managers.py, line 740, in _callmethod raise convert_to_error(kind, result) the vdsm side receive RemoteError because of supervdsm server multiprocessing manager raise error KIND='TRACEBACK' RemoteError: The upper part is the trace back from the client side, the following part is from server side: --- Traceback (most recent call last): File /usr/lib64/python2.6/multiprocessing/managers.py, line 214, in serve_client request = recv() IOError: [Errno 4] Interrupted system call --- Corresponding Python source code:managers.py(Server side) def serve_client(self, conn): ''' Handle requests from the proxies in a particular process/thread ''' util.debug('starting server thread to service %r', threading.current_thread().name) recv = conn.recv send = conn.send id_to_obj = self.id_to_obj while not self.stop: try: methodname = obj = None request = recv()--this line been interrupted by SIGCHLD ident, methodname, args, kwds = request obj, exposed, gettypeid = id_to_obj[ident] if methodname not in exposed: raise AttributeError( 'method %r of %r object is not in exposed=%r' % (methodname, type(obj), exposed) ) function = getattr(obj, methodname) try: res = function(*args, **kwds) except Exception, e: msg = ('#ERROR', e) else: typeid = gettypeid and gettypeid.get(methodname, None) if typeid: rident, rexposed = self.create(conn, typeid, res) token = Token(typeid, self.address, rident) msg = ('#PROXY', (rexposed, token))