[Users] Fwd: Re: latest vdsm cannot read ib device speeds causing storage attach fail

Royce Lv Fri, 25 Jan 2013 12:49:47 -0800

Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:

with the test case attached, put it under /usr/share/vdsm/tests/,run #./run_tests.sh superVdsmTests.py

   and this issue will be reproduced.
   2.Log analyse:

We notice a strange pattern in this log: connectStorageServer becalled twice, first supervdsm call succeed, second fails becasue ofvalidateAccess().That is because for the first call validateAccess returns normallyand leave a child there, when the second validateAccess call arrives andmultirprocessing manager is receiving the method message, it is just thetime first child exit and SIGCHLD comming, this signal interruptedmultiprocessing receive system call, python managers.py should handleINTR and retry recv() like we do in vdsm but it's not, so the second oneraise error.

Thread-18::DEBUG::2013-01-22 
10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' 
(cwd None)
Thread-18::DEBUG::2013-01-22 
10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n 
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 
192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd 
None)
Thread-18::ERROR::2013-01-22 
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect 
to storageServer
Traceback (most recent call last):
 File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer
   conObj.connect()
 File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
   return self._mountCon.connect()
 File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
   fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
 File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
   (os.R_OK | os.X_OK))
 File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
   return callMethod()
 File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
   **kwargs)
 File "<string>", line 2, in validateAccess
 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in 
_callmethod
   raise convert_to_error(kind, result)

the vdsm side receive RemoteError because of supervdsm servermultiprocessing manager raise error KIND='TRACEBACK'


 >RemoteError:

The upper part is the trace back from the client side, the followingpart is from server side:

---------------------------------------------------------------------------
Traceback (most recent call last):
 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in 
serve_client
   request = recv()
IOError: [Errno 4] Interrupted system call
---------------------------------------------------------------------------


Corresponding Python source code:managers.py(Server side)

    def serve_client(self, conn):

        '''

        Handle requests from the proxies in a particular process/thread

        '''

        util.debug('starting server thread to service %r',

                   threading.current_thread().name)

        recv = conn.recv

        send = conn.send

        id_to_obj = self.id_to_obj

        while not self.stop:

            try:

                methodname = obj = None

                request = recv()<------------------this line been interrupted 
by SIGCHLD

                ident, methodname, args, kwds = request

                obj, exposed, gettypeid = id_to_obj[ident]

                if methodname not in exposed:

                    raise AttributeError(

                        'method %r of %r object is not in exposed=%r' %

                        (methodname, type(obj), exposed)

                        )

                function = getattr(obj, methodname)
                try:

                    res = function(*args, **kwds)

                except Exception, e:

                    msg = ('#ERROR', e)

                else:

                    typeid = gettypeid and gettypeid.get(methodname, None)

                    if typeid:

                        rident, rexposed = self.create(conn, typeid, res)

                        token = Token(typeid, self.address, rident)

                        msg = ('#PROXY', (rexposed, token))

                    else:

                        msg = ('#RETURN', res)
            except AttributeError:
                if methodname is None:
                    msg = ('#TRACEBACK', format_exc())
                else:
                    try:
                        fallback_func = self.fallback_mapping[methodname]
                        result = fallback_func(
                            self, conn, ident, obj, *args, **kwds
                            )
                        msg = ('#RETURN', result)
                    except Exception:
                        msg = ('#TRACEBACK', format_exc())

            except EOFError:
                util.debug('got EOF -- exiting thread serving %r',
                           threading.current_thread().name)
                sys.exit(0)

            except Exception:<------does not handle IOError,INTR here should 
retry recv()
                msg = ('#TRACEBACK', format_exc())



3. Actions we will take:

(1)As a work round we can first remove the zombie reaper from supervdsmserver

(2)I'll see whether python has a fixed version for this

(3)Yaniv is working on changing vdsm/svdsm communication channel to pipeand handle it ourselves, I believe we'll get rid of this with thatproperly handled.



-------- Original Message --------

Subject: Re: [Users] latest vdsm cannot read ib device speeds causingstorage attach fail

Resent-Date:    Thu, 24 Jan 2013 12:24:10 +0200
Resent-From:    Dan Kenigsberg <dan...@redhat.com>
Resent-To:      Royce Lv <lvro...@linux.vnet.ibm.com>
Date:   Wed, 23 Jan 2013 10:44:57 -0600
From:   Dead Horse <deadhorseconsult...@gmail.com>
To:     Dan Kenigsberg <dan...@redhat.com>
CC:     <users@ovirt.org> <users@ovirt.org>

VDSM was built from:
commit 166138e37e75767b32227746bb671b1dab9cdd5e

Attached is the full vdsm log

I should also note that from engine perspective it sees the master storage
domain as locked and the others as unknown.

On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg <dan...@redhat.com> wrote:

On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
> Any ideas on this one? (from VDSM log):
> Thread-25::DEBUG::2013-01-22
> 15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client
[3.57.111.30]::call
> getCapabilities with () {}
> Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
> cannot read ib0 speed
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 155, in
> speed
>     s = int(file('/sys/class/net/%s/speed' % dev).read())
> IOError: [Errno 22] Invalid argument
>
> Causes VDSM to fail to attach storage

I doubt that this is the cause of the failure, as vdsm has always
reported "0" for ib devices, and still is.

Does a former version works with your Engine?
Could you share more of your vdsm.log? I suppose the culprit lies in one
one of the storage-related commands, not in statistics retrieval.

>
> Engine side sees:
> ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
> (QuartzScheduler_Worker-96) [553ef26e] The connection with details
> 192.168.0.1:/ovirt/ds failed because of error code 100 and error message
> is: general exception
> 2013-01-22 15:35:30,160 INFO
> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
> (QuartzScheduler_Worker-96) [1ab78378] Running command:
> SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
> 8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
> 2013-01-22 15:35:30,200 INFO
> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
> (QuartzScheduler_Worker-96) [1ab78378] START,
> SetVdsStatusVDSCommand(HostName = kezan, HostId =
> 8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
> nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
> 2013-01-22 15:35:30,211 INFO
> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
> (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,
log
> id: 4af5c4cd
> 2013-01-22 15:35:30,242 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
> values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
> storagepoolname
>
> Engine = latest master
> VDSM = latest master

Since "latest master" is an unstable reference by definition, I'm sure
that History would thank you if you post the exact version (git hash?)
of the code.

> node = el6

from testrunner import VdsmTestCase as TestCaseBase
import supervdsm
import testValidation
import tempfile
from vdsm import utils
import os
import uuid
from vdsm import constants
from storage import misc
from monkeypatch import MonkeyPatch
from time import sleep
from vdsm.constants import DISKIMAGE_GROUP, METADATA_GROUP,\
    QEMU_PROCESS_USER, EXT_PS


@utils.memoized
def getNeededPythonPath():
    testDir = os.path.dirname(__file__)
    base = os.path.dirname(testDir)
    vdsmPath = os.path.join(base, 'vdsm')
    cliPath = os.path.join(base, 'vdsm_cli')
    pyPath = "PYTHONPATH=" + ':'.join([base, vdsmPath, cliPath])
    return pyPath


def monkeyStart(self):
    self._authkey = str(uuid.uuid4())
    self._log.debug("Launching Super Vdsm")

    superVdsmCmd = [getNeededPythonPath(), constants.EXT_PYTHON,
                    supervdsm.SUPERVDSM,
                    self._authkey, str(os.getpid()),
                    self.pidfile, self.timestamp, self.address,
                    str(os.getuid())]
    misc.execCmd(superVdsmCmd, sync=False, sudo=True)
    sleep(2)


class TestSuperVdsm(TestCaseBase):
    def setUp(self):
        testValidation.checkSudo(['python', supervdsm.SUPERVDSM])
        self._proxy = supervdsm.getProxy()

        # temporary values to run temporary svdsm
        self.pidfd, pidfile = tempfile.mkstemp()
        self.timefd, timestamp = tempfile.mkstemp()
        self.addfd, address = tempfile.mkstemp()

        self._proxy.setIPCPaths(pidfile, timestamp, address)

    def tearDown(self):
        supervdsm.extraPythonPathList = []
        for fd in (self.pidfd, self.timefd, self.addfd):
            try:
                os.close(fd)
            except OSError:
                pass
        self._proxy.kill()  # cleanning old temp files

    @MonkeyPatch(supervdsm.SuperVdsmProxy, '_start', monkeyStart)
    def testIsSuperUp(self):
        self._proxy.ping()  # this call initiate svdsm
        self.assertTrue(self._proxy.isRunning())

    @MonkeyPatch(supervdsm.SuperVdsmProxy, '_start', monkeyStart)
    def testKillSuper(self):
        self._proxy.ping()
        self._proxy.kill()
        self.assertFalse(self._proxy.isRunning())
        self._proxy.ping()  # Launching vdsm after kill
        self.assertTrue(self._proxy.isRunning())

    @MonkeyPatch(supervdsm.SuperVdsmProxy, '_start', monkeyStart)
    def testNoPidFile(self):
        self._proxy.ping()  # svdsm is up
        self.assertTrue(self._proxy.isRunning())
        utils.rmFile(self._proxy.timestamp)
        self.assertRaises(IOError, self._proxy.isRunning)

    @MonkeyPatch(supervdsm.SuperVdsmProxy, '_start', monkeyStart)
    def testZombieReaper(self):
        REPO_PATH = '/rhev/data-center'
        self._proxy.ping()
        self.assertTrue(self._proxy.isRunning())
        self._proxy.validateAccess(
                QEMU_PROCESS_USER,
                (DISKIMAGE_GROUP, METADATA_GROUP), REPO_PATH,
                (os.R_OK | os.X_OK))
        sleep(4)
        self._proxy.validateAccess(
                QEMU_PROCESS_USER,
                (DISKIMAGE_GROUP, METADATA_GROUP), REPO_PATH,
                (os.R_OK | os.X_OK))

        with os.fdopen(self.pidfd, "r") as f:
            svdsmPid = f.read().strip()

        reapPid = utils.execCmd(
                         [EXT_PS, '--ppid', svdsmPid, '-o', 'pid='])[1]
        if reapPid:
            sleep(6)  # supervdsmServer checks SIGCHLD every 5 secs
            self.assertRaises(OSError, os.waitpid, int(reapPid[0]), os.WNOHANG)

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[Users] Fwd: Re: latest vdsm cannot read ib device speeds causing storage attach fail

Reply via email to