Dear all,

We are experiencing this problem over and over again with different Vms, the 
situation is as follows

  *   we are backing up all VM’s through iterating through the VM (see attached 
python file) which basically follow the recommendations
  *   This process run well for a while but at some point we get a problem with 
a random VM (it is always different machines) the backup process tries to 
remove all snapshots and this is in the log files

Vdsm.log
<— snip —>
Thread-8246::DEBUG::2015-05-27 
16:56:00,003::libvirtconnection::143::root::(wrapper) Unknown libvirterror: 
ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire 
state change lock
Thread-8246::ERROR::2015-05-27 16:56:00,016::vm::5761::vm.Vm::(queryBlockJobs) 
vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
    liveInfo = self._dom.blockJobInfo(drive.name, 0)
  File "/usr/share/vdsm/virt/vm.py", line 697, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock
VM Channels Listener::DEBUG::2015-05-27 
16:56:00,561::vmchannels::96::vds::(_handle_timeouts) Timeout on fileno 55.
<— snip —>

Syslog / journarctl
<— snip —>
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Cannot 
start job (modify, none) for domain fab-cms-app-01-live-fab-mcon-net; current 
job is (modify, none) owned by (1780, 0)
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Timed out 
during operation: cannot acquire state change lock
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net vdsm[10478]: vdsm vm.Vm 
ERROR vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info
                                                                   Traceback 
(most recent call last):
                                                                     File 
"/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                       liveInfo 
= self._dom.blockJobInfo(drive.name, 0)
                                                                     File 
"/usr/share/vdsm/virt/vm.py", line 697, in f
                                                                       raise toe
                                                                   
TimeoutError: Timed out during operation: cannot acquire state change lock
<— snip —>

The result is, that the VM is non-operational, the qemu process is running and 
ovirt show it with a “?”, this itself would not be that bad if we could recover 
from this problem, but the only thing we found so far to resolve this, is to 
put the hypervisor host in maintenance and then with the last VM with the “?” 
reboot it, we found no other way that allowed us to reboot this VM.
Even after actually killing the qemu process, there is no way to do anything 
with this VM.

I think i understand that the problem arises when 2 threads are making requests 
against the same VM, however, in the last case the VM was not doing anything 
else as far as we can see.

I found a bug that seems to be very similar (‘a little older though) in 
launchpad, but libvirt related.

https://bugs.launchpad.net/nova/+bug/1254872

These are the libvirt versions on the server

libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64

VDSM version
vdsm-python-4.16.14-0.el7.noarch
vdsm-jsonrpc-4.16.14-0.el7.noarch
vdsm-yajsonrpc-4.16.14-0.el7.noarch
vdsm-4.16.14-0.el7.x86_64
vdsm-python-zombiereaper-4.16.14-0.el7.noarch
vdsm-cli-4.16.14-0.el7.noarch
vdsm-xmlrpc-4.16.14-0.el7.noarch

Kernel
3.10.0-229.4.2.el7.x86_64


Any idea where to go ?





Regards
Soeren


From: Soeren Malchow <soeren.malc...@mcon.net<mailto:soeren.malc...@mcon.net>>
Date: Monday 25 May 2015 22:27
To: "users@ovirt.org<mailto:users@ovirt.org>" 
<users@ovirt.org<mailto:users@ovirt.org>>
Subject: [ovirt-users] VM crashes during snapshot/clone/export and show only "?"

Dear all,

In version 3.5.2 on CentOS 7.1 we now have the problem that the backup script 
seems to trigger a crash of Vms, this is the second time, the first time i 
could only solve the problem by rebooting the hypervisor host and acknowleding 
“host has been rebooted”

This problems happens while removing snapshots after snapshot –> clone -> 
export procedures.

The actual qemu process is still running after following log output, but the VM 
is not responsive anymore, i can kill the process problem.

Two questions for this:

How can I avoid this problem ?
Is there a way to tell ovirt that the qemu process is gone and that the VM can 
be started again ?

<— snip —>
May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: internal 
error: End of file from monitor
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: 
port 3(vnet3) entered disabled state
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: device vnet3 left 
promiscuous mode
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: 
port 3(vnet3) entered disabled state
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kvm[22973]: 15 guests now 
active
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net systemd-machined[14412]: 
Machine qemu-mc-glpi-app-01-live.mc.mcon.net terminated.
May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot 
start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current 
job is (modify, none) own
ed by (1534, 0)
May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out 
during operation: cannot acquire state change lock
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot 
start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current 
job is (modify, none) own
ed by (1534, 0)
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out 
during operation: cannot acquire state change lock
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm 
ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Error getting block job info
                                                                  Traceback 
(most recent call last):
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                      liveInfo 
= self._dom.blockJobInfo(drive.name, 0)
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 697, in f
                                                                      raise toe
                                                                  TimeoutError: 
Timed out during operation: cannot acquire state change lock
<— snip —>


<— snip —>
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm 
ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: 
<AdvancedStatsFunction _samp
                                                                  Traceback 
(most recent call last):
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      
statsFunction()
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue 
= self._function(*args, **kwargs)
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 346, in _sampleVmJobs
                                                                      return 
self._vm.queryBlockJobs()
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                      liveInfo 
= self._dom.blockJobInfo(drive.name, 0)
                                                                  
AttributeError: 'NoneType' object has no attribute 'blockJobInfo'
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata 
not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm 
ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: 
<AdvancedStatsFunction _samp
                                                                  Traceback 
(most recent call last):
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      
statsFunction()
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue 
= self._function(*args, **kwargs)
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 338, in _sampleVcpuPinning
                                                                      vCpuInfos 
= self._vm._dom.vcpus()
                                                                  
AttributeError: 'NoneType' object has no attribute 'vcpus'
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm 
ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: 
<AdvancedStatsFunction _samp
                                                                  Traceback 
(most recent call last):
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      
statsFunction()
                                                                    File 
"/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue 
= self._function(*args, **kwargs)
                                                                    File 
"/usr/share/vdsm/virt/vm.py", line 349, in _sampleCpuTune
                                                                      infos = 
self._vm._dom.schedulerParameters()
                                                                  
AttributeError: 'NoneType' object has no attribute ‘schedulerParameters'

<— snip —>
#!/usr/bin/python

import Queue
import threading
import time
from ovirtsdk.api import API
from ovirtsdk.xml import params
import sys
import datetime
import smtplib
from email.mime.text import MIMEText


global SNAPSHOT_NAME

VERSION             = params.Version(major='3', minor='0')
ENGINE_SERVER       = ''
ENGINE_USER         = ''
ENGINE_PASSWORD     = ''
ENGINE_CERT         = ''
NOW                 = datetime.datetime.now()
SNAPSHOT_NAME       = 'BACKUP_' + NOW.strftime("%Y-%m-%d-%H%M")
DAY_OF_WEEK         = NOW.strftime("%w")
BACKUP              = "FULL"

exitFlag = 0

class myThread (threading.Thread):
    def __init__(self, threadID, name, q):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.q = q
        self.api = api
        global message
    def run(self):
        print "Starting " + self.name
        process_data(self.name, self.q)
        print "Exiting " + self.name

def process_data(threadName, q):
    while not exitFlag:
        queueLock.acquire()
        if not workQueue.empty():
            data = q.get()
            queueLock.release()
            print "%s processing %s" % (threadName, data.name)
            vm = api.vms.get(name=data.name)
            vmname = data.name +"_"
            newvmname = vmname + SNAPSHOT_NAME
            cluster = api.clusters.get(id=vm.cluster.id)
            dc = api.datacenters.get(id=cluster.data_center.id)
            export = None
            for sd in dc.storagedomains.list():
                if sd.type_ == "export":
                    export = sd
            if not export:
                print("Export domain required, and none found, exitting...\n")
                sys.exit(1)
            print "Cluster: %s" % cluster.name
            if (data.name != 'HostedEngine' and cluster.name == 'CC-01'):
                vm.snapshots.add(params.Snapshot(description=SNAPSHOT_NAME, vm=vm ))
                snap = vm.snapshots.list(description=SNAPSHOT_NAME)[0]
                while vm.snapshots.get(id=snap.id).snapshot_status == "locked":
                    print("%s Waiting for snapshot of %s to finish") % (threadName, vm.name)
                    time.sleep(10)
                print("%s Snapshotting %s is done") % (threadName,vm.name)
                try:
                    snapshots = params.Snapshots(snapshot=[params.Snapshot(id=snap.id)])
                    api.vms.add(params.VM(name=newvmname, snapshots=snapshots, cluster=cluster, template=api.templates.get(name="Blank")))
                    while api.vms.get(name=newvmname).status.state == "image_locked":
                        print("%s Waiting for clone of %s to finish") % (threadName, vm.name)
                        time.sleep(60)
                    print("%s Cloning of %s  done") % (threadName, vm.name)
                    api.vms.get(name=newvmname).export(params.Action(storage_domain=export))
                    while api.vms.get(name=newvmname).status.state == "image_locked":
                        print("%s Waiting for export of %s finish") % (threadName, vm.name)
                        time.sleep(60)
                    print("%s Exporting %s done") % (threadName, vm.name)
                    api.vms.get(name=newvmname).delete()
                except Exception as e:
                    print ("Something went wrong with the coling or exporting\n%s") % str(e)
                snapshotlist = vm.snapshots.list()
                for snapshot in snapshotlist:
                    if snapshot.description != "Active VM":
                        snapshot.delete()
                        time.sleep(3)
                        try:
                            while api.vms.get(name=vm.name).snapshots.get(id=snapshot.id).snapshot_status == "locked":
                                print("%s Waiting for snapshot %s on %s deletion to finish") % (threadName, snapshot.name, vm.name)
                                time.sleep(10)
                        except Exception as e:
                            print ("%s Snapshot status request might have failed, this usually means that the snpashot was deleted properly") % threadName
                        print("%s Deleting snapshot %s on %s done") % (threadName, snapshot.name, vm.name)

        else:
            queueLock.release()
        time.sleep(1)

threadList = ["Backup-Thread-1", "Backup-Thread-2", "Backup-Thread-3"]

def Connect():
    global api
    api = API(url=ENGINE_SERVER, username=ENGINE_USER, password=ENGINE_PASSWORD, ca_file=ENGINE_CERT)

def Disconnect(exitcode):
    api.disconnect()
    sys.exit(exitcode)

try:
    Connect()
    vms = api.vms.list()

except Exception as e:
    print 'Failed:\n%s' % str(e)

nameList = vms 
queueLock = threading.Lock()
workQueue = Queue.Queue(0)
threads = []
threadID = 1

# Create new threads
for tName in threadList:
    thread = myThread(threadID, tName, workQueue)
    thread.start()
    threads.append(thread)
    threadID += 1

# Fill the queue
queueLock.acquire()
for word in nameList:
    workQueue.put(word)
queueLock.release()

# Wait for queue to empty
while not workQueue.empty():
    pass

# Notify threads it's time to exit
exitFlag = 1

# Wait for all threads to complete
for t in threads:
    t.join()
print "Exiting Main Thread"
api.disconnect()
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to