Dear all, We are experiencing this problem over and over again with different Vms, the situation is as follows
* we are backing up all VM’s through iterating through the VM (see attached python file) which basically follow the recommendations * This process run well for a while but at some point we get a problem with a random VM (it is always different machines) the backup process tries to remove all snapshots and this is in the log files Vdsm.log <— snip —> Thread-8246::DEBUG::2015-05-27 16:56:00,003::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock Thread-8246::ERROR::2015-05-27 16:56:00,016::vm::5761::vm.Vm::(queryBlockJobs) vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs liveInfo = self._dom.blockJobInfo(drive.name, 0) File "/usr/share/vdsm/virt/vm.py", line 697, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock VM Channels Listener::DEBUG::2015-05-27 16:56:00,561::vmchannels::96::vds::(_handle_timeouts) Timeout on fileno 55. <— snip —> Syslog / journarctl <— snip —> May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Cannot start job (modify, none) for domain fab-cms-app-01-live-fab-mcon-net; current job is (modify, none) owned by (1780, 0) May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Timed out during operation: cannot acquire state change lock May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net vdsm[10478]: vdsm vm.Vm ERROR vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs liveInfo = self._dom.blockJobInfo(drive.name, 0) File "/usr/share/vdsm/virt/vm.py", line 697, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock <— snip —> The result is, that the VM is non-operational, the qemu process is running and ovirt show it with a “?”, this itself would not be that bad if we could recover from this problem, but the only thing we found so far to resolve this, is to put the hypervisor host in maintenance and then with the last VM with the “?” reboot it, we found no other way that allowed us to reboot this VM. Even after actually killing the qemu process, there is no way to do anything with this VM. I think i understand that the problem arises when 2 threads are making requests against the same VM, however, in the last case the VM was not doing anything else as far as we can see. I found a bug that seems to be very similar (‘a little older though) in launchpad, but libvirt related. https://bugs.launchpad.net/nova/+bug/1254872 These are the libvirt versions on the server libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-1.2.8-16.el7_1.3.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64 libvirt-client-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64 VDSM version vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-cli-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch Kernel 3.10.0-229.4.2.el7.x86_64 Any idea where to go ? Regards Soeren From: Soeren Malchow <soeren.malc...@mcon.net<mailto:soeren.malc...@mcon.net>> Date: Monday 25 May 2015 22:27 To: "users@ovirt.org<mailto:users@ovirt.org>" <users@ovirt.org<mailto:users@ovirt.org>> Subject: [ovirt-users] VM crashes during snapshot/clone/export and show only "?" Dear all, In version 3.5.2 on CentOS 7.1 we now have the problem that the backup script seems to trigger a crash of Vms, this is the second time, the first time i could only solve the problem by rebooting the hypervisor host and acknowleding “host has been rebooted” This problems happens while removing snapshots after snapshot –> clone -> export procedures. The actual qemu process is still running after following log output, but the VM is not responsive anymore, i can kill the process problem. Two questions for this: How can I avoid this problem ? Is there a way to tell ovirt that the qemu process is gone and that the VM can be started again ? <— snip —> May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: internal error: End of file from monitor May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: port 3(vnet3) entered disabled state May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: device vnet3 left promiscuous mode May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: port 3(vnet3) entered disabled state May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kvm[22973]: 15 guests now active May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net systemd-machined[14412]: Machine qemu-mc-glpi-app-01-live.mc.mcon.net terminated. May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current job is (modify, none) own ed by (1534, 0) May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out during operation: cannot acquire state change lock May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current job is (modify, none) own ed by (1534, 0) May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out during operation: cannot acquire state change lock May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Error getting block job info Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs liveInfo = self._dom.blockJobInfo(drive.name, 0) File "/usr/share/vdsm/virt/vm.py", line 697, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock <— snip —> <— snip —> May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 484, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 346, in _sampleVmJobs return self._vm.queryBlockJobs() File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs liveInfo = self._dom.blockJobInfo(drive.name, 0) AttributeError: 'NoneType' object has no attribute 'blockJobInfo' May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 484, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 338, in _sampleVcpuPinning vCpuInfos = self._vm._dom.vcpus() AttributeError: 'NoneType' object has no attribute 'vcpus' May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 484, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 349, in _sampleCpuTune infos = self._vm._dom.schedulerParameters() AttributeError: 'NoneType' object has no attribute ‘schedulerParameters' <— snip —>
#!/usr/bin/python import Queue import threading import time from ovirtsdk.api import API from ovirtsdk.xml import params import sys import datetime import smtplib from email.mime.text import MIMEText global SNAPSHOT_NAME VERSION = params.Version(major='3', minor='0') ENGINE_SERVER = '' ENGINE_USER = '' ENGINE_PASSWORD = '' ENGINE_CERT = '' NOW = datetime.datetime.now() SNAPSHOT_NAME = 'BACKUP_' + NOW.strftime("%Y-%m-%d-%H%M") DAY_OF_WEEK = NOW.strftime("%w") BACKUP = "FULL" exitFlag = 0 class myThread (threading.Thread): def __init__(self, threadID, name, q): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.q = q self.api = api global message def run(self): print "Starting " + self.name process_data(self.name, self.q) print "Exiting " + self.name def process_data(threadName, q): while not exitFlag: queueLock.acquire() if not workQueue.empty(): data = q.get() queueLock.release() print "%s processing %s" % (threadName, data.name) vm = api.vms.get(name=data.name) vmname = data.name +"_" newvmname = vmname + SNAPSHOT_NAME cluster = api.clusters.get(id=vm.cluster.id) dc = api.datacenters.get(id=cluster.data_center.id) export = None for sd in dc.storagedomains.list(): if sd.type_ == "export": export = sd if not export: print("Export domain required, and none found, exitting...\n") sys.exit(1) print "Cluster: %s" % cluster.name if (data.name != 'HostedEngine' and cluster.name == 'CC-01'): vm.snapshots.add(params.Snapshot(description=SNAPSHOT_NAME, vm=vm )) snap = vm.snapshots.list(description=SNAPSHOT_NAME)[0] while vm.snapshots.get(id=snap.id).snapshot_status == "locked": print("%s Waiting for snapshot of %s to finish") % (threadName, vm.name) time.sleep(10) print("%s Snapshotting %s is done") % (threadName,vm.name) try: snapshots = params.Snapshots(snapshot=[params.Snapshot(id=snap.id)]) api.vms.add(params.VM(name=newvmname, snapshots=snapshots, cluster=cluster, template=api.templates.get(name="Blank"))) while api.vms.get(name=newvmname).status.state == "image_locked": print("%s Waiting for clone of %s to finish") % (threadName, vm.name) time.sleep(60) print("%s Cloning of %s done") % (threadName, vm.name) api.vms.get(name=newvmname).export(params.Action(storage_domain=export)) while api.vms.get(name=newvmname).status.state == "image_locked": print("%s Waiting for export of %s finish") % (threadName, vm.name) time.sleep(60) print("%s Exporting %s done") % (threadName, vm.name) api.vms.get(name=newvmname).delete() except Exception as e: print ("Something went wrong with the coling or exporting\n%s") % str(e) snapshotlist = vm.snapshots.list() for snapshot in snapshotlist: if snapshot.description != "Active VM": snapshot.delete() time.sleep(3) try: while api.vms.get(name=vm.name).snapshots.get(id=snapshot.id).snapshot_status == "locked": print("%s Waiting for snapshot %s on %s deletion to finish") % (threadName, snapshot.name, vm.name) time.sleep(10) except Exception as e: print ("%s Snapshot status request might have failed, this usually means that the snpashot was deleted properly") % threadName print("%s Deleting snapshot %s on %s done") % (threadName, snapshot.name, vm.name) else: queueLock.release() time.sleep(1) threadList = ["Backup-Thread-1", "Backup-Thread-2", "Backup-Thread-3"] def Connect(): global api api = API(url=ENGINE_SERVER, username=ENGINE_USER, password=ENGINE_PASSWORD, ca_file=ENGINE_CERT) def Disconnect(exitcode): api.disconnect() sys.exit(exitcode) try: Connect() vms = api.vms.list() except Exception as e: print 'Failed:\n%s' % str(e) nameList = vms queueLock = threading.Lock() workQueue = Queue.Queue(0) threads = [] threadID = 1 # Create new threads for tName in threadList: thread = myThread(threadID, tName, workQueue) thread.start() threads.append(thread) threadID += 1 # Fill the queue queueLock.acquire() for word in nameList: workQueue.put(word) queueLock.release() # Wait for queue to empty while not workQueue.empty(): pass # Notify threads it's time to exit exitFlag = 1 # Wait for all threads to complete for t in threads: t.join() print "Exiting Main Thread" api.disconnect()
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users