Reviewed: https://review.openstack.org/363926 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b20239a5d293f55889cd1bffa59e4792c75edbf Submitter: Jenkins Branch: master
commit 6b20239a5d293f55889cd1bffa59e4792c75edbf Author: Sławek Kapłoński <sla...@kaplonski.pl> Date: Wed Aug 31 20:28:36 2016 +0000 Fix race condition bug during live_snapshot During live_snapshot creation, when nova starts block_rebase operation in libvirt there is possibility that block_job is not yet started and libvirt blockJobInfo method will return status.end = 0 and status.cur = 0. In such case libvirt driver does not wait to finish block rebase operation and that causes a problem because created snapshot is corrupted. This patch adds check if status.end != 0 to return information that job is already finished. Change-Id: I45ac06eae0b1949f746dae305469718649bfcf23 Closes-Bug: #1530275 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1530275 Title: Live snapshot is corrupted (possibly race condition?) Status in OpenStack Compute (nova): Fix Released Bug description: We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored in qcow2 files on ext4 filesystem. When we live snapshot, 90% of the time the produced image is corrupted; specifically, the image is only a few megabytes (e.g. 30 MB) in size, while the disk size is several GB. Here is the log from a corrupted snapshot: 2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting 2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process 2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload 2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete The entire operation completes in a couple of seconds, which is unexpected. While testing, I added some sleep calls to the _live_snapshot function in virt/libvirt/driver.py to debug the problem. A few live snapshot runs were successful, but I'm not confident that it fixed the problem. Anyway, here is the code that I changed: try: # NOTE (rmk): blockRebase cannot be executed on persistent # domains, so we need to temporarily undefine it. # If any part of this block fails, the domain is # re-defined regardless. if guest.has_persistent_configuration(): guest.delete_configuration() # NOTE (rmk): Establish a temporary mirror of our root disk and # issue an abort once we have a complete copy. dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True) + time.sleep(10.0) while dev.wait_for_job(): - time.sleep(0.5) + time.sleep(5.0) dev.abort_job() libvirt_utils.chown(disk_delta, os.getuid()) finally: self._host.write_instance_config(xml) if require_quiesce: self.unquiesce(context, instance, image_meta) And the resulting log (which indicates that it is sleeping for not just the initial 10 second call, but even more than that; this means wait_for_job is returning false immediately before applying the modification, but after the modification it is actually returning true after the initial sleep and seems to be performing correctly): 2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting 2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process 2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload 2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete Since sleeping 10 seconds before polling wait_for_job seemed to resolve it, I think there may be a race condition where wait_for_job may be called before the job is fully initialized from the rebase call. I have not had a chance to explore that possibility further though. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1530275/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp