** Also affects: nova/rocky
   Importance: Undecided
       Status: New

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Also affects: nova/ussuri
   Importance: Undecided
       Status: New

** Also affects: nova/victoria
   Importance: Undecided
       Status: New

** Also affects: nova/stein
   Importance: Undecided
       Status: New

** Also affects: nova/trunk
   Importance: Undecided
       Status: New

** Changed in: nova/ussuri
     Assignee: (unassigned) => Lee Yarwood (lyarwood)

** Changed in: nova/victoria
     Assignee: (unassigned) => Lee Yarwood (lyarwood)

** Changed in: nova/rocky
       Status: New => In Progress

** Changed in: nova/trunk
     Assignee: (unassigned) => Lee Yarwood (lyarwood)

** Changed in: nova/ussuri
       Status: New => In Progress

** Changed in: nova/victoria
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1885528

Title:
  snapshot delete fails on shutdown VM

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  New
Status in OpenStack Compute (nova) trunk series:
  New
Status in OpenStack Compute (nova) ussuri series:
  In Progress
Status in OpenStack Compute (nova) victoria series:
  In Progress

Bug description:
  Description:
  When we try to delete the last snapshot of a VM in shutdown state, this 
snapshot_delete will fail (and be stuck in state error-deleting). When setting 
state==available and redeleting the snapshot, the volume will be corrupted and 
the VM will never start again. Volumes are stored on NFS.
  (for root cause and fix, see the bottom of this post)

  To reproduce:
  - storage on NFS
  - create a VM and some snapshots
  - shut down the VM (ie volume is still considered "attached" but vm is no 
longer "active")
  - delete the last snapshot

  Expected Result:
  snapshot is deleted, vm still works

  Actual result:
  The snapshot is stuck on error deleting. After setting the snapshot 
state==available and deleting the snapshot again, the volume will be corrupted 
and the VM will never start again. (non-existing backing_file in qcow on disk)

  Environment:
  - openstack version: stein, deployed via kolla-ansible. I suspect this 
downloads from git but i don't know the exact version.
  - hypervisor: Libvirt + KVM
  - storage: NFS
  - networking: Neutron with OpenVSwitch

  Nova debug Logs:
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver 
[req-d38b5ec8-afdb-4dfe-af12-0c47598c6a47 6dd1c995b2ea4ddfbeb0685bc52e5fbf 
6bebb564667d4a75b9281fd826b32ecf - d
  efault default] [instance: 711651a3-8440-42dd-a210-e7e550a8624e] Error 
occurred during volume_snapshot_delete, sending error status to Cinder.: 
DiskNotFound: No disk at
   
volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e] Traceback (most recent call last):
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
  ver.py", line 2726, in volume_snapshot_delete
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]     snapshot_id, delete_info=delete_info)
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
  ver.py", line 2686, in _volume_snapshot_delete
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]     rebase_base)
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
  ver.py", line 2519, in _rebase_with_qemu_img
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]     b_file_fmt = 
images.qemu_img_info(backing_file).file_forma
  t
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]   File 
"/usr/lib/python2.7/site-packages/nova/virt/images.py",
   line 58, in qemu_img_info
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e]     raise 
exception.DiskNotFound(location=path)
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e] DiskNotFound: No disk at 
volume-86c06b12-699c-4b54-8bca-fb92c9
  9a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
  2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 
711651a3-8440-42dd-a210-e7e550a8624e] 
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server 
[req-d38b5ec8-afdb-4dfe-af12-0c47598c6a47 6dd1c995b2ea4ddfbeb0685bc52e5fbf 
6bebb564667d4a75b9281fd826b32ecf - 
  default default] Exception during message handling: DiskNotFound: No disk at 
volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in 
_process_incoming
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     res = 
self.dispatcher.dispatch(message)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, 
in dispatch
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     return 
self._do_dispatch(endpoint, method, ctxt, args)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, 
in _do_dispatch
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     result = 
func(ctxt, **new_args)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 229, in 
inner
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     return 
func(*args, **kwargs)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in 
wrapped
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     function_name, 
call_dict, binary, tb)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     
self.force_reraise()
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     
six.reraise(self.type_, self.value, self.tb)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in 
wrapped
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     return f(self, 
context, *args, **kw)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3686, in 
volume_snapshot_delete
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     snapshot_id, 
delete_info)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2734, in 
volume_snapshot_delete
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     context, 
snapshot_id, 'error_deleting')
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     
self.force_reraise()
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     
six.reraise(self.type_, self.value, self.tb)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2726, in 
volume_snapshot_delete
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     snapshot_id, 
delete_info=delete_info)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2686, in 
_volume_snapshot_delete
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     rebase_base)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2519, in 
_rebase_with_qemu_img
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     b_file_fmt = 
images.qemu_img_info(backing_file).file_format
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/nova/virt/images.py", line 58, in 
qemu_img_info
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server     raise 
exception.DiskNotFound(location=path)
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server DiskNotFound: No 
disk at 
volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
  2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server 

  Root Cause:
  When you look at the first line in the debug log: "No disk at 
volume-volid.snapid", you will notice that this is a relative path. When the 
code does an `os.path.exists("volume-volid.snapid")` this will obviously fail 
(as nova has a cwd="/") and the volumes are stored under 
"/usr/lib/nova/mnt/xxxxx/". If an absolute path were used, the code would be 
much happier :)

  Code path:
    - cinder->snapshot_delete
    - In remotefs.py (called for nfs): if volume is attached -> call nova 
snapshot_delete
    - nova-> snapshot_delete: If snapshot_to_merge == active -> do rebase
    - if VM is not active: call _rebase_with_qemu_img (virt/libvirt/driver.py)
    - rebase_base/backing_file is now usually a relative file (depends on 
wether qemu_img_info is called with json or human and can also depend on the 
cwd). In all instances i saw, this was a relative path. When 
_rebase_with_qemu_img calls images.qemu_img_info(backing_file), this will fail 
unless supplied with and absolute path (first line in qemu_img_info does an 
os.path.exists which will obviously fail on a relative path). This will cause 
the snapshot delete to fail.

  Potential, cinder_based fix:
  In cinder, qemu_img_info is typically wrapped to always convert the 
backing_file to a relative path. In the code there are typically 2 variables: 
backing_file (relative) and backing_path(absolute). The code then decides when 
to use what variable:

  Patch:
  --- /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.orig   
2020-06-22 14:48:09.760186268 +0000
  +++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py        
2020-06-25 15:26:42.344963030 +0000
  @@ -2515,8 +2515,10 @@
               # If the rebased image is going to have a backing file then
               # explicitly set the backing file format to avoid any security
               # concerns related to file format auto detection.
  -            backing_file = rebase_base
  -            b_file_fmt = images.qemu_img_info(backing_file).file_format
  +            backing_file = os.path.basename(rebase_base)  # relative path
  +            volume_path = os.path.dirname(active_disk_object.source_path)
  +            backing_path = os.path.join(volume_path, backing_file)  # 
absolute
  +            b_file_fmt = images.qemu_img_info(backing_path).file_format
               qemu_img_extra_arg = ['-F', b_file_fmt]
   
           qemu_img_extra_arg.append(active_disk_object.source_path)

  Note:
  We've only been able to test on Stein. However based on a code analysis of 
the current code in git, this bug is probably still present.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1885528/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to