[Yahoo-eng-team] [Bug 1922053] [NEW] Operators can unset forced-down with `done` evacuation migration records against the host

Lee Yarwood Wed, 31 Mar 2021 03:41:20 -0700

Public bug reported:

Description
===========


Another PEBKAC issue but the current evacuation flow allows an admin to
force down, evacuate and unset forced down *without* ever restarting the
compute service. While it is clearly documented that operators need to
fence the source compute service ahead of evacuation (see below) that
should cause a service restart it isn't enforced anywhere in the current
flow:

https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-
evacuate-action-detail#evacuate-server-evacuate-action

This leaves evacuation migration records marked as done instead of
completed as the source host is never given a chance to clean up. The
request to unset forced down should be rejected until this happens and
the evacuation migration records are marked as completed.

This ultimately could lead to data loss if the instance is migrated back
to the host ahead of the next service restart. That restart causing the
evacuation clean up logic to fire potentially removing storage from
under the running instance.

Steps to reproduce
==================

- Mark a given host as forced down
- Evacuate instances from this host
- Unset forced down on the host
- Check that the migration records associated with the evacuations are still 
marked as done

Expected result
===============

The request to unset forced down is rejected until the service is
restarted and evacuation  migration records moved to completed.

Actual result
=============

The request to unset forced down is allowed and evacuation migration
records remained marked as done. This could eventually lead to data loss
if the instance is migrated back to the host prior to the next service
restart.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   Master

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   N/A

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?
  
   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1922053

Title:
  Operators can unset forced-down with `done` evacuation migration
  records against the host

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Another PEBKAC issue but the current evacuation flow allows an admin
  to force down, evacuate and unset forced down *without* ever
  restarting the compute service. While it is clearly documented that
  operators need to fence the source compute service ahead of evacuation
  (see below) that should cause a service restart it isn't enforced
  anywhere in the current flow:

  https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-
  evacuate-action-detail#evacuate-server-evacuate-action

  This leaves evacuation migration records marked as done instead of
  completed as the source host is never given a chance to clean up. The
  request to unset forced down should be rejected until this happens and
  the evacuation migration records are marked as completed.

  This ultimately could lead to data loss if the instance is migrated
  back to the host ahead of the next service restart. That restart
  causing the evacuation clean up logic to fire potentially removing
  storage from under the running instance.

  Steps to reproduce
  ==================

  - Mark a given host as forced down
  - Evacuate instances from this host
  - Unset forced down on the host
  - Check that the migration records associated with the evacuations are still 
marked as done

  Expected result
  ===============

  The request to unset forced down is rejected until the service is
  restarted and evacuation  migration records moved to completed.

  Actual result
  =============

  The request to unset forced down is allowed and evacuation migration
  records remained marked as done. This could eventually lead to data
  loss if the instance is migrated back to the host prior to the next
  service restart.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     Master

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     N/A

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?
    
     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     N/A

  Logs & Configs
  ==============

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1922053/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1922053] [NEW] Operators can unset forced-down with `done` evacuation migration records against the host

Reply via email to