Public bug reported:

Instance "evacuation" is a great feature and we are trying to take advantage of 
it.
But, it has some limitations, depending how "broken" is the node.

Let me give some context...

In the scenario where the compute node loses connectivity (broken switch
port, loose network cable, ...) or nova-compute is suck (filesystem
issue) evacuating instances can have some unexpected consequences and
lead to data corruption in the application (for example in a DB
application).

If a compute node loses connectivity (or an entire set of compute nodes), 
nova-compute and the instances are "not available".
If the node runs critical applications (let's suppose a MySQL DB), the cloud 
operator could be tempted to "evacuate" the instance to recover the critical 
application for the user. At this point the cloud operator may not know yet the 
compute node issue and maybe it won't be possible to shut it down (management 
network affected?, ...) or even simply don't want to interfere with the work of 
the repair team.

The repair teams fixes the issue (it can take few minutes or hours...)
and nova-compute and the instances are available again.

The problem is that nova-compute doesn't destroy the evacuated instances
in the source.

```
2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker 
[req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance 
fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host 
foo.cern.ch(foo.cern.ch). There are allocations remaining against the source 
host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 
1875}}.
```

At this point we have 2 instances sharing the same IP and possibly
writing into the same volume.

Only when nova-compute is restarted (I guess that was always the
assumption... the compute node was really broken) the evacuated
instances in the affected node are removed.

```
2021-10-19 15:39:49.257 21189 INFO nova.compute.manager 
[req-ded45b0c-20ab-4587-9533-8c613d977f79 - - - - -] Destroying instance as it 
has been evacuated from this host but still exists in the hypervisor
2021-10-19 15:39:52.949 21189 INFO nova.virt.libvirt.driver [ ] Instance 
destroyed successfully.
```

I would expect that nova-compute will constantly check for the evacuated 
instances and then removed them.
Otherwise, this requires a lot of coordination between different support teams.

Should this be moved to a periodic task?
https://github.com/openstack/nova/blob/e14eef0719eceef35e7e96b3e3d242ec79a80969/nova/compute/manager.py#L1440


I'm running Stein, but looking into the code, we have the same behaviour in 
master.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1947753

Title:
  Evacuated instances are not removed from the source

Status in OpenStack Compute (nova):
  New

Bug description:
  Instance "evacuation" is a great feature and we are trying to take advantage 
of it.
  But, it has some limitations, depending how "broken" is the node.

  Let me give some context...

  In the scenario where the compute node loses connectivity (broken
  switch port, loose network cable, ...) or nova-compute is suck
  (filesystem issue) evacuating instances can have some unexpected
  consequences and lead to data corruption in the application (for
  example in a DB application).

  If a compute node loses connectivity (or an entire set of compute nodes), 
nova-compute and the instances are "not available".
  If the node runs critical applications (let's suppose a MySQL DB), the cloud 
operator could be tempted to "evacuate" the instance to recover the critical 
application for the user. At this point the cloud operator may not know yet the 
compute node issue and maybe it won't be possible to shut it down (management 
network affected?, ...) or even simply don't want to interfere with the work of 
the repair team.

  The repair teams fixes the issue (it can take few minutes or hours...)
  and nova-compute and the instances are available again.

  The problem is that nova-compute doesn't destroy the evacuated
  instances in the source.

  ```
  2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker 
[req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance 
fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host 
foo.cern.ch(foo.cern.ch). There are allocations remaining against the source 
host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 
1875}}.
  ```

  At this point we have 2 instances sharing the same IP and possibly
  writing into the same volume.

  Only when nova-compute is restarted (I guess that was always the
  assumption... the compute node was really broken) the evacuated
  instances in the affected node are removed.

  ```
  2021-10-19 15:39:49.257 21189 INFO nova.compute.manager 
[req-ded45b0c-20ab-4587-9533-8c613d977f79 - - - - -] Destroying instance as it 
has been evacuated from this host but still exists in the hypervisor
  2021-10-19 15:39:52.949 21189 INFO nova.virt.libvirt.driver [ ] Instance 
destroyed successfully.
  ```

  I would expect that nova-compute will constantly check for the evacuated 
instances and then removed them.
  Otherwise, this requires a lot of coordination between different support 
teams.

  Should this be moved to a periodic task?
  
https://github.com/openstack/nova/blob/e14eef0719eceef35e7e96b3e3d242ec79a80969/nova/compute/manager.py#L1440

  
  I'm running Stein, but looking into the code, we have the same behaviour in 
master.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1947753/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to