I'm happy the main root cause is fixed (deleting the source disks).
To be clear, you can configure to resume guest states on compute service
restarts with the flag
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot
Closing the bug.
** Changed in: nova
Status: New => Won't Fix
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297
Title:
Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
Cinder Volume Upon Resume from Suspend
Status in OpenStack Compute (nova):
Won't Fix
Bug description:
Background: Libvirt + KVM cloud running Newton (but relevant code
appears the same on master). Earlier this week we had some issues with
a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
100% CPU (after running for several months) and Compute nodes lost
iSCSI connection. I had to restart tgt, cinder-volume service, and a
number of compute hosts + instances.
Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root and ephemeral
disks stored locally, third disk on shared Cinder storage). It appears
(per below-linked logs) that the iSCSI connection from the compute
host to the Cinder storage server was broken/missing, and because of
this, Cinder apparently "cleaned up" the instance including
*destroying its disk files*. Instance is now in error state.
nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/
We're still running Newton but the code appears the same on master.
Based on the log messages ("Deleting instance files" and "Deletion of
/var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it
appears that we ended up in this function, `delete_instance_files`:
https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
A trace wasn't logged for this, but I'm guessing we got here from the
`cleanup` function:
https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this
was run with defaults or not overridden.
(Someone, please correct me if the available data suggest otherwise!)
Nobody requested a Delete action, so this appears to be Nova deciding
to destroy an instance's local disks after encountering an otherwise-
unhandled exception related to the iSCSI device being unavailable. I
will try to reproduce and update the bug if successful.
For us, losing an instance's data is a Problem -- our users
(scientists) often store unique data on instances that are configured
by hand. If an instance cannot be resumed, I would much rather Nova
leave the instance's disks intact for investigation / data recovery,
instead of throwing everything out. For deployments whose instances
may contain important data, could this behavior be made configurable?
Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?
Thank you!
Chris Martin
(P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
between compute nodes and a Cinder storage server after a recovered
failure of the iSCSI target service on the storage server?)
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp