Re: [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone
On 10/16/2017 11:22 AM, Matt Riedemann wrote: This is interesting from the user point of view: https://bugs.launchpad.net/nova/+bug/1723880 - The user creates an instance in a non-default AZ. - They shelve offload the instance. - The admin deletes the AZ that the instance was using, for whatever reason. - The user unshelves the instance which goes back through scheduling and fails with NoValidHost because the AZ on the original request spec no longer exists. Now the question is what, if anything, do we do about this bug? Some notes: 1. How reasonable is it for a user to expect in a stable production environment that AZs are going to be deleted from under them? We actually have a spec related to this but with AZ renames: https://review.openstack.org/#/c/446446/ I don't think it's reasonable for a user to expect an AZ suddenly gets *deleted* from under them, no. That said, I think it's reasonable for operators to want to *rename* an AZ. And because AZs in Nova aren't really *things* [1], attempting to change the name of an AZ involves a bunch of nasty DB updates (including shadow tables). [2] 2. Should we null out the instance.availability_zone when it's shelved offloaded like we do for the instance.host and instance.node attributes? Similarly, we would not take into account the RequestSpec.availability_zone when scheduling during unshelve. I tend to prefer this option because once you unshelve offload an instance, it's no longer associated with a host and therefore no longer associated with an AZ. However, is it reasonable to assume that the user doesn't care that the instance, once unshelved, is no longer in the originally requested AZ? Probably not a safe assumption. Yeah, I don't think this is appropriate. 3. When a user unshelves, they can't propose a new AZ (and I don't think we want to add that capability to the unshelve API). So if the original AZ is gone, should we automatically remove the RequestSpec.availability_zone when scheduling? I tend to not like this as it's very implicit and the user could see the AZ on their instance change before and after unshelve and be confused. I don't think this is something we should add to the public API (for reasons Matt stated in a followup email to Dean). Instead, I think the "rename AZ" functionality should do the needful DB-related tasks to change the instance.availability_zone for shelved instances to the new AZ name... 4. We could simply do nothing about this specific bug and assert the behavior is correct. The user requested an instance in a specific AZ, shelved that instance and when they wanted to unshelve it, it's no longer available so it fails. The user would have to delete the instance and create a new instance from the shelve snapshot image in a new AZ. If we implemented Sylvain's spec in #1 above, maybe we don't have this problem going forward since you couldn't remove/delete an AZ when there are even shelved offloaded instances still tied to it. I think it's reasonable to prevent deletion of an AZ (whatever that actually means... see [1]) when the AZ "has instances in it" (whatever that means... see [1]) Best, -jay Other options? [1] AZs in Nova are just metadata key/values on aggregates and string values in the instance.availability_zone DB table field that have no FK relationship to said metadata key/values [2] Note that, as I've said before, the entire concept of an availability zone in Nova/Cinder/Neutron is completely fictional and improperly pretending to be an AWS EC2 availability zone. AZs in Nova pretend to be failure domains. They are not anything of the sort. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone
On 10/16/2017 09:22 AM, Matt Riedemann wrote: 2. Should we null out the instance.availability_zone when it's shelved offloaded like we do for the instance.host and instance.node attributes? Similarly, we would not take into account the RequestSpec.availability_zone when scheduling during unshelve. I tend to prefer this option because once you unshelve offload an instance, it's no longer associated with a host and therefore no longer associated with an AZ. This statement isn't true in the case where the user specifically requested a non-default AZ at boot time. However, is it reasonable to assume that the user doesn't care that the instance, once unshelved, is no longer in the originally requested AZ? Probably not a safe assumption. If they didn't request a non-default AZ then I think we could remove it. 3. When a user unshelves, they can't propose a new AZ (and I don't think we want to add that capability to the unshelve API). So if the original AZ is gone, should we automatically remove the RequestSpec.availability_zone when scheduling? I tend to not like this as it's very implicit and the user could see the AZ on their instance change before and after unshelve and be confused. I think allowing the user to specify an AZ on unshelve might be a reasonable option. Or maybe just allow modifying the AZ of a shelved instance without unshelving it via a PUT on /servers/{server_id}. 4. We could simply do nothing about this specific bug and assert the behavior is correct. The user requested an instance in a specific AZ, shelved that instance and when they wanted to unshelve it, it's no longer available so it fails. The user would have to delete the instance and create a new instance from the shelve snapshot image in a new AZ. I'm inclined to feel that this is operator error. If they want to delete an AZ that has shelved instances then they should talk with their customers and update the stored AZ in the DB to a new "valid" one. (Though currently this would require manual DB operations.) If we implemented Sylvain's spec in #1 above, maybe we don't have this problem going forward since you couldn't remove/delete an AZ when there are even shelved offloaded instances still tied to it. I kind of think it would be okay to disallow deleting AZs with shelved instances in them. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone
On 10/16/2017 11:00 AM, Dean Troyer wrote: [not having a dog in this hunt, this is what I would expect as a cloud consumer] Thanks for the user perspective, that's what I'm looking for here, and operator perspective of course. On Mon, Oct 16, 2017 at 10:22 AM, Matt Riedemannwrote: - The user creates an instance in a non-default AZ. - They shelve offload the instance. - The admin deletes the AZ that the instance was using, for whatever reason. - The user unshelves the instance which goes back through scheduling and fails with NoValidHost because the AZ on the original request spec no longer exists. 1. How reasonable is it for a user to expect in a stable production environment that AZs are going to be deleted from under them? We actually have a spec related to this but with AZ renames: Change happens... 2. Should we null out the instance.availability_zone when it's shelved offloaded like we do for the instance.host and instance.node attributes? Similarly, we would not take into account the RequestSpec.availability_zone when scheduling during unshelve. I tend to prefer this option because once you unshelve offload an instance, it's no longer associated with a host and therefore no longer associated with an AZ. However, is it reasonable to assume that the user doesn't care that the instance, once unshelved, is no longer in the originally requested AZ? Probably not a safe assumption. Agreed, unless we keep track that the user specified a default or no AZ at create. We do keep track of what the user originally requested, that is this RequestSpec object thing I keep referring to. I think nulling the AZ when the original doesn't exist would be reasonable from a user standpoint, but I'd feel handcuffed if that happens and I can not select a new AZ. Or throwing a specific error and letting the user handle it in #3 below: At the point of failure, the API has done an RPC cast and returned a 202 to the user, so the only way to provide a message like this to the user would be to check if the original AZ still exists in the API. We could do that, it would just be something to be aware of. 3. When a user unshelves, they can't propose a new AZ (and I don't think we want to add that capability to the unshelve API). So if the original AZ is Here is my question... if I can specify an AZ on create, why not on unshelve? Is it the image location movement under the hood? I just don't think it's ever come up. The reason I hesitate to add the ability to the unshelve API is more or less rooted in my bias toward not liking shelve/unshelve in general because of how complicated and half-baked it is (we've had a lot of bugs from these APIs, some of which are still unresolved). That's not the user's fault though, so one could argue that if we're not going to deprecate these APIs, we need to make them more robust. We, as developers, also don't have any idea how many users are actually using the shelve API, so it's hard to know if we should spend any time on improving it. gone, should we automatically remove the RequestSpec.availability_zone when scheduling? I tend to not like this as it's very implicit and the user could see the AZ on their instance change before and after unshelve and be confused. Agreed that explicit is better than implicit. 4. We could simply do nothing about this specific bug and assert the behavior is correct. The user requested an instance in a specific AZ, shelved that instance and when they wanted to unshelve it, it's no longer available so it fails. The user would have to delete the instance and create a new instance from the shelve snapshot image in a new AZ. If we implemented I do not have the list of things in my head that are preserved in shelve/unshelve that would be lost in a recreate, but that's where my worry would come. Presumably that is why I shelved in the first place rather than snapshotting the server and removing it. Depends on the cost models too, if I lose my grandfathered-in pricing by being forced to recreate I amy be unhappy. The volumes and ports remain attached to the shelved instance, only the guest on the hypervisor is destroyed. It doesn't change anything about quota - you retain quota usage for a shelved instance so you have room in your quota to unshelve it later. From what I can tell, the os-simple-tenant-usage API will still count the instance and it's consumed disk/ram/cpu against you even though the guest is deleted from the hypervisor while the instance is shelved offloaded. So the operator is happy about shelved offloaded instances because that means they have more free capacity for new instances and moving things, but the user is still getting charged the same, if your billing model is based on os-simple-tenant-usage (which Telemetry uses I believe). Sylvain's spec in #1 above, maybe we don't have this problem going forward since you couldn't remove/delete an AZ when there
Re: [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone
[not having a dog in this hunt, this is what I would expect as a cloud consumer] On Mon, Oct 16, 2017 at 10:22 AM, Matt Riedemannwrote: > - The user creates an instance in a non-default AZ. > - They shelve offload the instance. > - The admin deletes the AZ that the instance was using, for whatever reason. > - The user unshelves the instance which goes back through scheduling and > fails with NoValidHost because the AZ on the original request spec no longer > exists. > 1. How reasonable is it for a user to expect in a stable production > environment that AZs are going to be deleted from under them? We actually > have a spec related to this but with AZ renames: Change happens... > 2. Should we null out the instance.availability_zone when it's shelved > offloaded like we do for the instance.host and instance.node attributes? > Similarly, we would not take into account the RequestSpec.availability_zone > when scheduling during unshelve. I tend to prefer this option because once > you unshelve offload an instance, it's no longer associated with a host and > therefore no longer associated with an AZ. However, is it reasonable to > assume that the user doesn't care that the instance, once unshelved, is no > longer in the originally requested AZ? Probably not a safe assumption. Agreed, unless we keep track that the user specified a default or no AZ at create. I think nulling the AZ when the original doesn't exist would be reasonable from a user standpoint, but I'd feel handcuffed if that happens and I can not select a new AZ. Or throwing a specific error and letting the user handle it in #3 below: > 3. When a user unshelves, they can't propose a new AZ (and I don't think we > want to add that capability to the unshelve API). So if the original AZ is Here is my question... if I can specify an AZ on create, why not on unshelve? Is it the image location movement under the hood? > gone, should we automatically remove the RequestSpec.availability_zone when > scheduling? I tend to not like this as it's very implicit and the user could > see the AZ on their instance change before and after unshelve and be > confused. Agreed that explicit is better than implicit. > 4. We could simply do nothing about this specific bug and assert the > behavior is correct. The user requested an instance in a specific AZ, > shelved that instance and when they wanted to unshelve it, it's no longer > available so it fails. The user would have to delete the instance and create > a new instance from the shelve snapshot image in a new AZ. If we implemented I do not have the list of things in my head that are preserved in shelve/unshelve that would be lost in a recreate, but that's where my worry would come. Presumably that is why I shelved in the first place rather than snapshotting the server and removing it. Depends on the cost models too, if I lose my grandfathered-in pricing by being forced to recreate I amy be unhappy. > Sylvain's spec in #1 above, maybe we don't have this problem going forward > since you couldn't remove/delete an AZ when there are even shelved offloaded > instances still tied to it. As a user I probably do not mind this, as an operator I'd likely be unhappy. dt -- Dean Troyer dtro...@gmail.com __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone
This is interesting from the user point of view: https://bugs.launchpad.net/nova/+bug/1723880 - The user creates an instance in a non-default AZ. - They shelve offload the instance. - The admin deletes the AZ that the instance was using, for whatever reason. - The user unshelves the instance which goes back through scheduling and fails with NoValidHost because the AZ on the original request spec no longer exists. Now the question is what, if anything, do we do about this bug? Some notes: 1. How reasonable is it for a user to expect in a stable production environment that AZs are going to be deleted from under them? We actually have a spec related to this but with AZ renames: https://review.openstack.org/#/c/446446/ 2. Should we null out the instance.availability_zone when it's shelved offloaded like we do for the instance.host and instance.node attributes? Similarly, we would not take into account the RequestSpec.availability_zone when scheduling during unshelve. I tend to prefer this option because once you unshelve offload an instance, it's no longer associated with a host and therefore no longer associated with an AZ. However, is it reasonable to assume that the user doesn't care that the instance, once unshelved, is no longer in the originally requested AZ? Probably not a safe assumption. 3. When a user unshelves, they can't propose a new AZ (and I don't think we want to add that capability to the unshelve API). So if the original AZ is gone, should we automatically remove the RequestSpec.availability_zone when scheduling? I tend to not like this as it's very implicit and the user could see the AZ on their instance change before and after unshelve and be confused. 4. We could simply do nothing about this specific bug and assert the behavior is correct. The user requested an instance in a specific AZ, shelved that instance and when they wanted to unshelve it, it's no longer available so it fails. The user would have to delete the instance and create a new instance from the shelve snapshot image in a new AZ. If we implemented Sylvain's spec in #1 above, maybe we don't have this problem going forward since you couldn't remove/delete an AZ when there are even shelved offloaded instances still tied to it. Other options? -- Thanks, Matt __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev