Re: [openstack-dev] [nova] periodic task
On 8/31/15, 9:22 PM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> wrote: > > >On 8/27/2015 1:22 AM, Gary Kotton wrote: >> >> >> On 8/25/15, 2:43 PM, "Andrew Laski" <and...@lascii.com> wrote: >> >>> On 08/25/15 at 06:08pm, Gary Kotton wrote: >>>> >>>> >>>> On 8/25/15, 9:10 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> >>>>wrote: >>>> >>>>> >>>>> >>>>> On 8/25/2015 10:03 AM, Gary Kotton wrote: >>>>>> >>>>>> >>>>>> On 8/25/15, 7:04 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On 8/24/2015 9:32 PM, Gary Kotton wrote: >>>>>>>> In item #2 below the reboot is down via the guest and not the nova >>>>>>>> api¹s :) >>>>>>>> >>>>>>>> From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>> >>>>>>>> Reply-To: OpenStack List <openstack-dev@lists.openstack.org >>>>>>>> <mailto:openstack-dev@lists.openstack.org>> >>>>>>>> Date: Monday, August 24, 2015 at 7:18 PM >>>>>>>> To: OpenStack List <openstack-dev@lists.openstack.org >>>>>>>> <mailto:openstack-dev@lists.openstack.org>> >>>>>>>> Subject: [openstack-dev] [nova] periodic task >>>>>>>> >>>>>>>> Hi, >>>>>>>> A couple of months ago I posted a patch for bug >>>>>>>> https://launchpad.net/bugs/1463688. The issue is as follows: the >>>>>>>> periodic task detects that the instance state does not match the >>>>>>>> state >>>>>>>> on the hypervisor and it shuts down the running VM. There are a >>>>>>>> number >>>>>>>> of ways that this may happen and I will try and explain: >>>>>>>> >>>>>>>>1. Vmware driver example: a host where the instances are >>>>>>>>running >>>>>>>> goes >>>>>>>> down. This could be a power outage, host failure, etc. The >>>>>>>> first >>>>>>>> iteration of the perdioc task will determine that the actual >>>>>>>> instacne is down. This will update the state of the >>>>>>>>instance to >>>>>>>> DOWN. The VC has the ability to do HA and it will start the >>>>>>>> instance >>>>>>>> up and running again. The next iteration of the periodic >>>>>>>>task >>>>>>>> will >>>>>>>> determine that the instance is up and the compute manager >>>>>>>>will >>>>>>>> stop >>>>>>>> the instance. >>>>>>>>2. All drivers. The tenant decides to do a reboot of the >>>>>>>>instance >>>>>>>> and >>>>>>>> that coincides with the periodic task state validation. At >>>>>>>>this >>>>>>>> point in time the instance will not be up and the compute >>>>>>>>node >>>>>>>> will >>>>>>>> update the state of the instance as DWON. Next iteration the >>>>>>>> states >>>>>>>> will differ and the instance will be shutdown >>>>>>>> >>>>>>>> Basically the issue hit us with our CI and there was no CI running >>>>>>>> for a >>>>>>>> couple of hours due to the fact that the compute node decided to >>>>>>>> shutdown the running instances. The hypervisor should be the >>>>>>>>source >>>>>>>> of >>>>>>>> truth and it should not be the compute node that decides to >>>>>>>>shutdown >>>>>>>> instances. I posted a patch to deal with this >>>>>>>> https://review.openstack.org/#/c/190047/. Which is the reason for >>>>>>>> this >>>>>>>> mail.
Re: [openstack-dev] [nova] periodic task
On 08/25/2015 08:04 AM, Matt Riedemann wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api’s :) From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>> Reply-To: OpenStack List <openstack-dev@lists.openstack.org <mailto:openstack-dev@lists.openstack.org>> Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List <openstack-dev@lists.openstack.org <mailto:openstack-dev@lists.openstack.org>> Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. If we're talking about the guest rebooting itself (ie someone issuing a "reboot" from within the guest) then I think we should be able to handle that. Guests might want to reboot for any number of reasons, and there's no reason to force every guest to require access to the nova API in order to reboot. If we're talking about someone logging onto a compute node and running a virsh command (or similar) then I agree, that sort of thing should be done via nova. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] periodic task
On 8/27/2015 1:22 AM, Gary Kotton wrote: > > > On 8/25/15, 2:43 PM, "Andrew Laski" <and...@lascii.com> wrote: > >> On 08/25/15 at 06:08pm, Gary Kotton wrote: >>> >>> >>> On 8/25/15, 9:10 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> wrote: >>> >>>> >>>> >>>> On 8/25/2015 10:03 AM, Gary Kotton wrote: >>>>> >>>>> >>>>> On 8/25/15, 7:04 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On 8/24/2015 9:32 PM, Gary Kotton wrote: >>>>>>> In item #2 below the reboot is down via the guest and not the nova >>>>>>> api¹s :) >>>>>>> >>>>>>> From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>> >>>>>>> Reply-To: OpenStack List <openstack-dev@lists.openstack.org >>>>>>> <mailto:openstack-dev@lists.openstack.org>> >>>>>>> Date: Monday, August 24, 2015 at 7:18 PM >>>>>>> To: OpenStack List <openstack-dev@lists.openstack.org >>>>>>> <mailto:openstack-dev@lists.openstack.org>> >>>>>>> Subject: [openstack-dev] [nova] periodic task >>>>>>> >>>>>>> Hi, >>>>>>> A couple of months ago I posted a patch for bug >>>>>>> https://launchpad.net/bugs/1463688. The issue is as follows: the >>>>>>> periodic task detects that the instance state does not match the >>>>>>> state >>>>>>> on the hypervisor and it shuts down the running VM. There are a >>>>>>> number >>>>>>> of ways that this may happen and I will try and explain: >>>>>>> >>>>>>>1. Vmware driver example: a host where the instances are running >>>>>>> goes >>>>>>> down. This could be a power outage, host failure, etc. The >>>>>>> first >>>>>>> iteration of the perdioc task will determine that the actual >>>>>>> instacne is down. This will update the state of the instance to >>>>>>> DOWN. The VC has the ability to do HA and it will start the >>>>>>> instance >>>>>>> up and running again. The next iteration of the periodic task >>>>>>> will >>>>>>> determine that the instance is up and the compute manager will >>>>>>> stop >>>>>>> the instance. >>>>>>>2. All drivers. The tenant decides to do a reboot of the instance >>>>>>> and >>>>>>> that coincides with the periodic task state validation. At this >>>>>>> point in time the instance will not be up and the compute node >>>>>>> will >>>>>>> update the state of the instance as DWON. Next iteration the >>>>>>> states >>>>>>> will differ and the instance will be shutdown >>>>>>> >>>>>>> Basically the issue hit us with our CI and there was no CI running >>>>>>> for a >>>>>>> couple of hours due to the fact that the compute node decided to >>>>>>> shutdown the running instances. The hypervisor should be the source >>>>>>> of >>>>>>> truth and it should not be the compute node that decides to shutdown >>>>>>> instances. I posted a patch to deal with this >>>>>>> https://review.openstack.org/#/c/190047/. Which is the reason for >>>>>>> this >>>>>>> mail. The patch is backwards compatible so that the existing >>>>>>> deployments >>>>>>> and random shutdown continues as it works today and the admin now >>>>>>> has >>>>>>> an >>>>>>> ability just to do a log if there is a inconsistency. >>>>>>> >>>>>>> We do not want to disable the periodic task as knowing the current >>>>>>> state >>>>>>> of the instance is very important and has a ton of value, we just do >>>>>>> not >>>>>>> want the periodic to task to shut
Re: [openstack-dev] [nova] periodic task
On 8/25/15, 2:43 PM, Andrew Laski and...@lascii.com wrote: On 08/25/15 at 06:08pm, Gary Kotton wrote: On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/25/2015 10:03 AM, Gary Kotton wrote: On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api¹s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary _ __ __ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. Matt, this is one case that I showed where the problem occurs. There are others and I can invest time to see them. The fact that the periodic task is there is important. What I don¹t understand is why having an option of log indication for an admin is something that is not useful and instead we are going with having the compute node shutdown instance when this should not happen. Our infrastructure is behaving like cattle. That should not be the case and the hypervisor should be the source of truth. This is a serious issue and instances in production can and will go down. -- Thanks, Matt Riedemann __ __ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ __ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev For the HA case #1, the periodic task checks to see if the instance.host doesn't match the compute service host [1] and skips if they don't match. Shouldn't your HA scenario be updating which host the instance is running on? Or is this a vCenter-ism? The nova compute node has not changed
Re: [openstack-dev] [nova] periodic task
On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api’s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] periodic task
On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api¹s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary _ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. Matt, this is one case that I showed where the problem occurs. There are others and I can invest time to see them. The fact that the periodic task is there is important. What I don¹t understand is why having an option of log indication for an admin is something that is not useful and instead we are going with having the compute node shutdown instance when this should not happen. Our infrastructure is behaving like cattle. That should not be the case and the hypervisor should be the source of truth. This is a serious issue and instances in production can and will go down. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] periodic task
On 8/25/2015 10:03 AM, Gary Kotton wrote: On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api¹s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary _ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. Matt, this is one case that I showed where the problem occurs. There are others and I can invest time to see them. The fact that the periodic task is there is important. What I don¹t understand is why having an option of log indication for an admin is something that is not useful and instead we are going with having the compute node shutdown instance when this should not happen. Our infrastructure is behaving like cattle. That should not be the case and the hypervisor should be the source of truth. This is a serious issue and instances in production can and will go down. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev For the HA case #1, the periodic task checks to see if the instance.host doesn't match the compute service host [1] and skips if they don't match. Shouldn't your HA scenario be updating which host the instance is running on? Or is this a vCenter-ism? [1] http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n5871 -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ
Re: [openstack-dev] [nova] periodic task
On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/25/2015 10:03 AM, Gary Kotton wrote: On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api¹s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary ___ __ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. Matt, this is one case that I showed where the problem occurs. There are others and I can invest time to see them. The fact that the periodic task is there is important. What I don¹t understand is why having an option of log indication for an admin is something that is not useful and instead we are going with having the compute node shutdown instance when this should not happen. Our infrastructure is behaving like cattle. That should not be the case and the hypervisor should be the source of truth. This is a serious issue and instances in production can and will go down. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev For the HA case #1, the periodic task checks to see if the instance.host doesn't match the compute service host [1] and skips if they don't match. Shouldn't your HA scenario be updating which host the instance is running on? Or is this a vCenter-ism? The nova compute node has not changed. It is not the compute nodes host. The host that the instance was running on was down and those instances were moved
Re: [openstack-dev] [nova] periodic task
On 08/25/15 at 06:08pm, Gary Kotton wrote: On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/25/2015 10:03 AM, Gary Kotton wrote: On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote: On 8/24/2015 9:32 PM, Gary Kotton wrote: In item #2 below the reboot is down via the guest and not the nova api¹s :) From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.org mailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary ___ __ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev In #2 the guest shouldn't be rebooted by the user (tenant) outside of the nova-api. I'm not sure if it's actually formally documented in the nova documentation, but from what I've always heard/known, nova is the control plane and you should be doing everything with your instances via the nova-api. If the user rebooted via nova-api, the task_state would be set and the periodic task would ignore the instance. Matt, this is one case that I showed where the problem occurs. There are others and I can invest time to see them. The fact that the periodic task is there is important. What I don¹t understand is why having an option of log indication for an admin is something that is not useful and instead we are going with having the compute node shutdown instance when this should not happen. Our infrastructure is behaving like cattle. That should not be the case and the hypervisor should be the source of truth. This is a serious issue and instances in production can and will go down. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _ _ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev For the HA case #1, the periodic task checks to see if the instance.host doesn't match the compute service host [1] and skips if they don't match. Shouldn't your HA scenario be updating which host the instance is running on? Or is this a vCenter-ism? The nova compute node has not changed. It is not the compute nodes host. The host that the instance was running on was down and those instances were moved. So this is a case
[openstack-dev] [nova] periodic task
Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] periodic task
In item #2 below the reboot is down via the guest and not the nova api's :) From: Gary Kotton gkot...@vmware.commailto:gkot...@vmware.com Reply-To: OpenStack List openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Date: Monday, August 24, 2015 at 7:18 PM To: OpenStack List openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: [openstack-dev] [nova] periodic task Hi, A couple of months ago I posted a patch for bug https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task detects that the instance state does not match the state on the hypervisor and it shuts down the running VM. There are a number of ways that this may happen and I will try and explain: 1. Vmware driver example: a host where the instances are running goes down. This could be a power outage, host failure, etc. The first iteration of the perdioc task will determine that the actual instacne is down. This will update the state of the instance to DOWN. The VC has the ability to do HA and it will start the instance up and running again. The next iteration of the periodic task will determine that the instance is up and the compute manager will stop the instance. 2. All drivers. The tenant decides to do a reboot of the instance and that coincides with the periodic task state validation. At this point in time the instance will not be up and the compute node will update the state of the instance as DWON. Next iteration the states will differ and the instance will be shutdown Basically the issue hit us with our CI and there was no CI running for a couple of hours due to the fact that the compute node decided to shutdown the running instances. The hypervisor should be the source of truth and it should not be the compute node that decides to shutdown instances. I posted a patch to deal with this https://review.openstack.org/#/c/190047/. Which is the reason for this mail. The patch is backwards compatible so that the existing deployments and random shutdown continues as it works today and the admin now has an ability just to do a log if there is a inconsistency. We do not want to disable the periodic task as knowing the current state of the instance is very important and has a ton of value, we just do not want the periodic to task to shut down a running instance. Thanks Gary __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev