Re: [openstack-dev] [nova] periodic task

2015-09-01 Thread Gary Kotton


On 8/31/15, 9:22 PM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> wrote:

>
>
>On 8/27/2015 1:22 AM, Gary Kotton wrote:
>> 
>> 
>> On 8/25/15, 2:43 PM, "Andrew Laski" <and...@lascii.com> wrote:
>> 
>>> On 08/25/15 at 06:08pm, Gary Kotton wrote:
>>>>
>>>>
>>>> On 8/25/15, 9:10 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com>
>>>>wrote:
>>>>
>>>>>
>>>>>
>>>>> On 8/25/2015 10:03 AM, Gary Kotton wrote:
>>>>>>
>>>>>>
>>>>>> On 8/25/15, 7:04 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 8/24/2015 9:32 PM, Gary Kotton wrote:
>>>>>>>> In item #2 below the reboot is down via the guest and not the nova
>>>>>>>> api¹s :)
>>>>>>>>
>>>>>>>> From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>>
>>>>>>>> Reply-To: OpenStack List <openstack-dev@lists.openstack.org
>>>>>>>> <mailto:openstack-dev@lists.openstack.org>>
>>>>>>>> Date: Monday, August 24, 2015 at 7:18 PM
>>>>>>>> To: OpenStack List <openstack-dev@lists.openstack.org
>>>>>>>> <mailto:openstack-dev@lists.openstack.org>>
>>>>>>>> Subject: [openstack-dev] [nova] periodic task
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> A couple of months ago I posted a patch for bug
>>>>>>>> https://launchpad.net/bugs/1463688. The issue is as follows: the
>>>>>>>> periodic task detects that the instance state does not match the
>>>>>>>> state
>>>>>>>> on the hypervisor and it shuts down the running VM. There are a
>>>>>>>> number
>>>>>>>> of ways that this may happen and I will try and explain:
>>>>>>>>
>>>>>>>>1. Vmware driver example: a host where the instances are
>>>>>>>>running
>>>>>>>> goes
>>>>>>>>   down. This could be a power outage, host failure, etc. The
>>>>>>>> first
>>>>>>>>   iteration of the perdioc task will determine that the actual
>>>>>>>>   instacne is down. This will update the state of the
>>>>>>>>instance to
>>>>>>>>   DOWN. The VC has the ability to do HA and it will start the
>>>>>>>> instance
>>>>>>>>   up and running again. The next iteration of the periodic
>>>>>>>>task
>>>>>>>> will
>>>>>>>>   determine that the instance is up and the compute manager
>>>>>>>>will
>>>>>>>> stop
>>>>>>>>   the instance.
>>>>>>>>2. All drivers. The tenant decides to do a reboot of the
>>>>>>>>instance
>>>>>>>> and
>>>>>>>>   that coincides with the periodic task state validation. At
>>>>>>>>this
>>>>>>>>   point in time the instance will not be up and the compute
>>>>>>>>node
>>>>>>>> will
>>>>>>>>   update the state of the instance as DWON. Next iteration the
>>>>>>>> states
>>>>>>>>   will differ and the instance will be shutdown
>>>>>>>>
>>>>>>>> Basically the issue hit us with our CI and there was no CI running
>>>>>>>> for a
>>>>>>>> couple of hours due to the fact that the compute node decided to
>>>>>>>> shutdown the running instances. The hypervisor should be the
>>>>>>>>source
>>>>>>>> of
>>>>>>>> truth and it should not be the compute node that decides to
>>>>>>>>shutdown
>>>>>>>> instances. I posted a patch to deal with this
>>>>>>>> https://review.openstack.org/#/c/190047/. Which is the reason for
>>>>>>>> this
>>>>>>>> mail.

Re: [openstack-dev] [nova] periodic task

2015-08-31 Thread Chris Friesen

On 08/25/2015 08:04 AM, Matt Riedemann wrote:



On 8/24/2015 9:32 PM, Gary Kotton wrote:

In item #2 below the reboot is down via the guest and not the nova api’s :)

From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>>
Reply-To: OpenStack List <openstack-dev@lists.openstack.org
<mailto:openstack-dev@lists.openstack.org>>
Date: Monday, August 24, 2015 at 7:18 PM
To: OpenStack List <openstack-dev@lists.openstack.org
<mailto:openstack-dev@lists.openstack.org>>
Subject: [openstack-dev] [nova] periodic task

Hi,
A couple of months ago I posted a patch for bug
https://launchpad.net/bugs/1463688. The issue is as follows: the
periodic task detects that the instance state does not match the state
on the hypervisor and it shuts down the running VM. There are a number
of ways that this may happen and I will try and explain:

 1. Vmware driver example: a host where the instances are running goes
down. This could be a power outage, host failure, etc. The first
iteration of the perdioc task will determine that the actual
instacne is down. This will update the state of the instance to
DOWN. The VC has the ability to do HA and it will start the instance
up and running again. The next iteration of the periodic task will
determine that the instance is up and the compute manager will stop
the instance.
 2. All drivers. The tenant decides to do a reboot of the instance and
that coincides with the periodic task state validation. At this
point in time the instance will not be up and the compute node will
update the state of the instance as DWON. Next iteration the states
will differ and the instance will be shutdown




In #2 the guest shouldn't be rebooted by the user (tenant) outside of the
nova-api.  I'm not sure if it's actually formally documented in the nova
documentation, but from what I've always heard/known, nova is the control plane
and you should be doing everything with your instances via the nova-api.  If the
user rebooted via nova-api, the task_state would be set and the periodic task
would ignore the instance.


If we're talking about the guest rebooting itself (ie someone issuing a "reboot" 
from within the guest) then I think we should be able to handle that.  Guests 
might want to reboot for any number of reasons, and there's no reason to force 
every guest to require access to the nova API in order to reboot.


If we're talking about someone logging onto a compute node and running a virsh 
command (or similar) then I agree, that sort of thing should be done via nova.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] periodic task

2015-08-31 Thread Matt Riedemann


On 8/27/2015 1:22 AM, Gary Kotton wrote:
> 
> 
> On 8/25/15, 2:43 PM, "Andrew Laski" <and...@lascii.com> wrote:
> 
>> On 08/25/15 at 06:08pm, Gary Kotton wrote:
>>>
>>>
>>> On 8/25/15, 9:10 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com> wrote:
>>>
>>>>
>>>>
>>>> On 8/25/2015 10:03 AM, Gary Kotton wrote:
>>>>>
>>>>>
>>>>> On 8/25/15, 7:04 AM, "Matt Riedemann" <mrie...@linux.vnet.ibm.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/24/2015 9:32 PM, Gary Kotton wrote:
>>>>>>> In item #2 below the reboot is down via the guest and not the nova
>>>>>>> api¹s :)
>>>>>>>
>>>>>>> From: Gary Kotton <gkot...@vmware.com <mailto:gkot...@vmware.com>>
>>>>>>> Reply-To: OpenStack List <openstack-dev@lists.openstack.org
>>>>>>> <mailto:openstack-dev@lists.openstack.org>>
>>>>>>> Date: Monday, August 24, 2015 at 7:18 PM
>>>>>>> To: OpenStack List <openstack-dev@lists.openstack.org
>>>>>>> <mailto:openstack-dev@lists.openstack.org>>
>>>>>>> Subject: [openstack-dev] [nova] periodic task
>>>>>>>
>>>>>>> Hi,
>>>>>>> A couple of months ago I posted a patch for bug
>>>>>>> https://launchpad.net/bugs/1463688. The issue is as follows: the
>>>>>>> periodic task detects that the instance state does not match the
>>>>>>> state
>>>>>>> on the hypervisor and it shuts down the running VM. There are a
>>>>>>> number
>>>>>>> of ways that this may happen and I will try and explain:
>>>>>>>
>>>>>>>1. Vmware driver example: a host where the instances are running
>>>>>>> goes
>>>>>>>   down. This could be a power outage, host failure, etc. The
>>>>>>> first
>>>>>>>   iteration of the perdioc task will determine that the actual
>>>>>>>   instacne is down. This will update the state of the instance to
>>>>>>>   DOWN. The VC has the ability to do HA and it will start the
>>>>>>> instance
>>>>>>>   up and running again. The next iteration of the periodic task
>>>>>>> will
>>>>>>>   determine that the instance is up and the compute manager will
>>>>>>> stop
>>>>>>>   the instance.
>>>>>>>2. All drivers. The tenant decides to do a reboot of the instance
>>>>>>> and
>>>>>>>   that coincides with the periodic task state validation. At this
>>>>>>>   point in time the instance will not be up and the compute node
>>>>>>> will
>>>>>>>   update the state of the instance as DWON. Next iteration the
>>>>>>> states
>>>>>>>   will differ and the instance will be shutdown
>>>>>>>
>>>>>>> Basically the issue hit us with our CI and there was no CI running
>>>>>>> for a
>>>>>>> couple of hours due to the fact that the compute node decided to
>>>>>>> shutdown the running instances. The hypervisor should be the source
>>>>>>> of
>>>>>>> truth and it should not be the compute node that decides to shutdown
>>>>>>> instances. I posted a patch to deal with this
>>>>>>> https://review.openstack.org/#/c/190047/. Which is the reason for
>>>>>>> this
>>>>>>> mail. The patch is backwards compatible so that the existing
>>>>>>> deployments
>>>>>>> and random shutdown continues as it works today and the admin now
>>>>>>> has
>>>>>>> an
>>>>>>> ability just to do a log if there is a inconsistency.
>>>>>>>
>>>>>>> We do not want to disable the periodic task as knowing the current
>>>>>>> state
>>>>>>> of the instance is very important and has a ton of value, we just do
>>>>>>> not
>>>>>>> want the periodic to task to shut

Re: [openstack-dev] [nova] periodic task

2015-08-27 Thread Gary Kotton


On 8/25/15, 2:43 PM, Andrew Laski and...@lascii.com wrote:

On 08/25/15 at 06:08pm, Gary Kotton wrote:


On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote:



On 8/25/2015 10:03 AM, Gary Kotton wrote:


 On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com
wrote:



 On 8/24/2015 9:32 PM, Gary Kotton wrote:
 In item #2 below the reboot is down via the guest and not the nova
 api¹s :)

 From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
 Reply-To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Date: Monday, August 24, 2015 at 7:18 PM
 To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Subject: [openstack-dev] [nova] periodic task

 Hi,
 A couple of months ago I posted a patch for bug
 https://launchpad.net/bugs/1463688. The issue is as follows: the
 periodic task detects that the instance state does not match the
state
 on the hypervisor and it shuts down the running VM. There are a
number
 of ways that this may happen and I will try and explain:

   1. Vmware driver example: a host where the instances are running
goes
  down. This could be a power outage, host failure, etc. The
first
  iteration of the perdioc task will determine that the actual
  instacne is down. This will update the state of the instance to
  DOWN. The VC has the ability to do HA and it will start the
instance
  up and running again. The next iteration of the periodic task
will
  determine that the instance is up and the compute manager will
stop
  the instance.
   2. All drivers. The tenant decides to do a reboot of the instance
and
  that coincides with the periodic task state validation. At this
  point in time the instance will not be up and the compute node
will
  update the state of the instance as DWON. Next iteration the
states
  will differ and the instance will be shutdown

 Basically the issue hit us with our CI and there was no CI running
for a
 couple of hours due to the fact that the compute node decided to
 shutdown the running instances. The hypervisor should be the source
of
 truth and it should not be the compute node that decides to shutdown
 instances. I posted a patch to deal with this
 https://review.openstack.org/#/c/190047/. Which is the reason for
this
 mail. The patch is backwards compatible so that the existing
deployments
 and random shutdown continues as it works today and the admin now
has
an
 ability just to do a log if there is a inconsistency.

 We do not want to disable the periodic task as knowing the current
state
 of the instance is very important and has a ton of value, we just do
not
 want the periodic to task to shut down a running instance.

 Thanks
 Gary




_
__
__
 _
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 In #2 the guest shouldn't be rebooted by the user (tenant) outside of
 the nova-api.  I'm not sure if it's actually formally documented in
the
 nova documentation, but from what I've always heard/known, nova is
the
 control plane and you should be doing everything with your instances
via
 the nova-api.  If the user rebooted via nova-api, the task_state
would
 be set and the periodic task would ignore the instance.

 Matt, this is one case that I showed where the problem occurs. There
are
 others and I can invest time to see them. The fact that the periodic
task
 is there is important. What I don¹t understand is why having an option
of
 log indication for an admin is something that is not useful and
instead
we
 are going with having the compute node shutdown instance when this
should
 not happen. Our infrastructure is behaving like cattle. That should
not
be
 the case and the hypervisor should be the source of truth.

 This is a serious issue and instances in production can and will go
down.


 --

 Thanks,

 Matt Riedemann



__
__
__
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
__
_
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


For the HA case #1, the periodic task checks to see if the instance.host
doesn't match the compute service host [1] and skips if they don't
match.

Shouldn't your HA scenario be updating which host the instance is
running on?  Or is this a vCenter-ism?

The nova compute node has not changed

Re: [openstack-dev] [nova] periodic task

2015-08-25 Thread Matt Riedemann



On 8/24/2015 9:32 PM, Gary Kotton wrote:

In item #2 below the reboot is down via the guest and not the nova api’s :)

From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
Reply-To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Date: Monday, August 24, 2015 at 7:18 PM
To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Subject: [openstack-dev] [nova] periodic task

Hi,
A couple of months ago I posted a patch for bug
https://launchpad.net/bugs/1463688. The issue is as follows: the
periodic task detects that the instance state does not match the state
on the hypervisor and it shuts down the running VM. There are a number
of ways that this may happen and I will try and explain:

 1. Vmware driver example: a host where the instances are running goes
down. This could be a power outage, host failure, etc. The first
iteration of the perdioc task will determine that the actual
instacne is down. This will update the state of the instance to
DOWN. The VC has the ability to do HA and it will start the instance
up and running again. The next iteration of the periodic task will
determine that the instance is up and the compute manager will stop
the instance.
 2. All drivers. The tenant decides to do a reboot of the instance and
that coincides with the periodic task state validation. At this
point in time the instance will not be up and the compute node will
update the state of the instance as DWON. Next iteration the states
will differ and the instance will be shutdown

Basically the issue hit us with our CI and there was no CI running for a
couple of hours due to the fact that the compute node decided to
shutdown the running instances. The hypervisor should be the source of
truth and it should not be the compute node that decides to shutdown
instances. I posted a patch to deal with this
https://review.openstack.org/#/c/190047/. Which is the reason for this
mail. The patch is backwards compatible so that the existing deployments
and random shutdown continues as it works today and the admin now has an
ability just to do a log if there is a inconsistency.

We do not want to disable the periodic task as knowing the current state
of the instance is very important and has a ton of value, we just do not
want the periodic to task to shut down a running instance.

Thanks
Gary


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



In #2 the guest shouldn't be rebooted by the user (tenant) outside of 
the nova-api.  I'm not sure if it's actually formally documented in the 
nova documentation, but from what I've always heard/known, nova is the 
control plane and you should be doing everything with your instances via 
the nova-api.  If the user rebooted via nova-api, the task_state would 
be set and the periodic task would ignore the instance.


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] periodic task

2015-08-25 Thread Gary Kotton


On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote:



On 8/24/2015 9:32 PM, Gary Kotton wrote:
 In item #2 below the reboot is down via the guest and not the nova
api¹s :)

 From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
 Reply-To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Date: Monday, August 24, 2015 at 7:18 PM
 To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Subject: [openstack-dev] [nova] periodic task

 Hi,
 A couple of months ago I posted a patch for bug
 https://launchpad.net/bugs/1463688. The issue is as follows: the
 periodic task detects that the instance state does not match the state
 on the hypervisor and it shuts down the running VM. There are a number
 of ways that this may happen and I will try and explain:

  1. Vmware driver example: a host where the instances are running goes
 down. This could be a power outage, host failure, etc. The first
 iteration of the perdioc task will determine that the actual
 instacne is down. This will update the state of the instance to
 DOWN. The VC has the ability to do HA and it will start the instance
 up and running again. The next iteration of the periodic task will
 determine that the instance is up and the compute manager will stop
 the instance.
  2. All drivers. The tenant decides to do a reboot of the instance and
 that coincides with the periodic task state validation. At this
 point in time the instance will not be up and the compute node will
 update the state of the instance as DWON. Next iteration the states
 will differ and the instance will be shutdown

 Basically the issue hit us with our CI and there was no CI running for a
 couple of hours due to the fact that the compute node decided to
 shutdown the running instances. The hypervisor should be the source of
 truth and it should not be the compute node that decides to shutdown
 instances. I posted a patch to deal with this
 https://review.openstack.org/#/c/190047/. Which is the reason for this
 mail. The patch is backwards compatible so that the existing deployments
 and random shutdown continues as it works today and the admin now has an
 ability just to do a log if there is a inconsistency.

 We do not want to disable the periodic task as knowing the current state
 of the instance is very important and has a ton of value, we just do not
 want the periodic to task to shut down a running instance.

 Thanks
 Gary


 
_
_
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


In #2 the guest shouldn't be rebooted by the user (tenant) outside of
the nova-api.  I'm not sure if it's actually formally documented in the
nova documentation, but from what I've always heard/known, nova is the
control plane and you should be doing everything with your instances via
the nova-api.  If the user rebooted via nova-api, the task_state would
be set and the periodic task would ignore the instance.

Matt, this is one case that I showed where the problem occurs. There are
others and I can invest time to see them. The fact that the periodic task
is there is important. What I don¹t understand is why having an option of
log indication for an admin is something that is not useful and instead we
are going with having the compute node shutdown instance when this should
not happen. Our infrastructure is behaving like cattle. That should not be
the case and the hypervisor should be the source of truth.

This is a serious issue and instances in production can and will go down.


-- 

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] periodic task

2015-08-25 Thread Matt Riedemann



On 8/25/2015 10:03 AM, Gary Kotton wrote:



On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote:




On 8/24/2015 9:32 PM, Gary Kotton wrote:

In item #2 below the reboot is down via the guest and not the nova
api¹s :)

From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
Reply-To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Date: Monday, August 24, 2015 at 7:18 PM
To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Subject: [openstack-dev] [nova] periodic task

Hi,
A couple of months ago I posted a patch for bug
https://launchpad.net/bugs/1463688. The issue is as follows: the
periodic task detects that the instance state does not match the state
on the hypervisor and it shuts down the running VM. There are a number
of ways that this may happen and I will try and explain:

  1. Vmware driver example: a host where the instances are running goes
 down. This could be a power outage, host failure, etc. The first
 iteration of the perdioc task will determine that the actual
 instacne is down. This will update the state of the instance to
 DOWN. The VC has the ability to do HA and it will start the instance
 up and running again. The next iteration of the periodic task will
 determine that the instance is up and the compute manager will stop
 the instance.
  2. All drivers. The tenant decides to do a reboot of the instance and
 that coincides with the periodic task state validation. At this
 point in time the instance will not be up and the compute node will
 update the state of the instance as DWON. Next iteration the states
 will differ and the instance will be shutdown

Basically the issue hit us with our CI and there was no CI running for a
couple of hours due to the fact that the compute node decided to
shutdown the running instances. The hypervisor should be the source of
truth and it should not be the compute node that decides to shutdown
instances. I posted a patch to deal with this
https://review.openstack.org/#/c/190047/. Which is the reason for this
mail. The patch is backwards compatible so that the existing deployments
and random shutdown continues as it works today and the admin now has an
ability just to do a log if there is a inconsistency.

We do not want to disable the periodic task as knowing the current state
of the instance is very important and has a ton of value, we just do not
want the periodic to task to shut down a running instance.

Thanks
Gary



_
_
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



In #2 the guest shouldn't be rebooted by the user (tenant) outside of
the nova-api.  I'm not sure if it's actually formally documented in the
nova documentation, but from what I've always heard/known, nova is the
control plane and you should be doing everything with your instances via
the nova-api.  If the user rebooted via nova-api, the task_state would
be set and the periodic task would ignore the instance.


Matt, this is one case that I showed where the problem occurs. There are
others and I can invest time to see them. The fact that the periodic task
is there is important. What I don¹t understand is why having an option of
log indication for an admin is something that is not useful and instead we
are going with having the compute node shutdown instance when this should
not happen. Our infrastructure is behaving like cattle. That should not be
the case and the hypervisor should be the source of truth.

This is a serious issue and instances in production can and will go down.



--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



For the HA case #1, the periodic task checks to see if the instance.host 
doesn't match the compute service host [1] and skips if they don't match.


Shouldn't your HA scenario be updating which host the instance is 
running on?  Or is this a vCenter-ism?


[1] 
http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py#n5871


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ

Re: [openstack-dev] [nova] periodic task

2015-08-25 Thread Gary Kotton


On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote:



On 8/25/2015 10:03 AM, Gary Kotton wrote:


 On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com
wrote:



 On 8/24/2015 9:32 PM, Gary Kotton wrote:
 In item #2 below the reboot is down via the guest and not the nova
 api¹s :)

 From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
 Reply-To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Date: Monday, August 24, 2015 at 7:18 PM
 To: OpenStack List openstack-dev@lists.openstack.org
 mailto:openstack-dev@lists.openstack.org
 Subject: [openstack-dev] [nova] periodic task

 Hi,
 A couple of months ago I posted a patch for bug
 https://launchpad.net/bugs/1463688. The issue is as follows: the
 periodic task detects that the instance state does not match the state
 on the hypervisor and it shuts down the running VM. There are a number
 of ways that this may happen and I will try and explain:

   1. Vmware driver example: a host where the instances are running
goes
  down. This could be a power outage, host failure, etc. The first
  iteration of the perdioc task will determine that the actual
  instacne is down. This will update the state of the instance to
  DOWN. The VC has the ability to do HA and it will start the
instance
  up and running again. The next iteration of the periodic task
will
  determine that the instance is up and the compute manager will
stop
  the instance.
   2. All drivers. The tenant decides to do a reboot of the instance
and
  that coincides with the periodic task state validation. At this
  point in time the instance will not be up and the compute node
will
  update the state of the instance as DWON. Next iteration the
states
  will differ and the instance will be shutdown

 Basically the issue hit us with our CI and there was no CI running
for a
 couple of hours due to the fact that the compute node decided to
 shutdown the running instances. The hypervisor should be the source of
 truth and it should not be the compute node that decides to shutdown
 instances. I posted a patch to deal with this
 https://review.openstack.org/#/c/190047/. Which is the reason for this
 mail. The patch is backwards compatible so that the existing
deployments
 and random shutdown continues as it works today and the admin now has
an
 ability just to do a log if there is a inconsistency.

 We do not want to disable the periodic task as knowing the current
state
 of the instance is very important and has a ton of value, we just do
not
 want the periodic to task to shut down a running instance.

 Thanks
 Gary



 
___
__
 _
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 In #2 the guest shouldn't be rebooted by the user (tenant) outside of
 the nova-api.  I'm not sure if it's actually formally documented in the
 nova documentation, but from what I've always heard/known, nova is the
 control plane and you should be doing everything with your instances
via
 the nova-api.  If the user rebooted via nova-api, the task_state would
 be set and the periodic task would ignore the instance.

 Matt, this is one case that I showed where the problem occurs. There are
 others and I can invest time to see them. The fact that the periodic
task
 is there is important. What I don¹t understand is why having an option
of
 log indication for an admin is something that is not useful and instead
we
 are going with having the compute node shutdown instance when this
should
 not happen. Our infrastructure is behaving like cattle. That should not
be
 the case and the hypervisor should be the source of truth.

 This is a serious issue and instances in production can and will go
down.


 --

 Thanks,

 Matt Riedemann


 

__
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 
_
_
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


For the HA case #1, the periodic task checks to see if the instance.host
doesn't match the compute service host [1] and skips if they don't match.

Shouldn't your HA scenario be updating which host the instance is
running on?  Or is this a vCenter-ism?

The nova compute node has not changed. It is not the compute nodes host.
The host that the instance was running on was down and those instances
were moved

Re: [openstack-dev] [nova] periodic task

2015-08-25 Thread Andrew Laski

On 08/25/15 at 06:08pm, Gary Kotton wrote:



On 8/25/15, 9:10 AM, Matt Riedemann mrie...@linux.vnet.ibm.com wrote:




On 8/25/2015 10:03 AM, Gary Kotton wrote:



On 8/25/15, 7:04 AM, Matt Riedemann mrie...@linux.vnet.ibm.com
wrote:




On 8/24/2015 9:32 PM, Gary Kotton wrote:

In item #2 below the reboot is down via the guest and not the nova
api¹s :)

From: Gary Kotton gkot...@vmware.com mailto:gkot...@vmware.com
Reply-To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Date: Monday, August 24, 2015 at 7:18 PM
To: OpenStack List openstack-dev@lists.openstack.org
mailto:openstack-dev@lists.openstack.org
Subject: [openstack-dev] [nova] periodic task

Hi,
A couple of months ago I posted a patch for bug
https://launchpad.net/bugs/1463688. The issue is as follows: the
periodic task detects that the instance state does not match the state
on the hypervisor and it shuts down the running VM. There are a number
of ways that this may happen and I will try and explain:

  1. Vmware driver example: a host where the instances are running
goes
 down. This could be a power outage, host failure, etc. The first
 iteration of the perdioc task will determine that the actual
 instacne is down. This will update the state of the instance to
 DOWN. The VC has the ability to do HA and it will start the
instance
 up and running again. The next iteration of the periodic task
will
 determine that the instance is up and the compute manager will
stop
 the instance.
  2. All drivers. The tenant decides to do a reboot of the instance
and
 that coincides with the periodic task state validation. At this
 point in time the instance will not be up and the compute node
will
 update the state of the instance as DWON. Next iteration the
states
 will differ and the instance will be shutdown

Basically the issue hit us with our CI and there was no CI running
for a
couple of hours due to the fact that the compute node decided to
shutdown the running instances. The hypervisor should be the source of
truth and it should not be the compute node that decides to shutdown
instances. I posted a patch to deal with this
https://review.openstack.org/#/c/190047/. Which is the reason for this
mail. The patch is backwards compatible so that the existing
deployments
and random shutdown continues as it works today and the admin now has
an
ability just to do a log if there is a inconsistency.

We do not want to disable the periodic task as knowing the current
state
of the instance is very important and has a ton of value, we just do
not
want the periodic to task to shut down a running instance.

Thanks
Gary




___
__
_
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



In #2 the guest shouldn't be rebooted by the user (tenant) outside of
the nova-api.  I'm not sure if it's actually formally documented in the
nova documentation, but from what I've always heard/known, nova is the
control plane and you should be doing everything with your instances
via
the nova-api.  If the user rebooted via nova-api, the task_state would
be set and the periodic task would ignore the instance.


Matt, this is one case that I showed where the problem occurs. There are
others and I can invest time to see them. The fact that the periodic
task
is there is important. What I don¹t understand is why having an option
of
log indication for an admin is something that is not useful and instead
we
are going with having the compute node shutdown instance when this
should
not happen. Our infrastructure is behaving like cattle. That should not
be
the case and the hypervisor should be the source of truth.

This is a serious issue and instances in production can and will go
down.



--

Thanks,

Matt Riedemann




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




_
_
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



For the HA case #1, the periodic task checks to see if the instance.host
doesn't match the compute service host [1] and skips if they don't match.

Shouldn't your HA scenario be updating which host the instance is
running on?  Or is this a vCenter-ism?


The nova compute node has not changed. It is not the compute nodes host.
The host that the instance was running on was down and those instances
were moved.


So this is a case

[openstack-dev] [nova] periodic task

2015-08-24 Thread Gary Kotton
Hi,
A couple of months ago I posted a patch for bug 
https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task 
detects that the instance state does not match the state on the hypervisor and 
it shuts down the running VM. There are a number of ways that this may happen 
and I will try and explain:

  1.  Vmware driver example: a host where the instances are running goes down. 
This could be a power outage, host failure, etc. The first iteration of the 
perdioc task will determine that the actual instacne is down. This will update 
the state of the instance to DOWN. The VC has the ability to do HA and it will 
start the instance up and running again. The next iteration of the periodic 
task will determine that the instance is up and the compute manager will stop 
the instance.
  2.  All drivers. The tenant decides to do a reboot of the instance and that 
coincides with the periodic task state validation. At this point in time the 
instance will not be up and the compute node will update the state of the 
instance as DWON. Next iteration the states will differ and the instance will 
be shutdown

Basically the issue hit us with our CI and there was no CI running for a couple 
of hours due to the fact that the compute node decided to shutdown the running 
instances. The hypervisor should be the source of truth and it should not be 
the compute node that decides to shutdown instances. I posted a patch to deal 
with this https://review.openstack.org/#/c/190047/. Which is the reason for 
this mail. The patch is backwards compatible so that the existing deployments 
and random shutdown continues as it works today and the admin now has an 
ability just to do a log if there is a inconsistency.

We do not want to disable the periodic task as knowing the current state of the 
instance is very important and has a ton of value, we just do not want the 
periodic to task to shut down a running instance.

Thanks
Gary
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] periodic task

2015-08-24 Thread Gary Kotton
In item #2 below the reboot is down via the guest and not the nova api's :)

From: Gary Kotton gkot...@vmware.commailto:gkot...@vmware.com
Reply-To: OpenStack List 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Date: Monday, August 24, 2015 at 7:18 PM
To: OpenStack List 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
Subject: [openstack-dev] [nova] periodic task

Hi,
A couple of months ago I posted a patch for bug 
https://launchpad.net/bugs/1463688. The issue is as follows: the periodic task 
detects that the instance state does not match the state on the hypervisor and 
it shuts down the running VM. There are a number of ways that this may happen 
and I will try and explain:

  1.  Vmware driver example: a host where the instances are running goes down. 
This could be a power outage, host failure, etc. The first iteration of the 
perdioc task will determine that the actual instacne is down. This will update 
the state of the instance to DOWN. The VC has the ability to do HA and it will 
start the instance up and running again. The next iteration of the periodic 
task will determine that the instance is up and the compute manager will stop 
the instance.
  2.  All drivers. The tenant decides to do a reboot of the instance and that 
coincides with the periodic task state validation. At this point in time the 
instance will not be up and the compute node will update the state of the 
instance as DWON. Next iteration the states will differ and the instance will 
be shutdown

Basically the issue hit us with our CI and there was no CI running for a couple 
of hours due to the fact that the compute node decided to shutdown the running 
instances. The hypervisor should be the source of truth and it should not be 
the compute node that decides to shutdown instances. I posted a patch to deal 
with this https://review.openstack.org/#/c/190047/. Which is the reason for 
this mail. The patch is backwards compatible so that the existing deployments 
and random shutdown continues as it works today and the admin now has an 
ability just to do a log if there is a inconsistency.

We do not want to disable the periodic task as knowing the current state of the 
instance is very important and has a ton of value, we just do not want the 
periodic to task to shut down a running instance.

Thanks
Gary
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev