Re: [openstack-dev] [nova] Service group foundations and features

2015-06-11 Thread Vilobh Meshram
Few more places which can trigger inconsistent behaviour.

-
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/services.py#L44

-
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hypervisors.py#L98

-
https://github.com/openstack/nova/blob/stable/kilo/nova/availability_zones.py#L130

-
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/availability_zone.py#L68

-
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hosts.py#L88-L89

-
https://github.com/openstack/nova/blob/stable/kilo/nova/compute/api.py#L3399-L3421
.


Blueprint which plans to fix this :
https://blueprints.launchpad.net/nova/+spec/servicegroup-api-control-plane

Related Spec : 1) https://review.openstack.org/#/c/190322/

   2) https://review.openstack.org/#/c/138607/

-Vilobh


On Mon, May 11, 2015 at 8:08 AM, Chris Friesen chris.frie...@windriver.com
wrote:

 On 05/11/2015 07:13 AM, Attila Fazekas wrote:

 From: John Garbutt j...@johngarbutt.com


  * From the RPC api point of view, do we want to send a cast to
 something that we know is dead, maybe we want to? Should we wait for
 calls to timeout, or give up quicker?


 How to fail sooner:
 https://bugs.launchpad.net/oslo.messaging/+bug/1437955

 We do not need a dedicated is_up just for this.


 Is that really going to help?  As I understand it if nova-compute dies (or
 is isolated) then the queue remains present on the server but nothing will
 process messages from it.

 Chris


 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Service group foundations and features

2015-06-11 Thread Sylvain Bauza



Le 11/06/2015 18:52, Vilobh Meshram a écrit :

Few more places which can trigger inconsistent behaviour.

- 
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/services.py#L44


- 
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hypervisors.py#L98


- 
https://github.com/openstack/nova/blob/stable/kilo/nova/availability_zones.py#L130


- 
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/availability_zone.py#L68


- 
https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hosts.py#L88-L89


- 
https://github.com/openstack/nova/blob/stable/kilo/nova/compute/api.py#L3399-L3421.



Blueprint which plans to fix this : 
https://blueprints.launchpad.net/nova/+spec/servicegroup-api-control-plane


Related Spec : 1) https://review.openstack.org/#/c/190322/

 2) https://review.openstack.org/#/c/138607/

-Vilobh




tl,dr: checking a Service (is_up) should only be for making sure we can 
send a message to it, but not for checking if the related hypervisor(s) 
is/are up. Having a reference in the services table mapping 1:1 to a 
reference in a separate datastore is fine by me.



So, I'm going to review the specs above and leave my comments there.
That said, I want to also point out some humble opinion about what 
should be the relationship between a Service and what could be called 
the ServiceGroup API (badly named IMHO since it only checks a service, 
not a group ;-) )


From my perspective, the Service object is related to the AMQP service 
tied to the queue and... that's it.
That has nothing to do related to an hypervisor (since hypervisors can 
be distributed for a single service). That only represents the single 
point of failure for messages sent to a nova-compute service (and not a 
compute node, remember the distributed stuff) and since this is the only 
way to communicate with the related hypervisor(s), we have to know its 
status.


Again, that doesn't necessarly imply that if the service (who listens to 
the AMQP queue) is up, the hypervisors will be up as well, but that's 
enough strong to say that if it's down, we are sure that the 
hypervisor(s) won't receive messages.
Whether if the hypervisor is still continuing to work while the service 
is down is a corner case that the service status should not provide IMHO.


That's exactly why we need to consider that the service is a reference 
which can be used as it is for any relationship with a list of 
hypervisors (call that ComputeNode now) and checking its state (using 
any driver for it) should just be used for knowing if the message can be 
sent to it - *and not for checking if the related hypervisor(s) are 
running or not*


Given that disclaimer (which implies that we need to be very clear about 
when to wonder if is_up(service) ), I'm fine with considering the 
reference stored in DB (ie. the services table) as only a list of 
references pointing to a separate object which can be stored in any 
datastore (DB/Memcache/ZK/pick your favorite)


The only thing we need to make sure is that there is a 1:1 mapping 
between the 2 objects (eg. the DB service item and the datastored 
object) which can only be done logically.


My 2 cts,
-Sylvain




On Mon, May 11, 2015 at 8:08 AM, Chris Friesen 
chris.frie...@windriver.com mailto:chris.frie...@windriver.com wrote:


On 05/11/2015 07:13 AM, Attila Fazekas wrote:

From: John Garbutt j...@johngarbutt.com
mailto:j...@johngarbutt.com


* From the RPC api point of view, do we want to send a cast to
something that we know is dead, maybe we want to? Should
we wait for
calls to timeout, or give up quicker?


How to fail sooner:
https://bugs.launchpad.net/oslo.messaging/+bug/1437955

We do not need a dedicated is_up just for this.


Is that really going to help?  As I understand it if nova-compute
dies (or is isolated) then the queue remains present on the server
but nothing will process messages from it.

Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [nova] Service group foundations and features

2015-05-11 Thread Attila Fazekas




- Original Message -
 From: John Garbutt j...@johngarbutt.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Saturday, May 9, 2015 1:18:48 PM
 Subject: Re: [openstack-dev] [nova] Service group foundations and features
 
 On 7 May 2015 at 22:52, Joshua Harlow harlo...@outlook.com wrote:
  Hi all,
 
  In seeing the following:
 
  - https://review.openstack.org/#/c/169836/
  - https://review.openstack.org/#/c/163274/
  - https://review.openstack.org/#/c/138607/
 
  Vilobh and I are starting to come to the conclusion that the service group
  layers in nova really need to be cleaned up (without adding more features
  that only work in one driver), or removed or other... Spec[0] has
  interesting findings on this:
 
  A summary/highlights:
 
  * The zookeeper service driver in nova has probably been broken for 1 or
  more releases, due to eventlet attributes that are gone that it via
  evzookeeper[1] library was using. Evzookeeper only works for eventlet 
  0.17.1. Please refer to [0] for details.
  * The memcache service driver really only uses memcache for a tiny piece of
  the service liveness information (and does a database service table scan to
  get the list of services). Please refer to [0] for details.
  * Nova-manage service disable (CLI admin api) does interact with the
  service
  group layer for the 'is_up'[3] API (but it also does a database service
  table scan[4] to get the list of services, so this is inconsistent with the
  service group driver API 'get_all'[2] view on what is enabled/disabled).
  Please refer to [9][10] for nova manage service enable disable for details.
* Nova service delete (REST api) seems to follow a similar broken pattern
  (it also avoids calling into the service group layer to delete a service,
  which means it only works with the database layer[5], and therefore is
  inconsistent with the service group 'get_all'[2] API).
 
  ^^ Doing the above makes both disable/delete agnostic about other backends
  available that may/might manage service group data for example zookeeper,
  memcache, redis etc... Please refer [6][7] for details. Ideally the API
  should follow the model used in [8] so that the extension, admin interface
  as well as the API interface use the same servicegroup interface which
  should be *fully* responsible for managing services. Doing so we will have
  a
  consistent view of services data, liveness, disabled/enabled and so-on...
 
  So with no disrespect to the authors of 169836 and 163274 (or anyone else
  involved), I am wondering if we can put a request in to figure out how to
  get the foundation of the service group concepts stabilized (or other...)
  before adding more features (that only work with the DB layer).
 
  What is the path to request some kind of larger coordination effort by the
  nova folks to fix the service group layers (and the concepts that are not
  disjoint/don't work across them) before continuing to add features on-top
  of
  a 'shakey' foundation?
 
  If I could propose something it would probably work out like the following:
 
  Step 0: Figure out if the service group API + layer(s) should be
  maintained/tweaked at all (nova-core decides?)
 
  If maintain it:
 
   - Have an agreement that nova service extension, admin
  interface(nova-manage) and API go through a common path for
  update/delete/read.
* This common path should likely be the servicegroup API so as to have a
  consistent view of data and that also helps nova to add different
  data-stores (keeping the services data in a DB and getting numerous updates
  about liveliness every few seconds of N number of compute where N is pretty
  high can be detrimental to Nova's performance)
   - At the same time allow 163274 to be worked on (since it fixes a
   edge-case
  that was asked about in the initial addition of the delete API in its
  initial code commit @ https://review.openstack.org/#/c/39998/)
   - Delay 169836 until the above two/three are fixed (and stabilized); it's
  down concept (and all other usages of services that are hitting a database
  mentioned above) will need to go through the same service group foundation
  that is currently being skipped.
 
  Else:
- Discard 138607 and start removing the service group code (and just use
  the DB for all the things).
- Allow 163274 and 138607 (since those would be additions on-top of the
DB
  layer that will be preserved).
 
  Thoughts?
 
 I wonder about this approach:
 
 * I think we need to go back and document what we want from the
 service group concept.
 * Then we look at the best approach to implement that concept.
 * Then look at the best way to get to a happy place from where we are now,
 ** Noting we will need live upgrade for (at least) the most widely
 used drivers
 
 Does that make any sense?
 
 Things that pop into my head, include:
 * The operators have been asking questions like: Should new services

Re: [openstack-dev] [nova] Service group foundations and features

2015-05-11 Thread Chris Friesen

On 05/11/2015 07:13 AM, Attila Fazekas wrote:

From: John Garbutt j...@johngarbutt.com



* From the RPC api point of view, do we want to send a cast to
something that we know is dead, maybe we want to? Should we wait for
calls to timeout, or give up quicker?


How to fail sooner:
https://bugs.launchpad.net/oslo.messaging/+bug/1437955

We do not need a dedicated is_up just for this.


Is that really going to help?  As I understand it if nova-compute dies (or is 
isolated) then the queue remains present on the server but nothing will process 
messages from it.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Service group foundations and features

2015-05-11 Thread Juvonen, Tomi (Nokia - FI/Espoo)
From: ext Chris Friesen [mailto:chris.frie...@windriver.com] 
Sent: Monday, May 11, 2015 6:09 PM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [nova] Service group foundations and features

On 05/11/2015 07:13 AM, Attila Fazekas wrote:
 From: John Garbutt j...@johngarbutt.com

 * From the RPC api point of view, do we want to send a cast to
 something that we know is dead, maybe we want to? Should we wait for
 calls to timeout, or give up quicker?

 How to fail sooner:
 https://bugs.launchpad.net/oslo.messaging/+bug/1437955

 We do not need a dedicated is_up just for this.

Is that really going to help?  As I understand it if nova-compute dies (or is 
isolated) then the queue remains present on the server but nothing will 
process 
messages from it.

Chris

for queued messages if the forced_down proposed in 
https://review.openstack.org/#/c/169836/ is set to true, I'm up message 
should be ignored until this forced_down is cleared as this flag is there also 
to prevent service state to turn 'up'. So flag is there to have fast way to 
state service is down to enable evacuate (prevent scheduling VM to host..), but 
also to prevent invalid state. Even abort nova-compute startup as mentioned in 
review comments. This should make things quite safe.

-Tomi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Service group foundations and features

2015-05-09 Thread John Garbutt
On 7 May 2015 at 22:52, Joshua Harlow harlo...@outlook.com wrote:
 Hi all,

 In seeing the following:

 - https://review.openstack.org/#/c/169836/
 - https://review.openstack.org/#/c/163274/
 - https://review.openstack.org/#/c/138607/

 Vilobh and I are starting to come to the conclusion that the service group
 layers in nova really need to be cleaned up (without adding more features
 that only work in one driver), or removed or other... Spec[0] has
 interesting findings on this:

 A summary/highlights:

 * The zookeeper service driver in nova has probably been broken for 1 or
 more releases, due to eventlet attributes that are gone that it via
 evzookeeper[1] library was using. Evzookeeper only works for eventlet 
 0.17.1. Please refer to [0] for details.
 * The memcache service driver really only uses memcache for a tiny piece of
 the service liveness information (and does a database service table scan to
 get the list of services). Please refer to [0] for details.
 * Nova-manage service disable (CLI admin api) does interact with the service
 group layer for the 'is_up'[3] API (but it also does a database service
 table scan[4] to get the list of services, so this is inconsistent with the
 service group driver API 'get_all'[2] view on what is enabled/disabled).
 Please refer to [9][10] for nova manage service enable disable for details.
   * Nova service delete (REST api) seems to follow a similar broken pattern
 (it also avoids calling into the service group layer to delete a service,
 which means it only works with the database layer[5], and therefore is
 inconsistent with the service group 'get_all'[2] API).

 ^^ Doing the above makes both disable/delete agnostic about other backends
 available that may/might manage service group data for example zookeeper,
 memcache, redis etc... Please refer [6][7] for details. Ideally the API
 should follow the model used in [8] so that the extension, admin interface
 as well as the API interface use the same servicegroup interface which
 should be *fully* responsible for managing services. Doing so we will have a
 consistent view of services data, liveness, disabled/enabled and so-on...

 So with no disrespect to the authors of 169836 and 163274 (or anyone else
 involved), I am wondering if we can put a request in to figure out how to
 get the foundation of the service group concepts stabilized (or other...)
 before adding more features (that only work with the DB layer).

 What is the path to request some kind of larger coordination effort by the
 nova folks to fix the service group layers (and the concepts that are not
 disjoint/don't work across them) before continuing to add features on-top of
 a 'shakey' foundation?

 If I could propose something it would probably work out like the following:

 Step 0: Figure out if the service group API + layer(s) should be
 maintained/tweaked at all (nova-core decides?)

 If maintain it:

  - Have an agreement that nova service extension, admin
 interface(nova-manage) and API go through a common path for
 update/delete/read.
   * This common path should likely be the servicegroup API so as to have a
 consistent view of data and that also helps nova to add different
 data-stores (keeping the services data in a DB and getting numerous updates
 about liveliness every few seconds of N number of compute where N is pretty
 high can be detrimental to Nova's performance)
  - At the same time allow 163274 to be worked on (since it fixes a edge-case
 that was asked about in the initial addition of the delete API in its
 initial code commit @ https://review.openstack.org/#/c/39998/)
  - Delay 169836 until the above two/three are fixed (and stabilized); it's
 down concept (and all other usages of services that are hitting a database
 mentioned above) will need to go through the same service group foundation
 that is currently being skipped.

 Else:
   - Discard 138607 and start removing the service group code (and just use
 the DB for all the things).
   - Allow 163274 and 138607 (since those would be additions on-top of the DB
 layer that will be preserved).

 Thoughts?

I wonder about this approach:

* I think we need to go back and document what we want from the
service group concept.
* Then we look at the best approach to implement that concept.
* Then look at the best way to get to a happy place from where we are now,
** Noting we will need live upgrade for (at least) the most widely
used drivers

Does that make any sense?

Things that pop into my head, include:
* The operators have been asking questions like: Should new services
not be disabled by default? and Can't my admins tell you that I
just killed it?
* And from the scheduler point of view, how do we interact with the
provider that tells us if something is alive or not?
* From the RPC api point of view, do we want to send a cast to
something that we know is dead, maybe we want to? Should we wait for
calls to timeout, or give up quicker?
* Polling the DB kinda sucks, 

[openstack-dev] [nova] Service group foundations and features

2015-05-07 Thread Joshua Harlow

Hi all,

In seeing the following:

- https://review.openstack.org/#/c/169836/
- https://review.openstack.org/#/c/163274/
- https://review.openstack.org/#/c/138607/

Vilobh and I are starting to come to the conclusion that the service 
group layers in nova really need to be cleaned up (without adding more 
features that only work in one driver), or removed or other... Spec[0] 
has interesting findings on this:


A summary/highlights:

* The zookeeper service driver in nova has probably been broken for 1 or 
more releases, due to eventlet attributes that are gone that it via 
evzookeeper[1] library was using. Evzookeeper only works for eventlet  
0.17.1. Please refer to [0] for details.
* The memcache service driver really only uses memcache for a tiny piece 
of the service liveness information (and does a database service table 
scan to get the list of services). Please refer to [0] for details.
* Nova-manage service disable (CLI admin api) does interact with the 
service group layer for the 'is_up'[3] API (but it also does a database 
service table scan[4] to get the list of services, so this is 
inconsistent with the service group driver API 'get_all'[2] view on what 
is enabled/disabled). Please refer to [9][10] for nova manage service 
enable disable for details.
  * Nova service delete (REST api) seems to follow a similar broken 
pattern (it also avoids calling into the service group layer to delete a 
service, which means it only works with the database layer[5], and 
therefore is inconsistent with the service group 'get_all'[2] API).


^^ Doing the above makes both disable/delete agnostic about other 
backends available that may/might manage service group data for example 
zookeeper, memcache, redis etc... Please refer [6][7] for details. 
Ideally the API should follow the model used in [8] so that the 
extension, admin interface as well as the API interface use the same 
servicegroup interface which should be *fully* responsible for managing 
services. Doing so we will have a consistent view of services data, 
liveness, disabled/enabled and so-on...


So with no disrespect to the authors of 169836 and 163274 (or anyone 
else involved), I am wondering if we can put a request in to figure out 
how to get the foundation of the service group concepts stabilized (or 
other...) before adding more features (that only work with the DB layer).


What is the path to request some kind of larger coordination effort by 
the nova folks to fix the service group layers (and the concepts that 
are not disjoint/don't work across them) before continuing to add 
features on-top of a 'shakey' foundation?


If I could propose something it would probably work out like the following:

Step 0: Figure out if the service group API + layer(s) should be 
maintained/tweaked at all (nova-core decides?)


If maintain it:

 - Have an agreement that nova service extension, admin 
interface(nova-manage) and API go through a common path for 
update/delete/read.
  * This common path should likely be the servicegroup API so as to 
have a consistent view of data and that also helps nova to add different 
data-stores (keeping the services data in a DB and getting numerous 
updates about liveliness every few seconds of N number of compute where 
N is pretty high can be detrimental to Nova's performance)
 - At the same time allow 163274 to be worked on (since it fixes a 
edge-case that was asked about in the initial addition of the delete API 
in its initial code commit @ https://review.openstack.org/#/c/39998/)
 - Delay 169836 until the above two/three are fixed (and stabilized); 
it's down concept (and all other usages of services that are hitting a 
database mentioned above) will need to go through the same service group 
foundation that is currently being skipped.


Else:
  - Discard 138607 and start removing the service group code (and just 
use the DB for all the things).
  - Allow 163274 and 138607 (since those would be additions on-top of 
the DB layer that will be preserved).


Thoughts?

- Josh (and Vilobh, who is spending the most time on this recently)

[0] Replace service group with tooz : 
https://review.openstack.org/#/c/138607/

[1] https://pypi.python.org/pypi/evzookeeper/
[2] 
https://github.com/openstack/nova/blob/stable/kilo/nova/servicegroup/api.py#L93
[3] 
https://github.com/openstack/nova/blob/stable/kilo/nova/servicegroup/api.py#L87

[4] https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L711
[5] 
https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L106
[6] 
https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L107

[7] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3436
[8] 
https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L61
[9] Nova manage enable : 
https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L742
[10] Nova manage disable :