Re: [openstack-dev] [nova] Service group foundations and features
Few more places which can trigger inconsistent behaviour. - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/services.py#L44 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hypervisors.py#L98 - https://github.com/openstack/nova/blob/stable/kilo/nova/availability_zones.py#L130 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/availability_zone.py#L68 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hosts.py#L88-L89 - https://github.com/openstack/nova/blob/stable/kilo/nova/compute/api.py#L3399-L3421 . Blueprint which plans to fix this : https://blueprints.launchpad.net/nova/+spec/servicegroup-api-control-plane Related Spec : 1) https://review.openstack.org/#/c/190322/ 2) https://review.openstack.org/#/c/138607/ -Vilobh On Mon, May 11, 2015 at 8:08 AM, Chris Friesen chris.frie...@windriver.com wrote: On 05/11/2015 07:13 AM, Attila Fazekas wrote: From: John Garbutt j...@johngarbutt.com * From the RPC api point of view, do we want to send a cast to something that we know is dead, maybe we want to? Should we wait for calls to timeout, or give up quicker? How to fail sooner: https://bugs.launchpad.net/oslo.messaging/+bug/1437955 We do not need a dedicated is_up just for this. Is that really going to help? As I understand it if nova-compute dies (or is isolated) then the queue remains present on the server but nothing will process messages from it. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Service group foundations and features
Le 11/06/2015 18:52, Vilobh Meshram a écrit : Few more places which can trigger inconsistent behaviour. - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/services.py#L44 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hypervisors.py#L98 - https://github.com/openstack/nova/blob/stable/kilo/nova/availability_zones.py#L130 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/availability_zone.py#L68 - https://github.com/openstack/nova/blob/stable/kilo/nova/api/openstack/compute/contrib/hosts.py#L88-L89 - https://github.com/openstack/nova/blob/stable/kilo/nova/compute/api.py#L3399-L3421. Blueprint which plans to fix this : https://blueprints.launchpad.net/nova/+spec/servicegroup-api-control-plane Related Spec : 1) https://review.openstack.org/#/c/190322/ 2) https://review.openstack.org/#/c/138607/ -Vilobh tl,dr: checking a Service (is_up) should only be for making sure we can send a message to it, but not for checking if the related hypervisor(s) is/are up. Having a reference in the services table mapping 1:1 to a reference in a separate datastore is fine by me. So, I'm going to review the specs above and leave my comments there. That said, I want to also point out some humble opinion about what should be the relationship between a Service and what could be called the ServiceGroup API (badly named IMHO since it only checks a service, not a group ;-) ) From my perspective, the Service object is related to the AMQP service tied to the queue and... that's it. That has nothing to do related to an hypervisor (since hypervisors can be distributed for a single service). That only represents the single point of failure for messages sent to a nova-compute service (and not a compute node, remember the distributed stuff) and since this is the only way to communicate with the related hypervisor(s), we have to know its status. Again, that doesn't necessarly imply that if the service (who listens to the AMQP queue) is up, the hypervisors will be up as well, but that's enough strong to say that if it's down, we are sure that the hypervisor(s) won't receive messages. Whether if the hypervisor is still continuing to work while the service is down is a corner case that the service status should not provide IMHO. That's exactly why we need to consider that the service is a reference which can be used as it is for any relationship with a list of hypervisors (call that ComputeNode now) and checking its state (using any driver for it) should just be used for knowing if the message can be sent to it - *and not for checking if the related hypervisor(s) are running or not* Given that disclaimer (which implies that we need to be very clear about when to wonder if is_up(service) ), I'm fine with considering the reference stored in DB (ie. the services table) as only a list of references pointing to a separate object which can be stored in any datastore (DB/Memcache/ZK/pick your favorite) The only thing we need to make sure is that there is a 1:1 mapping between the 2 objects (eg. the DB service item and the datastored object) which can only be done logically. My 2 cts, -Sylvain On Mon, May 11, 2015 at 8:08 AM, Chris Friesen chris.frie...@windriver.com mailto:chris.frie...@windriver.com wrote: On 05/11/2015 07:13 AM, Attila Fazekas wrote: From: John Garbutt j...@johngarbutt.com mailto:j...@johngarbutt.com * From the RPC api point of view, do we want to send a cast to something that we know is dead, maybe we want to? Should we wait for calls to timeout, or give up quicker? How to fail sooner: https://bugs.launchpad.net/oslo.messaging/+bug/1437955 We do not need a dedicated is_up just for this. Is that really going to help? As I understand it if nova-compute dies (or is isolated) then the queue remains present on the server but nothing will process messages from it. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe:
Re: [openstack-dev] [nova] Service group foundations and features
- Original Message - From: John Garbutt j...@johngarbutt.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Saturday, May 9, 2015 1:18:48 PM Subject: Re: [openstack-dev] [nova] Service group foundations and features On 7 May 2015 at 22:52, Joshua Harlow harlo...@outlook.com wrote: Hi all, In seeing the following: - https://review.openstack.org/#/c/169836/ - https://review.openstack.org/#/c/163274/ - https://review.openstack.org/#/c/138607/ Vilobh and I are starting to come to the conclusion that the service group layers in nova really need to be cleaned up (without adding more features that only work in one driver), or removed or other... Spec[0] has interesting findings on this: A summary/highlights: * The zookeeper service driver in nova has probably been broken for 1 or more releases, due to eventlet attributes that are gone that it via evzookeeper[1] library was using. Evzookeeper only works for eventlet 0.17.1. Please refer to [0] for details. * The memcache service driver really only uses memcache for a tiny piece of the service liveness information (and does a database service table scan to get the list of services). Please refer to [0] for details. * Nova-manage service disable (CLI admin api) does interact with the service group layer for the 'is_up'[3] API (but it also does a database service table scan[4] to get the list of services, so this is inconsistent with the service group driver API 'get_all'[2] view on what is enabled/disabled). Please refer to [9][10] for nova manage service enable disable for details. * Nova service delete (REST api) seems to follow a similar broken pattern (it also avoids calling into the service group layer to delete a service, which means it only works with the database layer[5], and therefore is inconsistent with the service group 'get_all'[2] API). ^^ Doing the above makes both disable/delete agnostic about other backends available that may/might manage service group data for example zookeeper, memcache, redis etc... Please refer [6][7] for details. Ideally the API should follow the model used in [8] so that the extension, admin interface as well as the API interface use the same servicegroup interface which should be *fully* responsible for managing services. Doing so we will have a consistent view of services data, liveness, disabled/enabled and so-on... So with no disrespect to the authors of 169836 and 163274 (or anyone else involved), I am wondering if we can put a request in to figure out how to get the foundation of the service group concepts stabilized (or other...) before adding more features (that only work with the DB layer). What is the path to request some kind of larger coordination effort by the nova folks to fix the service group layers (and the concepts that are not disjoint/don't work across them) before continuing to add features on-top of a 'shakey' foundation? If I could propose something it would probably work out like the following: Step 0: Figure out if the service group API + layer(s) should be maintained/tweaked at all (nova-core decides?) If maintain it: - Have an agreement that nova service extension, admin interface(nova-manage) and API go through a common path for update/delete/read. * This common path should likely be the servicegroup API so as to have a consistent view of data and that also helps nova to add different data-stores (keeping the services data in a DB and getting numerous updates about liveliness every few seconds of N number of compute where N is pretty high can be detrimental to Nova's performance) - At the same time allow 163274 to be worked on (since it fixes a edge-case that was asked about in the initial addition of the delete API in its initial code commit @ https://review.openstack.org/#/c/39998/) - Delay 169836 until the above two/three are fixed (and stabilized); it's down concept (and all other usages of services that are hitting a database mentioned above) will need to go through the same service group foundation that is currently being skipped. Else: - Discard 138607 and start removing the service group code (and just use the DB for all the things). - Allow 163274 and 138607 (since those would be additions on-top of the DB layer that will be preserved). Thoughts? I wonder about this approach: * I think we need to go back and document what we want from the service group concept. * Then we look at the best approach to implement that concept. * Then look at the best way to get to a happy place from where we are now, ** Noting we will need live upgrade for (at least) the most widely used drivers Does that make any sense? Things that pop into my head, include: * The operators have been asking questions like: Should new services
Re: [openstack-dev] [nova] Service group foundations and features
On 05/11/2015 07:13 AM, Attila Fazekas wrote: From: John Garbutt j...@johngarbutt.com * From the RPC api point of view, do we want to send a cast to something that we know is dead, maybe we want to? Should we wait for calls to timeout, or give up quicker? How to fail sooner: https://bugs.launchpad.net/oslo.messaging/+bug/1437955 We do not need a dedicated is_up just for this. Is that really going to help? As I understand it if nova-compute dies (or is isolated) then the queue remains present on the server but nothing will process messages from it. Chris __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Service group foundations and features
From: ext Chris Friesen [mailto:chris.frie...@windriver.com] Sent: Monday, May 11, 2015 6:09 PM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] Service group foundations and features On 05/11/2015 07:13 AM, Attila Fazekas wrote: From: John Garbutt j...@johngarbutt.com * From the RPC api point of view, do we want to send a cast to something that we know is dead, maybe we want to? Should we wait for calls to timeout, or give up quicker? How to fail sooner: https://bugs.launchpad.net/oslo.messaging/+bug/1437955 We do not need a dedicated is_up just for this. Is that really going to help? As I understand it if nova-compute dies (or is isolated) then the queue remains present on the server but nothing will process messages from it. Chris for queued messages if the forced_down proposed in https://review.openstack.org/#/c/169836/ is set to true, I'm up message should be ignored until this forced_down is cleared as this flag is there also to prevent service state to turn 'up'. So flag is there to have fast way to state service is down to enable evacuate (prevent scheduling VM to host..), but also to prevent invalid state. Even abort nova-compute startup as mentioned in review comments. This should make things quite safe. -Tomi __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Service group foundations and features
On 7 May 2015 at 22:52, Joshua Harlow harlo...@outlook.com wrote: Hi all, In seeing the following: - https://review.openstack.org/#/c/169836/ - https://review.openstack.org/#/c/163274/ - https://review.openstack.org/#/c/138607/ Vilobh and I are starting to come to the conclusion that the service group layers in nova really need to be cleaned up (without adding more features that only work in one driver), or removed or other... Spec[0] has interesting findings on this: A summary/highlights: * The zookeeper service driver in nova has probably been broken for 1 or more releases, due to eventlet attributes that are gone that it via evzookeeper[1] library was using. Evzookeeper only works for eventlet 0.17.1. Please refer to [0] for details. * The memcache service driver really only uses memcache for a tiny piece of the service liveness information (and does a database service table scan to get the list of services). Please refer to [0] for details. * Nova-manage service disable (CLI admin api) does interact with the service group layer for the 'is_up'[3] API (but it also does a database service table scan[4] to get the list of services, so this is inconsistent with the service group driver API 'get_all'[2] view on what is enabled/disabled). Please refer to [9][10] for nova manage service enable disable for details. * Nova service delete (REST api) seems to follow a similar broken pattern (it also avoids calling into the service group layer to delete a service, which means it only works with the database layer[5], and therefore is inconsistent with the service group 'get_all'[2] API). ^^ Doing the above makes both disable/delete agnostic about other backends available that may/might manage service group data for example zookeeper, memcache, redis etc... Please refer [6][7] for details. Ideally the API should follow the model used in [8] so that the extension, admin interface as well as the API interface use the same servicegroup interface which should be *fully* responsible for managing services. Doing so we will have a consistent view of services data, liveness, disabled/enabled and so-on... So with no disrespect to the authors of 169836 and 163274 (or anyone else involved), I am wondering if we can put a request in to figure out how to get the foundation of the service group concepts stabilized (or other...) before adding more features (that only work with the DB layer). What is the path to request some kind of larger coordination effort by the nova folks to fix the service group layers (and the concepts that are not disjoint/don't work across them) before continuing to add features on-top of a 'shakey' foundation? If I could propose something it would probably work out like the following: Step 0: Figure out if the service group API + layer(s) should be maintained/tweaked at all (nova-core decides?) If maintain it: - Have an agreement that nova service extension, admin interface(nova-manage) and API go through a common path for update/delete/read. * This common path should likely be the servicegroup API so as to have a consistent view of data and that also helps nova to add different data-stores (keeping the services data in a DB and getting numerous updates about liveliness every few seconds of N number of compute where N is pretty high can be detrimental to Nova's performance) - At the same time allow 163274 to be worked on (since it fixes a edge-case that was asked about in the initial addition of the delete API in its initial code commit @ https://review.openstack.org/#/c/39998/) - Delay 169836 until the above two/three are fixed (and stabilized); it's down concept (and all other usages of services that are hitting a database mentioned above) will need to go through the same service group foundation that is currently being skipped. Else: - Discard 138607 and start removing the service group code (and just use the DB for all the things). - Allow 163274 and 138607 (since those would be additions on-top of the DB layer that will be preserved). Thoughts? I wonder about this approach: * I think we need to go back and document what we want from the service group concept. * Then we look at the best approach to implement that concept. * Then look at the best way to get to a happy place from where we are now, ** Noting we will need live upgrade for (at least) the most widely used drivers Does that make any sense? Things that pop into my head, include: * The operators have been asking questions like: Should new services not be disabled by default? and Can't my admins tell you that I just killed it? * And from the scheduler point of view, how do we interact with the provider that tells us if something is alive or not? * From the RPC api point of view, do we want to send a cast to something that we know is dead, maybe we want to? Should we wait for calls to timeout, or give up quicker? * Polling the DB kinda sucks,
[openstack-dev] [nova] Service group foundations and features
Hi all, In seeing the following: - https://review.openstack.org/#/c/169836/ - https://review.openstack.org/#/c/163274/ - https://review.openstack.org/#/c/138607/ Vilobh and I are starting to come to the conclusion that the service group layers in nova really need to be cleaned up (without adding more features that only work in one driver), or removed or other... Spec[0] has interesting findings on this: A summary/highlights: * The zookeeper service driver in nova has probably been broken for 1 or more releases, due to eventlet attributes that are gone that it via evzookeeper[1] library was using. Evzookeeper only works for eventlet 0.17.1. Please refer to [0] for details. * The memcache service driver really only uses memcache for a tiny piece of the service liveness information (and does a database service table scan to get the list of services). Please refer to [0] for details. * Nova-manage service disable (CLI admin api) does interact with the service group layer for the 'is_up'[3] API (but it also does a database service table scan[4] to get the list of services, so this is inconsistent with the service group driver API 'get_all'[2] view on what is enabled/disabled). Please refer to [9][10] for nova manage service enable disable for details. * Nova service delete (REST api) seems to follow a similar broken pattern (it also avoids calling into the service group layer to delete a service, which means it only works with the database layer[5], and therefore is inconsistent with the service group 'get_all'[2] API). ^^ Doing the above makes both disable/delete agnostic about other backends available that may/might manage service group data for example zookeeper, memcache, redis etc... Please refer [6][7] for details. Ideally the API should follow the model used in [8] so that the extension, admin interface as well as the API interface use the same servicegroup interface which should be *fully* responsible for managing services. Doing so we will have a consistent view of services data, liveness, disabled/enabled and so-on... So with no disrespect to the authors of 169836 and 163274 (or anyone else involved), I am wondering if we can put a request in to figure out how to get the foundation of the service group concepts stabilized (or other...) before adding more features (that only work with the DB layer). What is the path to request some kind of larger coordination effort by the nova folks to fix the service group layers (and the concepts that are not disjoint/don't work across them) before continuing to add features on-top of a 'shakey' foundation? If I could propose something it would probably work out like the following: Step 0: Figure out if the service group API + layer(s) should be maintained/tweaked at all (nova-core decides?) If maintain it: - Have an agreement that nova service extension, admin interface(nova-manage) and API go through a common path for update/delete/read. * This common path should likely be the servicegroup API so as to have a consistent view of data and that also helps nova to add different data-stores (keeping the services data in a DB and getting numerous updates about liveliness every few seconds of N number of compute where N is pretty high can be detrimental to Nova's performance) - At the same time allow 163274 to be worked on (since it fixes a edge-case that was asked about in the initial addition of the delete API in its initial code commit @ https://review.openstack.org/#/c/39998/) - Delay 169836 until the above two/three are fixed (and stabilized); it's down concept (and all other usages of services that are hitting a database mentioned above) will need to go through the same service group foundation that is currently being skipped. Else: - Discard 138607 and start removing the service group code (and just use the DB for all the things). - Allow 163274 and 138607 (since those would be additions on-top of the DB layer that will be preserved). Thoughts? - Josh (and Vilobh, who is spending the most time on this recently) [0] Replace service group with tooz : https://review.openstack.org/#/c/138607/ [1] https://pypi.python.org/pypi/evzookeeper/ [2] https://github.com/openstack/nova/blob/stable/kilo/nova/servicegroup/api.py#L93 [3] https://github.com/openstack/nova/blob/stable/kilo/nova/servicegroup/api.py#L87 [4] https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L711 [5] https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L106 [6] https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L107 [7] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3436 [8] https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/contrib/services.py#L61 [9] Nova manage enable : https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L742 [10] Nova manage disable :