Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-27 Thread Matt Riedemann

On 1/26/2017 8:40 AM, Matt Riedemann wrote:


And circling back on *that*, we've agreed to introduce a new service
version for the compute to indicate it's Ocata or not. Then we'll:

* check in the scheduler if the minimum compute service version is ocata,
* if minimum is ocata, then use placement, else fallback to the old
resource tracker data in the compute_nodes table - then we remove that
fallback in Pike.

We'll also have a check for the placement config during init_host on the
ocata compute such that if you are upgrading to ocata code for the
compute but don't have placement configured, it's a hard fail and the
nova-compute service is doing to die.

I'm pretty sure we've come full circle on this now.



Just an update:

The two nova changes for the filter scheduler using placement are 
passing CI testing now and are approved. They are held up from being 
merged due to the grenade change at the bottom of the series which 
installs placement before upgrading to ocata and configures nova-compute:


https://review.openstack.org/424730

We need that in as soon as possible, so Sean, if you're reading this, 
you know what to do.


--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Jay Pipes

On 01/26/2017 09:14 AM, Ed Leafe wrote:

On Jan 26, 2017, at 7:50 AM, Sylvain Bauza  wrote:


That's where I think we have another problem, which is bigger than the
corner case you mentioned above : when upgrading from Newton to Ocata,
we said that all Newton computes have be upgraded to the latest point
release. Great. But we forgot to identify that it would also require to
*modify* their nova.conf so they would be able to call the placement API.

That looks to me more than just a rolling upgrade mechanism. In theory,
a rolling upgrade process accepts that N-1 versioned computes can talk
to N versioned other services. That doesn't imply a necessary
configuration change (except the upgrade_levels flag) on the computes to
achieve that, right?

http://docs.openstack.org/developer/nova/upgrade.html


Reading that page: "At this point, you must also ensure you update the 
configuration, to stop using any deprecated features or options, and perform any 
required work to transition to alternative features.”

So yes, "updating your configuration” is an expected action. I’m not sure why 
this is so alarming.


Me neither.

-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Matt Riedemann

On 1/26/2017 7:41 AM, Sylvain Bauza wrote:


Circling back to the problem as time flies. As the patch Matt proposed
for option #4 is not fully working yet, I'm implementing option #3 by
making the HostManager.get_filtered_hosts() method being resilient to
the fact that there are no hosts given by the placement API if and only
if the user asked for forced destinations.

-Sylvain


And circling back on *that*, we've agreed to introduce a new service 
version for the compute to indicate it's Ocata or not. Then we'll:


* check in the scheduler if the minimum compute service version is ocata,
* if minimum is ocata, then use placement, else fallback to the old 
resource tracker data in the compute_nodes table - then we remove that 
fallback in Pike.


We'll also have a check for the placement config during init_host on the 
ocata compute such that if you are upgrading to ocata code for the 
compute but don't have placement configured, it's a hard fail and the 
nova-compute service is doing to die.


I'm pretty sure we've come full circle on this now.

--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Sylvain Bauza


Le 26/01/2017 05:42, Matt Riedemann a écrit :
> This is my public hand off to Sylvain for the work done tonight.
> 
> Starting with the multinode grenade failure in the nova patch to
> integrate placement with the filter scheduler:
> 
> https://review.openstack.org/#/c/417961/
> 
> The test_schedule_to_all_nodes tempest test was failing in there because
> that test explicitly forces hosts using AZs to build two instances.
> Because we didn't have nova.conf on the Newton subnode in the multinode
> grenade job configured to talk to placement, there was no resource
> provider for that Newton subnode when we started running smoke tests
> after the upgrade to Ocata, so that test failed since the request to the
> subnode had a NoValidHost (because no resource provider was checking in
> from the Newton node).
> 
> Grenade is not topology aware so it doesn't know anything about the
> subnode. When the subnode is stacked, it does so via a post-stack hook
> script that devstack-gate writes into the grenade run, so after stacking
> the primary Newton node, it then uses Ansible to ssh into the subnode
> and stack Newton there too:
> 
> https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L629
> 
> 
> logs.openstack.org/61/417961/26/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/15545e4/logs/grenade.sh.txt.gz#_2017-01-26_00_26_59_296
> 
> 
> And placement was optional in Newton so, you know, problems.
> 
> Some options came to mind:
> 
> 1. Change the test to not be a smoke test which would exclude it from
> running during grenade. QA would barf on this.
> 
> 2. Hack some kind of pre-upgrade callback from d-g into grenade just for
> configuring placement on the compute subnode. This would probably
> require adding a script to devstack just so d-g has something to call so
> we could keep branch logic out of d-g, like what we did for the
> discover_hosts stuff for cells v2. This is more complicated than what I
> wanted to deal with tonight with limited time on my hands.
> 
> 3. Change the nova filter scheduler patch to fallback to get all compute
> nodes if there are no resource providers. We've already talked about
> this a few times already in other threads and I consider it a safety net
> we'd like to avoid if all else fails. If we did this, we could
> potentially restrict it to just the forced-host case...
> 
> 4. Setup the Newton subnode in the grenade run to configure placement,
> which I think we can do from d-g using the features yaml file. That's
> what I opted to go with and the patch is here:
> 
> https://review.openstack.org/#/c/425524/
> 
> I've made the nova patch dependent on that *and* the other grenade patch
> to install and configure placement on the primary node when upgrading
> from Newton to Ocata.
> 
> -- 
> 
> That's where we're at right now. If #4 fails, I think we are stuck with
> adding a workaround for #3 into Ocata and then remove that in Pike when
> we know/expect computes to be running placement (they would be in our
> grenade runs from ocata->pike at least).
> 

Circling back to the problem as time flies. As the patch Matt proposed
for option #4 is not fully working yet, I'm implementing option #3 by
making the HostManager.get_filtered_hosts() method being resilient to
the fact that there are no hosts given by the placement API if and only
if the user asked for forced destinations.

-Sylvain

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Sylvain Bauza


Le 26/01/2017 15:14, Ed Leafe a écrit :
> On Jan 26, 2017, at 7:50 AM, Sylvain Bauza  wrote:
>>
>> That's where I think we have another problem, which is bigger than the
>> corner case you mentioned above : when upgrading from Newton to Ocata,
>> we said that all Newton computes have be upgraded to the latest point
>> release. Great. But we forgot to identify that it would also require to
>> *modify* their nova.conf so they would be able to call the placement API.
>>
>> That looks to me more than just a rolling upgrade mechanism. In theory,
>> a rolling upgrade process accepts that N-1 versioned computes can talk
>> to N versioned other services. That doesn't imply a necessary
>> configuration change (except the upgrade_levels flag) on the computes to
>> achieve that, right?
>>
>> http://docs.openstack.org/developer/nova/upgrade.html
> 
> Reading that page: "At this point, you must also ensure you update the 
> configuration, to stop using any deprecated features or options, and perform 
> any required work to transition to alternative features.”
> 
> So yes, "updating your configuration” is an expected action. I’m not sure why 
> this is so alarming.
> 

You give that phrase out of context. To give more details, that specific
sentence is related to what you should do *after* having your
maintenance window (ie. upgrading your controller while your API is
down) and the introduction paragraph mentions that all the bullet items
relate to all the nova services but the hypervisors.

And I'm not alarmed. I'm just trying to identify the correct upgrade
path that we should ask our operators to do. If that means adding an
extra step than the regular upgrade process, then I think everyone
should be aware of it.
Take myself, I'm probably exhausted and very narrow-eyed so I missed
that implication. I apologize for it and I want to clarify that.

-Sylvain

> 
> -- Ed Leafe
> 
> 
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread John Garbutt
On 26 January 2017 at 14:14, Ed Leafe  wrote:
> On Jan 26, 2017, at 7:50 AM, Sylvain Bauza  wrote:
>>
>> That's where I think we have another problem, which is bigger than the
>> corner case you mentioned above : when upgrading from Newton to Ocata,
>> we said that all Newton computes have be upgraded to the latest point
>> release. Great. But we forgot to identify that it would also require to
>> *modify* their nova.conf so they would be able to call the placement API.
>>
>> That looks to me more than just a rolling upgrade mechanism. In theory,
>> a rolling upgrade process accepts that N-1 versioned computes can talk
>> to N versioned other services. That doesn't imply a necessary
>> configuration change (except the upgrade_levels flag) on the computes to
>> achieve that, right?
>>
>> http://docs.openstack.org/developer/nova/upgrade.html
>
> Reading that page: "At this point, you must also ensure you update the 
> configuration, to stop using any deprecated features or options, and perform 
> any required work to transition to alternative features.”
>
> So yes, "updating your configuration” is an expected action. I’m not sure why 
> this is so alarming.

We did make this promise:
https://governance.openstack.org/tc/reference/tags/assert_supports-upgrade.html#requirements

Its bending that configuration requirement a little bit.
That requirement was originally added at the direct request of operators.

Now there is a need to tidy up your configuration after completing the
upgrade to N+1 before upgrading to N+2, but I believe that was assumed
to happen at the end of the N+1 upgrade, using the N+1 release notes.
The idea being warning messages in the logs etc, would help that all
get fixed before attempting the next upgrade. But I agree thats not
what the docs are currently saying.

Thanks,
John

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Ed Leafe
On Jan 26, 2017, at 7:50 AM, Sylvain Bauza  wrote:
> 
> That's where I think we have another problem, which is bigger than the
> corner case you mentioned above : when upgrading from Newton to Ocata,
> we said that all Newton computes have be upgraded to the latest point
> release. Great. But we forgot to identify that it would also require to
> *modify* their nova.conf so they would be able to call the placement API.
> 
> That looks to me more than just a rolling upgrade mechanism. In theory,
> a rolling upgrade process accepts that N-1 versioned computes can talk
> to N versioned other services. That doesn't imply a necessary
> configuration change (except the upgrade_levels flag) on the computes to
> achieve that, right?
> 
> http://docs.openstack.org/developer/nova/upgrade.html

Reading that page: "At this point, you must also ensure you update the 
configuration, to stop using any deprecated features or options, and perform 
any required work to transition to alternative features.”

So yes, "updating your configuration” is an expected action. I’m not sure why 
this is so alarming.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread John Garbutt
On 26 January 2017 at 13:50, Sylvain Bauza  wrote:
> Le 26/01/2017 05:42, Matt Riedemann a écrit :
>> This is my public hand off to Sylvain for the work done tonight.
>>
>
> Thanks Matt for your help yesterday, was awesome to count you in even
> you're personally away.
>
>
>> Starting with the multinode grenade failure in the nova patch to
>> integrate placement with the filter scheduler:
>>
>> https://review.openstack.org/#/c/417961/
>>
>> The test_schedule_to_all_nodes tempest test was failing in there because
>> that test explicitly forces hosts using AZs to build two instances.
>> Because we didn't have nova.conf on the Newton subnode in the multinode
>> grenade job configured to talk to placement, there was no resource
>> provider for that Newton subnode when we started running smoke tests
>> after the upgrade to Ocata, so that test failed since the request to the
>> subnode had a NoValidHost (because no resource provider was checking in
>> from the Newton node).
>>
>
> That's where I think the current implementation is weird : if you force
> the scheduler to return you a destination (without even calling the
> filters) by just verifying if the corresponding service is up, then why
> are you needing to get the full list of computes before that ?
>
> To the placement extend, if you just *force* the scheduler to return you
> a destination, then why should we verify if the resources are happy ?
> FWIW, we now have a fully different semantics that replaces the
> "force_hosts" thing that I hate : it's called
> RequestSpec.requested_destination and it actually verifies the filters
> only for that destination. No straight bypass of the filters like
> force_hosts does.

Thats just a symptom though, as I understand it?

It seems the real problem seems to be the placement isn't configured
on the old node. Which by accident is what most deployers are likely
to hit, if they didn't setup placement when upgrading last cycle.

>> Grenade is not topology aware so it doesn't know anything about the
>> subnode. When the subnode is stacked, it does so via a post-stack hook
>> script that devstack-gate writes into the grenade run, so after stacking
>> the primary Newton node, it then uses Ansible to ssh into the subnode
>> and stack Newton there too:
>>
>> https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L629
>>
>>
>> logs.openstack.org/61/417961/26/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/15545e4/logs/grenade.sh.txt.gz#_2017-01-26_00_26_59_296
>>
>>
>> And placement was optional in Newton so, you know, problems.
>>
>
> That's where I think we have another problem, which is bigger than the
> corner case you mentioned above : when upgrading from Newton to Ocata,
> we said that all Newton computes have be upgraded to the latest point
> release. Great. But we forgot to identify that it would also require to
> *modify* their nova.conf so they would be able to call the placement API.
>
> That looks to me more than just a rolling upgrade mechanism. In theory,
> a rolling upgrade process accepts that N-1 versioned computes can talk
> to N versioned other services. That doesn't imply a necessary
> configuration change (except the upgrade_levels flag) on the computes to
> achieve that, right?
>
> http://docs.openstack.org/developer/nova/upgrade.html

We normally say the config that worked last cycle should be fine.

We probably should have said placement was required last cycle, then
this wouldn't have been an issue.

>> Some options came to mind:
>>
>> 1. Change the test to not be a smoke test which would exclude it from
>> running during grenade. QA would barf on this.
>>
>> 2. Hack some kind of pre-upgrade callback from d-g into grenade just for
>> configuring placement on the compute subnode. This would probably
>> require adding a script to devstack just so d-g has something to call so
>> we could keep branch logic out of d-g, like what we did for the
>> discover_hosts stuff for cells v2. This is more complicated than what I
>> wanted to deal with tonight with limited time on my hands.
>>
>> 3. Change the nova filter scheduler patch to fallback to get all compute
>> nodes if there are no resource providers. We've already talked about
>> this a few times already in other threads and I consider it a safety net
>> we'd like to avoid if all else fails. If we did this, we could
>> potentially restrict it to just the forced-host case...
>>
>> 4. Setup the Newton subnode in the grenade run to configure placement,
>> which I think we can do from d-g using the features yaml file. That's
>> what I opted to go with and the patch is here:
>>
>> https://review.openstack.org/#/c/425524/
>>
>> I've made the nova patch dependent on that *and* the other grenade patch
>> to install and configure placement on the primary node when upgrading
>> from Newton to Ocata.
>>
>> --
>>
>> That's where we're at right now. If #4 fails, I think we are stuck with
>> adding a workaround for 

Re: [openstack-dev] [nova] Latest and greatest on trying to get n-sch to require placement

2017-01-26 Thread Sylvain Bauza


Le 26/01/2017 05:42, Matt Riedemann a écrit :
> This is my public hand off to Sylvain for the work done tonight.
> 

Thanks Matt for your help yesterday, was awesome to count you in even
you're personally away.


> Starting with the multinode grenade failure in the nova patch to
> integrate placement with the filter scheduler:
> 
> https://review.openstack.org/#/c/417961/
> 
> The test_schedule_to_all_nodes tempest test was failing in there because
> that test explicitly forces hosts using AZs to build two instances.
> Because we didn't have nova.conf on the Newton subnode in the multinode
> grenade job configured to talk to placement, there was no resource
> provider for that Newton subnode when we started running smoke tests
> after the upgrade to Ocata, so that test failed since the request to the
> subnode had a NoValidHost (because no resource provider was checking in
> from the Newton node).
> 

That's where I think the current implementation is weird : if you force
the scheduler to return you a destination (without even calling the
filters) by just verifying if the corresponding service is up, then why
are you needing to get the full list of computes before that ?

To the placement extend, if you just *force* the scheduler to return you
a destination, then why should we verify if the resources are happy ?
FWIW, we now have a fully different semantics that replaces the
"force_hosts" thing that I hate : it's called
RequestSpec.requested_destination and it actually verifies the filters
only for that destination. No straight bypass of the filters like
force_hosts does.

> Grenade is not topology aware so it doesn't know anything about the
> subnode. When the subnode is stacked, it does so via a post-stack hook
> script that devstack-gate writes into the grenade run, so after stacking
> the primary Newton node, it then uses Ansible to ssh into the subnode
> and stack Newton there too:
> 
> https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate.sh#L629
> 
> 
> logs.openstack.org/61/417961/26/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/15545e4/logs/grenade.sh.txt.gz#_2017-01-26_00_26_59_296
> 
> 
> And placement was optional in Newton so, you know, problems.
> 

That's where I think we have another problem, which is bigger than the
corner case you mentioned above : when upgrading from Newton to Ocata,
we said that all Newton computes have be upgraded to the latest point
release. Great. But we forgot to identify that it would also require to
*modify* their nova.conf so they would be able to call the placement API.

That looks to me more than just a rolling upgrade mechanism. In theory,
a rolling upgrade process accepts that N-1 versioned computes can talk
to N versioned other services. That doesn't imply a necessary
configuration change (except the upgrade_levels flag) on the computes to
achieve that, right?

http://docs.openstack.org/developer/nova/upgrade.html


> Some options came to mind:
> 
> 1. Change the test to not be a smoke test which would exclude it from
> running during grenade. QA would barf on this.
> 
> 2. Hack some kind of pre-upgrade callback from d-g into grenade just for
> configuring placement on the compute subnode. This would probably
> require adding a script to devstack just so d-g has something to call so
> we could keep branch logic out of d-g, like what we did for the
> discover_hosts stuff for cells v2. This is more complicated than what I
> wanted to deal with tonight with limited time on my hands.
> 
> 3. Change the nova filter scheduler patch to fallback to get all compute
> nodes if there are no resource providers. We've already talked about
> this a few times already in other threads and I consider it a safety net
> we'd like to avoid if all else fails. If we did this, we could
> potentially restrict it to just the forced-host case...
> 
> 4. Setup the Newton subnode in the grenade run to configure placement,
> which I think we can do from d-g using the features yaml file. That's
> what I opted to go with and the patch is here:
> 
> https://review.openstack.org/#/c/425524/
> 
> I've made the nova patch dependent on that *and* the other grenade patch
> to install and configure placement on the primary node when upgrading
> from Newton to Ocata.
> 
> -- 
> 
> That's where we're at right now. If #4 fails, I think we are stuck with
> adding a workaround for #3 into Ocata and then remove that in Pike when
> we know/expect computes to be running placement (they would be in our
> grenade runs from ocata->pike at least).
> 


Given the above two problems that I stated, I think I'm in favor of a #3
approach now that would do the following :

 - modify the scheduler so that it's acceptable to have the placement
returning nothing if you force hosts

 - modify the scheduler so in the event of an empty list returned by the
placement API, fallback getting the list of all computes


That still leaves the problem where a few computes are not all