Re: [openstack-dev] [nova] Is there any reason to exclude originally failed build hosts during live migration?

2017-09-20 Thread Sylvain Bauza
On Wed, Sep 20, 2017 at 10:15 PM, melanie witt  wrote:

> On Wed, 20 Sep 2017 13:47:18 -0500, Matt Riedemann wrote:
>
>> Presumably there was a good reason why the instance failed to build on a
>> host originally, but that could be for any number of reasons: resource
>> claim failed during a race, configuration issues, etc. Since we don't
>> really know what originally happened, it seems reasonable to not exclude
>> originally attempted build targets since the scheduler filters should still
>> validate them during live migration (this is all assuming you're not using
>> the 'force' flag with live migration - and if you are, all bets are off).
>>
>
> Yeah, I think because an original failure to build could have been a
> failed claim during a race, config issue, or just been a very long time
> ago, we shouldn't continue to exclude those hosts forever.
>
> If people agree with doing this fix, then we also have to consider making
>> a similar fix for other move operations like cold migrate, evacuate and
>> unshelve. However, out of those other move operations, only cold migrate
>> attempts any retries. If evacuate or unshelve fail on the target host,
>> there is no retry.
>>
>
> I agree with doing that fix for all of the move operations.
>
>
Yeah, a host could be failing when we created that instance 1 year ago,
that doesn't mean the host won't be available this time.

> -melanie
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Is there any reason to exclude originally failed build hosts during live migration?

2017-09-20 Thread melanie witt

On Wed, 20 Sep 2017 13:47:18 -0500, Matt Riedemann wrote:
Presumably there was a good reason why the instance failed to build on a 
host originally, but that could be for any number of reasons: resource 
claim failed during a race, configuration issues, etc. Since we don't 
really know what originally happened, it seems reasonable to not exclude 
originally attempted build targets since the scheduler filters should 
still validate them during live migration (this is all assuming you're 
not using the 'force' flag with live migration - and if you are, all 
bets are off).


Yeah, I think because an original failure to build could have been a 
failed claim during a race, config issue, or just been a very long time 
ago, we shouldn't continue to exclude those hosts forever.


If people agree with doing this fix, then we also have to consider 
making a similar fix for other move operations like cold migrate, 
evacuate and unshelve. However, out of those other move operations, only 
cold migrate attempts any retries. If evacuate or unshelve fail on the 
target host, there is no retry.


I agree with doing that fix for all of the move operations.

-melanie

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Is there any reason to exclude originally failed build hosts during live migration?

2017-09-20 Thread Chris Friesen

On 09/20/2017 12:47 PM, Matt Riedemann wrote:


I wanted to bring it up here in case anyone had a good reason why we should not
continue to exclude originally failed hosts during live migration, even if the
admin is specifying one of those hosts for the live migration destination.

Presumably there was a good reason why the instance failed to build on a host
originally, but that could be for any number of reasons: resource claim failed
during a race, configuration issues, etc. Since we don't really know what
originally happened, it seems reasonable to not exclude originally attempted
build targets since the scheduler filters should still validate them during live
migration (this is all assuming you're not using the 'force' flag with live
migration - and if you are, all bets are off).


As you say, a failure on a host during the original instance creation (which 
could have been a long time ago) is not a reason to bypass that host during 
subsequent operations.


In other words, I think the list of hosts to ignore should be scoped to a single 
"operation" that requires scheduling (which would include any necessary 
rescheduling for that "operation").


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Is there any reason to exclude originally failed build hosts during live migration?

2017-09-20 Thread Matt Riedemann

It's a weird question, so I'll explain.

An issue came up in IRC today where someone was trying to live migrate 
an instance to a specified host, and the RetryFilter in the scheduler 
was kicking out the specified host, even though other similar instances 
were live migrating to that specified host successfully.


After some DB debugging, we figured out that the instance that failed to 
live migrate has a persisted request spec which listed the specified 
host as an originally attempted host during the initial instance create. 
The RetryFilter was tripping up on this during live migration saying, 
essentially, "you've already tried that host, sorry".


This was confusing because the live migration task in conductor actually 
manually handles retries if pre-migration checks fail on the selected 
destination host. This is why we have the "migrate_max_retries" config 
option.


The actual fix for this is trivial:

https://review.openstack.org/#/c/505771/

I wanted to bring it up here in case anyone had a good reason why we 
should not continue to exclude originally failed hosts during live 
migration, even if the admin is specifying one of those hosts for the 
live migration destination.


Presumably there was a good reason why the instance failed to build on a 
host originally, but that could be for any number of reasons: resource 
claim failed during a race, configuration issues, etc. Since we don't 
really know what originally happened, it seems reasonable to not exclude 
originally attempted build targets since the scheduler filters should 
still validate them during live migration (this is all assuming you're 
not using the 'force' flag with live migration - and if you are, all 
bets are off).


If people agree with doing this fix, then we also have to consider 
making a similar fix for other move operations like cold migrate, 
evacuate and unshelve. However, out of those other move operations, only 
cold migrate attempts any retries. If evacuate or unshelve fail on the 
target host, there is no retry.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev