Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-31 Thread Nadathur, Sundar

Hi Eric and all,
    Thank you very much for considering my concerns and coming back 
with an improved solution. Glad that no blood was shed in the process.


I took this proposal and worked out its details, as I understand them, 
in this etherpad:

 https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction
The intention of this detailed scheme is to include GPUs, FPGAs and all 
devices, but the focus may be more on FPGAs.


This scheme at first keeps the restriction that a multi-function device 
cannot be reprogrammed but, in the last section, explores which part of 
the sky will fall down if we do allow that. May be we'll get through 
this with tears but no blood!


Have a good rest of the weekend.

Regards,
Sundar

On 3/29/2018 9:43 AM, Eric Fried wrote:

We discussed this on IRC [1], hangout, and etherpad [2].  Here is the
summary, which we mostly seem to agree on:

There are two different classes of device we're talking about
modeling/managing.  (We don't know the real nomenclature, so forgive
errors in that regard.)

==> Fully dynamic: You can program one region with one function, and
then still program a different region with a different function, etc.

==> Single program: Once you program the card with a function, *all* its
virtual slots are *only* capable of that function until the card is
reprogrammed.  And while any slot is in use, you can't reprogram.  This
is Sundar's FPGA use case.  It is also Sylvain's VGPU use case.

The "fully dynamic" case is straightforward (in the sense of being what
placement was architected to handle).
* Model the PF/region as a resource provider.
* The RP has inventory of some generic resource class (e.g. "VGPU",
"SRIOV_NET_VF", "FPGA_FUNCTION").  Allocations consume that inventory,
plain and simple.
* As a region gets programmed dynamically, it's acceptable for the thing
doing the programming to set a trait indicating that that function is in
play.  (Sundar, this is the thing I originally said would get
resistance; but we've agreed it's okay.  No blood was shed :)
* Requests *may* use preferred traits to help them land on a card that
already has their function flashed on it. (Prerequisite: preferred
traits, which can be implemented in placement.  Candidates with the most
preferred traits get sorted highest.)

The "single program" case needs to be handled more like what Alex
describes below.  TL;DR: We do *not* support dynamic programming,
traiting, or inventorying at instance boot time - it all has to be done
"up front".
* The PFs can be initially modeled as "empty" resource providers.  Or
maybe not at all.  Either way, *they can not be deployed* in this state.
* An operator or admin (via a CLI, config file, agent like blazar or
cyborg, etc.) preprograms the PF to have the specific desired
function/configuration.
   * This may be cyborg/blazar pre-programming devices to maintain an
available set of each function
   * This may be in response to a user requesting some function, which
causes a new image to be laid down on a device so it will be available
for scheduling
   * This may be a human doing it at cloud-build time
* This results in the resource provider being (created and) set up with
the inventory and traits appropriate to that function.
* Now deploys can happen, using required traits representing the desired
function.

-efried

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-29.log.html#t2018-03-29T12:52:56
[2] https://etherpad.openstack.org/p/placement-dynamic-traiting

On 03/29/2018 07:38 AM, Alex Xu wrote:

Agree with that, whatever the tweak inventory or traits, none of them works.

Same as VGPU, we can support pre-programmed mode for multiple-functions
region, and each region only can support one type function.

There are two reasons why Cyborg has a filter:
* records the usage of functions in a region
* records which function is programmed.

For #1, each region provider multiple functions. Each function can be
assigned to a VM. So we should create ResourceProvider for the region. And
the resource class is function. That is similar to the SR-IOV device.
The region(The PF)
provides functions (VFs).

For #2, We should use trait to distinguish the function type.

Then we didn't keep any inventory info in the cyborg again, and we
needn't any filter in cyborg also,
and there is no race condition anymore.

2018-03-29 2:48 GMT+08:00 Eric Fried >:

 Sundar-

         We're running across this issue in several places right
 now.   One
 thing that's definitely not going to get traction is
 automatically/implicitly tweaking inventory in one resource class when
 an allocation is made on a different resource class (whether in the same
 or different RPs).

         Slightly less of a nonstarter, but still likely to get
 significant
 push-back, is the idea of tweaking traits on the fly.  For example, your
 vGPU case might be modeled 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Ed Leafe
On Mar 29, 2018, at 12:57 PM, Eric Fried  wrote:
> 
>> That means that for the (re)-programming scenarios you need to
>> dynamically adjust the inventory of a particular FPGA resource provider.
> 
> Oh, see, this is something I had *thought* was a non-starter. 

I need to work on my communication skills. This is what I’ve been saying all 
along.

-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Dan Smith
> ==> Fully dynamic: You can program one region with one function, and
> then still program a different region with a different function, etc.

Note that this is also the case if you don't have virtualized multi-slot
devices. Like, if you had one that only has one region. Consuming it
consumes the one and only inventory.

> ==> Single program: Once you program the card with a function, *all* its
> virtual slots are *only* capable of that function until the card is
> reprogrammed.  And while any slot is in use, you can't reprogram.  This
> is Sundar's FPGA use case.  It is also Sylvain's VGPU use case.
>
> The "fully dynamic" case is straightforward (in the sense of being what
> placement was architected to handle).
> * Model the PF/region as a resource provider.
> * The RP has inventory of some generic resource class (e.g. "VGPU",
> "SRIOV_NET_VF", "FPGA_FUNCTION").  Allocations consume that inventory,
> plain and simple.
> * As a region gets programmed dynamically, it's acceptable for the thing
> doing the programming to set a trait indicating that that function is in
> play.  (Sundar, this is the thing I originally said would get
> resistance; but we've agreed it's okay.  No blood was shed :)
> * Requests *may* use preferred traits to help them land on a card that
> already has their function flashed on it. (Prerequisite: preferred
> traits, which can be implemented in placement.  Candidates with the most
> preferred traits get sorted highest.)

Yup.

> The "single program" case needs to be handled more like what Alex
> describes below.  TL;DR: We do *not* support dynamic programming,
> traiting, or inventorying at instance boot time - it all has to be done
> "up front".
> * The PFs can be initially modeled as "empty" resource providers.  Or
> maybe not at all.  Either way, *they can not be deployed* in this state.
> * An operator or admin (via a CLI, config file, agent like blazar or
> cyborg, etc.) preprograms the PF to have the specific desired
> function/configuration.
>   * This may be cyborg/blazar pre-programming devices to maintain an
> available set of each function
>   * This may be in response to a user requesting some function, which
> causes a new image to be laid down on a device so it will be available
> for scheduling
>   * This may be a human doing it at cloud-build time
> * This results in the resource provider being (created and) set up with
> the inventory and traits appropriate to that function.
> * Now deploys can happen, using required traits representing the desired
> function.

...and it could be in response to something noticing that a recent nova
boot failed to find any candidates with a particular function, which
provisions that thing so it can be retried. This is kindof the "spot
instances" approach -- that same workflow would work here as well,
although I expect most people would fit into the above cases.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Eric Fried
> That means that for the (re)-programming scenarios you need to
> dynamically adjust the inventory of a particular FPGA resource provider.

Oh, see, this is something I had *thought* was a non-starter.  This
makes the "single program" case way easier to deal with, and allows it
to be handled on the fly:

* Model your region as a provider with separate resource classes for
each function it supports.  The inventory totals for each would be the
total number of virtual slots (or whatever they're called) of that type
that are possible when the device is flashed with that function.
* An allocation is made for one unit of class X.  This percolates down
to cyborg to do the flashing/attaching.  At this time, cyborg *deletes*
the inventories for all the other resource classes.
* In a race with different resource classes, whoever gets to cyborg
first, wins.  The second one will see that the device is already flashed
with X, and fail.  The failure will bubble up, causing the allocation to
be released.
* Requests for multiple different resource classes at once will have to
filter out allocation candidates that put both on the same device.  Not
completely sure how this happens.  Otherwise they would have to fail at
cyborg, resulting in the same bubble/deallocate as above.

-efried

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Jay Pipes

On 03/28/2018 07:03 PM, Nadathur, Sundar wrote:
Thanks, Eric. Looks like there are no good solutions even as candidates, 
but only options with varying levels of unacceptability. It is funny 
that that the option that is considered the least unacceptable is to let 
the problem happen and then fail the request (last one in your list).


Could I ask what is the objection to the scheme that applies multiple 
traits and removes one as needed, apart from the fact that it has races?


The fundamental objection that I've had to various discussions that 
involve abusing traits in this fashion is that you are essentially 
trying to "consume" traits. But traits are *not consumable things*. Only 
resource classes are consumable things.


If you want to track the inventory of a certain thing -- and consume 
those things during scheduling -- then you need to use resource classes 
for that thing. The inventory management system in placement already has 
race protections in it. This means that you won't be able to 
over-allocate a particular consumable accelerated function if there 
isn't inventory capacity for that particular function on an FPGA. 
Likewise, you would not be able to *remove* inventory for a particular 
function on an FPGA if some instance is consuming that particular 
function. This protection does *not* exist if you are tracking 
particular functions with traits; the reason is because an instance 
doesn't *consume* a trait. There's no such thing as "I started an 
instance with accelerated function X and therefore I am consuming trait 
Y on this FPGA."


So, bottom line for me is make sure we're using resource classes for 
consumable items and traits for representing non-consumable capabilities 
**of the resource provider**.


That means that for the (re)-programming scenarios you need to 
dynamically adjust the inventory of a particular FPGA resource provider.


You will need to *add* an inventory item of a custom resource class 
representing the specific function you are flashing *to an empty region*.


You *may* want to *delete* an inventory item of a custom resource class 
representing the specific function *when an instance that was using that 
specific function is terminated*. When the instance is terminated, Nova 
will *automatically* delete allocations of that custom resource class 
associated with the instance if you use a custom resource class to 
represent the particular accelerated function. No such automatic removal 
of allocations is done if you use traits to represent particular 
accelerated functions (again, because traits aren't consumable things).


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Eric Fried
We discussed this on IRC [1], hangout, and etherpad [2].  Here is the
summary, which we mostly seem to agree on:

There are two different classes of device we're talking about
modeling/managing.  (We don't know the real nomenclature, so forgive
errors in that regard.)

==> Fully dynamic: You can program one region with one function, and
then still program a different region with a different function, etc.

==> Single program: Once you program the card with a function, *all* its
virtual slots are *only* capable of that function until the card is
reprogrammed.  And while any slot is in use, you can't reprogram.  This
is Sundar's FPGA use case.  It is also Sylvain's VGPU use case.

The "fully dynamic" case is straightforward (in the sense of being what
placement was architected to handle).
* Model the PF/region as a resource provider.
* The RP has inventory of some generic resource class (e.g. "VGPU",
"SRIOV_NET_VF", "FPGA_FUNCTION").  Allocations consume that inventory,
plain and simple.
* As a region gets programmed dynamically, it's acceptable for the thing
doing the programming to set a trait indicating that that function is in
play.  (Sundar, this is the thing I originally said would get
resistance; but we've agreed it's okay.  No blood was shed :)
* Requests *may* use preferred traits to help them land on a card that
already has their function flashed on it. (Prerequisite: preferred
traits, which can be implemented in placement.  Candidates with the most
preferred traits get sorted highest.)

The "single program" case needs to be handled more like what Alex
describes below.  TL;DR: We do *not* support dynamic programming,
traiting, or inventorying at instance boot time - it all has to be done
"up front".
* The PFs can be initially modeled as "empty" resource providers.  Or
maybe not at all.  Either way, *they can not be deployed* in this state.
* An operator or admin (via a CLI, config file, agent like blazar or
cyborg, etc.) preprograms the PF to have the specific desired
function/configuration.
  * This may be cyborg/blazar pre-programming devices to maintain an
available set of each function
  * This may be in response to a user requesting some function, which
causes a new image to be laid down on a device so it will be available
for scheduling
  * This may be a human doing it at cloud-build time
* This results in the resource provider being (created and) set up with
the inventory and traits appropriate to that function.
* Now deploys can happen, using required traits representing the desired
function.

-efried

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-29.log.html#t2018-03-29T12:52:56
[2] https://etherpad.openstack.org/p/placement-dynamic-traiting

On 03/29/2018 07:38 AM, Alex Xu wrote:
> Agree with that, whatever the tweak inventory or traits, none of them works.
> 
> Same as VGPU, we can support pre-programmed mode for multiple-functions
> region, and each region only can support one type function.
> 
> There are two reasons why Cyborg has a filter:
> * records the usage of functions in a region
> * records which function is programmed.
> 
> For #1, each region provider multiple functions. Each function can be
> assigned to a VM. So we should create ResourceProvider for the region. And
> the resource class is function. That is similar to the SR-IOV device.
> The region(The PF)
> provides functions (VFs).
> 
> For #2, We should use trait to distinguish the function type.
> 
> Then we didn't keep any inventory info in the cyborg again, and we
> needn't any filter in cyborg also,
> and there is no race condition anymore.
> 
> 2018-03-29 2:48 GMT+08:00 Eric Fried  >:
> 
> Sundar-
> 
>         We're running across this issue in several places right
> now.   One
> thing that's definitely not going to get traction is
> automatically/implicitly tweaking inventory in one resource class when
> an allocation is made on a different resource class (whether in the same
> or different RPs).
> 
>         Slightly less of a nonstarter, but still likely to get
> significant
> push-back, is the idea of tweaking traits on the fly.  For example, your
> vGPU case might be modeled as:
> 
> PGPU_RP: {
>   inventory: {
>       CUSTOM_VGPU_TYPE_A: 2,
>       CUSTOM_VGPU_TYPE_B: 4,
>   }
>   traits: [
>       CUSTOM_VGPU_TYPE_A_CAPABLE,
>       CUSTOM_VGPU_TYPE_B_CAPABLE,
>   ]
> }
> 
>         The request would come in for
> resources=CUSTOM_VGPU_TYPE_A:1=VGPU_TYPE_A_CAPABLE, resulting
> in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
> that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
> So it doesn't matter that there's still inventory of
> CUSTOM_VGPU_TYPE_B:4, because a request including
> required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
> There's of 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Eric Fried
Sundar-

To be clear, *all* of the solutions will have race conditions.  There's
no getting around the fact that we need to account for situations where
an allocation is made, but then can't be satisfied by cyborg (or
neutron, or nova, or cinder, or whoever).  That failure has to bubble up
and cause retry or failure of the overarching flow.

The objection to "dynamic trait setting" is that traits are intended to
indicate characteristics, not states.

https://www.google.com/search?q=estar+vs+ser

I'll have to let Jay or Dan explain it further.  Because TBH, I don't
see the harm in mucking with traits/inventories dynamically.

The solutions I discussed here are if it's critical that everything be
dynamic and ultimately flexible.  Alex brings up a different option in
another subthread which is more likely how we're going to handle this
for our Nova scenarios in Rocky.  I'll comment further in that subthread.

-efried

On 03/28/2018 06:03 PM, Nadathur, Sundar wrote:
> Thanks, Eric. Looks like there are no good solutions even as candidates,
> but only options with varying levels of unacceptability. It is funny
> that that the option that is considered the least unacceptable is to let
> the problem happen and then fail the request (last one in your list).
> 
> Could I ask what is the objection to the scheme that applies multiple
> traits and removes one as needed, apart from the fact that it has races?
> 
> Regards,
> Sundar
> 
> On 3/28/2018 11:48 AM, Eric Fried wrote:
>> Sundar-
>>
>> We're running across this issue in several places right now.   One
>> thing that's definitely not going to get traction is
>> automatically/implicitly tweaking inventory in one resource class when
>> an allocation is made on a different resource class (whether in the same
>> or different RPs).
>>
>> Slightly less of a nonstarter, but still likely to get significant
>> push-back, is the idea of tweaking traits on the fly.  For example, your
>> vGPU case might be modeled as:
>>
>> PGPU_RP: {
>>    inventory: {
>>    CUSTOM_VGPU_TYPE_A: 2,
>>    CUSTOM_VGPU_TYPE_B: 4,
>>    }
>>    traits: [
>>    CUSTOM_VGPU_TYPE_A_CAPABLE,
>>    CUSTOM_VGPU_TYPE_B_CAPABLE,
>>    ]
>> }
>>
>> The request would come in for
>> resources=CUSTOM_VGPU_TYPE_A:1=VGPU_TYPE_A_CAPABLE, resulting
>> in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
>> that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
>> So it doesn't matter that there's still inventory of
>> CUSTOM_VGPU_TYPE_B:4, because a request including
>> required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
>> There's of course a window between when the initial allocation is made
>> and when you tweak the trait list.  In that case you'll just have to
>> fail the loser.  This would be like any other failure in e.g. the spawn
>> process; it would bubble up, the allocation would be removed; retries
>> might happen or whatever.
>>
>> Like I said, you're likely to get a lot of resistance to this idea as
>> well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
>> patches; there's nothing about placement that disallows it.)
>>
>> The simple-but-inefficient solution is simply that we'd still be able
>> to make allocations for vGPU type B, but you would have to fail right
>> away when it came down to cyborg to attach the resource.  Which is code
>> you pretty much have to write anyway.  It's an improvement if cyborg
>> gets to be involved in the post-get-allocation-candidates
>> weighing/filtering step, because you can do that check at that point to
>> help filter out the candidates that would fail.  Of course there's still
>> a race condition there, but it's no different than for any other
>> resource.
>>
>> efried
>>
>> On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:
>>> Hi Eric and all,
>>>  I should have clarified that this race condition happens only for
>>> the case of devices with multiple functions. There is a prior thread
>>> 
>>>
>>> about it. I was trying to get a solution within Cyborg, but that faces
>>> this race condition as well.
>>>
>>> IIUC, this situation is somewhat similar to the issue with vGPU types
>>> 
>>>
>>> (thanks to Alex Xu for pointing this out). In the latter case, we could
>>> start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after
>>> consuming a unit of  vGPU-type-a, ideally the inventory should change
>>> to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,
>>> we start with an RP inventory of (region-type-A: 1, function-X: 4). But,
>>> after consuming a unit of that function, ideally the inventory should
>>> change to: (region-type-A: 0, function-X: 3).
>>>
>>> I understand that this approach is controversial 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-29 Thread Alex Xu
Agree with that, whatever the tweak inventory or traits, none of them works.

Same as VGPU, we can support pre-programmed mode for multiple-functions
region, and each region only can support one type function.

There are two reasons why Cyborg has a filter:
* records the usage of functions in a region
* records which function is programmed.

For #1, each region provider multiple functions. Each function can be
assigned to a VM. So we should create ResourceProvider for the region. And
the resource class is function. That is similar to the SR-IOV device. The
region(The PF)
provides functions (VFs).

For #2, We should use trait to distinguish the function type.

Then we didn't keep any inventory info in the cyborg again, and we needn't
any filter in cyborg also,
and there is no race condition anymore.

2018-03-29 2:48 GMT+08:00 Eric Fried :

> Sundar-
>
> We're running across this issue in several places right now.   One
> thing that's definitely not going to get traction is
> automatically/implicitly tweaking inventory in one resource class when
> an allocation is made on a different resource class (whether in the same
> or different RPs).
>
> Slightly less of a nonstarter, but still likely to get significant
> push-back, is the idea of tweaking traits on the fly.  For example, your
> vGPU case might be modeled as:
>
> PGPU_RP: {
>   inventory: {
>   CUSTOM_VGPU_TYPE_A: 2,
>   CUSTOM_VGPU_TYPE_B: 4,
>   }
>   traits: [
>   CUSTOM_VGPU_TYPE_A_CAPABLE,
>   CUSTOM_VGPU_TYPE_B_CAPABLE,
>   ]
> }
>
> The request would come in for
> resources=CUSTOM_VGPU_TYPE_A:1=VGPU_TYPE_A_CAPABLE, resulting
> in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
> that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
> So it doesn't matter that there's still inventory of
> CUSTOM_VGPU_TYPE_B:4, because a request including
> required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
> There's of course a window between when the initial allocation is made
> and when you tweak the trait list.  In that case you'll just have to
> fail the loser.  This would be like any other failure in e.g. the spawn
> process; it would bubble up, the allocation would be removed; retries
> might happen or whatever.
>
> Like I said, you're likely to get a lot of resistance to this idea
> as
> well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
> patches; there's nothing about placement that disallows it.)
>
> The simple-but-inefficient solution is simply that we'd still be
> able
> to make allocations for vGPU type B, but you would have to fail right
> away when it came down to cyborg to attach the resource.  Which is code
> you pretty much have to write anyway.  It's an improvement if cyborg
> gets to be involved in the post-get-allocation-candidates
> weighing/filtering step, because you can do that check at that point to
> help filter out the candidates that would fail.  Of course there's still
> a race condition there, but it's no different than for any other resource.
>
> efried
>
> On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:
> > Hi Eric and all,
> > I should have clarified that this race condition happens only for
> > the case of devices with multiple functions. There is a prior thread
> >  March/127882.html>
> > about it. I was trying to get a solution within Cyborg, but that faces
> > this race condition as well.
> >
> > IIUC, this situation is somewhat similar to the issue with vGPU types
> >  %23openstack-nova.2018-03-27.log.html#t2018-03-27T13:41:00>
> > (thanks to Alex Xu for pointing this out). In the latter case, we could
> > start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after
> > consuming a unit of  vGPU-type-a, ideally the inventory should change
> > to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,
> > we start with an RP inventory of (region-type-A: 1, function-X: 4). But,
> > after consuming a unit of that function, ideally the inventory should
> > change to: (region-type-A: 0, function-X: 3).
> >
> > I understand that this approach is controversial :) Also, one difference
> > from the vGPU case is that the number and count of vGPU types is static,
> > whereas with FPGAs, one could reprogram it to result in more or fewer
> > functions. That said, we could hopefully keep this analogy in mind for
> > future discussions.
> >
> > We probably will not support multi-function accelerators in Rocky. This
> > discussion is for the longer term.
> >
> > Regards,
> > Sundar
> >
> > On 3/23/2018 12:44 PM, Eric Fried wrote:
> >> Sundar-
> >>
> >>  First thought is to simplify by NOT keeping inventory information
> in
> >> the cyborg db at all.  The provider record in the placement service
> >> already knows the device (the provider ID, 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-28 Thread Nadathur, Sundar
Thanks, Eric. Looks like there are no good solutions even as candidates, 
but only options with varying levels of unacceptability. It is funny 
that that the option that is considered the least unacceptable is to let 
the problem happen and then fail the request (last one in your list).


Could I ask what is the objection to the scheme that applies multiple 
traits and removes one as needed, apart from the fact that it has races?


Regards,
Sundar

On 3/28/2018 11:48 AM, Eric Fried wrote:

Sundar-

We're running across this issue in several places right now.   One
thing that's definitely not going to get traction is
automatically/implicitly tweaking inventory in one resource class when
an allocation is made on a different resource class (whether in the same
or different RPs).

Slightly less of a nonstarter, but still likely to get significant
push-back, is the idea of tweaking traits on the fly.  For example, your
vGPU case might be modeled as:

PGPU_RP: {
   inventory: {
   CUSTOM_VGPU_TYPE_A: 2,
   CUSTOM_VGPU_TYPE_B: 4,
   }
   traits: [
   CUSTOM_VGPU_TYPE_A_CAPABLE,
   CUSTOM_VGPU_TYPE_B_CAPABLE,
   ]
}

The request would come in for
resources=CUSTOM_VGPU_TYPE_A:1=VGPU_TYPE_A_CAPABLE, resulting
in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
So it doesn't matter that there's still inventory of
CUSTOM_VGPU_TYPE_B:4, because a request including
required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
There's of course a window between when the initial allocation is made
and when you tweak the trait list.  In that case you'll just have to
fail the loser.  This would be like any other failure in e.g. the spawn
process; it would bubble up, the allocation would be removed; retries
might happen or whatever.

Like I said, you're likely to get a lot of resistance to this idea as
well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
patches; there's nothing about placement that disallows it.)

The simple-but-inefficient solution is simply that we'd still be able
to make allocations for vGPU type B, but you would have to fail right
away when it came down to cyborg to attach the resource.  Which is code
you pretty much have to write anyway.  It's an improvement if cyborg
gets to be involved in the post-get-allocation-candidates
weighing/filtering step, because you can do that check at that point to
help filter out the candidates that would fail.  Of course there's still
a race condition there, but it's no different than for any other resource.

efried

On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:

Hi Eric and all,
     I should have clarified that this race condition happens only for
the case of devices with multiple functions. There is a prior thread

about it. I was trying to get a solution within Cyborg, but that faces
this race condition as well.

IIUC, this situation is somewhat similar to the issue with vGPU types

(thanks to Alex Xu for pointing this out). In the latter case, we could
start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after
consuming a unit of  vGPU-type-a, ideally the inventory should change
to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,
we start with an RP inventory of (region-type-A: 1, function-X: 4). But,
after consuming a unit of that function, ideally the inventory should
change to: (region-type-A: 0, function-X: 3).

I understand that this approach is controversial :) Also, one difference
from the vGPU case is that the number and count of vGPU types is static,
whereas with FPGAs, one could reprogram it to result in more or fewer
functions. That said, we could hopefully keep this analogy in mind for
future discussions.

We probably will not support multi-function accelerators in Rocky. This
discussion is for the longer term.

Regards,
Sundar

On 3/23/2018 12:44 PM, Eric Fried wrote:

Sundar-

First thought is to simplify by NOT keeping inventory information in
the cyborg db at all.  The provider record in the placement service
already knows the device (the provider ID, which you can look up in the
cyborg db) the host (the root_provider_uuid of the provider representing
the device) and the inventory, and (I hope) you'll be augmenting it with
traits indicating what functions it's capable of.  That way, you'll
always get allocation candidates with devices that *can* load the
desired function; now you just have to engage your weigher to prioritize
the ones that already have it loaded so you can prefer those.

Am I missing something?

efried

On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:

Hi all,
     There seems to be a possibility of a race 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-28 Thread Eric Fried
Sundar-

We're running across this issue in several places right now.   One
thing that's definitely not going to get traction is
automatically/implicitly tweaking inventory in one resource class when
an allocation is made on a different resource class (whether in the same
or different RPs).

Slightly less of a nonstarter, but still likely to get significant
push-back, is the idea of tweaking traits on the fly.  For example, your
vGPU case might be modeled as:

PGPU_RP: {
  inventory: {
  CUSTOM_VGPU_TYPE_A: 2,
  CUSTOM_VGPU_TYPE_B: 4,
  }
  traits: [
  CUSTOM_VGPU_TYPE_A_CAPABLE,
  CUSTOM_VGPU_TYPE_B_CAPABLE,
  ]
}

The request would come in for
resources=CUSTOM_VGPU_TYPE_A:1=VGPU_TYPE_A_CAPABLE, resulting
in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
So it doesn't matter that there's still inventory of
CUSTOM_VGPU_TYPE_B:4, because a request including
required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
There's of course a window between when the initial allocation is made
and when you tweak the trait list.  In that case you'll just have to
fail the loser.  This would be like any other failure in e.g. the spawn
process; it would bubble up, the allocation would be removed; retries
might happen or whatever.

Like I said, you're likely to get a lot of resistance to this idea as
well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
patches; there's nothing about placement that disallows it.)

The simple-but-inefficient solution is simply that we'd still be able
to make allocations for vGPU type B, but you would have to fail right
away when it came down to cyborg to attach the resource.  Which is code
you pretty much have to write anyway.  It's an improvement if cyborg
gets to be involved in the post-get-allocation-candidates
weighing/filtering step, because you can do that check at that point to
help filter out the candidates that would fail.  Of course there's still
a race condition there, but it's no different than for any other resource.

efried

On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:
> Hi Eric and all,
>     I should have clarified that this race condition happens only for
> the case of devices with multiple functions. There is a prior thread
> 
> about it. I was trying to get a solution within Cyborg, but that faces
> this race condition as well.
> 
> IIUC, this situation is somewhat similar to the issue with vGPU types
> 
> (thanks to Alex Xu for pointing this out). In the latter case, we could
> start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after
> consuming a unit of  vGPU-type-a, ideally the inventory should change
> to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,
> we start with an RP inventory of (region-type-A: 1, function-X: 4). But,
> after consuming a unit of that function, ideally the inventory should
> change to: (region-type-A: 0, function-X: 3).
> 
> I understand that this approach is controversial :) Also, one difference
> from the vGPU case is that the number and count of vGPU types is static,
> whereas with FPGAs, one could reprogram it to result in more or fewer
> functions. That said, we could hopefully keep this analogy in mind for
> future discussions.
> 
> We probably will not support multi-function accelerators in Rocky. This
> discussion is for the longer term.
> 
> Regards,
> Sundar
> 
> On 3/23/2018 12:44 PM, Eric Fried wrote:
>> Sundar-
>>
>>  First thought is to simplify by NOT keeping inventory information in
>> the cyborg db at all.  The provider record in the placement service
>> already knows the device (the provider ID, which you can look up in the
>> cyborg db) the host (the root_provider_uuid of the provider representing
>> the device) and the inventory, and (I hope) you'll be augmenting it with
>> traits indicating what functions it's capable of.  That way, you'll
>> always get allocation candidates with devices that *can* load the
>> desired function; now you just have to engage your weigher to prioritize
>> the ones that already have it loaded so you can prefer those.
>>
>>  Am I missing something?
>>
>>  efried
>>
>> On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:
>>> Hi all,
>>>     There seems to be a possibility of a race condition in the
>>> Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to
>>> the proposed Cyborg/Nova spec
>>> 
>>> for details.)
>>>
>>> Consider the scenario where the flavor specifies a resource class for a
>>> device type, and also specifies a function (e.g. encrypt) in the extra
>>> specs. The Nova 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-28 Thread Nadathur, Sundar

Hi Shaohe,
  I have responded in the Etherpad. The Cyborg/Nova scheduling spec 
details the 4 types of user requests 
. 



I believe you are looking for more details on what the RC names, traits 
and flavors will look like. I will add that to the spec itself.


Thanks,
Sundar

On 3/28/2018 2:10 AM, 少合冯 wrote:

I have summarize some scenarios for fpga devices request.
https://etherpad.openstack.org/p/cyborg-fpga-request-scenarios

Please add more  more scenarios to find out the exceptions that 
placement can not satisfy the filter and weight.


IMOH, I refer placementto do filter and weight. If we have to let 
cyborg do filter and weight.  Nova scheduler just need call cyborg 
once for all host weight though we do the weigh one by one.



2018-03-23 12:27 GMT+08:00 Nadathur, Sundar >:


Hi all,
    There seems to be a possibility of a race condition in the
Cyborg/Nova flow. Apologies for missing this earlier. (You can
refer to the proposed Cyborg/Nova spec


for details.)

Consider the scenario where the flavor specifies a resource class
for a device type, and also specifies a function (e.g. encrypt) in
the extra specs. The Nova scheduler would only track the device
type as a resource, and Cyborg needs to track the availability of
functions. Further, to keep it simple, say all the functions exist
all the time (no reprogramming involved).

To recap, here is the scheduler flow for this case:

  * A request spec with a flavor comes to Nova
conductor/scheduler. The flavor has a device type as a
resource class, and a function in the extra specs.
  * Placement API returns the list of RPs (compute nodes) which
contain the requested device types (but not necessarily the
function).
  * Cyborg will provide a custom filter which queries Cyborg DB.
This needs to check which hosts contain the needed function,
and filter out the rest.
  * The scheduler selects one node from the filtered list, and the
request goes to the compute node.

For the filter to work, the Cyborg DB needs to maintain a table
with triples of (host, function type, #free units). The filter
checks if a given host has one or more free units of the requested
function type. But, to keep the # free units up to date, Cyborg on
the selected compute node needs to notify the Cyborg API to
decrement the #free units when an instance is spawned, and to
increment them when resources are released.

Therein lies the catch: this loop from the compute node to
controller is susceptible to race conditions. For example, if two
simultaneous requests each ask for function A, and there is only
one unit of that available, the Cyborg filter will approve both,
both may land on the same host, and one will fail. This is because
Cyborg on the controller does not decrement resource usage due to
one request before processing the next request.

This is similar to this previous Nova scheduling issue

.
That was solved by having the scheduler claim a resource in
Placement for the selected node. I don't see an analog for Cyborg,
since it would not know which node is selected.

Thanks in advance for suggestions and solutions.

Regards,
Sundar







__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-28 Thread Nadathur, Sundar

Hi Eric and all,
    I should have clarified that this race condition happens only for 
the case of devices with multiple functions. There is a prior thread 
 
about it. I was trying to get a solution within Cyborg, but that faces 
this race condition as well.


IIUC, this situation is somewhat similar to the issue with vGPU types 
 
(thanks to Alex Xu for pointing this out). In the latter case, we could 
start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after 
consuming a unit of vGPU-type-a, ideally the inventory should change to: 
(vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators, we 
start with an RP inventory of (region-type-A: 1, function-X: 4). But, 
after consuming a unit of that function, ideally the inventory should 
change to: (region-type-A: 0, function-X: 3).


I understand that this approach is controversial :) Also, one difference 
from the vGPU case is that the number and count of vGPU types is static, 
whereas with FPGAs, one could reprogram it to result in more or fewer 
functions. That said, we could hopefully keep this analogy in mind for 
future discussions.


We probably will not support multi-function accelerators in Rocky. This 
discussion is for the longer term.


Regards,
Sundar

On 3/23/2018 12:44 PM, Eric Fried wrote:

Sundar-

First thought is to simplify by NOT keeping inventory information in
the cyborg db at all.  The provider record in the placement service
already knows the device (the provider ID, which you can look up in the
cyborg db) the host (the root_provider_uuid of the provider representing
the device) and the inventory, and (I hope) you'll be augmenting it with
traits indicating what functions it's capable of.  That way, you'll
always get allocation candidates with devices that *can* load the
desired function; now you just have to engage your weigher to prioritize
the ones that already have it loaded so you can prefer those.

Am I missing something?

efried

On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:

Hi all,
     There seems to be a possibility of a race condition in the
Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to
the proposed Cyborg/Nova spec

for details.)

Consider the scenario where the flavor specifies a resource class for a
device type, and also specifies a function (e.g. encrypt) in the extra
specs. The Nova scheduler would only track the device type as a
resource, and Cyborg needs to track the availability of functions.
Further, to keep it simple, say all the functions exist all the time (no
reprogramming involved).

To recap, here is the scheduler flow for this case:

   * A request spec with a flavor comes to Nova conductor/scheduler. The
 flavor has a device type as a resource class, and a function in the
 extra specs.
   * Placement API returns the list of RPs (compute nodes) which contain
 the requested device types (but not necessarily the function).
   * Cyborg will provide a custom filter which queries Cyborg DB. This
 needs to check which hosts contain the needed function, and filter
 out the rest.
   * The scheduler selects one node from the filtered list, and the
 request goes to the compute node.

For the filter to work, the Cyborg DB needs to maintain a table with
triples of (host, function type, #free units). The filter checks if a
given host has one or more free units of the requested function type.
But, to keep the # free units up to date, Cyborg on the selected compute
node needs to notify the Cyborg API to decrement the #free units when an
instance is spawned, and to increment them when resources are released.

Therein lies the catch: this loop from the compute node to controller is
susceptible to race conditions. For example, if two simultaneous
requests each ask for function A, and there is only one unit of that
available, the Cyborg filter will approve both, both may land on the
same host, and one will fail. This is because Cyborg on the controller
does not decrement resource usage due to one request before processing
the next request.

This is similar to this previous Nova scheduling issue
.
That was solved by having the scheduler claim a resource in Placement
for the selected node. I don't see an analog for Cyborg, since it would
not know which node is selected.

Thanks in advance for suggestions and solutions.

Regards,
Sundar








__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-28 Thread 少合冯
I have summarize some scenarios for fpga devices request.
https://etherpad.openstack.org/p/cyborg-fpga-request-scenarios

Please add more  more  scenarios to find out the exceptions that placement
can not satisfy the filter and weight.

IMOH, I refer  placement  to do  filter and weight. If we have to let
cyborg do filter and weight.  Nova scheduler just need call cyborg once for
all host  weight though we do the weigh one by one.


2018-03-23 12:27 GMT+08:00 Nadathur, Sundar :

> Hi all,
> There seems to be a possibility of a race condition in the Cyborg/Nova
> flow. Apologies for missing this earlier. (You can refer to the proposed
> Cyborg/Nova spec
> 
> for details.)
>
> Consider the scenario where the flavor specifies a resource class for a
> device type, and also specifies a function (e.g. encrypt) in the extra
> specs. The Nova scheduler would only track the device type as a resource,
> and Cyborg needs to track the availability of functions. Further, to keep
> it simple, say all the functions exist all the time (no reprogramming
> involved).
>
> To recap, here is the scheduler flow for this case:
>
>- A request spec with a flavor comes to Nova conductor/scheduler. The
>flavor has a device type as a resource class, and a function in the extra
>specs.
>- Placement API returns the list of RPs (compute nodes) which contain
>the requested device types (but not necessarily the function).
>- Cyborg will provide a custom filter which queries Cyborg DB. This
>needs to check which hosts contain the needed function, and filter out the
>rest.
>- The scheduler selects one node from the filtered list, and the
>request goes to the compute node.
>
> For the filter to work, the Cyborg DB needs to maintain a table with
> triples of (host, function type, #free units). The filter checks if a given
> host has one or more free units of the requested function type. But, to
> keep the # free units up to date, Cyborg on the selected compute node needs
> to notify the Cyborg API to decrement the #free units when an instance is
> spawned, and to increment them when resources are released.
>
> Therein lies the catch: this loop from the compute node to controller is
> susceptible to race conditions. For example, if two simultaneous requests
> each ask for function A, and there is only one unit of that available, the
> Cyborg filter will approve both, both may land on the same host, and one
> will fail. This is because Cyborg on the controller does not decrement
> resource usage due to one request before processing the next request.
>
> This is similar to this previous Nova scheduling issue
> .
> That was solved by having the scheduler claim a resource in Placement for
> the selected node. I don't see an analog for Cyborg, since it would not
> know which node is selected.
>
> Thanks in advance for suggestions and solutions.
>
> Regards,
> Sundar
>
>
>
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-27 Thread 少合冯
As I know placement and nova scheduler dedicate to filter and weight.
 Placement
and nova scheduler is responsible for avoiding race.

Nested provider + traits should cover most scenarios.

Any  special case please let the nova developer and cyborg developer know,
let work together to get a solution.



I re-paste our design (for a POC) I have send it before as follow, hopeful
it can helpful.
We do not let cyborg do any scheduler function( include filter and weight).
It just responsible to do binding for FPGA device and vm instance ( or call
it FPGA devices assignment)

===
hi all

IMHO, we can consider the upstream of image management and resource
provider management, even scheduler weight.

1.  image  management
For  image management, I miss one things in the meeting.

We have discussed it before.
And Li Liu suggested to add a cyborg wrapper to upload the FPGA image.
This is a good ideas.
For example:
PUT
/cyborg/v1/images/{image_id}/file


It will call glance upload API to  upload  the image.
This is helpful for us to normalize the tags of image and properties.

To Dutch, Li Liu, Dolpher, Sunder and other FPGA experts:
 How about get agreement on the standardization of glance image
metadata, especially, tags and property.

For the tags:
IMHO, the "FPGA" is necessary, for there maybe many images managed by
glance, not only fpga image but also VM image. This tag can be a filter
help us to get only fpga images.
The vendor name is necessary as a tag? Such as "INTEL" or "XILINX"
The product model is necessary as a tag? Such as "STRATIX10"
Any others should be in the image tags?
For the properties :
It should include the function name(this means the accelerator type).
Should it also include stream id and vendor name?
such as: --property vendor=xilinx --property type=crypto,transcoding
 Any others should be in the image properties?

Li Liu is working on the spec.


2.   provider management.
  resource class, maybe the nested provider supported.
  we can define them as fellow:
  level 1 provider  resource class is  CUSTOM_FPGA_, and level 2
is  CUSTOM_FPGA__,   level 3 is
CUSTOM_FPGA___
  { "CUSTOM_FPGA_VF":
   { "num": 3
  "CUSTOM_FPGA_ XILINX _VF": { "num": 1 }
  "CUSTOM_FPGA_INTEL_VF":
  { "CUSTOM_FPGA_INTEL_STRATIX10_VF": "num": 1 }
  { "CUSTOM_FPGA_INTEL_STRATIX11_VF": "num": 1 }
   }
  }
  Not sure I understand correctly.

  And traits should include:  CUSTOM__FUNCTION_
  domain  means which project to consume these traits. CYBORG or
ACCELERATOR which is better?  Here it means cyborg care these traits. Nova,
neutron, cinder can ignore them.
   function, can be CRYPTO, TRANSCODING.

To Jay Pipes, Dutch, Li Liu, Dolpher, Sunder and other FPGA/placement
experts:
   Any suggestion on it?

3.  scheduler weight.
I think this is not the high priority at present for cyborg.
Zhipeng, Li Liu, Zhuli, Dopher and I have discussed them before for the
deployable model implementation.
We need to add steaming or image information for deployable.
Li Liu and Zhuli's design, they do have add extra info for deployable. So
it can be used for  steaming or image information.

And cyborg API had better support filters for  scheduler weighting.
Such as:
GET /cyborg/v1/accelerators?hosts=cyborg-1, cyborg-2,
cyborg-3=crypto,transcoding
It query all the hosts  cyborg-1, cyborg-2, cyborg-3 to get all
accelerators support crypto and transcoding function.
Cyborg API call conductor to get the accelerators information from by these
filters
scheduler can leverage the the accelerators information for weighting.
Maybe  Cyborg API can also help to do the  weighting. But I think this is
not a good idea.

To Sunder:
I know you are interested in scheduler weight and you have some other
weighting solutions.
Hopeful this can useful for you.
REF: https://etherpad.openstack.org/p/cyborg-nova-poc



2018-03-23 12:27 GMT+08:00 Nadathur, Sundar :

> Hi all,
> There seems to be a possibility of a race condition in the Cyborg/Nova
> flow. Apologies for missing this earlier. (You can refer to the proposed
> Cyborg/Nova spec
> 
> for details.)
>
> Consider the scenario where the flavor specifies a resource class for a
> device type, and also specifies a function (e.g. encrypt) in the extra
> specs. The Nova scheduler would only track the device type as a resource,
> and Cyborg needs to track the availability of functions. Further, to keep
> it simple, say all the functions exist all the time (no reprogramming
> involved).
>
> To recap, here is the scheduler flow for this case:
>
>- A request spec with a flavor comes to Nova conductor/scheduler. The
>flavor has a device type as a resource class, and a function in the extra
>specs.
>- Placement 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-25 Thread Nadathur, Sundar

On 3/23/2018 12:44 PM, Eric Fried wrote:

Sundar-

First thought is to simplify by NOT keeping inventory information in
the cyborg db at all.  The provider record in the placement service
already knows the device (the provider ID, which you can look up in the
cyborg db) the host (the root_provider_uuid of the provider representing
the device) and the inventory, and (I hope) you'll be augmenting it with
traits indicating what functions it's capable of.  That way, you'll
always get allocation candidates with devices that *can* load the
desired function; now you just have to engage your weigher to prioritize
the ones that already have it loaded so you can prefer those.

Eric,
   Thanks for the response.

   Traits only indicate whether a qualitative capability exists. To 
check if a free instance of the requested function exists in the host, 
we have to count the total count and free count of the needed function. 
Otherwise, we may pick a host because it *can* host a function, though 
it doesn't have a free instance of the function.


IIUC, your reply seems to expect that we can always reprogram a function 
as needed. The specific case we are looking at here is one where no 
reprogramming is involved. In the terminology of Cyborg/Nova 
rescheduling spec , this is 
the pre-programmed scenario (reasons why an operator may want this are 
stated in the spec). However, even if reprogramming is allowed, to 
prioritize hosts with free instances of the needed function, we will 
need to count how many free instances there are.


Since we said that only device types will be tracked as resource 
classes, and not functions, the scheduler will count available instances 
of device types, and Cyborg would have to count the functions separately.


Please let me know if I missed something.

Thanks & Regards,
Sundar


Am I missing something?

efried

On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:

Hi all,
     There seems to be a possibility of a race condition in the
Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to
the proposed Cyborg/Nova spec

for details.)

Consider the scenario where the flavor specifies a resource class for a
device type, and also specifies a function (e.g. encrypt) in the extra
specs. The Nova scheduler would only track the device type as a
resource, and Cyborg needs to track the availability of functions.
Further, to keep it simple, say all the functions exist all the time (no
reprogramming involved).

To recap, here is the scheduler flow for this case:

   * A request spec with a flavor comes to Nova conductor/scheduler. The
 flavor has a device type as a resource class, and a function in the
 extra specs.
   * Placement API returns the list of RPs (compute nodes) which contain
 the requested device types (but not necessarily the function).
   * Cyborg will provide a custom filter which queries Cyborg DB. This
 needs to check which hosts contain the needed function, and filter
 out the rest.
   * The scheduler selects one node from the filtered list, and the
 request goes to the compute node.

For the filter to work, the Cyborg DB needs to maintain a table with
triples of (host, function type, #free units). The filter checks if a
given host has one or more free units of the requested function type.
But, to keep the # free units up to date, Cyborg on the selected compute
node needs to notify the Cyborg API to decrement the #free units when an
instance is spawned, and to increment them when resources are released.

Therein lies the catch: this loop from the compute node to controller is
susceptible to race conditions. For example, if two simultaneous
requests each ask for function A, and there is only one unit of that
available, the Cyborg filter will approve both, both may land on the
same host, and one will fail. This is because Cyborg on the controller
does not decrement resource usage due to one request before processing
the next request.

This is similar to this previous Nova scheduling issue
.
That was solved by having the scheduler claim a resource in Placement
for the selected node. I don't see an analog for Cyborg, since it would
not know which node is selected.

Thanks in advance for suggestions and solutions.

Regards,
Sundar








__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-23 Thread Eric Fried
Sundar-

First thought is to simplify by NOT keeping inventory information in
the cyborg db at all.  The provider record in the placement service
already knows the device (the provider ID, which you can look up in the
cyborg db) the host (the root_provider_uuid of the provider representing
the device) and the inventory, and (I hope) you'll be augmenting it with
traits indicating what functions it's capable of.  That way, you'll
always get allocation candidates with devices that *can* load the
desired function; now you just have to engage your weigher to prioritize
the ones that already have it loaded so you can prefer those.

Am I missing something?

efried

On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:
> Hi all,
>     There seems to be a possibility of a race condition in the
> Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to
> the proposed Cyborg/Nova spec
> 
> for details.)
> 
> Consider the scenario where the flavor specifies a resource class for a
> device type, and also specifies a function (e.g. encrypt) in the extra
> specs. The Nova scheduler would only track the device type as a
> resource, and Cyborg needs to track the availability of functions.
> Further, to keep it simple, say all the functions exist all the time (no
> reprogramming involved).
> 
> To recap, here is the scheduler flow for this case:
> 
>   * A request spec with a flavor comes to Nova conductor/scheduler. The
> flavor has a device type as a resource class, and a function in the
> extra specs.
>   * Placement API returns the list of RPs (compute nodes) which contain
> the requested device types (but not necessarily the function).
>   * Cyborg will provide a custom filter which queries Cyborg DB. This
> needs to check which hosts contain the needed function, and filter
> out the rest.
>   * The scheduler selects one node from the filtered list, and the
> request goes to the compute node.
> 
> For the filter to work, the Cyborg DB needs to maintain a table with
> triples of (host, function type, #free units). The filter checks if a
> given host has one or more free units of the requested function type.
> But, to keep the # free units up to date, Cyborg on the selected compute
> node needs to notify the Cyborg API to decrement the #free units when an
> instance is spawned, and to increment them when resources are released.
> 
> Therein lies the catch: this loop from the compute node to controller is
> susceptible to race conditions. For example, if two simultaneous
> requests each ask for function A, and there is only one unit of that
> available, the Cyborg filter will approve both, both may land on the
> same host, and one will fail. This is because Cyborg on the controller
> does not decrement resource usage due to one request before processing
> the next request.
> 
> This is similar to this previous Nova scheduling issue
> .
> That was solved by having the scheduler claim a resource in Placement
> for the selected node. I don't see an analog for Cyborg, since it would
> not know which node is selected.
> 
> Thanks in advance for suggestions and solutions.
> 
> Regards,
> Sundar
> 
> 
> 
> 
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

2018-03-22 Thread Nadathur, Sundar

Hi all,
    There seems to be a possibility of a race condition in the 
Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to 
the proposed Cyborg/Nova spec 
 
for details.)


Consider the scenario where the flavor specifies a resource class for a 
device type, and also specifies a function (e.g. encrypt) in the extra 
specs. The Nova scheduler would only track the device type as a 
resource, and Cyborg needs to track the availability of functions. 
Further, to keep it simple, say all the functions exist all the time (no 
reprogramming involved).


To recap, here is the scheduler flow for this case:

 * A request spec with a flavor comes to Nova conductor/scheduler. The
   flavor has a device type as a resource class, and a function in the
   extra specs.
 * Placement API returns the list of RPs (compute nodes) which contain
   the requested device types (but not necessarily the function).
 * Cyborg will provide a custom filter which queries Cyborg DB. This
   needs to check which hosts contain the needed function, and filter
   out the rest.
 * The scheduler selects one node from the filtered list, and the
   request goes to the compute node.

For the filter to work, the Cyborg DB needs to maintain a table with 
triples of (host, function type, #free units). The filter checks if a 
given host has one or more free units of the requested function type. 
But, to keep the # free units up to date, Cyborg on the selected compute 
node needs to notify the Cyborg API to decrement the #free units when an 
instance is spawned, and to increment them when resources are released.


Therein lies the catch: this loop from the compute node to controller is 
susceptible to race conditions. For example, if two simultaneous 
requests each ask for function A, and there is only one unit of that 
available, the Cyborg filter will approve both, both may land on the 
same host, and one will fail. This is because Cyborg on the controller 
does not decrement resource usage due to one request before processing 
the next request.


This is similar to this previous Nova scheduling issue 
. 
That was solved by having the scheduler claim a resource in Placement 
for the selected node. I don't see an analog for Cyborg, since it would 
not know which node is selected.


Thanks in advance for suggestions and solutions.

Regards,
Sundar






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev