Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-29 Thread sfinucan
On Wed, 2017-06-21 at 07:01 -0400, Sean Dague wrote:
> On 06/21/2017 04:43 AM, sfinu...@redhat.com wrote:
> > On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
> > > On 06/20/2017 09:51 AM, Eric Fried wrote:
> > > > Nice Stephen!
> > > > 
> > > > For those who aren't aware, the rendered version (pretty, so pretty)
> > > > can
> > > > be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
> > > > 
> > > > http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu
> > > > -xen
> > > > ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
> > > 
> > > Can we teach it to not put line breaks in the middle of words in the text
> > > boxes?
> > 
> > Doesn't seem configurable in its current form :( This, and the defaulting
> > to PNG output instead of SVG (which makes things ungreppable) are my
> > biggest bug bear.
> > 
> > I'll go have a look at the sauce and see what can be done about it. If not,
> > still better than nothing?
> 
> I've actually looked through the blockdiag source (to try to solve a
> similar problem). There is no easy way to change it.
> 
> If people find it confusing, the best thing to do would be short labels
> on boxes, then explain in more detail in footnotes.

I managed to get this working through some monkey patching of the module [1].
It's not perfect and efried and I want to do something else to prevent
truncating [2], but it's much better now.

Stephen

[1] https://review.openstack.org/#/c/476159/
[2] https://review.openstack.org/#/c/476204/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-21 Thread Sean Dague
On 06/21/2017 04:43 AM, sfinu...@redhat.com wrote:
> On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
>> On 06/20/2017 09:51 AM, Eric Fried wrote:
>>> Nice Stephen!
>>>
>>> For those who aren't aware, the rendered version (pretty, so pretty) can
>>> be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
>>>
>>> http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xen
>>> ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
>>
>> Can we teach it to not put line breaks in the middle of words in the text
>> boxes?
> 
> Doesn't seem configurable in its current form :( This, and the defaulting to
> PNG output instead of SVG (which makes things ungreppable) are my biggest bug
> bear.
> 
> I'll go have a look at the sauce and see what can be done about it. If not,
> still better than nothing?

I've actually looked through the blockdiag source (to try to solve a
similar problem). There is no easy way to change it.

If people find it confusing, the best thing to do would be short labels
on boxes, then explain in more detail in footnotes.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-21 Thread sfinucan
On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
> On 06/20/2017 09:51 AM, Eric Fried wrote:
> > Nice Stephen!
> > 
> > For those who aren't aware, the rendered version (pretty, so pretty) can
> > be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
> > 
> > http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xen
> > ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
> 
> Can we teach it to not put line breaks in the middle of words in the text
> boxes?

Doesn't seem configurable in its current form :( This, and the defaulting to
PNG output instead of SVG (which makes things ungreppable) are my biggest bug
bear.

I'll go have a look at the sauce and see what can be done about it. If not,
still better than nothing?

Stephen

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Chris Friesen

On 06/20/2017 09:51 AM, Eric Fried wrote:

Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling


Can we teach it to not put line breaks in the middle of words in the text boxes?

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Eric Fried
Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling

On 06/20/2017 09:09 AM, sfinu...@redhat.com wrote:

> 
> I have a document (with a nifty activity diagram in tow) for all the above
> available here:
> 
>   https://review.openstack.org/475810 
> 
> Should be more Google'able that mailing list posts for future us :)
> 
> Stephen
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/20/2017 09:51 AM, Alex Xu wrote:
2017-06-19 22:17 GMT+08:00 Jay Pipes >:

* Scheduler then creates a list of N of these data structures,
with the first being the data for the selected host, and the the
rest being data structures representing alternates consisting of
the next hosts in the ranked list that are in the same cell as
the selected host.

Yes, this is the proposed solution for allowing retries within a cell.

Is that possible we use trait to distinguish different cells? Then the 
retry can be done in the cell by query the placement directly with trait 
which indicate the specific cell.


Those traits will be some custom traits, and generate by the cell name.


No, we're not going to use traits in this way, for a couple reasons:

1) Placement doesn't and shouldn't know about Nova's internals. Cells 
are internal structures of Nova. Users don't know about them, neither 
should placement.


2) Traits describe a resource provider. A cell ID doesn't describe a 
resource provider, just like an aggregate ID doesn't describe a resource 
provider.



* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends
that list to the target cell.
* Target cell tries to build the instance on the selected host.
If it fails, it uses the allocation data in the data structure
to unclaim the resources for the selected host, and tries to
claim the resources for the next host in the list using its
allocation data. It then tries to build the instance on the next
host in the list of alternates. Only when all alternates fail
does the build request fail.

In the compute node, will we get rid of the allocation update in the 
periodic task "update_available_resource"? Otherwise, we will have race 
between the claim in the nova-scheduler and that periodic task.


Yup, good point, and yes, we will be removing the call to PUT 
/allocations in the compute node resource tracker. Only DELETE 
/allocations/{instance_uuid} will be called if something goes terribly 
wrong on instance launch.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread sfinucan
On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote:
> On 6/19/2017 9:17 AM, Jay Pipes wrote:
> > On 06/19/2017 09:04 AM, Edward Leafe wrote:
> > > Current flow:
> 
> As noted in the nova-scheduler meeting this morning, this should have 
> been called "original plan" rather than "current flow", as Jay pointed 
> out inline.
> 
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> > 
> > Not root RPs. Non-sharing resource providers, which currently 
> > effectively means compute node providers. Nested resource providers 
> > isn't yet merged, so there is currently no concept of a hierarchy of 
> > providers.
> > 
> > > * Placement returns a list of the UUIDs for those root providers to 
> > > scheduler
> > 
> > It returns the provider names and UUIDs, yes.
> > 
> > > * Scheduler uses those UUIDs to create HostState objects for each
> > 
> > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
> > in a list of the provider UUIDs it got back from the placement service. 
> > The scheduler then builds a set of HostState objects from the results of 
> > ComputeNodeList.get_all_by_uuid().
> > 
> > The scheduler also keeps a set of AggregateMetadata objects in memory, 
> > including the association of aggregate to host (note: this is the 
> > compute node's *service*, not the compute node object itself, thus the 
> > reason aggregates don't work properly for Ironic nodes).
> > 
> > > * Scheduler runs those HostState objects through filters to remove 
> > > those that don't meet requirements not selected for by placement
> > 
> > Yep.
> > 
> > > * Scheduler runs the remaining HostState objects through weighers to 
> > > order them in terms of best fit.
> > 
> > Yep.
> > 
> > > * Scheduler takes the host at the top of that ranked list, and tries 
> > > to claim the resources in placement. If that fails, there is a race, 
> > > so that HostState is discarded, and the next is selected. This is 
> > > repeated until the claim succeeds.
> > 
> > No, this is not how things work currently. The scheduler does not claim 
> > resources. It selects the top (or random host depending on the selection 
> > strategy) and sends the launch request to the target compute node. The 
> > target compute node then attempts to claim the resources and in doing so 
> > writes records to the compute_nodes table in the Nova cell database as 
> > well as the Placement API for the compute node resource provider.
> 
> Not to nit pick, but today the scheduler sends the selected destinations 
> to the conductor. Conductor looks up the cell that a selected host is 
> in, creates the instance record and friends (bdms) in that cell and then 
> sends the build request to the compute host in that cell.
> 
> > 
> > > * Scheduler then creates a list of N UUIDs, with the first being the 
> > > selected host, and the the rest being alternates consisting of the 
> > > next hosts in the ranked list that are in the same cell as the 
> > > selected host.
> > 
> > This isn't currently how things work, no. This has been discussed, however.
> > 
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it unclaims the resources for the selected host, and tries to 
> > > claim the resources for the next host in the list. It then tries to 
> > > build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> > 
> > This isn't currently how things work, no. There has been discussion of 
> > having the compute node retry alternatives locally, but nothing more 
> > than discussion.
> 
> Correct that this isn't how things currently work, but it was/is the 
> original plan. And the retry happens within the cell conductor, not on 
> the compute node itself. The top-level conductor is what's getting 
> selected hosts from the scheduler. The cell-level conductor is what's 
> getting a retry request from the compute. The cell-level conductor would 
> deallocate from placement for the currently claimed providers, and then 
> pick one of the alternatives passed down from the top and then make 
> allocations (a claim) against those, then send to an alternative compute 
> host for another build attempt.
> 
> So with this plan, there are two places to make allocations - the 
> scheduler first, and then the cell conductors for retries. This 
> duplication is why some people were originally pushing to move all 
> allocation-related work happen in the conductor service.
> 
> > > Proposed flow:
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Edward Leafe
On Jun 20, 2017, at 8:38 AM, Jay Pipes  wrote:
> 
>>> The example I posted used 3 resource providers. 2 compute nodes with no 
>>> local disk and a shared storage pool.
>> Now I’m even more confused. In the straw man example 
>> (https://review.openstack.org/#/c/471927/ 
>> ) 
>> >  
>> >
>>  I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in 
>> the response.
> 
> I'm referring to the example I put in this email threads in 
> paste.openstack.org  with numbers showing 1600 
> bytes for 3 resource providers:
> 
> http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html 
> 


And I’m referring to the comment I made on the spec back on June 13 that was 
never corrected/clarified. I’m glad you gave an example yesterday after I 
expressed my confusion; that was the whole purpose of starting this thread. 
Things may be clear to you, but they have confused me and others. We can’t help 
if we don’t understand.


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Alex Xu
2017-06-19 22:17 GMT+08:00 Jay Pipes :

> On 06/19/2017 09:04 AM, Edward Leafe wrote:
>
>> Current flow:
>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy those
>> requirements
>>
>
> Not root RPs. Non-sharing resource providers, which currently effectively
> means compute node providers. Nested resource providers isn't yet merged,
> so there is currently no concept of a hierarchy of providers.
>
> * Placement returns a list of the UUIDs for those root providers to
>> scheduler
>>
>
> It returns the provider names and UUIDs, yes.
>
> * Scheduler uses those UUIDs to create HostState objects for each
>>
>
> Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing
> in a list of the provider UUIDs it got back from the placement service. The
> scheduler then builds a set of HostState objects from the results of
> ComputeNodeList.get_all_by_uuid().
>
> The scheduler also keeps a set of AggregateMetadata objects in memory,
> including the association of aggregate to host (note: this is the compute
> node's *service*, not the compute node object itself, thus the reason
> aggregates don't work properly for Ironic nodes).
>
> * Scheduler runs those HostState objects through filters to remove those
>> that don't meet requirements not selected for by placement
>>
>
> Yep.
>
> * Scheduler runs the remaining HostState objects through weighers to order
>> them in terms of best fit.
>>
>
> Yep.
>
> * Scheduler takes the host at the top of that ranked list, and tries to
>> claim the resources in placement. If that fails, there is a race, so that
>> HostState is discarded, and the next is selected. This is repeated until
>> the claim succeeds.
>>
>
> No, this is not how things work currently. The scheduler does not claim
> resources. It selects the top (or random host depending on the selection
> strategy) and sends the launch request to the target compute node. The
> target compute node then attempts to claim the resources and in doing so
> writes records to the compute_nodes table in the Nova cell database as well
> as the Placement API for the compute node resource provider.
>
> * Scheduler then creates a list of N UUIDs, with the first being the
>> selected host, and the the rest being alternates consisting of the next
>> hosts in the ranked list that are in the same cell as the selected host.
>>
>
> This isn't currently how things work, no. This has been discussed, however.
>
> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that list
>> to the target cell.
>> * Target cell tries to build the instance on the selected host. If it
>> fails, it unclaims the resources for the selected host, and tries to claim
>> the resources for the next host in the list. It then tries to build the
>> instance on the next host in the list of alternates. Only when all
>> alternates fail does the build request fail.
>>
>
> This isn't currently how things work, no. There has been discussion of
> having the compute node retry alternatives locally, but nothing more than
> discussion.
>
> Proposed flow:
>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy those
>> requirements
>>
>
> Yes.
>
> * Placement then constructs a data structure for each root provider as
>> documented in the spec. [0]
>>
>
> Yes.
>
> * Placement returns a number of these data structures as JSON blobs. Due
>> to the size of the data, a page size will have to be determined, and
>> placement will have to either maintain that list of structured datafor
>> subsequent requests, or re-run the query and only calculate the data
>> structures for the hosts that fit in the requested page.
>>
>
> "of these data structures as JSON blobs" is kind of redundant... all our
> REST APIs return data structures as JSON blobs.
>
> While we discussed the fact that there may be a lot of entries, we did not
> say we'd immediately support a paging mechanism.
>
> * Scheduler continues to request the paged results until it has them all.
>>
>
> See above. Was discussed briefly as a concern but not work to do for first
> patches.
>
> * Scheduler then runs this data through the filters and weighers. No
>> HostState objects are required, as the data structures will contain all the
>> information that scheduler will need.
>>
>
> No, this isn't correct. The scheduler will have *some* of the information
> it requires for weighing from the returned data from the GET
> /allocation_candidates call, but not all of it.
>
> Again, operators have insisted on keeping the flexibility currently in the
> Nova scheduler to weigh/sort compute nodes by things like thermal metrics
> and kinds of data that the 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/20/2017 08:43 AM, Edward Leafe wrote:
On Jun 20, 2017, at 6:54 AM, Jay Pipes > wrote:



It was the "per compute host" that I objected to.
I guess it would have helped to see an example of the data returned 
for multiple compute nodes. The straw man example was for a single 
compute node with SR-IOV, NUMA and shared storage. There was no 
indication how multiple hosts meeting the requested resources would 
be returned.


The example I posted used 3 resource providers. 2 compute nodes with 
no local disk and a shared storage pool.


Now I’m even more confused. In the straw man example 
(https://review.openstack.org/#/c/471927/) 
 I see 
only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the 
response.


I'm referring to the example I put in this email threads in 
paste.openstack.org with numbers showing 1600 bytes for 3 resource 
providers:


http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Edward Leafe
On Jun 20, 2017, at 6:54 AM, Jay Pipes  wrote:
> 
>>> It was the "per compute host" that I objected to.
>> I guess it would have helped to see an example of the data returned for 
>> multiple compute nodes. The straw man example was for a single compute node 
>> with SR-IOV, NUMA and shared storage. There was no indication how multiple 
>> hosts meeting the requested resources would be returned.
> 
> The example I posted used 3 resource providers. 2 compute nodes with no local 
> disk and a shared storage pool.


Now I’m even more confused. In the straw man example 
(https://review.openstack.org/#/c/471927/ 
) 

 I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the 
response.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/19/2017 09:26 PM, Boris Pavlovic wrote:

Hi,

Does this look too complicated and and a bit over designed.


Is that a question?

For example, why we can't store all data in memory of single python 
application with simple REST API and have
simple mechanism for plugins that are filtering. Basically there is no 
any kind of problems with storing it on single host.


You mean how things currently work minus the REST API?

If we even have 100k hosts and every host has about 10KB -> 1GB of RAM 
(I can just use phone)


There are easy ways to copy the state across different instance (sharing 
updates)


We already do this. It isn't as easy as you think. It's introduced a 
number of race conditions that we're attempting to address by doing 
claims in the scheduler.


And I thought that Placement project is going to be such centralized 
small simple APP for collecting all
resource information and doing this very very simple and easy placement 
selection...


1) Placement doesn't collect anything.
2) Placement is indeed a simple small app with a global view of resources
3) Placement doesn't do the sorting/weighing of destinations. The 
scheduler does that. See this thread for reasons why this is the case 
(operators didn't want to give up their complexity/flexibility in how 
they tweak selection decisions)
4) Placement simply tells the scheduler which providers have enough 
capacity for a requested set of resource amounts and required 
qualitative traits. It actually is pretty simple.


Best,
-jay


Best regards,
Boris Pavlovic

On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe > wrote:


On Jun 19, 2017, at 5:27 PM, Jay Pipes > wrote:



It was from the straw man example. Replacing the $FOO_UUID with
UUIDs, and then stripping out all whitespace resulted in about
1500 bytes. Your example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.


I guess it would have helped to see an example of the data returned
for multiple compute nodes. The straw man example was for a single
compute node with SR-IOV, NUMA and shared storage. There was no
indication how multiple hosts meeting the requested resources would
be returned.

-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/19/2017 08:05 PM, Edward Leafe wrote:
On Jun 19, 2017, at 5:27 PM, Jay Pipes > wrote:


It was from the straw man example. Replacing the $FOO_UUID with 
UUIDs, and then stripping out all whitespace resulted in about 1500 
bytes. Your example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.


I guess it would have helped to see an example of the data returned for 
multiple compute nodes. The straw man example was for a single compute 
node with SR-IOV, NUMA and shared storage. There was no indication how 
multiple hosts meeting the requested resources would be returned.


The example I posted used 3 resource providers. 2 compute nodes with no 
local disk and a shared storage pool.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Boris Pavlovic
Hi,

Does this look too complicated and and a bit over designed.

For example, why we can't store all data in memory of single python
application with simple REST API and have
simple mechanism for plugins that are filtering. Basically there is no any
kind of problems with storing it on single host.

If we even have 100k hosts and every host has about 10KB -> 1GB of RAM (I
can just use phone)

There are easy ways to copy the state across different instance (sharing
updates)

And I thought that Placement project is going to be such centralized small
simple APP for collecting all
resource information and doing this very very simple and easy placement
selection...


Best regards,
Boris Pavlovic

On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe  wrote:

> On Jun 19, 2017, at 5:27 PM, Jay Pipes  wrote:
>
>
> It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and
> then stripping out all whitespace resulted in about 1500 bytes. Your
> example, with whitespace included, is 1600 bytes.
>
>
> It was the "per compute host" that I objected to.
>
>
> I guess it would have helped to see an example of the data returned for
> multiple compute nodes. The straw man example was for a single compute node
> with SR-IOV, NUMA and shared storage. There was no indication how multiple
> hosts meeting the requested resources would be returned.
>
> -- Ed Leafe
>
>
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 5:27 PM, Jay Pipes  wrote:
> 
>> It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and 
>> then stripping out all whitespace resulted in about 1500 bytes. Your 
>> example, with whitespace included, is 1600 bytes.
> 
> It was the "per compute host" that I objected to.

I guess it would have helped to see an example of the data returned for 
multiple compute nodes. The straw man example was for a single compute node 
with SR-IOV, NUMA and shared storage. There was no indication how multiple 
hosts meeting the requested resources would be returned.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 05:24 PM, Edward Leafe wrote:
On Jun 19, 2017, at 1:34 PM, Jay Pipes > wrote:


OK, thanks for clarifying that. When we discussed returning 1.5K per 
compute host instead of a couple of hundred bytes, there was 
discussion that paging would be necessary.


Not sure where you're getting the whole 1.5K per compute host thing from.


It was from the straw man example. Replacing the $FOO_UUID with UUIDs, 
and then stripping out all whitespace resulted in about 1500 bytes. Your 
example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.

OK, that’s informative, too. Is there anything decided on how much 
host info will be in the response from placement, and how much will 
be in HostState? Or how the reporting of resources by the compute 
nodes will have to change to feed this information to placement? Or 
how the two sources of information will be combined so that the 
filters and weighers can process it? Or is that still to be worked out?


I'm currtently working on a patch that integrates the REST API into 
the scheduler.


The merging of data will essentially start with the resource amounts 
that the host state objects contain (stuff like total_usable_ram etc) 
with the accurate data from the provider_summaries section.


So in the near-term, we will be using provider_summaries to update the 
corresponding HostState objects with those values. Is the long-term plan 
to have most of the HostState information moved to placement?


Some things will move to placement sooner rather than later:

* Quantitative things that can be consumed
* Simple traits

Later rather than sooner:

* Distances between aggregates (affinity/anti-affinity)

Never:

* Filtering hosts based on how many instances use a particular image
* Filtering hosts based on something that is hypervisor-dependent
* Sorting hosts based on the number of instances in a particular state 
(e.g. how many instances are live-migrating or shelving at any given time)
* Weighing hosts based on the current temperature of a power supply in a 
rack

* Sorting hosts based on the current weather conditions in Zimbabwe

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 1:34 PM, Jay Pipes  wrote:
> 
>> OK, thanks for clarifying that. When we discussed returning 1.5K per compute 
>> host instead of a couple of hundred bytes, there was discussion that paging 
>> would be necessary.
> 
> Not sure where you're getting the whole 1.5K per compute host thing from.

It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and then 
stripping out all whitespace resulted in about 1500 bytes. Your example, with 
whitespace included, is 1600 bytes. 

> Here's a paste with the before and after of what we're talking about:
> 
> http://paste.openstack.org/show/613129/ 
> 
> 
> Note that I'm using a situation with shared storage and two compute nodes 
> providing VCPU and MEMORY. In the current situation, the shared storage 
> provider isn't returned, as you know.
> 
> The before is 231 bytes. The after (again, with three providers, not 1) is 
> 1651 bytes.

So in the basic non-shared, non-nested case, if there are, let’s say, 200 
compute nodes that can satisfy the request, will there be 1 
“allocation_requests” key returned, with 200 “allocations” sub-keys? And one 
“provider_summaries” key, with 200 sub-keys on the compute node UUID?

> gzipping the after contents results in 358 bytes.
> 
> So, honestly I'm not concerned.

Ok, just wanted to be clear.

>> OK, that’s informative, too. Is there anything decided on how much host info 
>> will be in the response from placement, and how much will be in HostState? 
>> Or how the reporting of resources by the compute nodes will have to change 
>> to feed this information to placement? Or how the two sources of information 
>> will be combined so that the filters and weighers can process it? Or is that 
>> still to be worked out?
> 
> I'm currtently working on a patch that integrates the REST API into the 
> scheduler.
> 
> The merging of data will essentially start with the resource amounts that the 
> host state objects contain (stuff like total_usable_ram etc) with the 
> accurate data from the provider_summaries section.


So in the near-term, we will be using provider_summaries to update the 
corresponding HostState objects with those values. Is the long-term plan to 
have most of the HostState information moved to placement?


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 01:59 PM, Edward Leafe wrote:
While we discussed the fact that there may be a lot of entries, we did 
not say we'd immediately support a paging mechanism.


OK, thanks for clarifying that. When we discussed returning 1.5K per 
compute host instead of a couple of hundred bytes, there was discussion 
that paging would be necessary.


Not sure where you're getting the whole 1.5K per compute host thing from.

Here's a paste with the before and after of what we're talking about:

http://paste.openstack.org/show/613129/

Note that I'm using a situation with shared storage and two compute 
nodes providing VCPU and MEMORY. In the current situation, the shared 
storage provider isn't returned, as you know.


The before is 231 bytes. The after (again, with three providers, not 1) 
is 1651 bytes.


gzipping the after contents results in 358 bytes.

So, honestly I'm not concerned.

Again, operators have insisted on keeping the flexibility currently in 
the Nova scheduler to weigh/sort compute nodes by things like thermal 
metrics and kinds of data that the Placement API will never be 
responsible for.


The scheduler will need to merge information from the 
"provider_summaries" part of the HTTP response with information it has 
already in its HostState objects (gotten from 
ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).


OK, that’s informative, too. Is there anything decided on how much host 
info will be in the response from placement, and how much will be in 
HostState? Or how the reporting of resources by the compute nodes will 
have to change to feed this information to placement? Or how the two 
sources of information will be combined so that the filters and weighers 
can process it? Or is that still to be worked out?


I'm currently working on a patch that integrates the REST API into the 
scheduler.


The merging of data will essentially start with the resource amounts 
that the host state objects contain (stuff like total_usable_ram etc) 
with the accurate data from the provider_summaries section.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 9:17 AM, Jay Pipes  wrote:

As Matt pointed out, I mis-wrote when I said “current flow”. I meant “current 
agreed-to design flow”. So no need to rehash that.

>> * Placement returns a number of these data structures as JSON blobs. Due to 
>> the size of the data, a page size will have to be determined, and placement 
>> will have to either maintain that list of structured datafor subsequent 
>> requests, or re-run the query and only calculate the data structures for the 
>> hosts that fit in the requested page.
> 
> "of these data structures as JSON blobs" is kind of redundant... all our REST 
> APIs return data structures as JSON blobs.

Well, I was trying to be specific. I didn’t mean to imply that this was a 
radical departure or anything.

> While we discussed the fact that there may be a lot of entries, we did not 
> say we'd immediately support a paging mechanism.

OK, thanks for clarifying that. When we discussed returning 1.5K per compute 
host instead of a couple of hundred bytes, there was discussion that paging 
would be necessary.

>> * Scheduler continues to request the paged results until it has them all.
> 
> See above. Was discussed briefly as a concern but not work to do for first 
> patches.
> 
>> * Scheduler then runs this data through the filters and weighers. No 
>> HostState objects are required, as the data structures will contain all the 
>> information that scheduler will need.
> 
> No, this isn't correct. The scheduler will have *some* of the information it 
> requires for weighing from the returned data from the GET 
> /allocation_candidates call, but not all of it.
> 
> Again, operators have insisted on keeping the flexibility currently in the 
> Nova scheduler to weigh/sort compute nodes by things like thermal metrics and 
> kinds of data that the Placement API will never be responsible for.
> 
> The scheduler will need to merge information from the "provider_summaries" 
> part of the HTTP response with information it has already in its HostState 
> objects (gotten from ComputeNodeList.get_all_by_uuid() and 
> AggregateMetadataList).

OK, that’s informative, too. Is there anything decided on how much host info 
will be in the response from placement, and how much will be in HostState? Or 
how the reporting of resources by the compute nodes will have to change to feed 
this information to placement? Or how the two sources of information will be 
combined so that the filters and weighers can process it? Or is that still to 
be worked out?

>> * Scheduler then selects the data structure at the top of the ranked list. 
>> Inside that structure is a dict of the allocation data that scheduler will 
>> need to claim the resources on the selected host. If the claim fails, the 
>> next data structure in the list is chosen, and repeated until a claim 
>> succeeds.
> 
> Kind of, yes. The scheduler will select a *host* that meets its needs.
> 
> There may be more than one allocation request that includes that host 
> resource provider, because of shared providers and (soon) nested providers. 
> The scheduler will choose one of these allocation requests and attempt a 
> claim of resources by simply PUT /allocations/{instance_uuid} with the 
> serialized body of that allocation request. If 202 returned, cool. If not, 
> repeat for the next allocation request.

Ah, yes, good point. A host with multiple nested providers, or with shared and 
local storage, will have to have multiple copies of the data structure returned 
to reflect those permutations. 

>> * Scheduler then creates a list of N of these data structures, with the 
>> first being the data for the selected host, and the the rest being data 
>> structures representing alternates consisting of the next hosts in the 
>> ranked list that are in the same cell as the selected host.
> 
> Yes, this is the proposed solution for allowing retries within a cell.

OK.

>> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that list to 
>> the target cell.
>> * Target cell tries to build the instance on the selected host. If it fails, 
>> it uses the allocation data in the data structure to unclaim the resources 
>> for the selected host, and tries to claim the resources for the next host in 
>> the list using its allocation data. It then tries to build the instance on 
>> the next host in the list of alternates. Only when all alternates fail does 
>> the build request fail.
> 
> I'll let Dan discuss this last part.


Well, that’s not substantially different than the original plan, so no 
additional explanation is required.

One other thing: since this new functionality is exposed via a new API call, is 
the existing method of filtering RPs by passing in resources going to be 
deprecated? And the code for adding filtering by traits to that also no longer 
useful?


-- Ed Leafe






Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Matt Riedemann

On 6/19/2017 9:17 AM, Jay Pipes wrote:

On 06/19/2017 09:04 AM, Edward Leafe wrote:

Current flow:


As noted in the nova-scheduler meeting this morning, this should have 
been called "original plan" rather than "current flow", as Jay pointed 
out inline.


* Scheduler gets a req spec from conductor, containing resource 
requirements

* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy 
those requirements


Not root RPs. Non-sharing resource providers, which currently 
effectively means compute node providers. Nested resource providers 
isn't yet merged, so there is currently no concept of a hierarchy of 
providers.


* Placement returns a list of the UUIDs for those root providers to 
scheduler


It returns the provider names and UUIDs, yes.


* Scheduler uses those UUIDs to create HostState objects for each


Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
in a list of the provider UUIDs it got back from the placement service. 
The scheduler then builds a set of HostState objects from the results of 
ComputeNodeList.get_all_by_uuid().


The scheduler also keeps a set of AggregateMetadata objects in memory, 
including the association of aggregate to host (note: this is the 
compute node's *service*, not the compute node object itself, thus the 
reason aggregates don't work properly for Ironic nodes).


* Scheduler runs those HostState objects through filters to remove 
those that don't meet requirements not selected for by placement


Yep.

* Scheduler runs the remaining HostState objects through weighers to 
order them in terms of best fit.


Yep.

* Scheduler takes the host at the top of that ranked list, and tries 
to claim the resources in placement. If that fails, there is a race, 
so that HostState is discarded, and the next is selected. This is 
repeated until the claim succeeds.


No, this is not how things work currently. The scheduler does not claim 
resources. It selects the top (or random host depending on the selection 
strategy) and sends the launch request to the target compute node. The 
target compute node then attempts to claim the resources and in doing so 
writes records to the compute_nodes table in the Nova cell database as 
well as the Placement API for the compute node resource provider.


Not to nit pick, but today the scheduler sends the selected destinations 
to the conductor. Conductor looks up the cell that a selected host is 
in, creates the instance record and friends (bdms) in that cell and then 
sends the build request to the compute host in that cell.




* Scheduler then creates a list of N UUIDs, with the first being the 
selected host, and the the rest being alternates consisting of the 
next hosts in the ranked list that are in the same cell as the 
selected host.


This isn't currently how things work, no. This has been discussed, however.


* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that 
list to the target cell.
* Target cell tries to build the instance on the selected host. If it 
fails, it unclaims the resources for the selected host, and tries to 
claim the resources for the next host in the list. It then tries to 
build the instance on the next host in the list of alternates. Only 
when all alternates fail does the build request fail.


This isn't currently how things work, no. There has been discussion of 
having the compute node retry alternatives locally, but nothing more 
than discussion.


Correct that this isn't how things currently work, but it was/is the 
original plan. And the retry happens within the cell conductor, not on 
the compute node itself. The top-level conductor is what's getting 
selected hosts from the scheduler. The cell-level conductor is what's 
getting a retry request from the compute. The cell-level conductor would 
deallocate from placement for the currently claimed providers, and then 
pick one of the alternatives passed down from the top and then make 
allocations (a claim) against those, then send to an alternative compute 
host for another build attempt.


So with this plan, there are two places to make allocations - the 
scheduler first, and then the cell conductors for retries. This 
duplication is why some people were originally pushing to move all 
allocation-related work happen in the conductor service.





Proposed flow:
* Scheduler gets a req spec from conductor, containing resource 
requirements

* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy 
those requirements


Yes.

* Placement then constructs a data structure for each root provider as 
documented in the spec. [0]


Yes.

* Placement returns a number of these data structures as JSON blobs. 
Due to the size of the data, a page size will have to be determined, 
and placement will have to either maintain that list of structured 
datafor 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 09:04 AM, Edward Leafe wrote:

Current flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements


Not root RPs. Non-sharing resource providers, which currently 
effectively means compute node providers. Nested resource providers 
isn't yet merged, so there is currently no concept of a hierarchy of 
providers.



* Placement returns a list of the UUIDs for those root providers to scheduler


It returns the provider names and UUIDs, yes.


* Scheduler uses those UUIDs to create HostState objects for each


Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
in a list of the provider UUIDs it got back from the placement service. 
The scheduler then builds a set of HostState objects from the results of 
ComputeNodeList.get_all_by_uuid().


The scheduler also keeps a set of AggregateMetadata objects in memory, 
including the association of aggregate to host (note: this is the 
compute node's *service*, not the compute node object itself, thus the 
reason aggregates don't work properly for Ironic nodes).



* Scheduler runs those HostState objects through filters to remove those that 
don't meet requirements not selected for by placement


Yep.


* Scheduler runs the remaining HostState objects through weighers to order them 
in terms of best fit.


Yep.


* Scheduler takes the host at the top of that ranked list, and tries to claim 
the resources in placement. If that fails, there is a race, so that HostState 
is discarded, and the next is selected. This is repeated until the claim 
succeeds.


No, this is not how things work currently. The scheduler does not claim 
resources. It selects the top (or random host depending on the selection 
strategy) and sends the launch request to the target compute node. The 
target compute node then attempts to claim the resources and in doing so 
writes records to the compute_nodes table in the Nova cell database as 
well as the Placement API for the compute node resource provider.



* Scheduler then creates a list of N UUIDs, with the first being the selected 
host, and the the rest being alternates consisting of the next hosts in the 
ranked list that are in the same cell as the selected host.


This isn't currently how things work, no. This has been discussed, however.


* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to 
the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it 
unclaims the resources for the selected host, and tries to claim the resources 
for the next host in the list. It then tries to build the instance on the next 
host in the list of alternates. Only when all alternates fail does the build 
request fail.


This isn't currently how things work, no. There has been discussion of 
having the compute node retry alternatives locally, but nothing more 
than discussion.



Proposed flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements


Yes.


* Placement then constructs a data structure for each root provider as 
documented in the spec. [0]


Yes.


* Placement returns a number of these data structures as JSON blobs. Due to the 
size of the data, a page size will have to be determined, and placement will 
have to either maintain that list of structured datafor subsequent requests, or 
re-run the query and only calculate the data structures for the hosts that fit 
in the requested page.


"of these data structures as JSON blobs" is kind of redundant... all our 
REST APIs return data structures as JSON blobs.


While we discussed the fact that there may be a lot of entries, we did 
not say we'd immediately support a paging mechanism.



* Scheduler continues to request the paged results until it has them all.


See above. Was discussed briefly as a concern but not work to do for 
first patches.



* Scheduler then runs this data through the filters and weighers. No HostState 
objects are required, as the data structures will contain all the information 
that scheduler will need.


No, this isn't correct. The scheduler will have *some* of the 
information it requires for weighing from the returned data from the GET 
/allocation_candidates call, but not all of it.


Again, operators have insisted on keeping the flexibility currently in 
the Nova scheduler to weigh/sort compute nodes by things like thermal 
metrics and kinds of data that the Placement API will never be 
responsible for.


The scheduler will need to merge information from the 
"provider_summaries" part of the HTTP response with information it has 
already in its HostState objects (gotten from 
ComputeNodeList.get_all_by_uuid() 

[openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
There is a lot going on lately in placement-land, and some of the changes being 
proposed are complex enough that it is difficult to understand what the final 
result is supposed to look like. I have documented my understanding of the 
current way that the placement/scheduler interaction works, and also what I 
understand if how it will work when the proposed changes are all implemented. I 
don’t know how close that understanding is to what the design is, so I’m hoping 
that this will serve as a starting point for clarifying things, so that 
everyone involved in these efforts has a clear view of the target we are aiming 
for. So please reply to this thread with any corrections or additions, so that 
all can see.

I do realize that some of this is to be done in Pike, and the rest in Queens, 
but that timetable is not relevant to the overall understanding of the design.

-- Ed Leafe

Current flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements
* Placement returns a list of the UUIDs for those root providers to scheduler
* Scheduler uses those UUIDs to create HostState objects for each
* Scheduler runs those HostState objects through filters to remove those that 
don't meet requirements not selected for by placement
* Scheduler runs the remaining HostState objects through weighers to order them 
in terms of best fit.
* Scheduler takes the host at the top of that ranked list, and tries to claim 
the resources in placement. If that fails, there is a race, so that HostState 
is discarded, and the next is selected. This is repeated until the claim 
succeeds.
* Scheduler then creates a list of N UUIDs, with the first being the selected 
host, and the the rest being alternates consisting of the next hosts in the 
ranked list that are in the same cell as the selected host.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to 
the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it 
unclaims the resources for the selected host, and tries to claim the resources 
for the next host in the list. It then tries to build the instance on the next 
host in the list of alternates. Only when all alternates fail does the build 
request fail.

Proposed flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements
* Placement then constructs a data structure for each root provider as 
documented in the spec. [0]
* Placement returns a number of these data structures as JSON blobs. Due to the 
size of the data, a page size will have to be determined, and placement will 
have to either maintain that list of structured datafor subsequent requests, or 
re-run the query and only calculate the data structures for the hosts that fit 
in the requested page.
* Scheduler continues to request the paged results until it has them all.
* Scheduler then runs this data through the filters and weighers. No HostState 
objects are required, as the data structures will contain all the information 
that scheduler will need.
* Scheduler then selects the data structure at the top of the ranked list. 
Inside that structure is a dict of the allocation data that scheduler will need 
to claim the resources on the selected host. If the claim fails, the next data 
structure in the list is chosen, and repeated until a claim succeeds.
* Scheduler then creates a list of N of these data structures, with the first 
being the data for the selected host, and the the rest being data structures 
representing alternates consisting of the next hosts in the ranked list that 
are in the same cell as the selected host.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to 
the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it 
uses the allocation data in the data structure to unclaim the resources for the 
selected host, and tries to claim the resources for the next host in the list 
using its allocation data. It then tries to build the instance on the next host 
in the list of alternates. Only when all alternates fail does the build request 
fail.


[0] https://review.openstack.org/#/c/471927/





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Jay Pipes

On 06/12/2017 02:17 PM, Edward Leafe wrote:
On Jun 12, 2017, at 10:20 AM, Jay Pipes > wrote:


The RP uuid is part of the provider: the compute node's uuid, and 
(after https://review.openstack.org/#/c/469147/ merges) the PCI 
device's uuid. So in the code that passes the PCI device information 
to the scheduler, we could add that new uuid field, and then the 
scheduler would have the information to a) select the best fit and 
then b) claim it with the specific uuid. Same for all the other 
nested/shared devices.


How would the scheduler know that a particular SRIOV PF resource 
provider UUID is on a particular compute node unless the placement API 
returns information indicating that SRIOV PF is a child of a 
particular compute node resource provider?


Because PCI devices are per compute node. The HostState object populates 
itself from the compute node here:


https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L224-L225

If we add the UUID information to the PCI device, as the above-mentioned 
patch proposes, when the scheduler selects a particular compute node 
that has the device, it uses the PCI device’s UUID. I thought that 
having that information in the scheduler was what that patch was all about.


I would hope that over time, there's be little to no need for the 
scheduler to read either the compute_nodes or the pci_devices tables 
(which, btw, are in the cell databases). The information that the 
scheduler currently keeps in the host state objects should eventually be 
able to be primarily constructed by the returned results from the 
placement API instead of the existing situation where the scheduler must 
make multiple calls to the multiple cells databases in order to fill 
that information in.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Edward Leafe
On Jun 12, 2017, at 10:20 AM, Jay Pipes  wrote:

>> The RP uuid is part of the provider: the compute node's uuid, and (after 
>> https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So 
>> in the code that passes the PCI device information to the scheduler, we 
>> could add that new uuid field, and then the scheduler would have the 
>> information to a) select the best fit and then b) claim it with the specific 
>> uuid. Same for all the other nested/shared devices.
> 
> How would the scheduler know that a particular SRIOV PF resource provider 
> UUID is on a particular compute node unless the placement API returns 
> information indicating that SRIOV PF is a child of a particular compute node 
> resource provider?


Because PCI devices are per compute node. The HostState object populates itself 
from the compute node here:

https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L224-L225
 


If we add the UUID information to the PCI device, as the above-mentioned patch 
proposes, when the scheduler selects a particular compute node that has the 
device, it uses the PCI device’s UUID. I thought that having that information 
in the scheduler was what that patch was all about.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-12 Thread Jay Pipes

On 06/09/2017 06:31 PM, Ed Leafe wrote:

On Jun 9, 2017, at 4:35 PM, Jay Pipes  wrote:


We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.


a) we can't assume that
b) a compute node could very well have both local disk and shared disk. how 
would the placement API know which one to pick? This is a sorting/weighing 
decision and thus is something the scheduler is responsible for.


I remember having this discussion, and we concluded that a compute node could 
either have local or shared resources, but not both. There would be a trait to 
indicate shared disk. Has this changed?


I'm not sure it's changed per-se :) It's just that there's nothing 
preventing this from happening. A compute node can theoretically have 
local disk and also be associated with a shared storage pool.



* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?


The filter scheduler has most of the information, yes. What it doesn't have is the 
*identifier* (UUID) for things like SRIOV PFs or NUMA cells that the Placement API will 
use to distinguish between things. In other words, the filter scheduler currently does 
things like unpack a NUMATopology object into memory and determine a NUMA cell to place 
an instance to. However, it has no concept that that NUMA cell is (or will soon be once 
nested-resource-providers is done) a resource provider in the placement API. Same for 
SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need to return information 
to the scheduler from the placement API that will allow the scheduler to understand 
"hey, this NUMA cell on compute node X is resource provider $UUID".


I guess that this was the point that confused me. The RP uuid is part of the 
provider: the compute node's uuid, and (after 
https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in 
the code that passes the PCI device information to the scheduler, we could add 
that new uuid field, and then the scheduler would have the information to a) 
select the best fit and then b) claim it with the specific uuid. Same for all 
the other nested/shared devices.


How would the scheduler know that a particular SRIOV PF resource 
provider UUID is on a particular compute node unless the placement API 
returns information indicating that SRIOV PF is a child of a particular 
compute node resource provider?



I don't mean to belabor this, but to my mind this seems a lot less disruptive 
to the existing code.


Belabor away :) I don't mind talking through the details. It's important 
to do.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Chris Dent

On Fri, 9 Jun 2017, Dan Smith wrote:


In other words, I would expect to be able to explain the purpose of the
scheduler as "applies nova-specific logic to the generic resources that
placement says are _valid_, with the goal of determining which one is
_best_".


This sounds great as an explanation. If we can reach this we done good.

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Chris Dent

On Fri, 9 Jun 2017, Jay Pipes wrote:


Sorry, been in a three-hour meeting. Comments inline...


Thanks for getting to this, it's very helpful to me.


* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).


Mmm, kinda, yeah.


What I meant by this was that if it didn't matter which of more than
one nested rp was used, then it would be easier to simply consider
the group of them as members of an inventory (that came out a bit
more in one of the later questions).


* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?


Good question. I think the answer to this is probably pretty theoretical at 
this point. My gut instinct is that we should treat the consumption of 
resources in an atomic fashion, and that transactional nature of allocation 
will result in fewer race conditions and cleaner code. But, admittedly, this 
is just my gut reaction.


I suppose if we were more spread oriented than pack oriented, an
allocation of vcpu and ram would almost operate as a proxy for a
lock, allowing the later correcting allocation proposed above to be
somewhat safe because other near concurrent emplacements would be
happening on some other machine. But we don't have that reality.
I've always been in favor of making the allocation as early as
possible. I remember those halcyon days when we even thought it
might be possible to make a request and claim of resources in one
HTTP request.


  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)


See above. The sorting/weighing logic, which is very much deployer-defined 
and wreaks of customization, is what would need to be added to the placement 
API.


And enough of that sorting/weighing logic is likely to do with child or
shared providers that it's not possible to constrain the weighing
and sorting to solely compute nodes? Not just whether the host is on
fire, but the share disk farm too?

Okay, thank you, that helps set the stage more clearly and leads
straight to my remaining big question, which is asked on the spec
you've proposed:

https://review.openstack.org/#/c/471927/

What are big strokes mechanisms for connecting the non-allocation
data in the response to GET /allocation_requests to the sorting
weighing logic? Answering on the spec works fine for me, I'm just
repeating it here in case people following along want the transition
over to the spec.

Thanks again.

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Dan Smith
>> b) a compute node could very well have both local disk and shared 
>> disk. how would the placement API know which one to pick? This is a
>> sorting/weighing decision and thus is something the scheduler is 
>> responsible for.

> I remember having this discussion, and we concluded that a 
> computenode could either have local or shared resources, but not 
> both. There would be a trait to indicate shared disk. Has this 
> changed?

I've always thought we discussed that one of the benefits of this
approach was that it _could_ have both. Maybe we said "initially we
won't implement stuff so it can have both" but I think the plan has been
that we'd be able to support it.

>>> * We already have the information the filter scheduler needs now
>>>  by some other means, right?  What are the reasons we don't want
>>>  to use that anymore?
>> 
>> The filter scheduler has most of the information, yes. What it 
>> doesn't have is the *identifier* (UUID) for things like SRIOV PFs 
>> or NUMA cells that the Placement API will use to distinguish 
>> between things. In other words, the filter scheduler currently does
>> things like unpack a NUMATopology object into memory and determine
>> a NUMA cell to place an instance to. However, it has no concept
>> that that NUMA cell is (or will soon be once 
>> nested-resource-providers is done) a resource provider in the 
>> placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs,
>>  etc. That's why we need to return information to the scheduler 
>> from the placement API that will allow the scheduler to understand 
>> "hey, this NUMA cell on compute node X is resource provider 
>> $UUID".

Why shouldn't scheduler know those relationships? You were the one (well
one of them :P) that specifically wanted to teach the nova scheduler to
be in the business of arranging and making claims (allocations) against
placement before returning. Why should some parts of the scheduler know
about resource providers, but not others? And, how would scheduler be
able to make the proper decisions (which require knowledge of
hierarchical relationships) without that knowledge? I'm sure I'm missing
something obvious, so please correct me.

IMHO, the scheduler should eventually evolve into a thing that mostly
deals in the currency of placement, translating those into nova concepts
where needed to avoid placement having to know anything about them.
In other words, I would expect to be able to explain the purpose of the
scheduler as "applies nova-specific logic to the generic resources that
placement says are _valid_, with the goal of determining which one is
_best_".

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Ed Leafe
On Jun 9, 2017, at 4:35 PM, Jay Pipes  wrote:

>> We can declare that allocating for shared disk is fairly deterministic
>> if we assume that any given compute node is only associated with one
>> shared disk provider.
> 
> a) we can't assume that
> b) a compute node could very well have both local disk and shared disk. how 
> would the placement API know which one to pick? This is a sorting/weighing 
> decision and thus is something the scheduler is responsible for.

I remember having this discussion, and we concluded that a compute node could 
either have local or shared resources, but not both. There would be a trait to 
indicate shared disk. Has this changed?

>> * We already have the information the filter scheduler needs now by
>>  some other means, right?  What are the reasons we don't want to
>>  use that anymore?
> 
> The filter scheduler has most of the information, yes. What it doesn't have 
> is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells that the 
> Placement API will use to distinguish between things. In other words, the 
> filter scheduler currently does things like unpack a NUMATopology object into 
> memory and determine a NUMA cell to place an instance to. However, it has no 
> concept that that NUMA cell is (or will soon be once 
> nested-resource-providers is done) a resource provider in the placement API. 
> Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need 
> to return information to the scheduler from the placement API that will allow 
> the scheduler to understand "hey, this NUMA cell on compute node X is 
> resource provider $UUID".

I guess that this was the point that confused me. The RP uuid is part of the 
provider: the compute node's uuid, and (after 
https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in 
the code that passes the PCI device information to the scheduler, we could add 
that new uuid field, and then the scheduler would have the information to a) 
select the best fit and then b) claim it with the specific uuid. Same for all 
the other nested/shared devices.

I don't mean to belabor this, but to my mind this seems a lot less disruptive 
to the existing code.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Jay Pipes

Sorry, been in a three-hour meeting. Comments inline...

On 06/06/2017 10:56 AM, Chris Dent wrote:

On Mon, 5 Jun 2017, Ed Leafe wrote:


One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".


What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.


We'd have to add all the sorting/weighing logic from the existing 
scheduler into the Placement API. Otherwise, the Placement API won't 
understand which child provider to pick out of many providers that meet 
resource/trait requirements.



We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.


a) we can't assume that
b) a compute node could very well have both local disk and shared disk. 
how would the placement API know which one to pick? This is a 
sorting/weighing decision and thus is something the scheduler is 
responsible for.



My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?


No, it's not about determinism or how late in the game a claim decision 
is made. It's really just that the scheduler is the thing that does 
sorting/weighing, not the placement API. We made this decision due to 
the operator feedback that they were not willing to give up their 
ability to add custom weighers and be able to have scheduling policies 
that rely on transient data like thermal metrics collection.



The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?


Correct.


Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?


Partly, yes. But, more than anything it's about the placement API 
returning resource provider UUIDs for child providers and sharing 
providers so that the scheduler, when it picks one of those SRIOV 
physical functions, or NUMA cells, or shared storage pools, has the 
identifier with which to tell the placement API "ok, claim *this* 
resource against *this* provider".



* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?


The filter scheduler has most of the information, yes. What it doesn't 
have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells 
that the Placement API will use to distinguish between things. In other 
words, the filter scheduler currently does things like unpack a 
NUMATopology object into memory and determine a NUMA cell to place an 
instance to. However, it has no concept that that NUMA cell is (or will 
soon be once nested-resource-providers is done) a resource provider in 
the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, 
etc. That's why we need to return information to the scheduler from the 
placement API that will allow the scheduler to understand "hey, this 
NUMA cell on compute node X is resource provider $UUID".



* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells).


Mmm, kinda, yeah.

>  If I

  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide".


Not really. Filtering is still going to be done in the placement API. 
It's the thing that says "hey, these providers (or trees of providers) 
meet these resource and trait requirements". The scheduler however is 
what takes that set of filtered providers and does its sorting/weighing 
magic and selects one.


> If that's the case, is the

  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a 

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Dan Smith
>> My current feeling is that we got ourselves into our existing mess
>> of ugly, convoluted code when we tried to add these complex 
>> relationships into the resource tracker and the scheduler. We set
>> out to create the placement engine to bring some sanity back to how
>> we think about things we need to virtualize.
> 
> Sorry, I completely disagree with your assessment of why the
> placement engine exists. We didn't create it to bring some sanity
> back to how we think about things we need to virtualize. We created
> it to add consistency and structure to the representation of
> resources in the system.
> 
> I don't believe that exposing this structured representation of 
> resources is a bad thing or that it is leaking "implementation
> details" out of the placement API. It's not an implementation detail
> that a resource provider is a child of another or that a different
> resource provider is supplying some resource to a group of other
> providers. That's simply an accurate representation of the underlying
> data structures.

This ^.

With the proposal Jay has up, placement is merely exposing some of its
own data structures to a client that has declared what it wants. The
client has made a request for resources, and placement is returning some
allocations that would be valid. None of them are nova-specific at all
-- they're all data structures that you would pass to and/or retrieve
from placement already.

>> I don't know the answer. I'm hoping that we can have a discussion 
>> that might uncover a clear approach, or, at the very least, one
>> that is less murky than the others.
> 
> I really like Dan's idea of returning a list of HTTP request bodies
> for POST /allocations/{consumer_uuid} calls along with a list of
> provider information that the scheduler can use in its
> sorting/weighing algorithms.
> 
> We've put this straw-man proposal here:
> 
> https://review.openstack.org/#/c/471927/
> 
> I'm hoping to keep the conversation going there.

This is the most clear option that we have, in my opinion. It simplifies
what the scheduler has to do, it simplifies what conductor has to do
during a retry, and it minimizes the amount of work that something else
like cinder would have to do to use placement to schedule resources.
Without this, cinder/neutron/whatever has to know about things like
aggregates and hierarchical relationships between providers in order to
make *any* sane decision about selecting resources. If placement returns
valid options with that stuff figured out, then those services can look
at the bits they care about and make a decision.

I'd really like us to use the existing strawman spec as a place to
iterate on what that API would look like, assuming we're going to go
that route, and work on actual code in both placement and the scheduler
to use it. I'm hoping that doing so will help clarify whether this is
the right approach or not, and whether there are other gotchas that we
don't yet have on our radar. We're rapidly running out of runway for
pike here and I feel like we've got to get moving on this or we're going
to have to punt. Since several other things depend on this work, we need
to consider the impact to a lot of our pike commitments if we're not
able to get something merged.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-09 Thread Jay Pipes

On 06/05/2017 05:22 PM, Ed Leafe wrote:

Another proposal involved a change to how placement responds to the
scheduler. Instead of just returning the UUIDs of the compute nodes
that satisfy the required resources, it would include a whole bunch
of additional information in a structured response. A straw man
example of such a response is here:
https://etherpad.openstack.org/p/placement-allocations-straw-man.
This was referred to as "Plan B".


Actually, this was Plan "C". Plan "B" was to modify the return of the 
GET /resource_providers Placement REST API endpoint.


> The main feature of this approach

is that part of that response would be the JSON dict for the
allocation call, containing the specific resource provider UUID for
each resource. This way, when the scheduler selects a host


Important clarification is needed here. The proposal is to have the 
scheduler actually select *more than just the compute host*. The 
scheduler would select the host, any sharing providers and any child 
providers within a host that actually contained the resources/traits 
that the request demanded.


>, it would

simply pass that dict back to the /allocations call, and placement
would be able to do the allocations directly against that
information.

There was another issue raised: simply providing the host UUIDs
didn't give the scheduler enough information in order to run its
filters and weighers. Since the scheduler uses those UUIDs to
construct HostState objects, the specific missing information was
never completely clarified, so I'm just including this aspect of the
conversation for completeness. It is orthogonal to the question of
how to allocate when the resource provider is not "simple".


The specific missing information is the following, but not limited to:

* Whether or not a resource can be provided by a sharing provider or a 
"local provider" or either. For example, assume a compute node that is 
associated with a shared storage pool via an aggregate but that also has 
local disk for instances. The Placement API currently returns just the 
compute host UUID but no indication of whether the compute host has 
local disk to consume from, has shared disk to consume from, or both. 
The scheduler is the thing that must weigh these choices and make a 
choice. The placement API gives the scheduler the choices and the 
scheduler makes a decision based on sorting/weighing algorithms.


It is imperative to remember the reason *why* we decided (way back in 
Portland at the Nova mid-cycle last year) to keep sorting/weighing in 
the Nova scheduler. The reason is because operators (and some 
developers) insisted on being able to weigh the possible choices in ways 
that "could not be pre-determined". In other words, folks wanted to keep 
the existing uber-flexibility and customizability that the scheduler 
weighers (and home-grown weigher plugins) currently allow, including 
being able to sort possible compute hosts by such things as the average 
thermal temperature of the power supply the hardware was connected to 
over the last five minutes (I kid you friggin not.)


* Which SR-IOV physical function should provider an SRIOV_NET_VF 
resource to an instance. Imagine a situation where a compute host has 4 
SR-IOV physical functions, each having some traits representing hardware 
offload support and each having an inventory of 8 SRIOV_NET_VF. 
Currently the scheduler absolutely has the information to pick one of 
these SRIOV physical functions to assign to a workload. What the 
scheduler does *not* have, however, is a way to tell the Placement API 
to consume an SRIOV_NET_VF from that particular physical function. Why? 
Because the scheduler doesn't know that a particular physical function 
even *is* a resource provider in the placement API. *Something* needs to 
inform the scheduler that the physical function is a resource provider 
and has a particular UUID to identify it. This is precisely what the 
proposed GET /allocation_requests HTTP response data provides to the 
scheduler.



My current feeling is that we got ourselves into our existing mess of
ugly, convoluted code when we tried to add these complex
relationships into the resource tracker and the scheduler. We set out
to create the placement engine to bring some sanity back to how we
think about things we need to virtualize.


Sorry, I completely disagree with your assessment of why the placement 
engine exists. We didn't create it to bring some sanity back to how we 
think about things we need to virtualize. We created it to add 
consistency and structure to the representation of resources in the system.


I don't believe that exposing this structured representation of 
resources is a bad thing or that it is leaking "implementation details" 
out of the placement API. It's not an implementation detail that a 
resource provider is a child of another or that a different resource 
provider is supplying some resource to a group of other providers. 
That's simply an 

Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-08 Thread Edward Leafe
Sorry for the top-post, but it seems that nobody has responded to this, and 
there are a lot of important questions that need answers. So I’m simply 
re-posting this so that we don’t get too ahead of ourselves, by planning 
implementations before we fully understand the problem and the implications of 
any proposed solution.


-- Ed Leafe


> On Jun 6, 2017, at 9:56 AM, Chris Dent  wrote:
> 
> On Mon, 5 Jun 2017, Ed Leafe wrote:
> 
>> One proposal is to essentially use the same logic in placement
>> that was used to include that host in those matching the
>> requirements. In other words, when it tries to allocate the amount
>> of disk, it would determine that that host is in a shared storage
>> aggregate, and be smart enough to allocate against that provider.
>> This was referred to in our discussion as "Plan A".
> 
> What would help for me is greater explanation of if and if so, how and
> why, "Plan A" doesn't work for nested resource providers.
> 
> We can declare that allocating for shared disk is fairly deterministic
> if we assume that any given compute node is only associated with one
> shared disk provider.
> 
> My understanding is this determinism is not the case with nested
> resource providers because there's some fairly late in the game
> choosing of which pci device or which numa cell is getting used.
> The existing resource tracking doesn't have this problem because the
> claim of those resources is made very late in the game. < Is this
> correct?
> 
> The problem comes into play when we want to claim from the scheduler
> (or conductor). Additional information is required to choose which
> child providers to use. <- Is this correct?
> 
> Plan B overcomes the information deficit by including more
> information in the response from placement (as straw-manned in the
> etherpad [1]) allowing code in the filter scheduler to make accurate
> claims. <- Is this correct?
> 
> For clarity and completeness in the discussion some questions for
> which we have explicit answers would be useful. Some of these may
> appear ignorant or obtuse and are mostly things we've been over
> before. The goal is to draw out some clear statements in the present
> day to be sure we are all talking about the same thing (or get us
> there if not) modified for what we know now, compared to what we
> knew a week or month ago.
> 
> * We already have the information the filter scheduler needs now by
>  some other means, right?  What are the reasons we don't want to
>  use that anymore?
> 
> * Part of the reason for having nested resource providers is because
>  it can allow affinity/anti-affinity below the compute node (e.g.,
>  workloads on the same host but different numa cells). If I
>  remember correctly, the modelling and tracking of this kind of
>  information in this way comes out of the time when we imagined the
>  placement service would be doing considerably more filtering than
>  is planned now. Plan B appears to be an acknowledgement of "on
>  some of this stuff, we can't actually do anything but provide you
>  some info, you need to decide". If that's the case, is the
>  topological modelling on the placement DB side of things solely a
>  convenient place to store information? If there were some other
>  way to model that topology could things currently being considered
>  for modelling as nested providers be instead simply modelled as
>  inventories of a particular class of resource?
>  (I'm not suggesting we do this, rather that the answer that says
>  why we don't want to do this is useful for understanding the
>  picture.)
> 
> * Does a claim made in the scheduler need to be complete? Is there
>  value in making a partial claim from the scheduler that consumes a
>  vcpu and some ram, and then in the resource tracker is corrected
>  to consume a specific pci device, numa cell, gpu and/or fpga?
>  Would this be better or worse than what we have now? Why?
> 
> * What is lacking in placement's representation of resource providers
>  that makes it difficult or impossible for an allocation against a
>  parent provider to be able to determine the correct child
>  providers to which to cascade some of the allocation? (And by
>  extension make the earlier scheduling decision.)
> 
> That's a start. With answers to at last some of these questions I
> think the straw man in the etherpad can be more effectively
> evaluated. As things stand right now it is a proposed solution
> without a clear problem statement. I feel like we could do with a
> more clear problem statement.
> 
> Thanks.
> 
> [1] https://etherpad.openstack.org/p/placement-allocations-straw-man
> 
> -- 
> Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
> freenode: cdent tw: 
> @anticdent__
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 

Re: [openstack-dev] [Nova][Scheduler]

2017-06-08 Thread Matt Riedemann

On 6/8/2017 3:36 AM, Narendra Pal Singh wrote:
Does Ocata bits support adding custom resource monitor say network 
bandwidth?


I don't believe so in the upstream code. There is only a CPU bandwidth 
monitor in-tree today, but only supported by the libvirt driver and 
untested anywhere in our integration testing.


Nova Scheduler should consider new metric data for cost calculation each 
filtered host.


There was an attempt in Liberty, Mitaka and Newton to add a new memory 
bandwidth monitor:


https://specs.openstack.org/openstack/nova-specs/specs/newton/approved/memory-bw.html

But we eventually said no to that, and stated why here:

https://docs.openstack.org/developer/nova/policies.html#metrics-gathering

--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Nova][Scheduler]

2017-06-08 Thread Narendra Pal Singh
Hello,

Does Ocata bits support adding custom resource monitor say network
bandwidth?
Nova Scheduler should consider new metric data for cost calculation each
filtered host.

-- 
Regards,
NPS.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K > wrote:

> [Mooney, Sean K] neutron will need to use nested resource providers to track
> Network backend specific consumable resources in the future also. One example 
> is
> is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
> their hardware based implementation are both a finite consumable
> resources and have numa affinity and there for need to track and nested.
> 
> Another example for neutron would be bandwidth based scheduling / sla 
> enforcement
> where we want to guarantee that a specific bandwidth is available on the 
> selected host
> for a vm to consume. From an ovs/vpp/linux bridge perspective this would 
> likely be tracked at
> the physnet level so when selecting a host we would want to ensure that the 
> physent
> is both available from the host and has enough bandwidth available to resever 
> for the instance.


OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable 
provider, given that that the request needs X amount of bandwidth? IOW, are 
there other considerations besides quantitative amount (and possibly traits for 
qualitative concerns) that placement simply doesn’t know about? The example I 
have in mind is the case of stack vs. spread, where there are a few available 
providers that can meet the request. The logic for which one to pick can’t be 
in placement, though, as it’s a detail of the calling service. In the case of 
Nova, the assignment of VFs on vNICs usually should be spread, but that is not 
what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to 
determine the “best” choice from a selection of resource providers? Or will the 
fact that a resource provider has enough of a given resource be all that is 
needed?


-- Ed Leafe








-- Ed Leafe








-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K > wrote:

> [Mooney, Sean K] neutron will need to use nested resource providers to track
> Network backend specific consumable resources in the future also. One example 
> is
> is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
> their hardware based implementation are both a finite consumable
> resources and have numa affinity and there for need to track and nested.
> 
> Another example for neutron would be bandwidth based scheduling / sla 
> enforcement
> where we want to guarantee that a specific bandwidth is available on the 
> selected host
> for a vm to consume. From an ovs/vpp/linux bridge perspective this would 
> likely be tracked at
> the physnet level so when selecting a host we would want to ensure that the 
> physent
> is both available from the host and has enough bandwidth available to resever 
> for the instance.


OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable 
provider, given that that the request needs X amount of bandwidth? IOW, are 
there other considerations besides quantitative amount (and possibly traits for 
qualitative concerns) that placement simply doesn’t know about? The example I 
have in mind is the case of stack vs. spread, where there are a few available 
providers that can meet the request. The logic for which one to pick can’t be 
in placement, though, as it’s a detail of the calling service. In the case of 
Nova, the assignment of VFs on vNICs usually should be spread, but that is not 
what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to 
determine the “best” choice from a selection of resource providers? Or will the 
fact that a resource provider has enough of a given resource be all that is 
needed?


-- Ed Leafe








-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Mooney, Sean K


> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Wednesday, June 7, 2017 6:47 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova][scheduler][placement] Allocating
> Complex Resources
> 
> On 06/07/2017 01:00 PM, Edward Leafe wrote:
> > On Jun 6, 2017, at 9:56 AM, Chris Dent <cdent...@anticdent.org
> > <mailto:cdent...@anticdent.org>> wrote:
> >>
> >> For clarity and completeness in the discussion some questions for
> >> which we have explicit answers would be useful. Some of these may
> >> appear ignorant or obtuse and are mostly things we've been over
> >> before. The goal is to draw out some clear statements in the present
> >> day to be sure we are all talking about the same thing (or get us
> >> there if not) modified for what we know now, compared to what we
> knew
> >> a week or month ago.
> >
> > One other question that came up: do we have any examples of any
> > service (such as Neutron or Cinder) that would require the modeling
> > for nested providers? Or is this confined to Nova?
> 
> The Cyborg project (accelerators like FPGAs and some vGPUs) need nested
> resource providers to model the relationship between a virtual resource
> context against an accelerator and the compute node itself.
[Mooney, Sean K] neutron will need to use nested resource providers to track
Network backend specific consumable resources in the future also. One example is
is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
their hardware based implementation are both a finite consumable
resources and have numa affinity and there for need to track and nested.

Another example for neutron would be bandwidth based scheduling / sla 
enforcement
where we want to guarantee that a specific bandwidth is available on the 
selected host
for a vm to consume. From an ovs/vpp/linux bridge perspective this would likely 
be tracked at
the physnet level so when selecting a host we would want to ensure that the 
physent
is both available from the host and has enough bandwidth available to resever 
for the instance.

Today nova and neutron do not track either of the above but at least the lather 
has been started
In the sriov context without placemet and should be extended to other non-sriov 
backend. 
Snabb switch actually supports this already with vendor extentions via the 
neutron bining:profile
https://github.com/snabbco/snabb/blob/b7d6d77ba5fd6a6b9306f92466c1779bba2caa31/src/program/snabbnfv/doc/neutron-api-extensions.md#bandwidth-reservation
but nova is not aware of the capacity or availability info when placing the 
instance so if
the host cannot fufill the request the degrade to the least over subscribed 
port.
https://github.com/snabbco/snabb-neutron/blob/master/snabb_neutron/mechanism_snabb.py#L194-L200

with nested resource providers they could harden this request from best effort 
to a guaranteed bandwidth reservation
by informing the placemnt api of the bandwith availability of the physical 
interface and also the numa affinity the interfaces
by created a nested resource provider. 

> 
> Best,
> -jay
> 
> ___
> ___
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Jay Pipes

On 06/07/2017 01:00 PM, Edward Leafe wrote:

On Jun 6, 2017, at 9:56 AM, Chris Dent > wrote:


For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.


One other question that came up: do we have any examples of any service
(such as Neutron or Cinder) that would require the modeling for nested
providers? Or is this confined to Nova?


The Cyborg project (accelerators like FPGAs and some vGPUs) need nested 
resource providers to model the relationship between a virtual resource 
context against an accelerator and the compute node itself.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-07 Thread Edward Leafe
On Jun 6, 2017, at 9:56 AM, Chris Dent  wrote:
> 
> For clarity and completeness in the discussion some questions for
> which we have explicit answers would be useful. Some of these may
> appear ignorant or obtuse and are mostly things we've been over
> before. The goal is to draw out some clear statements in the present
> day to be sure we are all talking about the same thing (or get us
> there if not) modified for what we know now, compared to what we
> knew a week or month ago.


One other question that came up: do we have any examples of any service (such 
as Neutron or Cinder) that would require the modeling for nested providers? Or 
is this confined to Nova?


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Chris Dent

On Mon, 5 Jun 2017, Ed Leafe wrote:


One proposal is to essentially use the same logic in placement
that was used to include that host in those matching the
requirements. In other words, when it tries to allocate the amount
of disk, it would determine that that host is in a shared storage
aggregate, and be smart enough to allocate against that provider.
This was referred to in our discussion as "Plan A".


What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.

We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.

My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?

The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?

Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?

For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.

* We already have the information the filter scheduler needs now by
  some other means, right?  What are the reasons we don't want to
  use that anymore?

* Part of the reason for having nested resource providers is because
  it can allow affinity/anti-affinity below the compute node (e.g.,
  workloads on the same host but different numa cells). If I
  remember correctly, the modelling and tracking of this kind of
  information in this way comes out of the time when we imagined the
  placement service would be doing considerably more filtering than
  is planned now. Plan B appears to be an acknowledgement of "on
  some of this stuff, we can't actually do anything but provide you
  some info, you need to decide". If that's the case, is the
  topological modelling on the placement DB side of things solely a
  convenient place to store information? If there were some other
  way to model that topology could things currently being considered
  for modelling as nested providers be instead simply modelled as
  inventories of a particular class of resource?
  (I'm not suggesting we do this, rather that the answer that says
  why we don't want to do this is useful for understanding the
  picture.)

* Does a claim made in the scheduler need to be complete? Is there
  value in making a partial claim from the scheduler that consumes a
  vcpu and some ram, and then in the resource tracker is corrected
  to consume a specific pci device, numa cell, gpu and/or fpga?
  Would this be better or worse than what we have now? Why?

* What is lacking in placement's representation of resource providers
  that makes it difficult or impossible for an allocation against a
  parent provider to be able to determine the correct child
  providers to which to cascade some of the allocation? (And by
  extension make the earlier scheduling decision.)

That's a start. With answers to at last some of these questions I
think the straw man in the etherpad can be more effectively
evaluated. As things stand right now it is a proposed solution
without a clear problem statement. I feel like we could do with a
more clear problem statement.

Thanks.

[1] https://etherpad.openstack.org/p/placement-allocations-straw-man

--
Chris Dent  ┬──┬◡ノ(° -°ノ)   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Sylvain Bauza


Le 06/06/2017 15:03, Edward Leafe a écrit :
> On Jun 6, 2017, at 4:14 AM, Sylvain Bauza  > wrote:
>>
>> The Plan A option you mention hides the complexity of the
>> shared/non-shared logic but to the price that it would make scheduling
>> decisions on those criteries impossible unless you put
>> filtering/weighting logic into Placement, which AFAIK we strongly
>> disagree with.
> 
> Not necessarily. Well, not now, for sure, but that’s why we need Traits
> to be integrated into Flavors as soon as possible so that we can make
> requests with qualitative requirements, not just quantitative. When that
> work is done, we can add traits to differentiate local from shared
> storage, just like we have traits to distinguish HDD from SSD. So if a
> VM with only local disk is needed, that will be in the request, and
> placement will never return hosts with shared storage. 
> 

Well, there is a whole difference between defining constraints into
flavors, and making a general constraint on a filter basis which is
opt-able by config.

Operators could claim that they would need to update all their N flavors
in order to achieve a strict separation for not-shared-with resource
providers, which would somehow leak into the fact that users would have
flavors that differ for that aspect.

I'm not saying it's not good to mark traits into flavor extraspecs,
sometimes they're all good, but I do care of the flavor count explosion
if we begin putting all the filtering logic into extraspecs (plus the
fact it can't be config-manageable like filters are at the moment).

-Sylvain

> -- Ed Leafe
> 
> 
> 
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Edward Leafe
On Jun 6, 2017, at 4:14 AM, Sylvain Bauza  wrote:
> 
> The Plan A option you mention hides the complexity of the
> shared/non-shared logic but to the price that it would make scheduling
> decisions on those criteries impossible unless you put
> filtering/weighting logic into Placement, which AFAIK we strongly
> disagree with.


Not necessarily. Well, not now, for sure, but that’s why we need Traits to be 
integrated into Flavors as soon as possible so that we can make requests with 
qualitative requirements, not just quantitative. When that work is done, we can 
add traits to differentiate local from shared storage, just like we have traits 
to distinguish HDD from SSD. So if a VM with only local disk is needed, that 
will be in the request, and placement will never return hosts with shared 
storage. 

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-06 Thread Sylvain Bauza
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256



Le 05/06/2017 23:22, Ed Leafe a écrit :
> We had a very lively discussion this morning during the Scheduler
> subteam meeting, which was continued in a Google hangout. The
> subject was how to handle claiming resources when the Resource
> Provider is not "simple". By "simple", I mean a compute node that
> provides all of the resources itself, as contrasted with a compute
> node that uses a shared storage for disk space, or which has
> complex nested relationships with things such as PCI devices or
> NUMA nodes. The current situation is as follows:
> 
> a) scheduler gets a request with certain resource requirements
> (RAM, disk, CPU, etc.) b) scheduler passes these resource
> requirements to placement, which returns a list of hosts (compute
> nodes) that can satisfy the request. c) scheduler runs these
> through some filters and weighers to get a list ordered by best
> "fit" d) it then tries to claim the resources, by posting to
> placement allocations for these resources against the selected
> host e) once the allocation succeeds, scheduler returns that host
> to conductor to then have the VM built
> 
> (some details for edge cases left out for clarity of the overall
> process)
> 
> The problem we discussed comes into play when the compute node
> isn't the actual provider of the resources. The easiest example to
> consider is when the computes are associated with a shared storage
> provider. The placement query is smart enough to know that even if
> the compute node doesn't have enough local disk, it will get it
> from the shared storage, so it will return that host in step b)
> above. If the scheduler then chooses that host, when it tries to
> claim it, it will pass the resources and the compute node UUID back
> to placement to make the allocations. This is the point where the
> current code would fall short: somehow, placement needs to know to
> allocate the disk requested against the shared storage provider,
> and not the compute node.
> 
> One proposal is to essentially use the same logic in placement that
> was used to include that host in those matching the requirements.
> In other words, when it tries to allocate the amount of disk, it
> would determine that that host is in a shared storage aggregate,
> and be smart enough to allocate against that provider. This was
> referred to in our discussion as "Plan A".
> 
> Another proposal involved a change to how placement responds to the
> scheduler. Instead of just returning the UUIDs of the compute nodes
> that satisfy the required resources, it would include a whole bunch
> of additional information in a structured response. A straw man
> example of such a response is here:
> https://etherpad.openstack.org/p/placement-allocations-straw-man.
> This was referred to as "Plan B". The main feature of this approach
> is that part of that response would be the JSON dict for the
> allocation call, containing the specific resource provider UUID for
> each resource. This way, when the scheduler selects a host, it
> would simply pass that dict back to the /allocations call, and
> placement would be able to do the allocations directly against that
> information.
> 
> There was another issue raised: simply providing the host UUIDs
> didn't give the scheduler enough information in order to run its
> filters and weighers. Since the scheduler uses those UUIDs to
> construct HostState objects, the specific missing information was
> never completely clarified, so I'm just including this aspect of
> the conversation for completeness. It is orthogonal to the question
> of how to allocate when the resource provider is not "simple".
> 
> My current feeling is that we got ourselves into our existing mess
> of ugly, convoluted code when we tried to add these complex
> relationships into the resource tracker and the scheduler. We set
> out to create the placement engine to bring some sanity back to how
> we think about things we need to virtualize. I would really hate to
> see us make the same mistake again, by adding a good deal of
> complexity to handle a few non-simple cases. What I would like to
> avoid, no matter what the eventual solution chosen, is representing
> this complexity in multiple places. Currently the only two
> candidates for this logic are the placement engine, which knows
> about these relationships already, or the compute service itself,
> which has to handle the management of these complex virtualized
> resources.
> 
> I don't know the answer. I'm hoping that we can have a discussion
> that might uncover a clear approach, or, at the very least, one
> that is less murky than the others.
> 

I wasn't part neither of the scheduler meeting nor the hangout (hitted
by French holiday) so I don't get all the details in mind and I could
probably make wrong assumptions, so I apology in advance if I'm
telling anything silly.

That said, I still have some opinions and I'll put them here. Thanks
for having brought 

[openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

2017-06-05 Thread Ed Leafe
We had a very lively discussion this morning during the Scheduler subteam 
meeting, which was continued in a Google hangout. The subject was how to handle 
claiming resources when the Resource Provider is not "simple". By "simple", I 
mean a compute node that provides all of the resources itself, as contrasted 
with a compute node that uses a shared storage for disk space, or which has 
complex nested relationships with things such as PCI devices or NUMA nodes. The 
current situation is as follows:

a) scheduler gets a request with certain resource requirements (RAM, disk, CPU, 
etc.)
b) scheduler passes these resource requirements to placement, which returns a 
list of hosts (compute nodes) that can satisfy the request.
c) scheduler runs these through some filters and weighers to get a list ordered 
by best "fit"
d) it then tries to claim the resources, by posting to placement allocations 
for these resources against the selected host
e) once the allocation succeeds, scheduler returns that host to conductor to 
then have the VM built

(some details for edge cases left out for clarity of the overall process)

The problem we discussed comes into play when the compute node isn't the actual 
provider of the resources. The easiest example to consider is when the computes 
are associated with a shared storage provider. The placement query is smart 
enough to know that even if the compute node doesn't have enough local disk, it 
will get it from the shared storage, so it will return that host in step b) 
above. If the scheduler then chooses that host, when it tries to claim it, it 
will pass the resources and the compute node UUID back to placement to make the 
allocations. This is the point where the current code would fall short: 
somehow, placement needs to know to allocate the disk requested against the 
shared storage provider, and not the compute node.

One proposal is to essentially use the same logic in placement that was used to 
include that host in those matching the requirements. In other words, when it 
tries to allocate the amount of disk, it would determine that that host is in a 
shared storage aggregate, and be smart enough to allocate against that 
provider. This was referred to in our discussion as "Plan A".

Another proposal involved a change to how placement responds to the scheduler. 
Instead of just returning the UUIDs of the compute nodes that satisfy the 
required resources, it would include a whole bunch of additional information in 
a structured response. A straw man example of such a response is here: 
https://etherpad.openstack.org/p/placement-allocations-straw-man. This was 
referred to as "Plan B". The main feature of this approach is that part of that 
response would be the JSON dict for the allocation call, containing the 
specific resource provider UUID for each resource. This way, when the scheduler 
selects a host, it would simply pass that dict back to the /allocations call, 
and placement would be able to do the allocations directly against that 
information.

There was another issue raised: simply providing the host UUIDs didn't give the 
scheduler enough information in order to run its filters and weighers. Since 
the scheduler uses those UUIDs to construct HostState objects, the specific 
missing information was never completely clarified, so I'm just including this 
aspect of the conversation for completeness. It is orthogonal to the question 
of how to allocate when the resource provider is not "simple".

My current feeling is that we got ourselves into our existing mess of ugly, 
convoluted code when we tried to add these complex relationships into the 
resource tracker and the scheduler. We set out to create the placement engine 
to bring some sanity back to how we think about things we need to virtualize. I 
would really hate to see us make the same mistake again, by adding a good deal 
of complexity to handle a few non-simple cases. What I would like to avoid, no 
matter what the eventual solution chosen, is representing this complexity in 
multiple places. Currently the only two candidates for this logic are the 
placement engine, which knows about these relationships already, or the compute 
service itself, which has to handle the management of these complex virtualized 
resources.

I don't know the answer. I'm hoping that we can have a discussion that might 
uncover a clear approach, or, at the very least, one that is less murky than 
the others.


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Nova][Scheduler]

2017-05-31 Thread Narendra Pal Singh
Hello,

I am looking for some suggestions. lets say i have multiple compute nodes,
Pool-A has 5 nodes and Pool-B has 4 nodes categorized based on some
property.
Now there is request for new instance, i always want this instance to be
placed on any compute in Pool-A and not in Pool-B.
What would be best approach to address this situation?

-- 
Best Regards,
Narendra Pal Singh
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] [Scheduler]

2017-05-31 Thread Jay Pipes

On 05/31/2017 09:49 AM, Narendra Pal Singh wrote:

Hello,

lets say i have multiple compute nodes, Pool-A has 5 nodes and Pool-B 
has 4 nodes categorized based on some property.
Now there is request for new instance, i always want this instance to be 
placed on any compute in Pool-A.

What would be best approach to address this situation?


FYI, this question is best asked on openstack@ ML. openstack-dev@ is for 
development questions.


You can use the AggregateInstanceExtraSpecsFilter and aggregate metadata 
to accomplish your needs.


Read more about that here:

https://docs.hpcloud.com/hos-3.x/helion/operations/compute/creating_aggregates.html

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Nova] [Scheduler]

2017-05-31 Thread Narendra Pal Singh
Hello,

lets say i have multiple compute nodes, Pool-A has 5 nodes and Pool-B has 4
nodes categorized based on some property.
Now there is request for new instance, i always want this instance to be
placed on any compute in Pool-A.
What would be best approach to address this situation?

-- 
Regards,
NPS
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] Anyone relying on the host_subset_size config option?

2017-05-26 Thread Rui Chen
Beside eliminate race conditions, we use host_subnet_size in the special
cases, we have different capacity hardware in a deployment,
imagine a simple case, two compute hosts(RAM 48G vs 16G free), only enable
the RAM weighter for nova-scheduler, if we launch
10 instances(RAM 1G flavor) one by one, all the 10 instances will be
launched on the 48G RAM compute hosts, that don't we want,
host_subset_size help to distribute load to random available hosts in the
situation.

Thank you sending the mail to operators list, let us to get more feedback
before doing some changes.

2017-05-27 4:46 GMT+08:00 Ben Nemec :

>
>
> On 05/26/2017 12:17 PM, Edward Leafe wrote:
>
>> [resending to include the operators list]
>>
>> The host_subset_size configuration option was added to the scheduler to
>> help eliminate race conditions when two requests for a similar VM would be
>> processed close together, since the scheduler’s algorithm would select the
>> same host in both cases, leading to a race and a likely failure to build
>> for the second request. By randomly choosing from the top N hosts, the
>> likelihood of a race would be reduced, leading to fewer failed builds.
>>
>> Current changes in the scheduling process now have the scheduler claiming
>> the resources as soon as it selects a host. So in the case above with 2
>> similar requests close together, the first request will claim successfully,
>> but the second will fail *while still in the scheduler*. Upon failing the
>> claim, the scheduler will simply pick the next host in its weighed list
>> until it finds one that it can claim the resources from. So the
>> host_subset_size configuration option is no longer needed.
>>
>> However, we have heard that some operators are relying on this option to
>> help spread instances across their hosts, rather than using the RAM
>> weigher. My question is: will removing this randomness from the scheduling
>> process hurt any operators out there? Or can we safely remove that logic?
>>
>
> We used host_subset_size to schedule randomly in one of the TripleO CI
> clouds.  Essentially we had a heterogeneous set of hardware where the
> numerically larger (more RAM, more disk, equal CPU cores) systems were
> significantly slower.  This caused them to be preferred by the scheduler
> with a normal filter configuration, which is obviously not what we wanted.
> I'm not sure if there's a smarter way to handle it, but setting
> host_subset_size to the number of compute nodes and disabling basically all
> of the weighers allowed us to equally distribute load so at least the slow
> nodes weren't preferred.
>
> That said, we're migrating away from that frankencloud so I certainly
> wouldn't block any scheduler improvements on it.  I'm mostly chiming in to
> describe a possible use case.  And please feel free to point out if there's
> a better way to do this. :-)
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] Anyone relying on the host_subset_size config option?

2017-05-26 Thread Ben Nemec



On 05/26/2017 12:17 PM, Edward Leafe wrote:

[resending to include the operators list]

The host_subset_size configuration option was added to the scheduler to help 
eliminate race conditions when two requests for a similar VM would be processed 
close together, since the scheduler’s algorithm would select the same host in 
both cases, leading to a race and a likely failure to build for the second 
request. By randomly choosing from the top N hosts, the likelihood of a race 
would be reduced, leading to fewer failed builds.

Current changes in the scheduling process now have the scheduler claiming the 
resources as soon as it selects a host. So in the case above with 2 similar 
requests close together, the first request will claim successfully, but the 
second will fail *while still in the scheduler*. Upon failing the claim, the 
scheduler will simply pick the next host in its weighed list until it finds one 
that it can claim the resources from. So the host_subset_size configuration 
option is no longer needed.

However, we have heard that some operators are relying on this option to help 
spread instances across their hosts, rather than using the RAM weigher. My 
question is: will removing this randomness from the scheduling process hurt any 
operators out there? Or can we safely remove that logic?


We used host_subset_size to schedule randomly in one of the TripleO CI 
clouds.  Essentially we had a heterogeneous set of hardware where the 
numerically larger (more RAM, more disk, equal CPU cores) systems were 
significantly slower.  This caused them to be preferred by the scheduler 
with a normal filter configuration, which is obviously not what we 
wanted.  I'm not sure if there's a smarter way to handle it, but setting 
host_subset_size to the number of compute nodes and disabling basically 
all of the weighers allowed us to equally distribute load so at least 
the slow nodes weren't preferred.


That said, we're migrating away from that frankencloud so I certainly 
wouldn't block any scheduler improvements on it.  I'm mostly chiming in 
to describe a possible use case.  And please feel free to point out if 
there's a better way to do this. :-)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] Anyone relying on the host_subset_size config option?

2017-05-26 Thread Jay Pipes

On 05/26/2017 01:14 PM, Edward Leafe wrote:

The host_subset_size configuration option was added to the scheduler to help 
eliminate race conditions when two requests for a similar VM would be processed 
close together, since the scheduler’s algorithm would select the same host in 
both cases, leading to a race and a likely failure to build for the second 
request. By randomly choosing from the top N hosts, the likelihood of a race 
would be reduced, leading to fewer failed builds.

Current changes in the scheduling process now have the scheduler claiming the 
resources as soon as it selects a host. So in the case above with 2 similar 
requests close together, the first request will claim successfully, but the 
second will fail *while still in the scheduler*. Upon failing the claim, the 
scheduler will simply pick the next host in its weighed list until it finds one 
that it can claim the resources from. So the host_subset_size configuration 
option is no longer needed.

However, we have heard that some operators are relying on this option to help 
spread instances across their hosts, rather than using the RAM weigher. My 
question is: will removing this randomness from the scheduling process hurt any 
operators out there? Or can we safely remove that logic?


Actually, I don't believe this should be removed. The randomness that is 
injected into the placement decision using this configuration setting is 
useful for reducing contention even in the scheduler claim process.


When benchmarking claims in the scheduler here:

https://github.com/jaypipes/placement-bench

I found that the use of a "partitioning strategy" resulted in dramatic 
reduction in lock contention in the claim process. The modulo and random 
partitioning strategies both seemed to work pretty well for reducing 
lock retries.


So, in short, I'd say keep it.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Anyone relying on the host_subset_size config option?

2017-05-26 Thread Edward Leafe
[resending to include the operators list]

The host_subset_size configuration option was added to the scheduler to help 
eliminate race conditions when two requests for a similar VM would be processed 
close together, since the scheduler’s algorithm would select the same host in 
both cases, leading to a race and a likely failure to build for the second 
request. By randomly choosing from the top N hosts, the likelihood of a race 
would be reduced, leading to fewer failed builds.

Current changes in the scheduling process now have the scheduler claiming the 
resources as soon as it selects a host. So in the case above with 2 similar 
requests close together, the first request will claim successfully, but the 
second will fail *while still in the scheduler*. Upon failing the claim, the 
scheduler will simply pick the next host in its weighed list until it finds one 
that it can claim the resources from. So the host_subset_size configuration 
option is no longer needed.

However, we have heard that some operators are relying on this option to help 
spread instances across their hosts, rather than using the RAM weigher. My 
question is: will removing this randomness from the scheduling process hurt any 
operators out there? Or can we safely remove that logic?


-- Ed Leafe


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Anyone relying on the host_subset_size config option?

2017-05-26 Thread Edward Leafe
The host_subset_size configuration option was added to the scheduler to help 
eliminate race conditions when two requests for a similar VM would be processed 
close together, since the scheduler’s algorithm would select the same host in 
both cases, leading to a race and a likely failure to build for the second 
request. By randomly choosing from the top N hosts, the likelihood of a race 
would be reduced, leading to fewer failed builds.

Current changes in the scheduling process now have the scheduler claiming the 
resources as soon as it selects a host. So in the case above with 2 similar 
requests close together, the first request will claim successfully, but the 
second will fail *while still in the scheduler*. Upon failing the claim, the 
scheduler will simply pick the next host in its weighed list until it finds one 
that it can claim the resources from. So the host_subset_size configuration 
option is no longer needed.

However, we have heard that some operators are relying on this option to help 
spread instances across their hosts, rather than using the RAM weigher. My 
question is: will removing this randomness from the scheduling process hurt any 
operators out there? Or can we safely remove that logic?


-- Ed Leafe


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Discussion on how to do claims

2017-05-17 Thread Edward Leafe
I missed the summit, so I also missed out on some of the decisions that were 
made. I don’t feel that some of them were ideal, and in talking to the people 
involved I’ve gotten various degrees of certainty about how settled things 
were. So I’ve not only pushed a series of patches as POC code [0] for my 
approach, but I’ve written a summary of my concerns [1].

Speaking with a few people today, since some people are not around, and I’m 
leaving for PyCon tomorrow, we felt it would be worthwhile to have everyone 
read this, and review the current proposed code by Sylvain [2], my code, and 
the blog post summary. Next week when we are all back in town we can set up a 
time to discuss, either in IRC or perhaps a hangout.

[0] https://review.openstack.org/#/c/464086/
[1] https://blog.leafe.com/claims-in-the-scheduler/
[2] https://review.openstack.org/#/c/460177/8 


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Scheduler meeting canceled for next Monday

2017-05-04 Thread Edward Leafe
Due to most participants being at the Forum this coming week, we will not hold 
our weekly Scheduler sub team meeting on Monday, May 8. Please join us the 
following Monday (May 15) in #openstack-meeting-alt at 1400 UTC.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-04 Thread Jay Pipes

On 05/04/2017 04:59 AM, Giuseppe Di Lena wrote:

Hi Chris,


I'm pretty sure a regular user can create a server group and specify the 
anti-affinity filter.


yes, but we want that the user specifies just the Robustness; the way in which 
we assign the instances to the compute nodes should be a black box for the 
regular user(and also for the admin).


Server groups *are* a black box though. You create a server group and 
set the policy of the group to "anti-affinity" and that's it. There's no 
need for the user or admin to know anything else...



Why do you need to track which compute nodes the instances are on?


Because putting the instances in the correct compute nodes is just the first 
step of the algorithm that we are implementing, for the next steps we need to 
know where is each instance.


In a cloud, it shouldn't matter which specific compute node an instance 
is on -- in fact, in clouds, an instance (workload) may not even know 
it's on a hypervisor vs. a baremetal machine vs. a privileged container.


What is important for the user in a cloud to specify is the amount of 
resources the workload will consume (this is the flavor in Nova) and a 
set of characteristics (traits) that the eventual host system should have.


I think it would help if you describe in a little more detail what is 
the eventual outcome you are trying to achieve and what use case that 
outcome serves. Then we can assist you in showing you how to get to that 
outcome.


Best,
-jay


Thank you for the question.

Best regards Giuseppe


Il giorno 03 mag 2017, alle ore 21:01, Chris Friesen 
 ha scritto:

On 05/03/2017 03:08 AM, Giuseppe Di Lena wrote:

Thank you a lot for the help!

I think that the problem can be solved using the anti-affinity filter, but we want 
a regular user can choose an instance and set the property(image, flavour, 
network, etc.) and a parameter Robustness >= 1(that is the number of copies of 
this particular instance).


I'm pretty sure a regular user can create a server group and specify the 
anti-affinity filter.  And a regular user can certainly specify --min-count and 
--max-count to specify the number of copies.


After that, we put every copy of this instance in a different compute, but we 
need to track where we put every copy of the instance (we need to know it for 
the algorithm that we would implement);


Normally only admin-level users are allowed to know which compute nodes a given 
instance is placed on.  Why do you need to track which compute nodes the 
instances are on?

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-04 Thread Giuseppe Di Lena
Hi Chris,

> I'm pretty sure a regular user can create a server group and specify the 
> anti-affinity filter. 

yes, but we want that the user specifies just the Robustness; the way in which 
we assign the instances to the compute nodes should be a black box for the 
regular user(and also for the admin).

> Why do you need to track which compute nodes the instances are on?

Because putting the instances in the correct compute nodes is just the first 
step of the algorithm that we are implementing, for the next steps we need to 
know where is each instance.

Thank you for the question.

Best regards Giuseppe 

> Il giorno 03 mag 2017, alle ore 21:01, Chris Friesen 
>  ha scritto:
> 
> On 05/03/2017 03:08 AM, Giuseppe Di Lena wrote:
>> Thank you a lot for the help!
>> 
>> I think that the problem can be solved using the anti-affinity filter, but 
>> we want a regular user can choose an instance and set the property(image, 
>> flavour, network, etc.) and a parameter Robustness >= 1(that is the number 
>> of copies of this particular instance).
> 
> I'm pretty sure a regular user can create a server group and specify the 
> anti-affinity filter.  And a regular user can certainly specify --min-count 
> and --max-count to specify the number of copies.
> 
>> After that, we put every copy of this instance in a different compute, but 
>> we need to track where we put every copy of the instance (we need to know it 
>> for the algorithm that we would implement);
> 
> Normally only admin-level users are allowed to know which compute nodes a 
> given instance is placed on.  Why do you need to track which compute nodes 
> the instances are on?
> 
> Chris
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-03 Thread Chris Friesen

On 05/03/2017 03:08 AM, Giuseppe Di Lena wrote:

Thank you a lot for the help!

I think that the problem can be solved using the anti-affinity filter, but we want 
a regular user can choose an instance and set the property(image, flavour, 
network, etc.) and a parameter Robustness >= 1(that is the number of copies of 
this particular instance).


I'm pretty sure a regular user can create a server group and specify the 
anti-affinity filter.  And a regular user can certainly specify --min-count and 
--max-count to specify the number of copies.



After that, we put every copy of this instance in a different compute, but we 
need to track where we put every copy of the instance (we need to know it for 
the algorithm that we would implement);


Normally only admin-level users are allowed to know which compute nodes a given 
instance is placed on.  Why do you need to track which compute nodes the 
instances are on?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-03 Thread Giuseppe Di Lena
Thank you a lot for the help!

I think that the problem can be solved using the anti-affinity filter, but we 
want a regular user can choose an instance and set the property(image, flavour, 
network, etc.) and a parameter Robustness >= 1(that is the number of copies of 
this particular instance).

After that, we put every copy of this instance in a different compute, but we 
need to track where we put every copy of the instance (we need to know it for 
the algorithm that we would implement);

EXAMPLE. With 4 compute nodes

User create an instance(Server_1) with Robustness = 3.

Server_1_copy1 ==> compute2
Server_1_copy2 ==> compute1
Server_1_copy3 ==> compute4

User create another instance (Host_1) with Robustness = 2

Host_1_copy1 ==> compute2
Host_1_copy2 ==> compute3

and we want to track in a table where are the copies of a particular instance.
For Server_1

copy1 ==> compute2 
copy2 ==> compute1
copy3 ==> compute4

same thing for Host_1
copy1 ==> compute2
copy2 ==> compute3

I hope that now is more clear.
In your opinion what is the best way to do it?

Best Regards Giuseppe

> Il giorno 02 mag 2017, alle ore 22:08, Matt Riedemann  
> ha scritto:
> 
> On 5/2/2017 2:15 PM, Chris Friesen wrote:
>> It sounds to me that the problem could be solved by specifying
>> --min-count and --max-count to specify the number of copies, and using
>> server groups with the anti-affinity filter to ensure that they end up
>> on different compute nodes.
> 
> This is exactly what I was thinking. What else is involved or needed that 
> existing parts of the compute API don't already provide for that use case?
> 
> -- 
> 
> Thanks,
> 
> Matt
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Matt Riedemann

On 5/2/2017 2:15 PM, Chris Friesen wrote:

It sounds to me that the problem could be solved by specifying
--min-count and --max-count to specify the number of copies, and using
server groups with the anti-affinity filter to ensure that they end up
on different compute nodes.


This is exactly what I was thinking. What else is involved or needed 
that existing parts of the compute API don't already provide for that 
use case?


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Chris Friesen

On 05/02/2017 10:59 AM, Jay Pipes wrote:

On 05/02/2017 12:33 PM, Giuseppe Di Lena wrote:

Thank you a lot! :-).
Actually, we are also working in parallel to implement the algorithm with
tacker, but for this project we will only use the basic modules in OpenStack
and Heat.

If the scheduler hints are no longer supported, what is the correct way to
give the scheduler personalized input with the instance details?
Best regards Giuseppe


Again, it depends on what problem you are trying to solve with this personalized
input... what would the scheduler do with the length of the service chain as an
input? What are you attempting to solve?


It sounds to me that he's not specifying the length of the service chain, but 
the amount of instance redundancy (i.e. the number of copies of the "same" 
instance).


From the earlier message:

"When we create an instance, we give as input the robustness(integer value >=1) 
 that is the number of copy of an instance. For example Robustness = 3, it will 
create 3 instances and put them in 3 different compute nodes."


It sounds to me that the problem could be solved by specifying --min-count and 
--max-count to specify the number of copies, and using server groups with the 
anti-affinity filter to ensure that they end up on different compute nodes.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Mooney, Sean K


> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Tuesday, May 2, 2017 5:59 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova-scheduler] Get scheduler hint
> 
> On 05/02/2017 12:33 PM, Giuseppe Di Lena wrote:
> > Thank you a lot! :-).
> > Actually, we are also working in parallel to implement the algorithm
> with tacker, but for this project we will only use the basic modules in
> OpenStack and Heat.
[Mooney, Sean K] if you can use heat you can use the server anti affinity filter
To create a server group per port-pair group and ensure that two instance of 
the same
port-pair-group do not reside on the same server. You could also use 
heat's/senlin's
Scaling groups to define the insantce count of each SF in the chain. 
There was a presentation on this in Barcelona 
https://www.openstack.org/summit/barcelona-2016/summit-schedule/events/15037/on-building-an-auto-healing-resource-cluster-using-senlin
https://www.youtube.com/watch?v=bmdU_m6vRZc
But as jay says below it depends on what you are trying to solve.
If you are trying to model ha constraints for a service chain the above may
Help however it is likely outside the scope of nova/placement api to support 
this directly.

> >
> > If the scheduler hints are no longer supported, what is the correct
> way to give the scheduler personalized input with the instance details?
> > Best regards Giuseppe
> 
> Again, it depends on what problem you are trying to solve with this
> personalized input... what would the scheduler do with the length of
> the service chain as an input? What are you attempting to solve?
> 
> Best,
> -jay
> 
> ___
> ___
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Jay Pipes

On 05/02/2017 12:33 PM, Giuseppe Di Lena wrote:

Thank you a lot! :-).
Actually, we are also working in parallel to implement the algorithm with 
tacker, but for this project we will only use the basic modules in OpenStack 
and Heat.

If the scheduler hints are no longer supported, what is the correct way to give 
the scheduler personalized input with the instance details?
Best regards Giuseppe


Again, it depends on what problem you are trying to solve with this 
personalized input... what would the scheduler do with the length of the 
service chain as an input? What are you attempting to solve?


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Giuseppe Di Lena
Thank you a lot! :-).
Actually, we are also working in parallel to implement the algorithm with 
tacker, but for this project we will only use the basic modules in OpenStack 
and Heat.

If the scheduler hints are no longer supported, what is the correct way to give 
the scheduler personalized input with the instance details? 
Best regards Giuseppe

> Il giorno 02 mag 2017, alle ore 16:47, Jay Pipes  ha 
> scritto:
> 
> On 05/02/2017 09:58 AM, Giuseppe Di Lena wrote:
>> Thank you for your answer :-)
>> 
>> I'm pretty new in OpenStack and open source in general(I started two
>> months ago, and I don't have any experience in developing big project).
>> What we are trying to do is to implement our algorithm for Robust NFV
>> chain; to do it we need to create multiple copies of the same instance
>> and put them in different compute nodes.
>> When we create an instance, we give as input the robustness(integer
>> value >=1)  that is the number of copy of an instance.
>> For example Robustness = 3, it will create 3 instances and put them in 3
>> different compute nodes.
> 
> OK, got it. What you want to do is actually not in Nova but rather in the 
> Neutron SFC and Tacker projects:
> 
> https://github.com/openstack/networking-sfc
> https://github.com/openstack/tacker
> 
> Go ahead and read up on those two projects and get a feel for how they are 
> structured. You can head over to Freenode IRC on the #openstack-neutron 
> channel to chat about Neutron service function chaining. The #tacker channel 
> has folks interested in NFV orchestration and should be able to help you out.
> 
> All the best,
> -jay
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Jay Pipes

On 05/02/2017 09:58 AM, Giuseppe Di Lena wrote:

Thank you for your answer :-)

I'm pretty new in OpenStack and open source in general(I started two
months ago, and I don't have any experience in developing big project).
What we are trying to do is to implement our algorithm for Robust NFV
chain; to do it we need to create multiple copies of the same instance
and put them in different compute nodes.
When we create an instance, we give as input the robustness(integer
value >=1)  that is the number of copy of an instance.
For example Robustness = 3, it will create 3 instances and put them in 3
different compute nodes.


OK, got it. What you want to do is actually not in Nova but rather in 
the Neutron SFC and Tacker projects:


https://github.com/openstack/networking-sfc
https://github.com/openstack/tacker

Go ahead and read up on those two projects and get a feel for how they 
are structured. You can head over to Freenode IRC on the 
#openstack-neutron channel to chat about Neutron service function 
chaining. The #tacker channel has folks interested in NFV orchestration 
and should be able to help you out.


All the best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Giuseppe Di Lena
Thank you for your answer :-)

I'm pretty new in OpenStack and open source in general(I started two months 
ago, and I don't have any experience in developing big project).
What we are trying to do is to implement our algorithm for Robust NFV chain; to 
do it we need to create multiple copies of the same instance and put them in 
different compute nodes.
When we create an instance, we give as input the robustness(integer value >=1)  
that is the number of copy of an instance.
For example Robustness = 3, it will create 3 instances and put them in 3 
different compute nodes.

Best regards


> Il giorno 02 mag 2017, alle ore 14:25, Jay Pipes  ha 
> scritto:
> 
> Scheduler hints are neither portable nor going to be supported in Nova 
> long-term, so describe to us what problem you're trying to solve and maybe we 
> can help guide you in what you need to change in the scheduler.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Jay Pipes

On 05/02/2017 05:14 AM, Giuseppe Di Lena wrote:

Hello all,
I’m modifying  nova-scheduler, implementing my own scheduler;
there is a way to get all the list of the scheduler hints(for example when I 
lunch a new Instance, add a custom Hint MY_HINT, with value 100)?

I tried with

def select_destination(self, context, spec_obj):
…..
spec_obj.get_scheduler_hint(‘MY_HINT’) # but it return None
…..

thank you


Scheduler hints are neither portable nor going to be supported in Nova 
long-term, so describe to us what problem you're trying to solve and 
maybe we can help guide you in what you need to change in the scheduler.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova-scheduler] Get scheduler hint

2017-05-02 Thread Giuseppe Di Lena
Hello all,
I’m modifying  nova-scheduler, implementing my own scheduler;
there is a way to get all the list of the scheduler hints(for example when I 
lunch a new Instance, add a custom Hint MY_HINT, with value 100)?

I tried with 
 
def select_destination(self, context, spec_obj):
…..
spec_obj.get_scheduler_hint(‘MY_HINT’) # but it return None
…..

thank you 

Giuseppe Di Lena
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler subteam meeting

2016-11-11 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, November 14 
at 1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20161114T14

As always, the agenda is here: 
https://wiki.openstack.org/wiki/Meetings/NovaScheduler

Please add any items you’d like to discuss to the agenda before the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] ResourceProvider design issues

2016-10-18 Thread Ed Leafe

> On Oct 17, 2016, at 8:45 PM, Jay Pipes  wrote:
> 
> On 10/17/2016 11:14 PM, Ed Leafe wrote:
>> Now that we’re starting to model some more complex resources, it seems that 
>> some of the original design decisions may have been mistaken. One approach 
>> to work around this is to create multiple levels of resource providers. 
>> While that works, it is unnecessarily complicated IMO. I think we need to 
>> revisit some basic assumptions about the design before we dig ourselves a 
>> big design hole that will be difficult to get out of. I’ve tried to 
>> summarize my thoughts in a blog post. I don’t presume that this is the only 
>> possible solution, but I feel it is better than the current approach.
>> 
>> https://blog.leafe.com/virtual-bike-sheds/
> 
> I commented on your blog, but leave it here for posterity:

Likewise, responded on the blog, but following your lead by posting in both 
places.

You didn't include this in your email, but I think you misunderstood my comment 
how "those of us experienced in OOP" might object to having multiple classes 
that differ solely on a single attribute. Since you are the one who is doing 
the objecting to multiple class names, I was merely saying that anyone with 
background in object-oriented programming might have a reflexive aversion to 
having slight variations on something with 'Class' in its name. That was the 
reason I said that if they had been named 'ResourceTypes' instead, the aversion 
might not be as strong. Sorry for the misunderstanding. I was in no way trying 
to minimize your understanding of OOPy things.

Regarding your comments on standardization, I'm not sure that I can see the 
difference between what you've described and what I have. In your design, you 
would have a standard class name for the SR-IOV-VF, and standard trait names 
for the networks. So with a two-network deployment, there would need to be 3 
standardized names. With multiple classes, there would need to be 2 
standardized names: not a huge difference. Now if there might be a more complex 
deployment than simply 'public' and 'private' networks for SR-IOV devices, then 
things are less clear. For things to be standardized across clouds, the way you 
request a resource has to be standardized. How would the various network names 
be constrained across clouds? Let's say there are N network types; the same 
math would apply. Nested providers would need N+1 standard names and multiple 
classes would need N in order to distinguish. If there are no restrictions on 
network names, then both approaches will fail on standardization, since a 
provider could call a network whatever they want.

As far as NUMA cells and their inventory accounting are concerned, that sounds 
like something where a whiteboard discussion will really help. Most of the 
people working on the placement engine, myself included, have only a passing 
understanding of the intricacies of NUMA arrangements. But even without that, I 
don't see the need to have multiple awkward names for the different NUMA 
resource classes. Based on my understanding, a slightly different approach 
would be sufficient. Instead of having multiple classes, we could remove the 
restriction that a ResourceProvider can only have one of any individual 
ResourceClass. In other words, the host would have two ResourceClass records of 
type NUMA_SOCKET (is that the right class?), one for each NUMA cell, and each 
of those would have their individual inventory records. So a request for 
MEMORY_PAGE_1G would involve a ResourceProvider seeing if any of their 
ResourceClass records has enough of that type of inventory available.

I think the same approach applies to the NIC bandwidth example you gave. By 
allowing multiple ResourceClass records representing the different NICs, the 
total bandwidth will also be a simple aggregate.

Finally, regarding the SQL complexity, I spent years as a SQL DBA and yet I am 
always impressed by how much better your SQL solutions are than the ones I 
might come up with. I'm not saying that the SQL is so complex as to be 
unworkable; I'm simply saying that it is more complex than it needs to be.

In any event, I am looking forward to carrying on these discussions in 
Barcelona with you and the rest of the scheduler subteam.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] ResourceProvider design issues

2016-10-17 Thread Jay Pipes

On 10/17/2016 11:14 PM, Ed Leafe wrote:

Now that we’re starting to model some more complex resources, it seems that 
some of the original design decisions may have been mistaken. One approach to 
work around this is to create multiple levels of resource providers. While that 
works, it is unnecessarily complicated IMO. I think we need to revisit some 
basic assumptions about the design before we dig ourselves a big design hole 
that will be difficult to get out of. I’ve tried to summarize my thoughts in a 
blog post. I don’t presume that this is the only possible solution, but I feel 
it is better than the current approach.

https://blog.leafe.com/virtual-bike-sheds/


I commented on your blog, but leave it here for posterity:

First, one of the reasons for the resource providers work was to 
*standardize* as much as possible the classes of resource that a cloud 
provides. Without standardized resource classes, there is no 
interoperability between clouds. The proposed solution of creating 
resource classes for each combination of actual resource class (the 
SRIOV VF) and the collection of traits that the VF might have (physical 
network tag, speed, product and vendor ID, etc) means there would be no 
interoperable way of referring to a VF resource in one OpenStack cloud 
as provided the same thing in another OpenStack cloud. The fact that a 
VF might be tagged to physical network A or physical network B doesn’t 
change the fundamentals: it’s a virtual function on an SR-IOV-enabled 
NIC that a guest consumes. If I don’t have a single resource class that 
represents a virtual function on an SR-IOV-enabled NIC (and instead I 
have dozens of different resource classes that refer to variations of 
VFs based on network tag and other traits) then I cannot have a 
normalized multi-OpenStack cloud environment because there’s no 
standardization.


Secondly, the compute host to SR-IOV PF is only one relationship that 
can be represented by nested resource providers. Other relationships 
that need to be represented include:


* Compute host to NUMA cell relations where a NUMA cell provides both 
VCPU, MEMORY_MB and MEMORY_PAGE_2M and MEMORY_PAGE_1G inventories that 
are separate from each other but accounted for in the parent provider 
(meaning the compute host’s MEMORY_MB inventory is logically the 
aggregate of both NUMA cells’ inventories of MEMORY_MB). In your data 
modeling, how would you represent two NUMA cells, each with their own 
inventories and allocations? Would you create resource classes called 
NUMA_CELL_0_MEMORY_MB and NUMA_CELL_1_MEMORY_MB etc? See point above 
about one of the purposes of the resource providers work being the 
standardization of resource classification.


* NIC bandwidth and NIC bandwidth per physical network. If I have 4 
physical NICs on a compute host and I want to track network bandwidth as 
a consumable resource on each of those NICs, how would I go about doing 
that? Again, would you suggest auto-creating a set of resource classes 
representing the NICs? So, NET_BW_KB_EKB_ENP3S1, NET_BW_KB_ENP4S0, and 
so on? If I wanted to see the total aggregate bandwidth of the compute 
host, the system will now have to have tribal knowledge built into it to 
know that all the NET_BW_KB* resource classes are all describing the 
same exact resource class (network bandwidth in KB) but that the 
resource class names should be interpreted in a certain way. Again, not 
standardizable. In the nested resource providers modeling, we would have 
a parent compute host resource provider and 4 child resource providers — 
one for each of the NICs. Each NIC would have a set of traits 
indicating, for example, the interface name or physical network tag. 
However, the inventory (quantitative) amounts for network bandwidth 
would be a single standardized resource class, say NET_BW_KB. This 
nested resource providers system accurately models the real world setup 
of things that are providing the consumable resource, which is network 
bandwidth.


Finally, I think you are overstating the complexity of the SQL that is 
involved in the placement queries.  I’ve tried to design the DB schema 
with an eye to efficient and relatively simple SQL queries — and keeping 
quantitative and qualitative things decoupled in the schema was a big 
part of that efficiency. I’d like to see specific examples of how you 
would solve the above scenarios by combining the qualitative and 
quantitative aspects into a single resource type but still manage to 
have some interoperable standards that multiple OpenStack clouds can expose.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] ResourceProvider design issues

2016-10-17 Thread Ed Leafe
Now that we’re starting to model some more complex resources, it seems that 
some of the original design decisions may have been mistaken. One approach to 
work around this is to create multiple levels of resource providers. While that 
works, it is unnecessarily complicated IMO. I think we need to revisit some 
basic assumptions about the design before we dig ourselves a big design hole 
that will be difficult to get out of. I’ve tried to summarize my thoughts in a 
blog post. I don’t presume that this is the only possible solution, but I feel 
it is better than the current approach.

https://blog.leafe.com/virtual-bike-sheds/


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler subteam meeting

2016-10-14 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, October 17 at 
1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20161017-T14

This will probably be a brief meeting, unless we start arguing about nested 
resource providers. I’d prefer to have those discussions in person at the 
summit. But if people are interested...

As always, the agenda is here: 
https://wiki.openstack.org/wiki/Meetings/NovaScheduler

Please add any items you’d like to discuss to the agenda before the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler subteam meeting

2016-10-07 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, October 10 at 
1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20161010-T14

As always, the agenda is here: 
https://wiki.openstack.org/wiki/Meetings/NovaScheduler

Please add any items you’d like to discuss to the agenda before the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler subteam meeting

2016-09-30 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, October 3 at 
1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20161003T14

As always, the agenda is here: 
https://wiki.openstack.org/wiki/Meetings/NovaScheduler

Please add any items you’d like to discuss to the agenda before the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next scheduler subteam meeting

2016-09-23 Thread Ed Leafe
The next meeting of the Nova scheduler subteam will be on Monday, September 26 
at 1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160926T14

The agenda is here: https://wiki.openstack.org/wiki/Meetings/NovaScheduler

Please add any issues you would like to discuss to the agenda.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler sub-team meeting

2016-09-16 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be Monday, September 19 at 
1400 UTC in #openstack-meeting-alt

http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160919T14

There probably won’t be much to discuss for Newton, but we should probably 
start thinking about Summit topics. If you have some ideas, please add them to 
the agenda at https://wiki.openstack.org/wiki/Meetings/NovaScheduler


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler subteam meeting Monday 1400 UTC

2016-09-09 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be Monday, September 12, at 
1400 UTC in #openstack-meeting-alt
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160912T14

The main topic for discussion will be the status of the patches for the 
placement API: https://etherpad.openstack.org/p/placement-next

Add anything else to the agenda at 
https://wiki.openstack.org/wiki/Meetings/NovaScheduler


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Nova scheduler 3rd-party (custom) driver connection in Newton

2016-09-07 Thread Anatoly Smolyaninov
Hello! 

Before Newton release, I just used full module path in nova.conf to plugin. But 
now blueprints suggest to use built-in drivers via entry points from 
`nova.scheduler` namespace. 

**How I should plug 3rd-party, not built-in  drivers?** 

I actually tried to add entry points manually to e.g. 
`/usr/lib/python2.7/site-packages/nova-13.1.0-py2.7.egg-info/entry_points.txt`, 
and it works, but it doesn't look like correct way.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Nova Scheduler Subteam Meeting

2016-08-19 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, August 22 at 
1400 UTC in #openstack-meeting-alt

http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160822T14

The agenda is here: https://wiki.openstack.org/wiki/Meetings/NovaScheduler

If you have any items you wish to discuss, please add them to the agenda before 
the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next Scheduler Subteam meeting

2016-08-12 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be on Monday, August 15 at 
1400 UTC in #openstack-meeting-alt

http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160815T14

The agenda is here: https://wiki.openstack.org/wiki/Meetings/NovaScheduler

If you have any items you wish to discuss, please add them to the agenda before 
the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-08 Thread Chris Dent

On Mon, 8 Aug 2016, Ed Leafe wrote:


Me too. I think that this is one case where thinking in SQL messes you
up. Sure, you can probably make it work by hacking in the concept of
infinity into the code, but there will still be a conceptual
disconnect. And later, when we inevitably need to enhance resource
providers and their capabilities, we will end up creating another hack
to work around what is an actual inventory and what is this new
infinite inventory thing.


/me shrugs

This has been an interesting exercise because it does get people to
express their ideas and their concerns a bit more. Which even if it
doesn't change this round of stuff that we create helps for the next
round.

In my case my interest in exploring the model I've been describing is
largely driven by wanting to _not_ think in SQL but in a way that is
more simple math but with a very constrained grammar.

Part of the goal here is to make "inevitable enhancement" which
involves a different conceptual model something that is constrained
to such an extent that it might be impossible and thus has to happen
in a separate service; so that we can avoid monoliths

This constrained grammar thing is really important.


Oh, and think of the person coming in new to the placement engine, and
having to explain what an infinite inventory is. Imagine their face.


Weird, I wonder what experience I had that makes that so natural for me.

I conceded a long time ago, but you guys keep saying things that
make me want to keep talking about it.

--
Chris Dent   ┬─┬ノ( º _ ºノ) http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-08 Thread Ed Leafe
On Aug 8, 2016, at 9:42 AM, Jay Pipes  wrote:

>> When some ssd-ness is consumed, all of it (infinity) is still left
>> over.
>> 
>> For me it is easier: a resource provider has just one way of
>> describing what it can do: classes of inventory (that provide
>> gigabytes of disk that are ssd). When we add tags/capabilities
>> we have another mode of description.
> 
> I could not disagree more. :)

Me too. I think that this is one case where thinking in SQL messes you up. 
Sure, you can probably make it work by hacking in the concept of infinity into 
the code, but there will still be a conceptual disconnect. And later, when we 
inevitably need to enhance resource providers and their capabilities, we will 
end up creating another hack to work around what is an actual inventory and 
what is this new infinite inventory thing.

Oh, and think of the person coming in new to the placement engine, and having 
to explain what an infinite inventory is. Imagine their face.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-08 Thread Jay Pipes

On 08/08/2016 06:14 AM, Chris Dent wrote:

On Mon, 8 Aug 2016, Alex Xu wrote:


Chris, thanks for the blog to explain your idea! It helps me understand
your idea better.


Thanks for reading it. As I think I've mentioned a few times I'm not
really trying to sell the idea, just make sure it is clear enough to
be evaluated.


I agree the goal for API interface design in your blog. But one point I
guess you also agree, that is "The interface is easy to understand for
API
user". So look at the example of API request flow with gabbi,  it is
pretty
clear for me even I didn't spend any time to learn the gabbi. That means:
gabbi is cool and the interface is clear! But the only confuse is "total:
∞". And the related ResourceClass is "ssd", does it mean disk size is
infinite? For a user, he is learning our API, he needs to search the
document, due to he want to know "what is this special usage way means
to".
If user can understand our API without any document, so that is prefect.


I think the main source of confusion is that where I find it pretty
easy to think of qualitative characteristics as being an "infinity
of a quantity" that's not an easy concept in general.

In the "ssd" example what it means is not that disk size is infinite
but taht the ssd-ness of the resource provider is infinite. So, for
example a resource provider which provides disk space that is hosted
on ssd has (at least) two resource classes:

   DISK_GB: 
   SSD: 

When some ssd-ness is consumed, all of it (infinity) is still left
over.

For me it is easier: a resource provider has just one way of
describing what it can do: classes of inventory (that provide
gigabytes of disk that are ssd). When we add tags/capabilities
we have another mode of description.


I could not disagree more. :)

You don't have an "inventory" of SSD. You have an inventory of DISK_GB. 
Whether the resource provider of that DISK_GB resource uses SSD or HDD 
disks is an *adjective* -- i.e. a capability -- that describes the provider.


Saying capabilities are just resources with infinite inventory just 
makes the API more confusing IMHO.


It's like saying "hey, give me two apples and an infinite amount of 
red." Just doesn't make sense.


My two pesos,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-08 Thread Chris Dent

On Mon, 8 Aug 2016, Alex Xu wrote:


Chris, thanks for the blog to explain your idea! It helps me understand
your idea better.


Thanks for reading it. As I think I've mentioned a few times I'm not
really trying to sell the idea, just make sure it is clear enough to
be evaluated.


I agree the goal for API interface design in your blog. But one point I
guess you also agree, that is "The interface is easy to understand for API
user". So look at the example of API request flow with gabbi,  it is pretty
clear for me even I didn't spend any time to learn the gabbi. That means:
gabbi is cool and the interface is clear! But the only confuse is "total:
∞". And the related ResourceClass is "ssd", does it mean disk size is
infinite? For a user, he is learning our API, he needs to search the
document, due to he want to know "what is this special usage way means to".
If user can understand our API without any document, so that is prefect.


I think the main source of confusion is that where I find it pretty
easy to think of qualitative characteristics as being an "infinity
of a quantity" that's not an easy concept in general.

In the "ssd" example what it means is not that disk size is infinite
but taht the ssd-ness of the resource provider is infinite. So, for
example a resource provider which provides disk space that is hosted
on ssd has (at least) two resource classes:

   DISK_GB: 
   SSD: 

When some ssd-ness is consumed, all of it (infinity) is still left
over.

For me it is easier: a resource provider has just one way of
describing what it can do: classes of inventory (that provide
gigabytes of disk that are ssd). When we add tags/capabilities
we have another mode of description.

/me yields

--
Chris Dent   ┬─┬ノ( º _ ºノ) http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-07 Thread Alex Xu
Chris, thanks for the blog to explain your idea! It helps me understand
your idea better.

I agree the goal for API interface design in your blog. But one point I
guess you also agree, that is "The interface is easy to understand for API
user". So look at the example of API request flow with gabbi,  it is pretty
clear for me even I didn't spend any time to learn the gabbi. That means:
gabbi is cool and the interface is clear! But the only confuse is "total:
∞". And the related ResourceClass is "ssd", does it mean disk size is
infinite? For a user, he is learning our API, he needs to search the
document, due to he want to know "what is this special usage way means to".
If user can understand our API without any document, so that is prefect.

I agree all of other point you said, limit resource, unified concept. If we
want to finish that goal, I think the way is "Use ResourceProviderTags
instead of ResourceClass", not "Use ResourceClass instead of ResourceClass"

2016-08-05 21:16 GMT+08:00 Chris Dent :

> On Tue, 2 Aug 2016, Alex Xu wrote:
>
> Chris have a thought about using ResourceClass to describe Capabilities
>> with an infinite inventory. In the beginning we brain storming the idea of
>> Tags, Tan Lin have same thought, but we say no very quickly, due to the
>> ResourceClass is really about Quantitative stuff. But Chris give very good
>> point about simplify the ResourceProvider model and the API.
>>
>
> I'm still leaning in this direction. I realized I wasn't explaining
> myself very well and "because I like it" isn't really a good enough
> for doing anything, so I wrote something up about it:
>
>https://anticdent.org/simple-resource-provision.html
>
> --
> Chris Dent   ┬─┬ノ( º _ ºノ) http://anticdent.org/
> freenode: cdent tw: @anticdent
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-07 Thread Yingxin Cheng
2016-08-05 21:16 GMT+08:00 Chris Dent :

> On Tue, 2 Aug 2016, Alex Xu wrote:
>
> Chris have a thought about using ResourceClass to describe Capabilities
>> with an infinite inventory. In the beginning we brain storming the idea of
>> Tags, Tan Lin have same thought, but we say no very quickly, due to the
>> ResourceClass is really about Quantitative stuff. But Chris give very good
>> point about simplify the ResourceProvider model and the API.
>>
>
> I'm still leaning in this direction. I realized I wasn't explaining
> myself very well and "because I like it" isn't really a good enough
> for doing anything, so I wrote something up about it:
>
>https://anticdent.org/simple-resource-provision.html
>
>
Reusing the existing infrastructure of resource classes, inventories and
allocations does make implementation easier with capabilities as well as
their calculations and representations, at least at the beginning.

But I'm still not convinced by this direction, because it introduces
unnecessary reuses as well as overhead for capabilities. Instead of
attaching a capability directly to a resource provider, it needs to create
an inventory and assign the capability to inventory, indirectly. Moreover,
it reuses allocations and even the "compare-and-swap" strategy with the
implementation of "generation" field in the resource provider. And it
introduces further complexities and obscurities if we decide to disable the
unnecessary consumable features for capabilities.

The existing architecture of resource provider is mainly for consumable
resources. And we don't want capabilities to be consumable by mistake. It
is an inherently different implementation for non consumable capabilities,
so I tend to agree to implement qualitative part of resource provider from
a fresher start to keep it simple and direct. And add features
incrementally if they are thought necessary.


---
Regards
Yingxin
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-05 Thread Chris Dent

On Tue, 2 Aug 2016, Alex Xu wrote:


Chris have a thought about using ResourceClass to describe Capabilities
with an infinite inventory. In the beginning we brain storming the idea of
Tags, Tan Lin have same thought, but we say no very quickly, due to the
ResourceClass is really about Quantitative stuff. But Chris give very good
point about simplify the ResourceProvider model and the API.


I'm still leaning in this direction. I realized I wasn't explaining
myself very well and "because I like it" isn't really a good enough
for doing anything, so I wrote something up about it:

   https://anticdent.org/simple-resource-provision.html

--
Chris Dent   ┬─┬ノ( º _ ºノ) http://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-03 Thread Alex Xu
2016-08-03 2:12 GMT+08:00 Jay Pipes :

> On 08/02/2016 08:19 AM, Alex Xu wrote:
>
>> Chris have a thought about using ResourceClass to describe Capabilities
>> with an infinite inventory. In the beginning we brain storming the idea
>> of Tags, Tan Lin have same thought, but we say no very quickly, due to
>> the ResourceClass is really about Quantitative stuff. But Chris give
>> very good point about simplify the ResourceProvider model and the API.
>>
>> After rethinking about those idea, I like simplify the ResourceProvider
>> model and the API. But I think the direction is opposite. ResourceClass
>> with infinite inventory is really hacky. The Placement API is simple,
>> but the usage of those API isn't simple for user, they need create a
>> ResourceClass, then create an infinite inventory. And ResourceClass
>> isn't managable like Tags, look at the Tags API, there are many query
>> parameter.
>>
>> But look at the ResourceClass and ResourceProviderTags, they are totally
>> same, two columns: one is integer id, another one is string.
>> ResourceClass is just for naming the quantitative stuff. So what we need
>> is thing used to 'naming'. ResourceProviderTags is higher abstract, Tags
>> is generic thing to name something, we totally can use Tag instead of
>> ResourceClass. So user can create inventory with tags, also user can
>> create ResourceProvider with tags.
>>
>
> No, this sounds like actually way more complexity than is needed and will
> make the schema less explicit.


No, it simplify the ResourceProvider model and the API? maybe the
complexity you pointed at other place.

Yes, it make the schema less explicit. Using higher layer abstraction, it
will lose some characteristic. That is thing we have to pay.

Anyway let me put this in the alternative section...


>
>
> But yes, there may still have problem isn't resolved, one of problem is
>> pointed out when I discuss with YingXin about how to distinguish the Tag
>> is about quantitative or qualitative. He think we need attribute for Tag
>> to distinguish it. But the attribute isn't thing I like, I prefer leave
>> that alone due to the user of placement API is admin-user.
>>
>> Any thought? or I'm too crazy at here...maybe I just need put this in
>> the alternative section in the spec...
>>
>
> A resource class is not a capability, though. It's an indication of a type
> of quantitative consumable that is exposed on a resource provider.
>
> A capability is a string that indicates a feature that a resource provider
> offers. A capability isn't "consumed".
>

Agree about the definition of resource class and capability. I think it is
pretty clear for us.

What I want to say is the Placement Engine really don't know what is
ResourceClass and Capability. They just need an indication for something.
You can think about ResourceClass and Capability is sub-class, the Tag is
the base-class for them. And think about a case, user can input 'cookie' as
the name of ResourceClass, but Placement Engine won't say no to user.
Because Placement Engine really don't care about the meaning of
ResourceClass' name. Placement Engine just needs a 'tag' to distinguish the
ResourceClass and Capability.


> BTW, this is why I continue to think that using the term "tags" in the
> placement API is wrong. The placement API should clearly indicate that a
> resource provider has a set of capabilities. Tags, in Nova at least, are
> end-user-defined simple categorization strings that have no standardization
> and no cataloguing or collation to them.
>

Yes, but we don't have standard string for all the capabilities. For shared
storage, this is setup by deployer, not the OpenStack, so the capabilities
of shared storage won't be defined by OpenStack, it is defined by deployer.


>
> Capabilities are not end-user-defined -- they can be defined by an
> operator but they are not things that a normal end-user can simply create.
> And capabilities are specifically *not* for categorization purposes. They
> are an indication of a set of features that a resource provider exposes.
>

I totally see your point. But there is one question I can't get answer. If
we call Capabilities instead of tags, but user still can free to input any
string. User can input 'fish' as a capabilities, and placement API won't
say no. Is this OK? Why this is OK when user input non-capability into a
capability field? Actually same question for ResourceClass. (ah, this is
back to ResourceClass and Capability have same base-class Tags)

I think this is only question I can't pass through, otherwise I will update
the spec :)


>
> This is why I think the placement API for capabilities should use the term
> "capabilities" and not "tags".
>
> Best,
> -jay
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> 

Re: [openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-02 Thread Jay Pipes

On 08/02/2016 08:19 AM, Alex Xu wrote:

Chris have a thought about using ResourceClass to describe Capabilities
with an infinite inventory. In the beginning we brain storming the idea
of Tags, Tan Lin have same thought, but we say no very quickly, due to
the ResourceClass is really about Quantitative stuff. But Chris give
very good point about simplify the ResourceProvider model and the API.

After rethinking about those idea, I like simplify the ResourceProvider
model and the API. But I think the direction is opposite. ResourceClass
with infinite inventory is really hacky. The Placement API is simple,
but the usage of those API isn't simple for user, they need create a
ResourceClass, then create an infinite inventory. And ResourceClass
isn't managable like Tags, look at the Tags API, there are many query
parameter.

But look at the ResourceClass and ResourceProviderTags, they are totally
same, two columns: one is integer id, another one is string.
ResourceClass is just for naming the quantitative stuff. So what we need
is thing used to 'naming'. ResourceProviderTags is higher abstract, Tags
is generic thing to name something, we totally can use Tag instead of
ResourceClass. So user can create inventory with tags, also user can
create ResourceProvider with tags.


No, this sounds like actually way more complexity than is needed and 
will make the schema less explicit.



But yes, there may still have problem isn't resolved, one of problem is
pointed out when I discuss with YingXin about how to distinguish the Tag
is about quantitative or qualitative. He think we need attribute for Tag
to distinguish it. But the attribute isn't thing I like, I prefer leave
that alone due to the user of placement API is admin-user.

Any thought? or I'm too crazy at here...maybe I just need put this in
the alternative section in the spec...


A resource class is not a capability, though. It's an indication of a 
type of quantitative consumable that is exposed on a resource provider.


A capability is a string that indicates a feature that a resource 
provider offers. A capability isn't "consumed".


BTW, this is why I continue to think that using the term "tags" in the 
placement API is wrong. The placement API should clearly indicate that a 
resource provider has a set of capabilities. Tags, in Nova at least, are 
end-user-defined simple categorization strings that have no 
standardization and no cataloguing or collation to them.


Capabilities are not end-user-defined -- they can be defined by an 
operator but they are not things that a normal end-user can simply 
create. And capabilities are specifically *not* for categorization 
purposes. They are an indication of a set of features that a resource 
provider exposes.


This is why I think the placement API for capabilities should use the 
term "capabilities" and not "tags".


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] [scheduler] Use ResourceProviderTags instead of ResourceClass?

2016-08-02 Thread Alex Xu
Chris have a thought about using ResourceClass to describe Capabilities
with an infinite inventory. In the beginning we brain storming the idea of
Tags, Tan Lin have same thought, but we say no very quickly, due to the
ResourceClass is really about Quantitative stuff. But Chris give very good
point about simplify the ResourceProvider model and the API.

After rethinking about those idea, I like simplify the ResourceProvider
model and the API. But I think the direction is opposite. ResourceClass
with infinite inventory is really hacky. The Placement API is simple, but
the usage of those API isn't simple for user, they need create a
ResourceClass, then create an infinite inventory. And ResourceClass isn't
managable like Tags, look at the Tags API, there are many query parameter.

But look at the ResourceClass and ResourceProviderTags, they are totally
same, two columns: one is integer id, another one is string. ResourceClass
is just for naming the quantitative stuff. So what we need is thing used to
'naming'. ResourceProviderTags is higher abstract, Tags is generic thing to
name something, we totally can use Tag instead of ResourceClass. So user
can create inventory with tags, also user can create ResourceProvider with
tags.

But yes, there may still have problem isn't resolved, one of problem is
pointed out when I discuss with YingXin about how to distinguish the Tag is
about quantitative or qualitative. He think we need attribute for Tag to
distinguish it. But the attribute isn't thing I like, I prefer leave that
alone due to the user of placement API is admin-user.

Any thought? or I'm too crazy at here...maybe I just need put this in the
alternative section in the spec...

Thanks
Alex
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Scheduler subteam meeting Monday at 1400 UTC

2016-07-31 Thread Edward Leafe
Sorry for the late email, but I was swamped last week after getting back from 
the mid-cycle. The next meeting of the Nova Scheduler subteam will be Monday, 
August 1, at 1400 UTC [0], in #openstack-meeting-alt

I've updated the agenda [1] with the main items; if you have other issues to 
discuss, please add them.

[0] http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160801T14
[1] https://wiki.openstack.org/wiki/Meetings/NovaScheduler


-- Ed Leafe







signature.asc
Description: Message signed with OpenPGP using GPGMail
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] [scheduler] New filter: AggregateInstanceTypeFilter

2016-07-27 Thread Alonso Hernandez, Rodolfo
Hello:

We have developed a new filter for Nova scheduler 
(SPEC). We have a POC in 
https://review.openstack.org/#/c/346662/.

My question is how to procced with this code:

1)  Merge into nova code. This solution seems not to be accepted (see spec 
comments; also previous versions were merged and reverted in previous releases).

2)  Merge into networking-ovs-dpdk 
(https://github.com/openstack/networking-ovs-dpdk) repo.

3)  Create a new repo to support this new filter.

Which option should we take?

Thank you in advance.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] A simple solution for better scheduler performance

2016-07-15 Thread Cheng, Yingxin
Hi John,

Thanks for the reply.

There’re two rounds of experiments:
Experiment A [3] is deployed by devstack. There’re 1000 compute services with 
fake virt driver. The DB driver is the devstack default PyMySQL. And the 
scheduler driver is the default filter scheduler.
Experiment B [4] is the real production environment from China Mobile with 
about 600 active compute nodes. The DB driver is the default driver of 
SQLAlchemy, i.e. the C based python-mysql. The scheduler is also filter 
scheduler.

And in analysis 
https://docs.google.com/document/d/1N_ZENg-jmFabyE0kLMBgIjBGXfL517QftX3DW7RVCzU/edit?usp=sharing
 Figure 1/2 are from experiment B, figure 3/4 are from experiment A. So the 2 
kinds of DB APIs are all covered.

My point is simple: When the host manager is querying host states for request 
A, and another request B comes, the host manager won’t launch a second 
cache-refresh; Instead, it simply reuses the first one and returns the same 
result to both A and B. In this way, we can reduce the expensive cache-refresh 
queries to minimum while keeping scheduler host states fresh. It will become 
more effective when there’re more compute nodes and heavier request pressure.

I also have runnable code that can better explain my idea: 
https://github.com/cyx1231st/making-food 

-- 
Regards
Yingxin

On 7/15/16, 17:19, "John Garbutt"  wrote:

On 15 July 2016 at 09:26, Cheng, Yingxin  wrote:
> It is easy to understand that scheduling in nova-scheduler service 
consists of 2 major phases:
> A. Cache refresh, in code [1].
> B. Filtering and weighing, in code [2].
>
> Couple of previous experiments [3] [4] shows that “cache-refresh” is the 
major bottleneck of nova scheduler. For example, the 15th page of presentation 
[3] says the time cost of “cache-refresh” takes 98.5% of time of the entire 
`_schedule` function [6], when there are 200-1000 nodes and 50+ concurrent 
requests. The latest experiments [5] in China Mobile’s 1000-node environment 
also prove the same conclusion, and it’s even 99.7% when there’re 40+ 
concurrent requests.
>
> Here’re some existing solutions for the “cache-refresh” bottleneck:
> I. Caching scheduler.
> II. Scheduler filters in DB [7].
> III. Eventually consistent scheduler host state [8].
>
> I can discuss their merits and drawbacks in a separate thread, but here I 
want to show a simplest solution based on my findings during the experiments 
[5]. I wrapped the expensive function [1] and tried to see the behavior of 
cache-refresh under pressure. It is very interesting to see a single 
cache-refresh only costs about 0.3 seconds. And when there’re concurrent 
cache-refresh operations, this cost can be suddenly increased to 8 seconds. 
I’ve seen it even reached 60 seconds for one cache-refresh under higher 
pressure. See the below section for details.

I am curious about what DB driver you are using?
Using PyMySQL should remove at lot of those issues.
This is the driver we use in the gate now, but it didn't used to be the 
default.

If you use the C based MySQL driver, you will find it locks the whole
process when making a DB call, then eventlet schedules the next DB
call, etc, etc, and then it loops back and allows the python code to
process the first db call, etc. In extreme cases you will find the
code processing the DB query considers some of the hosts to be down
since its so long since the DB call was returned.

Switching the driver should dramatically increase the performance of (II)

> It raises a question in the current implementation: Do we really need a 
cache-refresh operation [1] for *every* requests? If those concurrent 
operations are replaced by one database query, the scheduler is still happy 
with the latest resource view from database. Scheduler is even happier because 
those expensive cache-refresh operations are minimized and much faster (0.3 
seconds). I believe it is the simplest optimization to scheduler performance, 
which doesn’t make any changes in filter scheduler. Minor improvements inside 
host manager is enough.

So it depends on the usage patterns in your cloud.

The caching scheduler is one way to avoid the cache-refresh operation
on every request. It has an upper limit on throughput as you are
forced into having a single active nova-scheduler process.

But the caching means you can only have a single nova-scheduler
process, where as (II) allows you to have multiple nova-scheduler
workers to increase the concurrency.

> [1] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
> [2] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L112-L123
> [3] 
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
> [4] 

[openstack-dev] [nova][scheduler] Capability Modeling - Newton Mid-cycle Status

2016-07-15 Thread Ed Leafe
The current focus of improving the Nova Scheduler is progressing, with 
ResourceProviders now representing the quantities of the different resources 
available for virtualization. But these resources also have varying qualities 
that are not consumed by a VM, but describe something about the resource. The 
most common example is disk space, which can be consumed. This disk space can 
be either spinning disk or SSD, so we need to find a way to a) represent that 
quality and b) allow to user to specify those qualities that they desire.

We are proposing to add qualities to a ResourceProvider through the use of tags 
[0]. These tags are bits of text that can represent the qualities of that 
ResourceProvider, which can be queried in order to find those resources that 
have the desired qualities. In the example above, a host with SSD disk storage 
would have the tags ‘compute’ and ‘sad’ (among others), which would be queried 
for if the user asks for a VM backed by SSD.

The way that a user makes such a request will be changing, too. Instead of 
cramming all the possible qualitative aspects of a resource into a flavor, in 
the poorly-name ‘extra_specs’, the use of qualitative tags will allow flavors 
to be greatly simplified. In order to specify those qualitative aspects in a 
request, the API will add two new multi-value keys to the server request body: 
‘requirements’ and ‘preferences’ [1]. As the names suggest, any quality 
specified in the requirements key must be present in the ResourceProvider, 
while any quality in the preferences key *may* will be used to select, but not 
disqualify, a host. In practical terms, if you specify ‘ssd’ as a requirement, 
you are guaranteed that the VM you get will have SSD, or the request will fail. 
With ‘ssd’ as a preference, though, you are not guaranteed that the VM you get 
will have SSD, but if any are available, you will get SSD.

Many host capabilities can be retrieved from the virt layer, and there is a 
proposal [2] to add a new virt method to retrieve those from the hypervisor. 
These can be exposed via the API, or be used to automatically tag the host’s 
ResourceProvider record with the appropriate tags. Some more work needs to be 
done to determine how to best represent the different combinations of hardware 
and hypervisor capabilities into something that can be standardized as much as 
possible (for cross-cloud interop) without severely limiting the flexibility of 
the design (to keep current). One such proposal along these lines [3] is to 
standardize these capabilities into a set of defined enums. Again, the concern 
is the trade-off of cloud interop vs. flexibility, as defining enums in code 
means that adding a new capability will require a code change, along with the 
necessary doc changes and review.

So we certainly have enough to discuss next week at the mid-cycle!

[0] http://lists.openstack.org/pipermail/openstack-dev/2016-July/099032.html
[1] https://review.openstack.org/313784
[2] https://review.openstack.org/#/c/286520
[3] https://review.openstack.org/#/c/309762

-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler] A simple solution for better scheduler performance

2016-07-15 Thread John Garbutt
On 15 July 2016 at 09:26, Cheng, Yingxin  wrote:
> It is easy to understand that scheduling in nova-scheduler service consists 
> of 2 major phases:
> A. Cache refresh, in code [1].
> B. Filtering and weighing, in code [2].
>
> Couple of previous experiments [3] [4] shows that “cache-refresh” is the 
> major bottleneck of nova scheduler. For example, the 15th page of 
> presentation [3] says the time cost of “cache-refresh” takes 98.5% of time of 
> the entire `_schedule` function [6], when there are 200-1000 nodes and 50+ 
> concurrent requests. The latest experiments [5] in China Mobile’s 1000-node 
> environment also prove the same conclusion, and it’s even 99.7% when there’re 
> 40+ concurrent requests.
>
> Here’re some existing solutions for the “cache-refresh” bottleneck:
> I. Caching scheduler.
> II. Scheduler filters in DB [7].
> III. Eventually consistent scheduler host state [8].
>
> I can discuss their merits and drawbacks in a separate thread, but here I 
> want to show a simplest solution based on my findings during the experiments 
> [5]. I wrapped the expensive function [1] and tried to see the behavior of 
> cache-refresh under pressure. It is very interesting to see a single 
> cache-refresh only costs about 0.3 seconds. And when there’re concurrent 
> cache-refresh operations, this cost can be suddenly increased to 8 seconds. 
> I’ve seen it even reached 60 seconds for one cache-refresh under higher 
> pressure. See the below section for details.

I am curious about what DB driver you are using?
Using PyMySQL should remove at lot of those issues.
This is the driver we use in the gate now, but it didn't used to be the default.

If you use the C based MySQL driver, you will find it locks the whole
process when making a DB call, then eventlet schedules the next DB
call, etc, etc, and then it loops back and allows the python code to
process the first db call, etc. In extreme cases you will find the
code processing the DB query considers some of the hosts to be down
since its so long since the DB call was returned.

Switching the driver should dramatically increase the performance of (II)

> It raises a question in the current implementation: Do we really need a 
> cache-refresh operation [1] for *every* requests? If those concurrent 
> operations are replaced by one database query, the scheduler is still happy 
> with the latest resource view from database. Scheduler is even happier 
> because those expensive cache-refresh operations are minimized and much 
> faster (0.3 seconds). I believe it is the simplest optimization to scheduler 
> performance, which doesn’t make any changes in filter scheduler. Minor 
> improvements inside host manager is enough.

So it depends on the usage patterns in your cloud.

The caching scheduler is one way to avoid the cache-refresh operation
on every request. It has an upper limit on throughput as you are
forced into having a single active nova-scheduler process.

But the caching means you can only have a single nova-scheduler
process, where as (II) allows you to have multiple nova-scheduler
workers to increase the concurrency.

> [1] 
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
> [2] 
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L112-L123
> [3] 
> https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
> [4] http://lists.openstack.org/pipermail/openstack-dev/2016-June/098202.html
> [5] Please refer to Barcelona summit session ID 15334 later: “A tool to test 
> and tune your OpenStack Cloud? Sharing our 1000 node China Mobile experience.”
> [6] 
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L53
> [7] https://review.openstack.org/#/c/300178/
> [8] https://review.openstack.org/#/c/306844/
>
>
> ** Here is the discovery from latest experiments [5] **
> https://docs.google.com/document/d/1N_ZENg-jmFabyE0kLMBgIjBGXfL517QftX3DW7RVCzU/edit?usp=sharing
>
> The figure 1 illustrates the concurrent cache-refresh operations in a nova 
> scheduler service. There’re at most 23 requests waiting for the cache-refresh 
> operations at time 43s.
>
> The figure 2 illustrates the time cost of every requests in the same 
> experiment. It shows that the cost is increased with the growth of 
> concurrency. It proves the vicious circle that a request will wait longer for 
> the database when there’re more waiting requests.
>
> The figure 3/4 illustrate a worse case when the cache-refresh operation costs 
> reach 60 seconds because of excessive cache-refresh operations.

Sorry, its not clear to be if this was using I, II, or III? It seems
like its just using the default system?

This looks like the problems I have seen when you don't use PyMySQL
for your DB driver.

Thanks,
John

__
OpenStack Development Mailing 

[openstack-dev] [nova][scheduler] A simple solution for better scheduler performance

2016-07-15 Thread Cheng, Yingxin
It is easy to understand that scheduling in nova-scheduler service consists of 
2 major phases:
A. Cache refresh, in code [1].
B. Filtering and weighing, in code [2].

Couple of previous experiments [3] [4] shows that “cache-refresh” is the major 
bottleneck of nova scheduler. For example, the 15th page of presentation [3] 
says the time cost of “cache-refresh” takes 98.5% of time of the entire 
`_schedule` function [6], when there are 200-1000 nodes and 50+ concurrent 
requests. The latest experiments [5] in China Mobile’s 1000-node environment 
also prove the same conclusion, and it’s even 99.7% when there’re 40+ 
concurrent requests.

Here’re some existing solutions for the “cache-refresh” bottleneck:
I. Caching scheduler.
II. Scheduler filters in DB [7].
III. Eventually consistent scheduler host state [8].

I can discuss their merits and drawbacks in a separate thread, but here I want 
to show a simplest solution based on my findings during the experiments [5]. I 
wrapped the expensive function [1] and tried to see the behavior of 
cache-refresh under pressure. It is very interesting to see a single 
cache-refresh only costs about 0.3 seconds. And when there’re concurrent 
cache-refresh operations, this cost can be suddenly increased to 8 seconds. 
I’ve seen it even reached 60 seconds for one cache-refresh under higher 
pressure. See the below section for details.

It raises a question in the current implementation: Do we really need a 
cache-refresh operation [1] for *every* requests? If those concurrent 
operations are replaced by one database query, the scheduler is still happy 
with the latest resource view from database. Scheduler is even happier because 
those expensive cache-refresh operations are minimized and much faster (0.3 
seconds). I believe it is the simplest optimization to scheduler performance, 
which doesn’t make any changes in filter scheduler. Minor improvements inside 
host manager is enough.

[1] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
 
[2] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L112-L123
[3] 
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
 
[4] http://lists.openstack.org/pipermail/openstack-dev/2016-June/098202.html 
[5] Please refer to Barcelona summit session ID 15334 later: “A tool to test 
and tune your OpenStack Cloud? Sharing our 1000 node China Mobile experience.”
[6] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L53
[7] https://review.openstack.org/#/c/300178/
[8] https://review.openstack.org/#/c/306844/


** Here is the discovery from latest experiments [5] **
https://docs.google.com/document/d/1N_ZENg-jmFabyE0kLMBgIjBGXfL517QftX3DW7RVCzU/edit?usp=sharing
 

The figure 1 illustrates the concurrent cache-refresh operations in a nova 
scheduler service. There’re at most 23 requests waiting for the cache-refresh 
operations at time 43s.

The figure 2 illustrates the time cost of every requests in the same 
experiment. It shows that the cost is increased with the growth of concurrency. 
It proves the vicious circle that a request will wait longer for the database 
when there’re more waiting requests.

The figure 3/4 illustrate a worse case when the cache-refresh operation costs 
reach 60 seconds because of excessive cache-refresh operations.


-- 
Regards
Yingxin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] Next meeting: Monday, July 11 at 1400 UTC

2016-07-08 Thread Ed Leafe
The next meeting of the Nova Scheduler subteam will be Monday, July 11 at 
1400UTC in the #openstack-meeting-alt channel.
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160711T14

The agenda is here:
https://wiki.openstack.org/wiki/Meetings/NovaScheduler#Agenda_for_next_meeting

If you have anything specific to discuss, please update the agenda as soon as 
possible so that others may review before the meeting.


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] No meeting this Monday, July 4

2016-07-01 Thread Ed Leafe
Due to the US Independence Day holiday, we will be skipping the Nova Scheduler 
subteam meeting on this upcoming Monday, July 4. The next meeting will be July 
11 at 1400 UTC [0].

[0] http://www.timeanddate.com/worldclock/fixedtime.html?iso=20160711T14


-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][scheduler] More test results for the "eventually consistent host state" prototype

2016-06-27 Thread Cheng, Yingxin
Hi,

According to the feedback [1] from Austin design summit, I prepared my 
environment with pre-loaded computes and finished a new round of performance 
profiling using the tool [7]. I also updated the prototype [2] to simplify the 
implementation in compute-node side, which makes the implementation closer to 
the design described in the spec [6].

This set of results are more comprehensive, it includes the analysis of 
“eventually consistent host states” prototype [2], default filter scheduler, 
and the caching scheduler. They are tested with various scenarios in 
1000-compute-node environment, with real controller services, real rabbit-MQ 
and real MySQL database. The new set of experiments contains 55 repeatable 
results [3]. Don’t be afraid about the verbose data, I’ve dug out the 
conclusions.

To better understand what’s happening during scheduling in different scenarios, 
all of them are visualized in the doc [4]. They are complementary to what I had 
presented in the Austin design summit, the 7th page of the ppt [5].

Note that the “pre-load scenario” allows only 49 new instances to be launched 
in the 1000-node environment. It means when 50 requests are sent, there should 
be 1 and only 1 failed request if the scheduler decision is accurate.


Detailed analysis with illustration [4]: 
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing
 
==
In all test cases, nova is dispatching 50 instant requests to 1000 compute 
nodes. The aiming is to compare the behavior of 3 types of schedulers, with 
preloaded or empty-loaded scenarios, and with 1 or 2 scheduler services. So 
that’s 3*2*2=12 set of experiments, each set is tested multiple times. 

In scenario S1(i.e. 1 scheduler with empty loaded compute nodes), we can see 
from A2 very clearly that the entire boot process of filter scheduler is 
suffering from nova-scheduler service. Filter scheduler has a very slow speed 
to consume those 50 requests, causing all the requests being blocked before 
scheduler service in the yellow area. The ROOT CAUSE of it is the 
“cache-refresh” step before filtering (i.e. 
`nova.scheduler.filter_scheduler.FilterScheduler._get_all_host_states`). I’ve 
discussed this bottleneck in details in the Austin summit session “Dive into 
nova scheduler performance: where is the bottleneck” [8]. This is also proved 
by caching scheduler because it excludes the “cache-refresh” bottleneck and 
only uses in-memory filtering. By simply excluding “cache-refresh”, the 
performance benefits are huge: the query time is reduced by 87%, and the 
overall throughput (i.e. the delivered requests per second in this cloud) is 
multiplied by 8.24, see A3 for illustration. The “eventually consistent host 
states” prototype also excludes this bottleneck and takes a more meticulous way 
to synchronize scheduler caches. It is slightly slower than caching scheduler, 
because there is an overhead to apply incremental updates from compute nodes. 
The query time is reduced by 79% and the overall throughput is multiplied by 
5.63 in average in S1.

In preload scenario S2, we can see all 3 types of scheduler are faster than 
their empty loaded scenario. That’s because the filters can now prune the hosts 
from 1000 to only 49, so the last few filters don’t need to process 1000 host 
states, they can be much faster. But filter scheduler (B2) cannot be benefit 
much from faster filtering, because its bottleneck is still in “cache refresh”. 
However, it means different for caching scheduler and the prototype, because 
their performance heavily depend on in-memory filtering. For caching scheduler 
(B3), the query time is reduced by 81% and the overall throughput is multiplied 
by 7.52 compared with filter scheduler. And for the prototype (B1), the query 
time is reduced by 83% and the throughput is multiplied by 7.92 in average. 
Also, all those scheduler decisions are accurate, because their first decisions 
are all correct without any retries in preload scenario, and only 1 of 50 
requests is failed due to “no valid host” error.

In scenario S3 with 2 scheduler services and empty-loaded compute nodes, the 
overall schedule bandwidths are all multiplied by 2 internally. Filter 
scheduler (C2) has a major improvement, because its scheduler bandwidth is 
multiplied. But the other two types don’t have similar improvement, because 
their bottleneck is now in nova-api service instead. It is a wrong decision to 
add more schedulers when the actual bottleneck is happening elsewhere. And 
worse, multiple schedulers will introduce more race conditions as well as other 
overhead. However, the performance of caching scheduler (C3) and the prototype 
(C1) are still much better, the query time is reduced by 65% and the overall 
through is multiplied by 3.67 in average.

In preload scenario S4 with 2 schedulers, the race condition is surfaced 
because there’re only 49 slots in 1000 hosts in the cloud, and they will all 

  1   2   3   4   >