Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-29 Thread sfinucan
On Wed, 2017-06-21 at 07:01 -0400, Sean Dague wrote:
> On 06/21/2017 04:43 AM, sfinu...@redhat.com wrote:
> > On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
> > > On 06/20/2017 09:51 AM, Eric Fried wrote:
> > > > Nice Stephen!
> > > > 
> > > > For those who aren't aware, the rendered version (pretty, so pretty)
> > > > can
> > > > be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
> > > > 
> > > > http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu
> > > > -xen
> > > > ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
> > > 
> > > Can we teach it to not put line breaks in the middle of words in the text
> > > boxes?
> > 
> > Doesn't seem configurable in its current form :( This, and the defaulting
> > to PNG output instead of SVG (which makes things ungreppable) are my
> > biggest bug bear.
> > 
> > I'll go have a look at the sauce and see what can be done about it. If not,
> > still better than nothing?
> 
> I've actually looked through the blockdiag source (to try to solve a
> similar problem). There is no easy way to change it.
> 
> If people find it confusing, the best thing to do would be short labels
> on boxes, then explain in more detail in footnotes.

I managed to get this working through some monkey patching of the module [1].
It's not perfect and efried and I want to do something else to prevent
truncating [2], but it's much better now.

Stephen

[1] https://review.openstack.org/#/c/476159/
[2] https://review.openstack.org/#/c/476204/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-21 Thread Sean Dague
On 06/21/2017 04:43 AM, sfinu...@redhat.com wrote:
> On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
>> On 06/20/2017 09:51 AM, Eric Fried wrote:
>>> Nice Stephen!
>>>
>>> For those who aren't aware, the rendered version (pretty, so pretty) can
>>> be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
>>>
>>> http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xen
>>> ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
>>
>> Can we teach it to not put line breaks in the middle of words in the text
>> boxes?
> 
> Doesn't seem configurable in its current form :( This, and the defaulting to
> PNG output instead of SVG (which makes things ungreppable) are my biggest bug
> bear.
> 
> I'll go have a look at the sauce and see what can be done about it. If not,
> still better than nothing?

I've actually looked through the blockdiag source (to try to solve a
similar problem). There is no easy way to change it.

If people find it confusing, the best thing to do would be short labels
on boxes, then explain in more detail in footnotes.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-21 Thread sfinucan
On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:
> On 06/20/2017 09:51 AM, Eric Fried wrote:
> > Nice Stephen!
> > 
> > For those who aren't aware, the rendered version (pretty, so pretty) can
> > be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
> > 
> > http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xen
> > ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
> 
> Can we teach it to not put line breaks in the middle of words in the text
> boxes?

Doesn't seem configurable in its current form :( This, and the defaulting to
PNG output instead of SVG (which makes things ungreppable) are my biggest bug
bear.

I'll go have a look at the sauce and see what can be done about it. If not,
still better than nothing?

Stephen

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Chris Friesen

On 06/20/2017 09:51 AM, Eric Fried wrote:

Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling


Can we teach it to not put line breaks in the middle of words in the text boxes?

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Eric Fried
Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling

On 06/20/2017 09:09 AM, sfinu...@redhat.com wrote:

> 
> I have a document (with a nifty activity diagram in tow) for all the above
> available here:
> 
>   https://review.openstack.org/475810 
> 
> Should be more Google'able that mailing list posts for future us :)
> 
> Stephen
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/20/2017 09:51 AM, Alex Xu wrote:
2017-06-19 22:17 GMT+08:00 Jay Pipes >:

* Scheduler then creates a list of N of these data structures,
with the first being the data for the selected host, and the the
rest being data structures representing alternates consisting of
the next hosts in the ranked list that are in the same cell as
the selected host.

Yes, this is the proposed solution for allowing retries within a cell.

Is that possible we use trait to distinguish different cells? Then the 
retry can be done in the cell by query the placement directly with trait 
which indicate the specific cell.


Those traits will be some custom traits, and generate by the cell name.


No, we're not going to use traits in this way, for a couple reasons:

1) Placement doesn't and shouldn't know about Nova's internals. Cells 
are internal structures of Nova. Users don't know about them, neither 
should placement.


2) Traits describe a resource provider. A cell ID doesn't describe a 
resource provider, just like an aggregate ID doesn't describe a resource 
provider.



* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends
that list to the target cell.
* Target cell tries to build the instance on the selected host.
If it fails, it uses the allocation data in the data structure
to unclaim the resources for the selected host, and tries to
claim the resources for the next host in the list using its
allocation data. It then tries to build the instance on the next
host in the list of alternates. Only when all alternates fail
does the build request fail.

In the compute node, will we get rid of the allocation update in the 
periodic task "update_available_resource"? Otherwise, we will have race 
between the claim in the nova-scheduler and that periodic task.


Yup, good point, and yes, we will be removing the call to PUT 
/allocations in the compute node resource tracker. Only DELETE 
/allocations/{instance_uuid} will be called if something goes terribly 
wrong on instance launch.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread sfinucan
On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote:
> On 6/19/2017 9:17 AM, Jay Pipes wrote:
> > On 06/19/2017 09:04 AM, Edward Leafe wrote:
> > > Current flow:
> 
> As noted in the nova-scheduler meeting this morning, this should have 
> been called "original plan" rather than "current flow", as Jay pointed 
> out inline.
> 
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> > 
> > Not root RPs. Non-sharing resource providers, which currently 
> > effectively means compute node providers. Nested resource providers 
> > isn't yet merged, so there is currently no concept of a hierarchy of 
> > providers.
> > 
> > > * Placement returns a list of the UUIDs for those root providers to 
> > > scheduler
> > 
> > It returns the provider names and UUIDs, yes.
> > 
> > > * Scheduler uses those UUIDs to create HostState objects for each
> > 
> > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
> > in a list of the provider UUIDs it got back from the placement service. 
> > The scheduler then builds a set of HostState objects from the results of 
> > ComputeNodeList.get_all_by_uuid().
> > 
> > The scheduler also keeps a set of AggregateMetadata objects in memory, 
> > including the association of aggregate to host (note: this is the 
> > compute node's *service*, not the compute node object itself, thus the 
> > reason aggregates don't work properly for Ironic nodes).
> > 
> > > * Scheduler runs those HostState objects through filters to remove 
> > > those that don't meet requirements not selected for by placement
> > 
> > Yep.
> > 
> > > * Scheduler runs the remaining HostState objects through weighers to 
> > > order them in terms of best fit.
> > 
> > Yep.
> > 
> > > * Scheduler takes the host at the top of that ranked list, and tries 
> > > to claim the resources in placement. If that fails, there is a race, 
> > > so that HostState is discarded, and the next is selected. This is 
> > > repeated until the claim succeeds.
> > 
> > No, this is not how things work currently. The scheduler does not claim 
> > resources. It selects the top (or random host depending on the selection 
> > strategy) and sends the launch request to the target compute node. The 
> > target compute node then attempts to claim the resources and in doing so 
> > writes records to the compute_nodes table in the Nova cell database as 
> > well as the Placement API for the compute node resource provider.
> 
> Not to nit pick, but today the scheduler sends the selected destinations 
> to the conductor. Conductor looks up the cell that a selected host is 
> in, creates the instance record and friends (bdms) in that cell and then 
> sends the build request to the compute host in that cell.
> 
> > 
> > > * Scheduler then creates a list of N UUIDs, with the first being the 
> > > selected host, and the the rest being alternates consisting of the 
> > > next hosts in the ranked list that are in the same cell as the 
> > > selected host.
> > 
> > This isn't currently how things work, no. This has been discussed, however.
> > 
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it unclaims the resources for the selected host, and tries to 
> > > claim the resources for the next host in the list. It then tries to 
> > > build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> > 
> > This isn't currently how things work, no. There has been discussion of 
> > having the compute node retry alternatives locally, but nothing more 
> > than discussion.
> 
> Correct that this isn't how things currently work, but it was/is the 
> original plan. And the retry happens within the cell conductor, not on 
> the compute node itself. The top-level conductor is what's getting 
> selected hosts from the scheduler. The cell-level conductor is what's 
> getting a retry request from the compute. The cell-level conductor would 
> deallocate from placement for the currently claimed providers, and then 
> pick one of the alternatives passed down from the top and then make 
> allocations (a claim) against those, then send to an alternative compute 
> host for another build attempt.
> 
> So with this plan, there are two places to make allocations - the 
> scheduler first, and then the cell conductors for retries. This 
> duplication is why some people were originally pushing to move all 
> allocation-related work happen in the conductor service.
> 
> > > Proposed flow:
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Edward Leafe
On Jun 20, 2017, at 8:38 AM, Jay Pipes  wrote:
> 
>>> The example I posted used 3 resource providers. 2 compute nodes with no 
>>> local disk and a shared storage pool.
>> Now I’m even more confused. In the straw man example 
>> (https://review.openstack.org/#/c/471927/ 
>> ) 
>> >  
>> >
>>  I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in 
>> the response.
> 
> I'm referring to the example I put in this email threads in 
> paste.openstack.org  with numbers showing 1600 
> bytes for 3 resource providers:
> 
> http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html 
> 


And I’m referring to the comment I made on the spec back on June 13 that was 
never corrected/clarified. I’m glad you gave an example yesterday after I 
expressed my confusion; that was the whole purpose of starting this thread. 
Things may be clear to you, but they have confused me and others. We can’t help 
if we don’t understand.


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Alex Xu
2017-06-19 22:17 GMT+08:00 Jay Pipes :

> On 06/19/2017 09:04 AM, Edward Leafe wrote:
>
>> Current flow:
>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy those
>> requirements
>>
>
> Not root RPs. Non-sharing resource providers, which currently effectively
> means compute node providers. Nested resource providers isn't yet merged,
> so there is currently no concept of a hierarchy of providers.
>
> * Placement returns a list of the UUIDs for those root providers to
>> scheduler
>>
>
> It returns the provider names and UUIDs, yes.
>
> * Scheduler uses those UUIDs to create HostState objects for each
>>
>
> Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing
> in a list of the provider UUIDs it got back from the placement service. The
> scheduler then builds a set of HostState objects from the results of
> ComputeNodeList.get_all_by_uuid().
>
> The scheduler also keeps a set of AggregateMetadata objects in memory,
> including the association of aggregate to host (note: this is the compute
> node's *service*, not the compute node object itself, thus the reason
> aggregates don't work properly for Ironic nodes).
>
> * Scheduler runs those HostState objects through filters to remove those
>> that don't meet requirements not selected for by placement
>>
>
> Yep.
>
> * Scheduler runs the remaining HostState objects through weighers to order
>> them in terms of best fit.
>>
>
> Yep.
>
> * Scheduler takes the host at the top of that ranked list, and tries to
>> claim the resources in placement. If that fails, there is a race, so that
>> HostState is discarded, and the next is selected. This is repeated until
>> the claim succeeds.
>>
>
> No, this is not how things work currently. The scheduler does not claim
> resources. It selects the top (or random host depending on the selection
> strategy) and sends the launch request to the target compute node. The
> target compute node then attempts to claim the resources and in doing so
> writes records to the compute_nodes table in the Nova cell database as well
> as the Placement API for the compute node resource provider.
>
> * Scheduler then creates a list of N UUIDs, with the first being the
>> selected host, and the the rest being alternates consisting of the next
>> hosts in the ranked list that are in the same cell as the selected host.
>>
>
> This isn't currently how things work, no. This has been discussed, however.
>
> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that list
>> to the target cell.
>> * Target cell tries to build the instance on the selected host. If it
>> fails, it unclaims the resources for the selected host, and tries to claim
>> the resources for the next host in the list. It then tries to build the
>> instance on the next host in the list of alternates. Only when all
>> alternates fail does the build request fail.
>>
>
> This isn't currently how things work, no. There has been discussion of
> having the compute node retry alternatives locally, but nothing more than
> discussion.
>
> Proposed flow:
>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy those
>> requirements
>>
>
> Yes.
>
> * Placement then constructs a data structure for each root provider as
>> documented in the spec. [0]
>>
>
> Yes.
>
> * Placement returns a number of these data structures as JSON blobs. Due
>> to the size of the data, a page size will have to be determined, and
>> placement will have to either maintain that list of structured datafor
>> subsequent requests, or re-run the query and only calculate the data
>> structures for the hosts that fit in the requested page.
>>
>
> "of these data structures as JSON blobs" is kind of redundant... all our
> REST APIs return data structures as JSON blobs.
>
> While we discussed the fact that there may be a lot of entries, we did not
> say we'd immediately support a paging mechanism.
>
> * Scheduler continues to request the paged results until it has them all.
>>
>
> See above. Was discussed briefly as a concern but not work to do for first
> patches.
>
> * Scheduler then runs this data through the filters and weighers. No
>> HostState objects are required, as the data structures will contain all the
>> information that scheduler will need.
>>
>
> No, this isn't correct. The scheduler will have *some* of the information
> it requires for weighing from the returned data from the GET
> /allocation_candidates call, but not all of it.
>
> Again, operators have insisted on keeping the flexibility currently in the
> Nova scheduler to weigh/sort compute nodes by things like thermal metrics
> and kinds of data that the 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/20/2017 08:43 AM, Edward Leafe wrote:
On Jun 20, 2017, at 6:54 AM, Jay Pipes > wrote:



It was the "per compute host" that I objected to.
I guess it would have helped to see an example of the data returned 
for multiple compute nodes. The straw man example was for a single 
compute node with SR-IOV, NUMA and shared storage. There was no 
indication how multiple hosts meeting the requested resources would 
be returned.


The example I posted used 3 resource providers. 2 compute nodes with 
no local disk and a shared storage pool.


Now I’m even more confused. In the straw man example 
(https://review.openstack.org/#/c/471927/) 
 I see 
only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the 
response.


I'm referring to the example I put in this email threads in 
paste.openstack.org with numbers showing 1600 bytes for 3 resource 
providers:


http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Edward Leafe
On Jun 20, 2017, at 6:54 AM, Jay Pipes  wrote:
> 
>>> It was the "per compute host" that I objected to.
>> I guess it would have helped to see an example of the data returned for 
>> multiple compute nodes. The straw man example was for a single compute node 
>> with SR-IOV, NUMA and shared storage. There was no indication how multiple 
>> hosts meeting the requested resources would be returned.
> 
> The example I posted used 3 resource providers. 2 compute nodes with no local 
> disk and a shared storage pool.


Now I’m even more confused. In the straw man example 
(https://review.openstack.org/#/c/471927/ 
) 

 I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the 
response.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/19/2017 09:26 PM, Boris Pavlovic wrote:

Hi,

Does this look too complicated and and a bit over designed.


Is that a question?

For example, why we can't store all data in memory of single python 
application with simple REST API and have
simple mechanism for plugins that are filtering. Basically there is no 
any kind of problems with storing it on single host.


You mean how things currently work minus the REST API?

If we even have 100k hosts and every host has about 10KB -> 1GB of RAM 
(I can just use phone)


There are easy ways to copy the state across different instance (sharing 
updates)


We already do this. It isn't as easy as you think. It's introduced a 
number of race conditions that we're attempting to address by doing 
claims in the scheduler.


And I thought that Placement project is going to be such centralized 
small simple APP for collecting all
resource information and doing this very very simple and easy placement 
selection...


1) Placement doesn't collect anything.
2) Placement is indeed a simple small app with a global view of resources
3) Placement doesn't do the sorting/weighing of destinations. The 
scheduler does that. See this thread for reasons why this is the case 
(operators didn't want to give up their complexity/flexibility in how 
they tweak selection decisions)
4) Placement simply tells the scheduler which providers have enough 
capacity for a requested set of resource amounts and required 
qualitative traits. It actually is pretty simple.


Best,
-jay


Best regards,
Boris Pavlovic

On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe > wrote:


On Jun 19, 2017, at 5:27 PM, Jay Pipes > wrote:



It was from the straw man example. Replacing the $FOO_UUID with
UUIDs, and then stripping out all whitespace resulted in about
1500 bytes. Your example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.


I guess it would have helped to see an example of the data returned
for multiple compute nodes. The straw man example was for a single
compute node with SR-IOV, NUMA and shared storage. There was no
indication how multiple hosts meeting the requested resources would
be returned.

-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Jay Pipes

On 06/19/2017 08:05 PM, Edward Leafe wrote:
On Jun 19, 2017, at 5:27 PM, Jay Pipes > wrote:


It was from the straw man example. Replacing the $FOO_UUID with 
UUIDs, and then stripping out all whitespace resulted in about 1500 
bytes. Your example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.


I guess it would have helped to see an example of the data returned for 
multiple compute nodes. The straw man example was for a single compute 
node with SR-IOV, NUMA and shared storage. There was no indication how 
multiple hosts meeting the requested resources would be returned.


The example I posted used 3 resource providers. 2 compute nodes with no 
local disk and a shared storage pool.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Boris Pavlovic
Hi,

Does this look too complicated and and a bit over designed.

For example, why we can't store all data in memory of single python
application with simple REST API and have
simple mechanism for plugins that are filtering. Basically there is no any
kind of problems with storing it on single host.

If we even have 100k hosts and every host has about 10KB -> 1GB of RAM (I
can just use phone)

There are easy ways to copy the state across different instance (sharing
updates)

And I thought that Placement project is going to be such centralized small
simple APP for collecting all
resource information and doing this very very simple and easy placement
selection...


Best regards,
Boris Pavlovic

On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe  wrote:

> On Jun 19, 2017, at 5:27 PM, Jay Pipes  wrote:
>
>
> It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and
> then stripping out all whitespace resulted in about 1500 bytes. Your
> example, with whitespace included, is 1600 bytes.
>
>
> It was the "per compute host" that I objected to.
>
>
> I guess it would have helped to see an example of the data returned for
> multiple compute nodes. The straw man example was for a single compute node
> with SR-IOV, NUMA and shared storage. There was no indication how multiple
> hosts meeting the requested resources would be returned.
>
> -- Ed Leafe
>
>
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 5:27 PM, Jay Pipes  wrote:
> 
>> It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and 
>> then stripping out all whitespace resulted in about 1500 bytes. Your 
>> example, with whitespace included, is 1600 bytes.
> 
> It was the "per compute host" that I objected to.

I guess it would have helped to see an example of the data returned for 
multiple compute nodes. The straw man example was for a single compute node 
with SR-IOV, NUMA and shared storage. There was no indication how multiple 
hosts meeting the requested resources would be returned.

-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 05:24 PM, Edward Leafe wrote:
On Jun 19, 2017, at 1:34 PM, Jay Pipes > wrote:


OK, thanks for clarifying that. When we discussed returning 1.5K per 
compute host instead of a couple of hundred bytes, there was 
discussion that paging would be necessary.


Not sure where you're getting the whole 1.5K per compute host thing from.


It was from the straw man example. Replacing the $FOO_UUID with UUIDs, 
and then stripping out all whitespace resulted in about 1500 bytes. Your 
example, with whitespace included, is 1600 bytes.


It was the "per compute host" that I objected to.

OK, that’s informative, too. Is there anything decided on how much 
host info will be in the response from placement, and how much will 
be in HostState? Or how the reporting of resources by the compute 
nodes will have to change to feed this information to placement? Or 
how the two sources of information will be combined so that the 
filters and weighers can process it? Or is that still to be worked out?


I'm currtently working on a patch that integrates the REST API into 
the scheduler.


The merging of data will essentially start with the resource amounts 
that the host state objects contain (stuff like total_usable_ram etc) 
with the accurate data from the provider_summaries section.


So in the near-term, we will be using provider_summaries to update the 
corresponding HostState objects with those values. Is the long-term plan 
to have most of the HostState information moved to placement?


Some things will move to placement sooner rather than later:

* Quantitative things that can be consumed
* Simple traits

Later rather than sooner:

* Distances between aggregates (affinity/anti-affinity)

Never:

* Filtering hosts based on how many instances use a particular image
* Filtering hosts based on something that is hypervisor-dependent
* Sorting hosts based on the number of instances in a particular state 
(e.g. how many instances are live-migrating or shelving at any given time)
* Weighing hosts based on the current temperature of a power supply in a 
rack

* Sorting hosts based on the current weather conditions in Zimbabwe

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 1:34 PM, Jay Pipes  wrote:
> 
>> OK, thanks for clarifying that. When we discussed returning 1.5K per compute 
>> host instead of a couple of hundred bytes, there was discussion that paging 
>> would be necessary.
> 
> Not sure where you're getting the whole 1.5K per compute host thing from.

It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and then 
stripping out all whitespace resulted in about 1500 bytes. Your example, with 
whitespace included, is 1600 bytes. 

> Here's a paste with the before and after of what we're talking about:
> 
> http://paste.openstack.org/show/613129/ 
> 
> 
> Note that I'm using a situation with shared storage and two compute nodes 
> providing VCPU and MEMORY. In the current situation, the shared storage 
> provider isn't returned, as you know.
> 
> The before is 231 bytes. The after (again, with three providers, not 1) is 
> 1651 bytes.

So in the basic non-shared, non-nested case, if there are, let’s say, 200 
compute nodes that can satisfy the request, will there be 1 
“allocation_requests” key returned, with 200 “allocations” sub-keys? And one 
“provider_summaries” key, with 200 sub-keys on the compute node UUID?

> gzipping the after contents results in 358 bytes.
> 
> So, honestly I'm not concerned.

Ok, just wanted to be clear.

>> OK, that’s informative, too. Is there anything decided on how much host info 
>> will be in the response from placement, and how much will be in HostState? 
>> Or how the reporting of resources by the compute nodes will have to change 
>> to feed this information to placement? Or how the two sources of information 
>> will be combined so that the filters and weighers can process it? Or is that 
>> still to be worked out?
> 
> I'm currtently working on a patch that integrates the REST API into the 
> scheduler.
> 
> The merging of data will essentially start with the resource amounts that the 
> host state objects contain (stuff like total_usable_ram etc) with the 
> accurate data from the provider_summaries section.


So in the near-term, we will be using provider_summaries to update the 
corresponding HostState objects with those values. Is the long-term plan to 
have most of the HostState information moved to placement?


-- Ed Leafe





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 01:59 PM, Edward Leafe wrote:
While we discussed the fact that there may be a lot of entries, we did 
not say we'd immediately support a paging mechanism.


OK, thanks for clarifying that. When we discussed returning 1.5K per 
compute host instead of a couple of hundred bytes, there was discussion 
that paging would be necessary.


Not sure where you're getting the whole 1.5K per compute host thing from.

Here's a paste with the before and after of what we're talking about:

http://paste.openstack.org/show/613129/

Note that I'm using a situation with shared storage and two compute 
nodes providing VCPU and MEMORY. In the current situation, the shared 
storage provider isn't returned, as you know.


The before is 231 bytes. The after (again, with three providers, not 1) 
is 1651 bytes.


gzipping the after contents results in 358 bytes.

So, honestly I'm not concerned.

Again, operators have insisted on keeping the flexibility currently in 
the Nova scheduler to weigh/sort compute nodes by things like thermal 
metrics and kinds of data that the Placement API will never be 
responsible for.


The scheduler will need to merge information from the 
"provider_summaries" part of the HTTP response with information it has 
already in its HostState objects (gotten from 
ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).


OK, that’s informative, too. Is there anything decided on how much host 
info will be in the response from placement, and how much will be in 
HostState? Or how the reporting of resources by the compute nodes will 
have to change to feed this information to placement? Or how the two 
sources of information will be combined so that the filters and weighers 
can process it? Or is that still to be worked out?


I'm currently working on a patch that integrates the REST API into the 
scheduler.


The merging of data will essentially start with the resource amounts 
that the host state objects contain (stuff like total_usable_ram etc) 
with the accurate data from the provider_summaries section.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Edward Leafe
On Jun 19, 2017, at 9:17 AM, Jay Pipes  wrote:

As Matt pointed out, I mis-wrote when I said “current flow”. I meant “current 
agreed-to design flow”. So no need to rehash that.

>> * Placement returns a number of these data structures as JSON blobs. Due to 
>> the size of the data, a page size will have to be determined, and placement 
>> will have to either maintain that list of structured datafor subsequent 
>> requests, or re-run the query and only calculate the data structures for the 
>> hosts that fit in the requested page.
> 
> "of these data structures as JSON blobs" is kind of redundant... all our REST 
> APIs return data structures as JSON blobs.

Well, I was trying to be specific. I didn’t mean to imply that this was a 
radical departure or anything.

> While we discussed the fact that there may be a lot of entries, we did not 
> say we'd immediately support a paging mechanism.

OK, thanks for clarifying that. When we discussed returning 1.5K per compute 
host instead of a couple of hundred bytes, there was discussion that paging 
would be necessary.

>> * Scheduler continues to request the paged results until it has them all.
> 
> See above. Was discussed briefly as a concern but not work to do for first 
> patches.
> 
>> * Scheduler then runs this data through the filters and weighers. No 
>> HostState objects are required, as the data structures will contain all the 
>> information that scheduler will need.
> 
> No, this isn't correct. The scheduler will have *some* of the information it 
> requires for weighing from the returned data from the GET 
> /allocation_candidates call, but not all of it.
> 
> Again, operators have insisted on keeping the flexibility currently in the 
> Nova scheduler to weigh/sort compute nodes by things like thermal metrics and 
> kinds of data that the Placement API will never be responsible for.
> 
> The scheduler will need to merge information from the "provider_summaries" 
> part of the HTTP response with information it has already in its HostState 
> objects (gotten from ComputeNodeList.get_all_by_uuid() and 
> AggregateMetadataList).

OK, that’s informative, too. Is there anything decided on how much host info 
will be in the response from placement, and how much will be in HostState? Or 
how the reporting of resources by the compute nodes will have to change to feed 
this information to placement? Or how the two sources of information will be 
combined so that the filters and weighers can process it? Or is that still to 
be worked out?

>> * Scheduler then selects the data structure at the top of the ranked list. 
>> Inside that structure is a dict of the allocation data that scheduler will 
>> need to claim the resources on the selected host. If the claim fails, the 
>> next data structure in the list is chosen, and repeated until a claim 
>> succeeds.
> 
> Kind of, yes. The scheduler will select a *host* that meets its needs.
> 
> There may be more than one allocation request that includes that host 
> resource provider, because of shared providers and (soon) nested providers. 
> The scheduler will choose one of these allocation requests and attempt a 
> claim of resources by simply PUT /allocations/{instance_uuid} with the 
> serialized body of that allocation request. If 202 returned, cool. If not, 
> repeat for the next allocation request.

Ah, yes, good point. A host with multiple nested providers, or with shared and 
local storage, will have to have multiple copies of the data structure returned 
to reflect those permutations. 

>> * Scheduler then creates a list of N of these data structures, with the 
>> first being the data for the selected host, and the the rest being data 
>> structures representing alternates consisting of the next hosts in the 
>> ranked list that are in the same cell as the selected host.
> 
> Yes, this is the proposed solution for allowing retries within a cell.

OK.

>> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that list to 
>> the target cell.
>> * Target cell tries to build the instance on the selected host. If it fails, 
>> it uses the allocation data in the data structure to unclaim the resources 
>> for the selected host, and tries to claim the resources for the next host in 
>> the list using its allocation data. It then tries to build the instance on 
>> the next host in the list of alternates. Only when all alternates fail does 
>> the build request fail.
> 
> I'll let Dan discuss this last part.


Well, that’s not substantially different than the original plan, so no 
additional explanation is required.

One other thing: since this new functionality is exposed via a new API call, is 
the existing method of filtering RPs by passing in resources going to be 
deprecated? And the code for adding filtering by traits to that also no longer 
useful?


-- Ed Leafe






Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Matt Riedemann

On 6/19/2017 9:17 AM, Jay Pipes wrote:

On 06/19/2017 09:04 AM, Edward Leafe wrote:

Current flow:


As noted in the nova-scheduler meeting this morning, this should have 
been called "original plan" rather than "current flow", as Jay pointed 
out inline.


* Scheduler gets a req spec from conductor, containing resource 
requirements

* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy 
those requirements


Not root RPs. Non-sharing resource providers, which currently 
effectively means compute node providers. Nested resource providers 
isn't yet merged, so there is currently no concept of a hierarchy of 
providers.


* Placement returns a list of the UUIDs for those root providers to 
scheduler


It returns the provider names and UUIDs, yes.


* Scheduler uses those UUIDs to create HostState objects for each


Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
in a list of the provider UUIDs it got back from the placement service. 
The scheduler then builds a set of HostState objects from the results of 
ComputeNodeList.get_all_by_uuid().


The scheduler also keeps a set of AggregateMetadata objects in memory, 
including the association of aggregate to host (note: this is the 
compute node's *service*, not the compute node object itself, thus the 
reason aggregates don't work properly for Ironic nodes).


* Scheduler runs those HostState objects through filters to remove 
those that don't meet requirements not selected for by placement


Yep.

* Scheduler runs the remaining HostState objects through weighers to 
order them in terms of best fit.


Yep.

* Scheduler takes the host at the top of that ranked list, and tries 
to claim the resources in placement. If that fails, there is a race, 
so that HostState is discarded, and the next is selected. This is 
repeated until the claim succeeds.


No, this is not how things work currently. The scheduler does not claim 
resources. It selects the top (or random host depending on the selection 
strategy) and sends the launch request to the target compute node. The 
target compute node then attempts to claim the resources and in doing so 
writes records to the compute_nodes table in the Nova cell database as 
well as the Placement API for the compute node resource provider.


Not to nit pick, but today the scheduler sends the selected destinations 
to the conductor. Conductor looks up the cell that a selected host is 
in, creates the instance record and friends (bdms) in that cell and then 
sends the build request to the compute host in that cell.




* Scheduler then creates a list of N UUIDs, with the first being the 
selected host, and the the rest being alternates consisting of the 
next hosts in the ranked list that are in the same cell as the 
selected host.


This isn't currently how things work, no. This has been discussed, however.


* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that 
list to the target cell.
* Target cell tries to build the instance on the selected host. If it 
fails, it unclaims the resources for the selected host, and tries to 
claim the resources for the next host in the list. It then tries to 
build the instance on the next host in the list of alternates. Only 
when all alternates fail does the build request fail.


This isn't currently how things work, no. There has been discussion of 
having the compute node retry alternatives locally, but nothing more 
than discussion.


Correct that this isn't how things currently work, but it was/is the 
original plan. And the retry happens within the cell conductor, not on 
the compute node itself. The top-level conductor is what's getting 
selected hosts from the scheduler. The cell-level conductor is what's 
getting a retry request from the compute. The cell-level conductor would 
deallocate from placement for the currently claimed providers, and then 
pick one of the alternatives passed down from the top and then make 
allocations (a claim) against those, then send to an alternative compute 
host for another build attempt.


So with this plan, there are two places to make allocations - the 
scheduler first, and then the cell conductors for retries. This 
duplication is why some people were originally pushing to move all 
allocation-related work happen in the conductor service.





Proposed flow:
* Scheduler gets a req spec from conductor, containing resource 
requirements

* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy 
those requirements


Yes.

* Placement then constructs a data structure for each root provider as 
documented in the spec. [0]


Yes.

* Placement returns a number of these data structures as JSON blobs. 
Due to the size of the data, a page size will have to be determined, 
and placement will have to either maintain that list of structured 
datafor 

Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-19 Thread Jay Pipes

On 06/19/2017 09:04 AM, Edward Leafe wrote:

Current flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements


Not root RPs. Non-sharing resource providers, which currently 
effectively means compute node providers. Nested resource providers 
isn't yet merged, so there is currently no concept of a hierarchy of 
providers.



* Placement returns a list of the UUIDs for those root providers to scheduler


It returns the provider names and UUIDs, yes.


* Scheduler uses those UUIDs to create HostState objects for each


Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
in a list of the provider UUIDs it got back from the placement service. 
The scheduler then builds a set of HostState objects from the results of 
ComputeNodeList.get_all_by_uuid().


The scheduler also keeps a set of AggregateMetadata objects in memory, 
including the association of aggregate to host (note: this is the 
compute node's *service*, not the compute node object itself, thus the 
reason aggregates don't work properly for Ironic nodes).



* Scheduler runs those HostState objects through filters to remove those that 
don't meet requirements not selected for by placement


Yep.


* Scheduler runs the remaining HostState objects through weighers to order them 
in terms of best fit.


Yep.


* Scheduler takes the host at the top of that ranked list, and tries to claim 
the resources in placement. If that fails, there is a race, so that HostState 
is discarded, and the next is selected. This is repeated until the claim 
succeeds.


No, this is not how things work currently. The scheduler does not claim 
resources. It selects the top (or random host depending on the selection 
strategy) and sends the launch request to the target compute node. The 
target compute node then attempts to claim the resources and in doing so 
writes records to the compute_nodes table in the Nova cell database as 
well as the Placement API for the compute node resource provider.



* Scheduler then creates a list of N UUIDs, with the first being the selected 
host, and the the rest being alternates consisting of the next hosts in the 
ranked list that are in the same cell as the selected host.


This isn't currently how things work, no. This has been discussed, however.


* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to 
the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it 
unclaims the resources for the selected host, and tries to claim the resources 
for the next host in the list. It then tries to build the instance on the next 
host in the list of alternates. Only when all alternates fail does the build 
request fail.


This isn't currently how things work, no. There has been discussion of 
having the compute node retry alternatives locally, but nothing more 
than discussion.



Proposed flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those 
requirements


Yes.


* Placement then constructs a data structure for each root provider as 
documented in the spec. [0]


Yes.


* Placement returns a number of these data structures as JSON blobs. Due to the 
size of the data, a page size will have to be determined, and placement will 
have to either maintain that list of structured datafor subsequent requests, or 
re-run the query and only calculate the data structures for the hosts that fit 
in the requested page.


"of these data structures as JSON blobs" is kind of redundant... all our 
REST APIs return data structures as JSON blobs.


While we discussed the fact that there may be a lot of entries, we did 
not say we'd immediately support a paging mechanism.



* Scheduler continues to request the paged results until it has them all.


See above. Was discussed briefly as a concern but not work to do for 
first patches.



* Scheduler then runs this data through the filters and weighers. No HostState 
objects are required, as the data structures will contain all the information 
that scheduler will need.


No, this isn't correct. The scheduler will have *some* of the 
information it requires for weighing from the returned data from the GET 
/allocation_candidates call, but not all of it.


Again, operators have insisted on keeping the flexibility currently in 
the Nova scheduler to weigh/sort compute nodes by things like thermal 
metrics and kinds of data that the Placement API will never be 
responsible for.


The scheduler will need to merge information from the 
"provider_summaries" part of the HTTP response with information it has 
already in its HostState objects (gotten from 
ComputeNodeList.get_all_by_uuid()