Re: [openstack-dev] [nova][scheduler] A simple solution for better scheduler performance

2016-07-15 Thread Cheng, Yingxin
Hi John,

Thanks for the reply.

There’re two rounds of experiments:
Experiment A [3] is deployed by devstack. There’re 1000 compute services with 
fake virt driver. The DB driver is the devstack default PyMySQL. And the 
scheduler driver is the default filter scheduler.
Experiment B [4] is the real production environment from China Mobile with 
about 600 active compute nodes. The DB driver is the default driver of 
SQLAlchemy, i.e. the C based python-mysql. The scheduler is also filter 
scheduler.

And in analysis 
https://docs.google.com/document/d/1N_ZENg-jmFabyE0kLMBgIjBGXfL517QftX3DW7RVCzU/edit?usp=sharing
 Figure 1/2 are from experiment B, figure 3/4 are from experiment A. So the 2 
kinds of DB APIs are all covered.

My point is simple: When the host manager is querying host states for request 
A, and another request B comes, the host manager won’t launch a second 
cache-refresh; Instead, it simply reuses the first one and returns the same 
result to both A and B. In this way, we can reduce the expensive cache-refresh 
queries to minimum while keeping scheduler host states fresh. It will become 
more effective when there’re more compute nodes and heavier request pressure.

I also have runnable code that can better explain my idea: 
https://github.com/cyx1231st/making-food 

-- 
Regards
Yingxin

On 7/15/16, 17:19, "John Garbutt"  wrote:

On 15 July 2016 at 09:26, Cheng, Yingxin  wrote:
> It is easy to understand that scheduling in nova-scheduler service 
consists of 2 major phases:
> A. Cache refresh, in code [1].
> B. Filtering and weighing, in code [2].
>
> Couple of previous experiments [3] [4] shows that “cache-refresh” is the 
major bottleneck of nova scheduler. For example, the 15th page of presentation 
[3] says the time cost of “cache-refresh” takes 98.5% of time of the entire 
`_schedule` function [6], when there are 200-1000 nodes and 50+ concurrent 
requests. The latest experiments [5] in China Mobile’s 1000-node environment 
also prove the same conclusion, and it’s even 99.7% when there’re 40+ 
concurrent requests.
>
> Here’re some existing solutions for the “cache-refresh” bottleneck:
> I. Caching scheduler.
> II. Scheduler filters in DB [7].
> III. Eventually consistent scheduler host state [8].
>
> I can discuss their merits and drawbacks in a separate thread, but here I 
want to show a simplest solution based on my findings during the experiments 
[5]. I wrapped the expensive function [1] and tried to see the behavior of 
cache-refresh under pressure. It is very interesting to see a single 
cache-refresh only costs about 0.3 seconds. And when there’re concurrent 
cache-refresh operations, this cost can be suddenly increased to 8 seconds. 
I’ve seen it even reached 60 seconds for one cache-refresh under higher 
pressure. See the below section for details.

I am curious about what DB driver you are using?
Using PyMySQL should remove at lot of those issues.
This is the driver we use in the gate now, but it didn't used to be the 
default.

If you use the C based MySQL driver, you will find it locks the whole
process when making a DB call, then eventlet schedules the next DB
call, etc, etc, and then it loops back and allows the python code to
process the first db call, etc. In extreme cases you will find the
code processing the DB query considers some of the hosts to be down
since its so long since the DB call was returned.

Switching the driver should dramatically increase the performance of (II)

> It raises a question in the current implementation: Do we really need a 
cache-refresh operation [1] for *every* requests? If those concurrent 
operations are replaced by one database query, the scheduler is still happy 
with the latest resource view from database. Scheduler is even happier because 
those expensive cache-refresh operations are minimized and much faster (0.3 
seconds). I believe it is the simplest optimization to scheduler performance, 
which doesn’t make any changes in filter scheduler. Minor improvements inside 
host manager is enough.

So it depends on the usage patterns in your cloud.

The caching scheduler is one way to avoid the cache-refresh operation
on every request. It has an upper limit on throughput as you are
forced into having a single active nova-scheduler process.

But the caching means you can only have a single nova-scheduler
process, where as (II) allows you to have multiple nova-scheduler
workers to increase the concurrency.

> [1] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
> [2] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L112-L123
> [3] 
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
>

[openstack-dev] [nova][scheduler] A simple solution for better scheduler performance

2016-07-15 Thread Cheng, Yingxin
It is easy to understand that scheduling in nova-scheduler service consists of 
2 major phases:
A. Cache refresh, in code [1].
B. Filtering and weighing, in code [2].

Couple of previous experiments [3] [4] shows that “cache-refresh” is the major 
bottleneck of nova scheduler. For example, the 15th page of presentation [3] 
says the time cost of “cache-refresh” takes 98.5% of time of the entire 
`_schedule` function [6], when there are 200-1000 nodes and 50+ concurrent 
requests. The latest experiments [5] in China Mobile’s 1000-node environment 
also prove the same conclusion, and it’s even 99.7% when there’re 40+ 
concurrent requests.

Here’re some existing solutions for the “cache-refresh” bottleneck:
I. Caching scheduler.
II. Scheduler filters in DB [7].
III. Eventually consistent scheduler host state [8].

I can discuss their merits and drawbacks in a separate thread, but here I want 
to show a simplest solution based on my findings during the experiments [5]. I 
wrapped the expensive function [1] and tried to see the behavior of 
cache-refresh under pressure. It is very interesting to see a single 
cache-refresh only costs about 0.3 seconds. And when there’re concurrent 
cache-refresh operations, this cost can be suddenly increased to 8 seconds. 
I’ve seen it even reached 60 seconds for one cache-refresh under higher 
pressure. See the below section for details.

It raises a question in the current implementation: Do we really need a 
cache-refresh operation [1] for *every* requests? If those concurrent 
operations are replaced by one database query, the scheduler is still happy 
with the latest resource view from database. Scheduler is even happier because 
those expensive cache-refresh operations are minimized and much faster (0.3 
seconds). I believe it is the simplest optimization to scheduler performance, 
which doesn’t make any changes in filter scheduler. Minor improvements inside 
host manager is enough.

[1] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
 
[2] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L112-L123
[3] 
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
 
[4] http://lists.openstack.org/pipermail/openstack-dev/2016-June/098202.html 
[5] Please refer to Barcelona summit session ID 15334 later: “A tool to test 
and tune your OpenStack Cloud? Sharing our 1000 node China Mobile experience.”
[6] 
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L53
[7] https://review.openstack.org/#/c/300178/
[8] https://review.openstack.org/#/c/306844/


** Here is the discovery from latest experiments [5] **
https://docs.google.com/document/d/1N_ZENg-jmFabyE0kLMBgIjBGXfL517QftX3DW7RVCzU/edit?usp=sharing
 

The figure 1 illustrates the concurrent cache-refresh operations in a nova 
scheduler service. There’re at most 23 requests waiting for the cache-refresh 
operations at time 43s.

The figure 2 illustrates the time cost of every requests in the same 
experiment. It shows that the cost is increased with the growth of concurrency. 
It proves the vicious circle that a request will wait longer for the database 
when there’re more waiting requests.

The figure 3/4 illustrate a worse case when the cache-refresh operation costs 
reach 60 seconds because of excessive cache-refresh operations.


-- 
Regards
Yingxin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] [RFC] ResourceProviderTags - Manage Capabilities with ResourceProvider

2016-07-14 Thread Cheng, Yingxin
On 7/14/16, 12:18, "Edward Leafe"  wrote:
On Jul 13, 2016, at 9:38 PM, Cheng, Yingxin  wrote:
>>Thinking about that a bit, it would seem that a host aggregate could 
also be represented as a namespace:name tag. That makes sense, since the fact 
that a host belongs to a particular aggregate is a qualitative aspect of that 
host.
>> 
> 
> Thanks for the feedback!
> 
> We’ve thought about the relationship between capability tags and host 
aggregates carefully. And we decide not to blend it with host aggregates, for 
several reasons below:
> 1. We want to manage capabilities in only ONE place, either in host 
aggregates, compute_node records or with resource_provider records.
> 2. Compute services may need to attach discovered capabilities to its 
host. It is inconvenient if we store caps with host aggregates, because 
nova-compute needs to create/search host aggregates first, it can’t directly 
attach caps.
> 3. Other services may need to attach discovered capabilities to its 
resources. So the best place is to its related resource pool, not aggregates, 
nor compute_node records. Note the relationship between resource pools and host 
aggregates are N:N.
> 4. It’s logically correct to store caps with resource_providers, because 
caps are actually owned by nodes or resource pools.
> 5. Scheduling will be faster if resource-providers are directly attached 
with caps.
> 
> However, for user-defined caps, it still seems easier to manage them with 
aggregates. We may want to manage them in a way different from pre-defined 
caps. Or we can indirectly manage them through aggregates, but they are 
actually stored with compute-node resource-providers in placement db.

Oh, I think you misunderstood me. Capabilities definitely belong with 
resource providers, not host aggregates, because not all RPs are hosts.
I'm thinking that host aggregates themselves are equivalent to capabilities 
for hosts. Imagine we have 10 hosts, and put 3 of them in an aggregate. How is 
that different than if we give those three a tag with the 'host_agg' namespace, 
and with tag named for the agg?
I'm just thinking out loud here. There might be opportunities to simplify a 
lot of the code between capability tags and host aggregates in the future, 
since it looks like host aggs are a structural subset of RP capability tags.

-- Ed Leafe


Your concerns are correct. The major goal of “Capability Tags” series is to 
*replace* existing capability-like functionalities in Nova and Scheduler with a 
more generic and extensible implementation.

As you said, host aggregates themselves are equivalent to capabilities for 
hosts. We should continue support this way with the new “Capability Tags” 
implementation. Currently users can write free-formed metadata to host 
aggregates, then scheduler can process them through 
“AggregateImagePropertiesIsolation” and “AggregateInstanceExtraSpecsFilter”, 
when users can specify those caps in image-props and flavor extra-specs. This 
means we need to support capability tags in group granularity, i.e. to tag caps 
to host aggregates. It can be in a separate implementation called “Aggregate 
Capability Tags”, replacing current implementation with the two mentioned 
aggregate filters.

As for “Resource Provider Capability Tags”, we are managing capabilities in a 
finer granularity: host and resource pool level. Currently users can only use 
pre-defined caps such as “architecture”, “hypervisor-types”, 
“hypervisor-versions” and “vm-mode” in host states, which can be processed by 
“ImagePropertiesFilter” and “ComputeCapabilitiesFilter” and “JsonFilter”, when 
users can specify them in image-props, flavor extra-specs and scheduler hints. 
We are designing “Resource Provider Capability Tags” to replace them and 
providing extensibility to add more service-defined and user-defined caps in a 
generic way.

The above also means we may want to manage caps in a separate table, and 
maintain their relationship with resource providers and host aggregates. So we 
can query existing caps, validate them in image-props, flavor extra-specs and 
scheduler hints, and manage them in a consistent way.


---
Regards
Yingxin





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] [RFC] ResourceProviderTags - Manage Capabilities with ResourceProvider

2016-07-13 Thread Cheng, Yingxin

On 7/14/16, 05:42, "Ed Leafe"  wrote:
On Jul 12, 2016, at 2:43 AM, Cheng, Yingxin  wrote:
> 4. Capabilities are managed/grouped/abstracted by namespaces, and 
scheduler can make decisions based on either cap_names or cap_namespaces
> 5. Placement service DON’T have any specific knowledge of a capability, 
it only know the its name, namespaces, its relationship to resource providers. 
They are used for scheduling, capability management and report.

Thinking about that a bit, it would seem that a host aggregate could also 
be represented as a namespace:name tag. That makes sense, since the fact that a 
host belongs to a particular aggregate is a qualitative aspect of that host.


Thanks for the feedback!

We’ve thought about the relationship between capability tags and host 
aggregates carefully. And we decide not to blend it with host aggregates, for 
several reasons below:
1. We want to manage capabilities in only ONE place, either in host aggregates, 
compute_node records or with resource_provider records.
2. Compute services may need to attach discovered capabilities to its host. It 
is inconvenient if we store caps with host aggregates, because nova-compute 
needs to create/search host aggregates first, it can’t directly attach caps.
3. Other services may need to attach discovered capabilities to its resources. 
So the best place is to its related resource pool, not aggregates, nor 
compute_node records. Note the relationship between resource pools and host 
aggregates are N:N.
4. It’s logically correct to store caps with resource_providers, because caps 
are actually owned by nodes or resource pools.
5. Scheduling will be faster if resource-providers are directly attached with 
caps.

However, for user-defined caps, it still seems easier to manage them with 
aggregates. We may want to manage them in a way different from pre-defined 
caps. Or we can indirectly manage them through aggregates, but they are 
actually stored with compute-node resource-providers in placement db. 


-- 
Regards
Yingxin 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] [RFC] ResourceProviderTags - Manage Capabilities with ResourceProvider

2016-07-12 Thread Cheng, Yingxin
Some thoughts of “Capability” use cases:
1. Nova-compute service can discover host capabilities and tag them into its 
compute-node resource provider record.
2. Other services such as ironic and cinder may want to manage its resource 
pool, and they can tag capabilities themselves to that pool.
3. Admin/user can register user-defined capabilities to resource providers 
(i.e. a pool or a host)
4. Capabilities are managed/grouped/abstracted by namespaces, and scheduler can 
make decisions based on either cap_names or cap_namespaces
5. Placement service DON’T have any specific knowledge of a capability, it only 
know the its name, namespaces, its relationship to resource providers. They are 
used for scheduling, capability management and report.
6. Placement service need to know where a capability comes from(user-defined, 
nova-defined, or others), so it can have modification control of capabilities, 
and it can list existing capabilities according to types.

I think the above is the normal use cases, please correct me if there are 
mistakes or add more items.


Regards,
-Yingxin

From: Alex Xu [mailto:sou...@gmail.com]
Sent: Monday, July 11, 2016 11:22 PM
To: OpenStack Development Mailing List (not for usage questions) 

Cc: Bhandaru, Malini K ; Cheng, Yingxin 
; Jin, Yuntong ; Tan, Lin 

Subject: Re: [Nova] [RFC] ResourceProviderTags - Manage Capabilities with 
ResourceProvider

Matt mentioned Aggregate in the scheduler meeting, that is good question and 
also reminder me I should explain the relationship between Aggregate and 
ResourceProviderTags.

The Aggregate is expected to keep as a tool for group the hosts, and just for 
group hosts. People used to manage Capabilities with Aggregates, they put the 
hosts with some kind of Capabilities into same Aggregates, and using the 
metadata to identify the Capabilities. But Aggregate with metadata is really 
not very easy to manage.

Thinking of the case:

Host1 have Capability1
Host2 have Capability1 and Capability2
Host3 have Capability2 and Capability3.

With this case, we needs 3 aggregates for each Capability: agg_cap1, agg_cap2, 
agg_cap3. Then we needs add hosts to the aggregate as below:

agg_cap1: host1, host2
agg_cap2: host2, host3
agg_cap3: host3

When there are more capabilities and more hosts which needs to manage, the 
mapping of hosts and aggregate will be more complicate. And there isn't a easy 
interface to let user to know the specific host have what kind of capabilities.

The ResourceProviderTags will be a substitution of Aggregate on manage 
capabilities. With tags, it won't generate a complex mapping.

For the same case, we just need to add tags on the ResourceProvider. The 
interface of tags is pretty easy, check out the api-wg guideline 
https://github.com/openstack/api-wg/blob/master/guidelines/tags.rst. And the 
query parameter made the management easy.

There are also have some user want to write script to manage the Capabilities. 
Thinking the aggregate, the script will be very hard due to manage the mapping 
between aggregates and hosts. The script will be very easy with tags, due to 
tags just a set of string.


2016-07-11 19:08 GMT+08:00 Alex Xu mailto:sou...@gmail.com>>:
This propose is about using ResourceProviderTags as a solution to manage 
Capabilities (Qualitative) in ResourceProvider.
The ResourceProviderTags is to describe the capabilities which are defined by 
OpenStack Service (Compute Service,
Storage Service, Network Service etc.) and by users. The ResourceProvider 
provide resource exposed by a single
compute node, some shared resource pool or an external resource-providing 
service of some sort.  As such,
ResourceProviderTags is also expected to describe the capabilities of single 
ResourceProvider or the capabilities of
ResourcePool.

The ResourceProviderTags is similar with ServersTags [0] which is implemented 
in the Nova. The only difference is
that the tags is attached to the ResourceProvider. The API endpoint will be " 
/ResourceProvider/{uuid}/tags", and it
will follow the API-WG guideline about Tags [1].

As the Tags are just strings, the meaning of Tag isn't defined by Scheduler. 
The meaning of Tag is defined by
OpenStack services or Users. The ResourceProviderTags will only be used for 
scheduling with a ResourceProviderTags
filter.

The ResourceProviderTags is very easy to cover the cases of single 
ResourceProvider, ResourcePool and
DynamicResouces. Let see those cases one by one.

For single ResourceProvider case, just see how Nova report ComputeNode's 
Capabilities. Firstly,  Nova is expected
to define a standard way to describe the Capabilities which provided by 
Hypervisor or Hardware. Then those description
of Capabilities can be used across the Openstack deployment. So Nova will 
define a set of Tags. Those Tags should
be included with prefix to indicated that this is coming from Nova. Also the 
naming rule of prefix can be used to catalog
the Capabilities. 

[openstack-dev] [nova][scheduler] More test results for the "eventually consistent host state" prototype

2016-06-27 Thread Cheng, Yingxin
Hi,

According to the feedback [1] from Austin design summit, I prepared my 
environment with pre-loaded computes and finished a new round of performance 
profiling using the tool [7]. I also updated the prototype [2] to simplify the 
implementation in compute-node side, which makes the implementation closer to 
the design described in the spec [6].

This set of results are more comprehensive, it includes the analysis of 
“eventually consistent host states” prototype [2], default filter scheduler, 
and the caching scheduler. They are tested with various scenarios in 
1000-compute-node environment, with real controller services, real rabbit-MQ 
and real MySQL database. The new set of experiments contains 55 repeatable 
results [3]. Don’t be afraid about the verbose data, I’ve dug out the 
conclusions.

To better understand what’s happening during scheduling in different scenarios, 
all of them are visualized in the doc [4]. They are complementary to what I had 
presented in the Austin design summit, the 7th page of the ppt [5].

Note that the “pre-load scenario” allows only 49 new instances to be launched 
in the 1000-node environment. It means when 50 requests are sent, there should 
be 1 and only 1 failed request if the scheduler decision is accurate.


Detailed analysis with illustration [4]: 
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing
 
==
In all test cases, nova is dispatching 50 instant requests to 1000 compute 
nodes. The aiming is to compare the behavior of 3 types of schedulers, with 
preloaded or empty-loaded scenarios, and with 1 or 2 scheduler services. So 
that’s 3*2*2=12 set of experiments, each set is tested multiple times. 

In scenario S1(i.e. 1 scheduler with empty loaded compute nodes), we can see 
from A2 very clearly that the entire boot process of filter scheduler is 
suffering from nova-scheduler service. Filter scheduler has a very slow speed 
to consume those 50 requests, causing all the requests being blocked before 
scheduler service in the yellow area. The ROOT CAUSE of it is the 
“cache-refresh” step before filtering (i.e. 
`nova.scheduler.filter_scheduler.FilterScheduler._get_all_host_states`). I’ve 
discussed this bottleneck in details in the Austin summit session “Dive into 
nova scheduler performance: where is the bottleneck” [8]. This is also proved 
by caching scheduler because it excludes the “cache-refresh” bottleneck and 
only uses in-memory filtering. By simply excluding “cache-refresh”, the 
performance benefits are huge: the query time is reduced by 87%, and the 
overall throughput (i.e. the delivered requests per second in this cloud) is 
multiplied by 8.24, see A3 for illustration. The “eventually consistent host 
states” prototype also excludes this bottleneck and takes a more meticulous way 
to synchronize scheduler caches. It is slightly slower than caching scheduler, 
because there is an overhead to apply incremental updates from compute nodes. 
The query time is reduced by 79% and the overall throughput is multiplied by 
5.63 in average in S1.

In preload scenario S2, we can see all 3 types of scheduler are faster than 
their empty loaded scenario. That’s because the filters can now prune the hosts 
from 1000 to only 49, so the last few filters don’t need to process 1000 host 
states, they can be much faster. But filter scheduler (B2) cannot be benefit 
much from faster filtering, because its bottleneck is still in “cache refresh”. 
However, it means different for caching scheduler and the prototype, because 
their performance heavily depend on in-memory filtering. For caching scheduler 
(B3), the query time is reduced by 81% and the overall throughput is multiplied 
by 7.52 compared with filter scheduler. And for the prototype (B1), the query 
time is reduced by 83% and the throughput is multiplied by 7.92 in average. 
Also, all those scheduler decisions are accurate, because their first decisions 
are all correct without any retries in preload scenario, and only 1 of 50 
requests is failed due to “no valid host” error.

In scenario S3 with 2 scheduler services and empty-loaded compute nodes, the 
overall schedule bandwidths are all multiplied by 2 internally. Filter 
scheduler (C2) has a major improvement, because its scheduler bandwidth is 
multiplied. But the other two types don’t have similar improvement, because 
their bottleneck is now in nova-api service instead. It is a wrong decision to 
add more schedulers when the actual bottleneck is happening elsewhere. And 
worse, multiple schedulers will introduce more race conditions as well as other 
overhead. However, the performance of caching scheduler (C3) and the prototype 
(C1) are still much better, the query time is reduced by 65% and the overall 
through is multiplied by 3.67 in average.

In preload scenario S4 with 2 schedulers, the race condition is surfaced 
because there’re only 49 slots in 1000 hosts in the cloud, and they will all 
resul

Re: [openstack-dev] [nova] [placement] aggregates associated with multiple resource providers

2016-05-30 Thread Cheng, Yingxin
Hi, cdent:

This problem arises because the RT(resource tracker) only knows to consume the 
DISK resource in its host, but it still doesn’t know exactly which resource 
provider to place the consumption. That is to say, the RT still needs to *find* 
the correct resource provider in the step 4. The *step 4* finally causes the 
explicit problem that “the RT can find two resource providers providing 
DISK_GB, but it doesn’t know which is right”, as you’ve encountered.

The problem is: the RT needs to make a decision to choose a resource provider 
when it finds multiple of them according to *step 4*. However, the scheduler 
should already know which resource provider to choose when it is making a 
decision, and it doesn’t send this information to compute nodes, either. That’s 
also to say, there is a missing step in the bp g-r-p that we should “improve 
filter scheduler that can make correct decisions with generic resource pools”, 
the scheduler should tell the compute node RT not only about the resources 
consumptions in the compute-node resource provider, but also the information 
where to consume shared resources, i.e. their related resource-provider-ids.

Hope it can help you.

-- 
Regards
Yingxin


On 5/30/16, 06:19, "Chris Dent"  wrote:
>
>I'm currently doing some thinking on step 4 ("Modify resource tracker
>to pull information on aggregates the compute node is associated with
>and the resource pools available for those aggregatesa.") of the
>work items for the generic resource pools spec[1] and I've run into
>a brain teaser that I need some help working out.
>
>I'm not sure if I've run into an issue, or am just being ignorant. The
>latter is quite likely.
>
>This gets a bit complex (to me) but: The idea for step 4 is that the
>resource tracker will be modified such that:
>
>* if the compute node being claimed by an instance is a member of some
>   aggregates
>* and one of those  aggregates is associated with a resource provider 
>* and the resource provider has inventory of resource class DISK_GB
>
>then rather than claiming disk on the compute node, claim it on the
>resource provider.
>
>The first hurdle to overcome when doing this is to trace the path
>from compute node, through aggregates, to a resource provider. We
>can get a list of aggregates by host, and then we can use those
>aggregates to get a list of resource providers by joining across
>ResourceProviderAggregates, and we can join further to get just
>those ResourceProviders which have Inventory of resource class
>DISK_GB.
>
>The issue here is that the result is a list. As far as I can tell
>we can end up with >1 ResourceProviders providing DISK_GB for this
>host because it is possible for a host to be in more than one
>aggregate and it is necessary for an aggregate to be able to associate
>with more than one resource provider.
>
>If the above is true and we can find two resource providers providing
>DISK_GB how does:
>
>* the resource tracker know where (to which provider) to write its
>   disk claim?
>* the scheduler (the next step in the work items) make choices and
>   declarations amongst providers? (Yes, place on that node, but use disk 
> provider
>   X, not Y)
>
>If the above is not true, why is it not true? (show me the code
>please)
>
>If the above is an issue, but we'd like to prevent it, how do we fix it?
>Do we need to make it so that when we associate an aggregate with a
>resource provider we check to see that it is not already associated with
>some other provider of the same resource class? This would be a
>troubling approach because as things currently stand we can add Inventory
>of any class and aggregates to a provider at any time and the amount of
>checking that would need to happen is at least bi-directional if not multi
>and that level of complexity is not a great direction to be going.
>
>So, yeah, if someone could help me tease this out, that would be
>great, thanks.
>
>
>[1] 
>http://specs.openstack.org/openstack/nova-specs/specs/newton/approved/generic-resource-pools.html#work-items
>
>-- 
>Chris Dent   (╯°□°)╯︵┻━┻http://anticdent.org/
>freenode: cdent tw: @anticdent

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Grant FFE to "Host-state level locking" BP

2016-03-04 Thread Cheng, Yingxin
I'll make sure to deliver the patches if FFE is granted.

Regards,
-Yingxin

On Friday, March 4, 2016 6:56 PM Nikola Đipanov wrote:
> 
> Hi,
> 
> The actual BP that links to the approved spec is here: [1] and 2 outstanding
> patches are [2][3].
> 
> Apart from the usual empathy-inspired reasons to allow this (code's been up 
> for
> a while, yet only had real review on the last day etc.) which are not related 
> to
> the technical merit of the work, there is also the fact that two initial 
> patches
> that add locking around updates of the in-memory host map ([4] and [5]) have
> already been merged.
> 
> They add the overhead of locking to the scheduler, but without the final work
> they don't provide any benefits (races will not be detected, without [2]).
> 
> I don't have any numbers on this but the result is likely that we made things
> worse, for the sake of adhering to random and made-up dates. With this in mind
> I think it only makes sense to do our best to merge the 2 outstanding patches.
> 
> Cheers,
> N.
> 
> [1]
> https://blueprints.launchpad.net/openstack/?searchtext=host-state-level-
> locking
> [2] https://review.openstack.org/#/c/262938/
> [3] https://review.openstack.org/#/c/262939/
> 
> [4] https://review.openstack.org/#/c/259891/
> [5] https://review.openstack.org/#/c/259892/
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-04 Thread Cheng, Yingxin
Hi,

First of all, many delayed thanks to Jay Pipes's benchmarking framework, learnt 
a lot from it : )

Other comments inline.

On Friday, March 4, 2016 8:42 AM Jay Pipes wrote:
> Hi again, Yingxin, sorry for the delayed response... been traveling.
> Comments inline :)
> 
> On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:
> > Hi,
> >
> > I have simulated the distributed resource management with the incremental
> update model based on Jay's benchmarking framework:
> https://github.com/cyx1231st/placement-bench/tree/shared-state-
> demonstration. The complete result lies at
> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
> 4GB RAM, and the mysql service is using the default settings with the
> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
> nodes are set to "300".
> 
> A few things.
> 
> First, in order to make any predictions or statements about a potential
> implementation's scaling characteristics, you need to run the benchmarks with
> increasing levels of compute nodes. The results you show give us only a single
> dimension of scaling (300 compute nodes). What you want to do is run the
> benchmarks at 100, 200, 400, 800 and 1600 compute node scales. You don't
> need to run *all* of the different permutations of placement/partition/workers
> scenarios, of course. I'd suggest just running the none partition strategy 
> and the
> pack placement strategy at 8 worker processes. Those results will give you 
> (and
> us!) the data points that will indicate the scaling behaviour of the 
> shared-state-
> scheduler implementation proposal as the number of compute nodes in the
> deployment increases. The "none" partitioning strategy represents the reality 
> of
> the existing scheduler implementation, which does not shard the deployment
> into partitions but retrieves all compute nodes for the entire deployment on
> every request to the scheduler's
> select_destinations() method.

Hmm... good suggestion. I don't like to run all the benchmarks, either. It 
makes me wait for a whole day, and so much data to evaluate.

300 is the max number for me to test in my environment. Or db will refuse to 
work because of connection limits, because all those nodes are asking for 
connections. Should I emulate "conductors" to limit the db connections, or 
build up a thread pool to connect, or edit db configurations? I'm wondering if 
I can write a new tool to do tests in more real environment.

> Secondly, and I'm not sure if you intended this, the code in your
> compute_node.py file in the placement-bench project is not thread-safe.
> In other words, your code assumes that only a single process on each compute
> node could ever actually run the database transaction that inserts allocation
> records at any time.

[a]
So, single threaded in each node is already good enough to support 
"shared-state" scheduler to make 1000 more decisions per second. And because 
those claims are made distributedly in nodes, they are actually wrote to db by 
300 parallel nodes in nature. AFAIK, the compute node is single threaded,  they 
actually use greenthreads instead of real threads.

> If you want more than a single process on the compute
> node to be able to handle claims of resources, you will need to modify that 
> code
> to use a compare-and-update strategy, checking a "generation" attribute on the
> inventory record to ensure that another process on the compute node hasn't
> simultaneously updated the allocations information for that compute node.

I still don't think the compare-and-update strategy should be forced to 
"compute-local" resources even if the compute service is changed to use 
multiple processes. The scheduler decisions to those "compute-local" resources 
can be checked and confirmed by the accurate in-memory view of local resources 
in resource tracker, which is really really faster than db operations. And the 
following inventory insertion can be concurrent without locks.

The db is only responsible to use "compare-and-update strategy" to claim those 
shared resources, persist the confirmed scheduler decision with consumption 
into inventories, then tell compute service that it's OK to start to do the 
long job in spawning the VM.

> Third, you have your scheduler workers consuming messages off the request
> queue using get_nowait(), while you left the original placement scheduler 
> using
> the blocking get() call. :) Probably best to compare apples to apples and have
> them both using the blocking get() call.

Sorry, I don't agree with this. Consuming messages and getting requests are 
entirely two different things. I've trie

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-02 Thread Cheng, Yingxin
On Tuesday, March 1, 2016 7:29 PM, John Garbutt <mailto:j...@johngarbutt.com> 
wrote
> On 1 March 2016 at 08:34, Cheng, Yingxin  wrote:
> > Hi,
> >
> > I have simulated the distributed resource management with the incremental
> update model based on Jay's benchmarking framework:
> https://github.com/cyx1231st/placement-bench/tree/shared-state-
> demonstration. The complete result lies at
> http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
> 4GB RAM, and the mysql service is using the default settings with the
> "innodb_buffer_pool_size" setting to "2G". The number of simulated compute
> nodes are set to "300".
> >
> > [...]
> >
> > Second, here's what I've found in the centralized db claim design(i.e. rows 
> > that
> "do claim in compute?" = No):
> > 1. The speed of legacy python filtering is not slow(see rows that
> > "Filter strategy" = python): "Placement total query time" records the
> > cost of all query time including fetching states from db and filtering
> > using python. The actual cost of python filtering is
> > (Placement_total_query_time - Placement_total_db_query_time), and
> > that's only about 1/4 of total cost or even less. It also means python
> > in-memory filtering is much faster than db filtering in this
> > experiment. See http://paste.openstack.org/show/488710/
> > 2. The speed of `db filter strategy` and the legacy `python filter
> > strategy` are in the same order of magnitude, not a very huge
> > improvement. See the comparison of column "Placement total query
> > time". Note that the extra cost of `python filter strategy` mainly
> > comes from "Placement total db query time"(i.e. fetching states from
> > db). See http://paste.openstack.org/show/488709/
> 
> I think it might be time to run this in a nova-scheduler like
> environment: eventlet threads responding to rabbit, using pymysql backend, 
> etc.
> Note we should get quite a bit of concurrency within a single nova-scheduler
> process with the db approach.
> 
> I suspect clouds that are largely full of pets, pack/fill first, with a 
> smaller
> percentage of cattle on top, will benefit the most, as that initial DB filter 
> will
> bring back a small list of hosts.
> 
> > Third, my major concern of "centralized db claim" design is: Putting too 
> > much
> scheduling works into the centralized db, and it is not scalable by simply 
> adding
> conductors and schedulers.
> > 1. The filtering works are majorly done inside db by executing complex 
> > sqls. If
> the filtering logic is much more complex(currently only CPU and RAM are
> accounted in the experiment), the db overhead will be considerable.
> 
> So, to clarify, only resources we have claims for in the DB will be filtered 
> in the
> DB. All other filters will still occur in python.
> 
> The good news, is that if that turns out to be the wrong trade off, its easy 
> to
> revert back to doing all the filtering in python, with zero impact on the DB
> schema.

Another point is, the db filtering will recalculate every resources to get 
their free value from inventories and allocations each time when a schedule 
request comes. This overhead is unnecessary if scheduler can accept the 
incremental updates to adjust its cache recording free resources.
It also means there must be a mechanism based on strict version control of 
scheduler caches to make sure those updates are correctly handled.

> > 2. The racing of centralized claims are resolved by rolling back 
> > transactions
> and by checking the generations(see the design of "compare and update"
> strategy in https://review.openstack.org/#/c/283253/), it also causes 
> additional
> overhead to db.
> 
> Its worth noting this pattern is designed to work well with a Galera DB 
> cluster,
> including one that has writes going to all the nodes.

I know, my point is the "distributed resource management" with resource 
trackers doesn't need db-locks or db-rolling-backs to those compute-local 
resources as well as the additional overhead, regardless of the type of 
databases.
 
> > 3. The db overhead of filtering operation can be relaxed by moving
> > them to schedulers, that will be 38 times faster and can be executed
> > in parallel by schedulers according to the column "Placement avg query
> > time". See http://paste.openstack.org/show/488715/
> > 4. The "compare and update" overhead can be partially relaxed by using
> distributed resource claims in resource trackers. There is no need to roll 
> back
> transa

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-03-01 Thread Cheng, Yingxin
ions in updating inventories of compute local resources in order to be 
accurate. It is confirmed by checking the db records at the end of each run of 
eventually consistent scheduler state design.
5. If a) all the filtering operations are done inside schedulers,
b) schedulers do not need to refresh caches from db because of 
incremental updates,
c) it is no need to do "compare and update" to compute-local 
resources(i.e. none-shared resources),
 then here is the performance comparison using 1 scheduler instances: 
http://paste.openstack.org/show/488717/


Finally, it is not fair to directly compare the actual ability of 
resource-provider scheduler and shared-state scheduler using this benchmarking 
tool, because there are 300 more processes needed to be created in order to 
simulate the distributed resource management of 300 compute nodes, and there 
are no conductors and MQ in the simulation. But I think it is still useful to 
provide at least some statistics.


Regards,
-Yingxin


On Wednesday, February 24, 2016 5:04 PM, Cheng, Yingxin wrote:
> Very sorry for the delay, it feels hard for me to reply to all the concerns 
> raised,
> most of you have years more experiences. I've tried hard to present that 
> there is
> a direction to solve the issues of existing filter scheduler in multiple areas
> including performance, decision accuracy, multiple scheduler support and race
> condition. I'll also support any other solutions if they can solve the same 
> issue
> elegantly.
> 
> 
> @Jay Pipes:
> I feel that scheduler team will agree with the design that can fulfill 
> thousands of
> placement queries with thousands of nodes. But as a scheduler that will be
> splitted out to be used in wider areas, it's not simple to predict the that
> requirement, so I'm not agree with the statement "It doesn't need to be high-
> performance at all". There is no system that can be existed without 
> performance
> bottleneck, including resource-provider scheduler and shared-state scheduler. 
> I
> was trying to point out where is the potential bottleneck in each design and 
> how
> to improve if the worst thing happens, quote:
> "The performance bottleneck of resource-provider and legacy scheduler is from
> the centralized db (REMOVED: and scheduler cache refreshing). It can be
> alleviated by changing to a stand-alone high performance database. And the
> cache refreshing is designed to be replaced by to direct SQL queries 
> according to
> resource-provider scheduler spec. The performance bottleneck of shared-state
> scheduler may come from the overwhelming update messages, it can also be
> alleviated by changing to stand-alone distributed message queue and by using
> the "MessagePipe" to merge messages."
> 
> I'm not saying that there is a bottleneck of resource-provider scheduler in
> fulfilling current design goal. The ability of resource-provider scheduler is
> already proven by a nice modeling tool implemented by Jay, I trust it. But I 
> care
> more about the actual limit of each design and how easily they can be extended
> to increase that limit. That's why I turned to make efforts in scheduler 
> functional
> test framework(https://review.openstack.org/#/c/281825/). I finally want to
> test scheduler functionality using greenthreads in the gate, and test the
> performance and placement of each design using real processes. And I hope the
> scheduler can open to both centralized and distributed design.
> 
> I've updated my understanding of three designs:
> https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnq
> a6oZJWtTumAkw/edit?usp=sharing
> The "cache updates" arrows are changed to "resource updates" in resource-
> provider scheduler, because I think resource updates from virt driver are 
> still
> needed to be updated to the central database. Hope this time it's right.
> 
> 
> @Sylvain Bauza:
> As the first step towards shared-state scheduler, the required changes are 
> kept
> at minimum. There are no db modifications needed, so no rolling upgrade issues
> in data migration. The new scheduler can decide not to make decisions to old
> compute nodes, or try to refresh host states from db and use legacy way to
> make decisions until all the compute nodes are upgraded.
> 
> 
> I have to admit that my prototype still lack efforts to deal with overwhelming
> messages. This design works best using the distributed message queue. Also if
> we try to initiate multiple scheduler processes/workers in a single host, 
> there are
> a lot more to be done to reduce update messages between compute nodes and
> scheduler workers. But I see the potential of distr

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Cheng, Yingxin
Sorry for the late reply.

On 22 February 2016 at 18:45, John Garbutt wrote:
> On 21 February 2016 at 13:51, Cheng, Yingxin  wrote:
> > On 19 February 2016 at 5:58, John Garbutt wrote:
> >> On 17 February 2016 at 17:52, Clint Byrum  wrote:
> >> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Long term, I see a world where there are multiple scheduler Nova is
> >> able to use, depending on the deployment scenario.
> >
> > Technically, what I've implemented is a new type of scheduler host
> > manager `shared_state_manager.SharedHostManager`[1] with the ability
> > to synchronize host states directly from resource trackers.
> 
> Thats fine. You just get to re-use more code.
> 
> Maybe I should say multiple scheduling strategies, or something like that.
> 
> >> So a big question for me is, does the new scheduler interface work if
> >> you look at slotting in your prototype scheduler?
> >>
> >> Specifically I am thinking about this interface:
> >> https://github.com/openstack/nova/blob/master/nova/scheduler/client/_
> >> _init__.py
> 
> I am still curious if this interface is OK for your needs?
> 

The added interfaces from scheduler side is:
https://review.openstack.org/#/c/280047/2/nova/scheduler/client/__init__.py 
1. I can remove "notify_schedulers" because the same message can be sent 
through "send_commit" instead.
2. The "send_commit" interface is required because there should be a way to 
send state updates from compute node to a specific scheduler.

The added/changed interfaces from compute side is:
https://review.openstack.org/#/c/280047/2/nova/compute/rpcapi.py 
1. The "report_host_state" interface is required. When a scheduler is up, it 
must ask compute node for the latest host state. It is also required when the 
scheduler detects that it's host state is out of sync and it should ask compute 
node for a synced state(its rare due to network issues or bugs).
2. The new parameter "claim" should be added to interface 
"build_and_run_instance" because compute node should reply whether a scheduler 
claim is successful. Scheduler can thus track its claims and can be updated by 
successful claims from other schedulers immediately. The compute node can thus 
decide whether a scheduler decision is made from the "shared-state" scheduler, 
that's the *tricky* part to support both types of schedulers.

> Making this work across both types of scheduler might be tricky, but I think 
> it is
> worthwhile.
> 
> >> > This mostly agrees with recent tests I've been doing simulating
> >> > 1000 compute nodes with the fake virt driver.
> >>
> >> Overall this agrees with what I saw in production before moving us to
> >> the caching scheduler driver.
> >>
> >> I would love a nova functional test that does that test. It will help
> >> us compare these different schedulers and find the strengths and 
> >> weaknesses.
> >
> > I'm also working on implementing the functional tests of nova
> > scheduler, there is a patch showing my latest progress:
> > https://review.openstack.org/#/c/281825/
> >
> > IMO scheduler functional tests are not good at testing real
> > performance of different schedulers, because all of the services are
> > running as green threads instead of real processes. I think the better
> > way to analysis the real performance and the strengths and weaknesses
> > is to start services in different processes with fake virt driver(i.e.
> > Clint Byrum's work) or Jay Pipe's work in emulating different designs.
> 
> Having an option to run multiple process seems OK, if its needed.
> Although starting with a greenlet version that works in the gate seems best.
> 
> Lets try a few things, and see what predicts the results in real environments.

Sure.

> >> I am really interested how your prototype and the caching scheduler 
> >> compare?
> >> It looks like single node scheduler will perform in a very similar
> >> way, but multiple schedulers are less likely to race each other,
> >> although there are quite a few races?
> >
> > I think the major weakness of caching scheduler comes from its host
> > state update model, i.e. updating host states from db every `
> > CONF.scheduler_driver_task_period`
> > seconds.
> 
> The trade off is that consecutive scheduler decisions don't race each other, 
> at all.
> Say you have a burst of 1000 instance builds and you want to avoid build 
> failures
> (but accept sub optimal placement, and you are using fill first), thats

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-24 Thread Cheng, Yingxin
Very sorry for the delay, it feels hard for me to reply to all the concerns 
raised, most of you have years more experiences. I've tried hard to present 
that there is a direction to solve the issues of existing filter scheduler in 
multiple areas including performance, decision accuracy, multiple scheduler 
support and race condition. I'll also support any other solutions if they can 
solve the same issue elegantly.


@Jay Pipe:
I feel that scheduler team will agree with the design that can fulfill 
thousands of placement queries with thousands of nodes. But as a scheduler that 
will be splitted out to be used in wider areas, it's not simple to predict the 
that requirement, so I'm not agree with the statement "It doesn't need to be 
high-performance at all". There is no system that can be existed without 
performance bottleneck, including resource-provider scheduler and shared-state 
scheduler. I was trying to point out where is the potential bottleneck in each 
design and how to improve if the worst thing happens, quote:
"The performance bottleneck of resource-provider and legacy scheduler is from 
the centralized db (REMOVED: and scheduler cache refreshing). It can be 
alleviated by changing to a stand-alone high performance database. And the 
cache refreshing is designed to be replaced by to direct SQL queries according 
to resource-provider scheduler spec. The performance bottleneck of shared-state 
scheduler may come from the overwhelming update messages, it can also be 
alleviated by changing to stand-alone distributed message queue and by using 
the "MessagePipe" to merge messages."

I'm not saying that there is a bottleneck of resource-provider scheduler in 
fulfilling current design goal. The ability of resource-provider scheduler is 
already proven by a nice modeling tool implemented by Jay, I trust it. But I 
care more about the actual limit of each design and how easily they can be 
extended to increase that limit. That's why I turned to make efforts in 
scheduler functional test framework(https://review.openstack.org/#/c/281825/). 
I finally want to test scheduler functionality using greenthreads in the gate, 
and test the performance and placement of each design using real processes. And 
I hope the scheduler can open to both centralized and distributed design.

I've updated my understanding of three designs: 
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
 
The "cache updates" arrows are changed to "resource updates" in 
resource-provider scheduler, because I think resource updates from virt driver 
are still needed to be updated to the central database. Hope this time it's 
right.


@Sylvain Bauze:
As the first step towards shared-state scheduler, the required changes are kept 
at minimum. There are no db modifications needed, so no rolling upgrade issues 
in data migration. The new scheduler can decide not to make decisions to old 
compute nodes, or try to refresh host states from db and use legacy way to make 
decisions until all the compute nodes are upgraded.


I have to admit that my prototype still lack efforts to deal with overwhelming 
messages. This design works best using the distributed message queue. Also if 
we try to initiate multiple scheduler processes/workers in a single host, there 
are a lot more to be done to reduce update messages between compute nodes and 
scheduler workers. But I see the potential of distributed resource 
management/scheduling and would like to make efforts in this direction.

If we are agreed that the final decision accuracy are guaranteed in both 
directions, we should care more about the final decision throughput of both 
design. Theoretically it is better because the final consumptions are made 
distributedly, but there exists difficulties in reaching that limit. However, 
the centralized design is easier to approach its theoretical performance 
because of the lightweight implementation inside scheduler and the powerful 
underlying database.



Regards,
-Yingxin


> -Original Message-
> From: Jay Pipes [mailto:jaypi...@gmail.com]
> Sent: Wednesday, February 24, 2016 8:11 AM
> To: Sylvain Bauza ; OpenStack Development Mailing List
> (not for usage questions) ; Cheng, Yingxin
> 
> Subject: Re: [openstack-dev] [nova] A prototype implementation towards the
> "shared state scheduler"
> 
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> > I won't argue against performance here. You made a very nice PoC for
> > testing scaling DB writes within a single python process and I trust
> > your findings. While I would be naturally preferring some
> > shared-nothing approach that can horizontally scale, one could mention
> > that we can do the same with Galera clusters.
> 
> a) My benchmarks aren't single process comparison

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-21 Thread Cheng, Yingxin
On 19 February 2016 at 5:58, John Garbutt wrote:
> On 17 February 2016 at 17:52, Clint Byrum  wrote:
> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Hi,
> >>
> >> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to
> >> testify its design goals in accuracy, performance, reliability and
> >> compatibility improvements. It will also be an Austin Summit Session
> >> if elected:
> >> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presen
> >> tation/7316
> 
> Long term, I see a world where there are multiple scheduler Nova is able to 
> use,
> depending on the deployment scenario.
> 
> We have tried to stop any more scheduler going in tree (like the solver 
> scheduler)
> while we get the interface between the nova-scheduler and the rest of Nova
> straightened out, to make that much easier.

Technically, what I've implemented is a new type of scheduler host manager
`shared_state_manager.SharedHostManager`[1] with the ability to synchronize host
states directly from resource trackers. Filter scheduler driver can choose to 
load this
manager from stevedore[2], and thus get a different update model of internal
caches. This new manager is highly compatible to current scheduler architecture
because filter scheduler with HostManager can even run with the schedulers 
loaded
SharedHostManager at the same time(tested).

So why not have this in tree to give operators more options in choosing host
managers. I also have an opinion that caching scheduler is not exactly a new 
kind
of scheduler driver, it only has a different behavior in updating host states, 
and it
should be implemented as a new kind of host manager instead.

What I'm concerned is that the resource provider scheduler is going to change 
the
architecture of filter scheduler in Jay Pipe's bp[3]. There will be no host 
manager,
even no host state caches in the future. So what I've done in keeping 
compatibilities
will become incompatibilities in the future.

[1] 
https://review.openstack.org/#/c/280047/2/nova/scheduler/shared_state_manager.py
 L55
[2] https://review.openstack.org/#/c/280047/2/setup.cfg L194
[3] https://review.openstack.org/#/c/271823 

> 
> So a big question for me is, does the new scheduler interface work if you 
> look at
> slotting in your prototype scheduler?
> 
> Specifically I am thinking about this interface:
> https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init_
> _.py


> There are several problems in this update model, proven in experiments[3]:
> >> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request is
> 355ms in average in the deployment of 3 compute nodes, compared with only
> 3ms in in-memory decision-making. Imagine there could be at most 1k nodes,
> even 10k nodes in the future.
> >> 2. Race conditions: This is not only a parallel-scheduler problem,
> >> but also a problem using only one scheduler. The detailed analysis of one-
> scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between
> the scheduler makes a decision in host state cache and the compute node
> updates its in-db resource record according to that decision in resource 
> tracker.
> A recent scheduler resource consumption in cache can be lost and overwritten
> by compute node data because of it, result in cache inconsistency and
> unexpected retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests recorded, results 
> in
> 22.6% extra performance overhead.
> >> 3. Parallel scheduler support: The design of filter scheduler leads to an 
> >> "even
> worse" performance result using parallel schedulers. In the same experiment
> with 4 schedulers on separate machines, the average db block time is increased
> to 697ms per request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
> >
> > This mostly agrees with recent tests I've been doing simulating 1000
> > compute nodes with the fake virt driver.
> 
> Overall this agrees with what I saw in production before moving us to the
> caching scheduler driver.
> 
> I would love a nova functional test that does that test. It will help us 
> compare
> these different schedulers and find the strengths and weaknesses.

I'm also working on implementing the functional tests of nova scheduler, there
is a patch showing my latest progress: https://review.openstack.org/#/c/281825/ 

IMO scheduler functional tests are not good at testing real performance of
different schedulers, because all of the services are running as green threads
instead of real processes. I think the better way to analysis the real 
performance
and the strengths and weaknesses is to start services in different processes 
with
fake virt driver(i.e. Clint Byrum's work) or Jay Pipe's work in emulating 
different
designs.

> >> 2. Since the scheduler claims 

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin
> -Original Message-
> From: Clint Byrum [mailto:cl...@fewbar.com]
> Sent: Thursday, February 18, 2016 1:53 AM
> To: openstack-dev 
> Subject: Re: [openstack-dev] [nova] A prototype implementation towards the
> "shared state scheduler"
> 
> Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> > Hi,
> >
> > I've uploaded a prototype https://review.openstack.org/#/c/280047/ to
> > testify its design goals in accuracy, performance, reliability and
> > compatibility improvements. It will also be an Austin Summit Session
> > if elected:
> > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present
> > ation/7316
> >
> > I want to gather opinions about this idea:
> > 1. Is this feature possible to be accepted in the Newton release?
> > 2. Suggestions to improve its design and compatibility.
> > 3. Possibilities to integrate with resource-provider bp series: I know 
> > resource-
> provider is the major direction of Nova scheduler, and there will be 
> fundamental
> changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-
> providers-scheduler.rst. However, this prototype proposes a much faster and
> compatible way to make schedule decisions based on scheduler caches. The in-
> memory decisions are made at the same speed with the caching scheduler, but
> the caches are kept consistent with compute nodes as quickly as possible
> without db refreshing.
> >
> > Here is the detailed design of the mentioned prototype:
> >
> > >>
> > Background:
> > The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making. It is updated whenever a request is
> received[1], and all the compute node records are retrieved from db every 
> time.
> There are several problems in this update model, proven in experiments[3]:
> > 1. Performance: The scheduler performance is largely affected by db access 
> > in
> retrieving compute node records. The db block time of a single request is 
> 355ms
> in average in the deployment of 3 compute nodes, compared with only 3ms in
> in-memory decision-making. Imagine there could be at most 1k nodes, even 10k
> nodes in the future.
> > 2. Race conditions: This is not only a parallel-scheduler problem, but
> > also a problem using only one scheduler. The detailed analysis of one-
> scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between
> the scheduler makes a decision in host state cache and the compute node
> updates its in-db resource record according to that decision in resource 
> tracker.
> A recent scheduler resource consumption in cache can be lost and overwritten
> by compute node data because of it, result in cache inconsistency and
> unexpected retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests recorded, results 
> in
> 22.6% extra performance overhead.
> > 3. Parallel scheduler support: The design of filter scheduler leads to an 
> > "even
> worse" performance result using parallel schedulers. In the same experiment
> with 4 schedulers on separate machines, the average db block time is increased
> to 697ms per request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
> 
> 
> This mostly agrees with recent tests I've been doing simulating 1000 compute
> nodes with the fake virt driver. My retry rate is much lower, because there's 
> less
> window for race conditions since there is no latency for the time between 
> nova-
> compute getting the message that the VM is scheduled to it, and responding
> with a host update. Note that your database latency numbers seem much higher,
> we see about 200ms, and I wonder if you are running in a very resource
> constrained database instance.

Yes, I only have 4 cores and 16GB RAM in my desktop and I booted up 4 VMs for 
developing, testing and debugging this prototype.
Moreover, those schedulers are deployed on separate hosts so there are more 
latencies in my environment.

You have a great test environment for schedulers, you must have spent lots of 
efforts in managing 1000 compute nodes, collecting logs and making automatic 
analysis.

> 
> >
> > Improvements:
> > This prototype solved the mentioned issues above by implementing a new
> update model to scheduler host state cache. Instead of refreshing caches from
> db, every compute node maintains its accurate version of host state cache
> updated by the resource tracker, and sends incremental updates directly to
> schedulers. So the scheduler cache are synchronized to the correct state as 
> soon
> as possible with the lowest overhead. Also, scheduler will send resource claim
> with its decision to the target compute node. The compute node can decide
> whether the resource claim is successful immediately by its local host state
> cache and send responds back ASAP. With all the c

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin

On Wed, 17 February 2016, Sylvain Bauza wrote

(sorry, quoting off-context, but I feel it's a side point, not the main 
discussion)
Le 17/02/2016 16:40, Cheng, Yingxin a écrit :
IMHO, the authority to allocate resources is not limited by compute nodes, but 
also include network service, storage service or all other services which have 
the authority to manage their own resources. Those "shared" resources are 
coming from external services(i.e. system) which are not compute service. They 
all have responsibilities to push their own resource updates to schedulers, 
make resource reservation and consumption. The resource provider series 
provides a flexible representation of all kinds of resources, so that scheduler 
can handle them without having the specific knowledge of all the resources.

No, IMHO, the authority has to stay the entity which physically create the 
instance and own its lifecycle. What the user wants when booting is an 
instance, not something else. He can express some SLA by providing more context 
(implicit thru aggregates or flavors) or explicit (thru hints or AZs) that 
could be not compute-related (say a network segment locality or a 
volume-related thing) but at the end, it will create an instance on a compute 
node that matches the requirements.

Cinder and Neutron shouldn't manage which instances are on which hosts, they 
just have to provide the resource types and possible allocations (like a taken 
port)

-Sylvain

Yes, thought twice. The cinder project also has its own scheduler, so it is not 
the responsibility of nova-scheduler to schedule all pieces of resources. 
Nova-scheduler is responsible to boot instances, it has a limited scope to 
compute services.
-Yingxin
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-17 Thread Cheng, Yingxin
On Wed, 17 February 2016, Sylvain Bauza wrote
Le 17/02/2016 12:59, Chris Dent a écrit :
On Wed, 17 Feb 2016, Cheng, Yingxin wrote:


To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.

That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.

So, to be clear, I'm really happy to see the resource-providers series for many 
reasons :
 - it will help us getting a nice Facade for getting the resources and 
attributing them
 - it will help a shared-storage deployment by making sure that we don't have 
some resource problems when the resource is shared
 - it will create a possibility for external resource providers to provide some 
resource types to Nova so the Nova scheduler could use them (like Neutron 
related resources)

That, I really want to have it implemented in Mitaka and Newton and I'm totally 
on-board and supporting it.

TBC, the only problem I see with the series is [2], not the whole, please.

@cdent:
As far as I know, some resources are defined "shared", simply because they are 
not the resources of compute node service. In other words, the compute node 
resource tracker does not have the authority of those "shared" resources. For 
example, the "shared" storage resources are actually managed by the storage 
service, and the "shared" network resource "IP pool" is actually owned by 
network service. If all the resources are labeled "shared" only because they 
are not owned by compute node services, the 
shared-resource-tracking/consumption problem can be solved by implementing 
resource trackers in all the authorized services. Those resource trackers are 
constantly providing incremental updates to schedulers, and have the 
responsibilities to reserve and consume resources independently/distributedly, 
no matter where they are from, compute service, storage service or network 
service etc.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.

Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

So, IMHO, we should only have the compute nodes being the authority for 
allocating resources. They are many reasons for that I provided in the spec 
review, but I can reply again :
·#1 If we consider that an external system, as a resource provider, 
will provide a single resource class usage (like network segment availability), 
it will still require the instance to be spawned *for* consuming that resource 
class, even if the scheduler accounts for it. That would mean that the 
scheduler would have to manage a list of allocations with TTL, and periodically 
verify that the allocation succeeded by asking the external system (or getting 
feedback from the external system). See, that's racy.
·#2 the scheduler is just a decision maker, by any case it doesn't 
account for the real instance creation (it doesn't hold the ownership of the 
instance). Having it being accountable for the instances usage is heavily 
difficult. Take for example a request for CPU pinning or NUMA affinity. The 
user can't really express which pin of the pCPU he will get, that's

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-16 Thread Cheng, Yingxin
To better illustrate the differences between shared-state, resource-provider 
and legacy scheduler, I've drew 3 simplified pictures [1] in emphasizing the 
location of resource view, the location of claim and resource consumption, and 
the resource update/refresh pattern in three kinds of schedulers. Hoping I'm 
correct in the "resource-provider scheduler" part.


A point of view from my analysis in comparing three schedulers (before real 
experiment):
1. Performance: The performance bottlehead of resource-provider and legacy 
scheduler is from the centralized db and scheduler cache refreshing. It can be 
alleviated by changing to a stand-alone high performance database. And the 
cache refreshing is designed to be replaced by to direct SQL queries according 
to resource-provider scheduler spec [2]. The performance bottlehead of 
shared-state scheduler may come from the overwhelming update messages, it can 
also be alleviated by changing to stand-alone distributed message queue and by 
using the "MessagePipe" to merge messages.
2. Final decision accuracy: I think the accuracy of the final decision are high 
in all three schedulers, because until now the consistent resource view and the 
final resource consumption with claims are all in the same place. It's resource 
trackers in shared-state scheduler and legacy scheduler, and it's the 
resource-provider db in resource-provider scheduler.
3. Scheduler decision accuracy: IMO the order of accuracy of a single schedule 
decision is resource-provider > shared-state >> legacy scheduler. The 
resource-provider scheduler can get the accurate resource view directly from 
db. Shared-state scheduler is getting the most accurate resource view by 
constantly collecting updates from resource trackers and by tracking the 
scheduler claims from schedulers to RTs. Legacy scheduler's decision is the 
worst because it doesn't track its claims and get resource views from compute 
nodes records which are not that accurate.
4. Design goal difference:
The fundamental design goal of the two new schedulers is different. Copy my 
views from [2], I think it is the choice between "the loose distributed 
consistency with retries" and "the strict centralized consistency with locks".


As can be seen in the illustrations [1], the main compatibility issue between 
shared-state and resource-provider scheduler is caused by the different 
location of claim/consumption and the assumed consistent resource view. IMO 
unless the claims are allowed to happen in both places(resource tracker and 
resource-provider db), it seems difficult to make shared-state and 
resource-provider scheduler work together.


[1] 
https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
[2] https://review.openstack.org/#/c/271823/


Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 9:48 PM
To: OpenStack Development Mailing List (not for usage questions) 

Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 10:48, Cheng, Yingxin a écrit :
Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.

Nice, looking forward to it then :-)


2. Thanks for providing concerns I've not thought it yet, they will be in the 
spec soon.

3. Let me copy my thoughts from another thread about the integration with 
resource-provider:
The idea is about "Only compute node knows its own final compute-node resource 
view" or "The accurate resource view only exists at the place where it is 
actually consumed." I.e., The incremental updates can only come from the actual 
"consumption" action, no matter where it is(e.g. compute node, storage service, 
network service, etc.). Borrow the terms from resource-provider, compute nodes 
can maintain its accurate version of "compute-node-inventory" cache, and can 
send incremental updates because it actually consumes compute resources, 
furthermore, storage service can also maintain an accurate version of 
"storage-inventory" cache and send incremental updates if it also consumes 
storage resources. If there are central services in charge of consuming all the 
resources, the accurate cache and updates must come from them.


That is one of the things I'd like to see in your spec, and how you could 
interact with the new model.
Thanks,
-Sylvain




Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) 
<mailto:openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded a prototype https://review.openstack.org/#

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-15 Thread Cheng, Yingxin
Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.

2. Thanks for providing concerns I've not thought it yet, they will be in the 
spec soon.

3. Let me copy my thoughts from another thread about the integration with 
resource-provider:
The idea is about "Only compute node knows its own final compute-node resource 
view" or "The accurate resource view only exists at the place where it is 
actually consumed." I.e., The incremental updates can only come from the actual 
"consumption" action, no matter where it is(e.g. compute node, storage service, 
network service, etc.). Borrow the terms from resource-provider, compute nodes 
can maintain its accurate version of "compute-node-inventory" cache, and can 
send incremental updates because it actually consumes compute resources, 
furthermore, storage service can also maintain an accurate version of 
"storage-inventory" cache and send incremental updates if it also consumes 
storage resources. If there are central services in charge of consuming all the 
resources, the accurate cache and updates must come from them.


Regards,
-Yingxin

From: Sylvain Bauza [mailto:sba...@redhat.com]
Sent: Monday, February 15, 2016 5:28 PM
To: OpenStack Development Mailing List (not for usage questions) 

Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :
Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be written 
http://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file so it would 
be the best way to discuss on the design.



2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of the spec for 
that), but my biggest concerns would be when reviewing the spec :
 - how this can meet the OpenStack mission statement (ie. ubiquitous solution 
that would be easy to install and massively scalable)
 - how this can be integrated with the existing (filters, weighers) to provide 
a clean and simple path for operators to upgrade
 - how this can be supporting rolling upgrades (old computes sending updates to 
new scheduler)
 - how can we test it
 - can we have the feature optional for operators



3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.


That's the key point, thanks for noticing our priorities. So, you know that our 
resource modeling is drastically subject to change in Mitaka and Newton. That 
is the new game, so I'd love to see how you plan to interact with that.
Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could share your ideas 
because all of you are having great ideas to improve a current frustrating 
solution.

-Sylvain



Here is the detailed design of the mentioned prototype:

>>
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in 
resource tracker. A rece

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-14 Thread Cheng, Yingxin
Thanks Boris, the idea is quite similar in “Do not have db accesses during 
scheduler decision making” because db blocks are introduced at the same time, 
this is very bad for the lock-free design of nova scheduler.

Another important idea is that “Only compute node knows its own final 
compute-node resource view” or “The accurate resource view only exists at the 
place where it is actually consumed.” I.e., The incremental updates can only 
come from the actual “consumption” action, no matter where it is(e.g. compute 
node, storage service, network service, etc.). Borrow the terms from 
resource-provider, compute nodes can maintain its accurate version of 
“compute-node-inventory” cache, and can send incremental updates because it 
actually consumes compute resources, furthermore, storage service can also 
maintain an accurate version of “storage-inventory” cache and send incremental 
updates if it also consumes storage resources. If there are central services in 
charge of consuming all the resources, the accurate cache and updates must come 
from them.

The third idea is “compatibility”. This prototype focuses on a very small scope 
by only introducing a new host_manager driver “shared_host_manager” with minor 
other changes. The driver can be changed back to “host_manager” very easily. It 
can also run with filter schedulers and caching schedulers. Most importantly, 
the filtering and weighing algorithms are kept unchanged. So more changes can 
be introduced for the complete version of “shared state scheduler” because it 
is evolving in a gradual way.


Regards,
-Yingxin

From: Boris Pavlovic [mailto:bo...@pavlovic.me]
Sent: Monday, February 15, 2016 1:59 PM
To: OpenStack Development Mailing List (not for usage questions) 

Subject: Re: [openstack-dev] [nova] A prototype implementation towards the 
"shared state scheduler"

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler

It's really nice that somebody is still trying to push scheduler refactoring in 
this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin 
mailto:yingxin.ch...@intel.com>> wrote:
Hi,

I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:

>>
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the compute node 
updates its in-db resource record according to that decision in resource 
tracker. A recent scheduler resource consumption in cache can be lost and 
overwritten by compute node data because of it, result in cache inconsistency 
and unexpected retries. In a one-scheduler experiment using 3-node deployment, 
there are 7 retries out of 31 concurrent schedule requests recorded, results in 
22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even 
worse" performance result using parallel schedulers. In the same experiment 
with 4 schedulers on separate machines, the a

[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

2016-02-14 Thread Cheng, Yingxin
Hi,

I've uploaded a prototype https://review.openstack.org/#/c/280047/ to testify 
its design goals in accuracy, performance, reliability and compatibility 
improvements. It will also be an Austin Summit Session if elected: 
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316

I want to gather opinions about this idea:
1. Is this feature possible to be accepted in the Newton release?
2. Suggestions to improve its design and compatibility.
3. Possibilities to integrate with resource-provider bp series: I know 
resource-provider is the major direction of Nova scheduler, and there will be 
fundamental changes in the future, especially according to the bp 
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
 However, this prototype proposes a much faster and compatible way to make 
schedule decisions based on scheduler caches. The in-memory decisions are made 
at the same speed with the caching scheduler, but the caches are kept 
consistent with compute nodes as quickly as possible without db refreshing.

Here is the detailed design of the mentioned prototype:

>>
Background:
The host state cache maintained by host manager is the scheduler resource view 
during schedule decision making. It is updated whenever a request is 
received[1], and all the compute node records are retrieved from db every time. 
There are several problems in this update model, proven in experiments[3]:
1. Performance: The scheduler performance is largely affected by db access in 
retrieving compute node records. The db block time of a single request is 355ms 
in average in the deployment of 3 compute nodes, compared with only 3ms in 
in-memory decision-making. Imagine there could be at most 1k nodes, even 10k 
nodes in the future.
2. Race conditions: This is not only a parallel-scheduler problem, but also a 
problem using only one scheduler. The detailed analysis of 
one-scheduler-problem is located in bug analysis[2]. In short, there is a gap 
between the scheduler makes a decision in host state cache and the
compute node updates its in-db resource record according to that decision in 
resource tracker. A recent scheduler resource consumption in cache can be lost 
and overwritten by compute node data because of it, result in cache 
inconsistency and unexpected retries. In a one-scheduler experiment using 
3-node deployment, there are 7 retries out of 31 concurrent schedule requests 
recorded, results in 22.6% extra performance overhead.
3. Parallel scheduler support: The design of filter scheduler leads to an "even 
worse" performance result using parallel schedulers. In the same experiment 
with 4 schedulers on separate machines, the average db block time is increased 
to 697ms per request and there are 16 retries out of 31 schedule requests, 
namely 51.6% extra overhead.

Improvements:
This prototype solved the mentioned issues above by implementing a new update 
model to scheduler host state cache. Instead of refreshing caches from db, 
every compute node maintains its accurate version of host state cache updated 
by the resource tracker, and sends incremental updates directly to schedulers. 
So the scheduler cache are synchronized to the correct state as soon as 
possible with the lowest overhead. Also, scheduler will send resource claim 
with its decision to the target compute node. The compute node can decide 
whether the resource claim is successful immediately by its local host state 
cache and send responds back ASAP. With all the claims are tracked from 
schedulers to compute nodes, no false overwrites will happen, and thus the gaps 
between scheduler cache and real compute node states are minimized. The 
benefits are obvious with recorded experiments[3] compared with caching 
scheduler and filter scheduler:
1. There is no db block time during scheduler decision making, the average 
decision time per request is about 3ms in both single and multiple scheduler 
scenarios, which is equal to the in-memory decision time of filter scheduler 
and caching scheduler.
2. Since the scheduler claims are tracked and the "false overwrite" is 
eliminated, there should be 0 retries in one-scheduler deployment, as proven in 
the experiment. Thanks to the quick claim responding implementation, there are 
only 2 retries out of 31 requests in the 4-scheduler experiment.
3. All the filtering and weighing algorithms are compatible because the data 
structure of HostState is unchanged. In fact, this prototype even supports 
filter scheduler running at the same time(already tested). Like other 
operations with resource changes such as migration, resizing or shelving, they 
make claims in the resource tracker directly and update the compute node host 
state immediately without major changes.

Extra features:
More efforts are made to better adjust the implementation to real-world 
scenarios, such as network issues, service unexpectedly down and over

Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2016-01-27 Thread Cheng, Yingxin
Thank you Nikola! I'm very interested in this.


According to my current understanding, a complete functional test for nova 
scheduler should include nova-api, the scheduler service, part of conductor 
service which forward scheduler decisions to compute services, and the part of 
compute service including claim, claim abort and compute node resource 
consumption inside resource tracker. 

The inputs of this series of tests are the initial resource view, existing 
resource consumptions from fake instances and the coming schedule requests with 
flavors. 

The outputs are the statistics of elapsed time in every schedule phases, the 
statistics of requests' lifecycles, and the sanity of final resource view with 
booted fake instances. 

Extra features should also be taken into consideration including, but not 
limited to, image properties, host aggregates, availability zones, compute 
capabilities, servergroups, compute service status, forced hosts, metrics etc.

Please correct me if anything wrong, I also want to know the existing 
decisions/ideas from mid-cycle sprint.


I'll start from investigating existent functional test infrastructure, this 
could be much quicker if anyone (maybe Sean Dague) can provide help with the 
introduction of existing features. I've also seem others showing interests in 
this area -- Chris Dent(cdent). It would be great to work with other 
experienced contributors in community.



Regards,
-Yingxin


> -Original Message-
> From: Nikola Đipanov [mailto:ndipa...@redhat.com]
> Sent: Wednesday, January 27, 2016 9:58 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Cc: Cheng, Yingxin
> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
> conditions)?
> 
> Top posting since better scheduler testing just got brought up during the
> midcycle meetup, so it might be useful to re-kindle this thread.
> 
> Sean (Dague) brought up that there is some infrastructure already that could
> help us do what you propose bellow, but work may be needed to make it viable
> for proper reasource accounting tests.
> 
> Yingxin - in case you are still interested in doing some of this stuff, we can
> discuss here or on IRC.
> 
> Thanks,
> Nikola
> 
> On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
> >
> >> -Original Message-
> >> From: Nikola Đipanov [mailto:ndipa...@redhat.com]
> >> Sent: Monday, December 14, 2015 11:11 PM
> >> To: OpenStack Development Mailing List (not for usage questions)
> >> Subject: Re: [openstack-dev] [nova] Better tests for nova
> >> scheduler(esp. race conditions)?
> >>
> >> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
> >>> Hi All,
> >>>
> >>>
> >>>
> >>> When I was looking at bugs related to race conditions of scheduler
> >>> [1-3], it feels like nova scheduler lacks sanity checks of schedule
> >>> decisions according to different situations. We cannot even make
> >>> sure that some fixes successfully mitigate race conditions to an
> >>> acceptable scale. For example, there is no easy way to test whether
> >>> server-group race conditions still exists after a fix for bug[1], or
> >>> to make sure that after scheduling there will be no violations of
> >>> allocation ratios reported by bug[2], or to test that the retry rate
> >>> is acceptable in various corner cases proposed by bug[3]. And there
> >>> will be much more in this list.
> >>>
> >>>
> >>>
> >>> So I'm asking whether there is a plan to add those tests in the
> >>> future, or is there a design exist to simplify writing and executing
> >>> those kinds of tests? I'm thinking of using fake databases and fake
> >>> interfaces to isolate the entire scheduler service, so that we can
> >>> easily build up a disposable environment with all kinds of fake
> >>> resources and fake compute nodes to test scheduler behaviors. It is
> >>> even a good way to test whether scheduler is capable to scale to 10k
> >>> nodes without setting up 10k real compute nodes.
> >>>
> >>
> >> This would be a useful effort - however do not assume that this is
> >> going to be an easy task. Even in the paragraph above, you fail to
> >> take into account that in order to test the scheduling you also need
> >> to run all compute services since claims work like a kind of 2 phase
> >> commit where a scheduling decision gets checked on the destination
> >> compute host (through Claims logic), which involves locking in each compute
> process.
> >

Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2015-12-14 Thread Cheng, Yingxin

> -Original Message-
> From: Nikola Đipanov [mailto:ndipa...@redhat.com]
> Sent: Monday, December 14, 2015 11:11 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
> conditions)?
> 
> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
> > Hi All,
> >
> >
> >
> > When I was looking at bugs related to race conditions of scheduler
> > [1-3], it feels like nova scheduler lacks sanity checks of schedule
> > decisions according to different situations. We cannot even make sure
> > that some fixes successfully mitigate race conditions to an acceptable
> > scale. For example, there is no easy way to test whether server-group
> > race conditions still exists after a fix for bug[1], or to make sure
> > that after scheduling there will be no violations of allocation ratios
> > reported by bug[2], or to test that the retry rate is acceptable in
> > various corner cases proposed by bug[3]. And there will be much more
> > in this list.
> >
> >
> >
> > So I'm asking whether there is a plan to add those tests in the
> > future, or is there a design exist to simplify writing and executing
> > those kinds of tests? I'm thinking of using fake databases and fake
> > interfaces to isolate the entire scheduler service, so that we can
> > easily build up a disposable environment with all kinds of fake
> > resources and fake compute nodes to test scheduler behaviors. It is
> > even a good way to test whether scheduler is capable to scale to 10k
> > nodes without setting up 10k real compute nodes.
> >
> 
> This would be a useful effort - however do not assume that this is going to 
> be an
> easy task. Even in the paragraph above, you fail to take into account that in
> order to test the scheduling you also need to run all compute services since
> claims work like a kind of 2 phase commit where a scheduling decision gets
> checked on the destination compute host (through Claims logic), which involves
> locking in each compute process.
> 

Yes, the final goal is to test the entire scheduling process including 2PC. 
As scheduler is still in the process to be decoupled, some parts such as RT 
and retry mechanism are highly coupled with nova, thus IMO it is not a good 
idea to
include them in this stage. Thus I'll try to isolate filter-scheduler as the 
first step,
hope to be supported by community.


> >
> >
> > I'm also interested in the bp[4] to reduce scheduler race conditions
> > in green-thread level. I think it is a good start point in solving the
> > huge racing problem of nova scheduler, and I really wish I could help on 
> > that.
> >
> 
> I proposed said blueprint but am very unlikely to have any time to work on it 
> this
> cycle, so feel free to take a stab at it. I'd be more than happy to 
> prioritize any
> reviews related to the above BP.
> 
> Thanks for your interest in this
> 
> N.
> 

Many thanks nikola! I'm still looking at the claim logic and try to find a way 
to merge
it with scheduler host state, will upload patches as soon as I figure it out. 


> >
> >
> >
> >
> > [1] https://bugs.launchpad.net/nova/+bug/1423648
> >
> > [2] https://bugs.launchpad.net/nova/+bug/1370207
> >
> > [3] https://bugs.launchpad.net/nova/+bug/1341420
> >
> > [4]
> > https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
> >
> >
> >
> >
> >
> > Regards,
> >
> > -Yingxin
> >



Regards,
-Yingxin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

2015-12-14 Thread Cheng, Yingxin
Hi All,

When I was looking at bugs related to race conditions of scheduler [1-3], it 
feels like nova scheduler lacks sanity checks of schedule decisions according 
to different situations. We cannot even make sure that some fixes successfully 
mitigate race conditions to an acceptable scale. For example, there is no easy 
way to test whether server-group race conditions still exists after a fix for 
bug[1], or to make sure that after scheduling there will be no violations of 
allocation ratios reported by bug[2], or to test that the retry rate is 
acceptable in various corner cases proposed by bug[3]. And there will be much 
more in this list.

So I'm asking whether there is a plan to add those tests in the future, or is 
there a design exist to simplify writing and executing those kinds of tests? 
I'm thinking of using fake databases and fake interfaces to isolate the entire 
scheduler service, so that we can easily build up a disposable environment with 
all kinds of fake resources and fake compute nodes to test scheduler behaviors. 
It is even a good way to test whether scheduler is capable to scale to 10k 
nodes without setting up 10k real compute nodes.

I'm also interested in the bp[4] to reduce scheduler race conditions in 
green-thread level. I think it is a good start point in solving the huge racing 
problem of nova scheduler, and I really wish I could help on that.


[1] https://bugs.launchpad.net/nova/+bug/1423648
[2] https://bugs.launchpad.net/nova/+bug/1370207
[3] https://bugs.launchpad.net/nova/+bug/1341420
[4] https://blueprints.launchpad.net/nova/+spec/host-state-level-locking


Regards,
-Yingxin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How can I contribute to the scheduler codebase?

2015-11-12 Thread Cheng, Yingxin
Hi Sylvain,

Could you assign me scheduler-driver-use-stevedore? I can take that as my first 
step to nova. I wish I could help more, but still many things need to learn. So 
the simplest work comes first.

Yours,
-Yingxin

> -Original Message-
> From: Sylvain Bauza [mailto:sba...@redhat.com]
> Sent: Monday, November 9, 2015 11:09 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: [openstack-dev] [nova] How can I contribute to the scheduler 
> codebase?
> 
> Hi,
> 
> During the last Nova scheduler meeting (held every Mondays 1400UTC on
> #openstack-meeting-alt), we identified some on-going effort that could 
> possibly
> be addressed by anyone wanting to step in. For the moment, we are still
> polishing the last bits of agreement, but those blueprints should be splitted 
> into
> small actional items that could be seen as low-hanging-fruits.
> 
> Given those tasks require a bit of context understanding, the best way to
> consider joining us to join the Nova scheduler weekly meeting (see above the
> timing) and join our team. We'll try to provide you a bit of guidance and
> explanations whenever needed so that you could get some work assigned to you.
> 
>  From an overall point of view, you can still get many ways to begin your Nova
> journey by reading
> https://wiki.openstack.org/wiki/Nova/Mentoring#What_should_I_work_on.3F
> 
> HTH,
> -Sylvain
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Congress] Need reviews of bp rpc-for-dse

2015-07-23 Thread Cheng, Yingxin
OK, get it.

-Yingxin

From: Tim Hinrichs [mailto:t...@styra.com]
Sent: Thursday, July 23, 2015 11:18 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Congress] Need reviews of bp rpc-for-dse

Hi Yingxin,

We definitely want details for what you have in mind.  In case you haven't done 
this before, we write a 'spec' to explain the details.  Here's some info on how 
to do that.

https://wiki.openstack.org/wiki/Congress#How_To_Propose_a_New_Feature

Tim


On Thu, Jul 23, 2015 at 12:13 AM Cheng, Yingxin 
mailto:yingxin.ch...@intel.com>> wrote:
Hi all,


I have some thoughts about congress dse improvement after having read the code 
for several days.

Please refer to 
bp/rpc-for-dse<https://blueprints.launchpad.net/congress/+spec/rpc-for-dse>, 
its idea is to implement RPC in deepsix. Congress can benefit from 
lower-coupling between dse services. And the steps towards oslo-messaging 
integration can be milder too.

Eager to learn everything from Community. Anything wrong, please kindly point 
it out.


Thank you
Yingxin
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Congress] Need reviews of bp rpc-for-dse

2015-07-23 Thread Cheng, Yingxin
Tim and Masa,


Thanks very much for your invitation. I really wish I could come, but there is 
a visa problem.

I’m interested in Congress implementation and will join the sprint in remote.

Hope that my idea is helpful.


Yingxin


From: Tim Hinrichs [mailto:t...@styra.com]
Sent: Thursday, July 23, 2015 9:43 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Congress] Need reviews of bp rpc-for-dse

Hi Yingxin,

+1 to joining the sprint.  If you can't make the mid-cycle sprint in person, 
we're planning to have a remote option.  Keep in mind though that it will be 
harder to engage with everyone if you're remote.

We'll review your spec and get you feedback, but the purpose of the sprint (Aug 
6-7) is to decide and hopefully build the next generation of communication 
between the policy engines and the datasources.

Tim

On Thu, Jul 23, 2015 at 1:22 AM Masahito MUROI 
mailto:muroi.masah...@lab.ntt.co.jp>> wrote:
Hi Yingxin,

I think moving Congress from dse to rpc is a good idea.

Congress team will discuss its scalability in mid cycle sprint[1], [2].
I guess using rpc is one of key item for congress to get the ability.
Why don't you join the meetup?

[1] https://wiki.openstack.org/wiki/Sprints/CongressLibertySprint
[2]
http://www.eventbrite.com/e/congress-liberty-midcycle-sprint-tickets-17654731778

best regard,
masa

On 2015/07/23 16:07, Cheng, Yingxin wrote:
> Hi all,
>
>
> I have some thoughts about congress dse improvement after having read the 
> code for several days.
>
> Please refer to 
> bp/rpc-for-dse<https://blueprints.launchpad.net/congress/+spec/rpc-for-dse>, 
> its idea is to implement RPC in deepsix. Congress can benefit from 
> lower-coupling between dse services. And the steps towards oslo-messaging 
> integration can be milder too.
>
> Eager to learn everything from Community. Anything wrong, please kindly point 
> it out.
>
>
> Thank you
> Yingxin
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


--
室井 雅仁(Masahito MUROI)
Software Innovation Center, NTT
Tel: +81-422-59-4539,FAX: +81-422-59-2699


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Congress] Need reviews of bp rpc-for-dse

2015-07-23 Thread Cheng, Yingxin
Hi all,


I have some thoughts about congress dse improvement after having read the code 
for several days.

Please refer to 
bp/rpc-for-dse, 
its idea is to implement RPC in deepsix. Congress can benefit from 
lower-coupling between dse services. And the steps towards oslo-messaging 
integration can be milder too.

Eager to learn everything from Community. Anything wrong, please kindly point 
it out.


Thank you
Yingxin
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev