Hi,

According to the feedback [1] from Austin design summit, I prepared my 
environment with pre-loaded computes and finished a new round of performance 
profiling using the tool [7]. I also updated the prototype [2] to simplify the 
implementation in compute-node side, which makes the implementation closer to 
the design described in the spec [6].

This set of results are more comprehensive, it includes the analysis of 
“eventually consistent host states” prototype [2], default filter scheduler, 
and the caching scheduler. They are tested with various scenarios in 
1000-compute-node environment, with real controller services, real rabbit-MQ 
and real MySQL database. The new set of experiments contains 55 repeatable 
results [3]. Don’t be afraid about the verbose data, I’ve dug out the 
conclusions.

To better understand what’s happening during scheduling in different scenarios, 
all of them are visualized in the doc [4]. They are complementary to what I had 
presented in the Austin design summit, the 7th page of the ppt [5].

Note that the “pre-load scenario” allows only 49 new instances to be launched 
in the 1000-node environment. It means when 50 requests are sent, there should 
be 1 and only 1 failed request if the scheduler decision is accurate.


Detailed analysis with illustration [4]: 
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing
 
======
In all test cases, nova is dispatching 50 instant requests to 1000 compute 
nodes. The aiming is to compare the behavior of 3 types of schedulers, with 
preloaded or empty-loaded scenarios, and with 1 or 2 scheduler services. So 
that’s 3*2*2=12 set of experiments, each set is tested multiple times. 

In scenario S1(i.e. 1 scheduler with empty loaded compute nodes), we can see 
from A2 very clearly that the entire boot process of filter scheduler is 
suffering from nova-scheduler service. Filter scheduler has a very slow speed 
to consume those 50 requests, causing all the requests being blocked before 
scheduler service in the yellow area. The ROOT CAUSE of it is the 
“cache-refresh” step before filtering (i.e. 
`nova.scheduler.filter_scheduler.FilterScheduler._get_all_host_states`). I’ve 
discussed this bottleneck in details in the Austin summit session “Dive into 
nova scheduler performance: where is the bottleneck” [8]. This is also proved 
by caching scheduler because it excludes the “cache-refresh” bottleneck and 
only uses in-memory filtering. By simply excluding “cache-refresh”, the 
performance benefits are huge: the query time is reduced by 87%, and the 
overall throughput (i.e. the delivered requests per second in this cloud) is 
multiplied by 8.24, see A3 for illustration. The “eventually consistent host 
states” prototype also excludes this bottleneck and takes a more meticulous way 
to synchronize scheduler caches. It is slightly slower than caching scheduler, 
because there is an overhead to apply incremental updates from compute nodes. 
The query time is reduced by 79% and the overall throughput is multiplied by 
5.63 in average in S1.

In preload scenario S2, we can see all 3 types of scheduler are faster than 
their empty loaded scenario. That’s because the filters can now prune the hosts 
from 1000 to only 49, so the last few filters don’t need to process 1000 host 
states, they can be much faster. But filter scheduler (B2) cannot be benefit 
much from faster filtering, because its bottleneck is still in “cache refresh”. 
However, it means different for caching scheduler and the prototype, because 
their performance heavily depend on in-memory filtering. For caching scheduler 
(B3), the query time is reduced by 81% and the overall throughput is multiplied 
by 7.52 compared with filter scheduler. And for the prototype (B1), the query 
time is reduced by 83% and the throughput is multiplied by 7.92 in average. 
Also, all those scheduler decisions are accurate, because their first decisions 
are all correct without any retries in preload scenario, and only 1 of 50 
requests is failed due to “no valid host” error.

In scenario S3 with 2 scheduler services and empty-loaded compute nodes, the 
overall schedule bandwidths are all multiplied by 2 internally. Filter 
scheduler (C2) has a major improvement, because its scheduler bandwidth is 
multiplied. But the other two types don’t have similar improvement, because 
their bottleneck is now in nova-api service instead. It is a wrong decision to 
add more schedulers when the actual bottleneck is happening elsewhere. And 
worse, multiple schedulers will introduce more race conditions as well as other 
overhead. However, the performance of caching scheduler (C3) and the prototype 
(C1) are still much better, the query time is reduced by 65% and the overall 
through is multiplied by 3.67 in average.

In preload scenario S4 with 2 schedulers, the race condition is surfaced 
because there’re only 49 slots in 1000 hosts in the cloud, and they will all 
result in retries. The result (D1, D2, D3) shows the retry rates are similar 
under instant 50 requests, but there is already a further thought to improve 
the prototype. And the result should be different under continuous boot and 
delete requests. Some tests also show that the caches in caching scheduler are 
still outdated even after 1 minute. For example, in the test 
“results-1s-1000n-50r-0p222-preload-caching3”, 19 requests are failed because 
of outdated caches. And in the test 
“results-2s-1000n-50r-0p222-preload-caching4”, 21 requests are failed also for 
the same reason.


Quick conclusion here
======
In short, this prototype [2] has following improvements or guarantees:
1. When empty loaded, its performance is much better than filter scheduler 
(5.63x better in 1000 nodes), and close to caching scheduler.
2. With pre-load, its performance is even better than filter scheduler (7.92x) 
, and closer to caching scheduler.
3. Its placement accuracy is 100% in 1 scheduler scenario.
4. There’s no major change to the schedule process, it is highly compatible to 
the existing scheduler architecture (before resource provider).
5. The biggest bottleneck — “cache-refresh” is resolved by this prototype, and 
instead, nova-api becomes the new bottleneck that is limiting the throughput.
6. Racing is allowed among schedulers because of its lock-free design, and the 
racing rate in 2-scheduler scenario is acceptable.


About the profiling tool [7]
======
It is worth noting that this tool is much more precise and verbose than Rally. 
The whole analysis is offline, compared with Rally which is adding extra 
pressure to API to ask for status. And the analysis is fine-grained based on 
injected logs, compared with Rally only based on API level responses.

This tool can simulate compute nodes by attaching fake virt driver and launch 
the nodes in processes. It can also be deployed in the real OpenStack cluster 
by monkey-patching related nova services. It has already successfully deployed 
with China Mobile 1000-node environment, which uses multiple controllers. This 
tool can profile all the existing schedulers from Kilo to Mitaka release, under 
various configurations, and using the same analysis framework. This also means 
this prototype doesn’t introduce major changes to the existing nova 
architecture.


Current plan
======
This prototype isn’t a design of “shared-state scheduler”, it is an improvement 
to the existing host manager. But it is a very important step towards the 
“shared-state scheduler” described in the 11th page of ppt [9]. Basically it 
leverages inter-process commutation with workers to improve scheduler 
throughput. Inter-process communication is much faster than network, that’s why 
my previous analysis [10] based on placement bench [11] has extreme low race 
conditions (only 3% of 12000 requests using 8 workers) and much better 
performance (38 times better according to the data [12]). The end goal is to 
use this model inside scheduler service.

Current plan is still relying on the progress of resource provider series. 
Substantial design changes might be needed to addressed for the new 
architecture. But it doesn’t matter from my side, it is more important and 
beneficial to implement a generic scheduler service as early as possible.


[1] http://lists.openstack.org/pipermail/openstack-dev/2016-May/093595.html 
[2] https://review.openstack.org/#/c/306301/ 
[3] http://paste.openstack.org/show/506846/ 
[4] 
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing
 
[5] 
https://docs.google.com/presentation/d/1UG1HkEWyxPVMXseLwJ44ZDm-ek_MPc4M65H8EiwZnWs/edit?ts=571fcdd5#slide=id.g12d2cf15cd_2_223
 
[6] https://review.openstack.org/#/c/306844/ 
[7] https://github.com/cyx1231st/nova-scheduler-bench 
[8] 
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf
 
[9] 
https://docs.google.com/presentation/d/1UG1HkEWyxPVMXseLwJ44ZDm-ek_MPc4M65H8EiwZnWs/edit?ts=571fcdd5#slide=id.g12d2cf15cd_2_263
 
[10] http://lists.openstack.org/pipermail/openstack-dev/2016-March/087889.html 
[11] 
https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration 
[12] http://paste.openstack.org/show/488715/ 


-- 
Regards
Yingxin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to