[openstack-dev] [nova][scheduler] More test results for the "eventually consistent host state" prototype

Cheng, Yingxin Mon, 27 Jun 2016 00:50:06 -0700

Hi,

According to the feedback [1] from Austin design summit, I prepared my 
environment with pre-loaded computes and finished a new round of performance 
profiling using the tool [7]. I also updated the prototype [2] to simplify the 
implementation in compute-node side, which makes the implementation closer to 
the design described in the spec [6].

This set of results are more comprehensive, it includes the analysis of
“eventually consistent host states” prototype [2], default filter scheduler,
and the caching scheduler. They are tested with various scenarios in
1000-compute-node environment, with real controller services, real rabbit-MQ
and real MySQL database. The new set of experiments contains 55 repeatable
results [3]. Don’t be afraid about the verbose data, I’ve dug out the
conclusions.

To better understand what’s happening during scheduling in different scenarios,
all of them are visualized in the doc [4]. They are complementary to what I had
presented in the Austin design summit, the 7th page of the ppt [5].

Note that the “pre-load scenario” allows only 49 new instances to be launched
in the 1000-node environment. It means when 50 requests are sent, there should
be 1 and only 1 failed request if the scheduler decision is accurate.

Detailed analysis with illustration [4]:
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing

======
In all test cases, nova is dispatching 50 instant requests to 1000 compute
nodes. The aiming is to compare the behavior of 3 types of schedulers, with
preloaded or empty-loaded scenarios, and with 1 or 2 scheduler services. So
that’s 3*2*2=12 set of experiments, each set is tested multiple times.

In scenario S1(i.e. 1 scheduler with empty loaded compute nodes), we can see
from A2 very clearly that the entire boot process of filter scheduler is
suffering from nova-scheduler service. Filter scheduler has a very slow speed
to consume those 50 requests, causing all the requests being blocked before
scheduler service in the yellow area. The ROOT CAUSE of it is the
“cache-refresh” step before filtering (i.e.
`nova.scheduler.filter_scheduler.FilterScheduler._get_all_host_states`). I’ve
discussed this bottleneck in details in the Austin summit session “Dive into
nova scheduler performance: where is the bottleneck” [8]. This is also proved
by caching scheduler because it excludes the “cache-refresh” bottleneck and
only uses in-memory filtering. By simply excluding “cache-refresh”, the
performance benefits are huge: the query time is reduced by 87%, and the
overall throughput (i.e. the delivered requests per second in this cloud) is
multiplied by 8.24, see A3 for illustration. The “eventually consistent host
states” prototype also excludes this bottleneck and takes a more meticulous way
to synchronize scheduler caches. It is slightly slower than caching scheduler,
because there is an overhead to apply incremental updates from compute nodes.
The query time is reduced by 79% and the overall throughput is multiplied by
5.63 in average in S1.

In preload scenario S2, we can see all 3 types of scheduler are faster than
their empty loaded scenario. That’s because the filters can now prune the hosts
from 1000 to only 49, so the last few filters don’t need to process 1000 host
states, they can be much faster. But filter scheduler (B2) cannot be benefit
much from faster filtering, because its bottleneck is still in “cache refresh”.
However, it means different for caching scheduler and the prototype, because
their performance heavily depend on in-memory filtering. For caching scheduler
(B3), the query time is reduced by 81% and the overall throughput is multiplied
by 7.52 compared with filter scheduler. And for the prototype (B1), the query
time is reduced by 83% and the throughput is multiplied by 7.92 in average.
Also, all those scheduler decisions are accurate, because their first decisions
are all correct without any retries in preload scenario, and only 1 of 50
requests is failed due to “no valid host” error.

In scenario S3 with 2 scheduler services and empty-loaded compute nodes, the
overall schedule bandwidths are all multiplied by 2 internally. Filter
scheduler (C2) has a major improvement, because its scheduler bandwidth is
multiplied. But the other two types don’t have similar improvement, because
their bottleneck is now in nova-api service instead. It is a wrong decision to
add more schedulers when the actual bottleneck is happening elsewhere. And
worse, multiple schedulers will introduce more race conditions as well as other
overhead. However, the performance of caching scheduler (C3) and the prototype
(C1) are still much better, the query time is reduced by 65% and the overall
through is multiplied by 3.67 in average.

In preload scenario S4 with 2 schedulers, the race condition is surfaced
because there’re only 49 slots in 1000 hosts in the cloud, and they will all
result in retries. The result (D1, D2, D3) shows the retry rates are similar
under instant 50 requests, but there is already a further thought to improve
the prototype. And the result should be different under continuous boot and
delete requests. Some tests also show that the caches in caching scheduler are
still outdated even after 1 minute. For example, in the test
“results-1s-1000n-50r-0p222-preload-caching3”, 19 requests are failed because
of outdated caches. And in the test
“results-2s-1000n-50r-0p222-preload-caching4”, 21 requests are failed also for
the same reason.

Quick conclusion here
======
In short, this prototype [2] has following improvements or guarantees:
1. When empty loaded, its performance is much better than filter scheduler
(5.63x better in 1000 nodes), and close to caching scheduler.
2. With pre-load, its performance is even better than filter scheduler (7.92x)
, and closer to caching scheduler.
3. Its placement accuracy is 100% in 1 scheduler scenario.
4. There’s no major change to the schedule process, it is highly compatible to
the existing scheduler architecture (before resource provider).
5. The biggest bottleneck — “cache-refresh” is resolved by this prototype, and
instead, nova-api becomes the new bottleneck that is limiting the throughput.
6. Racing is allowed among schedulers because of its lock-free design, and the
racing rate in 2-scheduler scenario is acceptable.

About the profiling tool [7]
======
It is worth noting that this tool is much more precise and verbose than Rally.
The whole analysis is offline, compared with Rally which is adding extra
pressure to API to ask for status. And the analysis is fine-grained based on
injected logs, compared with Rally only based on API level responses.

This tool can simulate compute nodes by attaching fake virt driver and launch
the nodes in processes. It can also be deployed in the real OpenStack cluster
by monkey-patching related nova services. It has already successfully deployed
with China Mobile 1000-node environment, which uses multiple controllers. This
tool can profile all the existing schedulers from Kilo to Mitaka release, under
various configurations, and using the same analysis framework. This also means
this prototype doesn’t introduce major changes to the existing nova
architecture.

Current plan
======
This prototype isn’t a design of “shared-state scheduler”, it is an improvement
to the existing host manager. But it is a very important step towards the
“shared-state scheduler” described in the 11th page of ppt [9]. Basically it
leverages inter-process commutation with workers to improve scheduler
throughput. Inter-process communication is much faster than network, that’s why
my previous analysis [10] based on placement bench [11] has extreme low race
conditions (only 3% of 12000 requests using 8 workers) and much better
performance (38 times better according to the data [12]). The end goal is to
use this model inside scheduler service.

Current plan is still relying on the progress of resource provider series.
Substantial design changes might be needed to addressed for the new
architecture. But it doesn’t matter from my side, it is more important and
beneficial to implement a generic scheduler service as early as possible.

[1] http://lists.openstack.org/pipermail/openstack-dev/2016-May/093595.html
[2] https://review.openstack.org/#/c/306301/
[3] http://paste.openstack.org/show/506846/
[4]
https://docs.google.com/document/d/1qFNROdJxj4m1lXe250DW3XAAY02QHmlTm1N2nEHVVPg/edit?usp=sharing

[5]
https://docs.google.com/presentation/d/1UG1HkEWyxPVMXseLwJ44ZDm-ek_MPc4M65H8EiwZnWs/edit?ts=571fcdd5#slide=id.g12d2cf15cd_2_223

[6] https://review.openstack.org/#/c/306844/
[7] https://github.com/cyx1231st/nova-scheduler-bench
[8]
https://www.openstack.org/assets/presentation-media/7129-Dive-into-nova-scheduler-performance-summit.pdf

[9]
https://docs.google.com/presentation/d/1UG1HkEWyxPVMXseLwJ44ZDm-ek_MPc4M65H8EiwZnWs/edit?ts=571fcdd5#slide=id.g12d2cf15cd_2_263

[10] http://lists.openstack.org/pipermail/openstack-dev/2016-March/087889.html
[11]
https://github.com/cyx1231st/placement-bench/tree/shared-state-demonstration
[12] http://paste.openstack.org/show/488715/

--
Regards
Yingxin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [nova][scheduler] More test results for the "eventually consistent host state" prototype

Reply via email to