[vdsm] VDSM benchmarking and profiling, round 2

Francesco Romani Wed, 26 Mar 2014 07:01:03 -0700

Hello everyone,

it took a bit more than expected for the reason I am going to explain, but we 
have some
more results and better tooling.


Points open from the last round (not in strict priority order):
---------------------------------------------------------------

1. find a reliable application-wide profiling approach. Profile the whole VDSM,
not just the specific function (_startUnderlyingVm now, something else in 
future runs)
2. make the profiling more consistent and reliable;
do more runs (at least 10, the more the better); add the variance?
3. limit the profiling reports to the hotspots (the 32 most expensive calls)
4. show the callers to learn more about the wait calls
5. investigate where and why we spend time in StringIO
6. re-verify the impact of cPickle vs Pickle
7. benchmark BoundedSemaphore(s)
8. benchmark XML processing
9. add more scenarios, and more configurations in those scenarios


Quick summary of this round:
----------------------------

- added more user-visible metrics: time to a VM to come 'Up' from the time it 
is started
- benchmarking script automated and starting to be trustworty
- result data available as CSV files
- existing patches on gerrit deliver improvement (6-7% for cPickle, ~10% for 
xml caching)

Testing scenario is still the same for the previous round (next step is add 
more of them)

Please continue reading for more details.


Application-wide profiling
---------------------------

From a python prospective, looks like yappi is still the best shot. The selling 
point of yappi
is it is designed to be low-overhead (well they claim so) and to work nicely 
with long-running
multi-threaded daemons, like VDSM.

We have a nice patchset on gerrit to integrate yappi on VDSM, courtesy of Nir 
Soffer
(http://gerrit.ovirt.org/26113).
With this one we should be able to capture VDSM-wide profiles more easily. I am 
going to integrate
my benchmark script(s) with it (see below)

It is not clear if yappi (or any python profiler) can help us to understand 
properly and deeply enough
how threads interacts (or misbehave) with each other, with the GIL and where we 
waste time.
Exploring a system-wide profiler, like sysprof or oprofile, may be an useful 
next step.


Improvements in profile/results collection
------------------------------------------

On this round I focused on working with the hotspots we found in the previous 
round and improving
the benchmark tool to make it more reliable.

The scenario is now run 32 times, the results averaged.
We consider the startup time as defined as T_up - T_start being
T_start: time of the submission of the create command to the engine through the 
REST API
T_up: time of the VM reported as UP by the engine

The purpose is to model what an user will see on a real scenario.
Results: (find the scripts and the raw CSV data here: 
https://github.com/mojaves/ovirt-tools/tree/master/benchmark
lacking a better place)

Considering the rest results of March 24:

baseline data: vanilla VDSM
$ lsres.py 20140324/*/*.csv
20140324/baseline/bench_20140324_102249.csv:
mean:   33.037s sd=2.133s (6.5%)
best:   14.181s sd=1.880s (13.3%)
worst:  50.356s sd=2.713s (5.4%)
total:  1057.188s sd=68.249s (6.5%)

sd is standard deviation considering one sample per run (32 on this case)

we consider a sample per run of the following:
mean: mean of the startup times per run
best: the best startup time per run (fastest VM)
worst: the worst statup time per run (slowest VM)
total: sum of all the startup times, per run

Now let's consider the impact of the performance patches

Applying the cPikcle patch: http://gerrit.ovirt.org/#/c/25860/
20140324/cpickle/bench_20140324_115215.csv:
mean:   30.645s sd=2.422s (7.9%)
best:   13.048s sd=4.302s (33.0%)
worst:  46.404s sd=2.114s (4.6%)
total:  980.655s sd=77.510s (7.9%)

Improvement is
* negligible for the best case
* roughly 10% for the mean
* roughly 8% for the worst
* roughly the 7% for the total

On top of cPickle, we add XML caching: http://gerrit.ovirt.org/#/c/17694/

20140324/xmlcache/bench_20140324_125232.csv:
mean:   27.630s sd=1.242s (4.5%)
best:   11.320s sd=1.224s (10.8%)
worst:  41.554s sd=1.873s (4.5%)
total:  884.155s sd=39.745s (4.5%)

Improvement is
* roughly 9% for the mean
* roughly 15% for the best
* roughly 11% for the worst
* roughly 10% for the total

Given the fact that both patches are beneficial on all the flows/possible 
scenarios
because they affect the most basic creation flow, I think we have some tangible 
benefits
here.

During the benchmarks, I was quite concerned about the reliability and 
reperibility of
those tests, so I runned over and over again (that is one of the reason it took 
longer than
expected).

On particular, I runned some benchmarks again on March 25 (aka yesterday) with 
those results:

20140325/baseline/bench_20140325_180500.csv:
mean:   27.984s sd=1.074s (3.8%)
best:   10.507s sd=1.604s (15.3%)
worst:  42.711s sd=1.996s (4.7%)
total:  895.479s sd=34.375s (3.8%)

20140325/cpickle/bench_20140325_185941.csv:
mean:   26.423s sd=1.413s (5.3%)
best:   9.785s sd=1.669s (17.1%)
worst:  40.833s sd=2.452s (6.0%)
total:  845.523s sd=45.218s (5.3%)

We can easily see the absolute values are better (baseline is close to the XML 
patch!)

The main change is I rebased VDSM against yesterday's master, but given the 
fact we haven't
performance patches being merged (at least none I was aware of after the 
namedtuple fix)
I think there are external factos in play.

When I run benchmarks, I let the hypervisor and the engine hosts do just 
benchmarking, but still
there are many factors in play that can explain the variance (for example both 
the hosts
are NOT specifically tuned for benchmarking, so daemons are running on 
background and so on).

What I think it matters most is the gain from the cPickle patch is still here:
* roughly 5% for the mean
* roughly 7% for the best case
* roughly 5% for the worst case
* roughly 6% for the total

So I think we can move to the next steps:
* add adding more scenarios
* add more test cases inside scenarios
* add more metrics? (maybe part of test cases)
* integrate profile collection with benchmarking


Suggestions and comments are welcome

Thanks,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

[vdsm] VDSM benchmarking and profiling, round 2

Reply via email to