Re: [vdsm] VDSM benchmarking and profiling, round 2

Nir Soffer Wed, 26 Mar 2014 07:49:29 -0700

----- Original Message -----
> From: "Francesco Romani" <from...@redhat.com>
> To: "vdsm-devel" <vdsm-devel@lists.fedorahosted.org>
> Sent: Wednesday, March 26, 2014 4:00:29 PM
> Subject: [vdsm] VDSM benchmarking and profiling, round 2
> 
> Hello everyone,
> 
> it took a bit more than expected for the reason I am going to explain, but we
> have some
> more results and better tooling.
> 
> Points open from the last round (not in strict priority order):
> ---------------------------------------------------------------
> 
> 1. find a reliable application-wide profiling approach. Profile the whole
> VDSM,
> not just the specific function (_startUnderlyingVm now, something else in
> future runs)
> 2. make the profiling more consistent and reliable;
> do more runs (at least 10, the more the better); add the variance?
> 3. limit the profiling reports to the hotspots (the 32 most expensive calls)
> 4. show the callers to learn more about the wait calls
> 5. investigate where and why we spend time in StringIO
> 6. re-verify the impact of cPickle vs Pickle
> 7. benchmark BoundedSemaphore(s)
> 8. benchmark XML processing
> 9. add more scenarios, and more configurations in those scenarios
> 
> 
> Quick summary of this round:
> ----------------------------
> 
> - added more user-visible metrics: time to a VM to come 'Up' from the time it
> is started
> - benchmarking script automated and starting to be trustworty
> - result data available as CSV files
> - existing patches on gerrit deliver improvement (6-7% for cPickle, ~10% for
> xml caching)
> 
> Testing scenario is still the same for the previous round (next step is add
> more of them)
> 
> Please continue reading for more details.
> 
> 
> Application-wide profiling
> ---------------------------
> 
> From a python prospective, looks like yappi is still the best shot. The
> selling point of yappi
> is it is designed to be low-overhead (well they claim so) and to work nicely
> with long-running
> multi-threaded daemons, like VDSM.
> 
> We have a nice patchset on gerrit to integrate yappi on VDSM, courtesy of Nir
> Soffer
> (http://gerrit.ovirt.org/26113).
> With this one we should be able to capture VDSM-wide profiles more easily. I
> am going to integrate
> my benchmark script(s) with it (see below)
> 
> It is not clear if yappi (or any python profiler) can help us to understand
> properly and deeply enough
> how threads interacts (or misbehave) with each other, with the GIL and where
> we waste time.
> Exploring a system-wide profiler, like sysprof or oprofile, may be an useful
> next step.
> 
> 
> Improvements in profile/results collection
> ------------------------------------------
> 
> On this round I focused on working with the hotspots we found in the previous
> round and improving
> the benchmark tool to make it more reliable.
> 
> The scenario is now run 32 times, the results averaged.
> We consider the startup time as defined as T_up - T_start being
> T_start: time of the submission of the create command to the engine through
> the REST API
> T_up: time of the VM reported as UP by the engine
> 
> The purpose is to model what an user will see on a real scenario.
> Results: (find the scripts and the raw CSV data here:
> https://github.com/mojaves/ovirt-tools/tree/master/benchmark
> lacking a better place)
> 
> Considering the rest results of March 24:
> 
> baseline data: vanilla VDSM
> $ lsres.py 20140324/*/*.csv
> 20140324/baseline/bench_20140324_102249.csv:
> mean: 33.037s sd=2.133s (6.5%)
> best: 14.181s sd=1.880s (13.3%)
> worst:        50.356s sd=2.713s (5.4%)
> total:        1057.188s sd=68.249s (6.5%)
> 
> sd is standard deviation considering one sample per run (32 on this case)
> 
> we consider a sample per run of the following:
> mean: mean of the startup times per run
> best: the best startup time per run (fastest VM)
> worst: the worst statup time per run (slowest VM)
> total: sum of all the startup times, per run


I think we need a simpler way to evaluate progress (or regressions).

If the test was starting 32 vms concurrently, lets focus on the time
to complete the test (all vm are up), and the mean time to get one vm up.

> 
> Now let's consider the impact of the performance patches
> 
> Applying the cPikcle patch: http://gerrit.ovirt.org/#/c/25860/
> 20140324/cpickle/bench_20140324_115215.csv:
> mean: 30.645s sd=2.422s (7.9%)
> best: 13.048s sd=4.302s (33.0%)
> worst:        46.404s sd=2.114s (4.6%)
> total:        980.655s sd=77.510s (7.9%)
> 
> Improvement is
> * negligible for the best case
> * roughly 10% for the mean
> * roughly 8% for the worst
> * roughly the 7% for the total
> 
> On top of cPickle, we add XML caching: http://gerrit.ovirt.org/#/c/17694/
> 
> 20140324/xmlcache/bench_20140324_125232.csv:
> mean: 27.630s sd=1.242s (4.5%)
> best: 11.320s sd=1.224s (10.8%)
> worst:        41.554s sd=1.873s (4.5%)
> total:        884.155s sd=39.745s (4.5%)
> 
> Improvement is
> * roughly 9% for the mean
> * roughly 15% for the best
> * roughly 11% for the worst
> * roughly 10% for the total

Nice, seems like 20% improvement from baseline so far.

> 
> Given the fact that both patches are beneficial on all the flows/possible
> scenarios
> because they affect the most basic creation flow, I think we have some
> tangible benefits
> here.
> 
> During the benchmarks, I was quite concerned about the reliability and
> reperibility of
> those tests, so I runned over and over again (that is one of the reason it
> took longer than
> expected).
> 
> On particular, I runned some benchmarks again on March 25 (aka yesterday)
> with those results:
> 
> 20140325/baseline/bench_20140325_180500.csv:
> mean: 27.984s sd=1.074s (3.8%)
> best: 10.507s sd=1.604s (15.3%)
> worst:        42.711s sd=1.996s (4.7%)
> total:        895.479s sd=34.375s (3.8%)
> 
> 20140325/cpickle/bench_20140325_185941.csv:
> mean: 26.423s sd=1.413s (5.3%)
> best: 9.785s sd=1.669s (17.1%)
> worst:        40.833s sd=2.452s (6.0%)
> total:        845.523s sd=45.218s (5.3%)
> 
> We can easily see the absolute values are better (baseline is close to the
> XML patch!)

You probbaly need to run all tests in the same time to compare results, and
maybe include restart of the test machine before testing.

> 
> The main change is I rebased VDSM against yesterday's master, but given the
> fact we haven't
> performance patches being merged (at least none I was aware of after the
> namedtuple fix)
> I think there are external factos in play.

You should probably discuss the methodology with the scale team.

Nir
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] VDSM benchmarking and profiling, round 2

Reply via email to