Re: [vdsm] VDSM profiling results, round 1

Nir Soffer Thu, 20 Mar 2014 12:22:12 -0700

----- Original Message -----
> From: "Francesco Romani" <from...@redhat.com>
> To: "vdsm-devel" <vdsm-devel@lists.fedorahosted.org>
> Sent: Wednesday, March 19, 2014 10:33:51 AM
> Subject: [vdsm] VDSM profiling results, round 1
> 
> (sending again WITHOUT the attachments)
> 
> Hi everyone
> 
> I'd like to share the first round of profiling results for VDSM and my next
> steps.
> 
> Summary:
> - experimented a couple of profiling approaches and found a good one
> - benchmarked http://gerrit.ovirt.org/#/c/25678/ : it is beneficial, was
> merged
> - found a few low-hanging fruits which seems quite safe to merge and
> beneficial to *all* flows
> - started engagement with infra (see other thread) to have common and
> polished performance
>   tools
> - test roadmap is shaping up, wiki/ML will be updated in the coming days
> 
> Please read through for a more detailed discussion. Every comment is welcome.
> 
> Disclaimer:
> long mail, lot of content, please point out if something is missing or not
> clear enough
> or if deserves more discussion.
> 
> +++
> 
> == First round results ==
> 
> First round of profiling was a follow-up of what I shown during the VDSM
> gathering.
> The results file contains a full profile ordered by descending time.
> In a nutshell: parallel start of 32 tiny VMs using engine REST API and a
> single hypervisor host.
> 
> VMs are tiny just because I want to stuff as much VMs I can in my mini-dell
> (16 GB ram, 4 core + HT CPUs)
> 
> It is worth to point out a few differences with respect to the *profile* (NOT
> the graphs)
> I shown during the gathering:
> 
> - profile data is now collected using the profile decorator (see
> http://www.ovirt.org/Profiling_Vdsm)
>   just around Vm._startUnderlyingVm.


Profiling one function is misleading without the big picture. You should profile
the whole application before drilling down and checking one function.

> The gathering profile was obtained using
>   the yappi application-wide

I don't see such profile - can you put it somewhere?

>   profiler (see https://code.google.com/p/yappi/) and 40 VMs.
>   * why yappi?
>     I thought an application-wide profiler gathers more information and let
>     us to have a better picture.
>     I actually still think that but I faced some yappi misbehaviour which I
>     want to fix later;
>     function-level profile so far is easier to collect (just grab the data
>     dumped to file).
>   * why 40 VMs?
>     I started with 64 but exausted my storage backing store :)
>     Will add more storage space in the next days, for the moment I stepped
>     back to 32.
> 
> It is worth to note that while on one hand numbers change a bit (if you
> remember the old profile data
> and the scary 80secs wasted on namedtuple), on the other hand the suspects
> are the same and the
> relative positions are roughly the same.
> So I believe our initial findings (namedtuple patch) and the plan are still
> valid.
> 
> == how it was done ==
> 
> I am still focusing just on the "monday morning" scenario (mass start of many
> VMs at the same time).
> Each run consisted in a parallel start of 32 VMs as described in the result
> data.
> VDSM was restarted between one run and the another.
> engine was *NOT* restarted between runs.
> individual profiles have been gathered after all the runs and the profile was
> extracted from the aggregation of them.

How did you aggregate the profiles?

> 
> profile dumps are available to everyone, just drop me a note and I'll put the
> tarball somewhere.
> 
> please find attached the profile data as txt format. For easier consumption,
> they are also
> available on pastebin:
> 
> baseline      : http://paste.fedoraproject.org/86318/
> namedtuple fix: http://paste.fedoraproject.org/86378/
> pickle fix    : http://paste.fedoraproject.org/86600/ (see below)

Please limit the number of calls in print_stats to 20 or 30. Showing everything
just make it harder to compare, and leads to optimizing unimportant stuff. We
want to optimize only the hotspots.

> 
> == hotspots ==
> 
> the baseline profile data highlights five major areas and hotspots:
> 
> 1. internal concurrency (possible patch: http://gerrit.ovirt.org/#/c/25857/ -
> see below)

If you also add print_callers(30) to the profiles, we may learn more about
these wait calls.

> 2. libvirt

This is the main issue - we should invert time in this.

> 3. XML processing (initial patch: http://gerrit.ovirt.org/#/c/17694/)

Consider using cElementTree:
http://effbot.org/zone/celementtree.htm

> 4. namedtuple (patch: http://gerrit.ovirt.org/#/c/25678/ - fixed, merged)
> 5. pickling (patch: http://gerrit.ovirt.org/#/c/25860/ - see below)

These are much more expensive then pickle:

   452311   39.222    0.000   41.883    0.000 
/usr/lib64/python2.6/StringIO.py:208(write)
     7299   38.946    0.005   38.946    0.005 {built-in method acquire}

StringIO should be easily replaced with cStringIO (assuming that we write only
byte arrays - no unicode).

acquire - probably will be hard to fix - showing callers may be interesting.

> 
> #4 is beneficial in the ISCSI path and it was already merged.
> #1 shows some potential but it needs to be carefully evaluated to avoid
> performance regressions
> on different scenarios (e.g. bigger machines than mine :))
> #2 is basically outside of our control but it needs to be watched out
> #3 and #5 are beneficial for all flows and scenarios and are safe to merge.
> #5 is almost a no-brainer IMO
> 
> == Note about the third profile ==
> 
> When profiling the cPickle patch http://paste.fedoraproject.org/86600/
> the tests turned out actually *slower* with respect the second profile with
> just the namedtuple
> patch.
> 
> The hotspots seems to be around concurrency and libvirt:
> location                                      profile2(s)    profile3(s)
> diff(s)
> pthread.py:129(wait)                          1230.640       1377.992
> +147.28 (BAD)
> virDomainCreateXML                            155.171        175.681
> +20.51 (BAD)
> 'select.epoll' objects                        52.523         53.635
> +1.112 (negligible)
> expatbuilder.py:743(start_element_handler)    28.172         33.975
> +5.803 (BAD?)
> virDomainGetXMLDesc                           23.947         23.217
> -0.73 (negligible)

Looking at the number of calls, it is clear that you are not 
comparing the same thing:

baseline:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      330 1258.791    3.815 1258.793    3.815 
/usr/lib/python2.6/site-packages/pthread.py:129(wait)
      157  240.347    1.531  240.347    1.531 {libvirtmod.virDomainCreateXML}
     6153   89.501    0.015   89.501    0.015 {method 'poll' of 'select.epoll' 
objects}
   259014   63.736    0.000  113.172    0.000 
/usr/lib64/python2.6/xml/dom/expatbuilder.py:743(start_element_handler)
      700   50.530    0.072   50.530    0.072 {libvirtmod.virDomainGetXMLDesc}
    21831   43.144    0.002  108.839    0.005 
/usr/lib64/python2.6/collections.py:223(namedtuple)
  1659132   39.388    0.000   39.651    0.000 
/usr/lib64/python2.6/collections.py:253(<genexpr>)
   452311   39.222    0.000   41.883    0.000 
/usr/lib64/python2.6/StringIO.py:208(write)
     7299   38.946    0.005   38.946    0.005 {built-in method acquire}

nametuple + pickle:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      179 1377.992    7.698 1377.993    7.698 
/usr/lib/python2.6/site-packages/pthread.py:129(wait)
       96  175.681    1.830  175.681    1.830 {libvirtmod.virDomainCreateXML}
     3697   53.635    0.015   53.635    0.015 {method 'poll' of 'select.epoll' 
objects}
   141120   33.975    0.000   59.975    0.000 
/usr/lib64/python2.6/xml/dom/expatbuilder.py:743(start_element_handler)
      384   23.217    0.060   23.217    0.060 {libvirtmod.virDomainGetXMLDesc}
   249397   22.812    0.000   24.503    0.000 
/usr/lib64/python2.6/StringIO.py:208(write)
       96   18.541    0.193   18.541    0.193 
{libvirtmod.virDomainSetSchedulerParameters}
547584/9408   18.108    0.000   18.112    0.002 
/usr/lib64/python2.6/xml/dom/minidom.py:305(_get_elements_by_tagName_helper)
   348288   13.614    0.000   13.723    0.000 
/usr/lib64/python2.6/xml/dom/minidom.py:281(_append_child)
   199584   13.472    0.000   20.935    0.000 
/usr/lib64/python2.6/xml/dom/expatbuilder.py:274(character_data_handler_cdata)
   254400   13.368    0.000   13.470    0.000 
/usr/lib64/python2.6/xml/dom/minidom.py:349(__init__)
     4213   11.543    0.003   11.543    0.003 {built-in method acquire}

Note: 157 calls to libvirtmod.virDomainCreateXML in first profile vs 96
in the second. 700 calls to virDomainGetXMLDesc in the first vs 384 in
the second.

You should start the profiler before you start the test, and stop it when
the test is done. The approach of profiling one function run concurrently
on different threads and aggregating the profiles seems to give wrong results.

I would use the profiler to find the hotspots and compare the performance
by timing the whole operation without the profiler.

You should do multiple runs (at least 10) and report the average and variance
for the whole operation.

> 
> I'm OK with some variance (it is expected) but this is also a warning sign to
> be extra-carefully
> in tuning the concurrency patch (bullet point #1 above). We should definitely
> evaluate more scenarios
> before to merge it.
> 
> If we edge out those diffs, we see the cPickle patch has the (small) benefits
> we expect,
> and I think it is 100% safe to merge. I already did some minimal
> extra-verification just in case.
> 
> == Next steps ==
> 
> For the near term (the coming days/next weeks)
> * benchmark the remaining easy fixes which are beneficial for all flows
> and quite safe to merge (XML processing being first) and to work to have them
> merged.
> * polish scripts and benchmarking code, start to submit to infra for review
> * continue investigation about our (in)famous BoundedSempahore
> (http://gerrit.ovirt.org/#/c/25857/)
>   to see if dropping it has regressions or other bad effects
> * find other test scenarios
> 
> I also have noted all the suggestion received so far and I planning more test
> cases just for this scenario.
> 
> For example:
> 1. just start N QEMUs to obtain our lower bound (we cannot get faster than
> this)
> 2. run with different storage (NFS)
> 3. run with no storage
> 4. run with Guest OS installed on disks
> 
> And of course we need more scenarios.
> Let me just repeat myself: those are just the first steps of a long journey.
> 
> 
> --
> Francesco Romani
> RedHat Engineering Virtualization R & D
> Phone: 8261328
> IRC: fromani
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] VDSM profiling results, round 1

Reply via email to