> pthread.py:129(wait) 1230.640 1377.992 > +147.28 (BAD) The threadpool would just get stuck on wait() if there are no tasks since Queues use Conditions internally.
This might explain how the average wait time is so long. ----- Original Message ----- > From: "Francesco Romani" <from...@redhat.com> > To: "vdsm-devel" <vdsm-devel@lists.fedorahosted.org> > Sent: Wednesday, March 19, 2014 10:33:51 AM > Subject: [vdsm] VDSM profiling results, round 1 > > (sending again WITHOUT the attachments) > > Hi everyone > > I'd like to share the first round of profiling results for VDSM and my next > steps. > > Summary: > - experimented a couple of profiling approaches and found a good one > - benchmarked http://gerrit.ovirt.org/#/c/25678/ : it is beneficial, was > merged > - found a few low-hanging fruits which seems quite safe to merge and > beneficial to *all* flows > - started engagement with infra (see other thread) to have common and > polished performance > tools > - test roadmap is shaping up, wiki/ML will be updated in the coming days > > Please read through for a more detailed discussion. Every comment is welcome. > > Disclaimer: > long mail, lot of content, please point out if something is missing or not > clear enough > or if deserves more discussion. > > +++ > > == First round results == > > First round of profiling was a follow-up of what I shown during the VDSM > gathering. > The results file contains a full profile ordered by descending time. > In a nutshell: parallel start of 32 tiny VMs using engine REST API and a > single hypervisor host. > > VMs are tiny just because I want to stuff as much VMs I can in my mini-dell > (16 GB ram, 4 core + HT CPUs) > > It is worth to point out a few differences with respect to the *profile* (NOT > the graphs) > I shown during the gathering: > > - profile data is now collected using the profile decorator (see > http://www.ovirt.org/Profiling_Vdsm) > just around Vm._startUnderlyingVm. The gathering profile was obtained using > the yappi application-wide > profiler (see https://code.google.com/p/yappi/) and 40 VMs. > * why yappi? > I thought an application-wide profiler gathers more information and let > us to have a better picture. > I actually still think that but I faced some yappi misbehaviour which I > want to fix later; > function-level profile so far is easier to collect (just grab the data > dumped to file). > * why 40 VMs? > I started with 64 but exausted my storage backing store :) > Will add more storage space in the next days, for the moment I stepped > back to 32. > > It is worth to note that while on one hand numbers change a bit (if you > remember the old profile data > and the scary 80secs wasted on namedtuple), on the other hand the suspects > are the same and the > relative positions are roughly the same. > So I believe our initial findings (namedtuple patch) and the plan are still > valid. > > == how it was done == > > I am still focusing just on the "monday morning" scenario (mass start of many > VMs at the same time). > Each run consisted in a parallel start of 32 VMs as described in the result > data. > VDSM was restarted between one run and the another. > engine was *NOT* restarted between runs. > individual profiles have been gathered after all the runs and the profile was > extracted from the aggregation of them. > > profile dumps are available to everyone, just drop me a note and I'll put the > tarball somewhere. > > please find attached the profile data as txt format. For easier consumption, > they are also > available on pastebin: > > baseline : http://paste.fedoraproject.org/86318/ > namedtuple fix: http://paste.fedoraproject.org/86378/ > pickle fix : http://paste.fedoraproject.org/86600/ (see below) > > == hotspots == > > the baseline profile data highlights five major areas and hotspots: > > 1. internal concurrency (possible patch: http://gerrit.ovirt.org/#/c/25857/ - > see below) > 2. libvirt > 3. XML processing (initial patch: http://gerrit.ovirt.org/#/c/17694/) > 4. namedtuple (patch: http://gerrit.ovirt.org/#/c/25678/ - fixed, merged) > 5. pickling (patch: http://gerrit.ovirt.org/#/c/25860/ - see below) > > #4 is beneficial in the ISCSI path and it was already merged. > #1 shows some potential but it needs to be carefully evaluated to avoid > performance regressions > on different scenarios (e.g. bigger machines than mine :)) > #2 is basically outside of our control but it needs to be watched out > #3 and #5 are beneficial for all flows and scenarios and are safe to merge. > #5 is almost a no-brainer IMO > > == Note about the third profile == > > When profiling the cPickle patch http://paste.fedoraproject.org/86600/ > the tests turned out actually *slower* with respect the second profile with > just the namedtuple > patch. > > The hotspots seems to be around concurrency and libvirt: > location profile2(s) profile3(s) > diff(s) > pthread.py:129(wait) 1230.640 1377.992 > +147.28 (BAD) > virDomainCreateXML 155.171 175.681 > +20.51 (BAD) > 'select.epoll' objects 52.523 53.635 > +1.112 (negligible) > expatbuilder.py:743(start_element_handler) 28.172 33.975 > +5.803 (BAD?) > virDomainGetXMLDesc 23.947 23.217 > -0.73 (negligible) > > I'm OK with some variance (it is expected) but this is also a warning sign to > be extra-carefully > in tuning the concurrency patch (bullet point #1 above). We should definitely > evaluate more scenarios > before to merge it. > > If we edge out those diffs, we see the cPickle patch has the (small) benefits > we expect, > and I think it is 100% safe to merge. I already did some minimal > extra-verification just in case. > > == Next steps == > > For the near term (the coming days/next weeks) > * benchmark the remaining easy fixes which are beneficial for all flows > and quite safe to merge (XML processing being first) and to work to have them > merged. > * polish scripts and benchmarking code, start to submit to infra for review > * continue investigation about our (in)famous BoundedSempahore > (http://gerrit.ovirt.org/#/c/25857/) > to see if dropping it has regressions or other bad effects > * find other test scenarios > > I also have noted all the suggestion received so far and I planning more test > cases just for this scenario. > > For example: > 1. just start N QEMUs to obtain our lower bound (we cannot get faster than > this) > 2. run with different storage (NFS) > 3. run with no storage > 4. run with Guest OS installed on disks > > And of course we need more scenarios. > Let me just repeat myself: those are just the first steps of a long journey. > > > -- > Francesco Romani > RedHat Engineering Virtualization R & D > Phone: 8261328 > IRC: fromani > _______________________________________________ > vdsm-devel mailing list > vdsm-devel@lists.fedorahosted.org > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel > _______________________________________________ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel