>>> On 13.02.18 at 12:36, <jgr...@suse.com> wrote: > On 12/02/18 18:54, Dario Faggioli wrote: >> On Fri, 2018-02-09 at 15:01 +0100, Juergen Gross wrote: >>> This series is available via github: >>> >>> https://github.com/jgross1/xen.git xpti >>> >>> Dario wants to do some performance tests for this series to compare >>> performance with Jan's series with all optimizations posted. >>> >> And some of this is indeed ready. >> >> So, this is again on my testbox, with 16 pCPUs and 12GB of RAM, and I >> used a guest with 16 vCPUs and 10GB of RAM. >> >> I benchmarked Jan's patch *plus* all the optimizations and overhead >> mitigation patches he posted on xen-devel (the ones that are already in >> staging, and also the ones that are not yet there). That's "XPTI-Light" >> in the table and in the graphs. Booting this with 'xpti=false' is >> considered the baseline, while booting with 'xpti=true' is the actual >> thing we want to measure. :-) >> >> Then I ran the same benchmarks on Juergen's branch above, enabled at >> boot. That's "XPYI" in the table and graphs (yes, I know, sorry for the >> typo!). >> >> http://openbenchmarking.org/result/1802125-DARI-180211144 >> > http://openbenchmarking.org/result/1802125-DARI-180211144&obr_hgv=XPTI-Light+x > > pti%3Dfalse&obr_nor=y&obr_hgv=XPTI-Light+xpti%3Dfalse > > ... > >> Or, actually, that's not it! :-O In fact, right while I was writing >> this report, it came out on IRC that something can be done, on >> Juergen's XPTI series, to mitigate the performance impact a bit. >> >> Juergen sent me a patch already, and I'm re-running the benchmarks with >> that applied. I'll let know how the results ends up looking like. > > It turned out the results are not basically different. So the general > problem with context switches is still there (which I expected, BTW). > > So I guess the really bad results with benchmarks triggering a lot of > vcpu scheduling show that my approach isn't going to fly, as the most > probable cause for the slow context switches are the introduced > serializing instructions (LTR, WRMSRs) which can't be avoided when we > want to use per-vcpu stacks. > > OTOH the results of the other benchmarks showing some advantage over > Jan's solution indicate there is indeed an aspect which can be improved. > > Instead of preferring one approach over the other I have thought about > a way to use the best parts of each solution in a combined variant. In > case nobody is feeling strong to pursue my current approach further I'd > like to suggest the following scheme: > > - Whenever a L4 page table of the guest is in use on one physical cpu > only use the L4 shadow cache of my series in order to avoid having to > copy the L4 contents each time the hypervisor is left. > > - As soon as a L4 page table is being activated on a second cpu fall > back to use the per-cpu page table on that cpu (the cpu already using > the L4 page table can continue doing so).
Would the first of these CPUs continue to run on the shadow L4 in that case? If so, would there be no synchronization issues? If not, how do you envision "telling" it to move to the per-CPU L4 (which, afaict, includes knowing which vCPU / pCPU that is)? > - Before activation of a L4 shadow page table it is modified to map the > per-cpu data needed in guest mode for the local cpu only. I had been considering to do this in XPTI light for other purposes too (for example it might be possible to short circuit the guest system call path to get away without multiple page table switches). We really first need to settle on how much we feel is safe to expose while the guest is running. So far I've been under the impression that people actually think we should further reduce exposed pieces of code/data, rather than widen the "window". > - Use INVPCID instead of %cr4 PGE toggling to speed up purging global > TLB entries (depending on the availability of the feature, of course). That's something we should do independent of what XPTI model we'd like to retain long term. > - Use the PCID feature for being able to avoid purging TLB entries which > might be needed later (depending on hardware again). Which first of all raises the question: Does PCID (other than the U bit) prevent use of TLB entries in the wrong context? IOW is the PCID check done early (during TLB lookup) rather than late (during insn retirement)? Jan _______________________________________________ Xen-devel mailing list Xenemail@example.com https://lists.xenproject.org/mailman/listinfo/xen-devel