On 12/02/18 18:54, Dario Faggioli wrote:
> On Fri, 2018-02-09 at 15:01 +0100, Juergen Gross wrote:
>> This series is available via github:
>> https://github.com/jgross1/xen.git xpti
>> Dario wants to do some performance tests for this series to compare
>> performance with Jan's series with all optimizations posted.
> And some of this is indeed ready.
> So, this is again on my testbox, with 16 pCPUs and 12GB of RAM, and I
> used a guest with 16 vCPUs and 10GB of RAM.
> I benchmarked Jan's patch *plus* all the optimizations and overhead
> mitigation patches he posted on xen-devel (the ones that are already in
> staging, and also the ones that are not yet there). That's "XPTI-Light"
> in the table and in the graphs. Booting this with 'xpti=false' is
> considered the baseline, while booting with 'xpti=true' is the actual
> thing we want to measure. :-)
> Then I ran the same benchmarks on Juergen's branch above, enabled at
> boot. That's "XPYI" in the table and graphs (yes, I know, sorry for the
> Or, actually, that's not it! :-O In fact, right while I was writing
> this report, it came out on IRC that something can be done, on
> Juergen's XPTI series, to mitigate the performance impact a bit.
> Juergen sent me a patch already, and I'm re-running the benchmarks with
> that applied. I'll let know how the results ends up looking like.
It turned out the results are not basically different. So the general
problem with context switches is still there (which I expected, BTW).
So I guess the really bad results with benchmarks triggering a lot of
vcpu scheduling show that my approach isn't going to fly, as the most
probable cause for the slow context switches are the introduced
serializing instructions (LTR, WRMSRs) which can't be avoided when we
want to use per-vcpu stacks.
OTOH the results of the other benchmarks showing some advantage over
Jan's solution indicate there is indeed an aspect which can be improved.
Instead of preferring one approach over the other I have thought about
a way to use the best parts of each solution in a combined variant. In
case nobody is feeling strong to pursue my current approach further I'd
like to suggest the following scheme:
- Whenever a L4 page table of the guest is in use on one physical cpu
only use the L4 shadow cache of my series in order to avoid having to
copy the L4 contents each time the hypervisor is left.
- As soon as a L4 page table is being activated on a second cpu fall
back to use the per-cpu page table on that cpu (the cpu already using
the L4 page table can continue doing so).
- Before activation of a L4 shadow page table it is modified to map the
per-cpu data needed in guest mode for the local cpu only.
- Use INVPCID instead of %cr4 PGE toggling to speed up purging global
TLB entries (depending on the availability of the feature, of course).
- Use the PCID feature for being able to avoid purging TLB entries which
might be needed later (depending on hardware again). I expect this
will help especially for cases where the guest often switches between
kernel and user mode. Whether we want 3 or 4 PCID values for each
guest address space has to be discussed: do we need 2 different Xen
variants for guest user and guest kernel (IOW: are there any problems
possible when the hypervisor is using a guest kernel's permission to
access guest data when the guest was running in user mode before
entering the hypervisor)?
Xen-devel mailing list