On Fri, 2018-02-09 at 15:01 +0100, Juergen Gross wrote: > This series is available via github: > > https://github.com/jgross1/xen.git xpti > > Dario wants to do some performance tests for this series to compare > performance with Jan's series with all optimizations posted. > And some of this is indeed ready.
So, this is again on my testbox, with 16 pCPUs and 12GB of RAM, and I used a guest with 16 vCPUs and 10GB of RAM. I benchmarked Jan's patch *plus* all the optimizations and overhead mitigation patches he posted on xen-devel (the ones that are already in staging, and also the ones that are not yet there). That's "XPTI-Light" in the table and in the graphs. Booting this with 'xpti=false' is considered the baseline, while booting with 'xpti=true' is the actual thing we want to measure. :-) Then I ran the same benchmarks on Juergen's branch above, enabled at boot. That's "XPYI" in the table and graphs (yes, I know, sorry for the typo!). http://openbenchmarking.org/result/1802125-DARI-180211144 http://openbenchmarking.org/result/1802125-DARI-180211144&obr_hgv=XPTI-Light+xpti%3Dfalse&obr_nor=y&obr_hgv=XPTI-Light+xpti%3Dfalse As far as the following benchmarks go: - [disk] I/O benchmarks (like aio-stress, fio, iozone) - compress/uncompress benchmarks - sw building benchmarks - system benchmarks (pgbench, nginx, most of the stress-ng cases) - scheduling latency benchmarks (schbench) the two approach are very very close. It may be said that 'XPTI-Light optimized' has, overall, still a little bit of an edge. But really, that varies from test to test, and most of the time is marginal (either way). System-V message passing and semaphores, as well as socket activity tests, together with hackbench ones, seems to cause Juergen's XPTI serious problems, though. With Juergen, we decided to dig this a bit more. He hypothesized that, currently, (vCPU) context switching costs are high in his solution. Therefore, I went and check (roughly) how many context switches occurs in Xen, during a few of the benchmarks. Here's a summary. ******** stress-ng CPU ******** == XPTI stress-ng: info: cpu 1795.71 bogo ops/s sched: runs through scheduler 29822 sched: context switches 14391 == XPTI-Light stress-ng: info: cpu 1821.60 bogo ops/s sched: runs through scheduler 24544 sched: context switches 9128 ******** stress-ng Memory Copying ******** == XPTI stress-ng: info: memcpy 831.79 bogo ops/s sched: runs through scheduler 22875 sched: context switches 8230 == XPTI-Light stress-ng: info: memcpy 827.68 sched: runs through scheduler 23142 sched: context switches 8279 ******** schbench ******** == XPTI Latency percentiles (usec) 50.0000th: 36672 75.0000th: 79488 90.0000th: 124032 95.0000th: 154880 *99.0000th: 232192 99.5000th: 259328 99.9000th: 332288 min=0, max=568244 sched: runs through scheduler 25736 sched: context switches 10622 == XPTI-Light Latency percentiles (usec) 50.0000th: 37824 75.0000th: 81024 90.0000th: 127872 95.0000th: 156416 *99.0000th: 235776 99.5000th: 271872 99.9000th: 348672 min=0, max=643999 sched: runs through scheduler 25604 sched: context switches 10741 ******** hackbench ******** == XPTI Running with 4*40 (== 160) tasks 250.707 s sched: runs through scheduler 1322606 sched: context switches 1208853 == XPTI-Light Running with 4*40 (== 160) tasks 60.961 s sched: runs through scheduler 1680535 sched: context switches 1668358 ******** stress-ng SysV Msg Passing ******** == XPTI stress-ng: info: msg 276321.24 bogo ops/s sched: runs through scheduler 25144 sched: context switches 10391 == XPTI-Light stress-ng: info: msg 1775035.18 bogo ops/s sched: runs through scheduler 33453 sched: context switches 18566 ******** schbench -p ********* == XPTI Latency percentiles (usec) 50.0000th: 53 75.0000th: 56 90.0000th: 103 95.0000th: 161 *99.0000th: 1326 99.5000th: 2172 99.9000th: 4760 min=0, max=124594 avg worker transfer: 478.63 ops/sec 1.87KB/s sched: runs through scheduler 34161 sched: context switches 19556 == XPTI-Light Latency percentiles (usec) 50.0000th: 16 75.0000th: 17 90.0000th: 18 95.0000th: 35 *99.0000th: 258 99.5000th: 424 99.9000th: 1005 min=0, max=110505 avg worker transfer: 1791.82 ops/sec 7.00KB/s sched: runs through scheduler 41905 sched: context switches 27013 So, basically, the intuition seems to me to be confirmed. In fact, we see that until the number of context switches happening during the specific benchmark are limited to ~ below 10k, Juergen's XPTI is fine, and on par or better than Jan's XPTI-Light (see stress-ng:cpu, stress- ng:memorycopying, schbench). Above 10k, XPTI begins to suffer; and the more context switches there are, the worse (e.g., see how bad it goes in the hackbench case). Note that, in the stress-ng:sysvmsg case, we see that in the XPTI-Light case that there are ~20k context switches, and I believe that the fact that we only see ~10k of them in the XPTI case, is that, due to context switch being slower, the benchmark did less work in its 30s of execution. We can have a confirmation of that by looking at the schedbench-p case, where the slowdown is evident by looking at the average data transferred by the workers. So, that's it for now. Thoughts are welcome. :-) ... Or, actually, that's not it! :-O In fact, right while I was writing this report, it came out on IRC that something can be done, on Juergen's XPTI series, to mitigate the performance impact a bit. Juergen sent me a patch already, and I'm re-running the benchmarks with that applied. I'll let know how the results ends up looking like. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/
Description: This is a digitally signed message part
_______________________________________________ Xen-devel mailing list Xenfirstname.lastname@example.org https://lists.xenproject.org/mailman/listinfo/xen-devel