Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-06-14 Thread Raghavendra K T

On 05/30/2012 04:56 PM, Raghavendra K T wrote:

On 05/16/2012 08:49 AM, Raghavendra K T wrote:

On 05/14/2012 12:15 AM, Raghavendra K T wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.

[...]

To summarise,
with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
very low/undercommitted systems we may see very small improvement or
small acceptable degradation ( which it deserves).



For large guests, current value SPIN_THRESHOLD, along with ple_window
needed some of research/experiment.

[Thanks to Jeremy/Nikunj for inputs and help in result analysis ]

I started with debugfs spinlock/histograms, and ran experiments with 32,
64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with
1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench.
[ spinlock/histogram gives logarithmic view of lockwait times ]

machine: PLE machine with 32 cores.

Here is the result summary.
The summary includes 2 part,
(1) %improvement w.r.t 2K spin threshold,
(2) improvement w.r.t sum of histogram numbers in debugfs (that gives
rough indication of contention/cpu time wasted)

For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98%
reduction in sigma(histogram values) compared to 2k case

Result for 32 vcpu guest
==
++---+---+---+---+
| Base-2k | 4k | 8k | 16k | 32k |
++---+---+---+---+
| kbench-1vm | 44 | 50 | 46 | 41 |
| SPINHisto-1vm | 98 | 99 | 99 | 99 |
| kbench-2vm | 25 | 45 | 49 | 45 |
| SPINHisto-2vm | 31 | 91 | 99 | 99 |
| kbench-4vm | -13 | -27 | -2 | -4 |
| SPINHisto-4vm | 29 | 66 | 95 | 99 |
++---+---+---+---+
| ebizzy-1vm | 954 | 942 | 913 | 915 |
| SPINHisto-1vm | 96 | 99 | 99 | 99 |
| ebizzy-2vm | 158 | 135 | 123 | 106 |
| SPINHisto-2vm | 90 | 98 | 99 | 99 |
| ebizzy-4vm | -13 | -28 | -33 | -37 |
| SPINHisto-4vm | 83 | 98 | 99 | 99 |
++---+---+---+---+
| hbench-1vm | 48 | 56 | 52 | 64 |
| SPINHisto-1vm | 92 | 95 | 99 | 99 |
| hbench-2vm | 32 | 40 | 39 | 21 |
| SPINHisto-2vm | 74 | 96 | 99 | 99 |
| hbench-4vm | 27 | 15 | 3 | -57 |
| SPINHisto-4vm | 68 | 88 | 94 | 97 |
++---+---+---+---+
| sysbnch-1vm | 0 | 0 | 1 | 0 |
| SPINHisto-1vm | 76 | 98 | 99 | 99 |
| sysbnch-2vm | -1 | 3 | -1 | -4 |
| SPINHisto-2vm | 82 | 94 | 96 | 99 |
| sysbnch-4vm | 0 | -2 | -8 | -14 |
| SPINHisto-4vm | 57 | 79 | 88 | 95 |
++---+---+---+---+

result for 64 vcpu guest
=
++---+---+---+---+
| Base-2k | 4k | 8k | 16k | 32k |
++---+---+---+---+
| kbench-1vm | 1 | -11 | -25 | 31 |
| SPINHisto-1vm | 3 | 10 | 47 | 99 |
| kbench-2vm | 15 | -9 | -66 | -15 |
| SPINHisto-2vm | 2 | 11 | 19 | 90 |
++---+---+---+---+
| ebizzy-1vm | 784 | 1097 | 978 | 930 |
| SPINHisto-1vm | 74 | 97 | 98 | 99 |
| ebizzy-2vm | 43 | 48 | 56 | 32 |
| SPINHisto-2vm | 58 | 93 | 97 | 98 |
++---+---+---+---+
| hbench-1vm | 8 | 55 | 56 | 62 |
| SPINHisto-1vm | 18 | 69 | 96 | 99 |
| hbench-2vm | 13 | -14 | -75 | -29 |
| SPINHisto-2vm | 57 | 74 | 80 | 97 |
++---+---+---+---+
| sysbnch-1vm | 9 | 11 | 15 | 10 |
| SPINHisto-1vm | 80 | 93 | 98 | 99 |
| sysbnch-2vm | 3 | 3 | 4 | 2 |
| SPINHisto-2vm | 72 | 89 | 94 | 97 |
++---+---+---+---+

 From this, value around 4k-8k threshold seem to be optimal one. [ This
is amost inline with ple_window default ]
(lower the spin threshold, we would cover lesser % of spinlocks, that
would result in more halt_exit/wakeups.

[ www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical
detail on covering spinlock waits ]

After 8k threshold, we see no more contention but that would mean we
have wasted lot of cpu time in busy waits.

Will get a PLE machine again, and 'll continue experimenting with
further tuning of SPIN_THRESHOLD.


Sorry for delayed response. Was doing too much of analysis and
experiments.

Continued my experiment, with spin threshold. unfortunately could
not settle between which one of 4k/8k threshold is better, since it
depends on load and type of workload.

Here is the result for 32 vcpu guest for sysbench and kernebench for 4 
8GB RAM vms on same PLE machine with:


1x: benchmark running on 1 guest
2x: same benchmark running on 2 guest and so on

1x run is taken over 8*3 run averages
2x run was taken with 4*3 runs
3x run was with 6*3
4x run was with 4*3


kernbench
=
total 

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-30 Thread Raghavendra K T

On 05/16/2012 08:49 AM, Raghavendra K T wrote:

On 05/14/2012 12:15 AM, Raghavendra K T wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.

[...]

To summarise,
with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
very low/undercommitted systems we may see very small improvement or
small acceptable degradation ( which it deserves).



For large guests, current value SPIN_THRESHOLD, along with ple_window 
needed some of research/experiment.


[Thanks to Jeremy/Nikunj for inputs and help in result analysis ]

I started with debugfs spinlock/histograms, and ran experiments with 32, 
64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with

1vm/2vm/4vm  for kernbench, sysbench, ebizzy, hackbench.
[ spinlock/histogram  gives logarithmic view of lockwait times ]

machine: PLE machine  with 32 cores.

Here is the result summary.
The summary includes 2 part,
(1) %improvement w.r.t 2K spin threshold,
(2) improvement w.r.t sum of histogram numbers in debugfs (that gives 
rough indication of contention/cpu time wasted)


 For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98% 
reduction in sigma(histogram values) compared to 2k case


Result for 32 vcpu guest
==
++---+---+---+---+
|Base-2k | 4k|8k |   16k |32k|
++---+---+---+---+
| kbench-1vm |   44  |   50  |   46  |   41  |
|  SPINHisto-1vm |   98  |   99  |   99  |   99  |
| kbench-2vm |   25  |   45  |   49  |   45  |
|  SPINHisto-2vm |   31  |   91  |   99  |   99  |
| kbench-4vm |  -13  |  -27  |   -2  |   -4  |
|  SPINHisto-4vm |   29  |   66  |   95  |   99  |
++---+---+---+---+
| ebizzy-1vm |  954  |  942  |  913  |  915  |
|  SPINHisto-1vm |   96  |   99  |   99  |   99  |
| ebizzy-2vm |  158  |  135  |  123  |  106  |
|  SPINHisto-2vm |   90  |   98  |   99  |   99  |
| ebizzy-4vm |  -13  |  -28  |  -33  |  -37  |
|  SPINHisto-4vm |   83  |   98  |   99  |   99  |
++---+---+---+---+
| hbench-1vm |   48  |   56  |   52  |   64  |
|  SPINHisto-1vm |   92  |   95  |   99  |   99  |
| hbench-2vm |   32  |   40  |   39  |   21  |
|  SPINHisto-2vm |   74  |   96  |   99  |   99  |
| hbench-4vm |   27  |   15  |3  |  -57  |
|  SPINHisto-4vm |   68  |   88  |   94  |   97  |
++---+---+---+---+
|sysbnch-1vm |0  |0  |1  |0  |
|  SPINHisto-1vm |   76  |   98  |   99  |   99  |
|sysbnch-2vm |   -1  |3  |   -1  |   -4  |
|  SPINHisto-2vm |   82  |   94  |   96  |   99  |
|sysbnch-4vm |0  |   -2  |   -8  |  -14  |
|  SPINHisto-4vm |   57  |   79  |   88  |   95  |
++---+---+---+---+

result for 64  vcpu guest
=
++---+---+---+---+
|Base-2k | 4k|8k |   16k |32k|
++---+---+---+---+
| kbench-1vm |1  |  -11  |  -25  |   31  |
|  SPINHisto-1vm |3  |   10  |   47  |   99  |
| kbench-2vm |   15  |   -9  |  -66  |  -15  |
|  SPINHisto-2vm |2  |   11  |   19  |   90  |
++---+---+---+---+
| ebizzy-1vm |  784  | 1097  |  978  |  930  |
|  SPINHisto-1vm |   74  |   97  |   98  |   99  |
| ebizzy-2vm |   43  |   48  |   56  |   32  |
|  SPINHisto-2vm |   58  |   93  |   97  |   98  |
++---+---+---+---+
| hbench-1vm |8  |   55  |   56  |   62  |
|  SPINHisto-1vm |   18  |   69  |   96  |   99  |
| hbench-2vm |   13  |  -14  |  -75  |  -29  |
|  SPINHisto-2vm |   57  |   74  |   80  |   97  |
++---+---+---+---+
|sysbnch-1vm |9  |   11  |   15  |   10  |
|  SPINHisto-1vm |   80  |   93  |   98  |   99  |
|sysbnch-2vm |3  |3  |4  |2  |
|  SPINHisto-2vm |   72  

Re: [Xen-devel] [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-15 Thread Jan Beulich
 On 07.05.12 at 19:25, Ingo Molnar mi...@kernel.org wrote:

(apologies for the late reply, the mail just now made it to my inbox
via xen-devel)

 I'll hold off on the whole thing - frankly, we don't want this 
 kind of Xen-only complexity. If KVM can make use of PLE then Xen 
 ought to be able to do it as well.

It does - for fully virtualized guests. For para-virtualized ones,
it can't (as the hardware feature is an extension to VMX/SVM).

 If both Xen and KVM makes good use of it then that's a different 
 matter.

I saw in a later reply that you're now tending towards trying it
out at least - thanks.

Jan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-15 Thread Raghavendra K T

On 05/14/2012 12:15 AM, Raghavendra K T wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.



3 guests 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script with while
true do hackbench)

1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

kernbench on PLE:
Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32
core, with 8 online cpus and 4*64GB RAM.

The average is taken over 4 iterations with 3 run each (4*3=12). and
stdev is calculated over mean reported in each run.


A): 8 vcpu guest

BASE BASE+patch %improvement w.r.t
mean (sd) mean (sd) patched kernel time
case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605
case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401
case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855


B): 16 vcpu guest
BASE BASE+patch %improvement w.r.t
mean (sd) mean (sd) patched kernel time
case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867
case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114
case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218

B): 32 vcpu guest
BASE BASE+patch %improvementw.r.t
mean (sd) mean (sd) patched kernel time
case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905

It seems while we do not see any improvement in low contention case,
the benefit becomes evident with overcommit and large guests. I am
continuing analysis with other benchmarks (now with pgbench to check if
it has acceptable improvement/degradation in low contenstion case).


Here are the results for pgbench and sysbench. Here the results are on a 
single guest.


Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32 
core, with 8

 online cpus and 4*64GB RAM.

Guest config: 8GB RAM

pgbench
==

  unit=tps (higher is better)
  pgbench based on pgsql 9.2-dev:
http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo)

  tool used to collect benachmark: 
git://git.postgresql.org/git/pgbench-tools.git

  config: MAX_WORKER=16 SCALE=32 run for NRCLIENTS = 1, 8, 64

Average taken over 10 iterations.

 8 vcpu guest   

 N  base   patchimprovement
 1  5271   5235 -0.687679
 8  37953  382020.651798
 64 37546  377740.60359


 16 vcpu guest  

 N  base   patchimprovement
 1  5229   5239 0.190876
 8  34908  360483.16245
 64 51796  528521.99803

sysbench
==
sysbench 0.4.12 cnfigured for postgres driver ran with
sysbench --num-threads=8/16/32 --max-requests=10 --test=oltp 
--oltp-table-size=50 --db-driver=pgsql --oltp-read-only run

annalysed with ministat with
x patch
+ base

8 vcpu guest
---
1) num_threads = 8
N   Min   MaxMedian   AvgStddev
x  10   20.7805 21.55   20.9667  21.035020.22682186
+  1021.025   22.3122  21.29535  21.417930.39542349
Difference at 98.0% confidence
1.82035% +/- 1.74892%

2) num_threads = 16
N   Min   MaxMedian   AvgStddev
x  10   20.8786   21.3967   21.1566  21.144410.15490983
+  10   21.3992   21.9437  21.46235  21.58724 0.2089425
Difference at 98.0% confidence
2.09431% +/- 0.992732%

3) num_threads = 32
N   Min   MaxMedian   AvgStddev
x  10   21.1329   21.3726  21.33415   21.28930.08324195
+  10   21.5692   21.8966   21.6441  21.65679   0.093430003
Difference at 98.0% confidence
1.72617% +/- 0.474343%


16 vcpu guest
---
1) num_threads = 8
N   Min   MaxMedian   AvgStddev
x  10   23.5314   25.6118  24.76145  24.645170.74856264
+  10   22.2675   26.6204   22.9131  23.50554  1.345386
No difference proven at 98.0% confidence

2) num_threads = 16
N   Min   MaxMedian   AvgStddev
x  10   12.0095   12.2305  12.15575  12.13926   0.070872722
+  1011.413   11.6986   11.481711.493   0.080007819
Difference at 98.0% confidence
-5.32372% +/- 0.710561%

3) num_threads = 32
N   Min   MaxMedian   AvgStddev
x  10   12.1378   12.3567  12.21675  12.22703 0.0670695
+  1011.573   11.7438   11.6306  11.64905   0.062780221
Difference at 98.0% confidence
-4.72707% +/- 0.606349%


32 vcpu guest
---
1) num_threads = 8
N   Min   MaxMedian   AvgStddev
x  10   30.5602   41.4756

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-14 Thread Jeremy Fitzhardinge
On 05/13/2012 11:45 AM, Raghavendra K T wrote:
 On 05/07/2012 08:22 PM, Avi Kivity wrote:

 I could not come with pv-flush results (also Nikunj had clarified that
 the result was on NOn PLE

 I'd like to see those numbers, then.

 Ingo, please hold on the kvm-specific patches, meanwhile.


 3 guests 8GB RAM, 1 used for kernbench
 (kernbench -f -H -M -o 20) other for cpuhog (shell script with  while
 true do hackbench)

 1x: no hogs
 2x: 8hogs in one guest
 3x: 8hogs each in two guest

 kernbench on PLE:
 Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32
 core, with 8 online cpus and 4*64GB RAM.

 The average is taken over 4 iterations with 3 run each (4*3=12). and
 stdev is calculated over mean reported in each run.


 A): 8 vcpu guest

  BASEBASE+patch %improvement w.r.t
  mean (sd)   mean (sd) 
 patched kernel time
 case 1*1x:61.7075  (1.17872)60.93 (1.475625)1.27605
 case 1*2x:107.2125 (1.3821349)97.506675 (1.3461878)   9.95401
 case 1*3x:144.3515 (1.8203927)138.9525  (0.58309319)  3.8855


 B): 16 vcpu guest
  BASEBASE+patch %improvement w.r.t
  mean (sd)   mean (sd) 
 patched kernel time
 case 2*1x:70.524   (1.5941395)69.68866  (1.9392529)   1.19867
 case 2*2x:133.0738 (1.4558653)124.8568  (1.4544986)   6.58114
 case 2*3x:206.0094 (1.3437359)181.4712  (2.9134116)   13.5218

 B): 32 vcpu guest
  BASEBASE+patch %improvementw.r.t
  mean (sd)   mean (sd) 
 patched kernel time
 case 4*1x:100.61046 (2.7603485) 85.48734  (2.6035035)  17.6905

What does the 4*1x notation mean? Do these workloads have overcommit
of the PCPU resources?

When I measured it, even quite small amounts of overcommit lead to large
performance drops with non-pv ticket locks (on the order of 10%
improvements when there were 5 busy VCPUs on a 4 cpu system).  I never
tested it on larger machines, but I guess that represents around 25%
overcommit, or 40 busy VCPUs on a 32-PCPU system.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-14 Thread Raghavendra K T

On 05/14/2012 01:08 PM, Jeremy Fitzhardinge wrote:

On 05/13/2012 11:45 AM, Raghavendra K T wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.



3 guests 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script with  while
true do hackbench)

1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

kernbench on PLE:
Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32
core, with 8 online cpus and 4*64GB RAM.

The average is taken over 4 iterations with 3 run each (4*3=12). and
stdev is calculated over mean reported in each run.


A): 8 vcpu guest

  BASEBASE+patch %improvement w.r.t
  mean (sd)   mean (sd)
patched kernel time
case 1*1x:61.7075  (1.17872)60.93 (1.475625)1.27605
case 1*2x:107.2125 (1.3821349)97.506675 (1.3461878)   9.95401
case 1*3x:144.3515 (1.8203927)138.9525  (0.58309319)  3.8855


B): 16 vcpu guest
  BASEBASE+patch %improvement w.r.t
  mean (sd)   mean (sd)
patched kernel time
case 2*1x:70.524   (1.5941395)69.68866  (1.9392529)   1.19867
case 2*2x:133.0738 (1.4558653)124.8568  (1.4544986)   6.58114
case 2*3x:206.0094 (1.3437359)181.4712  (2.9134116)   13.5218

B): 32 vcpu guest
  BASEBASE+patch %improvementw.r.t
  mean (sd)   mean (sd)
patched kernel time
case 4*1x:100.61046 (2.7603485) 85.48734  (2.6035035)  17.6905


What does the 4*1x notation mean? Do these workloads have overcommit
of the PCPU resources?

When I measured it, even quite small amounts of overcommit lead to large
performance drops with non-pv ticket locks (on the order of 10%
improvements when there were 5 busy VCPUs on a 4 cpu system).  I never
tested it on larger machines, but I guess that represents around 25%
overcommit, or 40 busy VCPUs on a 32-PCPU system.


All the above measurements are on PLE machine. It is 32 vcpu single
guest on a 8 pcpu.

(PS:One problem I saw in my kernbench run itself is that
number of threads spawned = 20 instead of 2* number of vcpu. I ll
correct during next measurement.)

even quite small amounts of overcommit lead to large performance drops
with non-pv ticket locks:

This is very much true on non PLE machine. probably compilation takes
even a day vs just one hour. ( with just 1:3x overcommit I had got 25 x
speedup).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-14 Thread Raghavendra K T

On 05/14/2012 10:27 AM, Nikunj A Dadhania wrote:

On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K 
Traghavendra...@linux.vnet.ibm.com  wrote:

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


Did you see any issues on PLE?



No, I did not see issues in setup, but did not get time to check
that out yet ..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-13 Thread Raghavendra K T

On 05/07/2012 05:36 PM, Avi Kivity wrote:

On 05/07/2012 01:58 PM, Raghavendra K T wrote:

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

(Less is better. Below is time elapsed in sec for x86_64_defconfig
(3+3 runs)).

  BASEBASE+patch%improvement
  mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664



You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.



speedup calculation is clear.

I think confusion for me was more because of the types of benchmarks.

I always did

|(patch - base)| * 100 / base


So,  for
(1) lesser is better sort of benchmarks,
improvement calculation would be like

|(patched -  base)| * 100/ patched
e.g for kernbench,

suppose base= 150 sec
patched = 100 sec
improvement = 50 % ( = 33% degradation of base)


(2) for higher is better sort of benchmarks improvement calculation 
would be like


|(patched - base)| * 100 / base

for e.g say for pgbench/ ebizzy...

base = 100 tps (transactions per sec)
patched = 150 tps

 improvement  = 50 % of pathched kernel ( OR 33 % degradation of base )


Is this is what generally done? just wanted to be on same page before 
publishing benchmark results, other than kernbench.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-13 Thread Raghavendra K T

On 05/07/2012 08:22 PM, Avi Kivity wrote:

I could not come with pv-flush results (also Nikunj had clarified that
the result was on NOn PLE


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.



3 guests 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script with  while
true do hackbench)

1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

kernbench on PLE:
Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32 
core, with 8 online cpus and 4*64GB RAM.


The average is taken over 4 iterations with 3 run each (4*3=12). and 
stdev is calculated over mean reported in each run.



A): 8 vcpu guest

 BASEBASE+patch 
%improvement w.r.t
 mean (sd)   mean (sd)  patched 
kernel time

case 1*1x:  61.7075  (1.17872)  60.93 (1.475625)1.27605
case 1*2x:  107.2125 (1.3821349)97.506675 (1.3461878)   9.95401
case 1*3x:  144.3515 (1.8203927)138.9525  (0.58309319)  3.8855


B): 16 vcpu guest
 BASEBASE+patch 
%improvement w.r.t
 mean (sd)   mean (sd)  patched 
kernel time

case 2*1x:  70.524   (1.5941395)69.68866  (1.9392529)   1.19867
case 2*2x:  133.0738 (1.4558653)124.8568  (1.4544986)   6.58114
case 2*3x:  206.0094 (1.3437359)181.4712  (2.9134116)   13.5218

B): 32 vcpu guest
 BASEBASE+patch 
%improvementw.r.t
 mean (sd)   mean (sd)  patched 
kernel time

case 4*1x:  100.61046 (2.7603485)85.48734  (2.6035035)  17.6905

It seems while we do not see any improvement in low contention case,
the benefit becomes evident with overcommit and large guests. I am
continuing analysis with other benchmarks (now with pgbench to check if
it has acceptable improvement/degradation in low contenstion case).

Avi,
Can patch series go ahead for inclusion into tree with following
reasons:

The patch series brings fairness with ticketlock ( hence the
predictability, since during contention, vcpu trying
 to acqire lock is sure that it gets its turn in less than total number 
of vcpus conntending for lock), which is very much desired irrespective

of its low benefit/degradation (if any) in low contention scenarios.

Ofcourse ticketlocks had undesirable effect of exploding LHP problem,
and the series addresses with improvement in scheduling and sleeping 
instead of burning cpu time.


Finally a less famous one, it brings almost PLE equivalent capabilty to
all the non PLE hardware (TBH I always preferred my experiment kernel to 
be compiled in my pv guest that saves more than 30 min of time for each 
run).


It would be nice to see any results if somebody got benefited/suffered 
with patchset.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-13 Thread Nikunj A Dadhania
On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T 
raghavendra...@linux.vnet.ibm.com wrote:
 On 05/07/2012 08:22 PM, Avi Kivity wrote:
 
 I could not come with pv-flush results (also Nikunj had clarified that
 the result was on NOn PLE
 
Did you see any issues on PLE?

Regards,
Nikunj

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-08 Thread Nikunj A Dadhania
On Mon, 7 May 2012 22:42:30 +0200 (CEST), Thomas Gleixner t...@linutronix.de 
wrote:
 On Mon, 7 May 2012, Ingo Molnar wrote:
  * Avi Kivity a...@redhat.com wrote:
  
PS: Nikunj had experimented that pv-flush tlb + 
paravirt-spinlock is a win on PLE where only one of them 
alone could not prove the benefit.
   
Do not have PLE numbers yet for pvflush and pvspinlock. 

I have seen on Non-PLE having pvflush and pvspinlock patches -
kernbench, ebizzy, specjbb, hackbench and dbench all of them improved. 

I am chasing a race currently on pv-flush path, it is causing
file-system corruption. I will post these number along with my v2 post.

   I'd like to see those numbers, then.
   
   Ingo, please hold on the kvm-specific patches, meanwhile.
  
  I'll hold off on the whole thing - frankly, we don't want this 
  kind of Xen-only complexity. If KVM can make use of PLE then Xen 
  ought to be able to do it as well.
  
  If both Xen and KVM makes good use of it then that's a different 
  matter.
 
 Aside of that, it's kinda strange that a dude named Nikunj is
 referenced in the argument chain, but I can't find him on the CC list.
 
/me waves my hand

Regards
Nikunj

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-08 Thread Avi Kivity
On 05/08/2012 02:15 AM, Jeremy Fitzhardinge wrote:
 On 05/07/2012 06:49 AM, Avi Kivity wrote:
  On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:
  * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 
  19:08:51]:
 
  I 'll get hold of a PLE mc  and come up with the numbers soon. but I
  'll expect the improvement around 1-3% as it was in last version.
  Deferring preemption (when vcpu is holding lock) may give us better than 
  1-3% 
  results on PLE hardware. Something worth trying IMHO.
  Is the improvement so low, because PLE is interfering with the patch, or
  because PLE already does a good job?

 How does PLE help with ticket scheduling on unlock?  I thought it would
 just help with the actual spin loops.

PLE yields to up a random vcpu, hoping it is the lock holder.  This
patchset wakes up the right vcpu.  For small vcpu counts the difference
is a few bad wakeups (and even a bad wakeup sometimes works, since it
can put the spinner to sleep for a bit).  I expect that large vcpu
counts would show a greater difference.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Ingo Molnar

* Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:

 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.(targeted for 3.5 window)
 
 Note: This needs debugfs changes patch that should be in Xen / linux-next
https://lkml.org/lkml/2012/3/30/687
 
 Changes in V8:
  - Reabsed patches to 3.4-rc4
  - Combined the KVM changes with ticketlock + Xen changes (Ingo)
  - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we
 need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl.
  - Rewrite GET_MP_STATE condition (Avi)
  - Make pv_unhalt = bool (Avi)
  - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb)
  - Documentation changes (Rob Landley)
  - Have a printk to recognize that paravirt spinlock is enabled (Nikunj)
  - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now
so that it can be used for other optimizations such as 
flush_tlb_ipi_others etc. (Nikunj)
 
 Ticket locks have an inherent problem in a virtualized case, because
 the vCPUs are scheduled rather than running concurrently (ignoring
 gang scheduled vCPUs).  This can result in catastrophic performance
 collapses when the vCPU scheduler doesn't schedule the correct next
 vCPU, and ends up scheduling a vCPU which burns its entire timeslice
 spinning.  (Note that this is not the same problem as lock-holder
 preemption, which this series also addresses; that's also a problem,
 but not catastrophic).
 
 (See Thomas Friebel's talk Prevent Guests from Spinning Around
 http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
 
 Currently we deal with this by having PV spinlocks, which adds a layer
 of indirection in front of all the spinlock functions, and defining a
 completely new implementation for Xen (and for other pvops users, but
 there are none at present).
 
 PV ticketlocks keeps the existing ticketlock implemenentation
 (fastpath) as-is, but adds a couple of pvops for the slow paths:
 
 - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
   iterations, then call out to the __ticket_lock_spinning() pvop,
   which allows a backend to block the vCPU rather than spinning.  This
   pvop can set the lock into slowpath state.
 
 - When releasing a lock, if it is in slowpath state, the call
   __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
   lock is no longer in contention, it also clears the slowpath flag.
 
 The slowpath state is stored in the LSB of the within the lock tail
 ticket.  This has the effect of reducing the max number of CPUs by
 half (so, a small ticket can deal with 128 CPUs, and large ticket
 32768).
 
 For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
 another vcpu out of halt state.
 The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
 
 Overall, it results in a large reduction in code, it makes the native
 and virtualized cases closer, and it removes a layer of indirection
 around all the spinlock functions.
 
 The fast path (taking an uncontended lock which isn't in slowpath
 state) is optimal, identical to the non-paravirtualized case.
 
 The inner part of ticket lock code becomes:
   inc = xadd(lock-tickets, inc);
   inc.tail = ~TICKET_SLOWPATH_FLAG;
 
   if (likely(inc.head == inc.tail))
   goto out;
   for (;;) {
   unsigned count = SPIN_THRESHOLD;
   do {
   if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
   goto out;
   cpu_relax();
   } while (--count);
   __ticket_lock_spinning(lock, inc.tail);
   }
 out:  barrier();
 which results in:
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x200,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f   # Slowpath if lock in contention
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:and$-2,%edx
   movzbl %dl,%esi
 
 2:mov$0x800,%eax
   jmp4f
 
 3:pause  
   sub$0x1,%eax
   je 5f
 
 4:movzbl (%rdi),%ecx
   cmp%cl,%dl
   jne3b
 
   pop%rbp
   retq   
 
 5:callq  *__ticket_lock_spinning
   jmp2b
   ### SLOWPATH END
 
 with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
 the fastpath case is straight through (taking the lock without
 contention), and the spin loop is out of line:
 
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x100,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:pause  
   movzbl (%rdi),%eax
   cmp%dl,%al
   jne1b
 
   pop%rbp
   retq   
   ### SLOWPATH END
 
 The unlock code is complicated by the need to 

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections 
 from anyone to trying this out in a separate x86 topic tree?

No objections, instead an

Acked-by: Avi Kivity a...@redhat.com

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com



Thank you.

Here is a benchmark result with the patches.

3 guests with 8VCPU, 8GB RAM, 1 used for kernbench
(kernbench -f -H -M -o 20) other for cpuhog (shell script while
true with an instruction)

unpinned scenario
1x: no hogs
2x: 8hogs in one guest
3x: 8hogs each in two guest

BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n
BASE+patch: 3.4-rc4 + debugfs + pv patches with CONFIG_PARAVIRT_SPINLOCK=y

Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non PLE) 
with 8 core , 64GB RAM


(Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 
runs)).


 BASEBASE+patch%improvement
 mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)   61.3233 (68.8299)  7.16552
case 2x: 1253.2 (1795.74)131.606 (137.358)  89.4984
case 3x: 3431.04 (5297.26)   134.964 (149.861)  96.0664


Will be working on further analysis with other benchmarks 
(pgbench/sysbench/ebizzy...) and further optimization.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 01:58 PM, Raghavendra K T wrote:
 On 05/07/2012 02:02 PM, Avi Kivity wrote:
 On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections
 from anyone to trying this out in a separate x86 topic tree?

 No objections, instead an

 Acked-by: Avi Kivitya...@redhat.com


 Thank you.

 Here is a benchmark result with the patches.

 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench
 (kernbench -f -H -M -o 20) other for cpuhog (shell script while
 true with an instruction)

 unpinned scenario
 1x: no hogs
 2x: 8hogs in one guest
 3x: 8hogs each in two guest

 BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n
 BASE+patch: 3.4-rc4 + debugfs + pv patches with
 CONFIG_PARAVIRT_SPINLOCK=y

 Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non
 PLE) with 8 core , 64GB RAM

 (Less is better. Below is time elapsed in sec for x86_64_defconfig
 (3+3 runs)).

  BASEBASE+patch%improvement
  mean (sd)   mean (sd)
 case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
 case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
 case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664


You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 05:36 PM, Avi Kivity wrote:

On 05/07/2012 01:58 PM, Raghavendra K T wrote:

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com


[...]


(Less is better. Below is time elapsed in sec for x86_64_defconfig
(3+3 runs)).

  BASEBASE+patch%improvement
  mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664



You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.



You are right,
my %improvement was intended to be like
if
1) base takes 100 sec == patch takes 93 sec
2) base takes 100 sec == patch takes 11 sec
3) base takes 100 sec == patch takes 4 sec

The above is more confusing (and incorrect!).

Better is what you told which boils to 10x and 25x improvement in case
2 and case 3. And IMO, this *really* gives the feeling of magnitude of
improvement with patches.

I ll change script to report that way :).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 04:20 PM, Raghavendra K T wrote:
 On 05/07/2012 05:36 PM, Avi Kivity wrote:
 On 05/07/2012 01:58 PM, Raghavendra K T wrote:
 On 05/07/2012 02:02 PM, Avi Kivity wrote:
 On 05/07/2012 11:29 AM, Ingo Molnar wrote:
 This is looking pretty good and complete now - any objections
 from anyone to trying this out in a separate x86 topic tree?

 No objections, instead an

 Acked-by: Avi Kivitya...@redhat.com

 [...]

 (Less is better. Below is time elapsed in sec for x86_64_defconfig
 (3+3 runs)).

   BASEBASE+patch%improvement
   mean (sd)   mean (sd)
 case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
 case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
 case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664


 You're calculating the improvement incorrectly.  In the last case, it's
 not 96%, rather it's 2400% (25x).  Similarly the second case is about
 900% faster.


 You are right,
 my %improvement was intended to be like
 if
 1) base takes 100 sec == patch takes 93 sec
 2) base takes 100 sec == patch takes 11 sec
 3) base takes 100 sec == patch takes 4 sec

 The above is more confusing (and incorrect!).

 Better is what you told which boils to 10x and 25x improvement in case
 2 and case 3. And IMO, this *really* gives the feeling of magnitude of
 improvement with patches.

 I ll change script to report that way :).


btw, this is on non-PLE hardware, right?  What are the numbers for PLE?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 06:52 PM, Avi Kivity wrote:

On 05/07/2012 04:20 PM, Raghavendra K T wrote:

On 05/07/2012 05:36 PM, Avi Kivity wrote:

On 05/07/2012 01:58 PM, Raghavendra K T wrote:

On 05/07/2012 02:02 PM, Avi Kivity wrote:

On 05/07/2012 11:29 AM, Ingo Molnar wrote:

This is looking pretty good and complete now - any objections
from anyone to trying this out in a separate x86 topic tree?


No objections, instead an

Acked-by: Avi Kivitya...@redhat.com


[...]


(Less is better. Below is time elapsed in sec for x86_64_defconfig
(3+3 runs)).

   BASEBASE+patch%improvement
   mean (sd)   mean (sd)
case 1x: 66.0566 (74.0304)  61.3233 (68.8299) 7.16552
case 2x: 1253.2 (1795.74)  131.606 (137.358) 89.4984
case 3x: 3431.04 (5297.26)  134.964 (149.861) 96.0664



You're calculating the improvement incorrectly.  In the last case, it's
not 96%, rather it's 2400% (25x).  Similarly the second case is about
900% faster.



You are right,
my %improvement was intended to be like
if
1) base takes 100 sec ==  patch takes 93 sec
2) base takes 100 sec ==  patch takes 11 sec
3) base takes 100 sec ==  patch takes 4 sec

The above is more confusing (and incorrect!).

Better is what you told which boils to 10x and 25x improvement in case
2 and case 3. And IMO, this *really* gives the feeling of magnitude of
improvement with patches.

I ll change script to report that way :).



btw, this is on non-PLE hardware, right?  What are the numbers for PLE?


Sure.
I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Srivatsa Vaddagiri
* Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

 I 'll get hold of a PLE mc  and come up with the numbers soon. but I
 'll expect the improvement around 1-3% as it was in last version.

Deferring preemption (when vcpu is holding lock) may give us better than 1-3% 
results on PLE hardware. Something worth trying IMHO.

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:
 * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

  I 'll get hold of a PLE mc  and come up with the numbers soon. but I
  'll expect the improvement around 1-3% as it was in last version.

 Deferring preemption (when vcpu is holding lock) may give us better than 1-3% 
 results on PLE hardware. Something worth trying IMHO.

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 07:19 PM, Avi Kivity wrote:

On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.


Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.


Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?



It is because PLE already does a good job (of not burning cpu). The
1-3% improvement is because, patchset knows atleast who is next to hold
lock, which is lacking in PLE.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Srivatsa Vaddagiri
* Avi Kivity a...@redhat.com [2012-05-07 16:49:25]:

  Deferring preemption (when vcpu is holding lock) may give us better than 
  1-3% 
  results on PLE hardware. Something worth trying IMHO.
 
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?

I think its latter (PLE already doing a good job). 

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 07:16 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.


Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.



Yes, Sure. 'll take-up this and any scalability improvement possible 
further.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 04:53 PM, Raghavendra K T wrote:
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?



 It is because PLE already does a good job (of not burning cpu). The
 1-3% improvement is because, patchset knows atleast who is next to hold
 lock, which is lacking in PLE.


Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 07:28 PM, Avi Kivity wrote:

On 05/07/2012 04:53 PM, Raghavendra K T wrote:

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?




It is because PLE already does a good job (of not burning cpu). The
1-3% improvement is because, patchset knows atleast who is next to hold
lock, which is lacking in PLE.



Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.



Hmm agree that being a step ahead of mighty hardware (and just an
improvement of 1-3%) is no good for long term (where PLE is future).

Having said that, it is hard for me to resist saying :
 bottleneck is somewhere else on PLE m/c and IMHO answer would be
combination of paravirt-spinlock + pv-flush-tb.

But I need to come up with good number to argue in favour of the claim.

PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a 
win on PLE where only one of them alone could not prove the benefit.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 05:52 PM, Avi Kivity wrote:
  Having said that, it is hard for me to resist saying :
   bottleneck is somewhere else on PLE m/c and IMHO answer would be
  combination of paravirt-spinlock + pv-flush-tb.
 
  But I need to come up with good number to argue in favour of the claim.
 
  PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
  win on PLE where only one of them alone could not prove the benefit.
 

 I'd like to see those numbers, then.


Note: it's probably best to try very wide guests, where the overhead of
iterating on all vcpus begins to show.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Avi Kivity
On 05/07/2012 05:47 PM, Raghavendra K T wrote:
 Not good.  Solving a problem in software that is already solved by
 hardware?  It's okay if there are no costs involved, but here we're
 introducing a new ABI that we'll have to maintain for a long time.



 Hmm agree that being a step ahead of mighty hardware (and just an
 improvement of 1-3%) is no good for long term (where PLE is future).


PLE is the present, not the future.  It was introduced on later Nehalems
and is present on all Westmeres.  Two more processor generations have
passed meanwhile.  The AMD equivalent was also introduced around that
timeframe.

 Having said that, it is hard for me to resist saying :
  bottleneck is somewhere else on PLE m/c and IMHO answer would be
 combination of paravirt-spinlock + pv-flush-tb.

 But I need to come up with good number to argue in favour of the claim.

 PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
 win on PLE where only one of them alone could not prove the benefit.


I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

  PS: Nikunj had experimented that pv-flush tlb + 
  paravirt-spinlock is a win on PLE where only one of them 
  alone could not prove the benefit.
 
 I'd like to see those numbers, then.
 
 Ingo, please hold on the kvm-specific patches, meanwhile.

I'll hold off on the whole thing - frankly, we don't want this 
kind of Xen-only complexity. If KVM can make use of PLE then Xen 
ought to be able to do it as well.

If both Xen and KVM makes good use of it then that's a different 
matter.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Thomas Gleixner
On Mon, 7 May 2012, Ingo Molnar wrote:
 * Avi Kivity a...@redhat.com wrote:
 
   PS: Nikunj had experimented that pv-flush tlb + 
   paravirt-spinlock is a win on PLE where only one of them 
   alone could not prove the benefit.
  
  I'd like to see those numbers, then.
  
  Ingo, please hold on the kvm-specific patches, meanwhile.
 
 I'll hold off on the whole thing - frankly, we don't want this 
 kind of Xen-only complexity. If KVM can make use of PLE then Xen 
 ought to be able to do it as well.
 
 If both Xen and KVM makes good use of it then that's a different 
 matter.

Aside of that, it's kinda strange that a dude named Nikunj is
referenced in the argument chain, but I can't find him on the CC list.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Jeremy Fitzhardinge
On 05/07/2012 06:49 AM, Avi Kivity wrote:
 On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:
 * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]:

 I 'll get hold of a PLE mc  and come up with the numbers soon. but I
 'll expect the improvement around 1-3% as it was in last version.
 Deferring preemption (when vcpu is holding lock) may give us better than 
 1-3% 
 results on PLE hardware. Something worth trying IMHO.
 Is the improvement so low, because PLE is interfering with the patch, or
 because PLE already does a good job?

How does PLE help with ticket scheduling on unlock?  I thought it would
just help with the actual spin loops.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/08/2012 04:45 AM, Jeremy Fitzhardinge wrote:

On 05/07/2012 06:49 AM, Avi Kivity wrote:

On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote:

* Raghavendra K Traghavendra...@linux.vnet.ibm.com  [2012-05-07 19:08:51]:


I 'll get hold of a PLE mc  and come up with the numbers soon. but I
'll expect the improvement around 1-3% as it was in last version.

Deferring preemption (when vcpu is holding lock) may give us better than 1-3%
results on PLE hardware. Something worth trying IMHO.

Is the improvement so low, because PLE is interfering with the patch, or
because PLE already does a good job?


How does PLE help with ticket scheduling on unlock?  I thought it would
just help with the actual spin loops.


Hmm. This strikes something to me. I think I should replace while 1 hog
in with some *real job*  to measure over-commit case. I hope to see
greater improvements because of fairness and scheduling of the
patch-set.

May be all the way I was measuring something equal to 1x case.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-07 Thread Raghavendra K T

On 05/07/2012 08:22 PM, Avi Kivity wrote:

On 05/07/2012 05:47 PM, Raghavendra K T wrote:

Not good.  Solving a problem in software that is already solved by
hardware?  It's okay if there are no costs involved, but here we're
introducing a new ABI that we'll have to maintain for a long time.




Hmm agree that being a step ahead of mighty hardware (and just an
improvement of 1-3%) is no good for long term (where PLE is future).



PLE is the present, not the future.  It was introduced on later Nehalems
and is present on all Westmeres.  Two more processor generations have
passed meanwhile.  The AMD equivalent was also introduced around that
timeframe.


Having said that, it is hard for me to resist saying :
  bottleneck is somewhere else on PLE m/c and IMHO answer would be
combination of paravirt-spinlock + pv-flush-tb.

But I need to come up with good number to argue in favour of the claim.

PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a
win on PLE where only one of them alone could not prove the benefit.



I'd like to see those numbers, then.

Ingo, please hold on the kvm-specific patches, meanwhile.




Hmm. I think I messed up the fact while saying 1-3% improvement on PLE.

Going by what I had posted in  https://lkml.org/lkml/2012/4/5/73 (with
correct calculation)

  1x 70.475 (85.6979)   63.5033 (72.7041)   15.7%
  2x 110.971 (132.829)  105.099 (128.738)5.56%  
  3x 150.265 (184.766)  138.341 (172.69) 8.62%


It was around 12% with optimization patch posted separately with that 
(That one Needs more experiment though)


But anyways, I will come up with result for current patch series..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-02 Thread Raghavendra K T

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.(targeted for 3.5 window)

Note: This needs debugfs changes patch that should be in Xen / linux-next
   https://lkml.org/lkml/2012/3/30/687

Changes in V8:
 - Reabsed patches to 3.4-rc4
 - Combined the KVM changes with ticketlock + Xen changes (Ingo)
 - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we
need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl.
 - Rewrite GET_MP_STATE condition (Avi)
 - Make pv_unhalt = bool (Avi)
 - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb)
 - Documentation changes (Rob Landley)
 - Have a printk to recognize that paravirt spinlock is enabled (Nikunj)
 - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now
   so that it can be used for other optimizations such as 
   flush_tlb_ipi_others etc. (Nikunj)

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct next
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into slowpath state.

- When releasing a lock, if it is in slowpath state, the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The slowpath state is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a small ticket can deal with 128 CPUs, and large ticket
32768).

For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in slowpath
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(lock-tickets, inc);
inc.tail = ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
head and fetch the slowpath flag from tail.  This version of the
patch