Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-24 Thread Ingo Molnar

* Mike Galbraith  wrote:

> On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
> > On 01/10/2013 10:19 AM, Mike Galbraith wrote:
> > > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
> > >
> > >> Please let me know if you manage to break this code in any way,
> > >> so I can fix it...
> > >
> > > I didn't break it, but did let it play with rq->lock contention.  Using
> > > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
> > > pull around appears to have been a ~dead heat.
> > 
> > Good to hear that the code seems to be robust. It seems to
> > help prevent performance degradation in some workloads, and
> > nobody seems to have found regressions yet.
> 
> I had hoped for a bit of positive, but a wash isn't surprising 
> given the profile.  I tried tbench too, didn't expect to see 
> anything at all there, and got that.. so both results are 
> positive in that respect.

Ok, that's good.

Rik, mind re-sending the latest series with all the acks and 
Reviewed-by's added?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-24 Thread Ingo Molnar

* Mike Galbraith bitbuc...@online.de wrote:

 On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
  On 01/10/2013 10:19 AM, Mike Galbraith wrote:
   On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
  
   Please let me know if you manage to break this code in any way,
   so I can fix it...
  
   I didn't break it, but did let it play with rq-lock contention.  Using
   cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
   pull around appears to have been a ~dead heat.
  
  Good to hear that the code seems to be robust. It seems to
  help prevent performance degradation in some workloads, and
  nobody seems to have found regressions yet.
 
 I had hoped for a bit of positive, but a wash isn't surprising 
 given the profile.  I tried tbench too, didn't expect to see 
 anything at all there, and got that.. so both results are 
 positive in that respect.

Ok, that's good.

Rik, mind re-sending the latest series with all the acks and 
Reviewed-by's added?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-13 Thread Raghavendra K T

On 01/12/2013 01:41 AM, Rik van Riel wrote:

On 01/10/2013 12:36 PM, Raghavendra K T wrote:

* Rafael Aquini  [2013-01-10 00:27:23]:


On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
 N   Min   MaxMedian
AvgStddev
x   8   222.977231.16   227.735   227.388
3.1512986
+   8218.75   232.347  229.1035 228.25425
4.2730225
No difference proven at 95.0% confidence


I got similar results on smaller systems (1 socket, dual-cores and
quad-cores)
when running Rik's latest series, no big difference for good nor for
worse,
but I also think Rik's work is meant to address bigger systems with
more cores
contending for any given spinlock.


I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]


I believe I understand why this is happening.

Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
and Pause Filter (PF), respectively.  This feature is used to trap to
the host when the guest is spinning on a spinlock.

This allows the host to run something else, and having the spinner
temporarily yield the CPU.  Effectively, this causes the KVM code
to already do some limited amount of spinlock backoff code, in the
host.

Adding more backoff code in the guest can lead to wild delays in
acquiring locks, and generally bad performance.


Yes agree with you.


I suspect that when running in a virtual machine, we should limit
the delay factor to something much smaller, since the host will take
care of most of the backoff for us.



Even for non-PLE case I believe it would be difficult to tune delay,
because of VCPU scheduling and LHP.


Maybe a maximum delay value of ~10 would do the trick for KVM
guests.

We should be able to get this right by placing the value for the
maximum delay in a __read_mostly section and setting it to something
small from an init function when we detect we are running in a
virtual machine.

Let me cook up, and test, a patch that does that...


Sure.. Awaiting and happy to test the patches.
I also tried few things on my own and also how it behaves without patch
4. Nothing helped.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-13 Thread Raghavendra K T

On 01/12/2013 01:41 AM, Rik van Riel wrote:

On 01/10/2013 12:36 PM, Raghavendra K T wrote:

* Rafael Aquini aqu...@redhat.com [2013-01-10 00:27:23]:


On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
 N   Min   MaxMedian
AvgStddev
x   8   222.977231.16   227.735   227.388
3.1512986
+   8218.75   232.347  229.1035 228.25425
4.2730225
No difference proven at 95.0% confidence


I got similar results on smaller systems (1 socket, dual-cores and
quad-cores)
when running Rik's latest series, no big difference for good nor for
worse,
but I also think Rik's work is meant to address bigger systems with
more cores
contending for any given spinlock.


I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]


I believe I understand why this is happening.

Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
and Pause Filter (PF), respectively.  This feature is used to trap to
the host when the guest is spinning on a spinlock.

This allows the host to run something else, and having the spinner
temporarily yield the CPU.  Effectively, this causes the KVM code
to already do some limited amount of spinlock backoff code, in the
host.

Adding more backoff code in the guest can lead to wild delays in
acquiring locks, and generally bad performance.


Yes agree with you.


I suspect that when running in a virtual machine, we should limit
the delay factor to something much smaller, since the host will take
care of most of the backoff for us.



Even for non-PLE case I believe it would be difficult to tune delay,
because of VCPU scheduling and LHP.


Maybe a maximum delay value of ~10 would do the trick for KVM
guests.

We should be able to get this right by placing the value for the
maximum delay in a __read_mostly section and setting it to something
small from an init function when we detect we are running in a
virtual machine.

Let me cook up, and test, a patch that does that...


Sure.. Awaiting and happy to test the patches.
I also tried few things on my own and also how it behaves without patch
4. Nothing helped.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-11 Thread Rik van Riel

On 01/10/2013 12:36 PM, Raghavendra K T wrote:

* Rafael Aquini  [2013-01-10 00:27:23]:


On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
 N   Min   MaxMedian   AvgStddev
x   8   222.977231.16   227.735   227.388 3.1512986
+   8218.75   232.347  229.1035 228.25425 4.2730225
No difference proven at 95.0% confidence


I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
when running Rik's latest series, no big difference for good nor for worse,
but I also think Rik's work is meant to address bigger systems with more cores
contending for any given spinlock.


I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]


I believe I understand why this is happening.

Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
and Pause Filter (PF), respectively.  This feature is used to trap to
the host when the guest is spinning on a spinlock.

This allows the host to run something else, and having the spinner
temporarily yield the CPU.  Effectively, this causes the KVM code
to already do some limited amount of spinlock backoff code, in the
host.

Adding more backoff code in the guest can lead to wild delays in
acquiring locks, and generally bad performance.

I suspect that when running in a virtual machine, we should limit
the delay factor to something much smaller, since the host will take
care of most of the backoff for us.

Maybe a maximum delay value of ~10 would do the trick for KVM
guests.

We should be able to get this right by placing the value for the
maximum delay in a __read_mostly section and setting it to something
small from an init function when we detect we are running in a
virtual machine.

Let me cook up, and test, a patch that does that...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-11 Thread Rik van Riel

On 01/10/2013 12:36 PM, Raghavendra K T wrote:

* Rafael Aquini aqu...@redhat.com [2013-01-10 00:27:23]:


On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
 N   Min   MaxMedian   AvgStddev
x   8   222.977231.16   227.735   227.388 3.1512986
+   8218.75   232.347  229.1035 228.25425 4.2730225
No difference proven at 95.0% confidence


I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
when running Rik's latest series, no big difference for good nor for worse,
but I also think Rik's work is meant to address bigger systems with more cores
contending for any given spinlock.


I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]


I believe I understand why this is happening.

Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
and Pause Filter (PF), respectively.  This feature is used to trap to
the host when the guest is spinning on a spinlock.

This allows the host to run something else, and having the spinner
temporarily yield the CPU.  Effectively, this causes the KVM code
to already do some limited amount of spinlock backoff code, in the
host.

Adding more backoff code in the guest can lead to wild delays in
acquiring locks, and generally bad performance.

I suspect that when running in a virtual machine, we should limit
the delay factor to something much smaller, since the host will take
care of most of the backoff for us.

Maybe a maximum delay value of ~10 would do the trick for KVM
guests.

We should be able to get this right by placing the value for the
maximum delay in a __read_mostly section and setting it to something
small from an init function when we detect we are running in a
virtual machine.

Let me cook up, and test, a patch that does that...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Chegu Vinod

On 1/8/2013 2:26 PM, Rik van Riel wrote:
<...>

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



Attached below is some preliminary data with one of the AIM7 micro-benchmark
workloads (i.e. high_systime). This is a kernel intensive workload which
does tons of forks/execs etc.and stresses quite a few of the same set
of spinlocks and semaphores.

Observed a drop in performance as we go to 40way and 80 way. Wondering
if the back off keeps increasing to such an extent that it actually starts
to hurt given the nature of this workload ?  Also in the case of 80way
observed quite a bit of variation from run to run...

Also ran it inside a single KVM guest. There were some perf. dips but
interestingly didn't observe the same level of drop (compared to the
drop in the native case) as the guest size was scaled up to 40vcpu or
80vcpu.

FYI
Vinod



---

Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user.  

Values reported are Jobs Per Minute (Higher is better).  The values
are average of 3 runs.

1) Native run:
--

Config 1:  3.7 kernel
Config 2:  3.7 + Rik's 1-4 patches


  20way 40way 80way

Config 1 ~179K ~159K ~146K 

Config 2 ~180K ~134K ~21K-43K  <- high variation!


(Note: Used numactl to restrict workload to 
2 sockets (20way) and 4 sockets(40way))

--

2) KVM run : 


Single guest of different sizes (No over commit, NUMA enabled in the guest).

Note: This kernel intensive micro benchmark is exposes the PLE handler issue 
  esp. for large guests. Since Raghu's PLE changes are not yet in upstream 
  'have just run with current PLE handler & then by disabling 
  PLE (ple_gap=0).

Config 1 : Host & Guest at 3.7
Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches

--
 20vcpu/128G  40vcpu/256G  80vcpu/512G
(on 2 sockets)   (on 4 sockets)   (on 8 sockets)
--
Config 1   ~144K ~39K ~10K
--
Config 2   ~143K ~37.5K   ~11K
--

Config 3 : Host & Guest at 3.7 AND ple_gap=0
Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0

--
Config 3   ~154K~131K~116K 
--
Config 4   ~151K~130K~115K
--


(Note: Used numactl to restrict qemu to 
2 sockets (20way) and 4 sockets(40way))


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Mike Galbraith
On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
> On 01/10/2013 10:19 AM, Mike Galbraith wrote:
> > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
> >
> >> Please let me know if you manage to break this code in any way,
> >> so I can fix it...
> >
> > I didn't break it, but did let it play with rq->lock contention.  Using
> > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
> > pull around appears to have been a ~dead heat.
> 
> Good to hear that the code seems to be robust. It seems to
> help prevent performance degradation in some workloads, and
> nobody seems to have found regressions yet.

I had hoped for a bit of positive, but a wash isn't surprising given the
profile.  I tried tbench too, didn't expect to see anything at all
there, and got that.. so both results are positive in that respect.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Raghavendra K T
* Rafael Aquini  [2013-01-10 00:27:23]:

> On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
> > I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
> > x base_3.8rc2
> > + rik_backoff
> > N   Min   MaxMedian   AvgStddev
> > x   8   222.977231.16   227.735   227.388 3.1512986
> > +   8218.75   232.347  229.1035 228.25425 4.2730225
> > No difference proven at 95.0% confidence
> 
> I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
> when running Rik's latest series, no big difference for good nor for worse,
> but I also think Rik's work is meant to address bigger systems with more cores
> contending for any given spinlock.

I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]

+---+---+---++---+
kernbench (sec lower is better) 
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
44.3000 2.0404  46.7928 1.7518-5.62709
94.8262 5.1444 102.4737 7.8406-8.06475
   156.054014.5797 167.6888 9.7110-7.45562
   202.322515.8906 213.143517.1778-5.34839
+---+---+---++---+
+---+---+---++---+
   sysbench (sec lower is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
16.8512 0.4164  17.7301 0.3761-5.21565
13.0411 0.4115  12.9380 0.1554 0.79058
18.4573 0.2123  18.4662 0.2005-0.04822
24.2021 0.1713  24.3690 0.3270-0.68961
+---+---+---++---+
+---+---+---++---+
   ebizzy (record/sec higher is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
  2494.400027.54472400.600083.4255-3.76042
  2636.6000   302.96582757.5000   147.5137 4.58545
  2236.8333   239.65312131.6667   156.1534-4.70158
  1768.8750   142.54371901.3750   295.2147 7.49064
+---+---+---++---+
+---+---+---++---+
  dbench (throughput in MB/sec higher is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
 10076.9180  2410.96555870.7460  4297.4532xxx
  2152.522088.28531517.827061.9742   -29.48611
  1334.960834.32471078.427538.2288   -19.21654
   946.635532.0426 753.075725.5302   -20.44713
+---+---+---++---+

Please note that I have put dbench_1x result as  since I observed
very high variance in the result.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Rik van Riel

On 01/10/2013 10:19 AM, Mike Galbraith wrote:

On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:


Please let me know if you manage to break this code in any way,
so I can fix it...


I didn't break it, but did let it play with rq->lock contention.  Using
cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
pull around appears to have been a ~dead heat.


Good to hear that the code seems to be robust. It seems to
help prevent performance degradation in some workloads, and
nobody seems to have found regressions yet.

Thank you for testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Mike Galbraith
On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:

> Please let me know if you manage to break this code in any way,
> so I can fix it...

I didn't break it, but did let it play with rq->lock contention.  Using
cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
pull around appears to have been a ~dead heat.

  3.6.113.6.11-spinlock

   PerfTop:   78852 irqs/sec  kernel:96.4%  exact:  0.0% [1000Hz cycles],  
(all, 80 CPUs)
-

 samples  pcnt function samples  pcnt function
 ___ _ ___  ___ _ 
___

   468341.00 52.0% cpupri_set  471786.00 52.0% 
cpupri_set
   110259.00 12.2% _raw_spin_lock   88963.00  9.8% 
ticket_spin_lock_wait
78863.00  8.8% native_write_msr_safe77109.00  8.5% 
native_write_msr_safe
42882.00  4.8% __schedule   48858.00  5.4% 
native_write_cr0
40930.00  4.5% native_write_cr0 47038.00  5.2% 
__schedule
13718.00  1.5% finish_task_switch   24775.00  2.7% 
_raw_spin_lock
13188.00  1.5% plist_del13117.00  1.4% plist_del
13078.00  1.5% _raw_spin_lock_irqsave   12372.00  1.4% 
ttwu_do_wakeup
12083.00  1.3% ttwu_do_wakeup   11553.00  1.3% 
_raw_spin_lock_irqsave
 8359.00  0.9% pull_rt_task  8186.00  0.9% 
pull_rt_task
 6979.00  0.8% apic_timer_interrupt  7989.00  0.9% 
finish_task_switch
 4623.00  0.5% __enqueue_rt_entity   6430.00  0.7% 
apic_timer_interrupt
 3961.00  0.4% resched_task  4721.00  0.5% 
resched_task
 3942.00  0.4% __switch_to   4109.00  0.5% 
__switch_to
 3128.00  0.3% _raw_spin_trylock 2917.00  0.3% 
rcu_idle_exit_common
 3081.00  0.3% __tick_nohz_idle_enter2897.00  0.3% 
__local_bh_enable
 2561.00  0.3% update_curr_rt2873.00  0.3% 
_raw_spin_trylock
 2385.00  0.3% _raw_spin_lock_irq2674.00  0.3% 
__enqueue_rt_entity
 2190.00  0.2% __local_bh_enable 2434.00  0.3% 
update_curr_rt
 1904.00  0.2% rcu_idle_exit_common  2161.00  0.2% 
hrtimer_interrupt
 1870.00  0.2% clockevents_program_event 2106.00  0.2% 
ktime_get_update_offsets
 1828.00  0.2% hrtimer_interrupt 1766.00  0.2% 
tick_nohz_idle_exit
 1741.00  0.2% do_nanosleep  1608.00  0.2% 
__tick_nohz_idle_enter
 1681.00  0.2% sys_clock_nanosleep   1437.00  0.2% 
do_nanosleep
 1639.00  0.2% pick_next_task_rt 1428.00  0.2% 
hrtimer_init
 1630.00  0.2% pick_next_task_stop   1320.00  0.1% 
sched_clock_idle_sleep_event
 1535.00  0.2% _raw_spin_unlock_irqrestore   1290.00  0.1% 
sys_clock_nanosleep

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Raghavendra K T
* Rafael Aquini aqu...@redhat.com [2013-01-10 00:27:23]:

 On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
  I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
  x base_3.8rc2
  + rik_backoff
  N   Min   MaxMedian   AvgStddev
  x   8   222.977231.16   227.735   227.388 3.1512986
  +   8218.75   232.347  229.1035 228.25425 4.2730225
  No difference proven at 95.0% confidence
 
 I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
 when running Rik's latest series, no big difference for good nor for worse,
 but I also think Rik's work is meant to address bigger systems with more cores
 contending for any given spinlock.

I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]

+---+---+---++---+
kernbench (sec lower is better) 
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
44.3000 2.0404  46.7928 1.7518-5.62709
94.8262 5.1444 102.4737 7.8406-8.06475
   156.054014.5797 167.6888 9.7110-7.45562
   202.322515.8906 213.143517.1778-5.34839
+---+---+---++---+
+---+---+---++---+
   sysbench (sec lower is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
16.8512 0.4164  17.7301 0.3761-5.21565
13.0411 0.4115  12.9380 0.1554 0.79058
18.4573 0.2123  18.4662 0.2005-0.04822
24.2021 0.1713  24.3690 0.3270-0.68961
+---+---+---++---+
+---+---+---++---+
   ebizzy (record/sec higher is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
  2494.400027.54472400.600083.4255-3.76042
  2636.6000   302.96582757.5000   147.5137 4.58545
  2236.8333   239.65312131.6667   156.1534-4.70158
  1768.8750   142.54371901.3750   295.2147 7.49064
+---+---+---++---+
+---+---+---++---+
  dbench (throughput in MB/sec higher is better)
+---+---+---++---+
basestdevpatched  stdev %improve
+---+---+---++---+
 10076.9180  2410.96555870.7460  4297.4532xxx
  2152.522088.28531517.827061.9742   -29.48611
  1334.960834.32471078.427538.2288   -19.21654
   946.635532.0426 753.075725.5302   -20.44713
+---+---+---++---+

Please note that I have put dbench_1x result as  since I observed
very high variance in the result.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Mike Galbraith
On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
 On 01/10/2013 10:19 AM, Mike Galbraith wrote:
  On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
 
  Please let me know if you manage to break this code in any way,
  so I can fix it...
 
  I didn't break it, but did let it play with rq-lock contention.  Using
  cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
  pull around appears to have been a ~dead heat.
 
 Good to hear that the code seems to be robust. It seems to
 help prevent performance degradation in some workloads, and
 nobody seems to have found regressions yet.

I had hoped for a bit of positive, but a wash isn't surprising given the
profile.  I tried tbench too, didn't expect to see anything at all
there, and got that.. so both results are positive in that respect.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Chegu Vinod

On 1/8/2013 2:26 PM, Rik van Riel wrote:
...

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



Attached below is some preliminary data with one of the AIM7 micro-benchmark
workloads (i.e. high_systime). This is a kernel intensive workload which
does tons of forks/execs etc.and stresses quite a few of the same set
of spinlocks and semaphores.

Observed a drop in performance as we go to 40way and 80 way. Wondering
if the back off keeps increasing to such an extent that it actually starts
to hurt given the nature of this workload ?  Also in the case of 80way
observed quite a bit of variation from run to run...

Also ran it inside a single KVM guest. There were some perf. dips but
interestingly didn't observe the same level of drop (compared to the
drop in the native case) as the guest size was scaled up to 40vcpu or
80vcpu.

FYI
Vinod



---

Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Workload: AIM7-highsystime microbenchmark - 2000 users  100 jobs per user.  

Values reported are Jobs Per Minute (Higher is better).  The values
are average of 3 runs.

1) Native run:
--

Config 1:  3.7 kernel
Config 2:  3.7 + Rik's 1-4 patches


  20way 40way 80way

Config 1 ~179K ~159K ~146K 

Config 2 ~180K ~134K ~21K-43K  - high variation!


(Note: Used numactl to restrict workload to 
2 sockets (20way) and 4 sockets(40way))

--

2) KVM run : 


Single guest of different sizes (No over commit, NUMA enabled in the guest).

Note: This kernel intensive micro benchmark is exposes the PLE handler issue 
  esp. for large guests. Since Raghu's PLE changes are not yet in upstream 
  'have just run with current PLE handler  then by disabling 
  PLE (ple_gap=0).

Config 1 : Host  Guest at 3.7
Config 2 : Host  Guest are at 3.7 + Rik's 1-4 patches

--
 20vcpu/128G  40vcpu/256G  80vcpu/512G
(on 2 sockets)   (on 4 sockets)   (on 8 sockets)
--
Config 1   ~144K ~39K ~10K
--
Config 2   ~143K ~37.5K   ~11K
--

Config 3 : Host  Guest at 3.7 AND ple_gap=0
Config 4 : Host  Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0

--
Config 3   ~154K~131K~116K 
--
Config 4   ~151K~130K~115K
--


(Note: Used numactl to restrict qemu to 
2 sockets (20way) and 4 sockets(40way))


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Mike Galbraith
On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:

 Please let me know if you manage to break this code in any way,
 so I can fix it...

I didn't break it, but did let it play with rq-lock contention.  Using
cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
pull around appears to have been a ~dead heat.

  3.6.113.6.11-spinlock

   PerfTop:   78852 irqs/sec  kernel:96.4%  exact:  0.0% [1000Hz cycles],  
(all, 80 CPUs)
-

 samples  pcnt function samples  pcnt function
 ___ _ ___  ___ _ 
___

   468341.00 52.0% cpupri_set  471786.00 52.0% 
cpupri_set
   110259.00 12.2% _raw_spin_lock   88963.00  9.8% 
ticket_spin_lock_wait
78863.00  8.8% native_write_msr_safe77109.00  8.5% 
native_write_msr_safe
42882.00  4.8% __schedule   48858.00  5.4% 
native_write_cr0
40930.00  4.5% native_write_cr0 47038.00  5.2% 
__schedule
13718.00  1.5% finish_task_switch   24775.00  2.7% 
_raw_spin_lock
13188.00  1.5% plist_del13117.00  1.4% plist_del
13078.00  1.5% _raw_spin_lock_irqsave   12372.00  1.4% 
ttwu_do_wakeup
12083.00  1.3% ttwu_do_wakeup   11553.00  1.3% 
_raw_spin_lock_irqsave
 8359.00  0.9% pull_rt_task  8186.00  0.9% 
pull_rt_task
 6979.00  0.8% apic_timer_interrupt  7989.00  0.9% 
finish_task_switch
 4623.00  0.5% __enqueue_rt_entity   6430.00  0.7% 
apic_timer_interrupt
 3961.00  0.4% resched_task  4721.00  0.5% 
resched_task
 3942.00  0.4% __switch_to   4109.00  0.5% 
__switch_to
 3128.00  0.3% _raw_spin_trylock 2917.00  0.3% 
rcu_idle_exit_common
 3081.00  0.3% __tick_nohz_idle_enter2897.00  0.3% 
__local_bh_enable
 2561.00  0.3% update_curr_rt2873.00  0.3% 
_raw_spin_trylock
 2385.00  0.3% _raw_spin_lock_irq2674.00  0.3% 
__enqueue_rt_entity
 2190.00  0.2% __local_bh_enable 2434.00  0.3% 
update_curr_rt
 1904.00  0.2% rcu_idle_exit_common  2161.00  0.2% 
hrtimer_interrupt
 1870.00  0.2% clockevents_program_event 2106.00  0.2% 
ktime_get_update_offsets
 1828.00  0.2% hrtimer_interrupt 1766.00  0.2% 
tick_nohz_idle_exit
 1741.00  0.2% do_nanosleep  1608.00  0.2% 
__tick_nohz_idle_enter
 1681.00  0.2% sys_clock_nanosleep   1437.00  0.2% 
do_nanosleep
 1639.00  0.2% pick_next_task_rt 1428.00  0.2% 
hrtimer_init
 1630.00  0.2% pick_next_task_stop   1320.00  0.1% 
sched_clock_idle_sleep_event
 1535.00  0.2% _raw_spin_unlock_irqrestore   1290.00  0.1% 
sys_clock_nanosleep

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-10 Thread Rik van Riel

On 01/10/2013 10:19 AM, Mike Galbraith wrote:

On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:


Please let me know if you manage to break this code in any way,
so I can fix it...


I didn't break it, but did let it play with rq-lock contention.  Using
cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
pull around appears to have been a ~dead heat.


Good to hear that the code seems to be robust. It seems to
help prevent performance degradation in some workloads, and
nobody seems to have found regressions yet.

Thank you for testing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-09 Thread Rafael Aquini
On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
> I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
> x base_3.8rc2
> + rik_backoff
> N   Min   MaxMedian   AvgStddev
> x   8   222.977231.16   227.735   227.388 3.1512986
> +   8218.75   232.347  229.1035 228.25425 4.2730225
> No difference proven at 95.0% confidence

I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
when running Rik's latest series, no big difference for good nor for worse,
but I also think Rik's work is meant to address bigger systems with more cores
contending for any given spinlock.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-09 Thread Raghavendra K T

On 01/09/2013 03:56 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



Patch series does not show anymore weird behaviour because of the
underflow (pointed by Michael) and looks fine.

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
N   Min   MaxMedian   AvgStddev
x   8   222.977231.16   227.735   227.388 3.1512986
+   8218.75   232.347  229.1035 228.25425 4.2730225
No difference proven at 95.0% confidence

The run did not show much difference. But I believe a spinlock stress
test would have shown the benefit.
I 'll start running benchmarks now on kvm guests and comeback with report.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-09 Thread Raghavendra K T

On 01/09/2013 03:56 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
try to make the delay longer policy I have now. Several
small bug fixes and cleanups have been integrated.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



Patch series does not show anymore weird behaviour because of the
underflow (pointed by Michael) and looks fine.

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
N   Min   MaxMedian   AvgStddev
x   8   222.977231.16   227.735   227.388 3.1512986
+   8218.75   232.347  229.1035 228.25425 4.2730225
No difference proven at 95.0% confidence

The run did not show much difference. But I believe a spinlock stress
test would have shown the benefit.
I 'll start running benchmarks now on kvm guests and comeback with report.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-09 Thread Rafael Aquini
On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
 I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
 x base_3.8rc2
 + rik_backoff
 N   Min   MaxMedian   AvgStddev
 x   8   222.977231.16   227.735   227.388 3.1512986
 +   8218.75   232.347  229.1035 228.25425 4.2730225
 No difference proven at 95.0% confidence

I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
when running Rik's latest series, no big difference for good nor for worse,
but I also think Rik's work is meant to address bigger systems with more cores
contending for any given spinlock.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-08 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

<>

[PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-08 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
try to make the delay longer policy I have now. Several
small bug fixes and cleanups have been integrated.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

attachment: spinlock-backoff-v2.png

Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-03 Thread Raghavendra KT
[CCing my ibm id]
On Thu, Jan 3, 2013 at 10:45 AM, Rik van Riel  wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>

Hi Rik,
 Whole series looks very interesting.Thanks for posting spinlock. I am
also curious, how the series affect virtualization cases.
(was about to reply for V1 with Eric changes but delayed because of vacation).

I am planning try V2 on baremetal and guests and comeback.

On a related note,
While experimenting with PV spinlock, I had tried "weighed spinlock".
rationale behind the patch is in virtualized environment use case is
exactly opposite.

If head and tail difference is more, probability of getting lock is
very less so, spin for only little time when difference is high.
and after loop is over if we have not got lock, halt (pv spinlock)/
yield to better guy.

Here is the patch for reference that I tried on top of PV spinlocks.

summary of patch :
looping is proportional to

2 * SPIN_THRESHOLD
-
( head - tail - 1)

---8<---
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e6881fd..2f637ce 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -53,6 +53,18 @@ static inline void
__ticket_enter_slowpath(arch_spinlock_t *lock)
set_bit(0, (volatile unsigned long *)>tickets.tail);
 }

+static inline unsigned get_spin_threshold(int diff)
+{
+   unsigned count = SPIN_THRESHOLD;
+
+   /* handle wrap around */
+   if (unlikely(diff < 0))
+   diff += TICKETLOCK_MAX_VAL;
+
+   count = count >> ((diff - 1) >> 1);
+   return count;
+}
+
 #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
 static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
__ticket_t ticket)
@@ -62,6 +74,10 @@ static inline void
__ticket_unlock_kick(arch_spinlock_t *lock,
__ticket_t ticket)
 {
 }
+static inline unsigned get_spin_threshold(int diff)
+{

Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-03 Thread Michel Lespinasse
On Wed, Jan 2, 2013 at 9:15 PM, Rik van Riel  wrote:
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...

I'm seeing some very weird things on my 3 test machines. Looks like
the spinlock delay sometimes gets crazy, at which point spinlock
performance becomes unbearable and the network driver freaks out.

# ./spinlock_load_test
 1 spinlocks: 24159990
 2 spinlocks: 12900657
 3 spinlocks: 11547771
 4 spinlocks: 9113
 6 spinlocks: 259
 8 spinlocks: 310
12 spinlocks: 283
16 spinlocks:


Well, I take it as an incitation to pay special attention to this code review :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-03 Thread Michel Lespinasse
On Wed, Jan 2, 2013 at 9:15 PM, Rik van Riel r...@redhat.com wrote:
 The v2 series integrates several ideas from Michel Lespinasse
 and Eric Dumazet, which should result in better throughput and
 nicer behaviour in situations with contention on multiple locks.

 Please let me know if you manage to break this code in any way,
 so I can fix it...

I'm seeing some very weird things on my 3 test machines. Looks like
the spinlock delay sometimes gets crazy, at which point spinlock
performance becomes unbearable and the network driver freaks out.

# ./spinlock_load_test
 1 spinlocks: 24159990
 2 spinlocks: 12900657
 3 spinlocks: 11547771
 4 spinlocks: 9113
 6 spinlocks: 259
 8 spinlocks: 310
12 spinlocks: 283
16 spinlocks:
seems to be stuck here; meanwhile my serial console fills up with
network driver cries

Well, I take it as an incitation to pay special attention to this code review :)

-- 
Michel Walken Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-03 Thread Raghavendra KT
[CCing my ibm id]
On Thu, Jan 3, 2013 at 10:45 AM, Rik van Riel r...@redhat.com wrote:
 Many spinlocks are embedded in data structures; having many CPUs
 pounce on the cache line the lock is in will slow down the lock
 holder, and can cause system performance to fall off a cliff.

 The paper Non-scalable locks are dangerous is a good reference:

 http://pdos.csail.mit.edu/papers/linux:lock.pdf

 In the Linux kernel, spinlocks are optimized for the case of
 there not being contention. After all, if there is contention,
 the data structure can be improved to reduce or eliminate
 lock contention.

 Likewise, the spinlock API should remain simple, and the
 common case of the lock not being contended should remain
 as fast as ever.

 However, since spinlock contention should be fairly uncommon,
 we can add functionality into the spinlock slow path that keeps
 system performance from falling off a cliff when there is lock
 contention.

 Proportional delay in ticket locks is delaying the time between
 checking the ticket based on a delay factor, and the number of
 CPUs ahead of us in the queue for this lock. Checking the lock
 less often allows the lock holder to continue running, resulting
 in better throughput and preventing performance from dropping
 off a cliff.

 The test case has a number of threads locking and unlocking a
 semaphore. With just one thread, everything sits in the CPU
 cache and throughput is around 2.6 million operations per
 second, with a 5-10% variation.

 Once a second thread gets involved, data structures bounce
 from CPU to CPU, and performance deteriorates to about 1.25
 million operations per second, with a 5-10% variation.

 However, as more and more threads get added to the mix,
 performance with the vanilla kernel continues to deteriorate.
 Once I hit 24 threads, on a 24 CPU, 4 node test system,
 performance is down to about 290k operations/second.

 With a proportional backoff delay added to the spinlock
 code, performance with 24 threads goes up to about 400k
 operations/second with a 50x delay, and about 900k operations/second
 with a 250x delay. However, with a 250x delay, performance with
 2-5 threads is worse than with a 50x delay.

 Making the code auto-tune the delay factor results in a system
 that performs well with both light and heavy lock contention,
 and should also protect against the (likely) case of the fixed
 delay factor being wrong for other hardware.

 The attached graph shows the performance of the multi threaded
 semaphore lock/unlock test case, with 1-24 threads, on the
 vanilla kernel, with 10x, 50x, and 250x proportional delay,
 as well as the v1 patch series with autotuning for 2x and 2.7x
 spinning before the lock is obtained, and with the v2 series.

 The v2 series integrates several ideas from Michel Lespinasse
 and Eric Dumazet, which should result in better throughput and
 nicer behaviour in situations with contention on multiple locks.

 Please let me know if you manage to break this code in any way,
 so I can fix it...


Hi Rik,
 Whole series looks very interesting.Thanks for posting spinlock. I am
also curious, how the series affect virtualization cases.
(was about to reply for V1 with Eric changes but delayed because of vacation).

I am planning try V2 on baremetal and guests and comeback.

On a related note,
While experimenting with PV spinlock, I had tried weighed spinlock.
rationale behind the patch is in virtualized environment use case is
exactly opposite.

If head and tail difference is more, probability of getting lock is
very less so, spin for only little time when difference is high.
and after loop is over if we have not got lock, halt (pv spinlock)/
yield to better guy.

Here is the patch for reference that I tried on top of PV spinlocks.

summary of patch :
looping is proportional to

2 * SPIN_THRESHOLD
-
( head - tail - 1)

---8---
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e6881fd..2f637ce 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -53,6 +53,18 @@ static inline void
__ticket_enter_slowpath(arch_spinlock_t *lock)
set_bit(0, (volatile unsigned long *)lock-tickets.tail);
 }

+static inline unsigned get_spin_threshold(int diff)
+{
+   unsigned count = SPIN_THRESHOLD;
+
+   /* handle wrap around */
+   if (unlikely(diff  0))
+   diff += TICKETLOCK_MAX_VAL;
+
+   count = count  ((diff - 1)  1);
+   return count;
+}
+
 #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
 static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
__ticket_t ticket)
@@ -62,6 +74,10 @@ static inline void
__ticket_unlock_kick(arch_spinlock_t *lock,
__ticket_t ticket)
 {
 }
+static inline unsigned get_spin_threshold(int diff)
+{
+   return SPIN_THRESHOLD;
+}

 #endif /* 

[RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-02 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.
<>

[RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-02 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.
attachment: spinlock-backoff-v2.png