Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Michel Lespinasse
On Thu, Jan 10, 2013 at 5:05 AM, Rik van Riel  wrote:
> Eric,
>
> with just patches 1-3, can you still reproduce the
> regression on your system?
>
> In other words, could we get away with dropping the
> complexity of patch 4, or do we still need it?

To be clear, I must say that I'm not opposing patch 4 per se. I think
we should not rely on it to avoid regressions, as patch 3 needs to be
robust enough to do that on its own. However, it may very well be that
having different constants for each lock (or for each hash bucket as a
proxy) helps - if lock B has a consistently longer hold time than lock
A, having them in separate hash buckets will allow us to use optimal
tunings for both, but if they collide or if we don't have a hash
table, we'll use a delay that is close to A's value for both, which is
safe (shouldn't introduce regressions) but not optimal.

In other words, I really don't want us to depend on the hash table for
robustness but I think it's fine to have it for extra performance (as
it's actually very short)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Rik van Riel

On 01/10/2013 08:01 AM, Michel Lespinasse wrote:

On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel  wrote:

From: Eric Dumazet 

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.


Note that these results were with your v1 proposal. With v3 proposal,
on a slightly different machine (2 socket sandybridge) with a similar
NIC, I am not seeing the regression when not using the hash table. I
think this is because v3 got more conservative about mixed spinlock
hold times, and converges towards the shortest of the hold times in
that case.


Eric,

with just patches 1-3, can you still reproduce the
regression on your system?

In other words, could we get away with dropping the
complexity of patch 4, or do we still need it?


--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Michel Lespinasse
On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel  wrote:
> From: Eric Dumazet 
>
> Eric Dumazet found a regression with the first version of the spinlock
> backoff code, in a workload where multiple spinlocks were contended,
> each having a different wait time.
>
> This patch has multiple delay values per cpu, indexed on a hash
> of the lock address, to avoid that problem.
>
> Eric Dumazet wrote:
>
> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
>
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.
>
> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
>
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.

Note that these results were with your v1 proposal. With v3 proposal,
on a slightly different machine (2 socket sandybridge) with a similar
NIC, I am not seeing the regression when not using the hash table. I
think this is because v3 got more conservative about mixed spinlock
hold times, and converges towards the shortest of the hold times in
that case.

> -   unsigned delay = __this_cpu_read(spinlock_delay);
> +   u32 hash = hash32_ptr(lock);
> +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> +   struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> +   u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;

I think we should actually always use ent->delay here. My reasoning is
that I think the base algorithm (from the previous patch) should have
robustness against mixed spinlock hold times. We can't rely on
detecting hash collisions to protect us against varying hold times,
because this case could happen even with a single spinlock. So we need
to make sure the base algorithm is robust and converges towards using
the shorter of the spinlock hold times; if we have that then forcing a
reset to MIN_SPINLOCK_DELAY only hurts since it reduces the delay
below what is otherwise necessary.


Other than that:
Reviewed-by: Michel Lespinasse 

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Michel Lespinasse
On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel r...@redhat.com wrote:
 From: Eric Dumazet eric.duma...@gmail.com

 Eric Dumazet found a regression with the first version of the spinlock
 backoff code, in a workload where multiple spinlocks were contended,
 each having a different wait time.

 This patch has multiple delay values per cpu, indexed on a hash
 of the lock address, to avoid that problem.

 Eric Dumazet wrote:

 I did some tests with your patches with following configuration :

 tc qdisc add dev eth0 root htb r2q 1000 default 3
 (to force a contention on qdisc lock, even with a multi queue net
 device)

 and 24 concurrent netperf -t UDP_STREAM -H other_machine -- -m 128

 Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
 (24 threads), and a fast NIC (10Gbps)

 Resulting in a 13 % regression (676 Mbits - 595 Mbits)

 In this workload we have at least two contended spinlocks, with
 different delays. (spinlocks are not held for the same duration)

 It clearly defeats your assumption of a single per cpu delay being OK :
 Some cpus are spinning too long while the lock was released.

 We might try to use a hash on lock address, and an array of 16 different
 delays so that different spinlocks have a chance of not sharing the same
 delay.

 With following patch, I get 982 Mbits/s with same bench, so an increase
 of 45 % instead of a 13 % regression.

Note that these results were with your v1 proposal. With v3 proposal,
on a slightly different machine (2 socket sandybridge) with a similar
NIC, I am not seeing the regression when not using the hash table. I
think this is because v3 got more conservative about mixed spinlock
hold times, and converges towards the shortest of the hold times in
that case.

 -   unsigned delay = __this_cpu_read(spinlock_delay);
 +   u32 hash = hash32_ptr(lock);
 +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
 +   struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
 +   u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;

I think we should actually always use ent-delay here. My reasoning is
that I think the base algorithm (from the previous patch) should have
robustness against mixed spinlock hold times. We can't rely on
detecting hash collisions to protect us against varying hold times,
because this case could happen even with a single spinlock. So we need
to make sure the base algorithm is robust and converges towards using
the shorter of the spinlock hold times; if we have that then forcing a
reset to MIN_SPINLOCK_DELAY only hurts since it reduces the delay
below what is otherwise necessary.


Other than that:
Reviewed-by: Michel Lespinasse wal...@google.com

-- 
Michel Walken Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Rik van Riel

On 01/10/2013 08:01 AM, Michel Lespinasse wrote:

On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel r...@redhat.com wrote:

From: Eric Dumazet eric.duma...@gmail.com

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent netperf -t UDP_STREAM -H other_machine -- -m 128

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits - 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.


Note that these results were with your v1 proposal. With v3 proposal,
on a slightly different machine (2 socket sandybridge) with a similar
NIC, I am not seeing the regression when not using the hash table. I
think this is because v3 got more conservative about mixed spinlock
hold times, and converges towards the shortest of the hold times in
that case.


Eric,

with just patches 1-3, can you still reproduce the
regression on your system?

In other words, could we get away with dropping the
complexity of patch 4, or do we still need it?


--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-10 Thread Michel Lespinasse
On Thu, Jan 10, 2013 at 5:05 AM, Rik van Riel r...@redhat.com wrote:
 Eric,

 with just patches 1-3, can you still reproduce the
 regression on your system?

 In other words, could we get away with dropping the
 complexity of patch 4, or do we still need it?

To be clear, I must say that I'm not opposing patch 4 per se. I think
we should not rely on it to avoid regressions, as patch 3 needs to be
robust enough to do that on its own. However, it may very well be that
having different constants for each lock (or for each hash bucket as a
proxy) helps - if lock B has a consistently longer hold time than lock
A, having them in separate hash buckets will allow us to use optimal
tunings for both, but if they collide or if we don't have a hash
table, we'll use a delay that is close to A's value for both, which is
safe (shouldn't introduce regressions) but not optimal.

In other words, I really don't want us to depend on the hash table for
robustness but I think it's fine to have it for extra performance (as
it's actually very short)

-- 
Michel Walken Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-09 Thread Rafael Aquini
On Tue, Jan 08, 2013 at 05:31:19PM -0500, Rik van Riel wrote:
> From: Eric Dumazet 
> 
> Eric Dumazet found a regression with the first version of the spinlock
> backoff code, in a workload where multiple spinlocks were contended,
> each having a different wait time.
> 
> This patch has multiple delay values per cpu, indexed on a hash
> of the lock address, to avoid that problem.
> 
> Eric Dumazet wrote:
> 
> I did some tests with your patches with following configuration :
> 
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
> 
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
> 
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
> 
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
> 
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
> 
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.
> 
> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
> 
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.
> 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Rik van Riel 
> ---

Acked-by: Rafael Aquini 


>  arch/x86/kernel/smp.c |   22 +++---
>  1 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 05f828b..1877890 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
>  #define DELAY_FIXED_1 (1<  #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
>  #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
> -DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
> +#define DELAY_HASH_SHIFT 6
> +struct delay_entry {
> + u32 hash;
> + u32 delay;
> +};
> +static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], 
> spinlock_delay) = {
> + [0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
> + .hash = 0,
> + .delay = MIN_SPINLOCK_DELAY,
> + },
> +};
> +
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>   __ticket_t head = inc.head, ticket = inc.tail;
>   __ticket_t waiters_ahead;
> - unsigned delay = __this_cpu_read(spinlock_delay);
> + u32 hash = hash32_ptr(lock);
> + u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> + struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> + u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
>   unsigned loops = 1;
>  
>   for (;;) {
> @@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
> __raw_tickets inc)
>   break;
>   }
>   }
> - __this_cpu_write(spinlock_delay, delay);
> + ent->hash = hash;
> + ent->delay = delay;
>  }
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-09 Thread Rafael Aquini
On Tue, Jan 08, 2013 at 05:31:19PM -0500, Rik van Riel wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 
 Eric Dumazet found a regression with the first version of the spinlock
 backoff code, in a workload where multiple spinlocks were contended,
 each having a different wait time.
 
 This patch has multiple delay values per cpu, indexed on a hash
 of the lock address, to avoid that problem.
 
 Eric Dumazet wrote:
 
 I did some tests with your patches with following configuration :
 
 tc qdisc add dev eth0 root htb r2q 1000 default 3
 (to force a contention on qdisc lock, even with a multi queue net
 device)
 
 and 24 concurrent netperf -t UDP_STREAM -H other_machine -- -m 128
 
 Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
 (24 threads), and a fast NIC (10Gbps)
 
 Resulting in a 13 % regression (676 Mbits - 595 Mbits)
 
 In this workload we have at least two contended spinlocks, with
 different delays. (spinlocks are not held for the same duration)
 
 It clearly defeats your assumption of a single per cpu delay being OK :
 Some cpus are spinning too long while the lock was released.
 
 We might try to use a hash on lock address, and an array of 16 different
 delays so that different spinlocks have a chance of not sharing the same
 delay.
 
 With following patch, I get 982 Mbits/s with same bench, so an increase
 of 45 % instead of a 13 % regression.
 
 Signed-off-by: Eric Dumazet eric.duma...@gmail.com
 Signed-off-by: Rik van Riel r...@redhat.com
 ---

Acked-by: Rafael Aquini aqu...@redhat.com


  arch/x86/kernel/smp.c |   22 +++---
  1 files changed, 19 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
 index 05f828b..1877890 100644
 --- a/arch/x86/kernel/smp.c
 +++ b/arch/x86/kernel/smp.c
 @@ -23,6 +23,7 @@
  #include linux/interrupt.h
  #include linux/cpu.h
  #include linux/gfp.h
 +#include linux/hash.h
  
  #include asm/mtrr.h
  #include asm/tlbflush.h
 @@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
  #define DELAY_FIXED_1 (1DELAY_SHIFT)
  #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
  #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
 -DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 +#define DELAY_HASH_SHIFT 6
 +struct delay_entry {
 + u32 hash;
 + u32 delay;
 +};
 +static DEFINE_PER_CPU(struct delay_entry [1  DELAY_HASH_SHIFT], 
 spinlock_delay) = {
 + [0 ... (1  DELAY_HASH_SHIFT) - 1] = {
 + .hash = 0,
 + .delay = MIN_SPINLOCK_DELAY,
 + },
 +};
 +
  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
  {
   __ticket_t head = inc.head, ticket = inc.tail;
   __ticket_t waiters_ahead;
 - unsigned delay = __this_cpu_read(spinlock_delay);
 + u32 hash = hash32_ptr(lock);
 + u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
 + struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
 + u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;
   unsigned loops = 1;
  
   for (;;) {
 @@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
 __raw_tickets inc)
   break;
   }
   }
 - __this_cpu_write(spinlock_delay, delay);
 + ent-hash = hash;
 + ent-delay = delay;
  }
  
  /*
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-08 Thread Rik van Riel
From: Eric Dumazet 

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet 
Signed-off-by: Rik van Riel 
---
 arch/x86/kernel/smp.c |   22 +++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 05f828b..1877890 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1 (1delay : MIN_SPINLOCK_DELAY;
unsigned loops = 1;
 
for (;;) {
@@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
__raw_tickets inc)
break;
}
}
-   __this_cpu_write(spinlock_delay, delay);
+   ent->hash = hash;
+   ent->delay = delay;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-08 Thread Rik van Riel
From: Eric Dumazet eric.duma...@gmail.com

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent netperf -t UDP_STREAM -H other_machine -- -m 128

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits - 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet eric.duma...@gmail.com
Signed-off-by: Rik van Riel r...@redhat.com
---
 arch/x86/kernel/smp.c |   22 +++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 05f828b..1877890 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include linux/interrupt.h
 #include linux/cpu.h
 #include linux/gfp.h
+#include linux/hash.h
 
 #include asm/mtrr.h
 #include asm/tlbflush.h
@@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1 (1DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
 #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+   u32 hash;
+   u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1  DELAY_HASH_SHIFT], 
spinlock_delay) = {
+   [0 ... (1  DELAY_HASH_SHIFT) - 1] = {
+   .hash = 0,
+   .delay = MIN_SPINLOCK_DELAY,
+   },
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
__ticket_t head = inc.head, ticket = inc.tail;
__ticket_t waiters_ahead;
-   unsigned delay = __this_cpu_read(spinlock_delay);
+   u32 hash = hash32_ptr(lock);
+   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+   struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
+   u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;
unsigned loops = 1;
 
for (;;) {
@@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
__raw_tickets inc)
break;
}
}
-   __this_cpu_write(spinlock_delay, delay);
+   ent-hash = hash;
+   ent-delay = delay;
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-03 Thread Eric Dumazet
On Thu, 2013-01-03 at 04:48 -0800, Michel Lespinasse wrote:
> On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel  wrote:
> > From: Eric Dumazet 
> >
> > Eric Dumazet found a regression with the spinlock backoff code,
> > in workloads where multiple spinlocks were contended, each having
> > a different wait time.
> 
> I think you should really clarify that the regression was observed
> with version 1 of your proposal. At that time,
> 
> 1- the autotune code tended to use too long delays for long held locks, and
> 
> 2- there was no exponential backoff, which meant that sharing stats
> between a long held and a short held spinlock could really hurt
> throughput on the short held spinlock
> 
> 
> I believe that with autotune v2, this really shouldnt be a problem and
> stats sharing should result in just using a delay that's appropriate
> for the shorter of the two lock hold times - which is not the optimal
> value, but is actually close enough performance wise and, most
> importantly, should not cause any regression when compared to current
> mainline.
> 
> (it's important to point that out because otherwise, you might trick
> Linus into thinking your patches are risky, which I think they
> shouldn't be after you implemented exponential backoff)
> 
> >  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> >  {
> > __ticket_t head = inc.head, ticket = inc.tail;
> > __ticket_t waiters_ahead;
> > -   int delay = __this_cpu_read(spinlock_delay);
> > +   u32 hash = hash32_ptr(lock);
> > +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> > +   struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> > +   u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
> 
> IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
> autotune should make us resilient to collisions (if they happen we'll
> just end up with something very close to the min of the delays that
> would have been appropriate for either locks), so it should be better
> to just let collisions happen rather than force the use of
> MIN_SPINLOCK_DELAY.
> 

exponential backoff wont help, I tried this idea last week and found
that its better to detect hash collision and safely use
MIN_SPINLOCK_DELAY in this case.

Its better to not overestimate the delay and spin much longer than
needed.

On a hash collision, we dont know at all the contention history of this
lock, unless we store the EWMA delay inside the lock.

(On x86 and NR_CPUS <= 256, we have a 16 bit hole in the spinlock that
 we could use for this)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-03 Thread Michel Lespinasse
On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel  wrote:
> From: Eric Dumazet 
>
> Eric Dumazet found a regression with the spinlock backoff code,
> in workloads where multiple spinlocks were contended, each having
> a different wait time.

I think you should really clarify that the regression was observed
with version 1 of your proposal. At that time,

1- the autotune code tended to use too long delays for long held locks, and

2- there was no exponential backoff, which meant that sharing stats
between a long held and a short held spinlock could really hurt
throughput on the short held spinlock


I believe that with autotune v2, this really shouldnt be a problem and
stats sharing should result in just using a delay that's appropriate
for the shorter of the two lock hold times - which is not the optimal
value, but is actually close enough performance wise and, most
importantly, should not cause any regression when compared to current
mainline.

(it's important to point that out because otherwise, you might trick
Linus into thinking your patches are risky, which I think they
shouldn't be after you implemented exponential backoff)

>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> __ticket_t head = inc.head, ticket = inc.tail;
> __ticket_t waiters_ahead;
> -   int delay = __this_cpu_read(spinlock_delay);
> +   u32 hash = hash32_ptr(lock);
> +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> +   struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> +   u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;

IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
autotune should make us resilient to collisions (if they happen we'll
just end up with something very close to the min of the delays that
would have been appropriate for either locks), so it should be better
to just let collisions happen rather than force the use of
MIN_SPINLOCK_DELAY.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-03 Thread Michel Lespinasse
On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel r...@redhat.com wrote:
 From: Eric Dumazet eric.duma...@gmail.com

 Eric Dumazet found a regression with the spinlock backoff code,
 in workloads where multiple spinlocks were contended, each having
 a different wait time.

I think you should really clarify that the regression was observed
with version 1 of your proposal. At that time,

1- the autotune code tended to use too long delays for long held locks, and

2- there was no exponential backoff, which meant that sharing stats
between a long held and a short held spinlock could really hurt
throughput on the short held spinlock


I believe that with autotune v2, this really shouldnt be a problem and
stats sharing should result in just using a delay that's appropriate
for the shorter of the two lock hold times - which is not the optimal
value, but is actually close enough performance wise and, most
importantly, should not cause any regression when compared to current
mainline.

(it's important to point that out because otherwise, you might trick
Linus into thinking your patches are risky, which I think they
shouldn't be after you implemented exponential backoff)

  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
  {
 __ticket_t head = inc.head, ticket = inc.tail;
 __ticket_t waiters_ahead;
 -   int delay = __this_cpu_read(spinlock_delay);
 +   u32 hash = hash32_ptr(lock);
 +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
 +   struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
 +   u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;

IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
autotune should make us resilient to collisions (if they happen we'll
just end up with something very close to the min of the delays that
would have been appropriate for either locks), so it should be better
to just let collisions happen rather than force the use of
MIN_SPINLOCK_DELAY.

-- 
Michel Walken Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-03 Thread Eric Dumazet
On Thu, 2013-01-03 at 04:48 -0800, Michel Lespinasse wrote:
 On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel r...@redhat.com wrote:
  From: Eric Dumazet eric.duma...@gmail.com
 
  Eric Dumazet found a regression with the spinlock backoff code,
  in workloads where multiple spinlocks were contended, each having
  a different wait time.
 
 I think you should really clarify that the regression was observed
 with version 1 of your proposal. At that time,
 
 1- the autotune code tended to use too long delays for long held locks, and
 
 2- there was no exponential backoff, which meant that sharing stats
 between a long held and a short held spinlock could really hurt
 throughput on the short held spinlock
 
 
 I believe that with autotune v2, this really shouldnt be a problem and
 stats sharing should result in just using a delay that's appropriate
 for the shorter of the two lock hold times - which is not the optimal
 value, but is actually close enough performance wise and, most
 importantly, should not cause any regression when compared to current
 mainline.
 
 (it's important to point that out because otherwise, you might trick
 Linus into thinking your patches are risky, which I think they
 shouldn't be after you implemented exponential backoff)
 
   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
   {
  __ticket_t head = inc.head, ticket = inc.tail;
  __ticket_t waiters_ahead;
  -   int delay = __this_cpu_read(spinlock_delay);
  +   u32 hash = hash32_ptr(lock);
  +   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
  +   struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
  +   u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;
 
 IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
 autotune should make us resilient to collisions (if they happen we'll
 just end up with something very close to the min of the delays that
 would have been appropriate for either locks), so it should be better
 to just let collisions happen rather than force the use of
 MIN_SPINLOCK_DELAY.
 

exponential backoff wont help, I tried this idea last week and found
that its better to detect hash collision and safely use
MIN_SPINLOCK_DELAY in this case.

Its better to not overestimate the delay and spin much longer than
needed.

On a hash collision, we dont know at all the contention history of this
lock, unless we store the EWMA delay inside the lock.

(On x86 and NR_CPUS = 256, we have a 16 bit hole in the spinlock that
 we could use for this)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-02 Thread Rik van Riel
From: Eric Dumazet 

Eric Dumazet found a regression with the spinlock backoff code,
in workloads where multiple spinlocks were contended, each having
a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet 
Signed-off-by: Rik van Riel 
---
 arch/x86/kernel/smp.c |   22 +++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 6065291..29360c4 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -135,12 +136,26 @@ static bool smp_no_nmi_ipi = false;
  */
 #define MIN_SPINLOCK_DELAY 1
 #define MAX_SPINLOCK_DELAY 16000
-DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+   u32 hash;
+   u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], 
spinlock_delay) = {
+   [0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+   .hash = 0,
+   .delay = MIN_SPINLOCK_DELAY,
+   },
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
__ticket_t head = inc.head, ticket = inc.tail;
__ticket_t waiters_ahead;
-   int delay = __this_cpu_read(spinlock_delay);
+   u32 hash = hash32_ptr(lock);
+   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+   struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+   u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
unsigned loops;
 
for (;;) {
@@ -178,7 +193,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
__raw_tickets inc)
break;
}
}
-   __this_cpu_write(spinlock_delay, delay);
+   ent->hash = hash;
+   ent->delay = delay;
 }
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address

2013-01-02 Thread Rik van Riel
From: Eric Dumazet eric.duma...@gmail.com

Eric Dumazet found a regression with the spinlock backoff code,
in workloads where multiple spinlocks were contended, each having
a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent netperf -t UDP_STREAM -H other_machine -- -m 128

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits - 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet eric.duma...@gmail.com
Signed-off-by: Rik van Riel r...@redhat.com
---
 arch/x86/kernel/smp.c |   22 +++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 6065291..29360c4 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include linux/interrupt.h
 #include linux/cpu.h
 #include linux/gfp.h
+#include linux/hash.h
 
 #include asm/mtrr.h
 #include asm/tlbflush.h
@@ -135,12 +136,26 @@ static bool smp_no_nmi_ipi = false;
  */
 #define MIN_SPINLOCK_DELAY 1
 #define MAX_SPINLOCK_DELAY 16000
-DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+   u32 hash;
+   u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1  DELAY_HASH_SHIFT], 
spinlock_delay) = {
+   [0 ... (1  DELAY_HASH_SHIFT) - 1] = {
+   .hash = 0,
+   .delay = MIN_SPINLOCK_DELAY,
+   },
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
__ticket_t head = inc.head, ticket = inc.tail;
__ticket_t waiters_ahead;
-   int delay = __this_cpu_read(spinlock_delay);
+   u32 hash = hash32_ptr(lock);
+   u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+   struct delay_entry *ent = __get_cpu_var(spinlock_delay[slot]);
+   u32 delay = (ent-hash == hash) ? ent-delay : MIN_SPINLOCK_DELAY;
unsigned loops;
 
for (;;) {
@@ -178,7 +193,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct 
__raw_tickets inc)
break;
}
}
-   __this_cpu_write(spinlock_delay, delay);
+   ent-hash = hash;
+   ent-delay = delay;
 }
 
 /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/