Re: [PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

2013-04-15 Thread Rik van Riel

On 04/15/2013 10:37 AM, Waiman Long wrote:

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.




Signed-off-by: Waiman Long 
Reviewed-by: Davidlohr Bueso 


Reviewed-by: Rik van Riel 


--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

2013-04-15 Thread Waiman Long
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.

+-+---++--+
|  Configuration  | Mean JPM  |  Mean JPM  | % Change |
| | w/o patch | with patch |  |
+-+---+
| |  User Range 1100 - 2000   |
+-+---+
| 8 nodes, HT on  |36980   |   148590  | +301.8%  |
| 8 nodes, HT off |42799   |   145011  | +238.8%  |
| 4 nodes, HT on  |61318   |   118445  |  +51.1%  |
| 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
| 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
| 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
| 1 node , HT on  |   149042   |   147671  |   -0.9%  |
| 1 node , HT off |   126036   |   126533  |   +0.4%  |
+-+---+
| |   User Range 200 - 1000   |
+-+---+
| 8 nodes, HT on  |   41525|   122349  | +194.6%  |
| 8 nodes, HT off |   49866|   124032  | +148.7%  |
| 4 nodes, HT on  |   66409|   106984  |  +61.1%  |
| 4 nodes, HT off |  119880|   130508  |   +8.9%  |
| 2 nodes, HT on  |  138003|   133948  |   -2.9%  |
| 2 nodes, HT off |  132792|   131997  |   -0.6%  |
| 1 node , HT on  |  116593|   115859  |   -0.6%  |
| 1 node , HT off |  104499|   104597  |   +0.1%  |
+-++---+--+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count.  The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---+-+--+-+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---+-+--+-+
|8 nodes|  65.34% |  0.69%   |  -99%   |
|4 nodes|   8.70% |  1.02%   |  -88%   |
|2 nodes|   0.41% |  0.32%   |  -22%   |
+---+-+--+-+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--+---++-+
|   Workload   | mean % change | mean % change  | mean % change   |
|  | 10-100 users  | 200-1000 users | 1100-2000 users |

[PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

2013-04-15 Thread Waiman Long
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.

The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().

The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.

The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.

+-+---++--+
|  Configuration  | Mean JPM  |  Mean JPM  | % Change |
| | w/o patch | with patch |  |
+-+---+
| |  User Range 1100 - 2000   |
+-+---+
| 8 nodes, HT on  |36980   |   148590  | +301.8%  |
| 8 nodes, HT off |42799   |   145011  | +238.8%  |
| 4 nodes, HT on  |61318   |   118445  |  +51.1%  |
| 4 nodes, HT off |   158481   |   158592  |   +0.1%  |
| 2 nodes, HT on  |   180602   |   173967  |   -3.7%  |
| 2 nodes, HT off |   198409   |   198073  |   -0.2%  |
| 1 node , HT on  |   149042   |   147671  |   -0.9%  |
| 1 node , HT off |   126036   |   126533  |   +0.4%  |
+-+---+
| |   User Range 200 - 1000   |
+-+---+
| 8 nodes, HT on  |   41525|   122349  | +194.6%  |
| 8 nodes, HT off |   49866|   124032  | +148.7%  |
| 4 nodes, HT on  |   66409|   106984  |  +61.1%  |
| 4 nodes, HT off |  119880|   130508  |   +8.9%  |
| 2 nodes, HT on  |  138003|   133948  |   -2.9%  |
| 2 nodes, HT off |  132792|   131997  |   -0.6%  |
| 1 node , HT on  |  116593|   115859  |   -0.6%  |
| 1 node , HT off |  104499|   104597  |   +0.1%  |
+-++---+--+

At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.

AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.

This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count.  The patch has practically no impact on 1 and 2
nodes system.

The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.

+---+-+--+-+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---+-+--+-+
|8 nodes|  65.34% |  0.69%   |  -99%   |
|4 nodes|   8.70% |  1.02%   |  -88%   |
|2 nodes|   0.41% |  0.32%   |  -22%   |
+---+-+--+-+

It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.

The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).

+--+---++-+
|   Workload   | mean % change | mean % change  | mean % change   |
|  | 10-100 users  | 200-1000 users | 1100-2000 users |

Re: [PATCH v2 1/3] mutex: Make more scalable by doing less atomic operations

2013-04-15 Thread Rik van Riel

On 04/15/2013 10:37 AM, Waiman Long wrote:

In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.

This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.




Signed-off-by: Waiman Long waiman.l...@hp.com
Reviewed-by: Davidlohr Bueso davidlohr.bu...@hp.com


Reviewed-by: Rik van Riel r...@redhat.com


--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/