Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

2019-07-12 Thread Hanjun Guo
On 2019/7/3 19:58, Jan Glauber wrote:
> Hi Alex,
> I've tried this series on arm64 (ThunderX2 with up to SMT=4  and 224 CPUs)
> with the borderline testcase of accessing a single file from all
> threads. With that
> testcase the qspinlock slowpath is the top spot in the kernel.
> 
> The results look really promising:
> 
> CPUsnormalnuma-qspinlocks
> -
> 56149.41  73.90
> 224  576.95  290.31
> 
> Also frontend-stalls are reduced to 50% and interconnect traffic is
> greatly reduced.
> Tested-by: Jan Glauber 

Tested this patchset on Kunpeng920 ARM64 server (96 cores,
4 NUMA nodes), and with the same test case from Jan, I can
see 150%+ boost! (Need to add a patch below [1].)

For the real workload such as Nginx I can see about 10%
performance improvement as well.

Tested-by: Hanjun Guo 

Please cc me for new versions and I'm willing to test it.

Thanks
Hanjun

[1]
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 657bbc5..72c1346 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -792,6 +792,20 @@ config NODES_SHIFT
  Specify the maximum number of NUMA Nodes available on the target
  system.  Increases memory reserved to accommodate various tables.

+config NUMA_AWARE_SPINLOCKS
+ bool "Numa-aware spinlocks"
+ depends on NUMA
+ default y
+ help
+   Introduce NUMA (Non Uniform Memory Access) awareness into
+   the slow path of spinlocks.
+
+   The kernel will try to keep the lock on the same node,
+   thus reducing the number of remote cache misses, while
+   trading some of the short term fairness for better performance.
+
+   Say N if you want absolute first come first serve fairness.
+
 config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 2994167..be5dd44 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,7 +4,7 @@
 #endif

 #include 
-
+#include 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
  *
@@ -170,7 +170,7 @@ static __always_inline void cna_init_node(struct 
mcs_spinlock *node, int cpuid,
  u32 tail)
 {
if (decode_numa_node(node->node_and_count) == -1)
-   store_numa_node(node, numa_cpu_node(cpuid));
+ store_numa_node(node, cpu_to_node(cpuid));
node->encoded_tail = tail;
 }



Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

2019-07-03 Thread Jan Glauber
Hi Alex,
I've tried this series on arm64 (ThunderX2 with up to SMT=4  and 224 CPUs)
with the borderline testcase of accessing a single file from all
threads. With that
testcase the qspinlock slowpath is the top spot in the kernel.

The results look really promising:

CPUsnormalnuma-qspinlocks
-
56149.41  73.90
224  576.95  290.31

Also frontend-stalls are reduced to 50% and interconnect traffic is
greatly reduced.
Tested-by: Jan Glauber 

--Jan

Am Fr., 29. März 2019 um 16:23 Uhr schrieb Alex Kogan :
>
> This version addresses feedback from Peter and Waiman. In particular,
> the CNA functionality has been moved to a separate file, and is controlled
> by a config option (enabled by default if NUMA is enabled).
> An optimization has been introduced to reduce the overhead of shuffling
> threads between waiting queues when the lock is only lightly contended.
>
> Summary
> ---
>
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It can be
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>
> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. To avoid starvation of threads in the secondary queue,
> those threads are moved back to the head of the main queue
> after a certain expected number of intra-node lock hand-offs.
>
> More details are available at https://arxiv.org/abs/1810.05600.
>
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The
> standard deviation is also reported in (), and in general, with a few
> exceptions, is about 3%. The 'stock' kernel is v5.0-rc8,
> commit 28d49e282665 ("locking/lockdep: Shrink struct lock_class_key"),
> compiled in the default configuration. 'patch' is the modified
> kernel compiled with NUMA_AWARE_SPINLOCKS not set; it is included to show
> that any performance changes to the existing qspinlock implementation are
> essentially noise. 'patch-CNA' is the modified kernel with
> NUMA_AWARE_SPINLOCKS set; the speedup is calculated dividing
> 'patch-CNA' by 'stock'.
>
> #thr stock  patchpatch-CNA   speedup (patch-CNA/stock)
>   1  2.731 (0.102)  2.732 (0.093)   2.716 (0.082)  0.995
>   2  3.071 (0.124)  3.084 (0.109)   3.079 (0.113)  1.003
>   4  4.221 (0.138)  4.229 (0.087)   4.408 (0.103)  1.044
>   8  5.366 (0.154)  5.274 (0.094)   6.958 (0.233)  1.297
>  16  6.673 (0.164)  6.689 (0.095)   8.547 (0.145)  1.281
>  32  7.365 (0.177)  7.353 (0.183)   9.305 (0.202)  1.263
>  36  7.473 (0.198)  7.422 (0.181)   9.441 (0.196)  1.263
>  72  6.805 (0.182)  6.699 (0.170)  10.020 (0.218)  1.472
> 108  6.509 (0.082)  6.480 (0.115)  10.027 (0.194)  1.540
> 142  6.223 (0.109)  6.294 (0.100)   9.874 (0.183)  1.587
>
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads:
>
> #thr stock  patchpatch-CNA   speedup (patch-CNA/stock)
>   1  0.565 (0.004)  0.567 (0.001)  0.565 (0.003)  0.999
>   2  0.892 (0.021)  0.899 (0.022)  0.900 (0.018)  1.009
>   4  1.503 (0.031)  1.527 (0.038)  1.481 (0.025)  0.985
>   8  1.755 (0.105)  1.714 (0.079)  1.683 (0.106)  0.959
>  16  1.740 (0.095)  1.752 (0.087)  1.693 (0.098)  0.973
>  32  0.884 (0.080)  0.908 (0.090)  1.686 (0.092)  1.906
>  36  0.907 (0.095)  0.894 (0.088)  1.709 (0.081)  1.885
>  72  0.856 (0.041)  0.858 (0.043)  1.707 (0.082)  1.994
> 108  0.858 (0.039)  0.869 (0.037)  1.732 (0.076)  2.020
> 142  0.809 (0.044)  0.854 (0.044)  1.728 (0.083)  2.135
>
> and will-it-scale/lock2_threads:
>
> #thr stock  patchpatch-CNA   speedup (patch-CNA/stock)
>   1  1.713 (0.004)  1.715 (0.004)  1.711 (0.004)  0.999
>   2  2.889 (0.057)  2.864 (0.0

Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

2019-04-03 Thread Alex Kogan


> On Apr 1, 2019, at 5:09 AM, Peter Zijlstra  wrote:
> 
> On Fri, Mar 29, 2019 at 11:20:01AM -0400, Alex Kogan wrote:
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). 
> 
> The other interesting number is on a !NUMA machine. What do these
> patches do there? Remember, most people do not in fact have numa.
I will make sure to include numbers from a !NUMA machine in the next revision 
of the patch.

Regards,
— Alex

Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

2019-04-01 Thread Peter Zijlstra
On Fri, Mar 29, 2019 at 11:20:01AM -0400, Alex Kogan wrote:
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). 

The other interesting number is on a !NUMA machine. What do these
patches do there? Remember, most people do not in fact have numa.