Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-15 Thread Alex Kogan



> On Apr 13, 2021, at 8:03 AM, Peter Zijlstra  wrote:
> 
> On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:
> 
>> @@ -49,13 +55,33 @@ struct cna_node {
>>  u16 real_numa_node;
>>  u32 encoded_tail;   /* self */
>>  u32 partial_order;  /* enum val */
>> +s32 start_time;
>> };
> 
>> +/*
>> + * Controls the threshold time in ms (default = 10) for intra-node lock
>> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
>> + * passed to a thread on another NUMA node. The default setting can be
>> + * changed with the "numa_spinlock_threshold" boot option.
>> + */
>> +#define MSECS_TO_JIFFIES(m) \
>> +(((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
>> +static int intra_node_handoff_threshold __ro_after_init = 
>> MSECS_TO_JIFFIES(10);
>> +
>> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> +{
>> +s32 current_time = (s32)jiffies;
>> +s32 threshold = cn->start_time + intra_node_handoff_threshold;
>> +
>> +return current_time - threshold > 0;
>> +}
> 
> None of this makes any sense:
> 
> - why do you track time elapsed as a signed entity?
> - why are you using jiffies; that's terrible granularity.
Good points. I will address that (see below). I will just mention that 
those suggestions came from senior folks on this mailing list,
and it seemed prudent to take their counsel. 

> 
> As Andi already said, 10ms is silly large. You've just inflated the
> lock-acquire time for every contended lock to stupid land just because
> NUMA.
I just ran a few quick tests — local_clock() (a wrapper around sched_clock()) 
works well, so I will switch to using that.

I also took a few numbers with different thresholds. Looks like we can drop 
the threshold to 1ms with a minor penalty to performance. However, 
pushing the threshold to 100us has a more significant cost. Here are
the numbers for reference:

will-it-scale/lock2_threads:
threshold: 10ms 1ms  100us
speedup at 142 threads:   2.1841.974 1.1418 

will-it-scale/open1_threads:
threshold: 10ms 1ms  100us
speedup at 142 threads:   2.1461.974 1.291

Would you be more comfortable with setting the default at 1ms?

> And this also brings me to the whole premise of this series; *why* are
> we optimizing this? What locks are so contended that this actually helps
> and shouldn't you be spending your time breaking those locks? That would
> improve throughput more than this ever can.

I think for the same reason the kernel switched from ticket locks to queue locks
several years back. There always will be applications with contended locks. 
Sometimes the workarounds are easy, but many times they are not, like with 
legacy applications or when the workload is skewed (e.g., every client tries to
update the metadata of the same file protected by the same lock). The results
show that for those cases we leave > 2x performance on the table. Those are not
only our numbers — LKP reports show similar or even better results, 
on a wide range of benchmarks, e.g.:
https://lists.01.org/hyperkitty/list/l...@lists.01.org/thread/HGVOCYDEE5KTLYPTAFBD2RXDQOCDPFUJ/
https://lists.01.org/hyperkitty/list/l...@lists.01.org/thread/OUPS7MZ3GJA2XYWM52GMU7H7EI25IT37/
https://lists.01.org/hyperkitty/list/l...@lists.01.org/thread/DNMEQPXJRQY2IKHZ3ERGRY6TUPWDTFUN/

Regards,
— Alex



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Waiman Long

On 4/14/21 1:26 PM, Andi Kleen wrote:

The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
result, I think a module parameter will be no different from a kernel
command line parameter in this regard.

You can still change it in /sys at runtime, even if it's in the vmlinux.


I see, thank for the clarification.

Cheers,
Longman



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Andi Kleen
> The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
> result, I think a module parameter will be no different from a kernel
> command line parameter in this regard.

You can still change it in /sys at runtime, even if it's in the vmlinux.

-Andi


Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Waiman Long

On 4/13/21 5:01 PM, Alex Kogan wrote:

Hi, Andi.

Thanks for your comments!


On Apr 13, 2021, at 2:03 AM, Andi Kleen  wrote:

Alex Kogan  writes:

+   numa_spinlock_threshold=[NUMA, PV_OPS]
+   Set the time threshold in milliseconds for the
+   number of intra-node lock hand-offs before the
+   NUMA-aware spinlock is forced to be passed to
+   a thread on another NUMA node.  Valid values
+   are in the [1..100] range. Smaller values result
+   in a more fair, but less performant spinlock,
+   and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

We are reading time when we are at the head of the (main) queue, but
don’t have the lock yet. Not sure about the latency of ktime_get(), but
anything reasonably fast but not necessarily precise should work.


Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

Yes, good suggestion, thanks.


This would also make the code a lot shorter I guess.

So you don’t think we need the command-line parameter, just the module_param?


The CNA code, if enabled, will be in vmlinux, not in a kernel module. As 
a result, I think a module parameter will be no different from a kernel 
command line parameter in this regard.


Cheers,
Longman




Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Alex Kogan



> On Apr 13, 2021, at 5:22 PM, Andi Kleen  wrote:
> 
>>> ms granularity seems very coarse grained for this. Surely
>>> at some point of spinning you can afford a ktime_get? But ok.
>> We are reading time when we are at the head of the (main) queue, but
>> don’t have the lock yet. Not sure about the latency of ktime_get(), but
>> anything reasonably fast but not necessarily precise should work.
> 
> Actually cpu_clock / sched_clock (see my other email). These should
> be fast without corner cases and also monotonic.
I see, thanks.

> 
>> 
>>> Could you turn that into a moduleparm which can be changed at runtime?
>>> Would be strange to have to reboot just to play with this parameter
>> Yes, good suggestion, thanks.
>> 
>>> This would also make the code a lot shorter I guess.
>> So you don’t think we need the command-line parameter, just the module_param?
> 
> module_params can be changed at the command line too, so yes.
Got it, thanks again.

— Alex



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Andi Kleen
> > ms granularity seems very coarse grained for this. Surely
> > at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.

Actually cpu_clock / sched_clock (see my other email). These should
be fast without corner cases and also monotonic.

> 
> > Could you turn that into a moduleparm which can be changed at runtime?
> > Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
> 
> > This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

module_params can be changed at the command line too, so yes.

-Andi


Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Alex Kogan
Hi, Andi.

Thanks for your comments!

> On Apr 13, 2021, at 2:03 AM, Andi Kleen  wrote:
> 
> Alex Kogan  writes:
>> 
>> +numa_spinlock_threshold=[NUMA, PV_OPS]
>> +Set the time threshold in milliseconds for the
>> +number of intra-node lock hand-offs before the
>> +NUMA-aware spinlock is forced to be passed to
>> +a thread on another NUMA node.  Valid values
>> +are in the [1..100] range. Smaller values result
>> +in a more fair, but less performant spinlock,
>> +and vice versa. The default value is 10.
> 
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.
We are reading time when we are at the head of the (main) queue, but
don’t have the lock yet. Not sure about the latency of ktime_get(), but
anything reasonably fast but not necessarily precise should work.

> Could you turn that into a moduleparm which can be changed at runtime?
> Would be strange to have to reboot just to play with this parameter
Yes, good suggestion, thanks.

> This would also make the code a lot shorter I guess.
So you don’t think we need the command-line parameter, just the module_param?

Regards,
— Alex



Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Peter Zijlstra
On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:

> @@ -49,13 +55,33 @@ struct cna_node {
>   u16 real_numa_node;
>   u32 encoded_tail;   /* self */
>   u32 partial_order;  /* enum val */
> + s32 start_time;
>  };

> +/*
> + * Controls the threshold time in ms (default = 10) for intra-node lock
> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
> + * passed to a thread on another NUMA node. The default setting can be
> + * changed with the "numa_spinlock_threshold" boot option.
> + */
> +#define MSECS_TO_JIFFIES(m)  \
> + (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
> +static int intra_node_handoff_threshold __ro_after_init = 
> MSECS_TO_JIFFIES(10);
> +
> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
> +{
> + s32 current_time = (s32)jiffies;
> + s32 threshold = cn->start_time + intra_node_handoff_threshold;
> +
> + return current_time - threshold > 0;
> +}

None of this makes any sense:

 - why do you track time elapsed as a signed entity?
 - why are you using jiffies; that's terrible granularity.

As Andi already said, 10ms is silly large. You've just inflated the
lock-acquire time for every contended lock to stupid land just because
NUMA.

And this also brings me to the whole premise of this series; *why* are
we optimizing this? What locks are so contended that this actually helps
and shouldn't you be spending your time breaking those locks? That would
improve throughput more than this ever can.

>  static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>  {
>   struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);

> @@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
>  static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>struct mcs_spinlock *node)
>  {
> - /*
> -  * Try and put the time otherwise spent spin waiting on
> -  * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> -  */
> - cna_order_queue(node);
> + struct cna_node *cn = (struct cna_node *)node;
> +
> + if (!cn->start_time || !intra_node_threshold_reached(cn)) {
> + /*
> +  * Try and put the time otherwise spent spin waiting on
> +  * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> +  */
> + cna_order_queue(node);
> + } else {
> + cn->partial_order = FLUSH_SECONDARY_QUEUE;

This is even worse naming..

> + }
>  
>   return 0; /* we lied; we didn't wait, go do so now */
>  }


Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-12 Thread Andi Kleen
Andi Kleen  writes:

> Alex Kogan  writes:
>>  
>> +numa_spinlock_threshold=[NUMA, PV_OPS]
>> +Set the time threshold in milliseconds for the
>> +number of intra-node lock hand-offs before the
>> +NUMA-aware spinlock is forced to be passed to
>> +a thread on another NUMA node.  Valid values
>> +are in the [1..100] range. Smaller values result
>> +in a more fair, but less performant spinlock,
>> +and vice versa. The default value is 10.
>
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.

Actually thinking about it more using jiffies is likely broken
anyways because if the interrupts are disabled and the CPU
is running the main timer interrupts they won't increase.

cpu_clock (better than ktime_get) or sched_clock would work.

-Andi


Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-12 Thread Andi Kleen
Alex Kogan  writes:
>  
> + numa_spinlock_threshold=[NUMA, PV_OPS]
> + Set the time threshold in milliseconds for the
> + number of intra-node lock hand-offs before the
> + NUMA-aware spinlock is forced to be passed to
> + a thread on another NUMA node.  Valid values
> + are in the [1..100] range. Smaller values result
> + in a more fair, but less performant spinlock,
> + and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

This would also make the code a lot shorter I guess.

-Andi


[PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-01 Thread Alex Kogan
Keep track of the time the thread at the head of the secondary queue
has been waiting, and force inter-node handoff once this time passes
a preset threshold. The default value for the threshold (10ms) can be
overridden with the new kernel boot command-line option
"numa_spinlock_threshold". The ms value is translated internally to the
nearest rounded-up jiffies.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 .../admin-guide/kernel-parameters.txt |  9 ++
 kernel/locking/qspinlock_cna.h| 96 ---
 2 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index ace55afd4441..5c959631a8c8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3485,6 +3485,15 @@
Not specifying this option is equivalent to
numa_spinlock=auto.
 
+   numa_spinlock_threshold=[NUMA, PV_OPS]
+   Set the time threshold in milliseconds for the
+   number of intra-node lock hand-offs before the
+   NUMA-aware spinlock is forced to be passed to
+   a thread on another NUMA node.  Valid values
+   are in the [1..100] range. Smaller values result
+   in a more fair, but less performant spinlock,
+   and vice versa. The default value is 10.
+
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index d689861a7b3d..0513360c11fe 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -37,6 +37,12 @@
  * gradually filter the primary queue, leaving only waiters running on the same
  * preferred NUMA node.
  *
+ * We change the NUMA node preference after a waiter at the head of the
+ * secondary queue spins for a certain amount of time (10ms, by default).
+ * We do that by flushing the secondary queue into the head of the primary 
queue,
+ * effectively changing the preference to the NUMA node of the waiter at the 
head
+ * of the secondary queue at the time of the flush.
+ *
  * For more details, see https://arxiv.org/abs/1810.05600.
  *
  * Authors: Alex Kogan 
@@ -49,13 +55,33 @@ struct cna_node {
u16 real_numa_node;
u32 encoded_tail;   /* self */
u32 partial_order;  /* enum val */
+   s32 start_time;
 };
 
 enum {
LOCAL_WAITER_FOUND,
LOCAL_WAITER_NOT_FOUND,
+   FLUSH_SECONDARY_QUEUE
 };
 
+/*
+ * Controls the threshold time in ms (default = 10) for intra-node lock
+ * hand-offs before the NUMA-aware variant of spinlock is forced to be
+ * passed to a thread on another NUMA node. The default setting can be
+ * changed with the "numa_spinlock_threshold" boot option.
+ */
+#define MSECS_TO_JIFFIES(m)\
+   (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
+static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+   s32 current_time = (s32)jiffies;
+   s32 threshold = cn->start_time + intra_node_handoff_threshold;
+
+   return current_time - threshold > 0;
+}
+
 static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 {
struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -99,6 +125,7 @@ static __always_inline void cna_init_node(struct 
mcs_spinlock *node)
 
cn->numa_node = cn->real_numa_node;
cn->partial_order = LOCAL_WAITER_FOUND;
+   cn->start_time = 0;
 }
 
 /*
@@ -198,8 +225,15 @@ static void cna_splice_next(struct mcs_spinlock *node,
 
/* stick `next` on the secondary queue tail */
if (node->locked <= 1) { /* if secondary queue is empty */
+   struct cna_node *cn = (struct cna_node *)node;
+
/* create secondary queue */
next->next = next;
+
+   cn->start_time = (s32)jiffies;
+   /* make sure start_time != 0 iff secondary queue is not empty */
+   if (!cn->start_time)
+   cn->start_time = 1;
} else {
/* add to the tail of the secondary queue */
struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
 static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 struct mcs_spinlock *node)
 {
-   /*
-* Try and put the time otherwise spent spin waiting on
-* _Q_