Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Mon, May 25, 2020 at 04:05:49PM +0200, Frederic Weisbecker wrote:
> On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> > @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> > int cpu, int wake_flags)
> >  
> > if (llist_add(>wake_entry, >wake_list)) {
> > if (!set_nr_if_polling(rq->idle))
> > -   smp_call_function_single_async(cpu, >wake_csd);
> > +   smp_call_function_single_async(cpu, >wake_csd);
> > else
> > trace_sched_wake_idle_without_ipi(cpu);
> 
> Ok that's of course very unlikely but could it be possible to have the
> following:
> 
> CPU 0 CPU 1 CPU 2
> -   
> 
> //Wake up A
> ttwu_queue(TASK A, CPU 1) idle_loop {
>   ttwu_queue_pending {
>   
>   raw_spin_unlock_irqrestore(rq)
>   # VMEXIT (with IPI still pending)
> 
> //task A migrates here
> 
> wait_event()
> 
> //sleep
> 
> //Wake up A
> ttwu_queue(TASK A, CPU 2) {
> //IPI on CPU 2 ignored
> // due to csd->flags == CSD_LOCK
> 

Right you are.

Bah!

More thinking


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> - if (!(flags & NOHZ_KICK_MASK))
> + if (idle != CPU_IDLE)
>   return false;
>  
>   _nohz_idle_balance(this_rq, flags, idle);

Bah, I think I broke something there. Lemme go mend.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Frederic Weisbecker
On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu, int wake_flags)
>  
>   if (llist_add(>wake_entry, >wake_list)) {
>   if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, >wake_csd);
> + smp_call_function_single_async(cpu, >wake_csd);
>   else
>   trace_sched_wake_idle_without_ipi(cpu);

Ok that's of course very unlikely but could it be possible to have the
following:

CPU 0 CPU 1 CPU 2
-   

//Wake up A
ttwu_queue(TASK A, CPU 1) idle_loop {
  ttwu_queue_pending {
  
  raw_spin_unlock_irqrestore(rq)
  # VMEXIT (with IPI still pending)
//task 
A migrates here

wait_event()
//sleep

//Wake up A
ttwu_queue(TASK A, CPU 2) {
//IPI on CPU 2 ignored
// due to csd->flags == CSD_LOCK



Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-25 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:41:14PM +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01f94cf52783..b6d8a7b991f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> >  * is idle. And the softirq performing nohz idle load balance
> >  * will be run before returning from the IPI.
> >  */
> > -   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> > +   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
> 
> My fear here is that if a previous call from the the same CPU but to another
> target is still pending, the new one will be spuriously ignored.
> 

Urgh, indeed!

> But I believe we can still keep the remote csd if nohz_flags() are
> strictly only set before the IPI and strictly only cleared from it.
> 
> And I still don't understand why trigger_load_balance() raise the
> softirq without setting the current CPU as ilb. run_rebalance_domains()
> thus ignores it most of the time in the end or it spuriously clear the
> nohz_flags set by an IPI sender. Or there is something I misunderstood
> there.

That is because it is simple and didn't matter before. Whoever got there
first go to run the ilb whenever the flag was set.

But now we have this race due to having to serialize access to the csd.

We want the IPI to clear the flag, but then the softirq no longer knows
it was supposed to do ILB.

How's this then?

---
 include/linux/sched.h |  4 
 kernel/sched/core.c   | 41 +
 kernel/sched/fair.c   | 15 +++
 kernel/sched/sched.h  |  2 +-
 4 files changed, 25 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_seuclamp[UCLAMP_CNT];
 #endif
 
+#ifdef CONFIG_SMP
+   call_single_data_t  wake_csd;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head   preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..90484b988b65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -637,41 +637,25 @@ void wake_up_nohz_cpu(int cpu)
wake_up_idle_cpu(cpu);
 }
 
-static inline bool got_nohz_idle_kick(void)
+static void nohz_csd_func(void *info)
 {
-   int cpu = smp_processor_id();
-
-   if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
-   return false;
-
-   if (idle_cpu(cpu) && !need_resched())
-   return true;
+   struct rq *rq = info;
+   int cpu = cpu_of(rq);
 
+   WARN_ON(!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK));
/*
-* We can't run Idle Load Balance on this CPU for this time so we
-* cancel it and clear NOHZ_BALANCE_KICK
+* Release the rq::nohz_csd.
 */
+   smp_mb__before_atomic();
atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
-   return false;
-}
-
-static void nohz_csd_func(void *info)
-{
-   struct rq *rq = info;
 
-   if (got_nohz_idle_kick()) {
-   rq->idle_balance = 1;
+   rq->idle_balance = idle_cpu(cpu);
+   if (rq->idle_balance && !need_resched()) {
+   rq->nohz_idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
 }
 
-#else /* CONFIG_NO_HZ_COMMON */
-
-static inline bool got_nohz_idle_kick(void)
-{
-   return false;
-}
-
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu, int wake_flags)
 
if (llist_add(>wake_entry, >wake_list)) {
if (!set_nr_if_polling(rq->idle))
-   smp_call_function_single_async(cpu, >wake_csd);
+   smp_call_function_single_async(cpu, >wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2905,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #if defined(CONFIG_SMP)
p->on_cpu = 0;
+   p->wake_csd = (struct __call_single_data) {
+   .func = wake_csd_func,
+   };
 #endif
init_task_preempt_count(p);
 #ifdef CONFIG_SMP
@@ -6723,8 +6710,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-   rq_csd_init(rq, >wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(>cfs_tasks);
 
rq_attach_root(rq, _root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..93525549a023 100644
--- a/kernel/sched/fair.c

Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Qian Cai
On Thu, May 21, 2020 at 11:39:38AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > > .config are in,
> > > 
> > > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > > not seen anything like that myself. Let me go have a look.
> > > 
> > > 
> > > In as far as the logs are readable (they're a wrapped mess, please don't
> > > do that!), they contain very little useful, as is typical with IPIs :/
> > > 
> > > > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > > flush_smp_call_function_queue+0x1fa/0x2e0
> > 
> > So I've tried to think of a race that could produce that and here is
> > the only thing I could come up with. It's a bit complicated unfortunately:
> 
> This:
> 
> > smp_call_function_single_async() { 
> > smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified 
> > csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK  csd->flags = 
> > CSD_LOCK
> 
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well. Now in the ILB case there is an easy cure:
> 
> (because there is only a single ilb target)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>* is idle. And the softirq performing nohz idle load balance
>* will be run before returning from the IPI.
>*/
> - smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
>  }
>  
>  /*
> 
> Qian, can you give that a spin?

Running for a few hours now. It works fine.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Frederic Weisbecker
On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > 
> > > This:
> > > 
> > > > smp_call_function_single_async() { 
> > > > smp_call_function_single_async() {
> > > > // verified csd->flags != CSD_LOCK // verified 
> > > > csd->flags != CSD_LOCK
> > > > csd->flags = CSD_LOCK  csd->flags = 
> > > > CSD_LOCK
> > > 
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> > 
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> > 
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> > 
> > So it's either this:
> 
> Or this:
> 
> ---
>  include/linux/sched.h | 4 
>  kernel/sched/core.c   | 7 ---
>  kernel/sched/fair.c   | 2 +-
>  kernel/sched/sched.h  | 1 -
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
>   struct uclamp_seuclamp[UCLAMP_CNT];
>  #endif
>  
> +#ifdef CONFIG_SMP
> + call_single_data_t  wake_csd;
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>   /* List of struct preempt_notifier: */
>   struct hlist_head   preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, 
> int cpu, int wake_flags)
>  
>   if (llist_add(>wake_entry, >wake_list)) {
>   if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, >wake_csd);
> + smp_call_function_single_async(cpu, >wake_csd);
>   else
>   trace_sched_wake_idle_without_ipi(cpu);
>   }
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct 
> task_struct *p)
>  #endif
>  #if defined(CONFIG_SMP)
>   p->on_cpu = 0;
> + p->wake_csd = (struct __call_single_data) {
> + .func = wake_csd_func,
> + };
>  #endif
>   init_task_preempt_count(p);
>  #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
>   rq->avg_idle = 2*sysctl_sched_migration_cost;
>   rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>  
> - rq_csd_init(rq, >wake_csd, wake_csd_func);
> -
>   INIT_LIST_HEAD(>cfs_tasks);
>  
>   rq_attach_root(rq, _root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>* is idle. And the softirq performing nohz idle load balance
>* will be run before returning from the IPI.
>*/
> - smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);

My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.

Namely this could happen:

CPU 0  CPU 1
-  -
   local_irq_disable() or VMEXIT
kick_ilb() {
smp_call_function_single_async(CPU 1,  
 _rq()->nohz_csd);
}

kick_ilb() {
smp_call_function_single_async(CPU 2,  
 _rq()->nohz_csd) {
// IPI to CPU 2 ignored
if (csd->flags == CSD_LOCK)
return -EBUSY;
}
}

local_irq_enable();




But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.

And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.

(Haven't checked the wake up case yet).


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> 
> > This:
> > 
> > > smp_call_function_single_async() { 
> > > smp_call_function_single_async() {
> > > // verified csd->flags != CSD_LOCK // verified 
> > > csd->flags != CSD_LOCK
> > > csd->flags = CSD_LOCK  csd->flags = 
> > > CSD_LOCK
> > 
> > concurrent smp_call_function_single_async() using the same csd is what
> > I'm looking at as well.
> 
> So something like this ought to cure the fundamental problem and make
> smp_call_function_single_async() more user friendly, but also more
> expensive.
> 
> The problem is that while the ILB case is easy to fix, I can't seem to
> find an equally nice solution for the ttwu_remote_queue() case; that
> would basically require sticking the wake_csd in task_struct, I'll also
> post that.
> 
> So it's either this:

Or this:

---
 include/linux/sched.h | 4 
 kernel/sched/core.c   | 7 ---
 kernel/sched/fair.c   | 2 +-
 kernel/sched/sched.h  | 1 -
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_seuclamp[UCLAMP_CNT];
 #endif
 
+#ifdef CONFIG_SMP
+   call_single_data_t  wake_csd;
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head   preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..a7129652e89b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int 
cpu, int wake_flags)
 
if (llist_add(>wake_entry, >wake_list)) {
if (!set_nr_if_polling(rq->idle))
-   smp_call_function_single_async(cpu, >wake_csd);
+   smp_call_function_single_async(cpu, >wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct 
task_struct *p)
 #endif
 #if defined(CONFIG_SMP)
p->on_cpu = 0;
+   p->wake_csd = (struct __call_single_data) {
+   .func = wake_csd_func,
+   };
 #endif
init_task_preempt_count(p);
 #ifdef CONFIG_SMP
@@ -6723,8 +6726,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-   rq_csd_init(rq, >wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(>cfs_tasks);
 
rq_attach_root(rq, _root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
 * is idle. And the softirq performing nohz idle load balance
 * will be run before returning from the IPI.
 */
-   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
+   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7ab6334e992..c35f0ef43ab0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1021,7 +1021,6 @@ struct rq {
 #endif
 
 #ifdef CONFIG_SMP
-   call_single_data_t  wake_csd;
struct llist_head   wake_list;
 #endif
 


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:

> This:
> 
> > smp_call_function_single_async() { 
> > smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified 
> > csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK  csd->flags = 
> > CSD_LOCK
> 
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well.

So something like this ought to cure the fundamental problem and make
smp_call_function_single_async() more user friendly, but also more
expensive.

The problem is that while the ILB case is easy to fix, I can't seem to
find an equally nice solution for the ttwu_remote_queue() case; that
would basically require sticking the wake_csd in task_struct, I'll also
post that.

So it's either this:

---
 kernel/smp.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 84303197caf9..d1ca2a2d1cc7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -109,6 +109,12 @@ static __always_inline void 
csd_lock_wait(call_single_data_t *csd)
smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
 }
 
+/*
+ * csd_lock() can use non-atomic operations to set CSD_FLAG_LOCK because it's
+ * users are careful to only use CPU-local data. IOW, there is no cross-cpu
+ * lock usage. Also, you're not allowed to use smp_call_function*() from IRQs,
+ * and must be extra careful from SoftIRQ.
+ */
 static __always_inline void csd_lock(call_single_data_t *csd)
 {
csd_lock_wait(csd);
@@ -318,7 +324,7 @@ EXPORT_SYMBOL(smp_call_function_single);
 
 /**
  * smp_call_function_single_async(): Run an asynchronous function on a
- *  specific CPU.
+ *  specific CPU.
  * @cpu: The CPU to run on.
  * @csd: Pre-allocated and setup data structure
  *
@@ -339,18 +345,23 @@ EXPORT_SYMBOL(smp_call_function_single);
  */
 int smp_call_function_single_async(int cpu, call_single_data_t *csd)
 {
+   unsigned int csd_flags;
int err = 0;
 
preempt_disable();
 
-   if (csd->flags & CSD_FLAG_LOCK) {
+   /*
+* Unlike the regular smp_call_function*() APIs, this one is actually
+* usable from IRQ context, also the -EBUSY return value suggests
+* it is safe to share csd's.
+*/
+   csd_flags = READ_ONCE(csd->flags);
+   if (csd_flags & CSD_FLAG_LOCK ||
+   cmpxchg(>flags, csd_flags, csd_flags | CSD_FLAG_LOCK) != 
csd_flags) {
err = -EBUSY;
goto out;
}
 
-   csd->flags = CSD_FLAG_LOCK;
-   smp_wmb();
-
err = generic_exec_single(cpu, csd, csd->func, csd->info);
 
 out:


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:

>atomic_fetch_or(, 
> nohz_flags(0))
> softirq() {#VMEXIT or anything 
> that could stop a CPU for a while
> run_rebalance_domain() {
> nohz_idle_balance() {
> atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))

I'm an idiot and didn't have enough wake-up-juice; I missed that andnot
clearing the flag again.

Yes, fun fun fun..


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Peter Zijlstra
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > .config are in,
> > 
> > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > not seen anything like that myself. Let me go have a look.
> > 
> > 
> > In as far as the logs are readable (they're a wrapped mess, please don't
> > do that!), they contain very little useful, as is typical with IPIs :/
> > 
> > > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > flush_smp_call_function_queue+0x1fa/0x2e0
> 
> So I've tried to think of a race that could produce that and here is
> the only thing I could come up with. It's a bit complicated unfortunately:

This:

> smp_call_function_single_async() { 
> smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified 
> csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK  csd->flags = 
> CSD_LOCK

concurrent smp_call_function_single_async() using the same csd is what
I'm looking at as well. Now in the ILB case there is an easy cure:

(because there is only a single ilb target)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
 * is idle. And the softirq performing nohz idle load balance
 * will be run before returning from the IPI.
 */
-   smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
+   smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
 }
 
 /*

Qian, can you give that a spin?

But I'm still not convinced of your scenario:

>kick_ilb() {
>atomic_fetch_or(, 
> nohz_flags(0))

> atomic_fetch_or(, nohz_flags(0))   #VMENTER
> smp_call_function_single_async() { 
> smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified 
> csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK  csd->flags = 
> CSD_LOCK

Note that we check the return value of atomic_fetch_or() and bail if
someone else set a flag in KICK_MASK before us.

Aah, I suppose you're saying this can happen when:

  !(flags & NOHZ_KICK_MASK)

? That's not supposed to happen though.


Anyway, let me go stare at the remove wake-up case, because i'm afraid
that might have the same problem too...


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Frederic Weisbecker
On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0  CPU 1
-  -

tick {
trigger_load_balance() {
raise_softirq(SCHED_SOFTIRQ);
//but nohz_flags(0) = 0
}
   kick_ilb() {
   atomic_fetch_or(, 
nohz_flags(0))
softirq() {#VMEXIT or anything that 
could stop a CPU for a while
run_rebalance_domain() {
nohz_idle_balance() {
atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
}
 }
 }
}

// schedule
nohz_newidle_balance() {
kick_ilb() { // pick current CPU
atomic_fetch_or(, nohz_flags(0))   #VMENTER
smp_call_function_single_async() { 
smp_call_function_single_async() {
// verified csd->flags != CSD_LOCK // verified 
csd->flags != CSD_LOCK
csd->flags = CSD_LOCK  csd->flags = CSD_LOCK
//execute in place //queue and send IPI
csd->flags = 0
nohz_csd_func()
}
}
}


IPI�{
flush_smp_call_function_queue() {
csd_unlock() {
WARN_ON(csd->flags != CSD_LOCK) <-!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Qian Cai
On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.

Yes, I ended up figuring out the same commit a bit earlier. Since then I
reverted that commit and its dependency,

2a0a24ebb499 ("sched: Make scheduler_ipi inline")

Everything works fine so far.

> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/

Sorry about that. I forgot that gmail webUI will wrap things around. I will
switch to mutt.

> 
> > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.00][C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> > fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> > acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> > scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> > dm_region_hash dm_log dm_mod
> > [ 1168.029492][C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> > 5.7.0-rc6-next-20200519 #1
> > [ 1168.037665][C1] Hardware name: HPE ProLiant DL385
> > Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > [ 1168.046978][C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.053658][C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> > ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> > e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> > 10 52 3f 71
> > [ 1168.073262][C1] RSP: 0018:c9178918 EFLAGS: 00010046
> > [ 1168.079253][C1] RAX:  RBX: 430c58f8
> > RCX: 8ec26083
> > [ 1168.087156][C1] RDX: 0003 RSI: dc00
> > RDI: 430c58f8
> > [ 1168.095054][C1] RBP: c91789a8 R08: ed1108618cec
> > R09: ed1108618cec
> > [ 1168.102964][C1] R10: 430c675b R11: 
> > R12: 430c58e0
> > [ 1168.110866][C1] R13: 8eb30c40 R14: 430c5880
> > R15: 430c58e0
> > [ 1168.118767][C1] FS:  ()
> > GS:4308() knlGS:
> > [ 1168.127628][C1] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1168.134129][C1] CR2: 55b169604560 CR3: 000d08a14000
> > CR4: 003406e0
> > [ 1168.142026][C1] Call Trace:
> > [ 1168.145206][C1]  
> > [ 1168.147957][C1]  ? smp_call_on_cpu_callback+0xd0/0xd0
> > [ 1168.153421][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
> > [ 1168.158880][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
> > [ 1168.164076][C1]  generic_smp_call_function_single_interrupt+0x13/0x2b
> > [ 1168.170938][C1]  smp_call_function_single_interrupt+0x157/0x4e0
> > [ 1168.177278][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
> > [ 1168.183172][C1]  ? interrupt_entry+0xe4/0xf0
> > [ 1168.187846][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
> > [ 1168.193478][C1]  ? trace_hardirqs_on_caller+0x1f0/0x1f0
> > [ 1168.199116][C1]  ? _nohz_idle_balance+0x221/0x360
> > [ 1168.204228][C1]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [ 1168.209690][C1]  call_function_single_interrupt+0xf/0x20


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Peter Zijlstra
On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> Just a head up. Repeatedly compiling kernels for a while would trigger
> endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> .config are in,

Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
not seen anything like that myself. Let me go have a look.


In as far as the logs are readable (they're a wrapped mess, please don't
do that!), they contain very little useful, as is typical with IPIs :/

> [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.00][C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> dm_region_hash dm_log dm_mod
> [ 1168.029492][C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> 5.7.0-rc6-next-20200519 #1
> [ 1168.037665][C1] Hardware name: HPE ProLiant DL385
> Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> [ 1168.046978][C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.053658][C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> 10 52 3f 71
> [ 1168.073262][C1] RSP: 0018:c9178918 EFLAGS: 00010046
> [ 1168.079253][C1] RAX:  RBX: 430c58f8
> RCX: 8ec26083
> [ 1168.087156][C1] RDX: 0003 RSI: dc00
> RDI: 430c58f8
> [ 1168.095054][C1] RBP: c91789a8 R08: ed1108618cec
> R09: ed1108618cec
> [ 1168.102964][C1] R10: 430c675b R11: 
> R12: 430c58e0
> [ 1168.110866][C1] R13: 8eb30c40 R14: 430c5880
> R15: 430c58e0
> [ 1168.118767][C1] FS:  ()
> GS:4308() knlGS:
> [ 1168.127628][C1] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1168.134129][C1] CR2: 55b169604560 CR3: 000d08a14000
> CR4: 003406e0
> [ 1168.142026][C1] Call Trace:
> [ 1168.145206][C1]  
> [ 1168.147957][C1]  ? smp_call_on_cpu_callback+0xd0/0xd0
> [ 1168.153421][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
> [ 1168.158880][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
> [ 1168.164076][C1]  generic_smp_call_function_single_interrupt+0x13/0x2b
> [ 1168.170938][C1]  smp_call_function_single_interrupt+0x157/0x4e0
> [ 1168.177278][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
> [ 1168.183172][C1]  ? interrupt_entry+0xe4/0xf0
> [ 1168.187846][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
> [ 1168.193478][C1]  ? trace_hardirqs_on_caller+0x1f0/0x1f0
> [ 1168.199116][C1]  ? _nohz_idle_balance+0x221/0x360
> [ 1168.204228][C1]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1168.209690][C1]  call_function_single_interrupt+0xf/0x20