Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Paul E. McKenney
On Tue, Jul 03, 2018 at 01:12:05PM -0700, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > > trigger since.  :-/
> > 
> > But I get a build error:
> > 
> > kernel/workqueue.o: In function `worker_attach_to_pool':
> > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 
> > 'vmlinux' failed
> > 
> > My guess is that the worker_attach_to_pool() code needs to change so
> > as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.
> 
> Just to clarify, the bug happened on a kernel with cpuhp enabled,
> right?

It seems to be a bit more complicated than that, despite my initial
confidence that I knew what was going on.  The build failed for these
rcutorture scenarios:

SRCU-N: (No idea!)

CONFIG_RCU_TRACE=n
CONFIG_SMP=y
CONFIG_NR_CPUS=4
CONFIG_HOTPLUG_CPU=y
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n

SRCU-t:  (Expected, CONFIG_SMP=n)
SRCU-u:  (Expected, CONFIG_SMP=n)
TASKS02:  (Expected, CONFIG_SMP=n)
TASKS03:  (Expected, CONFIG_HOTPLUG_CPU=n)
TINY01:  (Expected, CONFIG_SMP=n)
TINY02:  (Expected, CONFIG_SMP=n)
TREE01:  (No idea!)

CONFIG_SMP=y
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_FAST_NO_HZ=y
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_MAXSMP=y
CONFIG_CPUMASK_OFFSTACK=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

TREE02 didn't fail despite having CONFIG_HOTPLUG_CPU=n!
TREE03:  (Expected, CONFIG_HOTPLUG_CPU=n)
TREE04:  (Expected, CONFIG_HOTPLUG_CPU=n)

TREE07:  (No idea!)

CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=y
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

TREE08:  (Expected, CONFIG_HOTPLUG_CPU=n)
TREE09:  (Expected, CONFIG_SMP=n)

Thoughts?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Paul E. McKenney
On Tue, Jul 03, 2018 at 01:12:05PM -0700, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > > trigger since.  :-/
> > 
> > But I get a build error:
> > 
> > kernel/workqueue.o: In function `worker_attach_to_pool':
> > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 
> > 'vmlinux' failed
> > 
> > My guess is that the worker_attach_to_pool() code needs to change so
> > as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.
> 
> Just to clarify, the bug happened on a kernel with cpuhp enabled,
> right?

It seems to be a bit more complicated than that, despite my initial
confidence that I knew what was going on.  The build failed for these
rcutorture scenarios:

SRCU-N: (No idea!)

CONFIG_RCU_TRACE=n
CONFIG_SMP=y
CONFIG_NR_CPUS=4
CONFIG_HOTPLUG_CPU=y
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n

SRCU-t:  (Expected, CONFIG_SMP=n)
SRCU-u:  (Expected, CONFIG_SMP=n)
TASKS02:  (Expected, CONFIG_SMP=n)
TASKS03:  (Expected, CONFIG_HOTPLUG_CPU=n)
TINY01:  (Expected, CONFIG_SMP=n)
TINY02:  (Expected, CONFIG_SMP=n)
TREE01:  (No idea!)

CONFIG_SMP=y
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_FAST_NO_HZ=y
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_MAXSMP=y
CONFIG_CPUMASK_OFFSTACK=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

TREE02 didn't fail despite having CONFIG_HOTPLUG_CPU=n!
TREE03:  (Expected, CONFIG_HOTPLUG_CPU=n)
TREE04:  (Expected, CONFIG_HOTPLUG_CPU=n)

TREE07:  (No idea!)

CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=y
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=y
CONFIG_HOTPLUG_CPU=y
CONFIG_RCU_FANOUT=2
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y

TREE08:  (Expected, CONFIG_HOTPLUG_CPU=n)
TREE09:  (Expected, CONFIG_SMP=n)

Thoughts?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Tejun Heo
Hello, Paul.

On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > trigger since.  :-/
> 
> But I get a build error:
> 
> kernel/workqueue.o: In function `worker_attach_to_pool':
> workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
> failed
> 
> My guess is that the worker_attach_to_pool() code needs to change so
> as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Just to clarify, the bug happened on a kernel with cpuhp enabled,
right?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Tejun Heo
Hello, Paul.

On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote:
> > I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> > trigger since.  :-/
> 
> But I get a build error:
> 
> kernel/workqueue.o: In function `worker_attach_to_pool':
> workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
> workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
> /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
> failed
> 
> My guess is that the worker_attach_to_pool() code needs to change so
> as to invoke cpuhp_current_state() and cpuhp_target_state() only in
> CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Just to clarify, the bug happened on a kernel with cpuhp enabled,
right?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Paul E. McKenney
On Mon, Jul 02, 2018 at 09:05:18PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > Sorry about the late reply.
> > 
> > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> > > I have hit this WARN_ON_ONCE() in process_one_work:
> > > 
> > >   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> > >raw_smp_processor_id() != pool->cpu);
> > > 
> > > This looks like it is my rcu_gp workqueue (see splat below), and it
> > > appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> > > which does random CPU-hotplug operations (in case that helps).
> > > 
> > > Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> > > what should I do to further debug this?
> > 
> > Hmm... I checked the code paths but couldn't spot anything suspicious.
> > Can you please apply the following patch and see whether it triggers
> > before hitting the warn and if so report what it says?
> 
> I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> trigger since.  :-/

But I get a build error:

kernel/workqueue.o: In function `worker_attach_to_pool':
workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
/home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
failed

My guess is that the worker_attach_to_pool() code needs to change so
as to invoke cpuhp_current_state() and cpuhp_target_state() only in
CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-03 Thread Paul E. McKenney
On Mon, Jul 02, 2018 at 09:05:18PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > Sorry about the late reply.
> > 
> > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> > > I have hit this WARN_ON_ONCE() in process_one_work:
> > > 
> > >   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> > >raw_smp_processor_id() != pool->cpu);
> > > 
> > > This looks like it is my rcu_gp workqueue (see splat below), and it
> > > appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> > > which does random CPU-hotplug operations (in case that helps).
> > > 
> > > Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> > > what should I do to further debug this?
> > 
> > Hmm... I checked the code paths but couldn't spot anything suspicious.
> > Can you please apply the following patch and see whether it triggers
> > before hitting the warn and if so report what it says?
> 
> I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
> trigger since.  :-/

But I get a build error:

kernel/workqueue.o: In function `worker_attach_to_pool':
workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state'
workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state'
/home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' 
failed

My guess is that the worker_attach_to_pool() code needs to change so
as to invoke cpuhp_current_state() and cpuhp_target_state() only in
CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Paul E. McKenney
On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote:
> Hello, Paul.
> 
> Sorry about the late reply.
> 
> On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> > I have hit this WARN_ON_ONCE() in process_one_work:
> > 
> > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> >  raw_smp_processor_id() != pool->cpu);
> > 
> > This looks like it is my rcu_gp workqueue (see splat below), and it
> > appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> > which does random CPU-hotplug operations (in case that helps).
> > 
> > Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> > what should I do to further debug this?
> 
> Hmm... I checked the code paths but couldn't spot anything suspicious.
> Can you please apply the following patch and see whether it triggers
> before hitting the warn and if so report what it says?

I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
trigger since.  :-/

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 0db8938fbb23..81caab9643b2 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
>  static struct lockdep_map cpuhp_state_down_map =
>   STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
> 
> +int cpuhp_current_state(int cpu)
> +{
> + return per_cpu_ptr(_state, cpu)->state;
> +}
> +
> +int cpuhp_target_state(int cpu)
> +{
> + return per_cpu_ptr(_state, cpu)->target;
> +}
> 
>  static inline void cpuhp_lock_acquire(bool bringup)
>  {
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 78b192071ef7..365cf6342808 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
>   return worker;
>  }
> 
> +int cpuhp_current_state(int cpu);
> +int cpuhp_target_state(int cpu);
> +
>  /**
>   * worker_attach_to_pool() - attach a worker to a pool
>   * @worker: worker to be attached
> @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
>  static void worker_attach_to_pool(struct worker *worker,
>  struct worker_pool *pool)
>  {
> + int ret;
> +
>   mutex_lock(_pool_attach_mutex);
> 
>   /*
>* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>* online CPUs.  It'll be re-applied when any of the CPUs come up.
>*/
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + if (ret && pool->cpu >= 0 && worker->rescue_wq)
> + printk("XXX rescuer failed to attach: ret=%d pool=%d 
> this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
> +ret, pool->id, raw_smp_processor_id(), pool->cpu,
> +cpuhp_current_state(pool->cpu),
> +cpuhp_target_state(pool->cpu));
> 
>   /*
>* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> 



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Paul E. McKenney
On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote:
> Hello, Paul.
> 
> Sorry about the late reply.
> 
> On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> > I have hit this WARN_ON_ONCE() in process_one_work:
> > 
> > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> >  raw_smp_processor_id() != pool->cpu);
> > 
> > This looks like it is my rcu_gp workqueue (see splat below), and it
> > appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> > which does random CPU-hotplug operations (in case that helps).
> > 
> > Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> > what should I do to further debug this?
> 
> Hmm... I checked the code paths but couldn't spot anything suspicious.
> Can you please apply the following patch and see whether it triggers
> before hitting the warn and if so report what it says?

I will apply this, but be advised that I have not seen that WARN_ON_ONCE()
trigger since.  :-/

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 0db8938fbb23..81caab9643b2 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
>  static struct lockdep_map cpuhp_state_down_map =
>   STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
> 
> +int cpuhp_current_state(int cpu)
> +{
> + return per_cpu_ptr(_state, cpu)->state;
> +}
> +
> +int cpuhp_target_state(int cpu)
> +{
> + return per_cpu_ptr(_state, cpu)->target;
> +}
> 
>  static inline void cpuhp_lock_acquire(bool bringup)
>  {
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 78b192071ef7..365cf6342808 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
>   return worker;
>  }
> 
> +int cpuhp_current_state(int cpu);
> +int cpuhp_target_state(int cpu);
> +
>  /**
>   * worker_attach_to_pool() - attach a worker to a pool
>   * @worker: worker to be attached
> @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
>  static void worker_attach_to_pool(struct worker *worker,
>  struct worker_pool *pool)
>  {
> + int ret;
> +
>   mutex_lock(_pool_attach_mutex);
> 
>   /*
>* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>* online CPUs.  It'll be re-applied when any of the CPUs come up.
>*/
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + if (ret && pool->cpu >= 0 && worker->rescue_wq)
> + printk("XXX rescuer failed to attach: ret=%d pool=%d 
> this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
> +ret, pool->id, raw_smp_processor_id(), pool->cpu,
> +cpuhp_current_state(pool->cpu),
> +cpuhp_target_state(pool->cpu));
> 
>   /*
>* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> 



Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Tejun Heo
Hello, Paul.

Sorry about the late reply.

On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> I have hit this WARN_ON_ONCE() in process_one_work:
> 
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> This looks like it is my rcu_gp workqueue (see splat below), and it
> appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> which does random CPU-hotplug operations (in case that helps).
> 
> Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> what should I do to further debug this?

Hmm... I checked the code paths but couldn't spot anything suspicious.
Can you please apply the following patch and see whether it triggers
before hitting the warn and if so report what it says?

Thanks.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0db8938fbb23..81caab9643b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
 static struct lockdep_map cpuhp_state_down_map =
STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
 
+int cpuhp_current_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->state;
+}
+
+int cpuhp_target_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->target;
+}
 
 static inline void cpuhp_lock_acquire(bool bringup)
 {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 78b192071ef7..365cf6342808 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
return worker;
 }
 
+int cpuhp_current_state(int cpu);
+int cpuhp_target_state(int cpu);
+
 /**
  * worker_attach_to_pool() - attach a worker to a pool
  * @worker: worker to be attached
@@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(_pool_attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   if (ret && pool->cpu >= 0 && worker->rescue_wq)
+   printk("XXX rescuer failed to attach: ret=%d pool=%d 
this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
+  ret, pool->id, raw_smp_processor_id(), pool->cpu,
+  cpuhp_current_state(pool->cpu),
+  cpuhp_target_state(pool->cpu));
 
/*
 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains


Re: WARN_ON_ONCE() in process_one_work()?

2018-07-02 Thread Tejun Heo
Hello, Paul.

Sorry about the late reply.

On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote:
> I have hit this WARN_ON_ONCE() in process_one_work:
> 
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> This looks like it is my rcu_gp workqueue (see splat below), and it
> appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
> which does random CPU-hotplug operations (in case that helps).
> 
> Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
> what should I do to further debug this?

Hmm... I checked the code paths but couldn't spot anything suspicious.
Can you please apply the following patch and see whether it triggers
before hitting the warn and if so report what it says?

Thanks.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0db8938fbb23..81caab9643b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map =
 static struct lockdep_map cpuhp_state_down_map =
STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map);
 
+int cpuhp_current_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->state;
+}
+
+int cpuhp_target_state(int cpu)
+{
+   return per_cpu_ptr(_state, cpu)->target;
+}
 
 static inline void cpuhp_lock_acquire(bool bringup)
 {
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 78b192071ef7..365cf6342808 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node)
return worker;
 }
 
+int cpuhp_current_state(int cpu);
+int cpuhp_target_state(int cpu);
+
 /**
  * worker_attach_to_pool() - attach a worker to a pool
  * @worker: worker to be attached
@@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(_pool_attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   if (ret && pool->cpu >= 0 && worker->rescue_wq)
+   printk("XXX rescuer failed to attach: ret=%d pool=%d 
this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n",
+  ret, pool->id, raw_smp_processor_id(), pool->cpu,
+  cpuhp_current_state(pool->cpu),
+  cpuhp_target_state(pool->cpu));
 
/*
 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains


WARN_ON_ONCE() in process_one_work()?

2018-06-20 Thread Paul E. McKenney
Hello!

I have hit this WARN_ON_ONCE() in process_one_work:

WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 raw_smp_processor_id() != pool->cpu);

This looks like it is my rcu_gp workqueue (see splat below), and it
appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
which does random CPU-hotplug operations (in case that helps).

Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
what should I do to further debug this?

Thanx, Paul

[  828.791694] WARNING: CPU: 2 PID: 3 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2069 
process_one_work+0x49/0x3c0
[  828.794850] Modules linked in:
[  828.795748] CPU: 2 PID: 3 Comm: rcu_gp Not tainted 4.18.0-rc1+ #1
[  828.797436] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  828.800059] Workqueue:(null) (rcu_gp)
[  828.801373] RIP: 0010:process_one_work+0x49/0x3c0
[  828.802690] Code: 00 00 00 00 4c 0f 44 e0 49 8b 44 24 08 44 8b a8 00 01 00 
00 41 83 e5 20 f6 45 10 04 75 0e 65 8b 05 6c 99 79 68 3b 45 04 74 02 <0f> 0b 48 
ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 8b
[  828.807871] RSP: :8a3c400e7e70 EFLAGS: 00010002
[  828.809314] RAX: 0002 RBX: 88acdfc206e0 RCX: 88acdfc206e0
[  828.811307] RDX: 88acdfc206d8 RSI: 88acdfc197b8 RDI: 88acde80c6c0
[  828.813253] RBP: 88acdfc206c0 R08:  R09: 
[  828.815239] R10: 8a3c400e7e48 R11:  R12: 88acdfc24900
[  828.817073] R13:  R14: 88acde80c6c0 R15: 88acde80c6e8
[  828.819071] FS:  () GS:88acdfd0() 
knlGS:
[  828.821361] CS:  0010 DS:  ES:  CR0: 80050033
[  828.822971] CR2: 00b0 CR3: 1040a000 CR4: 06e0
[  828.824935] Call Trace:
[  828.825653]  rescuer_thread+0x201/0x320
[  828.826744]  kthread+0xf3/0x130
[  828.827618]  ? process_one_work+0x3c0/0x3c0
[  828.828758]  ? kthread_destroy_worker+0x40/0x40
[  828.830019]  ret_from_fork+0x35/0x40
[  828.831028] ---[ end trace 4197eb078bf0edac ]---



WARN_ON_ONCE() in process_one_work()?

2018-06-20 Thread Paul E. McKenney
Hello!

I have hit this WARN_ON_ONCE() in process_one_work:

WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 raw_smp_processor_id() != pool->cpu);

This looks like it is my rcu_gp workqueue (see splat below), and it
appears to be intermittent.  This happens on rcutorture scenario SRCU-N,
which does random CPU-hotplug operations (in case that helps).

Is this related to the recent addition of WQ_MEM_RECLAIM?  Either way,
what should I do to further debug this?

Thanx, Paul

[  828.791694] WARNING: CPU: 2 PID: 3 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2069 
process_one_work+0x49/0x3c0
[  828.794850] Modules linked in:
[  828.795748] CPU: 2 PID: 3 Comm: rcu_gp Not tainted 4.18.0-rc1+ #1
[  828.797436] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  828.800059] Workqueue:(null) (rcu_gp)
[  828.801373] RIP: 0010:process_one_work+0x49/0x3c0
[  828.802690] Code: 00 00 00 00 4c 0f 44 e0 49 8b 44 24 08 44 8b a8 00 01 00 
00 41 83 e5 20 f6 45 10 04 75 0e 65 8b 05 6c 99 79 68 3b 45 04 74 02 <0f> 0b 48 
ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 8b
[  828.807871] RSP: :8a3c400e7e70 EFLAGS: 00010002
[  828.809314] RAX: 0002 RBX: 88acdfc206e0 RCX: 88acdfc206e0
[  828.811307] RDX: 88acdfc206d8 RSI: 88acdfc197b8 RDI: 88acde80c6c0
[  828.813253] RBP: 88acdfc206c0 R08:  R09: 
[  828.815239] R10: 8a3c400e7e48 R11:  R12: 88acdfc24900
[  828.817073] R13:  R14: 88acde80c6c0 R15: 88acde80c6e8
[  828.819071] FS:  () GS:88acdfd0() 
knlGS:
[  828.821361] CS:  0010 DS:  ES:  CR0: 80050033
[  828.822971] CR2: 00b0 CR3: 1040a000 CR4: 06e0
[  828.824935] Call Trace:
[  828.825653]  rescuer_thread+0x201/0x320
[  828.826744]  kthread+0xf3/0x130
[  828.827618]  ? process_one_work+0x3c0/0x3c0
[  828.828758]  ? kthread_destroy_worker+0x40/0x40
[  828.830019]  ret_from_fork+0x35/0x40
[  828.831028] ---[ end trace 4197eb078bf0edac ]---



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-27 Thread Paul E. McKenney
On Fri, Jun 23, 2017 at 09:41:42AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > > > And no test failures from yesterday evening.  So it looks like we 
> > > > > > > get
> > > > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > > > rcutorture
> > > > > > > runtime with your printk() in the mix.
> > > > > > >
> > > > > > > Was the above output from your printk() output of any help?
> > > > > > 
> > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > > > racing against CPU offline, which would explain why it's so 
> > > > > > difficult
> > > > > > to repro.  Can you please see whether the following patch resolves 
> > > > > > the
> > > > > > issue?
> > > > > 
> > > > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > > > tests.  ;-)
> > > > 
> > > > I was a bit confused.  It has to be racing against either new kworker
> > > > being created on the wrong CPU or rescuer trying to migrate to the
> > > > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > > > yeah, this would only get triggered rarely.  Another contributing
> > > > factor could be the vmstat work putting on a workqueue w/ rescuer
> > > > recently.  It runs quite often, so probably has increased the chance
> > > > of hitting the right condition.
> > > 
> > > Sounds like too much fun!  ;-)
> > > 
> > > But more constructively...  If I understand correctly, it is now possible
> > > to take a CPU partially offline and put it back online again.  This should
> > > allow much more intense testing of this sort of interaction.
> > > 
> > > And no, I haven't yet tried this with RCU because I would probably need
> > > to do some mix of just-RCU online/offline and full-up online-offline.
> > > Plus RCU requires pretty much a full online/offline cycle to fully
> > > exercise it.  :-/
> > > 
> > > > > I have a weekend-long run going, but will give this a shot overnight 
> > > > > on
> > > > > Monday, Pacific Time.  Thank you for putting it together, looking 
> > > > > forward
> > > > > to seeing what it does!
> > > > 
> > > > Thanks a lot for the testing and patience.  Sorry that it took so
> > > > long.  I'm not completely sure the patch is correct.  It might have to
> > > > be more specifc about which type of migration or require further
> > > > synchronization around migration, but hopefully it'll at least be able
> > > > to show that this was the cause of the problem.
> > > 
> > > And last night's tests had no failures.  Which might actually mean
> > > something, will get more info when I run without your patch this
> > > evening.  ;-)
> > 
> > And it didn't fail without the patch, either.  45 hours of test vs.
> > 60 hours with the patch.  This one is not going to be easy to prove
> > either way.  I will try again this evening without the patch and see
> > what that gets us.
> 
> And another 36 hours (total of 81 hours) without the patch, still no
> failure.  Sigh.
> 
> In the sense that the patch doesn't cause any new problem:
> 
> Tested-by: Paul E. McKenney 
> 
> But I clearly have nothing of statistical significance, so any confidence
> in the fix is coming from your reproducer.

And for whatever it is worth, I did finally get a reproduction without
the patch.  The probability of occurrence is quite low with my test setup,
so please queue this patch.  I will accumulate test time on it over the
months to come.  :-/

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-27 Thread Paul E. McKenney
On Fri, Jun 23, 2017 at 09:41:42AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > > > And no test failures from yesterday evening.  So it looks like we 
> > > > > > > get
> > > > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > > > rcutorture
> > > > > > > runtime with your printk() in the mix.
> > > > > > >
> > > > > > > Was the above output from your printk() output of any help?
> > > > > > 
> > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > > > racing against CPU offline, which would explain why it's so 
> > > > > > difficult
> > > > > > to repro.  Can you please see whether the following patch resolves 
> > > > > > the
> > > > > > issue?
> > > > > 
> > > > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > > > tests.  ;-)
> > > > 
> > > > I was a bit confused.  It has to be racing against either new kworker
> > > > being created on the wrong CPU or rescuer trying to migrate to the
> > > > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > > > yeah, this would only get triggered rarely.  Another contributing
> > > > factor could be the vmstat work putting on a workqueue w/ rescuer
> > > > recently.  It runs quite often, so probably has increased the chance
> > > > of hitting the right condition.
> > > 
> > > Sounds like too much fun!  ;-)
> > > 
> > > But more constructively...  If I understand correctly, it is now possible
> > > to take a CPU partially offline and put it back online again.  This should
> > > allow much more intense testing of this sort of interaction.
> > > 
> > > And no, I haven't yet tried this with RCU because I would probably need
> > > to do some mix of just-RCU online/offline and full-up online-offline.
> > > Plus RCU requires pretty much a full online/offline cycle to fully
> > > exercise it.  :-/
> > > 
> > > > > I have a weekend-long run going, but will give this a shot overnight 
> > > > > on
> > > > > Monday, Pacific Time.  Thank you for putting it together, looking 
> > > > > forward
> > > > > to seeing what it does!
> > > > 
> > > > Thanks a lot for the testing and patience.  Sorry that it took so
> > > > long.  I'm not completely sure the patch is correct.  It might have to
> > > > be more specifc about which type of migration or require further
> > > > synchronization around migration, but hopefully it'll at least be able
> > > > to show that this was the cause of the problem.
> > > 
> > > And last night's tests had no failures.  Which might actually mean
> > > something, will get more info when I run without your patch this
> > > evening.  ;-)
> > 
> > And it didn't fail without the patch, either.  45 hours of test vs.
> > 60 hours with the patch.  This one is not going to be easy to prove
> > either way.  I will try again this evening without the patch and see
> > what that gets us.
> 
> And another 36 hours (total of 81 hours) without the patch, still no
> failure.  Sigh.
> 
> In the sense that the patch doesn't cause any new problem:
> 
> Tested-by: Paul E. McKenney 
> 
> But I clearly have nothing of statistical significance, so any confidence
> in the fix is coming from your reproducer.

And for whatever it is worth, I did finally get a reproduction without
the patch.  The probability of occurrence is quite low with my test setup,
so please queue this patch.  I will accumulate test time on it over the
months to come.  :-/

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-23 Thread Paul E. McKenney
On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > > And no test failures from yesterday evening.  So it looks like we 
> > > > > > get
> > > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > > rcutorture
> > > > > > runtime with your printk() in the mix.
> > > > > >
> > > > > > Was the above output from your printk() output of any help?
> > > > > 
> > > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > > racing against CPU offline, which would explain why it's so difficult
> > > > > to repro.  Can you please see whether the following patch resolves the
> > > > > issue?
> > > > 
> > > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > > tests.  ;-)
> > > 
> > > I was a bit confused.  It has to be racing against either new kworker
> > > being created on the wrong CPU or rescuer trying to migrate to the
> > > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > > yeah, this would only get triggered rarely.  Another contributing
> > > factor could be the vmstat work putting on a workqueue w/ rescuer
> > > recently.  It runs quite often, so probably has increased the chance
> > > of hitting the right condition.
> > 
> > Sounds like too much fun!  ;-)
> > 
> > But more constructively...  If I understand correctly, it is now possible
> > to take a CPU partially offline and put it back online again.  This should
> > allow much more intense testing of this sort of interaction.
> > 
> > And no, I haven't yet tried this with RCU because I would probably need
> > to do some mix of just-RCU online/offline and full-up online-offline.
> > Plus RCU requires pretty much a full online/offline cycle to fully
> > exercise it.  :-/
> > 
> > > > I have a weekend-long run going, but will give this a shot overnight on
> > > > Monday, Pacific Time.  Thank you for putting it together, looking 
> > > > forward
> > > > to seeing what it does!
> > > 
> > > Thanks a lot for the testing and patience.  Sorry that it took so
> > > long.  I'm not completely sure the patch is correct.  It might have to
> > > be more specifc about which type of migration or require further
> > > synchronization around migration, but hopefully it'll at least be able
> > > to show that this was the cause of the problem.
> > 
> > And last night's tests had no failures.  Which might actually mean
> > something, will get more info when I run without your patch this
> > evening.  ;-)
> 
> And it didn't fail without the patch, either.  45 hours of test vs.
> 60 hours with the patch.  This one is not going to be easy to prove
> either way.  I will try again this evening without the patch and see
> what that gets us.

And another 36 hours (total of 81 hours) without the patch, still no
failure.  Sigh.

In the sense that the patch doesn't cause any new problem:

Tested-by: Paul E. McKenney 

But I clearly have nothing of statistical significance, so any confidence
in the fix is coming from your reproducer.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-23 Thread Paul E. McKenney
On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > > And no test failures from yesterday evening.  So it looks like we 
> > > > > > get
> > > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > > rcutorture
> > > > > > runtime with your printk() in the mix.
> > > > > >
> > > > > > Was the above output from your printk() output of any help?
> > > > > 
> > > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > > racing against CPU offline, which would explain why it's so difficult
> > > > > to repro.  Can you please see whether the following patch resolves the
> > > > > issue?
> > > > 
> > > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > > tests.  ;-)
> > > 
> > > I was a bit confused.  It has to be racing against either new kworker
> > > being created on the wrong CPU or rescuer trying to migrate to the
> > > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > > yeah, this would only get triggered rarely.  Another contributing
> > > factor could be the vmstat work putting on a workqueue w/ rescuer
> > > recently.  It runs quite often, so probably has increased the chance
> > > of hitting the right condition.
> > 
> > Sounds like too much fun!  ;-)
> > 
> > But more constructively...  If I understand correctly, it is now possible
> > to take a CPU partially offline and put it back online again.  This should
> > allow much more intense testing of this sort of interaction.
> > 
> > And no, I haven't yet tried this with RCU because I would probably need
> > to do some mix of just-RCU online/offline and full-up online-offline.
> > Plus RCU requires pretty much a full online/offline cycle to fully
> > exercise it.  :-/
> > 
> > > > I have a weekend-long run going, but will give this a shot overnight on
> > > > Monday, Pacific Time.  Thank you for putting it together, looking 
> > > > forward
> > > > to seeing what it does!
> > > 
> > > Thanks a lot for the testing and patience.  Sorry that it took so
> > > long.  I'm not completely sure the patch is correct.  It might have to
> > > be more specifc about which type of migration or require further
> > > synchronization around migration, but hopefully it'll at least be able
> > > to show that this was the cause of the problem.
> > 
> > And last night's tests had no failures.  Which might actually mean
> > something, will get more info when I run without your patch this
> > evening.  ;-)
> 
> And it didn't fail without the patch, either.  45 hours of test vs.
> 60 hours with the patch.  This one is not going to be easy to prove
> either way.  I will try again this evening without the patch and see
> what that gets us.

And another 36 hours (total of 81 hours) without the patch, still no
failure.  Sigh.

In the sense that the patch doesn't cause any new problem:

Tested-by: Paul E. McKenney 

But I clearly have nothing of statistical significance, so any confidence
in the fix is coming from your reproducer.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-21 Thread Paul E. McKenney
On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > And no test failures from yesterday evening.  So it looks like we get
> > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > rcutorture
> > > > > runtime with your printk() in the mix.
> > > > >
> > > > > Was the above output from your printk() output of any help?
> > > > 
> > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > racing against CPU offline, which would explain why it's so difficult
> > > > to repro.  Can you please see whether the following patch resolves the
> > > > issue?
> > > 
> > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > tests.  ;-)
> > 
> > I was a bit confused.  It has to be racing against either new kworker
> > being created on the wrong CPU or rescuer trying to migrate to the
> > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > yeah, this would only get triggered rarely.  Another contributing
> > factor could be the vmstat work putting on a workqueue w/ rescuer
> > recently.  It runs quite often, so probably has increased the chance
> > of hitting the right condition.
> 
> Sounds like too much fun!  ;-)
> 
> But more constructively...  If I understand correctly, it is now possible
> to take a CPU partially offline and put it back online again.  This should
> allow much more intense testing of this sort of interaction.
> 
> And no, I haven't yet tried this with RCU because I would probably need
> to do some mix of just-RCU online/offline and full-up online-offline.
> Plus RCU requires pretty much a full online/offline cycle to fully
> exercise it.  :-/
> 
> > > I have a weekend-long run going, but will give this a shot overnight on
> > > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > > to seeing what it does!
> > 
> > Thanks a lot for the testing and patience.  Sorry that it took so
> > long.  I'm not completely sure the patch is correct.  It might have to
> > be more specifc about which type of migration or require further
> > synchronization around migration, but hopefully it'll at least be able
> > to show that this was the cause of the problem.
> 
> And last night's tests had no failures.  Which might actually mean
> something, will get more info when I run without your patch this
> evening.  ;-)

And it didn't fail without the patch, either.  45 hours of test vs.
60 hours with the patch.  This one is not going to be easy to prove
either way.  I will try again this evening without the patch and see
what that gets us.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-21 Thread Paul E. McKenney
On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > And no test failures from yesterday evening.  So it looks like we get
> > > > > somewhere on the order of one failure per 138 hours of TREE07 
> > > > > rcutorture
> > > > > runtime with your printk() in the mix.
> > > > >
> > > > > Was the above output from your printk() output of any help?
> > > > 
> > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > racing against CPU offline, which would explain why it's so difficult
> > > > to repro.  Can you please see whether the following patch resolves the
> > > > issue?
> > > 
> > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > tests.  ;-)
> > 
> > I was a bit confused.  It has to be racing against either new kworker
> > being created on the wrong CPU or rescuer trying to migrate to the
> > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > yeah, this would only get triggered rarely.  Another contributing
> > factor could be the vmstat work putting on a workqueue w/ rescuer
> > recently.  It runs quite often, so probably has increased the chance
> > of hitting the right condition.
> 
> Sounds like too much fun!  ;-)
> 
> But more constructively...  If I understand correctly, it is now possible
> to take a CPU partially offline and put it back online again.  This should
> allow much more intense testing of this sort of interaction.
> 
> And no, I haven't yet tried this with RCU because I would probably need
> to do some mix of just-RCU online/offline and full-up online-offline.
> Plus RCU requires pretty much a full online/offline cycle to fully
> exercise it.  :-/
> 
> > > I have a weekend-long run going, but will give this a shot overnight on
> > > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > > to seeing what it does!
> > 
> > Thanks a lot for the testing and patience.  Sorry that it took so
> > long.  I'm not completely sure the patch is correct.  It might have to
> > be more specifc about which type of migration or require further
> > synchronization around migration, but hopefully it'll at least be able
> > to show that this was the cause of the problem.
> 
> And last night's tests had no failures.  Which might actually mean
> something, will get more info when I run without your patch this
> evening.  ;-)

And it didn't fail without the patch, either.  45 hours of test vs.
60 hours with the patch.  This one is not going to be easy to prove
either way.  I will try again this evening without the patch and see
what that gets us.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-20 Thread Paul E. McKenney
On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > And no test failures from yesterday evening.  So it looks like we get
> > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > > runtime with your printk() in the mix.
> > > >
> > > > Was the above output from your printk() output of any help?
> > > 
> > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > racing against CPU offline, which would explain why it's so difficult
> > > to repro.  Can you please see whether the following patch resolves the
> > > issue?
> > 
> > That could explain why only Steve Rostedt and I saw the issue.  As far
> > as I know, we are the only ones who regularly run CPU-hotplug stress
> > tests.  ;-)
> 
> I was a bit confused.  It has to be racing against either new kworker
> being created on the wrong CPU or rescuer trying to migrate to the
> CPU, and it looks like we're mostly seeing the rescuer condition, but,
> yeah, this would only get triggered rarely.  Another contributing
> factor could be the vmstat work putting on a workqueue w/ rescuer
> recently.  It runs quite often, so probably has increased the chance
> of hitting the right condition.

Sounds like too much fun!  ;-)

But more constructively...  If I understand correctly, it is now possible
to take a CPU partially offline and put it back online again.  This should
allow much more intense testing of this sort of interaction.

And no, I haven't yet tried this with RCU because I would probably need
to do some mix of just-RCU online/offline and full-up online-offline.
Plus RCU requires pretty much a full online/offline cycle to fully
exercise it.  :-/

> > I have a weekend-long run going, but will give this a shot overnight on
> > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > to seeing what it does!
> 
> Thanks a lot for the testing and patience.  Sorry that it took so
> long.  I'm not completely sure the patch is correct.  It might have to
> be more specifc about which type of migration or require further
> synchronization around migration, but hopefully it'll at least be able
> to show that this was the cause of the problem.

And last night's tests had no failures.  Which might actually mean
something, will get more info when I run without your patch this
evening.  ;-)

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-20 Thread Paul E. McKenney
On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > And no test failures from yesterday evening.  So it looks like we get
> > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > > runtime with your printk() in the mix.
> > > >
> > > > Was the above output from your printk() output of any help?
> > > 
> > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > racing against CPU offline, which would explain why it's so difficult
> > > to repro.  Can you please see whether the following patch resolves the
> > > issue?
> > 
> > That could explain why only Steve Rostedt and I saw the issue.  As far
> > as I know, we are the only ones who regularly run CPU-hotplug stress
> > tests.  ;-)
> 
> I was a bit confused.  It has to be racing against either new kworker
> being created on the wrong CPU or rescuer trying to migrate to the
> CPU, and it looks like we're mostly seeing the rescuer condition, but,
> yeah, this would only get triggered rarely.  Another contributing
> factor could be the vmstat work putting on a workqueue w/ rescuer
> recently.  It runs quite often, so probably has increased the chance
> of hitting the right condition.

Sounds like too much fun!  ;-)

But more constructively...  If I understand correctly, it is now possible
to take a CPU partially offline and put it back online again.  This should
allow much more intense testing of this sort of interaction.

And no, I haven't yet tried this with RCU because I would probably need
to do some mix of just-RCU online/offline and full-up online-offline.
Plus RCU requires pretty much a full online/offline cycle to fully
exercise it.  :-/

> > I have a weekend-long run going, but will give this a shot overnight on
> > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > to seeing what it does!
> 
> Thanks a lot for the testing and patience.  Sorry that it took so
> long.  I'm not completely sure the patch is correct.  It might have to
> be more specifc about which type of migration or require further
> synchronization around migration, but hopefully it'll at least be able
> to show that this was the cause of the problem.

And last night's tests had no failures.  Which might actually mean
something, will get more info when I run without your patch this
evening.  ;-)

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-18 Thread Tejun Heo
Hello,

On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > And no test failures from yesterday evening.  So it looks like we get
> > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > runtime with your printk() in the mix.
> > >
> > > Was the above output from your printk() output of any help?
> > 
> > Yeah, if my suspicion is correct, it'd require new kworker creation
> > racing against CPU offline, which would explain why it's so difficult
> > to repro.  Can you please see whether the following patch resolves the
> > issue?
> 
> That could explain why only Steve Rostedt and I saw the issue.  As far
> as I know, we are the only ones who regularly run CPU-hotplug stress
> tests.  ;-)

I was a bit confused.  It has to be racing against either new kworker
being created on the wrong CPU or rescuer trying to migrate to the
CPU, and it looks like we're mostly seeing the rescuer condition, but,
yeah, this would only get triggered rarely.  Another contributing
factor could be the vmstat work putting on a workqueue w/ rescuer
recently.  It runs quite often, so probably has increased the chance
of hitting the right condition.

> I have a weekend-long run going, but will give this a shot overnight on
> Monday, Pacific Time.  Thank you for putting it together, looking forward
> to seeing what it does!

Thanks a lot for the testing and patience.  Sorry that it took so
long.  I'm not completely sure the patch is correct.  It might have to
be more specifc about which type of migration or require further
synchronization around migration, but hopefully it'll at least be able
to show that this was the cause of the problem.

Thanks!

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-06-18 Thread Tejun Heo
Hello,

On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > And no test failures from yesterday evening.  So it looks like we get
> > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > runtime with your printk() in the mix.
> > >
> > > Was the above output from your printk() output of any help?
> > 
> > Yeah, if my suspicion is correct, it'd require new kworker creation
> > racing against CPU offline, which would explain why it's so difficult
> > to repro.  Can you please see whether the following patch resolves the
> > issue?
> 
> That could explain why only Steve Rostedt and I saw the issue.  As far
> as I know, we are the only ones who regularly run CPU-hotplug stress
> tests.  ;-)

I was a bit confused.  It has to be racing against either new kworker
being created on the wrong CPU or rescuer trying to migrate to the
CPU, and it looks like we're mostly seeing the rescuer condition, but,
yeah, this would only get triggered rarely.  Another contributing
factor could be the vmstat work putting on a workqueue w/ rescuer
recently.  It runs quite often, so probably has increased the chance
of hitting the right condition.

> I have a weekend-long run going, but will give this a shot overnight on
> Monday, Pacific Time.  Thank you for putting it together, looking forward
> to seeing what it does!

Thanks a lot for the testing and patience.  Sorry that it took so
long.  I'm not completely sure the patch is correct.  It might have to
be more specifc about which type of migration or require further
synchronization around migration, but hopefully it'll at least be able
to show that this was the cause of the problem.

Thanks!

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-06-17 Thread Paul E. McKenney
On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > And no test failures from yesterday evening.  So it looks like we get
> > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > runtime with your printk() in the mix.
> >
> > Was the above output from your printk() output of any help?
> 
> Yeah, if my suspicion is correct, it'd require new kworker creation
> racing against CPU offline, which would explain why it's so difficult
> to repro.  Can you please see whether the following patch resolves the
> issue?

That could explain why only Steve Rostedt and I saw the issue.  As far
as I know, we are the only ones who regularly run CPU-hotplug stress
tests.  ;-)

I have a weekend-long run going, but will give this a shot overnight on
Monday, Pacific Time.  Thank you for putting it together, looking forward
to seeing what it does!

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 803c3bc274c4..1500217ce4b4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -980,8 +980,13 @@ struct migration_arg {
>  static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
>struct task_struct *p, int dest_cpu)
>  {
> - if (unlikely(!cpu_active(dest_cpu)))
> - return rq;
> + if (p->flags & PF_KTHREAD) {
> + if (unlikely(!cpu_online(dest_cpu)))
> + return rq;
> + } else {
> + if (unlikely(!cpu_active(dest_cpu)))
> + return rq;
> + }
> 
>   /* Affinity changed (again). */
>   if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))
> 



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-17 Thread Paul E. McKenney
On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > And no test failures from yesterday evening.  So it looks like we get
> > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > runtime with your printk() in the mix.
> >
> > Was the above output from your printk() output of any help?
> 
> Yeah, if my suspicion is correct, it'd require new kworker creation
> racing against CPU offline, which would explain why it's so difficult
> to repro.  Can you please see whether the following patch resolves the
> issue?

That could explain why only Steve Rostedt and I saw the issue.  As far
as I know, we are the only ones who regularly run CPU-hotplug stress
tests.  ;-)

I have a weekend-long run going, but will give this a shot overnight on
Monday, Pacific Time.  Thank you for putting it together, looking forward
to seeing what it does!

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 803c3bc274c4..1500217ce4b4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -980,8 +980,13 @@ struct migration_arg {
>  static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
>struct task_struct *p, int dest_cpu)
>  {
> - if (unlikely(!cpu_active(dest_cpu)))
> - return rq;
> + if (p->flags & PF_KTHREAD) {
> + if (unlikely(!cpu_online(dest_cpu)))
> + return rq;
> + } else {
> + if (unlikely(!cpu_active(dest_cpu)))
> + return rq;
> + }
> 
>   /* Affinity changed (again). */
>   if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))
> 



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-17 Thread Tejun Heo
Hello,

On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> And no test failures from yesterday evening.  So it looks like we get
> somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> runtime with your printk() in the mix.
>
> Was the above output from your printk() output of any help?

Yeah, if my suspicion is correct, it'd require new kworker creation
racing against CPU offline, which would explain why it's so difficult
to repro.  Can you please see whether the following patch resolves the
issue?

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 803c3bc274c4..1500217ce4b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -980,8 +980,13 @@ struct migration_arg {
 static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
 struct task_struct *p, int dest_cpu)
 {
-   if (unlikely(!cpu_active(dest_cpu)))
-   return rq;
+   if (p->flags & PF_KTHREAD) {
+   if (unlikely(!cpu_online(dest_cpu)))
+   return rq;
+   } else {
+   if (unlikely(!cpu_active(dest_cpu)))
+   return rq;
+   }
 
/* Affinity changed (again). */
if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))


Re: WARN_ON_ONCE() in process_one_work()?

2017-06-17 Thread Tejun Heo
Hello,

On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> And no test failures from yesterday evening.  So it looks like we get
> somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> runtime with your printk() in the mix.
>
> Was the above output from your printk() output of any help?

Yeah, if my suspicion is correct, it'd require new kworker creation
racing against CPU offline, which would explain why it's so difficult
to repro.  Can you please see whether the following patch resolves the
issue?

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 803c3bc274c4..1500217ce4b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -980,8 +980,13 @@ struct migration_arg {
 static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
 struct task_struct *p, int dest_cpu)
 {
-   if (unlikely(!cpu_active(dest_cpu)))
-   return rq;
+   if (p->flags & PF_KTHREAD) {
+   if (unlikely(!cpu_online(dest_cpu)))
+   return rq;
+   } else {
+   if (unlikely(!cpu_active(dest_cpu)))
+   return rq;
+   }
 
/* Affinity changed (again). */
if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))


Re: WARN_ON_ONCE() in process_one_work()?

2017-06-16 Thread Paul E. McKenney
On Thu, Jun 15, 2017 at 08:38:57AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > > > Hello, Paul.
> > > > 
> > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > > > Just following up...  I have hit this bug a couple of times over the
> > > > > past few days.  Anything I can do to help?
> > > > 
> > > > My apologies for dropping the ball on this.  I've gone over the hot
> > > > plug code in workqueue several times but can't really find how this
> > > > would happen.  Can you please apply the following patch and see what
> > > > it says when the problem happens?
> > > 
> > > I have fired it up, thank you!
> > > 
> > > Last time I saw one failure in 21 hours of test runs, so I have kicked
> > > of 42 one-hour test runs.  Will see what happens tomorrow morning,
> > > Pacific Time.
> > 
> > And none of the 42 runs resulted in a workqueue splat.  I will try again
> > this evening, Pacific Time.
> > 
> > Who knows, maybe your diagnostic patch is the fix.  ;-)
> 
> And this time, we did get something!  Here is the printk() output:
> 
> [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
> online=0-2,7 active=0,2,7
> 
> Please see below for the full splat from dmesg.
> 
> Please let me know if you need additional email.  My test ID is KSIC
> 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile
> of test results.  ;-)

And no test failures from yesterday evening.  So it looks like we get
somewhere on the order of one failure per 138 hours of TREE07 rcutorture
runtime with your printk() in the mix.

Was the above output from your printk() output of any help?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-16 Thread Paul E. McKenney
On Thu, Jun 15, 2017 at 08:38:57AM -0700, Paul E. McKenney wrote:
> On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > > > Hello, Paul.
> > > > 
> > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > > > Just following up...  I have hit this bug a couple of times over the
> > > > > past few days.  Anything I can do to help?
> > > > 
> > > > My apologies for dropping the ball on this.  I've gone over the hot
> > > > plug code in workqueue several times but can't really find how this
> > > > would happen.  Can you please apply the following patch and see what
> > > > it says when the problem happens?
> > > 
> > > I have fired it up, thank you!
> > > 
> > > Last time I saw one failure in 21 hours of test runs, so I have kicked
> > > of 42 one-hour test runs.  Will see what happens tomorrow morning,
> > > Pacific Time.
> > 
> > And none of the 42 runs resulted in a workqueue splat.  I will try again
> > this evening, Pacific Time.
> > 
> > Who knows, maybe your diagnostic patch is the fix.  ;-)
> 
> And this time, we did get something!  Here is the printk() output:
> 
> [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
> online=0-2,7 active=0,2,7
> 
> Please see below for the full splat from dmesg.
> 
> Please let me know if you need additional email.  My test ID is KSIC
> 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile
> of test results.  ;-)

And no test failures from yesterday evening.  So it looks like we get
somewhere on the order of one failure per 138 hours of TREE07 rcutorture
runtime with your printk() in the mix.

Was the above output from your printk() output of any help?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-15 Thread Paul E. McKenney
On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > > Hello, Paul.
> > > 
> > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > > Just following up...  I have hit this bug a couple of times over the
> > > > past few days.  Anything I can do to help?
> > > 
> > > My apologies for dropping the ball on this.  I've gone over the hot
> > > plug code in workqueue several times but can't really find how this
> > > would happen.  Can you please apply the following patch and see what
> > > it says when the problem happens?
> > 
> > I have fired it up, thank you!
> > 
> > Last time I saw one failure in 21 hours of test runs, so I have kicked
> > of 42 one-hour test runs.  Will see what happens tomorrow morning,
> > Pacific Time.
> 
> And none of the 42 runs resulted in a workqueue splat.  I will try again
> this evening, Pacific Time.
> 
> Who knows, maybe your diagnostic patch is the fix.  ;-)

And this time, we did get something!  Here is the printk() output:

[ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
online=0-2,7 active=0,2,7

Please see below for the full splat from dmesg.

Please let me know if you need additional email.  My test ID is KSIC
2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile
of test results.  ;-)

Thanx, Paul



[ 2126.840625] [ cut here ]
[ 2126.841373] WARNING: CPU: 2 PID: 6 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2048 
process_one_work+0x5f/0x440
[ 2126.843042] Modules linked in:
[ 2126.843521] CPU: 2 PID: 6 Comm: mm_percpu_wq Not tainted 4.12.0-rc4+ #1
[ 2126.844493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2126.845898] task: 985c5e59bc00 task.stack: b8be000fc000
[ 2126.846773] RIP: 0010:process_one_work+0x5f/0x440
[ 2126.847470] RSP: 0018:b8be000ffe60 EFLAGS: 00010002
[ 2126.848240] RAX: 985c5e595200 RBX: 985c5e56e840 RCX: 
[ 2126.849284] RDX: 0001 RSI: 985c5e8952c0 RDI: 985c5e56e840
[ 2126.850337] RBP: b8be000ffe90 R08: 0002 R09: 
[ 2126.851378] R10: b8be000ffe18 R11:  R12: 985c5e899c00
[ 2126.852410] R13: 985c5e89e800 R14:  R15: 985c5e8952c0
[ 2126.853440] FS:  () GS:985c5e90() 
knlGS:
[ 2126.854650] CS:  0010 DS:  ES:  CR0: 80050033
[ 2126.855500] CR2: 00b0 CR3: 1dcfc000 CR4: 06e0
[ 2126.856558] Call Trace:
[ 2126.856944]  rescuer_thread+0x207/0x350
[ 2126.857520]  kthread+0x104/0x140
[ 2126.858024]  ? process_one_work+0x440/0x440
[ 2126.858644]  ? kthread_create_on_node+0x40/0x40
[ 2126.859343]  ret_from_fork+0x22/0x30
[ 2126.859879] Code: e8 49 8b 45 08 44 8b b0 00 01 00 00 41 83 e6 20 f6 c1 04 
44 89 75 d4 75 21 41 8b 54 24 04 65 44 8b 05 de 8a f9 74 41 39 d0 74 0f <0f> ff 
80 3d b8 d5 ec 00 00 0f 84 92 03 00 00 48 ba eb 83 b5 80 
[ 2126.862715] ---[ end trace df0aaa02bf2d28ee ]---
[ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
online=0-2,7 active=0,2,7



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-15 Thread Paul E. McKenney
On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > > Hello, Paul.
> > > 
> > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > > Just following up...  I have hit this bug a couple of times over the
> > > > past few days.  Anything I can do to help?
> > > 
> > > My apologies for dropping the ball on this.  I've gone over the hot
> > > plug code in workqueue several times but can't really find how this
> > > would happen.  Can you please apply the following patch and see what
> > > it says when the problem happens?
> > 
> > I have fired it up, thank you!
> > 
> > Last time I saw one failure in 21 hours of test runs, so I have kicked
> > of 42 one-hour test runs.  Will see what happens tomorrow morning,
> > Pacific Time.
> 
> And none of the 42 runs resulted in a workqueue splat.  I will try again
> this evening, Pacific Time.
> 
> Who knows, maybe your diagnostic patch is the fix.  ;-)

And this time, we did get something!  Here is the printk() output:

[ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
online=0-2,7 active=0,2,7

Please see below for the full splat from dmesg.

Please let me know if you need additional email.  My test ID is KSIC
2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile
of test results.  ;-)

Thanx, Paul



[ 2126.840625] [ cut here ]
[ 2126.841373] WARNING: CPU: 2 PID: 6 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2048 
process_one_work+0x5f/0x440
[ 2126.843042] Modules linked in:
[ 2126.843521] CPU: 2 PID: 6 Comm: mm_percpu_wq Not tainted 4.12.0-rc4+ #1
[ 2126.844493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2126.845898] task: 985c5e59bc00 task.stack: b8be000fc000
[ 2126.846773] RIP: 0010:process_one_work+0x5f/0x440
[ 2126.847470] RSP: 0018:b8be000ffe60 EFLAGS: 00010002
[ 2126.848240] RAX: 985c5e595200 RBX: 985c5e56e840 RCX: 
[ 2126.849284] RDX: 0001 RSI: 985c5e8952c0 RDI: 985c5e56e840
[ 2126.850337] RBP: b8be000ffe90 R08: 0002 R09: 
[ 2126.851378] R10: b8be000ffe18 R11:  R12: 985c5e899c00
[ 2126.852410] R13: 985c5e89e800 R14:  R15: 985c5e8952c0
[ 2126.853440] FS:  () GS:985c5e90() 
knlGS:
[ 2126.854650] CS:  0010 DS:  ES:  CR0: 80050033
[ 2126.855500] CR2: 00b0 CR3: 1dcfc000 CR4: 06e0
[ 2126.856558] Call Trace:
[ 2126.856944]  rescuer_thread+0x207/0x350
[ 2126.857520]  kthread+0x104/0x140
[ 2126.858024]  ? process_one_work+0x440/0x440
[ 2126.858644]  ? kthread_create_on_node+0x40/0x40
[ 2126.859343]  ret_from_fork+0x22/0x30
[ 2126.859879] Code: e8 49 8b 45 08 44 8b b0 00 01 00 00 41 83 e6 20 f6 c1 04 
44 89 75 d4 75 21 41 8b 54 24 04 65 44 8b 05 de 8a f9 74 41 39 d0 74 0f <0f> ff 
80 3d b8 d5 ec 00 00 0f 84 92 03 00 00 48 ba eb 83 b5 80 
[ 2126.862715] ---[ end trace df0aaa02bf2d28ee ]---
[ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 
online=0-2,7 active=0,2,7



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-14 Thread Paul E. McKenney
On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > Just following up...  I have hit this bug a couple of times over the
> > > past few days.  Anything I can do to help?
> > 
> > My apologies for dropping the ball on this.  I've gone over the hot
> > plug code in workqueue several times but can't really find how this
> > would happen.  Can you please apply the following patch and see what
> > it says when the problem happens?
> 
> I have fired it up, thank you!
> 
> Last time I saw one failure in 21 hours of test runs, so I have kicked
> of 42 one-hour test runs.  Will see what happens tomorrow morning,
> Pacific Time.

And none of the 42 runs resulted in a workqueue splat.  I will try again
this evening, Pacific Time.

Who knows, maybe your diagnostic patch is the fix.  ;-)

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-14 Thread Paul E. McKenney
On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > > Just following up...  I have hit this bug a couple of times over the
> > > past few days.  Anything I can do to help?
> > 
> > My apologies for dropping the ball on this.  I've gone over the hot
> > plug code in workqueue several times but can't really find how this
> > would happen.  Can you please apply the following patch and see what
> > it says when the problem happens?
> 
> I have fired it up, thank you!
> 
> Last time I saw one failure in 21 hours of test runs, so I have kicked
> of 42 one-hour test runs.  Will see what happens tomorrow morning,
> Pacific Time.

And none of the 42 runs resulted in a workqueue splat.  I will try again
this evening, Pacific Time.

Who knows, maybe your diagnostic patch is the fix.  ;-)

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-13 Thread Paul E. McKenney
On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > Just following up...  I have hit this bug a couple of times over the
> > past few days.  Anything I can do to help?
> 
> My apologies for dropping the ball on this.  I've gone over the hot
> plug code in workqueue several times but can't really find how this
> would happen.  Can you please apply the following patch and see what
> it says when the problem happens?

I have fired it up, thank you!

Last time I saw one failure in 21 hours of test runs, so I have kicked
of 42 one-hour test runs.  Will see what happens tomorrow morning,
Pacific Time.

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c74bf39ef764..bd2ce3cbfb41 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node)
>  static void worker_attach_to_pool(struct worker *worker,
>  struct worker_pool *pool)
>  {
> + int ret;
> +
>   mutex_lock(>attach_mutex);
> 
>   /*
>* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>* online CPUs.  It'll be re-applied when any of the CPUs come up.
>*/
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +
> + WARN(ret && !(pool->flags & POOL_DISASSOCIATED),
> +  "set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x 
> cpumask=%*pbl online=%*pbl active=%*pbl\n",
> +  ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask),
> +  cpumask_pr_args(cpu_online_mask), 
> cpumask_pr_args(cpu_active_mask));
> 
>   /*
>* The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
> @@ -2037,8 +2044,11 @@ __acquires(>lock)
>   lockdep_copy_map(_map, >lockdep_map);
>  #endif
>   /* ensure we're on the correct CPU */
> - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> -  raw_smp_processor_id() != pool->cpu);
> + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> +  raw_smp_processor_id() != pool->cpu))
> + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d 
> online=%*pbl active=%*pbl\n",
> + work->func, pool->cpu, pool->flags, 
> raw_smp_processor_id(),
> + cpumask_pr_args(cpu_online_mask), 
> cpumask_pr_args(cpu_active_mask));
> 
>   /*
>* A single work shouldn't be executed concurrently by
> 



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-13 Thread Paul E. McKenney
On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> > Just following up...  I have hit this bug a couple of times over the
> > past few days.  Anything I can do to help?
> 
> My apologies for dropping the ball on this.  I've gone over the hot
> plug code in workqueue several times but can't really find how this
> would happen.  Can you please apply the following patch and see what
> it says when the problem happens?

I have fired it up, thank you!

Last time I saw one failure in 21 hours of test runs, so I have kicked
of 42 one-hour test runs.  Will see what happens tomorrow morning,
Pacific Time.

Thanx, Paul

> Thanks.
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c74bf39ef764..bd2ce3cbfb41 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node)
>  static void worker_attach_to_pool(struct worker *worker,
>  struct worker_pool *pool)
>  {
> + int ret;
> +
>   mutex_lock(>attach_mutex);
> 
>   /*
>* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>* online CPUs.  It'll be re-applied when any of the CPUs come up.
>*/
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +
> + WARN(ret && !(pool->flags & POOL_DISASSOCIATED),
> +  "set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x 
> cpumask=%*pbl online=%*pbl active=%*pbl\n",
> +  ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask),
> +  cpumask_pr_args(cpu_online_mask), 
> cpumask_pr_args(cpu_active_mask));
> 
>   /*
>* The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
> @@ -2037,8 +2044,11 @@ __acquires(>lock)
>   lockdep_copy_map(_map, >lockdep_map);
>  #endif
>   /* ensure we're on the correct CPU */
> - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> -  raw_smp_processor_id() != pool->cpu);
> + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
> +  raw_smp_processor_id() != pool->cpu))
> + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d 
> online=%*pbl active=%*pbl\n",
> + work->func, pool->cpu, pool->flags, 
> raw_smp_processor_id(),
> + cpumask_pr_args(cpu_online_mask), 
> cpumask_pr_args(cpu_active_mask));
> 
>   /*
>* A single work shouldn't be executed concurrently by
> 



Re: WARN_ON_ONCE() in process_one_work()?

2017-06-13 Thread Tejun Heo
Hello, Paul.

On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> Just following up...  I have hit this bug a couple of times over the
> past few days.  Anything I can do to help?

My apologies for dropping the ball on this.  I've gone over the hot
plug code in workqueue several times but can't really find how this
would happen.  Can you please apply the following patch and see what
it says when the problem happens?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39ef764..bd2ce3cbfb41 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(>attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
+   WARN(ret && !(pool->flags & POOL_DISASSOCIATED),
+"set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x 
cpumask=%*pbl online=%*pbl active=%*pbl\n",
+ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask),
+cpumask_pr_args(cpu_online_mask), 
cpumask_pr_args(cpu_active_mask));
 
/*
 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -2037,8 +2044,11 @@ __acquires(>lock)
lockdep_copy_map(_map, >lockdep_map);
 #endif
/* ensure we're on the correct CPU */
-   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
-raw_smp_processor_id() != pool->cpu);
+   if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
+raw_smp_processor_id() != pool->cpu))
+   printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d 
online=%*pbl active=%*pbl\n",
+   work->func, pool->cpu, pool->flags, 
raw_smp_processor_id(),
+   cpumask_pr_args(cpu_online_mask), 
cpumask_pr_args(cpu_active_mask));
 
/*
 * A single work shouldn't be executed concurrently by


Re: WARN_ON_ONCE() in process_one_work()?

2017-06-13 Thread Tejun Heo
Hello, Paul.

On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote:
> Just following up...  I have hit this bug a couple of times over the
> past few days.  Anything I can do to help?

My apologies for dropping the ball on this.  I've gone over the hot
plug code in workqueue several times but can't really find how this
would happen.  Can you please apply the following patch and see what
it says when the problem happens?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39ef764..bd2ce3cbfb41 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
   struct worker_pool *pool)
 {
+   int ret;
+
mutex_lock(>attach_mutex);
 
/*
 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
-   set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
+   WARN(ret && !(pool->flags & POOL_DISASSOCIATED),
+"set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x 
cpumask=%*pbl online=%*pbl active=%*pbl\n",
+ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask),
+cpumask_pr_args(cpu_online_mask), 
cpumask_pr_args(cpu_active_mask));
 
/*
 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -2037,8 +2044,11 @@ __acquires(>lock)
lockdep_copy_map(_map, >lockdep_map);
 #endif
/* ensure we're on the correct CPU */
-   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
-raw_smp_processor_id() != pool->cpu);
+   if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
+raw_smp_processor_id() != pool->cpu))
+   printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d 
online=%*pbl active=%*pbl\n",
+   work->func, pool->cpu, pool->flags, 
raw_smp_processor_id(),
+   cpumask_pr_args(cpu_online_mask), 
cpumask_pr_args(cpu_active_mask));
 
/*
 * A single work shouldn't be executed concurrently by


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-05 Thread Paul E. McKenney
On Mon, May 01, 2017 at 11:58:19AM -0700, Paul E. McKenney wrote:
> On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > > 
> > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > > > what I did wrong to make this happen:
> > > 
> > > Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> > > expedited grace periods block within a workqueue handler?  Might this
> > > in turn run the system out of workqueue kthreads?  If this is the likely
> > > cause, my approach would be to rework the expected-grace-period workqueue
> > > handler to return when waiting for the grace period to complete, and to
> > > replace the current wakeup with a schedule_work() or something similar.
> > 
> > That should be completely fine.  It could just be that the rescuer
> > path has a bug around CPU hotplug handling.  Can you please confirm
> > either way on the cpuset usage?
> 
> I have no explicit cpuset usage or affinity of the workqueue handlers
> themselves.
> 
> However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in
> this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs.
> IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away
> from the nohz_full CPUs.
> 
> I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of
> hotplugging.  Also, other kthreads (but again, not the workqueue handlers)
> do a lot of explicit CPU-affinity manipulation.

Just following up...  I have hit this bug a couple of times over the
past few days.  Anything I can do to help?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-05-05 Thread Paul E. McKenney
On Mon, May 01, 2017 at 11:58:19AM -0700, Paul E. McKenney wrote:
> On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote:
> > Hello, Paul.
> > 
> > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > > 
> > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > > > what I did wrong to make this happen:
> > > 
> > > Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> > > expedited grace periods block within a workqueue handler?  Might this
> > > in turn run the system out of workqueue kthreads?  If this is the likely
> > > cause, my approach would be to rework the expected-grace-period workqueue
> > > handler to return when waiting for the grace period to complete, and to
> > > replace the current wakeup with a schedule_work() or something similar.
> > 
> > That should be completely fine.  It could just be that the rescuer
> > path has a bug around CPU hotplug handling.  Can you please confirm
> > either way on the cpuset usage?
> 
> I have no explicit cpuset usage or affinity of the workqueue handlers
> themselves.
> 
> However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in
> this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs.
> IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away
> from the nohz_full CPUs.
> 
> I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of
> hotplugging.  Also, other kthreads (but again, not the workqueue handlers)
> do a lot of explicit CPU-affinity manipulation.

Just following up...  I have hit this bug a couple of times over the
past few days.  Anything I can do to help?

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Steven Rostedt
On Mon, 1 May 2017 15:50:58 -0400
Tejun Heo  wrote:
> 
> Were you also using nohz_full?

Do you mean, "was it compiled in" or "was I actually using it. ie. had
it on the command line"?

It may have been compiled in, but I didn't do anything to activate it.

-- Steve


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Steven Rostedt
On Mon, 1 May 2017 15:50:58 -0400
Tejun Heo  wrote:
> 
> Were you also using nohz_full?

Do you mean, "was it compiled in" or "was I actually using it. ie. had
it on the command line"?

It may have been compiled in, but I didn't do anything to activate it.

-- Steve


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
On Mon, May 01, 2017 at 03:42:37PM -0400, Steven Rostedt wrote:
> On Mon, 1 May 2017 14:42:50 -0400
> Tejun Heo  wrote:
> 
> > Steven's involved a rescuer too.  One possibility was cpuset being
> > involved somehow and messing up the affinity of the rescuer kthread
> > unexpectedly.  Is cpuset involved in any way?
> 
> I didn't do anything with cpuset. I did take CPUs offline though.
> 
> I'm currently testing changes, but when its done, I'll remove the
> "WARN_ON() comment out" and see if I can reproduce it again.

Were you also using nohz_full?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
On Mon, May 01, 2017 at 03:42:37PM -0400, Steven Rostedt wrote:
> On Mon, 1 May 2017 14:42:50 -0400
> Tejun Heo  wrote:
> 
> > Steven's involved a rescuer too.  One possibility was cpuset being
> > involved somehow and messing up the affinity of the rescuer kthread
> > unexpectedly.  Is cpuset involved in any way?
> 
> I didn't do anything with cpuset. I did take CPUs offline though.
> 
> I'm currently testing changes, but when its done, I'll remove the
> "WARN_ON() comment out" and see if I can reproduce it again.

Were you also using nohz_full?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Steven Rostedt
On Mon, 1 May 2017 14:42:50 -0400
Tejun Heo  wrote:

> Steven's involved a rescuer too.  One possibility was cpuset being
> involved somehow and messing up the affinity of the rescuer kthread
> unexpectedly.  Is cpuset involved in any way?

I didn't do anything with cpuset. I did take CPUs offline though.

I'm currently testing changes, but when its done, I'll remove the
"WARN_ON() comment out" and see if I can reproduce it again.

-- Steve


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Steven Rostedt
On Mon, 1 May 2017 14:42:50 -0400
Tejun Heo  wrote:

> Steven's involved a rescuer too.  One possibility was cpuset being
> involved somehow and messing up the affinity of the rescuer kthread
> unexpectedly.  Is cpuset involved in any way?

I didn't do anything with cpuset. I did take CPUs offline though.

I'm currently testing changes, but when its done, I'll remove the
"WARN_ON() comment out" and see if I can reproduce it again.

-- Steve


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > > what I did wrong to make this happen:
> > 
> > Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> > expedited grace periods block within a workqueue handler?  Might this
> > in turn run the system out of workqueue kthreads?  If this is the likely
> > cause, my approach would be to rework the expected-grace-period workqueue
> > handler to return when waiting for the grace period to complete, and to
> > replace the current wakeup with a schedule_work() or something similar.
> 
> That should be completely fine.  It could just be that the rescuer
> path has a bug around CPU hotplug handling.  Can you please confirm
> either way on the cpuset usage?

I have no explicit cpuset usage or affinity of the workqueue handlers
themselves.

However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in
this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs.
IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away
from the nohz_full CPUs.

I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of
hotplugging.  Also, other kthreads (but again, not the workqueue handlers)
do a lot of explicit CPU-affinity manipulation.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > > what I did wrong to make this happen:
> > 
> > Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> > expedited grace periods block within a workqueue handler?  Might this
> > in turn run the system out of workqueue kthreads?  If this is the likely
> > cause, my approach would be to rework the expected-grace-period workqueue
> > handler to return when waiting for the grace period to complete, and to
> > replace the current wakeup with a schedule_work() or something similar.
> 
> That should be completely fine.  It could just be that the rescuer
> path has a bug around CPU hotplug handling.  Can you please confirm
> either way on the cpuset usage?

I have no explicit cpuset usage or affinity of the workqueue handlers
themselves.

However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in
this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs.
IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away
from the nohz_full CPUs.

I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of
hotplugging.  Also, other kthreads (but again, not the workqueue handlers)
do a lot of explicit CPU-affinity manipulation.

Thanx, Paul



Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
Hello, Paul.

On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > what I did wrong to make this happen:
> 
> Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> expedited grace periods block within a workqueue handler?  Might this
> in turn run the system out of workqueue kthreads?  If this is the likely
> cause, my approach would be to rework the expected-grace-period workqueue
> handler to return when waiting for the grace period to complete, and to
> replace the current wakeup with a schedule_work() or something similar.

That should be completely fine.  It could just be that the rescuer
path has a bug around CPU hotplug handling.  Can you please confirm
either way on the cpuset usage?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
Hello, Paul.

On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote:
> On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> > what I did wrong to make this happen:
> 
> Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
> expedited grace periods block within a workqueue handler?  Might this
> in turn run the system out of workqueue kthreads?  If this is the likely
> cause, my approach would be to rework the expected-grace-period workqueue
> handler to return when waiting for the grace period to complete, and to
> replace the current wakeup with a schedule_work() or something similar.

That should be completely fine.  It could just be that the rescuer
path has a bug around CPU hotplug handling.  Can you please confirm
either way on the cpuset usage?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
Hello, Paul.

Hmmm... Steven reproted a similar issue.

 http://lkml.kernel.org/r/20170405151628.33df7...@gandalf.local.home

On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> what I did wrong to make this happen:
> 
> 
> 
> static void process_one_work(struct worker *worker, struct work_struct *work)
> __releases(>lock)
> __acquires(>lock)
> {
>   struct pool_workqueue *pwq = get_work_pwq(work);
>   struct worker_pool *pool = worker->pool;
>   bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
>   int work_color;
>   struct worker *collision;
> #ifdef CONFIG_LOCKDEP
>   /*
>* It is permissible to free the struct work_struct from
>* inside the function that is called from it, this we need to
>* take into account for lockdep too.  To avoid bogus "held
>* lock freed" warnings as well as problems when looking into
>* work->lockdep_map, make a copy and use that here.
>*/
>   struct lockdep_map lockdep_map;
> 
>   lockdep_copy_map(_map, >lockdep_map);
> #endif
>   /* ensure we're on the correct CPU */
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> 
> 
> Here is the splat:
> 
> 
> 
> [12600.593006] WARNING: CPU: 0 PID: 6 at 
> /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
> process_one_work+0x46c/0x4d0
> [12600.593006] Modules linked in:
> [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
> [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [12600.593006] Call Trace:
> [12600.593006]  dump_stack+0x4f/0x72
> [12600.593006]  __warn+0xc6/0xe0
> [12600.593006]  warn_slowpath_null+0x18/0x20
> [12600.593006]  process_one_work+0x46c/0x4d0
> [12600.593006]  rescuer_thread+0x20e/0x3b0
> [12600.593006]  kthread+0x104/0x140
> [12600.593006]  ? worker_thread+0x4e0/0x4e0
> [12600.593006]  ? kthread_create_on_node+0x40/0x40
> [12600.593006]  ret_from_fork+0x29/0x40
> 
> 
> 
> This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
> attached.

Steven's involved a rescuer too.  One possibility was cpuset being
involved somehow and messing up the affinity of the rescuer kthread
unexpectedly.  Is cpuset involved in any way?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Tejun Heo
Hello, Paul.

Hmmm... Steven reproted a similar issue.

 http://lkml.kernel.org/r/20170405151628.33df7...@gandalf.local.home

On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> what I did wrong to make this happen:
> 
> 
> 
> static void process_one_work(struct worker *worker, struct work_struct *work)
> __releases(>lock)
> __acquires(>lock)
> {
>   struct pool_workqueue *pwq = get_work_pwq(work);
>   struct worker_pool *pool = worker->pool;
>   bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
>   int work_color;
>   struct worker *collision;
> #ifdef CONFIG_LOCKDEP
>   /*
>* It is permissible to free the struct work_struct from
>* inside the function that is called from it, this we need to
>* take into account for lockdep too.  To avoid bogus "held
>* lock freed" warnings as well as problems when looking into
>* work->lockdep_map, make a copy and use that here.
>*/
>   struct lockdep_map lockdep_map;
> 
>   lockdep_copy_map(_map, >lockdep_map);
> #endif
>   /* ensure we're on the correct CPU */
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> 
> 
> Here is the splat:
> 
> 
> 
> [12600.593006] WARNING: CPU: 0 PID: 6 at 
> /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
> process_one_work+0x46c/0x4d0
> [12600.593006] Modules linked in:
> [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
> [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [12600.593006] Call Trace:
> [12600.593006]  dump_stack+0x4f/0x72
> [12600.593006]  __warn+0xc6/0xe0
> [12600.593006]  warn_slowpath_null+0x18/0x20
> [12600.593006]  process_one_work+0x46c/0x4d0
> [12600.593006]  rescuer_thread+0x20e/0x3b0
> [12600.593006]  kthread+0x104/0x140
> [12600.593006]  ? worker_thread+0x4e0/0x4e0
> [12600.593006]  ? kthread_create_on_node+0x40/0x40
> [12600.593006]  ret_from_fork+0x29/0x40
> 
> 
> 
> This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
> attached.

Steven's involved a rescuer too.  One possibility was cpuset being
involved somehow and messing up the affinity of the rescuer kthread
unexpectedly.  Is cpuset involved in any way?

Thanks.

-- 
tejun


Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> what I did wrong to make this happen:

Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
expedited grace periods block within a workqueue handler?  Might this
in turn run the system out of workqueue kthreads?  If this is the likely
cause, my approach would be to rework the expected-grace-period workqueue
handler to return when waiting for the grace period to complete, and to
replace the current wakeup with a schedule_work() or something similar.

Thoughts?

Thanx, Paul

> 
> 
> static void process_one_work(struct worker *worker, struct work_struct *work)
> __releases(>lock)
> __acquires(>lock)
> {
>   struct pool_workqueue *pwq = get_work_pwq(work);
>   struct worker_pool *pool = worker->pool;
>   bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
>   int work_color;
>   struct worker *collision;
> #ifdef CONFIG_LOCKDEP
>   /*
>* It is permissible to free the struct work_struct from
>* inside the function that is called from it, this we need to
>* take into account for lockdep too.  To avoid bogus "held
>* lock freed" warnings as well as problems when looking into
>* work->lockdep_map, make a copy and use that here.
>*/
>   struct lockdep_map lockdep_map;
> 
>   lockdep_copy_map(_map, >lockdep_map);
> #endif
>   /* ensure we're on the correct CPU */
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> 
> 
> Here is the splat:
> 
> 
> 
> [12600.593006] WARNING: CPU: 0 PID: 6 at 
> /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
> process_one_work+0x46c/0x4d0
> [12600.593006] Modules linked in:
> [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
> [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [12600.593006] Call Trace:
> [12600.593006]  dump_stack+0x4f/0x72
> [12600.593006]  __warn+0xc6/0xe0
> [12600.593006]  warn_slowpath_null+0x18/0x20
> [12600.593006]  process_one_work+0x46c/0x4d0
> [12600.593006]  rescuer_thread+0x20e/0x3b0
> [12600.593006]  kthread+0x104/0x140
> [12600.593006]  ? worker_thread+0x4e0/0x4e0
> [12600.593006]  ? kthread_create_on_node+0x40/0x40
> [12600.593006]  ret_from_fork+0x29/0x40
> 
> 
> 
> This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
> attached.
> 
> Enlightenment?
> 
>   Thanx, Paul

> #
> # Automatically generated file; DO NOT EDIT.
> # Linux/x86 4.11.0-rc7 Kernel Configuration
> #
> CONFIG_64BIT=y
> CONFIG_X86_64=y
> CONFIG_X86=y
> CONFIG_INSTRUCTION_DECODER=y
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_MMU=y
> CONFIG_ARCH_MMAP_RND_BITS_MIN=28
> CONFIG_ARCH_MMAP_RND_BITS_MAX=32
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
> CONFIG_NEED_DMA_MAP_STATE=y
> CONFIG_NEED_SG_DMA_LENGTH=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_GENERIC_BUG=y
> CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
> CONFIG_GENERIC_HWEIGHT=y
> CONFIG_ARCH_MAY_HAVE_PC_FDC=y
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_GENERIC_CALIBRATE_DELAY=y
> CONFIG_ARCH_HAS_CPU_RELAX=y
> CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
> CONFIG_HAVE_SETUP_PER_CPU_AREA=y
> CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
> CONFIG_ARCH_HIBERNATION_POSSIBLE=y
> CONFIG_ARCH_SUSPEND_POSSIBLE=y
> CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
> CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
> CONFIG_ZONE_DMA32=y
> CONFIG_AUDIT_ARCH=y
> CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_INTEL_TXT=y
> CONFIG_X86_64_SMP=y
> CONFIG_ARCH_SUPPORTS_UPROBES=y
> CONFIG_FIX_EARLYCON_MEM=y
> CONFIG_PGTABLE_LEVELS=4
> CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
> CONFIG_IRQ_WORK=y
> CONFIG_BUILDTIME_EXTABLE_SORT=y
> CONFIG_THREAD_INFO_IN_TASK=y
> 
> #
> # General setup
> #
> C

Re: WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
> what I did wrong to make this happen:

Oh, wait...  Rescuer, it says.  Might this be due to the fact that RCU's
expedited grace periods block within a workqueue handler?  Might this
in turn run the system out of workqueue kthreads?  If this is the likely
cause, my approach would be to rework the expected-grace-period workqueue
handler to return when waiting for the grace period to complete, and to
replace the current wakeup with a schedule_work() or something similar.

Thoughts?

Thanx, Paul

> 
> 
> static void process_one_work(struct worker *worker, struct work_struct *work)
> __releases(>lock)
> __acquires(>lock)
> {
>   struct pool_workqueue *pwq = get_work_pwq(work);
>   struct worker_pool *pool = worker->pool;
>   bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
>   int work_color;
>   struct worker *collision;
> #ifdef CONFIG_LOCKDEP
>   /*
>* It is permissible to free the struct work_struct from
>* inside the function that is called from it, this we need to
>* take into account for lockdep too.  To avoid bogus "held
>* lock freed" warnings as well as problems when looking into
>* work->lockdep_map, make a copy and use that here.
>*/
>   struct lockdep_map lockdep_map;
> 
>   lockdep_copy_map(_map, >lockdep_map);
> #endif
>   /* ensure we're on the correct CPU */
>   WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
>raw_smp_processor_id() != pool->cpu);
> 
> 
> 
> Here is the splat:
> 
> 
> 
> [12600.593006] WARNING: CPU: 0 PID: 6 at 
> /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
> process_one_work+0x46c/0x4d0
> [12600.593006] Modules linked in:
> [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
> [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [12600.593006] Call Trace:
> [12600.593006]  dump_stack+0x4f/0x72
> [12600.593006]  __warn+0xc6/0xe0
> [12600.593006]  warn_slowpath_null+0x18/0x20
> [12600.593006]  process_one_work+0x46c/0x4d0
> [12600.593006]  rescuer_thread+0x20e/0x3b0
> [12600.593006]  kthread+0x104/0x140
> [12600.593006]  ? worker_thread+0x4e0/0x4e0
> [12600.593006]  ? kthread_create_on_node+0x40/0x40
> [12600.593006]  ret_from_fork+0x29/0x40
> 
> 
> 
> This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
> attached.
> 
> Enlightenment?
> 
>   Thanx, Paul

> #
> # Automatically generated file; DO NOT EDIT.
> # Linux/x86 4.11.0-rc7 Kernel Configuration
> #
> CONFIG_64BIT=y
> CONFIG_X86_64=y
> CONFIG_X86=y
> CONFIG_INSTRUCTION_DECODER=y
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_MMU=y
> CONFIG_ARCH_MMAP_RND_BITS_MIN=28
> CONFIG_ARCH_MMAP_RND_BITS_MAX=32
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
> CONFIG_NEED_DMA_MAP_STATE=y
> CONFIG_NEED_SG_DMA_LENGTH=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_GENERIC_BUG=y
> CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
> CONFIG_GENERIC_HWEIGHT=y
> CONFIG_ARCH_MAY_HAVE_PC_FDC=y
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_GENERIC_CALIBRATE_DELAY=y
> CONFIG_ARCH_HAS_CPU_RELAX=y
> CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
> CONFIG_HAVE_SETUP_PER_CPU_AREA=y
> CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
> CONFIG_ARCH_HIBERNATION_POSSIBLE=y
> CONFIG_ARCH_SUSPEND_POSSIBLE=y
> CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
> CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
> CONFIG_ZONE_DMA32=y
> CONFIG_AUDIT_ARCH=y
> CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_HAVE_INTEL_TXT=y
> CONFIG_X86_64_SMP=y
> CONFIG_ARCH_SUPPORTS_UPROBES=y
> CONFIG_FIX_EARLYCON_MEM=y
> CONFIG_PGTABLE_LEVELS=4
> CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
> CONFIG_IRQ_WORK=y
> CONFIG_BUILDTIME_EXTABLE_SORT=y
> CONFIG_THREAD_INFO_IN_TASK=y
> 
> #
> # General setup
> #
> C

WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
Hello!

I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
what I did wrong to make this happen:



static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(>lock)
__acquires(>lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
int work_color;
struct worker *collision;
#ifdef CONFIG_LOCKDEP
/*
 * It is permissible to free the struct work_struct from
 * inside the function that is called from it, this we need to
 * take into account for lockdep too.  To avoid bogus "held
 * lock freed" warnings as well as problems when looking into
 * work->lockdep_map, make a copy and use that here.
 */
struct lockdep_map lockdep_map;

lockdep_copy_map(_map, >lockdep_map);
#endif
/* ensure we're on the correct CPU */
WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 raw_smp_processor_id() != pool->cpu);



Here is the splat:



[12600.593006] WARNING: CPU: 0 PID: 6 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
process_one_work+0x46c/0x4d0
[12600.593006] Modules linked in:
[12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
[12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[12600.593006] Call Trace:
[12600.593006]  dump_stack+0x4f/0x72
[12600.593006]  __warn+0xc6/0xe0
[12600.593006]  warn_slowpath_null+0x18/0x20
[12600.593006]  process_one_work+0x46c/0x4d0
[12600.593006]  rescuer_thread+0x20e/0x3b0
[12600.593006]  kthread+0x104/0x140
[12600.593006]  ? worker_thread+0x4e0/0x4e0
[12600.593006]  ? kthread_create_on_node+0x40/0x40
[12600.593006]  ret_from_fork+0x29/0x40



This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
attached.

Enlightenment?

Thanx, Paul
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.11.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_S

WARN_ON_ONCE() in process_one_work()?

2017-05-01 Thread Paul E. McKenney
Hello!

I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering
what I did wrong to make this happen:



static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(>lock)
__acquires(>lock)
{
struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
int work_color;
struct worker *collision;
#ifdef CONFIG_LOCKDEP
/*
 * It is permissible to free the struct work_struct from
 * inside the function that is called from it, this we need to
 * take into account for lockdep too.  To avoid bogus "held
 * lock freed" warnings as well as problems when looking into
 * work->lockdep_map, make a copy and use that here.
 */
struct lockdep_map lockdep_map;

lockdep_copy_map(_map, >lockdep_map);
#endif
/* ensure we're on the correct CPU */
WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
 raw_smp_processor_id() != pool->cpu);



Here is the splat:



[12600.593006] WARNING: CPU: 0 PID: 6 at 
/home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 
process_one_work+0x46c/0x4d0
[12600.593006] Modules linked in:
[12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1
[12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[12600.593006] Call Trace:
[12600.593006]  dump_stack+0x4f/0x72
[12600.593006]  __warn+0xc6/0xe0
[12600.593006]  warn_slowpath_null+0x18/0x20
[12600.593006]  process_one_work+0x46c/0x4d0
[12600.593006]  rescuer_thread+0x20e/0x3b0
[12600.593006]  kthread+0x104/0x140
[12600.593006]  ? worker_thread+0x4e0/0x4e0
[12600.593006]  ? kthread_create_on_node+0x40/0x40
[12600.593006]  ret_from_fork+0x29/0x40



This happens about 3.5 hours into the TREE03 rcutorture scenario, .config
attached.

Enlightenment?

Thanx, Paul
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.11.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_S