Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jul 03, 2018 at 01:12:05PM -0700, Tejun Heo wrote: > Hello, Paul. > > On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > > trigger since. :-/ > > > > But I get a build error: > > > > kernel/workqueue.o: In function `worker_attach_to_pool': > > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target > > 'vmlinux' failed > > > > My guess is that the worker_attach_to_pool() code needs to change so > > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. > > Just to clarify, the bug happened on a kernel with cpuhp enabled, > right? It seems to be a bit more complicated than that, despite my initial confidence that I knew what was going on. The build failed for these rcutorture scenarios: SRCU-N: (No idea!) CONFIG_RCU_TRACE=n CONFIG_SMP=y CONFIG_NR_CPUS=4 CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n SRCU-t: (Expected, CONFIG_SMP=n) SRCU-u: (Expected, CONFIG_SMP=n) TASKS02: (Expected, CONFIG_SMP=n) TASKS03: (Expected, CONFIG_HOTPLUG_CPU=n) TINY01: (Expected, CONFIG_SMP=n) TINY02: (Expected, CONFIG_SMP=n) TREE01: (No idea!) CONFIG_SMP=y CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=y CONFIG_HZ_PERIODIC=n CONFIG_NO_HZ_IDLE=y CONFIG_NO_HZ_FULL=n CONFIG_RCU_FAST_NO_HZ=y CONFIG_RCU_TRACE=y CONFIG_HOTPLUG_CPU=y CONFIG_MAXSMP=y CONFIG_CPUMASK_OFFSTACK=y CONFIG_RCU_NOCB_CPU=y CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_RCU_BOOST=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y TREE02 didn't fail despite having CONFIG_HOTPLUG_CPU=n! TREE03: (Expected, CONFIG_HOTPLUG_CPU=n) TREE04: (Expected, CONFIG_HOTPLUG_CPU=n) TREE07: (No idea!) CONFIG_NR_CPUS=16 CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_HZ_PERIODIC=n CONFIG_NO_HZ_IDLE=n CONFIG_NO_HZ_FULL=y CONFIG_RCU_FAST_NO_HZ=n CONFIG_RCU_TRACE=y CONFIG_HOTPLUG_CPU=y CONFIG_RCU_FANOUT=2 CONFIG_RCU_FANOUT_LEAF=2 CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y TREE08: (Expected, CONFIG_HOTPLUG_CPU=n) TREE09: (Expected, CONFIG_SMP=n) Thoughts? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jul 03, 2018 at 01:12:05PM -0700, Tejun Heo wrote: > Hello, Paul. > > On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > > trigger since. :-/ > > > > But I get a build error: > > > > kernel/workqueue.o: In function `worker_attach_to_pool': > > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target > > 'vmlinux' failed > > > > My guess is that the worker_attach_to_pool() code needs to change so > > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. > > Just to clarify, the bug happened on a kernel with cpuhp enabled, > right? It seems to be a bit more complicated than that, despite my initial confidence that I knew what was going on. The build failed for these rcutorture scenarios: SRCU-N: (No idea!) CONFIG_RCU_TRACE=n CONFIG_SMP=y CONFIG_NR_CPUS=4 CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n SRCU-t: (Expected, CONFIG_SMP=n) SRCU-u: (Expected, CONFIG_SMP=n) TASKS02: (Expected, CONFIG_SMP=n) TASKS03: (Expected, CONFIG_HOTPLUG_CPU=n) TINY01: (Expected, CONFIG_SMP=n) TINY02: (Expected, CONFIG_SMP=n) TREE01: (No idea!) CONFIG_SMP=y CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=y CONFIG_HZ_PERIODIC=n CONFIG_NO_HZ_IDLE=y CONFIG_NO_HZ_FULL=n CONFIG_RCU_FAST_NO_HZ=y CONFIG_RCU_TRACE=y CONFIG_HOTPLUG_CPU=y CONFIG_MAXSMP=y CONFIG_CPUMASK_OFFSTACK=y CONFIG_RCU_NOCB_CPU=y CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_RCU_BOOST=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y TREE02 didn't fail despite having CONFIG_HOTPLUG_CPU=n! TREE03: (Expected, CONFIG_HOTPLUG_CPU=n) TREE04: (Expected, CONFIG_HOTPLUG_CPU=n) TREE07: (No idea!) CONFIG_NR_CPUS=16 CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n CONFIG_HZ_PERIODIC=n CONFIG_NO_HZ_IDLE=n CONFIG_NO_HZ_FULL=y CONFIG_RCU_FAST_NO_HZ=n CONFIG_RCU_TRACE=y CONFIG_HOTPLUG_CPU=y CONFIG_RCU_FANOUT=2 CONFIG_RCU_FANOUT_LEAF=2 CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y TREE08: (Expected, CONFIG_HOTPLUG_CPU=n) TREE09: (Expected, CONFIG_SMP=n) Thoughts? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > trigger since. :-/ > > But I get a build error: > > kernel/workqueue.o: In function `worker_attach_to_pool': > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' > failed > > My guess is that the worker_attach_to_pool() code needs to change so > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Just to clarify, the bug happened on a kernel with cpuhp enabled, right? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Tue, Jul 03, 2018 at 09:40:44AM -0700, Paul E. McKenney wrote: > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > > trigger since. :-/ > > But I get a build error: > > kernel/workqueue.o: In function `worker_attach_to_pool': > workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' > workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' > /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' > failed > > My guess is that the worker_attach_to_pool() code needs to change so > as to invoke cpuhp_current_state() and cpuhp_target_state() only in > CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Just to clarify, the bug happened on a kernel with cpuhp enabled, right? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, Jul 02, 2018 at 09:05:18PM -0700, Paul E. McKenney wrote: > On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote: > > Hello, Paul. > > > > Sorry about the late reply. > > > > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > > > I have hit this WARN_ON_ONCE() in process_one_work: > > > > > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > > >raw_smp_processor_id() != pool->cpu); > > > > > > This looks like it is my rcu_gp workqueue (see splat below), and it > > > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > > > which does random CPU-hotplug operations (in case that helps). > > > > > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > > > what should I do to further debug this? > > > > Hmm... I checked the code paths but couldn't spot anything suspicious. > > Can you please apply the following patch and see whether it triggers > > before hitting the warn and if so report what it says? > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > trigger since. :-/ But I get a build error: kernel/workqueue.o: In function `worker_attach_to_pool': workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' failed My guess is that the worker_attach_to_pool() code needs to change so as to invoke cpuhp_current_state() and cpuhp_target_state() only in CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, Jul 02, 2018 at 09:05:18PM -0700, Paul E. McKenney wrote: > On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote: > > Hello, Paul. > > > > Sorry about the late reply. > > > > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > > > I have hit this WARN_ON_ONCE() in process_one_work: > > > > > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > > >raw_smp_processor_id() != pool->cpu); > > > > > > This looks like it is my rcu_gp workqueue (see splat below), and it > > > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > > > which does random CPU-hotplug operations (in case that helps). > > > > > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > > > what should I do to further debug this? > > > > Hmm... I checked the code paths but couldn't spot anything suspicious. > > Can you please apply the following patch and see whether it triggers > > before hitting the warn and if so report what it says? > > I will apply this, but be advised that I have not seen that WARN_ON_ONCE() > trigger since. :-/ But I get a build error: kernel/workqueue.o: In function `worker_attach_to_pool': workqueue.c:(.text+0x63c): undefined reference to `cpuhp_target_state' workqueue.c:(.text+0x647): undefined reference to `cpuhp_current_state' /home/paulmck/public_git/linux-rcu/Makefile:1015: recipe for target 'vmlinux' failed My guess is that the worker_attach_to_pool() code needs to change so as to invoke cpuhp_current_state() and cpuhp_target_state() only in CONFIG_HOTPLUG_CPU=y kernels, but I figured I should defer to you. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote: > Hello, Paul. > > Sorry about the late reply. > > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > > I have hit this WARN_ON_ONCE() in process_one_work: > > > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > > raw_smp_processor_id() != pool->cpu); > > > > This looks like it is my rcu_gp workqueue (see splat below), and it > > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > > which does random CPU-hotplug operations (in case that helps). > > > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > > what should I do to further debug this? > > Hmm... I checked the code paths but couldn't spot anything suspicious. > Can you please apply the following patch and see whether it triggers > before hitting the warn and if so report what it says? I will apply this, but be advised that I have not seen that WARN_ON_ONCE() trigger since. :-/ Thanx, Paul > Thanks. > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 0db8938fbb23..81caab9643b2 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = > static struct lockdep_map cpuhp_state_down_map = > STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); > > +int cpuhp_current_state(int cpu) > +{ > + return per_cpu_ptr(_state, cpu)->state; > +} > + > +int cpuhp_target_state(int cpu) > +{ > + return per_cpu_ptr(_state, cpu)->target; > +} > > static inline void cpuhp_lock_acquire(bool bringup) > { > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 78b192071ef7..365cf6342808 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) > return worker; > } > > +int cpuhp_current_state(int cpu); > +int cpuhp_target_state(int cpu); > + > /** > * worker_attach_to_pool() - attach a worker to a pool > * @worker: worker to be attached > @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) > static void worker_attach_to_pool(struct worker *worker, > struct worker_pool *pool) > { > + int ret; > + > mutex_lock(_pool_attach_mutex); > > /* >* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any >* online CPUs. It'll be re-applied when any of the CPUs come up. >*/ > - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + if (ret && pool->cpu >= 0 && worker->rescue_wq) > + printk("XXX rescuer failed to attach: ret=%d pool=%d > this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", > +ret, pool->id, raw_smp_processor_id(), pool->cpu, > +cpuhp_current_state(pool->cpu), > +cpuhp_target_state(pool->cpu)); > > /* >* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains >
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, Jul 02, 2018 at 02:05:40PM -0700, Tejun Heo wrote: > Hello, Paul. > > Sorry about the late reply. > > On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > > I have hit this WARN_ON_ONCE() in process_one_work: > > > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > > raw_smp_processor_id() != pool->cpu); > > > > This looks like it is my rcu_gp workqueue (see splat below), and it > > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > > which does random CPU-hotplug operations (in case that helps). > > > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > > what should I do to further debug this? > > Hmm... I checked the code paths but couldn't spot anything suspicious. > Can you please apply the following patch and see whether it triggers > before hitting the warn and if so report what it says? I will apply this, but be advised that I have not seen that WARN_ON_ONCE() trigger since. :-/ Thanx, Paul > Thanks. > > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 0db8938fbb23..81caab9643b2 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = > static struct lockdep_map cpuhp_state_down_map = > STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); > > +int cpuhp_current_state(int cpu) > +{ > + return per_cpu_ptr(_state, cpu)->state; > +} > + > +int cpuhp_target_state(int cpu) > +{ > + return per_cpu_ptr(_state, cpu)->target; > +} > > static inline void cpuhp_lock_acquire(bool bringup) > { > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 78b192071ef7..365cf6342808 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) > return worker; > } > > +int cpuhp_current_state(int cpu); > +int cpuhp_target_state(int cpu); > + > /** > * worker_attach_to_pool() - attach a worker to a pool > * @worker: worker to be attached > @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) > static void worker_attach_to_pool(struct worker *worker, > struct worker_pool *pool) > { > + int ret; > + > mutex_lock(_pool_attach_mutex); > > /* >* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any >* online CPUs. It'll be re-applied when any of the CPUs come up. >*/ > - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + if (ret && pool->cpu >= 0 && worker->rescue_wq) > + printk("XXX rescuer failed to attach: ret=%d pool=%d > this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", > +ret, pool->id, raw_smp_processor_id(), pool->cpu, > +cpuhp_current_state(pool->cpu), > +cpuhp_target_state(pool->cpu)); > > /* >* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains >
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Sorry about the late reply. On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > I have hit this WARN_ON_ONCE() in process_one_work: > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > This looks like it is my rcu_gp workqueue (see splat below), and it > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > which does random CPU-hotplug operations (in case that helps). > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > what should I do to further debug this? Hmm... I checked the code paths but couldn't spot anything suspicious. Can you please apply the following patch and see whether it triggers before hitting the warn and if so report what it says? Thanks. diff --git a/kernel/cpu.c b/kernel/cpu.c index 0db8938fbb23..81caab9643b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = static struct lockdep_map cpuhp_state_down_map = STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); +int cpuhp_current_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->state; +} + +int cpuhp_target_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->target; +} static inline void cpuhp_lock_acquire(bool bringup) { diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 78b192071ef7..365cf6342808 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) return worker; } +int cpuhp_current_state(int cpu); +int cpuhp_target_state(int cpu); + /** * worker_attach_to_pool() - attach a worker to a pool * @worker: worker to be attached @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(_pool_attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + if (ret && pool->cpu >= 0 && worker->rescue_wq) + printk("XXX rescuer failed to attach: ret=%d pool=%d this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", + ret, pool->id, raw_smp_processor_id(), pool->cpu, + cpuhp_current_state(pool->cpu), + cpuhp_target_state(pool->cpu)); /* * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Sorry about the late reply. On Wed, Jun 20, 2018 at 12:29:01PM -0700, Paul E. McKenney wrote: > I have hit this WARN_ON_ONCE() in process_one_work: > > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > This looks like it is my rcu_gp workqueue (see splat below), and it > appears to be intermittent. This happens on rcutorture scenario SRCU-N, > which does random CPU-hotplug operations (in case that helps). > > Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, > what should I do to further debug this? Hmm... I checked the code paths but couldn't spot anything suspicious. Can you please apply the following patch and see whether it triggers before hitting the warn and if so report what it says? Thanks. diff --git a/kernel/cpu.c b/kernel/cpu.c index 0db8938fbb23..81caab9643b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -79,6 +79,15 @@ static struct lockdep_map cpuhp_state_up_map = static struct lockdep_map cpuhp_state_down_map = STATIC_LOCKDEP_MAP_INIT("cpuhp_state-down", _state_down_map); +int cpuhp_current_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->state; +} + +int cpuhp_target_state(int cpu) +{ + return per_cpu_ptr(_state, cpu)->target; +} static inline void cpuhp_lock_acquire(bool bringup) { diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 78b192071ef7..365cf6342808 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1712,6 +1712,9 @@ static struct worker *alloc_worker(int node) return worker; } +int cpuhp_current_state(int cpu); +int cpuhp_target_state(int cpu); + /** * worker_attach_to_pool() - attach a worker to a pool * @worker: worker to be attached @@ -1724,13 +1727,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(_pool_attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + if (ret && pool->cpu >= 0 && worker->rescue_wq) + printk("XXX rescuer failed to attach: ret=%d pool=%d this_cpu=%d target_cpu=%d cpuhp_state=%d chuhp_target=%d\n", + ret, pool->id, raw_smp_processor_id(), pool->cpu, + cpuhp_current_state(pool->cpu), + cpuhp_target_state(pool->cpu)); /* * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
WARN_ON_ONCE() in process_one_work()?
Hello! I have hit this WARN_ON_ONCE() in process_one_work: WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && raw_smp_processor_id() != pool->cpu); This looks like it is my rcu_gp workqueue (see splat below), and it appears to be intermittent. This happens on rcutorture scenario SRCU-N, which does random CPU-hotplug operations (in case that helps). Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, what should I do to further debug this? Thanx, Paul [ 828.791694] WARNING: CPU: 2 PID: 3 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2069 process_one_work+0x49/0x3c0 [ 828.794850] Modules linked in: [ 828.795748] CPU: 2 PID: 3 Comm: rcu_gp Not tainted 4.18.0-rc1+ #1 [ 828.797436] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 828.800059] Workqueue:(null) (rcu_gp) [ 828.801373] RIP: 0010:process_one_work+0x49/0x3c0 [ 828.802690] Code: 00 00 00 00 4c 0f 44 e0 49 8b 44 24 08 44 8b a8 00 01 00 00 41 83 e5 20 f6 45 10 04 75 0e 65 8b 05 6c 99 79 68 3b 45 04 74 02 <0f> 0b 48 ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 8b [ 828.807871] RSP: :8a3c400e7e70 EFLAGS: 00010002 [ 828.809314] RAX: 0002 RBX: 88acdfc206e0 RCX: 88acdfc206e0 [ 828.811307] RDX: 88acdfc206d8 RSI: 88acdfc197b8 RDI: 88acde80c6c0 [ 828.813253] RBP: 88acdfc206c0 R08: R09: [ 828.815239] R10: 8a3c400e7e48 R11: R12: 88acdfc24900 [ 828.817073] R13: R14: 88acde80c6c0 R15: 88acde80c6e8 [ 828.819071] FS: () GS:88acdfd0() knlGS: [ 828.821361] CS: 0010 DS: ES: CR0: 80050033 [ 828.822971] CR2: 00b0 CR3: 1040a000 CR4: 06e0 [ 828.824935] Call Trace: [ 828.825653] rescuer_thread+0x201/0x320 [ 828.826744] kthread+0xf3/0x130 [ 828.827618] ? process_one_work+0x3c0/0x3c0 [ 828.828758] ? kthread_destroy_worker+0x40/0x40 [ 828.830019] ret_from_fork+0x35/0x40 [ 828.831028] ---[ end trace 4197eb078bf0edac ]---
WARN_ON_ONCE() in process_one_work()?
Hello! I have hit this WARN_ON_ONCE() in process_one_work: WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && raw_smp_processor_id() != pool->cpu); This looks like it is my rcu_gp workqueue (see splat below), and it appears to be intermittent. This happens on rcutorture scenario SRCU-N, which does random CPU-hotplug operations (in case that helps). Is this related to the recent addition of WQ_MEM_RECLAIM? Either way, what should I do to further debug this? Thanx, Paul [ 828.791694] WARNING: CPU: 2 PID: 3 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2069 process_one_work+0x49/0x3c0 [ 828.794850] Modules linked in: [ 828.795748] CPU: 2 PID: 3 Comm: rcu_gp Not tainted 4.18.0-rc1+ #1 [ 828.797436] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 828.800059] Workqueue:(null) (rcu_gp) [ 828.801373] RIP: 0010:process_one_work+0x49/0x3c0 [ 828.802690] Code: 00 00 00 00 4c 0f 44 e0 49 8b 44 24 08 44 8b a8 00 01 00 00 41 83 e5 20 f6 45 10 04 75 0e 65 8b 05 6c 99 79 68 3b 45 04 74 02 <0f> 0b 48 ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 8b [ 828.807871] RSP: :8a3c400e7e70 EFLAGS: 00010002 [ 828.809314] RAX: 0002 RBX: 88acdfc206e0 RCX: 88acdfc206e0 [ 828.811307] RDX: 88acdfc206d8 RSI: 88acdfc197b8 RDI: 88acde80c6c0 [ 828.813253] RBP: 88acdfc206c0 R08: R09: [ 828.815239] R10: 8a3c400e7e48 R11: R12: 88acdfc24900 [ 828.817073] R13: R14: 88acde80c6c0 R15: 88acde80c6e8 [ 828.819071] FS: () GS:88acdfd0() knlGS: [ 828.821361] CS: 0010 DS: ES: CR0: 80050033 [ 828.822971] CR2: 00b0 CR3: 1040a000 CR4: 06e0 [ 828.824935] Call Trace: [ 828.825653] rescuer_thread+0x201/0x320 [ 828.826744] kthread+0xf3/0x130 [ 828.827618] ? process_one_work+0x3c0/0x3c0 [ 828.828758] ? kthread_destroy_worker+0x40/0x40 [ 828.830019] ret_from_fork+0x35/0x40 [ 828.831028] ---[ end trace 4197eb078bf0edac ]---
Re: WARN_ON_ONCE() in process_one_work()?
On Fri, Jun 23, 2017 at 09:41:42AM -0700, Paul E. McKenney wrote: > On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > > > And no test failures from yesterday evening. So it looks like we > > > > > > > get > > > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > > > rcutorture > > > > > > > runtime with your printk() in the mix. > > > > > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > > > racing against CPU offline, which would explain why it's so > > > > > > difficult > > > > > > to repro. Can you please see whether the following patch resolves > > > > > > the > > > > > > issue? > > > > > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > > > tests. ;-) > > > > > > > > I was a bit confused. It has to be racing against either new kworker > > > > being created on the wrong CPU or rescuer trying to migrate to the > > > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > > > yeah, this would only get triggered rarely. Another contributing > > > > factor could be the vmstat work putting on a workqueue w/ rescuer > > > > recently. It runs quite often, so probably has increased the chance > > > > of hitting the right condition. > > > > > > Sounds like too much fun! ;-) > > > > > > But more constructively... If I understand correctly, it is now possible > > > to take a CPU partially offline and put it back online again. This should > > > allow much more intense testing of this sort of interaction. > > > > > > And no, I haven't yet tried this with RCU because I would probably need > > > to do some mix of just-RCU online/offline and full-up online-offline. > > > Plus RCU requires pretty much a full online/offline cycle to fully > > > exercise it. :-/ > > > > > > > > I have a weekend-long run going, but will give this a shot overnight > > > > > on > > > > > Monday, Pacific Time. Thank you for putting it together, looking > > > > > forward > > > > > to seeing what it does! > > > > > > > > Thanks a lot for the testing and patience. Sorry that it took so > > > > long. I'm not completely sure the patch is correct. It might have to > > > > be more specifc about which type of migration or require further > > > > synchronization around migration, but hopefully it'll at least be able > > > > to show that this was the cause of the problem. > > > > > > And last night's tests had no failures. Which might actually mean > > > something, will get more info when I run without your patch this > > > evening. ;-) > > > > And it didn't fail without the patch, either. 45 hours of test vs. > > 60 hours with the patch. This one is not going to be easy to prove > > either way. I will try again this evening without the patch and see > > what that gets us. > > And another 36 hours (total of 81 hours) without the patch, still no > failure. Sigh. > > In the sense that the patch doesn't cause any new problem: > > Tested-by: Paul E. McKenney> > But I clearly have nothing of statistical significance, so any confidence > in the fix is coming from your reproducer. And for whatever it is worth, I did finally get a reproduction without the patch. The probability of occurrence is quite low with my test setup, so please queue this patch. I will accumulate test time on it over the months to come. :-/ Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Fri, Jun 23, 2017 at 09:41:42AM -0700, Paul E. McKenney wrote: > On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > > > And no test failures from yesterday evening. So it looks like we > > > > > > > get > > > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > > > rcutorture > > > > > > > runtime with your printk() in the mix. > > > > > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > > > racing against CPU offline, which would explain why it's so > > > > > > difficult > > > > > > to repro. Can you please see whether the following patch resolves > > > > > > the > > > > > > issue? > > > > > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > > > tests. ;-) > > > > > > > > I was a bit confused. It has to be racing against either new kworker > > > > being created on the wrong CPU or rescuer trying to migrate to the > > > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > > > yeah, this would only get triggered rarely. Another contributing > > > > factor could be the vmstat work putting on a workqueue w/ rescuer > > > > recently. It runs quite often, so probably has increased the chance > > > > of hitting the right condition. > > > > > > Sounds like too much fun! ;-) > > > > > > But more constructively... If I understand correctly, it is now possible > > > to take a CPU partially offline and put it back online again. This should > > > allow much more intense testing of this sort of interaction. > > > > > > And no, I haven't yet tried this with RCU because I would probably need > > > to do some mix of just-RCU online/offline and full-up online-offline. > > > Plus RCU requires pretty much a full online/offline cycle to fully > > > exercise it. :-/ > > > > > > > > I have a weekend-long run going, but will give this a shot overnight > > > > > on > > > > > Monday, Pacific Time. Thank you for putting it together, looking > > > > > forward > > > > > to seeing what it does! > > > > > > > > Thanks a lot for the testing and patience. Sorry that it took so > > > > long. I'm not completely sure the patch is correct. It might have to > > > > be more specifc about which type of migration or require further > > > > synchronization around migration, but hopefully it'll at least be able > > > > to show that this was the cause of the problem. > > > > > > And last night's tests had no failures. Which might actually mean > > > something, will get more info when I run without your patch this > > > evening. ;-) > > > > And it didn't fail without the patch, either. 45 hours of test vs. > > 60 hours with the patch. This one is not going to be easy to prove > > either way. I will try again this evening without the patch and see > > what that gets us. > > And another 36 hours (total of 81 hours) without the patch, still no > failure. Sigh. > > In the sense that the patch doesn't cause any new problem: > > Tested-by: Paul E. McKenney > > But I clearly have nothing of statistical significance, so any confidence > in the fix is coming from your reproducer. And for whatever it is worth, I did finally get a reproduction without the patch. The probability of occurrence is quite low with my test setup, so please queue this patch. I will accumulate test time on it over the months to come. :-/ Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote: > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > > And no test failures from yesterday evening. So it looks like we > > > > > > get > > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > > rcutorture > > > > > > runtime with your printk() in the mix. > > > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > > racing against CPU offline, which would explain why it's so difficult > > > > > to repro. Can you please see whether the following patch resolves the > > > > > issue? > > > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > > tests. ;-) > > > > > > I was a bit confused. It has to be racing against either new kworker > > > being created on the wrong CPU or rescuer trying to migrate to the > > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > > yeah, this would only get triggered rarely. Another contributing > > > factor could be the vmstat work putting on a workqueue w/ rescuer > > > recently. It runs quite often, so probably has increased the chance > > > of hitting the right condition. > > > > Sounds like too much fun! ;-) > > > > But more constructively... If I understand correctly, it is now possible > > to take a CPU partially offline and put it back online again. This should > > allow much more intense testing of this sort of interaction. > > > > And no, I haven't yet tried this with RCU because I would probably need > > to do some mix of just-RCU online/offline and full-up online-offline. > > Plus RCU requires pretty much a full online/offline cycle to fully > > exercise it. :-/ > > > > > > I have a weekend-long run going, but will give this a shot overnight on > > > > Monday, Pacific Time. Thank you for putting it together, looking > > > > forward > > > > to seeing what it does! > > > > > > Thanks a lot for the testing and patience. Sorry that it took so > > > long. I'm not completely sure the patch is correct. It might have to > > > be more specifc about which type of migration or require further > > > synchronization around migration, but hopefully it'll at least be able > > > to show that this was the cause of the problem. > > > > And last night's tests had no failures. Which might actually mean > > something, will get more info when I run without your patch this > > evening. ;-) > > And it didn't fail without the patch, either. 45 hours of test vs. > 60 hours with the patch. This one is not going to be easy to prove > either way. I will try again this evening without the patch and see > what that gets us. And another 36 hours (total of 81 hours) without the patch, still no failure. Sigh. In the sense that the patch doesn't cause any new problem: Tested-by: Paul E. McKenneyBut I clearly have nothing of statistical significance, so any confidence in the fix is coming from your reproducer. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Wed, Jun 21, 2017 at 08:30:35AM -0700, Paul E. McKenney wrote: > On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > > And no test failures from yesterday evening. So it looks like we > > > > > > get > > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > > rcutorture > > > > > > runtime with your printk() in the mix. > > > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > > racing against CPU offline, which would explain why it's so difficult > > > > > to repro. Can you please see whether the following patch resolves the > > > > > issue? > > > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > > tests. ;-) > > > > > > I was a bit confused. It has to be racing against either new kworker > > > being created on the wrong CPU or rescuer trying to migrate to the > > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > > yeah, this would only get triggered rarely. Another contributing > > > factor could be the vmstat work putting on a workqueue w/ rescuer > > > recently. It runs quite often, so probably has increased the chance > > > of hitting the right condition. > > > > Sounds like too much fun! ;-) > > > > But more constructively... If I understand correctly, it is now possible > > to take a CPU partially offline and put it back online again. This should > > allow much more intense testing of this sort of interaction. > > > > And no, I haven't yet tried this with RCU because I would probably need > > to do some mix of just-RCU online/offline and full-up online-offline. > > Plus RCU requires pretty much a full online/offline cycle to fully > > exercise it. :-/ > > > > > > I have a weekend-long run going, but will give this a shot overnight on > > > > Monday, Pacific Time. Thank you for putting it together, looking > > > > forward > > > > to seeing what it does! > > > > > > Thanks a lot for the testing and patience. Sorry that it took so > > > long. I'm not completely sure the patch is correct. It might have to > > > be more specifc about which type of migration or require further > > > synchronization around migration, but hopefully it'll at least be able > > > to show that this was the cause of the problem. > > > > And last night's tests had no failures. Which might actually mean > > something, will get more info when I run without your patch this > > evening. ;-) > > And it didn't fail without the patch, either. 45 hours of test vs. > 60 hours with the patch. This one is not going to be easy to prove > either way. I will try again this evening without the patch and see > what that gets us. And another 36 hours (total of 81 hours) without the patch, still no failure. Sigh. In the sense that the patch doesn't cause any new problem: Tested-by: Paul E. McKenney But I clearly have nothing of statistical significance, so any confidence in the fix is coming from your reproducer. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > Hello, > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > And no test failures from yesterday evening. So it looks like we get > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > rcutorture > > > > > runtime with your printk() in the mix. > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > racing against CPU offline, which would explain why it's so difficult > > > > to repro. Can you please see whether the following patch resolves the > > > > issue? > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > tests. ;-) > > > > I was a bit confused. It has to be racing against either new kworker > > being created on the wrong CPU or rescuer trying to migrate to the > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > yeah, this would only get triggered rarely. Another contributing > > factor could be the vmstat work putting on a workqueue w/ rescuer > > recently. It runs quite often, so probably has increased the chance > > of hitting the right condition. > > Sounds like too much fun! ;-) > > But more constructively... If I understand correctly, it is now possible > to take a CPU partially offline and put it back online again. This should > allow much more intense testing of this sort of interaction. > > And no, I haven't yet tried this with RCU because I would probably need > to do some mix of just-RCU online/offline and full-up online-offline. > Plus RCU requires pretty much a full online/offline cycle to fully > exercise it. :-/ > > > > I have a weekend-long run going, but will give this a shot overnight on > > > Monday, Pacific Time. Thank you for putting it together, looking forward > > > to seeing what it does! > > > > Thanks a lot for the testing and patience. Sorry that it took so > > long. I'm not completely sure the patch is correct. It might have to > > be more specifc about which type of migration or require further > > synchronization around migration, but hopefully it'll at least be able > > to show that this was the cause of the problem. > > And last night's tests had no failures. Which might actually mean > something, will get more info when I run without your patch this > evening. ;-) And it didn't fail without the patch, either. 45 hours of test vs. 60 hours with the patch. This one is not going to be easy to prove either way. I will try again this evening without the patch and see what that gets us. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote: > On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > > Hello, > > > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > > And no test failures from yesterday evening. So it looks like we get > > > > > somewhere on the order of one failure per 138 hours of TREE07 > > > > > rcutorture > > > > > runtime with your printk() in the mix. > > > > > > > > > > Was the above output from your printk() output of any help? > > > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > > racing against CPU offline, which would explain why it's so difficult > > > > to repro. Can you please see whether the following patch resolves the > > > > issue? > > > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > > as I know, we are the only ones who regularly run CPU-hotplug stress > > > tests. ;-) > > > > I was a bit confused. It has to be racing against either new kworker > > being created on the wrong CPU or rescuer trying to migrate to the > > CPU, and it looks like we're mostly seeing the rescuer condition, but, > > yeah, this would only get triggered rarely. Another contributing > > factor could be the vmstat work putting on a workqueue w/ rescuer > > recently. It runs quite often, so probably has increased the chance > > of hitting the right condition. > > Sounds like too much fun! ;-) > > But more constructively... If I understand correctly, it is now possible > to take a CPU partially offline and put it back online again. This should > allow much more intense testing of this sort of interaction. > > And no, I haven't yet tried this with RCU because I would probably need > to do some mix of just-RCU online/offline and full-up online-offline. > Plus RCU requires pretty much a full online/offline cycle to fully > exercise it. :-/ > > > > I have a weekend-long run going, but will give this a shot overnight on > > > Monday, Pacific Time. Thank you for putting it together, looking forward > > > to seeing what it does! > > > > Thanks a lot for the testing and patience. Sorry that it took so > > long. I'm not completely sure the patch is correct. It might have to > > be more specifc about which type of migration or require further > > synchronization around migration, but hopefully it'll at least be able > > to show that this was the cause of the problem. > > And last night's tests had no failures. Which might actually mean > something, will get more info when I run without your patch this > evening. ;-) And it didn't fail without the patch, either. 45 hours of test vs. 60 hours with the patch. This one is not going to be easy to prove either way. I will try again this evening without the patch and see what that gets us. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > Hello, > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > And no test failures from yesterday evening. So it looks like we get > > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > > > runtime with your printk() in the mix. > > > > > > > > Was the above output from your printk() output of any help? > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > racing against CPU offline, which would explain why it's so difficult > > > to repro. Can you please see whether the following patch resolves the > > > issue? > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > as I know, we are the only ones who regularly run CPU-hotplug stress > > tests. ;-) > > I was a bit confused. It has to be racing against either new kworker > being created on the wrong CPU or rescuer trying to migrate to the > CPU, and it looks like we're mostly seeing the rescuer condition, but, > yeah, this would only get triggered rarely. Another contributing > factor could be the vmstat work putting on a workqueue w/ rescuer > recently. It runs quite often, so probably has increased the chance > of hitting the right condition. Sounds like too much fun! ;-) But more constructively... If I understand correctly, it is now possible to take a CPU partially offline and put it back online again. This should allow much more intense testing of this sort of interaction. And no, I haven't yet tried this with RCU because I would probably need to do some mix of just-RCU online/offline and full-up online-offline. Plus RCU requires pretty much a full online/offline cycle to fully exercise it. :-/ > > I have a weekend-long run going, but will give this a shot overnight on > > Monday, Pacific Time. Thank you for putting it together, looking forward > > to seeing what it does! > > Thanks a lot for the testing and patience. Sorry that it took so > long. I'm not completely sure the patch is correct. It might have to > be more specifc about which type of migration or require further > synchronization around migration, but hopefully it'll at least be able > to show that this was the cause of the problem. And last night's tests had no failures. Which might actually mean something, will get more info when I run without your patch this evening. ;-) Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote: > Hello, > > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > > Hello, > > > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > > And no test failures from yesterday evening. So it looks like we get > > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > > > runtime with your printk() in the mix. > > > > > > > > Was the above output from your printk() output of any help? > > > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > > racing against CPU offline, which would explain why it's so difficult > > > to repro. Can you please see whether the following patch resolves the > > > issue? > > > > That could explain why only Steve Rostedt and I saw the issue. As far > > as I know, we are the only ones who regularly run CPU-hotplug stress > > tests. ;-) > > I was a bit confused. It has to be racing against either new kworker > being created on the wrong CPU or rescuer trying to migrate to the > CPU, and it looks like we're mostly seeing the rescuer condition, but, > yeah, this would only get triggered rarely. Another contributing > factor could be the vmstat work putting on a workqueue w/ rescuer > recently. It runs quite often, so probably has increased the chance > of hitting the right condition. Sounds like too much fun! ;-) But more constructively... If I understand correctly, it is now possible to take a CPU partially offline and put it back online again. This should allow much more intense testing of this sort of interaction. And no, I haven't yet tried this with RCU because I would probably need to do some mix of just-RCU online/offline and full-up online-offline. Plus RCU requires pretty much a full online/offline cycle to fully exercise it. :-/ > > I have a weekend-long run going, but will give this a shot overnight on > > Monday, Pacific Time. Thank you for putting it together, looking forward > > to seeing what it does! > > Thanks a lot for the testing and patience. Sorry that it took so > long. I'm not completely sure the patch is correct. It might have to > be more specifc about which type of migration or require further > synchronization around migration, but hopefully it'll at least be able > to show that this was the cause of the problem. And last night's tests had no failures. Which might actually mean something, will get more info when I run without your patch this evening. ;-) Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
Hello, On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > Hello, > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > And no test failures from yesterday evening. So it looks like we get > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > > runtime with your printk() in the mix. > > > > > > Was the above output from your printk() output of any help? > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > racing against CPU offline, which would explain why it's so difficult > > to repro. Can you please see whether the following patch resolves the > > issue? > > That could explain why only Steve Rostedt and I saw the issue. As far > as I know, we are the only ones who regularly run CPU-hotplug stress > tests. ;-) I was a bit confused. It has to be racing against either new kworker being created on the wrong CPU or rescuer trying to migrate to the CPU, and it looks like we're mostly seeing the rescuer condition, but, yeah, this would only get triggered rarely. Another contributing factor could be the vmstat work putting on a workqueue w/ rescuer recently. It runs quite often, so probably has increased the chance of hitting the right condition. > I have a weekend-long run going, but will give this a shot overnight on > Monday, Pacific Time. Thank you for putting it together, looking forward > to seeing what it does! Thanks a lot for the testing and patience. Sorry that it took so long. I'm not completely sure the patch is correct. It might have to be more specifc about which type of migration or require further synchronization around migration, but hopefully it'll at least be able to show that this was the cause of the problem. Thanks! -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote: > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > > Hello, > > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > > And no test failures from yesterday evening. So it looks like we get > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > > runtime with your printk() in the mix. > > > > > > Was the above output from your printk() output of any help? > > > > Yeah, if my suspicion is correct, it'd require new kworker creation > > racing against CPU offline, which would explain why it's so difficult > > to repro. Can you please see whether the following patch resolves the > > issue? > > That could explain why only Steve Rostedt and I saw the issue. As far > as I know, we are the only ones who regularly run CPU-hotplug stress > tests. ;-) I was a bit confused. It has to be racing against either new kworker being created on the wrong CPU or rescuer trying to migrate to the CPU, and it looks like we're mostly seeing the rescuer condition, but, yeah, this would only get triggered rarely. Another contributing factor could be the vmstat work putting on a workqueue w/ rescuer recently. It runs quite often, so probably has increased the chance of hitting the right condition. > I have a weekend-long run going, but will give this a shot overnight on > Monday, Pacific Time. Thank you for putting it together, looking forward > to seeing what it does! Thanks a lot for the testing and patience. Sorry that it took so long. I'm not completely sure the patch is correct. It might have to be more specifc about which type of migration or require further synchronization around migration, but hopefully it'll at least be able to show that this was the cause of the problem. Thanks! -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > Hello, > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > And no test failures from yesterday evening. So it looks like we get > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > runtime with your printk() in the mix. > > > > Was the above output from your printk() output of any help? > > Yeah, if my suspicion is correct, it'd require new kworker creation > racing against CPU offline, which would explain why it's so difficult > to repro. Can you please see whether the following patch resolves the > issue? That could explain why only Steve Rostedt and I saw the issue. As far as I know, we are the only ones who regularly run CPU-hotplug stress tests. ;-) I have a weekend-long run going, but will give this a shot overnight on Monday, Pacific Time. Thank you for putting it together, looking forward to seeing what it does! Thanx, Paul > Thanks. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 803c3bc274c4..1500217ce4b4 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -980,8 +980,13 @@ struct migration_arg { > static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, >struct task_struct *p, int dest_cpu) > { > - if (unlikely(!cpu_active(dest_cpu))) > - return rq; > + if (p->flags & PF_KTHREAD) { > + if (unlikely(!cpu_online(dest_cpu))) > + return rq; > + } else { > + if (unlikely(!cpu_active(dest_cpu))) > + return rq; > + } > > /* Affinity changed (again). */ > if (!cpumask_test_cpu(dest_cpu, >cpus_allowed)) >
Re: WARN_ON_ONCE() in process_one_work()?
On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote: > Hello, > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > > And no test failures from yesterday evening. So it looks like we get > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > > runtime with your printk() in the mix. > > > > Was the above output from your printk() output of any help? > > Yeah, if my suspicion is correct, it'd require new kworker creation > racing against CPU offline, which would explain why it's so difficult > to repro. Can you please see whether the following patch resolves the > issue? That could explain why only Steve Rostedt and I saw the issue. As far as I know, we are the only ones who regularly run CPU-hotplug stress tests. ;-) I have a weekend-long run going, but will give this a shot overnight on Monday, Pacific Time. Thank you for putting it together, looking forward to seeing what it does! Thanx, Paul > Thanks. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 803c3bc274c4..1500217ce4b4 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -980,8 +980,13 @@ struct migration_arg { > static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, >struct task_struct *p, int dest_cpu) > { > - if (unlikely(!cpu_active(dest_cpu))) > - return rq; > + if (p->flags & PF_KTHREAD) { > + if (unlikely(!cpu_online(dest_cpu))) > + return rq; > + } else { > + if (unlikely(!cpu_active(dest_cpu))) > + return rq; > + } > > /* Affinity changed (again). */ > if (!cpumask_test_cpu(dest_cpu, >cpus_allowed)) >
Re: WARN_ON_ONCE() in process_one_work()?
Hello, On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > And no test failures from yesterday evening. So it looks like we get > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > runtime with your printk() in the mix. > > Was the above output from your printk() output of any help? Yeah, if my suspicion is correct, it'd require new kworker creation racing against CPU offline, which would explain why it's so difficult to repro. Can you please see whether the following patch resolves the issue? Thanks. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 803c3bc274c4..1500217ce4b4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -980,8 +980,13 @@ struct migration_arg { static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, struct task_struct *p, int dest_cpu) { - if (unlikely(!cpu_active(dest_cpu))) - return rq; + if (p->flags & PF_KTHREAD) { + if (unlikely(!cpu_online(dest_cpu))) + return rq; + } else { + if (unlikely(!cpu_active(dest_cpu))) + return rq; + } /* Affinity changed (again). */ if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))
Re: WARN_ON_ONCE() in process_one_work()?
Hello, On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote: > And no test failures from yesterday evening. So it looks like we get > somewhere on the order of one failure per 138 hours of TREE07 rcutorture > runtime with your printk() in the mix. > > Was the above output from your printk() output of any help? Yeah, if my suspicion is correct, it'd require new kworker creation racing against CPU offline, which would explain why it's so difficult to repro. Can you please see whether the following patch resolves the issue? Thanks. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 803c3bc274c4..1500217ce4b4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -980,8 +980,13 @@ struct migration_arg { static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, struct task_struct *p, int dest_cpu) { - if (unlikely(!cpu_active(dest_cpu))) - return rq; + if (p->flags & PF_KTHREAD) { + if (unlikely(!cpu_online(dest_cpu))) + return rq; + } else { + if (unlikely(!cpu_active(dest_cpu))) + return rq; + } /* Affinity changed (again). */ if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))
Re: WARN_ON_ONCE() in process_one_work()?
On Thu, Jun 15, 2017 at 08:38:57AM -0700, Paul E. McKenney wrote: > On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > > > Hello, Paul. > > > > > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > > > Just following up... I have hit this bug a couple of times over the > > > > > past few days. Anything I can do to help? > > > > > > > > My apologies for dropping the ball on this. I've gone over the hot > > > > plug code in workqueue several times but can't really find how this > > > > would happen. Can you please apply the following patch and see what > > > > it says when the problem happens? > > > > > > I have fired it up, thank you! > > > > > > Last time I saw one failure in 21 hours of test runs, so I have kicked > > > of 42 one-hour test runs. Will see what happens tomorrow morning, > > > Pacific Time. > > > > And none of the 42 runs resulted in a workqueue splat. I will try again > > this evening, Pacific Time. > > > > Who knows, maybe your diagnostic patch is the fix. ;-) > > And this time, we did get something! Here is the printk() output: > > [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 > online=0-2,7 active=0,2,7 > > Please see below for the full splat from dmesg. > > Please let me know if you need additional email. My test ID is KSIC > 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile > of test results. ;-) And no test failures from yesterday evening. So it looks like we get somewhere on the order of one failure per 138 hours of TREE07 rcutorture runtime with your printk() in the mix. Was the above output from your printk() output of any help? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Thu, Jun 15, 2017 at 08:38:57AM -0700, Paul E. McKenney wrote: > On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote: > > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > > > Hello, Paul. > > > > > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > > > Just following up... I have hit this bug a couple of times over the > > > > > past few days. Anything I can do to help? > > > > > > > > My apologies for dropping the ball on this. I've gone over the hot > > > > plug code in workqueue several times but can't really find how this > > > > would happen. Can you please apply the following patch and see what > > > > it says when the problem happens? > > > > > > I have fired it up, thank you! > > > > > > Last time I saw one failure in 21 hours of test runs, so I have kicked > > > of 42 one-hour test runs. Will see what happens tomorrow morning, > > > Pacific Time. > > > > And none of the 42 runs resulted in a workqueue splat. I will try again > > this evening, Pacific Time. > > > > Who knows, maybe your diagnostic patch is the fix. ;-) > > And this time, we did get something! Here is the printk() output: > > [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 > online=0-2,7 active=0,2,7 > > Please see below for the full splat from dmesg. > > Please let me know if you need additional email. My test ID is KSIC > 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile > of test results. ;-) And no test failures from yesterday evening. So it looks like we get somewhere on the order of one failure per 138 hours of TREE07 rcutorture runtime with your printk() in the mix. Was the above output from your printk() output of any help? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote: > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > > Hello, Paul. > > > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > > Just following up... I have hit this bug a couple of times over the > > > > past few days. Anything I can do to help? > > > > > > My apologies for dropping the ball on this. I've gone over the hot > > > plug code in workqueue several times but can't really find how this > > > would happen. Can you please apply the following patch and see what > > > it says when the problem happens? > > > > I have fired it up, thank you! > > > > Last time I saw one failure in 21 hours of test runs, so I have kicked > > of 42 one-hour test runs. Will see what happens tomorrow morning, > > Pacific Time. > > And none of the 42 runs resulted in a workqueue splat. I will try again > this evening, Pacific Time. > > Who knows, maybe your diagnostic patch is the fix. ;-) And this time, we did get something! Here is the printk() output: [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 online=0-2,7 active=0,2,7 Please see below for the full splat from dmesg. Please let me know if you need additional email. My test ID is KSIC 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile of test results. ;-) Thanx, Paul [ 2126.840625] [ cut here ] [ 2126.841373] WARNING: CPU: 2 PID: 6 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2048 process_one_work+0x5f/0x440 [ 2126.843042] Modules linked in: [ 2126.843521] CPU: 2 PID: 6 Comm: mm_percpu_wq Not tainted 4.12.0-rc4+ #1 [ 2126.844493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 2126.845898] task: 985c5e59bc00 task.stack: b8be000fc000 [ 2126.846773] RIP: 0010:process_one_work+0x5f/0x440 [ 2126.847470] RSP: 0018:b8be000ffe60 EFLAGS: 00010002 [ 2126.848240] RAX: 985c5e595200 RBX: 985c5e56e840 RCX: [ 2126.849284] RDX: 0001 RSI: 985c5e8952c0 RDI: 985c5e56e840 [ 2126.850337] RBP: b8be000ffe90 R08: 0002 R09: [ 2126.851378] R10: b8be000ffe18 R11: R12: 985c5e899c00 [ 2126.852410] R13: 985c5e89e800 R14: R15: 985c5e8952c0 [ 2126.853440] FS: () GS:985c5e90() knlGS: [ 2126.854650] CS: 0010 DS: ES: CR0: 80050033 [ 2126.855500] CR2: 00b0 CR3: 1dcfc000 CR4: 06e0 [ 2126.856558] Call Trace: [ 2126.856944] rescuer_thread+0x207/0x350 [ 2126.857520] kthread+0x104/0x140 [ 2126.858024] ? process_one_work+0x440/0x440 [ 2126.858644] ? kthread_create_on_node+0x40/0x40 [ 2126.859343] ret_from_fork+0x22/0x30 [ 2126.859879] Code: e8 49 8b 45 08 44 8b b0 00 01 00 00 41 83 e6 20 f6 c1 04 44 89 75 d4 75 21 41 8b 54 24 04 65 44 8b 05 de 8a f9 74 41 39 d0 74 0f <0f> ff 80 3d b8 d5 ec 00 00 0f 84 92 03 00 00 48 ba eb 83 b5 80 [ 2126.862715] ---[ end trace df0aaa02bf2d28ee ]--- [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 online=0-2,7 active=0,2,7
Re: WARN_ON_ONCE() in process_one_work()?
On Wed, Jun 14, 2017 at 08:15:48AM -0700, Paul E. McKenney wrote: > On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > > Hello, Paul. > > > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > > Just following up... I have hit this bug a couple of times over the > > > > past few days. Anything I can do to help? > > > > > > My apologies for dropping the ball on this. I've gone over the hot > > > plug code in workqueue several times but can't really find how this > > > would happen. Can you please apply the following patch and see what > > > it says when the problem happens? > > > > I have fired it up, thank you! > > > > Last time I saw one failure in 21 hours of test runs, so I have kicked > > of 42 one-hour test runs. Will see what happens tomorrow morning, > > Pacific Time. > > And none of the 42 runs resulted in a workqueue splat. I will try again > this evening, Pacific Time. > > Who knows, maybe your diagnostic patch is the fix. ;-) And this time, we did get something! Here is the printk() output: [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 online=0-2,7 active=0,2,7 Please see below for the full splat from dmesg. Please let me know if you need additional email. My test ID is KSIC 2017.06.14-15:50:08/TREE07.14, just to help me find it in my large pile of test results. ;-) Thanx, Paul [ 2126.840625] [ cut here ] [ 2126.841373] WARNING: CPU: 2 PID: 6 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2048 process_one_work+0x5f/0x440 [ 2126.843042] Modules linked in: [ 2126.843521] CPU: 2 PID: 6 Comm: mm_percpu_wq Not tainted 4.12.0-rc4+ #1 [ 2126.844493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 2126.845898] task: 985c5e59bc00 task.stack: b8be000fc000 [ 2126.846773] RIP: 0010:process_one_work+0x5f/0x440 [ 2126.847470] RSP: 0018:b8be000ffe60 EFLAGS: 00010002 [ 2126.848240] RAX: 985c5e595200 RBX: 985c5e56e840 RCX: [ 2126.849284] RDX: 0001 RSI: 985c5e8952c0 RDI: 985c5e56e840 [ 2126.850337] RBP: b8be000ffe90 R08: 0002 R09: [ 2126.851378] R10: b8be000ffe18 R11: R12: 985c5e899c00 [ 2126.852410] R13: 985c5e89e800 R14: R15: 985c5e8952c0 [ 2126.853440] FS: () GS:985c5e90() knlGS: [ 2126.854650] CS: 0010 DS: ES: CR0: 80050033 [ 2126.855500] CR2: 00b0 CR3: 1dcfc000 CR4: 06e0 [ 2126.856558] Call Trace: [ 2126.856944] rescuer_thread+0x207/0x350 [ 2126.857520] kthread+0x104/0x140 [ 2126.858024] ? process_one_work+0x440/0x440 [ 2126.858644] ? kthread_create_on_node+0x40/0x40 [ 2126.859343] ret_from_fork+0x22/0x30 [ 2126.859879] Code: e8 49 8b 45 08 44 8b b0 00 01 00 00 41 83 e6 20 f6 c1 04 44 89 75 d4 75 21 41 8b 54 24 04 65 44 8b 05 de 8a f9 74 41 39 d0 74 0f <0f> ff 80 3d b8 d5 ec 00 00 0f 84 92 03 00 00 48 ba eb 83 b5 80 [ 2126.862715] ---[ end trace df0aaa02bf2d28ee ]--- [ 2126.863410] XXX workfn=vmstat_update pool->cpu/flags=1/0x0 curcpu=2 online=0-2,7 active=0,2,7
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > Hello, Paul. > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > Just following up... I have hit this bug a couple of times over the > > > past few days. Anything I can do to help? > > > > My apologies for dropping the ball on this. I've gone over the hot > > plug code in workqueue several times but can't really find how this > > would happen. Can you please apply the following patch and see what > > it says when the problem happens? > > I have fired it up, thank you! > > Last time I saw one failure in 21 hours of test runs, so I have kicked > of 42 one-hour test runs. Will see what happens tomorrow morning, > Pacific Time. And none of the 42 runs resulted in a workqueue splat. I will try again this evening, Pacific Time. Who knows, maybe your diagnostic patch is the fix. ;-) Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 13, 2017 at 03:31:03PM -0700, Paul E. McKenney wrote: > On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > > Hello, Paul. > > > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > > Just following up... I have hit this bug a couple of times over the > > > past few days. Anything I can do to help? > > > > My apologies for dropping the ball on this. I've gone over the hot > > plug code in workqueue several times but can't really find how this > > would happen. Can you please apply the following patch and see what > > it says when the problem happens? > > I have fired it up, thank you! > > Last time I saw one failure in 21 hours of test runs, so I have kicked > of 42 one-hour test runs. Will see what happens tomorrow morning, > Pacific Time. And none of the 42 runs resulted in a workqueue splat. I will try again this evening, Pacific Time. Who knows, maybe your diagnostic patch is the fix. ;-) Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > Hello, Paul. > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > Just following up... I have hit this bug a couple of times over the > > past few days. Anything I can do to help? > > My apologies for dropping the ball on this. I've gone over the hot > plug code in workqueue several times but can't really find how this > would happen. Can you please apply the following patch and see what > it says when the problem happens? I have fired it up, thank you! Last time I saw one failure in 21 hours of test runs, so I have kicked of 42 one-hour test runs. Will see what happens tomorrow morning, Pacific Time. Thanx, Paul > Thanks. > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index c74bf39ef764..bd2ce3cbfb41 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node) > static void worker_attach_to_pool(struct worker *worker, > struct worker_pool *pool) > { > + int ret; > + > mutex_lock(>attach_mutex); > > /* >* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any >* online CPUs. It'll be re-applied when any of the CPUs come up. >*/ > - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + > + WARN(ret && !(pool->flags & POOL_DISASSOCIATED), > + "set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x > cpumask=%*pbl online=%*pbl active=%*pbl\n", > + ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask), > + cpumask_pr_args(cpu_online_mask), > cpumask_pr_args(cpu_active_mask)); > > /* >* The pool->attach_mutex ensures %POOL_DISASSOCIATED remains > @@ -2037,8 +2044,11 @@ __acquires(>lock) > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > - raw_smp_processor_id() != pool->cpu); > + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > + raw_smp_processor_id() != pool->cpu)) > + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d > online=%*pbl active=%*pbl\n", > + work->func, pool->cpu, pool->flags, > raw_smp_processor_id(), > + cpumask_pr_args(cpu_online_mask), > cpumask_pr_args(cpu_active_mask)); > > /* >* A single work shouldn't be executed concurrently by >
Re: WARN_ON_ONCE() in process_one_work()?
On Tue, Jun 13, 2017 at 04:58:37PM -0400, Tejun Heo wrote: > Hello, Paul. > > On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > > Just following up... I have hit this bug a couple of times over the > > past few days. Anything I can do to help? > > My apologies for dropping the ball on this. I've gone over the hot > plug code in workqueue several times but can't really find how this > would happen. Can you please apply the following patch and see what > it says when the problem happens? I have fired it up, thank you! Last time I saw one failure in 21 hours of test runs, so I have kicked of 42 one-hour test runs. Will see what happens tomorrow morning, Pacific Time. Thanx, Paul > Thanks. > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index c74bf39ef764..bd2ce3cbfb41 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node) > static void worker_attach_to_pool(struct worker *worker, > struct worker_pool *pool) > { > + int ret; > + > mutex_lock(>attach_mutex); > > /* >* set_cpus_allowed_ptr() will fail if the cpumask doesn't have any >* online CPUs. It'll be re-applied when any of the CPUs come up. >*/ > - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); > + > + WARN(ret && !(pool->flags & POOL_DISASSOCIATED), > + "set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x > cpumask=%*pbl online=%*pbl active=%*pbl\n", > + ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask), > + cpumask_pr_args(cpu_online_mask), > cpumask_pr_args(cpu_active_mask)); > > /* >* The pool->attach_mutex ensures %POOL_DISASSOCIATED remains > @@ -2037,8 +2044,11 @@ __acquires(>lock) > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > - raw_smp_processor_id() != pool->cpu); > + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && > + raw_smp_processor_id() != pool->cpu)) > + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d > online=%*pbl active=%*pbl\n", > + work->func, pool->cpu, pool->flags, > raw_smp_processor_id(), > + cpumask_pr_args(cpu_online_mask), > cpumask_pr_args(cpu_active_mask)); > > /* >* A single work shouldn't be executed concurrently by >
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > Just following up... I have hit this bug a couple of times over the > past few days. Anything I can do to help? My apologies for dropping the ball on this. I've gone over the hot plug code in workqueue several times but can't really find how this would happen. Can you please apply the following patch and see what it says when the problem happens? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c74bf39ef764..bd2ce3cbfb41 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(>attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + + WARN(ret && !(pool->flags & POOL_DISASSOCIATED), +"set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x cpumask=%*pbl online=%*pbl active=%*pbl\n", +ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask), +cpumask_pr_args(cpu_online_mask), cpumask_pr_args(cpu_active_mask)); /* * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains @@ -2037,8 +2044,11 @@ __acquires(>lock) lockdep_copy_map(_map, >lockdep_map); #endif /* ensure we're on the correct CPU */ - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && -raw_smp_processor_id() != pool->cpu); + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && +raw_smp_processor_id() != pool->cpu)) + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d online=%*pbl active=%*pbl\n", + work->func, pool->cpu, pool->flags, raw_smp_processor_id(), + cpumask_pr_args(cpu_online_mask), cpumask_pr_args(cpu_active_mask)); /* * A single work shouldn't be executed concurrently by
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Fri, May 05, 2017 at 10:11:59AM -0700, Paul E. McKenney wrote: > Just following up... I have hit this bug a couple of times over the > past few days. Anything I can do to help? My apologies for dropping the ball on this. I've gone over the hot plug code in workqueue several times but can't really find how this would happen. Can you please apply the following patch and see what it says when the problem happens? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c74bf39ef764..bd2ce3cbfb41 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1691,13 +1691,20 @@ static struct worker *alloc_worker(int node) static void worker_attach_to_pool(struct worker *worker, struct worker_pool *pool) { + int ret; + mutex_lock(>attach_mutex); /* * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any * online CPUs. It'll be re-applied when any of the CPUs come up. */ - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + ret = set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask); + + WARN(ret && !(pool->flags & POOL_DISASSOCIATED), +"set_cpus_allowed_ptr failed, ret=%d pool->cpu/flags=%d/0x%x cpumask=%*pbl online=%*pbl active=%*pbl\n", +ret, pool->cpu, pool->flags, cpumask_pr_args(pool->attrs->cpumask), +cpumask_pr_args(cpu_online_mask), cpumask_pr_args(cpu_active_mask)); /* * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains @@ -2037,8 +2044,11 @@ __acquires(>lock) lockdep_copy_map(_map, >lockdep_map); #endif /* ensure we're on the correct CPU */ - WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && -raw_smp_processor_id() != pool->cpu); + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && +raw_smp_processor_id() != pool->cpu)) + printk_once("XXX workfn=%pf pool->cpu/flags=%d/0x%x curcpu=%d online=%*pbl active=%*pbl\n", + work->func, pool->cpu, pool->flags, raw_smp_processor_id(), + cpumask_pr_args(cpu_online_mask), cpumask_pr_args(cpu_active_mask)); /* * A single work shouldn't be executed concurrently by
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 11:58:19AM -0700, Paul E. McKenney wrote: > On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote: > > Hello, Paul. > > > > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > > > Hello! > > > > > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > > > what I did wrong to make this happen: > > > > > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > > > expedited grace periods block within a workqueue handler? Might this > > > in turn run the system out of workqueue kthreads? If this is the likely > > > cause, my approach would be to rework the expected-grace-period workqueue > > > handler to return when waiting for the grace period to complete, and to > > > replace the current wakeup with a schedule_work() or something similar. > > > > That should be completely fine. It could just be that the rescuer > > path has a bug around CPU hotplug handling. Can you please confirm > > either way on the cpuset usage? > > I have no explicit cpuset usage or affinity of the workqueue handlers > themselves. > > However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in > this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs. > IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away > from the nohz_full CPUs. > > I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of > hotplugging. Also, other kthreads (but again, not the workqueue handlers) > do a lot of explicit CPU-affinity manipulation. Just following up... I have hit this bug a couple of times over the past few days. Anything I can do to help? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 11:58:19AM -0700, Paul E. McKenney wrote: > On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote: > > Hello, Paul. > > > > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > > > Hello! > > > > > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > > > what I did wrong to make this happen: > > > > > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > > > expedited grace periods block within a workqueue handler? Might this > > > in turn run the system out of workqueue kthreads? If this is the likely > > > cause, my approach would be to rework the expected-grace-period workqueue > > > handler to return when waiting for the grace period to complete, and to > > > replace the current wakeup with a schedule_work() or something similar. > > > > That should be completely fine. It could just be that the rescuer > > path has a bug around CPU hotplug handling. Can you please confirm > > either way on the cpuset usage? > > I have no explicit cpuset usage or affinity of the workqueue handlers > themselves. > > However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in > this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs. > IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away > from the nohz_full CPUs. > > I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of > hotplugging. Also, other kthreads (but again, not the workqueue handlers) > do a lot of explicit CPU-affinity manipulation. Just following up... I have hit this bug a couple of times over the past few days. Anything I can do to help? Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, 1 May 2017 15:50:58 -0400 Tejun Heowrote: > > Were you also using nohz_full? Do you mean, "was it compiled in" or "was I actually using it. ie. had it on the command line"? It may have been compiled in, but I didn't do anything to activate it. -- Steve
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, 1 May 2017 15:50:58 -0400 Tejun Heo wrote: > > Were you also using nohz_full? Do you mean, "was it compiled in" or "was I actually using it. ie. had it on the command line"? It may have been compiled in, but I didn't do anything to activate it. -- Steve
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 03:42:37PM -0400, Steven Rostedt wrote: > On Mon, 1 May 2017 14:42:50 -0400 > Tejun Heowrote: > > > Steven's involved a rescuer too. One possibility was cpuset being > > involved somehow and messing up the affinity of the rescuer kthread > > unexpectedly. Is cpuset involved in any way? > > I didn't do anything with cpuset. I did take CPUs offline though. > > I'm currently testing changes, but when its done, I'll remove the > "WARN_ON() comment out" and see if I can reproduce it again. Were you also using nohz_full? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 03:42:37PM -0400, Steven Rostedt wrote: > On Mon, 1 May 2017 14:42:50 -0400 > Tejun Heo wrote: > > > Steven's involved a rescuer too. One possibility was cpuset being > > involved somehow and messing up the affinity of the rescuer kthread > > unexpectedly. Is cpuset involved in any way? > > I didn't do anything with cpuset. I did take CPUs offline though. > > I'm currently testing changes, but when its done, I'll remove the > "WARN_ON() comment out" and see if I can reproduce it again. Were you also using nohz_full? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, 1 May 2017 14:42:50 -0400 Tejun Heowrote: > Steven's involved a rescuer too. One possibility was cpuset being > involved somehow and messing up the affinity of the rescuer kthread > unexpectedly. Is cpuset involved in any way? I didn't do anything with cpuset. I did take CPUs offline though. I'm currently testing changes, but when its done, I'll remove the "WARN_ON() comment out" and see if I can reproduce it again. -- Steve
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, 1 May 2017 14:42:50 -0400 Tejun Heo wrote: > Steven's involved a rescuer too. One possibility was cpuset being > involved somehow and messing up the affinity of the rescuer kthread > unexpectedly. Is cpuset involved in any way? I didn't do anything with cpuset. I did take CPUs offline though. I'm currently testing changes, but when its done, I'll remove the "WARN_ON() comment out" and see if I can reproduce it again. -- Steve
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote: > Hello, Paul. > > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > > Hello! > > > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > > what I did wrong to make this happen: > > > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > > expedited grace periods block within a workqueue handler? Might this > > in turn run the system out of workqueue kthreads? If this is the likely > > cause, my approach would be to rework the expected-grace-period workqueue > > handler to return when waiting for the grace period to complete, and to > > replace the current wakeup with a schedule_work() or something similar. > > That should be completely fine. It could just be that the rescuer > path has a bug around CPU hotplug handling. Can you please confirm > either way on the cpuset usage? I have no explicit cpuset usage or affinity of the workqueue handlers themselves. However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs. IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away from the nohz_full CPUs. I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of hotplugging. Also, other kthreads (but again, not the workqueue handlers) do a lot of explicit CPU-affinity manipulation. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 02:44:02PM -0400, Tejun Heo wrote: > Hello, Paul. > > On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > > Hello! > > > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > > what I did wrong to make this happen: > > > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > > expedited grace periods block within a workqueue handler? Might this > > in turn run the system out of workqueue kthreads? If this is the likely > > cause, my approach would be to rework the expected-grace-period workqueue > > handler to return when waiting for the grace period to complete, and to > > replace the current wakeup with a schedule_work() or something similar. > > That should be completely fine. It could just be that the rescuer > path has a bug around CPU hotplug handling. Can you please confirm > either way on the cpuset usage? I have no explicit cpuset usage or affinity of the workqueue handlers themselves. However, this is thus far only happening in CONFIG_NO_HZ_FULL=y runs, in this case, with the kernel boot parameter nohz_full=2-9 out of 16 CPUs. IIRC, this sets up a "housekeeping" cpuset that pushes normal tasks away from the nohz_full CPUs. I do build with CONFIG_HOTPLUG_CPU=y, and the test does a lot of hotplugging. Also, other kthreads (but again, not the workqueue handlers) do a lot of explicit CPU-affinity manipulation. Thanx, Paul
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > Hello! > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > what I did wrong to make this happen: > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > expedited grace periods block within a workqueue handler? Might this > in turn run the system out of workqueue kthreads? If this is the likely > cause, my approach would be to rework the expected-grace-period workqueue > handler to return when waiting for the grace period to complete, and to > replace the current wakeup with a schedule_work() or something similar. That should be completely fine. It could just be that the rescuer path has a bug around CPU hotplug handling. Can you please confirm either way on the cpuset usage? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. On Mon, May 01, 2017 at 11:38:07AM -0700, Paul E. McKenney wrote: > On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > > Hello! > > > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > > what I did wrong to make this happen: > > Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's > expedited grace periods block within a workqueue handler? Might this > in turn run the system out of workqueue kthreads? If this is the likely > cause, my approach would be to rework the expected-grace-period workqueue > handler to return when waiting for the grace period to complete, and to > replace the current wakeup with a schedule_work() or something similar. That should be completely fine. It could just be that the rescuer path has a bug around CPU hotplug handling. Can you please confirm either way on the cpuset usage? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Hmmm... Steven reproted a similar issue. http://lkml.kernel.org/r/20170405151628.33df7...@gandalf.local.home On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > Hello! > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > what I did wrong to make this happen: > > > > static void process_one_work(struct worker *worker, struct work_struct *work) > __releases(>lock) > __acquires(>lock) > { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > int work_color; > struct worker *collision; > #ifdef CONFIG_LOCKDEP > /* >* It is permissible to free the struct work_struct from >* inside the function that is called from it, this we need to >* take into account for lockdep too. To avoid bogus "held >* lock freed" warnings as well as problems when looking into >* work->lockdep_map, make a copy and use that here. >*/ > struct lockdep_map lockdep_map; > > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > > > Here is the splat: > > > > [12600.593006] WARNING: CPU: 0 PID: 6 at > /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 > process_one_work+0x46c/0x4d0 > [12600.593006] Modules linked in: > [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 > [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [12600.593006] Call Trace: > [12600.593006] dump_stack+0x4f/0x72 > [12600.593006] __warn+0xc6/0xe0 > [12600.593006] warn_slowpath_null+0x18/0x20 > [12600.593006] process_one_work+0x46c/0x4d0 > [12600.593006] rescuer_thread+0x20e/0x3b0 > [12600.593006] kthread+0x104/0x140 > [12600.593006] ? worker_thread+0x4e0/0x4e0 > [12600.593006] ? kthread_create_on_node+0x40/0x40 > [12600.593006] ret_from_fork+0x29/0x40 > > > > This happens about 3.5 hours into the TREE03 rcutorture scenario, .config > attached. Steven's involved a rescuer too. One possibility was cpuset being involved somehow and messing up the affinity of the rescuer kthread unexpectedly. Is cpuset involved in any way? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
Hello, Paul. Hmmm... Steven reproted a similar issue. http://lkml.kernel.org/r/20170405151628.33df7...@gandalf.local.home On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > Hello! > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > what I did wrong to make this happen: > > > > static void process_one_work(struct worker *worker, struct work_struct *work) > __releases(>lock) > __acquires(>lock) > { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > int work_color; > struct worker *collision; > #ifdef CONFIG_LOCKDEP > /* >* It is permissible to free the struct work_struct from >* inside the function that is called from it, this we need to >* take into account for lockdep too. To avoid bogus "held >* lock freed" warnings as well as problems when looking into >* work->lockdep_map, make a copy and use that here. >*/ > struct lockdep_map lockdep_map; > > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > > > Here is the splat: > > > > [12600.593006] WARNING: CPU: 0 PID: 6 at > /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 > process_one_work+0x46c/0x4d0 > [12600.593006] Modules linked in: > [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 > [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [12600.593006] Call Trace: > [12600.593006] dump_stack+0x4f/0x72 > [12600.593006] __warn+0xc6/0xe0 > [12600.593006] warn_slowpath_null+0x18/0x20 > [12600.593006] process_one_work+0x46c/0x4d0 > [12600.593006] rescuer_thread+0x20e/0x3b0 > [12600.593006] kthread+0x104/0x140 > [12600.593006] ? worker_thread+0x4e0/0x4e0 > [12600.593006] ? kthread_create_on_node+0x40/0x40 > [12600.593006] ret_from_fork+0x29/0x40 > > > > This happens about 3.5 hours into the TREE03 rcutorture scenario, .config > attached. Steven's involved a rescuer too. One possibility was cpuset being involved somehow and messing up the affinity of the rescuer kthread unexpectedly. Is cpuset involved in any way? Thanks. -- tejun
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > Hello! > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > what I did wrong to make this happen: Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's expedited grace periods block within a workqueue handler? Might this in turn run the system out of workqueue kthreads? If this is the likely cause, my approach would be to rework the expected-grace-period workqueue handler to return when waiting for the grace period to complete, and to replace the current wakeup with a schedule_work() or something similar. Thoughts? Thanx, Paul > > > static void process_one_work(struct worker *worker, struct work_struct *work) > __releases(>lock) > __acquires(>lock) > { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > int work_color; > struct worker *collision; > #ifdef CONFIG_LOCKDEP > /* >* It is permissible to free the struct work_struct from >* inside the function that is called from it, this we need to >* take into account for lockdep too. To avoid bogus "held >* lock freed" warnings as well as problems when looking into >* work->lockdep_map, make a copy and use that here. >*/ > struct lockdep_map lockdep_map; > > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > > > Here is the splat: > > > > [12600.593006] WARNING: CPU: 0 PID: 6 at > /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 > process_one_work+0x46c/0x4d0 > [12600.593006] Modules linked in: > [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 > [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [12600.593006] Call Trace: > [12600.593006] dump_stack+0x4f/0x72 > [12600.593006] __warn+0xc6/0xe0 > [12600.593006] warn_slowpath_null+0x18/0x20 > [12600.593006] process_one_work+0x46c/0x4d0 > [12600.593006] rescuer_thread+0x20e/0x3b0 > [12600.593006] kthread+0x104/0x140 > [12600.593006] ? worker_thread+0x4e0/0x4e0 > [12600.593006] ? kthread_create_on_node+0x40/0x40 > [12600.593006] ret_from_fork+0x29/0x40 > > > > This happens about 3.5 hours into the TREE03 rcutorture scenario, .config > attached. > > Enlightenment? > > Thanx, Paul > # > # Automatically generated file; DO NOT EDIT. > # Linux/x86 4.11.0-rc7 Kernel Configuration > # > CONFIG_64BIT=y > CONFIG_X86_64=y > CONFIG_X86=y > CONFIG_INSTRUCTION_DECODER=y > CONFIG_OUTPUT_FORMAT="elf64-x86-64" > CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_STACKTRACE_SUPPORT=y > CONFIG_MMU=y > CONFIG_ARCH_MMAP_RND_BITS_MIN=28 > CONFIG_ARCH_MMAP_RND_BITS_MAX=32 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 > CONFIG_NEED_DMA_MAP_STATE=y > CONFIG_NEED_SG_DMA_LENGTH=y > CONFIG_GENERIC_ISA_DMA=y > CONFIG_GENERIC_BUG=y > CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y > CONFIG_GENERIC_HWEIGHT=y > CONFIG_ARCH_MAY_HAVE_PC_FDC=y > CONFIG_RWSEM_XCHGADD_ALGORITHM=y > CONFIG_GENERIC_CALIBRATE_DELAY=y > CONFIG_ARCH_HAS_CPU_RELAX=y > CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y > CONFIG_HAVE_SETUP_PER_CPU_AREA=y > CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y > CONFIG_ARCH_HIBERNATION_POSSIBLE=y > CONFIG_ARCH_SUSPEND_POSSIBLE=y > CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y > CONFIG_ARCH_WANT_GENERAL_HUGETLB=y > CONFIG_ZONE_DMA32=y > CONFIG_AUDIT_ARCH=y > CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y > CONFIG_HAVE_INTEL_TXT=y > CONFIG_X86_64_SMP=y > CONFIG_ARCH_SUPPORTS_UPROBES=y > CONFIG_FIX_EARLYCON_MEM=y > CONFIG_PGTABLE_LEVELS=4 > CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" > CONFIG_IRQ_WORK=y > CONFIG_BUILDTIME_EXTABLE_SORT=y > CONFIG_THREAD_INFO_IN_TASK=y > > # > # General setup > # > C
Re: WARN_ON_ONCE() in process_one_work()?
On Mon, May 01, 2017 at 09:57:47AM -0700, Paul E. McKenney wrote: > Hello! > > I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering > what I did wrong to make this happen: Oh, wait... Rescuer, it says. Might this be due to the fact that RCU's expedited grace periods block within a workqueue handler? Might this in turn run the system out of workqueue kthreads? If this is the likely cause, my approach would be to rework the expected-grace-period workqueue handler to return when waiting for the grace period to complete, and to replace the current wakeup with a schedule_work() or something similar. Thoughts? Thanx, Paul > > > static void process_one_work(struct worker *worker, struct work_struct *work) > __releases(>lock) > __acquires(>lock) > { > struct pool_workqueue *pwq = get_work_pwq(work); > struct worker_pool *pool = worker->pool; > bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; > int work_color; > struct worker *collision; > #ifdef CONFIG_LOCKDEP > /* >* It is permissible to free the struct work_struct from >* inside the function that is called from it, this we need to >* take into account for lockdep too. To avoid bogus "held >* lock freed" warnings as well as problems when looking into >* work->lockdep_map, make a copy and use that here. >*/ > struct lockdep_map lockdep_map; > > lockdep_copy_map(_map, >lockdep_map); > #endif > /* ensure we're on the correct CPU */ > WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && >raw_smp_processor_id() != pool->cpu); > > > > Here is the splat: > > > > [12600.593006] WARNING: CPU: 0 PID: 6 at > /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 > process_one_work+0x46c/0x4d0 > [12600.593006] Modules linked in: > [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 > [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [12600.593006] Call Trace: > [12600.593006] dump_stack+0x4f/0x72 > [12600.593006] __warn+0xc6/0xe0 > [12600.593006] warn_slowpath_null+0x18/0x20 > [12600.593006] process_one_work+0x46c/0x4d0 > [12600.593006] rescuer_thread+0x20e/0x3b0 > [12600.593006] kthread+0x104/0x140 > [12600.593006] ? worker_thread+0x4e0/0x4e0 > [12600.593006] ? kthread_create_on_node+0x40/0x40 > [12600.593006] ret_from_fork+0x29/0x40 > > > > This happens about 3.5 hours into the TREE03 rcutorture scenario, .config > attached. > > Enlightenment? > > Thanx, Paul > # > # Automatically generated file; DO NOT EDIT. > # Linux/x86 4.11.0-rc7 Kernel Configuration > # > CONFIG_64BIT=y > CONFIG_X86_64=y > CONFIG_X86=y > CONFIG_INSTRUCTION_DECODER=y > CONFIG_OUTPUT_FORMAT="elf64-x86-64" > CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_STACKTRACE_SUPPORT=y > CONFIG_MMU=y > CONFIG_ARCH_MMAP_RND_BITS_MIN=28 > CONFIG_ARCH_MMAP_RND_BITS_MAX=32 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 > CONFIG_NEED_DMA_MAP_STATE=y > CONFIG_NEED_SG_DMA_LENGTH=y > CONFIG_GENERIC_ISA_DMA=y > CONFIG_GENERIC_BUG=y > CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y > CONFIG_GENERIC_HWEIGHT=y > CONFIG_ARCH_MAY_HAVE_PC_FDC=y > CONFIG_RWSEM_XCHGADD_ALGORITHM=y > CONFIG_GENERIC_CALIBRATE_DELAY=y > CONFIG_ARCH_HAS_CPU_RELAX=y > CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y > CONFIG_HAVE_SETUP_PER_CPU_AREA=y > CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y > CONFIG_ARCH_HIBERNATION_POSSIBLE=y > CONFIG_ARCH_SUSPEND_POSSIBLE=y > CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y > CONFIG_ARCH_WANT_GENERAL_HUGETLB=y > CONFIG_ZONE_DMA32=y > CONFIG_AUDIT_ARCH=y > CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y > CONFIG_HAVE_INTEL_TXT=y > CONFIG_X86_64_SMP=y > CONFIG_ARCH_SUPPORTS_UPROBES=y > CONFIG_FIX_EARLYCON_MEM=y > CONFIG_PGTABLE_LEVELS=4 > CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" > CONFIG_IRQ_WORK=y > CONFIG_BUILDTIME_EXTABLE_SORT=y > CONFIG_THREAD_INFO_IN_TASK=y > > # > # General setup > # > C
WARN_ON_ONCE() in process_one_work()?
Hello! I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering what I did wrong to make this happen: static void process_one_work(struct worker *worker, struct work_struct *work) __releases(>lock) __acquires(>lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker->pool; bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; int work_color; struct worker *collision; #ifdef CONFIG_LOCKDEP /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus "held * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ struct lockdep_map lockdep_map; lockdep_copy_map(_map, >lockdep_map); #endif /* ensure we're on the correct CPU */ WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && raw_smp_processor_id() != pool->cpu); Here is the splat: [12600.593006] WARNING: CPU: 0 PID: 6 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 process_one_work+0x46c/0x4d0 [12600.593006] Modules linked in: [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12600.593006] Call Trace: [12600.593006] dump_stack+0x4f/0x72 [12600.593006] __warn+0xc6/0xe0 [12600.593006] warn_slowpath_null+0x18/0x20 [12600.593006] process_one_work+0x46c/0x4d0 [12600.593006] rescuer_thread+0x20e/0x3b0 [12600.593006] kthread+0x104/0x140 [12600.593006] ? worker_thread+0x4e0/0x4e0 [12600.593006] ? kthread_create_on_node+0x40/0x40 [12600.593006] ret_from_fork+0x29/0x40 This happens about 3.5 hours into the TREE03 rcutorture scenario, .config attached. Enlightenment? Thanx, Paul # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.11.0-rc7 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_S
WARN_ON_ONCE() in process_one_work()?
Hello! I am hitting this WARN_ON_ONCE() in process_one_work() and am wondering what I did wrong to make this happen: static void process_one_work(struct worker *worker, struct work_struct *work) __releases(>lock) __acquires(>lock) { struct pool_workqueue *pwq = get_work_pwq(work); struct worker_pool *pool = worker->pool; bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE; int work_color; struct worker *collision; #ifdef CONFIG_LOCKDEP /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus "held * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ struct lockdep_map lockdep_map; lockdep_copy_map(_map, >lockdep_map); #endif /* ensure we're on the correct CPU */ WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && raw_smp_processor_id() != pool->cpu); Here is the splat: [12600.593006] WARNING: CPU: 0 PID: 6 at /home/paulmck/public_git/linux-rcu/kernel/workqueue.c:2041 process_one_work+0x46c/0x4d0 [12600.593006] Modules linked in: [12600.593006] CPU: 0 PID: 6 Comm: mm_percpu_wq Not tainted 4.11.0-rc7+ #1 [12600.593006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12600.593006] Call Trace: [12600.593006] dump_stack+0x4f/0x72 [12600.593006] __warn+0xc6/0xe0 [12600.593006] warn_slowpath_null+0x18/0x20 [12600.593006] process_one_work+0x46c/0x4d0 [12600.593006] rescuer_thread+0x20e/0x3b0 [12600.593006] kthread+0x104/0x140 [12600.593006] ? worker_thread+0x4e0/0x4e0 [12600.593006] ? kthread_create_on_node+0x40/0x40 [12600.593006] ret_from_fork+0x29/0x40 This happens about 3.5 hours into the TREE03 rcutorture scenario, .config attached. Enlightenment? Thanx, Paul # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.11.0-rc7 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_S