Re: cpu stopper threads and setaffinity leads to deadlock
On 2018-08-03 04:41, Thomas Gleixner wrote: Prasad. On Thu, 2 Aug 2018, Peter Zijlstra wrote: So why didn't you do the 'obvious' parallel to what you did for cpu_stop_queue_two_works(), namely: Is that patch fixing the issue for you? Hi Thomas and Peter, Yes. Tested both versions of patches and both variants are working on Qualcomm devices with stress testing of set affinity and tasks cross-migration, which were previously leading to the deadlock. -Thanks, Prasad --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned unsigned long flags; bool enabled; + preempt_disable(); raw_spin_lock_irqsave(>lock, flags); enabled = stopper->enabled; if (enabled) @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + preempt_enable(); return enabled; } -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, Linux Foundation Collaborative Project
Re: cpu stopper threads and setaffinity leads to deadlock
On 2018-08-03 04:41, Thomas Gleixner wrote: Prasad. On Thu, 2 Aug 2018, Peter Zijlstra wrote: So why didn't you do the 'obvious' parallel to what you did for cpu_stop_queue_two_works(), namely: Is that patch fixing the issue for you? Hi Thomas and Peter, Yes. Tested both versions of patches and both variants are working on Qualcomm devices with stress testing of set affinity and tasks cross-migration, which were previously leading to the deadlock. -Thanks, Prasad --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned unsigned long flags; bool enabled; + preempt_disable(); raw_spin_lock_irqsave(>lock, flags); enabled = stopper->enabled; if (enabled) @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + preempt_enable(); return enabled; } -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, Linux Foundation Collaborative Project
Re: cpu stopper threads and setaffinity leads to deadlock
Prasad. On Thu, 2 Aug 2018, Peter Zijlstra wrote: > > So why didn't you do the 'obvious' parallel to what you did for > cpu_stop_queue_two_works(), namely: Is that patch fixing the issue for you? > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned > unsigned long flags; > bool enabled; > > + preempt_disable(); > raw_spin_lock_irqsave(>lock, flags); > enabled = stopper->enabled; > if (enabled) > @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned > raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + preempt_enable(); > > return enabled; > } >
Re: cpu stopper threads and setaffinity leads to deadlock
Prasad. On Thu, 2 Aug 2018, Peter Zijlstra wrote: > > So why didn't you do the 'obvious' parallel to what you did for > cpu_stop_queue_two_works(), namely: Is that patch fixing the issue for you? > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned > unsigned long flags; > bool enabled; > > + preempt_disable(); > raw_spin_lock_irqsave(>lock, flags); > enabled = stopper->enabled; > if (enabled) > @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned > raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + preempt_enable(); > > return enabled; > } >
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > Due to cross migration of tasks between cpu7 and cpu3, migration/7 has > started executing and waits for the migration/3 task, so that they can > proceed within the multi cpu stop state machine together. > Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has > started running, and has monopolized cpu7’s execution, stress-ng will never > run on cpu7, and cpu3’s migration task is never woken up. > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > index e190d1e..f932e1e 100644 > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > cpu_stop_work *work) > __cpu_stop_queue_work(stopper, work, ); > else if (work->done) > cpu_stop_signal_done(work->done); > - raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + raw_spin_unlock_irqrestore(>lock, flags); > So why didn't you do the 'obvious' parallel to what you did for cpu_stop_queue_two_works(), namely: --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned unsigned long flags; bool enabled; + preempt_disable(); raw_spin_lock_irqsave(>lock, flags); enabled = stopper->enabled; if (enabled) @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + preempt_enable(); return enabled; }
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > Due to cross migration of tasks between cpu7 and cpu3, migration/7 has > started executing and waits for the migration/3 task, so that they can > proceed within the multi cpu stop state machine together. > Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has > started running, and has monopolized cpu7’s execution, stress-ng will never > run on cpu7, and cpu3’s migration task is never woken up. > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > index e190d1e..f932e1e 100644 > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > cpu_stop_work *work) > __cpu_stop_queue_work(stopper, work, ); > else if (work->done) > cpu_stop_signal_done(work->done); > - raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + raw_spin_unlock_irqrestore(>lock, flags); > So why didn't you do the 'obvious' parallel to what you did for cpu_stop_queue_two_works(), namely: --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned unsigned long flags; bool enabled; + preempt_disable(); raw_spin_lock_irqsave(>lock, flags); enabled = stopper->enabled; if (enabled) @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + preempt_enable(); return enabled; }
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > the Linux-4.14.56 kernel. Can you also please run on something recent...
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > the Linux-4.14.56 kernel. Can you also please run on something recent...
Re: cpu stopper threads and setaffinity leads to deadlock
On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote: > On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > > index e190d1e..f932e1e 100644 > > --- a/kernel/stop_machine.c > > +++ b/kernel/stop_machine.c > > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > > cpu_stop_work *work) > > __cpu_stop_queue_work(stopper, work, ); > > else if (work->done) > > cpu_stop_signal_done(work->done); > > - raw_spin_unlock_irqrestore(>lock, flags); > > > > wake_up_q(); > > + raw_spin_unlock_irqrestore(>lock, flags); > > > > That puts the wakeup back under stopper lock, which causes another > deadlock iirc. Yup, one you fixed. 0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92) wake_up_q();
Re: cpu stopper threads and setaffinity leads to deadlock
On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote: > On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > > index e190d1e..f932e1e 100644 > > --- a/kernel/stop_machine.c > > +++ b/kernel/stop_machine.c > > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > > cpu_stop_work *work) > > __cpu_stop_queue_work(stopper, work, ); > > else if (work->done) > > cpu_stop_signal_done(work->done); > > - raw_spin_unlock_irqrestore(>lock, flags); > > > > wake_up_q(); > > + raw_spin_unlock_irqrestore(>lock, flags); > > > > That puts the wakeup back under stopper lock, which causes another > deadlock iirc. Yup, one you fixed. 0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92) wake_up_q();
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > index e190d1e..f932e1e 100644 > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > cpu_stop_work *work) > __cpu_stop_queue_work(stopper, work, ); > else if (work->done) > cpu_stop_signal_done(work->done); > - raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + raw_spin_unlock_irqrestore(>lock, flags); > That puts the wakeup back under stopper lock, which causes another deadlock iirc.
Re: cpu stopper threads and setaffinity leads to deadlock
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote: > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > index e190d1e..f932e1e 100644 > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct > cpu_stop_work *work) > __cpu_stop_queue_work(stopper, work, ); > else if (work->done) > cpu_stop_signal_done(work->done); > - raw_spin_unlock_irqrestore(>lock, flags); > > wake_up_q(); > + raw_spin_unlock_irqrestore(>lock, flags); > That puts the wakeup back under stopper lock, which causes another deadlock iirc.
cpu stopper threads and setaffinity leads to deadlock
Hi Peter and Tglx, We are observing another deadlock issue due to commit 0b26351b91(stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock), even after taking the following fix https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740526.html on the Linux-4.14.56 kernel. Here is the scenario that leads to this deadlock. We have used the stress-ng-64 --affinity test case to reproduce this issue in a controlled environment, while simultaneously running CPU hot plug and task migrations. Stress-ng-affin (call stack shown below) is changing its own affinity from cpu3 to cpu7. Stress-ng-affin is preempted in the cpu_stop_queue_work() function as soon as the stopper lock for migration/3 is released . At the same time, on CPU 7, cross migration of tasks happens between cpu3 and cpu7. === Process: stress-ng-affin, cpu: 3 pid: 1748 start: 0xffd8817e4480 = Task name: stress-ng-affin pid: 1748 cpu: 3 start: ffd8817e4480 state: 0x0 exit_state: 0x0 stack base: 0xff801c8e8000 Prio: 120 Stack: [] __switch_to+0xb8 [] __schedule+0x690 [] preempt_schedule_common+0x100 [] preempt_schedule+0x24 [] _raw_spin_unlock_irqrestore+0x64 [] cpu_stop_queue_work+0x9c [] stop_one_cpu+0x58 [] __set_cpus_allowed_ptr+0x234 [] sched_setaffinity+0x150 [] SyS_sched_setaffinity+0xcc [] el0_svc_naked+0x34 [<0>] UNKNOWN+0x0 Due to cross migration of tasks between cpu7 and cpu3, migration/7 has started executing and waits for the migration/3 task, so that they can proceed within the multi cpu stop state machine together. Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has started running, and has monopolized cpu7’s execution, stress-ng will never run on cpu7, and cpu3’s migration task is never woken up. Essentially: Due to the nature of the wake_q interface, a thread can only be in at most one wake queue at a time. migration/3 is currently in stress-ng-affin’s wake_q. This means that no other thread can add migration/3 to their wake queue. Thus, even if any attempt is made to stop CPU 3 (e.g. cross-migration, hot plugging, etc), no thread will wake up migration/3. Below change helped to fix this deadlock. diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index e190d1e..f932e1e 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) __cpu_stop_queue_work(stopper, work, ); else if (work->done) cpu_stop_signal_done(work->done); - raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + raw_spin_unlock_irqrestore(>lock, flags); -Thanks, Prasad -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, Linux Foundation Collaborative Project
cpu stopper threads and setaffinity leads to deadlock
Hi Peter and Tglx, We are observing another deadlock issue due to commit 0b26351b91(stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock), even after taking the following fix https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740526.html on the Linux-4.14.56 kernel. Here is the scenario that leads to this deadlock. We have used the stress-ng-64 --affinity test case to reproduce this issue in a controlled environment, while simultaneously running CPU hot plug and task migrations. Stress-ng-affin (call stack shown below) is changing its own affinity from cpu3 to cpu7. Stress-ng-affin is preempted in the cpu_stop_queue_work() function as soon as the stopper lock for migration/3 is released . At the same time, on CPU 7, cross migration of tasks happens between cpu3 and cpu7. === Process: stress-ng-affin, cpu: 3 pid: 1748 start: 0xffd8817e4480 = Task name: stress-ng-affin pid: 1748 cpu: 3 start: ffd8817e4480 state: 0x0 exit_state: 0x0 stack base: 0xff801c8e8000 Prio: 120 Stack: [] __switch_to+0xb8 [] __schedule+0x690 [] preempt_schedule_common+0x100 [] preempt_schedule+0x24 [] _raw_spin_unlock_irqrestore+0x64 [] cpu_stop_queue_work+0x9c [] stop_one_cpu+0x58 [] __set_cpus_allowed_ptr+0x234 [] sched_setaffinity+0x150 [] SyS_sched_setaffinity+0xcc [] el0_svc_naked+0x34 [<0>] UNKNOWN+0x0 Due to cross migration of tasks between cpu7 and cpu3, migration/7 has started executing and waits for the migration/3 task, so that they can proceed within the multi cpu stop state machine together. Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has started running, and has monopolized cpu7’s execution, stress-ng will never run on cpu7, and cpu3’s migration task is never woken up. Essentially: Due to the nature of the wake_q interface, a thread can only be in at most one wake queue at a time. migration/3 is currently in stress-ng-affin’s wake_q. This means that no other thread can add migration/3 to their wake queue. Thus, even if any attempt is made to stop CPU 3 (e.g. cross-migration, hot plugging, etc), no thread will wake up migration/3. Below change helped to fix this deadlock. diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index e190d1e..f932e1e 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) __cpu_stop_queue_work(stopper, work, ); else if (work->done) cpu_stop_signal_done(work->done); - raw_spin_unlock_irqrestore(>lock, flags); wake_up_q(); + raw_spin_unlock_irqrestore(>lock, flags); -Thanks, Prasad -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, Linux Foundation Collaborative Project