Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-03 Thread Sodagudi Prasad

On 2018-08-03 04:41, Thomas Gleixner wrote:

Prasad.

On Thu, 2 Aug 2018, Peter Zijlstra wrote:


So why didn't you do the 'obvious' parallel to what you did for
cpu_stop_queue_two_works(), namely:


Is that patch fixing the issue for you?

 Hi Thomas and Peter,

Yes. Tested both versions of patches and both variants are working on 
Qualcomm devices
with stress testing of set affinity and tasks cross-migration, which 
were previously leading to the deadlock.


-Thanks, Prasad



--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
unsigned long flags;
bool enabled;

+   preempt_disable();
raw_spin_lock_irqsave(>lock, flags);
enabled = stopper->enabled;
if (enabled)
@@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
raw_spin_unlock_irqrestore(>lock, flags);

wake_up_q();
+   preempt_enable();

return enabled;
 }



--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

Linux Foundation Collaborative Project


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-03 Thread Sodagudi Prasad

On 2018-08-03 04:41, Thomas Gleixner wrote:

Prasad.

On Thu, 2 Aug 2018, Peter Zijlstra wrote:


So why didn't you do the 'obvious' parallel to what you did for
cpu_stop_queue_two_works(), namely:


Is that patch fixing the issue for you?

 Hi Thomas and Peter,

Yes. Tested both versions of patches and both variants are working on 
Qualcomm devices
with stress testing of set affinity and tasks cross-migration, which 
were previously leading to the deadlock.


-Thanks, Prasad



--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
unsigned long flags;
bool enabled;

+   preempt_disable();
raw_spin_lock_irqsave(>lock, flags);
enabled = stopper->enabled;
if (enabled)
@@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
raw_spin_unlock_irqrestore(>lock, flags);

wake_up_q();
+   preempt_enable();

return enabled;
 }



--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

Linux Foundation Collaborative Project


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-03 Thread Thomas Gleixner
Prasad.

On Thu, 2 Aug 2018, Peter Zijlstra wrote:
> 
> So why didn't you do the 'obvious' parallel to what you did for
> cpu_stop_queue_two_works(), namely:

Is that patch fixing the issue for you?

> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
>   unsigned long flags;
>   bool enabled;
>  
> + preempt_disable();
>   raw_spin_lock_irqsave(>lock, flags);
>   enabled = stopper->enabled;
>   if (enabled)
> @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
>   raw_spin_unlock_irqrestore(>lock, flags);
>  
>   wake_up_q();
> + preempt_enable();
>  
>   return enabled;
>  }
> 


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-03 Thread Thomas Gleixner
Prasad.

On Thu, 2 Aug 2018, Peter Zijlstra wrote:
> 
> So why didn't you do the 'obvious' parallel to what you did for
> cpu_stop_queue_two_works(), namely:

Is that patch fixing the issue for you?

> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
>   unsigned long flags;
>   bool enabled;
>  
> + preempt_disable();
>   raw_spin_lock_irqsave(>lock, flags);
>   enabled = stopper->enabled;
>   if (enabled)
> @@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
>   raw_spin_unlock_irqrestore(>lock, flags);
>  
>   wake_up_q();
> + preempt_enable();
>  
>   return enabled;
>  }
> 


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> Due to cross migration of tasks between cpu7 and cpu3, migration/7 has
> started executing and waits for the migration/3 task, so that they can
> proceed within the multi cpu stop state machine together.
> Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has
> started running, and has monopolized cpu7’s execution, stress-ng will never
> run on cpu7, and cpu3’s migration task is never woken up.

> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
> __cpu_stop_queue_work(stopper, work, );
> else if (work->done)
> cpu_stop_signal_done(work->done);
> -   raw_spin_unlock_irqrestore(>lock, flags);
> 
> wake_up_q();
> +   raw_spin_unlock_irqrestore(>lock, flags);
> 

So why didn't you do the 'obvious' parallel to what you did for
cpu_stop_queue_two_works(), namely:

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
unsigned long flags;
bool enabled;
 
+   preempt_disable();
raw_spin_lock_irqsave(>lock, flags);
enabled = stopper->enabled;
if (enabled)
@@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
raw_spin_unlock_irqrestore(>lock, flags);
 
wake_up_q();
+   preempt_enable();
 
return enabled;
 }


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> Due to cross migration of tasks between cpu7 and cpu3, migration/7 has
> started executing and waits for the migration/3 task, so that they can
> proceed within the multi cpu stop state machine together.
> Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 has
> started running, and has monopolized cpu7’s execution, stress-ng will never
> run on cpu7, and cpu3’s migration task is never woken up.

> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
> __cpu_stop_queue_work(stopper, work, );
> else if (work->done)
> cpu_stop_signal_done(work->done);
> -   raw_spin_unlock_irqrestore(>lock, flags);
> 
> wake_up_q();
> +   raw_spin_unlock_irqrestore(>lock, flags);
> 

So why didn't you do the 'obvious' parallel to what you did for
cpu_stop_queue_two_works(), namely:

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -81,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned
unsigned long flags;
bool enabled;
 
+   preempt_disable();
raw_spin_lock_irqsave(>lock, flags);
enabled = stopper->enabled;
if (enabled)
@@ -90,6 +91,7 @@ static bool cpu_stop_queue_work(unsigned
raw_spin_unlock_irqrestore(>lock, flags);
 
wake_up_q();
+   preempt_enable();
 
return enabled;
 }


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> the Linux-4.14.56  kernel.

Can you also please run on something recent...


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> the Linux-4.14.56  kernel.

Can you also please run on something recent...


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Mike Galbraith
On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index e190d1e..f932e1e 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> > cpu_stop_work *work)
> > __cpu_stop_queue_work(stopper, work, );
> > else if (work->done)
> > cpu_stop_signal_done(work->done);
> > -   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> > wake_up_q();
> > +   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> 
> That puts the wakeup back under stopper lock, which causes another
> deadlock iirc.

Yup, one you fixed.

0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92) 
wake_up_q();


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Mike Galbraith
On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index e190d1e..f932e1e 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> > cpu_stop_work *work)
> > __cpu_stop_queue_work(stopper, work, );
> > else if (work->done)
> > cpu_stop_signal_done(work->done);
> > -   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> > wake_up_q();
> > +   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> 
> That puts the wakeup back under stopper lock, which causes another
> deadlock iirc.

Yup, one you fixed.

0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92) 
wake_up_q();


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
> __cpu_stop_queue_work(stopper, work, );
> else if (work->done)
> cpu_stop_signal_done(work->done);
> -   raw_spin_unlock_irqrestore(>lock, flags);
> 
> wake_up_q();
> +   raw_spin_unlock_irqrestore(>lock, flags);
> 

That puts the wakeup back under stopper lock, which causes another
deadlock iirc.


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Peter Zijlstra
On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index e190d1e..f932e1e 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> cpu_stop_work *work)
> __cpu_stop_queue_work(stopper, work, );
> else if (work->done)
> cpu_stop_signal_done(work->done);
> -   raw_spin_unlock_irqrestore(>lock, flags);
> 
> wake_up_q();
> +   raw_spin_unlock_irqrestore(>lock, flags);
> 

That puts the wakeup back under stopper lock, which causes another
deadlock iirc.


cpu stopper threads and setaffinity leads to deadlock

2018-08-01 Thread Sodagudi Prasad

Hi Peter and Tglx,

We are observing another deadlock issue due to commit 
0b26351b91(stop_machine, sched: Fix migrate_swap() vs. active_balance() 
deadlock), even after taking the following fix
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740526.html 
on the Linux-4.14.56  kernel.


Here is the scenario that leads to this deadlock.
We have used the stress-ng-64 --affinity test case to reproduce this 
issue in a controlled environment, while simultaneously running CPU hot 
plug and task migrations.


Stress-ng-affin (call stack shown below) is changing its own affinity 
from cpu3 to cpu7. Stress-ng-affin is preempted in the 
cpu_stop_queue_work() function
as soon as the stopper lock for migration/3 is released . At the same 
time, on CPU 7, cross migration of tasks happens between  cpu3 and cpu7.


===
Process: stress-ng-affin, cpu: 3 pid: 1748 start: 0xffd8817e4480
=
Task name: stress-ng-affin pid: 1748 cpu: 3 start: ffd8817e4480
state: 0x0 exit_state: 0x0 stack base: 0xff801c8e8000 Prio: 120
Stack:
[] __switch_to+0xb8
[] __schedule+0x690
[] preempt_schedule_common+0x100
[] preempt_schedule+0x24
[] _raw_spin_unlock_irqrestore+0x64
[] cpu_stop_queue_work+0x9c
[] stop_one_cpu+0x58
[] __set_cpus_allowed_ptr+0x234
[] sched_setaffinity+0x150
[] SyS_sched_setaffinity+0xcc
[] el0_svc_naked+0x34
[<0>] UNKNOWN+0x0

Due to cross migration of tasks between cpu7 and cpu3, migration/7 has 
started executing and waits for the migration/3 task, so that they can 
proceed within the multi cpu stop state machine together.
Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 
has started running, and has monopolized cpu7’s execution, stress-ng 
will never run on cpu7, and cpu3’s migration task is never woken up.


Essentially:
Due to the nature of the wake_q interface,  a thread can only be in at 
most one wake queue at a time.
migration/3 is currently in stress-ng-affin’s wake_q. This means that no 
other thread can add migration/3 to their wake queue.
Thus, even if any attempt is made to stop CPU 3 (e.g. cross-migration, 
hot plugging, etc), no thread will wake up migration/3.


Below change helped to fix this deadlock.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index e190d1e..f932e1e 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, 
struct cpu_stop_work *work)

__cpu_stop_queue_work(stopper, work, );
else if (work->done)
cpu_stop_signal_done(work->done);
-   raw_spin_unlock_irqrestore(>lock, flags);

wake_up_q();
+   raw_spin_unlock_irqrestore(>lock, flags);


-Thanks, Prasad

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

Linux Foundation Collaborative Project


cpu stopper threads and setaffinity leads to deadlock

2018-08-01 Thread Sodagudi Prasad

Hi Peter and Tglx,

We are observing another deadlock issue due to commit 
0b26351b91(stop_machine, sched: Fix migrate_swap() vs. active_balance() 
deadlock), even after taking the following fix
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740526.html 
on the Linux-4.14.56  kernel.


Here is the scenario that leads to this deadlock.
We have used the stress-ng-64 --affinity test case to reproduce this 
issue in a controlled environment, while simultaneously running CPU hot 
plug and task migrations.


Stress-ng-affin (call stack shown below) is changing its own affinity 
from cpu3 to cpu7. Stress-ng-affin is preempted in the 
cpu_stop_queue_work() function
as soon as the stopper lock for migration/3 is released . At the same 
time, on CPU 7, cross migration of tasks happens between  cpu3 and cpu7.


===
Process: stress-ng-affin, cpu: 3 pid: 1748 start: 0xffd8817e4480
=
Task name: stress-ng-affin pid: 1748 cpu: 3 start: ffd8817e4480
state: 0x0 exit_state: 0x0 stack base: 0xff801c8e8000 Prio: 120
Stack:
[] __switch_to+0xb8
[] __schedule+0x690
[] preempt_schedule_common+0x100
[] preempt_schedule+0x24
[] _raw_spin_unlock_irqrestore+0x64
[] cpu_stop_queue_work+0x9c
[] stop_one_cpu+0x58
[] __set_cpus_allowed_ptr+0x234
[] sched_setaffinity+0x150
[] SyS_sched_setaffinity+0xcc
[] el0_svc_naked+0x34
[<0>] UNKNOWN+0x0

Due to cross migration of tasks between cpu7 and cpu3, migration/7 has 
started executing and waits for the migration/3 task, so that they can 
proceed within the multi cpu stop state machine together.
Unfortunately stress-ng-affin is affine to cpu7, and since migration 7 
has started running, and has monopolized cpu7’s execution, stress-ng 
will never run on cpu7, and cpu3’s migration task is never woken up.


Essentially:
Due to the nature of the wake_q interface,  a thread can only be in at 
most one wake queue at a time.
migration/3 is currently in stress-ng-affin’s wake_q. This means that no 
other thread can add migration/3 to their wake queue.
Thus, even if any attempt is made to stop CPU 3 (e.g. cross-migration, 
hot plugging, etc), no thread will wake up migration/3.


Below change helped to fix this deadlock.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index e190d1e..f932e1e 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, 
struct cpu_stop_work *work)

__cpu_stop_queue_work(stopper, work, );
else if (work->done)
cpu_stop_signal_done(work->done);
-   raw_spin_unlock_irqrestore(>lock, flags);

wake_up_q();
+   raw_spin_unlock_irqrestore(>lock, flags);


-Thanks, Prasad

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

Linux Foundation Collaborative Project