Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti  wrote:
> 
> > Hi Steve,
> > 
> > Thanks for the patch.
> > 
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt  wrote:
> > >   
> > > >  void rto_push_irq_work_func(struct irq_work *work)
> > > >  {
> > > > +   struct root_domain *rd =
> > > > +   container_of(work, struct root_domain, rto_push_work);
> > > > struct rq *rq;  
> > > 
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >   
> > 
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> > 
> >
> 
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
> 

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti 

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti  wrote:
> 
> > Hi Steve,
> > 
> > Thanks for the patch.
> > 
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt  wrote:
> > >   
> > > >  void rto_push_irq_work_func(struct irq_work *work)
> > > >  {
> > > > +   struct root_domain *rd =
> > > > +   container_of(work, struct root_domain, rto_push_work);
> > > > struct rq *rq;  
> > > 
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >   
> > 
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> > 
> >
> 
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
> 

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti 

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Sat, 20 Jan 2018 00:27:56 +0530
Pavan Kondeti  wrote:

> Hi Steve,
> 
> Thanks for the patch.
> 
> On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > On Fri, 19 Jan 2018 13:11:21 -0500
> > Steven Rostedt  wrote:
> >   
> > >  void rto_push_irq_work_func(struct irq_work *work)
> > >  {
> > > + struct root_domain *rd =
> > > + container_of(work, struct root_domain, rto_push_work);
> > >   struct rq *rq;  
> > 
> > Notice that I also remove the dependency on rq from getting the rd.
> >   
> 
> Nice. This snippet it self solves the original problem, I reported.
> I will test your patch and let you know the results.
> 
>

I'll break the patch up into two then. One with this snippet, and the
other with the rd ref counting.

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Sat, 20 Jan 2018 00:27:56 +0530
Pavan Kondeti  wrote:

> Hi Steve,
> 
> Thanks for the patch.
> 
> On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > On Fri, 19 Jan 2018 13:11:21 -0500
> > Steven Rostedt  wrote:
> >   
> > >  void rto_push_irq_work_func(struct irq_work *work)
> > >  {
> > > + struct root_domain *rd =
> > > + container_of(work, struct root_domain, rto_push_work);
> > >   struct rq *rq;  
> > 
> > Notice that I also remove the dependency on rq from getting the rd.
> >   
> 
> Nice. This snippet it self solves the original problem, I reported.
> I will test your patch and let you know the results.
> 
>

I'll break the patch up into two then. One with this snippet, and the
other with the rd ref counting.

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt  wrote:
> 
> >  void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +   struct root_domain *rd =
> > +   container_of(work, struct root_domain, rto_push_work);
> > struct rq *rq;
> 
> Notice that I also remove the dependency on rq from getting the rd.
> 

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt  wrote:
> 
> >  void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +   struct root_domain *rd =
> > +   container_of(work, struct root_domain, rto_push_work);
> > struct rq *rq;
> 
> Notice that I also remove the dependency on rq from getting the rd.
> 

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti  wrote:
> 
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> > 
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the 
> > corruption
> > of the remote  CPU's IRQ work list. Right?
> > 
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. 
> > Probably
> > we have to wait for the IRQ work to finish before freeing the older root 
> > domain
> > in RCU-sched callback.
> 
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
> 
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> - struct root_domain *rd = rq->rd;
>   int next;
>   int cpu;
>  
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
>* Otherwise it is finishing up and an ipi needs to be sent.
>*/
>   if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>  
>   raw_spin_unlock(>rd->rto_lock);
>  
>   rto_start_unlock(>rd->rto_loop_start);
>  
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
>   irq_work_queue_on(>rd->rto_push_work, cpu);
> + }
>  }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti  wrote:
> 
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> > 
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the 
> > corruption
> > of the remote  CPU's IRQ work list. Right?
> > 
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. 
> > Probably
> > we have to wait for the IRQ work to finish before freeing the older root 
> > domain
> > in RCU-sched callback.
> 
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
> 
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> - struct root_domain *rd = rq->rd;
>   int next;
>   int cpu;
>  
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
>* Otherwise it is finishing up and an ipi needs to be sent.
>*/
>   if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>  
>   raw_spin_unlock(>rd->rto_lock);
>  
>   rto_start_unlock(>rd->rto_loop_start);
>  
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
>   irq_work_queue_on(>rd->rto_push_work, cpu);
> + }
>  }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 13:11:21 -0500
Steven Rostedt  wrote:

>  void rto_push_irq_work_func(struct irq_work *work)
>  {
> + struct root_domain *rd =
> + container_of(work, struct root_domain, rto_push_work);
>   struct rq *rq;

Notice that I also remove the dependency on rq from getting the rd.

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 13:11:21 -0500
Steven Rostedt  wrote:

>  void rto_push_irq_work_func(struct irq_work *work)
>  {
> + struct root_domain *rd =
> + container_of(work, struct root_domain, rto_push_work);
>   struct rq *rq;

Notice that I also remove the dependency on rq from getting the rd.

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 23:16:17 +0530
Pavan Kondeti  wrote:

> I am thinking of another problem because of the race between
> rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> 
> Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> CPU. In the mean time, the rq_attach_root() might drop all the references
> to this cached (old) rd and wants to free it. The rq->rd is freed in
> RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> can get freed before the IRQ work is executed. This results in the corruption
> of the remote  CPU's IRQ work list. Right?
> 
> Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> we have to wait for the IRQ work to finish before freeing the older root 
> domain
> in RCU-sched callback.

I was wondering about this too. Yeah, it would require an RCU like
update. Once the rd was unreferenced, it would need to wait for the
irq works to to finish before freeing it.

The easy way to do this is to simply up the refcount when sending the
domain. Something like this:

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..89a086ed2b16 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
  * the rt_loop_next will cause the iterator to perform another scan.
  *
  */
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
 {
-   struct root_domain *rd = rq->rd;
int next;
int cpu;
 
@@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
 * Otherwise it is finishing up and an ipi needs to be sent.
 */
if (rq->rd->rto_cpu < 0)
-   cpu = rto_next_cpu(rq);
+   cpu = rto_next_cpu(rq->rd);
 
raw_spin_unlock(>rd->rto_lock);
 
rto_start_unlock(>rd->rto_loop_start);
 
-   if (cpu >= 0)
+   if (cpu >= 0) {
+   /* Make sure the rd does not get freed while pushing */
+   sched_get_rd(rq->rd);
irq_work_queue_on(>rd->rto_push_work, cpu);
+   }
 }
 
 /* Called from hardirq context */
 void rto_push_irq_work_func(struct irq_work *work)
 {
+   struct root_domain *rd =
+   container_of(work, struct root_domain, rto_push_work);
struct rq *rq;
int cpu;
 
@@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}
 
-   raw_spin_lock(>rd->rto_lock);
+   raw_spin_lock(>rto_lock);
 
/* Pass the IPI to the next rt overloaded queue */
-   cpu = rto_next_cpu(rq);
+   cpu = rto_next_cpu(rd);
 
-   raw_spin_unlock(>rd->rto_lock);
+   raw_spin_unlock(>rto_lock);
 
-   if (cpu < 0)
+   if (cpu < 0) {
+   sched_put_rd(rd);
return;
+   }
 
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   irq_work_queue_on(>rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..fb5fc458547f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);
 
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..519b024f4e94 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
call_rcu_sched(_rd->rcu, free_rootdomain);
 }
 
+void sched_get_rd(struct root_domain *rd)
+{
+   atomic_inc(>refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+   if (!atomic_dec_and_test(>refcount))
+   return;
+
+   call_rcu_sched(>rcu, free_rootdomain);
+}
+
 static int init_rootdomain(struct root_domain *rd)
 {
if (!zalloc_cpumask_var(>span, GFP_KERNEL))

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 23:16:17 +0530
Pavan Kondeti  wrote:

> I am thinking of another problem because of the race between
> rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> 
> Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> CPU. In the mean time, the rq_attach_root() might drop all the references
> to this cached (old) rd and wants to free it. The rq->rd is freed in
> RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> can get freed before the IRQ work is executed. This results in the corruption
> of the remote  CPU's IRQ work list. Right?
> 
> Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> we have to wait for the IRQ work to finish before freeing the older root 
> domain
> in RCU-sched callback.

I was wondering about this too. Yeah, it would require an RCU like
update. Once the rd was unreferenced, it would need to wait for the
irq works to to finish before freeing it.

The easy way to do this is to simply up the refcount when sending the
domain. Something like this:

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..89a086ed2b16 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
  * the rt_loop_next will cause the iterator to perform another scan.
  *
  */
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
 {
-   struct root_domain *rd = rq->rd;
int next;
int cpu;
 
@@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
 * Otherwise it is finishing up and an ipi needs to be sent.
 */
if (rq->rd->rto_cpu < 0)
-   cpu = rto_next_cpu(rq);
+   cpu = rto_next_cpu(rq->rd);
 
raw_spin_unlock(>rd->rto_lock);
 
rto_start_unlock(>rd->rto_loop_start);
 
-   if (cpu >= 0)
+   if (cpu >= 0) {
+   /* Make sure the rd does not get freed while pushing */
+   sched_get_rd(rq->rd);
irq_work_queue_on(>rd->rto_push_work, cpu);
+   }
 }
 
 /* Called from hardirq context */
 void rto_push_irq_work_func(struct irq_work *work)
 {
+   struct root_domain *rd =
+   container_of(work, struct root_domain, rto_push_work);
struct rq *rq;
int cpu;
 
@@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}
 
-   raw_spin_lock(>rd->rto_lock);
+   raw_spin_lock(>rto_lock);
 
/* Pass the IPI to the next rt overloaded queue */
-   cpu = rto_next_cpu(rq);
+   cpu = rto_next_cpu(rd);
 
-   raw_spin_unlock(>rd->rto_lock);
+   raw_spin_unlock(>rto_lock);
 
-   if (cpu < 0)
+   if (cpu < 0) {
+   sched_put_rd(rd);
return;
+   }
 
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   irq_work_queue_on(>rto_push_work, cpu);
 }
 #endif /* HAVE_RT_PUSH_IPI */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..fb5fc458547f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
 extern void init_defrootdomain(void);
 extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);
 
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..519b024f4e94 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
call_rcu_sched(_rd->rcu, free_rootdomain);
 }
 
+void sched_get_rd(struct root_domain *rd)
+{
+   atomic_inc(>refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+   if (!atomic_dec_and_test(>refcount))
+   return;
+
+   call_rcu_sched(>rcu, free_rootdomain);
+}
+
 static int init_rootdomain(struct root_domain *rd)
 {
if (!zalloc_cpumask_var(>span, GFP_KERNEL))

-- Steve


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote  CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote  CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 21:14:30 +0530
Pavan Kondeti  wrote:


> Yeah, it should work. I will give it a try and send the patch
> for review.

Add a comment above the assignment saying something to the effect:


/* Due to CPU hotplug, rq->rd could possibly change */

-- Steve



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 21:14:30 +0530
Pavan Kondeti  wrote:


> Yeah, it should work. I will give it a try and send the patch
> for review.

Add a comment above the assignment saying something to the effect:


/* Due to CPU hotplug, rq->rd could possibly change */

-- Steve



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

Yeah, it should work. I will give it a try and send the patch
for review.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

Yeah, it should work. I will give it a try and send the patch
for review.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 14:53:05 +0530
Pavan Kondeti  wrote:

> I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> stable kernel based system. This issue is observed only after
> inclusion of this patch. It appears to me that rq->rd can change
> between spinlock is acquired and released in rto_push_irq_work_func()
> IRQ work if hotplug is in progress. It was only reported couple of
> times during long stress testing. The issue can be easily reproduced
> if an artificial delay is introduced between lock and unlock of
> rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> race with rq->lock. The below patch solved the problem. we are taking
> rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> here. Please let me know your thoughts on this.

As so rq->rd can change. Interesting.

> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d863d39..478192b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> raw_spin_unlock(>lock);
> }
> 
> +   raw_spin_lock(>lock);


What about just saving the rd then?

struct root_domain *rd;

rd = READ_ONCE(rq->rd);

then use that. Then we don't need to worry about it changing.

-- Steve


> raw_spin_lock(>rd->rto_lock);
> 
> /* Pass the IPI to the next rt overloaded queue */
> @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)
> 
> raw_spin_unlock(>rd->rto_lock);
> 
> -   if (cpu < 0)
> -   return;
> -
> /* Try the next RT overloaded CPU */
> -   irq_work_queue_on(>rd->rto_push_work, cpu);
> +   if (cpu >= 0)
> +   irq_work_queue_on(>rd->rto_push_work, cpu);
> +   raw_spin_unlock(>lock);
>  }
>  #endif /* HAVE_RT_PUSH_IPI */
> 
> 
> Thanks,
> Pavan
> 



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Steven Rostedt
On Fri, 19 Jan 2018 14:53:05 +0530
Pavan Kondeti  wrote:

> I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> stable kernel based system. This issue is observed only after
> inclusion of this patch. It appears to me that rq->rd can change
> between spinlock is acquired and released in rto_push_irq_work_func()
> IRQ work if hotplug is in progress. It was only reported couple of
> times during long stress testing. The issue can be easily reproduced
> if an artificial delay is introduced between lock and unlock of
> rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> race with rq->lock. The below patch solved the problem. we are taking
> rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> here. Please let me know your thoughts on this.

As so rq->rd can change. Interesting.

> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d863d39..478192b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> raw_spin_unlock(>lock);
> }
> 
> +   raw_spin_lock(>lock);


What about just saving the rd then?

struct root_domain *rd;

rd = READ_ONCE(rq->rd);

then use that. Then we don't need to worry about it changing.

-- Steve


> raw_spin_lock(>rd->rto_lock);
> 
> /* Pass the IPI to the next rt overloaded queue */
> @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)
> 
> raw_spin_unlock(>rd->rto_lock);
> 
> -   if (cpu < 0)
> -   return;
> -
> /* Try the next RT overloaded CPU */
> -   irq_work_queue_on(>rd->rto_push_work, cpu);
> +   if (cpu >= 0)
> +   irq_work_queue_on(>rd->rto_push_work, cpu);
> +   raw_spin_unlock(>lock);
>  }
>  #endif /* HAVE_RT_PUSH_IPI */
> 
> 
> Thanks,
> Pavan
> 



Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steven,

>  /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> -   struct rt_rq *rt_rq = arg;
> -   struct rq *rq, *src_rq;
> -   int this_cpu;
> +   struct rq *rq;
> int cpu;
>
> -   this_cpu = rt_rq->push_cpu;
> +   rq = this_rq();
>
> -   /* Paranoid check */
> -   BUG_ON(this_cpu != smp_processor_id());
> -
> -   rq = cpu_rq(this_cpu);
> -   src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> +   /*
> +* We do not need to grab the lock to check for has_pushable_tasks.
> +* When it gets updated, a check is made if a push is possible.
> +*/
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(>lock);
> -   push_rt_task(rq);
> +   push_rt_tasks(rq);
> raw_spin_unlock(>lock);
> }
>
> -   /* Pass the IPI to the next rt overloaded queue */
> -   raw_spin_lock(_rq->push_lock);
> -   /*
> -* If the source queue changed since the IPI went out,
> -* we need to restart the search from that CPU again.
> -*/
> -   if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> -   rt_rq->push_cpu = src_rq->cpu;
> -   }
> +   raw_spin_lock(>rd->rto_lock);
>
> -   cpu = find_next_push_cpu(src_rq);
> +   /* Pass the IPI to the next rt overloaded queue */
> +   cpu = rto_next_cpu(rq);
>
> -   if (cpu >= nr_cpu_ids)
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> -   raw_spin_unlock(_rq->push_lock);
> +   raw_spin_unlock(>rd->rto_lock);
>
> -   if (cpu >= nr_cpu_ids)
> +   if (cpu < 0)
> return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}

+   raw_spin_lock(>lock);
raw_spin_lock(>rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(>rd->rto_lock);

-   if (cpu < 0)
-   return;
-
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   if (cpu >= 0)
+   irq_work_queue_on(>rd->rto_push_work, cpu);
+   raw_spin_unlock(>lock);
 }
 #endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project


Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti
Hi Steven,

>  /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> -   struct rt_rq *rt_rq = arg;
> -   struct rq *rq, *src_rq;
> -   int this_cpu;
> +   struct rq *rq;
> int cpu;
>
> -   this_cpu = rt_rq->push_cpu;
> +   rq = this_rq();
>
> -   /* Paranoid check */
> -   BUG_ON(this_cpu != smp_processor_id());
> -
> -   rq = cpu_rq(this_cpu);
> -   src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> +   /*
> +* We do not need to grab the lock to check for has_pushable_tasks.
> +* When it gets updated, a check is made if a push is possible.
> +*/
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(>lock);
> -   push_rt_task(rq);
> +   push_rt_tasks(rq);
> raw_spin_unlock(>lock);
> }
>
> -   /* Pass the IPI to the next rt overloaded queue */
> -   raw_spin_lock(_rq->push_lock);
> -   /*
> -* If the source queue changed since the IPI went out,
> -* we need to restart the search from that CPU again.
> -*/
> -   if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> -   rt_rq->push_cpu = src_rq->cpu;
> -   }
> +   raw_spin_lock(>rd->rto_lock);
>
> -   cpu = find_next_push_cpu(src_rq);
> +   /* Pass the IPI to the next rt overloaded queue */
> +   cpu = rto_next_cpu(rq);
>
> -   if (cpu >= nr_cpu_ids)
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> -   raw_spin_unlock(_rq->push_lock);
> +   raw_spin_unlock(>rd->rto_lock);
>
> -   if (cpu >= nr_cpu_ids)
> +   if (cpu < 0)
> return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}

+   raw_spin_lock(>lock);
raw_spin_lock(>rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(>rd->rto_lock);

-   if (cpu < 0)
-   return;
-
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   if (cpu >= 0)
+   irq_work_queue_on(>rd->rto_push_work, cpu);
+   raw_spin_unlock(>lock);
 }
 #endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project


[tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2017-10-10 Thread tip-bot for Steven Rostedt (Red Hat)
Commit-ID:  4bdced5c9a2922521e325896a7bbbf0132c94e56
Gitweb: https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56
Author: Steven Rostedt (Red Hat) 
AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 10 Oct 2017 11:45:40 +0200

sched/rt: Simplify the IPI based RT balancing logic

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of 
pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work  - the irq work to process each CPU set in rto_mask
  rto_lock   - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
  rto_loop_next  - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
  rto_loop   - a variable protected by rto_lock that is used to
compare against rto_loop_next
  rto_cpu- The cpu to send the next IPI to, also protected by
the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) 

[tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2017-10-10 Thread tip-bot for Steven Rostedt (Red Hat)
Commit-ID:  4bdced5c9a2922521e325896a7bbbf0132c94e56
Gitweb: https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56
Author: Steven Rostedt (Red Hat) 
AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 10 Oct 2017 11:45:40 +0200

sched/rt: Simplify the IPI based RT balancing logic

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of 
pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work  - the irq work to process each CPU set in rto_mask
  rto_lock   - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
  rto_loop_next  - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
  rto_loop   - a variable protected by rto_lock that is used to
compare against rto_loop_next
  rto_cpu- The cpu to send the next IPI to, also protected by
the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Clark Williams 
Cc: