Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steve, On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote: > On Sat, 20 Jan 2018 00:27:56 +0530 > Pavan Kondetiwrote: > > > Hi Steve, > > > > Thanks for the patch. > > > > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > > > On Fri, 19 Jan 2018 13:11:21 -0500 > > > Steven Rostedt wrote: > > > > > > > void rto_push_irq_work_func(struct irq_work *work) > > > > { > > > > + struct root_domain *rd = > > > > + container_of(work, struct root_domain, rto_push_work); > > > > struct rq *rq; > > > > > > Notice that I also remove the dependency on rq from getting the rd. > > > > > > > Nice. This snippet it self solves the original problem, I reported. > > I will test your patch and let you know the results. > > > > > > I'll break the patch up into two then. One with this snippet, and the > other with the rd ref counting. > Yeah, this snippet fixed the original problem. I have not seen "use after free" of rd in my testing. But I can see we are operating on a rd for which refcount is 0. After applying your refcount patch, it never happened. I also verified that we are freeing the rd via IRQ work by dropping the last reference. Thanks for your help with the patches. Please copy linux-stable for the 1st patch. Feel free to use Tested-by: Pavankumar Kondeti Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steve, On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote: > On Sat, 20 Jan 2018 00:27:56 +0530 > Pavan Kondeti wrote: > > > Hi Steve, > > > > Thanks for the patch. > > > > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > > > On Fri, 19 Jan 2018 13:11:21 -0500 > > > Steven Rostedt wrote: > > > > > > > void rto_push_irq_work_func(struct irq_work *work) > > > > { > > > > + struct root_domain *rd = > > > > + container_of(work, struct root_domain, rto_push_work); > > > > struct rq *rq; > > > > > > Notice that I also remove the dependency on rq from getting the rd. > > > > > > > Nice. This snippet it self solves the original problem, I reported. > > I will test your patch and let you know the results. > > > > > > I'll break the patch up into two then. One with this snippet, and the > other with the rd ref counting. > Yeah, this snippet fixed the original problem. I have not seen "use after free" of rd in my testing. But I can see we are operating on a rd for which refcount is 0. After applying your refcount patch, it never happened. I also verified that we are freeing the rd via IRQ work by dropping the last reference. Thanks for your help with the patches. Please copy linux-stable for the 1st patch. Feel free to use Tested-by: Pavankumar Kondeti Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Sat, 20 Jan 2018 00:27:56 +0530 Pavan Kondetiwrote: > Hi Steve, > > Thanks for the patch. > > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > > On Fri, 19 Jan 2018 13:11:21 -0500 > > Steven Rostedt wrote: > > > > > void rto_push_irq_work_func(struct irq_work *work) > > > { > > > + struct root_domain *rd = > > > + container_of(work, struct root_domain, rto_push_work); > > > struct rq *rq; > > > > Notice that I also remove the dependency on rq from getting the rd. > > > > Nice. This snippet it self solves the original problem, I reported. > I will test your patch and let you know the results. > > I'll break the patch up into two then. One with this snippet, and the other with the rd ref counting. -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Sat, 20 Jan 2018 00:27:56 +0530 Pavan Kondeti wrote: > Hi Steve, > > Thanks for the patch. > > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > > On Fri, 19 Jan 2018 13:11:21 -0500 > > Steven Rostedt wrote: > > > > > void rto_push_irq_work_func(struct irq_work *work) > > > { > > > + struct root_domain *rd = > > > + container_of(work, struct root_domain, rto_push_work); > > > struct rq *rq; > > > > Notice that I also remove the dependency on rq from getting the rd. > > > > Nice. This snippet it self solves the original problem, I reported. > I will test your patch and let you know the results. > > I'll break the patch up into two then. One with this snippet, and the other with the rd ref counting. -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steve, Thanks for the patch. On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 13:11:21 -0500 > Steven Rostedtwrote: > > > void rto_push_irq_work_func(struct irq_work *work) > > { > > + struct root_domain *rd = > > + container_of(work, struct root_domain, rto_push_work); > > struct rq *rq; > > Notice that I also remove the dependency on rq from getting the rd. > Nice. This snippet it self solves the original problem, I reported. I will test your patch and let you know the results. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steve, Thanks for the patch. On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 13:11:21 -0500 > Steven Rostedt wrote: > > > void rto_push_irq_work_func(struct irq_work *work) > > { > > + struct root_domain *rd = > > + container_of(work, struct root_domain, rto_push_work); > > struct rq *rq; > > Notice that I also remove the dependency on rq from getting the rd. > Nice. This snippet it self solves the original problem, I reported. I will test your patch and let you know the results. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 23:16:17 +0530 > Pavan Kondetiwrote: > > > I am thinking of another problem because of the race between > > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. > > > > Lets say, we cache the rq->rd here and queued the IRQ work on a remote > > CPU. In the mean time, the rq_attach_root() might drop all the references > > to this cached (old) rd and wants to free it. The rq->rd is freed in > > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd > > can get freed before the IRQ work is executed. This results in the > > corruption > > of the remote CPU's IRQ work list. Right? > > > > Taking rq->lock in rto_push_irq_work_func() also does not help here. > > Probably > > we have to wait for the IRQ work to finish before freeing the older root > > domain > > in RCU-sched callback. > > I was wondering about this too. Yeah, it would require an RCU like > update. Once the rd was unreferenced, it would need to wait for the > irq works to to finish before freeing it. > > The easy way to do this is to simply up the refcount when sending the > domain. Something like this: > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index 862a513adca3..89a086ed2b16 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq) > * the rt_loop_next will cause the iterator to perform another scan. > * > */ > -static int rto_next_cpu(struct rq *rq) > +static int rto_next_cpu(struct root_domain *rd) > { > - struct root_domain *rd = rq->rd; > int next; > int cpu; > > @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq) >* Otherwise it is finishing up and an ipi needs to be sent. >*/ > if (rq->rd->rto_cpu < 0) > - cpu = rto_next_cpu(rq); > + cpu = rto_next_cpu(rq->rd); > > raw_spin_unlock(>rd->rto_lock); > > rto_start_unlock(>rd->rto_loop_start); > > - if (cpu >= 0) > + if (cpu >= 0) { > + /* Make sure the rd does not get freed while pushing */ > + sched_get_rd(rq->rd); > irq_work_queue_on(>rd->rto_push_work, cpu); > + } > } Since this is covered by rq->lock, it is guaranteed that we increment the refcount on the older rd before RCU-sched callback is queued in rq_attach_root(). Either we keep older rd alive or use the updated rd. We are good here, I think. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 23:16:17 +0530 > Pavan Kondeti wrote: > > > I am thinking of another problem because of the race between > > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. > > > > Lets say, we cache the rq->rd here and queued the IRQ work on a remote > > CPU. In the mean time, the rq_attach_root() might drop all the references > > to this cached (old) rd and wants to free it. The rq->rd is freed in > > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd > > can get freed before the IRQ work is executed. This results in the > > corruption > > of the remote CPU's IRQ work list. Right? > > > > Taking rq->lock in rto_push_irq_work_func() also does not help here. > > Probably > > we have to wait for the IRQ work to finish before freeing the older root > > domain > > in RCU-sched callback. > > I was wondering about this too. Yeah, it would require an RCU like > update. Once the rd was unreferenced, it would need to wait for the > irq works to to finish before freeing it. > > The easy way to do this is to simply up the refcount when sending the > domain. Something like this: > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index 862a513adca3..89a086ed2b16 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq) > * the rt_loop_next will cause the iterator to perform another scan. > * > */ > -static int rto_next_cpu(struct rq *rq) > +static int rto_next_cpu(struct root_domain *rd) > { > - struct root_domain *rd = rq->rd; > int next; > int cpu; > > @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq) >* Otherwise it is finishing up and an ipi needs to be sent. >*/ > if (rq->rd->rto_cpu < 0) > - cpu = rto_next_cpu(rq); > + cpu = rto_next_cpu(rq->rd); > > raw_spin_unlock(>rd->rto_lock); > > rto_start_unlock(>rd->rto_loop_start); > > - if (cpu >= 0) > + if (cpu >= 0) { > + /* Make sure the rd does not get freed while pushing */ > + sched_get_rd(rq->rd); > irq_work_queue_on(>rd->rto_push_work, cpu); > + } > } Since this is covered by rq->lock, it is guaranteed that we increment the refcount on the older rd before RCU-sched callback is queued in rq_attach_root(). Either we keep older rd alive or use the updated rd. We are good here, I think. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 13:11:21 -0500 Steven Rostedtwrote: > void rto_push_irq_work_func(struct irq_work *work) > { > + struct root_domain *rd = > + container_of(work, struct root_domain, rto_push_work); > struct rq *rq; Notice that I also remove the dependency on rq from getting the rd. -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 13:11:21 -0500 Steven Rostedt wrote: > void rto_push_irq_work_func(struct irq_work *work) > { > + struct root_domain *rd = > + container_of(work, struct root_domain, rto_push_work); > struct rq *rq; Notice that I also remove the dependency on rq from getting the rd. -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 23:16:17 +0530 Pavan Kondetiwrote: > I am thinking of another problem because of the race between > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. > > Lets say, we cache the rq->rd here and queued the IRQ work on a remote > CPU. In the mean time, the rq_attach_root() might drop all the references > to this cached (old) rd and wants to free it. The rq->rd is freed in > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd > can get freed before the IRQ work is executed. This results in the corruption > of the remote CPU's IRQ work list. Right? > > Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably > we have to wait for the IRQ work to finish before freeing the older root > domain > in RCU-sched callback. I was wondering about this too. Yeah, it would require an RCU like update. Once the rd was unreferenced, it would need to wait for the irq works to to finish before freeing it. The easy way to do this is to simply up the refcount when sending the domain. Something like this: diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 862a513adca3..89a086ed2b16 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq) * the rt_loop_next will cause the iterator to perform another scan. * */ -static int rto_next_cpu(struct rq *rq) +static int rto_next_cpu(struct root_domain *rd) { - struct root_domain *rd = rq->rd; int next; int cpu; @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq) * Otherwise it is finishing up and an ipi needs to be sent. */ if (rq->rd->rto_cpu < 0) - cpu = rto_next_cpu(rq); + cpu = rto_next_cpu(rq->rd); raw_spin_unlock(>rd->rto_lock); rto_start_unlock(>rd->rto_loop_start); - if (cpu >= 0) + if (cpu >= 0) { + /* Make sure the rd does not get freed while pushing */ + sched_get_rd(rq->rd); irq_work_queue_on(>rd->rto_push_work, cpu); + } } /* Called from hardirq context */ void rto_push_irq_work_func(struct irq_work *work) { + struct root_domain *rd = + container_of(work, struct root_domain, rto_push_work); struct rq *rq; int cpu; @@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>lock); } - raw_spin_lock(>rd->rto_lock); + raw_spin_lock(>rto_lock); /* Pass the IPI to the next rt overloaded queue */ - cpu = rto_next_cpu(rq); + cpu = rto_next_cpu(rd); - raw_spin_unlock(>rd->rto_lock); + raw_spin_unlock(>rto_lock); - if (cpu < 0) + if (cpu < 0) { + sched_put_rd(rd); return; + } /* Try the next RT overloaded CPU */ - irq_work_queue_on(>rd->rto_push_work, cpu); + irq_work_queue_on(>rto_push_work, cpu); } #endif /* HAVE_RT_PUSH_IPI */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2e95505e23c6..fb5fc458547f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex; extern void init_defrootdomain(void); extern int sched_init_domains(const struct cpumask *cpu_map); extern void rq_attach_root(struct rq *rq, struct root_domain *rd); +extern void sched_get_rd(struct root_domain *rd); +extern void sched_put_rd(struct root_domain *rd); #ifdef HAVE_RT_PUSH_IPI extern void rto_push_irq_work_func(struct irq_work *work); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 034cbed7f88b..519b024f4e94 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) call_rcu_sched(_rd->rcu, free_rootdomain); } +void sched_get_rd(struct root_domain *rd) +{ + atomic_inc(>refcount); +} + +void sched_put_rd(struct root_domain *rd) +{ + if (!atomic_dec_and_test(>refcount)) + return; + + call_rcu_sched(>rcu, free_rootdomain); +} + static int init_rootdomain(struct root_domain *rd) { if (!zalloc_cpumask_var(>span, GFP_KERNEL)) -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 23:16:17 +0530 Pavan Kondeti wrote: > I am thinking of another problem because of the race between > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. > > Lets say, we cache the rq->rd here and queued the IRQ work on a remote > CPU. In the mean time, the rq_attach_root() might drop all the references > to this cached (old) rd and wants to free it. The rq->rd is freed in > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd > can get freed before the IRQ work is executed. This results in the corruption > of the remote CPU's IRQ work list. Right? > > Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably > we have to wait for the IRQ work to finish before freeing the older root > domain > in RCU-sched callback. I was wondering about this too. Yeah, it would require an RCU like update. Once the rd was unreferenced, it would need to wait for the irq works to to finish before freeing it. The easy way to do this is to simply up the refcount when sending the domain. Something like this: diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 862a513adca3..89a086ed2b16 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq) * the rt_loop_next will cause the iterator to perform another scan. * */ -static int rto_next_cpu(struct rq *rq) +static int rto_next_cpu(struct root_domain *rd) { - struct root_domain *rd = rq->rd; int next; int cpu; @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq) * Otherwise it is finishing up and an ipi needs to be sent. */ if (rq->rd->rto_cpu < 0) - cpu = rto_next_cpu(rq); + cpu = rto_next_cpu(rq->rd); raw_spin_unlock(>rd->rto_lock); rto_start_unlock(>rd->rto_loop_start); - if (cpu >= 0) + if (cpu >= 0) { + /* Make sure the rd does not get freed while pushing */ + sched_get_rd(rq->rd); irq_work_queue_on(>rd->rto_push_work, cpu); + } } /* Called from hardirq context */ void rto_push_irq_work_func(struct irq_work *work) { + struct root_domain *rd = + container_of(work, struct root_domain, rto_push_work); struct rq *rq; int cpu; @@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>lock); } - raw_spin_lock(>rd->rto_lock); + raw_spin_lock(>rto_lock); /* Pass the IPI to the next rt overloaded queue */ - cpu = rto_next_cpu(rq); + cpu = rto_next_cpu(rd); - raw_spin_unlock(>rd->rto_lock); + raw_spin_unlock(>rto_lock); - if (cpu < 0) + if (cpu < 0) { + sched_put_rd(rd); return; + } /* Try the next RT overloaded CPU */ - irq_work_queue_on(>rd->rto_push_work, cpu); + irq_work_queue_on(>rto_push_work, cpu); } #endif /* HAVE_RT_PUSH_IPI */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2e95505e23c6..fb5fc458547f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex; extern void init_defrootdomain(void); extern int sched_init_domains(const struct cpumask *cpu_map); extern void rq_attach_root(struct rq *rq, struct root_domain *rd); +extern void sched_get_rd(struct root_domain *rd); +extern void sched_put_rd(struct root_domain *rd); #ifdef HAVE_RT_PUSH_IPI extern void rto_push_irq_work_func(struct irq_work *work); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 034cbed7f88b..519b024f4e94 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) call_rcu_sched(_rd->rcu, free_rootdomain); } +void sched_get_rd(struct root_domain *rd) +{ + atomic_inc(>refcount); +} + +void sched_put_rd(struct root_domain *rd) +{ + if (!atomic_dec_and_test(>refcount)) + return; + + call_rcu_sched(>rcu, free_rootdomain); +} + static int init_rootdomain(struct root_domain *rd) { if (!zalloc_cpumask_var(>span, GFP_KERNEL)) -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 14:53:05 +0530 > Pavan Kondetiwrote: > > > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > > stable kernel based system. This issue is observed only after > > inclusion of this patch. It appears to me that rq->rd can change > > between spinlock is acquired and released in rto_push_irq_work_func() > > IRQ work if hotplug is in progress. It was only reported couple of > > times during long stress testing. The issue can be easily reproduced > > if an artificial delay is introduced between lock and unlock of > > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > > race with rq->lock. The below patch solved the problem. we are taking > > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > > here. Please let me know your thoughts on this. > > As so rq->rd can change. Interesting. > > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index d863d39..478192b 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>lock); > > } > > > > + raw_spin_lock(>lock); > > > What about just saving the rd then? > > struct root_domain *rd; > > rd = READ_ONCE(rq->rd); > > then use that. Then we don't need to worry about it changing. > I am thinking of another problem because of the race between rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. Lets say, we cache the rq->rd here and queued the IRQ work on a remote CPU. In the mean time, the rq_attach_root() might drop all the references to this cached (old) rd and wants to free it. The rq->rd is freed in RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd can get freed before the IRQ work is executed. This results in the corruption of the remote CPU's IRQ work list. Right? Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably we have to wait for the IRQ work to finish before freeing the older root domain in RCU-sched callback. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 14:53:05 +0530 > Pavan Kondeti wrote: > > > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > > stable kernel based system. This issue is observed only after > > inclusion of this patch. It appears to me that rq->rd can change > > between spinlock is acquired and released in rto_push_irq_work_func() > > IRQ work if hotplug is in progress. It was only reported couple of > > times during long stress testing. The issue can be easily reproduced > > if an artificial delay is introduced between lock and unlock of > > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > > race with rq->lock. The below patch solved the problem. we are taking > > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > > here. Please let me know your thoughts on this. > > As so rq->rd can change. Interesting. > > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index d863d39..478192b 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>lock); > > } > > > > + raw_spin_lock(>lock); > > > What about just saving the rd then? > > struct root_domain *rd; > > rd = READ_ONCE(rq->rd); > > then use that. Then we don't need to worry about it changing. > I am thinking of another problem because of the race between rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified. Lets say, we cache the rq->rd here and queued the IRQ work on a remote CPU. In the mean time, the rq_attach_root() might drop all the references to this cached (old) rd and wants to free it. The rq->rd is freed in RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd can get freed before the IRQ work is executed. This results in the corruption of the remote CPU's IRQ work list. Right? Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably we have to wait for the IRQ work to finish before freeing the older root domain in RCU-sched callback. Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 21:14:30 +0530 Pavan Kondetiwrote: > Yeah, it should work. I will give it a try and send the patch > for review. Add a comment above the assignment saying something to the effect: /* Due to CPU hotplug, rq->rd could possibly change */ -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 21:14:30 +0530 Pavan Kondeti wrote: > Yeah, it should work. I will give it a try and send the patch > for review. Add a comment above the assignment saying something to the effect: /* Due to CPU hotplug, rq->rd could possibly change */ -- Steve
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steven, On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 14:53:05 +0530 > Pavan Kondetiwrote: > > > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > > stable kernel based system. This issue is observed only after > > inclusion of this patch. It appears to me that rq->rd can change > > between spinlock is acquired and released in rto_push_irq_work_func() > > IRQ work if hotplug is in progress. It was only reported couple of > > times during long stress testing. The issue can be easily reproduced > > if an artificial delay is introduced between lock and unlock of > > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > > race with rq->lock. The below patch solved the problem. we are taking > > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > > here. Please let me know your thoughts on this. > > As so rq->rd can change. Interesting. > > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index d863d39..478192b 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>lock); > > } > > > > + raw_spin_lock(>lock); > > > What about just saving the rd then? > > struct root_domain *rd; > > rd = READ_ONCE(rq->rd); > > then use that. Then we don't need to worry about it changing. > Yeah, it should work. I will give it a try and send the patch for review. -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steven, On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote: > On Fri, 19 Jan 2018 14:53:05 +0530 > Pavan Kondeti wrote: > > > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > > stable kernel based system. This issue is observed only after > > inclusion of this patch. It appears to me that rq->rd can change > > between spinlock is acquired and released in rto_push_irq_work_func() > > IRQ work if hotplug is in progress. It was only reported couple of > > times during long stress testing. The issue can be easily reproduced > > if an artificial delay is introduced between lock and unlock of > > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > > race with rq->lock. The below patch solved the problem. we are taking > > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > > here. Please let me know your thoughts on this. > > As so rq->rd can change. Interesting. > > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index d863d39..478192b 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>lock); > > } > > > > + raw_spin_lock(>lock); > > > What about just saving the rd then? > > struct root_domain *rd; > > rd = READ_ONCE(rq->rd); > > then use that. Then we don't need to worry about it changing. > Yeah, it should work. I will give it a try and send the patch for review. -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 14:53:05 +0530 Pavan Kondetiwrote: > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > stable kernel based system. This issue is observed only after > inclusion of this patch. It appears to me that rq->rd can change > between spinlock is acquired and released in rto_push_irq_work_func() > IRQ work if hotplug is in progress. It was only reported couple of > times during long stress testing. The issue can be easily reproduced > if an artificial delay is introduced between lock and unlock of > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > race with rq->lock. The below patch solved the problem. we are taking > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > here. Please let me know your thoughts on this. As so rq->rd can change. Interesting. > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index d863d39..478192b 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > raw_spin_unlock(>lock); > } > > + raw_spin_lock(>lock); What about just saving the rd then? struct root_domain *rd; rd = READ_ONCE(rq->rd); then use that. Then we don't need to worry about it changing. -- Steve > raw_spin_lock(>rd->rto_lock); > > /* Pass the IPI to the next rt overloaded queue */ > @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>rd->rto_lock); > > - if (cpu < 0) > - return; > - > /* Try the next RT overloaded CPU */ > - irq_work_queue_on(>rd->rto_push_work, cpu); > + if (cpu >= 0) > + irq_work_queue_on(>rd->rto_push_work, cpu); > + raw_spin_unlock(>lock); > } > #endif /* HAVE_RT_PUSH_IPI */ > > > Thanks, > Pavan >
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
On Fri, 19 Jan 2018 14:53:05 +0530 Pavan Kondeti wrote: > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 > stable kernel based system. This issue is observed only after > inclusion of this patch. It appears to me that rq->rd can change > between spinlock is acquired and released in rto_push_irq_work_func() > IRQ work if hotplug is in progress. It was only reported couple of > times during long stress testing. The issue can be easily reproduced > if an artificial delay is introduced between lock and unlock of > rto_lock. The rq->rd is changed under rq->lock, so we can protect this > race with rq->lock. The below patch solved the problem. we are taking > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same > here. Please let me know your thoughts on this. As so rq->rd can change. Interesting. > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index d863d39..478192b 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) > raw_spin_unlock(>lock); > } > > + raw_spin_lock(>lock); What about just saving the rd then? struct root_domain *rd; rd = READ_ONCE(rq->rd); then use that. Then we don't need to worry about it changing. -- Steve > raw_spin_lock(>rd->rto_lock); > > /* Pass the IPI to the next rt overloaded queue */ > @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work) > > raw_spin_unlock(>rd->rto_lock); > > - if (cpu < 0) > - return; > - > /* Try the next RT overloaded CPU */ > - irq_work_queue_on(>rd->rto_push_work, cpu); > + if (cpu >= 0) > + irq_work_queue_on(>rd->rto_push_work, cpu); > + raw_spin_unlock(>lock); > } > #endif /* HAVE_RT_PUSH_IPI */ > > > Thanks, > Pavan >
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steven, > /* Called from hardirq context */ > -static void try_to_push_tasks(void *arg) > +void rto_push_irq_work_func(struct irq_work *work) > { > - struct rt_rq *rt_rq = arg; > - struct rq *rq, *src_rq; > - int this_cpu; > + struct rq *rq; > int cpu; > > - this_cpu = rt_rq->push_cpu; > + rq = this_rq(); > > - /* Paranoid check */ > - BUG_ON(this_cpu != smp_processor_id()); > - > - rq = cpu_rq(this_cpu); > - src_rq = rq_of_rt_rq(rt_rq); > - > -again: > + /* > +* We do not need to grab the lock to check for has_pushable_tasks. > +* When it gets updated, a check is made if a push is possible. > +*/ > if (has_pushable_tasks(rq)) { > raw_spin_lock(>lock); > - push_rt_task(rq); > + push_rt_tasks(rq); > raw_spin_unlock(>lock); > } > > - /* Pass the IPI to the next rt overloaded queue */ > - raw_spin_lock(_rq->push_lock); > - /* > -* If the source queue changed since the IPI went out, > -* we need to restart the search from that CPU again. > -*/ > - if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) { > - rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART; > - rt_rq->push_cpu = src_rq->cpu; > - } > + raw_spin_lock(>rd->rto_lock); > > - cpu = find_next_push_cpu(src_rq); > + /* Pass the IPI to the next rt overloaded queue */ > + cpu = rto_next_cpu(rq); > > - if (cpu >= nr_cpu_ids) > - rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING; > - raw_spin_unlock(_rq->push_lock); > + raw_spin_unlock(>rd->rto_lock); > > - if (cpu >= nr_cpu_ids) > + if (cpu < 0) > return; I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 stable kernel based system. This issue is observed only after inclusion of this patch. It appears to me that rq->rd can change between spinlock is acquired and released in rto_push_irq_work_func() IRQ work if hotplug is in progress. It was only reported couple of times during long stress testing. The issue can be easily reproduced if an artificial delay is introduced between lock and unlock of rto_lock. The rq->rd is changed under rq->lock, so we can protect this race with rq->lock. The below patch solved the problem. we are taking rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same here. Please let me know your thoughts on this. diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index d863d39..478192b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>lock); } + raw_spin_lock(>lock); raw_spin_lock(>rd->rto_lock); /* Pass the IPI to the next rt overloaded queue */ @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>rd->rto_lock); - if (cpu < 0) - return; - /* Try the next RT overloaded CPU */ - irq_work_queue_on(>rd->rto_push_work, cpu); + if (cpu >= 0) + irq_work_queue_on(>rd->rto_push_work, cpu); + raw_spin_unlock(>lock); } #endif /* HAVE_RT_PUSH_IPI */ Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Hi Steven, > /* Called from hardirq context */ > -static void try_to_push_tasks(void *arg) > +void rto_push_irq_work_func(struct irq_work *work) > { > - struct rt_rq *rt_rq = arg; > - struct rq *rq, *src_rq; > - int this_cpu; > + struct rq *rq; > int cpu; > > - this_cpu = rt_rq->push_cpu; > + rq = this_rq(); > > - /* Paranoid check */ > - BUG_ON(this_cpu != smp_processor_id()); > - > - rq = cpu_rq(this_cpu); > - src_rq = rq_of_rt_rq(rt_rq); > - > -again: > + /* > +* We do not need to grab the lock to check for has_pushable_tasks. > +* When it gets updated, a check is made if a push is possible. > +*/ > if (has_pushable_tasks(rq)) { > raw_spin_lock(>lock); > - push_rt_task(rq); > + push_rt_tasks(rq); > raw_spin_unlock(>lock); > } > > - /* Pass the IPI to the next rt overloaded queue */ > - raw_spin_lock(_rq->push_lock); > - /* > -* If the source queue changed since the IPI went out, > -* we need to restart the search from that CPU again. > -*/ > - if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) { > - rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART; > - rt_rq->push_cpu = src_rq->cpu; > - } > + raw_spin_lock(>rd->rto_lock); > > - cpu = find_next_push_cpu(src_rq); > + /* Pass the IPI to the next rt overloaded queue */ > + cpu = rto_next_cpu(rq); > > - if (cpu >= nr_cpu_ids) > - rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING; > - raw_spin_unlock(_rq->push_lock); > + raw_spin_unlock(>rd->rto_lock); > > - if (cpu >= nr_cpu_ids) > + if (cpu < 0) > return; I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9 stable kernel based system. This issue is observed only after inclusion of this patch. It appears to me that rq->rd can change between spinlock is acquired and released in rto_push_irq_work_func() IRQ work if hotplug is in progress. It was only reported couple of times during long stress testing. The issue can be easily reproduced if an artificial delay is introduced between lock and unlock of rto_lock. The rq->rd is changed under rq->lock, so we can protect this race with rq->lock. The below patch solved the problem. we are taking rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same here. Please let me know your thoughts on this. diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index d863d39..478192b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>lock); } + raw_spin_lock(>lock); raw_spin_lock(>rd->rto_lock); /* Pass the IPI to the next rt overloaded queue */ @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work) raw_spin_unlock(>rd->rto_lock); - if (cpu < 0) - return; - /* Try the next RT overloaded CPU */ - irq_work_queue_on(>rd->rto_push_work, cpu); + if (cpu >= 0) + irq_work_queue_on(>rd->rto_push_work, cpu); + raw_spin_unlock(>lock); } #endif /* HAVE_RT_PUSH_IPI */ Thanks, Pavan -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Commit-ID: 4bdced5c9a2922521e325896a7bbbf0132c94e56 Gitweb: https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56 Author: Steven Rostedt (Red Hat)AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400 Committer: Ingo Molnar CommitDate: Tue, 10 Oct 2017 11:45:40 +0200 sched/rt: Simplify the IPI based RT balancing logic When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu- The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware)
[tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Commit-ID: 4bdced5c9a2922521e325896a7bbbf0132c94e56 Gitweb: https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56 Author: Steven Rostedt (Red Hat) AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400 Committer: Ingo Molnar CommitDate: Tue, 10 Oct 2017 11:45:40 +0200 sched/rt: Simplify the IPI based RT balancing logic When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu- The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) Signed-off-by: Peter Zijlstra (Intel) Cc: Clark Williams Cc: