Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-15 Thread Steven Rostedt
On Fri, 2013-02-15 at 12:51 +0100, Peter Zijlstra wrote:
> On Wed, 2013-02-13 at 14:05 -0500, Steven Rostedt wrote:

> Ah, so that's push_rt_task() which wants to move extra rt tasks to other
> cpus. Doing that from where we have idle_balance() won't actually work I
> think since we might need to move current, which we cannot at that point
> -- I'm thinking a higher prio task (than current) waking to this cpu and
> then cascading current to another cpu, can that happen?

Yep, that's exactly why we do the push after the switch. It also catches
a bunch of races that can happen if in the middle of the switch another
CPU drops in priority. A drop in priority can cause that CPU to search
for pending rt tasks on other CPUs, but because current hasn't been
taken off the queue yet, it can miss trying to pull it, and current
could get stuck on its CPU even though there's another CPU it could run
on. Having the higher priority task do the check solves that race.

> 
> If we never need to migrate current because we don't do the cascade by
> ensuring we wake the higher prio task to the approriate cpu we might
> just get away with it.

I'm not exactly sure what you mean by this? Migrate the higher task? But
what if it's pinned. And we use to do that, until we found out that
higher priority tasks were ping-ponging around like crazy.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-15 Thread Peter Zijlstra
On Wed, 2013-02-13 at 14:05 -0500, Steven Rostedt wrote:
> That is, the CPU is about to go idle, thus a load balance is done, and
> perhaps a task is pulled to the current queue. To do this, rq locks
> and
> such need to be grabbed across CPUs.

Right, grabbing the rq locks and all isn't my main worry, we do that
either case, but my worry was the two extra switches we do for no good
reason at all. 

Now its not as if we'll actually run the idle thread, that would be very
expensive indeed, so its just the two context_switch() calls, but still,
I somehow remember us spending quite a lot of effort to keep
idle_balance where it is now, if only I could remember the benchmark we
had for it :/

Can't you do the opposite and fold post_schedule() into idle_balance()?

/me goes stare at code to help remember what the -rt requirements were
again..

Ah, so that's push_rt_task() which wants to move extra rt tasks to other
cpus. Doing that from where we have idle_balance() won't actually work I
think since we might need to move current, which we cannot at that point
-- I'm thinking a higher prio task (than current) waking to this cpu and
then cascading current to another cpu, can that happen?

If we never need to migrate current because we don't do the cascade by
ensuring we wake the higher prio task to the approriate cpu we might
just get away with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-15 Thread Peter Zijlstra
On Wed, 2013-02-13 at 14:05 -0500, Steven Rostedt wrote:
 That is, the CPU is about to go idle, thus a load balance is done, and
 perhaps a task is pulled to the current queue. To do this, rq locks
 and
 such need to be grabbed across CPUs.

Right, grabbing the rq locks and all isn't my main worry, we do that
either case, but my worry was the two extra switches we do for no good
reason at all. 

Now its not as if we'll actually run the idle thread, that would be very
expensive indeed, so its just the two context_switch() calls, but still,
I somehow remember us spending quite a lot of effort to keep
idle_balance where it is now, if only I could remember the benchmark we
had for it :/

Can't you do the opposite and fold post_schedule() into idle_balance()?

/me goes stare at code to help remember what the -rt requirements were
again..

Ah, so that's push_rt_task() which wants to move extra rt tasks to other
cpus. Doing that from where we have idle_balance() won't actually work I
think since we might need to move current, which we cannot at that point
-- I'm thinking a higher prio task (than current) waking to this cpu and
then cascading current to another cpu, can that happen?

If we never need to migrate current because we don't do the cascade by
ensuring we wake the higher prio task to the approriate cpu we might
just get away with it.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-15 Thread Steven Rostedt
On Fri, 2013-02-15 at 12:51 +0100, Peter Zijlstra wrote:
 On Wed, 2013-02-13 at 14:05 -0500, Steven Rostedt wrote:

 Ah, so that's push_rt_task() which wants to move extra rt tasks to other
 cpus. Doing that from where we have idle_balance() won't actually work I
 think since we might need to move current, which we cannot at that point
 -- I'm thinking a higher prio task (than current) waking to this cpu and
 then cascading current to another cpu, can that happen?

Yep, that's exactly why we do the push after the switch. It also catches
a bunch of races that can happen if in the middle of the switch another
CPU drops in priority. A drop in priority can cause that CPU to search
for pending rt tasks on other CPUs, but because current hasn't been
taken off the queue yet, it can miss trying to pull it, and current
could get stuck on its CPU even though there's another CPU it could run
on. Having the higher priority task do the check solves that race.

 
 If we never need to migrate current because we don't do the cascade by
 ensuring we wake the higher prio task to the approriate cpu we might
 just get away with it.

I'm not exactly sure what you mean by this? Migrate the higher task? But
what if it's pinned. And we use to do that, until we found out that
higher priority tasks were ping-ponging around like crazy.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-14 Thread Steven Rostedt
On Wed, 2013-02-13 at 19:43 +0100, Peter Zijlstra wrote:

> How does that follow? We can have to-idle switches _far_ more often than
> we balance.

I ran "perf stat -a -r 100 hackbench 500" on an i7 4 core hyperthreaded
box, and got the following results:

With no patches applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 199820.045583 task-clock#8.016 CPUs utilized   
 ( +-  5.29% ) [100.00%]
 3,594,264 context-switches  #0.018 M/sec   
 ( +-  5.94% ) [100.00%]
   352,240 cpu-migrations#0.002 M/sec   
 ( +-  3.31% ) [100.00%]
 1,006,732 page-faults   #0.005 M/sec   
 ( +-  0.56% )
   293,801,912,874 cycles#1.470 GHz 
 ( +-  4.20% ) [100.00%]
   261,808,125,109 stalled-cycles-frontend   #   89.11% frontend cycles idle
 ( +-  4.38% ) [100.00%]
stalled-cycles-backend  
   135,521,344,089 instructions  #0.46  insns per cycle
 #1.93  stalled cycles per insn 
 ( +-  4.37% ) [100.00%]
26,198,116,586 branches  #  131.109 M/sec   
 ( +-  4.59% ) [100.00%]
   115,326,812 branch-misses #0.44% of all branches 
 ( +-  4.12% )

  24.929136087 seconds time elapsed 
 ( +-  5.31% )


With the two idle_balance patches applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 178619.929457 task-clock#8.011 CPUs utilized   
 ( +-  4.16% ) [100.00%]
 3,171,229 context-switches  #0.018 M/sec   
 ( +-  3.30% ) [100.00%]
   323,115 cpu-migrations#0.002 M/sec   
 ( +-  2.47% ) [100.00%]
   994,506 page-faults   #0.006 M/sec   
 ( +-  0.52% )
   269,744,391,573 cycles#1.510 GHz 
 ( +-  2.12% ) [100.00%]
   238,589,242,461 stalled-cycles-frontend   #   88.45% frontend cycles idle
 ( +-  2.26% ) [100.00%]
stalled-cycles-backend  
   124,298,251,712 instructions  #0.46  insns per cycle
 #1.92  stalled cycles per insn 
 ( +-  1.99% ) [100.00%]
23,918,305,712 branches  #  133.906 M/sec   
 ( +-  2.13% ) [100.00%]
   105,863,415 branch-misses #0.44% of all branches 
 ( +-  2.14% )

  22.296151996 seconds time elapsed 
 ( +-  4.18% )


And finally with the patch at the bottom of this email applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 170170.547682 task-clock#8.021 CPUs utilized   
 ( +-  5.59% ) [100.00%]
 3,118,923 context-switches  #0.018 M/sec   
 ( +-  4.82% ) [100.00%]
   318,479 cpu-migrations#0.002 M/sec   
 ( +-  2.58% ) [100.00%]
   988,187 page-faults   #0.006 M/sec   
 ( +-  0.62% )
   271,343,352,987 cycles#1.595 GHz 
 ( +-  3.84% ) [100.00%]
   240,599,089,430 stalled-cycles-frontend   #   88.67% frontend cycles idle
 ( +-  4.24% ) [100.00%]
stalled-cycles-backend  
   125,888,645,748 instructions  #0.46  insns per cycle
 #1.91  stalled cycles per insn 
 ( +-  3.97% ) [100.00%]
24,219,147,811 branches  #  142.323 M/sec   
 ( +-  4.22% ) [100.00%]
   105,077,636 branch-misses #0.43% of all branches 
 ( +-  3.70% )

  21.214840224 seconds time elapsed 
 ( +-  5.61% )

Yeah, it seems the extra check for rq empty helps. But even without
that, the current idle patches still seem pretty good.

-- Steve

diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 66b5220..025350b 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -16,7 +16,9 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int 
flags)
 
 static void post_schedule_idle(struct rq *rq)
 {
-   idle_balance(smp_processor_id(), rq);
+   /* rq lock was released, make sure system is still idle */
+   if (likely(!rq->nr_running))
+   idle_balance(smp_processor_id(), rq);
 }
 #endif /* CONFIG_SMP */
 /*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-14 Thread Steven Rostedt
On Wed, 2013-02-13 at 19:43 +0100, Peter Zijlstra wrote:

 How does that follow? We can have to-idle switches _far_ more often than
 we balance.

I ran perf stat -a -r 100 hackbench 500 on an i7 4 core hyperthreaded
box, and got the following results:

With no patches applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 199820.045583 task-clock#8.016 CPUs utilized   
 ( +-  5.29% ) [100.00%]
 3,594,264 context-switches  #0.018 M/sec   
 ( +-  5.94% ) [100.00%]
   352,240 cpu-migrations#0.002 M/sec   
 ( +-  3.31% ) [100.00%]
 1,006,732 page-faults   #0.005 M/sec   
 ( +-  0.56% )
   293,801,912,874 cycles#1.470 GHz 
 ( +-  4.20% ) [100.00%]
   261,808,125,109 stalled-cycles-frontend   #   89.11% frontend cycles idle
 ( +-  4.38% ) [100.00%]
   not supported stalled-cycles-backend  
   135,521,344,089 instructions  #0.46  insns per cycle
 #1.93  stalled cycles per insn 
 ( +-  4.37% ) [100.00%]
26,198,116,586 branches  #  131.109 M/sec   
 ( +-  4.59% ) [100.00%]
   115,326,812 branch-misses #0.44% of all branches 
 ( +-  4.12% )

  24.929136087 seconds time elapsed 
 ( +-  5.31% )


With the two idle_balance patches applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 178619.929457 task-clock#8.011 CPUs utilized   
 ( +-  4.16% ) [100.00%]
 3,171,229 context-switches  #0.018 M/sec   
 ( +-  3.30% ) [100.00%]
   323,115 cpu-migrations#0.002 M/sec   
 ( +-  2.47% ) [100.00%]
   994,506 page-faults   #0.006 M/sec   
 ( +-  0.52% )
   269,744,391,573 cycles#1.510 GHz 
 ( +-  2.12% ) [100.00%]
   238,589,242,461 stalled-cycles-frontend   #   88.45% frontend cycles idle
 ( +-  2.26% ) [100.00%]
   not supported stalled-cycles-backend  
   124,298,251,712 instructions  #0.46  insns per cycle
 #1.92  stalled cycles per insn 
 ( +-  1.99% ) [100.00%]
23,918,305,712 branches  #  133.906 M/sec   
 ( +-  2.13% ) [100.00%]
   105,863,415 branch-misses #0.44% of all branches 
 ( +-  2.14% )

  22.296151996 seconds time elapsed 
 ( +-  4.18% )


And finally with the patch at the bottom of this email applied:

 Performance counter stats for '/work/c/hackbench 500' (100 runs):

 170170.547682 task-clock#8.021 CPUs utilized   
 ( +-  5.59% ) [100.00%]
 3,118,923 context-switches  #0.018 M/sec   
 ( +-  4.82% ) [100.00%]
   318,479 cpu-migrations#0.002 M/sec   
 ( +-  2.58% ) [100.00%]
   988,187 page-faults   #0.006 M/sec   
 ( +-  0.62% )
   271,343,352,987 cycles#1.595 GHz 
 ( +-  3.84% ) [100.00%]
   240,599,089,430 stalled-cycles-frontend   #   88.67% frontend cycles idle
 ( +-  4.24% ) [100.00%]
   not supported stalled-cycles-backend  
   125,888,645,748 instructions  #0.46  insns per cycle
 #1.91  stalled cycles per insn 
 ( +-  3.97% ) [100.00%]
24,219,147,811 branches  #  142.323 M/sec   
 ( +-  4.22% ) [100.00%]
   105,077,636 branch-misses #0.43% of all branches 
 ( +-  3.70% )

  21.214840224 seconds time elapsed 
 ( +-  5.61% )

Yeah, it seems the extra check for rq empty helps. But even without
that, the current idle patches still seem pretty good.

-- Steve

diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 66b5220..025350b 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -16,7 +16,9 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int 
flags)
 
 static void post_schedule_idle(struct rq *rq)
 {
-   idle_balance(smp_processor_id(), rq);
+   /* rq lock was released, make sure system is still idle */
+   if (likely(!rq-nr_running))
+   idle_balance(smp_processor_id(), rq);
 }
 #endif /* CONFIG_SMP */
 /*


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-13 Thread Steven Rostedt
On Wed, 2013-02-13 at 19:43 +0100, Peter Zijlstra wrote:
> On Tue, 2013-02-12 at 17:54 -0500, Steven Rostedt wrote:
> > There's no real reason that the idle_balance() needs to be called in
> > the
> > middle of schedule anyway. The only benefit is that if a task is
> > pulled
> > to this CPU, it can be scheduled without the need to schedule the idle
> > task. 
> 
> Uhm, istr that extra schedule being an issue somewhere.. Make very sure
> you don't regress anything silly like sysbench or hackbench. Maybe ask
> Mike, he seems to have a better retention for benchmark weirdness than
> me.

I'm all for benchmarks :-)

> 
> > But load balancing and migrating the task makes a switch to idle
> > and back negligible.
> 
> How does that follow? We can have to-idle switches _far_ more often than
> we balance.

But we have the to-idle switch regardless in that case, don't we?

That is, the CPU is about to go idle, thus a load balance is done, and
perhaps a task is pulled to the current queue. To do this, rq locks and
such need to be grabbed across CPUs.

If it didn't balance the switch would happen anyways.

The current method is:

if (!rq->nr_running)
idle_balance();

/* if something pulled, then run that instead,
   if not, continue to switch to idle. */

What this change did was:

pick_next_task_idle()
rq->post_schedule = 1;

post_schedule()
idle_balance();

Now if a balance occurred, we would have to switch back from idle, to
what we just pulled.

Hence the change is that a switch to and from idle is done only in the
case that load balancing occurred, otherwise it just goes to idle like
it always has.

Or perhaps I need add another check to make sure rq->nr_running is still
0 before doing the balance, because the rq lock was released. That is:

post_schedule()
if (!rq->nr_running)
idle_balance();

needs to be added.

-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-13 Thread Peter Zijlstra
On Tue, 2013-02-12 at 17:54 -0500, Steven Rostedt wrote:
> There's no real reason that the idle_balance() needs to be called in
> the
> middle of schedule anyway. The only benefit is that if a task is
> pulled
> to this CPU, it can be scheduled without the need to schedule the idle
> task. 

Uhm, istr that extra schedule being an issue somewhere.. Make very sure
you don't regress anything silly like sysbench or hackbench. Maybe ask
Mike, he seems to have a better retention for benchmark weirdness than
me.

> But load balancing and migrating the task makes a switch to idle
> and back negligible.

How does that follow? We can have to-idle switches _far_ more often than
we balance.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-13 Thread Peter Zijlstra
On Tue, 2013-02-12 at 17:54 -0500, Steven Rostedt wrote:
 There's no real reason that the idle_balance() needs to be called in
 the
 middle of schedule anyway. The only benefit is that if a task is
 pulled
 to this CPU, it can be scheduled without the need to schedule the idle
 task. 

Uhm, istr that extra schedule being an issue somewhere.. Make very sure
you don't regress anything silly like sysbench or hackbench. Maybe ask
Mike, he seems to have a better retention for benchmark weirdness than
me.

 But load balancing and migrating the task makes a switch to idle
 and back negligible.

How does that follow? We can have to-idle switches _far_ more often than
we balance.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-13 Thread Steven Rostedt
On Wed, 2013-02-13 at 19:43 +0100, Peter Zijlstra wrote:
 On Tue, 2013-02-12 at 17:54 -0500, Steven Rostedt wrote:
  There's no real reason that the idle_balance() needs to be called in
  the
  middle of schedule anyway. The only benefit is that if a task is
  pulled
  to this CPU, it can be scheduled without the need to schedule the idle
  task. 
 
 Uhm, istr that extra schedule being an issue somewhere.. Make very sure
 you don't regress anything silly like sysbench or hackbench. Maybe ask
 Mike, he seems to have a better retention for benchmark weirdness than
 me.

I'm all for benchmarks :-)

 
  But load balancing and migrating the task makes a switch to idle
  and back negligible.
 
 How does that follow? We can have to-idle switches _far_ more often than
 we balance.

But we have the to-idle switch regardless in that case, don't we?

That is, the CPU is about to go idle, thus a load balance is done, and
perhaps a task is pulled to the current queue. To do this, rq locks and
such need to be grabbed across CPUs.

If it didn't balance the switch would happen anyways.

The current method is:

if (!rq-nr_running)
idle_balance();

/* if something pulled, then run that instead,
   if not, continue to switch to idle. */

What this change did was:

pick_next_task_idle()
rq-post_schedule = 1;

post_schedule()
idle_balance();

Now if a balance occurred, we would have to switch back from idle, to
what we just pulled.

Hence the change is that a switch to and from idle is done only in the
case that load balancing occurred, otherwise it just goes to idle like
it always has.

Or perhaps I need add another check to make sure rq-nr_running is still
0 before doing the balance, because the rq lock was released. That is:

post_schedule()
if (!rq-nr_running)
idle_balance();

needs to be added.

-- Steve



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-12 Thread Steven Rostedt
From: "Steven Rostedt (Red Hat)" 

The idle_balance() code is called to do task load balancing just before
going to idle. This makes sense as the CPU is about to sleep anyway.
But currently it's called in the middle of the scheduler and in a place
that must have interrupts disabled. That means, while the load balancing
is going on, if a task wakes up on this CPU, it wont get to run while
the interrupts are disabled. The idle task doing the balancing will be
clueless about it.

There's no real reason that the idle_balance() needs to be called in the
middle of schedule anyway. The only benefit is that if a task is pulled
to this CPU, it can be scheduled without the need to schedule the idle
task. But load balancing and migrating the task makes a switch to idle
and back negligible.

By using the post_schedule function pointer of the sched class, the
unlikely branch in the hot path of the scheduler can be removed, and
the idle task itself can do the load balancing.

Another advantage of this, is that by moving the idle_balance to the
post_schedule routine, interrupts can now be enabled in the load balance
allowing for interrupts and wakeups to still occur on that CPU while a
balance is taking place. The enabling of interrupts will come as a separate
patch.

Cc: Peter Zijlstra 
Signed-off-by: Steven Rostedt 
---
 kernel/sched/core.c  |3 ---
 kernel/sched/idle_task.c |   10 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..a9317b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2927,9 +2927,6 @@ need_resched:
 
pre_schedule(rq, prev);
 
-   if (unlikely(!rq->nr_running))
-   idle_balance(cpu, rq);
-
put_prev_task(rq, prev);
next = pick_next_task(rq);
clear_tsk_need_resched(prev);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..66b5220 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,6 +13,11 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int 
flags)
 {
return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static void post_schedule_idle(struct rq *rq)
+{
+   idle_balance(smp_processor_id(), rq);
+}
 #endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
@@ -25,6 +30,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
 static struct task_struct *pick_next_task_idle(struct rq *rq)
 {
schedstat_inc(rq, sched_goidle);
+#ifdef CONFIG_SMP
+   /* Trigger the post schedule to do an idle_balance */
+   rq->post_schedule = 1;
+#endif
return rq->idle;
 }
 
@@ -86,6 +95,7 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_idle,
+   .post_schedule  = post_schedule_idle,
 #endif
 
.set_curr_task  = set_curr_task_idle,
-- 
1.7.10.4




signature.asc
Description: This is a digitally signed message part


[PATCH 2/3] sched: Move idle_balance() to post_schedule

2013-02-12 Thread Steven Rostedt
From: Steven Rostedt (Red Hat) rost...@goodmis.org

The idle_balance() code is called to do task load balancing just before
going to idle. This makes sense as the CPU is about to sleep anyway.
But currently it's called in the middle of the scheduler and in a place
that must have interrupts disabled. That means, while the load balancing
is going on, if a task wakes up on this CPU, it wont get to run while
the interrupts are disabled. The idle task doing the balancing will be
clueless about it.

There's no real reason that the idle_balance() needs to be called in the
middle of schedule anyway. The only benefit is that if a task is pulled
to this CPU, it can be scheduled without the need to schedule the idle
task. But load balancing and migrating the task makes a switch to idle
and back negligible.

By using the post_schedule function pointer of the sched class, the
unlikely branch in the hot path of the scheduler can be removed, and
the idle task itself can do the load balancing.

Another advantage of this, is that by moving the idle_balance to the
post_schedule routine, interrupts can now be enabled in the load balance
allowing for interrupts and wakeups to still occur on that CPU while a
balance is taking place. The enabling of interrupts will come as a separate
patch.

Cc: Peter Zijlstra pet...@infradead.org
Signed-off-by: Steven Rostedt rost...@goodmis.org
---
 kernel/sched/core.c  |3 ---
 kernel/sched/idle_task.c |   10 ++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..a9317b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2927,9 +2927,6 @@ need_resched:
 
pre_schedule(rq, prev);
 
-   if (unlikely(!rq-nr_running))
-   idle_balance(cpu, rq);
-
put_prev_task(rq, prev);
next = pick_next_task(rq);
clear_tsk_need_resched(prev);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..66b5220 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,6 +13,11 @@ select_task_rq_idle(struct task_struct *p, int sd_flag, int 
flags)
 {
return task_cpu(p); /* IDLE tasks as never migrated */
 }
+
+static void post_schedule_idle(struct rq *rq)
+{
+   idle_balance(smp_processor_id(), rq);
+}
 #endif /* CONFIG_SMP */
 /*
  * Idle tasks are unconditionally rescheduled:
@@ -25,6 +30,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
 static struct task_struct *pick_next_task_idle(struct rq *rq)
 {
schedstat_inc(rq, sched_goidle);
+#ifdef CONFIG_SMP
+   /* Trigger the post schedule to do an idle_balance */
+   rq-post_schedule = 1;
+#endif
return rq-idle;
 }
 
@@ -86,6 +95,7 @@ const struct sched_class idle_sched_class = {
 
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_idle,
+   .post_schedule  = post_schedule_idle,
 #endif
 
.set_curr_task  = set_curr_task_idle,
-- 
1.7.10.4




signature.asc
Description: This is a digitally signed message part