Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Paul Menage
On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote:
 Steven wrote:
  +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
  +{
  + cpu_set(cpu, task_cs(tsk)-rt_overload);
  +}

 Question for Steven:

   What locks are held when cpuset_rt_set_overload() is called?

 Questions for Paul Menage:

   Does 'tsk' need to be locked for the above task_cs() call?

Cgroups doesn't change the locking rules for accessing a cpuset from a
task - you have to have one of:

- task_lock(task)

- callback_mutex

- be in an RCU section from the point when you call task_cs to the
point when you stop using its result. (Additionally, in this case
there's no guarantee that the task stays in this cpuset for the
duration of the RCU section).

Paul
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Paul Jackson
Paul M wrote:
 Cgroups doesn't change the locking rules for accessing a cpuset from a
 task - you have to have one of:

Good - could you comment task_cs() with this explanation?

The rules are derived from the cpuset rules, as you explain,
and as I suspected, but now task_cs() is the most popular
way to access a tasks cpuset from code within kernel/cpuset.c,
and that's code you added.

The reason that I started this subthread is that I didn't see
any immediate evidence that the RT code was honoring this locking,
and I suspected that I clear comment over task_cs() could have
clarified that for them.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and Prempt?

2007-10-23 Thread Jan Kiszka
Sven-Thorsten Dietrich wrote:
 On Mon, 2007-10-22 at 09:01 +0200, Back, Michael (ext) wrote:
 Hallo,
 I tried to run Windows XP with KVM on Linux 2.6.31.1 on a 
 
 You mean .21.1 ? 

Classic typo I interestingly also did several times the last week. :)

 
 AMD Opteron and on a Intel Xeon, on both it works fine!
 
 After this test I patch the kernel with the current prempt-patch and on
 both it doesn't works! 
 
 Did you try against 2.6.23-rt1.

kvm in -rt1 is not usable. It's too old, lacking PREEMPT_NOTIFIER
support, thus quickly triggering lockdep.

 
 If you must stay on .21, you might have some other issues with the AMD
 and NUMA.
 
 At the very least, you will need to apply the attached patch from git
 somehow, although this patch is against a new scheduler post 2.6.22, so
 good luck :)

--snip--

Those patches are already mainline... :-

What you rather need are latest kvm patches, or - if building the kvm
distribution out of tree - a patch to enabled CONFIG_PREEMPT_NOTIFIERS
unconditionally:

--- linux-2.6.23.1-rt/kernel/Kconfig.preempt.orig
+++ linux-2.6.23.1-rt/kernel/Kconfig.preempt
@@ -136,6 +136,7 @@

 config PREEMPT_NOTIFIERS
bool
+   default y

 config PREEMPT_BKL
bool


Still, I'm seeing oopses here (more precisely, lock validator
complaints), but I need to re-test, better using kvm from git instead of
kvm-48.

Beyond this, I'm struggling to understand 300-400 us vm-exit latencies
(over Intel VMX), which appear to be independent of the underlying
system. See kvm-devel. Such latencies would limit the RT usability of
kvm - unless you spent dedicated CPUs.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2 0/7] New RT Task Balancing -v2

2007-10-23 Thread Ingo Molnar

* Steven Rostedt [EMAIL PROTECTED] wrote:

   Changes since V1:
 Updated to git tree 55b70a0300b873c0ec7ea6e33752af56f41250ce
 
 Various clean ups suggested by Gregory Haskins, Dmitry Adamushko,
 and Peter Zijlstra.

ok, i like this new approach - nice work! I'd suggest we test it in -rt 
for some time and then merge it into v2.6.25?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and Prempt?

2007-10-23 Thread Gregory Haskins
On Tue, 2007-10-23 at 10:35 +0200, Jan Kiszka wrote:
 Sven-Thorsten Dietrich wrote:
  On Mon, 2007-10-22 at 09:01 +0200, Back, Michael (ext) wrote:
  Hallo,
  I tried to run Windows XP with KVM on Linux 2.6.31.1 on a 
  
  You mean .21.1 ? 
 
 Classic typo I interestingly also did several times the last week. :)
 
  
  AMD Opteron and on a Intel Xeon, on both it works fine!
  
  After this test I patch the kernel with the current prempt-patch and on
  both it doesn't works! 
  
  Did you try against 2.6.23-rt1.
 
 kvm in -rt1 is not usable. It's too old, lacking PREEMPT_NOTIFIER
 support, thus quickly triggering lockdep.
 
  
  If you must stay on .21, you might have some other issues with the AMD
  and NUMA.
  
  At the very least, you will need to apply the attached patch from git
  somehow, although this patch is against a new scheduler post 2.6.22, so
  good luck :)
 
 --snip--
 
 Those patches are already mainline... :-
 
 What you rather need are latest kvm patches, or - if building the kvm
 distribution out of tree - a patch to enabled CONFIG_PREEMPT_NOTIFIERS
 unconditionally:
 
 --- linux-2.6.23.1-rt/kernel/Kconfig.preempt.orig
 +++ linux-2.6.23.1-rt/kernel/Kconfig.preempt
 @@ -136,6 +136,7 @@
 
  config PREEMPT_NOTIFIERS
   bool
 + default y
 
  config PREEMPT_BKL
   bool
 
 
 Still, I'm seeing oopses here (more precisely, lock validator
 complaints), but I need to re-test, better using kvm from git instead of
 kvm-48.
 
 Beyond this, I'm struggling to understand 300-400 us vm-exit latencies
 (over Intel VMX), which appear to be independent of the underlying
 system. See kvm-devel. Such latencies would limit the RT usability of
 kvm - unless you spent dedicated CPUs.

Some work is still left to be done in this area.  Until then you will
see problems like you are describing.

At one point I had a patch series that allowed KVM to actually work in
-rt without crashes, and with decent latencies (both host, and guest).
Some point soon I will revive the series and port it to the latest
kvm.git.

HTH

Regards,
-Greg




signature.asc
Description: This is a digitally signed message part


Re: [PATCH -v2 4/7] RT overloaded runqueues accounting

2007-10-23 Thread Steven Rostedt

--
On Mon, 22 Oct 2007, Paul Menage wrote:

 On 10/22/07, Paul Jackson [EMAIL PROTECTED] wrote:
  Steven wrote:
   +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
   +{
   + cpu_set(cpu, task_cs(tsk)-rt_overload);
   +}
 
  Question for Steven:
 
What locks are held when cpuset_rt_set_overload() is called?

Right now only the rq lock. I know it's not enough. This needs to be
fixed. These are called in the heart of the scheduler so...

 
  Questions for Paul Menage:
 
Does 'tsk' need to be locked for the above task_cs() call?

 Cgroups doesn't change the locking rules for accessing a cpuset from a
 task - you have to have one of:

 - task_lock(task)

I'm not sure of the lock nesting between task_lock(task) and rq-lock.
If this nesting doesn't yet exist, then I can add code to do the
task_lock.  But if the rq-lock is called within the task_lock somewhere,
then this can't be used.


 - callback_mutex

Can schedule, so it's definitely out.


 - be in an RCU section from the point when you call task_cs to the
 point when you stop using its result. (Additionally, in this case
 there's no guarantee that the task stays in this cpuset for the
 duration of the RCU section).

This may also be an option. Although with interrupts disabled for the
entire time the cpusets are used should keep RCU grace periods from moving
forward.

But let me give you two some background to what I'm trying to solve. And
then I'll get to where cpusets come in.

Currently the mainline vanilla kernel does not handle migrating RT tasks
well. And this problem also exists (but not as badly) in the -rt patch.
When a RT task is queued on a CPU that is running an even higher priority
RT task, it should be pushed off to another CPU that is running a lower
priority task. But this does not always happen and a RT task may take
several milliseconds before it gets a chance to run.

This latancy is not acceptible for RT tasks. So I added logic to push and
pull RT tasks to and from CPUS.  The push happens when a RT task wakes up
and can't preempt the RT task running on the same CPU, or when a lower RT
task is preempted by a higher one. The lower may be pushed to another CPU.

This alone is not enough to cover RT migration. We also need to pull RT
tasks to a CPU if that CPU is lowering its priority (a high priority RT
task has just went to sleep).

Idealy, and when CONFIG_CPUSETS is not defined, I keep track of all CPUS
that have more than one RT task queued to run on it.  This I call an RT
overload.  There's an RT overload bitmask that keeps track of the CPUS
that have more than one RT task queued.  When a CPU stops running a high
priority RT task, a search is made of all the CPUS that are in the RT
overload state to see if there exists a RT task that can migrate to the
CPU that is lowering its priority.

Ingo Molnar and Peter Zijlstra pointed out to me that this global cpumask
would kill performance on 64 CPU boxes due to cacheline bouncing. To
solve this issue, I placed the RT overload mask into the cpusets.

Ideally, the RT overload mask would keep track of all CPUS that have tasks
that can run on a given CPU. In-other-words, a cpuset from the point of
view of the CPU (or runqueue).  But this is overkill for the RT migration
code.  Large # CPU boxes probably don't have the RT balancing problems
that small # CPU boxes have, since the CPU resource is greater to run RT
tasks on.

Using cpusets seemed to be a nice place to add functionality to keep the
RT overload code from crippling large # CPU boxes.

The RT overload mask is bound to the CPU (runqueue) and not to the task.
But to get to the cpuset, I needed to go through the task. The task I used
was whatever was currently running on the given CPU, or sometimes the task
that was being pushed to a CPU.

Due to overlapping cpusets, there can be inconsistencies between the RT
overload mask and actual CPUS that are in the overload state. This is
tolerated, as the more CPUS you have, the less of a problem it should be
to have overloaded CPUS. Remember, the push task doesn't use the overload.
Only the pull does, and that happens when a push didn't succeed. With more
CPUS, pushes are more likely to succeed.

So the switch to use cpusets was to keep the RT balancing code from
hurting large SMP boxes than for actually being correct on those boxes.
The RT balance is much more important when the CPU resource is limited.
The cpusets were picked just because it seemed resonable that most of the
time a cpuset of one task on a runqueue would equal that of another task
on the same runqueue. But the code is good enough if that's not the
case.

My code can handle inconsistencies between the RT overload mask and actual
overloaded CPUS. So what I need to really protect with regards to cpusets
is from them disappearing and causing an oops.  Whether or not a task
comes and goes from a cpuset is not the important part.  The RT balance
code only uses the cpuset to determine what other RT tasks are 

How to debug complete kernel lock-ups

2007-10-23 Thread John Sigler

Hello everyone,

I have an x86 system with two PCI slots, in which I inserted two
specialized output cards (Dektec DTA-105).

http://www.dektec.com/Products/DTA-105/
(They provide an open source driver.)

My problem is: when I write to the 4 ports (each card has 2 ports) at 
the same time (not really at the same time because I have a 
uni-processor system, so within a short time frame is more accurate) 
the system *completely* locks up.


The manufacturer told me they had seen the problem in their lab. I'm 
just trying to provide some helpful debug output to speed up the process 
of fixing the problem :-)


I've built a debug 2.6.22.1-rt9 kernel, hoping to get the kernel to dump 
something, anything.


+CONFIG_KALLSYMS_ALL=y
+CONFIG_PCI_DEBUG=y
+CONFIG_DEBUG_DRIVER=y
+CONFIG_PRINTK_TIME=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_SHIRQ=y
+CONFIG_DETECT_SOFTLOCKUP=y
+CONFIG_DEBUG_SLAB=y
+CONFIG_DEBUG_SLAB_LEAK=y
+CONFIG_DEBUG_PREEMPT=y
+CONFIG_DEBUG_RT_MUTEXES=y
+CONFIG_DEBUG_PI_LIST=y
+CONFIG_RT_MUTEX_TESTER=y
+CONFIG_DEBUG_SPINLOCK=y
+CONFIG_DEBUG_MUTEXES=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+CONFIG_LOCKDEP=y
+CONFIG_TRACE_IRQFLAGS=y
+CONFIG_DEBUG_SPINLOCK_SLEEP=y
+CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
+CONFIG_STACKTRACE=y
+CONFIG_PREEMPT_TRACE=y
+CONFIG_DEBUG_BUGVERBOSE=y
+CONFIG_DEBUG_INFO=y
+CONFIG_FRAME_POINTER=y
+CONFIG_FORCED_INLINING=y
+CONFIG_DEBUG_STACKOVERFLOW=y
+CONFIG_DEBUG_RODATA=y
+CONFIG_4KSTACKS=y

I've enabled the serial console, and used SysRq to bump the console 
level to 9 (I want everything, even KERN_DEBUG output).


I've enabled the IO-APIC watchdog (nmi_watchdog=1).

Once the system locks up, I get no output, no panic, no oops.
The serial console is frozen, my ssh sessions are frozen.

Suppose the PCI bus crashes (whatever that means) or locks up.
Would that make the system completely unresponsive? The I/O does have to 
get to/from the south bridge, through the PCI bus AFAIU. I can imagine 
that a locked PCI bus would be slightly problematic.


Does this mean I need some kind of PCI bus analyzer (i.e. hardware) at 
this point? Is there anything more I can try?


Regards.
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Test posted to wiki

2007-10-23 Thread Gregory Haskins
http://rt.wiki.kernel.org/index.php/Preemption_Test

Thanks to Darren Hart for fixing the permissions on the site for me.
And thanks to Steven Rostedt for inspiring this test.

(Steve, feel free to edit the page to include your test as well)

-Greg


signature.asc
Description: This is a digitally signed message part


[PATCH 00/13] Balance RT tasks v5

2007-10-23 Thread Gregory Haskins
This is version 5 of the patch series against 23-rt1.

There have been numerous fixes/tweaks since v4, though we still are based on
the global rto_cpumask logic instead of Steve/Ingo's cpuset logic.  Otherwise,
it's in pretty good shape.

Without the series applied, the following test will fail:

ftp://ftp.novell.com/dev/ghaskins/preempt-test-latest.tar.bz2

After it is applied, it will pass.

NOTE: it appears that the series also introduces wake-latency spikes that
are not present in the baseline code, so this is still RFC quality.
However, the baseline scheduler also violates priority order, so its hard to
determine if the numbers translate apples to apples.  These issues are still
under investigation, but I am sharing the series now so that Steven Rostedt
and Darren Hart can have access to my current tree.  The issues appear to be
caused by some other strange scheduling decisions (such as running the idle
thread while we are busy).  TBD 
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/13] RT: push-rt

2007-10-23 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched.c|  141 ++---
 kernel/sched_rt.c |   44 +
 2 files changed, 178 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3e75c62..0dabf89 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -304,6 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
unsigned long rt_nr_running;
unsigned long rt_nr_uninterruptible;
+   int curr_prio;
 #endif
 
unsigned long switch_timestamp;
@@ -1484,6 +1485,123 @@ next_in_queue:
 
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 
+/* Only try this algorithm three times */
+#define RT_PUSH_MAX_TRIES 3
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
+ struct task_struct *task,
+ struct rq *this_rq)
+{
+   struct rq *lowest_rq = NULL;
+   int dst_cpu = -1;
+   int cpu;
+   int tries;
+
+   for (tries = 0; tries  RT_PUSH_MAX_TRIES; tries++) {
+   /*
+* Scan each rq for the lowest prio.
+*/
+   for_each_cpu_mask(cpu, *cpu_mask) {
+   struct rq *rq = per_cpu(runqueues, cpu);
+
+   if (cpu == smp_processor_id())
+   continue;
+
+   /* We look for lowest RT prio or non-rt CPU */
+   if (rq-curr_prio = MAX_RT_PRIO) {
+   lowest_rq = rq;
+   dst_cpu = cpu;
+   break;
+   }
+
+   /* no locking for now */
+   if (rq-curr_prio  task-prio 
+   (!lowest_rq || rq-curr_prio  
lowest_rq-curr_prio)) {
+   lowest_rq = rq;
+   dst_cpu = cpu;
+   }
+   }
+
+   if (!lowest_rq)
+   break;
+
+   /* if the prio of this runqueue changed, try again */
+   if (double_lock_balance(this_rq, lowest_rq)) {
+   /*
+* We had to unlock the run queue. In
+* the mean time, task could have
+* migrated already or had its affinity changed.
+*/
+   if (unlikely(task_rq(task) != this_rq ||
+!cpu_isset(dst_cpu, task-cpus_allowed))) {
+   spin_unlock(lowest_rq-lock);
+   lowest_rq = NULL;
+   break;
+   }
+
+   }
+
+   /* If this rq is still suitable use it. */
+   if (lowest_rq-curr_prio  task-prio)
+   break;
+
+   /* try again */
+   spin_unlock(lowest_rq-lock);
+   lowest_rq = NULL;
+   }
+
+   return lowest_rq;
+}
+
+/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+   struct task_struct *next_task;
+   struct rq *lowest_rq;
+   int dst_cpu;
+   int ret = 0;
+   cpumask_t cpu_mask;
+
+   assert_spin_locked(this_rq-lock);
+
+   next_task = rt_next_highest_task(this_rq);
+   if (!next_task)
+   return 0;
+
+   cpus_and(cpu_mask, cpu_online_map, next_task-cpus_allowed);
+
+   /* We might release this_rq lock */
+   get_task_struct(next_task);
+
+   /* find_lock_lowest_rq locks the rq if found */
+   lowest_rq = find_lock_lowest_rq(cpu_mask, next_task, this_rq);
+   if (!lowest_rq)
+   goto out;
+
+   dst_cpu = lowest_rq-cpu;
+
+   assert_spin_locked(lowest_rq-lock);
+
+   deactivate_task(this_rq, next_task, 0);
+   set_task_cpu(next_task, dst_cpu);
+   activate_task(lowest_rq, next_task, 0);
+
+   resched_task(lowest_rq-curr);
+
+   spin_unlock(lowest_rq-lock);
+
+   ret = 1;
+out:
+   put_task_struct(next_task);
+
+   return ret;
+}
+
 /*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
@@ -2202,19 +2320,28 @@ static inline void finish_task_switch(struct rq *rq, 
struct task_struct *prev)
 * be dropped twice.
 *  Manfred Spraul [EMAIL PROTECTED]
 */
+   prev_state = prev-state;
+   _finish_arch_switch(prev);
+#if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
+   rq-curr_prio = current-prio;
+#endif
+   finish_lock_switch(rq, prev);
 #if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
/*
 * If we 

[PATCH 02/13] RT: Condense the next-task search into one function

2007-10-23 Thread Gregory Haskins
We inadvertently added a redundant function, so clean it up

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|9 +
 kernel/sched_rt.c |   44 
 2 files changed, 5 insertions(+), 48 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0dabf89..daeb8ed 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1471,9 +1471,10 @@ next_in_queue:
 * Return the highest-prio non-running RT task (if task
 * may run on this CPU):
 */
-   if (!task_running(src_rq, tmp) 
-   cpu_isset(this_cpu, tmp-cpus_allowed))
-   return tmp;
+   if (!task_running(src_rq, tmp)) {
+   if ((this_cpu == -1) || cpu_isset(this_cpu, tmp-cpus_allowed))
+   return tmp;
+   }
 
curr = curr-next;
if (curr != head)
@@ -1569,7 +1570,7 @@ static int push_rt_task(struct rq *this_rq)
 
assert_spin_locked(this_rq-lock);
 
-   next_task = rt_next_highest_task(this_rq);
+   next_task = pick_rt_task(this_rq, -1);
if (!next_task)
return 0;
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 8d59e62..369827b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -96,50 +96,6 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
return next;
 }
 
-#ifdef CONFIG_PREEMPT_RT
-static struct task_struct *rt_next_highest_task(struct rq *rq)
-{
-   struct rt_prio_array *array = rq-rt.active;
-   struct task_struct *next;
-   struct list_head *queue;
-   int idx;
-
-   if (likely (rq-rt_nr_running  2))
-   return NULL;
-
-   idx = sched_find_first_bit(array-bitmap);
-   if (idx = MAX_RT_PRIO) {
-   WARN_ON(1); /* rt_nr_running is bad */
-   return NULL;
-   }
-
-   queue = array-queue + idx;
-   next = list_entry(queue-next, struct task_struct, run_list);
-   if (unlikely(next != current))
-   return next;
-
-   if (queue-next-next != queue) {
-   /* same prio task */
-   next = list_entry(queue-next-next, struct task_struct, 
run_list);
-   goto out;
-   }
-
-   /* slower, but more flexible */
-   idx = find_next_bit(array-bitmap, MAX_RT_PRIO, idx+1);
-   if (idx = MAX_RT_PRIO) {
-   WARN_ON(1); /* rt_nr_running was 2 and above! */
-   return NULL;
-   }
-
-   queue = array-queue + idx;
-   next = list_entry(queue-next, struct task_struct, run_list);
-
- out:
-   return next;
-
-}
-#endif /* CONFIG_PREEMPT_RT */
-
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
update_curr_rt(rq);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/13] RT: Wrap the RQ notion of priority to make it conditional

2007-10-23 Thread Gregory Haskins
A little cleanup to avoid #ifdef proliferation later in the series

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |   16 +---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e22eec7..dfd0b92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -365,6 +365,16 @@ struct rq {
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 static DEFINE_MUTEX(sched_hotcpu_mutex);
 
+#if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
+static inline void set_rq_prio(struct rq *rq, int prio)
+{
+   rq-curr_prio = prio;
+}
+
+#else
+#define set_rq_prio(rq, prio) do { } while(0)
+#endif
+
 static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
 {
rq-curr-sched_class-check_preempt_curr(rq, p);
@@ -2329,9 +2339,9 @@ static inline void finish_task_switch(struct rq *rq, 
struct task_struct *prev)
 */
prev_state = prev-state;
_finish_arch_switch(prev);
-#if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
-   rq-curr_prio = current-prio;
-#endif
+
+   set_rq_prio(rq, current-prio);
+
finish_lock_switch(rq, prev);
 #if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
/*

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/13] RT: Maintain the highest RQ priority

2007-10-23 Thread Gregory Haskins
This is an implementation of Steve's idea where we should update the RQ
concept of priority to show the highest-task, even if that task is not (yet)
running.  This prevents us from pushing multiple tasks to the RQ before it
gets a chance to reschedule.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |   34 +-
 1 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c4fba8..c17e2e4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -304,7 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
unsigned long rt_nr_running;
unsigned long rt_nr_uninterruptible;
-   int curr_prio;
+   int highest_prio;
 #endif
 
unsigned long switch_timestamp;
@@ -368,11 +368,20 @@ static DEFINE_MUTEX(sched_hotcpu_mutex);
 #if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
 static inline void set_rq_prio(struct rq *rq, int prio)
 {
-   rq-curr_prio = prio;
+   rq-highest_prio = prio;
+}
+
+static inline void update_rq_prio(struct rq *rq)
+{
+   struct rt_prio_array *array = rq-rt.active;
+   int   prio  = sched_find_first_bit(array-bitmap);
+
+   set_rq_prio(rq, prio);
 }
 
 #else
 #define set_rq_prio(rq, prio) do { } while(0)
+#define update_rq_prio(rq)do { } while(0)
 #endif
 
 static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
@@ -1023,12 +1032,14 @@ static void enqueue_task(struct rq *rq, struct 
task_struct *p, int wakeup)
sched_info_queued(p);
p-sched_class-enqueue_task(rq, p, wakeup);
p-se.on_rq = 1;
+   update_rq_prio(rq);
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
p-sched_class-dequeue_task(rq, p, sleep);
p-se.on_rq = 0;
+   update_rq_prio(rq);
 }
 
 /*
@@ -1526,15 +1537,15 @@ static struct rq *find_lock_lowest_rq(cpumask_t 
*cpu_mask,
continue;
 
/* We look for lowest RT prio or non-rt CPU */
-   if (rq-curr_prio = MAX_RT_PRIO) {
+   if (rq-highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
dst_cpu = cpu;
break;
}
 
/* no locking for now */
-   if (rq-curr_prio  task-prio 
-   (!lowest_rq || rq-curr_prio  
lowest_rq-curr_prio)) {
+   if (rq-highest_prio  task-prio 
+   (!lowest_rq || rq-highest_prio  
lowest_rq-highest_prio)) {
lowest_rq = rq;
dst_cpu = cpu;
}
@@ -1560,7 +1571,7 @@ static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
}
 
/* If this rq is still suitable use it. */
-   if (lowest_rq-curr_prio  task-prio)
+   if (lowest_rq-highest_prio  task-prio)
break;
 
/* try again */
@@ -2339,10 +2350,8 @@ static inline void finish_task_switch(struct rq *rq, 
struct task_struct *prev)
 */
prev_state = prev-state;
_finish_arch_switch(prev);
-
-   set_rq_prio(rq, current-prio);
-
finish_lock_switch(rq, prev);
+
 #if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
/*
 * If we pushed an RT task off the runqueue,
@@ -4647,6 +4656,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
prev_resched = _need_resched();
 
if (on_rq) {
+   /*
+* Note: RQ priority gets updated in the enqueue/dequeue logic
+*/
enqueue_task(rq, p, 0);
/*
 * Reschedule if we are currently running on this runqueue and
@@ -4713,6 +4725,10 @@ void set_user_nice(struct task_struct *p, long nice)
 */
if (delta  0 || (delta  0  task_running(rq, p)))
resched_task(rq-curr);
+
+   /*
+* Note: RQ priority gets updated in the enqueue/dequeue logic
+*/
}
 out_unlock:
task_rq_unlock(rq, flags);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/13] RT: Add support for low-priority wake-up to push_rt feature

2007-10-23 Thread Gregory Haskins
There are three events that require consideration for redistributing RT
tasks:

1) When one or more higher-priority tasks preempts a lower-one from a
   RQ
2) When a lower-priority task is woken up on a RQ
3) When a RQ downgrades its current priority

Steve Rostedt's push_rt patch addresses (1).  It hooks in right after
a new task has been switched-in.  If this was the result of an RT
preemption, or if more than one task was awoken at the same time, we
can try to push some of those other tasks away.

This patch addresses (2).  When we wake up a task, we check to see
if it would preempt the current task on the queue.  If it will not, we
attempt to find a better suited CPU (e.g. one running something lower
priority than the task being woken) and try to activate the task there.

Finally, we have (3).  In theory, we only need to balance_rt_tasks() if
the following conditions are met:
   1) One or more CPUs are in overload, AND
   2) We are about to switch to a task that lowers our priority.

(3) will be addressed in a later patch.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |  109 
 1 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0600062..e536142 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1626,6 +1626,13 @@ out:
return ret;
 }
 
+/* Push all tasks that we can to other CPUs */
+static void push_rt_tasks(struct rq *this_rq)
+{
+   while (push_rt_task(this_rq))
+   ;
+}
+
 /*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
@@ -1986,6 +1993,46 @@ out_set_cpu:
this_cpu = smp_processor_id();
cpu = task_cpu(p);
}
+   
+#if defined(CONFIG_PREEMPT_RT)
+   /*
+* If a newly woken up RT task will not run immediately on its affined
+* RQ, try to find another CPU it can preempt:
+*/
+   if (rt_task(p)  (p-prio  rq-highest_prio)) {
+   struct rq *lowest_rq = find_lock_lowest_rq(p, rq);
+
+   if (lowest_rq) {
+   /*
+* We may have dropped this_rq-lock, so check to be
+* sure we are still eligible to wake up this task
+* somewhere...
+*
+* Basically the task could already be running on this
+* RQ, or it could have already migrated away to a
+* different RQ. 
+*/
+   if (!task_running(rq, p)  (task_rq(p) == rq)) {
+   set_task_cpu(p, lowest_rq-cpu);
+   spin_unlock(rq-lock);
+
+   /*
+* The new lock was already acquired in
+* find_lowest
+*/ 
+   rq  = lowest_rq;
+   cpu = task_cpu(p);
+   } else
+   spin_unlock(lowest_rq-lock);
+   }
+
+   old_state = p-state;
+   if (!(old_state  state))
+   goto out;
+   if (p-se.on_rq)
+   goto out_running;
+   }
+#endif /* defined(CONFIG_PREEMPT_RT) */
 
 out_activate:
 #endif /* CONFIG_SMP */
@@ -1995,51 +2042,20 @@ out_activate:
trace_start_sched_wakeup(p, rq);
 
/*
-* If a newly woken up RT task cannot preempt the
-* current (RT) task (on a target runqueue) then try
-* to find another CPU it can preempt:
+* Sync wakeups (i.e. those types of wakeups where the waker
+* has indicated that it will leave the CPU in short order)
+* don't trigger a preemption, if the woken up task will run on
+* this cpu. (in this case the 'I will reschedule' promise of
+* the waker guarantees that the freshly woken up task is going
+* to be considered on this CPU.)
 */
-   if (rt_task(p)  !TASK_PREEMPTS_CURR(p, rq)) {
-   struct rq *this_rq = cpu_rq(this_cpu);
-   /*
-* Special-case: the task on this CPU can be
-* preempted. In that case there's no need to
-* trigger reschedules on other CPUs, we can
-* mark the current task for reschedule.
-*
-* (Note that it's safe to access this_rq without
-* extra locking in this particular case, because
-* we are on the current CPU.)
-*/
-   if (TASK_PREEMPTS_CURR(p, this_rq))
-   set_tsk_need_resched(this_rq-curr);
-   else
-   /*
-* Neither the intended target runqueue
-* 

[PATCH 09/13] RT: Only dirty a cacheline if the priority is actually changing

2007-10-23 Thread Gregory Haskins
We can avoid dirtying a rq related cacheline with a simple check, so why not.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e536142..1058a1f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -376,7 +376,8 @@ static inline void update_rq_prio(struct rq *rq)
struct rt_prio_array *array = rq-rt.active;
int   prio  = sched_find_first_bit(array-bitmap);
 
-   set_rq_prio(rq, prio);
+   if (rq-highest_prio != prio)
+   set_rq_prio(rq, prio);
 }
 
 #else

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/13] RT: Fixes for push-rt patch

2007-10-23 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

Steve found these errors in the original patch

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|   15 -
 kernel/sched_rt.c |   90 +
 2 files changed, 15 insertions(+), 90 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1058a1f..a1f1d92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1535,7 +1535,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
for_each_cpu_mask(cpu, cpu_mask) {
struct rq *rq = per_cpu(runqueues, cpu);
 
-   if (cpu == smp_processor_id())
+   if (cpu == this_rq-cpu)
continue;
 
/* We look for lowest RT prio or non-rt CPU */
@@ -1561,7 +1561,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 * the mean time, task could have
 * migrated already or had its affinity changed.
 */
-   if (unlikely(task_rq(task) != this_rq ||
+   if (unlikely(task_running(this_rq, task) ||
+task_rq(task) != this_rq ||
 !cpu_isset(lowest_rq-cpu, 
task-cpus_allowed))) {
spin_unlock(lowest_rq-lock);
lowest_rq = NULL;
@@ -2380,6 +2381,7 @@ static inline void finish_task_switch(struct rq *rq, 
struct task_struct *prev)
}
 
 #endif
+
fire_sched_in_preempt_notifiers(current);
trace_stop_sched_switched(current);
/*
@@ -4102,6 +4104,15 @@ asmlinkage void __sched __schedule(void)
context_switch(rq, prev, next); /* unlocks the rq */
__preempt_enable_no_resched();
} else {
+#if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
+   /*
+* If we hit the condition where we do not need to actually
+* reschedule, we need to check if there are any tasks that
+* should be pushed away
+*/
+   if (unlikely(rq-rt_nr_running  1))
+   push_rt_tasks(rq);
+#endif
__preempt_enable_no_resched();
spin_unlock(rq-lock);
trace_stop_sched_switched(next);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 369827b..9b677c1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -102,100 +102,14 @@ static void put_prev_task_rt(struct rq *rq, struct 
task_struct *p)
p-se.exec_start = 0;
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = rq-rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = sched_find_first_bit(array-bitmap);
-   if (idx = MAX_RT_PRIO)
-   return NULL;
-
-   head = array-queue + idx;
-   curr = head-prev;
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr-prev;
-
-   rq-rt.rt_load_balance_idx = idx;
-   rq-rt.rt_load_balance_head = head;
-   rq-rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = rq-rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = rq-rt.rt_load_balance_idx;
-   head = rq-rt.rt_load_balance_head;
-   curr = rq-rt.rt_load_balance_curr;
-
-   /*
-* If we arrived back to the head again then
-* iterate to the next queue (if any):
-*/
-   if (unlikely(head == curr)) {
-   int next_idx = find_next_bit(array-bitmap, MAX_RT_PRIO, idx+1);
-
-   if (next_idx = MAX_RT_PRIO)
-   return NULL;
-
-   idx = next_idx;
-   head = array-queue + idx;
-   curr = head-prev;
-
-   rq-rt.rt_load_balance_idx = idx;
-   rq-rt.rt_load_balance_head = head;
-   }
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr-prev;
-
-   rq-rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_nr_move, unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio)
 {

[PATCH 11/13] RT: Condense NORMAL and IDLE priorities

2007-10-23 Thread Gregory Haskins
We only need to track if the CPU is in a non-RT state, as opposed to its
priority within the non-RT state.  So simplify setting in the effort of
reducing cache-thrash.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a1f1d92..4abe738 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -371,11 +371,19 @@ static inline void set_rq_prio(struct rq *rq, int prio)
rq-highest_prio = prio;
 }
 
+/*
+ * We dont care what the exact normal priority is.  We only care about
+ * RT-priority, vs non-RT (normal or idle).  So flatten the priority if its a
+ * non-RT variety. This will reduce cache-thrashing on the rq-highest_prio.
+ */
 static inline void update_rq_prio(struct rq *rq)
 {
struct rt_prio_array *array = rq-rt.active;
int   prio  = sched_find_first_bit(array-bitmap);
 
+   if ((prio != MAX_PRIO)  (prio  MAX_RT_PRIO))
+   prio = MAX_RT_PRIO;
+
if (rq-highest_prio != prio)
set_rq_prio(rq, prio);
 }

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/13] RT: CPU priority management

2007-10-23 Thread Gregory Haskins
This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Makefile   |2 
 kernel/sched.c|   37 +++--
 kernel/sched_cpupri.c |  200 +
 kernel/sched_cpupri.h |   10 ++
 4 files changed, 222 insertions(+), 27 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o 
profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
-   utsname.o
+   utsname.o sched_cpupri.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 4abe738..bdb6be0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -67,6 +67,8 @@
 
 #include asm/tlb.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -384,8 +386,10 @@ static inline void update_rq_prio(struct rq *rq)
if ((prio != MAX_PRIO)  (prio  MAX_RT_PRIO))
prio = MAX_RT_PRIO;
 
-   if (rq-highest_prio != prio)
+   if (rq-highest_prio != prio) {
+   cpupri_set(rq-cpu, prio);
set_rq_prio(rq, prio);
+   }
 }
 
 #else
@@ -1532,36 +1536,15 @@ static struct rq *find_lock_lowest_rq(struct 
task_struct *task,
struct rq *lowest_rq = NULL;
int cpu;
int tries;
-   cpumask_t cpu_mask;
-
-   cpus_and(cpu_mask, cpu_online_map, task-cpus_allowed);
 
for (tries = 0; tries  RT_PUSH_MAX_TRIES; tries++) {
-   /*
-* Scan each rq for the lowest prio.
-*/
-   for_each_cpu_mask(cpu, cpu_mask) {
-   struct rq *rq = per_cpu(runqueues, cpu);
-
-   if (cpu == this_rq-cpu)
-   continue;
+   cpu = cpupri_find(this_rq-cpu, task);
 
-   /* We look for lowest RT prio or non-rt CPU */
-   if (rq-highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
-   }
-
-   /* no locking for now */
-   if (rq-highest_prio  task-prio 
-   (!lowest_rq || rq-highest_prio  
lowest_rq-highest_prio)) {
-   lowest_rq = rq;
-   }
-   }
-
-   if (!lowest_rq)
+   if (cpu == this_rq-cpu)
break;
 
+   lowest_rq = cpu_rq(cpu);
+
/* if the prio of this runqueue changed, try again */
if (double_lock_balance(this_rq, lowest_rq)) {
/*
@@ -7395,6 +7378,8 @@ void __init sched_init(void)
fair_sched_class.next = idle_sched_class;
idle_sched_class.next = NULL;
 
+   cpupri_init();
+
for_each_possible_cpu(i) {
struct rt_prio_array *array;
struct rq *rq;
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..5cb51ae
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,200 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  

[PATCH 13/13] RT: Cache cpus_allowed weight for optimizing migration

2007-10-23 Thread Gregory Haskins
Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for one or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then during the handling of the timer, the system determines that it's
time to smp-rebalance the system so it schedules softirq_sched to run
from within the softirq_timer kthread context. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

The problem is that these tasks cannot ever be pulled away since they
are already running on their one and only valid RQ.  However, the other
CPUs cannot determine that the tasks are unpullable without going
through expensive checks/locking.  Therefore the helping CPUS
experience unecessary overhead/latencies regardless as they
ineffectively try to process the overload condition.

This patch tries to optimize the situation by utilizing the hamming
weight of the task-cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated, which may be utilized by the overload
handling code to eliminate uncessary rebalance attempts.  We also
introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable. 

Calculating the weight is probably relatively expensive, so it is only
done when the cpus_allowed mask is updated (which should be relatively
infrequent, especially compared to scheduling frequency) and cached in
the task_struct.


Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 include/linux/sched.h |1 
 kernel/fork.c |1 
 kernel/sched.c|  121 +
 3 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a3829f..b657c13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1144,6 +1144,7 @@ struct task_struct {
 
unsigned int policy;
cpumask_t cpus_allowed;
+   int nr_cpus_allowed;
unsigned int time_slice;
 
 #ifdef CONFIG_PREEMPT_RCU
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f11f23..f808e18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1257,6 +1257,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
 */
preempt_disable();
p-cpus_allowed = current-cpus_allowed;
+   p-nr_cpus_allowed = current-nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p-cpus_allowed) ||
!cpu_online(task_cpu(p
set_task_cpu(p, smp_processor_id());
diff --git a/kernel/sched.c b/kernel/sched.c
index bdb6be0..9120b41 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -305,6 +305,7 @@ struct rq {
 
 #ifdef CONFIG_PREEMPT_RT
unsigned long rt_nr_running;
+   unsigned long rt_nr_migratory;
unsigned long rt_nr_uninterruptible;
int highest_prio;
 #endif
@@ -665,20 +666,43 @@ static inline struct rq *this_rq_lock(void)
 #if defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP)
 static __cacheline_aligned_in_smp atomic_t rt_overload;
 static cpumask_t rto_cpus;
-#endif
+
+static void inc_rt_migration(struct rq *rq)
+{
+   rq-rt_nr_migratory++;
+
+   if ((rq-rt_nr_running  1)  !cpu_isset(rq-cpu, rto_cpus)) {
+   cpu_set(rq-cpu, rto_cpus);
+   smp_wmb();
+   atomic_inc(rt_overload);
+   }
+}
+
+static void dec_rt_migration(struct rq *rq)
+{
+   WARN_ON(!rq-rt_nr_migratory);
+   rq-rt_nr_migratory--;
+
+   if (((rq-rt_nr_running = 1) || !rq-rt_nr_migratory)
+cpu_isset(rq-cpu, rto_cpus)) {
+   atomic_dec(rt_overload);
+   cpu_clear(rq-cpu, rto_cpus);
+   }
+}
+
+#else
+#define inc_rt_migration(rq) do { } while(0)
+#define dec_rt_migration(rq) do { } while(0)
+#endif /* defined(CONFIG_PREEMPT_RT)  defined(CONFIG_SMP) */
+
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 {
 #ifdef CONFIG_PREEMPT_RT
if (rt_task(p)) {
rq-rt_nr_running++;
-# ifdef CONFIG_SMP
-   if (rq-rt_nr_running == 2) {
-   cpu_set(rq-cpu, rto_cpus);
-   smp_wmb();
-   atomic_inc(rt_overload);
-   }
-# endif
+   if (p-nr_cpus_allowed  1)
+   inc_rt_migration(rq);
}
 #endif
 }
@@ -689,12 +713,8 @@ static inline void dec_rt_tasks(struct task_struct *p, 
struct rq *rq)
if (rt_task(p)) {
WARN_ON(!rq-rt_nr_running);
rq-rt_nr_running--;
-# ifdef CONFIG_SMP
-   if (rq-rt_nr_running == 1) {
-   atomic_dec(rt_overload);
-   cpu_clear(rq-cpu, rto_cpus);
-   }
-# endif