Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:03 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 

 +static inline void
 +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
 *adaptive)
 ...
 +#define prepare_adaptive_wait(lock, busy) {}
 
 This is evil. Use empty inline function instead (same for the other
 function, there you can maybe get away with it).
 

I went to implement your suggested change and I remembered why I did it this 
way:  I wanted a macro so that the struct adaptive_waiter local variable will 
fall away without an #ifdef in the main body of code.  So I have left this 
logic alone for now.

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-26 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:06 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 
 I believe you have _way_ too many config variables. If this can be set
 at runtime, does it need a config option, too?

Generally speaking, I think until this algorithm has an adaptive-timeout in 
addition to an adaptive-spin/sleep, these .config based defaults are a good 
idea.  Sometimes setting these things at runtime are a PITA when you are 
talking about embedded systems that might not have/want a nice userspace 
sysctl-config infrastructure.  And changing the defaults in the code is 
unattractive for some users.  I don't think its a big deal either way, so if 
people hate the config options, they should go.  But I thought I would throw 
this use-case out there to ponder.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
 On Tue, Feb 26, 2008 at  1:06 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 On Tue 2008-02-26 08:03:43, Gregory Haskins wrote:
  On Mon, Feb 25, 2008 at  5:03 PM, in message
 [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 
  +static inline void
  +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
  *adaptive)
  ...
  +#define prepare_adaptive_wait(lock, busy) {}
  
  This is evil. Use empty inline function instead (same for the other
  function, there you can maybe get away with it).
  
 
 I went to implement your suggested change and I remembered why I did it this 
 way:  I wanted a macro so that the struct adaptive_waiter local variable 
 will fall away without an #ifdef in the main body of code.  So I have left 
 this logic alone for now.
 
 Hmm, but inline function will allow dead code elimination,  too, no?

I was getting compile errors.  Might be operator-error ;)

 
 Anyway non-evil way to do it with macro is 
 
 #define prepare_adaptive_wait(lock, busy) do {} while (0)
 
 ...that behaves properly in complex statements.

Ah, I was wondering why people use that.  Will do.  Thanks!

-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
From: Sven-Thorsten Dietrich [EMAIL PROTECTED]

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED]
---

 kernel/rtmutex.c |7 ++-
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 6624c66..cd39c26 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -18,6 +18,10 @@
 
 #include rtmutex_common.h
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current-prio  pendowner-prio)
return 0;
 
-   if (!unfair  (current-prio == pendowner-prio))
+   if (unlikely(current-prio == pendowner-prio) 
+  !(unfair  rtmutex_lateral_steal))
 #endif
return 0;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c913d48..c24c53d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtmutex_lateral_steal,
+   .data   = rtmutex_lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current-state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cd39c26..ef52db6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(current-state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
+   saved_state = current-state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, saved_state);
 
spin_lock_irqsave(lock-wait_lock, flags);
current-flags |= saved_flags;
current-lock_depth = saved_lock_depth;
-   state = xchg(current-state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(current-state, saved_state);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping).  Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks.  However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks.  This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention.  Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent.  Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem.  They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   20 +++
 kernel/rtmutex.c  |   30 +++---
 kernel/rtmutex_adaptive.h |  138 +
 3 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e493257..d2432fa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -183,6 +183,26 @@ config RCU_TRACE
  Say Y/M here if you want to enable RCU tracing in-kernel/module.
  Say N if you are unsure.
 
+config ADAPTIVE_RTLOCK
+bool Adaptive real-time locks
+default y
+depends on PREEMPT_RT  SMP
+help
+  PREEMPT_RT allows for greater determinism by transparently
+  converting normal spinlock_ts into preemptible rtmutexes which
+  sleep any waiters under contention.  However, in many cases the
+  lock will be released in less time than it takes to context
+  switch.  Therefore, the sleep under contention policy may also
+  degrade throughput performance due to the extra context switches.
+
+  This option alters the rtmutex derived spinlock_t replacement
+  code to use an adaptive spin/sleep algorithm.  It will spin
+  unless it determines it must sleep to avoid deadlock.  This
+  offers a best of both worlds solution since we achieve both
+  high-throughput and low-latency.
+
+  If unsure, say Y.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index bf9e230..3802ef8 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,8 @@
  *  Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED]
  *  Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
  *  Copyright (C) 2006 Esben Nielsen
+ *  Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale,
+ *   and Gregory Haskins
  *
  *  See Documentation/rt-mutex-design.txt for details.
  */
@@ -17,6 +19,7 @@
 #include linux/hardirq.h
 
 #include rtmutex_common.h
+#include rtmutex_adaptive.h
 
 #ifdef CONFIG_RTLOCK_LATERAL_STEAL
 int rtmutex_lateral_steal __read_mostly = 1;
@@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+   DECLARE_ADAPTIVE_WAITER(adaptive);
 
debug_rt_mutex_init_waiter(waiter);
waiter.task = NULL;
@@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
 
+   prepare_adaptive_wait(lock, adaptive);
+
/*
 * Prevent schedule() to drop BKL, while waiting for
 * the lock ! We restore lock_depth when we come back.
@@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   update_current(TASK_UNINTERRUPTIBLE, saved_state);
-   /*
-* The xchg() in update_current() is an implicit barrier
-* which we rely upon to ensure current-state is visible
-* before we test waiter.task.
-*/
-   if (waiter.task)
-   schedule_rt_mutex(lock);
-   else
-   update_current(TASK_RUNNING_MUTEX, saved_state);
+   /* adaptive_wait() returns 1

[(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-25 Thread Gregory Haskins
From: Sven Dietrich [EMAIL PROTECTED]

Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2432fa..ac1cbad 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK
 
   If unsure, say Y.
 
+config RTLOCK_DELAY
+   int Default delay (in loops) for adaptive rtlocks
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default 1
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 3802ef8..4a16b13 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -25,6 +25,10 @@
 int rtmutex_lateral_steal __read_mostly = 1;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 862c088..60c6328 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive-timeout  0; adaptive-timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive-timeout = 0)
+   sleep = 1;
+
put_task_struct(adaptive-owner);
 
return sleep;
@@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive-owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c24c53d..55189ea 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -56,6 +56,8 @@
 #include asm/stacktrace.h
 #endif
 
+#include rtmutex_adaptive.h
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtlock_timeout,
+   .data   = rtlock_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   37 ++
 kernel/rtmutex.c  |   76 +
 kernel/rtmutex_adaptive.h |   32 ++-
 kernel/sysctl.c   |   10 ++
 4 files changed, 119 insertions(+), 36 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index ac1cbad..864bf14 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -214,6 +214,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool Adaptive real-time mutexes
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int Default delay (in loops) for adaptive mutexes
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default 3000
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a16b13..ea593e0 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1;
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* We can skip the actual (expensive) wakeup if the
+* waiter is already running, but we have to be careful
+* of race conditions because they may be about to sleep.
+*
+* The waiter-side protocol has the following pattern:
+* 1: Set state != RUNNING
+* 2: Conditionally sleep if waiter-task != NULL;
+*
+* And the owner-side has the following:
+* A: Set waiter-task = NULL
+* B: Conditionally wake if the state != RUNNING
+*
+* As long as we ensure 1-2 order, and A-B order, we
+* will never miss a wakeup.
+*
+* Therefore, this barrier ensures that waiter-task = NULL
+* is visible before we test the pendowner-state.  The
+* corresponding barrier is in the sleep logic.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   /*
-* We can skip the actual (expensive) wakeup if the
-* waiter is already running, but we have to be careful
-* of race conditions because they may be about to sleep.
-*
-* The waiter-side protocol has the following pattern:
-* 1: Set state != RUNNING
-* 2: Conditionally sleep if waiter-task != NULL;
-*
-* And the owner-side has the following:
-* A: Set waiter-task = NULL
-* B: Conditionally wake if the state != RUNNING
-*
-* As long as we ensure 1-2 order, and A-B order, we
-* will never miss a wakeup.
-*
-   

[(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ea593e0..b81bbef 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(current-pi_lock);
 
@@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner-normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(pendowner-pi_lock);
 
WARN_ON(!pendowner-pi_blocked_on);
@@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner-pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
-   }
+
spin_unlock(pendowner-pi_lock);
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  4:54 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
   * saved_state accordingly. If we did not get a real wakeup
   * then we return with the saved state.
   */
 -saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
 +saved_state = current-state;
 +smp_mb();
  
  for (;;) {
  unsigned long saved_flags;
 
 Please document what the barrier is good for.

Yeah, I think you are right that this isn't needed.  I think that is a relic 
from back when I was debugging some other problems.  Let me wrap my head around 
the implications of removing it, and either remove it or document appropriately.

 
 Plus, you are replacing atomic operation with nonatomic; is that ok?

Yeah, I think so.  We are substituting a write with a read, and word reads are 
always atomic anyway IIUC (or is that only true on certain architectures)?  
Note that we are moving the atomic-write to be done later in the 
update_current() calls.

-Greg



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:03 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 +/*
 + * Adaptive-rtlocks will busywait when possible, and sleep only if
 + * necessary. Note that the busyloop looks racy, and it isbut we do
 + * not care. If we lose any races it simply means that we spin one more
 + * time before seeing that we need to break-out on the next iteration.
 + *
 + * We realize this is a relatively large function to inline, but note that
 + * it is only instantiated 1 or 2 times max, and it makes a measurable
 + * performance different to avoid the call.
 + *
 + * Returns 1 if we should sleep
 + *
 + */
 +static inline int
 +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
 +  struct adaptive_waiter *adaptive)
 +{
 +int sleep = 0;
 +
 +for (;;) {
 +/*
 + * If the task was re-awoken, break out completely so we can
 + * reloop through the lock-acquisition code.
 + */
 +if (!waiter-task)
 +break;
 +
 +/*
 + * We need to break if the owner changed so we can reloop
 + * and safely acquire the owner-pointer again with the
 + * wait_lock held.
 + */
 +if (adaptive-owner != rt_mutex_owner(lock))
 +break;
 +
 +/*
 + * If we got here, presumably the lock ownership is still
 + * current.  We will use it to our advantage to be able to
 + * spin without disabling preemption...
 + */
 +
 +/*
 + * .. sleep if the owner is not running..
 + */
 +if (!adaptive-owner-se.on_rq) {
 +sleep = 1;
 +break;
 +}
 +
 +/*
 + * .. or is running on our own cpu (to prevent deadlock)
 + */
 +if (task_cpu(adaptive-owner) == task_cpu(current)) {
 +sleep = 1;
 +break;
 +}
 +
 +cpu_relax();
 +}
 +
 +put_task_struct(adaptive-owner);
 +
 +return sleep;
 +}
 +
 
 You want to inline this?

Yes.  As the comment indicates, there are 1-2 users tops, and it has a 
significant impact on throughput ( 5%) to take the hit with a call.  I don't 
think its actually much code anyway...its all comments.

 
 +static inline void
 +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
 *adaptive)
 ...
 +#define prepare_adaptive_wait(lock, busy) {}
 
 This is evil. Use empty inline function instead (same for the other
 function, there you can maybe get away with it).

Ok.


   Pavel



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:09 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 From: Peter W.Morreale [EMAIL PROTECTED]
 
 This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
 a new tunable: rtmutex_timeout, which is the companion to the
 rtlock_timeout tunable.
 
 Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
 
 Not signed off by you?

I wasn't sure if this was appropriate for me to do.  This is the first time I 
was acting as upstream to someone.  If that is what I am expected to do, 
consider this an ack for your remaining comments related to this.

 
 diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
 index ac1cbad..864bf14 100644
 --- a/kernel/Kconfig.preempt
 +++ b/kernel/Kconfig.preempt
 @@ -214,6 +214,43 @@ config RTLOCK_DELAY
   tunable at runtime via a sysctl.  A setting of 0 (zero) disables
   the adaptive algorithm entirely.
  
 +config ADAPTIVE_RTMUTEX
 +bool Adaptive real-time mutexes
 +default y
 +depends on ADAPTIVE_RTLOCK
 +help
 + This option adds the adaptive rtlock spin/sleep algorithm to
 + rtmutexes.  In rtlocks, a significant gain in throughput
 + can be seen by allowing rtlocks to spin for a distinct
 + amount of time prior to going to sleep for deadlock avoidence.
 + 
 + Typically, mutexes are used when a critical section may need to
 + sleep due to a blocking operation.  In the event the critical 
 + section does not need to sleep, an additional gain in throughput 
 + can be seen by avoiding the extra overhead of sleeping.
 
 Watch the whitespace. ... and do we need yet another config options?
 
 +config RTMUTEX_DELAY
 +int Default delay (in loops) for adaptive mutexes
 +range 0 1000
 +depends on ADAPTIVE_RTMUTEX
 +default 3000
 +help
 + This allows you to specify the maximum delay a task will use
 + to wait for a rt mutex before going to sleep.  Note that that
 + although the delay is implemented as a preemptable loop, tasks
 + of like priority cannot preempt each other and this setting can
 + result in increased latencies.
 + 
 + The value is tunable at runtime via a sysctl.  A setting of 0
 + (zero) disables the adaptive algorithm entirely.
 
 Ouch.

?  Is this reference to whitespace damage, or does the content need addressing?

 
 +#ifdef CONFIG_ADAPTIVE_RTMUTEX
 +
 +#define mutex_adaptive_wait adaptive_wait
 +#define mutex_prepare_adaptive_wait prepare_adaptive_wait
 +
 +extern int rtmutex_timeout;
 +
 +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
 + struct adaptive_waiter name = { .owner = NULL,   \
 + .timeout = rtmutex_timeout, }
 +
 +#else
 +
 +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
 +
 +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
 +#define mutex_prepare_adaptive_wait(lock, busy) {}
 
 More evil macros. Macro does not behave like a function, make it
 inline function if you are replacing a function.

Ok


   Pavel



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:57 PM, in message
[EMAIL PROTECTED], Sven-Thorsten Dietrich
[EMAIL PROTECTED] wrote: 

 But Greg may need to enforce it on his git tree that he mails these from
 - are you referring to anything specific in this patch?
 

Thats what I don't get.  I *did* checkpatch all of these before sending them 
out (and I have for every release).

I am aware of two tabs vs spaces warnings, but the rest checked clean.  Why 
do some people still see errors when I don't?  Is there a set of switches I 
should supply to checkpatch to make it more aggressive or something?

-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Gregory Haskins

Paul E. McKenney wrote:

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.


Yeah, fully agree.  This is on my research todo list.  My theory is 
that the ultimate adaptive-timeout algorithm here would essentially be 
the following:


*) compute the context-switch pair time average for the system.  This is 
your time threshold (CSt).


*) For each lock, maintain an average hold-time (AHt) statistic (I am 
assuming this can be done cheaply...perhaps not).


The adaptive code would work as follows:

if (AHt  CSt) /* dont even bother if the average is greater than CSt */
   timeout = 0;
else
   timeout = AHt;

if (adaptive_wait(timeout))
   sleep();

Anyone have some good ideas on how to compute CSt?  I was thinking you 
could create two kthreads that message one another (measuring round-trip 
time) for some number (say 100) to get an average.  You could probably 
just approximate it with flushing workqueue jobs.


-Greg



Thanx, Paul


Sven


Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-22 Thread Gregory Haskins

Pavel Machek wrote:

Hi!


Decorate the printk path with an unlikely()

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (!current-in_printk)
-   might_sleep();
-   else if (in_atomic() || irqs_disabled())
+   if (unlikely(current-in_printk)  (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 
+	might_sleep();


I think you changed the code here... you call might_sleep() in
different cases afaict.


Agreed, but it's still correct afaict.  I added an extra might_sleep() 
to a path that really might sleep.  I should have mentioned that in the 
header.


In any case, its moot.  Andi indicated this patch is probably a no-op so 
I was considering dropping it on the v2 pass.


Regards,
-Greg



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 09/14] adaptive mutexes

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   37 +
 kernel/rtmutex.c  |   44 ++--
 kernel/rtmutex_adaptive.h |   32 ++--
 kernel/sysctl.c   |   10 ++
 4 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index eebec19..d2b0daa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -223,6 +223,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool Adaptive real-time mutexes
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int Default delay (in loops) for adaptive mutexes
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default 3000
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a7423f..a7ed7b2 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -24,6 +24,10 @@
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* This may appear to be a race, but the barriers close the
+* window.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   smp_mb();
-   /*
-* This may appear to be a race, but the barriers close the
-* window.
-*/
-   if ((pendowner-state != TASK_RUNNING)
-(pendowner-state != TASK_RUNNING_MUTEX))
+   smp_mb();
+   if ((pendowner-state != TASK_RUNNING)
+(pendowner-state != TASK_RUNNING_MUTEX)) {
+   if (!savestate)
+   wake_up_process(pendowner);
+   else
wake_up_process_mutex(pendowner);
}
 
@@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(waiter);
 
/* adaptive_wait() returns 1 if we need to sleep */
-   if (adaptive_wait(lock, waiter, adaptive)) {
+   if (adaptive_wait(lock, 0, waiter, adaptive)) {
update_current(TASK_UNINTERRUPTIBLE, saved_state);
if (waiter.task)
schedule_rt_mutex(lock);
@@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
+   DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive);
 
debug_rt_mutex_init_waiter(waiter);
waiter.task = NULL;
@@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(current-lock_depth = 0))

[PATCH [RT] 10/14] adjust pi_lock usage in wakeup

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a7ed7b2..122f143 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(current-pi_lock);
 
@@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner-normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(pendowner-pi_lock);
 
WARN_ON(!pendowner-pi_blocked_on);
@@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner-pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
-   }
+
spin_unlock(pendowner-pi_lock);
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing

2008-02-21 Thread Gregory Haskins
From: Sven-Thorsten Dietrich [EMAIL PROTECTED]

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED]
---

 kernel/rtmutex.c |8 ++--
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index da077e5..62e7af5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #ifdef CONFIG_ADAPTIVE_RTMUTEX
 int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
 
 /*
  * lock-owner state tracking:
@@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current-prio  pendowner-prio)
return 0;
 
-   if (!unfair  (current-prio == pendowner-prio))
+   if (unlikely(current-prio == pendowner-prio) 
+  !(unfair  rtmutex_lateral_steal))
 #endif
return 0;
 
@@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, 
int unfair)
return 1;
 
spin_lock(pendowner-pi_lock);
-   if (!lock_is_stealable(pendowner, unfair)) {
+   if (!lock_is_stealable(pendowner, (unfair  rtmutex_lateral_steal))) {
spin_unlock(pendowner-pi_lock);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3465af2..c1a1c6d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtmutex_lateral_steal,
+   .data   = rtmutex_lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 06/14] optimize rt lock wakeup

2008-02-21 Thread Gregory Haskins
It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   23 ++-
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 15fc6e6..cb27b08 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
pendowner = waiter-task;
waiter-task = NULL;
 
+   /*
+* Do the wakeup before the ownership change to give any spinning
+* waiter grantees a headstart over the other threads that will
+* trigger once owner changes.
+*/
+   if (!savestate)
+   wake_up_process(pendowner);
+   else {
+   smp_mb();
+   /*
+* This may appear to be a race, but the barriers close the
+* window.
+*/
+   if ((pendowner-state != TASK_RUNNING)
+(pendowner-state != TASK_RUNNING_MUTEX))
+   wake_up_process_mutex(pendowner);
+   }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
spin_unlock(current-pi_lock);
@@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
}
spin_unlock(pendowner-pi_lock);
-
-   if (savestate)
-   wake_up_process_mutex(pendowner);
-   else
-   wake_up_process(pendowner);
 }
 
 /*

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 03/14] x86: FIFO ticket spinlocks

2008-02-21 Thread Gregory Haskins
From: Nick Piggin [EMAIL PROTECTED]

Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or unfairly granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again.  These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---

 include/asm-x86/spinlock.h   |  225 ++
 include/asm-x86/spinlock_32.h|  221 -
 include/asm-x86/spinlock_64.h|  167 
 include/asm-x86/spinlock_types.h |2 
 4 files changed, 224 insertions(+), 391 deletions(-)

diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h
index d74d85e..72fe445 100644
--- a/include/asm-x86/spinlock.h
+++ b/include/asm-x86/spinlock.h
@@ -1,5 +1,226 @@
+#ifndef _X86_SPINLOCK_H_
+#define _X86_SPINLOCK_H_
+
+#include asm/atomic.h
+#include asm/rwlock.h
+#include asm/page.h
+#include asm/processor.h
+#include linux/compiler.h
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations.  There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
 #ifdef CONFIG_X86_32
-# include spinlock_32.h
+typedef char _slock_t;
+# define LOCK_INS_DEC decb
+# define LOCK_INS_XCH xchgb
+# define LOCK_INS_MOV movb
+# define LOCK_INS_CMP cmpb
+# define LOCK_PTR_REG a
 #else
-# include spinlock_64.h
+typedef int _slock_t;
+# define LOCK_INS_DEC decl
+# define LOCK_INS_XCH xchgl
+# define LOCK_INS_MOV movl
+# define LOCK_INS_CMP cmpl
+# define LOCK_PTR_REG D
+#endif
+
+#if (NR_CPUS  256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) != (tmp  0xff));
+}
+
+static inline int __raw_spin_is_contended(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) - (tmp  0xff))  1;
+}
+
+static inline void __raw_spin_lock(__raw_spinlock_t *lock)
+{
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX xaddw %w0, %1\n
+   1:\t
+   cmpb %h0, %b0\n\t
+   je 2f\n\t
+   rep ; nop\n\t
+   movb %1, %b0\n\t
+   /* don't need lfence here, because loads are in-order */
+   jmp 1b\n
+   2:
+   :+Q (inc), +m (lock-slock)
+   :
+   :memory, cc);
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int 

[PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use

2008-02-21 Thread Gregory Haskins
Preemptible spinlock waiters effectively bypasses the benefits of a fifo
spinlock.  Since we now have fifo spinlocks for x86 enabled, disable the
preemption feature on x86.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Nick Piggin [EMAIL PROTECTED]
---

 arch/x86/Kconfig |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8d15667..d5b9a67 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -20,6 +20,7 @@ config X86
bool
default y
select HAVE_MCOUNT
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
 
 config GENERIC_TIME
bool

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
From: Sven Dietrich [EMAIL PROTECTED]

Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK
 
 If unsure, say Y
 
+config RTLOCK_DELAY
+   int Default delay (in loops) for adaptive rtlocks
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default 1
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
 #include rtmutex_common.h
 #include rtmutex_adaptive.h
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive-timeout  0; adaptive-timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive-timeout = 0)
+   sleep = 1;
+
put_task_struct(adaptive-owner);
 
return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive-owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
 #include asm/stacktrace.h
 #endif
 
+#include rtmutex_adaptive.h
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtlock_timeout,
+   .data   = rtlock_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
spinlocks to an RT mutex implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, dbench and hackbench),
whereas some others (make -j 128) the results were not as profound but
they were still net-positive. In all cases we have also verified that
deterministic latency is not impacted by using cyclic-test. 

This patch series also includes some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock-wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 05/14] rearrange rt_spin_lock sleep

2008-02-21 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current-state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..15fc6e6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(current-state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
+   saved_state = current-state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, saved_state);
 
spin_lock_irqsave(lock-wait_lock, flags);
current-flags |= saved_flags;
current-lock_depth = saved_lock_depth;
-   state = xchg(current-state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(current-state, saved_state);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled

2008-02-21 Thread Gregory Haskins
The logic is currently broken so that PREEMPT_RT disables preemptible
spinlock waiters, which is counter intuitive. 

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/spinlock.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index c9bcf1b..b0e7f02 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
 #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT)
+   defined(CONFIG_DEBUG_LOCK_ALLOC)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option

2008-02-21 Thread Gregory Haskins
We introduce a configuration variable for the feature to make it easier for
various architectures and/or configs to enable or disable it based on their
requirements.  

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |9 +
 kernel/spinlock.c  |7 +++
 lib/Kconfig.debug  |1 +
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..5b45213 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,6 +86,15 @@ config PREEMPT
default y
depends on PREEMPT_DESKTOP || PREEMPT_RT
 
+config DISABLE_PREEMPT_SPINLOCK_WAITERS
+bool
+   default n
+
+config PREEMPT_SPINLOCK_WAITERS
+bool
+   default y
+   depends on PREEMPT  SMP  !DISABLE_PREEMPT_SPINLOCK_WAITERS
+
 config PREEMPT_SOFTIRQS
bool Thread Softirqs
default n
diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index b0e7f02..2e6a904 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {
@@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock)
 
 EXPORT_SYMBOL(__write_lock);
 
-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 /*
  * This could be a long-held lock. We both prepare to spin for a long
@@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock);
 BUILD_LOCK_OPS(read, raw_rwlock);
 BUILD_LOCK_OPS(write, raw_rwlock);
 
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9208791..f2889b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC
bool Lock debugging: detect incorrect freeing of live locks
depends on DEBUG_KERNEL  TRACE_IRQFLAGS_SUPPORT  STACKTRACE_SUPPORT 
 LOCKDEP_SUPPORT
select DEBUG_SPINLOCK
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
select DEBUG_MUTEXES
select LOCKDEP
help

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 10:26 AM, in message
[EMAIL PROTECTED], Gregory Haskins
[EMAIL PROTECTED] wrote: 

 We have put together some data from different types of benchmarks for
 this patch series, which you can find here:
 
 ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

For convenience, I have also places a tarball of the entire series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 11:41 AM, in message [EMAIL PROTECTED],
Andi Kleen [EMAIL PROTECTED] wrote: 

 +config RTLOCK_DELAY
 +int Default delay (in loops) for adaptive rtlocks
 +range 0 10
 +depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
 +default 1
 
 Perhaps you can expand how you came up with that default number? 

Actually, the number doesn't seem to matter that much as long as it is 
sufficiently long enough to make timeouts rare.  Most workloads will present 
some threshold for hold-time.  You generally get the best performance if the 
value is at least as long as that threshold.  Anything beyond that and there 
is no gain, but there doesn't appear to be a penalty either.  So we picked 
1 because we found it to fit that criteria quite well for our range of GHz 
class x86 machines.  YMMY, but that is why its configurable ;)

 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) 

Yeah, fully agree.  We really wanted to use a time-value here but ran into 
various problems that have yet to be resolved.  We have it on the todo list to 
express this in terms in ns so it at least will scale with the architecture.

 Did you experiment with other spin times?

Of course ;)

 Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research todo.

 And at what point is real
 time behaviour visibly impacted? 

Well, if we did our jobs correctly, RT behavior should *never* be impacted.  
*Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that 
affects the probability of timing out.  Values lower than that threshold start 
to degrade throughput.  Values higher than that have no affect on throughput, 
but may drive the cpu utilization higher which can theoretically impact tasks 
of equal or lesser priority by taking that resource away from them.  To date, 
we have not observed any real-world implications of this however.

 
 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.

I agree.  We need to move in the direction of time-basis.  The tradeoff is that 
it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). 
 I think one of the (not-included) patches converts a nanosecond value from the 
sysctl to approximate loop-counts using the bogomips data.  This was a decent 
compromise between the non-scaling loopcounts and the heavy-weight official 
timing APIs.  We dropped it because we support older kernels which were 
conflicting with the patch. We may have to resurrect it, however..

-Greg



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:24 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 hm. Why is the ticket spinlock patch included in this patchset? It just 
 skews your performance results unnecessarily. Ticket spinlocks are 
 independent conceptually, they are already upstream in 2.6.25-rc2 and 
 -rt will have them automatically once we rebase to .25.

Sorry if it was ambiguous.  I included them because we found the patch series 
without them can cause spikes due to the newly introduced pressure on the 
(raw_spinlock_t)lock-wait_lock.  You can run the adaptive-spin patches without 
them just fine (in fact, in many cases things run faster without themdbench 
*thrives* on chaos).  But you may also measure a cyclic-test spike if you do 
so.  So I included them to present a complete package without spikes.  I 
tried to explain that detail in the prologue, but most people probably fell 
asleep before they got to the end ;)

 
 and if we take the ticket spinlock patch out of your series, the the 
 size of the patchset shrinks in half and touches only 200-300 lines of 
 code ;-) Considering the total size of the -rt patchset:
 
652 files changed, 23830 insertions(+), 4636 deletions(-)
 
 we can regard it a routine optimization ;-)

Its not the size of your LOC, but what you do with it :)

 
 regarding the concept: adaptive mutexes have been talked about in the 
 past, but their advantage is not at all clear, that's why we havent done 
 them. It's definitely not an unambigiously win-win concept.
 
 So lets get some real marketing-free benchmarking done, and we are not 
 just interested in the workloads where a bit of polling on contended 
 locks helps, but we are also interested in workloads where the polling 
 hurts ... And lets please do the comparisons without the ticket spinlock 
 patch ...

I'm open to suggestion, and this was just a sample of the testing we have done. 
 We have thrown plenty of workloads at this patch series far beyond the slides 
I prepared in that URL, and they all seem to indicate a net positive 
improvement so far.  Some of those results I cannot share due to NDA, and some 
I didnt share simply because I never formally collected the data like I did for 
these tests.  If there is something you would like to see, please let me know 
and I will arrange for it to be executed if at all possible.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:42 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Bill Huey (hui) [EMAIL PROTECTED] wrote:
 
 I came to the original conclusion that it wasn't originally worth it, 
 but the dbench number published say otherwise. [...]
 
 dbench is a notoriously unreliable and meaningless workload. It's being 
 frowned upon by the VM and the IO folks.

I agree...its a pretty weak benchmark.  BUT, it does pound on dcache_lock and 
therefore was a good demonstration of the benefits of lower-contention 
overhead.  Also note we also threw other tests in that PDF if you scroll to the 
subsequent pages.

 If that's the only workload 
 where spin-mutexes help, and if it's only a 3% improvement [of which it 
 is unclear how much of that improvement was due to ticket spinlocks], 
 then adaptive mutexes are probably not worth it.

Note that the 3% figure being thrown around was from a single patch within 
the series.  We are actually getting a net average gain of 443% in dbench.  And 
note that the number goes *up* when you remove the ticketlocks.  The 
ticketlocks are there to prevent latency spikes, not improve throughput.

Also take a look at the hackbench numbers which are particularly promising.   
We get a net average gain of 493% faster for RT10 based hackbench runs.  The 
kernel build was only a small gain, but it was all gain nonetheless.  We see 
similar results for any other workloads we throw at this thing.  I will gladly 
run any test requested to which I have the ability to run, and I would 
encourage third party results as well.


 
 I'd not exclude them fundamentally though, it's really the numbers that 
 matter. The code is certainly simple enough (albeit the .config and 
 sysctl controls are quite ugly and unacceptable - adaptive mutexes 
 should really be ... adaptive, with no magic constants in .configs or 
 else).

We can clean this up, per your suggestions.

 
 But ... i'm somewhat sceptic, after having played with spin-a-bit 
 mutexes before.

Its very subtle to get this concept to work.  The first few weeks, we were 
getting 90% regressions ;)  Then we had a breakthrough and started to get this 
thing humming along quite nicely.

Regards,
-Greg




-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
This patch adds a new critical-section primitive pair:

migration_disable() and migration_enable()

This allows you to force a task to remain on the current cpu, while
still remaining fully preemptible.  This is a better alternative to
modifying current-cpus_allowed because you dont have to worry about
colliding with another entity also modifying the cpumask_t while in
the critical section.

In fact, modifying the cpumask_t while in the critical section is
fully supported, but note that the behavior of set_cpus_allowed()
has slightly different behavior.  In the old code, the mask update
was synchronous: e.g. the task would be on a legal cpu when the call
returned.  The new behavior makes this asynchronous if the task is
currently in a migration-disabled critical section.  The task will
migrate to a legal cpu once the critical section ends.

This concept will be used later in the series.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 include/linux/init_task.h |1 +
 include/linux/sched.h |8 +
 kernel/fork.c |1 +
 kernel/sched.c|   70 -
 kernel/sched_rt.c |6 +++-
 5 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 316a184..151197b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -137,6 +137,7 @@ extern struct group_info init_groups;
.usage  = ATOMIC_INIT(2),   \
.flags  = 0,\
.lock_depth = -1,   \
+   .migration_disable_depth = 0,   \
.prio   = MAX_PRIO-20,  \
.static_prio= MAX_PRIO-20,  \
.normal_prio= MAX_PRIO-20,  \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c87d46a..ab7768a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1109,6 +1109,7 @@ struct task_struct {
unsigned int ptrace;
 
int lock_depth; /* BKL lock depth */
+   int migration_disable_depth;
 
 #ifdef CONFIG_SMP
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
@@ -2284,10 +2285,17 @@ static inline void inc_syscw(struct task_struct *tsk)
 
 #ifdef CONFIG_SMP
 void migration_init(void);
+int migration_disable(struct task_struct *tsk);
+void migration_enable(struct task_struct *tsk);
 #else
 static inline void migration_init(void)
 {
 }
+static inline int migration_disable(struct task_struct *tsk)
+{
+   return 0;
+}
+#define migration_enable(tsk) do {} while (0)
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c00b55..7745937 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1127,6 +1127,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
INIT_LIST_HEAD(p-cpu_timers[2]);
p-posix_timer_list = NULL;
p-lock_depth = -1; /* -1 = no lock */
+   p-migration_disable_depth = 0;
do_posix_clock_monotonic_gettime(p-start_time);
p-real_start_time = p-start_time;
monotonic_to_bootbased(p-real_start_time);
diff --git a/kernel/sched.c b/kernel/sched.c
index e6ad493..cf32000 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1231,6 +1231,8 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
  *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
u64 clock_offset;
 
+   BUG_ON(p-migration_disable_depth);
+
clock_offset = old_rq-clock - new_rq-clock;
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1632,7 +1634,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int sync, int mutex)
if (unlikely(task_running(rq, p)))
goto out_activate;
 
-   cpu = p-sched_class-select_task_rq(p, sync);
+   if (!p-migration_disable_depth)
+   cpu = p-sched_class-select_task_rq(p, sync);
+
if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
task_rq_unlock(rq, flags);
@@ -5422,11 +5426,12 @@ static inline void sched_init_granularity(void)
  */
 int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
-   struct migration_req req;
unsigned long flags;
struct rq *rq;
int ret = 0;
 
+   migration_disable(p);
+
rq = task_rq_lock(p, flags);
if (!cpus_intersects(new_mask, cpu_online_map)) {
ret = -EINVAL;
@@ -5440,21 +5445,11 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t 
new_mask)
p-nr_cpus_allowed = cpus_weight(new_mask);
}
 
-   /* Can the task run on the task's current CPU? If so, we're done */
-   if (cpu_isset(task_cpu(p), new_mask))
-   goto out;
-
-   if (migrate_task(p, any_online_cpu(new_mask), req)) {
-   /* Need help

Re: [PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
 On Tue, Feb 12, 2008 at  2:22 PM, in message
[EMAIL PROTECTED], Steven Rostedt
[EMAIL PROTECTED] wrote: 

 On Tue, 12 Feb 2008, Gregory Haskins wrote:
 
 This patch adds a new critical-section primitive pair:

 migration_disable() and migration_enable()
 
 This is similar to what Mathieu once posted:
 
 http://lkml.org/lkml/2007/7/11/13
 
 Not sure the arguments against (no time to read the thread again). But I'd
 recommend that you read it.
 
 -- Steve

Indeed, thanks for the link!  At quick glance, the concept looks identical, 
though the implementations are radically different.

-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] fix cpus_allowed settings

2008-02-12 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/kthread.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..b193b47 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
wait_task_inactive(k);
set_task_cpu(k, cpu);
k-cpus_allowed = cpumask_of_cpu(cpu);
+   k-nr_cpus_allowed = 1;
 }
 EXPORT_SYMBOL(kthread_bind);
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] migration disabled critical sections

2008-02-12 Thread Gregory Haskins
Hi Ingo, Steven,

I had been working on some ideas related to saving context switches in the
bottom-half mechanisms on -rt.  So far, the ideas have been a flop, but a few
peripheral technologies did come out of it.  This series is one such
idea that I thought might have some merit on its own.  The header-comments
describe it in detail, so I wont bother replicating that here.

This series applies to 24-rt1.  Any comments/feedback welcome.

-Greg


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-05 Thread Gregory Haskins
 On Mon, Feb 4, 2008 at  9:51 PM, in message
[EMAIL PROTECTED], Daniel Walker
[EMAIL PROTECTED] wrote: 
 On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote:

[snip]


 Also the first thing I tried was to bring CPU1 off-line. Thats the fastest 
 way to get irqs, soft-irqs, timers, etc of a CPU. But the box hung 
 completely.

After applying my earlier submitted patch, I was able to reproduce the hang you 
mentioned.  I poked around in sysrq and it looked like a deadlock on a 
rt_mutex, so I turned on lockdep and it found:


===
[ INFO: possible circular locking dependency detected ]
[ 2.6.24-rt1-rt #3
---
bash/4604 is trying to acquire lock:
 (events){--..}, at: [802537b6] cleanup_workqueue_thread+0x16/0x80

but task is already holding lock:
 (workqueue_mutex){--..}, at: [80254615] 
workqueue_cpu_callback+0xe5/0x140

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #5 (workqueue_mutex){--..}:
   [80266752] __lock_acquire+0xf82/0x1090
   [802668b7] lock_acquire+0x57/0x80
   [80254615] workqueue_cpu_callback+0xe5/0x140
   [80486818] _mutex_lock+0x28/0x40
   [80254615] workqueue_cpu_callback+0xe5/0x140
   [8048a575] notifier_call_chain+0x45/0x90
   [8025d079] __raw_notifier_call_chain+0x9/0x10
   [8025d091] raw_notifier_call_chain+0x11/0x20
   [8026d157] _cpu_down+0x97/0x2d0
   [8026d3b5] cpu_down+0x25/0x60
   [8026d3c8] cpu_down+0x38/0x60
   [803d6719] store_online+0x49/0xa0
   [803d2774] sysdev_store+0x24/0x30
   [8031279f] sysfs_write_file+0xcf/0x140
   [802c0005] vfs_write+0xe5/0x1a0
   [802c0733] sys_write+0x53/0x90
   [8020c4fe] system_call+0x7e/0x83
   [] 0x

- #4 (cache_chain_mutex){--..}:
   [80266752] __lock_acquire+0xf82/0x1090
   [802668b7] lock_acquire+0x57/0x80
   [802bb7fa] kmem_cache_create+0x6a/0x480
   [80486818] _mutex_lock+0x28/0x40
   [802bb7fa] kmem_cache_create+0x6a/0x480
   [802872a6] __rcu_read_unlock+0x96/0xb0
   [8046b824] fib_hash_init+0xa4/0xe0
   [80467ee5] fib_new_table+0x35/0x70
   [80467fb1] fib_magic+0x91/0x100
   [80468093] fib_add_ifaddr+0x73/0x170
   [8046829b] fib_inetaddr_event+0x4b/0x260
   [8048a575] notifier_call_chain+0x45/0x90
   [8025d2ce] __blocking_notifier_call_chain+0x5e/0x90
   [8025d311] blocking_notifier_call_chain+0x11/0x20
   [8045f714] __inet_insert_ifa+0xd4/0x170
   [8045f7bd] inet_insert_ifa+0xd/0x10
   [8046083a] inetdev_event+0x45a/0x510
   [8041ee4d] fib_rules_event+0x6d/0x160
   [8048a575] notifier_call_chain+0x45/0x90
   [8025d079] __raw_notifier_call_chain+0x9/0x10
   [8025d091] raw_notifier_call_chain+0x11/0x20
   [8040f466] call_netdevice_notifiers+0x16/0x20
   [80410f6d] dev_open+0x8d/0xa0
   [8040f5e9] dev_change_flags+0x99/0x1b0
   [80460ffd] devinet_ioctl+0x5ad/0x760
   [80410d6a] dev_ioctl+0x4ba/0x590
   [8026523d] trace_hardirqs_on+0xd/0x10
   [8046162d] inet_ioctl+0x5d/0x80
   [80400f21] sock_ioctl+0xd1/0x260
   [802ce154] do_ioctl+0x34/0xa0
   [802ce239] vfs_ioctl+0x79/0x2f0
   [80485f30] trace_hardirqs_on_thunk+0x3a/0x3f
   [802ce532] sys_ioctl+0x82/0xa0
   [8020c4fe] system_call+0x7e/0x83
   [] 0x

- #3 ((inetaddr_chain).rwsem){..--}:
   [80266752] __lock_acquire+0xf82/0x1090
   [802668b7] lock_acquire+0x57/0x80
   [8026ca9b] rt_down_read+0xb/0x10
   [8026ca29] __rt_down_read+0x29/0x80
   [8026ca9b] rt_down_read+0xb/0x10
   [8025d2b8] __blocking_notifier_call_chain+0x48/0x90
   [8025d311] blocking_notifier_call_chain+0x11/0x20
   [8045f714] __inet_insert_ifa+0xd4/0x170
   [8045f7bd] inet_insert_ifa+0xd/0x10
   [8046083a] inetdev_event+0x45a/0x510
   [8041ee4d] fib_rules_event+0x6d/0x160
   [8048a575] notifier_call_chain+0x45/0x90
   [8025d079] __raw_notifier_call_chain+0x9/0x10
   [8025d091] raw_notifier_call_chain+0x11/0x20
   [8040f466] call_netdevice_notifiers+0x16/0x20
   [80410f6d] dev_open+0x8d/0xa0
   [8040f5e9] dev_change_flags+0x99/0x1b0
   [80460ffd] devinet_ioctl+0x5ad/0x760
   [80410d6a] dev_ioctl+0x4ba/0x590
   [8026523d] 

Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-05 Thread Gregory Haskins
 On Tue, Feb 5, 2008 at 11:59 AM, in message
[EMAIL PROTECTED], Daniel Walker
[EMAIL PROTECTED] wrote: 
 On Mon, Feb 04, 2008 at 10:02:12PM -0700, Gregory Haskins wrote:
  On Mon, Feb 4, 2008 at  9:51 PM, in message
 [EMAIL PROTECTED], Daniel Walker
 [EMAIL PROTECTED] wrote: 
  I get the following when I tried it,
  
  BUG: sleeping function called from invalid context bash(5126) at
  kernel/rtmutex.c:638
  in_atomic():1 [0001], irqs_disabled():1
 
 Hi Daniel,
   Can you try this patch and let me know if it fixes your problem?
 
 ---
 
 use rcu for root-domain kfree
 
 Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
 
 diff --git a/kernel/sched.c b/kernel/sched.c
 index e6ad493..77e86c1 100644
 --- a/kernel/sched.c
 +++ b/kernel/sched.c
 @@ -339,6 +339,7 @@ struct root_domain {
 atomic_t refcount;
 cpumask_t span;
 cpumask_t online;
 +   struct rcu_head rcu;
 
 /*
  * The RT overload flag: it gets set if a CPU has more than
 @@ -6222,6 +6223,12 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
 sched_domain *parent)
 return 1;
  }
 
 +/* rcu callback to free a root-domain */
 +static void rq_free_root(struct rcu_head *rcu)
 +{
 +   kfree(container_of(rcu, struct root_domain, rcu));
 +}
 +
 
 I looked at the code a bit, and I'm not sure you need this complexity..
 Once you have replace the old_rq, there is no reason it needs to
 protection of the run queue spinlock .. So you could just move the kfree
 down below the spin_unlock_irqrestore() ..

Indeed.  When I looked last night at the stack, I thought the in_atomic was 
coming from further up in the trace.  I see the issue now, thanks Daniel.  
(Anyone have a spare brown bag?)

-Greg

 
 Daniel
 -
 To unsubscribe from this list: send the line unsubscribe linux-rt-users in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-05 Thread Gregory Haskins
 On Tue, Feb 5, 2008 at 11:59 AM, in message
[EMAIL PROTECTED], Daniel Walker
[EMAIL PROTECTED] wrote: 

 
 I looked at the code a bit, and I'm not sure you need this complexity..
 Once you have replace the old_rq, there is no reason it needs to
 protection of the run queue spinlock .. So you could just move the kfree
 down below the spin_unlock_irqrestore() ..

Here is a new version to address your observation:
---

we cannot kfree while in_atomic()

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]

diff --git a/kernel/sched.c b/kernel/sched.c
index e6ad493..0978912 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6226,6 +6226,7 @@ static void rq_attach_root(struct rq *rq, struct 
root_domain *rd)
 {
unsigned long flags;
const struct sched_class *class;
+   struct root_domain *reap = NULL;

spin_lock_irqsave(rq-lock, flags);

@@ -6241,7 +6242,7 @@ static void rq_attach_root(struct rq *rq, struct 
root_domain *rd)
cpu_clear(rq-cpu, old_rd-online);

if (atomic_dec_and_test(old_rd-refcount))
-   kfree(old_rd);
+   reap = old_rd;
}

atomic_inc(rd-refcount);
@@ -6257,6 +6258,10 @@ static void rq_attach_root(struct rq *rq, struct 
root_domain *rd)
}

spin_unlock_irqrestore(rq-lock, flags);
+
+   /* Don't try to free the memory while in-atomic() */
+   if (unlikely(reap))
+   kfree(reap);
 }




 
 Daniel
 -
 To unsubscribe from this list: send the line unsubscribe linux-rt-users in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-05 Thread Gregory Haskins
 On Tue, Feb 5, 2008 at  4:58 PM, in message
[EMAIL PROTECTED], Daniel Walker
[EMAIL PROTECTED] wrote: 
 On Tue, Feb 05, 2008 at 11:25:18AM -0700, Gregory Haskins wrote:
 @@ -6241,7 +6242,7 @@ static void rq_attach_root(struct rq *rq, struct 
 root_domain *rd)
 cpu_clear(rq-cpu, old_rd-online);
 
 if (atomic_dec_and_test(old_rd-refcount))
 -   kfree(old_rd);
 +   reap = old_rd;
 
 Unrelated to the in atomic issue, I was wondering if this if statement
 isn't true can the old_rd memory get leaked, or is it cleaned up
 someplace else?

Each RQ always has a reference to one root-domain and is thus represented by 
the rd-refcount.  When the last RQ drops its reference to a particular 
instance, we free the structure.  So this is the only place where we clean up, 
but it should also be the only place we need to (unless I am misunderstanding 
you?)

Note that there is one exception: the default root-domain is never freed, which 
is why we initialize it with a refcount = 1.  So it is theoretically possible 
to have this particular root-domain dangling with no RQs associated with it, 
but that is by design. 

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-04 Thread Gregory Haskins
Hi Daniel,

  See inline...

 On Mon, Feb 4, 2008 at  9:51 PM, in message
[EMAIL PROTECTED], Daniel Walker
[EMAIL PROTECTED] wrote: 
 On Mon, Feb 04, 2008 at 03:35:13PM -0800, Max Krasnyanskiy wrote:
 This is just an FYI. As part of the Isolated CPU extensions thread Daniel 
 suggest for me
 to check out latest RT kernels. So I did or at least tried to and 
 immediately spotted a couple
 of issues.

 The machine I'm running it on is:
  HP xw9300, Dual Opteron, NUMA

 It looks like with -rt kernel IRQ affinity masks are ignored on that 
 system. ie I write 1 to lets say /proc/irq/23/smp_affinity but the 
 interrupts keep coming to CPU1. Vanilla 2.6.24 does not have that issue.
 
 I tried this, and it works according to /proc/interrupts .. Are you
 looking at the interrupt threads affinity?
 
 Also the first thing I tried was to bring CPU1 off-line. Thats the fastest 
 way to get irqs, soft-irqs, timers, etc of a CPU. But the box hung 
 completely. It also managed to mess up my ext3 filesystem to the point 
 where it required manual fsck (have not see that for a couple of
 years now). I tried the same thing (ie echo 0  
 /sys/devices/cpu/cpu1/online) from the console. It hang again with the 
 message that looked something like:
  CPU1 is now off-line
  Thread IRQ-23 is on CPU1 ...
 
 I get the following when I tried it,
 
 BUG: sleeping function called from invalid context bash(5126) at
 kernel/rtmutex.c:638
 in_atomic():1 [0001], irqs_disabled():1
 Pid: 5126, comm: bash Not tainted 2.6.24-rt1 #1
  [c010506b] show_trace_log_lvl+0x1d/0x3a
  [c01059cd] show_trace+0x12/0x14
  [c0106151] dump_stack+0x6c/0x72
  [c011d153] __might_sleep+0xe8/0xef
  [c03b2326] __rt_spin_lock+0x24/0x59
  [c03b2363] rt_spin_lock+0x8/0xa
  [c0165b2f] kfree+0x2c/0x8d

Doh!  This is my bug.  Ill have to come up with a good way to free that memory 
under atomic, or do this another way.  Stay tuned.

  [c011eacb] rq_attach_root+0x67/0xba
  [c01209ae] cpu_attach_domain+0x2b6/0x2f7
  [c0120a12] detach_destroy_domains+0x23/0x37
  [c0121368] update_sched_domains+0x2d/0x40
  [c013b482] notifier_call_chain+0x2b/0x55
  [c013b4d9] __raw_notifier_call_chain+0x19/0x1e
  [c01420d3] _cpu_down+0x84/0x24c
  [c01422c3] cpu_down+0x28/0x3a
  [c029f59e] store_online+0x27/0x5a
  [c029c9dc] sysdev_store+0x20/0x25
  [c019a695] sysfs_write_file+0xad/0xde
  [c0169929] vfs_write+0x82/0xb8
  [c0169e2a] sys_write+0x3d/0x61
  [c0104072] sysenter_past_esp+0x5f/0x85
  ===
 ---
 | preempt count: 0001 ]
 | 1-level deep critical section nesting:
 
 .. [c03b25e2]  __spin_lock_irqsave+0x14/0x3b
 .[c011ea76] ..   ( = rq_attach_root+0x12/0xba)
 
 Which is clearly a problem .. 
 
 (I added linux-rt-users to the CC)
 
 Daniel
 -
 To unsubscribe from this list: send the line unsubscribe linux-rt-users in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: How to enable HPET on an 2.6.23.9-rt12 with Intel ICH2

2008-01-22 Thread Gregory Haskins
 On Tue, Jan 22, 2008 at  7:14 AM, in message
[EMAIL PROTECTED],
Lampersperger Andreas [EMAIL PROTECTED] wrote: 
 No, there is no BIOS option which indicates HPET support. But the
 BIOS on my system is IMHO very very bad. It knows nothing about the
 features on the board. Is ist crucial to have here BIOS support? Or
 can these features be enabled by e.g. boot parameters?

I had a similar problem once.  It turned out that nmi_watchdog=1 in the 
kernel-args disables HPET such that you will see that warning from cyclictest.  
Not sure if you have a similar setup, but perhaps this info will help.

-Greg  

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc7-rt1

2008-01-15 Thread Gregory Haskins
 On Tue, Jan 15, 2008 at  4:28 AM, in message
[EMAIL PROTECTED], Mike Galbraith [EMAIL PROTECTED]
wrote: 

 debug resume trace
 
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
 {
   int first;
 
   /* this_cpu is cheaper to preempt than a remote processor */
   if ((this_cpu != -1)  cpu_isset(this_cpu, *mask))
   return this_cpu;
 
   first = first_cpu(*mask);
   if (first != NR_CPUS) {
   if (!cpu_online(first)) {
   WARN_ON_ONCE(1);
   return -1;
   }
   return first;
   }
 
   return -1;
 }
 
 [  156.344352] CPU1 is down
 [3.726557] Intel machine check architecture supported.
 [3.726565] Intel machine check reporting enabled on CPU#0.
 [3.726567] CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
 [3.726570] CPU0: Thermal monitoring enabled
 [3.726576] Back to C!
 [3.727107] Force enabled HPET at resume
 [3.727193] Enabling non-boot CPUs ...
 [3.727446] CPU0 attaching NULL sched-domain.
 [3.727550] WARNING: at kernel/sched_rt.c:385 pick_optimal_cpu()
 [3.727553] Pid: 27, comm: events/0 Not tainted 2.6.24-rt2-smp #73
 [3.727556]  [b010522a] show_trace_log_lvl+0x1a/0x30
 [3.727564]  [b0105ccd] show_trace+0x12/0x14
 [3.727567]  [b010641f] dump_stack+0x6c/0x72
 [3.727569]  [b0120b0f] find_lowest_rq+0x199/0x1af
 [3.727573]  [b0120beb] push_rt_task+0x6e/0x1ed
 [3.727575]  [b01276a3] switched_to_rt+0x39/0x55
 [3.727579]  [b0128cd1] task_setprio+0xbf/0x18f
 [3.727581]  [b014c21e] __rt_mutex_adjust_prio+0x19/0x1c
 [3.727585]  [b014c943] task_blocks_on_rt_mutex+0x14e/0x17e
 [3.727588]  [b02f2e18] rt_spin_lock_slowlock+0xed/0x16a
 [3.727593]  [b02f351a] __rt_spin_lock+0x41/0x48
 [3.727596]  [b02f3529] rt_spin_lock+0x8/0xa
 [3.727598]  [b013f2c2] finish_wait+0x25/0x49
 [3.727602]  [b013c746] worker_thread+0x64/0xe1
 [3.727605]  [b013efb2] kthread+0x39/0x5b
 [3.727607]  [b0104e83] kernel_thread_helper+0x7/0x14
 [3.727610]  ===
 [3.727624] SMP alternatives: switching to SMP code
 [3.728314] Booting processor 1/1 eip 3000

This gives me a much better hint, Mike.  Thanks!

-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc7-rt1

2008-01-14 Thread Gregory Haskins
 On Mon, Jan 14, 2008 at  3:27 AM, in message
[EMAIL PROTECTED], Mike Galbraith [EMAIL PROTECTED]
wrote: 

 On Sun, 2008-01-13 at 15:54 -0500, Steven Rostedt wrote:
 
 OK, -rt2 will take a bit more beating from me before I release it, so it
 might take some time to get it out (expect it out on Monday).
 
 Ah, that reminds me (tests, yup) I still need the patchlet below to
 resume from ram without black screen of death.

I had forgotten about this issue over the holidays...Sorry Mike.

So does that BUG_ON trip if you remove the first hunk?

 No idea why my P4 box
 seems to be the only box in the rt galaxy affected.  (haven't poked at
 it since the holidays)
 
 Index: linux-2.6.24.git-rt1/kernel/sched_rt.c
 ===
 --- linux-2.6.24.git-rt1.orig/kernel/sched_rt.c
 +++ linux-2.6.24.git-rt1/kernel/sched_rt.c
 @@ -33,6 +33,9 @@ static inline void rt_clear_overload(str
  
  static void update_rt_migration(struct rq *rq)
  {
 + if (unlikely(num_online_cpus() == 1))
 + return;
 +
   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1)) {
   if (!rq-rt.overloaded) {
   rt_set_overload(rq);
 @@ -105,8 +108,10 @@ static inline void dec_rt_tasks(struct t
   } /* otherwise leave rq-highest prio alone */
   } else
   rq-rt.highest_prio = MAX_RT_PRIO;
 - if (p-nr_cpus_allowed  1)
 + if (p-nr_cpus_allowed  1) {
 + BUG_ON(!rq-rt.rt_nr_migratory);
   rq-rt.rt_nr_migratory--;
 + }
  
   if (rq-rt.highest_prio != highest_prio)
   cpupri_set(rq-rd-cpupri, rq-cpu, rq-rt.highest_prio);
 
 
 btw, CONFIG_INTEL_IOATDMA compile booboo
 
   CC  drivers/dma/ioat_dma.o
 drivers/dma/ioat_dma.c: In function ‘ioat1_tx_submit’:
 drivers/dma/ioat_dma.c:300: error: too few arguments to function 
 ‘__list_splice’
 make[2]: *** [drivers/dma/ioat_dma.o] Error 1
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-rt-users in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Subject: SCHED - Use a 2-d bitmap for searching lowest-pri CPU

2007-12-05 Thread Gregory Haskins
 On Wed, Dec 5, 2007 at  4:34 AM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Gregory Haskins [EMAIL PROTECTED] wrote:
 
 The current code use a linear algorithm which causes scaling issues on 
 larger SMP machines.  This patch replaces that algorithm with a 
 2-dimensional bitmap to reduce latencies in the wake-up path.
 
 hm, what kind of scaling issues

well, the linear algorithm scales with the number of online-cpus, so as you add 
more cpus the latencies increase, thus decreasing determinism.

 - do you have any numbers?  What workload did you measure and on what 
 hardware?
 

The vast majority of my own work was on 4-way and 8-way systems with -rt which 
I can dig up and share with you if you like.  Generally speaking, I would 
present various loads (make -j 128 being the defacto standard) and 
simultaneously measure RT latency performance with tools such as cyclictest 
and/or preempt-test.  Typically, I would let these loads/measurements run 
overnight to really seek out the worst-case maximums.

As a synopsis, the 2-d alg helps even on the 4-way, but even more so on the 
8-way.  I expect that trend to continue as we add more cpus.  The general 
numbers I recall from memory are that the 2-d alg reduces max latencies by 
about 25-40% on the 8-way.

However, that said, Steven's testing work on the mainline port of our series 
sums it up very nicely, so I will present that in lieu of digging up my -rt 
numbers unless you specifically want them too.  Here they are:


http://people.redhat.com/srostedt/rt-benchmarks/

As I believe you are aware, these numbers were generated on a 64-way SMP beast. 
 To break down what we are looking at, lets start with the first row of graphs 
(under Kernel and System Information).  What we see here are scatterplots of 
wake-latencies generated from Steven's rt-migrate test.  We start with the 
vanilla rc3 kernel, and make our way through sdr (the first 7-8 patches 
from v7), gh (patches 8-20), and finally cpupri (patches 21-23).  (Ignore 
the noavoid runs, that was an experiment that Steven and I agree was a flop 
so its dropped from the series). 

The test has a default run-interval of 20ms, which is employed here in this 
test.  What we expect to see is that the red points (high-priority) should be 
allowed to start immediately (close to 0 latency), whereas green points 
(low-priority) may sometimes happen to start first, or may start after the 20ms 
interval depending on luck of the draw.

You can see that rc3 has issues keeping the red-points near zero, which is 
part of the problem we were addressing in this series.

http://people.redhat.com/srostedt/rt-benchmarks/imgs/results-rt-migrate-test-2.6.24-rc3.out.png

Rolling in sdr we fix the wandering/20ms red-points, but you still see a 
degree of jitter in the red-points staying close to zero.

http://people.redhat.com/srostedt/rt-benchmarks/imgs/results-rt-migrate-test-2.6.24-rc3-sdr.out-sm.png

This is ok.  The code technically passes the test and does the right thing 
from a balancing perspective.  The remainder of the series is performance 
related.

We then apply gh which adds support for considering things like NUMA/MC 
distance and we can see a higher degree of determinism, particularly on the 
red-points.  

http://people.redhat.com/srostedt/rt-benchmarks/imgs/results-rt-migrate-test-2.6.24-rc3-gh.out-sm.png

Rolling in cpupri we substitute the linear search for the 2-d bitmap 
optimization, are we observe a further increase in determinism:

http://people.redhat.com/srostedt/rt-benchmarks/imgs/results-rt-migrate-test-2.6.24-rc3-gh-cpupri.out-sm.png

The numbers presented in these scatterplots can be summarized in the following 
bar-graph for easier comparison:

http://people.redhat.com/srostedt/rt-benchmarks/imgs/results-rt-migrate-hist-sm.png

You can clearly see the latencies dropping as we move through the 23 patches.  
This is similar to my findings under -rt with the 4/8 ways that I mentioned 
earlier, but I can find those numbers for you next if you feel it is necessary.

I hope this helps to show the difference.  Thanks for hanging in there on this 
long mail!

Regards,
-Greg






-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Subject: SCHED - Use a 2-d bitmap for searching lowest-pri CPU

2007-12-05 Thread Gregory Haskins
 On Wed, Dec 5, 2007 at  6:44 AM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Gregory Haskins [EMAIL PROTECTED] wrote:
 
 However, that said, Steven's testing work on the mainline port of our 
 series sums it up very nicely, so I will present that in lieu of 
 digging up my -rt numbers unless you specifically want them too.  Here 
 they are:
 
 i'm well aware of Steve's benchmarking efforts, but i dont think he's 
 finished with it and i'll let him present the results once he wants to 
 announce them. I asked about the effects of the 2-d patch in isolation 
 and i'm not sure the numbers show that individual patch in action.

Ah, sorry if I was not clear.  What I was trying to show was that you can 
compare gh to cpupri to see the effects of the 2-d patch in isolation (*) 
in Steven's tests.  I believe it shows a positive impact on some tests, and a 
negligible impact on some tests.  As long as we dont have a regression 
somewhere, I am happy :)

(*) Yes, cpupri in this test also has patches 21-22 (root-domain).  However, 
note that Steven is not configuring cpusets, and therefore the root-domain code 
is effectively marginalized in this data.  Its not a pure isolation, no.  But 
the results of my tests with *true* isolation present similar characteristics, 
so I felt they were representative.

 
 in any case, you are preaching to the choir, i wrote the first 
 rt-overload code and it's been in -rt forever so it's not like you need 
 to sell me the concept ;-) But upstream quality requirements are 
 different from -rt and we need to examine all aspects of scheduling, not 
 just latency. 

Understood and agree.  I designed the subsystem with the overall system in 
mind, so hopefully that is reflected in the numbers and the review comments 
that come out of this. :)

In any case, i'll wait for the rest of Steve's numbers.

Sounds good.  Ill try to dig up my 4/8-way numbers as well for another data 
point.

Thanks for taking the time to review all this stuff.  I know you are swamped 
these days.

-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/23] Subject: SCHED - Add rt_nr_running accounting

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|1 +
 kernel/sched_rt.c |   17 +
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b062856..4751c2f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -266,6 +266,7 @@ struct rt_rq {
struct rt_prio_array active;
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+   unsigned long rt_nr_running;
 };
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ee9c8b6..a6271f4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -26,12 +26,27 @@ static void update_curr_rt(struct rq *rq)
cpuacct_charge(curr, delta_exec);
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+   WARN_ON(!rt_task(p));
+   rq-rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+   WARN_ON(!rt_task(p));
+   WARN_ON(!rq-rt.rt_nr_running);
+   rq-rt.rt_nr_running--;
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
struct rt_prio_array *array = rq-rt.active;
 
list_add_tail(p-run_list, array-queue + p-prio);
__set_bit(p-prio, array-bitmap);
+
+   inc_rt_tasks(p, rq);
 }
 
 /*
@@ -46,6 +61,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct 
*p, int sleep)
list_del(p-run_list);
if (list_empty(array-queue + p-prio))
__clear_bit(p-prio, array-bitmap);
+
+   dec_rt_tasks(p, rq);
 }
 
 /*

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/23] Subject: SCHED - track highest prio queued on runqueue

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|3 +++
 kernel/sched_rt.c |   18 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4751c2f..90c04fd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -267,6 +267,8 @@ struct rt_rq {
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
unsigned long rt_nr_running;
+   /* highest queued rt task prio */
+   int highest_prio;
 };
 
 /*
@@ -6783,6 +6785,7 @@ void __init sched_init(void)
rq-cpu = i;
rq-migration_thread = NULL;
INIT_LIST_HEAD(rq-migration_queue);
+   rq-rt.highest_prio = MAX_RT_PRIO;
 #endif
atomic_set(rq-nr_iowait, 0);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a6271f4..4d5b9b2 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -30,6 +30,10 @@ static inline void inc_rt_tasks(struct task_struct *p, 
struct rq *rq)
 {
WARN_ON(!rt_task(p));
rq-rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+   if (p-prio  rq-rt.highest_prio)
+   rq-rt.highest_prio = p-prio;
+#endif /* CONFIG_SMP */
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -37,6 +41,20 @@ static inline void dec_rt_tasks(struct task_struct *p, 
struct rq *rq)
WARN_ON(!rt_task(p));
WARN_ON(!rq-rt.rt_nr_running);
rq-rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+   if (rq-rt.rt_nr_running) {
+   struct rt_prio_array *array;
+
+   WARN_ON(p-prio  rq-rt.highest_prio);
+   if (p-prio == rq-rt.highest_prio) {
+   /* recalculate */
+   array = rq-rt.active;
+   rq-rt.highest_prio =
+   sched_find_first_bit(array-bitmap);
+   } /* otherwise leave rq-highest prio alone */
+   } else
+   rq-rt.highest_prio = MAX_RT_PRIO;
+#endif /* CONFIG_SMP */
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/23] Subject: SCHED - push RT tasks

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|8 ++
 kernel/sched_rt.c |  225 +
 2 files changed, 231 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 90c04fd..7748d14 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1883,6 +1883,8 @@ static void finish_task_switch(struct rq *rq, struct 
task_struct *prev)
prev_state = prev-state;
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
+   schedule_tail_balance_rt(rq);
+
fire_sched_in_preempt_notifiers(current);
if (mm)
mmdrop(mm);
@@ -2116,11 +2118,13 @@ static void double_rq_unlock(struct rq *rq1, struct rq 
*rq2)
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
__releases(this_rq-lock)
__acquires(busiest-lock)
__acquires(this_rq-lock)
 {
+   int ret = 0;
+
if (unlikely(!irqs_disabled())) {
/* printk() doesn't work good under rq-lock */
spin_unlock(this_rq-lock);
@@ -2131,9 +2135,11 @@ static void double_lock_balance(struct rq *this_rq, 
struct rq *busiest)
spin_unlock(this_rq-lock);
spin_lock(busiest-lock);
spin_lock(this_rq-lock);
+   ret = 1;
} else
spin_lock(busiest-lock);
}
+   return ret;
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 4d5b9b2..b5ef4b8 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -135,6 +135,227 @@ static void put_prev_task_rt(struct rq *rq, struct 
task_struct *p)
 }
 
 #ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+   struct rt_prio_array *array = rq-rt.active;
+   struct task_struct *next;
+   struct list_head *queue;
+   int idx;
+
+   assert_spin_locked(rq-lock);
+
+   if (likely(rq-rt.rt_nr_running  2))
+   return NULL;
+
+   idx = sched_find_first_bit(array-bitmap);
+   if (unlikely(idx = MAX_RT_PRIO)) {
+   WARN_ON(1); /* rt_nr_running is bad */
+   return NULL;
+   }
+
+   queue = array-queue + idx;
+   next = list_entry(queue-next, struct task_struct, run_list);
+   if (unlikely(next != rq-curr))
+   return next;
+
+   if (queue-next-next != queue) {
+   /* same prio task */
+   next = list_entry(queue-next-next, struct task_struct, 
run_list);
+   return next;
+   }
+
+   /* slower, but more flexible */
+   idx = find_next_bit(array-bitmap, MAX_RT_PRIO, idx+1);
+   if (unlikely(idx = MAX_RT_PRIO)) {
+   WARN_ON(1); /* rt_nr_running was 2 and above! */
+   return NULL;
+   }
+
+   queue = array-queue + idx;
+   next = list_entry(queue-next, struct task_struct, run_list);
+
+   return next;
+}
+
+static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct

[PATCH 05/23] Subject: SCHED - pull RT tasks

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|2 +
 kernel/sched_rt.c |  187 ++---
 2 files changed, 178 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7748d14..a30147e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3652,6 +3652,8 @@ need_resched_nonpreemptible:
switch_count = prev-nvcsw;
}
 
+   schedule_balance_rt(rq, prev);
+
if (unlikely(!rq-nr_running))
idle_balance(cpu, rq);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b8c758a..a2f1057 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -177,8 +177,17 @@ static void put_prev_task_rt(struct rq *rq, struct 
task_struct *p)
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+   if (!task_running(rq, p) 
+   (cpu  0 || cpu_isset(cpu, p-cpus_allowed)))
+   return 1;
+   return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+int cpu)
 {
struct rt_prio_array *array = rq-rt.active;
struct task_struct *next;
@@ -197,26 +206,36 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq)
}
 
queue = array-queue + idx;
+   BUG_ON(list_empty(queue));
+
next = list_entry(queue-next, struct task_struct, run_list);
-   if (unlikely(next != rq-curr))
-   return next;
+   if (unlikely(pick_rt_task(rq, next, cpu)))
+   goto out;
 
if (queue-next-next != queue) {
/* same prio task */
next = list_entry(queue-next-next, struct task_struct, 
run_list);
-   return next;
+   if (pick_rt_task(rq, next, cpu))
+   goto out;
}
 
+ retry:
/* slower, but more flexible */
idx = find_next_bit(array-bitmap, MAX_RT_PRIO, idx+1);
-   if (unlikely(idx = MAX_RT_PRIO)) {
-   WARN_ON(1); /* rt_nr_running was 2 and above! */
+   if (unlikely(idx = MAX_RT_PRIO))
return NULL;
-   }
 
queue = array-queue + idx;
-   next = list_entry(queue-next, struct task_struct, run_list);
+   BUG_ON(list_empty(queue));
+
+   list_for_each_entry(next, queue, run_list) {
+   if (pick_rt_task(rq, next, cpu))
+   goto out;
+   }
+
+   goto retry;
 
+ out:
return next;
 }
 
@@ -303,13 +322,15 @@ static int push_rt_task(struct rq *this_rq)
 
assert_spin_locked(this_rq-lock);
 
-   next_task = pick_next_highest_task_rt(this_rq);
+   next_task = pick_next_highest_task_rt(this_rq, -1);
if (!next_task)
return 0;
 
  retry:
-   if (unlikely(next_task == this_rq-curr))
+   if (unlikely(next_task == this_rq-curr)) {
+   WARN_ON(1);
return 0;
+   }
 
/*
 * It's possible that the next_task slipped in of
@@ -333,7 +354,7 @@ static int push_rt_task(struct rq *this_rq)
 * so it is possible that next_task has changed.
 * If it has, then try again.
 */
-   task = pick_next_highest_task_rt(this_rq);
+   task = pick_next_highest_task_rt(this_rq, -1);
if (unlikely(task != next_task)  task  paranoid--) {
put_task_struct(next_task);
next_task = task;
@@ -376,6 +397,149 @@ static void push_rt_tasks(struct rq *rq)
;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+   struct task_struct *next;
+   struct task_struct *p;
+   struct rq *src_rq;
+   cpumask_t *rto_cpumask;
+   int this_cpu = this_rq-cpu;
+   int cpu;
+   int ret = 0;
+
+   assert_spin_locked(this_rq-lock);
+
+   /*
+* If cpusets are used, and we have overlapping
+* run queue cpusets, then this algorithm may not catch all.
+* This is just the price you pay on trying to keep
+* dirtying caches down on large SMP machines.
+*/
+   if (likely(!rt_overloaded()))
+   return

[PATCH 07/23] Subject: SCHED - disable CFS RT load balancing.

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   95 ++---
 1 files changed, 4 insertions(+), 91 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a0b05ff..ea07ffa 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -565,109 +565,22 @@ static void wakeup_balance_rt(struct rq *rq, struct 
task_struct *p)
push_rt_tasks(rq);
 }
 
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = rq-rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = sched_find_first_bit(array-bitmap);
-   if (idx = MAX_RT_PRIO)
-   return NULL;
-
-   head = array-queue + idx;
-   curr = head-prev;
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr-prev;
-
-   rq-rt.rt_load_balance_idx = idx;
-   rq-rt.rt_load_balance_head = head;
-   rq-rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-   struct rq *rq = arg;
-   struct rt_prio_array *array = rq-rt.active;
-   struct list_head *head, *curr;
-   struct task_struct *p;
-   int idx;
-
-   idx = rq-rt.rt_load_balance_idx;
-   head = rq-rt.rt_load_balance_head;
-   curr = rq-rt.rt_load_balance_curr;
-
-   /*
-* If we arrived back to the head again then
-* iterate to the next queue (if any):
-*/
-   if (unlikely(head == curr)) {
-   int next_idx = find_next_bit(array-bitmap, MAX_RT_PRIO, idx+1);
-
-   if (next_idx = MAX_RT_PRIO)
-   return NULL;
-
-   idx = next_idx;
-   head = array-queue + idx;
-   curr = head-prev;
-
-   rq-rt.rt_load_balance_idx = idx;
-   rq-rt.rt_load_balance_head = head;
-   }
-
-   p = list_entry(curr, struct task_struct, run_list);
-
-   curr = curr-prev;
-
-   rq-rt.rt_load_balance_curr = curr;
-
-   return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio)
 {
-   struct rq_iterator rt_rq_iterator;
-
-   rt_rq_iterator.start = load_balance_start_rt;
-   rt_rq_iterator.next = load_balance_next_rt;
-   /* pass 'busiest' rq argument into
-* load_balance_[start|next]_rt iterators
-*/
-   rt_rq_iterator.arg = busiest;
-
-   return balance_tasks(this_rq, this_cpu, busiest, max_load_move, sd,
-idle, all_pinned, this_best_prio, rt_rq_iterator);
+   /* don't touch RT tasks */
+   return 0;
 }
 
 static int
 move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 struct sched_domain *sd, enum cpu_idle_type idle)
 {
-   struct rq_iterator rt_rq_iterator;
-
-   rt_rq_iterator.start = load_balance_start_rt;
-   rt_rq_iterator.next = load_balance_next_rt;
-   rt_rq_iterator.arg = busiest;
-
-   return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle,
- rt_rq_iterator);
+   /* don't touch RT tasks */
+   return 0;
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)  do { } while (0)

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/23] Subject: SCHED - Cache cpus_allowed weight for optimizing migration

2007-12-04 Thread Gregory Haskins
Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task-cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 include/linux/init_task.h |1 +
 include/linux/sched.h |2 ++
 kernel/fork.c |1 +
 kernel/sched.c|9 +++-
 kernel/sched_rt.c |   50 +
 5 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index cae35b6..572c65b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -130,6 +130,7 @@ extern struct group_info init_groups;
.normal_prio= MAX_PRIO-20,  \
.policy = SCHED_NORMAL, \
.cpus_allowed   = CPU_MASK_ALL, \
+   .nr_cpus_allowed = NR_CPUS, \
.mm = NULL, \
.active_mm  = init_mm, \
.run_list   = LIST_HEAD_INIT(tsk.run_list), \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac3d496..a3dc53b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -847,6 +847,7 @@ struct sched_class {
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
+   void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
 };
 
 struct load_weight {
@@ -956,6 +957,7 @@ struct task_struct {
 
unsigned int policy;
cpumask_t cpus_allowed;
+   int nr_cpus_allowed;
unsigned int time_slice;
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
diff --git a/kernel/fork.c b/kernel/fork.c
index 8ca1a14..12e3e58 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1237,6 +1237,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
 * parent's CPU). This avoids alot of nasty races.
 */
p-cpus_allowed = current-cpus_allowed;
+   p-nr_cpus_allowed = current-nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p-cpus_allowed) ||
!cpu_online(task_cpu(p
set_task_cpu(p, smp_processor_id());
diff --git a/kernel/sched.c b/kernel/sched.c
index ebd114b..70f08de 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -269,6 +269,7 @@ struct rt_rq {
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
unsigned long rt_nr_running;
+   unsigned long rt_nr_migratory;
/* highest queued rt task prio */
int highest_prio;
 };
@@ -5080,7 +5081,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t 
new_mask)
goto out;
}
 
-   p-cpus_allowed = new_mask;
+   if (p-sched_class-set_cpus_allowed)
+   p-sched_class-set_cpus_allowed(p, new_mask);
+   else {
+   p-cpus_allowed= new_mask;
+   p-nr_cpus_allowed = cpus_weight(new_mask

[PATCH 10/23] Subject: SCHED - Remove some CFS specific code from the wakeup path of RT tasks

2007-12-04 Thread Gregory Haskins
The current wake-up code path tries to determine if it can optimize the
wake-up to this_cpu by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For
RT tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

(Note that there is one checkpatch error due to braces, but this code was
simply moved from kernel/sched.c to kernel/sched_fail.c so I don't want to
modify it)

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 include/linux/sched.h   |1 
 kernel/sched.c  |  167 ---
 kernel/sched_fair.c |  148 ++
 kernel/sched_idletask.c |9 +++
 kernel/sched_rt.c   |   10 +++
 5 files changed, 195 insertions(+), 140 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a3dc53b..809658c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -827,6 +827,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
+   int  (*select_task_rq)(struct task_struct *p, int sync);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 70f08de..6fa511d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -866,6 +866,13 @@ static void cpuacct_charge(struct task_struct *tsk, u64 
cputime);
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 #endif
 
+#ifdef CONFIG_SMP
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+#endif /* CONFIG_SMP */
+
 #include sched_stats.h
 #include sched_idletask.c
 #include sched_fair.c
@@ -1051,7 +1058,7 @@ static inline void __set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 /*
  * Is this task likely cache-hot:
  */
-static inline int
+static int
 task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 {
s64 delta;
@@ -1276,7 +1283,7 @@ static unsigned long target_load(int cpu, int type)
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);
@@ -1433,58 +1440,6 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task-cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-   cpumask_t tmp;
-   struct sched_domain *sd;
-   int i;
-
-   /*
-* If it is idle, then it is the best cpu to run this task.
-*
-* This cpu is also the best, if it has more than one task already.
-* Siblings must be also busy(in most cases) as they didn't already
-* pickup the extra load from this cpu and hence we need not check
-* sibling runqueue info. This will avoid the checks and cache miss
-* penalities associated with that.
-*/
-   if (idle_cpu(cpu) || cpu_rq(cpu)-nr_running  1)
-   return cpu;
-
-   for_each_domain(cpu, sd) {
-   if (sd-flags  SD_WAKE_IDLE) {
-   cpus_and(tmp, sd-span, p-cpus_allowed);
-   for_each_cpu_mask(i, tmp) {
-   if (idle_cpu(i)) {
-   if (i != task_cpu(p)) {
-   schedstat_inc(p,
-   se.nr_wakeups_idle);
-   }
-   return i;
-   }
-   }
-   } else {
-   break;
-   }
-   }
-   return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-   return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1506,8 +1461,6 @@ static int try_to_wake_up(struct task_struct *p, unsigned 
int state, int sync)
long old_state;
struct rq *rq;
 #ifdef CONFIG_SMP

[PATCH 11/23] Subject: SCHED - Break out the search function

2007-12-04 Thread Gregory Haskins
Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

(Checkpatch error is inherited from moved code)

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   66 +++--
 1 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index da21c7a..7e26c2c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -261,54 +261,66 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *this_rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-   struct rq *lowest_rq = NULL;
int cpu;
-   int tries;
cpumask_t *cpu_mask = __get_cpu_var(local_cpu_mask);
+   struct rq *lowest_rq = NULL;
 
cpus_and(*cpu_mask, cpu_online_map, task-cpus_allowed);
 
-   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
-   /*
-* Scan each rq for the lowest prio.
-*/
-   for_each_cpu_mask(cpu, *cpu_mask) {
-   struct rq *rq = per_cpu(runqueues, cpu);
+   /*
+* Scan each rq for the lowest prio.
+*/
+   for_each_cpu_mask(cpu, *cpu_mask) {
+   struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == this_rq-cpu)
-   continue;
+   if (cpu == rq-cpu)
+   continue;
 
-   /* We look for lowest RT prio or non-rt CPU */
-   if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
-   }
+   /* We look for lowest RT prio or non-rt CPU */
+   if (rq-rt.highest_prio = MAX_RT_PRIO) {
+   lowest_rq = rq;
+   break;
+   }
 
-   /* no locking for now */
-   if (rq-rt.highest_prio  task-prio 
-   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = rq;
-   }
+   /* no locking for now */
+   if (rq-rt.highest_prio  task-prio 
+   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
+   lowest_rq = rq;
}
+   }
+
+   return lowest_rq ? lowest_rq-cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+ struct rq *rq)
+{
+   struct rq *lowest_rq = NULL;
+   int cpu;
+   int tries;
 
-   if (!lowest_rq)
+   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
+   cpu = find_lowest_rq(task);
+
+   if (cpu == -1)
break;
 
+   lowest_rq = cpu_rq(cpu);
+
/* if the prio of this runqueue changed, try again */
-   if (double_lock_balance(this_rq, lowest_rq)) {
+   if (double_lock_balance(rq, lowest_rq)) {
/*
 * We had to unlock the run queue. In
 * the mean time, task could have
 * migrated already or had its affinity changed.
 * Also make sure that it wasn't scheduled on its rq.
 */
-   if (unlikely(task_rq(task) != this_rq ||
+   if (unlikely(task_rq(task) != rq ||
 !cpu_isset(lowest_rq-cpu, 
task-cpus_allowed) ||
-task_running(this_rq, task) ||
+task_running(rq, task) ||
 !task-se.on_rq)) {
spin_unlock(lowest_rq-lock);
lowest_rq = NULL;

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/23] Subject: SCHED - Allow current_cpu to be included in search

2007-12-04 Thread Gregory Haskins
It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched_rt.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7e26c2c..7e444f4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -275,9 +275,6 @@ static int find_lowest_rq(struct task_struct *task)
for_each_cpu_mask(cpu, *cpu_mask) {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq-cpu)
-   continue;
-
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
@@ -305,7 +302,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
for (tries = 0; tries  RT_MAX_TRIES; tries++) {
cpu = find_lowest_rq(task);
 
-   if (cpu == -1)
+   if ((cpu == -1) || (cpu == rq-cpu))
break;
 
lowest_rq = cpu_rq(cpu);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/23] Subject: SCHED - Pre-route RT tasks on wakeup

2007-12-04 Thread Gregory Haskins
In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
pre-routing (push the task away to a less important RQ before activating)
approach, while Steve utilized a post-routing approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a pull optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a light-weight lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of priority mistargeting caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7e444f4..ea40851 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -149,8 +149,27 @@ yield_task_rt(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
+static int find_lowest_rq(struct task_struct *task);
+
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+   struct rq *rq = task_rq(p);
+
+   /*
+* If the task will not preempt the RQ, try to find a better RQ
+* before we even activate the task
+*/
+   if ((p-prio = rq-rt.highest_prio)
+(p-nr_cpus_allowed  1)) {
+   int cpu = find_lowest_rq(p);
+
+   return (cpu == -1) ? task_cpu(p) : cpu;
+   }
+
+   /*
+* Otherwise, just let it ride on the affined RQ and the
+* post-schedule router will push the preempted task away
+*/
return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/23] Subject: SCHED - Optimize our cpu selection based on topology

2007-12-04 Thread Gregory Haskins
The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched.c|1 +
 kernel/sched_rt.c |  100 +++--
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6fa511d..651270e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -24,6 +24,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  * (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include linux/mm.h
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ea40851..67daa66 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -279,35 +279,111 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int cpu;
-   cpumask_t *cpu_mask = __get_cpu_var(local_cpu_mask);
-   struct rq *lowest_rq = NULL;
+   int   cpu;
+   cpumask_t *valid_mask = __get_cpu_var(valid_cpu_mask);
+   int   lowest_prio = -1;
+   int   ret = 0;
 
-   cpus_and(*cpu_mask, cpu_online_map, task-cpus_allowed);
+   cpus_clear(*lowest_mask);
+   cpus_and(*valid_mask, cpu_online_map, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *cpu_mask) {
+   for_each_cpu_mask(cpu, *valid_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
+   if (ret)
+   cpus_clear(*lowest_mask);
+   cpu_set(rq-cpu, *lowest_mask);
+   return 1;
}
 
/* no locking for now */
-   if (rq-rt.highest_prio  task-prio 
-   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = rq;
+   if ((rq-rt.highest_prio  task-prio)
+(rq-rt.highest_prio = lowest_prio)) {
+   if (rq-rt.highest_prio  lowest_prio) {
+   /* new low - clear old data */
+   lowest_prio = rq-rt.highest_prio;
+   cpus_clear(*lowest_mask);
+   }
+   cpu_set(rq-cpu, *lowest_mask);
+   ret = 1;
+   }
+   }
+
+   return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+   int first;
+
+   /* this_cpu is cheaper to preempt than a remote processor */
+   if ((this_cpu != -1)  cpu_isset(this_cpu, *mask))
+   return this_cpu;
+
+   first = first_cpu(*mask);
+   if (first != NR_CPUS)
+   return first;
+
+   return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+   struct sched_domain *sd;
+   cpumask_t *lowest_mask = __get_cpu_var(local_cpu_mask);
+   int this_cpu = smp_processor_id();
+   int cpu  = task_cpu(task);
+
+   if (!find_lowest_cpus(task, lowest_mask))
+   return -1;
+
+   /*
+* At this point we have built a mask of cpus representing the
+* lowest priority tasks in the system.  Now we want to elect
+* the best one based on our affinity and topology.
+*
+* We prioritize the last cpu that the task executed on since
+* it is most likely cache-hot in that location.
+*/
+   if (cpu_isset(cpu, *lowest_mask))
+   return cpu;
+
+   /*
+* Otherwise, we consult the sched_domains span maps to figure
+* out which cpu is logically closest to our hot cache data.
+*/
+   if (this_cpu == cpu)
+   this_cpu = -1; /* Skip this_cpu opt if the same */
+
+   for_each_domain(cpu, sd) {
+   if (sd-flags  SD_WAKE_AFFINE

[PATCH 15/23] Subject: SCHED - Optimize rebalancing

2007-12-04 Thread Gregory Haskins
We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 651270e..ed031bd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -273,6 +273,7 @@ struct rt_rq {
unsigned long rt_nr_migratory;
/* highest queued rt task prio */
int highest_prio;
+   int overloaded;
 };
 
 /*
@@ -6692,6 +6693,7 @@ void __init sched_init(void)
rq-migration_thread = NULL;
INIT_LIST_HEAD(rq-migration_queue);
rq-rt.highest_prio = MAX_RT_PRIO;
+   rq-rt.overloaded = 0;
 #endif
atomic_set(rq-nr_iowait, 0);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 67daa66..19db3a9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(void)
 }
 static inline void rt_set_overload(struct rq *rq)
 {
+   rq-rt.overloaded = 1;
cpu_set(rq-cpu, rt_overload_mask);
/*
 * Make sure the mask is visible before we set
@@ -32,6 +33,7 @@ static inline void rt_clear_overload(struct rq *rq)
/* the order here really doesn't matter */
atomic_dec(rto_count);
cpu_clear(rq-cpu, rt_overload_mask);
+   rq-rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct rq *rq)
@@ -446,6 +448,9 @@ static int push_rt_task(struct rq *rq)
 
assert_spin_locked(rq-lock);
 
+   if (!rq-rt.overloaded)
+   return 0;
+
next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
@@ -673,7 +678,7 @@ static void schedule_tail_balance_rt(struct rq *rq)
 * the lock was owned by prev, we need to release it
 * first via finish_lock_switch and then reaquire it here.
 */
-   if (unlikely(rq-rt.rt_nr_running  1)) {
+   if (unlikely(rq-rt.overloaded)) {
spin_lock_irq(rq-lock);
push_rt_tasks(rq);
spin_unlock_irq(rq-lock);
@@ -685,7 +690,8 @@ static void wakeup_balance_rt(struct rq *rq, struct 
task_struct *p)
 {
if (unlikely(rt_task(p)) 
!task_running(rq, p) 
-   (p-prio = rq-curr-prio))
+   (p-prio = rq-rt.highest_prio) 
+   rq-rt.overloaded)
push_rt_tasks(rq);
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/23] Subject: SCHED - Avoid overload

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
 its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   20 
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 19db3a9..e007d2b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -158,11 +158,23 @@ static int select_task_rq_rt(struct task_struct *p, int 
sync)
struct rq *rq = task_rq(p);
 
/*
-* If the task will not preempt the RQ, try to find a better RQ
-* before we even activate the task
+* If the current task is an RT task, then
+* try to see if we can wake this RT task up on another
+* runqueue. Otherwise simply start this RT task
+* on its current runqueue.
+*
+* We want to avoid overloading runqueues. Even if
+* the RT task is of higher priority than the current RT task.
+* RT tasks behave differently than other tasks. If
+* one gets preempted, we try to push it off to another queue.
+* So trying to keep a preempting RT task on the same
+* cache hot CPU will force the running RT task to
+* a cold CPU. So we waste all the cache for the lower
+* RT task in hopes of saving some of a RT task
+* that is just being woken and probably will have
+* cold cache anyway.
 */
-   if ((p-prio = rq-rt.highest_prio)
-(p-nr_cpus_allowed  1)) {
+   if (unlikely(rt_task(rq-curr))) {
int cpu = find_lowest_rq(p);
 
return (cpu == -1) ? task_cpu(p) : cpu;

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/23] Subject: SCHED - restore the migratable conditional

2007-12-04 Thread Gregory Haskins
We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
---

 kernel/sched_rt.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index e007d2b..fe0b43f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -174,7 +174,8 @@ static int select_task_rq_rt(struct task_struct *p, int 
sync)
 * that is just being woken and probably will have
 * cold cache anyway.
 */
-   if (unlikely(rt_task(rq-curr))) {
+   if (unlikely(rt_task(rq-curr)) 
+   (p-nr_cpus_allowed  1)) {
int cpu = find_lowest_rq(p);
 
return (cpu == -1) ? task_cpu(p) : cpu;

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/23] Subject: SCHED - Optimize cpu search with hamming weight

2007-12-04 Thread Gregory Haskins
We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs.  Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   25 ++---
 1 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fe0b43f..0514b27 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -301,7 +301,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
int   cpu;
cpumask_t *valid_mask = __get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
-   int   ret = 0;
+   int   count   = 0;
 
cpus_clear(*lowest_mask);
cpus_and(*valid_mask, cpu_online_map, task-cpus_allowed);
@@ -314,7 +314,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
 
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   if (ret)
+   if (count)
cpus_clear(*lowest_mask);
cpu_set(rq-cpu, *lowest_mask);
return 1;
@@ -326,14 +326,17 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
if (rq-rt.highest_prio  lowest_prio) {
/* new low - clear old data */
lowest_prio = rq-rt.highest_prio;
-   cpus_clear(*lowest_mask);
+   if (count) {
+   cpus_clear(*lowest_mask);
+   count = 0;
+   }
}
cpu_set(rq-cpu, *lowest_mask);
-   ret = 1;
+   count++;
}
}
 
-   return ret;
+   return count;
 }
 
 static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
@@ -357,9 +360,17 @@ static int find_lowest_rq(struct task_struct *task)
cpumask_t *lowest_mask = __get_cpu_var(local_cpu_mask);
int this_cpu = smp_processor_id();
int cpu  = task_cpu(task);
+   int count= find_lowest_cpus(task, lowest_mask);
 
-   if (!find_lowest_cpus(task, lowest_mask))
-   return -1;
+   if (!count)
+   return -1; /* No targets found */
+
+   /*
+* There is no sense in performing an optimal search if only one
+* target is found.
+*/
+   if (count == 1)
+   return first_cpu(*lowest_mask);
 
/*
 * At this point we have built a mask of cpus representing the

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 19/23] Subject: SCHED - Optimize out cpu_clears

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   49 -
 1 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0514b27..039be04 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -294,29 +294,36 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
 }
 
 static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);
 
 static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int   cpu;
-   cpumask_t *valid_mask = __get_cpu_var(valid_cpu_mask);
int   lowest_prio = -1;
+   int   lowest_cpu  = -1;
int   count   = 0;
+   int   cpu;
 
-   cpus_clear(*lowest_mask);
-   cpus_and(*valid_mask, cpu_online_map, task-cpus_allowed);
+   cpus_and(*lowest_mask, cpu_online_map, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, *valid_mask) {
+   for_each_cpu_mask(cpu, *lowest_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   if (count)
+   /*
+* if we already found a low RT queue
+* and now we found this non-rt queue
+* clear the mask and set our bit.
+* Otherwise just return the queue as is
+* and the count==1 will cause the algorithm
+* to use the first bit found.
+*/
+   if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
-   cpu_set(rq-cpu, *lowest_mask);
+   cpu_set(rq-cpu, *lowest_mask);
+   }
return 1;
}
 
@@ -326,13 +333,29 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
if (rq-rt.highest_prio  lowest_prio) {
/* new low - clear old data */
lowest_prio = rq-rt.highest_prio;
-   if (count) {
-   cpus_clear(*lowest_mask);
-   count = 0;
-   }
+   lowest_cpu = cpu;
+   count = 0;
}
-   cpu_set(rq-cpu, *lowest_mask);
count++;
+   } else
+   cpu_clear(cpu, *lowest_mask);
+   }
+
+   /*
+* Clear out all the set bits that represent
+* runqueues that were of higher prio than
+* the lowest_prio.
+*/
+   if (lowest_cpu  0) {
+   /*
+* Perhaps we could add another cpumask op to
+* zero out bits. Like cpu_zero_bits(cpumask, nrbits);
+* Then that could be optimized to use memset and such.
+*/
+   for_each_cpu_mask(cpu, *lowest_mask) {
+   if (cpu = lowest_cpu)
+   break;
+   cpu_clear(cpu, *lowest_mask);
}
}
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/23] Subject: SCHED - balance RT tasks no new wake up

2007-12-04 Thread Gregory Haskins
From: Steven Rostedt [EMAIL PROTECTED]

Run the RT balancing code on wake up to an RT task.

Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ed031bd..ba9eadb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1656,6 +1656,7 @@ void fastcall wake_up_new_task(struct task_struct *p, 
unsigned long clone_flags)
inc_nr_running(p, rq);
}
check_preempt_curr(rq, p);
+   wakeup_balance_rt(rq, p);
task_rq_unlock(rq, flags);
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/23] Subject: SCHED - Add sched-domain roots

2007-12-04 Thread Gregory Haskins
We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
CC: Paul Jackson [EMAIL PROTECTED]
CC: Simon Derr [EMAIL PROTECTED]
---

 include/linux/sched.h |3 +
 kernel/sched.c|  121 -
 2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 809658c..b891d81 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -849,6 +849,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+   void (*join_domain)(struct rq *rq);
+   void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index ba9eadb..79f3eba 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,28 @@ struct rt_rq {
int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+/*
+ * We add the notion of a root-domain which will be used to define per-domain
+ * variables.  Each exclusive cpuset essentially defines an island domain by
+ * fully partitioning the member cpus from any other cpuset.  Whenever a new
+ * exclusive cpuset is created, we also create and attach a new root-domain
+ * object.
+ *
+ * By default the system creates a single root-domain with all cpus as
+ * members (mimicking the global state we have today).
+ */
+struct root_domain {
+   atomic_t refcount;
+   cpumask_t span;
+   cpumask_t online;
+};
+
+static struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -333,6 +355,7 @@ struct rq {
atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+   struct root_domain  *rd;
struct sched_domain *sd;
 
/* For active balancing */
@@ -5489,6 +5512,15 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
case CPU_ONLINE_FROZEN:
/* Strictly unnecessary, as first user will wake it. */
wake_up_process(cpu_rq(cpu)-migration_thread);
+
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_set(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5537,6 +5569,17 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
}
spin_unlock_irq(rq-lock);
break;
+
+   case CPU_DOWN_PREPARE:
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_clear(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
+   break;
 #endif
case CPU_LOCK_RELEASE:
mutex_unlock(sched_hotcpu_mutex);
@@ -5728,11 +5771,69 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
sched_domain *parent)
return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+   unsigned long flags;
+   const struct sched_class *class;
+
+   spin_lock_irqsave(rq-lock, flags);
+
+   if (rq-rd) {
+   struct root_domain *old_rd = rq-rd;
+
+   for (class = sched_class_highest; class; class = class-next)
+   if (class-leave_domain)
+   class-leave_domain(rq);
+
+   if (atomic_dec_and_test(old_rd-refcount))
+   kfree(old_rd);
+   }
+
+   atomic_inc(rd-refcount);
+   rq-rd = rd;
+
+   for (class

[PATCH 22/23] Subject: SCHED - Only balance our RT tasks within our root-domain

2007-12-04 Thread Gregory Haskins
We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task-cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   57 -
 2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 79f3eba..0c7e5e4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -292,6 +292,8 @@ struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;
+   cpumask_t rto_mask;
+   atomic_t  rto_count;
 };
 
 static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 039be04..9e8a59d 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
  */
 
 #ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
 {
-   return atomic_read(rto_count);
+   return atomic_read(rq-rd-rto_count);
 }
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
 {
-   return rt_overload_mask;
+   return rq-rd-rto_mask;
 }
 static inline void rt_set_overload(struct rq *rq)
 {
-   rq-rt.overloaded = 1;
-   cpu_set(rq-cpu, rt_overload_mask);
+   cpu_set(rq-cpu, rq-rd-rto_mask);
/*
 * Make sure the mask is visible before we set
 * the overload count. That is checked to determine
@@ -26,22 +24,24 @@ static inline void rt_set_overload(struct rq *rq)
 * updated yet.
 */
wmb();
-   atomic_inc(rto_count);
+   atomic_inc(rq-rd-rto_count);
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
/* the order here really doesn't matter */
-   atomic_dec(rto_count);
-   cpu_clear(rq-cpu, rt_overload_mask);
-   rq-rt.overloaded = 0;
+   atomic_dec(rq-rd-rto_count);
+   cpu_clear(rq-cpu, rq-rd-rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1))
+   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1)) {
rt_set_overload(rq);
-   else
+   rq-rt.overloaded = 1;
+   } else {
rt_clear_overload(rq);
+   rq-rt.overloaded = 0;
+   }
 }
 #endif /* CONFIG_SMP */
 
@@ -302,7 +302,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
int   count   = 0;
int   cpu;
 
-   cpus_and(*lowest_mask, cpu_online_map, task-cpus_allowed);
+   cpus_and(*lowest_mask, task_rq(task)-rd-online, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
@@ -585,18 +585,12 @@ static int pull_rt_task(struct rq *this_rq)
 
assert_spin_locked(this_rq-lock);
 
-   /*
-* If cpusets are used, and we have overlapping
-* run queue cpusets, then this algorithm may not catch all.
-* This is just the price you pay on trying to keep
-* dirtying caches down on large SMP machines.
-*/
-   if (likely(!rt_overloaded()))
+   if (likely(!rt_overloaded(this_rq)))
return 0;
 
next = pick_next_task_rt(this_rq);
 
-   rto_cpumask = rt_overload();
+   rto_cpumask = rt_overload(this_rq);
 
for_each_cpu_mask(cpu, *rto_cpumask) {
if (this_cpu == cpu)
@@ -815,6 +809,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct 
*p)
}
 }
 
+/* Assumes rq-lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_set_overload(rq);
+}
+
+/* Assumes rq-lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_clear_overload(rq);
+}
+
 static void set_curr_task_rt(struct rq *rq)
 {
struct task_struct *p = rq-curr;
@@ -844,4 +852,7 @@ const struct sched_class rt_sched_class = {
 
.set_curr_task  = set_curr_task_rt,
.task_tick  = task_tick_rt,
+
+   .join_domain= join_domain_rt,
+   .leave_domain

[PATCH 23/23] Subject: SCHED - Use a 2-d bitmap for searching lowest-pri CPU

2007-12-04 Thread Gregory Haskins
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   11 +++
 kernel/Makefile|1 
 kernel/sched.c |8 ++
 kernel/sched_cpupri.c  |  174 
 kernel/sched_cpupri.h  |   37 ++
 kernel/sched_rt.c  |   36 +-
 6 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c64ce9c..f78ab80 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -63,3 +63,14 @@ config PREEMPT_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config CPUPRI
+   bool Optimize lowest-priority search
+   depends on SMP  EXPERIMENTAL
+   default n
+   help
+ This option attempts to reduce latency in the kernel by replacing
+  the linear lowest-priority search algorithm with a 2-d bitmap.
+
+ Say Y here if you want to try this experimental algorithm.
+ Say N if you are unsure.
+
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..78a385e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_CPUPRI) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra [EMAIL PROTECTED], the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 0c7e5e4..2ace03f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -294,6 +296,9 @@ struct root_domain {
cpumask_t online;
cpumask_t rto_mask;
atomic_t  rto_count;
+#ifdef CONFIG_CPUPRI
+   struct cpupri cpupri;
+#endif
 };
 
 static struct root_domain def_root_domain;
@@ -5807,6 +5812,9 @@ static void init_rootdomain(struct root_domain *rd, const 
cpumask_t *map)
 
rd-span = *map;
cpus_and(rd-online, rd-span, cpu_online_map);
+#ifdef CONFIG_CPUPRI
+   cpupri_init(rd-cpupri);
+#endif
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..78e069e
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,174 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include sched_cpupri.h
+
+/* Convert between a 140 based task-prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+
+   if (prio == CPUPRI_INVALID)
+   cpupri = CPUPRI_INVALID;
+   else if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio = MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio + 1;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)\
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \
+   idx  CPUPRI_NR_PRIORITIES;\
+   idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs

Re: [PATCH 00/23] RT balance v7

2007-12-04 Thread Gregory Haskins
Ingo Molnar wrote:
 * Gregory Haskins [EMAIL PROTECTED] wrote:
 
 Ingo,

 This series applies on GIT commit 
 2254c2e0184c603f92fc9b81016ff4bb53da622d (2.6.24-rc4 (ish) git HEAD)
 
 please post patches against sched-devel.git - it has part of your 
 previous patches included already, plus some cleanups i did to them, so 
 this series of yours wont apply. sched-devel.git is at:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git


Ah, will do.  Thanks!

-Greg
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] RT balance v7a

2007-12-04 Thread Gregory Haskins
 On Tue, Dec 4, 2007 at  4:27 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Gregory Haskins [EMAIL PROTECTED] wrote:
 
 Ingo,
 
 This series applies on GIT commit 
 2254c2e0184c603f92fc9b81016ff4bb53da622d (2.6.24-rc4 (ish) git HEAD)
 
 please post patches against sched-devel.git - it has part of your 
 previous patches included already, plus some cleanups i did to them, so 
 this series of yours wont apply. sched-devel.git is at:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
Hi Ingo,
  I have rebased to your sched-devel.git.  You were right, most of the patches
  were already there (1-20, in fact), so there remain only the last three.
  These constitute the cpupri moniker in Steven's testing.  Let me know if
  you have any questions.  Comments/review by anyone are of course welcome.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Subject: SCHED - Add sched-domain roots

2007-12-04 Thread Gregory Haskins
We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
CC: Paul Jackson [EMAIL PROTECTED]
CC: Simon Derr [EMAIL PROTECTED]
---

 include/linux/sched.h |3 +
 kernel/sched.c|  121 -
 2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0bb7033..b6885ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -854,6 +854,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+   void (*join_domain)(struct rq *rq);
+   void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 4849c72..dfb939f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -351,6 +351,28 @@ struct rt_rq {
int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+/*
+ * We add the notion of a root-domain which will be used to define per-domain
+ * variables.  Each exclusive cpuset essentially defines an island domain by
+ * fully partitioning the member cpus from any other cpuset.  Whenever a new
+ * exclusive cpuset is created, we also create and attach a new root-domain
+ * object.
+ *
+ * By default the system creates a single root-domain with all cpus as
+ * members (mimicking the global state we have today).
+ */
+struct root_domain {
+   atomic_t refcount;
+   cpumask_t span;
+   cpumask_t online;
+};
+
+static struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -408,6 +430,7 @@ struct rq {
atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+   struct root_domain  *rd;
struct sched_domain *sd;
 
/* For active balancing */
@@ -5540,6 +5563,15 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
case CPU_ONLINE_FROZEN:
/* Strictly unnecessary, as first user will wake it. */
wake_up_process(cpu_rq(cpu)-migration_thread);
+
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_set(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5588,6 +5620,17 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
}
spin_unlock_irq(rq-lock);
break;
+
+   case CPU_DOWN_PREPARE:
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_clear(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
+   break;
 #endif
}
return NOTIFY_OK;
@@ -5776,11 +5819,69 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
sched_domain *parent)
return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+   unsigned long flags;
+   const struct sched_class *class;
+
+   spin_lock_irqsave(rq-lock, flags);
+
+   if (rq-rd) {
+   struct root_domain *old_rd = rq-rd;
+
+   for (class = sched_class_highest; class; class = class-next)
+   if (class-leave_domain)
+   class-leave_domain(rq);
+
+   if (atomic_dec_and_test(old_rd-refcount))
+   kfree(old_rd);
+   }
+
+   atomic_inc(rd-refcount);
+   rq-rd = rd;
+
+   for (class = sched_class_highest; class; class = class-next

[PATCH 2/3] Subject: SCHED - Only balance our RT tasks within our root-domain

2007-12-04 Thread Gregory Haskins
We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task-cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/sched.c|7 +++
 kernel/sched_rt.c |   57 +
 2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dfb939f..9a09f82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -367,6 +367,13 @@ struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;
+
+/*
+* The RT overload flag: it gets set if a CPU has more than
+* one runnable RT task.
+*/
+   cpumask_t rto_mask;
+   atomic_t  rto_count;
 };
 
 static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9dcf522..f9728d0 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -5,22 +5,14 @@
 
 #ifdef CONFIG_SMP
 
-/*
- * The RT overload flag: it gets set if a CPU has more than
- * one runnable RT task.
- */
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-
-static inline int rt_overloaded(void)
+static inline int rt_overloaded(struct rq *rq)
 {
-   return atomic_read(rto_count);
+   return atomic_read(rq-rd-rto_count);
 }
 
 static inline void rt_set_overload(struct rq *rq)
 {
-   rq-rt.overloaded = 1;
-   cpu_set(rq-cpu, rt_overload_mask);
+   cpu_set(rq-cpu, rq-rd-rto_mask);
/*
 * Make sure the mask is visible before we set
 * the overload count. That is checked to determine
@@ -29,23 +21,25 @@ static inline void rt_set_overload(struct rq *rq)
 * updated yet.
 */
wmb();
-   atomic_inc(rto_count);
+   atomic_inc(rq-rd-rto_count);
 }
 
 static inline void rt_clear_overload(struct rq *rq)
 {
/* the order here really doesn't matter */
-   atomic_dec(rto_count);
-   cpu_clear(rq-cpu, rt_overload_mask);
-   rq-rt.overloaded = 0;
+   atomic_dec(rq-rd-rto_count);
+   cpu_clear(rq-cpu, rq-rd-rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1))
+   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1)) {
rt_set_overload(rq);
-   else
+   rq-rt.overloaded = 1;
+   } else {
rt_clear_overload(rq);
+   rq-rt.overloaded = 0;
+   }
 }
 #endif /* CONFIG_SMP */
 
@@ -306,7 +300,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
int   count   = 0;
int   cpu;
 
-   cpus_and(*lowest_mask, cpu_online_map, task-cpus_allowed);
+   cpus_and(*lowest_mask, task_rq(task)-rd-online, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
@@ -580,18 +574,12 @@ static int pull_rt_task(struct rq *this_rq)
struct task_struct *p, *next;
struct rq *src_rq;
 
-   /*
-* If cpusets are used, and we have overlapping
-* run queue cpusets, then this algorithm may not catch all.
-* This is just the price you pay on trying to keep
-* dirtying caches down on large SMP machines.
-*/
-   if (likely(!rt_overloaded()))
+   if (likely(!rt_overloaded(this_rq)))
return 0;
 
next = pick_next_task_rt(this_rq);
 
-   for_each_cpu_mask(cpu, rt_overload_mask) {
+   for_each_cpu_mask(cpu, this_rq-rd-rto_mask) {
if (this_cpu == cpu)
continue;
 
@@ -809,6 +797,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct 
*p)
}
 }
 
+/* Assumes rq-lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_set_overload(rq);
+}
+
+/* Assumes rq-lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_clear_overload(rq);
+}
+
 static void set_curr_task_rt(struct rq *rq)
 {
struct task_struct *p = rq-curr;
@@ -838,4 +840,7 @@ const struct sched_class rt_sched_class = {
 
.set_curr_task  = set_curr_task_rt,
.task_tick

[PATCH 3/3] Subject: SCHED - Use a 2-d bitmap for searching lowest-pri CPU

2007-12-04 Thread Gregory Haskins
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   11 +++
 kernel/Makefile|1 
 kernel/sched.c |8 ++
 kernel/sched_cpupri.c  |  174 
 kernel/sched_cpupri.h  |   37 ++
 kernel/sched_rt.c  |   36 +-
 6 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c64ce9c..f78ab80 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -63,3 +63,14 @@ config PREEMPT_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config CPUPRI
+   bool Optimize lowest-priority search
+   depends on SMP  EXPERIMENTAL
+   default n
+   help
+ This option attempts to reduce latency in the kernel by replacing
+  the linear lowest-priority search algorithm with a 2-d bitmap.
+
+ Say Y here if you want to try this experimental algorithm.
+ Say N if you are unsure.
+
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..78a385e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_CPUPRI) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra [EMAIL PROTECTED], the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a09f82..1c173c1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -69,6 +69,8 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -374,6 +376,9 @@ struct root_domain {
 */
cpumask_t rto_mask;
atomic_t  rto_count;
+#ifdef CONFIG_CPUPRI
+   struct cpupri cpupri;
+#endif
 };
 
 static struct root_domain def_root_domain;
@@ -5860,6 +5865,9 @@ static void init_rootdomain(struct root_domain *rd, const 
cpumask_t *map)
 
rd-span = *map;
cpus_and(rd-online, rd-span, cpu_online_map);
+#ifdef CONFIG_CPUPRI
+   cpupri_init(rd-cpupri);
+#endif
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..78e069e
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,174 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include sched_cpupri.h
+
+/* Convert between a 140 based task-prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+
+   if (prio == CPUPRI_INVALID)
+   cpupri = CPUPRI_INVALID;
+   else if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio = MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio + 1;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)\
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \
+   idx  CPUPRI_NR_PRIORITIES;\
+   idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have

[PATCH 3/4] Subject: SCHED - Only balance our RT tasks within our root-domain

2007-11-30 Thread Gregory Haskins
We move the rt-overload data as the first global to per-domain
reclassification.  This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system.  Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down.  Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task-cpus_allowed will effectively reduce our search
to within our domain.  However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution.  If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   57 -
 2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9fcf36a..e9d932d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -292,6 +292,8 @@ struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;
+   cpumask_t rto_mask;
+   atomic_t  rto_count;
 };
 
 static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a9675dc..78a188f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
  */
 
 #ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
 {
-   return atomic_read(rto_count);
+   return atomic_read(rq-rd-rto_count);
 }
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
 {
-   return rt_overload_mask;
+   return rq-rd-rto_mask;
 }
 static inline void rt_set_overload(struct rq *rq)
 {
-   rq-rt.overloaded = 1;
-   cpu_set(rq-cpu, rt_overload_mask);
+   cpu_set(rq-cpu, rq-rd-rto_mask);
/*
 * Make sure the mask is visible before we set
 * the overload count. That is checked to determine
@@ -26,22 +24,24 @@ static inline void rt_set_overload(struct rq *rq)
 * updated yet.
 */
wmb();
-   atomic_inc(rto_count);
+   atomic_inc(rq-rd-rto_count);
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
/* the order here really doesn't matter */
-   atomic_dec(rto_count);
-   cpu_clear(rq-cpu, rt_overload_mask);
-   rq-rt.overloaded = 0;
+   atomic_dec(rq-rd-rto_count);
+   cpu_clear(rq-cpu, rq-rd-rto_mask);
 }
 
 static void update_rt_migration(struct rq *rq)
 {
-   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1))
+   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1)) {
rt_set_overload(rq);
-   else
+   rq-rt.overloaded = 1;
+   } else {
rt_clear_overload(rq);
+   rq-rt.overloaded = 0;
+   }
 }
 #endif /* CONFIG_SMP */
 
@@ -325,7 +325,7 @@ static int find_lowest_cpus(struct task_struct *task, 
cpumask_t *lowest_mask)
int   count   = 0;
int   cpu;
 
-   cpus_and(*lowest_mask, cpu_online_map, task-cpus_allowed);
+   cpus_and(*lowest_mask, task_rq(task)-rd-online, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
@@ -608,18 +608,12 @@ static int pull_rt_task(struct rq *this_rq)
 
assert_spin_locked(this_rq-lock);
 
-   /*
-* If cpusets are used, and we have overlapping
-* run queue cpusets, then this algorithm may not catch all.
-* This is just the price you pay on trying to keep
-* dirtying caches down on large SMP machines.
-*/
-   if (likely(!rt_overloaded()))
+   if (likely(!rt_overloaded(this_rq)))
return 0;
 
next = pick_next_task_rt(this_rq);
 
-   rto_cpumask = rt_overload();
+   rto_cpumask = rt_overload(this_rq);
 
for_each_cpu_mask(cpu, *rto_cpumask) {
if (this_cpu == cpu)
@@ -838,6 +832,20 @@ static void task_tick_rt(struct rq *rq, struct task_struct 
*p)
}
 }
 
+/* Assumes rq-lock is held */
+static void join_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_set_overload(rq);
+}
+
+/* Assumes rq-lock is held */
+static void leave_domain_rt(struct rq *rq)
+{
+   if (rq-rt.overloaded)
+   rt_clear_overload(rq);
+}
+
 static void set_curr_task_rt(struct rq *rq)
 {
struct task_struct *p = rq-curr;
@@ -867,4 +875,7 @@ const struct sched_class rt_sched_class = {
 
.set_curr_task  = set_curr_task_rt,
.task_tick  = task_tick_rt,
+
+   .join_domain= join_domain_rt,
+   .leave_domain

[PATCH 1/4] Subject: SCHED - Make the wake-up priority a config option

2007-11-30 Thread Gregory Haskins
We recently changed the behavior of the wake-up logic such that a higher
priority task does not preempt a lower-priority task if that task is RT.
Instead, it tries to pre-route the higher task to a different cpu.

This causes a performance regression for me in at least preempt-test.  I
suspect there may be other regressions as well.  We make it easier on people
to select which method they want by making the algorithm a config option,
with the default being the current behavior.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   31 +++
 kernel/sched_rt.c  |   32 
 2 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c64ce9c..c35b1d3 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -52,6 +52,37 @@ config PREEMPT
 
 endchoice
 
+choice 
+   prompt Realtime Wakeup Policy
+   default RTWAKEUP_FAVOR_HOT_TASK
+
+config RTWAKEUP_FAVOR_HOT_TASK
+   bool Favor hot tasks
+   help
+This setting strives to avoid creating an RT overload condition
+ by always favoring a hot RT task over a high priority RT task. The
+idea is that a newly woken RT task is not likely to be cache hot
+anyway.  Therefore it's cheaper to migrate the new task to some
+other processor rather than to preempt a currently executing RT
+task, even if the new task is of higher priority than the current.
+
+RT tasks behave differently than other tasks. If one gets preempted,
+we try to push it off to another queue. So trying to keep a
+preempting RT task on the same cache hot CPU will force the
+running RT task to a cold CPU. So we waste all the cache for the lower
+RT task in hopes of saving some of a RT task that is just being
+woken and probably will have cold cache anyway.
+
+config RTWAKEUP_FAVOR_HIGHER_TASK
+   bool Favor highest priority
+   help
+ This setting strives to make sure the highest priority task has 
+ the shortest wakeup latency possible by honoring its affinity when
+ possible.  Some tests reveal that this results in higher
+ performance, but this is still experimental.
+
+endchoice
+
 config PREEMPT_BKL
bool Preempt The Big Kernel Lock
depends on SMP || PREEMPT
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0bd14bd..a9675dc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -150,12 +150,19 @@ yield_task_rt(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
-static int find_lowest_rq(struct task_struct *task);
 
-static int select_task_rq_rt(struct task_struct *p, int sync)
+#ifdef CONFIG_RTWAKEUP_FAVOR_HIGHER_TASK
+static inline int rt_wakeup_premigrate(struct task_struct *p, struct rq *rq)
 {
-   struct rq *rq = task_rq(p);
+   if ((p-prio = rq-rt.highest_prio) 
+   (p-nr_cpus_allowed  1))
+   return 1;
 
+   return 0;
+}
+#else
+static inline int rt_wakeup_premigrate(struct task_struct *p, struct rq *rq)
+{
/*
 * If the current task is an RT task, then
 * try to see if we can wake this RT task up on another
@@ -174,7 +181,24 @@ static int select_task_rq_rt(struct task_struct *p, int 
sync)
 * cold cache anyway.
 */
if (unlikely(rt_task(rq-curr)) 
-   (p-nr_cpus_allowed  1)) {
+   (p-nr_cpus_allowed  1))
+   return 1;
+
+   return 0;
+}
+#endif
+
+static int find_lowest_rq(struct task_struct *task);
+
+static int select_task_rq_rt(struct task_struct *p, int sync)
+{
+   struct rq *rq = task_rq(p);
+
+   /*
+* Check to see if we should move this task away from its affined
+* RQ before we even initially wake it
+*/
+   if (rt_wakeup_premigrate(p, rq)) {
int cpu = find_lowest_rq(p);
 
return (cpu == -1) ? task_cpu(p) : cpu;

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Subject: SCHED - Add sched-domain roots

2007-11-30 Thread Gregory Haskins
We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables.  Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset.  However, we currently still maintain some
policy/state as global variables which transcend all cpusets.  Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one.  This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
CC: Paul Jackson [EMAIL PROTECTED]
CC: Simon Derr [EMAIL PROTECTED]
---

 include/linux/sched.h |3 +
 kernel/sched.c|  121 -
 2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 809658c..b891d81 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -849,6 +849,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+   void (*join_domain)(struct rq *rq);
+   void (*leave_domain)(struct rq *rq);
 };
 
 struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 19973a0..9fcf36a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,28 @@ struct rt_rq {
int overloaded;
 };
 
+#ifdef CONFIG_SMP
+
+/*
+ * We add the notion of a root-domain which will be used to define per-domain
+ * variables.  Each exclusive cpuset essentially defines an island domain by
+ * fully partitioning the member cpus from any other cpuset.  Whenever a new
+ * exclusive cpuset is created, we also create and attach a new root-domain
+ * object.
+ *
+ * By default the system creates a single root-domain with all cpus as
+ * members (mimicking the global state we have today).
+ */
+struct root_domain {
+   atomic_t refcount;
+   cpumask_t span;
+   cpumask_t online;
+};
+
+static struct root_domain def_root_domain;
+
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -333,6 +355,7 @@ struct rq {
atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
+   struct root_domain  *rd;
struct sched_domain *sd;
 
/* For active balancing */
@@ -5479,6 +5502,15 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
case CPU_ONLINE_FROZEN:
/* Strictly unnecessary, as first user will wake it. */
wake_up_process(cpu_rq(cpu)-migration_thread);
+
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_set(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5527,6 +5559,17 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
}
spin_unlock_irq(rq-lock);
break;
+
+   case CPU_DOWN_PREPARE:
+   /* Update our root-domain */
+   rq = cpu_rq(cpu);
+   spin_lock_irqsave(rq-lock, flags);
+   if (rq-rd) {
+   BUG_ON(!cpu_isset(cpu, rq-rd-span));
+   cpu_clear(cpu, rq-rd-online);
+   }
+   spin_unlock_irqrestore(rq-lock, flags);
+   break;
 #endif
case CPU_LOCK_RELEASE:
mutex_unlock(sched_hotcpu_mutex);
@@ -5718,11 +5761,69 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
sched_domain *parent)
return 1;
 }
 
+static void rq_attach_root(struct rq *rq, struct root_domain *rd)
+{
+   unsigned long flags;
+   const struct sched_class *class;
+
+   spin_lock_irqsave(rq-lock, flags);
+
+   if (rq-rd) {
+   struct root_domain *old_rd = rq-rd;
+
+   for (class = sched_class_highest; class; class = class-next)
+   if (class-leave_domain)
+   class-leave_domain(rq);
+
+   if (atomic_dec_and_test(old_rd-refcount))
+   kfree(old_rd);
+   }
+
+   atomic_inc(rd-refcount);
+   rq-rd = rd;
+
+   for (class

[PATCH 4/4] Subject: SCHED - Use a 2-d bitmap for searching lowest-pri CPU

2007-11-30 Thread Gregory Haskins
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Christoph Lameter [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   11 +++
 kernel/Makefile|1 
 kernel/sched.c |8 ++
 kernel/sched_cpupri.c  |  174 
 kernel/sched_cpupri.h  |   37 ++
 kernel/sched_rt.c  |   38 ++
 6 files changed, 266 insertions(+), 3 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c35b1d3..578adba 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -94,3 +94,14 @@ config PREEMPT_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config CPUPRI
+   bool Optimize lowest-priority search
+   depends on SMP  EXPERIMENTAL
+   default n
+   help
+ This option attempts to reduce latency in the kernel by replacing
+  the linear lowest-priority search algorithm with a 2-d bitmap.
+
+ Say Y here if you want to try this experimental algorithm.
+ Say N if you are unsure.
+
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..78a385e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_CPUPRI) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra [EMAIL PROTECTED], the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index e9d932d..892f036 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -294,6 +296,9 @@ struct root_domain {
cpumask_t online;
cpumask_t rto_mask;
atomic_t  rto_count;
+#ifdef CONFIG_CPUPRI
+   struct cpupri cpupri;
+#endif
 };
 
 static struct root_domain def_root_domain;
@@ -5797,6 +5802,9 @@ static void init_rootdomain(struct root_domain *rd, const 
cpumask_t *map)
 
rd-span = *map;
cpus_and(rd-online, rd-span, cpu_online_map);
+#ifdef CONFIG_CPUPRI
+   cpupri_init(rd-cpupri);
+#endif
 }
 
 static void init_defrootdomain(void)
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..78e069e
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,174 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, nr_domcpus)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include sched_cpupri.h
+
+/* Convert between a 140 based task-prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+
+   if (prio == CPUPRI_INVALID)
+   cpupri = CPUPRI_INVALID;
+   else if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio = MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio + 1;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)\
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \
+   idx  CPUPRI_NR_PRIORITIES;\
+   idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cp: The cpupri context
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs

[ANNOUNCE] Preempt-test v4 released

2007-11-28 Thread Gregory Haskins

v4 is a minor update over v3 to improve the accuracy of the latency 
measurement.  v3 used a different startup barrier mechanism which made thread 
#1 report a falsely large latency.

http://rt.wiki.kernel.org/index.php/Preemption_Test

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RT: Fix a bug for properly setting the priority on rt-dequeue

2007-11-14 Thread Gregory Haskins
Hi Steven,
  This patch applies to 23-rt11 to fix that bug you found in the git-HEAD
  merge.  I will fold this patch into my 24 series so it is fixed there.  Feel
  free to fold this into patch #8 instead of maintaining it seperately, if
  you prefer.

--

RT: Fix a bug for properly setting the priority on rt-dequeue

We need to update the priority on task-dequeue whenever it changes, not just
if more RT tasks are pending.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f05912a..864d18a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -85,6 +85,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct 
rq *rq)
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 {
+   int highest_prio = rq-rt.highest_prio;
+
WARN_ON(!rt_task(p));
WARN_ON(!rq-rt.rt_nr_running);
rq-rt.rt_nr_running--;
@@ -98,13 +100,15 @@ static inline void dec_rt_tasks(struct task_struct *p, 
struct rq *rq)
array = rq-rt.active;
rq-rt.highest_prio =
sched_find_first_bit(array-bitmap);
-   cpupri_set(rq-cpu, rq-rt.highest_prio);
} /* otherwise leave rq-highest prio alone */
} else
rq-rt.highest_prio = MAX_RT_PRIO;
if (p-nr_cpus_allowed  1)
rq-rt.rt_nr_migratory--;
 
+   if (rq-rt.highest_prio != highest_prio)
+   cpupri_set(rq-cpu, rq-rt.highest_prio);
+
update_rt_migration(p, rq);
 #endif /* CONFIG_SMP */
 }

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RT: convert cpupri spinlock_t to raw_spinlock_t

2007-11-14 Thread Gregory Haskins
Hi Steven,
   This patch should fix the hang you were seeing in 24-rc2-rt1-pre9 with #8
applied.  I meant to bring this required -rt specific change up when we spoke
on IRC earlier, but it slipped through the cracks. :(  Sorry 'bout that.

Regards,
-Greg




RT: convert cpupri spinlock_t to raw_spinlock_t

We recently started preparing some of the scheduler changes for upstream
merginging.  Part of this work involved changing the original raw_spinlock
used in cpupri to a spinlock since this is proper use for a mainline change.
However, in order to continue to use this patch back in -rt, we need to
restore the lock back to raw or the kernel will hang during bootup.

So this patch does the conversion, but should not go upstream with the rest
of the scheduler enhancements until the -rt spinlock work is also merged.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_cpupri.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index 09e27eb..91cc9c8 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -39,8 +39,8 @@
 
 struct pri_vec
 {
-   spinlock_t lock;
-   cpumask_t  mask;
+   raw_spinlock_t lock;
+   cpumask_t  mask;
 };
 
 struct cpu_priority {

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to use latency trace

2007-11-13 Thread Gregory Haskins
 On Tue, Nov 13, 2007 at  8:22 AM, in message
[EMAIL PROTECTED], Jaswinder
Singh [EMAIL PROTECTED] wrote: 
 hello Sven,
 
 
 On Nov 7, 2007 2:43 AM, Sven-Thorsten Dietrich
 [EMAIL PROTECTED] wrote:
 
  you can request for interval for 1000 so it should come 1000 all the
  time, but it is not.

 Its should come CLOSE to 1000. What errors have you seen?

 
 My errors are min and avg are not equal to ZERO.
 And I will be more happy If max is also ZERO :)

Jaswinder,
  As we recently discussed on IRC, this is not an error per se, but a status 
report of your OS/HW combo.  You will never see zero in cyclictest because the 
resolution off cyclictest/hrt is higher than the jitter specification of your 
chosen platform.

There is not an RT system in the world that has zero jitter to my knowledge.  
Rather, each OS/HW combo will have some arbitrary jitter specification.   It is 
then up to the application designer to pick the platform where the specified 
jitter is lower than the apps tolerances.  Some dedicated hardware and/or 
RTOS's might spec out in picoseconds or nanoseconds.  Others might in 
microseconds, milliseconds, etc.

In the case of linux-rt on a modern x86 PC, this spec is generally in the range 
of 10us-100us.  For instance, an application that needs hard-realtime latencies 
with no more than 500us jitter, it would probably work great on linux-rt/x86.  
Conversely, if you require no more than 500ns jitter, you need to look 
elsewhere.

So if you are seeing latency spikes 100us, they should probably be 
investigated (using latency-trace) as potential bugs in -rt.  However, numbers 
below that range are probably normal for your system.  We are, of course, 
always looking to improve these numbers...but as of right now 10us-100us is 
state of the art.

I hope this helps.
-Greg


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cyclic test results on 8 way (2.6.23.1-rt7 through 2.6.23.1-rt11)

2007-11-08 Thread Gregory Haskins
 On Thu, Nov 8, 2007 at  2:34 AM, in message
[EMAIL PROTECTED], Darren Hart [EMAIL PROTECTED] wrote:

 Greg,
 
 Here are the results I promised you.

Darren,
   First off, thanks a bunch for running through those tests!

 I don't think they are particularly interesting

They *are* interesting in the sense that peaks of 40ish for an 8-way are 
excellent IMO ;)  But yeah, i see what you mean; theres no interesting 
difference between them.

Do you have any comparative data from runs in -rt1 or earlier kernels?

 perhaps more iterations, say 100? 

We generally load the system up a lot more (e.g. make -j 128)  or let it run 
for a long time with the make in a loop (or both).  For instance, here is the 
last data I gathered (with plot attached).  This was for a single pass of make 
mrproper; make allmodconfig; time make -j 128
 
23.1-rt7
-
real10m49.312s
user49m59.674s
sys 16m15.404s

26.38 76.02 56.11 1/304 21301

T: 0 ( 5959) P:90 I:100 C:7500381 Min:  2 Act:2 Avg:4 Max:  78
T: 1 ( 5960) P:89 I:200 C:3750191 Min:  2 Act:3 Avg:5 Max:  91
T: 2 ( 5961) P:88 I:300 C:2500127 Min:  2 Act:3 Avg:4 Max:  66
T: 3 ( 5962) P:87 I:400 C:1875096 Min:  2 Act:5 Avg:6 Max: 106
T: 4 ( 5963) P:86 I:500 C:1500077 Min:  2 Act:6 Avg:5 Max:  92
T: 5 ( 5964) P:85 I:600 C:1250064 Min:  2 Act:5 Avg:6 Max:  93
T: 6 ( 5965) P:84 I:700 C:1071483 Min:  2 Act:3 Avg:6 Max:  78
T: 7 ( 5966) P:83 I:800 C: 937548 Min:  2 Act:5 Avg:6 Max:  97

23.1-rt10
-

real10m39.046s
user56m31.662s
sys 12m51.158s

7.64 63.88 62.70 1/291 30774

T: 0 (15455) P:90 I:100 C:8001856 Min:  2 Act:3 Avg:4 Max:  65
T: 1 (15456) P:89 I:200 C:4000928 Min:  2 Act:3 Avg:4 Max:  46
T: 2 (15457) P:88 I:300 C:2667286 Min:  2 Act:3 Avg:5 Max:  54
T: 3 (15458) P:87 I:400 C:2000464 Min:  2 Act:3 Avg:5 Max:  46
T: 4 (15459) P:86 I:500 C:1600372 Min:  2 Act:3 Avg:5 Max:  46
T: 5 (15460) P:85 I:600 C:1333643 Min:  2 Act:5 Avg:5 Max:  75
T: 6 (15461) P:84 I:700 C:1143123 Min:  2 Act:5 Avg:6 Max:  52
T: 7 (15462) P:83 I:800 C:1000232 Min:  2 Act:4 Avg:5 Max:  66

23.1-rt11
-

real10m41.610s
user55m37.065s
sys 13m16.394s

101.92 101.56 61.65 1/308 21349

T: 0 ( 6028) P:90 I:100 C:6584414 Min:  2 Act:3 Avg:4 Max:  55
T: 1 ( 6029) P:89 I:200 C:3292207 Min:  2 Act:6 Avg:5 Max:  58
T: 2 ( 6030) P:88 I:300 C:2194805 Min:  2 Act:5 Avg:5 Max:  54
T: 3 ( 6031) P:87 I:400 C:1646104 Min:  2 Act:5 Avg:5 Max:  79
T: 4 ( 6032) P:86 I:500 C:1316883 Min:  2 Act:5 Avg:5 Max:  60
T: 5 ( 6033) P:85 I:600 C:1097403 Min:  2 Act:5 Avg:6 Max:  45
T: 6 ( 6034) P:84 I:700 C: 940631 Min:  2 Act:3 Avg:6 Max:  45
T: 7 ( 6035) P:83 I:800 C: 823052 Min:  2 Act:5 Avg:6 Max:  47

As Steven Rostedt pointed out on IRC, numbers coming from me are suspect ;).  
But we can at least use them to see if you can repro similar results.

Thanks again!

Regards,
-Greg

attachment: plot.png

[PATCH] RT: fix uniprocessor build issue with new scheduler enhancements

2007-11-07 Thread Gregory Haskins
Primary issue is cpupri_init() is not defined, but also clean up
some warnings related to uniproc builds.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Dragan Noveski [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_cpupri.h |5 +
 kernel/sched_rt.c |2 +-
 3 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6f24aa0..365c987 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -863,9 +863,11 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, 
struct rq *busiest,
  int *all_pinned, unsigned long *load_moved,
  int *this_best_prio, struct rq_iterator *iterator);
 
+#ifdef CONFIG_SMP
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
 static unsigned long cpu_avg_load_per_task(int cpu);
+#endif /* CONFIG_SMP */
 
 #include sched_stats.h
 #include sched_rt.c
diff --git a/kernel/sched_cpupri.h b/kernel/sched_cpupri.h
index 8cdd15d..2119495 100644
--- a/kernel/sched_cpupri.h
+++ b/kernel/sched_cpupri.h
@@ -5,6 +5,11 @@
 
 int  cpupri_find(struct task_struct *p, cpumask_t *lowest_mask);
 void cpupri_set(int cpu, int pri);
+
+#ifdef CONFIG_SMP
 void cpupri_init(void);
+#else
+#define cpupri_init() do { } while(0)
+#endif
 
 #endif /* _LINUX_CPUPRI_H */
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 71ae9e6..0213aa2 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -207,9 +207,9 @@ yield_task_rt(struct rq *rq, struct task_struct *p)
requeue_task_rt(rq, p);
 }
 
+#ifdef CONFIG_SMP
 static int find_lowest_rq(struct task_struct *task);
 
-#ifdef CONFIG_SMP
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
struct rq *rq = task_rq(p);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1-rt9 (and others)

2007-11-07 Thread Gregory Haskins
 On Wed, Nov 7, 2007 at  9:17 AM, in message [EMAIL PROTECTED], Dragan
Noveski [EMAIL PROTECTED] wrote: 

 hallo, i just tried to compile rt9 and rt7 and both times i get this 
 error at about the end of the 'make  make modules' step:
 
   GEN .version
   CHK include/linux/compile.h
   UPD include/linux/compile.h
   CC  init/version.o
   LD  init/built-in.o
   LD  .tmp_vmlinux1
 kernel/built-in.o: In function `sched_init':
 (.init.text+0x1af): undefined reference to `cpupri_init'
 make: *** [.tmp_vmlinux1] Fehler 1
 
 
 i am on uniprozessor machine here, IBM-thinkpad-r50e + debian testing.
 if you need parts of my config file, just feel free to tell me so and  i 
 ll try to provide you the information!

Doh!  Your error made me realize that I broke uniprocessor in -rt9.  Will fix 
right away.

As far as -rt7 is concerned, that doesn't make a lot of sense since cpupri isnt 
introduced until -rt9.  Perhaps your tree was dirtied from a previous 
application of -rt9?  Let me know if that doesn't appear to be the case.

Regards,
-Greg



-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1-rt9 (and others)

2007-11-07 Thread Gregory Haskins
 On Wed, Nov 7, 2007 at 10:41 AM, in message
[EMAIL PROTECTED], Steven Rostedt
[EMAIL PROTECTED] wrote: 

 
 So please compare -rt7, -rt8 and -rt10.

Here are my results from running:

sudo chrt -f 80 ./cyclictest -n -p 90 -t 8 -d 100 -i 100

with

while true; do make mrproper; make allmodconfig; make -j 128  /dev/null; done
 on a plain kernel.org kernel

running in the background on an 8-way (4core x2) C2D Xeon 5335 system

23.1-rt7
---

  144.60 135.30 116.38 30/814 8184

  T: 0 ( 5990) P:90 I:100 C:11560415 Min:2 Act:  5 Avg: 4 Max:  83
  T: 1 ( 5991) P:89 I:200 C:5780208 Min: 2 Act:  5 Avg: 4 Max:  66
  T: 2 ( 5992) P:88 I:300 C:3853472 Min: 2 Act:  5 Avg: 5 Max:  89
  T: 3 ( 5993) P:87 I:400 C:2890104 Min: 2 Act:  6 Avg: 5 Max:  70
  T: 4 ( 5994) P:86 I:500 C:2312083 Min: 2 Act:  4 Avg: 5 Max:  91
  T: 5 ( 5995) P:85 I:600 C:1926736 Min: 2 Act:  5 Avg: 5 Max:  94
  T: 6 ( 5996) P:84 I:700 C:1651488 Min: 2 Act: 12 Avg: 5 Max: 115
  T: 7 ( 5997) P:83 I:800 C:1445052 Min: 2 Act:  6 Avg: 5 Max:  79

23.1-rt8
---
  119.95 106.56 107.98 37/811 18445

  T: 0 ( 5052) P:90 I:100 C:29592746 Min: 2 Act: 21 Avg: 4 Max:  78
  T: 1 ( 5053) P:89 I:200 C:14796374 Min: 2 Act:  6 Avg: 4 Max:  81
  T: 2 ( 5054) P:88 I:300 C:9864249 Min:  2 Act: 10 Avg: 4 Max:  88
  T: 3 ( 5055) P:87 I:400 C:7398187 Min:  2 Act:  6 Avg: 4 Max:  86
  T: 4 ( 5056) P:86 I:500 C:5918550 Min:  2 Act: 13 Avg: 9 Max:  69
  T: 5 ( 5057) P:85 I:600 C:4932125 Min:  2 Act: 11 Avg: 4 Max:  71
  T: 6 ( 5058) P:84 I:700 C:4227536 Min:  2 Act:  8 Avg: 5 Max:  65
  T: 7 ( 5059) P:83 I:800 C:3699094 Min:  2 Act:  4 Avg: 4 Max: 114

23.1-rt10
---
  143.39 123.49 117.92 22/791 5802

  T: 0 ( 5305) P:90 I:100 C:45332860 Min: 2 Act:  4 Avg: 4 Max: 89
  T: 1 ( 5306) P:89 I:200 C:22666431 Min: 2 Act:  3 Avg: 5 Max: 49
  T: 2 ( 5307) P:88 I:300 C:15110954 Min: 2 Act:  6 Avg: 5 Max: 76
  T: 3 ( 5308) P:87 I:400 C:11333216 Min: 2 Act:  7 Avg: 5 Max: 81
  T: 4 ( 5309) P:86 I:500 C:9066572 Min:  2 Act:  8 Avg: 5 Max: 57
  T: 5 ( 5310) P:85 I:600 C:7555477 Min:  2 Act:  8 Avg: 5 Max: 55
  T: 6 ( 5311) P:84 I:700 C:6476123 Min:  2 Act: 14 Avg: 6 Max: 73
  T: 7 ( 5312) P:83 I:800 C:508 Min:  2 Act:  5 Avg: 6 Max: 78

note that the rt10 image was running for a much longer duration than the other 
two...which generally will push the max higher.  I know in general if I let 
-rt7 go that long it will hit 120-150+.  This means these tests were biased 
towards -rt7 and -rt8 performing better, but they still came in with higher 
latencies.

I will officially test -rt11 next, tho in my small runs so far it looks great.

HTH

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] RT: Use a 2-d bitmap for searching lowest-pri CPU

2007-11-05 Thread Gregory Haskins
The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Makefile   |1 
 kernel/sched.c|4 +
 kernel/sched_cpupri.c |  186 +
 kernel/sched_cpupri.h |   10 +++
 kernel/sched_rt.c |   52 ++
 5 files changed, 210 insertions(+), 43 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..a822706 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_SMP) += sched_cpupri.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra [EMAIL PROTECTED], the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 0eced8c..6f24aa0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
 
 #include asm/tlb.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -6842,6 +6844,8 @@ void __init sched_init(void)
fair_sched_class.next = idle_sched_class;
idle_sched_class.next = NULL;
 
+   cpupri_init();
+
for_each_possible_cpu(i) {
struct rt_prio_array *array;
struct rq *rq;
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..e6280b1
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,186 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include sched_cpupri.h
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG
+
+#define CPUPRI_INVALID -1
+#define CPUPRI_IDLE 0
+#define CPUPRI_NORMAL   1
+/* values 2-101 are RT priorities 0-99 */
+
+struct pri_vec
+{
+   raw_spinlock_t lock;
+   cpumask_t  mask;
+};
+
+struct cpu_priority {
+   struct pri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES];
+   long   pri_active[CPUPRI_NR_PRI_WORDS];
+   intcpu_to_pri[NR_CPUS];
+};
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+/* Convert between a 140 based task-prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+
+   if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio = MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio + 1;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)\
+  for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \
+   idx  CPUPRI_NR_PRIORITIES;\
+   idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @p: The task
+ * @lowest_mask: A mask to fill in with selected CPUs
+ *
+ * Note: This function returns the recommended CPUs as calculated during the
+ * current invokation.  By the time the call returns, the CPUs may have in
+ * fact changed priorities any number of times.  While not ideal, it is not
+ * an issue of correctness since the normal rebalancer logic will correct
+ * any discrepancies created by racing against the uncertainty of the current
+ * priority configuration.
+ *
+ * Returns: (int)bool - CPUs were found
+ */
+int cpupri_find(struct task_struct *p, cpumask_t *lowest_mask)
+{
+   int  idx  = 0;
+   struct cpu_priority *cp   = cpu_priority;
+   int  task_pri = convert_prio(p-prio

[PATCH 3/8] RT: Break out the search function

2007-11-05 Thread Gregory Haskins
Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   62 -
 1 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f1fc1b4..fbe7b8a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -312,43 +312,55 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
return next;
 }
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-   struct rq *lowest_rq = NULL;
-   cpumask_t cpu_mask;
int cpu;
-   int tries;
+   cpumask_t cpu_mask;
+   struct rq *lowest_rq = NULL;
 
cpus_and(cpu_mask, cpu_online_map, task-cpus_allowed);
 
-   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
-   /*
-* Scan each rq for the lowest prio.
-*/
-   for_each_cpu_mask(cpu, cpu_mask) {
-   struct rq *curr_rq = per_cpu(runqueues, cpu);
+   /*
+* Scan each rq for the lowest prio.
+*/
+   for_each_cpu_mask(cpu, cpu_mask) {
+   struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq-cpu)
-   continue;
+   if (cpu == rq-cpu)
+   continue;
 
-   /* We look for lowest RT prio or non-rt CPU */
-   if (curr_rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = curr_rq;
-   break;
-   }
+   /* We look for lowest RT prio or non-rt CPU */
+   if (rq-rt.highest_prio = MAX_RT_PRIO) {
+   lowest_rq = rq;
+   break;
+   }
 
-   /* no locking for now */
-   if (curr_rq-rt.highest_prio  task-prio 
-   (!lowest_rq || curr_rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = curr_rq;
-   }
+   /* no locking for now */
+   if (rq-rt.highest_prio  task-prio 
+   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
+   lowest_rq = rq;
}
+   }
+
+   return lowest_rq ? lowest_rq-cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+ struct rq *rq)
+{
+   struct rq *lowest_rq = NULL;
+   int cpu;
+   int tries;
+
+   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
+   cpu = find_lowest_rq(task);
 
-   if (!lowest_rq)
+   if (cpu == -1)
break;
 
+   lowest_rq = cpu_rq(cpu);
+
/* if the prio of this runqueue changed, try again */
if (double_lock_balance(rq, lowest_rq)) {
/*

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] RT: scheduler migration/wakeup enhancements

2007-11-05 Thread Gregory Haskins
Ingo, Steven, Thomas,

Please consider this series for inclusion in 23-rt6, as it has shown
to make a substantial improvement in our local testing.  Independent
verification and/or comments/review are more than welcome.

-Greg

-
RT: scheduler migration/wakeup enhancements

This series applies to 23.1-rt5 and includes numerous tweaks to the
scheduler.  The primary goal of this set of patches is to further improve
wake-up latencies (particularly on larger SMP systems) and decrease migration
overhead.  This is accomplished by making improvements on several fronts:

1) We factor in CPU topology in the routing decision to pick the best
   migration target according to cache hierarchy.

2) We moved some CFS load calculation code as a member function of the CFS
   sched_class.  This removes this unecessary code from the RT fastpath for
   tasks in the RT sched_class.

3) We make further improvements against non-migratable tasks by factoring in
   the RQ overload state, instead of just the RQ depth.

4) We replace the linear priority search with a 2-d algorithm.

In past -rt releases, latencies could become quickly absymal on larger SMP (8+
cpus) to the order of 350us+.  The recent work in -rt2 and -rt4 dropped this
figure by a large margin, bringing things in the order of approximately
~120us.  This new series improves upon this work even further, bringing
latencies down to the sub 80us mark on our reference 8-way Intel C2D 5335
Xeon.

These figures were obtained by simultaneous execution of:

# ./cyclictest -n -p 90 -t 8 -d 100 -i 100
# while true; do make mrproper; make alldefconfig; make -j 128; done

for long durations.  The following are the results from one particular run,
though the results are similar across various short and long term trials in
our labs.

23.1-rt5-baseline
--
138.60 110.62 70.96 23/808 10246

T: 0 ( 5179) P:90 I:100 C:9011636 Min:  2 Act:4 Avg:4 Max: 117
T: 1 ( 5180) P:89 I:200 C:4505819 Min:  2 Act:5 Avg:4 Max:  95
T: 2 ( 5181) P:88 I:300 C:3003879 Min:  2 Act:9 Avg:5 Max:  85
T: 3 ( 5182) P:87 I:400 C:2252910 Min:  2 Act:3 Avg:4 Max:  75
T: 4 ( 5183) P:86 I:500 C:1802328 Min:  2 Act:7 Avg:5 Max:  71
T: 5 ( 5184) P:85 I:600 C:1501940 Min:  2 Act:4 Avg:5 Max:  74
T: 6 ( 5185) P:84 I:700 C:1287377 Min:  2 Act:6 Avg:6 Max:  85
T: 7 ( 5186) P:83 I:800 C:1126455 Min:  2 Act:4 Avg:5 Max:  75

23.1-rt5-gh
--
146.47 127.99 85.35 30/815 32289

T: 0 ( 5027) P:90 I:100 C:10856538 Min:  2 Act:4 Avg:4 Max:
60
T: 1 ( 5028) P:89 I:200 C:5428270 Min:  2 Act:7 Avg:5 Max:  57
T: 2 ( 5029) P:88 I:300 C:3618846 Min:  2 Act:5 Avg:5 Max:  48
T: 3 ( 5030) P:87 I:400 C:2714135 Min:  2 Act:7 Avg:5 Max:  61
T: 4 ( 5031) P:86 I:500 C:2171308 Min:  2 Act:6 Avg:6 Max:  51
T: 5 ( 5032) P:85 I:600 C:1809424 Min:  2 Act:5 Avg:7 Max:  59
T: 6 ( 5033) P:84 I:700 C:1550935 Min:  2 Act:6 Avg:6 Max:  54
T: 7 ( 5034) P:83 I:800 C:1357068 Min:  2 Act:7 Avg:6 Max:  62

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] RT: Allow current_cpu to be included in search

2007-11-05 Thread Gregory Haskins
It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fbe7b8a..7dd67db 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -326,9 +326,6 @@ static int find_lowest_rq(struct task_struct *task)
for_each_cpu_mask(cpu, cpu_mask) {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq-cpu)
-   continue;
-
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
@@ -356,7 +353,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
for (tries = 0; tries  RT_MAX_TRIES; tries++) {
cpu = find_lowest_rq(task);
 
-   if (cpu == -1)
+   if ((cpu == -1) || (cpu == rq-cpu))
break;
 
lowest_rq = cpu_rq(cpu);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] RT: Optimize our cpu selection based on topology

2007-11-05 Thread Gregory Haskins
The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|1 +
 kernel/sched_rt.c |   99 +++--
 2 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d16c686..8a27f09 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -26,6 +26,7 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-10-22  RT overload balancing by Steven Rostedt
  * (with thanks to Gregory Haskins)
+ *  2007-11-05  RT migration/wakeup tuning by Gregory Haskins
  */
 
 #include linux/mm.h
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 43d4ea6..6ba5921 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -331,34 +331,109 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
return next;
 }
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 {
-   int cpu;
-   cpumask_t cpu_mask;
-   struct rq *lowest_rq = NULL;
+   int   cpu;
+   cpumask_t valid_mask;
+   int   lowest_prio = -1;
+   int   ret = 0;
 
-   cpus_and(cpu_mask, cpu_online_map, task-cpus_allowed);
+   cpus_clear(*lowest_mask);
+   cpus_and(valid_mask, cpu_online_map, task-cpus_allowed);
 
/*
 * Scan each rq for the lowest prio.
 */
-   for_each_cpu_mask(cpu, cpu_mask) {
+   for_each_cpu_mask(cpu, valid_mask) {
struct rq *rq = cpu_rq(cpu);
 
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
-   break;
+   if (ret)
+   cpus_clear(*lowest_mask);
+   cpu_set(rq-cpu, *lowest_mask);
+   return 1;
}
 
/* no locking for now */
-   if (rq-rt.highest_prio  task-prio 
-   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = rq;
+   if ((rq-rt.highest_prio  task-prio)
+(rq-rt.highest_prio = lowest_prio)) {
+   if (rq-rt.highest_prio  lowest_prio) {
+   /* new low - clear old data */
+   lowest_prio = rq-rt.highest_prio;
+   cpus_clear(*lowest_mask);
+   }
+   cpu_set(rq-cpu, *lowest_mask);
+   ret = 1;
+   }
+   }
+
+   return ret;
+}
+
+static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
+{
+   int first;
+
+   /* this_cpu is cheaper to preempt than a remote processor */
+   if ((this_cpu != -1)  cpu_isset(this_cpu, *mask))
+   return this_cpu;
+
+   first = first_cpu(*mask);
+   if (first != NR_CPUS)
+   return first;
+
+   return -1;
+}
+
+static int find_lowest_rq(struct task_struct *task)
+{
+   struct sched_domain *sd;
+   cpumask_t lowest_mask;
+   int this_cpu = smp_processor_id();
+   int cpu  = task_cpu(task);
+
+   if (!find_lowest_cpus(task, lowest_mask))
+   return -1;
+
+   /*
+* At this point we have built a mask of cpus representing the
+* lowest priority tasks in the system.  Now we want to elect
+* the best one based on our affinity and topology.
+*
+* We prioritize the last cpu that the task executed on since
+* it is most likely cache-hot in that location.
+*/
+   if (cpu_isset(cpu, lowest_mask))
+   return cpu;
+
+   /*
+* Otherwise, we consult the sched_domains span maps to figure
+* out which cpu is logically closest to our hot cache data.
+*/
+   if (this_cpu == cpu)
+   this_cpu = -1; /* Skip this_cpu opt if the same */
+
+   for_each_domain(cpu, sd) {
+   if (sd-flags  SD_WAKE_AFFINE) {
+   cpumask_t domain_mask;
+   int   best_cpu;
+
+   cpus_and(domain_mask, sd-span, lowest_mask);
+
+   best_cpu = pick_optimal_cpu(this_cpu

[PATCH 5/8] RT: Pre-route RT tasks on wakeup

2007-11-05 Thread Gregory Haskins
In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
pre-routing (push the task away to a less important RQ before activating)
approach, while Steve utilized a post-routing approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a pull optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a light-weight lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of priority mistargeting caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7dd67db..43d4ea6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -202,9 +202,28 @@ yield_task_rt(struct rq *rq, struct task_struct *p)
requeue_task_rt(rq, p);
 }
 
+static int find_lowest_rq(struct task_struct *task);
+
 #ifdef CONFIG_SMP
 static int select_task_rq_rt(struct task_struct *p, int sync)
 {
+   struct rq *rq = task_rq(p);
+
+   /*
+* If the task will not preempt the RQ, try to find a better RQ
+* before we even activate the task
+*/
+   if ((p-prio = rq-rt.highest_prio)
+(p-nr_cpus_allowed  1)) {
+   int cpu = find_lowest_rq(p);
+
+   return (cpu == -1) ? task_cpu(p) : cpu;
+   }
+
+   /*
+* Otherwise, just let it ride on the affined RQ and the
+* post-schedule router will push the preempted task away
+*/
return task_cpu(p);
 }
 #endif /* CONFIG_SMP */

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] RT: Consistency cleanup for this_rq usage

2007-11-05 Thread Gregory Haskins
this_rq is normally used to denote the RQ on the current cpu
(i.e. cpu_rq(this_cpu)).  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   46 +++---
 1 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9b06d7c..0348423 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -307,7 +307,7 @@ static struct task_struct *pick_next_highest_task_rt(struct 
rq *rq,
 
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *this_rq)
+ struct rq *rq)
 {
struct rq *lowest_rq = NULL;
cpumask_t cpu_mask;
@@ -321,21 +321,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 * Scan each rq for the lowest prio.
 */
for_each_cpu_mask(cpu, cpu_mask) {
-   struct rq *rq = per_cpu(runqueues, cpu);
+   struct rq *curr_rq = per_cpu(runqueues, cpu);
 
-   if (cpu == this_rq-cpu)
+   if (cpu == rq-cpu)
continue;
 
/* We look for lowest RT prio or non-rt CPU */
-   if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
+   if (curr_rq-rt.highest_prio = MAX_RT_PRIO) {
+   lowest_rq = curr_rq;
break;
}
 
/* no locking for now */
-   if (rq-rt.highest_prio  task-prio 
-   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = rq;
+   if (curr_rq-rt.highest_prio  task-prio 
+   (!lowest_rq || curr_rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
+   lowest_rq = curr_rq;
}
}
 
@@ -343,16 +343,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
break;
 
/* if the prio of this runqueue changed, try again */
-   if (double_lock_balance(this_rq, lowest_rq)) {
+   if (double_lock_balance(rq, lowest_rq)) {
/*
 * We had to unlock the run queue. In
 * the mean time, task could have
 * migrated already or had its affinity changed.
 * Also make sure that it wasn't scheduled on its rq.
 */
-   if (unlikely(task_rq(task) != this_rq ||
+   if (unlikely(task_rq(task) != rq ||
 !cpu_isset(lowest_rq-cpu, 
task-cpus_allowed) ||
-task_running(this_rq, task) ||
+task_running(rq, task) ||
 !task-se.on_rq)) {
spin_unlock(lowest_rq-lock);
lowest_rq = NULL;
@@ -377,21 +377,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
struct task_struct *next_task;
struct rq *lowest_rq;
int ret = 0;
int paranoid = RT_MAX_TRIES;
 
-   assert_spin_locked(this_rq-lock);
+   assert_spin_locked(rq-lock);
 
-   next_task = pick_next_highest_task_rt(this_rq, -1);
+   next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
 
  retry:
-   if (unlikely(next_task == this_rq-curr)) {
+   if (unlikely(next_task == rq-curr)) {
WARN_ON(1);
return 0;
}
@@ -401,24 +401,24 @@ static int push_rt_task(struct rq *this_rq)
 * higher priority than current. If that's the case
 * just reschedule current.
 */
-   if (unlikely(next_task-prio  this_rq-curr-prio)) {
-   resched_task(this_rq-curr);
+   if (unlikely(next_task-prio  rq-curr-prio)) {
+   resched_task(rq-curr);
return 0;
}
 
-   /* We might release this_rq lock */
+   /* We might release rq lock */
get_task_struct(next_task);
 
/* find_lock_lowest_rq locks the rq if found */
-   lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+   lowest_rq = find_lock_lowest_rq(next_task, rq);
if (!lowest_rq) {
struct task_struct *task;
/*
-* find

[PATCH 7/8] RT: Optimize rebalancing

2007-11-05 Thread Gregory Haskins
We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8a27f09..0eced8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -268,6 +268,7 @@ struct rt_rq {
unsigned long rt_nr_uninterruptible;
/* highest queued rt task prio */
int highest_prio;
+   int overloaded;
 };
 
 /*
@@ -6869,6 +6870,7 @@ void __init sched_init(void)
rq-migration_thread = NULL;
INIT_LIST_HEAD(rq-migration_queue);
rq-rt.highest_prio = MAX_RT_PRIO;
+   rq-rt.overloaded = 0;
 #endif
atomic_set(rq-nr_iowait, 0);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6ba5921..698f4d9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -28,10 +28,12 @@ static inline cpumask_t *rt_overload(struct rq *rq)
 static inline void rt_set_overload(struct rq *rq)
 {
cpu_set(rq-cpu, *rt_overload_mask(rq-cpu));
+   rq-rt.overloaded = 1;
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
cpu_clear(rq-cpu, *rt_overload_mask(rq-cpu));
+   rq-rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct task_struct *p, struct rq *rq)
@@ -496,6 +498,9 @@ static int push_rt_task(struct rq *rq)
 
assert_spin_locked(rq-lock);
 
+   if (!rq-rt.overloaded)
+   return 0;
+
next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
@@ -737,7 +742,7 @@ static void schedule_tail_balance_rt(struct rq *rq)
 * the lock was owned by prev, we need to release it
 * first via finish_lock_switch and then reaquire it here.
 */
-   if (unlikely(rq-rt.rt_nr_running  1)) {
+   if (unlikely(rq-rt.overloaded)) {
spin_lock(rq-lock);
push_rt_tasks(rq);
schedstat_inc(rq, rto_schedule_tail);
@@ -749,7 +754,8 @@ static void wakeup_balance_rt(struct rq *rq, struct 
task_struct *p)
 {
if (unlikely(rt_task(p)) 
!task_running(rq, p) 
-   (p-prio = rq-curr-prio)) {
+   (p-prio = rq-rt.highest_prio) 
+   rq-rt.overloaded) {
push_rt_tasks(rq);
schedstat_inc(rq, rto_wakeup);
}

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] RT: Consistency cleanup for this_rq usage

2007-11-05 Thread Gregory Haskins
this_rq is normally used to denote the RQ on the current cpu
(i.e. cpu_rq(this_cpu)).  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   46 +++---
 1 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9b06d7c..0348423 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -307,7 +307,7 @@ static struct task_struct *pick_next_highest_task_rt(struct 
rq *rq,
 
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *this_rq)
+ struct rq *rq)
 {
struct rq *lowest_rq = NULL;
cpumask_t cpu_mask;
@@ -321,21 +321,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 * Scan each rq for the lowest prio.
 */
for_each_cpu_mask(cpu, cpu_mask) {
-   struct rq *rq = per_cpu(runqueues, cpu);
+   struct rq *curr_rq = per_cpu(runqueues, cpu);
 
-   if (cpu == this_rq-cpu)
+   if (cpu == rq-cpu)
continue;
 
/* We look for lowest RT prio or non-rt CPU */
-   if (rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = rq;
+   if (curr_rq-rt.highest_prio = MAX_RT_PRIO) {
+   lowest_rq = curr_rq;
break;
}
 
/* no locking for now */
-   if (rq-rt.highest_prio  task-prio 
-   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = rq;
+   if (curr_rq-rt.highest_prio  task-prio 
+   (!lowest_rq || curr_rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
+   lowest_rq = curr_rq;
}
}
 
@@ -343,16 +343,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
break;
 
/* if the prio of this runqueue changed, try again */
-   if (double_lock_balance(this_rq, lowest_rq)) {
+   if (double_lock_balance(rq, lowest_rq)) {
/*
 * We had to unlock the run queue. In
 * the mean time, task could have
 * migrated already or had its affinity changed.
 * Also make sure that it wasn't scheduled on its rq.
 */
-   if (unlikely(task_rq(task) != this_rq ||
+   if (unlikely(task_rq(task) != rq ||
 !cpu_isset(lowest_rq-cpu, 
task-cpus_allowed) ||
-task_running(this_rq, task) ||
+task_running(rq, task) ||
 !task-se.on_rq)) {
spin_unlock(lowest_rq-lock);
lowest_rq = NULL;
@@ -377,21 +377,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
  * running task can migrate over to a CPU that is running a task
  * of lesser priority.
  */
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
 {
struct task_struct *next_task;
struct rq *lowest_rq;
int ret = 0;
int paranoid = RT_MAX_TRIES;
 
-   assert_spin_locked(this_rq-lock);
+   assert_spin_locked(rq-lock);
 
-   next_task = pick_next_highest_task_rt(this_rq, -1);
+   next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
 
  retry:
-   if (unlikely(next_task == this_rq-curr)) {
+   if (unlikely(next_task == rq-curr)) {
WARN_ON(1);
return 0;
}
@@ -401,24 +401,24 @@ static int push_rt_task(struct rq *this_rq)
 * higher priority than current. If that's the case
 * just reschedule current.
 */
-   if (unlikely(next_task-prio  this_rq-curr-prio)) {
-   resched_task(this_rq-curr);
+   if (unlikely(next_task-prio  rq-curr-prio)) {
+   resched_task(rq-curr);
return 0;
}
 
-   /* We might release this_rq lock */
+   /* We might release rq lock */
get_task_struct(next_task);
 
/* find_lock_lowest_rq locks the rq if found */
-   lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+   lowest_rq = find_lock_lowest_rq(next_task, rq);
if (!lowest_rq) {
struct task_struct *task;
/*
-* find

[PATCH 2/8] RT: Remove some CFS specific code from the wakeup path of RT tasks

2007-11-05 Thread Gregory Haskins
The current wake-up code path tries to determine if it can optimize the
wake-up to this_cpu by computing load calculations.  The problem is that
these calculations are only relevant to CFS tasks where load is king.  For RT
tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 include/linux/sched.h   |1 
 kernel/sched.c  |  135 ---
 kernel/sched_fair.c |  134 +++
 kernel/sched_idletask.c |9 +++
 kernel/sched_rt.c   |   10 +++
 5 files changed, 165 insertions(+), 124 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe3fd1d..cfef8fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1033,6 +1033,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq, struct task_struct *p);
+   int  (*select_task_rq)(struct task_struct *p, int sync);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 49e68be..d16c686 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -859,6 +859,10 @@ static int balance_tasks(struct rq *this_rq, int this_cpu, 
struct rq *busiest,
  int *all_pinned, unsigned long *load_moved,
  int *this_best_prio, struct rq_iterator *iterator);
 
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_avg_load_per_task(int cpu);
+
 #include sched_stats.h
 #include sched_rt.c
 #include sched_fair.c
@@ -1266,7 +1270,7 @@ void kick_process(struct task_struct *p)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static inline unsigned long source_load(int cpu, int type)
+static unsigned long source_load(int cpu, int type)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);
@@ -1281,7 +1285,7 @@ static inline unsigned long source_load(int cpu, int type)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and nice value.
  */
-static inline unsigned long target_load(int cpu, int type)
+static unsigned long target_load(int cpu, int type)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);
@@ -1295,7 +1299,7 @@ static inline unsigned long target_load(int cpu, int type)
 /*
  * Return the average load per task on the cpu's run queue
  */
-static inline unsigned long cpu_avg_load_per_task(int cpu)
+static unsigned long cpu_avg_load_per_task(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);
@@ -1454,53 +1458,6 @@ static int sched_balance_self(int cpu, int flag)
 
 #endif /* CONFIG_SMP */
 
-/*
- * wake_idle() will wake a task on an idle cpu if task-cpu is
- * not idle and an idle cpu is available.  The span of cpus to
- * search starts with cpus closest then further out as needed,
- * so we always favor a closer, idle cpu.
- *
- * Returns the CPU we should wake onto.
- */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-static int wake_idle(int cpu, struct task_struct *p)
-{
-   cpumask_t tmp;
-   struct sched_domain *sd;
-   int i;
-
-   /*
-* If it is idle, then it is the best cpu to run this task.
-*
-* This cpu is also the best, if it has more than one task already.
-* Siblings must be also busy(in most cases) as they didn't already
-* pickup the extra load from this cpu and hence we need not check
-* sibling runqueue info. This will avoid the checks and cache miss
-* penalities associated with that.
-*/
-   if (idle_cpu(cpu) || cpu_rq(cpu)-nr_running  1)
-   return cpu;
-
-   for_each_domain(cpu, sd) {
-   if (sd-flags  SD_WAKE_IDLE) {
-   cpus_and(tmp, sd-span, p-cpus_allowed);
-   for_each_cpu_mask(i, tmp) {
-   if (idle_cpu(i))
-   return i;
-   }
-   } else {
-   break;
-   }
-   }
-   return cpu;
-}
-#else
-static inline int wake_idle(int cpu, struct task_struct *p)
-{
-   return cpu;
-}
-#endif
-
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -1523,8 +1480,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int sync, int mutex)
long old_state;
struct rq *rq;
 #ifdef

[PATCH 0/8] RT: scheduler migration/wakeup enhancements

2007-11-05 Thread Gregory Haskins
Ingo, Steven, Thomas,

Please consider this series for inclusion in 23-rt6, as it has shown
to make a substantial improvement in our local testing.  Independent
verification and/or comments/review are more than welcome.

-Greg

-
RT: scheduler migration/wakeup enhancements

This series applies to 23.1-rt5 and includes numerous tweaks to the
scheduler.  The primary goal of this set of patches is to further improve
wake-up latencies (particularly on larger SMP systems) and decrease migration
overhead.  This is accomplished by making improvements on several fronts:

1) We factor in CPU topology in the routing decision to pick the best
   migration target according to cache hierarchy.

2) We moved some CFS load calculation code as a member function of the CFS
   sched_class.  This removes this unecessary code from the RT fastpath for
   tasks in the RT sched_class.

3) We make further improvements against non-migratable tasks by factoring in
   the RQ overload state, instead of just the RQ depth.

4) We replace the linear priority search with a 2-d algorithm.

In past -rt releases, latencies could become quickly absymal on larger SMP (8+
cpus) to the order of 350us+.  The recent work in -rt2 and -rt4 dropped this
figure by a large margin, bringing things in the order of approximately
~120us.  This new series improves upon this work even further, bringing
latencies down to the sub 80us mark on our reference 8-way Intel C2D 5335
Xeon.

These figures were obtained by simultaneous execution of:

# ./cyclictest -n -p 90 -t 8 -d 100 -i 100
# while true; do make mrproper; make alldefconfig; make -j 128; done

for long durations.  The following are the results from one particular run,
though the results are similar across various short and long term trials in
our labs.

23.1-rt5-baseline
--
138.60 110.62 70.96 23/808 10246

T: 0 ( 5179) P:90 I:100 C:9011636 Min:  2 Act:4 Avg:4 Max: 117
T: 1 ( 5180) P:89 I:200 C:4505819 Min:  2 Act:5 Avg:4 Max:  95
T: 2 ( 5181) P:88 I:300 C:3003879 Min:  2 Act:9 Avg:5 Max:  85
T: 3 ( 5182) P:87 I:400 C:2252910 Min:  2 Act:3 Avg:4 Max:  75
T: 4 ( 5183) P:86 I:500 C:1802328 Min:  2 Act:7 Avg:5 Max:  71
T: 5 ( 5184) P:85 I:600 C:1501940 Min:  2 Act:4 Avg:5 Max:  74
T: 6 ( 5185) P:84 I:700 C:1287377 Min:  2 Act:6 Avg:6 Max:  85
T: 7 ( 5186) P:83 I:800 C:1126455 Min:  2 Act:4 Avg:5 Max:  75

23.1-rt5-gh
--
146.47 127.99 85.35 30/815 32289

T: 0 ( 5027) P:90 I:100 C:10856538 Min:  2 Act:4 Avg:4 Max:
60
T: 1 ( 5028) P:89 I:200 C:5428270 Min:  2 Act:7 Avg:5 Max:  57
T: 2 ( 5029) P:88 I:300 C:3618846 Min:  2 Act:5 Avg:5 Max:  48
T: 3 ( 5030) P:87 I:400 C:2714135 Min:  2 Act:7 Avg:5 Max:  61
T: 4 ( 5031) P:86 I:500 C:2171308 Min:  2 Act:6 Avg:6 Max:  51
T: 5 ( 5032) P:85 I:600 C:1809424 Min:  2 Act:5 Avg:7 Max:  59
T: 6 ( 5033) P:84 I:700 C:1550935 Min:  2 Act:6 Avg:6 Max:  54
T: 7 ( 5034) P:83 I:800 C:1357068 Min:  2 Act:7 Avg:6 Max:  62

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] RT: Break out the search function

2007-11-05 Thread Gregory Haskins
Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   62 -
 1 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f1fc1b4..fbe7b8a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -312,43 +312,55 @@ static struct task_struct 
*pick_next_highest_task_rt(struct rq *rq,
return next;
 }
 
-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *rq)
+static int find_lowest_rq(struct task_struct *task)
 {
-   struct rq *lowest_rq = NULL;
-   cpumask_t cpu_mask;
int cpu;
-   int tries;
+   cpumask_t cpu_mask;
+   struct rq *lowest_rq = NULL;
 
cpus_and(cpu_mask, cpu_online_map, task-cpus_allowed);
 
-   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
-   /*
-* Scan each rq for the lowest prio.
-*/
-   for_each_cpu_mask(cpu, cpu_mask) {
-   struct rq *curr_rq = per_cpu(runqueues, cpu);
+   /*
+* Scan each rq for the lowest prio.
+*/
+   for_each_cpu_mask(cpu, cpu_mask) {
+   struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq-cpu)
-   continue;
+   if (cpu == rq-cpu)
+   continue;
 
-   /* We look for lowest RT prio or non-rt CPU */
-   if (curr_rq-rt.highest_prio = MAX_RT_PRIO) {
-   lowest_rq = curr_rq;
-   break;
-   }
+   /* We look for lowest RT prio or non-rt CPU */
+   if (rq-rt.highest_prio = MAX_RT_PRIO) {
+   lowest_rq = rq;
+   break;
+   }
 
-   /* no locking for now */
-   if (curr_rq-rt.highest_prio  task-prio 
-   (!lowest_rq || curr_rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   lowest_rq = curr_rq;
-   }
+   /* no locking for now */
+   if (rq-rt.highest_prio  task-prio 
+   (!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
+   lowest_rq = rq;
}
+   }
+
+   return lowest_rq ? lowest_rq-cpu : -1;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+ struct rq *rq)
+{
+   struct rq *lowest_rq = NULL;
+   int cpu;
+   int tries;
+
+   for (tries = 0; tries  RT_MAX_TRIES; tries++) {
+   cpu = find_lowest_rq(task);
 
-   if (!lowest_rq)
+   if (cpu == -1)
break;
 
+   lowest_rq = cpu_rq(cpu);
+
/* if the prio of this runqueue changed, try again */
if (double_lock_balance(rq, lowest_rq)) {
/*

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] RT: Optimize rebalancing

2007-11-05 Thread Gregory Haskins
We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched.c|2 ++
 kernel/sched_rt.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8a27f09..0eced8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -268,6 +268,7 @@ struct rt_rq {
unsigned long rt_nr_uninterruptible;
/* highest queued rt task prio */
int highest_prio;
+   int overloaded;
 };
 
 /*
@@ -6869,6 +6870,7 @@ void __init sched_init(void)
rq-migration_thread = NULL;
INIT_LIST_HEAD(rq-migration_queue);
rq-rt.highest_prio = MAX_RT_PRIO;
+   rq-rt.overloaded = 0;
 #endif
atomic_set(rq-nr_iowait, 0);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6ba5921..698f4d9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -28,10 +28,12 @@ static inline cpumask_t *rt_overload(struct rq *rq)
 static inline void rt_set_overload(struct rq *rq)
 {
cpu_set(rq-cpu, *rt_overload_mask(rq-cpu));
+   rq-rt.overloaded = 1;
 }
 static inline void rt_clear_overload(struct rq *rq)
 {
cpu_clear(rq-cpu, *rt_overload_mask(rq-cpu));
+   rq-rt.overloaded = 0;
 }
 
 static void update_rt_migration(struct task_struct *p, struct rq *rq)
@@ -496,6 +498,9 @@ static int push_rt_task(struct rq *rq)
 
assert_spin_locked(rq-lock);
 
+   if (!rq-rt.overloaded)
+   return 0;
+
next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
@@ -737,7 +742,7 @@ static void schedule_tail_balance_rt(struct rq *rq)
 * the lock was owned by prev, we need to release it
 * first via finish_lock_switch and then reaquire it here.
 */
-   if (unlikely(rq-rt.rt_nr_running  1)) {
+   if (unlikely(rq-rt.overloaded)) {
spin_lock(rq-lock);
push_rt_tasks(rq);
schedstat_inc(rq, rto_schedule_tail);
@@ -749,7 +754,8 @@ static void wakeup_balance_rt(struct rq *rq, struct 
task_struct *p)
 {
if (unlikely(rt_task(p)) 
!task_running(rq, p) 
-   (p-prio = rq-curr-prio)) {
+   (p-prio = rq-rt.highest_prio) 
+   rq-rt.overloaded) {
push_rt_tasks(rq);
schedstat_inc(rq, rto_wakeup);
}

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] RT: Allow current_cpu to be included in search

2007-11-05 Thread Gregory Haskins
It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fbe7b8a..7dd67db 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -326,9 +326,6 @@ static int find_lowest_rq(struct task_struct *task)
for_each_cpu_mask(cpu, cpu_mask) {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == rq-cpu)
-   continue;
-
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
@@ -356,7 +353,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
for (tries = 0; tries  RT_MAX_TRIES; tries++) {
cpu = find_lowest_rq(task);
 
-   if (cpu == -1)
+   if ((cpu == -1) || (cpu == rq-cpu))
break;
 
lowest_rq = cpu_rq(cpu);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] RT: Cache cpus_allowed weight for optimizing migration

2007-10-26 Thread Gregory Haskins
Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task-cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 include/linux/sched.h |2 ++
 kernel/fork.c |1 +
 kernel/sched.c|9 +++-
 kernel/sched_rt.c |   58 +
 4 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a3829f..829de6f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1048,6 +1048,7 @@ struct sched_class {
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
+   void (*set_cpus_allowed)(struct task_struct *p, cpumask_t newmask);
 };
 
 struct load_weight {
@@ -1144,6 +1145,7 @@ struct task_struct {
 
unsigned int policy;
cpumask_t cpus_allowed;
+   int nr_cpus_allowed;
unsigned int time_slice;
 
 #ifdef CONFIG_PREEMPT_RCU
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f11f23..f808e18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1257,6 +1257,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
 */
preempt_disable();
p-cpus_allowed = current-cpus_allowed;
+   p-nr_cpus_allowed = current-nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p-cpus_allowed) ||
!cpu_online(task_cpu(p
set_task_cpu(p, smp_processor_id());
diff --git a/kernel/sched.c b/kernel/sched.c
index 30fa531..6c90093 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -262,6 +262,7 @@ struct rt_rq {
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
unsigned long rt_nr_running;
+   unsigned long rt_nr_migratory;
unsigned long rt_nr_uninterruptible;
/* highest queued rt task prio */
int highest_prio;
@@ -5371,7 +5372,13 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t 
new_mask)
goto out;
}
 
-   p-cpus_allowed = new_mask;
+   if (p-sched_class-set_cpus_allowed)
+   p-sched_class-set_cpus_allowed(p, new_mask);
+   else {
+   p-cpus_allowed= new_mask;
+   p-nr_cpus_allowed = cpus_weight(new_mask);
+   }
+
/* Can the task run on the task's current CPU? If so, we're done */
if (cpu_isset(task_cpu(p), new_mask))
goto out;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b59dc20..64481c8 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -51,6 +51,16 @@ static inline void update_curr_rt(struct rq *rq)
curr-se.exec_start = rq-clock;
 }
 
+#ifdef CONFIG_SMP
+static void update_rt_migration(struct task_struct *p, struct rq *rq)
+{
+   if (rq-rt.rt_nr_migratory  (rq-rt.rt_nr_running  1))
+   rt_set_overload(p, rq-cpu);
+   else
+   rt_clear_overload(p, rq-cpu);
+}
+#endif
+
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 {
WARN_ON(!rt_task(p));
@@ -58,8 +68,10 @@ static inline void inc_rt_tasks(struct task_struct *p, 
struct rq *rq

[PATCH 1/2] RT: cleanup some push-rt logic

2007-10-26 Thread Gregory Haskins
Please fold into original -rt2 patches as appropriate

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   10 ++
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 55da7d0..b59dc20 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -292,7 +292,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 {
struct rq *lowest_rq = NULL;
cpumask_t cpu_mask;
-   int dst_cpu = -1;
int cpu;
int tries;
 
@@ -311,14 +310,12 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
-   dst_cpu = cpu;
break;
}
 
/* no locking for now */
if (rq-rt.highest_prio  task-prio 
(!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   dst_cpu = cpu;
lowest_rq = rq;
}
}
@@ -335,7 +332,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 * Also make sure that it wasn't scheduled on its rq.
 */
if (unlikely(task_rq(task) != this_rq ||
-!cpu_isset(dst_cpu, task-cpus_allowed) ||
+!cpu_isset(lowest_rq-cpu, 
task-cpus_allowed) ||
 task_running(this_rq, task) ||
 !task-se.on_rq)) {
spin_unlock(lowest_rq-lock);
@@ -365,7 +362,6 @@ static int push_rt_task(struct rq *this_rq)
 {
struct task_struct *next_task;
struct rq *lowest_rq;
-   int dst_cpu;
int ret = 0;
int paranoid = RT_MAX_TRIES;
 
@@ -412,12 +408,10 @@ static int push_rt_task(struct rq *this_rq)
goto out;
}
 
-   dst_cpu = lowest_rq-cpu;
-
assert_spin_locked(lowest_rq-lock);
 
deactivate_task(this_rq, next_task, 0);
-   set_task_cpu(next_task, dst_cpu);
+   set_task_cpu(next_task, lowest_rq-cpu);
activate_task(lowest_rq, next_task, 0);
 
resched_task(lowest_rq-curr);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] RT: balance rt tasks enhancements v6

2007-10-25 Thread Gregory Haskins
This is a mini-release of my series, rebased on -rt2.  I have more changes
downstream which are not quite ready for primetime, but I need to work on some
other unrelated issues right now and I wanted to get what works out there. 

Changes since v5

*) Rebased to rt2 - Many of the functions of the original series are now
 included in base -rt so they are dropped out

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] RT: cleanup some push-rt logic

2007-10-25 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/sched_rt.c |   10 ++
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 55da7d0..b59dc20 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -292,7 +292,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 {
struct rq *lowest_rq = NULL;
cpumask_t cpu_mask;
-   int dst_cpu = -1;
int cpu;
int tries;
 
@@ -311,14 +310,12 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
/* We look for lowest RT prio or non-rt CPU */
if (rq-rt.highest_prio = MAX_RT_PRIO) {
lowest_rq = rq;
-   dst_cpu = cpu;
break;
}
 
/* no locking for now */
if (rq-rt.highest_prio  task-prio 
(!lowest_rq || rq-rt.highest_prio  
lowest_rq-rt.highest_prio)) {
-   dst_cpu = cpu;
lowest_rq = rq;
}
}
@@ -335,7 +332,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task,
 * Also make sure that it wasn't scheduled on its rq.
 */
if (unlikely(task_rq(task) != this_rq ||
-!cpu_isset(dst_cpu, task-cpus_allowed) ||
+!cpu_isset(lowest_rq-cpu, 
task-cpus_allowed) ||
 task_running(this_rq, task) ||
 !task-se.on_rq)) {
spin_unlock(lowest_rq-lock);
@@ -365,7 +362,6 @@ static int push_rt_task(struct rq *this_rq)
 {
struct task_struct *next_task;
struct rq *lowest_rq;
-   int dst_cpu;
int ret = 0;
int paranoid = RT_MAX_TRIES;
 
@@ -412,12 +408,10 @@ static int push_rt_task(struct rq *this_rq)
goto out;
}
 
-   dst_cpu = lowest_rq-cpu;
-
assert_spin_locked(lowest_rq-lock);
 
deactivate_task(this_rq, next_task, 0);
-   set_task_cpu(next_task, dst_cpu);
+   set_task_cpu(next_task, lowest_rq-cpu);
activate_task(lowest_rq, next_task, 0);
 
resched_task(lowest_rq-curr);

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] RT: CPU priority management

2007-10-25 Thread Gregory Haskins
This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Makefile   |2 
 kernel/sched.c|4 +
 kernel/sched_cpupri.c |  201 +
 kernel/sched_cpupri.h |   10 ++
 kernel/sched_rt.c |   34 ++--
 5 files changed, 224 insertions(+), 27 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o 
profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
-   utsname.o
+   utsname.o sched_cpupri.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 6c90093..acfc75d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -68,6 +68,8 @@
 
 #include asm/tlb.h
 
+#include sched_cpupri.h
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -6955,6 +6957,8 @@ void __init sched_init(void)
fair_sched_class.next = idle_sched_class;
idle_sched_class.next = NULL;
 
+   cpupri_init();
+
for_each_possible_cpu(i) {
struct rt_prio_array *array;
struct rq *rq;
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
new file mode 100644
index 000..7a226cb
--- /dev/null
+++ b/kernel/sched_cpupri.c
@@ -0,0 +1,201 @@
+/*
+ *  kernel/sched_cpupri.c
+ *
+ *  CPU priority management
+ *
+ *  Copyright (C) 2007 Novell
+ *
+ *  Author: Gregory Haskins [EMAIL PROTECTED]
+ *
+ *  This code tracks the priority of each CPU so that global migration
+ *  decisions are easy to calculate.  Each CPU can be in a state as follows:
+ *
+ * (INVALID), IDLE, NORMAL, RT1, ... RT99
+ *
+ *  going from the lowest priority to the highest.  CPUs in the INVALID state
+ *  are not eligible for routing.  The system maintains this state with
+ *  a 2 dimensional bitmap (the first for priority class, the second for cpus
+ *  in that class).  Therefore a typical application without affinity
+ *  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
+ *  searches).  For tasks with affinity restrictions, the algorithm has a
+ *  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
+ *  yields the worst case search is fairly contrived.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include asm/idle.h
+
+#include sched_cpupri.h
+
+#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO
+
+#define CPUPRI_INVALID -2
+#define CPUPRI_IDLE-1
+#define CPUPRI_NORMAL   0
+/* values 1-99 are RT priorities */
+
+struct cpu_priority {
+   raw_spinlock_t lock;
+   cpumask_t  pri_to_cpu[CPUPRI_NR_PRIORITIES];
+   long   pri_active[CPUPRI_NR_PRIORITIES/BITS_PER_LONG];
+   intcpu_to_pri[NR_CPUS];
+};
+
+static __cacheline_aligned_in_smp struct cpu_priority cpu_priority;
+
+/* Convert between a 140 based task-prio, and our 102 based cpupri */
+static int convert_prio(int prio)
+{
+   int cpupri;
+   
+   if (prio == MAX_PRIO)
+   cpupri = CPUPRI_IDLE;
+   else if (prio = MAX_RT_PRIO)
+   cpupri = CPUPRI_NORMAL;
+   else
+   cpupri = MAX_RT_PRIO - prio;
+
+   return cpupri;
+}
+
+#define for_each_cpupri_active(array, idx)   \
+  for( idx = find_first_bit(array, CPUPRI_NR_PRIORITIES);\
+   idx  CPUPRI_NR_PRIORITIES;   \
+   idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1))
+
+/**
+ * cpupri_find - find the best (lowest-pri) CPU in the system
+ * @cpu: The recommended/default CPU
+ * @task_pri: The priority of the task being scheduled (IDLE-RT99)
+ * @p: The task being scheduled
+ *
+ * Note: This function returns the recommended CPU as calculated during the
+ * current invokation.  By the time

  1   2   >