Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
>>> On Tue, Feb 26, 2008 at  1:06 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
> On Tue 2008-02-26 08:03:43, Gregory Haskins wrote:
>> >>> On Mon, Feb 25, 2008 at  5:03 PM, in message
>> <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
>> 
>> >> +static inline void
>> >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
>> > *adaptive)
>> > ...
>> >> +#define prepare_adaptive_wait(lock, busy) {}
>> > 
>> > This is evil. Use empty inline function instead (same for the other
>> > function, there you can maybe get away with it).
>> > 
>> 
>> I went to implement your suggested change and I remembered why I did it this 
> way:  I wanted a macro so that the "struct adaptive_waiter" local variable 
> will fall away without an #ifdef in the main body of code.  So I have left 
> this logic alone for now.
> 
> Hmm, but inline function will allow dead code elimination,  too, no?

I was getting compile errors.  Might be operator-error ;)

> 
> Anyway non-evil way to do it with macro is 
> 
> #define prepare_adaptive_wait(lock, busy) do {} while (0)
> 
> ...that behaves properly in complex statements.

Ah, I was wondering why people use that.  Will do.  Thanks!

-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-26 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  5:06 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
> 
> I believe you have _way_ too many config variables. If this can be set
> at runtime, does it need a config option, too?

Generally speaking, I think until this algorithm has an adaptive-timeout in 
addition to an adaptive-spin/sleep, these .config based defaults are a good 
idea.  Sometimes setting these things at runtime are a PITA when you are 
talking about embedded systems that might not have/want a nice userspace 
sysctl-config infrastructure.  And changing the defaults in the code is 
unattractive for some users.  I don't think its a big deal either way, so if 
people hate the config options, they should go.  But I thought I would throw 
this use-case out there to ponder.

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  5:03 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 

>> +static inline void
>> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
> *adaptive)
> ...
>> +#define prepare_adaptive_wait(lock, busy) {}
> 
> This is evil. Use empty inline function instead (same for the other
> function, there you can maybe get away with it).
> 

I went to implement your suggested change and I remembered why I did it this 
way:  I wanted a macro so that the "struct adaptive_waiter" local variable will 
fall away without an #ifdef in the main body of code.  So I have left this 
logic alone for now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:03 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 

 +static inline void
 +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
 *adaptive)
 ...
 +#define prepare_adaptive_wait(lock, busy) {}
 
 This is evil. Use empty inline function instead (same for the other
 function, there you can maybe get away with it).
 

I went to implement your suggested change and I remembered why I did it this 
way:  I wanted a macro so that the struct adaptive_waiter local variable will 
fall away without an #ifdef in the main body of code.  So I have left this 
logic alone for now.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-26 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:06 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 
 I believe you have _way_ too many config variables. If this can be set
 at runtime, does it need a config option, too?

Generally speaking, I think until this algorithm has an adaptive-timeout in 
addition to an adaptive-spin/sleep, these .config based defaults are a good 
idea.  Sometimes setting these things at runtime are a PITA when you are 
talking about embedded systems that might not have/want a nice userspace 
sysctl-config infrastructure.  And changing the defaults in the code is 
unattractive for some users.  I don't think its a big deal either way, so if 
people hate the config options, they should go.  But I thought I would throw 
this use-case out there to ponder.

Regards,
-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-26 Thread Gregory Haskins
 On Tue, Feb 26, 2008 at  1:06 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 On Tue 2008-02-26 08:03:43, Gregory Haskins wrote:
  On Mon, Feb 25, 2008 at  5:03 PM, in message
 [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 
  +static inline void
  +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
  *adaptive)
  ...
  +#define prepare_adaptive_wait(lock, busy) {}
  
  This is evil. Use empty inline function instead (same for the other
  function, there you can maybe get away with it).
  
 
 I went to implement your suggested change and I remembered why I did it this 
 way:  I wanted a macro so that the struct adaptive_waiter local variable 
 will fall away without an #ifdef in the main body of code.  So I have left 
 this logic alone for now.
 
 Hmm, but inline function will allow dead code elimination,  too, no?

I was getting compile errors.  Might be operator-error ;)

 
 Anyway non-evil way to do it with macro is 
 
 #define prepare_adaptive_wait(lock, busy) do {} while (0)
 
 ...that behaves properly in complex statements.

Ah, I was wondering why people use that.  Will do.  Thanks!

-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  5:57 PM, in message
<[EMAIL PROTECTED]>, Sven-Thorsten Dietrich
<[EMAIL PROTECTED]> wrote: 
>
> But Greg may need to enforce it on his git tree that he mails these from
> - are you referring to anything specific in this patch?
> 

Thats what I don't get.  I *did* checkpatch all of these before sending them 
out (and I have for every release).

I am aware of two "tabs vs spaces" warnings, but the rest checked clean.  Why 
do some people still see errors when I don't?  Is there a set of switches I 
should supply to checkpatch to make it more aggressive or something?

-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  5:09 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
> Hi!
> 
>> From: Peter W.Morreale <[EMAIL PROTECTED]>
>> 
>> This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
>> a new tunable: rtmutex_timeout, which is the companion to the
>> rtlock_timeout tunable.
>> 
>> Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
> 
> Not signed off by you?

I wasn't sure if this was appropriate for me to do.  This is the first time I 
was acting as "upstream" to someone.  If that is what I am expected to do, 
consider this an "ack" for your remaining comments related to this.

> 
>> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
>> index ac1cbad..864bf14 100644
>> --- a/kernel/Kconfig.preempt
>> +++ b/kernel/Kconfig.preempt
>> @@ -214,6 +214,43 @@ config RTLOCK_DELAY
>>   tunable at runtime via a sysctl.  A setting of 0 (zero) disables
>>   the adaptive algorithm entirely.
>>  
>> +config ADAPTIVE_RTMUTEX
>> +bool "Adaptive real-time mutexes"
>> +default y
>> +depends on ADAPTIVE_RTLOCK
>> +help
>> + This option adds the adaptive rtlock spin/sleep algorithm to
>> + rtmutexes.  In rtlocks, a significant gain in throughput
>> + can be seen by allowing rtlocks to spin for a distinct
>> + amount of time prior to going to sleep for deadlock avoidence.
>> + 
>> + Typically, mutexes are used when a critical section may need to
>> + sleep due to a blocking operation.  In the event the critical 
>> + section does not need to sleep, an additional gain in throughput 
>> + can be seen by avoiding the extra overhead of sleeping.
> 
> Watch the whitespace. ... and do we need yet another config options?
> 
>> +config RTMUTEX_DELAY
>> +int "Default delay (in loops) for adaptive mutexes"
>> +range 0 1000
>> +depends on ADAPTIVE_RTMUTEX
>> +default "3000"
>> +help
>> + This allows you to specify the maximum delay a task will use
>> + to wait for a rt mutex before going to sleep.  Note that that
>> + although the delay is implemented as a preemptable loop, tasks
>> + of like priority cannot preempt each other and this setting can
>> + result in increased latencies.
>> + 
>> + The value is tunable at runtime via a sysctl.  A setting of 0
>> + (zero) disables the adaptive algorithm entirely.
> 
> Ouch.

?  Is this reference to whitespace damage, or does the content need addressing?

> 
>> +#ifdef CONFIG_ADAPTIVE_RTMUTEX
>> +
>> +#define mutex_adaptive_wait adaptive_wait
>> +#define mutex_prepare_adaptive_wait prepare_adaptive_wait
>> +
>> +extern int rtmutex_timeout;
>> +
>> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
>> + struct adaptive_waiter name = { .owner = NULL,   \
>> + .timeout = rtmutex_timeout, }
>> +
>> +#else
>> +
>> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
>> +
>> +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
>> +#define mutex_prepare_adaptive_wait(lock, busy) {}
> 
> More evil macros. Macro does not behave like a function, make it
> inline function if you are replacing a function.

Ok


>   Pavel



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  5:03 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
> Hi!
> 
>> +/*
>> + * Adaptive-rtlocks will busywait when possible, and sleep only if
>> + * necessary. Note that the busyloop looks racy, and it isbut we do
>> + * not care. If we lose any races it simply means that we spin one more
>> + * time before seeing that we need to break-out on the next iteration.
>> + *
>> + * We realize this is a relatively large function to inline, but note that
>> + * it is only instantiated 1 or 2 times max, and it makes a measurable
>> + * performance different to avoid the call.
>> + *
>> + * Returns 1 if we should sleep
>> + *
>> + */
>> +static inline int
>> +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
>> +  struct adaptive_waiter *adaptive)
>> +{
>> +int sleep = 0;
>> +
>> +for (;;) {
>> +/*
>> + * If the task was re-awoken, break out completely so we can
>> + * reloop through the lock-acquisition code.
>> + */
>> +if (!waiter->task)
>> +break;
>> +
>> +/*
>> + * We need to break if the owner changed so we can reloop
>> + * and safely acquire the owner-pointer again with the
>> + * wait_lock held.
>> + */
>> +if (adaptive->owner != rt_mutex_owner(lock))
>> +break;
>> +
>> +/*
>> + * If we got here, presumably the lock ownership is still
>> + * current.  We will use it to our advantage to be able to
>> + * spin without disabling preemption...
>> + */
>> +
>> +/*
>> + * .. sleep if the owner is not running..
>> + */
>> +if (!adaptive->owner->se.on_rq) {
>> +sleep = 1;
>> +break;
>> +}
>> +
>> +/*
>> + * .. or is running on our own cpu (to prevent deadlock)
>> + */
>> +if (task_cpu(adaptive->owner) == task_cpu(current)) {
>> +sleep = 1;
>> +break;
>> +}
>> +
>> +cpu_relax();
>> +}
>> +
>> +put_task_struct(adaptive->owner);
>> +
>> +return sleep;
>> +}
>> +
> 
> You want to inline this?

Yes.  As the comment indicates, there are 1-2 users tops, and it has a 
significant impact on throughput (> 5%) to take the hit with a call.  I don't 
think its actually much code anyway...its all comments.

> 
>> +static inline void
>> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
> *adaptive)
> ...
>> +#define prepare_adaptive_wait(lock, busy) {}
> 
> This is evil. Use empty inline function instead (same for the other
> function, there you can maybe get away with it).

Ok.


>   Pavel



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
>>> On Mon, Feb 25, 2008 at  4:54 PM, in message
<[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: 
> Hi!
> 
>> @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
>>   * saved_state accordingly. If we did not get a real wakeup
>>   * then we return with the saved state.
>>   */
>> -saved_state = xchg(>state, TASK_UNINTERRUPTIBLE);
>> +saved_state = current->state;
>> +smp_mb();
>>  
>>  for (;;) {
>>  unsigned long saved_flags;
> 
> Please document what the barrier is good for.

Yeah, I think you are right that this isn't needed.  I think that is a relic 
from back when I was debugging some other problems.  Let me wrap my head around 
the implications of removing it, and either remove it or document appropriately.

> 
> Plus, you are replacing atomic operation with nonatomic; is that ok?

Yeah, I think so.  We are substituting a write with a read, and word reads are 
always atomic anyway IIUC (or is that only true on certain architectures)?  
Note that we are moving the atomic-write to be done later in the 
update_current() calls.

-Greg



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 9/9] remove the extra call to try_to_take_lock

2008-02-25 Thread Gregory Haskins
From: Peter W. Morreale <[EMAIL PROTECTED]>

Remove the redundant attempt to get the lock.  While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation. 

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b81bbef..266ae31 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -756,12 +756,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(>wait_lock, flags);
init_lists(lock);
 
-   /* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
-   spin_unlock_irqrestore(>wait_lock, flags);
-   return;
-   }
-
BUG_ON(rt_mutex_owner(lock) == current);
 
/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale <[EMAIL PROTECTED]>

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   37 ++
 kernel/rtmutex.c  |   76 +
 kernel/rtmutex_adaptive.h |   32 ++-
 kernel/sysctl.c   |   10 ++
 4 files changed, 119 insertions(+), 36 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index ac1cbad..864bf14 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -214,6 +214,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool "Adaptive real-time mutexes"
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int "Default delay (in loops) for adaptive mutexes"
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default "3000"
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a16b13..ea593e0 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1;
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock->owner state tracking:
  *
@@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* We can skip the actual (expensive) wakeup if the
+* waiter is already running, but we have to be careful
+* of race conditions because they may be about to sleep.
+*
+* The waiter-side protocol has the following pattern:
+* 1: Set state != RUNNING
+* 2: Conditionally sleep if waiter->task != NULL;
+*
+* And the owner-side has the following:
+* A: Set waiter->task = NULL
+* B: Conditionally wake if the state != RUNNING
+*
+* As long as we ensure 1->2 order, and A->B order, we
+* will never miss a wakeup.
+*
+* Therefore, this barrier ensures that waiter->task = NULL
+* is visible before we test the pendowner->state.  The
+* corresponding barrier is in the sleep logic.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   /*
-* We can skip the actual (expensive) wakeup if the
-* waiter is already running, but we have to be careful
-* of race conditions because they may be about to sleep.
-*
-* The waiter-side protocol has the following pattern:
-* 1: Set state != RUNNING
-* 2: Conditionally sleep if waiter->task != NULL;
-*
-* And the owner-side has the following:
-* A: Set waiter->task = NULL
-* B: Conditionally wake if the state != RUNNING
-*
-* As long as we ensure 1->2 order, and A->B order, we
-* will never miss a 

[(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale <[EMAIL PROTECTED]>

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ea593e0..b81bbef 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(>pi_lock);
 
@@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner->normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(>pi_lock);
 
WARN_ON(!pendowner->pi_blocked_on);
@@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner->pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(>pi_list_entry, >pi_waiters);
-   }
+
spin_unlock(>pi_lock);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping).  Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks.  However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks.  This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention.  Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent.  Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem.  They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Peter Morreale <[EMAIL PROTECTED]>
Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   20 +++
 kernel/rtmutex.c  |   30 +++---
 kernel/rtmutex_adaptive.h |  138 +
 3 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e493257..d2432fa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -183,6 +183,26 @@ config RCU_TRACE
  Say Y/M here if you want to enable RCU tracing in-kernel/module.
  Say N if you are unsure.
 
+config ADAPTIVE_RTLOCK
+bool "Adaptive real-time locks"
+default y
+depends on PREEMPT_RT && SMP
+help
+  PREEMPT_RT allows for greater determinism by transparently
+  converting normal spinlock_ts into preemptible rtmutexes which
+  sleep any waiters under contention.  However, in many cases the
+  lock will be released in less time than it takes to context
+  switch.  Therefore, the "sleep under contention" policy may also
+  degrade throughput performance due to the extra context switches.
+
+  This option alters the rtmutex derived spinlock_t replacement
+  code to use an adaptive spin/sleep algorithm.  It will spin
+  unless it determines it must sleep to avoid deadlock.  This
+  offers a best of both worlds solution since we achieve both
+  high-throughput and low-latency.
+
+  If unsure, say Y.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index bf9e230..3802ef8 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,8 @@
  *  Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]>
  *  Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
  *  Copyright (C) 2006 Esben Nielsen
+ *  Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale,
+ *   and Gregory Haskins
  *
  *  See Documentation/rt-mutex-design.txt for details.
  */
@@ -17,6 +19,7 @@
 #include 
 
 #include "rtmutex_common.h"
+#include "rtmutex_adaptive.h"
 
 #ifdef CONFIG_RTLOCK_LATERAL_STEAL
 int rtmutex_lateral_steal __read_mostly = 1;
@@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+   DECLARE_ADAPTIVE_WAITER(adaptive);
 
debug_rt_mutex_init_waiter();
waiter.task = NULL;
@@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
 
+   prepare_adaptive_wait(lock, );
+
/*
 * Prevent schedule() to drop BKL, while waiting for
 * the lock ! We restore lock_depth when we come back.
@@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock();
 
-   update_current(TASK_UNINTERRUPTIBLE, _state);
-   /*
-* The xchg() in update_current() is an implicit barrier
-* which we rely upon to ensure current->state is visible
-* before we test waiter.task.
-*/
-   if (waiter.task)
-   schedule_rt_mutex(lock);
-   else
-   update_current(

[(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-25 Thread Gregory Haskins
From: Sven Dietrich <[EMAIL PROTECTED]>

Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2432fa..ac1cbad 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK
 
   If unsure, say Y.
 
+config RTLOCK_DELAY
+   int "Default delay (in loops) for adaptive rtlocks"
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default "1"
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 3802ef8..4a16b13 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -25,6 +25,10 @@
 int rtmutex_lateral_steal __read_mostly = 1;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock->owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 862c088..60c6328 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive->timeout > 0; adaptive->timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive->timeout <= 0)
+   sleep = 1;
+
put_task_struct(adaptive->owner);
 
return sleep;
@@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive->owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c24c53d..55189ea 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -56,6 +56,8 @@
 #include 
 #endif
 
+#include "rtmutex_adaptive.h"
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "rtlock_timeout",
+   .data   = _timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current->state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cd39c26..ef52db6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(>state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(>state, TASK_UNINTERRUPTIBLE);
+   saved_state = current->state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock();
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, _state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, _state);
 
spin_lock_irqsave(>wait_lock, flags);
current->flags |= saved_flags;
current->lock_depth = saved_lock_depth;
-   state = xchg(>state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(>state, saved_state);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 4/9] optimize rt lock wakeup

2008-02-25 Thread Gregory Haskins
It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Peter Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   45 -
 1 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ef52db6..bf9e230 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -531,6 +531,41 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
pendowner = waiter->task;
waiter->task = NULL;
 
+   /*
+* Do the wakeup before the ownership change to give any spinning
+* waiter grantees a headstart over the other threads that will
+* trigger once owner changes.
+*/
+   if (!savestate)
+   wake_up_process(pendowner);
+   else {
+   /*
+* We can skip the actual (expensive) wakeup if the
+* waiter is already running, but we have to be careful
+* of race conditions because they may be about to sleep.
+*
+* The waiter-side protocol has the following pattern:
+* 1: Set state != RUNNING
+* 2: Conditionally sleep if waiter->task != NULL;
+*
+* And the owner-side has the following:
+* A: Set waiter->task = NULL
+* B: Conditionally wake if the state != RUNNING
+*
+* As long as we ensure 1->2 order, and A->B order, we
+* will never miss a wakeup.
+*
+* Therefore, this barrier ensures that waiter->task = NULL
+* is visible before we test the pendowner->state.  The
+* corresponding barrier is in the sleep logic.
+*/
+   smp_mb();
+
+   if ((pendowner->state != TASK_RUNNING)
+   && (pendowner->state != TASK_RUNNING_MUTEX))
+   wake_up_process_mutex(pendowner);
+   }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
spin_unlock(>pi_lock);
@@ -557,11 +592,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
plist_add(>pi_list_entry, >pi_waiters);
}
spin_unlock(>pi_lock);
-
-   if (savestate)
-   wake_up_process_mutex(pendowner);
-   else
-   wake_up_process(pendowner);
 }
 
 /*
@@ -762,6 +792,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock();
 
update_current(TASK_UNINTERRUPTIBLE, _state);
+   /*
+* The xchg() in update_current() is an implicit barrier
+* which we rely upon to ensure current->state is visible
+* before we test waiter.task.
+*/
if (waiter.task)
schedule_rt_mutex(lock);
else

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
From: Sven-Thorsten Dietrich <[EMAIL PROTECTED]>

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |7 ++-
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 6624c66..cd39c26 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -18,6 +18,10 @@
 
 #include "rtmutex_common.h"
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
+
 /*
  * lock->owner state tracking:
  *
@@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current->prio > pendowner->prio)
return 0;
 
-   if (!unfair && (current->prio == pendowner->prio))
+   if (unlikely(current->prio == pendowner->prio) &&
+  !(unfair && rtmutex_lateral_steal))
 #endif
return 0;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c913d48..c24c53d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "rtmutex_lateral_steal",
+   .data   = _lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority

2008-02-25 Thread Gregory Haskins
The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority.  The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes.  But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block.  However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio).  This helps to keep the system "primed" with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt |   10 ++
 kernel/rtmutex.c   |   31 +++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..e493257 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -196,3 +196,13 @@ config SPINLOCK_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config RTLOCK_LATERAL_STEAL
+bool "Allow equal-priority rtlock stealing"
+default y
+depends on PREEMPT_RT
+help
+  This option alters the rtlock lock-stealing logic to allow
+  equal priority tasks to preempt a pending owner in addition
+  to higher priority tasks.  This allows for a significant
+  boost in throughput under certain circumstances at the expense
+  of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..6624c66 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct 
*task,
return ret;
 }
 
+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+   if (current->prio >= pendowner->prio)
+#else
+   if (current->prio > pendowner->prio)
+   return 0;
+
+   if (!unfair && (current->prio == pendowner->prio))
+#endif
+   return 0;
+
+   return 1;
+}
+
 /*
  * Optimization: check if we can steal the lock from the
  * assigned pending owner [which might not have taken the
  * lock yet]:
  */
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
 {
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;
 
spin_lock(>pi_lock);
-   if (current->prio >= pendowner->prio) {
+   if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(>pi_lock);
return 0;
}
@@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
  *
  * Must be called with lock->wait_lock held.
  */
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
 {
/*
 * We have to be careful here if the atomic speedups are
@@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
 */
mark_rt_mutex_waiters(lock);
 
-   if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+   if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair))
return 0;
 
/* We got the lock. */
@@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current->lock_depth;
 
/* Try to acquire the lock */
-   if (try_to_take_rt_mutex(lock))
+   if (try_to_take_rt_mutex(lock, 1))
break;
/*
 * waiter.task is NULL the first time we come here and
@@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);
 
/* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
+   if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(>wait_lock, flags);
return 0;
}
@@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_fla

[(RT RFC) PATCH v2 0/9] adaptive real-time locks

2008-02-25 Thread Gregory Haskins
You can download this series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v2.tar.bz2

Changes since v1:

*) Rebased from 24-rt1 to 24.2-rt2
*) Dropped controversial (and likely unecessary) printk patch
*) Dropped (internally) controversial PREEMPT_SPINLOCK_WAITERS config options
*) Incorporated review feedback for comment/config cleanup from Pavel/PeterZ
*) Moved lateral-steal to front of queue
*) Fixed compilation issue with !defined(LATERAL_STEAL)
*) Moved spinlock rework into a separate series:
   ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Todo:
*) Convert loop based timeouts to use nanoseconds
*) Tie into lockstat infrastructure.
*) Long-term: research adaptive-timeout algorithms so a fixed/one-size-
   -fits-all value is not necessary.



Adaptive real-time locks

The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
"spinlocks" to an "RT mutex" implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, ~450% more throughput
for dbench, and ~500% faster for hackbench), whereas some others
(make -j 128) the results were not as profound but they were still
net-positive. In all cases we have also verified that deterministic
latency is not impacted by using cyclic-test. 

This patch series depends on some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock->wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

You can find this re-work as a separate series here:

ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 0/9] adaptive real-time locks

2008-02-25 Thread Gregory Haskins
You can download this series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v2.tar.bz2

Changes since v1:

*) Rebased from 24-rt1 to 24.2-rt2
*) Dropped controversial (and likely unecessary) printk patch
*) Dropped (internally) controversial PREEMPT_SPINLOCK_WAITERS config options
*) Incorporated review feedback for comment/config cleanup from Pavel/PeterZ
*) Moved lateral-steal to front of queue
*) Fixed compilation issue with !defined(LATERAL_STEAL)
*) Moved spinlock rework into a separate series:
   ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Todo:
*) Convert loop based timeouts to use nanoseconds
*) Tie into lockstat infrastructure.
*) Long-term: research adaptive-timeout algorithms so a fixed/one-size-
   -fits-all value is not necessary.



Adaptive real-time locks

The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
spinlocks to an RT mutex implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, ~450% more throughput
for dbench, and ~500% faster for hackbench), whereas some others
(make -j 128) the results were not as profound but they were still
net-positive. In all cases we have also verified that deterministic
latency is not impacted by using cyclic-test. 

This patch series depends on some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock-wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

You can find this re-work as a separate series here:

ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority

2008-02-25 Thread Gregory Haskins
The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority.  The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes.  But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block.  However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio).  This helps to keep the system primed with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   10 ++
 kernel/rtmutex.c   |   31 +++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..e493257 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -196,3 +196,13 @@ config SPINLOCK_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config RTLOCK_LATERAL_STEAL
+bool Allow equal-priority rtlock stealing
+default y
+depends on PREEMPT_RT
+help
+  This option alters the rtlock lock-stealing logic to allow
+  equal priority tasks to preempt a pending owner in addition
+  to higher priority tasks.  This allows for a significant
+  boost in throughput under certain circumstances at the expense
+  of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..6624c66 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct 
*task,
return ret;
 }
 
+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+   if (current-prio = pendowner-prio)
+#else
+   if (current-prio  pendowner-prio)
+   return 0;
+
+   if (!unfair  (current-prio == pendowner-prio))
+#endif
+   return 0;
+
+   return 1;
+}
+
 /*
  * Optimization: check if we can steal the lock from the
  * assigned pending owner [which might not have taken the
  * lock yet]:
  */
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
 {
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;
 
spin_lock(pendowner-pi_lock);
-   if (current-prio = pendowner-prio) {
+   if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(pendowner-pi_lock);
return 0;
}
@@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
  *
  * Must be called with lock-wait_lock held.
  */
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
 {
/*
 * We have to be careful here if the atomic speedups are
@@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
 */
mark_rt_mutex_waiters(lock);
 
-   if (rt_mutex_owner(lock)  !try_to_steal_lock(lock))
+   if (rt_mutex_owner(lock)  !try_to_steal_lock(lock, unfair))
return 0;
 
/* We got the lock. */
@@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current-lock_depth;
 
/* Try to acquire the lock */
-   if (try_to_take_rt_mutex(lock))
+   if (try_to_take_rt_mutex(lock, 1))
break;
/*
 * waiter.task is NULL the first time we come here and
@@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);
 
/* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
+   if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(lock-wait_lock, flags);
return 0;
}
@@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_flags;
 
/* Try to acquire the lock: */
-   if (try_to_take_rt_mutex(lock

[(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
From: Sven-Thorsten Dietrich [EMAIL PROTECTED]

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED]
---

 kernel/rtmutex.c |7 ++-
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 6624c66..cd39c26 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -18,6 +18,10 @@
 
 #include rtmutex_common.h
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current-prio  pendowner-prio)
return 0;
 
-   if (!unfair  (current-prio == pendowner-prio))
+   if (unlikely(current-prio == pendowner-prio) 
+  !(unfair  rtmutex_lateral_steal))
 #endif
return 0;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c913d48..c24c53d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtmutex_lateral_steal,
+   .data   = rtmutex_lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current-state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cd39c26..ef52db6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(current-state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
+   saved_state = current-state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, saved_state);
 
spin_lock_irqsave(lock-wait_lock, flags);
current-flags |= saved_flags;
current-lock_depth = saved_lock_depth;
-   state = xchg(current-state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(current-state, saved_state);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 4/9] optimize rt lock wakeup

2008-02-25 Thread Gregory Haskins
It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   45 -
 1 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ef52db6..bf9e230 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -531,6 +531,41 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
pendowner = waiter-task;
waiter-task = NULL;
 
+   /*
+* Do the wakeup before the ownership change to give any spinning
+* waiter grantees a headstart over the other threads that will
+* trigger once owner changes.
+*/
+   if (!savestate)
+   wake_up_process(pendowner);
+   else {
+   /*
+* We can skip the actual (expensive) wakeup if the
+* waiter is already running, but we have to be careful
+* of race conditions because they may be about to sleep.
+*
+* The waiter-side protocol has the following pattern:
+* 1: Set state != RUNNING
+* 2: Conditionally sleep if waiter-task != NULL;
+*
+* And the owner-side has the following:
+* A: Set waiter-task = NULL
+* B: Conditionally wake if the state != RUNNING
+*
+* As long as we ensure 1-2 order, and A-B order, we
+* will never miss a wakeup.
+*
+* Therefore, this barrier ensures that waiter-task = NULL
+* is visible before we test the pendowner-state.  The
+* corresponding barrier is in the sleep logic.
+*/
+   smp_mb();
+
+   if ((pendowner-state != TASK_RUNNING)
+(pendowner-state != TASK_RUNNING_MUTEX))
+   wake_up_process_mutex(pendowner);
+   }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
spin_unlock(current-pi_lock);
@@ -557,11 +592,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
}
spin_unlock(pendowner-pi_lock);
-
-   if (savestate)
-   wake_up_process_mutex(pendowner);
-   else
-   wake_up_process(pendowner);
 }
 
 /*
@@ -762,6 +792,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(waiter);
 
update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   /*
+* The xchg() in update_current() is an implicit barrier
+* which we rely upon to ensure current-state is visible
+* before we test waiter.task.
+*/
if (waiter.task)
schedule_rt_mutex(lock);
else

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping).  Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks.  However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks.  This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention.  Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent.  Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem.  They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   20 +++
 kernel/rtmutex.c  |   30 +++---
 kernel/rtmutex_adaptive.h |  138 +
 3 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e493257..d2432fa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -183,6 +183,26 @@ config RCU_TRACE
  Say Y/M here if you want to enable RCU tracing in-kernel/module.
  Say N if you are unsure.
 
+config ADAPTIVE_RTLOCK
+bool Adaptive real-time locks
+default y
+depends on PREEMPT_RT  SMP
+help
+  PREEMPT_RT allows for greater determinism by transparently
+  converting normal spinlock_ts into preemptible rtmutexes which
+  sleep any waiters under contention.  However, in many cases the
+  lock will be released in less time than it takes to context
+  switch.  Therefore, the sleep under contention policy may also
+  degrade throughput performance due to the extra context switches.
+
+  This option alters the rtmutex derived spinlock_t replacement
+  code to use an adaptive spin/sleep algorithm.  It will spin
+  unless it determines it must sleep to avoid deadlock.  This
+  offers a best of both worlds solution since we achieve both
+  high-throughput and low-latency.
+
+  If unsure, say Y.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index bf9e230..3802ef8 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,8 @@
  *  Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED]
  *  Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
  *  Copyright (C) 2006 Esben Nielsen
+ *  Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale,
+ *   and Gregory Haskins
  *
  *  See Documentation/rt-mutex-design.txt for details.
  */
@@ -17,6 +19,7 @@
 #include linux/hardirq.h
 
 #include rtmutex_common.h
+#include rtmutex_adaptive.h
 
 #ifdef CONFIG_RTLOCK_LATERAL_STEAL
 int rtmutex_lateral_steal __read_mostly = 1;
@@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+   DECLARE_ADAPTIVE_WAITER(adaptive);
 
debug_rt_mutex_init_waiter(waiter);
waiter.task = NULL;
@@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
 
+   prepare_adaptive_wait(lock, adaptive);
+
/*
 * Prevent schedule() to drop BKL, while waiting for
 * the lock ! We restore lock_depth when we come back.
@@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   update_current(TASK_UNINTERRUPTIBLE, saved_state);
-   /*
-* The xchg() in update_current() is an implicit barrier
-* which we rely upon to ensure current-state is visible
-* before we test waiter.task.
-*/
-   if (waiter.task)
-   schedule_rt_mutex(lock);
-   else
-   update_current(TASK_RUNNING_MUTEX, saved_state);
+   /* adaptive_wait() returns 1

[(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism

2008-02-25 Thread Gregory Haskins
From: Sven Dietrich [EMAIL PROTECTED]

Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2432fa..ac1cbad 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK
 
   If unsure, say Y.
 
+config RTLOCK_DELAY
+   int Default delay (in loops) for adaptive rtlocks
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default 1
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 3802ef8..4a16b13 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -25,6 +25,10 @@
 int rtmutex_lateral_steal __read_mostly = 1;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 862c088..60c6328 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive-timeout  0; adaptive-timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive-timeout = 0)
+   sleep = 1;
+
put_task_struct(adaptive-owner);
 
return sleep;
@@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive-owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c24c53d..55189ea 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -56,6 +56,8 @@
 #include asm/stacktrace.h
 #endif
 
+#include rtmutex_adaptive.h
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtlock_timeout,
+   .data   = rtlock_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   37 ++
 kernel/rtmutex.c  |   76 +
 kernel/rtmutex_adaptive.h |   32 ++-
 kernel/sysctl.c   |   10 ++
 4 files changed, 119 insertions(+), 36 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index ac1cbad..864bf14 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -214,6 +214,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool Adaptive real-time mutexes
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int Default delay (in loops) for adaptive mutexes
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default 3000
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a16b13..ea593e0 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1;
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* We can skip the actual (expensive) wakeup if the
+* waiter is already running, but we have to be careful
+* of race conditions because they may be about to sleep.
+*
+* The waiter-side protocol has the following pattern:
+* 1: Set state != RUNNING
+* 2: Conditionally sleep if waiter-task != NULL;
+*
+* And the owner-side has the following:
+* A: Set waiter-task = NULL
+* B: Conditionally wake if the state != RUNNING
+*
+* As long as we ensure 1-2 order, and A-B order, we
+* will never miss a wakeup.
+*
+* Therefore, this barrier ensures that waiter-task = NULL
+* is visible before we test the pendowner-state.  The
+* corresponding barrier is in the sleep logic.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   /*
-* We can skip the actual (expensive) wakeup if the
-* waiter is already running, but we have to be careful
-* of race conditions because they may be about to sleep.
-*
-* The waiter-side protocol has the following pattern:
-* 1: Set state != RUNNING
-* 2: Conditionally sleep if waiter-task != NULL;
-*
-* And the owner-side has the following:
-* A: Set waiter-task = NULL
-* B: Conditionally wake if the state != RUNNING
-*
-* As long as we ensure 1-2 order, and A-B order, we
-* will never miss a wakeup.
-*
-   

[(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup

2008-02-25 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ea593e0..b81bbef 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(current-pi_lock);
 
@@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner-normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(pendowner-pi_lock);
 
WARN_ON(!pendowner-pi_blocked_on);
@@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner-pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
-   }
+
spin_unlock(pendowner-pi_lock);
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[(RT RFC) PATCH v2 9/9] remove the extra call to try_to_take_lock

2008-02-25 Thread Gregory Haskins
From: Peter W. Morreale [EMAIL PROTECTED]

Remove the redundant attempt to get the lock.  While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation. 

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b81bbef..266ae31 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -756,12 +756,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(lock-wait_lock, flags);
init_lists(lock);
 
-   /* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
-   spin_unlock_irqrestore(lock-wait_lock, flags);
-   return;
-   }
-
BUG_ON(rt_mutex_owner(lock) == current);
 
/*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  4:54 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
   * saved_state accordingly. If we did not get a real wakeup
   * then we return with the saved state.
   */
 -saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
 +saved_state = current-state;
 +smp_mb();
  
  for (;;) {
  unsigned long saved_flags;
 
 Please document what the barrier is good for.

Yeah, I think you are right that this isn't needed.  I think that is a relic 
from back when I was debugging some other problems.  Let me wrap my head around 
the implications of removing it, and either remove it or document appropriately.

 
 Plus, you are replacing atomic operation with nonatomic; is that ok?

Yeah, I think so.  We are substituting a write with a read, and word reads are 
always atomic anyway IIUC (or is that only true on certain architectures)?  
Note that we are moving the atomic-write to be done later in the 
update_current() calls.

-Greg



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:03 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 +/*
 + * Adaptive-rtlocks will busywait when possible, and sleep only if
 + * necessary. Note that the busyloop looks racy, and it isbut we do
 + * not care. If we lose any races it simply means that we spin one more
 + * time before seeing that we need to break-out on the next iteration.
 + *
 + * We realize this is a relatively large function to inline, but note that
 + * it is only instantiated 1 or 2 times max, and it makes a measurable
 + * performance different to avoid the call.
 + *
 + * Returns 1 if we should sleep
 + *
 + */
 +static inline int
 +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
 +  struct adaptive_waiter *adaptive)
 +{
 +int sleep = 0;
 +
 +for (;;) {
 +/*
 + * If the task was re-awoken, break out completely so we can
 + * reloop through the lock-acquisition code.
 + */
 +if (!waiter-task)
 +break;
 +
 +/*
 + * We need to break if the owner changed so we can reloop
 + * and safely acquire the owner-pointer again with the
 + * wait_lock held.
 + */
 +if (adaptive-owner != rt_mutex_owner(lock))
 +break;
 +
 +/*
 + * If we got here, presumably the lock ownership is still
 + * current.  We will use it to our advantage to be able to
 + * spin without disabling preemption...
 + */
 +
 +/*
 + * .. sleep if the owner is not running..
 + */
 +if (!adaptive-owner-se.on_rq) {
 +sleep = 1;
 +break;
 +}
 +
 +/*
 + * .. or is running on our own cpu (to prevent deadlock)
 + */
 +if (task_cpu(adaptive-owner) == task_cpu(current)) {
 +sleep = 1;
 +break;
 +}
 +
 +cpu_relax();
 +}
 +
 +put_task_struct(adaptive-owner);
 +
 +return sleep;
 +}
 +
 
 You want to inline this?

Yes.  As the comment indicates, there are 1-2 users tops, and it has a 
significant impact on throughput ( 5%) to take the hit with a call.  I don't 
think its actually much code anyway...its all comments.

 
 +static inline void
 +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter 
 *adaptive)
 ...
 +#define prepare_adaptive_wait(lock, busy) {}
 
 This is evil. Use empty inline function instead (same for the other
 function, there you can maybe get away with it).

Ok.


   Pavel



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:09 PM, in message
[EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: 
 Hi!
 
 From: Peter W.Morreale [EMAIL PROTECTED]
 
 This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
 a new tunable: rtmutex_timeout, which is the companion to the
 rtlock_timeout tunable.
 
 Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
 
 Not signed off by you?

I wasn't sure if this was appropriate for me to do.  This is the first time I 
was acting as upstream to someone.  If that is what I am expected to do, 
consider this an ack for your remaining comments related to this.

 
 diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
 index ac1cbad..864bf14 100644
 --- a/kernel/Kconfig.preempt
 +++ b/kernel/Kconfig.preempt
 @@ -214,6 +214,43 @@ config RTLOCK_DELAY
   tunable at runtime via a sysctl.  A setting of 0 (zero) disables
   the adaptive algorithm entirely.
  
 +config ADAPTIVE_RTMUTEX
 +bool Adaptive real-time mutexes
 +default y
 +depends on ADAPTIVE_RTLOCK
 +help
 + This option adds the adaptive rtlock spin/sleep algorithm to
 + rtmutexes.  In rtlocks, a significant gain in throughput
 + can be seen by allowing rtlocks to spin for a distinct
 + amount of time prior to going to sleep for deadlock avoidence.
 + 
 + Typically, mutexes are used when a critical section may need to
 + sleep due to a blocking operation.  In the event the critical 
 + section does not need to sleep, an additional gain in throughput 
 + can be seen by avoiding the extra overhead of sleeping.
 
 Watch the whitespace. ... and do we need yet another config options?
 
 +config RTMUTEX_DELAY
 +int Default delay (in loops) for adaptive mutexes
 +range 0 1000
 +depends on ADAPTIVE_RTMUTEX
 +default 3000
 +help
 + This allows you to specify the maximum delay a task will use
 + to wait for a rt mutex before going to sleep.  Note that that
 + although the delay is implemented as a preemptable loop, tasks
 + of like priority cannot preempt each other and this setting can
 + result in increased latencies.
 + 
 + The value is tunable at runtime via a sysctl.  A setting of 0
 + (zero) disables the adaptive algorithm entirely.
 
 Ouch.

?  Is this reference to whitespace damage, or does the content need addressing?

 
 +#ifdef CONFIG_ADAPTIVE_RTMUTEX
 +
 +#define mutex_adaptive_wait adaptive_wait
 +#define mutex_prepare_adaptive_wait prepare_adaptive_wait
 +
 +extern int rtmutex_timeout;
 +
 +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \
 + struct adaptive_waiter name = { .owner = NULL,   \
 + .timeout = rtmutex_timeout, }
 +
 +#else
 +
 +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name)
 +
 +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1
 +#define mutex_prepare_adaptive_wait(lock, busy) {}
 
 More evil macros. Macro does not behave like a function, make it
 inline function if you are replacing a function.

Ok


   Pavel



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing

2008-02-25 Thread Gregory Haskins
 On Mon, Feb 25, 2008 at  5:57 PM, in message
[EMAIL PROTECTED], Sven-Thorsten Dietrich
[EMAIL PROTECTED] wrote: 

 But Greg may need to enforce it on his git tree that he mails these from
 - are you referring to anything specific in this patch?
 

Thats what I don't get.  I *did* checkpatch all of these before sending them 
out (and I have for every release).

I am aware of two tabs vs spaces warnings, but the rest checked clean.  Why 
do some people still see errors when I don't?  Is there a set of switches I 
should supply to checkpatch to make it more aggressive or something?

-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-24 Thread Gregory Haskins

Bill Huey (hui) wrote:


The might_sleep is annotation and well as a conditional preemption
point for the regular kernel. You might want to do a schedule check
there, but it's the wrong function if memory serves me correctly. It's
reserved for things that actually are design to sleep.


Note that might_sleep() already does a cond_resched() on the 
configurations that need it, so I am not sure what you are getting at 
here.  Is that not enough?




The rt_spin*()
function are really a method of preserving BKL semantics across real
schedule() calls. You'd have to use something else instead for that
purpose like cond_reschedule() instead.


I dont quite understand this part either.  From my perspective, 
rt_spin*() functions are locking constructs that might sleep (or might 
spin with the new patches), and they happen to be BKL and wakeup 
transparent.  To me, either the might_sleep() is correct for all paths 
that don't fit the in_atomic-printk exception, or none of them are.


Are you saying that the modified logic that I introduced is broken?  Or 
that the original use of the might_sleep() annotation inside this 
function is broken?


-Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-24 Thread Gregory Haskins

Bill Huey (hui) wrote:


The might_sleep is annotation and well as a conditional preemption
point for the regular kernel. You might want to do a schedule check
there, but it's the wrong function if memory serves me correctly. It's
reserved for things that actually are design to sleep.


Note that might_sleep() already does a cond_resched() on the 
configurations that need it, so I am not sure what you are getting at 
here.  Is that not enough?




The rt_spin*()
function are really a method of preserving BKL semantics across real
schedule() calls. You'd have to use something else instead for that
purpose like cond_reschedule() instead.


I dont quite understand this part either.  From my perspective, 
rt_spin*() functions are locking constructs that might sleep (or might 
spin with the new patches), and they happen to be BKL and wakeup 
transparent.  To me, either the might_sleep() is correct for all paths 
that don't fit the in_atomic-printk exception, or none of them are.


Are you saying that the modified logic that I introduced is broken?  Or 
that the original use of the might_sleep() annotation inside this 
function is broken?


-Greg
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-22 Thread Gregory Haskins

Pavel Machek wrote:

Hi!


Decorate the printk path with an "unlikely()"

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (!current->in_printk)
-   might_sleep();
-   else if (in_atomic() || irqs_disabled())
+   if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 
+	might_sleep();


I think you changed the code here... you call might_sleep() in
different cases afaict.


Agreed, but it's still correct afaict.  I added an extra might_sleep() 
to a path that really might sleep.  I should have mentioned that in the 
header.


In any case, its moot.  Andi indicated this patch is probably a no-op so 
I was considering dropping it on the v2 pass.


Regards,
-Greg



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Gregory Haskins

Paul E. McKenney wrote:

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.


Yeah, fully agree.  This is on my research "todo" list.  My theory is 
that the ultimate adaptive-timeout algorithm here would essentially be 
the following:


*) compute the context-switch pair time average for the system.  This is 
your time threshold (CSt).


*) For each lock, maintain an average hold-time (AHt) statistic (I am 
assuming this can be done cheaply...perhaps not).


The adaptive code would work as follows:

if (AHt > CSt) /* dont even bother if the average is greater than CSt */
   timeout = 0;
else
   timeout = AHt;

if (adaptive_wait(timeout))
   sleep();

Anyone have some good ideas on how to compute CSt?  I was thinking you 
could create two kthreads that message one another (measuring round-trip 
time) for some number (say 100) to get an average.  You could probably 
just approximate it with flushing workqueue jobs.


-Greg



Thanx, Paul


Sven


Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep

2008-02-22 Thread Gregory Haskins

Gregory Haskins wrote:

@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
 		debug_rt_mutex_print_deadlock();
 
-		schedule_rt_mutex(lock);

+   update_current(TASK_UNINTERRUPTIBLE, _state);


I have a question for everyone out there about this particular part of 
the code. Patch 6/14 adds an optimization that is predicated on the 
order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the 
waiter.task below.


My assumption is that the xchg() (inside update_current()) acts as an 
effective wmb().  If xchg() does not have this property, then this code 
is broken and patch 6/14 should also add a:



+   smp_wmb();



+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, _state);
 
 		spin_lock_irqsave(>wait_lock, flags);

current->flags |= saved_flags;
current->lock_depth = saved_lock_depth;
-   state = xchg(>state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;



Does anyone know the answer to this?

Regards,
-Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep

2008-02-22 Thread Gregory Haskins

Gregory Haskins wrote:

@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
 		debug_rt_mutex_print_deadlock(waiter);
 
-		schedule_rt_mutex(lock);

+   update_current(TASK_UNINTERRUPTIBLE, saved_state);


I have a question for everyone out there about this particular part of 
the code. Patch 6/14 adds an optimization that is predicated on the 
order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the 
waiter.task below.


My assumption is that the xchg() (inside update_current()) acts as an 
effective wmb().  If xchg() does not have this property, then this code 
is broken and patch 6/14 should also add a:



+   smp_wmb();



+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, saved_state);
 
 		spin_lock_irqsave(lock-wait_lock, flags);

current-flags |= saved_flags;
current-lock_depth = saved_lock_depth;
-   state = xchg(current-state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;



Does anyone know the answer to this?

Regards,
-Greg
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Gregory Haskins

Paul E. McKenney wrote:

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.


Yeah, fully agree.  This is on my research todo list.  My theory is 
that the ultimate adaptive-timeout algorithm here would essentially be 
the following:


*) compute the context-switch pair time average for the system.  This is 
your time threshold (CSt).


*) For each lock, maintain an average hold-time (AHt) statistic (I am 
assuming this can be done cheaply...perhaps not).


The adaptive code would work as follows:

if (AHt  CSt) /* dont even bother if the average is greater than CSt */
   timeout = 0;
else
   timeout = AHt;

if (adaptive_wait(timeout))
   sleep();

Anyone have some good ideas on how to compute CSt?  I was thinking you 
could create two kthreads that message one another (measuring round-trip 
time) for some number (say 100) to get an average.  You could probably 
just approximate it with flushing workqueue jobs.


-Greg



Thanx, Paul


Sven


Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-22 Thread Gregory Haskins

Pavel Machek wrote:

Hi!


Decorate the printk path with an unlikely()

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (!current-in_printk)
-   might_sleep();
-   else if (in_atomic() || irqs_disabled())
+   if (unlikely(current-in_printk)  (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 
+	might_sleep();


I think you changed the code here... you call might_sleep() in
different cases afaict.


Agreed, but it's still correct afaict.  I added an extra might_sleep() 
to a path that really might sleep.  I should have mentioned that in the 
header.


In any case, its moot.  Andi indicated this patch is probably a no-op so 
I was considering dropping it on the v2 pass.


Regards,
-Greg



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at  4:42 PM, in message <[EMAIL PROTECTED]>,
Ingo Molnar <[EMAIL PROTECTED]> wrote: 

> * Bill Huey (hui) <[EMAIL PROTECTED]> wrote:
> 
>> I came to the original conclusion that it wasn't originally worth it, 
>> but the dbench number published say otherwise. [...]
> 
> dbench is a notoriously unreliable and meaningless workload. It's being 
> frowned upon by the VM and the IO folks.

I agree...its a pretty weak benchmark.  BUT, it does pound on dcache_lock and 
therefore was a good demonstration of the benefits of lower-contention 
overhead.  Also note we also threw other tests in that PDF if you scroll to the 
subsequent pages.

> If that's the only workload 
> where spin-mutexes help, and if it's only a 3% improvement [of which it 
> is unclear how much of that improvement was due to ticket spinlocks], 
> then adaptive mutexes are probably not worth it.

Note that the "3%" figure being thrown around was from a single patch within 
the series.  We are actually getting a net average gain of 443% in dbench.  And 
note that the number goes *up* when you remove the ticketlocks.  The 
ticketlocks are there to prevent latency spikes, not improve throughput.

Also take a look at the hackbench numbers which are particularly promising.   
We get a net average gain of 493% faster for RT10 based hackbench runs.  The 
kernel build was only a small gain, but it was all gain nonetheless.  We see 
similar results for any other workloads we throw at this thing.  I will gladly 
run any test requested to which I have the ability to run, and I would 
encourage third party results as well.


> 
> I'd not exclude them fundamentally though, it's really the numbers that 
> matter. The code is certainly simple enough (albeit the .config and 
> sysctl controls are quite ugly and unacceptable - adaptive mutexes 
> should really be ... adaptive, with no magic constants in .configs or 
> else).

We can clean this up, per your suggestions.

> 
> But ... i'm somewhat sceptic, after having played with spin-a-bit 
> mutexes before.

Its very subtle to get this concept to work.  The first few weeks, we were 
getting 90% regressions ;)  Then we had a breakthrough and started to get this 
thing humming along quite nicely.

Regards,
-Greg




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at  4:24 PM, in message <[EMAIL PROTECTED]>,
Ingo Molnar <[EMAIL PROTECTED]> wrote: 

> hm. Why is the ticket spinlock patch included in this patchset? It just 
> skews your performance results unnecessarily. Ticket spinlocks are 
> independent conceptually, they are already upstream in 2.6.25-rc2 and 
> -rt will have them automatically once we rebase to .25.

Sorry if it was ambiguous.  I included them because we found the patch series 
without them can cause spikes due to the newly introduced pressure on the 
(raw_spinlock_t)lock->wait_lock.  You can run the adaptive-spin patches without 
them just fine (in fact, in many cases things run faster without themdbench 
*thrives* on chaos).  But you may also measure a cyclic-test spike if you do 
so.  So I included them to present a "complete package without spikes".  I 
tried to explain that detail in the prologue, but most people probably fell 
asleep before they got to the end ;)

> 
> and if we take the ticket spinlock patch out of your series, the the 
> size of the patchset shrinks in half and touches only 200-300 lines of 
> code ;-) Considering the total size of the -rt patchset:
> 
>652 files changed, 23830 insertions(+), 4636 deletions(-)
> 
> we can regard it a routine optimization ;-)

Its not the size of your LOC, but what you do with it :)

> 
> regarding the concept: adaptive mutexes have been talked about in the 
> past, but their advantage is not at all clear, that's why we havent done 
> them. It's definitely not an unambigiously win-win concept.
> 
> So lets get some real marketing-free benchmarking done, and we are not 
> just interested in the workloads where a bit of polling on contended 
> locks helps, but we are also interested in workloads where the polling 
> hurts ... And lets please do the comparisons without the ticket spinlock 
> patch ...

I'm open to suggestion, and this was just a sample of the testing we have done. 
 We have thrown plenty of workloads at this patch series far beyond the slides 
I prepared in that URL, and they all seem to indicate a net positive 
improvement so far.  Some of those results I cannot share due to NDA, and some 
I didnt share simply because I never formally collected the data like I did for 
these tests.  If there is something you would like to see, please let me know 
and I will arrange for it to be executed if at all possible.

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at 11:41 AM, in message <[EMAIL PROTECTED]>,
Andi Kleen <[EMAIL PROTECTED]> wrote: 

>> +config RTLOCK_DELAY
>> +int "Default delay (in loops) for adaptive rtlocks"
>> +range 0 10
>> +depends on ADAPTIVE_RTLOCK
> 
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
> 
>> +default "1"
> 
> Perhaps you can expand how you came up with that default number? 

Actually, the number doesn't seem to matter that much as long as it is 
sufficiently long enough to make timeouts rare.  Most workloads will present 
some threshold for hold-time.  You generally get the best performance if the 
value is at least as "long" as that threshold.  Anything beyond that and there 
is no gain, but there doesn't appear to be a penalty either.  So we picked 
1 because we found it to fit that criteria quite well for our range of GHz 
class x86 machines.  YMMY, but that is why its configurable ;)

> It looks suspiciously round and worse the actual spin time depends a lot on 
> the 
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
> differently from a 2Ghz CPU) 

Yeah, fully agree.  We really wanted to use a time-value here but ran into 
various problems that have yet to be resolved.  We have it on the todo list to 
express this in terms in ns so it at least will scale with the architecture.

> Did you experiment with other spin times?

Of course ;)

> Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research "todo".

> And at what point is real
> time behaviour visibly impacted? 

Well, if we did our jobs correctly, RT behavior should *never* be impacted.  
*Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that 
affects the probability of timing out.  Values lower than that threshold start 
to degrade throughput.  Values higher than that have no affect on throughput, 
but may drive the cpu utilization higher which can theoretically impact tasks 
of equal or lesser priority by taking that resource away from them.  To date, 
we have not observed any real-world implications of this however.

> 
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.

I agree.  We need to move in the direction of time-basis.  The tradeoff is that 
it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). 
 I think one of the (not-included) patches converts a nanosecond value from the 
sysctl to approximate loop-counts using the bogomips data.  This was a decent 
compromise between the non-scaling loopcounts and the heavy-weight official 
timing APIs.  We dropped it because we support older kernels which were 
conflicting with the patch. We may have to resurrect it, however..

-Greg



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at 11:36 AM, in message <[EMAIL PROTECTED]>,
Andi Kleen <[EMAIL PROTECTED]> wrote: 
> On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote:
> 
>> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
>>  void fastcall (*slowfn)(struct rt_mutex *lock))
>>  {
>>  /* Temporary HACK! */
>> -if (!current->in_printk)
>> -might_sleep();
>> -else if (in_atomic() || irqs_disabled())
>> +if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
> 
> I have my doubts that gcc will honor unlikelies that don't affect
> the complete condition of an if.
> 
> Also conditions guarding returns are by default predicted unlikely
> anyways AFAIK. 
> 
> The patch is likely a nop.
> 

Yeah, you are probably right.  We have found that the system is *extremely* 
touchy on how much overhead we have in the lock-acquisition path.  For 
instance, using a non-inline version of adaptive_wait() can cost 5-10% in 
disk-io throughput.  So we were trying to find places to shave anywhere we 
could.  That being said, I didn't record any difference from this patch, so you 
are probably exactly right.  It just seemed like "the right thing to do" so I 
left it in.

-Greg



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at 10:26 AM, in message
<[EMAIL PROTECTED]>, Gregory Haskins
<[EMAIL PROTECTED]> wrote: 

> We have put together some data from different types of benchmarks for
> this patch series, which you can find here:
> 
> ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

For convenience, I have also places a tarball of the entire series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 13/14] allow rt-mutex lock-stealing to include lateral priority

2008-02-21 Thread Gregory Haskins
The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority.  The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes.  But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block.  However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio).  This helps to keep the system "primed" with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt |   10 ++
 kernel/rtmutex.c   |   31 +++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2b0daa..343b93c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -273,3 +273,13 @@ config SPINLOCK_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config RTLOCK_LATERAL_STEAL
+bool "Allow equal-priority rtlock stealing"
+   default y
+   depends on PREEMPT_RT
+   help
+This option alters the rtlock lock-stealing logic to allow
+equal priority tasks to preempt a pending owner in addition
+to higher priority tasks.  This allows for a significant
+boost in throughput under certain circumstances at the expense
+of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 95c3644..da077e5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -323,12 +323,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct 
*task,
return ret;
 }
 
+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+   if (current->prio >= pendowner->prio)
+#else
+   if (current->prio > pendowner->prio)
+   return 0;
+
+   if (!unfair && (current->prio == pendowner->prio))
+#endif
+   return 0;
+
+   return 1;
+}
+
 /*
  * Optimization: check if we can steal the lock from the
  * assigned pending owner [which might not have taken the
  * lock yet]:
  */
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
 {
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -340,7 +355,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;
 
spin_lock(>pi_lock);
-   if (current->prio >= pendowner->prio) {
+   if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(>pi_lock);
return 0;
}
@@ -393,7 +408,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
  *
  * Must be called with lock->wait_lock held.
  */
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
 {
/*
 * We have to be careful here if the atomic speedups are
@@ -416,7 +431,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
 */
mark_rt_mutex_waiters(lock);
 
-   if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+   if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair))
return 0;
 
/* We got the lock. */
@@ -737,7 +752,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current->lock_depth;
 
/* Try to acquire the lock */
-   if (try_to_take_rt_mutex(lock))
+   if (try_to_take_rt_mutex(lock, 1))
break;
/*
 * waiter.task is NULL the first time we come here and
@@ -985,7 +1000,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);
 
/* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
+   if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(>wait_lock, flags);
return 0;
}
@@ -1006,7 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_fla

[PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing

2008-02-21 Thread Gregory Haskins
From: Sven-Thorsten Dietrich <[EMAIL PROTECTED]>

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |8 ++--
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index da077e5..62e7af5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #ifdef CONFIG_ADAPTIVE_RTMUTEX
 int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
 
 /*
  * lock->owner state tracking:
@@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current->prio > pendowner->prio)
return 0;
 
-   if (!unfair && (current->prio == pendowner->prio))
+   if (unlikely(current->prio == pendowner->prio) &&
+  !(unfair && rtmutex_lateral_steal))
 #endif
return 0;
 
@@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, 
int unfair)
return 1;
 
spin_lock(>pi_lock);
-   if (!lock_is_stealable(pendowner, unfair)) {
+   if (!lock_is_stealable(pendowner, (unfair & rtmutex_lateral_steal))) {
spin_unlock(>pi_lock);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3465af2..c1a1c6d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "rtmutex_lateral_steal",
+   .data   = _lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-21 Thread Gregory Haskins
Decorate the printk path with an "unlikely()"

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (!current->in_printk)
-   might_sleep();
-   else if (in_atomic() || irqs_disabled())
+   if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 
+   might_sleep();
+
if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
rt_mutex_deadlock_account_lock(lock, current);
else
@@ -677,7 +677,7 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (current->in_printk && (in_atomic() || irqs_disabled()))
+   if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 12/14] remove the extra call to try_to_take_lock

2008-02-21 Thread Gregory Haskins
From: Peter W. Morreale <[EMAIL PROTECTED]>

Remove the redundant attempt to get the lock.  While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation. 

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ebdaa17..95c3644 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -718,12 +718,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(>wait_lock, flags);
init_lists(lock);
 
-   /* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
-   spin_unlock_irqrestore(>wait_lock, flags);
-   return;
-   }
-
BUG_ON(rt_mutex_owner(lock) == current);
 
/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 09/14] adaptive mutexes

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale <[EMAIL PROTECTED]>

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   37 +
 kernel/rtmutex.c  |   44 ++--
 kernel/rtmutex_adaptive.h |   32 ++--
 kernel/sysctl.c   |   10 ++
 4 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index eebec19..d2b0daa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -223,6 +223,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool "Adaptive real-time mutexes"
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int "Default delay (in loops) for adaptive mutexes"
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default "3000"
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a7423f..a7ed7b2 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -24,6 +24,10 @@
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock->owner state tracking:
  *
@@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* This may appear to be a race, but the barriers close the
+* window.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   smp_mb();
-   /*
-* This may appear to be a race, but the barriers close the
-* window.
-*/
-   if ((pendowner->state != TASK_RUNNING)
-   && (pendowner->state != TASK_RUNNING_MUTEX))
+   smp_mb();
+   if ((pendowner->state != TASK_RUNNING)
+   && (pendowner->state != TASK_RUNNING_MUTEX)) {
+   if (!savestate)
+   wake_up_process(pendowner);
+   else
wake_up_process_mutex(pendowner);
}
 
@@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock();
 
/* adaptive_wait() returns 1 if we need to sleep */
-   if (adaptive_wait(lock, , )) {
+   if (adaptive_wait(lock, 0, , )) {
update_current(TASK_UNINTERRUPTIBLE, _state);
if (waiter.task)
schedule_rt_mutex(lock);
@@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
+   DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive);
 
debug_rt_mutex_init_waiter();
waiter.task = NULL;
@@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(current->lock_depth >= 0))
saved_lock_depth = 

[PATCH [RT] 10/14] adjust pi_lock usage in wakeup

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale <[EMAIL PROTECTED]>

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a7ed7b2..122f143 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(>pi_lock);
 
@@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner->normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(>pi_lock);
 
WARN_ON(!pendowner->pi_blocked_on);
@@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner->pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(>pi_list_entry, >pi_waiters);
-   }
+
spin_unlock(>pi_lock);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 07/14] adaptive real-time lock support

2008-02-21 Thread Gregory Haskins
There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping).  Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks.  However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks.  This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention.  Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent.  Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem.  They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Peter Morreale <[EMAIL PROTECTED]>
Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   20 +++
 kernel/rtmutex.c  |   19 +-
 kernel/rtmutex_adaptive.h |  134 +
 3 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 5b45213..6568519 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,6 +192,26 @@ config RCU_TRACE
  Say Y/M here if you want to enable RCU tracing in-kernel/module.
  Say N if you are unsure.
 
+config ADAPTIVE_RTLOCK
+bool "Adaptive real-time locks"
+   default y
+   depends on PREEMPT_RT && SMP
+   help
+PREEMPT_RT allows for greater determinism by transparently
+converting normal spinlock_ts into preemptible rtmutexes which
+sleep any waiters under contention.  However, in many cases the
+lock will be released in less time than it takes to context
+switch.  Therefore, the "sleep under contention" policy may also
+degrade throughput performance due to the extra context switches.
+
+This option alters the rtmutex derived spinlock_t replacement
+code to use an adaptive spin/sleep algorithm.  It will spin
+unless it determines it must sleep to avoid deadlock.  This
+offers a best of both worlds solution since we achieve both
+high-throughput and low-latency.
+
+If unsure, say Y
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cb27b08..feb938f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,7 @@
  *  Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]>
  *  Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
  *  Copyright (C) 2006 Esben Nielsen
+ *  Copyright (C) 2008 Novell, Inc.
  *
  *  See Documentation/rt-mutex-design.txt for details.
  */
@@ -17,6 +18,7 @@
 #include 
 
 #include "rtmutex_common.h"
+#include "rtmutex_adaptive.h"
 
 /*
  * lock->owner state tracking:
@@ -697,6 +699,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+   DECLARE_ADAPTIVE_WAITER(adaptive);
 
debug_rt_mutex_init_waiter();
waiter.task = NULL;
@@ -743,6 +746,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
 
+   prepare_adaptive_wait(lock, );
+
/*
 * Prevent schedule() to drop BKL, while waiting for
 * the lock ! We restore lock_depth when we come back.
@@ -754,11 +759,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock();
 
-   update_current(TASK_UNINTERRUPTIBLE, _state);
-   if (waiter.task)
-   schedule_rt_mutex(lock);
-   else
-   update_current(TASK_RUNNING_MUTEX, _state);
+   /* adaptive_wait() returns 1 if we need to sleep */
+   if (adaptive_wait(lock, , )) {
+   update_current(TASK_UNINTERRUPTIBLE, _state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+  

[PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
From: Sven Dietrich <[EMAIL PROTECTED]>

Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK
 
 If unsure, say Y
 
+config RTLOCK_DELAY
+   int "Default delay (in loops) for adaptive rtlocks"
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default "1"
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
 #include "rtmutex_common.h"
 #include "rtmutex_adaptive.h"
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock->owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive->timeout > 0; adaptive->timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive->timeout <= 0)
+   sleep = 1;
+
put_task_struct(adaptive->owner);
 
return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive->owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
 #include 
 #endif
 
+#include "rtmutex_adaptive.h"
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "rtlock_timeout",
+   .data   = _timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 06/14] optimize rt lock wakeup

2008-02-21 Thread Gregory Haskins
It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
Signed-off-by: Peter Morreale <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   23 ++-
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 15fc6e6..cb27b08 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
pendowner = waiter->task;
waiter->task = NULL;
 
+   /*
+* Do the wakeup before the ownership change to give any spinning
+* waiter grantees a headstart over the other threads that will
+* trigger once owner changes.
+*/
+   if (!savestate)
+   wake_up_process(pendowner);
+   else {
+   smp_mb();
+   /*
+* This may appear to be a race, but the barriers close the
+* window.
+*/
+   if ((pendowner->state != TASK_RUNNING)
+   && (pendowner->state != TASK_RUNNING_MUTEX))
+   wake_up_process_mutex(pendowner);
+   }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
spin_unlock(>pi_lock);
@@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
plist_add(>pi_list_entry, >pi_waiters);
}
spin_unlock(>pi_lock);
-
-   if (savestate)
-   wake_up_process_mutex(pendowner);
-   else
-   wake_up_process(pendowner);
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 05/14] rearrange rt_spin_lock sleep

2008-02-21 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current->state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..15fc6e6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(>state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(>state, TASK_UNINTERRUPTIBLE);
+   saved_state = current->state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock();
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, _state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, _state);
 
spin_lock_irqsave(>wait_lock, flags);
current->flags |= saved_flags;
current->lock_depth = saved_lock_depth;
-   state = xchg(>state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(>state, saved_state);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 03/14] x86: FIFO ticket spinlocks

2008-02-21 Thread Gregory Haskins
From: Nick Piggin <[EMAIL PROTECTED]>

Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again.  These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---

 include/asm-x86/spinlock.h   |  225 ++
 include/asm-x86/spinlock_32.h|  221 -
 include/asm-x86/spinlock_64.h|  167 
 include/asm-x86/spinlock_types.h |2 
 4 files changed, 224 insertions(+), 391 deletions(-)

diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h
index d74d85e..72fe445 100644
--- a/include/asm-x86/spinlock.h
+++ b/include/asm-x86/spinlock.h
@@ -1,5 +1,226 @@
+#ifndef _X86_SPINLOCK_H_
+#define _X86_SPINLOCK_H_
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations.  There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
 #ifdef CONFIG_X86_32
-# include "spinlock_32.h"
+typedef char _slock_t;
+# define LOCK_INS_DEC "decb"
+# define LOCK_INS_XCH "xchgb"
+# define LOCK_INS_MOV "movb"
+# define LOCK_INS_CMP "cmpb"
+# define LOCK_PTR_REG "a"
 #else
-# include "spinlock_64.h"
+typedef int _slock_t;
+# define LOCK_INS_DEC "decl"
+# define LOCK_INS_XCH "xchgl"
+# define LOCK_INS_MOV "movl"
+# define LOCK_INS_CMP "cmpl"
+# define LOCK_PTR_REG "D"
+#endif
+
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) != (tmp & 0xff));
+}
+
+static inline int __raw_spin_is_contended(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
+}
+
+static inline void __raw_spin_lock(__raw_spinlock_t *lock)
+{
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX "xaddw %w0, %1\n"
+   "1:\t"
+   "cmpb %h0, %b0\n\t"
+   "je 2f\n\t"
+   "rep ; nop\n\t"
+   "movb %1, %b0\n\t"
+   /* don't need lfence here, because loads are in-order */
+   "jmp 1b\n"
+   "2:"
+   :"+Q" (inc), "+m" (lock->slock)
+   :
+   :"memory", "cc");
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int 

[PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use

2008-02-21 Thread Gregory Haskins
Preemptible spinlock waiters effectively bypasses the benefits of a fifo
spinlock.  Since we now have fifo spinlocks for x86 enabled, disable the
preemption feature on x86.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
CC: Nick Piggin <[EMAIL PROTECTED]>
---

 arch/x86/Kconfig |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8d15667..d5b9a67 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -20,6 +20,7 @@ config X86
bool
default y
select HAVE_MCOUNT
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
 
 config GENERIC_TIME
bool

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option

2008-02-21 Thread Gregory Haskins
We introduce a configuration variable for the feature to make it easier for
various architectures and/or configs to enable or disable it based on their
requirements.  

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt |9 +
 kernel/spinlock.c  |7 +++
 lib/Kconfig.debug  |1 +
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..5b45213 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,6 +86,15 @@ config PREEMPT
default y
depends on PREEMPT_DESKTOP || PREEMPT_RT
 
+config DISABLE_PREEMPT_SPINLOCK_WAITERS
+bool
+   default n
+
+config PREEMPT_SPINLOCK_WAITERS
+bool
+   default y
+   depends on PREEMPT && SMP && !DISABLE_PREEMPT_SPINLOCK_WAITERS
+
 config PREEMPT_SOFTIRQS
bool "Thread Softirqs"
default n
diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index b0e7f02..2e6a904 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {
@@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock)
 
 EXPORT_SYMBOL(__write_lock);
 
-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 /*
  * This could be a long-held lock. We both prepare to spin for a long
@@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock);
 BUILD_LOCK_OPS(read, raw_rwlock);
 BUILD_LOCK_OPS(write, raw_rwlock);
 
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9208791..f2889b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC
bool "Lock debugging: detect incorrect freeing of live locks"
depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT 
&& LOCKDEP_SUPPORT
select DEBUG_SPINLOCK
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
select DEBUG_MUTEXES
select LOCKDEP
help

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
"spinlocks" to an "RT mutex" implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, dbench and hackbench),
whereas some others (make -j 128) the results were not as profound but
they were still net-positive. In all cases we have also verified that
deterministic latency is not impacted by using cyclic-test. 

This patch series also includes some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock->wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled

2008-02-21 Thread Gregory Haskins
The logic is currently broken so that PREEMPT_RT disables preemptible
spinlock waiters, which is counter intuitive. 

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/spinlock.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index c9bcf1b..b0e7f02 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
 #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT)
+   defined(CONFIG_DEBUG_LOCK_ALLOC)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option

2008-02-21 Thread Gregory Haskins
We introduce a configuration variable for the feature to make it easier for
various architectures and/or configs to enable or disable it based on their
requirements.  

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |9 +
 kernel/spinlock.c  |7 +++
 lib/Kconfig.debug  |1 +
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 41a0d88..5b45213 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -86,6 +86,15 @@ config PREEMPT
default y
depends on PREEMPT_DESKTOP || PREEMPT_RT
 
+config DISABLE_PREEMPT_SPINLOCK_WAITERS
+bool
+   default n
+
+config PREEMPT_SPINLOCK_WAITERS
+bool
+   default y
+   depends on PREEMPT  SMP  !DISABLE_PREEMPT_SPINLOCK_WAITERS
+
 config PREEMPT_SOFTIRQS
bool Thread Softirqs
default n
diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index b0e7f02..2e6a904 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {
@@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock)
 
 EXPORT_SYMBOL(__write_lock);
 
-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 /*
  * This could be a long-held lock. We both prepare to spin for a long
@@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock);
 BUILD_LOCK_OPS(read, raw_rwlock);
 BUILD_LOCK_OPS(write, raw_rwlock);
 
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 9208791..f2889b2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC
bool Lock debugging: detect incorrect freeing of live locks
depends on DEBUG_KERNEL  TRACE_IRQFLAGS_SUPPORT  STACKTRACE_SUPPORT 
 LOCKDEP_SUPPORT
select DEBUG_SPINLOCK
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
select DEBUG_MUTEXES
select LOCKDEP
help

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 03/14] x86: FIFO ticket spinlocks

2008-02-21 Thread Gregory Haskins
From: Nick Piggin [EMAIL PROTECTED]

Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or unfairly granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again.  These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---

 include/asm-x86/spinlock.h   |  225 ++
 include/asm-x86/spinlock_32.h|  221 -
 include/asm-x86/spinlock_64.h|  167 
 include/asm-x86/spinlock_types.h |2 
 4 files changed, 224 insertions(+), 391 deletions(-)

diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h
index d74d85e..72fe445 100644
--- a/include/asm-x86/spinlock.h
+++ b/include/asm-x86/spinlock.h
@@ -1,5 +1,226 @@
+#ifndef _X86_SPINLOCK_H_
+#define _X86_SPINLOCK_H_
+
+#include asm/atomic.h
+#include asm/rwlock.h
+#include asm/page.h
+#include asm/processor.h
+#include linux/compiler.h
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations.  There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
 #ifdef CONFIG_X86_32
-# include spinlock_32.h
+typedef char _slock_t;
+# define LOCK_INS_DEC decb
+# define LOCK_INS_XCH xchgb
+# define LOCK_INS_MOV movb
+# define LOCK_INS_CMP cmpb
+# define LOCK_PTR_REG a
 #else
-# include spinlock_64.h
+typedef int _slock_t;
+# define LOCK_INS_DEC decl
+# define LOCK_INS_XCH xchgl
+# define LOCK_INS_MOV movl
+# define LOCK_INS_CMP cmpl
+# define LOCK_PTR_REG D
+#endif
+
+#if (NR_CPUS  256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) != (tmp  0xff));
+}
+
+static inline int __raw_spin_is_contended(__raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) - (tmp  0xff))  1;
+}
+
+static inline void __raw_spin_lock(__raw_spinlock_t *lock)
+{
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX xaddw %w0, %1\n
+   1:\t
+   cmpb %h0, %b0\n\t
+   je 2f\n\t
+   rep ; nop\n\t
+   movb %1, %b0\n\t
+   /* don't need lfence here, because loads are in-order */
+   jmp 1b\n
+   2:
+   :+Q (inc), +m (lock-slock)
+   :
+   :memory, cc);
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int 

[PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use

2008-02-21 Thread Gregory Haskins
Preemptible spinlock waiters effectively bypasses the benefits of a fifo
spinlock.  Since we now have fifo spinlocks for x86 enabled, disable the
preemption feature on x86.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
CC: Nick Piggin [EMAIL PROTECTED]
---

 arch/x86/Kconfig |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8d15667..d5b9a67 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -20,6 +20,7 @@ config X86
bool
default y
select HAVE_MCOUNT
+   select DISABLE_PREEMPT_SPINLOCK_WAITERS
 
 config GENERIC_TIME
bool

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 05/14] rearrange rt_spin_lock sleep

2008-02-21 Thread Gregory Haskins
The current logic makes rather coarse adjustments to current-state since
it is planning on sleeping anyway.  We want to eventually move to an
adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the
adjustments to bracket the schedule().  This should yield correct behavior
with or without the adaptive features that are added later in the series.
We add it here as a separate patch for greater review clarity on smaller
changes.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a2b00cc..15fc6e6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
slowfn(lock);
 }
 
+static inline void
+update_current(unsigned long new_state, unsigned long *saved_state)
+{
+   unsigned long state = xchg(current-state, new_state);
+   if (unlikely(state == TASK_RUNNING))
+   *saved_state = TASK_RUNNING;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 * saved_state accordingly. If we did not get a real wakeup
 * then we return with the saved state.
 */
-   saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE);
+   saved_state = current-state;
+   smp_mb();
 
for (;;) {
unsigned long saved_flags;
@@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   schedule_rt_mutex(lock);
+   update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX, saved_state);
 
spin_lock_irqsave(lock-wait_lock, flags);
current-flags |= saved_flags;
current-lock_depth = saved_lock_depth;
-   state = xchg(current-state, TASK_UNINTERRUPTIBLE);
-   if (unlikely(state == TASK_RUNNING))
-   saved_state = TASK_RUNNING;
}
 
state = xchg(current-state, saved_state);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 06/14] optimize rt lock wakeup

2008-02-21 Thread Gregory Haskins
It is redundant to wake the grantee task if it is already running

Credit goes to Peter for the general idea.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   23 ++-
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 15fc6e6..cb27b08 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
pendowner = waiter-task;
waiter-task = NULL;
 
+   /*
+* Do the wakeup before the ownership change to give any spinning
+* waiter grantees a headstart over the other threads that will
+* trigger once owner changes.
+*/
+   if (!savestate)
+   wake_up_process(pendowner);
+   else {
+   smp_mb();
+   /*
+* This may appear to be a race, but the barriers close the
+* window.
+*/
+   if ((pendowner-state != TASK_RUNNING)
+(pendowner-state != TASK_RUNNING_MUTEX))
+   wake_up_process_mutex(pendowner);
+   }
+
rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
spin_unlock(current-pi_lock);
@@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
}
spin_unlock(pendowner-pi_lock);
-
-   if (savestate)
-   wake_up_process_mutex(pendowner);
-   else
-   wake_up_process(pendowner);
 }
 
 /*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 07/14] adaptive real-time lock support

2008-02-21 Thread Gregory Haskins
There are pros and cons when deciding between the two basic forms of
locking primitives (spinning vs sleeping).  Without going into great
detail on either one, we note that spinlocks have the advantage of
lower overhead for short hold locks.  However, they also have a
con in that they create indeterminate latencies since preemption
must traditionally be disabled while the lock is held (to prevent deadlock).

We want to avoid non-deterministic critical sections in -rt. Therefore,
when realtime is enabled, most contexts are converted to threads, and
likewise most spinlock_ts are converted to sleepable rt-mutex derived
locks.  This allows the holder of the lock to remain fully preemptible,
thus reducing a major source of latencies in the kernel.

However, converting what was once a true spinlock into a sleeping lock
may also decrease performance since the locks will now sleep under
contention.  Since the fundamental lock used to be a spinlock, it is
highly likely that it was used in a short-hold path and that release
is imminent.  Therefore sleeping only serves to cause context-thrashing.

Adaptive RT locks use a hybrid approach to solve the problem.  They
spin when possible, and sleep when necessary (to avoid deadlock, etc).
This significantly improves many areas of the performance of the -rt
kernel.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
Signed-off-by: Peter Morreale [EMAIL PROTECTED]
Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   20 +++
 kernel/rtmutex.c  |   19 +-
 kernel/rtmutex_adaptive.h |  134 +
 3 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 5b45213..6568519 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,6 +192,26 @@ config RCU_TRACE
  Say Y/M here if you want to enable RCU tracing in-kernel/module.
  Say N if you are unsure.
 
+config ADAPTIVE_RTLOCK
+bool Adaptive real-time locks
+   default y
+   depends on PREEMPT_RT  SMP
+   help
+PREEMPT_RT allows for greater determinism by transparently
+converting normal spinlock_ts into preemptible rtmutexes which
+sleep any waiters under contention.  However, in many cases the
+lock will be released in less time than it takes to context
+switch.  Therefore, the sleep under contention policy may also
+degrade throughput performance due to the extra context switches.
+
+This option alters the rtmutex derived spinlock_t replacement
+code to use an adaptive spin/sleep algorithm.  It will spin
+unless it determines it must sleep to avoid deadlock.  This
+offers a best of both worlds solution since we achieve both
+high-throughput and low-latency.
+
+If unsure, say Y
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index cb27b08..feb938f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -7,6 +7,7 @@
  *  Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED]
  *  Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt
  *  Copyright (C) 2006 Esben Nielsen
+ *  Copyright (C) 2008 Novell, Inc.
  *
  *  See Documentation/rt-mutex-design.txt for details.
  */
@@ -17,6 +18,7 @@
 #include linux/hardirq.h
 
 #include rtmutex_common.h
+#include rtmutex_adaptive.h
 
 /*
  * lock-owner state tracking:
@@ -697,6 +699,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
struct rt_mutex_waiter waiter;
unsigned long saved_state, state, flags;
+   DECLARE_ADAPTIVE_WAITER(adaptive);
 
debug_rt_mutex_init_waiter(waiter);
waiter.task = NULL;
@@ -743,6 +746,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
continue;
}
 
+   prepare_adaptive_wait(lock, adaptive);
+
/*
 * Prevent schedule() to drop BKL, while waiting for
 * the lock ! We restore lock_depth when we come back.
@@ -754,11 +759,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 
debug_rt_mutex_print_deadlock(waiter);
 
-   update_current(TASK_UNINTERRUPTIBLE, saved_state);
-   if (waiter.task)
-   schedule_rt_mutex(lock);
-   else
-   update_current(TASK_RUNNING_MUTEX, saved_state);
+   /* adaptive_wait() returns 1 if we need to sleep */
+   if (adaptive_wait(lock, waiter, adaptive)) {
+   update_current(TASK_UNINTERRUPTIBLE, saved_state);
+   if (waiter.task)
+   schedule_rt_mutex(lock);
+   else
+   update_current(TASK_RUNNING_MUTEX

[PATCH [RT] 09/14] adaptive mutexes

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

This patch adds the adaptive spin lock busywait to rtmutexes.  It adds
a new tunable: rtmutex_timeout, which is the companion to the
rtlock_timeout tunable.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   37 +
 kernel/rtmutex.c  |   44 ++--
 kernel/rtmutex_adaptive.h |   32 ++--
 kernel/sysctl.c   |   10 ++
 4 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index eebec19..d2b0daa 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -223,6 +223,43 @@ config RTLOCK_DELAY
 tunable at runtime via a sysctl.  A setting of 0 (zero) disables
 the adaptive algorithm entirely.
 
+config ADAPTIVE_RTMUTEX
+bool Adaptive real-time mutexes
+default y
+depends on ADAPTIVE_RTLOCK
+help
+ This option adds the adaptive rtlock spin/sleep algorithm to
+ rtmutexes.  In rtlocks, a significant gain in throughput
+ can be seen by allowing rtlocks to spin for a distinct
+ amount of time prior to going to sleep for deadlock avoidence.
+ 
+ Typically, mutexes are used when a critical section may need to
+ sleep due to a blocking operation.  In the event the critical 
+section does not need to sleep, an additional gain in throughput 
+can be seen by avoiding the extra overhead of sleeping.
+ 
+ This option alters the rtmutex code to use an adaptive
+ spin/sleep algorithm.  It will spin unless it determines it must
+ sleep to avoid deadlock.  This offers a best of both worlds
+ solution since we achieve both high-throughput and low-latency.
+ 
+ If unsure, say Y
+ 
+config RTMUTEX_DELAY
+int Default delay (in loops) for adaptive mutexes
+range 0 1000
+depends on ADAPTIVE_RTMUTEX
+default 3000
+help
+ This allows you to specify the maximum delay a task will use
+to wait for a rt mutex before going to sleep.  Note that that
+although the delay is implemented as a preemptable loop, tasks
+of like priority cannot preempt each other and this setting can
+result in increased latencies.
+
+ The value is tunable at runtime via a sysctl.  A setting of 0
+(zero) disables the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4a7423f..a7ed7b2 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -24,6 +24,10 @@
 int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #endif
 
+#ifdef CONFIG_ADAPTIVE_RTMUTEX
+int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
@@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * Do the wakeup before the ownership change to give any spinning
 * waiter grantees a headstart over the other threads that will
 * trigger once owner changes.
+*
+* This may appear to be a race, but the barriers close the
+* window.
 */
-   if (!savestate)
-   wake_up_process(pendowner);
-   else {
-   smp_mb();
-   /*
-* This may appear to be a race, but the barriers close the
-* window.
-*/
-   if ((pendowner-state != TASK_RUNNING)
-(pendowner-state != TASK_RUNNING_MUTEX))
+   smp_mb();
+   if ((pendowner-state != TASK_RUNNING)
+(pendowner-state != TASK_RUNNING_MUTEX)) {
+   if (!savestate)
+   wake_up_process(pendowner);
+   else
wake_up_process_mutex(pendowner);
}
 
@@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
debug_rt_mutex_print_deadlock(waiter);
 
/* adaptive_wait() returns 1 if we need to sleep */
-   if (adaptive_wait(lock, waiter, adaptive)) {
+   if (adaptive_wait(lock, 0, waiter, adaptive)) {
update_current(TASK_UNINTERRUPTIBLE, saved_state);
if (waiter.task)
schedule_rt_mutex(lock);
@@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0, saved_lock_depth = -1;
struct rt_mutex_waiter waiter;
unsigned long flags;
+   DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive);
 
debug_rt_mutex_init_waiter(waiter);
waiter.task = NULL;
@@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
if (unlikely(current-lock_depth = 0))

[PATCH [RT] 10/14] adjust pi_lock usage in wakeup

2008-02-21 Thread Gregory Haskins
From: Peter W.Morreale [EMAIL PROTECTED]

In wakeup_next_waiter(), we take the pi_lock, and then find out whether
we have another waiter to add to the pending owner.  We can reduce
contention on the pi_lock for the pending owner if we first obtain the
pointer to the next waiter outside of the pi_lock.

This patch adds a measureable increase in throughput.

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a7ed7b2..122f143 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 {
struct rt_mutex_waiter *waiter;
struct task_struct *pendowner;
+   struct rt_mutex_waiter *next;
 
spin_lock(current-pi_lock);
 
@@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 * waiter with higher priority than pending-owner-normal_prio
 * is blocked on the unboosted (pending) owner.
 */
+
+   if (rt_mutex_has_waiters(lock))
+   next = rt_mutex_top_waiter(lock);
+   else
+   next = NULL;
+
spin_lock(pendowner-pi_lock);
 
WARN_ON(!pendowner-pi_blocked_on);
@@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int 
savestate)
 
pendowner-pi_blocked_on = NULL;
 
-   if (rt_mutex_has_waiters(lock)) {
-   struct rt_mutex_waiter *next;
-
-   next = rt_mutex_top_waiter(lock);
+   if (next)
plist_add(next-pi_list_entry, pendowner-pi_waiters);
-   }
+
spin_unlock(pendowner-pi_lock);
 }
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-21 Thread Gregory Haskins
Decorate the printk path with an unlikely()

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/rtmutex.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 122f143..ebdaa17 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (!current-in_printk)
-   might_sleep();
-   else if (in_atomic() || irqs_disabled())
+   if (unlikely(current-in_printk)  (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 
+   might_sleep();
+
if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
rt_mutex_deadlock_account_lock(lock, current);
else
@@ -677,7 +677,7 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock,
void fastcall (*slowfn)(struct rt_mutex *lock))
 {
/* Temporary HACK! */
-   if (current-in_printk  (in_atomic() || irqs_disabled()))
+   if (unlikely(current-in_printk)  (in_atomic() || irqs_disabled()))
/* don't grab locks for printk in atomic */
return;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 12/14] remove the extra call to try_to_take_lock

2008-02-21 Thread Gregory Haskins
From: Peter W. Morreale [EMAIL PROTECTED]

Remove the redundant attempt to get the lock.  While it is true that the
exit path with this patch adds an un-necessary xchg (in the event the
lock is granted without further traversal in the loop) experimentation
shows that we almost never encounter this situation. 

Signed-off-by: Peter W. Morreale [EMAIL PROTECTED]
---

 kernel/rtmutex.c |6 --
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index ebdaa17..95c3644 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -718,12 +718,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
spin_lock_irqsave(lock-wait_lock, flags);
init_lists(lock);
 
-   /* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
-   spin_unlock_irqrestore(lock-wait_lock, flags);
-   return;
-   }
-
BUG_ON(rt_mutex_owner(lock) == current);
 
/*

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
From: Sven Dietrich [EMAIL PROTECTED]

Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK
 
 If unsure, say Y
 
+config RTLOCK_DELAY
+   int Default delay (in loops) for adaptive rtlocks
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default 1
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
 #include rtmutex_common.h
 #include rtmutex_adaptive.h
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive-timeout  0; adaptive-timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive-timeout = 0)
+   sleep = 1;
+
put_task_struct(adaptive-owner);
 
return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive-owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
 #include asm/stacktrace.h
 #endif
 
+#include rtmutex_adaptive.h
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtlock_timeout,
+   .data   = rtlock_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 13/14] allow rt-mutex lock-stealing to include lateral priority

2008-02-21 Thread Gregory Haskins
The current logic only allows lock stealing to occur if the current task
is of higher priority than the pending owner. We can gain signficant
throughput improvements (200%+) by allowing the lock-stealing code to
include tasks of equal priority.  The theory is that the system will make
faster progress by allowing the task already on the CPU to take the lock
rather than waiting for the system to wake-up a different task.

This does add a degree of unfairness, yes.  But also note that the users
of these locks under non -rt environments have already been using unfair
raw spinlocks anyway so the tradeoff is probably worth it.

The way I like to think of this is that higher priority tasks should
clearly preempt, and lower priority tasks should clearly block.  However,
if tasks have an identical priority value, then we can think of the
scheduler decisions as the tie-breaking parameter. (e.g. tasks that the
scheduler picked to run first have a logically higher priority amoung tasks
of the same prio).  This helps to keep the system primed with tasks doing
useful work, and the end result is higher throughput.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt |   10 ++
 kernel/rtmutex.c   |   31 +++
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index d2b0daa..343b93c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -273,3 +273,13 @@ config SPINLOCK_BKL
  Say Y here if you are building a kernel for a desktop system.
  Say N if you are unsure.
 
+config RTLOCK_LATERAL_STEAL
+bool Allow equal-priority rtlock stealing
+   default y
+   depends on PREEMPT_RT
+   help
+This option alters the rtlock lock-stealing logic to allow
+equal priority tasks to preempt a pending owner in addition
+to higher priority tasks.  This allows for a significant
+boost in throughput under certain circumstances at the expense
+of strict FIFO lock access.
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 95c3644..da077e5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -323,12 +323,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct 
*task,
return ret;
 }
 
+static inline int lock_is_stealable(struct task_struct *pendowner, int unfair)
+{
+#ifndef CONFIG_RTLOCK_LATERAL_STEAL
+   if (current-prio = pendowner-prio)
+#else
+   if (current-prio  pendowner-prio)
+   return 0;
+
+   if (!unfair  (current-prio == pendowner-prio))
+#endif
+   return 0;
+
+   return 1;
+}
+
 /*
  * Optimization: check if we can steal the lock from the
  * assigned pending owner [which might not have taken the
  * lock yet]:
  */
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair)
 {
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -340,7 +355,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
return 1;
 
spin_lock(pendowner-pi_lock);
-   if (current-prio = pendowner-prio) {
+   if (!lock_is_stealable(pendowner, unfair)) {
spin_unlock(pendowner-pi_lock);
return 0;
}
@@ -393,7 +408,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
  *
  * Must be called with lock-wait_lock held.
  */
-static int try_to_take_rt_mutex(struct rt_mutex *lock)
+static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair)
 {
/*
 * We have to be careful here if the atomic speedups are
@@ -416,7 +431,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
 */
mark_rt_mutex_waiters(lock);
 
-   if (rt_mutex_owner(lock)  !try_to_steal_lock(lock))
+   if (rt_mutex_owner(lock)  !try_to_steal_lock(lock, unfair))
return 0;
 
/* We got the lock. */
@@ -737,7 +752,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
int saved_lock_depth = current-lock_depth;
 
/* Try to acquire the lock */
-   if (try_to_take_rt_mutex(lock))
+   if (try_to_take_rt_mutex(lock, 1))
break;
/*
 * waiter.task is NULL the first time we come here and
@@ -985,7 +1000,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
init_lists(lock);
 
/* Try to acquire the lock again: */
-   if (try_to_take_rt_mutex(lock)) {
+   if (try_to_take_rt_mutex(lock, 0)) {
spin_unlock_irqrestore(lock-wait_lock, flags);
return 0;
}
@@ -1006,7 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long saved_flags;
 
/* Try to acquire the lock: */
-   if (try_to_take_rt_mutex(lock

[PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing

2008-02-21 Thread Gregory Haskins
From: Sven-Thorsten Dietrich [EMAIL PROTECTED]

Add /proc/sys/kernel/lateral_steal, to allow switching on and off
equal-priority mutex stealing between threads.

Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED]
---

 kernel/rtmutex.c |8 ++--
 kernel/sysctl.c  |   14 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index da077e5..62e7af5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
 #ifdef CONFIG_ADAPTIVE_RTMUTEX
 int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY;
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+int rtmutex_lateral_steal __read_mostly = 1;
+#endif
 
 /*
  * lock-owner state tracking:
@@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct 
*pendowner, int unfair)
if (current-prio  pendowner-prio)
return 0;
 
-   if (!unfair  (current-prio == pendowner-prio))
+   if (unlikely(current-prio == pendowner-prio) 
+  !(unfair  rtmutex_lateral_steal))
 #endif
return 0;
 
@@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, 
int unfair)
return 1;
 
spin_lock(pendowner-pi_lock);
-   if (!lock_is_stealable(pendowner, unfair)) {
+   if (!lock_is_stealable(pendowner, (unfair  rtmutex_lateral_steal))) {
spin_unlock(pendowner-pi_lock);
return 0;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3465af2..c1a1c6d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+extern int rtmutex_lateral_steal;
+#endif
+
 extern int prove_locking;
 extern int lock_stat;
 
@@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_RTLOCK_LATERAL_STEAL
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtmutex_lateral_steal,
+   .data   = rtmutex_lateral_steal,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
spinlocks to an RT mutex implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, dbench and hackbench),
whereas some others (make -j 128) the results were not as profound but
they were still net-positive. In all cases we have also verified that
deterministic latency is not impacted by using cyclic-test. 

This patch series also includes some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock-wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled

2008-02-21 Thread Gregory Haskins
The logic is currently broken so that PREEMPT_RT disables preemptible
spinlock waiters, which is counter intuitive. 

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/spinlock.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/spinlock.c b/kernel/spinlock.c
index c9bcf1b..b0e7f02 100644
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave);
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
 #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \
-   defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT)
+   defined(CONFIG_DEBUG_LOCK_ALLOC)
 
 void __lockfunc __read_lock(raw_rwlock_t *lock)
 {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 10:26 AM, in message
[EMAIL PROTECTED], Gregory Haskins
[EMAIL PROTECTED] wrote: 

 We have put together some data from different types of benchmarks for
 this patch series, which you can find here:
 
 ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

For convenience, I have also places a tarball of the entire series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2

Regards,
-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 11:36 AM, in message [EMAIL PROTECTED],
Andi Kleen [EMAIL PROTECTED] wrote: 
 On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote:
 
 @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock,
  void fastcall (*slowfn)(struct rt_mutex *lock))
  {
  /* Temporary HACK! */
 -if (!current-in_printk)
 -might_sleep();
 -else if (in_atomic() || irqs_disabled())
 +if (unlikely(current-in_printk)  (in_atomic() || irqs_disabled()))
 
 I have my doubts that gcc will honor unlikelies that don't affect
 the complete condition of an if.
 
 Also conditions guarding returns are by default predicted unlikely
 anyways AFAIK. 
 
 The patch is likely a nop.
 

Yeah, you are probably right.  We have found that the system is *extremely* 
touchy on how much overhead we have in the lock-acquisition path.  For 
instance, using a non-inline version of adaptive_wait() can cost 5-10% in 
disk-io throughput.  So we were trying to find places to shave anywhere we 
could.  That being said, I didn't record any difference from this patch, so you 
are probably exactly right.  It just seemed like the right thing to do so I 
left it in.

-Greg



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 11:41 AM, in message [EMAIL PROTECTED],
Andi Kleen [EMAIL PROTECTED] wrote: 

 +config RTLOCK_DELAY
 +int Default delay (in loops) for adaptive rtlocks
 +range 0 10
 +depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
 +default 1
 
 Perhaps you can expand how you came up with that default number? 

Actually, the number doesn't seem to matter that much as long as it is 
sufficiently long enough to make timeouts rare.  Most workloads will present 
some threshold for hold-time.  You generally get the best performance if the 
value is at least as long as that threshold.  Anything beyond that and there 
is no gain, but there doesn't appear to be a penalty either.  So we picked 
1 because we found it to fit that criteria quite well for our range of GHz 
class x86 machines.  YMMY, but that is why its configurable ;)

 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) 

Yeah, fully agree.  We really wanted to use a time-value here but ran into 
various problems that have yet to be resolved.  We have it on the todo list to 
express this in terms in ns so it at least will scale with the architecture.

 Did you experiment with other spin times?

Of course ;)

 Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research todo.

 And at what point is real
 time behaviour visibly impacted? 

Well, if we did our jobs correctly, RT behavior should *never* be impacted.  
*Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that 
affects the probability of timing out.  Values lower than that threshold start 
to degrade throughput.  Values higher than that have no affect on throughput, 
but may drive the cpu utilization higher which can theoretically impact tasks 
of equal or lesser priority by taking that resource away from them.  To date, 
we have not observed any real-world implications of this however.

 
 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.

I agree.  We need to move in the direction of time-basis.  The tradeoff is that 
it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). 
 I think one of the (not-included) patches converts a nanosecond value from the 
sysctl to approximate loop-counts using the bogomips data.  This was a decent 
compromise between the non-scaling loopcounts and the heavy-weight official 
timing APIs.  We dropped it because we support older kernels which were 
conflicting with the patch. We may have to resurrect it, however..

-Greg



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:24 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 hm. Why is the ticket spinlock patch included in this patchset? It just 
 skews your performance results unnecessarily. Ticket spinlocks are 
 independent conceptually, they are already upstream in 2.6.25-rc2 and 
 -rt will have them automatically once we rebase to .25.

Sorry if it was ambiguous.  I included them because we found the patch series 
without them can cause spikes due to the newly introduced pressure on the 
(raw_spinlock_t)lock-wait_lock.  You can run the adaptive-spin patches without 
them just fine (in fact, in many cases things run faster without themdbench 
*thrives* on chaos).  But you may also measure a cyclic-test spike if you do 
so.  So I included them to present a complete package without spikes.  I 
tried to explain that detail in the prologue, but most people probably fell 
asleep before they got to the end ;)

 
 and if we take the ticket spinlock patch out of your series, the the 
 size of the patchset shrinks in half and touches only 200-300 lines of 
 code ;-) Considering the total size of the -rt patchset:
 
652 files changed, 23830 insertions(+), 4636 deletions(-)
 
 we can regard it a routine optimization ;-)

Its not the size of your LOC, but what you do with it :)

 
 regarding the concept: adaptive mutexes have been talked about in the 
 past, but their advantage is not at all clear, that's why we havent done 
 them. It's definitely not an unambigiously win-win concept.
 
 So lets get some real marketing-free benchmarking done, and we are not 
 just interested in the workloads where a bit of polling on contended 
 locks helps, but we are also interested in workloads where the polling 
 hurts ... And lets please do the comparisons without the ticket spinlock 
 patch ...

I'm open to suggestion, and this was just a sample of the testing we have done. 
 We have thrown plenty of workloads at this patch series far beyond the slides 
I prepared in that URL, and they all seem to indicate a net positive 
improvement so far.  Some of those results I cannot share due to NDA, and some 
I didnt share simply because I never formally collected the data like I did for 
these tests.  If there is something you would like to see, please let me know 
and I will arrange for it to be executed if at all possible.

Regards,
-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:42 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Bill Huey (hui) [EMAIL PROTECTED] wrote:
 
 I came to the original conclusion that it wasn't originally worth it, 
 but the dbench number published say otherwise. [...]
 
 dbench is a notoriously unreliable and meaningless workload. It's being 
 frowned upon by the VM and the IO folks.

I agree...its a pretty weak benchmark.  BUT, it does pound on dcache_lock and 
therefore was a good demonstration of the benefits of lower-contention 
overhead.  Also note we also threw other tests in that PDF if you scroll to the 
subsequent pages.

 If that's the only workload 
 where spin-mutexes help, and if it's only a 3% improvement [of which it 
 is unclear how much of that improvement was due to ticket spinlocks], 
 then adaptive mutexes are probably not worth it.

Note that the 3% figure being thrown around was from a single patch within 
the series.  We are actually getting a net average gain of 443% in dbench.  And 
note that the number goes *up* when you remove the ticketlocks.  The 
ticketlocks are there to prevent latency spikes, not improve throughput.

Also take a look at the hackbench numbers which are particularly promising.   
We get a net average gain of 493% faster for RT10 based hackbench runs.  The 
kernel build was only a small gain, but it was all gain nonetheless.  We see 
similar results for any other workloads we throw at this thing.  I will gladly 
run any test requested to which I have the ability to run, and I would 
encourage third party results as well.


 
 I'd not exclude them fundamentally though, it's really the numbers that 
 matter. The code is certainly simple enough (albeit the .config and 
 sysctl controls are quite ugly and unacceptable - adaptive mutexes 
 should really be ... adaptive, with no magic constants in .configs or 
 else).

We can clean this up, per your suggestions.

 
 But ... i'm somewhat sceptic, after having played with spin-a-bit 
 mutexes before.

Its very subtle to get this concept to work.  The first few weeks, we were 
getting 90% regressions ;)  Then we had a breakthrough and started to get this 
thing humming along quite nicely.

Regards,
-Greg




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing

2008-02-19 Thread Gregory Haskins

Peter Zijlstra wrote:

On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote:
  

but perhaps you can convince me that it is not needed? 
(i.e. I am still not understanding how the timer guarantees the stability).



ok, let me try again.

So we take rq->lock, at this point we know rd is valid.
We also know the timer is active.

So when we release it, the last reference can be dropped and we end up
in the hrtimer_cancel(), right before the kfree().

hrtimer_cancel() will wait for the timer to end. therefore delaying the
kfree() until the running timer finished.

  


Ok, I see it now.  I agree that I think it is safe.  Thanks!

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing

2008-02-19 Thread Gregory Haskins

Peter Zijlstra wrote:

On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote:
  

but perhaps you can convince me that it is not needed? 
(i.e. I am still not understanding how the timer guarantees the stability).



ok, let me try again.

So we take rq-lock, at this point we know rd is valid.
We also know the timer is active.

So when we release it, the last reference can be dropped and we end up
in the hrtimer_cancel(), right before the kfree().

hrtimer_cancel() will wait for the timer to end. therefore delaying the
kfree() until the running timer finished.

  


Ok, I see it now.  I agree that I think it is safe.  Thanks!

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing

2008-02-15 Thread Gregory Haskins

Peter Zijlstra wrote:

Currently the lb_monitor will walk all the domains/cpus from a single
cpu's timer interrupt. This will cause massive cache-trashing and cache-line
bouncing on larger machines.

Split the lb_monitor into root_domain (disjoint sched-domains).

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
CC: Gregory Haskins <[EMAIL PROTECTED]>
---
 kernel/sched.c  |  106 
 kernel/sched_fair.c |2 
 2 files changed, 59 insertions(+), 49 deletions(-)


Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -357,8 +357,6 @@ struct lb_monitor {
spinlock_t lock;
 };
 
-static struct lb_monitor lb_monitor;

-
 /*
  * How frequently should we rebalance_shares() across cpus?
  *
@@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo
if (hrtimer_active(_monitor->timer))
return;
 
+	/*

+* XXX: rd->load_balance && weight(rd->span) > 1
+*/
if (nr_cpu_ids == 1)
return;
 
@@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo
 
 	spin_lock_init(_monitor->lock);

 }
+
+static int lb_monitor_destroy(struct lb_monitor *lb_monitor)
+{
+   return hrtimer_cancel(_monitor->timer);
+}
 #endif
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares);

@@ -607,6 +613,8 @@ struct root_domain {
 */
cpumask_t rto_mask;
atomic_t rto_count;
+
+   struct lb_monitor lb_monitor;
 };
 
 /*

@@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq
 {
unsigned long flags;
const struct sched_class *class;
+   int active = 0;
 
 	spin_lock_irqsave(>lock, flags);
 
@@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq

cpu_clear(rq->cpu, old_rd->span);
cpu_clear(rq->cpu, old_rd->online);
 
-		if (atomic_dec_and_test(_rd->refcount))

+   if (atomic_dec_and_test(_rd->refcount)) {
+   /*
+* sync with active timers.
+*/
+   active = lb_monitor_destroy(_rd->lb_monitor);
+
kfree(old_rd);


Note that this works out to be a bug in my code on -rt since you cannot 
kfree() while the raw rq->lock is held.  This isn't your problem, per 
se, but just a heads up that I might need to patch this area ASAP.



+   }
}
 
 	atomic_inc(>refcount);

@@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq
class->join_domain(rq);
}
 
+	if (active)

+   lb_monitor_wake(>lb_monitor);
+
spin_unlock_irqrestore(>lock, flags);
 }
 
@@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_
 
 	cpus_clear(rd->span);

cpus_clear(rd->online);
+
+   lb_monitor_init(>lb_monitor);
 }
 
 static void init_defrootdomain(void)

@@ -7398,10 +7418,6 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP

init_defrootdomain();
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-   lb_monitor_init(_monitor);
-#endif
 #endif
init_rt_bandwidth(_rt_bandwidth,
global_rt_period(), global_rt_runtime());
@@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_
  * distribute shares of all task groups among their schedulable entities,
  * to reflect load distribution across cpus.
  */
-static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+static int rebalance_shares(struct root_domain *rd, int this_cpu)
 {
struct cfs_rq *cfs_rq;
struct rq *rq = cpu_rq(this_cpu);
-   cpumask_t sdspan = sd->span;
+   cpumask_t sdspan = rd->span;
int state = shares_idle;
 
 	/* Walk thr' all the task groups that we have */

@@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched
return state;
 }
 
-static int load_balance_shares(struct lb_monitor *lb_monitor)

+static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state)
 {
-   int i, cpu, state = shares_idle;
u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC;
u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC;
u64 timeout;
 
-	/* Prevent cpus going down or coming up */

-   /* get_online_cpus(); */
-   /* lockout changes to doms_cur[] array */
-   /* lock_doms_cur(); */
-   /*
-* Enter a rcu read-side critical section to safely walk rq->sd
-* chain on various cpus and to walk task group list
-* (rq->leaf_cfs_rq_list) in rebalance_shares().
-*/
-   rcu_read_lock();
-
-   for (i = 0; i < ndoms_cur; i++) {
-   cpumask_t cpumap = doms_cur[i];
-   struct sched_domain *sd = NULL, *sd_prev = NULL;
-
-  

Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing

2008-02-15 Thread Gregory Haskins

Peter Zijlstra wrote:

Currently the lb_monitor will walk all the domains/cpus from a single
cpu's timer interrupt. This will cause massive cache-trashing and cache-line
bouncing on larger machines.

Split the lb_monitor into root_domain (disjoint sched-domains).

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
CC: Gregory Haskins [EMAIL PROTECTED]
---
 kernel/sched.c  |  106 
 kernel/sched_fair.c |2 
 2 files changed, 59 insertions(+), 49 deletions(-)


Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -357,8 +357,6 @@ struct lb_monitor {
spinlock_t lock;
 };
 
-static struct lb_monitor lb_monitor;

-
 /*
  * How frequently should we rebalance_shares() across cpus?
  *
@@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo
if (hrtimer_active(lb_monitor-timer))
return;
 
+	/*

+* XXX: rd-load_balance  weight(rd-span)  1
+*/
if (nr_cpu_ids == 1)
return;
 
@@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo
 
 	spin_lock_init(lb_monitor-lock);

 }
+
+static int lb_monitor_destroy(struct lb_monitor *lb_monitor)
+{
+   return hrtimer_cancel(lb_monitor-timer);
+}
 #endif
 
 static void set_se_shares(struct sched_entity *se, unsigned long shares);

@@ -607,6 +613,8 @@ struct root_domain {
 */
cpumask_t rto_mask;
atomic_t rto_count;
+
+   struct lb_monitor lb_monitor;
 };
 
 /*

@@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq
 {
unsigned long flags;
const struct sched_class *class;
+   int active = 0;
 
 	spin_lock_irqsave(rq-lock, flags);
 
@@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq

cpu_clear(rq-cpu, old_rd-span);
cpu_clear(rq-cpu, old_rd-online);
 
-		if (atomic_dec_and_test(old_rd-refcount))

+   if (atomic_dec_and_test(old_rd-refcount)) {
+   /*
+* sync with active timers.
+*/
+   active = lb_monitor_destroy(old_rd-lb_monitor);
+
kfree(old_rd);


Note that this works out to be a bug in my code on -rt since you cannot 
kfree() while the raw rq-lock is held.  This isn't your problem, per 
se, but just a heads up that I might need to patch this area ASAP.



+   }
}
 
 	atomic_inc(rd-refcount);

@@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq
class-join_domain(rq);
}
 
+	if (active)

+   lb_monitor_wake(rd-lb_monitor);
+
spin_unlock_irqrestore(rq-lock, flags);
 }
 
@@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_
 
 	cpus_clear(rd-span);

cpus_clear(rd-online);
+
+   lb_monitor_init(rd-lb_monitor);
 }
 
 static void init_defrootdomain(void)

@@ -7398,10 +7418,6 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SMP

init_defrootdomain();
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-   lb_monitor_init(lb_monitor);
-#endif
 #endif
init_rt_bandwidth(def_rt_bandwidth,
global_rt_period(), global_rt_runtime());
@@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_
  * distribute shares of all task groups among their schedulable entities,
  * to reflect load distribution across cpus.
  */
-static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+static int rebalance_shares(struct root_domain *rd, int this_cpu)
 {
struct cfs_rq *cfs_rq;
struct rq *rq = cpu_rq(this_cpu);
-   cpumask_t sdspan = sd-span;
+   cpumask_t sdspan = rd-span;
int state = shares_idle;
 
 	/* Walk thr' all the task groups that we have */

@@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched
return state;
 }
 
-static int load_balance_shares(struct lb_monitor *lb_monitor)

+static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state)
 {
-   int i, cpu, state = shares_idle;
u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC;
u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC;
u64 timeout;
 
-	/* Prevent cpus going down or coming up */

-   /* get_online_cpus(); */
-   /* lockout changes to doms_cur[] array */
-   /* lock_doms_cur(); */
-   /*
-* Enter a rcu read-side critical section to safely walk rq-sd
-* chain on various cpus and to walk task group list
-* (rq-leaf_cfs_rq_list) in rebalance_shares().
-*/
-   rcu_read_lock();
-
-   for (i = 0; i  ndoms_cur; i++) {
-   cpumask_t cpumap = doms_cur[i];
-   struct sched_domain *sd = NULL, *sd_prev = NULL;
-
-   cpu = first_cpu(cpumap);
-
-   /* Find the highest domain at which to balance shares

Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-14 Thread Gregory Haskins
>>> On Thu, Feb 14, 2008 at  1:15 PM, in message
<[EMAIL PROTECTED]>, Paul Jackson <[EMAIL PROTECTED]> wrote: 
> Peter wrote of:
>> the lack of rd->load_balance.
> 
> Could you explain to me a bit what that means?
> 
> Does this mean that the existing code would, by default (default being
> a single sched domain, covering the entire system's CPUs) load balance
> across the entire system, but with your rework, not so load balance
> there?  That seems unlikely.
> 
> In any event, from my rather cpuset-centric perspective, there are only
> two common cases to consider.
> 
>  1. In the default case, build_sched_domains() gets called once,
> at init, with a cpu_map of all non-isolated CPUs, and we should
> forever after load balance across all those non-isolated CPUs.
> 
>  2. In some carefully managed systems using the per-cpuset
> 'sched_load_balance' flags, we tear down that first default
> sched domain, by calling detach_destroy_domains() on it, and we
> then setup some number of sched_domains (typically in the range
> of two to ten, though I suppose we should design to scale to
> hundreds of sched domains, on systems with thousands of CPUs)
> by additional calls to build_sched_domains(), such that their
> CPUs don't overlap (pairwise disjoint) and such that the union
> of all their CPUs may, or may not, include all non-isolated CPUs
> (some CPUs might be left 'out in the cold', intentionally, as
> essentially additional isolated CPUs.)  We would then expect load
> balancing within each of these pair-wise disjoint sched domains,
> but not between one of them and another.


Hi Paul,
  I think it will still work as you describe.  We create a new root-domain 
object for each pair-wise disjoint sched-domain.  In your case (1) above, we 
would only have one instance of a root-domain which contains (of course) a 
single instance of the rd->load_balance object.  This would, in fact operate 
like the global variable that Peter is suggesting it replace (IIUC).  However, 
for case (2), we would instantiate a root-domain object per pairwise-disjoint 
sched-domain, and therefore each one would have its own instance of 
rd->load_balance.

HTH
-Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-14 Thread Gregory Haskins
>>> On Thu, Feb 14, 2008 at 10:57 AM, in message
<[EMAIL PROTECTED]>, Peter Zijlstra <[EMAIL PROTECTED]>
wrote: 
> Hi,
> 
> Here the current patches that rework load_balance_monitor.
> 
> The main reason for doing this is to eliminate the wakeups the thing 
> generates,
> esp. on an idle system. The bonus is that it removes a kernel thread.
> 
> Paul, Gregory - the thing that bothers me most atm is the lack of
> rd->load_balance. Should I introduce that (-rt ought to make use of that as
> well) by way of copying from the top sched_domain when it gets created?

With the caveat that I currently have not digested your patch series, this 
sounds like a reasonable approach.  The root-domain effectively represents the 
top sched-domain anyway (with the additional attribute that its a shared 
structure with all constituent cpus).

Ill try to take a look at the series later today and get back to you with 
feedback.

-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-14 Thread Gregory Haskins
 On Thu, Feb 14, 2008 at 10:57 AM, in message
[EMAIL PROTECTED], Peter Zijlstra [EMAIL PROTECTED]
wrote: 
 Hi,
 
 Here the current patches that rework load_balance_monitor.
 
 The main reason for doing this is to eliminate the wakeups the thing 
 generates,
 esp. on an idle system. The bonus is that it removes a kernel thread.
 
 Paul, Gregory - the thing that bothers me most atm is the lack of
 rd-load_balance. Should I introduce that (-rt ought to make use of that as
 well) by way of copying from the top sched_domain when it gets created?

With the caveat that I currently have not digested your patch series, this 
sounds like a reasonable approach.  The root-domain effectively represents the 
top sched-domain anyway (with the additional attribute that its a shared 
structure with all constituent cpus).

Ill try to take a look at the series later today and get back to you with 
feedback.

-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-14 Thread Gregory Haskins
 On Thu, Feb 14, 2008 at  1:15 PM, in message
[EMAIL PROTECTED], Paul Jackson [EMAIL PROTECTED] wrote: 
 Peter wrote of:
 the lack of rd-load_balance.
 
 Could you explain to me a bit what that means?
 
 Does this mean that the existing code would, by default (default being
 a single sched domain, covering the entire system's CPUs) load balance
 across the entire system, but with your rework, not so load balance
 there?  That seems unlikely.
 
 In any event, from my rather cpuset-centric perspective, there are only
 two common cases to consider.
 
  1. In the default case, build_sched_domains() gets called once,
 at init, with a cpu_map of all non-isolated CPUs, and we should
 forever after load balance across all those non-isolated CPUs.
 
  2. In some carefully managed systems using the per-cpuset
 'sched_load_balance' flags, we tear down that first default
 sched domain, by calling detach_destroy_domains() on it, and we
 then setup some number of sched_domains (typically in the range
 of two to ten, though I suppose we should design to scale to
 hundreds of sched domains, on systems with thousands of CPUs)
 by additional calls to build_sched_domains(), such that their
 CPUs don't overlap (pairwise disjoint) and such that the union
 of all their CPUs may, or may not, include all non-isolated CPUs
 (some CPUs might be left 'out in the cold', intentionally, as
 essentially additional isolated CPUs.)  We would then expect load
 balancing within each of these pair-wise disjoint sched domains,
 but not between one of them and another.


Hi Paul,
  I think it will still work as you describe.  We create a new root-domain 
object for each pair-wise disjoint sched-domain.  In your case (1) above, we 
would only have one instance of a root-domain which contains (of course) a 
single instance of the rd-load_balance object.  This would, in fact operate 
like the global variable that Peter is suggesting it replace (IIUC).  However, 
for case (2), we would instantiate a root-domain object per pairwise-disjoint 
sched-domain, and therefore each one would have its own instance of 
rd-load_balance.

HTH
-Greg
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
>>> On Tue, Feb 12, 2008 at  2:22 PM, in message
<[EMAIL PROTECTED]>, Steven Rostedt
<[EMAIL PROTECTED]> wrote: 

> On Tue, 12 Feb 2008, Gregory Haskins wrote:
> 
>> This patch adds a new critical-section primitive pair:
>>
>> "migration_disable()" and "migration_enable()"
> 
> This is similar to what Mathieu once posted:
> 
> http://lkml.org/lkml/2007/7/11/13
> 
> Not sure the arguments against (no time to read the thread again). But I'd
> recommend that you read it.
> 
> -- Steve

Indeed, thanks for the link!  At quick glance, the concept looks identical, 
though the implementations are radically different.

-Greg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
This patch adds a new critical-section primitive pair:

"migration_disable()" and "migration_enable()"

This allows you to force a task to remain on the current cpu, while
still remaining fully preemptible.  This is a better alternative to
modifying current->cpus_allowed because you dont have to worry about
colliding with another entity also modifying the cpumask_t while in
the critical section.

In fact, modifying the cpumask_t while in the critical section is
fully supported, but note that the behavior of set_cpus_allowed()
has slightly different behavior.  In the old code, the mask update
was synchronous: e.g. the task would be on a legal cpu when the call
returned.  The new behavior makes this asynchronous if the task is
currently in a migration-disabled critical section.  The task will
migrate to a legal cpu once the critical section ends.

This concept will be used later in the series.

Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 include/linux/init_task.h |1 +
 include/linux/sched.h |8 +
 kernel/fork.c |1 +
 kernel/sched.c|   70 -
 kernel/sched_rt.c |6 +++-
 5 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 316a184..151197b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -137,6 +137,7 @@ extern struct group_info init_groups;
.usage  = ATOMIC_INIT(2),   \
.flags  = 0,\
.lock_depth = -1,   \
+   .migration_disable_depth = 0,   \
.prio   = MAX_PRIO-20,  \
.static_prio= MAX_PRIO-20,  \
.normal_prio= MAX_PRIO-20,  \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c87d46a..ab7768a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1109,6 +1109,7 @@ struct task_struct {
unsigned int ptrace;
 
int lock_depth; /* BKL lock depth */
+   int migration_disable_depth;
 
 #ifdef CONFIG_SMP
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
@@ -2284,10 +2285,17 @@ static inline void inc_syscw(struct task_struct *tsk)
 
 #ifdef CONFIG_SMP
 void migration_init(void);
+int migration_disable(struct task_struct *tsk);
+void migration_enable(struct task_struct *tsk);
 #else
 static inline void migration_init(void)
 {
 }
+static inline int migration_disable(struct task_struct *tsk)
+{
+   return 0;
+}
+#define migration_enable(tsk) do {} while (0)
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c00b55..7745937 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1127,6 +1127,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
INIT_LIST_HEAD(>cpu_timers[2]);
p->posix_timer_list = NULL;
p->lock_depth = -1; /* -1 = no lock */
+   p->migration_disable_depth = 0;
do_posix_clock_monotonic_gettime(>start_time);
p->real_start_time = p->start_time;
monotonic_to_bootbased(>real_start_time);
diff --git a/kernel/sched.c b/kernel/sched.c
index e6ad493..cf32000 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1231,6 +1231,8 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
  *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
u64 clock_offset;
 
+   BUG_ON(p->migration_disable_depth);
+
clock_offset = old_rq->clock - new_rq->clock;
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1632,7 +1634,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int sync, int mutex)
if (unlikely(task_running(rq, p)))
goto out_activate;
 
-   cpu = p->sched_class->select_task_rq(p, sync);
+   if (!p->migration_disable_depth)
+   cpu = p->sched_class->select_task_rq(p, sync);
+
if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
task_rq_unlock(rq, );
@@ -5422,11 +5426,12 @@ static inline void sched_init_granularity(void)
  */
 int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
-   struct migration_req req;
unsigned long flags;
struct rq *rq;
int ret = 0;
 
+   migration_disable(p);
+
rq = task_rq_lock(p, );
if (!cpus_intersects(new_mask, cpu_online_map)) {
ret = -EINVAL;
@@ -5440,21 +5445,11 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t 
new_mask)
p->nr_cpus_allowed = cpus_weight(new_mask);
}
 
-   /* Can the task run on the task's current CPU? If so, we're done */
-   if (cpu_isset(task_cpu(p), new_mask))
-   goto out;
-
- 

[PATCH 0/2] migration disabled critical sections

2008-02-12 Thread Gregory Haskins
Hi Ingo, Steven,

I had been working on some ideas related to saving context switches in the
bottom-half mechanisms on -rt.  So far, the ideas have been a flop, but a few
peripheral technologies did come out of it.  This series is one such
idea that I thought might have some merit on its own.  The header-comments
describe it in detail, so I wont bother replicating that here.

This series applies to 24-rt1.  Any comments/feedback welcome.

-Greg


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] fix cpus_allowed settings

2008-02-12 Thread Gregory Haskins
Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]>
---

 kernel/kthread.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..b193b47 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
wait_task_inactive(k);
set_task_cpu(k, cpu);
k->cpus_allowed = cpumask_of_cpu(cpu);
+   k->nr_cpus_allowed = 1;
 }
 EXPORT_SYMBOL(kthread_bind);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
This patch adds a new critical-section primitive pair:

migration_disable() and migration_enable()

This allows you to force a task to remain on the current cpu, while
still remaining fully preemptible.  This is a better alternative to
modifying current-cpus_allowed because you dont have to worry about
colliding with another entity also modifying the cpumask_t while in
the critical section.

In fact, modifying the cpumask_t while in the critical section is
fully supported, but note that the behavior of set_cpus_allowed()
has slightly different behavior.  In the old code, the mask update
was synchronous: e.g. the task would be on a legal cpu when the call
returned.  The new behavior makes this asynchronous if the task is
currently in a migration-disabled critical section.  The task will
migrate to a legal cpu once the critical section ends.

This concept will be used later in the series.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 include/linux/init_task.h |1 +
 include/linux/sched.h |8 +
 kernel/fork.c |1 +
 kernel/sched.c|   70 -
 kernel/sched_rt.c |6 +++-
 5 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 316a184..151197b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -137,6 +137,7 @@ extern struct group_info init_groups;
.usage  = ATOMIC_INIT(2),   \
.flags  = 0,\
.lock_depth = -1,   \
+   .migration_disable_depth = 0,   \
.prio   = MAX_PRIO-20,  \
.static_prio= MAX_PRIO-20,  \
.normal_prio= MAX_PRIO-20,  \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c87d46a..ab7768a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1109,6 +1109,7 @@ struct task_struct {
unsigned int ptrace;
 
int lock_depth; /* BKL lock depth */
+   int migration_disable_depth;
 
 #ifdef CONFIG_SMP
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
@@ -2284,10 +2285,17 @@ static inline void inc_syscw(struct task_struct *tsk)
 
 #ifdef CONFIG_SMP
 void migration_init(void);
+int migration_disable(struct task_struct *tsk);
+void migration_enable(struct task_struct *tsk);
 #else
 static inline void migration_init(void)
 {
 }
+static inline int migration_disable(struct task_struct *tsk)
+{
+   return 0;
+}
+#define migration_enable(tsk) do {} while (0)
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c00b55..7745937 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1127,6 +1127,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
INIT_LIST_HEAD(p-cpu_timers[2]);
p-posix_timer_list = NULL;
p-lock_depth = -1; /* -1 = no lock */
+   p-migration_disable_depth = 0;
do_posix_clock_monotonic_gettime(p-start_time);
p-real_start_time = p-start_time;
monotonic_to_bootbased(p-real_start_time);
diff --git a/kernel/sched.c b/kernel/sched.c
index e6ad493..cf32000 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1231,6 +1231,8 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
  *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
u64 clock_offset;
 
+   BUG_ON(p-migration_disable_depth);
+
clock_offset = old_rq-clock - new_rq-clock;
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1632,7 +1634,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int sync, int mutex)
if (unlikely(task_running(rq, p)))
goto out_activate;
 
-   cpu = p-sched_class-select_task_rq(p, sync);
+   if (!p-migration_disable_depth)
+   cpu = p-sched_class-select_task_rq(p, sync);
+
if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
task_rq_unlock(rq, flags);
@@ -5422,11 +5426,12 @@ static inline void sched_init_granularity(void)
  */
 int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
-   struct migration_req req;
unsigned long flags;
struct rq *rq;
int ret = 0;
 
+   migration_disable(p);
+
rq = task_rq_lock(p, flags);
if (!cpus_intersects(new_mask, cpu_online_map)) {
ret = -EINVAL;
@@ -5440,21 +5445,11 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t 
new_mask)
p-nr_cpus_allowed = cpus_weight(new_mask);
}
 
-   /* Can the task run on the task's current CPU? If so, we're done */
-   if (cpu_isset(task_cpu(p), new_mask))
-   goto out;
-
-   if (migrate_task(p, any_online_cpu(new_mask), req)) {
-   /* Need help

[PATCH 2/2] fix cpus_allowed settings

2008-02-12 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/kthread.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..b193b47 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
wait_task_inactive(k);
set_task_cpu(k, cpu);
k-cpus_allowed = cpumask_of_cpu(cpu);
+   k-nr_cpus_allowed = 1;
 }
 EXPORT_SYMBOL(kthread_bind);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] add task migration_disable critical section

2008-02-12 Thread Gregory Haskins
 On Tue, Feb 12, 2008 at  2:22 PM, in message
[EMAIL PROTECTED], Steven Rostedt
[EMAIL PROTECTED] wrote: 

 On Tue, 12 Feb 2008, Gregory Haskins wrote:
 
 This patch adds a new critical-section primitive pair:

 migration_disable() and migration_enable()
 
 This is similar to what Mathieu once posted:
 
 http://lkml.org/lkml/2007/7/11/13
 
 Not sure the arguments against (no time to read the thread again). But I'd
 recommend that you read it.
 
 -- Steve

Indeed, thanks for the link!  At quick glance, the concept looks identical, 
though the implementations are radically different.

-Greg

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] migration disabled critical sections

2008-02-12 Thread Gregory Haskins
Hi Ingo, Steven,

I had been working on some ideas related to saving context switches in the
bottom-half mechanisms on -rt.  So far, the ideas have been a flop, but a few
peripheral technologies did come out of it.  This series is one such
idea that I thought might have some merit on its own.  The header-comments
describe it in detail, so I wont bother replicating that here.

This series applies to 24-rt1.  Any comments/feedback welcome.

-Greg


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: any cpu hotplug changes in 2.6.24-current-git?

2008-02-06 Thread Gregory Haskins

Pavel Machek wrote:

Hi!

  

Are there any recent changes in cpu hotplug? I have suspend (random)
problems, nosmp seems to fix it, and last messages in the "it hangs"
case are from cpu hotplug...
  


  

  Can you send along your cpuinfo?



It happened on more than one machine, one cpuinfo is:
  
Ah, ok.  This one is a C2D, correct?  The only reason I asked is that 
someone was reporting an s2ram problem on P4s on some of that 
root-domain logic I submitted a little while ago (and was merged in 
.25), whereas C2D seemed fine.  That doesn't mean anything here, 
really.  The problem could still be my code, or it might be unrelated.  
I was just wondering if you also had a P4 on the troubled systems.


So is your problem on suspend or resume?  (or both?)  (I know you 
mentioned it was random problems, but I wasn't sure if you could qualify 
that further)


Any info you can provide will be appreciated. 


-Greg



processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 14
model name  : Genuine Intel(R) CPU   T2400  @ 1.83GHz
stepping: 8
cpu MHz : 1000.000
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
constant_tsc arch_perfmon bts pni monitor vmx est tm2 xtpr
bogomips: 3657.58
clflush size: 64

Pavel
  





signature.asc
Description: OpenPGP digital signature


Re: any cpu hotplug changes in 2.6.24-current-git?

2008-02-06 Thread Gregory Haskins

Pavel Machek wrote:

Hi!

Are there any recent changes in cpu hotplug? I have suspend (random)
problems, nosmp seems to fix it, and last messages in the "it hangs"
case are from cpu hotplug...
Pavel


Hi Pavel,
  Can you send along your cpuinfo?

-Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >