Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
>>> On Tue, Feb 26, 2008 at 1:06 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: > On Tue 2008-02-26 08:03:43, Gregory Haskins wrote: >> >>> On Mon, Feb 25, 2008 at 5:03 PM, in message >> <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: >> >> >> +static inline void >> >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter >> > *adaptive) >> > ... >> >> +#define prepare_adaptive_wait(lock, busy) {} >> > >> > This is evil. Use empty inline function instead (same for the other >> > function, there you can maybe get away with it). >> > >> >> I went to implement your suggested change and I remembered why I did it this > way: I wanted a macro so that the "struct adaptive_waiter" local variable > will fall away without an #ifdef in the main body of code. So I have left > this logic alone for now. > > Hmm, but inline function will allow dead code elimination, too, no? I was getting compile errors. Might be operator-error ;) > > Anyway non-evil way to do it with macro is > > #define prepare_adaptive_wait(lock, busy) do {} while (0) > > ...that behaves properly in complex statements. Ah, I was wondering why people use that. Will do. Thanks! -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism
>>> On Mon, Feb 25, 2008 at 5:06 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: > > I believe you have _way_ too many config variables. If this can be set > at runtime, does it need a config option, too? Generally speaking, I think until this algorithm has an adaptive-timeout in addition to an adaptive-spin/sleep, these .config based defaults are a good idea. Sometimes setting these things at runtime are a PITA when you are talking about embedded systems that might not have/want a nice userspace sysctl-config infrastructure. And changing the defaults in the code is unattractive for some users. I don't think its a big deal either way, so if people hate the config options, they should go. But I thought I would throw this use-case out there to ponder. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
>>> On Mon, Feb 25, 2008 at 5:03 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: >> +static inline void >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter > *adaptive) > ... >> +#define prepare_adaptive_wait(lock, busy) {} > > This is evil. Use empty inline function instead (same for the other > function, there you can maybe get away with it). > I went to implement your suggested change and I remembered why I did it this way: I wanted a macro so that the "struct adaptive_waiter" local variable will fall away without an #ifdef in the main body of code. So I have left this logic alone for now. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
On Mon, Feb 25, 2008 at 5:03 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: +static inline void +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) ... +#define prepare_adaptive_wait(lock, busy) {} This is evil. Use empty inline function instead (same for the other function, there you can maybe get away with it). I went to implement your suggested change and I remembered why I did it this way: I wanted a macro so that the struct adaptive_waiter local variable will fall away without an #ifdef in the main body of code. So I have left this logic alone for now. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism
On Mon, Feb 25, 2008 at 5:06 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: I believe you have _way_ too many config variables. If this can be set at runtime, does it need a config option, too? Generally speaking, I think until this algorithm has an adaptive-timeout in addition to an adaptive-spin/sleep, these .config based defaults are a good idea. Sometimes setting these things at runtime are a PITA when you are talking about embedded systems that might not have/want a nice userspace sysctl-config infrastructure. And changing the defaults in the code is unattractive for some users. I don't think its a big deal either way, so if people hate the config options, they should go. But I thought I would throw this use-case out there to ponder. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
On Tue, Feb 26, 2008 at 1:06 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: On Tue 2008-02-26 08:03:43, Gregory Haskins wrote: On Mon, Feb 25, 2008 at 5:03 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: +static inline void +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) ... +#define prepare_adaptive_wait(lock, busy) {} This is evil. Use empty inline function instead (same for the other function, there you can maybe get away with it). I went to implement your suggested change and I remembered why I did it this way: I wanted a macro so that the struct adaptive_waiter local variable will fall away without an #ifdef in the main body of code. So I have left this logic alone for now. Hmm, but inline function will allow dead code elimination, too, no? I was getting compile errors. Might be operator-error ;) Anyway non-evil way to do it with macro is #define prepare_adaptive_wait(lock, busy) do {} while (0) ...that behaves properly in complex statements. Ah, I was wondering why people use that. Will do. Thanks! -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing
>>> On Mon, Feb 25, 2008 at 5:57 PM, in message <[EMAIL PROTECTED]>, Sven-Thorsten Dietrich <[EMAIL PROTECTED]> wrote: > > But Greg may need to enforce it on his git tree that he mails these from > - are you referring to anything specific in this patch? > Thats what I don't get. I *did* checkpatch all of these before sending them out (and I have for every release). I am aware of two "tabs vs spaces" warnings, but the rest checked clean. Why do some people still see errors when I don't? Is there a set of switches I should supply to checkpatch to make it more aggressive or something? -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes
>>> On Mon, Feb 25, 2008 at 5:09 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > >> From: Peter W.Morreale <[EMAIL PROTECTED]> >> >> This patch adds the adaptive spin lock busywait to rtmutexes. It adds >> a new tunable: rtmutex_timeout, which is the companion to the >> rtlock_timeout tunable. >> >> Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> > > Not signed off by you? I wasn't sure if this was appropriate for me to do. This is the first time I was acting as "upstream" to someone. If that is what I am expected to do, consider this an "ack" for your remaining comments related to this. > >> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt >> index ac1cbad..864bf14 100644 >> --- a/kernel/Kconfig.preempt >> +++ b/kernel/Kconfig.preempt >> @@ -214,6 +214,43 @@ config RTLOCK_DELAY >> tunable at runtime via a sysctl. A setting of 0 (zero) disables >> the adaptive algorithm entirely. >> >> +config ADAPTIVE_RTMUTEX >> +bool "Adaptive real-time mutexes" >> +default y >> +depends on ADAPTIVE_RTLOCK >> +help >> + This option adds the adaptive rtlock spin/sleep algorithm to >> + rtmutexes. In rtlocks, a significant gain in throughput >> + can be seen by allowing rtlocks to spin for a distinct >> + amount of time prior to going to sleep for deadlock avoidence. >> + >> + Typically, mutexes are used when a critical section may need to >> + sleep due to a blocking operation. In the event the critical >> + section does not need to sleep, an additional gain in throughput >> + can be seen by avoiding the extra overhead of sleeping. > > Watch the whitespace. ... and do we need yet another config options? > >> +config RTMUTEX_DELAY >> +int "Default delay (in loops) for adaptive mutexes" >> +range 0 1000 >> +depends on ADAPTIVE_RTMUTEX >> +default "3000" >> +help >> + This allows you to specify the maximum delay a task will use >> + to wait for a rt mutex before going to sleep. Note that that >> + although the delay is implemented as a preemptable loop, tasks >> + of like priority cannot preempt each other and this setting can >> + result in increased latencies. >> + >> + The value is tunable at runtime via a sysctl. A setting of 0 >> + (zero) disables the adaptive algorithm entirely. > > Ouch. ? Is this reference to whitespace damage, or does the content need addressing? > >> +#ifdef CONFIG_ADAPTIVE_RTMUTEX >> + >> +#define mutex_adaptive_wait adaptive_wait >> +#define mutex_prepare_adaptive_wait prepare_adaptive_wait >> + >> +extern int rtmutex_timeout; >> + >> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \ >> + struct adaptive_waiter name = { .owner = NULL, \ >> + .timeout = rtmutex_timeout, } >> + >> +#else >> + >> +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) >> + >> +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1 >> +#define mutex_prepare_adaptive_wait(lock, busy) {} > > More evil macros. Macro does not behave like a function, make it > inline function if you are replacing a function. Ok > Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
>>> On Mon, Feb 25, 2008 at 5:03 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > >> +/* >> + * Adaptive-rtlocks will busywait when possible, and sleep only if >> + * necessary. Note that the busyloop looks racy, and it isbut we do >> + * not care. If we lose any races it simply means that we spin one more >> + * time before seeing that we need to break-out on the next iteration. >> + * >> + * We realize this is a relatively large function to inline, but note that >> + * it is only instantiated 1 or 2 times max, and it makes a measurable >> + * performance different to avoid the call. >> + * >> + * Returns 1 if we should sleep >> + * >> + */ >> +static inline int >> +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, >> + struct adaptive_waiter *adaptive) >> +{ >> +int sleep = 0; >> + >> +for (;;) { >> +/* >> + * If the task was re-awoken, break out completely so we can >> + * reloop through the lock-acquisition code. >> + */ >> +if (!waiter->task) >> +break; >> + >> +/* >> + * We need to break if the owner changed so we can reloop >> + * and safely acquire the owner-pointer again with the >> + * wait_lock held. >> + */ >> +if (adaptive->owner != rt_mutex_owner(lock)) >> +break; >> + >> +/* >> + * If we got here, presumably the lock ownership is still >> + * current. We will use it to our advantage to be able to >> + * spin without disabling preemption... >> + */ >> + >> +/* >> + * .. sleep if the owner is not running.. >> + */ >> +if (!adaptive->owner->se.on_rq) { >> +sleep = 1; >> +break; >> +} >> + >> +/* >> + * .. or is running on our own cpu (to prevent deadlock) >> + */ >> +if (task_cpu(adaptive->owner) == task_cpu(current)) { >> +sleep = 1; >> +break; >> +} >> + >> +cpu_relax(); >> +} >> + >> +put_task_struct(adaptive->owner); >> + >> +return sleep; >> +} >> + > > You want to inline this? Yes. As the comment indicates, there are 1-2 users tops, and it has a significant impact on throughput (> 5%) to take the hit with a call. I don't think its actually much code anyway...its all comments. > >> +static inline void >> +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter > *adaptive) > ... >> +#define prepare_adaptive_wait(lock, busy) {} > > This is evil. Use empty inline function instead (same for the other > function, there you can maybe get away with it). Ok. > Pavel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep
>>> On Mon, Feb 25, 2008 at 4:54 PM, in message <[EMAIL PROTECTED]>, Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > >> @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) >> * saved_state accordingly. If we did not get a real wakeup >> * then we return with the saved state. >> */ >> -saved_state = xchg(>state, TASK_UNINTERRUPTIBLE); >> +saved_state = current->state; >> +smp_mb(); >> >> for (;;) { >> unsigned long saved_flags; > > Please document what the barrier is good for. Yeah, I think you are right that this isn't needed. I think that is a relic from back when I was debugging some other problems. Let me wrap my head around the implications of removing it, and either remove it or document appropriately. > > Plus, you are replacing atomic operation with nonatomic; is that ok? Yeah, I think so. We are substituting a write with a read, and word reads are always atomic anyway IIUC (or is that only true on certain architectures)? Note that we are moving the atomic-write to be done later in the update_current() calls. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 9/9] remove the extra call to try_to_take_lock
From: Peter W. Morreale <[EMAIL PROTECTED]> Remove the redundant attempt to get the lock. While it is true that the exit path with this patch adds an un-necessary xchg (in the event the lock is granted without further traversal in the loop) experimentation shows that we almost never encounter this situation. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index b81bbef..266ae31 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -756,12 +756,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) spin_lock_irqsave(>wait_lock, flags); init_lists(lock); - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { - spin_unlock_irqrestore(>wait_lock, flags); - return; - } - BUG_ON(rt_mutex_owner(lock) == current); /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 7/9] adaptive mutexes
From: Peter W.Morreale <[EMAIL PROTECTED]> This patch adds the adaptive spin lock busywait to rtmutexes. It adds a new tunable: rtmutex_timeout, which is the companion to the rtlock_timeout tunable. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 37 ++ kernel/rtmutex.c | 76 + kernel/rtmutex_adaptive.h | 32 ++- kernel/sysctl.c | 10 ++ 4 files changed, 119 insertions(+), 36 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index ac1cbad..864bf14 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -214,6 +214,43 @@ config RTLOCK_DELAY tunable at runtime via a sysctl. A setting of 0 (zero) disables the adaptive algorithm entirely. +config ADAPTIVE_RTMUTEX +bool "Adaptive real-time mutexes" +default y +depends on ADAPTIVE_RTLOCK +help + This option adds the adaptive rtlock spin/sleep algorithm to + rtmutexes. In rtlocks, a significant gain in throughput + can be seen by allowing rtlocks to spin for a distinct + amount of time prior to going to sleep for deadlock avoidence. + + Typically, mutexes are used when a critical section may need to + sleep due to a blocking operation. In the event the critical +section does not need to sleep, an additional gain in throughput +can be seen by avoiding the extra overhead of sleeping. + + This option alters the rtmutex code to use an adaptive + spin/sleep algorithm. It will spin unless it determines it must + sleep to avoid deadlock. This offers a best of both worlds + solution since we achieve both high-throughput and low-latency. + + If unsure, say Y + +config RTMUTEX_DELAY +int "Default delay (in loops) for adaptive mutexes" +range 0 1000 +depends on ADAPTIVE_RTMUTEX +default "3000" +help + This allows you to specify the maximum delay a task will use +to wait for a rt mutex before going to sleep. Note that that +although the delay is implemented as a preemptable loop, tasks +of like priority cannot preempt each other and this setting can +result in increased latencies. + + The value is tunable at runtime via a sysctl. A setting of 0 +(zero) disables the adaptive algorithm entirely. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 4a16b13..ea593e0 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1; int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #endif +#ifdef CONFIG_ADAPTIVE_RTMUTEX +int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; +#endif + /* * lock->owner state tracking: * @@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * Do the wakeup before the ownership change to give any spinning * waiter grantees a headstart over the other threads that will * trigger once owner changes. +* +* We can skip the actual (expensive) wakeup if the +* waiter is already running, but we have to be careful +* of race conditions because they may be about to sleep. +* +* The waiter-side protocol has the following pattern: +* 1: Set state != RUNNING +* 2: Conditionally sleep if waiter->task != NULL; +* +* And the owner-side has the following: +* A: Set waiter->task = NULL +* B: Conditionally wake if the state != RUNNING +* +* As long as we ensure 1->2 order, and A->B order, we +* will never miss a wakeup. +* +* Therefore, this barrier ensures that waiter->task = NULL +* is visible before we test the pendowner->state. The +* corresponding barrier is in the sleep logic. */ - if (!savestate) - wake_up_process(pendowner); - else { - /* -* We can skip the actual (expensive) wakeup if the -* waiter is already running, but we have to be careful -* of race conditions because they may be about to sleep. -* -* The waiter-side protocol has the following pattern: -* 1: Set state != RUNNING -* 2: Conditionally sleep if waiter->task != NULL; -* -* And the owner-side has the following: -* A: Set waiter->task = NULL -* B: Conditionally wake if the state != RUNNING -* -* As long as we ensure 1->2 order, and A->B order, we -* will never miss a
[(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup
From: Peter W.Morreale <[EMAIL PROTECTED]> In wakeup_next_waiter(), we take the pi_lock, and then find out whether we have another waiter to add to the pending owner. We can reduce contention on the pi_lock for the pending owner if we first obtain the pointer to the next waiter outside of the pi_lock. This patch adds a measureable increase in throughput. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ea593e0..b81bbef 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + struct rt_mutex_waiter *next; spin_lock(>pi_lock); @@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * waiter with higher priority than pending-owner->normal_prio * is blocked on the unboosted (pending) owner. */ + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + spin_lock(>pi_lock); WARN_ON(!pendowner->pi_blocked_on); @@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner->pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(>pi_list_entry, >pi_waiters); - } + spin_unlock(>pi_lock); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 5/9] adaptive real-time lock support
There are pros and cons when deciding between the two basic forms of locking primitives (spinning vs sleeping). Without going into great detail on either one, we note that spinlocks have the advantage of lower overhead for short hold locks. However, they also have a con in that they create indeterminate latencies since preemption must traditionally be disabled while the lock is held (to prevent deadlock). We want to avoid non-deterministic critical sections in -rt. Therefore, when realtime is enabled, most contexts are converted to threads, and likewise most spinlock_ts are converted to sleepable rt-mutex derived locks. This allows the holder of the lock to remain fully preemptible, thus reducing a major source of latencies in the kernel. However, converting what was once a true spinlock into a sleeping lock may also decrease performance since the locks will now sleep under contention. Since the fundamental lock used to be a spinlock, it is highly likely that it was used in a short-hold path and that release is imminent. Therefore sleeping only serves to cause context-thrashing. Adaptive RT locks use a hybrid approach to solve the problem. They spin when possible, and sleep when necessary (to avoid deadlock, etc). This significantly improves many areas of the performance of the -rt kernel. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> Signed-off-by: Peter Morreale <[EMAIL PROTECTED]> Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 20 +++ kernel/rtmutex.c | 30 +++--- kernel/rtmutex_adaptive.h | 138 + 3 files changed, 178 insertions(+), 10 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index e493257..d2432fa 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -183,6 +183,26 @@ config RCU_TRACE Say Y/M here if you want to enable RCU tracing in-kernel/module. Say N if you are unsure. +config ADAPTIVE_RTLOCK +bool "Adaptive real-time locks" +default y +depends on PREEMPT_RT && SMP +help + PREEMPT_RT allows for greater determinism by transparently + converting normal spinlock_ts into preemptible rtmutexes which + sleep any waiters under contention. However, in many cases the + lock will be released in less time than it takes to context + switch. Therefore, the "sleep under contention" policy may also + degrade throughput performance due to the extra context switches. + + This option alters the rtmutex derived spinlock_t replacement + code to use an adaptive spin/sleep algorithm. It will spin + unless it determines it must sleep to avoid deadlock. This + offers a best of both worlds solution since we achieve both + high-throughput and low-latency. + + If unsure, say Y. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index bf9e230..3802ef8 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -7,6 +7,8 @@ * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]> * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen + * Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale, + * and Gregory Haskins * * See Documentation/rt-mutex-design.txt for details. */ @@ -17,6 +19,7 @@ #include #include "rtmutex_common.h" +#include "rtmutex_adaptive.h" #ifdef CONFIG_RTLOCK_LATERAL_STEAL int rtmutex_lateral_steal __read_mostly = 1; @@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) { struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; + DECLARE_ADAPTIVE_WAITER(adaptive); debug_rt_mutex_init_waiter(); waiter.task = NULL; @@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) continue; } + prepare_adaptive_wait(lock, ); + /* * Prevent schedule() to drop BKL, while waiting for * the lock ! We restore lock_depth when we come back. @@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); - update_current(TASK_UNINTERRUPTIBLE, _state); - /* -* The xchg() in update_current() is an implicit barrier -* which we rely upon to ensure current->state is visible -* before we test waiter.task. -*/ - if (waiter.task) - schedule_rt_mutex(lock); - else - update_current(
[(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism
From: Sven Dietrich <[EMAIL PROTECTED]> Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index d2432fa..ac1cbad 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y. +config RTLOCK_DELAY + int "Default delay (in loops) for adaptive rtlocks" + range 0 10 + depends on ADAPTIVE_RTLOCK + default "1" +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 3802ef8..4a16b13 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -25,6 +25,10 @@ int rtmutex_lateral_steal __read_mostly = 1; #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock->owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 862c088..60c6328 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -43,6 +43,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive->timeout > 0; adaptive->timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive->timeout <= 0) + sleep = 1; + put_task_struct(adaptive->owner); return sleep; @@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive->owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c24c53d..55189ea 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -56,6 +56,8 @@ #include #endif +#include "rtmutex_adaptive.h" + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rtlock_timeout", + .data = _timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep
The current logic makes rather coarse adjustments to current->state since it is planning on sleeping anyway. We want to eventually move to an adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the adjustments to bracket the schedule(). This should yield correct behavior with or without the adaptive features that are added later in the series. We add it here as a separate patch for greater review clarity on smaller changes. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index cd39c26..ef52db6 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, slowfn(lock); } +static inline void +update_current(unsigned long new_state, unsigned long *saved_state) +{ + unsigned long state = xchg(>state, new_state); + if (unlikely(state == TASK_RUNNING)) + *saved_state = TASK_RUNNING; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ - saved_state = xchg(>state, TASK_UNINTERRUPTIBLE); + saved_state = current->state; + smp_mb(); for (;;) { unsigned long saved_flags; @@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, _state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, _state); spin_lock_irqsave(>wait_lock, flags); current->flags |= saved_flags; current->lock_depth = saved_lock_depth; - state = xchg(>state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; } state = xchg(>state, saved_state); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 4/9] optimize rt lock wakeup
It is redundant to wake the grantee task if it is already running Credit goes to Peter for the general idea. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> Signed-off-by: Peter Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 45 - 1 files changed, 40 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ef52db6..bf9e230 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -531,6 +531,41 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner = waiter->task; waiter->task = NULL; + /* +* Do the wakeup before the ownership change to give any spinning +* waiter grantees a headstart over the other threads that will +* trigger once owner changes. +*/ + if (!savestate) + wake_up_process(pendowner); + else { + /* +* We can skip the actual (expensive) wakeup if the +* waiter is already running, but we have to be careful +* of race conditions because they may be about to sleep. +* +* The waiter-side protocol has the following pattern: +* 1: Set state != RUNNING +* 2: Conditionally sleep if waiter->task != NULL; +* +* And the owner-side has the following: +* A: Set waiter->task = NULL +* B: Conditionally wake if the state != RUNNING +* +* As long as we ensure 1->2 order, and A->B order, we +* will never miss a wakeup. +* +* Therefore, this barrier ensures that waiter->task = NULL +* is visible before we test the pendowner->state. The +* corresponding barrier is in the sleep logic. +*/ + smp_mb(); + + if ((pendowner->state != TASK_RUNNING) + && (pendowner->state != TASK_RUNNING_MUTEX)) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); spin_unlock(>pi_lock); @@ -557,11 +592,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) plist_add(>pi_list_entry, >pi_waiters); } spin_unlock(>pi_lock); - - if (savestate) - wake_up_process_mutex(pendowner); - else - wake_up_process(pendowner); } /* @@ -762,6 +792,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); update_current(TASK_UNINTERRUPTIBLE, _state); + /* +* The xchg() in update_current() is an implicit barrier +* which we rely upon to ensure current->state is visible +* before we test waiter.task. +*/ if (waiter.task) schedule_rt_mutex(lock); else -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing
From: Sven-Thorsten Dietrich <[EMAIL PROTECTED]> Add /proc/sys/kernel/lateral_steal, to allow switching on and off equal-priority mutex stealing between threads. Signed-off-by: Sven-Thorsten Dietrich <[EMAIL PROTECTED]> --- kernel/rtmutex.c |7 ++- kernel/sysctl.c | 14 ++ 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 6624c66..cd39c26 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -18,6 +18,10 @@ #include "rtmutex_common.h" +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +int rtmutex_lateral_steal __read_mostly = 1; +#endif + /* * lock->owner state tracking: * @@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) if (current->prio > pendowner->prio) return 0; - if (!unfair && (current->prio == pendowner->prio)) + if (unlikely(current->prio == pendowner->prio) && + !(unfair && rtmutex_lateral_steal)) #endif return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c913d48..c24c53d 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[]; int sysctl_legacy_va_layout; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +extern int rtmutex_lateral_steal; +#endif + extern int prove_locking; extern int lock_stat; @@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rtmutex_lateral_steal", + .data = _lateral_steal, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority
The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. We can gain signficant throughput improvements (200%+) by allowing the lock-stealing code to include tasks of equal priority. The theory is that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake-up a different task. This does add a degree of unfairness, yes. But also note that the users of these locks under non -rt environments have already been using unfair raw spinlocks anyway so the tradeoff is probably worth it. The way I like to think of this is that higher priority tasks should clearly preempt, and lower priority tasks should clearly block. However, if tasks have an identical priority value, then we can think of the scheduler decisions as the tie-breaking parameter. (e.g. tasks that the scheduler picked to run first have a logically higher priority amoung tasks of the same prio). This helps to keep the system "primed" with tasks doing useful work, and the end result is higher throughput. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt | 10 ++ kernel/rtmutex.c | 31 +++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 41a0d88..e493257 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -196,3 +196,13 @@ config SPINLOCK_BKL Say Y here if you are building a kernel for a desktop system. Say N if you are unsure. +config RTLOCK_LATERAL_STEAL +bool "Allow equal-priority rtlock stealing" +default y +depends on PREEMPT_RT +help + This option alters the rtlock lock-stealing logic to allow + equal priority tasks to preempt a pending owner in addition + to higher priority tasks. This allows for a significant + boost in throughput under certain circumstances at the expense + of strict FIFO lock access. diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a2b00cc..6624c66 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, return ret; } +static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) +{ +#ifndef CONFIG_RTLOCK_LATERAL_STEAL + if (current->prio >= pendowner->prio) +#else + if (current->prio > pendowner->prio) + return 0; + + if (!unfair && (current->prio == pendowner->prio)) +#endif + return 0; + + return 1; +} + /* * Optimization: check if we can steal the lock from the * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; @@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) return 1; spin_lock(>pi_lock); - if (current->prio >= pendowner->prio) { + if (!lock_is_stealable(pendowner, unfair)) { spin_unlock(>pi_lock); return 0; } @@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) * * Must be called with lock->wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair) { /* * We have to be careful here if the atomic speedups are @@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock) */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) && !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair)) return 0; /* We got the lock. */ @@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) int saved_lock_depth = current->lock_depth; /* Try to acquire the lock */ - if (try_to_take_rt_mutex(lock)) + if (try_to_take_rt_mutex(lock, 1)) break; /* * waiter.task is NULL the first time we come here and @@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, init_lists(lock); /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { + if (try_to_take_rt_mutex(lock, 0)) { spin_unlock_irqrestore(>wait_lock, flags); return 0; } @@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long saved_fla
[(RT RFC) PATCH v2 0/9] adaptive real-time locks
You can download this series here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v2.tar.bz2 Changes since v1: *) Rebased from 24-rt1 to 24.2-rt2 *) Dropped controversial (and likely unecessary) printk patch *) Dropped (internally) controversial PREEMPT_SPINLOCK_WAITERS config options *) Incorporated review feedback for comment/config cleanup from Pavel/PeterZ *) Moved lateral-steal to front of queue *) Fixed compilation issue with !defined(LATERAL_STEAL) *) Moved spinlock rework into a separate series: ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2 Todo: *) Convert loop based timeouts to use nanoseconds *) Tie into lockstat infrastructure. *) Long-term: research adaptive-timeout algorithms so a fixed/one-size- -fits-all value is not necessary. Adaptive real-time locks The Real Time patches to the Linux kernel converts the architecture specific SMP-synchronization primitives commonly referred to as "spinlocks" to an "RT mutex" implementation that support a priority inheritance protocol, and priority-ordered wait queues. The RT mutex implementation allows tasks that would otherwise busy-wait for a contended lock to be preempted by higher priority tasks without compromising the integrity of critical sections protected by the lock. The unintended side-effect is that the -rt kernel suffers from significant degradation of IO throughput (disk and net) due to the extra overhead associated with managing pi-lists and context switching. This has been generally accepted as a price to pay for low-latency preemption. Our research indicates that it doesn't necessarily have to be this way. This patch set introduces an adaptive technology that retains both the priority inheritance protocol as well as the preemptive nature of spinlocks and mutexes and adds a 300+% throughput increase to the Linux Real time kernel. It applies to 2.6.24-rt1. These performance increases apply to disk IO as well as netperf UDP benchmarks, without compromising RT preemption latency. For more complex applications, overall the I/O throughput seems to approach the throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is shipped by most distros. Essentially, the RT Mutex has been modified to busy-wait under contention for a limited (and configurable) time. This works because most locks are typically held for very short time spans. Too often, by the time a task goes to sleep on a mutex, the mutex is already being released on another CPU. The effect (on SMP) is that by polling a mutex for a limited time we reduce context switch overhead by up to 90%, and therefore eliminate CPU cycles as well as massive hot-spots in the scheduler / other bottlenecks in the Kernel - even though we busy-wait (using CPU cycles) to poll the lock. We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock 2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock (2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core, dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. Some tests show a marked improvement (for instance, ~450% more throughput for dbench, and ~500% faster for hackbench), whereas some others (make -j 128) the results were not as profound but they were still net-positive. In all cases we have also verified that deterministic latency is not impacted by using cyclic-test. This patch series depends on some re-work on the raw_spinlock infrastructure, including Nick Piggin's x86-ticket-locks. We found that the increased pressure on the lock->wait_locks could cause rare but serious latency spikes that are fixed by a fifo raw_spinlock_t implementation. Nick was gracious enough to allow us to re-use his work (which is already accepted in 2.6.25). Note that we also have a C version of his protocol available if other architectures need fifo-lock support as well, which we will gladly post upon request. You can find this re-work as a separate series here: ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2 Special thanks go to many people who were instrumental to this project, including: *) the -rt team here at Novell for research, development, and testing. *) Nick Piggin for his invaluable consultation/feedback and use of his x86-ticket-locks. *) The reviewers/testers at Suse, Montavista, and Bill Huey for their time and feedback on the early versions of these patches. As always, comments/feedback/bug-fixes are welcome. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 0/9] adaptive real-time locks
You can download this series here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v2.tar.bz2 Changes since v1: *) Rebased from 24-rt1 to 24.2-rt2 *) Dropped controversial (and likely unecessary) printk patch *) Dropped (internally) controversial PREEMPT_SPINLOCK_WAITERS config options *) Incorporated review feedback for comment/config cleanup from Pavel/PeterZ *) Moved lateral-steal to front of queue *) Fixed compilation issue with !defined(LATERAL_STEAL) *) Moved spinlock rework into a separate series: ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2 Todo: *) Convert loop based timeouts to use nanoseconds *) Tie into lockstat infrastructure. *) Long-term: research adaptive-timeout algorithms so a fixed/one-size- -fits-all value is not necessary. Adaptive real-time locks The Real Time patches to the Linux kernel converts the architecture specific SMP-synchronization primitives commonly referred to as spinlocks to an RT mutex implementation that support a priority inheritance protocol, and priority-ordered wait queues. The RT mutex implementation allows tasks that would otherwise busy-wait for a contended lock to be preempted by higher priority tasks without compromising the integrity of critical sections protected by the lock. The unintended side-effect is that the -rt kernel suffers from significant degradation of IO throughput (disk and net) due to the extra overhead associated with managing pi-lists and context switching. This has been generally accepted as a price to pay for low-latency preemption. Our research indicates that it doesn't necessarily have to be this way. This patch set introduces an adaptive technology that retains both the priority inheritance protocol as well as the preemptive nature of spinlocks and mutexes and adds a 300+% throughput increase to the Linux Real time kernel. It applies to 2.6.24-rt1. These performance increases apply to disk IO as well as netperf UDP benchmarks, without compromising RT preemption latency. For more complex applications, overall the I/O throughput seems to approach the throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is shipped by most distros. Essentially, the RT Mutex has been modified to busy-wait under contention for a limited (and configurable) time. This works because most locks are typically held for very short time spans. Too often, by the time a task goes to sleep on a mutex, the mutex is already being released on another CPU. The effect (on SMP) is that by polling a mutex for a limited time we reduce context switch overhead by up to 90%, and therefore eliminate CPU cycles as well as massive hot-spots in the scheduler / other bottlenecks in the Kernel - even though we busy-wait (using CPU cycles) to poll the lock. We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock 2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock (2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core, dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. Some tests show a marked improvement (for instance, ~450% more throughput for dbench, and ~500% faster for hackbench), whereas some others (make -j 128) the results were not as profound but they were still net-positive. In all cases we have also verified that deterministic latency is not impacted by using cyclic-test. This patch series depends on some re-work on the raw_spinlock infrastructure, including Nick Piggin's x86-ticket-locks. We found that the increased pressure on the lock-wait_locks could cause rare but serious latency spikes that are fixed by a fifo raw_spinlock_t implementation. Nick was gracious enough to allow us to re-use his work (which is already accepted in 2.6.25). Note that we also have a C version of his protocol available if other architectures need fifo-lock support as well, which we will gladly post upon request. You can find this re-work as a separate series here: ftp://ftp.novell.com/dev/ghaskins/ticket-locks.tar.bz2 Special thanks go to many people who were instrumental to this project, including: *) the -rt team here at Novell for research, development, and testing. *) Nick Piggin for his invaluable consultation/feedback and use of his x86-ticket-locks. *) The reviewers/testers at Suse, Montavista, and Bill Huey for their time and feedback on the early versions of these patches. As always, comments/feedback/bug-fixes are welcome. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 1/9] allow rt-mutex lock-stealing to include lateral priority
The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. We can gain signficant throughput improvements (200%+) by allowing the lock-stealing code to include tasks of equal priority. The theory is that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake-up a different task. This does add a degree of unfairness, yes. But also note that the users of these locks under non -rt environments have already been using unfair raw spinlocks anyway so the tradeoff is probably worth it. The way I like to think of this is that higher priority tasks should clearly preempt, and lower priority tasks should clearly block. However, if tasks have an identical priority value, then we can think of the scheduler decisions as the tie-breaking parameter. (e.g. tasks that the scheduler picked to run first have a logically higher priority amoung tasks of the same prio). This helps to keep the system primed with tasks doing useful work, and the end result is higher throughput. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/Kconfig.preempt | 10 ++ kernel/rtmutex.c | 31 +++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 41a0d88..e493257 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -196,3 +196,13 @@ config SPINLOCK_BKL Say Y here if you are building a kernel for a desktop system. Say N if you are unsure. +config RTLOCK_LATERAL_STEAL +bool Allow equal-priority rtlock stealing +default y +depends on PREEMPT_RT +help + This option alters the rtlock lock-stealing logic to allow + equal priority tasks to preempt a pending owner in addition + to higher priority tasks. This allows for a significant + boost in throughput under certain circumstances at the expense + of strict FIFO lock access. diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a2b00cc..6624c66 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -313,12 +313,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, return ret; } +static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) +{ +#ifndef CONFIG_RTLOCK_LATERAL_STEAL + if (current-prio = pendowner-prio) +#else + if (current-prio pendowner-prio) + return 0; + + if (!unfair (current-prio == pendowner-prio)) +#endif + return 0; + + return 1; +} + /* * Optimization: check if we can steal the lock from the * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; @@ -330,7 +345,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) return 1; spin_lock(pendowner-pi_lock); - if (current-prio = pendowner-prio) { + if (!lock_is_stealable(pendowner, unfair)) { spin_unlock(pendowner-pi_lock); return 0; } @@ -383,7 +398,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) * * Must be called with lock-wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair) { /* * We have to be careful here if the atomic speedups are @@ -406,7 +421,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock) */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) !try_to_steal_lock(lock, unfair)) return 0; /* We got the lock. */ @@ -707,7 +722,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) int saved_lock_depth = current-lock_depth; /* Try to acquire the lock */ - if (try_to_take_rt_mutex(lock)) + if (try_to_take_rt_mutex(lock, 1)) break; /* * waiter.task is NULL the first time we come here and @@ -947,7 +962,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, init_lists(lock); /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { + if (try_to_take_rt_mutex(lock, 0)) { spin_unlock_irqrestore(lock-wait_lock, flags); return 0; } @@ -970,7 +985,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long saved_flags; /* Try to acquire the lock: */ - if (try_to_take_rt_mutex(lock
[(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing
From: Sven-Thorsten Dietrich [EMAIL PROTECTED] Add /proc/sys/kernel/lateral_steal, to allow switching on and off equal-priority mutex stealing between threads. Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED] --- kernel/rtmutex.c |7 ++- kernel/sysctl.c | 14 ++ 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 6624c66..cd39c26 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -18,6 +18,10 @@ #include rtmutex_common.h +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +int rtmutex_lateral_steal __read_mostly = 1; +#endif + /* * lock-owner state tracking: * @@ -321,7 +325,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) if (current-prio pendowner-prio) return 0; - if (!unfair (current-prio == pendowner-prio)) + if (unlikely(current-prio == pendowner-prio) + !(unfair rtmutex_lateral_steal)) #endif return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c913d48..c24c53d 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -175,6 +175,10 @@ extern struct ctl_table inotify_table[]; int sysctl_legacy_va_layout; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +extern int rtmutex_lateral_steal; +#endif + extern int prove_locking; extern int lock_stat; @@ -836,6 +840,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL + { + .ctl_name = CTL_UNNUMBERED, + .procname = rtmutex_lateral_steal, + .data = rtmutex_lateral_steal, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep
The current logic makes rather coarse adjustments to current-state since it is planning on sleeping anyway. We want to eventually move to an adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the adjustments to bracket the schedule(). This should yield correct behavior with or without the adaptive features that are added later in the series. We add it here as a separate patch for greater review clarity on smaller changes. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/rtmutex.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index cd39c26..ef52db6 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -681,6 +681,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, slowfn(lock); } +static inline void +update_current(unsigned long new_state, unsigned long *saved_state) +{ + unsigned long state = xchg(current-state, new_state); + if (unlikely(state == TASK_RUNNING)) + *saved_state = TASK_RUNNING; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ - saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE); + saved_state = current-state; + smp_mb(); for (;;) { unsigned long saved_flags; @@ -752,14 +761,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, saved_state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, saved_state); spin_lock_irqsave(lock-wait_lock, flags); current-flags |= saved_flags; current-lock_depth = saved_lock_depth; - state = xchg(current-state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; } state = xchg(current-state, saved_state); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 4/9] optimize rt lock wakeup
It is redundant to wake the grantee task if it is already running Credit goes to Peter for the general idea. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] Signed-off-by: Peter Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c | 45 - 1 files changed, 40 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ef52db6..bf9e230 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -531,6 +531,41 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner = waiter-task; waiter-task = NULL; + /* +* Do the wakeup before the ownership change to give any spinning +* waiter grantees a headstart over the other threads that will +* trigger once owner changes. +*/ + if (!savestate) + wake_up_process(pendowner); + else { + /* +* We can skip the actual (expensive) wakeup if the +* waiter is already running, but we have to be careful +* of race conditions because they may be about to sleep. +* +* The waiter-side protocol has the following pattern: +* 1: Set state != RUNNING +* 2: Conditionally sleep if waiter-task != NULL; +* +* And the owner-side has the following: +* A: Set waiter-task = NULL +* B: Conditionally wake if the state != RUNNING +* +* As long as we ensure 1-2 order, and A-B order, we +* will never miss a wakeup. +* +* Therefore, this barrier ensures that waiter-task = NULL +* is visible before we test the pendowner-state. The +* corresponding barrier is in the sleep logic. +*/ + smp_mb(); + + if ((pendowner-state != TASK_RUNNING) +(pendowner-state != TASK_RUNNING_MUTEX)) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); spin_unlock(current-pi_lock); @@ -557,11 +592,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) plist_add(next-pi_list_entry, pendowner-pi_waiters); } spin_unlock(pendowner-pi_lock); - - if (savestate) - wake_up_process_mutex(pendowner); - else - wake_up_process(pendowner); } /* @@ -762,6 +792,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); update_current(TASK_UNINTERRUPTIBLE, saved_state); + /* +* The xchg() in update_current() is an implicit barrier +* which we rely upon to ensure current-state is visible +* before we test waiter.task. +*/ if (waiter.task) schedule_rt_mutex(lock); else -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 5/9] adaptive real-time lock support
There are pros and cons when deciding between the two basic forms of locking primitives (spinning vs sleeping). Without going into great detail on either one, we note that spinlocks have the advantage of lower overhead for short hold locks. However, they also have a con in that they create indeterminate latencies since preemption must traditionally be disabled while the lock is held (to prevent deadlock). We want to avoid non-deterministic critical sections in -rt. Therefore, when realtime is enabled, most contexts are converted to threads, and likewise most spinlock_ts are converted to sleepable rt-mutex derived locks. This allows the holder of the lock to remain fully preemptible, thus reducing a major source of latencies in the kernel. However, converting what was once a true spinlock into a sleeping lock may also decrease performance since the locks will now sleep under contention. Since the fundamental lock used to be a spinlock, it is highly likely that it was used in a short-hold path and that release is imminent. Therefore sleeping only serves to cause context-thrashing. Adaptive RT locks use a hybrid approach to solve the problem. They spin when possible, and sleep when necessary (to avoid deadlock, etc). This significantly improves many areas of the performance of the -rt kernel. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] Signed-off-by: Peter Morreale [EMAIL PROTECTED] Signed-off-by: Sven Dietrich [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 20 +++ kernel/rtmutex.c | 30 +++--- kernel/rtmutex_adaptive.h | 138 + 3 files changed, 178 insertions(+), 10 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index e493257..d2432fa 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -183,6 +183,26 @@ config RCU_TRACE Say Y/M here if you want to enable RCU tracing in-kernel/module. Say N if you are unsure. +config ADAPTIVE_RTLOCK +bool Adaptive real-time locks +default y +depends on PREEMPT_RT SMP +help + PREEMPT_RT allows for greater determinism by transparently + converting normal spinlock_ts into preemptible rtmutexes which + sleep any waiters under contention. However, in many cases the + lock will be released in less time than it takes to context + switch. Therefore, the sleep under contention policy may also + degrade throughput performance due to the extra context switches. + + This option alters the rtmutex derived spinlock_t replacement + code to use an adaptive spin/sleep algorithm. It will spin + unless it determines it must sleep to avoid deadlock. This + offers a best of both worlds solution since we achieve both + high-throughput and low-latency. + + If unsure, say Y. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index bf9e230..3802ef8 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -7,6 +7,8 @@ * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED] * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen + * Copyright (C) 2008 Novell, Inc., Sven Dietrich, Peter Morreale, + * and Gregory Haskins * * See Documentation/rt-mutex-design.txt for details. */ @@ -17,6 +19,7 @@ #include linux/hardirq.h #include rtmutex_common.h +#include rtmutex_adaptive.h #ifdef CONFIG_RTLOCK_LATERAL_STEAL int rtmutex_lateral_steal __read_mostly = 1; @@ -734,6 +737,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) { struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; + DECLARE_ADAPTIVE_WAITER(adaptive); debug_rt_mutex_init_waiter(waiter); waiter.task = NULL; @@ -780,6 +784,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) continue; } + prepare_adaptive_wait(lock, adaptive); + /* * Prevent schedule() to drop BKL, while waiting for * the lock ! We restore lock_depth when we come back. @@ -791,16 +797,20 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); - update_current(TASK_UNINTERRUPTIBLE, saved_state); - /* -* The xchg() in update_current() is an implicit barrier -* which we rely upon to ensure current-state is visible -* before we test waiter.task. -*/ - if (waiter.task) - schedule_rt_mutex(lock); - else - update_current(TASK_RUNNING_MUTEX, saved_state); + /* adaptive_wait() returns 1
[(RT RFC) PATCH v2 6/9] add a loop counter based timeout mechanism
From: Sven Dietrich [EMAIL PROTECTED] Signed-off-by: Sven Dietrich [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index d2432fa..ac1cbad 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -203,6 +203,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y. +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK + default 1 +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 3802ef8..4a16b13 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -25,6 +25,10 @@ int rtmutex_lateral_steal __read_mostly = 1; #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock-owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 862c088..60c6328 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -43,6 +43,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -64,7 +65,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive-timeout 0; adaptive-timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -105,6 +106,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive-timeout = 0) + sleep = 1; + put_task_struct(adaptive-owner); return sleep; @@ -122,8 +126,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive-owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c24c53d..55189ea 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -56,6 +56,8 @@ #include asm/stacktrace.h #endif +#include rtmutex_adaptive.h + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -850,6 +852,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = rtlock_timeout, + .data = rtlock_timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 7/9] adaptive mutexes
From: Peter W.Morreale [EMAIL PROTECTED] This patch adds the adaptive spin lock busywait to rtmutexes. It adds a new tunable: rtmutex_timeout, which is the companion to the rtlock_timeout tunable. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 37 ++ kernel/rtmutex.c | 76 + kernel/rtmutex_adaptive.h | 32 ++- kernel/sysctl.c | 10 ++ 4 files changed, 119 insertions(+), 36 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index ac1cbad..864bf14 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -214,6 +214,43 @@ config RTLOCK_DELAY tunable at runtime via a sysctl. A setting of 0 (zero) disables the adaptive algorithm entirely. +config ADAPTIVE_RTMUTEX +bool Adaptive real-time mutexes +default y +depends on ADAPTIVE_RTLOCK +help + This option adds the adaptive rtlock spin/sleep algorithm to + rtmutexes. In rtlocks, a significant gain in throughput + can be seen by allowing rtlocks to spin for a distinct + amount of time prior to going to sleep for deadlock avoidence. + + Typically, mutexes are used when a critical section may need to + sleep due to a blocking operation. In the event the critical +section does not need to sleep, an additional gain in throughput +can be seen by avoiding the extra overhead of sleeping. + + This option alters the rtmutex code to use an adaptive + spin/sleep algorithm. It will spin unless it determines it must + sleep to avoid deadlock. This offers a best of both worlds + solution since we achieve both high-throughput and low-latency. + + If unsure, say Y + +config RTMUTEX_DELAY +int Default delay (in loops) for adaptive mutexes +range 0 1000 +depends on ADAPTIVE_RTMUTEX +default 3000 +help + This allows you to specify the maximum delay a task will use +to wait for a rt mutex before going to sleep. Note that that +although the delay is implemented as a preemptable loop, tasks +of like priority cannot preempt each other and this setting can +result in increased latencies. + + The value is tunable at runtime via a sysctl. A setting of 0 +(zero) disables the adaptive algorithm entirely. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 4a16b13..ea593e0 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -29,6 +29,10 @@ int rtmutex_lateral_steal __read_mostly = 1; int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #endif +#ifdef CONFIG_ADAPTIVE_RTMUTEX +int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; +#endif + /* * lock-owner state tracking: * @@ -542,34 +546,33 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * Do the wakeup before the ownership change to give any spinning * waiter grantees a headstart over the other threads that will * trigger once owner changes. +* +* We can skip the actual (expensive) wakeup if the +* waiter is already running, but we have to be careful +* of race conditions because they may be about to sleep. +* +* The waiter-side protocol has the following pattern: +* 1: Set state != RUNNING +* 2: Conditionally sleep if waiter-task != NULL; +* +* And the owner-side has the following: +* A: Set waiter-task = NULL +* B: Conditionally wake if the state != RUNNING +* +* As long as we ensure 1-2 order, and A-B order, we +* will never miss a wakeup. +* +* Therefore, this barrier ensures that waiter-task = NULL +* is visible before we test the pendowner-state. The +* corresponding barrier is in the sleep logic. */ - if (!savestate) - wake_up_process(pendowner); - else { - /* -* We can skip the actual (expensive) wakeup if the -* waiter is already running, but we have to be careful -* of race conditions because they may be about to sleep. -* -* The waiter-side protocol has the following pattern: -* 1: Set state != RUNNING -* 2: Conditionally sleep if waiter-task != NULL; -* -* And the owner-side has the following: -* A: Set waiter-task = NULL -* B: Conditionally wake if the state != RUNNING -* -* As long as we ensure 1-2 order, and A-B order, we -* will never miss a wakeup. -* -
[(RT RFC) PATCH v2 8/9] adjust pi_lock usage in wakeup
From: Peter W.Morreale [EMAIL PROTECTED] In wakeup_next_waiter(), we take the pi_lock, and then find out whether we have another waiter to add to the pending owner. We can reduce contention on the pi_lock for the pending owner if we first obtain the pointer to the next waiter outside of the pi_lock. This patch adds a measureable increase in throughput. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ea593e0..b81bbef 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -526,6 +526,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + struct rt_mutex_waiter *next; spin_lock(current-pi_lock); @@ -587,6 +588,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * waiter with higher priority than pending-owner-normal_prio * is blocked on the unboosted (pending) owner. */ + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + spin_lock(pendowner-pi_lock); WARN_ON(!pendowner-pi_blocked_on); @@ -595,12 +602,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner-pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(next-pi_list_entry, pendowner-pi_waiters); - } + spin_unlock(pendowner-pi_lock); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[(RT RFC) PATCH v2 9/9] remove the extra call to try_to_take_lock
From: Peter W. Morreale [EMAIL PROTECTED] Remove the redundant attempt to get the lock. While it is true that the exit path with this patch adds an un-necessary xchg (in the event the lock is granted without further traversal in the loop) experimentation shows that we almost never encounter this situation. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index b81bbef..266ae31 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -756,12 +756,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) spin_lock_irqsave(lock-wait_lock, flags); init_lists(lock); - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { - spin_unlock_irqrestore(lock-wait_lock, flags); - return; - } - BUG_ON(rt_mutex_owner(lock) == current); /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 3/9] rearrange rt_spin_lock sleep
On Mon, Feb 25, 2008 at 4:54 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: Hi! @@ -720,7 +728,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ -saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE); +saved_state = current-state; +smp_mb(); for (;;) { unsigned long saved_flags; Please document what the barrier is good for. Yeah, I think you are right that this isn't needed. I think that is a relic from back when I was debugging some other problems. Let me wrap my head around the implications of removing it, and either remove it or document appropriately. Plus, you are replacing atomic operation with nonatomic; is that ok? Yeah, I think so. We are substituting a write with a read, and word reads are always atomic anyway IIUC (or is that only true on certain architectures)? Note that we are moving the atomic-write to be done later in the update_current() calls. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 5/9] adaptive real-time lock support
On Mon, Feb 25, 2008 at 5:03 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: Hi! +/* + * Adaptive-rtlocks will busywait when possible, and sleep only if + * necessary. Note that the busyloop looks racy, and it isbut we do + * not care. If we lose any races it simply means that we spin one more + * time before seeing that we need to break-out on the next iteration. + * + * We realize this is a relatively large function to inline, but note that + * it is only instantiated 1 or 2 times max, and it makes a measurable + * performance different to avoid the call. + * + * Returns 1 if we should sleep + * + */ +static inline int +adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, + struct adaptive_waiter *adaptive) +{ +int sleep = 0; + +for (;;) { +/* + * If the task was re-awoken, break out completely so we can + * reloop through the lock-acquisition code. + */ +if (!waiter-task) +break; + +/* + * We need to break if the owner changed so we can reloop + * and safely acquire the owner-pointer again with the + * wait_lock held. + */ +if (adaptive-owner != rt_mutex_owner(lock)) +break; + +/* + * If we got here, presumably the lock ownership is still + * current. We will use it to our advantage to be able to + * spin without disabling preemption... + */ + +/* + * .. sleep if the owner is not running.. + */ +if (!adaptive-owner-se.on_rq) { +sleep = 1; +break; +} + +/* + * .. or is running on our own cpu (to prevent deadlock) + */ +if (task_cpu(adaptive-owner) == task_cpu(current)) { +sleep = 1; +break; +} + +cpu_relax(); +} + +put_task_struct(adaptive-owner); + +return sleep; +} + You want to inline this? Yes. As the comment indicates, there are 1-2 users tops, and it has a significant impact on throughput ( 5%) to take the hit with a call. I don't think its actually much code anyway...its all comments. +static inline void +prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) ... +#define prepare_adaptive_wait(lock, busy) {} This is evil. Use empty inline function instead (same for the other function, there you can maybe get away with it). Ok. Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 7/9] adaptive mutexes
On Mon, Feb 25, 2008 at 5:09 PM, in message [EMAIL PROTECTED], Pavel Machek [EMAIL PROTECTED] wrote: Hi! From: Peter W.Morreale [EMAIL PROTECTED] This patch adds the adaptive spin lock busywait to rtmutexes. It adds a new tunable: rtmutex_timeout, which is the companion to the rtlock_timeout tunable. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] Not signed off by you? I wasn't sure if this was appropriate for me to do. This is the first time I was acting as upstream to someone. If that is what I am expected to do, consider this an ack for your remaining comments related to this. diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index ac1cbad..864bf14 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -214,6 +214,43 @@ config RTLOCK_DELAY tunable at runtime via a sysctl. A setting of 0 (zero) disables the adaptive algorithm entirely. +config ADAPTIVE_RTMUTEX +bool Adaptive real-time mutexes +default y +depends on ADAPTIVE_RTLOCK +help + This option adds the adaptive rtlock spin/sleep algorithm to + rtmutexes. In rtlocks, a significant gain in throughput + can be seen by allowing rtlocks to spin for a distinct + amount of time prior to going to sleep for deadlock avoidence. + + Typically, mutexes are used when a critical section may need to + sleep due to a blocking operation. In the event the critical + section does not need to sleep, an additional gain in throughput + can be seen by avoiding the extra overhead of sleeping. Watch the whitespace. ... and do we need yet another config options? +config RTMUTEX_DELAY +int Default delay (in loops) for adaptive mutexes +range 0 1000 +depends on ADAPTIVE_RTMUTEX +default 3000 +help + This allows you to specify the maximum delay a task will use + to wait for a rt mutex before going to sleep. Note that that + although the delay is implemented as a preemptable loop, tasks + of like priority cannot preempt each other and this setting can + result in increased latencies. + + The value is tunable at runtime via a sysctl. A setting of 0 + (zero) disables the adaptive algorithm entirely. Ouch. ? Is this reference to whitespace damage, or does the content need addressing? +#ifdef CONFIG_ADAPTIVE_RTMUTEX + +#define mutex_adaptive_wait adaptive_wait +#define mutex_prepare_adaptive_wait prepare_adaptive_wait + +extern int rtmutex_timeout; + +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) \ + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtmutex_timeout, } + +#else + +#define DECLARE_ADAPTIVE_MUTEX_WAITER(name) + +#define mutex_adaptive_wait(lock, intr, waiter, busy) 1 +#define mutex_prepare_adaptive_wait(lock, busy) {} More evil macros. Macro does not behave like a function, make it inline function if you are replacing a function. Ok Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [(RT RFC) PATCH v2 2/9] sysctl for runtime-control of lateral mutex stealing
On Mon, Feb 25, 2008 at 5:57 PM, in message [EMAIL PROTECTED], Sven-Thorsten Dietrich [EMAIL PROTECTED] wrote: But Greg may need to enforce it on his git tree that he mails these from - are you referring to anything specific in this patch? Thats what I don't get. I *did* checkpatch all of these before sending them out (and I have for every release). I am aware of two tabs vs spaces warnings, but the rest checked clean. Why do some people still see errors when I don't? Is there a set of switches I should supply to checkpatch to make it more aggressive or something? -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Bill Huey (hui) wrote: The might_sleep is annotation and well as a conditional preemption point for the regular kernel. You might want to do a schedule check there, but it's the wrong function if memory serves me correctly. It's reserved for things that actually are design to sleep. Note that might_sleep() already does a cond_resched() on the configurations that need it, so I am not sure what you are getting at here. Is that not enough? The rt_spin*() function are really a method of preserving BKL semantics across real schedule() calls. You'd have to use something else instead for that purpose like cond_reschedule() instead. I dont quite understand this part either. From my perspective, rt_spin*() functions are locking constructs that might sleep (or might spin with the new patches), and they happen to be BKL and wakeup transparent. To me, either the might_sleep() is correct for all paths that don't fit the in_atomic-printk exception, or none of them are. Are you saying that the modified logic that I introduced is broken? Or that the original use of the might_sleep() annotation inside this function is broken? -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Bill Huey (hui) wrote: The might_sleep is annotation and well as a conditional preemption point for the regular kernel. You might want to do a schedule check there, but it's the wrong function if memory serves me correctly. It's reserved for things that actually are design to sleep. Note that might_sleep() already does a cond_resched() on the configurations that need it, so I am not sure what you are getting at here. Is that not enough? The rt_spin*() function are really a method of preserving BKL semantics across real schedule() calls. You'd have to use something else instead for that purpose like cond_reschedule() instead. I dont quite understand this part either. From my perspective, rt_spin*() functions are locking constructs that might sleep (or might spin with the new patches), and they happen to be BKL and wakeup transparent. To me, either the might_sleep() is correct for all paths that don't fit the in_atomic-printk exception, or none of them are. Are you saying that the modified logic that I introduced is broken? Or that the original use of the might_sleep() annotation inside this function is broken? -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Pavel Machek wrote: Hi! Decorate the printk path with an "unlikely()" Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/rtmutex.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 122f143..ebdaa17 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (!current->in_printk) - might_sleep(); - else if (in_atomic() || irqs_disabled()) + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; + might_sleep(); I think you changed the code here... you call might_sleep() in different cases afaict. Agreed, but it's still correct afaict. I added an extra might_sleep() to a path that really might sleep. I should have mentioned that in the header. In any case, its moot. Andi indicated this patch is probably a no-op so I was considering dropping it on the v2 pass. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
Paul E. McKenney wrote: Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Yeah, fully agree. This is on my research "todo" list. My theory is that the ultimate adaptive-timeout algorithm here would essentially be the following: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). The adaptive code would work as follows: if (AHt > CSt) /* dont even bother if the average is greater than CSt */ timeout = 0; else timeout = AHt; if (adaptive_wait(timeout)) sleep(); Anyone have some good ideas on how to compute CSt? I was thinking you could create two kthreads that message one another (measuring round-trip time) for some number (say 100) to get an average. You could probably just approximate it with flushing workqueue jobs. -Greg Thanx, Paul Sven Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep
Gregory Haskins wrote: @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, _state); I have a question for everyone out there about this particular part of the code. Patch 6/14 adds an optimization that is predicated on the order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the waiter.task below. My assumption is that the xchg() (inside update_current()) acts as an effective wmb(). If xchg() does not have this property, then this code is broken and patch 6/14 should also add a: + smp_wmb(); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, _state); spin_lock_irqsave(>wait_lock, flags); current->flags |= saved_flags; current->lock_depth = saved_lock_depth; - state = xchg(>state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; Does anyone know the answer to this? Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 05/14] rearrange rt_spin_lock sleep
Gregory Haskins wrote: @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, saved_state); I have a question for everyone out there about this particular part of the code. Patch 6/14 adds an optimization that is predicated on the order in which we modify the state==TASK_UNINTERRUPTIBLE vs reading the waiter.task below. My assumption is that the xchg() (inside update_current()) acts as an effective wmb(). If xchg() does not have this property, then this code is broken and patch 6/14 should also add a: + smp_wmb(); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, saved_state); spin_lock_irqsave(lock-wait_lock, flags); current-flags |= saved_flags; current-lock_depth = saved_lock_depth; - state = xchg(current-state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; Does anyone know the answer to this? Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
Paul E. McKenney wrote: Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Yeah, fully agree. This is on my research todo list. My theory is that the ultimate adaptive-timeout algorithm here would essentially be the following: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). The adaptive code would work as follows: if (AHt CSt) /* dont even bother if the average is greater than CSt */ timeout = 0; else timeout = AHt; if (adaptive_wait(timeout)) sleep(); Anyone have some good ideas on how to compute CSt? I was thinking you could create two kthreads that message one another (measuring round-trip time) for some number (say 100) to get an average. You could probably just approximate it with flushing workqueue jobs. -Greg Thanx, Paul Sven Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Pavel Machek wrote: Hi! Decorate the printk path with an unlikely() Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/rtmutex.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 122f143..ebdaa17 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (!current-in_printk) - might_sleep(); - else if (in_atomic() || irqs_disabled()) + if (unlikely(current-in_printk) (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; + might_sleep(); I think you changed the code here... you call might_sleep() in different cases afaict. Agreed, but it's still correct afaict. I added an extra might_sleep() to a path that really might sleep. I should have mentioned that in the header. In any case, its moot. Andi indicated this patch is probably a no-op so I was considering dropping it on the v2 pass. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
>>> On Thu, Feb 21, 2008 at 4:42 PM, in message <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]> wrote: > * Bill Huey (hui) <[EMAIL PROTECTED]> wrote: > >> I came to the original conclusion that it wasn't originally worth it, >> but the dbench number published say otherwise. [...] > > dbench is a notoriously unreliable and meaningless workload. It's being > frowned upon by the VM and the IO folks. I agree...its a pretty weak benchmark. BUT, it does pound on dcache_lock and therefore was a good demonstration of the benefits of lower-contention overhead. Also note we also threw other tests in that PDF if you scroll to the subsequent pages. > If that's the only workload > where spin-mutexes help, and if it's only a 3% improvement [of which it > is unclear how much of that improvement was due to ticket spinlocks], > then adaptive mutexes are probably not worth it. Note that the "3%" figure being thrown around was from a single patch within the series. We are actually getting a net average gain of 443% in dbench. And note that the number goes *up* when you remove the ticketlocks. The ticketlocks are there to prevent latency spikes, not improve throughput. Also take a look at the hackbench numbers which are particularly promising. We get a net average gain of 493% faster for RT10 based hackbench runs. The kernel build was only a small gain, but it was all gain nonetheless. We see similar results for any other workloads we throw at this thing. I will gladly run any test requested to which I have the ability to run, and I would encourage third party results as well. > > I'd not exclude them fundamentally though, it's really the numbers that > matter. The code is certainly simple enough (albeit the .config and > sysctl controls are quite ugly and unacceptable - adaptive mutexes > should really be ... adaptive, with no magic constants in .configs or > else). We can clean this up, per your suggestions. > > But ... i'm somewhat sceptic, after having played with spin-a-bit > mutexes before. Its very subtle to get this concept to work. The first few weeks, we were getting 90% regressions ;) Then we had a breakthrough and started to get this thing humming along quite nicely. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
>>> On Thu, Feb 21, 2008 at 4:24 PM, in message <[EMAIL PROTECTED]>, Ingo Molnar <[EMAIL PROTECTED]> wrote: > hm. Why is the ticket spinlock patch included in this patchset? It just > skews your performance results unnecessarily. Ticket spinlocks are > independent conceptually, they are already upstream in 2.6.25-rc2 and > -rt will have them automatically once we rebase to .25. Sorry if it was ambiguous. I included them because we found the patch series without them can cause spikes due to the newly introduced pressure on the (raw_spinlock_t)lock->wait_lock. You can run the adaptive-spin patches without them just fine (in fact, in many cases things run faster without themdbench *thrives* on chaos). But you may also measure a cyclic-test spike if you do so. So I included them to present a "complete package without spikes". I tried to explain that detail in the prologue, but most people probably fell asleep before they got to the end ;) > > and if we take the ticket spinlock patch out of your series, the the > size of the patchset shrinks in half and touches only 200-300 lines of > code ;-) Considering the total size of the -rt patchset: > >652 files changed, 23830 insertions(+), 4636 deletions(-) > > we can regard it a routine optimization ;-) Its not the size of your LOC, but what you do with it :) > > regarding the concept: adaptive mutexes have been talked about in the > past, but their advantage is not at all clear, that's why we havent done > them. It's definitely not an unambigiously win-win concept. > > So lets get some real marketing-free benchmarking done, and we are not > just interested in the workloads where a bit of polling on contended > locks helps, but we are also interested in workloads where the polling > hurts ... And lets please do the comparisons without the ticket spinlock > patch ... I'm open to suggestion, and this was just a sample of the testing we have done. We have thrown plenty of workloads at this patch series far beyond the slides I prepared in that URL, and they all seem to indicate a net positive improvement so far. Some of those results I cannot share due to NDA, and some I didnt share simply because I never formally collected the data like I did for these tests. If there is something you would like to see, please let me know and I will arrange for it to be executed if at all possible. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
>>> On Thu, Feb 21, 2008 at 11:41 AM, in message <[EMAIL PROTECTED]>, Andi Kleen <[EMAIL PROTECTED]> wrote: >> +config RTLOCK_DELAY >> +int "Default delay (in loops) for adaptive rtlocks" >> +range 0 10 >> +depends on ADAPTIVE_RTLOCK > > I must say I'm not a big fan of putting such subtle configurable numbers > into Kconfig. Compilation is usually the wrong place to configure > such a thing. Just having it as a sysctl only should be good enough. > >> +default "1" > > Perhaps you can expand how you came up with that default number? Actually, the number doesn't seem to matter that much as long as it is sufficiently long enough to make timeouts rare. Most workloads will present some threshold for hold-time. You generally get the best performance if the value is at least as "long" as that threshold. Anything beyond that and there is no gain, but there doesn't appear to be a penalty either. So we picked 1 because we found it to fit that criteria quite well for our range of GHz class x86 machines. YMMY, but that is why its configurable ;) > It looks suspiciously round and worse the actual spin time depends a lot on > the > CPU frequency (so e.g. a 3Ghz CPU will likely behave quite > differently from a 2Ghz CPU) Yeah, fully agree. We really wanted to use a time-value here but ran into various problems that have yet to be resolved. We have it on the todo list to express this in terms in ns so it at least will scale with the architecture. > Did you experiment with other spin times? Of course ;) > Should it be scaled with number of CPUs? Not to my knowledge, but we can put that as a research "todo". > And at what point is real > time behaviour visibly impacted? Well, if we did our jobs correctly, RT behavior should *never* be impacted. *Throughput* on the other hand... ;) But its comes down to what I mentioned earlier. There is that threshold that affects the probability of timing out. Values lower than that threshold start to degrade throughput. Values higher than that have no affect on throughput, but may drive the cpu utilization higher which can theoretically impact tasks of equal or lesser priority by taking that resource away from them. To date, we have not observed any real-world implications of this however. > > Most likely it would be better to switch to something that is more > absolute time, like checking RDTSC every few iteration similar to what > udelay does. That would be at least constant time. I agree. We need to move in the direction of time-basis. The tradeoff is that it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). I think one of the (not-included) patches converts a nanosecond value from the sysctl to approximate loop-counts using the bogomips data. This was a decent compromise between the non-scaling loopcounts and the heavy-weight official timing APIs. We dropped it because we support older kernels which were conflicting with the patch. We may have to resurrect it, however.. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
>>> On Thu, Feb 21, 2008 at 11:36 AM, in message <[EMAIL PROTECTED]>, Andi Kleen <[EMAIL PROTECTED]> wrote: > On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote: > >> @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, >> void fastcall (*slowfn)(struct rt_mutex *lock)) >> { >> /* Temporary HACK! */ >> -if (!current->in_printk) >> -might_sleep(); >> -else if (in_atomic() || irqs_disabled()) >> +if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled())) > > I have my doubts that gcc will honor unlikelies that don't affect > the complete condition of an if. > > Also conditions guarding returns are by default predicted unlikely > anyways AFAIK. > > The patch is likely a nop. > Yeah, you are probably right. We have found that the system is *extremely* touchy on how much overhead we have in the lock-acquisition path. For instance, using a non-inline version of adaptive_wait() can cost 5-10% in disk-io throughput. So we were trying to find places to shave anywhere we could. That being said, I didn't record any difference from this patch, so you are probably exactly right. It just seemed like "the right thing to do" so I left it in. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
>>> On Thu, Feb 21, 2008 at 10:26 AM, in message <[EMAIL PROTECTED]>, Gregory Haskins <[EMAIL PROTECTED]> wrote: > We have put together some data from different types of benchmarks for > this patch series, which you can find here: > > ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf For convenience, I have also places a tarball of the entire series here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2 Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 13/14] allow rt-mutex lock-stealing to include lateral priority
The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. We can gain signficant throughput improvements (200%+) by allowing the lock-stealing code to include tasks of equal priority. The theory is that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake-up a different task. This does add a degree of unfairness, yes. But also note that the users of these locks under non -rt environments have already been using unfair raw spinlocks anyway so the tradeoff is probably worth it. The way I like to think of this is that higher priority tasks should clearly preempt, and lower priority tasks should clearly block. However, if tasks have an identical priority value, then we can think of the scheduler decisions as the tie-breaking parameter. (e.g. tasks that the scheduler picked to run first have a logically higher priority amoung tasks of the same prio). This helps to keep the system "primed" with tasks doing useful work, and the end result is higher throughput. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt | 10 ++ kernel/rtmutex.c | 31 +++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index d2b0daa..343b93c 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -273,3 +273,13 @@ config SPINLOCK_BKL Say Y here if you are building a kernel for a desktop system. Say N if you are unsure. +config RTLOCK_LATERAL_STEAL +bool "Allow equal-priority rtlock stealing" + default y + depends on PREEMPT_RT + help +This option alters the rtlock lock-stealing logic to allow +equal priority tasks to preempt a pending owner in addition +to higher priority tasks. This allows for a significant +boost in throughput under certain circumstances at the expense +of strict FIFO lock access. diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 95c3644..da077e5 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -323,12 +323,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, return ret; } +static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) +{ +#ifndef CONFIG_RTLOCK_LATERAL_STEAL + if (current->prio >= pendowner->prio) +#else + if (current->prio > pendowner->prio) + return 0; + + if (!unfair && (current->prio == pendowner->prio)) +#endif + return 0; + + return 1; +} + /* * Optimization: check if we can steal the lock from the * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; @@ -340,7 +355,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) return 1; spin_lock(>pi_lock); - if (current->prio >= pendowner->prio) { + if (!lock_is_stealable(pendowner, unfair)) { spin_unlock(>pi_lock); return 0; } @@ -393,7 +408,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) * * Must be called with lock->wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair) { /* * We have to be careful here if the atomic speedups are @@ -416,7 +431,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock) */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) && !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, unfair)) return 0; /* We got the lock. */ @@ -737,7 +752,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) int saved_lock_depth = current->lock_depth; /* Try to acquire the lock */ - if (try_to_take_rt_mutex(lock)) + if (try_to_take_rt_mutex(lock, 1)) break; /* * waiter.task is NULL the first time we come here and @@ -985,7 +1000,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, init_lists(lock); /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { + if (try_to_take_rt_mutex(lock, 0)) { spin_unlock_irqrestore(>wait_lock, flags); return 0; } @@ -1006,7 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long saved_fla
[PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing
From: Sven-Thorsten Dietrich <[EMAIL PROTECTED]> Add /proc/sys/kernel/lateral_steal, to allow switching on and off equal-priority mutex stealing between threads. Signed-off-by: Sven-Thorsten Dietrich <[EMAIL PROTECTED]> --- kernel/rtmutex.c |8 ++-- kernel/sysctl.c | 14 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index da077e5..62e7af5 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #ifdef CONFIG_ADAPTIVE_RTMUTEX int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +int rtmutex_lateral_steal __read_mostly = 1; +#endif /* * lock->owner state tracking: @@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) if (current->prio > pendowner->prio) return 0; - if (!unfair && (current->prio == pendowner->prio)) + if (unlikely(current->prio == pendowner->prio) && + !(unfair && rtmutex_lateral_steal)) #endif return 0; @@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) return 1; spin_lock(>pi_lock); - if (!lock_is_stealable(pendowner, unfair)) { + if (!lock_is_stealable(pendowner, (unfair & rtmutex_lateral_steal))) { spin_unlock(>pi_lock); return 0; } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 3465af2..c1a1c6d 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[]; int sysctl_legacy_va_layout; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +extern int rtmutex_lateral_steal; +#endif + extern int prove_locking; extern int lock_stat; @@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rtmutex_lateral_steal", + .data = _lateral_steal, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Decorate the printk path with an "unlikely()" Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/rtmutex.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 122f143..ebdaa17 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (!current->in_printk) - might_sleep(); - else if (in_atomic() || irqs_disabled()) + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; + might_sleep(); + if (likely(rt_mutex_cmpxchg(lock, NULL, current))) rt_mutex_deadlock_account_lock(lock, current); else @@ -677,7 +677,7 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (current->in_printk && (in_atomic() || irqs_disabled())) + if (unlikely(current->in_printk) && (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 12/14] remove the extra call to try_to_take_lock
From: Peter W. Morreale <[EMAIL PROTECTED]> Remove the redundant attempt to get the lock. While it is true that the exit path with this patch adds an un-necessary xchg (in the event the lock is granted without further traversal in the loop) experimentation shows that we almost never encounter this situation. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ebdaa17..95c3644 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -718,12 +718,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) spin_lock_irqsave(>wait_lock, flags); init_lists(lock); - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { - spin_unlock_irqrestore(>wait_lock, flags); - return; - } - BUG_ON(rt_mutex_owner(lock) == current); /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 09/14] adaptive mutexes
From: Peter W.Morreale <[EMAIL PROTECTED]> This patch adds the adaptive spin lock busywait to rtmutexes. It adds a new tunable: rtmutex_timeout, which is the companion to the rtlock_timeout tunable. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 37 + kernel/rtmutex.c | 44 ++-- kernel/rtmutex_adaptive.h | 32 ++-- kernel/sysctl.c | 10 ++ 4 files changed, 103 insertions(+), 20 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index eebec19..d2b0daa 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -223,6 +223,43 @@ config RTLOCK_DELAY tunable at runtime via a sysctl. A setting of 0 (zero) disables the adaptive algorithm entirely. +config ADAPTIVE_RTMUTEX +bool "Adaptive real-time mutexes" +default y +depends on ADAPTIVE_RTLOCK +help + This option adds the adaptive rtlock spin/sleep algorithm to + rtmutexes. In rtlocks, a significant gain in throughput + can be seen by allowing rtlocks to spin for a distinct + amount of time prior to going to sleep for deadlock avoidence. + + Typically, mutexes are used when a critical section may need to + sleep due to a blocking operation. In the event the critical +section does not need to sleep, an additional gain in throughput +can be seen by avoiding the extra overhead of sleeping. + + This option alters the rtmutex code to use an adaptive + spin/sleep algorithm. It will spin unless it determines it must + sleep to avoid deadlock. This offers a best of both worlds + solution since we achieve both high-throughput and low-latency. + + If unsure, say Y + +config RTMUTEX_DELAY +int "Default delay (in loops) for adaptive mutexes" +range 0 1000 +depends on ADAPTIVE_RTMUTEX +default "3000" +help + This allows you to specify the maximum delay a task will use +to wait for a rt mutex before going to sleep. Note that that +although the delay is implemented as a preemptable loop, tasks +of like priority cannot preempt each other and this setting can +result in increased latencies. + + The value is tunable at runtime via a sysctl. A setting of 0 +(zero) disables the adaptive algorithm entirely. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 4a7423f..a7ed7b2 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -24,6 +24,10 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #endif +#ifdef CONFIG_ADAPTIVE_RTMUTEX +int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; +#endif + /* * lock->owner state tracking: * @@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * Do the wakeup before the ownership change to give any spinning * waiter grantees a headstart over the other threads that will * trigger once owner changes. +* +* This may appear to be a race, but the barriers close the +* window. */ - if (!savestate) - wake_up_process(pendowner); - else { - smp_mb(); - /* -* This may appear to be a race, but the barriers close the -* window. -*/ - if ((pendowner->state != TASK_RUNNING) - && (pendowner->state != TASK_RUNNING_MUTEX)) + smp_mb(); + if ((pendowner->state != TASK_RUNNING) + && (pendowner->state != TASK_RUNNING_MUTEX)) { + if (!savestate) + wake_up_process(pendowner); + else wake_up_process_mutex(pendowner); } @@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); /* adaptive_wait() returns 1 if we need to sleep */ - if (adaptive_wait(lock, , )) { + if (adaptive_wait(lock, 0, , )) { update_current(TASK_UNINTERRUPTIBLE, _state); if (waiter.task) schedule_rt_mutex(lock); @@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, int ret = 0, saved_lock_depth = -1; struct rt_mutex_waiter waiter; unsigned long flags; + DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive); debug_rt_mutex_init_waiter(); waiter.task = NULL; @@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, if (unlikely(current->lock_depth >= 0)) saved_lock_depth =
[PATCH [RT] 10/14] adjust pi_lock usage in wakeup
From: Peter W.Morreale <[EMAIL PROTECTED]> In wakeup_next_waiter(), we take the pi_lock, and then find out whether we have another waiter to add to the pending owner. We can reduce contention on the pi_lock for the pending owner if we first obtain the pointer to the next waiter outside of the pi_lock. This patch adds a measureable increase in throughput. Signed-off-by: Peter W. Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a7ed7b2..122f143 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + struct rt_mutex_waiter *next; spin_lock(>pi_lock); @@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * waiter with higher priority than pending-owner->normal_prio * is blocked on the unboosted (pending) owner. */ + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + spin_lock(>pi_lock); WARN_ON(!pendowner->pi_blocked_on); @@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner->pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(>pi_list_entry, >pi_waiters); - } + spin_unlock(>pi_lock); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 07/14] adaptive real-time lock support
There are pros and cons when deciding between the two basic forms of locking primitives (spinning vs sleeping). Without going into great detail on either one, we note that spinlocks have the advantage of lower overhead for short hold locks. However, they also have a con in that they create indeterminate latencies since preemption must traditionally be disabled while the lock is held (to prevent deadlock). We want to avoid non-deterministic critical sections in -rt. Therefore, when realtime is enabled, most contexts are converted to threads, and likewise most spinlock_ts are converted to sleepable rt-mutex derived locks. This allows the holder of the lock to remain fully preemptible, thus reducing a major source of latencies in the kernel. However, converting what was once a true spinlock into a sleeping lock may also decrease performance since the locks will now sleep under contention. Since the fundamental lock used to be a spinlock, it is highly likely that it was used in a short-hold path and that release is imminent. Therefore sleeping only serves to cause context-thrashing. Adaptive RT locks use a hybrid approach to solve the problem. They spin when possible, and sleep when necessary (to avoid deadlock, etc). This significantly improves many areas of the performance of the -rt kernel. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> Signed-off-by: Peter Morreale <[EMAIL PROTECTED]> Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 20 +++ kernel/rtmutex.c | 19 +- kernel/rtmutex_adaptive.h | 134 + 3 files changed, 168 insertions(+), 5 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 5b45213..6568519 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -192,6 +192,26 @@ config RCU_TRACE Say Y/M here if you want to enable RCU tracing in-kernel/module. Say N if you are unsure. +config ADAPTIVE_RTLOCK +bool "Adaptive real-time locks" + default y + depends on PREEMPT_RT && SMP + help +PREEMPT_RT allows for greater determinism by transparently +converting normal spinlock_ts into preemptible rtmutexes which +sleep any waiters under contention. However, in many cases the +lock will be released in less time than it takes to context +switch. Therefore, the "sleep under contention" policy may also +degrade throughput performance due to the extra context switches. + +This option alters the rtmutex derived spinlock_t replacement +code to use an adaptive spin/sleep algorithm. It will spin +unless it determines it must sleep to avoid deadlock. This +offers a best of both worlds solution since we achieve both +high-throughput and low-latency. + +If unsure, say Y + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index cb27b08..feb938f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -7,6 +7,7 @@ * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner <[EMAIL PROTECTED]> * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen + * Copyright (C) 2008 Novell, Inc. * * See Documentation/rt-mutex-design.txt for details. */ @@ -17,6 +18,7 @@ #include #include "rtmutex_common.h" +#include "rtmutex_adaptive.h" /* * lock->owner state tracking: @@ -697,6 +699,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) { struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; + DECLARE_ADAPTIVE_WAITER(adaptive); debug_rt_mutex_init_waiter(); waiter.task = NULL; @@ -743,6 +746,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) continue; } + prepare_adaptive_wait(lock, ); + /* * Prevent schedule() to drop BKL, while waiting for * the lock ! We restore lock_depth when we come back. @@ -754,11 +759,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); - update_current(TASK_UNINTERRUPTIBLE, _state); - if (waiter.task) - schedule_rt_mutex(lock); - else - update_current(TASK_RUNNING_MUTEX, _state); + /* adaptive_wait() returns 1 if we need to sleep */ + if (adaptive_wait(lock, , )) { + update_current(TASK_UNINTERRUPTIBLE, _state); + if (waiter.task) + schedule_rt_mutex(lock); + else +
[PATCH [RT] 08/14] add a loop counter based timeout mechanism
From: Sven Dietrich <[EMAIL PROTECTED]> Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 6568519..eebec19 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y +config RTLOCK_DELAY + int "Default delay (in loops) for adaptive rtlocks" + range 0 10 + depends on ADAPTIVE_RTLOCK + default "1" +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index feb938f..4a7423f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -20,6 +20,10 @@ #include "rtmutex_common.h" #include "rtmutex_adaptive.h" +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock->owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 505fed5..b7e282b 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -39,6 +39,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive->timeout > 0; adaptive->timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive->timeout <= 0) + sleep = 1; + put_task_struct(adaptive->owner); return sleep; @@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive->owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 541aa9f..36259e4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -58,6 +58,8 @@ #include #endif +#include "rtmutex_adaptive.h" + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rtlock_timeout", + .data = _timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 06/14] optimize rt lock wakeup
It is redundant to wake the grantee task if it is already running Credit goes to Peter for the general idea. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> Signed-off-by: Peter Morreale <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 23 ++- 1 files changed, 18 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 15fc6e6..cb27b08 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner = waiter->task; waiter->task = NULL; + /* +* Do the wakeup before the ownership change to give any spinning +* waiter grantees a headstart over the other threads that will +* trigger once owner changes. +*/ + if (!savestate) + wake_up_process(pendowner); + else { + smp_mb(); + /* +* This may appear to be a race, but the barriers close the +* window. +*/ + if ((pendowner->state != TASK_RUNNING) + && (pendowner->state != TASK_RUNNING_MUTEX)) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); spin_unlock(>pi_lock); @@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) plist_add(>pi_list_entry, >pi_waiters); } spin_unlock(>pi_lock); - - if (savestate) - wake_up_process_mutex(pendowner); - else - wake_up_process(pendowner); } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 05/14] rearrange rt_spin_lock sleep
The current logic makes rather coarse adjustments to current->state since it is planning on sleeping anyway. We want to eventually move to an adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the adjustments to bracket the schedule(). This should yield correct behavior with or without the adaptive features that are added later in the series. We add it here as a separate patch for greater review clarity on smaller changes. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/rtmutex.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a2b00cc..15fc6e6 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, slowfn(lock); } +static inline void +update_current(unsigned long new_state, unsigned long *saved_state) +{ + unsigned long state = xchg(>state, new_state); + if (unlikely(state == TASK_RUNNING)) + *saved_state = TASK_RUNNING; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ - saved_state = xchg(>state, TASK_UNINTERRUPTIBLE); + saved_state = current->state; + smp_mb(); for (;;) { unsigned long saved_flags; @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, _state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, _state); spin_lock_irqsave(>wait_lock, flags); current->flags |= saved_flags; current->lock_depth = saved_lock_depth; - state = xchg(>state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; } state = xchg(>state, saved_state); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 03/14] x86: FIFO ticket spinlocks
From: Nick Piggin <[EMAIL PROTECTED]> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation is described in the comments. The straight-line lock/unlock instruction sequence is slightly slower than the dec based locks on modern x86 CPUs, however the difference is quite small on Core2 and Opteron when working out of cache, and becomes almost insignificant even on P4 when the lock misses cache. trylock is more significantly slower, but they are relatively rare. On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times. After this patch, all threads appear to finish at exactly the same time. The memory ordering of the lock does conform to x86 standards, and the implementation has been reviewed by Intel and AMD engineers. The algorithm also tells us how many CPUs are contending the lock, so lockbreak becomes trivial and we no longer have to waste 4 bytes per spinlock for it. After this, we can no longer spin on any locks with preempt enabled and cannot reenable interrupts when spinning on an irq safe lock, because at that point we have already taken a ticket and the would deadlock if the same CPU tries to take the lock again. These are questionable anyway: if the lock happens to be called under a preempt or interrupt disabled section, then it will just have the same latency problems. The real fix is to keep critical sections short, and ensure locks are reasonably fair (which this patch does). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- include/asm-x86/spinlock.h | 225 ++ include/asm-x86/spinlock_32.h| 221 - include/asm-x86/spinlock_64.h| 167 include/asm-x86/spinlock_types.h |2 4 files changed, 224 insertions(+), 391 deletions(-) diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h index d74d85e..72fe445 100644 --- a/include/asm-x86/spinlock.h +++ b/include/asm-x86/spinlock.h @@ -1,5 +1,226 @@ +#ifndef _X86_SPINLOCK_H_ +#define _X86_SPINLOCK_H_ + +#include +#include +#include +#include +#include + +/* + * Your basic SMP spinlocks, allowing only a single CPU anywhere + * + * Simple spin lock operations. There are two variants, one clears IRQ's + * on the local processor, one does not. + * + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. + * + * (the type definitions are in asm/spinlock_types.h) + */ + #ifdef CONFIG_X86_32 -# include "spinlock_32.h" +typedef char _slock_t; +# define LOCK_INS_DEC "decb" +# define LOCK_INS_XCH "xchgb" +# define LOCK_INS_MOV "movb" +# define LOCK_INS_CMP "cmpb" +# define LOCK_PTR_REG "a" #else -# include "spinlock_64.h" +typedef int _slock_t; +# define LOCK_INS_DEC "decl" +# define LOCK_INS_XCH "xchgl" +# define LOCK_INS_MOV "movl" +# define LOCK_INS_CMP "cmpl" +# define LOCK_PTR_REG "D" +#endif + +#if (NR_CPUS > 256) +#error spinlock supports a maximum of 256 CPUs +#endif + +static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) != (tmp & 0xff)); +} + +static inline int __raw_spin_is_contended(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1; +} + +static inline void __raw_spin_lock(__raw_spinlock_t *lock) +{ + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX "xaddw %w0, %1\n" + "1:\t" + "cmpb %h0, %b0\n\t" + "je 2f\n\t" + "rep ; nop\n\t" + "movb %1, %b0\n\t" + /* don't need lfence here, because loads are in-order */ + "jmp 1b\n" + "2:" + :"+Q" (inc), "+m" (lock->slock) + : + :"memory", "cc"); +} + +#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) + +static inline int
[PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use
Preemptible spinlock waiters effectively bypasses the benefits of a fifo spinlock. Since we now have fifo spinlocks for x86 enabled, disable the preemption feature on x86. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> CC: Nick Piggin <[EMAIL PROTECTED]> --- arch/x86/Kconfig |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8d15667..d5b9a67 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -20,6 +20,7 @@ config X86 bool default y select HAVE_MCOUNT + select DISABLE_PREEMPT_SPINLOCK_WAITERS config GENERIC_TIME bool -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option
We introduce a configuration variable for the feature to make it easier for various architectures and/or configs to enable or disable it based on their requirements. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt |9 + kernel/spinlock.c |7 +++ lib/Kconfig.debug |1 + 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 41a0d88..5b45213 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -86,6 +86,15 @@ config PREEMPT default y depends on PREEMPT_DESKTOP || PREEMPT_RT +config DISABLE_PREEMPT_SPINLOCK_WAITERS +bool + default n + +config PREEMPT_SPINLOCK_WAITERS +bool + default y + depends on PREEMPT && SMP && !DISABLE_PREEMPT_SPINLOCK_WAITERS + config PREEMPT_SOFTIRQS bool "Thread Softirqs" default n diff --git a/kernel/spinlock.c b/kernel/spinlock.c index b0e7f02..2e6a904 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ -#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) +#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS) void __lockfunc __read_lock(raw_rwlock_t *lock) { @@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock) EXPORT_SYMBOL(__write_lock); -#else /* CONFIG_PREEMPT: */ +#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */ /* * This could be a long-held lock. We both prepare to spin for a long @@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock); BUILD_LOCK_OPS(read, raw_rwlock); BUILD_LOCK_OPS(write, raw_rwlock); -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */ #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9208791..f2889b2 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC bool "Lock debugging: detect incorrect freeing of live locks" depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT select DEBUG_SPINLOCK + select DISABLE_PREEMPT_SPINLOCK_WAITERS select DEBUG_MUTEXES select LOCKDEP help -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 00/14] RFC - adaptive real-time locks
The Real Time patches to the Linux kernel converts the architecture specific SMP-synchronization primitives commonly referred to as "spinlocks" to an "RT mutex" implementation that support a priority inheritance protocol, and priority-ordered wait queues. The RT mutex implementation allows tasks that would otherwise busy-wait for a contended lock to be preempted by higher priority tasks without compromising the integrity of critical sections protected by the lock. The unintended side-effect is that the -rt kernel suffers from significant degradation of IO throughput (disk and net) due to the extra overhead associated with managing pi-lists and context switching. This has been generally accepted as a price to pay for low-latency preemption. Our research indicates that it doesn't necessarily have to be this way. This patch set introduces an adaptive technology that retains both the priority inheritance protocol as well as the preemptive nature of spinlocks and mutexes and adds a 300+% throughput increase to the Linux Real time kernel. It applies to 2.6.24-rt1. These performance increases apply to disk IO as well as netperf UDP benchmarks, without compromising RT preemption latency. For more complex applications, overall the I/O throughput seems to approach the throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is shipped by most distros. Essentially, the RT Mutex has been modified to busy-wait under contention for a limited (and configurable) time. This works because most locks are typically held for very short time spans. Too often, by the time a task goes to sleep on a mutex, the mutex is already being released on another CPU. The effect (on SMP) is that by polling a mutex for a limited time we reduce context switch overhead by up to 90%, and therefore eliminate CPU cycles as well as massive hot-spots in the scheduler / other bottlenecks in the Kernel - even though we busy-wait (using CPU cycles) to poll the lock. We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock 2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock (2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core, dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. Some tests show a marked improvement (for instance, dbench and hackbench), whereas some others (make -j 128) the results were not as profound but they were still net-positive. In all cases we have also verified that deterministic latency is not impacted by using cyclic-test. This patch series also includes some re-work on the raw_spinlock infrastructure, including Nick Piggin's x86-ticket-locks. We found that the increased pressure on the lock->wait_locks could cause rare but serious latency spikes that are fixed by a fifo raw_spinlock_t implementation. Nick was gracious enough to allow us to re-use his work (which is already accepted in 2.6.25). Note that we also have a C version of his protocol available if other architectures need fifo-lock support as well, which we will gladly post upon request. Special thanks go to many people who were instrumental to this project, including: *) the -rt team here at Novell for research, development, and testing. *) Nick Piggin for his invaluable consultation/feedback and use of his x86-ticket-locks. *) The reviewers/testers at Suse, Montavista, and Bill Huey for their time and feedback on the early versions of these patches. As always, comments/feedback/bug-fixes are welcome. Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled
The logic is currently broken so that PREEMPT_RT disables preemptible spinlock waiters, which is counter intuitive. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/spinlock.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/spinlock.c b/kernel/spinlock.c index c9bcf1b..b0e7f02 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT) + defined(CONFIG_DEBUG_LOCK_ALLOC) void __lockfunc __read_lock(raw_rwlock_t *lock) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 02/14] spinlock: make preemptible-waiter feature a specific config option
We introduce a configuration variable for the feature to make it easier for various architectures and/or configs to enable or disable it based on their requirements. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/Kconfig.preempt |9 + kernel/spinlock.c |7 +++ lib/Kconfig.debug |1 + 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 41a0d88..5b45213 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -86,6 +86,15 @@ config PREEMPT default y depends on PREEMPT_DESKTOP || PREEMPT_RT +config DISABLE_PREEMPT_SPINLOCK_WAITERS +bool + default n + +config PREEMPT_SPINLOCK_WAITERS +bool + default y + depends on PREEMPT SMP !DISABLE_PREEMPT_SPINLOCK_WAITERS + config PREEMPT_SOFTIRQS bool Thread Softirqs default n diff --git a/kernel/spinlock.c b/kernel/spinlock.c index b0e7f02..2e6a904 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -116,8 +116,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ -#if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) +#if !defined(CONFIG_PREEMPT_SPINLOCK_WAITERS) void __lockfunc __read_lock(raw_rwlock_t *lock) { @@ -244,7 +243,7 @@ void __lockfunc __write_lock(raw_rwlock_t *lock) EXPORT_SYMBOL(__write_lock); -#else /* CONFIG_PREEMPT: */ +#else /* CONFIG_PREEMPT_SPINLOCK_WAITERS */ /* * This could be a long-held lock. We both prepare to spin for a long @@ -334,7 +333,7 @@ BUILD_LOCK_OPS(spin, raw_spinlock); BUILD_LOCK_OPS(read, raw_rwlock); BUILD_LOCK_OPS(write, raw_rwlock); -#endif /* CONFIG_PREEMPT */ +#endif /* CONFIG_PREEMPT_SPINLOCK_WAITERS */ #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9208791..f2889b2 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -233,6 +233,7 @@ config DEBUG_LOCK_ALLOC bool Lock debugging: detect incorrect freeing of live locks depends on DEBUG_KERNEL TRACE_IRQFLAGS_SUPPORT STACKTRACE_SUPPORT LOCKDEP_SUPPORT select DEBUG_SPINLOCK + select DISABLE_PREEMPT_SPINLOCK_WAITERS select DEBUG_MUTEXES select LOCKDEP help -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 03/14] x86: FIFO ticket spinlocks
From: Nick Piggin [EMAIL PROTECTED] Introduce ticket lock spinlocks for x86 which are FIFO. The implementation is described in the comments. The straight-line lock/unlock instruction sequence is slightly slower than the dec based locks on modern x86 CPUs, however the difference is quite small on Core2 and Opteron when working out of cache, and becomes almost insignificant even on P4 when the lock misses cache. trylock is more significantly slower, but they are relatively rare. On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or unfairly granted the lock up to 1 000 000 (!) times. After this patch, all threads appear to finish at exactly the same time. The memory ordering of the lock does conform to x86 standards, and the implementation has been reviewed by Intel and AMD engineers. The algorithm also tells us how many CPUs are contending the lock, so lockbreak becomes trivial and we no longer have to waste 4 bytes per spinlock for it. After this, we can no longer spin on any locks with preempt enabled and cannot reenable interrupts when spinning on an irq safe lock, because at that point we have already taken a ticket and the would deadlock if the same CPU tries to take the lock again. These are questionable anyway: if the lock happens to be called under a preempt or interrupt disabled section, then it will just have the same latency problems. The real fix is to keep critical sections short, and ensure locks are reasonably fair (which this patch does). Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- include/asm-x86/spinlock.h | 225 ++ include/asm-x86/spinlock_32.h| 221 - include/asm-x86/spinlock_64.h| 167 include/asm-x86/spinlock_types.h |2 4 files changed, 224 insertions(+), 391 deletions(-) diff --git a/include/asm-x86/spinlock.h b/include/asm-x86/spinlock.h index d74d85e..72fe445 100644 --- a/include/asm-x86/spinlock.h +++ b/include/asm-x86/spinlock.h @@ -1,5 +1,226 @@ +#ifndef _X86_SPINLOCK_H_ +#define _X86_SPINLOCK_H_ + +#include asm/atomic.h +#include asm/rwlock.h +#include asm/page.h +#include asm/processor.h +#include linux/compiler.h + +/* + * Your basic SMP spinlocks, allowing only a single CPU anywhere + * + * Simple spin lock operations. There are two variants, one clears IRQ's + * on the local processor, one does not. + * + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. + * + * (the type definitions are in asm/spinlock_types.h) + */ + #ifdef CONFIG_X86_32 -# include spinlock_32.h +typedef char _slock_t; +# define LOCK_INS_DEC decb +# define LOCK_INS_XCH xchgb +# define LOCK_INS_MOV movb +# define LOCK_INS_CMP cmpb +# define LOCK_PTR_REG a #else -# include spinlock_64.h +typedef int _slock_t; +# define LOCK_INS_DEC decl +# define LOCK_INS_XCH xchgl +# define LOCK_INS_MOV movl +# define LOCK_INS_CMP cmpl +# define LOCK_PTR_REG D +#endif + +#if (NR_CPUS 256) +#error spinlock supports a maximum of 256 CPUs +#endif + +static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) != (tmp 0xff)); +} + +static inline int __raw_spin_is_contended(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) - (tmp 0xff)) 1; +} + +static inline void __raw_spin_lock(__raw_spinlock_t *lock) +{ + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX xaddw %w0, %1\n + 1:\t + cmpb %h0, %b0\n\t + je 2f\n\t + rep ; nop\n\t + movb %1, %b0\n\t + /* don't need lfence here, because loads are in-order */ + jmp 1b\n + 2: + :+Q (inc), +m (lock-slock) + : + :memory, cc); +} + +#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) + +static inline int
[PATCH [RT] 04/14] disable PREEMPT_SPINLOCK_WAITERS when x86 ticket/fifo spins are in use
Preemptible spinlock waiters effectively bypasses the benefits of a fifo spinlock. Since we now have fifo spinlocks for x86 enabled, disable the preemption feature on x86. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] CC: Nick Piggin [EMAIL PROTECTED] --- arch/x86/Kconfig |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8d15667..d5b9a67 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -20,6 +20,7 @@ config X86 bool default y select HAVE_MCOUNT + select DISABLE_PREEMPT_SPINLOCK_WAITERS config GENERIC_TIME bool -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 05/14] rearrange rt_spin_lock sleep
The current logic makes rather coarse adjustments to current-state since it is planning on sleeping anyway. We want to eventually move to an adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the adjustments to bracket the schedule(). This should yield correct behavior with or without the adaptive features that are added later in the series. We add it here as a separate patch for greater review clarity on smaller changes. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/rtmutex.c | 20 +++- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a2b00cc..15fc6e6 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -661,6 +661,14 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, slowfn(lock); } +static inline void +update_current(unsigned long new_state, unsigned long *saved_state) +{ + unsigned long state = xchg(current-state, new_state); + if (unlikely(state == TASK_RUNNING)) + *saved_state = TASK_RUNNING; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -700,7 +708,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ - saved_state = xchg(current-state, TASK_UNINTERRUPTIBLE); + saved_state = current-state; + smp_mb(); for (;;) { unsigned long saved_flags; @@ -732,14 +741,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, saved_state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, saved_state); spin_lock_irqsave(lock-wait_lock, flags); current-flags |= saved_flags; current-lock_depth = saved_lock_depth; - state = xchg(current-state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; } state = xchg(current-state, saved_state); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 06/14] optimize rt lock wakeup
It is redundant to wake the grantee task if it is already running Credit goes to Peter for the general idea. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] Signed-off-by: Peter Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c | 23 ++- 1 files changed, 18 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 15fc6e6..cb27b08 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -511,6 +511,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner = waiter-task; waiter-task = NULL; + /* +* Do the wakeup before the ownership change to give any spinning +* waiter grantees a headstart over the other threads that will +* trigger once owner changes. +*/ + if (!savestate) + wake_up_process(pendowner); + else { + smp_mb(); + /* +* This may appear to be a race, but the barriers close the +* window. +*/ + if ((pendowner-state != TASK_RUNNING) +(pendowner-state != TASK_RUNNING_MUTEX)) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); spin_unlock(current-pi_lock); @@ -537,11 +555,6 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) plist_add(next-pi_list_entry, pendowner-pi_waiters); } spin_unlock(pendowner-pi_lock); - - if (savestate) - wake_up_process_mutex(pendowner); - else - wake_up_process(pendowner); } /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 07/14] adaptive real-time lock support
There are pros and cons when deciding between the two basic forms of locking primitives (spinning vs sleeping). Without going into great detail on either one, we note that spinlocks have the advantage of lower overhead for short hold locks. However, they also have a con in that they create indeterminate latencies since preemption must traditionally be disabled while the lock is held (to prevent deadlock). We want to avoid non-deterministic critical sections in -rt. Therefore, when realtime is enabled, most contexts are converted to threads, and likewise most spinlock_ts are converted to sleepable rt-mutex derived locks. This allows the holder of the lock to remain fully preemptible, thus reducing a major source of latencies in the kernel. However, converting what was once a true spinlock into a sleeping lock may also decrease performance since the locks will now sleep under contention. Since the fundamental lock used to be a spinlock, it is highly likely that it was used in a short-hold path and that release is imminent. Therefore sleeping only serves to cause context-thrashing. Adaptive RT locks use a hybrid approach to solve the problem. They spin when possible, and sleep when necessary (to avoid deadlock, etc). This significantly improves many areas of the performance of the -rt kernel. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] Signed-off-by: Peter Morreale [EMAIL PROTECTED] Signed-off-by: Sven Dietrich [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 20 +++ kernel/rtmutex.c | 19 +- kernel/rtmutex_adaptive.h | 134 + 3 files changed, 168 insertions(+), 5 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 5b45213..6568519 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -192,6 +192,26 @@ config RCU_TRACE Say Y/M here if you want to enable RCU tracing in-kernel/module. Say N if you are unsure. +config ADAPTIVE_RTLOCK +bool Adaptive real-time locks + default y + depends on PREEMPT_RT SMP + help +PREEMPT_RT allows for greater determinism by transparently +converting normal spinlock_ts into preemptible rtmutexes which +sleep any waiters under contention. However, in many cases the +lock will be released in less time than it takes to context +switch. Therefore, the sleep under contention policy may also +degrade throughput performance due to the extra context switches. + +This option alters the rtmutex derived spinlock_t replacement +code to use an adaptive spin/sleep algorithm. It will spin +unless it determines it must sleep to avoid deadlock. This +offers a best of both worlds solution since we achieve both +high-throughput and low-latency. + +If unsure, say Y + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index cb27b08..feb938f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -7,6 +7,7 @@ * Copyright (C) 2005-2006 Timesys Corp., Thomas Gleixner [EMAIL PROTECTED] * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen + * Copyright (C) 2008 Novell, Inc. * * See Documentation/rt-mutex-design.txt for details. */ @@ -17,6 +18,7 @@ #include linux/hardirq.h #include rtmutex_common.h +#include rtmutex_adaptive.h /* * lock-owner state tracking: @@ -697,6 +699,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) { struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; + DECLARE_ADAPTIVE_WAITER(adaptive); debug_rt_mutex_init_waiter(waiter); waiter.task = NULL; @@ -743,6 +746,8 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) continue; } + prepare_adaptive_wait(lock, adaptive); + /* * Prevent schedule() to drop BKL, while waiting for * the lock ! We restore lock_depth when we come back. @@ -754,11 +759,15 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); - update_current(TASK_UNINTERRUPTIBLE, saved_state); - if (waiter.task) - schedule_rt_mutex(lock); - else - update_current(TASK_RUNNING_MUTEX, saved_state); + /* adaptive_wait() returns 1 if we need to sleep */ + if (adaptive_wait(lock, waiter, adaptive)) { + update_current(TASK_UNINTERRUPTIBLE, saved_state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX
[PATCH [RT] 09/14] adaptive mutexes
From: Peter W.Morreale [EMAIL PROTECTED] This patch adds the adaptive spin lock busywait to rtmutexes. It adds a new tunable: rtmutex_timeout, which is the companion to the rtlock_timeout tunable. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 37 + kernel/rtmutex.c | 44 ++-- kernel/rtmutex_adaptive.h | 32 ++-- kernel/sysctl.c | 10 ++ 4 files changed, 103 insertions(+), 20 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index eebec19..d2b0daa 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -223,6 +223,43 @@ config RTLOCK_DELAY tunable at runtime via a sysctl. A setting of 0 (zero) disables the adaptive algorithm entirely. +config ADAPTIVE_RTMUTEX +bool Adaptive real-time mutexes +default y +depends on ADAPTIVE_RTLOCK +help + This option adds the adaptive rtlock spin/sleep algorithm to + rtmutexes. In rtlocks, a significant gain in throughput + can be seen by allowing rtlocks to spin for a distinct + amount of time prior to going to sleep for deadlock avoidence. + + Typically, mutexes are used when a critical section may need to + sleep due to a blocking operation. In the event the critical +section does not need to sleep, an additional gain in throughput +can be seen by avoiding the extra overhead of sleeping. + + This option alters the rtmutex code to use an adaptive + spin/sleep algorithm. It will spin unless it determines it must + sleep to avoid deadlock. This offers a best of both worlds + solution since we achieve both high-throughput and low-latency. + + If unsure, say Y + +config RTMUTEX_DELAY +int Default delay (in loops) for adaptive mutexes +range 0 1000 +depends on ADAPTIVE_RTMUTEX +default 3000 +help + This allows you to specify the maximum delay a task will use +to wait for a rt mutex before going to sleep. Note that that +although the delay is implemented as a preemptable loop, tasks +of like priority cannot preempt each other and this setting can +result in increased latencies. + + The value is tunable at runtime via a sysctl. A setting of 0 +(zero) disables the adaptive algorithm entirely. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 4a7423f..a7ed7b2 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -24,6 +24,10 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #endif +#ifdef CONFIG_ADAPTIVE_RTMUTEX +int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; +#endif + /* * lock-owner state tracking: * @@ -521,17 +525,16 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * Do the wakeup before the ownership change to give any spinning * waiter grantees a headstart over the other threads that will * trigger once owner changes. +* +* This may appear to be a race, but the barriers close the +* window. */ - if (!savestate) - wake_up_process(pendowner); - else { - smp_mb(); - /* -* This may appear to be a race, but the barriers close the -* window. -*/ - if ((pendowner-state != TASK_RUNNING) -(pendowner-state != TASK_RUNNING_MUTEX)) + smp_mb(); + if ((pendowner-state != TASK_RUNNING) +(pendowner-state != TASK_RUNNING_MUTEX)) { + if (!savestate) + wake_up_process(pendowner); + else wake_up_process_mutex(pendowner); } @@ -764,7 +767,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) debug_rt_mutex_print_deadlock(waiter); /* adaptive_wait() returns 1 if we need to sleep */ - if (adaptive_wait(lock, waiter, adaptive)) { + if (adaptive_wait(lock, 0, waiter, adaptive)) { update_current(TASK_UNINTERRUPTIBLE, saved_state); if (waiter.task) schedule_rt_mutex(lock); @@ -975,6 +978,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, int ret = 0, saved_lock_depth = -1; struct rt_mutex_waiter waiter; unsigned long flags; + DECLARE_ADAPTIVE_MUTEX_WAITER(adaptive); debug_rt_mutex_init_waiter(waiter); waiter.task = NULL; @@ -995,8 +999,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, if (unlikely(current-lock_depth = 0))
[PATCH [RT] 10/14] adjust pi_lock usage in wakeup
From: Peter W.Morreale [EMAIL PROTECTED] In wakeup_next_waiter(), we take the pi_lock, and then find out whether we have another waiter to add to the pending owner. We can reduce contention on the pi_lock for the pending owner if we first obtain the pointer to the next waiter outside of the pi_lock. This patch adds a measureable increase in throughput. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index a7ed7b2..122f143 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -505,6 +505,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + struct rt_mutex_waiter *next; spin_lock(current-pi_lock); @@ -549,6 +550,12 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) * waiter with higher priority than pending-owner-normal_prio * is blocked on the unboosted (pending) owner. */ + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + spin_lock(pendowner-pi_lock); WARN_ON(!pendowner-pi_blocked_on); @@ -557,12 +564,9 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) pendowner-pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(next-pi_list_entry, pendowner-pi_waiters); - } + spin_unlock(pendowner-pi_lock); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
Decorate the printk path with an unlikely() Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/rtmutex.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 122f143..ebdaa17 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (!current-in_printk) - might_sleep(); - else if (in_atomic() || irqs_disabled()) + if (unlikely(current-in_printk) (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; + might_sleep(); + if (likely(rt_mutex_cmpxchg(lock, NULL, current))) rt_mutex_deadlock_account_lock(lock, current); else @@ -677,7 +677,7 @@ rt_spin_lock_fastunlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (current-in_printk (in_atomic() || irqs_disabled())) + if (unlikely(current-in_printk) (in_atomic() || irqs_disabled())) /* don't grab locks for printk in atomic */ return; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 12/14] remove the extra call to try_to_take_lock
From: Peter W. Morreale [EMAIL PROTECTED] Remove the redundant attempt to get the lock. While it is true that the exit path with this patch adds an un-necessary xchg (in the event the lock is granted without further traversal in the loop) experimentation shows that we almost never encounter this situation. Signed-off-by: Peter W. Morreale [EMAIL PROTECTED] --- kernel/rtmutex.c |6 -- 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index ebdaa17..95c3644 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -718,12 +718,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) spin_lock_irqsave(lock-wait_lock, flags); init_lists(lock); - /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { - spin_unlock_irqrestore(lock-wait_lock, flags); - return; - } - BUG_ON(rt_mutex_owner(lock) == current); /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 08/14] add a loop counter based timeout mechanism
From: Sven Dietrich [EMAIL PROTECTED] Signed-off-by: Sven Dietrich [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 6568519..eebec19 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK + default 1 +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index feb938f..4a7423f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -20,6 +20,10 @@ #include rtmutex_common.h #include rtmutex_adaptive.h +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock-owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 505fed5..b7e282b 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -39,6 +39,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive-timeout 0; adaptive-timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive-timeout = 0) + sleep = 1; + put_task_struct(adaptive-owner); return sleep; @@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive-owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 541aa9f..36259e4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -58,6 +58,8 @@ #include asm/stacktrace.h #endif +#include rtmutex_adaptive.h + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = rtlock_timeout, + .data = rtlock_timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 13/14] allow rt-mutex lock-stealing to include lateral priority
The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. We can gain signficant throughput improvements (200%+) by allowing the lock-stealing code to include tasks of equal priority. The theory is that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake-up a different task. This does add a degree of unfairness, yes. But also note that the users of these locks under non -rt environments have already been using unfair raw spinlocks anyway so the tradeoff is probably worth it. The way I like to think of this is that higher priority tasks should clearly preempt, and lower priority tasks should clearly block. However, if tasks have an identical priority value, then we can think of the scheduler decisions as the tie-breaking parameter. (e.g. tasks that the scheduler picked to run first have a logically higher priority amoung tasks of the same prio). This helps to keep the system primed with tasks doing useful work, and the end result is higher throughput. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/Kconfig.preempt | 10 ++ kernel/rtmutex.c | 31 +++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index d2b0daa..343b93c 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -273,3 +273,13 @@ config SPINLOCK_BKL Say Y here if you are building a kernel for a desktop system. Say N if you are unsure. +config RTLOCK_LATERAL_STEAL +bool Allow equal-priority rtlock stealing + default y + depends on PREEMPT_RT + help +This option alters the rtlock lock-stealing logic to allow +equal priority tasks to preempt a pending owner in addition +to higher priority tasks. This allows for a significant +boost in throughput under certain circumstances at the expense +of strict FIFO lock access. diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index 95c3644..da077e5 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -323,12 +323,27 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, return ret; } +static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) +{ +#ifndef CONFIG_RTLOCK_LATERAL_STEAL + if (current-prio = pendowner-prio) +#else + if (current-prio pendowner-prio) + return 0; + + if (!unfair (current-prio == pendowner-prio)) +#endif + return 0; + + return 1; +} + /* * Optimization: check if we can steal the lock from the * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; @@ -340,7 +355,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) return 1; spin_lock(pendowner-pi_lock); - if (current-prio = pendowner-prio) { + if (!lock_is_stealable(pendowner, unfair)) { spin_unlock(pendowner-pi_lock); return 0; } @@ -393,7 +408,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock) * * Must be called with lock-wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int try_to_take_rt_mutex(struct rt_mutex *lock, int unfair) { /* * We have to be careful here if the atomic speedups are @@ -416,7 +431,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock) */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) !try_to_steal_lock(lock, unfair)) return 0; /* We got the lock. */ @@ -737,7 +752,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock) int saved_lock_depth = current-lock_depth; /* Try to acquire the lock */ - if (try_to_take_rt_mutex(lock)) + if (try_to_take_rt_mutex(lock, 1)) break; /* * waiter.task is NULL the first time we come here and @@ -985,7 +1000,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, init_lists(lock); /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { + if (try_to_take_rt_mutex(lock, 0)) { spin_unlock_irqrestore(lock-wait_lock, flags); return 0; } @@ -1006,7 +1021,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long saved_flags; /* Try to acquire the lock: */ - if (try_to_take_rt_mutex(lock
[PATCH [RT] 14/14] sysctl for runtime-control of lateral mutex stealing
From: Sven-Thorsten Dietrich [EMAIL PROTECTED] Add /proc/sys/kernel/lateral_steal, to allow switching on and off equal-priority mutex stealing between threads. Signed-off-by: Sven-Thorsten Dietrich [EMAIL PROTECTED] --- kernel/rtmutex.c |8 ++-- kernel/sysctl.c | 14 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index da077e5..62e7af5 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -27,6 +27,9 @@ int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; #ifdef CONFIG_ADAPTIVE_RTMUTEX int rtmutex_timeout __read_mostly = CONFIG_RTMUTEX_DELAY; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +int rtmutex_lateral_steal __read_mostly = 1; +#endif /* * lock-owner state tracking: @@ -331,7 +334,8 @@ static inline int lock_is_stealable(struct task_struct *pendowner, int unfair) if (current-prio pendowner-prio) return 0; - if (!unfair (current-prio == pendowner-prio)) + if (unlikely(current-prio == pendowner-prio) + !(unfair rtmutex_lateral_steal)) #endif return 0; @@ -355,7 +359,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock, int unfair) return 1; spin_lock(pendowner-pi_lock); - if (!lock_is_stealable(pendowner, unfair)) { + if (!lock_is_stealable(pendowner, (unfair rtmutex_lateral_steal))) { spin_unlock(pendowner-pi_lock); return 0; } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 3465af2..c1a1c6d 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -179,6 +179,10 @@ extern struct ctl_table inotify_table[]; int sysctl_legacy_va_layout; #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL +extern int rtmutex_lateral_steal; +#endif + extern int prove_locking; extern int lock_stat; @@ -986,6 +990,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_RTLOCK_LATERAL_STEAL + { + .ctl_name = CTL_UNNUMBERED, + .procname = rtmutex_lateral_steal, + .data = rtmutex_lateral_steal, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 00/14] RFC - adaptive real-time locks
The Real Time patches to the Linux kernel converts the architecture specific SMP-synchronization primitives commonly referred to as spinlocks to an RT mutex implementation that support a priority inheritance protocol, and priority-ordered wait queues. The RT mutex implementation allows tasks that would otherwise busy-wait for a contended lock to be preempted by higher priority tasks without compromising the integrity of critical sections protected by the lock. The unintended side-effect is that the -rt kernel suffers from significant degradation of IO throughput (disk and net) due to the extra overhead associated with managing pi-lists and context switching. This has been generally accepted as a price to pay for low-latency preemption. Our research indicates that it doesn't necessarily have to be this way. This patch set introduces an adaptive technology that retains both the priority inheritance protocol as well as the preemptive nature of spinlocks and mutexes and adds a 300+% throughput increase to the Linux Real time kernel. It applies to 2.6.24-rt1. These performance increases apply to disk IO as well as netperf UDP benchmarks, without compromising RT preemption latency. For more complex applications, overall the I/O throughput seems to approach the throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is shipped by most distros. Essentially, the RT Mutex has been modified to busy-wait under contention for a limited (and configurable) time. This works because most locks are typically held for very short time spans. Too often, by the time a task goes to sleep on a mutex, the mutex is already being released on another CPU. The effect (on SMP) is that by polling a mutex for a limited time we reduce context switch overhead by up to 90%, and therefore eliminate CPU cycles as well as massive hot-spots in the scheduler / other bottlenecks in the Kernel - even though we busy-wait (using CPU cycles) to poll the lock. We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock 2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock (2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core, dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. Some tests show a marked improvement (for instance, dbench and hackbench), whereas some others (make -j 128) the results were not as profound but they were still net-positive. In all cases we have also verified that deterministic latency is not impacted by using cyclic-test. This patch series also includes some re-work on the raw_spinlock infrastructure, including Nick Piggin's x86-ticket-locks. We found that the increased pressure on the lock-wait_locks could cause rare but serious latency spikes that are fixed by a fifo raw_spinlock_t implementation. Nick was gracious enough to allow us to re-use his work (which is already accepted in 2.6.25). Note that we also have a C version of his protocol available if other architectures need fifo-lock support as well, which we will gladly post upon request. Special thanks go to many people who were instrumental to this project, including: *) the -rt team here at Novell for research, development, and testing. *) Nick Piggin for his invaluable consultation/feedback and use of his x86-ticket-locks. *) The reviewers/testers at Suse, Montavista, and Bill Huey for their time and feedback on the early versions of these patches. As always, comments/feedback/bug-fixes are welcome. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 01/14] spinlocks: fix preemption feature when PREEMPT_RT is enabled
The logic is currently broken so that PREEMPT_RT disables preemptible spinlock waiters, which is counter intuitive. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/spinlock.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/spinlock.c b/kernel/spinlock.c index c9bcf1b..b0e7f02 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT) + defined(CONFIG_DEBUG_LOCK_ALLOC) void __lockfunc __read_lock(raw_rwlock_t *lock) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 10:26 AM, in message [EMAIL PROTECTED], Gregory Haskins [EMAIL PROTECTED] wrote: We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf For convenience, I have also places a tarball of the entire series here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2 Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 11/14] optimize the !printk fastpath through the lock acquisition
On Thu, Feb 21, 2008 at 11:36 AM, in message [EMAIL PROTECTED], Andi Kleen [EMAIL PROTECTED] wrote: On Thursday 21 February 2008 16:27:22 Gregory Haskins wrote: @@ -660,12 +660,12 @@ rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ -if (!current-in_printk) -might_sleep(); -else if (in_atomic() || irqs_disabled()) +if (unlikely(current-in_printk) (in_atomic() || irqs_disabled())) I have my doubts that gcc will honor unlikelies that don't affect the complete condition of an if. Also conditions guarding returns are by default predicted unlikely anyways AFAIK. The patch is likely a nop. Yeah, you are probably right. We have found that the system is *extremely* touchy on how much overhead we have in the lock-acquisition path. For instance, using a non-inline version of adaptive_wait() can cost 5-10% in disk-io throughput. So we were trying to find places to shave anywhere we could. That being said, I didn't record any difference from this patch, so you are probably exactly right. It just seemed like the right thing to do so I left it in. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, Feb 21, 2008 at 11:41 AM, in message [EMAIL PROTECTED], Andi Kleen [EMAIL PROTECTED] wrote: +config RTLOCK_DELAY +int Default delay (in loops) for adaptive rtlocks +range 0 10 +depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. +default 1 Perhaps you can expand how you came up with that default number? Actually, the number doesn't seem to matter that much as long as it is sufficiently long enough to make timeouts rare. Most workloads will present some threshold for hold-time. You generally get the best performance if the value is at least as long as that threshold. Anything beyond that and there is no gain, but there doesn't appear to be a penalty either. So we picked 1 because we found it to fit that criteria quite well for our range of GHz class x86 machines. YMMY, but that is why its configurable ;) It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Yeah, fully agree. We really wanted to use a time-value here but ran into various problems that have yet to be resolved. We have it on the todo list to express this in terms in ns so it at least will scale with the architecture. Did you experiment with other spin times? Of course ;) Should it be scaled with number of CPUs? Not to my knowledge, but we can put that as a research todo. And at what point is real time behaviour visibly impacted? Well, if we did our jobs correctly, RT behavior should *never* be impacted. *Throughput* on the other hand... ;) But its comes down to what I mentioned earlier. There is that threshold that affects the probability of timing out. Values lower than that threshold start to degrade throughput. Values higher than that have no affect on throughput, but may drive the cpu utilization higher which can theoretically impact tasks of equal or lesser priority by taking that resource away from them. To date, we have not observed any real-world implications of this however. Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. I agree. We need to move in the direction of time-basis. The tradeoff is that it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). I think one of the (not-included) patches converts a nanosecond value from the sysctl to approximate loop-counts using the bogomips data. This was a decent compromise between the non-scaling loopcounts and the heavy-weight official timing APIs. We dropped it because we support older kernels which were conflicting with the patch. We may have to resurrect it, however.. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 4:24 PM, in message [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED] wrote: hm. Why is the ticket spinlock patch included in this patchset? It just skews your performance results unnecessarily. Ticket spinlocks are independent conceptually, they are already upstream in 2.6.25-rc2 and -rt will have them automatically once we rebase to .25. Sorry if it was ambiguous. I included them because we found the patch series without them can cause spikes due to the newly introduced pressure on the (raw_spinlock_t)lock-wait_lock. You can run the adaptive-spin patches without them just fine (in fact, in many cases things run faster without themdbench *thrives* on chaos). But you may also measure a cyclic-test spike if you do so. So I included them to present a complete package without spikes. I tried to explain that detail in the prologue, but most people probably fell asleep before they got to the end ;) and if we take the ticket spinlock patch out of your series, the the size of the patchset shrinks in half and touches only 200-300 lines of code ;-) Considering the total size of the -rt patchset: 652 files changed, 23830 insertions(+), 4636 deletions(-) we can regard it a routine optimization ;-) Its not the size of your LOC, but what you do with it :) regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. So lets get some real marketing-free benchmarking done, and we are not just interested in the workloads where a bit of polling on contended locks helps, but we are also interested in workloads where the polling hurts ... And lets please do the comparisons without the ticket spinlock patch ... I'm open to suggestion, and this was just a sample of the testing we have done. We have thrown plenty of workloads at this patch series far beyond the slides I prepared in that URL, and they all seem to indicate a net positive improvement so far. Some of those results I cannot share due to NDA, and some I didnt share simply because I never formally collected the data like I did for these tests. If there is something you would like to see, please let me know and I will arrange for it to be executed if at all possible. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 4:42 PM, in message [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED] wrote: * Bill Huey (hui) [EMAIL PROTECTED] wrote: I came to the original conclusion that it wasn't originally worth it, but the dbench number published say otherwise. [...] dbench is a notoriously unreliable and meaningless workload. It's being frowned upon by the VM and the IO folks. I agree...its a pretty weak benchmark. BUT, it does pound on dcache_lock and therefore was a good demonstration of the benefits of lower-contention overhead. Also note we also threw other tests in that PDF if you scroll to the subsequent pages. If that's the only workload where spin-mutexes help, and if it's only a 3% improvement [of which it is unclear how much of that improvement was due to ticket spinlocks], then adaptive mutexes are probably not worth it. Note that the 3% figure being thrown around was from a single patch within the series. We are actually getting a net average gain of 443% in dbench. And note that the number goes *up* when you remove the ticketlocks. The ticketlocks are there to prevent latency spikes, not improve throughput. Also take a look at the hackbench numbers which are particularly promising. We get a net average gain of 493% faster for RT10 based hackbench runs. The kernel build was only a small gain, but it was all gain nonetheless. We see similar results for any other workloads we throw at this thing. I will gladly run any test requested to which I have the ability to run, and I would encourage third party results as well. I'd not exclude them fundamentally though, it's really the numbers that matter. The code is certainly simple enough (albeit the .config and sysctl controls are quite ugly and unacceptable - adaptive mutexes should really be ... adaptive, with no magic constants in .configs or else). We can clean this up, per your suggestions. But ... i'm somewhat sceptic, after having played with spin-a-bit mutexes before. Its very subtle to get this concept to work. The first few weeks, we were getting 90% regressions ;) Then we had a breakthrough and started to get this thing humming along quite nicely. Regards, -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
Peter Zijlstra wrote: On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote: but perhaps you can convince me that it is not needed? (i.e. I am still not understanding how the timer guarantees the stability). ok, let me try again. So we take rq->lock, at this point we know rd is valid. We also know the timer is active. So when we release it, the last reference can be dropped and we end up in the hrtimer_cancel(), right before the kfree(). hrtimer_cancel() will wait for the timer to end. therefore delaying the kfree() until the running timer finished. Ok, I see it now. I agree that I think it is safe. Thanks! -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
Peter Zijlstra wrote: On Fri, 2008-02-15 at 11:46 -0500, Gregory Haskins wrote: but perhaps you can convince me that it is not needed? (i.e. I am still not understanding how the timer guarantees the stability). ok, let me try again. So we take rq-lock, at this point we know rd is valid. We also know the timer is active. So when we release it, the last reference can be dropped and we end up in the hrtimer_cancel(), right before the kfree(). hrtimer_cancel() will wait for the timer to end. therefore delaying the kfree() until the running timer finished. Ok, I see it now. I agree that I think it is safe. Thanks! -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
Peter Zijlstra wrote: Currently the lb_monitor will walk all the domains/cpus from a single cpu's timer interrupt. This will cause massive cache-trashing and cache-line bouncing on larger machines. Split the lb_monitor into root_domain (disjoint sched-domains). Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> CC: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/sched.c | 106 kernel/sched_fair.c |2 2 files changed, 59 insertions(+), 49 deletions(-) Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -357,8 +357,6 @@ struct lb_monitor { spinlock_t lock; }; -static struct lb_monitor lb_monitor; - /* * How frequently should we rebalance_shares() across cpus? * @@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo if (hrtimer_active(_monitor->timer)) return; + /* +* XXX: rd->load_balance && weight(rd->span) > 1 +*/ if (nr_cpu_ids == 1) return; @@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo spin_lock_init(_monitor->lock); } + +static int lb_monitor_destroy(struct lb_monitor *lb_monitor) +{ + return hrtimer_cancel(_monitor->timer); +} #endif static void set_se_shares(struct sched_entity *se, unsigned long shares); @@ -607,6 +613,8 @@ struct root_domain { */ cpumask_t rto_mask; atomic_t rto_count; + + struct lb_monitor lb_monitor; }; /* @@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq { unsigned long flags; const struct sched_class *class; + int active = 0; spin_lock_irqsave(>lock, flags); @@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq cpu_clear(rq->cpu, old_rd->span); cpu_clear(rq->cpu, old_rd->online); - if (atomic_dec_and_test(_rd->refcount)) + if (atomic_dec_and_test(_rd->refcount)) { + /* +* sync with active timers. +*/ + active = lb_monitor_destroy(_rd->lb_monitor); + kfree(old_rd); Note that this works out to be a bug in my code on -rt since you cannot kfree() while the raw rq->lock is held. This isn't your problem, per se, but just a heads up that I might need to patch this area ASAP. + } } atomic_inc(>refcount); @@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq class->join_domain(rq); } + if (active) + lb_monitor_wake(>lb_monitor); + spin_unlock_irqrestore(>lock, flags); } @@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_ cpus_clear(rd->span); cpus_clear(rd->online); + + lb_monitor_init(>lb_monitor); } static void init_defrootdomain(void) @@ -7398,10 +7418,6 @@ void __init sched_init(void) #ifdef CONFIG_SMP init_defrootdomain(); - -#ifdef CONFIG_FAIR_GROUP_SCHED - lb_monitor_init(_monitor); -#endif #endif init_rt_bandwidth(_rt_bandwidth, global_rt_period(), global_rt_runtime()); @@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_ * distribute shares of all task groups among their schedulable entities, * to reflect load distribution across cpus. */ -static int rebalance_shares(struct sched_domain *sd, int this_cpu) +static int rebalance_shares(struct root_domain *rd, int this_cpu) { struct cfs_rq *cfs_rq; struct rq *rq = cpu_rq(this_cpu); - cpumask_t sdspan = sd->span; + cpumask_t sdspan = rd->span; int state = shares_idle; /* Walk thr' all the task groups that we have */ @@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched return state; } -static int load_balance_shares(struct lb_monitor *lb_monitor) +static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state) { - int i, cpu, state = shares_idle; u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC; u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC; u64 timeout; - /* Prevent cpus going down or coming up */ - /* get_online_cpus(); */ - /* lockout changes to doms_cur[] array */ - /* lock_doms_cur(); */ - /* -* Enter a rcu read-side critical section to safely walk rq->sd -* chain on various cpus and to walk task group list -* (rq->leaf_cfs_rq_list) in rebalance_shares(). -*/ - rcu_read_lock(); - - for (i = 0; i < ndoms_cur; i++) { - cpumask_t cpumap = doms_cur[i]; - struct sched_domain *sd = NULL, *sd_prev = NULL; - -
Re: [RFC][PATCH 2/2] sched: fair-group: per root-domain load balancing
Peter Zijlstra wrote: Currently the lb_monitor will walk all the domains/cpus from a single cpu's timer interrupt. This will cause massive cache-trashing and cache-line bouncing on larger machines. Split the lb_monitor into root_domain (disjoint sched-domains). Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] CC: Gregory Haskins [EMAIL PROTECTED] --- kernel/sched.c | 106 kernel/sched_fair.c |2 2 files changed, 59 insertions(+), 49 deletions(-) Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -357,8 +357,6 @@ struct lb_monitor { spinlock_t lock; }; -static struct lb_monitor lb_monitor; - /* * How frequently should we rebalance_shares() across cpus? * @@ -417,6 +415,9 @@ static void lb_monitor_wake(struct lb_mo if (hrtimer_active(lb_monitor-timer)) return; + /* +* XXX: rd-load_balance weight(rd-span) 1 +*/ if (nr_cpu_ids == 1) return; @@ -444,6 +445,11 @@ static void lb_monitor_init(struct lb_mo spin_lock_init(lb_monitor-lock); } + +static int lb_monitor_destroy(struct lb_monitor *lb_monitor) +{ + return hrtimer_cancel(lb_monitor-timer); +} #endif static void set_se_shares(struct sched_entity *se, unsigned long shares); @@ -607,6 +613,8 @@ struct root_domain { */ cpumask_t rto_mask; atomic_t rto_count; + + struct lb_monitor lb_monitor; }; /* @@ -6328,6 +6336,7 @@ static void rq_attach_root(struct rq *rq { unsigned long flags; const struct sched_class *class; + int active = 0; spin_lock_irqsave(rq-lock, flags); @@ -6342,8 +6351,14 @@ static void rq_attach_root(struct rq *rq cpu_clear(rq-cpu, old_rd-span); cpu_clear(rq-cpu, old_rd-online); - if (atomic_dec_and_test(old_rd-refcount)) + if (atomic_dec_and_test(old_rd-refcount)) { + /* +* sync with active timers. +*/ + active = lb_monitor_destroy(old_rd-lb_monitor); + kfree(old_rd); Note that this works out to be a bug in my code on -rt since you cannot kfree() while the raw rq-lock is held. This isn't your problem, per se, but just a heads up that I might need to patch this area ASAP. + } } atomic_inc(rd-refcount); @@ -6358,6 +6373,9 @@ static void rq_attach_root(struct rq *rq class-join_domain(rq); } + if (active) + lb_monitor_wake(rd-lb_monitor); + spin_unlock_irqrestore(rq-lock, flags); } @@ -6367,6 +6385,8 @@ static void init_rootdomain(struct root_ cpus_clear(rd-span); cpus_clear(rd-online); + + lb_monitor_init(rd-lb_monitor); } static void init_defrootdomain(void) @@ -7398,10 +7418,6 @@ void __init sched_init(void) #ifdef CONFIG_SMP init_defrootdomain(); - -#ifdef CONFIG_FAIR_GROUP_SCHED - lb_monitor_init(lb_monitor); -#endif #endif init_rt_bandwidth(def_rt_bandwidth, global_rt_period(), global_rt_runtime()); @@ -7631,11 +7647,11 @@ void set_curr_task(int cpu, struct task_ * distribute shares of all task groups among their schedulable entities, * to reflect load distribution across cpus. */ -static int rebalance_shares(struct sched_domain *sd, int this_cpu) +static int rebalance_shares(struct root_domain *rd, int this_cpu) { struct cfs_rq *cfs_rq; struct rq *rq = cpu_rq(this_cpu); - cpumask_t sdspan = sd-span; + cpumask_t sdspan = rd-span; int state = shares_idle; /* Walk thr' all the task groups that we have */ @@ -7685,50 +7701,12 @@ static int rebalance_shares(struct sched return state; } -static int load_balance_shares(struct lb_monitor *lb_monitor) +static void set_lb_monitor_timeout(struct lb_monitor *lb_monitor, int state) { - int i, cpu, state = shares_idle; u64 max_timeout = (u64)sysctl_sched_max_bal_int_shares * NSEC_PER_MSEC; u64 min_timeout = (u64)sysctl_sched_min_bal_int_shares * NSEC_PER_MSEC; u64 timeout; - /* Prevent cpus going down or coming up */ - /* get_online_cpus(); */ - /* lockout changes to doms_cur[] array */ - /* lock_doms_cur(); */ - /* -* Enter a rcu read-side critical section to safely walk rq-sd -* chain on various cpus and to walk task group list -* (rq-leaf_cfs_rq_list) in rebalance_shares(). -*/ - rcu_read_lock(); - - for (i = 0; i ndoms_cur; i++) { - cpumask_t cpumap = doms_cur[i]; - struct sched_domain *sd = NULL, *sd_prev = NULL; - - cpu = first_cpu(cpumap); - - /* Find the highest domain at which to balance shares
Re: [RFC][PATCH 0/2] reworking load_balance_monitor
>>> On Thu, Feb 14, 2008 at 1:15 PM, in message <[EMAIL PROTECTED]>, Paul Jackson <[EMAIL PROTECTED]> wrote: > Peter wrote of: >> the lack of rd->load_balance. > > Could you explain to me a bit what that means? > > Does this mean that the existing code would, by default (default being > a single sched domain, covering the entire system's CPUs) load balance > across the entire system, but with your rework, not so load balance > there? That seems unlikely. > > In any event, from my rather cpuset-centric perspective, there are only > two common cases to consider. > > 1. In the default case, build_sched_domains() gets called once, > at init, with a cpu_map of all non-isolated CPUs, and we should > forever after load balance across all those non-isolated CPUs. > > 2. In some carefully managed systems using the per-cpuset > 'sched_load_balance' flags, we tear down that first default > sched domain, by calling detach_destroy_domains() on it, and we > then setup some number of sched_domains (typically in the range > of two to ten, though I suppose we should design to scale to > hundreds of sched domains, on systems with thousands of CPUs) > by additional calls to build_sched_domains(), such that their > CPUs don't overlap (pairwise disjoint) and such that the union > of all their CPUs may, or may not, include all non-isolated CPUs > (some CPUs might be left 'out in the cold', intentionally, as > essentially additional isolated CPUs.) We would then expect load > balancing within each of these pair-wise disjoint sched domains, > but not between one of them and another. Hi Paul, I think it will still work as you describe. We create a new root-domain object for each pair-wise disjoint sched-domain. In your case (1) above, we would only have one instance of a root-domain which contains (of course) a single instance of the rd->load_balance object. This would, in fact operate like the global variable that Peter is suggesting it replace (IIUC). However, for case (2), we would instantiate a root-domain object per pairwise-disjoint sched-domain, and therefore each one would have its own instance of rd->load_balance. HTH -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/2] reworking load_balance_monitor
>>> On Thu, Feb 14, 2008 at 10:57 AM, in message <[EMAIL PROTECTED]>, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > Hi, > > Here the current patches that rework load_balance_monitor. > > The main reason for doing this is to eliminate the wakeups the thing > generates, > esp. on an idle system. The bonus is that it removes a kernel thread. > > Paul, Gregory - the thing that bothers me most atm is the lack of > rd->load_balance. Should I introduce that (-rt ought to make use of that as > well) by way of copying from the top sched_domain when it gets created? With the caveat that I currently have not digested your patch series, this sounds like a reasonable approach. The root-domain effectively represents the top sched-domain anyway (with the additional attribute that its a shared structure with all constituent cpus). Ill try to take a look at the series later today and get back to you with feedback. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/2] reworking load_balance_monitor
On Thu, Feb 14, 2008 at 10:57 AM, in message [EMAIL PROTECTED], Peter Zijlstra [EMAIL PROTECTED] wrote: Hi, Here the current patches that rework load_balance_monitor. The main reason for doing this is to eliminate the wakeups the thing generates, esp. on an idle system. The bonus is that it removes a kernel thread. Paul, Gregory - the thing that bothers me most atm is the lack of rd-load_balance. Should I introduce that (-rt ought to make use of that as well) by way of copying from the top sched_domain when it gets created? With the caveat that I currently have not digested your patch series, this sounds like a reasonable approach. The root-domain effectively represents the top sched-domain anyway (with the additional attribute that its a shared structure with all constituent cpus). Ill try to take a look at the series later today and get back to you with feedback. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/2] reworking load_balance_monitor
On Thu, Feb 14, 2008 at 1:15 PM, in message [EMAIL PROTECTED], Paul Jackson [EMAIL PROTECTED] wrote: Peter wrote of: the lack of rd-load_balance. Could you explain to me a bit what that means? Does this mean that the existing code would, by default (default being a single sched domain, covering the entire system's CPUs) load balance across the entire system, but with your rework, not so load balance there? That seems unlikely. In any event, from my rather cpuset-centric perspective, there are only two common cases to consider. 1. In the default case, build_sched_domains() gets called once, at init, with a cpu_map of all non-isolated CPUs, and we should forever after load balance across all those non-isolated CPUs. 2. In some carefully managed systems using the per-cpuset 'sched_load_balance' flags, we tear down that first default sched domain, by calling detach_destroy_domains() on it, and we then setup some number of sched_domains (typically in the range of two to ten, though I suppose we should design to scale to hundreds of sched domains, on systems with thousands of CPUs) by additional calls to build_sched_domains(), such that their CPUs don't overlap (pairwise disjoint) and such that the union of all their CPUs may, or may not, include all non-isolated CPUs (some CPUs might be left 'out in the cold', intentionally, as essentially additional isolated CPUs.) We would then expect load balancing within each of these pair-wise disjoint sched domains, but not between one of them and another. Hi Paul, I think it will still work as you describe. We create a new root-domain object for each pair-wise disjoint sched-domain. In your case (1) above, we would only have one instance of a root-domain which contains (of course) a single instance of the rd-load_balance object. This would, in fact operate like the global variable that Peter is suggesting it replace (IIUC). However, for case (2), we would instantiate a root-domain object per pairwise-disjoint sched-domain, and therefore each one would have its own instance of rd-load_balance. HTH -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] add task migration_disable critical section
>>> On Tue, Feb 12, 2008 at 2:22 PM, in message <[EMAIL PROTECTED]>, Steven Rostedt <[EMAIL PROTECTED]> wrote: > On Tue, 12 Feb 2008, Gregory Haskins wrote: > >> This patch adds a new critical-section primitive pair: >> >> "migration_disable()" and "migration_enable()" > > This is similar to what Mathieu once posted: > > http://lkml.org/lkml/2007/7/11/13 > > Not sure the arguments against (no time to read the thread again). But I'd > recommend that you read it. > > -- Steve Indeed, thanks for the link! At quick glance, the concept looks identical, though the implementations are radically different. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] add task migration_disable critical section
This patch adds a new critical-section primitive pair: "migration_disable()" and "migration_enable()" This allows you to force a task to remain on the current cpu, while still remaining fully preemptible. This is a better alternative to modifying current->cpus_allowed because you dont have to worry about colliding with another entity also modifying the cpumask_t while in the critical section. In fact, modifying the cpumask_t while in the critical section is fully supported, but note that the behavior of set_cpus_allowed() has slightly different behavior. In the old code, the mask update was synchronous: e.g. the task would be on a legal cpu when the call returned. The new behavior makes this asynchronous if the task is currently in a migration-disabled critical section. The task will migrate to a legal cpu once the critical section ends. This concept will be used later in the series. Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- include/linux/init_task.h |1 + include/linux/sched.h |8 + kernel/fork.c |1 + kernel/sched.c| 70 - kernel/sched_rt.c |6 +++- 5 files changed, 70 insertions(+), 16 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 316a184..151197b 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -137,6 +137,7 @@ extern struct group_info init_groups; .usage = ATOMIC_INIT(2), \ .flags = 0,\ .lock_depth = -1, \ + .migration_disable_depth = 0, \ .prio = MAX_PRIO-20, \ .static_prio= MAX_PRIO-20, \ .normal_prio= MAX_PRIO-20, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index c87d46a..ab7768a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1109,6 +1109,7 @@ struct task_struct { unsigned int ptrace; int lock_depth; /* BKL lock depth */ + int migration_disable_depth; #ifdef CONFIG_SMP #ifdef __ARCH_WANT_UNLOCKED_CTXSW @@ -2284,10 +2285,17 @@ static inline void inc_syscw(struct task_struct *tsk) #ifdef CONFIG_SMP void migration_init(void); +int migration_disable(struct task_struct *tsk); +void migration_enable(struct task_struct *tsk); #else static inline void migration_init(void) { } +static inline int migration_disable(struct task_struct *tsk) +{ + return 0; +} +#define migration_enable(tsk) do {} while (0) #endif #endif /* __KERNEL__ */ diff --git a/kernel/fork.c b/kernel/fork.c index 8c00b55..7745937 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1127,6 +1127,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, INIT_LIST_HEAD(>cpu_timers[2]); p->posix_timer_list = NULL; p->lock_depth = -1; /* -1 = no lock */ + p->migration_disable_depth = 0; do_posix_clock_monotonic_gettime(>start_time); p->real_start_time = p->start_time; monotonic_to_bootbased(>real_start_time); diff --git a/kernel/sched.c b/kernel/sched.c index e6ad493..cf32000 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1231,6 +1231,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu); u64 clock_offset; + BUG_ON(p->migration_disable_depth); + clock_offset = old_rq->clock - new_rq->clock; #ifdef CONFIG_SCHEDSTATS @@ -1632,7 +1634,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex) if (unlikely(task_running(rq, p))) goto out_activate; - cpu = p->sched_class->select_task_rq(p, sync); + if (!p->migration_disable_depth) + cpu = p->sched_class->select_task_rq(p, sync); + if (cpu != orig_cpu) { set_task_cpu(p, cpu); task_rq_unlock(rq, ); @@ -5422,11 +5426,12 @@ static inline void sched_init_granularity(void) */ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) { - struct migration_req req; unsigned long flags; struct rq *rq; int ret = 0; + migration_disable(p); + rq = task_rq_lock(p, ); if (!cpus_intersects(new_mask, cpu_online_map)) { ret = -EINVAL; @@ -5440,21 +5445,11 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) p->nr_cpus_allowed = cpus_weight(new_mask); } - /* Can the task run on the task's current CPU? If so, we're done */ - if (cpu_isset(task_cpu(p), new_mask)) - goto out; - -
[PATCH 0/2] migration disabled critical sections
Hi Ingo, Steven, I had been working on some ideas related to saving context switches in the bottom-half mechanisms on -rt. So far, the ideas have been a flop, but a few peripheral technologies did come out of it. This series is one such idea that I thought might have some merit on its own. The header-comments describe it in detail, so I wont bother replicating that here. This series applies to 24-rt1. Any comments/feedback welcome. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] fix cpus_allowed settings
Signed-off-by: Gregory Haskins <[EMAIL PROTECTED]> --- kernel/kthread.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index dcfe724..b193b47 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu) wait_task_inactive(k); set_task_cpu(k, cpu); k->cpus_allowed = cpumask_of_cpu(cpu); + k->nr_cpus_allowed = 1; } EXPORT_SYMBOL(kthread_bind); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] add task migration_disable critical section
This patch adds a new critical-section primitive pair: migration_disable() and migration_enable() This allows you to force a task to remain on the current cpu, while still remaining fully preemptible. This is a better alternative to modifying current-cpus_allowed because you dont have to worry about colliding with another entity also modifying the cpumask_t while in the critical section. In fact, modifying the cpumask_t while in the critical section is fully supported, but note that the behavior of set_cpus_allowed() has slightly different behavior. In the old code, the mask update was synchronous: e.g. the task would be on a legal cpu when the call returned. The new behavior makes this asynchronous if the task is currently in a migration-disabled critical section. The task will migrate to a legal cpu once the critical section ends. This concept will be used later in the series. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- include/linux/init_task.h |1 + include/linux/sched.h |8 + kernel/fork.c |1 + kernel/sched.c| 70 - kernel/sched_rt.c |6 +++- 5 files changed, 70 insertions(+), 16 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 316a184..151197b 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -137,6 +137,7 @@ extern struct group_info init_groups; .usage = ATOMIC_INIT(2), \ .flags = 0,\ .lock_depth = -1, \ + .migration_disable_depth = 0, \ .prio = MAX_PRIO-20, \ .static_prio= MAX_PRIO-20, \ .normal_prio= MAX_PRIO-20, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index c87d46a..ab7768a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1109,6 +1109,7 @@ struct task_struct { unsigned int ptrace; int lock_depth; /* BKL lock depth */ + int migration_disable_depth; #ifdef CONFIG_SMP #ifdef __ARCH_WANT_UNLOCKED_CTXSW @@ -2284,10 +2285,17 @@ static inline void inc_syscw(struct task_struct *tsk) #ifdef CONFIG_SMP void migration_init(void); +int migration_disable(struct task_struct *tsk); +void migration_enable(struct task_struct *tsk); #else static inline void migration_init(void) { } +static inline int migration_disable(struct task_struct *tsk) +{ + return 0; +} +#define migration_enable(tsk) do {} while (0) #endif #endif /* __KERNEL__ */ diff --git a/kernel/fork.c b/kernel/fork.c index 8c00b55..7745937 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1127,6 +1127,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, INIT_LIST_HEAD(p-cpu_timers[2]); p-posix_timer_list = NULL; p-lock_depth = -1; /* -1 = no lock */ + p-migration_disable_depth = 0; do_posix_clock_monotonic_gettime(p-start_time); p-real_start_time = p-start_time; monotonic_to_bootbased(p-real_start_time); diff --git a/kernel/sched.c b/kernel/sched.c index e6ad493..cf32000 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1231,6 +1231,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu); u64 clock_offset; + BUG_ON(p-migration_disable_depth); + clock_offset = old_rq-clock - new_rq-clock; #ifdef CONFIG_SCHEDSTATS @@ -1632,7 +1634,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex) if (unlikely(task_running(rq, p))) goto out_activate; - cpu = p-sched_class-select_task_rq(p, sync); + if (!p-migration_disable_depth) + cpu = p-sched_class-select_task_rq(p, sync); + if (cpu != orig_cpu) { set_task_cpu(p, cpu); task_rq_unlock(rq, flags); @@ -5422,11 +5426,12 @@ static inline void sched_init_granularity(void) */ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) { - struct migration_req req; unsigned long flags; struct rq *rq; int ret = 0; + migration_disable(p); + rq = task_rq_lock(p, flags); if (!cpus_intersects(new_mask, cpu_online_map)) { ret = -EINVAL; @@ -5440,21 +5445,11 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) p-nr_cpus_allowed = cpus_weight(new_mask); } - /* Can the task run on the task's current CPU? If so, we're done */ - if (cpu_isset(task_cpu(p), new_mask)) - goto out; - - if (migrate_task(p, any_online_cpu(new_mask), req)) { - /* Need help
[PATCH 2/2] fix cpus_allowed settings
Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/kthread.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index dcfe724..b193b47 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu) wait_task_inactive(k); set_task_cpu(k, cpu); k-cpus_allowed = cpumask_of_cpu(cpu); + k-nr_cpus_allowed = 1; } EXPORT_SYMBOL(kthread_bind); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] add task migration_disable critical section
On Tue, Feb 12, 2008 at 2:22 PM, in message [EMAIL PROTECTED], Steven Rostedt [EMAIL PROTECTED] wrote: On Tue, 12 Feb 2008, Gregory Haskins wrote: This patch adds a new critical-section primitive pair: migration_disable() and migration_enable() This is similar to what Mathieu once posted: http://lkml.org/lkml/2007/7/11/13 Not sure the arguments against (no time to read the thread again). But I'd recommend that you read it. -- Steve Indeed, thanks for the link! At quick glance, the concept looks identical, though the implementations are radically different. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] migration disabled critical sections
Hi Ingo, Steven, I had been working on some ideas related to saving context switches in the bottom-half mechanisms on -rt. So far, the ideas have been a flop, but a few peripheral technologies did come out of it. This series is one such idea that I thought might have some merit on its own. The header-comments describe it in detail, so I wont bother replicating that here. This series applies to 24-rt1. Any comments/feedback welcome. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: any cpu hotplug changes in 2.6.24-current-git?
Pavel Machek wrote: Hi! Are there any recent changes in cpu hotplug? I have suspend (random) problems, nosmp seems to fix it, and last messages in the "it hangs" case are from cpu hotplug... Can you send along your cpuinfo? It happened on more than one machine, one cpuinfo is: Ah, ok. This one is a C2D, correct? The only reason I asked is that someone was reporting an s2ram problem on P4s on some of that root-domain logic I submitted a little while ago (and was merged in .25), whereas C2D seemed fine. That doesn't mean anything here, really. The problem could still be my code, or it might be unrelated. I was just wondering if you also had a P4 on the troubled systems. So is your problem on suspend or resume? (or both?) (I know you mentioned it was random problems, but I wasn't sure if you could qualify that further) Any info you can provide will be appreciated. -Greg processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 14 model name : Genuine Intel(R) CPU T2400 @ 1.83GHz stepping: 8 cpu MHz : 1000.000 cache size : 2048 KB physical id : 0 siblings: 2 core id : 1 cpu cores : 2 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc arch_perfmon bts pni monitor vmx est tm2 xtpr bogomips: 3657.58 clflush size: 64 Pavel signature.asc Description: OpenPGP digital signature
Re: any cpu hotplug changes in 2.6.24-current-git?
Pavel Machek wrote: Hi! Are there any recent changes in cpu hotplug? I have suspend (random) problems, nosmp seems to fix it, and last messages in the "it hangs" case are from cpu hotplug... Pavel Hi Pavel, Can you send along your cpuinfo? -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/