Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Sat, 2008-02-23 at 13:31 +0100, Andi Kleen wrote: > > *) compute the context-switch pair time average for the system. This is > > your time threshold (CSt). > Hi Andi, > This is not a uniform time. Consider the difference between > context switch on the same hyperthread, context switch between cores > on a die, context switch between sockets, context switch between > distant numa nodes. You could have several orders of magnitude > between all those. > > > > > *) For each lock, maintain an average hold-time (AHt) statistic (I am > > assuming this can be done cheaply...perhaps not). > > That would assume that the hold times are very uniform. But what happens > when you e.g. have a workload where 50% of the lock aquisitions are short > and 30% are long? > > I'm a little sceptical of such "too clever" algorithms. > I agree, this does not make any sense until you get to very large SMP systems - and only iff it can be demonstrated, that the overhead associated with lock statistics and including some notion of the context switch overhead still comes out with a net gain. There is some notion of task-routing in the RT scheduler already, and this is quite a clever little algorithm. I see no reason, why the scheduler should not eventually take directly into account (when balancing), the quantifiable context-switch and CPU overhead of moving a task to a distant processor. In fact, for a deadline-based scheduler working on high-frequency tasks, given that the times can switch so radically, this is would be requiredOnve context-switch-overhead data is available to the scheduler, there is no particular reason why adaptive locking could not also utilize that data. Sven > -Andi > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Sat, 2008-02-23 at 13:31 +0100, Andi Kleen wrote: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). Hi Andi, This is not a uniform time. Consider the difference between context switch on the same hyperthread, context switch between cores on a die, context switch between sockets, context switch between distant numa nodes. You could have several orders of magnitude between all those. *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). That would assume that the hold times are very uniform. But what happens when you e.g. have a workload where 50% of the lock aquisitions are short and 30% are long? I'm a little sceptical of such too clever algorithms. I agree, this does not make any sense until you get to very large SMP systems - and only iff it can be demonstrated, that the overhead associated with lock statistics and including some notion of the context switch overhead still comes out with a net gain. There is some notion of task-routing in the RT scheduler already, and this is quite a clever little algorithm. I see no reason, why the scheduler should not eventually take directly into account (when balancing), the quantifiable context-switch and CPU overhead of moving a task to a distant processor. In fact, for a deadline-based scheduler working on high-frequency tasks, given that the times can switch so radically, this is would be requiredOnve context-switch-overhead data is available to the scheduler, there is no particular reason why adaptive locking could not also utilize that data. Sven -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Sat, Feb 23, 2008 at 01:31:00PM +0100, Andi Kleen wrote: > > *) compute the context-switch pair time average for the system. This is > > your time threshold (CSt). > > This is not a uniform time. Consider the difference between > context switch on the same hyperthread, context switch between cores > on a die, context switch between sockets, context switch between > distant numa nodes. You could have several orders of magnitude > between all those. Wouldn't the common case for blocking on a lock with a short hold time be the minimal context-switch pair on the same CPU? > > *) For each lock, maintain an average hold-time (AHt) statistic (I am > > assuming this can be done cheaply...perhaps not). > > That would assume that the hold times are very uniform. But what happens > when you e.g. have a workload where 50% of the lock aquisitions are short > and 30% are long? > > I'm a little sceptical of such "too clever" algorithms. Me too. That said, I cannot resist pointing out that the measurement of interest would be the fraction of lock-hold times that are shorter than the context-switch time. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
> *) compute the context-switch pair time average for the system. This is > your time threshold (CSt). This is not a uniform time. Consider the difference between context switch on the same hyperthread, context switch between cores on a die, context switch between sockets, context switch between distant numa nodes. You could have several orders of magnitude between all those. > > *) For each lock, maintain an average hold-time (AHt) statistic (I am > assuming this can be done cheaply...perhaps not). That would assume that the hold times are very uniform. But what happens when you e.g. have a workload where 50% of the lock aquisitions are short and 30% are long? I'm a little sceptical of such "too clever" algorithms. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
*) compute the context-switch pair time average for the system. This is your time threshold (CSt). This is not a uniform time. Consider the difference between context switch on the same hyperthread, context switch between cores on a die, context switch between sockets, context switch between distant numa nodes. You could have several orders of magnitude between all those. *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). That would assume that the hold times are very uniform. But what happens when you e.g. have a workload where 50% of the lock aquisitions are short and 30% are long? I'm a little sceptical of such too clever algorithms. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Sat, Feb 23, 2008 at 01:31:00PM +0100, Andi Kleen wrote: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). This is not a uniform time. Consider the difference between context switch on the same hyperthread, context switch between cores on a die, context switch between sockets, context switch between distant numa nodes. You could have several orders of magnitude between all those. Wouldn't the common case for blocking on a lock with a short hold time be the minimal context-switch pair on the same CPU? *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). That would assume that the hold times are very uniform. But what happens when you e.g. have a workload where 50% of the lock aquisitions are short and 30% are long? I'm a little sceptical of such too clever algorithms. Me too. That said, I cannot resist pointing out that the measurement of interest would be the fraction of lock-hold times that are shorter than the context-switch time. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 13:36 -0700, Peter W. Morreale wrote: > On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote: > > > > In high-contention, short-hold time situations, it may even make sense > > to have multiple CPUs with multiple waiters spinning, depending on > > hold-time vs. time to put a waiter to sleep and wake them up. > > > > The wake-up side could also walk ahead on the queue, and bring up > > spinners from sleeping, so that they are all ready to go when the lock > > flips green for them. > > > > I did try an attempt at this one time. The logic was merely if the > pending owner was running, wakeup the next waiter. The results were > terrible for the benchmarks used, as compared to the current > implementation. Yup, but you cut the CONTEXT where I said: "for very large SMP systems" Specifically, what I mean, is an SMP system, where I have enough CPUs to do this: let (t_Tcs) be the time to lock, transition and unlock an un-contended critical section (i.e. the one that I am the pending waiter for). let (t_W) be the time to wake up a sleeping task. and let (t_W > t_Tcs) Then, "for very large SMP systems" if S = (t_W / t_Tcs), then S designates the number of tasks to transition a critical section before the first sleeper would wake up. and the number of CPUs > S. The time for an arbitrary number of tasks N > S which are all competing for lock L, to transition a critical section (T_N_cs), approaches: T_N_cs = (N * t_W) if you have only 1 task spinning. but if you can have N tasks spinning, (T_N_cs) approaches: T_N_cs = (N * t_Tcs) and with the premise, that t_W > t_Tcs, you should see a dramatic throughput improvement when running PREEMPT_RT on VERY LARGE SMP systems. I want to disclaim, that the math above is very much simplified, but I hope its sufficient to demonstrate the concept. I have to acknowledge Ingo's comments, that this is all suspect until proven to make a positive difference in "non-marketing" workloads. I personally *think* we are past that already, and the adaptive concept can and will be extended and scaled as M-socket and N-core based SMP proliferates into to larger grid-based systems. But there is plenty more to do to prove it. (someone send me a 1024 CPU box and a wind-powered-generator) Sven > > What this meant was that virtually every unlock performed a wakeup, if > not for the new pending owner, than the next-in-line waiter. > > My impression at the time was that the contention for the rq lock is > significant, regardless of even if the task being woken up was already > running. > > I can generate numbers if that helps. > > -PWM > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
Paul E. McKenney wrote: Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Yeah, fully agree. This is on my research "todo" list. My theory is that the ultimate adaptive-timeout algorithm here would essentially be the following: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). The adaptive code would work as follows: if (AHt > CSt) /* dont even bother if the average is greater than CSt */ timeout = 0; else timeout = AHt; if (adaptive_wait(timeout)) sleep(); Anyone have some good ideas on how to compute CSt? I was thinking you could create two kthreads that message one another (measuring round-trip time) for some number (say 100) to get an average. You could probably just approximate it with flushing workqueue jobs. -Greg Thanx, Paul Sven Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote: > > In high-contention, short-hold time situations, it may even make sense > to have multiple CPUs with multiple waiters spinning, depending on > hold-time vs. time to put a waiter to sleep and wake them up. > > The wake-up side could also walk ahead on the queue, and bring up > spinners from sleeping, so that they are all ready to go when the lock > flips green for them. > I did try an attempt at this one time. The logic was merely if the pending owner was running, wakeup the next waiter. The results were terrible for the benchmarks used, as compared to the current implementation. What this meant was that virtually every unlock performed a wakeup, if not for the new pending owner, than the next-in-line waiter. My impression at the time was that the contention for the rq lock is significant, regardless of even if the task being woken up was already running. I can generate numbers if that helps. -PWM -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:55:45AM -0800, Sven-Thorsten Dietrich wrote: > > On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: > > On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote: > > > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> > > > wrote: > > > > Yeah, I'm not very keen on having a constant there without some > > > > contention instrumentation to see how long the spins are. It would be > > > > better to just let it run until either task->on_cpu is off or checking > > > > if the "current" in no longer matches the mutex owner for the runqueue > > > > in question. At that point, you know the thread isn't running. > > > > Spinning on something like that is just a waste of time. It's for that > > > > reason that doing in the spin outside of a preempt critical section > > > > isn't really needed > > > > > > Excuse me, I meant to say "...isn't a problem". > > > > The fixed-time spins are very useful in cases where the critical section > > is almost always very short but can sometimes be very long. In such > > cases, you would want to spin until either ownership changes or it is > > apparent that the current critical-section instance will be long. > > > > I believe that there are locks in the Linux kernel that have this > > "mostly short but sometimes long" hold-time property. > > In regards to this "mostly short but sometimes long" question, > for very large SMP systems, running with some profiling enabled, might > allow the system to adapt to varying workloads and therefore shifting > lock contention / hold-times. > > Overall utilization despite the overhead might be lower, but this is > tbd. > > In high-contention, short-hold time situations, it may even make sense > to have multiple CPUs with multiple waiters spinning, depending on > hold-time vs. time to put a waiter to sleep and wake them up. > > The wake-up side could also walk ahead on the queue, and bring up > spinners from sleeping, so that they are all ready to go when the lock > flips green for them. > > But in more simple cases, there should be a simple, default timeout > governed by context switch overhead or as defined by a derived number of > cache misses, as you suggested. Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Thanx, Paul > Sven > > > Thanx, Paul > > - > > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: > > The fixed-time spins are very useful in cases where the critical section > is almost always very short but can sometimes be very long. In such > cases, you would want to spin until either ownership changes or it is > apparent that the current critical-section instance will be long. > This is what the adaptive patch set does. All current spinners exit the loop on an owner change. In this case each "spinner" will traverse the outer rt lock loop and re-attempt to acquire the lock. If unsuccessful, the task will re-enter the adaptive wait loop again, providing that there are still "timeout" iterations left for that particular task. I do have some stats on the rt lock/adaptive algorithm from a crude patch used during development. As Bill has noted, the correct way to obtain these would be to extend lockstats. It was just more expedient to create this crude patch at the time. If it is helpful to look at these results in the meantime, I can easily make them, as well as the patch, available. -PWM -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: > On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote: > > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote: > > > Yeah, I'm not very keen on having a constant there without some > > > contention instrumentation to see how long the spins are. It would be > > > better to just let it run until either task->on_cpu is off or checking > > > if the "current" in no longer matches the mutex owner for the runqueue > > > in question. At that point, you know the thread isn't running. > > > Spinning on something like that is just a waste of time. It's for that > > > reason that doing in the spin outside of a preempt critical section > > > isn't really needed > > > > Excuse me, I meant to say "...isn't a problem". > > The fixed-time spins are very useful in cases where the critical section > is almost always very short but can sometimes be very long. In such > cases, you would want to spin until either ownership changes or it is > apparent that the current critical-section instance will be long. > > I believe that there are locks in the Linux kernel that have this > "mostly short but sometimes long" hold-time property. In regards to this "mostly short but sometimes long" question, for very large SMP systems, running with some profiling enabled, might allow the system to adapt to varying workloads and therefore shifting lock contention / hold-times. Overall utilization despite the overhead might be lower, but this is tbd. In high-contention, short-hold time situations, it may even make sense to have multiple CPUs with multiple waiters spinning, depending on hold-time vs. time to put a waiter to sleep and wake them up. The wake-up side could also walk ahead on the queue, and bring up spinners from sleeping, so that they are all ready to go when the lock flips green for them. But in more simple cases, there should be a simple, default timeout governed by context switch overhead or as defined by a derived number of cache misses, as you suggested. Sven > Thanx, Paul > - > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote: > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote: > > Yeah, I'm not very keen on having a constant there without some > > contention instrumentation to see how long the spins are. It would be > > better to just let it run until either task->on_cpu is off or checking > > if the "current" in no longer matches the mutex owner for the runqueue > > in question. At that point, you know the thread isn't running. > > Spinning on something like that is just a waste of time. It's for that > > reason that doing in the spin outside of a preempt critical section > > isn't really needed > > Excuse me, I meant to say "...isn't a problem". The fixed-time spins are very useful in cases where the critical section is almost always very short but can sometimes be very long. In such cases, you would want to spin until either ownership changes or it is apparent that the current critical-section instance will be long. I believe that there are locks in the Linux kernel that have this "mostly short but sometimes long" hold-time property. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote: > Yeah, I'm not very keen on having a constant there without some > contention instrumentation to see how long the spins are. It would be > better to just let it run until either task->on_cpu is off or checking > if the "current" in no longer matches the mutex owner for the runqueue > in question. At that point, you know the thread isn't running. > Spinning on something like that is just a waste of time. It's for that > reason that doing in the spin outside of a preempt critical section > isn't really needed Excuse me, I meant to say "...isn't a problem". bill -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:08 AM, Paul E. McKenney <[EMAIL PROTECTED]> wrote: > One approach would be to set the RTLOCK_DELAY parameter to something like > -1 for default, and to set it to the number of cycles required for about > 10 cache misses at boot time. This would automatically scale with CPU > frequency and memory latency. Yeah, I'm not very keen on having a constant there without some contention instrumentation to see how long the spins are. It would be better to just let it run until either task->on_cpu is off or checking if the "current" in no longer matches the mutex owner for the runqueue in question. At that point, you know the thread isn't running. Spinning on something like that is just a waste of time. It's for that reason that doing in the spin outside of a preempt critical section isn't really needed bill -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, Feb 21, 2008 at 05:41:09PM +0100, Andi Kleen wrote: > > > +config RTLOCK_DELAY > > + int "Default delay (in loops) for adaptive rtlocks" > > + range 0 10 > > + depends on ADAPTIVE_RTLOCK > > I must say I'm not a big fan of putting such subtle configurable numbers > into Kconfig. Compilation is usually the wrong place to configure > such a thing. Just having it as a sysctl only should be good enough. > > > + default "1" > > Perhaps you can expand how you came up with that default number? > It looks suspiciously round and worse the actual spin time depends a lot on > the > CPU frequency (so e.g. a 3Ghz CPU will likely behave quite > differently from a 2Ghz CPU) Did you experiment with other spin times? > Should it be scaled with number of CPUs? And at what point is real > time behaviour visibly impacted? > > Most likely it would be better to switch to something that is more > absolute time, like checking RDTSC every few iteration similar to what > udelay does. That would be at least constant time. One approach would be to set the RTLOCK_DELAY parameter to something like -1 for default, and to set it to the number of cycles required for about 10 cache misses at boot time. This would automatically scale with CPU frequency and memory latency. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, Feb 21, 2008 at 05:41:09PM +0100, Andi Kleen wrote: +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. + default 1 Perhaps you can expand how you came up with that default number? It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Did you experiment with other spin times? Should it be scaled with number of CPUs? And at what point is real time behaviour visibly impacted? Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. One approach would be to set the RTLOCK_DELAY parameter to something like -1 for default, and to set it to the number of cycles required for about 10 cache misses at boot time. This would automatically scale with CPU frequency and memory latency. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:08 AM, Paul E. McKenney [EMAIL PROTECTED] wrote: One approach would be to set the RTLOCK_DELAY parameter to something like -1 for default, and to set it to the number of cycles required for about 10 cache misses at boot time. This would automatically scale with CPU frequency and memory latency. Yeah, I'm not very keen on having a constant there without some contention instrumentation to see how long the spins are. It would be better to just let it run until either task-on_cpu is off or checking if the current in no longer matches the mutex owner for the runqueue in question. At that point, you know the thread isn't running. Spinning on something like that is just a waste of time. It's for that reason that doing in the spin outside of a preempt critical section isn't really needed bill -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] wrote: Yeah, I'm not very keen on having a constant there without some contention instrumentation to see how long the spins are. It would be better to just let it run until either task-on_cpu is off or checking if the current in no longer matches the mutex owner for the runqueue in question. At that point, you know the thread isn't running. Spinning on something like that is just a waste of time. It's for that reason that doing in the spin outside of a preempt critical section isn't really needed Excuse me, I meant to say ...isn't a problem. bill -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote: On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] wrote: Yeah, I'm not very keen on having a constant there without some contention instrumentation to see how long the spins are. It would be better to just let it run until either task-on_cpu is off or checking if the current in no longer matches the mutex owner for the runqueue in question. At that point, you know the thread isn't running. Spinning on something like that is just a waste of time. It's for that reason that doing in the spin outside of a preempt critical section isn't really needed Excuse me, I meant to say ...isn't a problem. The fixed-time spins are very useful in cases where the critical section is almost always very short but can sometimes be very long. In such cases, you would want to spin until either ownership changes or it is apparent that the current critical-section instance will be long. I believe that there are locks in the Linux kernel that have this mostly short but sometimes long hold-time property. In regards to this mostly short but sometimes long question, for very large SMP systems, running with some profiling enabled, might allow the system to adapt to varying workloads and therefore shifting lock contention / hold-times. Overall utilization despite the overhead might be lower, but this is tbd. In high-contention, short-hold time situations, it may even make sense to have multiple CPUs with multiple waiters spinning, depending on hold-time vs. time to put a waiter to sleep and wake them up. The wake-up side could also walk ahead on the queue, and bring up spinners from sleeping, so that they are all ready to go when the lock flips green for them. But in more simple cases, there should be a simple, default timeout governed by context switch overhead or as defined by a derived number of cache misses, as you suggested. Sven Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote: In high-contention, short-hold time situations, it may even make sense to have multiple CPUs with multiple waiters spinning, depending on hold-time vs. time to put a waiter to sleep and wake them up. The wake-up side could also walk ahead on the queue, and bring up spinners from sleeping, so that they are all ready to go when the lock flips green for them. I did try an attempt at this one time. The logic was merely if the pending owner was running, wakeup the next waiter. The results were terrible for the benchmarks used, as compared to the current implementation. What this meant was that virtually every unlock performed a wakeup, if not for the new pending owner, than the next-in-line waiter. My impression at the time was that the contention for the rq lock is significant, regardless of even if the task being woken up was already running. I can generate numbers if that helps. -PWM -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, Feb 22, 2008 at 11:55:45AM -0800, Sven-Thorsten Dietrich wrote: On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote: On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] wrote: Yeah, I'm not very keen on having a constant there without some contention instrumentation to see how long the spins are. It would be better to just let it run until either task-on_cpu is off or checking if the current in no longer matches the mutex owner for the runqueue in question. At that point, you know the thread isn't running. Spinning on something like that is just a waste of time. It's for that reason that doing in the spin outside of a preempt critical section isn't really needed Excuse me, I meant to say ...isn't a problem. The fixed-time spins are very useful in cases where the critical section is almost always very short but can sometimes be very long. In such cases, you would want to spin until either ownership changes or it is apparent that the current critical-section instance will be long. I believe that there are locks in the Linux kernel that have this mostly short but sometimes long hold-time property. In regards to this mostly short but sometimes long question, for very large SMP systems, running with some profiling enabled, might allow the system to adapt to varying workloads and therefore shifting lock contention / hold-times. Overall utilization despite the overhead might be lower, but this is tbd. In high-contention, short-hold time situations, it may even make sense to have multiple CPUs with multiple waiters spinning, depending on hold-time vs. time to put a waiter to sleep and wake them up. The wake-up side could also walk ahead on the queue, and bring up spinners from sleeping, so that they are all ready to go when the lock flips green for them. But in more simple cases, there should be a simple, default timeout governed by context switch overhead or as defined by a derived number of cache misses, as you suggested. Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Thanx, Paul Sven Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote: The fixed-time spins are very useful in cases where the critical section is almost always very short but can sometimes be very long. In such cases, you would want to spin until either ownership changes or it is apparent that the current critical-section instance will be long. This is what the adaptive patch set does. All current spinners exit the loop on an owner change. In this case each spinner will traverse the outer rt lock loop and re-attempt to acquire the lock. If unsuccessful, the task will re-enter the adaptive wait loop again, providing that there are still timeout iterations left for that particular task. I do have some stats on the rt lock/adaptive algorithm from a crude patch used during development. As Bill has noted, the correct way to obtain these would be to extend lockstats. It was just more expedient to create this crude patch at the time. If it is helpful to look at these results in the meantime, I can easily make them, as well as the patch, available. -PWM -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
Paul E. McKenney wrote: Governing the timeout by context-switch overhead sounds even better to me. Really easy to calibrate, and short critical sections are of much shorter duration than are a context-switch pair. Yeah, fully agree. This is on my research todo list. My theory is that the ultimate adaptive-timeout algorithm here would essentially be the following: *) compute the context-switch pair time average for the system. This is your time threshold (CSt). *) For each lock, maintain an average hold-time (AHt) statistic (I am assuming this can be done cheaply...perhaps not). The adaptive code would work as follows: if (AHt CSt) /* dont even bother if the average is greater than CSt */ timeout = 0; else timeout = AHt; if (adaptive_wait(timeout)) sleep(); Anyone have some good ideas on how to compute CSt? I was thinking you could create two kthreads that message one another (measuring round-trip time) for some number (say 100) to get an average. You could probably just approximate it with flushing workqueue jobs. -Greg Thanx, Paul Sven Thanx, Paul - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Fri, 2008-02-22 at 13:36 -0700, Peter W. Morreale wrote: On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote: In high-contention, short-hold time situations, it may even make sense to have multiple CPUs with multiple waiters spinning, depending on hold-time vs. time to put a waiter to sleep and wake them up. The wake-up side could also walk ahead on the queue, and bring up spinners from sleeping, so that they are all ready to go when the lock flips green for them. I did try an attempt at this one time. The logic was merely if the pending owner was running, wakeup the next waiter. The results were terrible for the benchmarks used, as compared to the current implementation. Yup, but you cut the CONTEXT where I said: for very large SMP systems Specifically, what I mean, is an SMP system, where I have enough CPUs to do this: let (t_Tcs) be the time to lock, transition and unlock an un-contended critical section (i.e. the one that I am the pending waiter for). let (t_W) be the time to wake up a sleeping task. and let (t_W t_Tcs) Then, for very large SMP systems if S = (t_W / t_Tcs), then S designates the number of tasks to transition a critical section before the first sleeper would wake up. and the number of CPUs S. The time for an arbitrary number of tasks N S which are all competing for lock L, to transition a critical section (T_N_cs), approaches: T_N_cs = (N * t_W) if you have only 1 task spinning. but if you can have N tasks spinning, (T_N_cs) approaches: T_N_cs = (N * t_Tcs) and with the premise, that t_W t_Tcs, you should see a dramatic throughput improvement when running PREEMPT_RT on VERY LARGE SMP systems. I want to disclaim, that the math above is very much simplified, but I hope its sufficient to demonstrate the concept. I have to acknowledge Ingo's comments, that this is all suspect until proven to make a positive difference in non-marketing workloads. I personally *think* we are past that already, and the adaptive concept can and will be extended and scaled as M-socket and N-core based SMP proliferates into to larger grid-based systems. But there is plenty more to do to prove it. (someone send me a 1024 CPU box and a wind-powered-generator) Sven What this meant was that virtually every unlock performed a wakeup, if not for the new pending owner, than the next-in-line waiter. My impression at the time was that the contention for the rq lock is significant, regardless of even if the task being woken up was already running. I can generate numbers if that helps. -PWM -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote: > > +config RTLOCK_DELAY > > + int "Default delay (in loops) for adaptive rtlocks" > > + range 0 10 > > + depends on ADAPTIVE_RTLOCK > > I must say I'm not a big fan of putting such subtle configurable numbers > into Kconfig. Compilation is usually the wrong place to configure > such a thing. Just having it as a sysctl only should be good enough. > > > + default "1" > > Perhaps you can expand how you came up with that default number? We did not want to create a hotspot around time sources for HRT and the scheduler clock, and there really hasn't been enough analyis. The loop should be calibrated using something similar to loops_per_jiffy, and it should be in nanoseconds. It needs to be tunable, because it depends a lot on the workload. High frequency periodic tasks would need a lower setting - but it also relates to the number of waiting tasks and the number of CPUs, so there may be heuristics that tie into lockstat. For big-SMP systems, it may actually be worth the overhead to track these stats per-lock (for the hot locks), if we can correlate it all to performance. > It looks suspiciously round and worse the actual spin time depends a lot on > the > CPU frequency (so e.g. a 3Ghz CPU will likely behave quite > differently from a 2Ghz CPU) Did you experiment with other spin times? > Should it be scaled with number of CPUs? And at what point is real > time behaviour visibly impacted? > The code actually runs preemptibly, so even before the timeout expires, the task can pop off the CPU (at which point another state chance cancels the loop) > Most likely it would be better to switch to something that is more > absolute time, like checking RDTSC every few iteration similar to what > udelay does. That would be at least constant time. > True - I looked at something generic, similar to what RT's latency tracing uses, allowing for other architectures. Sven > -Andi > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote: > > +config RTLOCK_DELAY > > + int "Default delay (in loops) for adaptive rtlocks" > > + range 0 10 > > + depends on ADAPTIVE_RTLOCK > > I must say I'm not a big fan of putting such subtle configurable numbers > into Kconfig. Compilation is usually the wrong place to configure > such a thing. Just having it as a sysctl only should be good enough. > > > + default "1" > > Perhaps you can expand how you came up with that default number? > It looks suspiciously round and worse the actual spin time depends a lot on > the > CPU frequency (so e.g. a 3Ghz CPU will likely behave quite > differently from a 2Ghz CPU) Did you experiment with other spin times? > Should it be scaled with number of CPUs? And at what point is real > time behaviour visibly impacted? > We did play with a number of different values at first, however this needs more inspection. With the loads we have played with, this parameter is not that sensitive. At first we only had the adaptive wait added to the rt spin lock code, and the values chosen did not seem to have much of an impact until we dropped below 1000. Higher values had little impact. Adding the adaptive code to rt mutexes seems to make this (and the similar mutex parameter) more sensitive, however this is likely due to the particular benchmarks we were using. Note that the adaptive wait loop breaks on every owner change, so the actual time spent spinning on a CPU (in that loop) is, in general, limited to the time the lock is held. (Note the other cases in that adaptive wait loop as well) It seems to be that the higher values help in the cases where the lock is stolen from the pending owner, in which case the pending owner can and will iterate and again start spinning. We are certainly hoping that the community will test and also provide feedback. > Most likely it would be better to switch to something that is more > absolute time, like checking RDTSC every few iteration similar to what > udelay does. That would be at least constant time. > > -Andi We (at least I :-) did try that, to the detriment of throughput. It is seemingly overkill to more precisely account for the "max" time spent spinning. At one point, I did have a patch that specified the sysctl as "nanoseconds" and converted that to a approximate loop count (via the math in udelay()). The idea was that nanoseconds are more human friendly, as well as hardware independent, however on further review, the idea was dropped. Best, -PWM -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
>>> On Thu, Feb 21, 2008 at 11:41 AM, in message <[EMAIL PROTECTED]>, Andi Kleen <[EMAIL PROTECTED]> wrote: >> +config RTLOCK_DELAY >> +int "Default delay (in loops) for adaptive rtlocks" >> +range 0 10 >> +depends on ADAPTIVE_RTLOCK > > I must say I'm not a big fan of putting such subtle configurable numbers > into Kconfig. Compilation is usually the wrong place to configure > such a thing. Just having it as a sysctl only should be good enough. > >> +default "1" > > Perhaps you can expand how you came up with that default number? Actually, the number doesn't seem to matter that much as long as it is sufficiently long enough to make timeouts rare. Most workloads will present some threshold for hold-time. You generally get the best performance if the value is at least as "long" as that threshold. Anything beyond that and there is no gain, but there doesn't appear to be a penalty either. So we picked 1 because we found it to fit that criteria quite well for our range of GHz class x86 machines. YMMY, but that is why its configurable ;) > It looks suspiciously round and worse the actual spin time depends a lot on > the > CPU frequency (so e.g. a 3Ghz CPU will likely behave quite > differently from a 2Ghz CPU) Yeah, fully agree. We really wanted to use a time-value here but ran into various problems that have yet to be resolved. We have it on the todo list to express this in terms in ns so it at least will scale with the architecture. > Did you experiment with other spin times? Of course ;) > Should it be scaled with number of CPUs? Not to my knowledge, but we can put that as a research "todo". > And at what point is real > time behaviour visibly impacted? Well, if we did our jobs correctly, RT behavior should *never* be impacted. *Throughput* on the other hand... ;) But its comes down to what I mentioned earlier. There is that threshold that affects the probability of timing out. Values lower than that threshold start to degrade throughput. Values higher than that have no affect on throughput, but may drive the cpu utilization higher which can theoretically impact tasks of equal or lesser priority by taking that resource away from them. To date, we have not observed any real-world implications of this however. > > Most likely it would be better to switch to something that is more > absolute time, like checking RDTSC every few iteration similar to what > udelay does. That would be at least constant time. I agree. We need to move in the direction of time-basis. The tradeoff is that it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). I think one of the (not-included) patches converts a nanosecond value from the sysctl to approximate loop-counts using the bogomips data. This was a decent compromise between the non-scaling loopcounts and the heavy-weight official timing APIs. We dropped it because we support older kernels which were conflicting with the patch. We may have to resurrect it, however.. -Greg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
> +config RTLOCK_DELAY > + int "Default delay (in loops) for adaptive rtlocks" > + range 0 10 > + depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. > + default "1" Perhaps you can expand how you came up with that default number? It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Did you experiment with other spin times? Should it be scaled with number of CPUs? And at what point is real time behaviour visibly impacted? Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 08/14] add a loop counter based timeout mechanism
From: Sven Dietrich <[EMAIL PROTECTED]> Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]> --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 6568519..eebec19 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y +config RTLOCK_DELAY + int "Default delay (in loops) for adaptive rtlocks" + range 0 10 + depends on ADAPTIVE_RTLOCK + default "1" +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool "Old-Style Big Kernel Lock" depends on (PREEMPT || SMP) && !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index feb938f..4a7423f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -20,6 +20,10 @@ #include "rtmutex_common.h" #include "rtmutex_adaptive.h" +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock->owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 505fed5..b7e282b 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -39,6 +39,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive->timeout > 0; adaptive->timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive->timeout <= 0) + sleep = 1; + put_task_struct(adaptive->owner); return sleep; @@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive->owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 541aa9f..36259e4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -58,6 +58,8 @@ #include #endif +#include "rtmutex_adaptive.h" + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = { .proc_handler = _dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rtlock_timeout", + .data = _timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = _dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH [RT] 08/14] add a loop counter based timeout mechanism
From: Sven Dietrich [EMAIL PROTECTED] Signed-off-by: Sven Dietrich [EMAIL PROTECTED] --- kernel/Kconfig.preempt| 11 +++ kernel/rtmutex.c |4 kernel/rtmutex_adaptive.h | 11 +-- kernel/sysctl.c | 12 4 files changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 6568519..eebec19 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK If unsure, say Y +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK + default 1 +help + This allows you to specify the maximum attempts a task will spin +attempting to acquire an rtlock before sleeping. The value is +tunable at runtime via a sysctl. A setting of 0 (zero) disables +the adaptive algorithm entirely. + config SPINLOCK_BKL bool Old-Style Big Kernel Lock depends on (PREEMPT || SMP) !PREEMPT_RT diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c index feb938f..4a7423f 100644 --- a/kernel/rtmutex.c +++ b/kernel/rtmutex.c @@ -20,6 +20,10 @@ #include rtmutex_common.h #include rtmutex_adaptive.h +#ifdef CONFIG_ADAPTIVE_RTLOCK +int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY; +#endif + /* * lock-owner state tracking: * diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h index 505fed5..b7e282b 100644 --- a/kernel/rtmutex_adaptive.h +++ b/kernel/rtmutex_adaptive.h @@ -39,6 +39,7 @@ #ifdef CONFIG_ADAPTIVE_RTLOCK struct adaptive_waiter { struct task_struct *owner; + int timeout; }; /* @@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, { int sleep = 0; - for (;;) { + for (; adaptive-timeout 0; adaptive-timeout--) { /* * If the task was re-awoken, break out completely so we can * reloop through the lock-acquisition code. @@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, cpu_relax(); } + if (adaptive-timeout = 0) + sleep = 1; + put_task_struct(adaptive-owner); return sleep; @@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct adaptive_waiter *adaptive) get_task_struct(adaptive-owner); } +extern int rtlock_timeout; + #define DECLARE_ADAPTIVE_WAITER(name) \ - struct adaptive_waiter name = { .owner = NULL, } + struct adaptive_waiter name = { .owner = NULL, \ + .timeout = rtlock_timeout, } #else diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 541aa9f..36259e4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -58,6 +58,8 @@ #include asm/stacktrace.h #endif +#include rtmutex_adaptive.h + static int deprecated_sysctl_warning(struct __sysctl_args *args); #if defined(CONFIG_SYSCTL) @@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif +#ifdef CONFIG_ADAPTIVE_RTLOCK + { + .ctl_name = CTL_UNNUMBERED, + .procname = rtlock_timeout, + .data = rtlock_timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_PROC_FS { .ctl_name = CTL_UNNUMBERED, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
+config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. + default 1 Perhaps you can expand how you came up with that default number? It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Did you experiment with other spin times? Should it be scaled with number of CPUs? And at what point is real time behaviour visibly impacted? Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, Feb 21, 2008 at 11:41 AM, in message [EMAIL PROTECTED], Andi Kleen [EMAIL PROTECTED] wrote: +config RTLOCK_DELAY +int Default delay (in loops) for adaptive rtlocks +range 0 10 +depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. +default 1 Perhaps you can expand how you came up with that default number? Actually, the number doesn't seem to matter that much as long as it is sufficiently long enough to make timeouts rare. Most workloads will present some threshold for hold-time. You generally get the best performance if the value is at least as long as that threshold. Anything beyond that and there is no gain, but there doesn't appear to be a penalty either. So we picked 1 because we found it to fit that criteria quite well for our range of GHz class x86 machines. YMMY, but that is why its configurable ;) It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Yeah, fully agree. We really wanted to use a time-value here but ran into various problems that have yet to be resolved. We have it on the todo list to express this in terms in ns so it at least will scale with the architecture. Did you experiment with other spin times? Of course ;) Should it be scaled with number of CPUs? Not to my knowledge, but we can put that as a research todo. And at what point is real time behaviour visibly impacted? Well, if we did our jobs correctly, RT behavior should *never* be impacted. *Throughput* on the other hand... ;) But its comes down to what I mentioned earlier. There is that threshold that affects the probability of timing out. Values lower than that threshold start to degrade throughput. Values higher than that have no affect on throughput, but may drive the cpu utilization higher which can theoretically impact tasks of equal or lesser priority by taking that resource away from them. To date, we have not observed any real-world implications of this however. Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. I agree. We need to move in the direction of time-basis. The tradeoff is that it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). I think one of the (not-included) patches converts a nanosecond value from the sysctl to approximate loop-counts using the bogomips data. This was a decent compromise between the non-scaling loopcounts and the heavy-weight official timing APIs. We dropped it because we support older kernels which were conflicting with the patch. We may have to resurrect it, however.. -Greg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote: +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. + default 1 Perhaps you can expand how you came up with that default number? It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Did you experiment with other spin times? Should it be scaled with number of CPUs? And at what point is real time behaviour visibly impacted? We did play with a number of different values at first, however this needs more inspection. With the loads we have played with, this parameter is not that sensitive. At first we only had the adaptive wait added to the rt spin lock code, and the values chosen did not seem to have much of an impact until we dropped below 1000. Higher values had little impact. Adding the adaptive code to rt mutexes seems to make this (and the similar mutex parameter) more sensitive, however this is likely due to the particular benchmarks we were using. Note that the adaptive wait loop breaks on every owner change, so the actual time spent spinning on a CPU (in that loop) is, in general, limited to the time the lock is held. (Note the other cases in that adaptive wait loop as well) It seems to be that the higher values help in the cases where the lock is stolen from the pending owner, in which case the pending owner can and will iterate and again start spinning. We are certainly hoping that the community will test and also provide feedback. Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. -Andi We (at least I :-) did try that, to the detriment of throughput. It is seemingly overkill to more precisely account for the max time spent spinning. At one point, I did have a patch that specified the sysctl as nanoseconds and converted that to a approximate loop count (via the math in udelay()). The idea was that nanoseconds are more human friendly, as well as hardware independent, however on further review, the idea was dropped. Best, -PWM -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote: +config RTLOCK_DELAY + int Default delay (in loops) for adaptive rtlocks + range 0 10 + depends on ADAPTIVE_RTLOCK I must say I'm not a big fan of putting such subtle configurable numbers into Kconfig. Compilation is usually the wrong place to configure such a thing. Just having it as a sysctl only should be good enough. + default 1 Perhaps you can expand how you came up with that default number? We did not want to create a hotspot around time sources for HRT and the scheduler clock, and there really hasn't been enough analyis. The loop should be calibrated using something similar to loops_per_jiffy, and it should be in nanoseconds. It needs to be tunable, because it depends a lot on the workload. High frequency periodic tasks would need a lower setting - but it also relates to the number of waiting tasks and the number of CPUs, so there may be heuristics that tie into lockstat. For big-SMP systems, it may actually be worth the overhead to track these stats per-lock (for the hot locks), if we can correlate it all to performance. It looks suspiciously round and worse the actual spin time depends a lot on the CPU frequency (so e.g. a 3Ghz CPU will likely behave quite differently from a 2Ghz CPU) Did you experiment with other spin times? Should it be scaled with number of CPUs? And at what point is real time behaviour visibly impacted? The code actually runs preemptibly, so even before the timeout expires, the task can pop off the CPU (at which point another state chance cancels the loop) Most likely it would be better to switch to something that is more absolute time, like checking RDTSC every few iteration similar to what udelay does. That would be at least constant time. True - I looked at something generic, similar to what RT's latency tracing uses, allowing for other architectures. Sven -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/