Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-25 Thread Sven-Thorsten Dietrich

On Sat, 2008-02-23 at 13:31 +0100, Andi Kleen wrote:
> > *) compute the context-switch pair time average for the system.  This is 
> > your time threshold (CSt).
> 

Hi Andi,

> This is not a uniform time. Consider the difference between
> context switch on the same hyperthread, context switch between cores
> on a die, context switch between sockets, context switch between
> distant numa nodes. You could have several orders of magnitude
> between all those.
> 
> >
> > *) For each lock, maintain an average hold-time (AHt) statistic (I am 
> > assuming this can be done cheaply...perhaps not).
> 
> That would assume that the hold times are very uniform. But what happens
> when you e.g. have a workload where 50% of the lock aquisitions are short 
> and 30% are long? 
> 
> I'm a little sceptical of such "too clever" algorithms.
> 

I agree, this does not make any sense until you get to very large SMP
systems - and only iff it can be demonstrated, that the overhead
associated with lock statistics and including some notion of the context
switch overhead still comes out with a net gain.

There is some notion of task-routing in the RT scheduler already, and
this is quite a clever little algorithm. 

I see no reason, why the scheduler should not eventually take directly
into account (when balancing), the quantifiable context-switch and CPU
overhead of moving a task to a distant processor.

In fact, for a deadline-based scheduler working on high-frequency tasks,
given that the times can switch so radically, this is would be
requiredOnve context-switch-overhead data is available to the scheduler,
there is no particular reason why adaptive locking could not also
utilize that data.

Sven

> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-25 Thread Sven-Thorsten Dietrich

On Sat, 2008-02-23 at 13:31 +0100, Andi Kleen wrote:
  *) compute the context-switch pair time average for the system.  This is 
  your time threshold (CSt).
 

Hi Andi,

 This is not a uniform time. Consider the difference between
 context switch on the same hyperthread, context switch between cores
 on a die, context switch between sockets, context switch between
 distant numa nodes. You could have several orders of magnitude
 between all those.
 
 
  *) For each lock, maintain an average hold-time (AHt) statistic (I am 
  assuming this can be done cheaply...perhaps not).
 
 That would assume that the hold times are very uniform. But what happens
 when you e.g. have a workload where 50% of the lock aquisitions are short 
 and 30% are long? 
 
 I'm a little sceptical of such too clever algorithms.
 

I agree, this does not make any sense until you get to very large SMP
systems - and only iff it can be demonstrated, that the overhead
associated with lock statistics and including some notion of the context
switch overhead still comes out with a net gain.

There is some notion of task-routing in the RT scheduler already, and
this is quite a clever little algorithm. 

I see no reason, why the scheduler should not eventually take directly
into account (when balancing), the quantifiable context-switch and CPU
overhead of moving a task to a distant processor.

In fact, for a deadline-based scheduler working on high-frequency tasks,
given that the times can switch so radically, this is would be
requiredOnve context-switch-overhead data is available to the scheduler,
there is no particular reason why adaptive locking could not also
utilize that data.

Sven

 -Andi
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-23 Thread Paul E. McKenney
On Sat, Feb 23, 2008 at 01:31:00PM +0100, Andi Kleen wrote:
> > *) compute the context-switch pair time average for the system.  This is 
> > your time threshold (CSt).
> 
> This is not a uniform time. Consider the difference between
> context switch on the same hyperthread, context switch between cores
> on a die, context switch between sockets, context switch between
> distant numa nodes. You could have several orders of magnitude
> between all those.

Wouldn't the common case for blocking on a lock with a short hold
time be the minimal context-switch pair on the same CPU?

> > *) For each lock, maintain an average hold-time (AHt) statistic (I am 
> > assuming this can be done cheaply...perhaps not).
> 
> That would assume that the hold times are very uniform. But what happens
> when you e.g. have a workload where 50% of the lock aquisitions are short 
> and 30% are long? 
> 
> I'm a little sceptical of such "too clever" algorithms.

Me too.  That said, I cannot resist pointing out that the measurement
of interest would be the fraction of lock-hold times that are shorter
than the context-switch time.  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-23 Thread Andi Kleen
> *) compute the context-switch pair time average for the system.  This is 
> your time threshold (CSt).

This is not a uniform time. Consider the difference between
context switch on the same hyperthread, context switch between cores
on a die, context switch between sockets, context switch between
distant numa nodes. You could have several orders of magnitude
between all those.

>
> *) For each lock, maintain an average hold-time (AHt) statistic (I am 
> assuming this can be done cheaply...perhaps not).

That would assume that the hold times are very uniform. But what happens
when you e.g. have a workload where 50% of the lock aquisitions are short 
and 30% are long? 

I'm a little sceptical of such "too clever" algorithms.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-23 Thread Andi Kleen
 *) compute the context-switch pair time average for the system.  This is 
 your time threshold (CSt).

This is not a uniform time. Consider the difference between
context switch on the same hyperthread, context switch between cores
on a die, context switch between sockets, context switch between
distant numa nodes. You could have several orders of magnitude
between all those.


 *) For each lock, maintain an average hold-time (AHt) statistic (I am 
 assuming this can be done cheaply...perhaps not).

That would assume that the hold times are very uniform. But what happens
when you e.g. have a workload where 50% of the lock aquisitions are short 
and 30% are long? 

I'm a little sceptical of such too clever algorithms.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-23 Thread Paul E. McKenney
On Sat, Feb 23, 2008 at 01:31:00PM +0100, Andi Kleen wrote:
  *) compute the context-switch pair time average for the system.  This is 
  your time threshold (CSt).
 
 This is not a uniform time. Consider the difference between
 context switch on the same hyperthread, context switch between cores
 on a die, context switch between sockets, context switch between
 distant numa nodes. You could have several orders of magnitude
 between all those.

Wouldn't the common case for blocking on a lock with a short hold
time be the minimal context-switch pair on the same CPU?

  *) For each lock, maintain an average hold-time (AHt) statistic (I am 
  assuming this can be done cheaply...perhaps not).
 
 That would assume that the hold times are very uniform. But what happens
 when you e.g. have a workload where 50% of the lock aquisitions are short 
 and 30% are long? 
 
 I'm a little sceptical of such too clever algorithms.

Me too.  That said, I cannot resist pointing out that the measurement
of interest would be the fraction of lock-hold times that are shorter
than the context-switch time.  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Sven-Thorsten Dietrich

On Fri, 2008-02-22 at 13:36 -0700, Peter W. Morreale wrote:
> On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
> > 
> > In high-contention, short-hold time situations, it may even make sense
> > to have multiple CPUs with multiple waiters spinning, depending on
> > hold-time vs. time to put a waiter to sleep and wake them up.
> > 
> > The wake-up side could also walk ahead on the queue, and bring up
> > spinners from sleeping, so that they are all ready to go when the lock
> > flips green for them.
> > 
> 
> I did try an attempt at this one time.  The logic was merely if the
> pending owner was running, wakeup the next waiter.  The results were
> terrible for the benchmarks used, as compared to the current
> implementation. 

Yup, but you cut the CONTEXT where I said:

"for very large SMP systems"

Specifically, what I mean, is an SMP system, where I have enough CPUs to
do this:

let (t_Tcs) be the time to lock, transition and unlock an un-contended
critical section (i.e. the one that I am the pending waiter for).

let (t_W) be the time to wake up a sleeping task.

and let (t_W > t_Tcs)

Then, "for very large SMP systems"

if 

S = (t_W / t_Tcs), 

then S designates the number of tasks to transition a critical section
before the first sleeper would wake up.

and the number of CPUs > S.

The time for an arbitrary number of tasks N > S which are all competing
for lock L, to transition a critical section (T_N_cs), approaches:

T_N_cs = (N * t_W) 

if you have only 1 task spinning.

but if you can have 

N tasks spinning, (T_N_cs) approaches:

T_N_cs = (N * t_Tcs)

and with the premise, that t_W > t_Tcs, you should see a dramatic
throughput improvement when running PREEMPT_RT on VERY LARGE SMP
systems.

I want to disclaim, that the math above is very much simplified, but I
hope its sufficient to demonstrate the concept.

I have to acknowledge Ingo's comments, that this is all suspect until
proven to make a positive difference in "non-marketing" workloads.

I personally *think* we are past that already, and the adaptive concept
can and will be extended and scaled as M-socket and N-core based SMP
proliferates into to larger grid-based systems. But there is plenty more
to do to prove it.

(someone send me a 1024 CPU box and a wind-powered-generator)

Sven



> 
> What this meant was that virtually every unlock performed a wakeup, if
> not for the new pending owner, than the next-in-line waiter. 
> 
> My impression at the time was that the contention for the rq lock is
> significant, regardless of even if the task being woken up was already
> running.  
> 
> I can generate numbers if that helps.
> 
> -PWM
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Gregory Haskins

Paul E. McKenney wrote:

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.


Yeah, fully agree.  This is on my research "todo" list.  My theory is 
that the ultimate adaptive-timeout algorithm here would essentially be 
the following:


*) compute the context-switch pair time average for the system.  This is 
your time threshold (CSt).


*) For each lock, maintain an average hold-time (AHt) statistic (I am 
assuming this can be done cheaply...perhaps not).


The adaptive code would work as follows:

if (AHt > CSt) /* dont even bother if the average is greater than CSt */
   timeout = 0;
else
   timeout = AHt;

if (adaptive_wait(timeout))
   sleep();

Anyone have some good ideas on how to compute CSt?  I was thinking you 
could create two kthreads that message one another (measuring round-trip 
time) for some number (say 100) to get an average.  You could probably 
just approximate it with flushing workqueue jobs.


-Greg



Thanx, Paul


Sven


Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Peter W. Morreale
On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
> 
> In high-contention, short-hold time situations, it may even make sense
> to have multiple CPUs with multiple waiters spinning, depending on
> hold-time vs. time to put a waiter to sleep and wake them up.
> 
> The wake-up side could also walk ahead on the queue, and bring up
> spinners from sleeping, so that they are all ready to go when the lock
> flips green for them.
> 

I did try an attempt at this one time.  The logic was merely if the
pending owner was running, wakeup the next waiter.  The results were
terrible for the benchmarks used, as compared to the current
implementation. 

What this meant was that virtually every unlock performed a wakeup, if
not for the new pending owner, than the next-in-line waiter. 

My impression at the time was that the contention for the rq lock is
significant, regardless of even if the task being woken up was already
running.  

I can generate numbers if that helps.

-PWM



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Paul E. McKenney
On Fri, Feb 22, 2008 at 11:55:45AM -0800, Sven-Thorsten Dietrich wrote:
> 
> On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> > > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> 
> > > wrote:
> > > >  Yeah, I'm not very keen on having a constant there without some
> > > >  contention instrumentation to see how long the spins are. It would be
> > > >  better to just let it run until either task->on_cpu is off or checking
> > > >  if the "current" in no longer matches the mutex owner for the runqueue
> > > >  in question. At that point, you know the thread isn't running.
> > > >  Spinning on something like that is just a waste of time. It's for that
> > > >  reason that doing in the spin outside of a preempt critical section
> > > >  isn't really needed
> > > 
> > > Excuse me, I meant to say "...isn't a problem".
> > 
> > The fixed-time spins are very useful in cases where the critical section
> > is almost always very short but can sometimes be very long.  In such
> > cases, you would want to spin until either ownership changes or it is
> > apparent that the current critical-section instance will be long.
> > 
> > I believe that there are locks in the Linux kernel that have this
> > "mostly short but sometimes long" hold-time property.
> 
> In regards to this "mostly short but sometimes long" question, 
> for very large SMP systems, running with some profiling enabled, might
> allow the system to adapt to varying workloads and therefore shifting
> lock contention / hold-times. 
> 
> Overall utilization despite the overhead might be lower, but this is
> tbd.
> 
> In high-contention, short-hold time situations, it may even make sense
> to have multiple CPUs with multiple waiters spinning, depending on
> hold-time vs. time to put a waiter to sleep and wake them up.
> 
> The wake-up side could also walk ahead on the queue, and bring up
> spinners from sleeping, so that they are all ready to go when the lock
> flips green for them.
> 
> But in more simple cases, there should be a simple, default timeout
> governed by context switch overhead or as defined by a derived number of
> cache misses, as you suggested.

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.

Thanx, Paul

> Sven
> 
> > Thanx, Paul
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Peter W. Morreale
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
> 
> The fixed-time spins are very useful in cases where the critical section
> is almost always very short but can sometimes be very long.  In such
> cases, you would want to spin until either ownership changes or it is
> apparent that the current critical-section instance will be long.
> 

This is what the adaptive patch set does.  All current spinners exit the
loop on an owner change.  

In this case each "spinner" will traverse the outer rt lock loop and
re-attempt to acquire the lock.  If unsuccessful, the task will re-enter
the adaptive wait loop again, providing that there are still "timeout"
iterations left for that particular task. 

I do have some stats on the rt lock/adaptive algorithm from a crude
patch used during development.  As Bill has noted, the correct way to
obtain these would be to extend lockstats.  It was just more expedient
to create this crude patch at the time.  If it is helpful to look at
these results in the meantime, I can easily make them, as well as the
patch, available.  


-PWM

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Sven-Thorsten Dietrich

On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
> On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> > On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote:
> > >  Yeah, I'm not very keen on having a constant there without some
> > >  contention instrumentation to see how long the spins are. It would be
> > >  better to just let it run until either task->on_cpu is off or checking
> > >  if the "current" in no longer matches the mutex owner for the runqueue
> > >  in question. At that point, you know the thread isn't running.
> > >  Spinning on something like that is just a waste of time. It's for that
> > >  reason that doing in the spin outside of a preempt critical section
> > >  isn't really needed
> > 
> > Excuse me, I meant to say "...isn't a problem".
> 
> The fixed-time spins are very useful in cases where the critical section
> is almost always very short but can sometimes be very long.  In such
> cases, you would want to spin until either ownership changes or it is
> apparent that the current critical-section instance will be long.
> 
> I believe that there are locks in the Linux kernel that have this
> "mostly short but sometimes long" hold-time property.

In regards to this "mostly short but sometimes long" question, 
for very large SMP systems, running with some profiling enabled, might
allow the system to adapt to varying workloads and therefore shifting
lock contention / hold-times. 

Overall utilization despite the overhead might be lower, but this is
tbd.

In high-contention, short-hold time situations, it may even make sense
to have multiple CPUs with multiple waiters spinning, depending on
hold-time vs. time to put a waiter to sleep and wake them up.

The wake-up side could also walk ahead on the queue, and bring up
spinners from sleeping, so that they are all ready to go when the lock
flips green for them.

But in more simple cases, there should be a simple, default timeout
governed by context switch overhead or as defined by a derived number of
cache misses, as you suggested.

Sven

>   Thanx, Paul
> -
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Paul E. McKenney
On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
> On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote:
> >  Yeah, I'm not very keen on having a constant there without some
> >  contention instrumentation to see how long the spins are. It would be
> >  better to just let it run until either task->on_cpu is off or checking
> >  if the "current" in no longer matches the mutex owner for the runqueue
> >  in question. At that point, you know the thread isn't running.
> >  Spinning on something like that is just a waste of time. It's for that
> >  reason that doing in the spin outside of a preempt critical section
> >  isn't really needed
> 
> Excuse me, I meant to say "...isn't a problem".

The fixed-time spins are very useful in cases where the critical section
is almost always very short but can sometimes be very long.  In such
cases, you would want to spin until either ownership changes or it is
apparent that the current critical-section instance will be long.

I believe that there are locks in the Linux kernel that have this
"mostly short but sometimes long" hold-time property.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Bill Huey (hui)
On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) <[EMAIL PROTECTED]> wrote:
>  Yeah, I'm not very keen on having a constant there without some
>  contention instrumentation to see how long the spins are. It would be
>  better to just let it run until either task->on_cpu is off or checking
>  if the "current" in no longer matches the mutex owner for the runqueue
>  in question. At that point, you know the thread isn't running.
>  Spinning on something like that is just a waste of time. It's for that
>  reason that doing in the spin outside of a preempt critical section
>  isn't really needed

Excuse me, I meant to say "...isn't a problem".

bill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Bill Huey (hui)
On Fri, Feb 22, 2008 at 11:08 AM, Paul E. McKenney
<[EMAIL PROTECTED]> wrote:
>  One approach would be to set the RTLOCK_DELAY parameter to something like
>  -1 for default, and to set it to the number of cycles required for about
>  10 cache misses at boot time.  This would automatically scale with CPU
>  frequency and memory latency.

Yeah, I'm not very keen on having a constant there without some
contention instrumentation to see how long the spins are. It would be
better to just let it run until either task->on_cpu is off or checking
if the "current" in no longer matches the mutex owner for the runqueue
in question. At that point, you know the thread isn't running.
Spinning on something like that is just a waste of time. It's for that
reason that doing in the spin outside of a preempt critical section
isn't really needed

bill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Paul E. McKenney
On Thu, Feb 21, 2008 at 05:41:09PM +0100, Andi Kleen wrote:
> 
> > +config RTLOCK_DELAY
> > +   int "Default delay (in loops) for adaptive rtlocks"
> > +   range 0 10
> > +   depends on ADAPTIVE_RTLOCK
> 
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
> 
> > +   default "1"
> 
> Perhaps you can expand how you came up with that default number? 
> It looks suspiciously round and worse the actual spin time depends a lot on 
> the 
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
> differently from a 2Ghz CPU) Did you experiment with other spin times? 
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted? 
> 
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.

One approach would be to set the RTLOCK_DELAY parameter to something like
-1 for default, and to set it to the number of cycles required for about
10 cache misses at boot time.  This would automatically scale with CPU
frequency and memory latency.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Paul E. McKenney
On Thu, Feb 21, 2008 at 05:41:09PM +0100, Andi Kleen wrote:
 
  +config RTLOCK_DELAY
  +   int Default delay (in loops) for adaptive rtlocks
  +   range 0 10
  +   depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
  +   default 1
 
 Perhaps you can expand how you came up with that default number? 
 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) Did you experiment with other spin times? 
 Should it be scaled with number of CPUs? And at what point is real
 time behaviour visibly impacted? 
 
 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.

One approach would be to set the RTLOCK_DELAY parameter to something like
-1 for default, and to set it to the number of cycles required for about
10 cache misses at boot time.  This would automatically scale with CPU
frequency and memory latency.

Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Bill Huey (hui)
On Fri, Feb 22, 2008 at 11:08 AM, Paul E. McKenney
[EMAIL PROTECTED] wrote:
  One approach would be to set the RTLOCK_DELAY parameter to something like
  -1 for default, and to set it to the number of cycles required for about
  10 cache misses at boot time.  This would automatically scale with CPU
  frequency and memory latency.

Yeah, I'm not very keen on having a constant there without some
contention instrumentation to see how long the spins are. It would be
better to just let it run until either task-on_cpu is off or checking
if the current in no longer matches the mutex owner for the runqueue
in question. At that point, you know the thread isn't running.
Spinning on something like that is just a waste of time. It's for that
reason that doing in the spin outside of a preempt critical section
isn't really needed

bill
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Bill Huey (hui)
On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] wrote:
  Yeah, I'm not very keen on having a constant there without some
  contention instrumentation to see how long the spins are. It would be
  better to just let it run until either task-on_cpu is off or checking
  if the current in no longer matches the mutex owner for the runqueue
  in question. At that point, you know the thread isn't running.
  Spinning on something like that is just a waste of time. It's for that
  reason that doing in the spin outside of a preempt critical section
  isn't really needed

Excuse me, I meant to say ...isn't a problem.

bill
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Sven-Thorsten Dietrich

On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
 On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
  On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] wrote:
Yeah, I'm not very keen on having a constant there without some
contention instrumentation to see how long the spins are. It would be
better to just let it run until either task-on_cpu is off or checking
if the current in no longer matches the mutex owner for the runqueue
in question. At that point, you know the thread isn't running.
Spinning on something like that is just a waste of time. It's for that
reason that doing in the spin outside of a preempt critical section
isn't really needed
  
  Excuse me, I meant to say ...isn't a problem.
 
 The fixed-time spins are very useful in cases where the critical section
 is almost always very short but can sometimes be very long.  In such
 cases, you would want to spin until either ownership changes or it is
 apparent that the current critical-section instance will be long.
 
 I believe that there are locks in the Linux kernel that have this
 mostly short but sometimes long hold-time property.

In regards to this mostly short but sometimes long question, 
for very large SMP systems, running with some profiling enabled, might
allow the system to adapt to varying workloads and therefore shifting
lock contention / hold-times. 

Overall utilization despite the overhead might be lower, but this is
tbd.

In high-contention, short-hold time situations, it may even make sense
to have multiple CPUs with multiple waiters spinning, depending on
hold-time vs. time to put a waiter to sleep and wake them up.

The wake-up side could also walk ahead on the queue, and bring up
spinners from sleeping, so that they are all ready to go when the lock
flips green for them.

But in more simple cases, there should be a simple, default timeout
governed by context switch overhead or as defined by a derived number of
cache misses, as you suggested.

Sven

   Thanx, Paul
 -
 To unsubscribe from this list: send the line unsubscribe linux-rt-users in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Peter W. Morreale
On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
 
 In high-contention, short-hold time situations, it may even make sense
 to have multiple CPUs with multiple waiters spinning, depending on
 hold-time vs. time to put a waiter to sleep and wake them up.
 
 The wake-up side could also walk ahead on the queue, and bring up
 spinners from sleeping, so that they are all ready to go when the lock
 flips green for them.
 

I did try an attempt at this one time.  The logic was merely if the
pending owner was running, wakeup the next waiter.  The results were
terrible for the benchmarks used, as compared to the current
implementation. 

What this meant was that virtually every unlock performed a wakeup, if
not for the new pending owner, than the next-in-line waiter. 

My impression at the time was that the contention for the rq lock is
significant, regardless of even if the task being woken up was already
running.  

I can generate numbers if that helps.

-PWM



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Paul E. McKenney
On Fri, Feb 22, 2008 at 11:55:45AM -0800, Sven-Thorsten Dietrich wrote:
 
 On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
  On Fri, Feb 22, 2008 at 11:21:14AM -0800, Bill Huey (hui) wrote:
   On Fri, Feb 22, 2008 at 11:19 AM, Bill Huey (hui) [EMAIL PROTECTED] 
   wrote:
 Yeah, I'm not very keen on having a constant there without some
 contention instrumentation to see how long the spins are. It would be
 better to just let it run until either task-on_cpu is off or checking
 if the current in no longer matches the mutex owner for the runqueue
 in question. At that point, you know the thread isn't running.
 Spinning on something like that is just a waste of time. It's for that
 reason that doing in the spin outside of a preempt critical section
 isn't really needed
   
   Excuse me, I meant to say ...isn't a problem.
  
  The fixed-time spins are very useful in cases where the critical section
  is almost always very short but can sometimes be very long.  In such
  cases, you would want to spin until either ownership changes or it is
  apparent that the current critical-section instance will be long.
  
  I believe that there are locks in the Linux kernel that have this
  mostly short but sometimes long hold-time property.
 
 In regards to this mostly short but sometimes long question, 
 for very large SMP systems, running with some profiling enabled, might
 allow the system to adapt to varying workloads and therefore shifting
 lock contention / hold-times. 
 
 Overall utilization despite the overhead might be lower, but this is
 tbd.
 
 In high-contention, short-hold time situations, it may even make sense
 to have multiple CPUs with multiple waiters spinning, depending on
 hold-time vs. time to put a waiter to sleep and wake them up.
 
 The wake-up side could also walk ahead on the queue, and bring up
 spinners from sleeping, so that they are all ready to go when the lock
 flips green for them.
 
 But in more simple cases, there should be a simple, default timeout
 governed by context switch overhead or as defined by a derived number of
 cache misses, as you suggested.

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.

Thanx, Paul

 Sven
 
  Thanx, Paul
  -
  To unsubscribe from this list: send the line unsubscribe linux-rt-users in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Peter W. Morreale
On Fri, 2008-02-22 at 11:43 -0800, Paul E. McKenney wrote:
 
 The fixed-time spins are very useful in cases where the critical section
 is almost always very short but can sometimes be very long.  In such
 cases, you would want to spin until either ownership changes or it is
 apparent that the current critical-section instance will be long.
 

This is what the adaptive patch set does.  All current spinners exit the
loop on an owner change.  

In this case each spinner will traverse the outer rt lock loop and
re-attempt to acquire the lock.  If unsuccessful, the task will re-enter
the adaptive wait loop again, providing that there are still timeout
iterations left for that particular task. 

I do have some stats on the rt lock/adaptive algorithm from a crude
patch used during development.  As Bill has noted, the correct way to
obtain these would be to extend lockstats.  It was just more expedient
to create this crude patch at the time.  If it is helpful to look at
these results in the meantime, I can easily make them, as well as the
patch, available.  


-PWM

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Gregory Haskins

Paul E. McKenney wrote:

Governing the timeout by context-switch overhead sounds even better to me.
Really easy to calibrate, and short critical sections are of much shorter
duration than are a context-switch pair.


Yeah, fully agree.  This is on my research todo list.  My theory is 
that the ultimate adaptive-timeout algorithm here would essentially be 
the following:


*) compute the context-switch pair time average for the system.  This is 
your time threshold (CSt).


*) For each lock, maintain an average hold-time (AHt) statistic (I am 
assuming this can be done cheaply...perhaps not).


The adaptive code would work as follows:

if (AHt  CSt) /* dont even bother if the average is greater than CSt */
   timeout = 0;
else
   timeout = AHt;

if (adaptive_wait(timeout))
   sleep();

Anyone have some good ideas on how to compute CSt?  I was thinking you 
could create two kthreads that message one another (measuring round-trip 
time) for some number (say 100) to get an average.  You could probably 
just approximate it with flushing workqueue jobs.


-Greg



Thanx, Paul


Sven


Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-22 Thread Sven-Thorsten Dietrich

On Fri, 2008-02-22 at 13:36 -0700, Peter W. Morreale wrote:
 On Fri, 2008-02-22 at 11:55 -0800, Sven-Thorsten Dietrich wrote:
  
  In high-contention, short-hold time situations, it may even make sense
  to have multiple CPUs with multiple waiters spinning, depending on
  hold-time vs. time to put a waiter to sleep and wake them up.
  
  The wake-up side could also walk ahead on the queue, and bring up
  spinners from sleeping, so that they are all ready to go when the lock
  flips green for them.
  
 
 I did try an attempt at this one time.  The logic was merely if the
 pending owner was running, wakeup the next waiter.  The results were
 terrible for the benchmarks used, as compared to the current
 implementation. 

Yup, but you cut the CONTEXT where I said:

for very large SMP systems

Specifically, what I mean, is an SMP system, where I have enough CPUs to
do this:

let (t_Tcs) be the time to lock, transition and unlock an un-contended
critical section (i.e. the one that I am the pending waiter for).

let (t_W) be the time to wake up a sleeping task.

and let (t_W  t_Tcs)

Then, for very large SMP systems

if 

S = (t_W / t_Tcs), 

then S designates the number of tasks to transition a critical section
before the first sleeper would wake up.

and the number of CPUs  S.

The time for an arbitrary number of tasks N  S which are all competing
for lock L, to transition a critical section (T_N_cs), approaches:

T_N_cs = (N * t_W) 

if you have only 1 task spinning.

but if you can have 

N tasks spinning, (T_N_cs) approaches:

T_N_cs = (N * t_Tcs)

and with the premise, that t_W  t_Tcs, you should see a dramatic
throughput improvement when running PREEMPT_RT on VERY LARGE SMP
systems.

I want to disclaim, that the math above is very much simplified, but I
hope its sufficient to demonstrate the concept.

I have to acknowledge Ingo's comments, that this is all suspect until
proven to make a positive difference in non-marketing workloads.

I personally *think* we are past that already, and the adaptive concept
can and will be extended and scaled as M-socket and N-core based SMP
proliferates into to larger grid-based systems. But there is plenty more
to do to prove it.

(someone send me a 1024 CPU box and a wind-powered-generator)

Sven



 
 What this meant was that virtually every unlock performed a wakeup, if
 not for the new pending owner, than the next-in-line waiter. 
 
 My impression at the time was that the contention for the rq lock is
 significant, regardless of even if the task being woken up was already
 running.  
 
 I can generate numbers if that helps.
 
 -PWM
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Sven-Thorsten Dietrich

On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
> > +config RTLOCK_DELAY
> > +   int "Default delay (in loops) for adaptive rtlocks"
> > +   range 0 10
> > +   depends on ADAPTIVE_RTLOCK
> 
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
> 
> > +   default "1"
> 
> Perhaps you can expand how you came up with that default number? 

We did not want to create a hotspot around time sources for HRT and the
scheduler clock, and there really hasn't been enough analyis.

The loop should be calibrated using something similar to
loops_per_jiffy, and it should be in nanoseconds.

It needs to be tunable, because it depends a lot on the workload.

High frequency periodic tasks would need a lower setting - but it also
relates to the number of waiting tasks and the number of CPUs, so there
may be heuristics that tie into lockstat.

For big-SMP systems, it may actually be worth the overhead to track
these stats per-lock (for the hot locks), if we can correlate it all to
performance.

> It looks suspiciously round and worse the actual spin time depends a lot on 
> the 
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
> differently from a 2Ghz CPU) Did you experiment with other spin times? 
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted? 
> 

The code actually runs preemptibly, so even before the timeout expires,
the task can pop off the CPU (at which point another state chance
cancels the loop)

> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.
> 

True - I looked at something generic, similar to what RT's latency
tracing uses, allowing for other architectures.

Sven

> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Peter W. Morreale
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
> > +config RTLOCK_DELAY
> > +   int "Default delay (in loops) for adaptive rtlocks"
> > +   range 0 10
> > +   depends on ADAPTIVE_RTLOCK
> 
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
> 
> > +   default "1"
> 
> Perhaps you can expand how you came up with that default number? 
> It looks suspiciously round and worse the actual spin time depends a lot on 
> the 
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
> differently from a 2Ghz CPU) Did you experiment with other spin times? 
> Should it be scaled with number of CPUs? And at what point is real
> time behaviour visibly impacted? 
> 

We did play with a number of different values at first, however this
needs more inspection.   With the loads we have played with, this
parameter is not that sensitive.

At first we only had the adaptive wait added to the rt spin lock code,
and the values chosen did not seem to have much of an impact until we
dropped below 1000.  Higher values had little impact.  Adding the
adaptive code to rt mutexes seems to make this (and the similar mutex
parameter) more sensitive, however this is likely due to the particular
benchmarks we were using.

Note that the adaptive wait loop breaks on every owner change, so the
actual time spent spinning on a CPU (in that loop) is, in general,
limited to the time the lock is held.  (Note the other cases in that
adaptive wait loop as well) 

It seems to be that the higher values help in the cases where the lock
is stolen from the pending owner, in which case the pending owner can
and will iterate and again start spinning. 

We are certainly hoping that the community will test and also provide
feedback. 

 
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.
> 
> -Andi

We (at least I  :-) did try that, to the detriment of throughput.   It
is seemingly overkill to more precisely account for the "max" time spent
spinning.  

At one point, I did have a patch that specified the sysctl as
"nanoseconds" and converted that to a approximate loop count (via the
math in udelay()).  The idea was that nanoseconds are more human
friendly, as well as hardware independent, however on further review,
the idea was dropped.


Best,
-PWM


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
>>> On Thu, Feb 21, 2008 at 11:41 AM, in message <[EMAIL PROTECTED]>,
Andi Kleen <[EMAIL PROTECTED]> wrote: 

>> +config RTLOCK_DELAY
>> +int "Default delay (in loops) for adaptive rtlocks"
>> +range 0 10
>> +depends on ADAPTIVE_RTLOCK
> 
> I must say I'm not a big fan of putting such subtle configurable numbers
> into Kconfig. Compilation is usually the wrong place to configure
> such a thing. Just having it as a sysctl only should be good enough.
> 
>> +default "1"
> 
> Perhaps you can expand how you came up with that default number? 

Actually, the number doesn't seem to matter that much as long as it is 
sufficiently long enough to make timeouts rare.  Most workloads will present 
some threshold for hold-time.  You generally get the best performance if the 
value is at least as "long" as that threshold.  Anything beyond that and there 
is no gain, but there doesn't appear to be a penalty either.  So we picked 
1 because we found it to fit that criteria quite well for our range of GHz 
class x86 machines.  YMMY, but that is why its configurable ;)

> It looks suspiciously round and worse the actual spin time depends a lot on 
> the 
> CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
> differently from a 2Ghz CPU) 

Yeah, fully agree.  We really wanted to use a time-value here but ran into 
various problems that have yet to be resolved.  We have it on the todo list to 
express this in terms in ns so it at least will scale with the architecture.

> Did you experiment with other spin times?

Of course ;)

> Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research "todo".

> And at what point is real
> time behaviour visibly impacted? 

Well, if we did our jobs correctly, RT behavior should *never* be impacted.  
*Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that 
affects the probability of timing out.  Values lower than that threshold start 
to degrade throughput.  Values higher than that have no affect on throughput, 
but may drive the cpu utilization higher which can theoretically impact tasks 
of equal or lesser priority by taking that resource away from them.  To date, 
we have not observed any real-world implications of this however.

> 
> Most likely it would be better to switch to something that is more
> absolute time, like checking RDTSC every few iteration similar to what
> udelay does. That would be at least constant time.

I agree.  We need to move in the direction of time-basis.  The tradeoff is that 
it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). 
 I think one of the (not-included) patches converts a nanosecond value from the 
sysctl to approximate loop-counts using the bogomips data.  This was a decent 
compromise between the non-scaling loopcounts and the heavy-weight official 
timing APIs.  We dropped it because we support older kernels which were 
conflicting with the patch. We may have to resurrect it, however..

-Greg



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Andi Kleen

> +config RTLOCK_DELAY
> + int "Default delay (in loops) for adaptive rtlocks"
> + range 0 10
> + depends on ADAPTIVE_RTLOCK

I must say I'm not a big fan of putting such subtle configurable numbers
into Kconfig. Compilation is usually the wrong place to configure
such a thing. Just having it as a sysctl only should be good enough.

> + default "1"

Perhaps you can expand how you came up with that default number? 
It looks suspiciously round and worse the actual spin time depends a lot on the 
CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
differently from a 2Ghz CPU) Did you experiment with other spin times? 
Should it be scaled with number of CPUs? And at what point is real
time behaviour visibly impacted? 

Most likely it would be better to switch to something that is more
absolute time, like checking RDTSC every few iteration similar to what
udelay does. That would be at least constant time.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
From: Sven Dietrich <[EMAIL PROTECTED]>

Signed-off-by: Sven Dietrich <[EMAIL PROTECTED]>
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK
 
 If unsure, say Y
 
+config RTLOCK_DELAY
+   int "Default delay (in loops) for adaptive rtlocks"
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default "1"
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool "Old-Style Big Kernel Lock"
depends on (PREEMPT || SMP) && !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
 #include "rtmutex_common.h"
 #include "rtmutex_adaptive.h"
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock->owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive->timeout > 0; adaptive->timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive->timeout <= 0)
+   sleep = 1;
+
put_task_struct(adaptive->owner);
 
return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive->owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
 #include 
 #endif
 
+#include "rtmutex_adaptive.h"
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "rtlock_timeout",
+   .data   = _timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
From: Sven Dietrich [EMAIL PROTECTED]

Signed-off-by: Sven Dietrich [EMAIL PROTECTED]
---

 kernel/Kconfig.preempt|   11 +++
 kernel/rtmutex.c  |4 
 kernel/rtmutex_adaptive.h |   11 +--
 kernel/sysctl.c   |   12 
 4 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 6568519..eebec19 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -212,6 +212,17 @@ config ADAPTIVE_RTLOCK
 
 If unsure, say Y
 
+config RTLOCK_DELAY
+   int Default delay (in loops) for adaptive rtlocks
+   range 0 10
+   depends on ADAPTIVE_RTLOCK
+   default 1
+help
+ This allows you to specify the maximum attempts a task will spin
+attempting to acquire an rtlock before sleeping.  The value is
+tunable at runtime via a sysctl.  A setting of 0 (zero) disables
+the adaptive algorithm entirely.
+
 config SPINLOCK_BKL
bool Old-Style Big Kernel Lock
depends on (PREEMPT || SMP)  !PREEMPT_RT
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index feb938f..4a7423f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,10 @@
 #include rtmutex_common.h
 #include rtmutex_adaptive.h
 
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+int rtlock_timeout __read_mostly = CONFIG_RTLOCK_DELAY;
+#endif
+
 /*
  * lock-owner state tracking:
  *
diff --git a/kernel/rtmutex_adaptive.h b/kernel/rtmutex_adaptive.h
index 505fed5..b7e282b 100644
--- a/kernel/rtmutex_adaptive.h
+++ b/kernel/rtmutex_adaptive.h
@@ -39,6 +39,7 @@
 #ifdef CONFIG_ADAPTIVE_RTLOCK
 struct adaptive_waiter {
struct task_struct *owner;
+   int timeout;
 };
 
 /*
@@ -60,7 +61,7 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
 {
int sleep = 0;
 
-   for (;;) {
+   for (; adaptive-timeout  0; adaptive-timeout--) {
/*
 * If the task was re-awoken, break out completely so we can
 * reloop through the lock-acquisition code.
@@ -101,6 +102,9 @@ adaptive_wait(struct rt_mutex *lock, struct rt_mutex_waiter 
*waiter,
cpu_relax();
}
 
+   if (adaptive-timeout = 0)
+   sleep = 1;
+
put_task_struct(adaptive-owner);
 
return sleep;
@@ -118,8 +122,11 @@ prepare_adaptive_wait(struct rt_mutex *lock, struct 
adaptive_waiter *adaptive)
get_task_struct(adaptive-owner);
 }
 
+extern int rtlock_timeout;
+
 #define DECLARE_ADAPTIVE_WAITER(name) \
- struct adaptive_waiter name = { .owner = NULL, }
+ struct adaptive_waiter name = { .owner = NULL,   \
+ .timeout = rtlock_timeout, }
 
 #else
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 541aa9f..36259e4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -58,6 +58,8 @@
 #include asm/stacktrace.h
 #endif
 
+#include rtmutex_adaptive.h
+
 static int deprecated_sysctl_warning(struct __sysctl_args *args);
 
 #if defined(CONFIG_SYSCTL)
@@ -964,6 +966,16 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_ADAPTIVE_RTLOCK
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = rtlock_timeout,
+   .data   = rtlock_timeout,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+#endif
 #ifdef CONFIG_PROC_FS
{
.ctl_name   = CTL_UNNUMBERED,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Andi Kleen

 +config RTLOCK_DELAY
 + int Default delay (in loops) for adaptive rtlocks
 + range 0 10
 + depends on ADAPTIVE_RTLOCK

I must say I'm not a big fan of putting such subtle configurable numbers
into Kconfig. Compilation is usually the wrong place to configure
such a thing. Just having it as a sysctl only should be good enough.

 + default 1

Perhaps you can expand how you came up with that default number? 
It looks suspiciously round and worse the actual spin time depends a lot on the 
CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
differently from a 2Ghz CPU) Did you experiment with other spin times? 
Should it be scaled with number of CPUs? And at what point is real
time behaviour visibly impacted? 

Most likely it would be better to switch to something that is more
absolute time, like checking RDTSC every few iteration similar to what
udelay does. That would be at least constant time.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 11:41 AM, in message [EMAIL PROTECTED],
Andi Kleen [EMAIL PROTECTED] wrote: 

 +config RTLOCK_DELAY
 +int Default delay (in loops) for adaptive rtlocks
 +range 0 10
 +depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
 +default 1
 
 Perhaps you can expand how you came up with that default number? 

Actually, the number doesn't seem to matter that much as long as it is 
sufficiently long enough to make timeouts rare.  Most workloads will present 
some threshold for hold-time.  You generally get the best performance if the 
value is at least as long as that threshold.  Anything beyond that and there 
is no gain, but there doesn't appear to be a penalty either.  So we picked 
1 because we found it to fit that criteria quite well for our range of GHz 
class x86 machines.  YMMY, but that is why its configurable ;)

 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) 

Yeah, fully agree.  We really wanted to use a time-value here but ran into 
various problems that have yet to be resolved.  We have it on the todo list to 
express this in terms in ns so it at least will scale with the architecture.

 Did you experiment with other spin times?

Of course ;)

 Should it be scaled with number of CPUs?

Not to my knowledge, but we can put that as a research todo.

 And at what point is real
 time behaviour visibly impacted? 

Well, if we did our jobs correctly, RT behavior should *never* be impacted.  
*Throughput* on the other hand... ;)

But its comes down to what I mentioned earlier. There is that threshold that 
affects the probability of timing out.  Values lower than that threshold start 
to degrade throughput.  Values higher than that have no affect on throughput, 
but may drive the cpu utilization higher which can theoretically impact tasks 
of equal or lesser priority by taking that resource away from them.  To date, 
we have not observed any real-world implications of this however.

 
 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.

I agree.  We need to move in the direction of time-basis.  The tradeoff is that 
it needs to be portable, and low-impact (e.g. ktime_get() is too heavy-weight). 
 I think one of the (not-included) patches converts a nanosecond value from the 
sysctl to approximate loop-counts using the bogomips data.  This was a decent 
compromise between the non-scaling loopcounts and the heavy-weight official 
timing APIs.  We dropped it because we support older kernels which were 
conflicting with the patch. We may have to resurrect it, however..

-Greg



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Peter W. Morreale
On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
  +config RTLOCK_DELAY
  +   int Default delay (in loops) for adaptive rtlocks
  +   range 0 10
  +   depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
  +   default 1
 
 Perhaps you can expand how you came up with that default number? 
 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) Did you experiment with other spin times? 
 Should it be scaled with number of CPUs? And at what point is real
 time behaviour visibly impacted? 
 

We did play with a number of different values at first, however this
needs more inspection.   With the loads we have played with, this
parameter is not that sensitive.

At first we only had the adaptive wait added to the rt spin lock code,
and the values chosen did not seem to have much of an impact until we
dropped below 1000.  Higher values had little impact.  Adding the
adaptive code to rt mutexes seems to make this (and the similar mutex
parameter) more sensitive, however this is likely due to the particular
benchmarks we were using.

Note that the adaptive wait loop breaks on every owner change, so the
actual time spent spinning on a CPU (in that loop) is, in general,
limited to the time the lock is held.  (Note the other cases in that
adaptive wait loop as well) 

It seems to be that the higher values help in the cases where the lock
is stolen from the pending owner, in which case the pending owner can
and will iterate and again start spinning. 

We are certainly hoping that the community will test and also provide
feedback. 

 
 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.
 
 -Andi

We (at least I  :-) did try that, to the detriment of throughput.   It
is seemingly overkill to more precisely account for the max time spent
spinning.  

At one point, I did have a patch that specified the sysctl as
nanoseconds and converted that to a approximate loop count (via the
math in udelay()).  The idea was that nanoseconds are more human
friendly, as well as hardware independent, however on further review,
the idea was dropped.


Best,
-PWM


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH [RT] 08/14] add a loop counter based timeout mechanism

2008-02-21 Thread Sven-Thorsten Dietrich

On Thu, 2008-02-21 at 17:41 +0100, Andi Kleen wrote:
  +config RTLOCK_DELAY
  +   int Default delay (in loops) for adaptive rtlocks
  +   range 0 10
  +   depends on ADAPTIVE_RTLOCK
 
 I must say I'm not a big fan of putting such subtle configurable numbers
 into Kconfig. Compilation is usually the wrong place to configure
 such a thing. Just having it as a sysctl only should be good enough.
 
  +   default 1
 
 Perhaps you can expand how you came up with that default number? 

We did not want to create a hotspot around time sources for HRT and the
scheduler clock, and there really hasn't been enough analyis.

The loop should be calibrated using something similar to
loops_per_jiffy, and it should be in nanoseconds.

It needs to be tunable, because it depends a lot on the workload.

High frequency periodic tasks would need a lower setting - but it also
relates to the number of waiting tasks and the number of CPUs, so there
may be heuristics that tie into lockstat.

For big-SMP systems, it may actually be worth the overhead to track
these stats per-lock (for the hot locks), if we can correlate it all to
performance.

 It looks suspiciously round and worse the actual spin time depends a lot on 
 the 
 CPU frequency (so e.g. a 3Ghz CPU will likely behave quite 
 differently from a 2Ghz CPU) Did you experiment with other spin times? 
 Should it be scaled with number of CPUs? And at what point is real
 time behaviour visibly impacted? 
 

The code actually runs preemptibly, so even before the timeout expires,
the task can pop off the CPU (at which point another state chance
cancels the loop)

 Most likely it would be better to switch to something that is more
 absolute time, like checking RDTSC every few iteration similar to what
 udelay does. That would be at least constant time.
 

True - I looked at something generic, similar to what RT's latency
tracing uses, allowing for other architectures.

Sven

 -Andi
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/