subject:"Re\: Network slowdown due to CFS"

Re: Network slowdown due to CFS

2007-10-07 Thread Ingo Molnar


* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> > [...] The timeslices of tasks (i.e. the time they spend on a CPU 
> > without scheduling away) is _not_ maintained directly in CFS as a 
> > per-task variable that can be "cleared", it's not the metric that 
> > drives scheduling. Yes, of course CFS too "slices up CPU time", but 
> > those slices are not the per-task variables of traditional 
> > schedulers and cannot be 'cleared'.
> 
> It's not about this comment alone, but this comment plus "no notion" 
> comment, which appears in sched-design-CFS.txt too.

ok - i've re-read it and it indeed is somewhat confusing without 
additional context. I'll improve the wording. (sched-design-CFS.txt 
needs an update anyway)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-07 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

  [...] The timeslices of tasks (i.e. the time they spend on a CPU 
  without scheduling away) is _not_ maintained directly in CFS as a 
  per-task variable that can be cleared, it's not the metric that 
  drives scheduling. Yes, of course CFS too slices up CPU time, but 
  those slices are not the per-task variables of traditional 
  schedulers and cannot be 'cleared'.
 
 It's not about this comment alone, but this comment plus no notion 
 comment, which appears in sched-design-CFS.txt too.

ok - i've re-read it and it indeed is somewhat confusing without 
additional context. I'll improve the wording. (sched-design-CFS.txt 
needs an update anyway)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Casey Dahlin


Ingo Molnar wrote:

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:
  
[...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
to return... time.)



wrong again. That is a function, not a variable to be cleared.


It still gives us a target time, so could we not simply have sched_yield 
put the thread completely to sleep for the given amount of time? It 
wholly redefines the operation, and its far more expensive (now there's 
a whole new timer involved) but it might emulate the expected behavior. 
Its hideous, but so is sched_yield in the first place, so why not?


--CJD
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-03 Thread Rusty Russell

On Mon, 2007-10-01 at 09:49 -0700, David Schwartz wrote:
> > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> >
> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at all,
> > > even where it's right.
> 
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
> 
> It can occasionally be an optimization. You may have a case where you can do
> something very efficiently if a lock is not held, but you cannot afford to
> wait for the lock to be released. So you check the lock, if it's held, you
> yield and then check again. If that fails, you do it the less optimal way
> (for example, dispatching it to a thread that *can* afford to wait).

This used to be true, and still is if you want to be portable.  But the
point of futexes was precisely to attack this use case: whereas
sched_yield() says "I'm waiting for something, but I won't tell you
what" the futex ops tells the kernel what you're waiting for.

While the time to do a futex op is slightly slower than sched_yield(),
futexes win in so many cases that we haven't found a benchmark where
yield wins.  Yield-lose cases include:
1) There are other unrelated process that yield() ends up queueing
   behind.
2) The process you're waiting for doesn't conveniently sleep as soon as
   it releases the lock, so you wait for longer than intended,
3) You race between the yield and the lock being dropped.

In summary: spin N times & futex seems optimal.  The value of N depends
on the number of CPUs in the machine and other factors, but N=1 has
shown itself pretty reasonable.

Hope that helps,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote:
...
> just a quick patch, not tested and I've not evaluated all possible
> implications yet.
> But someone might give it a try with his/(her -- are even more
> welcomed :-) favourite sched_yield() load.

Of course, after some evaluation by yourself and Ingo the most
interesting should be Martin's Michlmayr testing, so I hope you'll
Cc him too?!

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Helge Hafting


David Schwartz wrote:

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:



BTW, it looks like risky to criticise sched_yield too much: some
people can misinterpret such discussions and stop using this at all,
even where it's right.
  


  

Really, i have never seen a _single_ mainstream app where the use of
sched_yield() was the right choice.



It can occasionally be an optimization. You may have a case where you can do
something very efficiently if a lock is not held, but you cannot afford to
wait for the lock to be released. So you check the lock, if it's held, you
yield and then check again. If that fails, you do it the less optimal way
(for example, dispatching it to a thread that *can* afford to wait).
  

How about:
Check the lock. If it is held, sleep for an interval that is shorter
than acceptable waiting time. If it is still held, sleep for twice as long.
Loop until you get the lock and do the work, or until you
you reach the limit for how much you can wait at this point and
dispatch to a thread instead.

This approach should be portable, don't wake up too often,
and don't waste the CPU.  (And it won't go idle either, whoever
holds the lock will be running.)


Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar


* Dmitry Adamushko <[EMAIL PROTECTED]> wrote:

> +   se->vruntime += delta_exec_weighted;

thanks Dmitry.

Btw., this is quite similar to the yield_granularity patch i did 
originally, just less flexible. It turned out that apps want either zero 
granularity or "infinite" granularity, they dont actually want something 
inbetween. That's the two extremes that the current sysctl expresses in 
essence.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 12:58:26PM +0200, Dmitry Adamushko wrote:
> On 03/10/2007, Dmitry Adamushko <[EMAIL PROTECTED]> wrote:
> > On 03/10/2007, Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> > > I can't see anything about clearing. I think, this was about charging,
> > > which should change the key enough, to move a task to, maybe, a better
> > > place in a que (tree) than with current ways.
> >
> > just a quick patch, not tested and I've not evaluated all possible
> > implications yet.
> > But someone might give it a try with his/(her -- are even more
> > welcomed :-) favourite sched_yield() load.
> >
> > (and white space damaged)
> >
> > --- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
> > +++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
...
> s/curr/se

Thanks very much!

Alas, I'll be able to look at this and try only in the evening.

Best regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Dmitry Adamushko

On 03/10/2007, Dmitry Adamushko <[EMAIL PROTECTED]> wrote:
> On 03/10/2007, Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> > I can't see anything about clearing. I think, this was about charging,
> > which should change the key enough, to move a task to, maybe, a better
> > place in a que (tree) than with current ways.
>
> just a quick patch, not tested and I've not evaluated all possible
> implications yet.
> But someone might give it a try with his/(her -- are even more
> welcomed :-) favourite sched_yield() load.
>
> (and white space damaged)
>
> --- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
> +++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
> @@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
> update_curr(cfs_rq);
>
> return;
> +   } else if (sysctl_sched_compat_yield == 2) {
> +   unsigned long ideal_runtime, delta_exec,
> + delta_exec_weighted;
> +
> +   __update_rq_clock(rq);
> +   /*
> +* Update run-time statistics of the 'current'.
> +*/
> +   update_curr(cfs_rq);
> +
> +   /*
> +* Emulate (speed up) the effect of us being preempted
> +* by scheduler_tick().
> +*/
> +   ideal_runtime = sched_slice(cfs_rq, curr);
> +   delta_exec = curr->sum_exec_runtime -
> curr->prev_sum_exec_runtime;
> +
> +   if (ideal_runtime > delta_exec) {
> +   delta_exec_weighted = ideal_runtime - delta_exec;
> +
> +   if (unlikely(curr->load.weight != NICE_0_LOAD)) {
> +   delta_exec_weighted =
> calc_delta_fair(delta_exec_weighted,
> +
>  >load);
> +   }


s/curr/se


-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Dmitry Adamushko

On 03/10/2007, Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> I can't see anything about clearing. I think, this was about charging,
> which should change the key enough, to move a task to, maybe, a better
> place in a que (tree) than with current ways.

just a quick patch, not tested and I've not evaluated all possible
implications yet.
But someone might give it a try with his/(her -- are even more
welcomed :-) favourite sched_yield() load.

(and white space damaged)

--- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
update_curr(cfs_rq);

return;
+   } else if (sysctl_sched_compat_yield == 2) {
+   unsigned long ideal_runtime, delta_exec,
+ delta_exec_weighted;
+
+   __update_rq_clock(rq);
+   /*
+* Update run-time statistics of the 'current'.
+*/
+   update_curr(cfs_rq);
+
+   /*
+* Emulate (speed up) the effect of us being preempted
+* by scheduler_tick().
+*/
+   ideal_runtime = sched_slice(cfs_rq, curr);
+   delta_exec = curr->sum_exec_runtime -
curr->prev_sum_exec_runtime;
+
+   if (ideal_runtime > delta_exec) {
+   delta_exec_weighted = ideal_runtime - delta_exec;
+
+   if (unlikely(curr->load.weight != NICE_0_LOAD)) {
+   delta_exec_weighted =
calc_delta_fair(delta_exec_weighted,
+
 >load);
+   }
+   se->vruntime += delta_exec_weighted;
+   }
+   return;
}
+
/*
 * Find the rightmost entry in the rbtree:
 */


>
> Jarek P.
>

-- 
Best regards,
Dmitry Adamushko
--- sched_fair-old.c	2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c	2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
 		update_curr(cfs_rq);
 
 		return;
+	} else if (sysctl_sched_compat_yield == 2) {
+		unsigned long ideal_runtime, delta_exec,
+			  delta_exec_weighted;
+
+		__update_rq_clock(rq);
+		/*
+		 * Update run-time statistics of the 'current'.
+		 */
+		update_curr(cfs_rq);
+
+		/*
+		 * Emulate the effect of us being preempted
+		 * by scheduler_tick().
+		 */
+		ideal_runtime = sched_slice(cfs_rq, curr);
+		delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+		if (ideal_runtime > delta_exec) {
+			delta_exec_weighted = ideal_runtime - delta_exec;
+
+			if (unlikely(curr->load.weight != NICE_0_LOAD)) {
+delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
+	>load);
+			}
+			se->vruntime += delta_exec_weighted;
+		}
+		return;
 	}
+
 	/*
 	 * Find the rightmost entry in the rbtree:
 	 */

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 11:10:58AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
> > > 
> > > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> > > 
> > > > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks 
> > > > > "earn" a right to the CPU, and that "right" is not sliced in the 
> > > > > traditional sense) But we tried a conceptually similar thing [...]
> > > > 
> > > > >From kernel/sched_fair.c:
> > > > 
> > > > "/*
> > > >  * Targeted preemption latency for CPU-bound tasks:
> > > >  * (default: 20ms, units: nanoseconds)
> > > >  *
> > > >  * NOTE: this latency value is not the same as the concept of
> > > >  * 'timeslice length' - timeslices in CFS are of variable length.
> > > >  * (to see the precise effective timeslice length of your workload,
> > > >  *  run vmstat and monitor the context-switches field)
> > > > ..."
> > > > 
> > > > So, no notion of something, which are(!) of variable length, and which 
> > > > precise effective timeslice lenght can be seen in nanoseconds? (But 
> > > > not timeslice!)
> > > 
> > > You should really read and understand the code you are arguing about :-/
> > 
> > Maybe you could help me with better comments? IMHO, it would be enough 
> > to warn new timeslices have different meaning, or stop to use this 
> > term at all. [...]
> 
> i'm curious, what better do you need than the very detailed comment 
> quoted above? Which bit of "this latency value is not the same as the 
> concept of timeslice length" is difficult to understand? The timeslices 
> of tasks (i.e. the time they spend on a CPU without scheduling away) is 
> _not_ maintained directly in CFS as a per-task variable that can be 
> "cleared", it's not the metric that drives scheduling. Yes, of course 
> CFS too "slices up CPU time", but those slices are not the per-task 
> variables of traditional schedulers and cannot be 'cleared'.

It's not about this comment alone, but this comment plus "no notion"
comment, which appears in sched-design-CFS.txt too.

> 
> > [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
> > to return... time.)
> 
> wrong again. That is a function, not a variable to be cleared. (Anyway, 
> the noise/signal ratio is getting increasingly high in this thread with 
> no progress in sight, so i cannot guarantee any further replies - 
> possibly others will pick up the tab and explain/discuss any other 
> questions that might come up. Patches are welcome of course.)

I can't see anything about clearing. I think, this was about charging,
which should change the key enough, to move a task to, maybe, a better
place in a que (tree) than with current ways.

Jarek P.

PS: Don't you think that a nice argue with some celebrity, like Ingo
Molnar himself, is by far more interesting than those dull patches?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
> > 
> > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> > 
> > > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks 
> > > > "earn" a right to the CPU, and that "right" is not sliced in the 
> > > > traditional sense) But we tried a conceptually similar thing [...]
> > > 
> > > >From kernel/sched_fair.c:
> > > 
> > > "/*
> > >  * Targeted preemption latency for CPU-bound tasks:
> > >  * (default: 20ms, units: nanoseconds)
> > >  *
> > >  * NOTE: this latency value is not the same as the concept of
> > >  * 'timeslice length' - timeslices in CFS are of variable length.
> > >  * (to see the precise effective timeslice length of your workload,
> > >  *  run vmstat and monitor the context-switches field)
> > > ..."
> > > 
> > > So, no notion of something, which are(!) of variable length, and which 
> > > precise effective timeslice lenght can be seen in nanoseconds? (But 
> > > not timeslice!)
> > 
> > You should really read and understand the code you are arguing about :-/
> 
> Maybe you could help me with better comments? IMHO, it would be enough 
> to warn new timeslices have different meaning, or stop to use this 
> term at all. [...]

i'm curious, what better do you need than the very detailed comment 
quoted above? Which bit of "this latency value is not the same as the 
concept of timeslice length" is difficult to understand? The timeslices 
of tasks (i.e. the time they spend on a CPU without scheduling away) is 
_not_ maintained directly in CFS as a per-task variable that can be 
"cleared", it's not the metric that drives scheduling. Yes, of course 
CFS too "slices up CPU time", but those slices are not the per-task 
variables of traditional schedulers and cannot be 'cleared'.

> [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
> to return... time.)

wrong again. That is a function, not a variable to be cleared. (Anyway, 
the noise/signal ratio is getting increasingly high in this thread with 
no progress in sight, so i cannot guarantee any further replies - 
possibly others will pick up the tab and explain/discuss any other 
questions that might come up. Patches are welcome of course.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> 
> > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks 
> > > "earn" a right to the CPU, and that "right" is not sliced in the 
> > > traditional sense) But we tried a conceptually similar thing [...]
> > 
> > >From kernel/sched_fair.c:
> > 
> > "/*
> >  * Targeted preemption latency for CPU-bound tasks:
> >  * (default: 20ms, units: nanoseconds)
> >  *
> >  * NOTE: this latency value is not the same as the concept of
> >  * 'timeslice length' - timeslices in CFS are of variable length.
> >  * (to see the precise effective timeslice length of your workload,
> >  *  run vmstat and monitor the context-switches field)
> > ..."
> > 
> > So, no notion of something, which are(!) of variable length, and which 
> > precise effective timeslice lenght can be seen in nanoseconds? (But 
> > not timeslice!)
> 
> You should really read and understand the code you are arguing about :-/

Maybe you could help me with better comments? IMHO, it would be enough
to warn new timeslices have different meaning, or stop to use this
term at all. (Btw, in -rc8-mm2 I see new sched_slice() function which
seems to return... time.)

> 
> In the 2.6.22 scheduler, there was a p->time_slice per task variable 
> that could be manipulated. (Note, in 2.6.22's sched_yield() did not 
> manipulate p->time_slice.)
> 
> sysctl_sched_latency on the other hand is not something that is per task 
> (it is global) so there is no pending timeslice to be "cleared" as it 
> has been suggested naively.

But, there is this "something", very similar and very misleading, you
count eg. in check_preempt_curr_fair to find if time is over, and I
think this could be similar enough to what David Schwartz wanted to
use in his idea, and you didn't care to explain why it's so different?

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar


* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks 
> > "earn" a right to the CPU, and that "right" is not sliced in the 
> > traditional sense) But we tried a conceptually similar thing [...]
> 
> >From kernel/sched_fair.c:
> 
> "/*
>  * Targeted preemption latency for CPU-bound tasks:
>  * (default: 20ms, units: nanoseconds)
>  *
>  * NOTE: this latency value is not the same as the concept of
>  * 'timeslice length' - timeslices in CFS are of variable length.
>  * (to see the precise effective timeslice length of your workload,
>  *  run vmstat and monitor the context-switches field)
> ..."
> 
> So, no notion of something, which are(!) of variable length, and which 
> precise effective timeslice lenght can be seen in nanoseconds? (But 
> not timeslice!)

You should really read and understand the code you are arguing about :-/

In the 2.6.22 scheduler, there was a p->time_slice per task variable 
that could be manipulated. (Note, in 2.6.22's sched_yield() did not 
manipulate p->time_slice.)

sysctl_sched_latency on the other hand is not something that is per task 
(it is global) so there is no pending timeslice to be "cleared" as it 
has been suggested naively.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On 02-10-2007 08:06, Ingo Molnar wrote:
> * David Schwartz <[EMAIL PROTECTED]> wrote:
...
>> I'm not familiar enough with CFS' internals to help much on the 
>> implementation, but there may be some simple compromise yield that 
>> might work well enough. How about simply acting as if the task used up 
>> its timeslice and scheduling the next one? (Possibly with a slight 
>> reduction in penalty or reward for not really using all the time, if 
>> possible?)
> 
> firstly, there's no notion of "timeslices" in CFS. (in CFS tasks "earn" 
> a right to the CPU, and that "right" is not sliced in the traditional 
> sense) But we tried a conceptually similar thing [...]

>From kernel/sched_fair.c:

"/*
 * Targeted preemption latency for CPU-bound tasks:
 * (default: 20ms, units: nanoseconds)
 *
 * NOTE: this latency value is not the same as the concept of
 * 'timeslice length' - timeslices in CFS are of variable length.
 * (to see the precise effective timeslice length of your workload,
 *  run vmstat and monitor the context-switches field)
..."

So, no notion of something, which are(!) of variable length, and which
precise effective timeslice lenght can be seen in nanoseconds? (But
not timeslice!)

Well, I start to think, this new scheduler could be too simple yet...

> [...] [ and this is driven by compatibility 
> goals - regardless of how broken we consider yield use. The ideal 
> solution is of course to almost never use yield. Fortunately 99%+ of 
> Linux apps follow that ideal solution ;-) ]

Nevertheless, it seems, this 1% is important enough to boast a little:

  "( another detail: due to nanosec accounting and timeline sorting,
 sched_yield() support is very simple under CFS, and in fact under
 CFS sched_yield() behaves much better than under any other
 scheduler i have tested so far. )"
[Documentation/sched-design-CFS.txt]

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On 02-10-2007 17:37, David Schwartz wrote:
...
> So now I not only have to come up with an example where sched_yield is the
> best practical choice, I have to come up with one where sched_yield is the
> best conceivable choice? Didn't we start out by agreeing these are very rare
> cases? Why are we designing new APIs for them (Arjan) and why do we care
> about their performance (Ingo)?
> 
> These are *rare* cases. It is a waste of time to optimize them.

Probably we'll start to care after first comparison tests done by our
rivals. It should be a piece of cake for them to find the "right"
code...

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On 02-10-2007 17:37, David Schwartz wrote:
...
 So now I not only have to come up with an example where sched_yield is the
 best practical choice, I have to come up with one where sched_yield is the
 best conceivable choice? Didn't we start out by agreeing these are very rare
 cases? Why are we designing new APIs for them (Arjan) and why do we care
 about their performance (Ingo)?
 
 These are *rare* cases. It is a waste of time to optimize them.

Probably we'll start to care after first comparison tests done by our
rivals. It should be a piece of cake for them to find the right
code...

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On 02-10-2007 08:06, Ingo Molnar wrote:
 * David Schwartz [EMAIL PROTECTED] wrote:
...
 I'm not familiar enough with CFS' internals to help much on the 
 implementation, but there may be some simple compromise yield that 
 might work well enough. How about simply acting as if the task used up 
 its timeslice and scheduling the next one? (Possibly with a slight 
 reduction in penalty or reward for not really using all the time, if 
 possible?)
 
 firstly, there's no notion of timeslices in CFS. (in CFS tasks earn 
 a right to the CPU, and that right is not sliced in the traditional 
 sense) But we tried a conceptually similar thing [...]

From kernel/sched_fair.c:

/*
 * Targeted preemption latency for CPU-bound tasks:
 * (default: 20ms, units: nanoseconds)
 *
 * NOTE: this latency value is not the same as the concept of
 * 'timeslice length' - timeslices in CFS are of variable length.
 * (to see the precise effective timeslice length of your workload,
 *  run vmstat and monitor the context-switches field)
...

So, no notion of something, which are(!) of variable length, and which
precise effective timeslice lenght can be seen in nanoseconds? (But
not timeslice!)

Well, I start to think, this new scheduler could be too simple yet...


 [...] [ and this is driven by compatibility 
 goals - regardless of how broken we consider yield use. The ideal 
 solution is of course to almost never use yield. Fortunately 99%+ of 
 Linux apps follow that ideal solution ;-) ]

Nevertheless, it seems, this 1% is important enough to boast a little:

  ( another detail: due to nanosec accounting and timeline sorting,
 sched_yield() support is very simple under CFS, and in fact under
 CFS sched_yield() behaves much better than under any other
 scheduler i have tested so far. )
[Documentation/sched-design-CFS.txt]

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

  firstly, there's no notion of timeslices in CFS. (in CFS tasks 
  earn a right to the CPU, and that right is not sliced in the 
  traditional sense) But we tried a conceptually similar thing [...]
 
 From kernel/sched_fair.c:
 
 /*
  * Targeted preemption latency for CPU-bound tasks:
  * (default: 20ms, units: nanoseconds)
  *
  * NOTE: this latency value is not the same as the concept of
  * 'timeslice length' - timeslices in CFS are of variable length.
  * (to see the precise effective timeslice length of your workload,
  *  run vmstat and monitor the context-switches field)
 ...
 
 So, no notion of something, which are(!) of variable length, and which 
 precise effective timeslice lenght can be seen in nanoseconds? (But 
 not timeslice!)

You should really read and understand the code you are arguing about :-/

In the 2.6.22 scheduler, there was a p-time_slice per task variable 
that could be manipulated. (Note, in 2.6.22's sched_yield() did not 
manipulate p-time_slice.)

sysctl_sched_latency on the other hand is not something that is per task 
(it is global) so there is no pending timeslice to be cleared as it 
has been suggested naively.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
   firstly, there's no notion of timeslices in CFS. (in CFS tasks 
   earn a right to the CPU, and that right is not sliced in the 
   traditional sense) But we tried a conceptually similar thing [...]
  
  From kernel/sched_fair.c:
  
  /*
   * Targeted preemption latency for CPU-bound tasks:
   * (default: 20ms, units: nanoseconds)
   *
   * NOTE: this latency value is not the same as the concept of
   * 'timeslice length' - timeslices in CFS are of variable length.
   * (to see the precise effective timeslice length of your workload,
   *  run vmstat and monitor the context-switches field)
  ...
  
  So, no notion of something, which are(!) of variable length, and which 
  precise effective timeslice lenght can be seen in nanoseconds? (But 
  not timeslice!)
 
 You should really read and understand the code you are arguing about :-/

Maybe you could help me with better comments? IMHO, it would be enough
to warn new timeslices have different meaning, or stop to use this
term at all. (Btw, in -rc8-mm2 I see new sched_slice() function which
seems to return... time.)

 
 In the 2.6.22 scheduler, there was a p-time_slice per task variable 
 that could be manipulated. (Note, in 2.6.22's sched_yield() did not 
 manipulate p-time_slice.)
 
 sysctl_sched_latency on the other hand is not something that is per task 
 (it is global) so there is no pending timeslice to be cleared as it 
 has been suggested naively.

But, there is this something, very similar and very misleading, you
count eg. in check_preempt_curr_fair to find if time is over, and I
think this could be similar enough to what David Schwartz wanted to
use in his idea, and you didn't care to explain why it's so different?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
  
  * Jarek Poplawski [EMAIL PROTECTED] wrote:
  
firstly, there's no notion of timeslices in CFS. (in CFS tasks 
earn a right to the CPU, and that right is not sliced in the 
traditional sense) But we tried a conceptually similar thing [...]
   
   From kernel/sched_fair.c:
   
   /*
* Targeted preemption latency for CPU-bound tasks:
* (default: 20ms, units: nanoseconds)
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length.
* (to see the precise effective timeslice length of your workload,
*  run vmstat and monitor the context-switches field)
   ...
   
   So, no notion of something, which are(!) of variable length, and which 
   precise effective timeslice lenght can be seen in nanoseconds? (But 
   not timeslice!)
  
  You should really read and understand the code you are arguing about :-/
 
 Maybe you could help me with better comments? IMHO, it would be enough 
 to warn new timeslices have different meaning, or stop to use this 
 term at all. [...]

i'm curious, what better do you need than the very detailed comment 
quoted above? Which bit of this latency value is not the same as the 
concept of timeslice length is difficult to understand? The timeslices 
of tasks (i.e. the time they spend on a CPU without scheduling away) is 
_not_ maintained directly in CFS as a per-task variable that can be 
cleared, it's not the metric that drives scheduling. Yes, of course 
CFS too slices up CPU time, but those slices are not the per-task 
variables of traditional schedulers and cannot be 'cleared'.

 [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
 to return... time.)

wrong again. That is a function, not a variable to be cleared. (Anyway, 
the noise/signal ratio is getting increasingly high in this thread with 
no progress in sight, so i cannot guarantee any further replies - 
possibly others will pick up the tab and explain/discuss any other 
questions that might come up. Patches are welcome of course.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 11:10:58AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
   
   * Jarek Poplawski [EMAIL PROTECTED] wrote:
   
 firstly, there's no notion of timeslices in CFS. (in CFS tasks 
 earn a right to the CPU, and that right is not sliced in the 
 traditional sense) But we tried a conceptually similar thing [...]

From kernel/sched_fair.c:

/*
 * Targeted preemption latency for CPU-bound tasks:
 * (default: 20ms, units: nanoseconds)
 *
 * NOTE: this latency value is not the same as the concept of
 * 'timeslice length' - timeslices in CFS are of variable length.
 * (to see the precise effective timeslice length of your workload,
 *  run vmstat and monitor the context-switches field)
...

So, no notion of something, which are(!) of variable length, and which 
precise effective timeslice lenght can be seen in nanoseconds? (But 
not timeslice!)
   
   You should really read and understand the code you are arguing about :-/
  
  Maybe you could help me with better comments? IMHO, it would be enough 
  to warn new timeslices have different meaning, or stop to use this 
  term at all. [...]
 
 i'm curious, what better do you need than the very detailed comment 
 quoted above? Which bit of this latency value is not the same as the 
 concept of timeslice length is difficult to understand? The timeslices 
 of tasks (i.e. the time they spend on a CPU without scheduling away) is 
 _not_ maintained directly in CFS as a per-task variable that can be 
 cleared, it's not the metric that drives scheduling. Yes, of course 
 CFS too slices up CPU time, but those slices are not the per-task 
 variables of traditional schedulers and cannot be 'cleared'.

It's not about this comment alone, but this comment plus no notion
comment, which appears in sched-design-CFS.txt too.

 
  [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
  to return... time.)
 
 wrong again. That is a function, not a variable to be cleared. (Anyway, 
 the noise/signal ratio is getting increasingly high in this thread with 
 no progress in sight, so i cannot guarantee any further replies - 
 possibly others will pick up the tab and explain/discuss any other 
 questions that might come up. Patches are welcome of course.)

I can't see anything about clearing. I think, this was about charging,
which should change the key enough, to move a task to, maybe, a better
place in a que (tree) than with current ways.

Jarek P.

PS: Don't you think that a nice argue with some celebrity, like Ingo
Molnar himself, is by far more interesting than those dull patches?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Dmitry Adamushko

On 03/10/2007, Jarek Poplawski [EMAIL PROTECTED] wrote:
 I can't see anything about clearing. I think, this was about charging,
 which should change the key enough, to move a task to, maybe, a better
 place in a que (tree) than with current ways.

just a quick patch, not tested and I've not evaluated all possible
implications yet.
But someone might give it a try with his/(her -- are even more
welcomed :-) favourite sched_yield() load.

(and white space damaged)

--- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
update_curr(cfs_rq);

return;
+   } else if (sysctl_sched_compat_yield == 2) {
+   unsigned long ideal_runtime, delta_exec,
+ delta_exec_weighted;
+
+   __update_rq_clock(rq);
+   /*
+* Update run-time statistics of the 'current'.
+*/
+   update_curr(cfs_rq);
+
+   /*
+* Emulate (speed up) the effect of us being preempted
+* by scheduler_tick().
+*/
+   ideal_runtime = sched_slice(cfs_rq, curr);
+   delta_exec = curr-sum_exec_runtime -
curr-prev_sum_exec_runtime;
+
+   if (ideal_runtime  delta_exec) {
+   delta_exec_weighted = ideal_runtime - delta_exec;
+
+   if (unlikely(curr-load.weight != NICE_0_LOAD)) {
+   delta_exec_weighted =
calc_delta_fair(delta_exec_weighted,
+
 se-load);
+   }
+   se-vruntime += delta_exec_weighted;
+   }
+   return;
}
+
/*
 * Find the rightmost entry in the rbtree:
 */



 Jarek P.


-- 
Best regards,
Dmitry Adamushko
--- sched_fair-old.c	2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c	2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
 		update_curr(cfs_rq);
 
 		return;
+	} else if (sysctl_sched_compat_yield == 2) {
+		unsigned long ideal_runtime, delta_exec,
+			  delta_exec_weighted;
+
+		__update_rq_clock(rq);
+		/*
+		 * Update run-time statistics of the 'current'.
+		 */
+		update_curr(cfs_rq);
+
+		/*
+		 * Emulate the effect of us being preempted
+		 * by scheduler_tick().
+		 */
+		ideal_runtime = sched_slice(cfs_rq, curr);
+		delta_exec = curr-sum_exec_runtime - curr-prev_sum_exec_runtime;
+
+		if (ideal_runtime  delta_exec) {
+			delta_exec_weighted = ideal_runtime - delta_exec;
+
+			if (unlikely(curr-load.weight != NICE_0_LOAD)) {
+delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
+	se-load);
+			}
+			se-vruntime += delta_exec_weighted;
+		}
+		return;
 	}
+
 	/*
 	 * Find the rightmost entry in the rbtree:
 	 */

Re: Network slowdown due to CFS

2007-10-03 Thread Dmitry Adamushko

On 03/10/2007, Dmitry Adamushko [EMAIL PROTECTED] wrote:
 On 03/10/2007, Jarek Poplawski [EMAIL PROTECTED] wrote:
  I can't see anything about clearing. I think, this was about charging,
  which should change the key enough, to move a task to, maybe, a better
  place in a que (tree) than with current ways.

 just a quick patch, not tested and I've not evaluated all possible
 implications yet.
 But someone might give it a try with his/(her -- are even more
 welcomed :-) favourite sched_yield() load.

 (and white space damaged)

 --- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
 +++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
 @@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
 update_curr(cfs_rq);

 return;
 +   } else if (sysctl_sched_compat_yield == 2) {
 +   unsigned long ideal_runtime, delta_exec,
 + delta_exec_weighted;
 +
 +   __update_rq_clock(rq);
 +   /*
 +* Update run-time statistics of the 'current'.
 +*/
 +   update_curr(cfs_rq);
 +
 +   /*
 +* Emulate (speed up) the effect of us being preempted
 +* by scheduler_tick().
 +*/
 +   ideal_runtime = sched_slice(cfs_rq, curr);
 +   delta_exec = curr-sum_exec_runtime -
 curr-prev_sum_exec_runtime;
 +
 +   if (ideal_runtime  delta_exec) {
 +   delta_exec_weighted = ideal_runtime - delta_exec;
 +
 +   if (unlikely(curr-load.weight != NICE_0_LOAD)) {
 +   delta_exec_weighted =
 calc_delta_fair(delta_exec_weighted,
 +
  se-load);
 +   }


s/curr/se


-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 12:58:26PM +0200, Dmitry Adamushko wrote:
 On 03/10/2007, Dmitry Adamushko [EMAIL PROTECTED] wrote:
  On 03/10/2007, Jarek Poplawski [EMAIL PROTECTED] wrote:
   I can't see anything about clearing. I think, this was about charging,
   which should change the key enough, to move a task to, maybe, a better
   place in a que (tree) than with current ways.
 
  just a quick patch, not tested and I've not evaluated all possible
  implications yet.
  But someone might give it a try with his/(her -- are even more
  welcomed :-) favourite sched_yield() load.
 
  (and white space damaged)
 
  --- sched_fair-old.c2007-10-03 12:45:17.010306000 +0200
  +++ sched_fair.c2007-10-03 12:44:46.899851000 +0200
...
 s/curr/se

Thanks very much!

Alas, I'll be able to look at this and try only in the evening.

Best regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Ingo Molnar


* Dmitry Adamushko [EMAIL PROTECTED] wrote:

 +   se-vruntime += delta_exec_weighted;

thanks Dmitry.

Btw., this is quite similar to the yield_granularity patch i did 
originally, just less flexible. It turned out that apps want either zero 
granularity or infinite granularity, they dont actually want something 
inbetween. That's the two extremes that the current sysctl expresses in 
essence.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Helge Hafting


David Schwartz wrote:

* Jarek Poplawski [EMAIL PROTECTED] wrote:



BTW, it looks like risky to criticise sched_yield too much: some
people can misinterpret such discussions and stop using this at all,
even where it's right.
  


  

Really, i have never seen a _single_ mainstream app where the use of
sched_yield() was the right choice.



It can occasionally be an optimization. You may have a case where you can do
something very efficiently if a lock is not held, but you cannot afford to
wait for the lock to be released. So you check the lock, if it's held, you
yield and then check again. If that fails, you do it the less optimal way
(for example, dispatching it to a thread that *can* afford to wait).
  

How about:
Check the lock. If it is held, sleep for an interval that is shorter
than acceptable waiting time. If it is still held, sleep for twice as long.
Loop until you get the lock and do the work, or until you
you reach the limit for how much you can wait at this point and
dispatch to a thread instead.

This approach should be portable, don't wake up too often,
and don't waste the CPU.  (And it won't go idle either, whoever
holds the lock will be running.)


Helge Hafting
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Jarek Poplawski

On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote:
...
 just a quick patch, not tested and I've not evaluated all possible
 implications yet.
 But someone might give it a try with his/(her -- are even more
 welcomed :-) favourite sched_yield() load.

Of course, after some evaluation by yourself and Ingo the most
interesting should be Martin's Michlmayr testing, so I hope you'll
Cc him too?!

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-03 Thread Rusty Russell

On Mon, 2007-10-01 at 09:49 -0700, David Schwartz wrote:
  * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
   BTW, it looks like risky to criticise sched_yield too much: some
   people can misinterpret such discussions and stop using this at all,
   even where it's right.
 
  Really, i have never seen a _single_ mainstream app where the use of
  sched_yield() was the right choice.
 
 It can occasionally be an optimization. You may have a case where you can do
 something very efficiently if a lock is not held, but you cannot afford to
 wait for the lock to be released. So you check the lock, if it's held, you
 yield and then check again. If that fails, you do it the less optimal way
 (for example, dispatching it to a thread that *can* afford to wait).

This used to be true, and still is if you want to be portable.  But the
point of futexes was precisely to attack this use case: whereas
sched_yield() says I'm waiting for something, but I won't tell you
what the futex ops tells the kernel what you're waiting for.

While the time to do a futex op is slightly slower than sched_yield(),
futexes win in so many cases that we haven't found a benchmark where
yield wins.  Yield-lose cases include:
1) There are other unrelated process that yield() ends up queueing
   behind.
2) The process you're waiting for doesn't conveniently sleep as soon as
   it releases the lock, so you wait for longer than intended,
3) You race between the yield and the lock being dropped.

In summary: spin N times  futex seems optimal.  The value of N depends
on the number of CPUs in the machine and other factors, but N=1 has
shown itself pretty reasonable.

Hope that helps,
Rusty.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-03 Thread Casey Dahlin


Ingo Molnar wrote:

* Jarek Poplawski [EMAIL PROTECTED] wrote:
  
[...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems 
to return... time.)



wrong again. That is a function, not a variable to be cleared.


It still gives us a target time, so could we not simply have sched_yield 
put the thread completely to sleep for the given amount of time? It 
wholly redefines the operation, and its far more expensive (now there's 
a whole new timer involved) but it might emulate the expected behavior. 
Its hideous, but so is sched_yield in the first place, so why not?


--CJD
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-02 Thread David Schwartz


This is a combined response to Arjan's:

> that's also what trylock is for... as well as spinaphores...
> (you can argue that futexes should be more intelligent and do
> spinaphore stuff etc... and I can buy that, lets improve them in the
> kernel by any means. But userspace yield() isn't the answer. A
> yield_to() would have been a ton better (which would return immediately
> if the thing you want to yield to is running already somethere), a
> blind "yield" isn't, since it doesn't say what you want to yield to.

And Ingo's:

> but i'll attempt to weave the chain of argument one step forward (in the
> hope of not distorting your point in any way): _if_ the sched_yield()
> call in that memory allocator is done because it uses a locking
> primitive that is unfair (hence the memory pool lock can be starved),
> then the "guaranteed large latency" is caused by "guaranteed
> unfairness". The solution is not to insert a random latency (via a
> sched_yield() call) that also has a side-effect of fairness to other
> tasks, because this random latency introduces guaranteed unfairness for
> this particular task. The correct solution IMO is to make the locking
> primitive more fair _without_ random delays, and there are a number of
> good techniques for that. (they mostly center around the use of futexes)

So now I not only have to come up with an example where sched_yield is the
best practical choice, I have to come up with one where sched_yield is the
best conceivable choice? Didn't we start out by agreeing these are very rare
cases? Why are we designing new APIs for them (Arjan) and why do we care
about their performance (Ingo)?

These are *rare* cases. It is a waste of time to optimize them.

In this case, nobody cares about fairness to the service thread. It is a
cleanup task that probably runs every few minutes. It could be delayed for
minutes and nobody would care. What they do care about is the impact of the
service thread on the threads doing real work.

You two challenged me to present any legitimate use case for sched_yield. I
see now that was not a legitimate challenge and you two were determined to
shoot down any response no matter how reasonable on the grounds that there
is some way to do it better, no matter how complex, impractical, or
unjustified by the real-world problem.

I think if a pthread_mutex had a 'yield to others blocking on this mutex'
kind of a 'go to the back of the line' option, that would cover the majority
of cases where sched_yield is your best choice currently. Unfortunately,
POSIX gave us yield.

Note that I think we all agree that any program whose performance relies on
quirks of sched_yield (such as the examples that have been cited as CFS
'regressions') are broken horribly. None of the cases I am suggesting use
sched_yield as anything more than a minor optimization.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Tue, Oct 02, 2007 at 11:03:46AM +0200, Jarek Poplawski wrote:
...
> should suffice. Currently, I wonder if simply charging (with a key
> recalculated) such a task for all the time it could've used isn't one
> of such methods. It seems, it's functionally analogous with going to
> the end of que of tasks with the same priority according to the old
> sched.

Only now I've read I repeat the idea of David Schwartz (and probably
not only him) from a nearby thread, sorry. But, I still try to find
what was wrong with it?

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Mon, Oct 01, 2007 at 10:43:56AM +0200, Jarek Poplawski wrote:
...
> etc., if we know (after testing) eg. average expedition time of such

No new theory - it's only my reverse Polish translation. Should be:
"etc., if we know (after testing) eg. average dispatch time of such".

Sorry,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Mon, Oct 01, 2007 at 06:25:07PM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> 
> > BTW, it looks like risky to criticise sched_yield too much: some 
> > people can misinterpret such discussions and stop using this at all, 
> > even where it's right.
> 
> Really, i have never seen a _single_ mainstream app where the use of 
> sched_yield() was the right choice.
> 
> Fortunately, the sched_yield() API is already one of the most rarely 
> used scheduler functionalities, so it does not really matter. [ In my 
> experience a Linux scheduler is stabilizing pretty well when the 
> discussion shifts to yield behavior, because that shows that everything 
> else is pretty much fine ;-) ]
> 
> But, because you assert it that it's risky to "criticise sched_yield() 
> too much", you sure must know at least one real example where it's right 
> to use it (and cite the line and code where it's used, with 
> specificity)?

Very clever move! And I see some people have catched this...

Since sched_yeld() is a very general purpose tool, it can be easily
replaced by others, of course, just like probably half of all
system calls. And such things are done often in a code during
optimization. But, IMHO, the main value of sched_yield() is it's
easy to use and very readable way to mark some place. Sometimes
even test if such idea is reasonable at all... Otherwise, many such
possibilities could easily stay forgotten forever.

But you are right, the value of this call shouldn't be exaggerated,
and my proposal was an overkill. Anyway, it seems, there could be
imagined something better than current ways of doing this. They look
like two extremes and something in between and not too complicated
should suffice. Currently, I wonder if simply charging (with a key
recalculated) such a task for all the time it could've used isn't one
of such methods. It seems, it's functionally analogous with going to
the end of que of tasks with the same priority according to the old
sched.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Andi Kleen

Ingo Molnar <[EMAIL PROTECTED]> writes:

> * David Schwartz <[EMAIL PROTECTED]> wrote:
> 
> > > These are generic statements, but i'm _really_ interested in the 
> > > specifics. Real, specific code that i can look at. The typical Linux 
> > > distro consists of in execess of 500 millions of lines of code, in 
> > > tens of thousands of apps, so there really must be some good, valid 
> > > and "right" use of sched_yield() somewhere in there, in some 
> > > mainstream app, right? (because, as you might have guessed it, in 
> > > the past decade of sched_yield() existence i _have_ seen my share of 
> > > sched_yield() utilizing user-space code, and at the moment i'm not 
> > > really impressed by those examples.)
> > 
> > Maybe, maybe not. Even if so, it would be very difficult to find. 
> > [...]
> 
> google.com/codesearch is your friend. Really, 

http://www.koders.com/ (which has been around for a long time)
actually seems to have more code.

It's also a pity that so much free code is behind passwords
and protected from spiders.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar

* David Schwartz <[EMAIL PROTECTED]> wrote:

> > at a quick glance this seems broken too - but if you show the 
> > specific code i might be able to point out the breakage in detail. 
> > (One underlying problem here appears to be fairness: a quick 
> > unlock/lock sequence may starve out other threads. yield wont solve 
> > that fundamental problem either, and it will introduce random 
> > latencies into apps using this memory allocator.)
> 
> You are assuming that random latencies are necessarily bad. Random 
> latencies may be significantly better than predictable high latency.

i'm not really assuming anything, i gave a vague first impression of the 
vague example you gave (assuming that the yield was done to combat 
fairness problems). This is a case where the human language shows its 
boundaries: statements that are hard to refute with certainty because 
they are too vague. So i'd really suggest you show me some sample/real 
code - that would move this discussion to a much more productive level.

but i'll attempt to weave the chain of argument one step forward (in the 
hope of not distorting your point in any way): _if_ the sched_yield() 
call in that memory allocator is done because it uses a locking 
primitive that is unfair (hence the memory pool lock can be starved), 
then the "guaranteed large latency" is caused by "guaranteed 
unfairness". The solution is not to insert a random latency (via a 
sched_yield() call) that also has a side-effect of fairness to other 
tasks, because this random latency introduces guaranteed unfairness for 
this particular task. The correct solution IMO is to make the locking 
primitive more fair _without_ random delays, and there are a number of 
good techniques for that. (they mostly center around the use of futexes)

one thing that is often missed is that most of the cost of a yield() is 
in the system call and the context-switch - quite similar to the futex 
slowpath. So there's _no_ reason to not use a futexes on Linux. (yes, 
there might be historic/compatibility or ease-of-porting arguments but 
those do not really impact the fundamental argument of whether something 
is technically right or not.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar

* David Schwartz <[EMAIL PROTECTED]> wrote:

> > (user-space spinlocks are broken beyond words for anything but 
> > perhaps SCHED_FIFO tasks.)
> 
> User-space spinlocks are broken so spinlocks can only be implemented 
> in kernel-space? Even if you use the kernel to schedule/unschedule the 
> tasks, you still have to spin in user-space.

user-space spinlocks (in anything but SCHED_FIFO tasks) are pretty 
broken because they waste CPU time. (not as broken as yield, because 
"wasting CPU time" is a more deterministic act, but still broken) Could 
you cite a single example where user-space spinlocks are technically the 
best solution?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar

* David Schwartz <[EMAIL PROTECTED]> wrote:

> > These are generic statements, but i'm _really_ interested in the 
> > specifics. Real, specific code that i can look at. The typical Linux 
> > distro consists of in execess of 500 millions of lines of code, in 
> > tens of thousands of apps, so there really must be some good, valid 
> > and "right" use of sched_yield() somewhere in there, in some 
> > mainstream app, right? (because, as you might have guessed it, in 
> > the past decade of sched_yield() existence i _have_ seen my share of 
> > sched_yield() utilizing user-space code, and at the moment i'm not 
> > really impressed by those examples.)
> 
> Maybe, maybe not. Even if so, it would be very difficult to find. 
> [...]

google.com/codesearch is your friend. Really, 

> Note that I'm not saying this is a particularly big deal. And I'm not 
> calling CFS' behavior a regression, since it's not really better or 
> worse than the old behavior, simply different.

yes, and that's the core point.

> I'm not familiar enough with CFS' internals to help much on the 
> implementation, but there may be some simple compromise yield that 
> might work well enough. How about simply acting as if the task used up 
> its timeslice and scheduling the next one? (Possibly with a slight 
> reduction in penalty or reward for not really using all the time, if 
> possible?)

firstly, there's no notion of "timeslices" in CFS. (in CFS tasks "earn" 
a right to the CPU, and that "right" is not sliced in the traditional 
sense) But we tried a conceptually similar thing: to schedule not to the 
end of the tree but into the next position. That too was bad for _some_ 
apps. CFS literally cycled through 5-6 different yield implementations 
in its 22 versions so far. The current flag solution was achieved in 
such an iterative fashion and gives an acceptable solution to all app 
categories that came up so far. [ and this is driven by compatibility 
goals - regardless of how broken we consider yield use. The ideal 
solution is of course to almost never use yield. Fortunately 99%+ of 
Linux apps follow that ideal solution ;-) ]

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar


* David Schwartz [EMAIL PROTECTED] wrote:

  These are generic statements, but i'm _really_ interested in the 
  specifics. Real, specific code that i can look at. The typical Linux 
  distro consists of in execess of 500 millions of lines of code, in 
  tens of thousands of apps, so there really must be some good, valid 
  and right use of sched_yield() somewhere in there, in some 
  mainstream app, right? (because, as you might have guessed it, in 
  the past decade of sched_yield() existence i _have_ seen my share of 
  sched_yield() utilizing user-space code, and at the moment i'm not 
  really impressed by those examples.)
 
 Maybe, maybe not. Even if so, it would be very difficult to find. 
 [...]

google.com/codesearch is your friend. Really, 

 Note that I'm not saying this is a particularly big deal. And I'm not 
 calling CFS' behavior a regression, since it's not really better or 
 worse than the old behavior, simply different.

yes, and that's the core point.

 I'm not familiar enough with CFS' internals to help much on the 
 implementation, but there may be some simple compromise yield that 
 might work well enough. How about simply acting as if the task used up 
 its timeslice and scheduling the next one? (Possibly with a slight 
 reduction in penalty or reward for not really using all the time, if 
 possible?)

firstly, there's no notion of timeslices in CFS. (in CFS tasks earn 
a right to the CPU, and that right is not sliced in the traditional 
sense) But we tried a conceptually similar thing: to schedule not to the 
end of the tree but into the next position. That too was bad for _some_ 
apps. CFS literally cycled through 5-6 different yield implementations 
in its 22 versions so far. The current flag solution was achieved in 
such an iterative fashion and gives an acceptable solution to all app 
categories that came up so far. [ and this is driven by compatibility 
goals - regardless of how broken we consider yield use. The ideal 
solution is of course to almost never use yield. Fortunately 99%+ of 
Linux apps follow that ideal solution ;-) ]

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar


* David Schwartz [EMAIL PROTECTED] wrote:

  (user-space spinlocks are broken beyond words for anything but 
  perhaps SCHED_FIFO tasks.)
 
 User-space spinlocks are broken so spinlocks can only be implemented 
 in kernel-space? Even if you use the kernel to schedule/unschedule the 
 tasks, you still have to spin in user-space.

user-space spinlocks (in anything but SCHED_FIFO tasks) are pretty 
broken because they waste CPU time. (not as broken as yield, because 
wasting CPU time is a more deterministic act, but still broken) Could 
you cite a single example where user-space spinlocks are technically the 
best solution?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Ingo Molnar


* David Schwartz [EMAIL PROTECTED] wrote:

  at a quick glance this seems broken too - but if you show the 
  specific code i might be able to point out the breakage in detail. 
  (One underlying problem here appears to be fairness: a quick 
  unlock/lock sequence may starve out other threads. yield wont solve 
  that fundamental problem either, and it will introduce random 
  latencies into apps using this memory allocator.)
 
 You are assuming that random latencies are necessarily bad. Random 
 latencies may be significantly better than predictable high latency.

i'm not really assuming anything, i gave a vague first impression of the 
vague example you gave (assuming that the yield was done to combat 
fairness problems). This is a case where the human language shows its 
boundaries: statements that are hard to refute with certainty because 
they are too vague. So i'd really suggest you show me some sample/real 
code - that would move this discussion to a much more productive level.

but i'll attempt to weave the chain of argument one step forward (in the 
hope of not distorting your point in any way): _if_ the sched_yield() 
call in that memory allocator is done because it uses a locking 
primitive that is unfair (hence the memory pool lock can be starved), 
then the guaranteed large latency is caused by guaranteed 
unfairness. The solution is not to insert a random latency (via a 
sched_yield() call) that also has a side-effect of fairness to other 
tasks, because this random latency introduces guaranteed unfairness for 
this particular task. The correct solution IMO is to make the locking 
primitive more fair _without_ random delays, and there are a number of 
good techniques for that. (they mostly center around the use of futexes)

one thing that is often missed is that most of the cost of a yield() is 
in the system call and the context-switch - quite similar to the futex 
slowpath. So there's _no_ reason to not use a futexes on Linux. (yes, 
there might be historic/compatibility or ease-of-porting arguments but 
those do not really impact the fundamental argument of whether something 
is technically right or not.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Andi Kleen

Ingo Molnar [EMAIL PROTECTED] writes:

 * David Schwartz [EMAIL PROTECTED] wrote:
 
   These are generic statements, but i'm _really_ interested in the 
   specifics. Real, specific code that i can look at. The typical Linux 
   distro consists of in execess of 500 millions of lines of code, in 
   tens of thousands of apps, so there really must be some good, valid 
   and right use of sched_yield() somewhere in there, in some 
   mainstream app, right? (because, as you might have guessed it, in 
   the past decade of sched_yield() existence i _have_ seen my share of 
   sched_yield() utilizing user-space code, and at the moment i'm not 
   really impressed by those examples.)
  
  Maybe, maybe not. Even if so, it would be very difficult to find. 
  [...]
 
 google.com/codesearch is your friend. Really, 

http://www.koders.com/ (which has been around for a long time)
actually seems to have more code.

It's also a pity that so much free code is behind passwords
and protected from spiders.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Mon, Oct 01, 2007 at 06:25:07PM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  BTW, it looks like risky to criticise sched_yield too much: some 
  people can misinterpret such discussions and stop using this at all, 
  even where it's right.
 
 Really, i have never seen a _single_ mainstream app where the use of 
 sched_yield() was the right choice.
 
 Fortunately, the sched_yield() API is already one of the most rarely 
 used scheduler functionalities, so it does not really matter. [ In my 
 experience a Linux scheduler is stabilizing pretty well when the 
 discussion shifts to yield behavior, because that shows that everything 
 else is pretty much fine ;-) ]
 
 But, because you assert it that it's risky to criticise sched_yield() 
 too much, you sure must know at least one real example where it's right 
 to use it (and cite the line and code where it's used, with 
 specificity)?

Very clever move! And I see some people have catched this...

Since sched_yeld() is a very general purpose tool, it can be easily
replaced by others, of course, just like probably half of all
system calls. And such things are done often in a code during
optimization. But, IMHO, the main value of sched_yield() is it's
easy to use and very readable way to mark some place. Sometimes
even test if such idea is reasonable at all... Otherwise, many such
possibilities could easily stay forgotten forever.

But you are right, the value of this call shouldn't be exaggerated,
and my proposal was an overkill. Anyway, it seems, there could be
imagined something better than current ways of doing this. They look
like two extremes and something in between and not too complicated
should suffice. Currently, I wonder if simply charging (with a key
recalculated) such a task for all the time it could've used isn't one
of such methods. It seems, it's functionally analogous with going to
the end of que of tasks with the same priority according to the old
sched.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Mon, Oct 01, 2007 at 10:43:56AM +0200, Jarek Poplawski wrote:
...
 etc., if we know (after testing) eg. average expedition time of such

No new theory - it's only my reverse Polish translation. Should be:
etc., if we know (after testing) eg. average dispatch time of such.

Sorry,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-02 Thread Jarek Poplawski

On Tue, Oct 02, 2007 at 11:03:46AM +0200, Jarek Poplawski wrote:
...
 should suffice. Currently, I wonder if simply charging (with a key
 recalculated) such a task for all the time it could've used isn't one
 of such methods. It seems, it's functionally analogous with going to
 the end of que of tasks with the same priority according to the old
 sched.

Only now I've read I repeat the idea of David Schwartz (and probably
not only him) from a nearby thread, sorry. But, I still try to find
what was wrong with it?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-02 Thread David Schwartz


This is a combined response to Arjan's:

 that's also what trylock is for... as well as spinaphores...
 (you can argue that futexes should be more intelligent and do
 spinaphore stuff etc... and I can buy that, lets improve them in the
 kernel by any means. But userspace yield() isn't the answer. A
 yield_to() would have been a ton better (which would return immediately
 if the thing you want to yield to is running already somethere), a
 blind yield isn't, since it doesn't say what you want to yield to.

And Ingo's:

 but i'll attempt to weave the chain of argument one step forward (in the
 hope of not distorting your point in any way): _if_ the sched_yield()
 call in that memory allocator is done because it uses a locking
 primitive that is unfair (hence the memory pool lock can be starved),
 then the guaranteed large latency is caused by guaranteed
 unfairness. The solution is not to insert a random latency (via a
 sched_yield() call) that also has a side-effect of fairness to other
 tasks, because this random latency introduces guaranteed unfairness for
 this particular task. The correct solution IMO is to make the locking
 primitive more fair _without_ random delays, and there are a number of
 good techniques for that. (they mostly center around the use of futexes)

So now I not only have to come up with an example where sched_yield is the
best practical choice, I have to come up with one where sched_yield is the
best conceivable choice? Didn't we start out by agreeing these are very rare
cases? Why are we designing new APIs for them (Arjan) and why do we care
about their performance (Ingo)?

These are *rare* cases. It is a waste of time to optimize them.

In this case, nobody cares about fairness to the service thread. It is a
cleanup task that probably runs every few minutes. It could be delayed for
minutes and nobody would care. What they do care about is the impact of the
service thread on the threads doing real work.

You two challenged me to present any legitimate use case for sched_yield. I
see now that was not a legitimate challenge and you two were determined to
shoot down any response no matter how reasonable on the grounds that there
is some way to do it better, no matter how complex, impractical, or
unjustified by the real-world problem.

I think if a pthread_mutex had a 'yield to others blocking on this mutex'
kind of a 'go to the back of the line' option, that would cover the majority
of cases where sched_yield is your best choice currently. Unfortunately,
POSIX gave us yield.

Note that I think we all agree that any program whose performance relies on
quirks of sched_yield (such as the examples that have been cited as CFS
'regressions') are broken horribly. None of the cases I am suggesting use
sched_yield as anything more than a minor optimization.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 15:44:09 -0700
"David Schwartz" <[EMAIL PROTECTED]> wrote:

> 
> > yielding IS blocking. Just with indeterminate fuzzyness added to
> > it
> 
> Yielding is sort of blocking, but the difference is that yielding
> will not idle the CPU while blocking might. 

not really; SOMEONE will make progress, the one holding the lock.
Granted, he can be on some other cpu, but at that point all yielding
gets you is a bunch of cache bounces.

>Yielding is sometimes
> preferable to blocking in a case where the thread knows it can make
> forward progress even if it doesn't get the resource. (As in the
> examples I explained.)

that's also what trylock is for... as well as spinaphores...
(you can argue that futexes should be more intelligent and do
spinaphore stuff etc... and I can buy that, lets improve them in the
kernel by any means. But userspace yield() isn't the answer. A
yield_to() would have been a ton better (which would return immediately
if the thing you want to yield to is running already somethere), a
blind "yield" isn't, since it doesn't say what you want to yield to.

Note: The answer to "what to yield to" isn't "everything that might
want to run"; we tried that way back when the 2.6.early scheduler was
designed and that turns out to not be what people calling yield
expected.. (it made their things even slower than they thought). So
they want "yield to" semantics, without telling the kernel what they
want to yield to, and complain if the kernel second-guesses wrongly

not a good api.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


> yielding IS blocking. Just with indeterminate fuzzyness added to it

Yielding is sort of blocking, but the difference is that yielding will not
idle the CPU while blocking might. Yielding is sometimes preferable to
blocking in a case where the thread knows it can make forward progress even
if it doesn't get the resource. (As in the examples I explained.)

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 15:17:52 -0700
"David Schwartz" <[EMAIL PROTECTED]> wrote:

> 
> Arjan van de Ven wrote:
> 
> > > It can occasionally be an optimization. You may have a case where
> > > you can do something very efficiently if a lock is not held, but
> > > you cannot afford to wait for the lock to be released. So you
> > > check the lock, if it's held, you yield and then check again. If
> > > that fails, you do it the less optimal way (for example,
> > > dispatching it to a thread that *can* afford to wait).
> 
> > at this point it's "use a futex" instead; once you're doing system
> > calls you might as well use the right one for what you're trying to
> > achieve.
> 
> There are two answers to this. One is that you sometimes are writing
> POSIX code and Linux-specific optimizations don't change the fact
> that you still need a portable implementation.
> 
> The other answer is that futexes don't change anything in this case.
> In fact, in the last time I hit this, the lock was a futex on Linux.
> Nevertheless, that doesn't change the basic issue. The lock is
> locked, you cannot afford to wait for it, but not getting the lock is
> expensive. The solution is to yield and check the lock again. If it's
> still held, you dispatch to another thread, but many times, yielding
> can avoid that.

yielding IS blocking. Just with indeterminate fuzzyness added to it
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz

Arjan van de Ven wrote:

> > It can occasionally be an optimization. You may have a case where you
> > can do something very efficiently if a lock is not held, but you
> > cannot afford to wait for the lock to be released. So you check the
> > lock, if it's held, you yield and then check again. If that fails,
> > you do it the less optimal way (for example, dispatching it to a
> > thread that *can* afford to wait).

> at this point it's "use a futex" instead; once you're doing system
> calls you might as well use the right one for what you're trying to
> achieve.

There are two answers to this. One is that you sometimes are writing POSIX
code and Linux-specific optimizations don't change the fact that you still
need a portable implementation.

The other answer is that futexes don't change anything in this case. In
fact, in the last time I hit this, the lock was a futex on Linux.
Nevertheless, that doesn't change the basic issue. The lock is locked, you
cannot afford to wait for it, but not getting the lock is expensive. The
solution is to yield and check the lock again. If it's still held, you
dispatch to another thread, but many times, yielding can avoid that.

A futex doesn't change the fact that sometimes you can't afford to block on
a lock but nevertheless would save significant effort if you were able to
acquire it. Odds are the thread that holds it is about to release it anyway.

That is, you need something in-between "non-blocking trylock, fail easily"
and "blocking lock, do not fail", but you'd rather make forward progress
without the lock than actually block/sleep.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Hubert Tonneau

Ingo Molnar wrote:
>
> Really, i have never seen a _single_ mainstream app where the use of
> sched_yield() was the right choice.

Pliant 'FastSem' semaphore implementation (as oppsed to 'Sem') uses 'yield'
http://old.fullpliant.org/

Basically, if the ressource you are protecting with the semaphore will be held
for a significant time, then a full semaphore might be better, but if the
ressource will be held just a fiew cycles, then light aquiering might bring best
result because the most significant cost is in aquiering/releasing.

So the aquiering algorithm for fast semaphores might be:
try to aquire with a hardware atomic read and set instruction, then if it fails,
call yield then retry (at least on a single processor single core system).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 09:49:35 -0700
"David Schwartz" <[EMAIL PROTECTED]> wrote:

> 
> > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> >
> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at
> > > all, even where it's right.
> 
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
> 
> It can occasionally be an optimization. You may have a case where you
> can do something very efficiently if a lock is not held, but you
> cannot afford to wait for the lock to be released. So you check the
> lock, if it's held, you yield and then check again. If that fails,
> you do it the less optimal way (for example, dispatching it to a
> thread that *can* afford to wait).

at this point it's "use a futex" instead; once you're doing system
calls you might as well use the right one for what you're trying to
achieve.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


> These are generic statements, but i'm _really_ interested in the
> specifics. Real, specific code that i can look at. The typical Linux
> distro consists of in execess of 500 millions of lines of code, in tens
> of thousands of apps, so there really must be some good, valid and
> "right" use of sched_yield() somewhere in there, in some mainstream app,
> right? (because, as you might have guessed it, in the past decade of
> sched_yield() existence i _have_ seen my share of sched_yield()
> utilizing user-space code, and at the moment i'm not really impressed by
> those examples.)

Maybe, maybe not. Even if so, it would be very difficult to find. Simply
grepping for sched_yield is not going to help because determining whether a
given use of sched_yield is smart is not going to be easy.

> (user-space spinlocks are broken beyond words for anything but perhaps
> SCHED_FIFO tasks.)

User-space spinlocks are broken so spinlocks can only be implemented in
kernel-space? Even if you use the kernel to schedule/unschedule the tasks,
you still have to spin in user-space.

> > One example I know of is a defragmenter for a multi-threaded memory
> > allocator, and it has to lock whole pools. When it releases these
> > locks, it calls yield before re-acquiring them to go back to work. The
> > idea is to "go to the back of the line" if any threads are blocking on
> > those mutexes.

> at a quick glance this seems broken too - but if you show the specific
> code i might be able to point out the breakage in detail. (One
> underlying problem here appears to be fairness: a quick unlock/lock
> sequence may starve out other threads. yield wont solve that fundamental
> problem either, and it will introduce random latencies into apps using
> this memory allocator.)

You are assuming that random latencies are necessarily bad. Random latencies
may be significantly better than predictable high latency.


> > Can you explain what the current sched_yield behavior *is* for CFS and
> > what the tunable does to change it?

> sure. (and i described that flag on lkml before) The sched_yield flag
> does two things:

>  - if 0 ("opportunistic mode"), then the task will reschedule to any
>other task that is in "bigger need for CPU time" than the currently
>running task, as indicated by CFS's ->wait_runtime metric. (or as
>indicated by the similar ->vruntime metric in sched-devel.git)
>
>  - if 1 ("agressive mode"), then the task will be one-time requeued to
>the right end of the CFS rbtree. This means that for one instance,
>all other tasks will run before this task will run again - after that
>this task's natural ordering within the rbtree is restored.

Thank you. Unfortunately, neither of these does what sched_yiled is really
supposed to do. Opportunistic mode does too little and agressive mode does
too much.

> > The desired behavior is for the current thread to not be rescheduled
> > until every thread at the same static priority as this thread has had
> > a chance to be scheduled.

> do you realize that this "desired behavior" you just described is not
> achieved by the old scheduler, and that this random behavior _is_ the
> main problem here? If yield was well-specified then we could implement
> it in a well-specified way - even if the API was poor.

> But fact is that it is _not_ well-specified, and apps grew upon a random
> scheduler implementation details in random ways. (in the lkml discussion
> about this topic, Linus offered a pretty sane theoretical definition for
> yield but it's not simple to implement [and no scheduler implements it
> at the moment] - nor will it map to the old scheduler's yield behavior
> so we'll end up breaking more apps.)

I don't have a problem with failing to emulate the old scheduler's behavior
if we can show that the new behavior has saner semantics. Unfortunately, in
this case, I think CFS' semantics are pretty bad. Neither of these is what
sched_yield is supposed to do.

Note that I'm not saying this is a particularly big deal. And I'm not
calling CFS' behavior a regression, since it's not really better or worse
than the old behavior, simply different.

I'm not familiar enough with CFS' internals to help much on the
implementation, but there may be some simple compromise yield that might
work well enough. How about simply acting as if the task used up its
timeslice and scheduling the next one? (Possibly with a slight reduction in
penalty or reward for not really using all the time, if possible?)

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Chris Friesen


Ingo Molnar wrote:

* Chris Friesen <[EMAIL PROTECTED]> wrote:


However, there are closed-source and/or frozen-source apps where it's 
not practical to rewrite or rebuild the app.  Does it make sense to 
break the behaviour of all of these?



See the background and answers to that in:

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

there's plenty of recourse possible to all possible kinds of apps. Tune 
the sysctl flag in one direction or another, depending on which behavior 
the app is expecting.


Yeah, I read those threads.

It seems like the fundamental source of the disconnect is that the tasks 
used to be sorted by priority (thus making it easy to bump a yielding 
task to the end of that priority level) while now they're organized by 
time (making it harder to do anything priority-based).  Do I have that 
right?


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar

* David Schwartz <[EMAIL PROTECTED]> wrote:

> > > BTW, it looks like risky to criticise sched_yield too much: some 
> > > people can misinterpret such discussions and stop using this at 
> > > all, even where it's right.
> 
> > Really, i have never seen a _single_ mainstream app where the use of 
> > sched_yield() was the right choice.
> 
> It can occasionally be an optimization. You may have a case where you 
> can do something very efficiently if a lock is not held, but you 
> cannot afford to wait for the lock to be released. So you check the 
> lock, if it's held, you yield and then check again. If that fails, you 
> do it the less optimal way (for example, dispatching it to a thread 
> that *can* afford to wait).

These are generic statements, but i'm _really_ interested in the 
specifics. Real, specific code that i can look at. The typical Linux 
distro consists of in execess of 500 millions of lines of code, in tens 
of thousands of apps, so there really must be some good, valid and 
"right" use of sched_yield() somewhere in there, in some mainstream app, 
right? (because, as you might have guessed it, in the past decade of 
sched_yield() existence i _have_ seen my share of sched_yield() 
utilizing user-space code, and at the moment i'm not really impressed by 
those examples.)

Preferably that example should show that the best quality user-space 
lock implementation in a given scenario is best done via sched_yield(). 
Actual code and numbers. (And this isnt _that_ hard. I'm not asking for 
a full RDBMS implementation that must run through SQL99 spec suite. This 
is about a simple locking primitive, or a simple pointer to an existing 
codebase.)

> It is also sometimes used in the implementation of spinlock-type 
> primitives. After spinning fails, yielding is tried.

(user-space spinlocks are broken beyond words for anything but perhaps 
SCHED_FIFO tasks.)

> One example I know of is a defragmenter for a multi-threaded memory 
> allocator, and it has to lock whole pools. When it releases these 
> locks, it calls yield before re-acquiring them to go back to work. The 
> idea is to "go to the back of the line" if any threads are blocking on 
> those mutexes.

at a quick glance this seems broken too - but if you show the specific 
code i might be able to point out the breakage in detail. (One 
underlying problem here appears to be fairness: a quick unlock/lock 
sequence may starve out other threads. yield wont solve that fundamental 
problem either, and it will introduce random latencies into apps using 
this memory allocator.)

> > Fortunately, the sched_yield() API is already one of the most rarely
> > used scheduler functionalities, so it does not really matter. [ In my
> > experience a Linux scheduler is stabilizing pretty well when the
> > discussion shifts to yield behavior, because that shows that everything
> > else is pretty much fine ;-) ]
> 
> Can you explain what the current sched_yield behavior *is* for CFS and 
> what the tunable does to change it?

sure. (and i described that flag on lkml before) The sched_yield flag 
does two things:

 - if 0 ("opportunistic mode"), then the task will reschedule to any
   other task that is in "bigger need for CPU time" than the currently 
   running task, as indicated by CFS's ->wait_runtime metric. (or as 
   indicated by the similar ->vruntime metric in sched-devel.git)

 - if 1 ("agressive mode"), then the task will be one-time requeued to 
   the right end of the CFS rbtree. This means that for one instance, 
   all other tasks will run before this task will run again - after that 
   this task's natural ordering within the rbtree is restored.

> The desired behavior is for the current thread to not be rescheduled 
> until every thread at the same static priority as this thread has had 
> a chance to be scheduled.

do you realize that this "desired behavior" you just described is not 
achieved by the old scheduler, and that this random behavior _is_ the 
main problem here? If yield was well-specified then we could implement 
it in a well-specified way - even if the API was poor.

But fact is that it is _not_ well-specified, and apps grew upon a random 
scheduler implementation details in random ways. (in the lkml discussion 
about this topic, Linus offered a pretty sane theoretical definition for 
yield but it's not simple to implement [and no scheduler implements it 
at the moment] - nor will it map to the old scheduler's yield behavior 
so we'll end up breaking more apps.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar


* Chris Friesen <[EMAIL PROTECTED]> wrote:

> Ingo Molnar wrote:
> 
> >But, because you assert it that it's risky to "criticise sched_yield() 
> >too much", you sure must know at least one real example where it's right 
> >to use it (and cite the line and code where it's used, with 
> >specificity)?
> 
> It's fine to criticise sched_yield().  I agree that new apps should 
> generally be written to use proper completion mechanisms or to wait 
> for specific events.

yes.

> However, there are closed-source and/or frozen-source apps where it's 
> not practical to rewrite or rebuild the app.  Does it make sense to 
> break the behaviour of all of these?

See the background and answers to that in:

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

there's plenty of recourse possible to all possible kinds of apps. Tune 
the sysctl flag in one direction or another, depending on which behavior 
the app is expecting.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Chris Friesen


Ingo Molnar wrote:

But, because you assert it that it's risky to "criticise sched_yield() 
too much", you sure must know at least one real example where it's right 
to use it (and cite the line and code where it's used, with 
specificity)?


It's fine to criticise sched_yield().  I agree that new apps should 
generally be written to use proper completion mechanisms or to wait for 
specific events.


However, there are closed-source and/or frozen-source apps where it's 
not practical to rewrite or rebuild the app.  Does it make sense to 
break the behaviour of all of these?


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz

> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
>
> > BTW, it looks like risky to criticise sched_yield too much: some
> > people can misinterpret such discussions and stop using this at all,
> > even where it's right.

> Really, i have never seen a _single_ mainstream app where the use of
> sched_yield() was the right choice.

It can occasionally be an optimization. You may have a case where you can do
something very efficiently if a lock is not held, but you cannot afford to
wait for the lock to be released. So you check the lock, if it's held, you
yield and then check again. If that fails, you do it the less optimal way
(for example, dispatching it to a thread that *can* afford to wait).

It is also sometimes used in the implementation of spinlock-type primitives.
After spinning fails, yielding is tried.

I think it's also sometimes appropriate when a thread may monopolize a
mutex. For example, consider a rarely-run task that cleans up some expensive
structures. It may need to hold locks that are only held during this complex
clean up.

One example I know of is a defragmenter for a multi-threaded memory
allocator, and it has to lock whole pools. When it releases these locks, it
calls yield before re-acquiring them to go back to work. The idea is to "go
to the back of the line" if any threads are blocking on those mutexes.

There are certainly other ways to do these things, but I have seen cases
where, IMO, yielding was the best solution. Doing nothing would have been
okay too.

> Fortunately, the sched_yield() API is already one of the most rarely
> used scheduler functionalities, so it does not really matter. [ In my
> experience a Linux scheduler is stabilizing pretty well when the
> discussion shifts to yield behavior, because that shows that everything
> else is pretty much fine ;-) ]

Can you explain what the current sched_yield behavior *is* for CFS and what
the tunable does to change it?

The desired behavior is for the current thread to not be rescheduled until
every thread at the same static priority as this thread has had a chance to
be scheduled.

Of course, it's not clear exactly what a "chance" is.

The semantics with respect to threads at other static priority levels is not
clear. Ditto for SMP issues. It's also not clear whether threads that yield
should be rewarded or punished for doing so.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> BTW, it looks like risky to criticise sched_yield too much: some 
> people can misinterpret such discussions and stop using this at all, 
> even where it's right.

Really, i have never seen a _single_ mainstream app where the use of 
sched_yield() was the right choice.

Fortunately, the sched_yield() API is already one of the most rarely 
used scheduler functionalities, so it does not really matter. [ In my 
experience a Linux scheduler is stabilizing pretty well when the 
discussion shifts to yield behavior, because that shows that everything 
else is pretty much fine ;-) ]

But, because you assert it that it's risky to "criticise sched_yield() 
too much", you sure must know at least one real example where it's right 
to use it (and cite the line and code where it's used, with 
specificity)?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Jarek Poplawski

On Fri, Sep 28, 2007 at 04:10:00PM +1000, Nick Piggin wrote:
> On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
> > On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
> > > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> >
> > ...
> >
> > > > OK, but let's forget about fixing iperf. Probably I got this wrong,
> > > > but I've thought this "bad" iperf patch was tested on a few nixes and
> > > > linux was the most different one. The main point is: even if there is
> > > > no standard here, it should be a common interest to try to not differ
> > > > too much at least. So, it's not about exactness, but 50% (63 -> 95)
> > > > change in linux own 'definition' after upgrading seems to be a lot.
> > > > So, IMHO, maybe some 'compatibility' test could be prepared to compare
> > > > a few different ideas on this yield and some average value could be a
> > > > kind of at least linux' own standard, which should be emulated within
> > > > some limits by next kernels?
> > >
> > > you repeat your point of "emulating yield", and i can only repeat my
> > > point that you should please read this:
> > >
> > > http://lkml.org/lkml/2007/9/19/357
> > >
> > > because, once you read that, i think you'll agree with me that what you
> > > say is simply not possible in a sane way at this stage. We went through
> > > a number of yield implementations already and each will change behavior
> > > for _some_ category of apps. So right now we offer two implementations,
> > > and the default was chosen empirically to minimize the amount of
> > > complaints. (but it's not possible to eliminate them altogether, for the
> > > reasons outlined above - hence the switch.)
> >
> > Sorry, but I think you got me wrong: I didn't mean emulation of any
> > implementation, but probably the some thing you write above: emulation
> > of time/performance. In my opinion this should be done experimentally
> > too, but with something more objective and constant than current
> > "complaints counter". And the first thing could be a try to set some
> > kind of linux internal "standard of yeld" for the future by averaging
> > a few most popular systems in a test doing things like this iperf or
> > preferably more.
> 
> By definition, yield is essentially undefined as to the behaviour between
> SCHED_OTHER tasks at the same priority level (ie. all of them), because
> SCHED_OTHER scheduling behaviour itself is undefined.
> 
> It's never going to do exactly what everybody wants, except those using
> it for legitimate reasons in realtime applications.
> 

That's why I've used words like: "not differ too much" and "within
some limits" above. So, it's only about being reasonable, compared
to our previous versions, and to others, if possible.

This should be not impossible to additionally control (delay or
accelerate) yielding tasks wrt. current load/weight/number_of_tasks
etc., if we know (after testing) eg. average expedition time of such
tasks with various schedulers. Of course, such tests and controlling
paremeters can change for some time until the problem is explored
enough, and still no aim for exactness or to please everybody.

BTW, it looks like risky to criticise sched_yield too much: some
people can misinterpret such discussions and stop using this at
all, even where it's right.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Jarek Poplawski

On Fri, Sep 28, 2007 at 04:10:00PM +1000, Nick Piggin wrote:
 On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
  On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
   * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  ...
 
OK, but let's forget about fixing iperf. Probably I got this wrong,
but I've thought this bad iperf patch was tested on a few nixes and
linux was the most different one. The main point is: even if there is
no standard here, it should be a common interest to try to not differ
too much at least. So, it's not about exactness, but 50% (63 - 95)
change in linux own 'definition' after upgrading seems to be a lot.
So, IMHO, maybe some 'compatibility' test could be prepared to compare
a few different ideas on this yield and some average value could be a
kind of at least linux' own standard, which should be emulated within
some limits by next kernels?
  
   you repeat your point of emulating yield, and i can only repeat my
   point that you should please read this:
  
   http://lkml.org/lkml/2007/9/19/357
  
   because, once you read that, i think you'll agree with me that what you
   say is simply not possible in a sane way at this stage. We went through
   a number of yield implementations already and each will change behavior
   for _some_ category of apps. So right now we offer two implementations,
   and the default was chosen empirically to minimize the amount of
   complaints. (but it's not possible to eliminate them altogether, for the
   reasons outlined above - hence the switch.)
 
  Sorry, but I think you got me wrong: I didn't mean emulation of any
  implementation, but probably the some thing you write above: emulation
  of time/performance. In my opinion this should be done experimentally
  too, but with something more objective and constant than current
  complaints counter. And the first thing could be a try to set some
  kind of linux internal standard of yeld for the future by averaging
  a few most popular systems in a test doing things like this iperf or
  preferably more.
 
 By definition, yield is essentially undefined as to the behaviour between
 SCHED_OTHER tasks at the same priority level (ie. all of them), because
 SCHED_OTHER scheduling behaviour itself is undefined.
 
 It's never going to do exactly what everybody wants, except those using
 it for legitimate reasons in realtime applications.
 

That's why I've used words like: not differ too much and within
some limits above. So, it's only about being reasonable, compared
to our previous versions, and to others, if possible.

This should be not impossible to additionally control (delay or
accelerate) yielding tasks wrt. current load/weight/number_of_tasks
etc., if we know (after testing) eg. average expedition time of such
tasks with various schedulers. Of course, such tests and controlling
paremeters can change for some time until the problem is explored
enough, and still no aim for exactness or to please everybody.

BTW, it looks like risky to criticise sched_yield too much: some
people can misinterpret such discussions and stop using this at
all, even where it's right.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

 BTW, it looks like risky to criticise sched_yield too much: some 
 people can misinterpret such discussions and stop using this at all, 
 even where it's right.

Really, i have never seen a _single_ mainstream app where the use of 
sched_yield() was the right choice.

Fortunately, the sched_yield() API is already one of the most rarely 
used scheduler functionalities, so it does not really matter. [ In my 
experience a Linux scheduler is stabilizing pretty well when the 
discussion shifts to yield behavior, because that shows that everything 
else is pretty much fine ;-) ]

But, because you assert it that it's risky to criticise sched_yield() 
too much, you sure must know at least one real example where it's right 
to use it (and cite the line and code where it's used, with 
specificity)?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


 * Jarek Poplawski [EMAIL PROTECTED] wrote:

  BTW, it looks like risky to criticise sched_yield too much: some
  people can misinterpret such discussions and stop using this at all,
  even where it's right.

 Really, i have never seen a _single_ mainstream app where the use of
 sched_yield() was the right choice.

It can occasionally be an optimization. You may have a case where you can do
something very efficiently if a lock is not held, but you cannot afford to
wait for the lock to be released. So you check the lock, if it's held, you
yield and then check again. If that fails, you do it the less optimal way
(for example, dispatching it to a thread that *can* afford to wait).

It is also sometimes used in the implementation of spinlock-type primitives.
After spinning fails, yielding is tried.

I think it's also sometimes appropriate when a thread may monopolize a
mutex. For example, consider a rarely-run task that cleans up some expensive
structures. It may need to hold locks that are only held during this complex
clean up.

One example I know of is a defragmenter for a multi-threaded memory
allocator, and it has to lock whole pools. When it releases these locks, it
calls yield before re-acquiring them to go back to work. The idea is to go
to the back of the line if any threads are blocking on those mutexes.

There are certainly other ways to do these things, but I have seen cases
where, IMO, yielding was the best solution. Doing nothing would have been
okay too.

 Fortunately, the sched_yield() API is already one of the most rarely
 used scheduler functionalities, so it does not really matter. [ In my
 experience a Linux scheduler is stabilizing pretty well when the
 discussion shifts to yield behavior, because that shows that everything
 else is pretty much fine ;-) ]

Can you explain what the current sched_yield behavior *is* for CFS and what
the tunable does to change it?

The desired behavior is for the current thread to not be rescheduled until
every thread at the same static priority as this thread has had a chance to
be scheduled.

Of course, it's not clear exactly what a chance is.

The semantics with respect to threads at other static priority levels is not
clear. Ditto for SMP issues. It's also not clear whether threads that yield
should be rewarded or punished for doing so.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Chris Friesen


Ingo Molnar wrote:

But, because you assert it that it's risky to criticise sched_yield() 
too much, you sure must know at least one real example where it's right 
to use it (and cite the line and code where it's used, with 
specificity)?


It's fine to criticise sched_yield().  I agree that new apps should 
generally be written to use proper completion mechanisms or to wait for 
specific events.


However, there are closed-source and/or frozen-source apps where it's 
not practical to rewrite or rebuild the app.  Does it make sense to 
break the behaviour of all of these?


Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar


* Chris Friesen [EMAIL PROTECTED] wrote:

 Ingo Molnar wrote:
 
 But, because you assert it that it's risky to criticise sched_yield() 
 too much, you sure must know at least one real example where it's right 
 to use it (and cite the line and code where it's used, with 
 specificity)?
 
 It's fine to criticise sched_yield().  I agree that new apps should 
 generally be written to use proper completion mechanisms or to wait 
 for specific events.

yes.

 However, there are closed-source and/or frozen-source apps where it's 
 not practical to rewrite or rebuild the app.  Does it make sense to 
 break the behaviour of all of these?

See the background and answers to that in:

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

there's plenty of recourse possible to all possible kinds of apps. Tune 
the sysctl flag in one direction or another, depending on which behavior 
the app is expecting.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Ingo Molnar


* David Schwartz [EMAIL PROTECTED] wrote:

   BTW, it looks like risky to criticise sched_yield too much: some 
   people can misinterpret such discussions and stop using this at 
   all, even where it's right.
 
  Really, i have never seen a _single_ mainstream app where the use of 
  sched_yield() was the right choice.
 
 It can occasionally be an optimization. You may have a case where you 
 can do something very efficiently if a lock is not held, but you 
 cannot afford to wait for the lock to be released. So you check the 
 lock, if it's held, you yield and then check again. If that fails, you 
 do it the less optimal way (for example, dispatching it to a thread 
 that *can* afford to wait).

These are generic statements, but i'm _really_ interested in the 
specifics. Real, specific code that i can look at. The typical Linux 
distro consists of in execess of 500 millions of lines of code, in tens 
of thousands of apps, so there really must be some good, valid and 
right use of sched_yield() somewhere in there, in some mainstream app, 
right? (because, as you might have guessed it, in the past decade of 
sched_yield() existence i _have_ seen my share of sched_yield() 
utilizing user-space code, and at the moment i'm not really impressed by 
those examples.)
 
Preferably that example should show that the best quality user-space 
lock implementation in a given scenario is best done via sched_yield(). 
Actual code and numbers. (And this isnt _that_ hard. I'm not asking for 
a full RDBMS implementation that must run through SQL99 spec suite. This 
is about a simple locking primitive, or a simple pointer to an existing 
codebase.)

 It is also sometimes used in the implementation of spinlock-type 
 primitives. After spinning fails, yielding is tried.

(user-space spinlocks are broken beyond words for anything but perhaps 
SCHED_FIFO tasks.)

 One example I know of is a defragmenter for a multi-threaded memory 
 allocator, and it has to lock whole pools. When it releases these 
 locks, it calls yield before re-acquiring them to go back to work. The 
 idea is to go to the back of the line if any threads are blocking on 
 those mutexes.

at a quick glance this seems broken too - but if you show the specific 
code i might be able to point out the breakage in detail. (One 
underlying problem here appears to be fairness: a quick unlock/lock 
sequence may starve out other threads. yield wont solve that fundamental 
problem either, and it will introduce random latencies into apps using 
this memory allocator.)

  Fortunately, the sched_yield() API is already one of the most rarely
  used scheduler functionalities, so it does not really matter. [ In my
  experience a Linux scheduler is stabilizing pretty well when the
  discussion shifts to yield behavior, because that shows that everything
  else is pretty much fine ;-) ]
 
 Can you explain what the current sched_yield behavior *is* for CFS and 
 what the tunable does to change it?

sure. (and i described that flag on lkml before) The sched_yield flag 
does two things:

 - if 0 (opportunistic mode), then the task will reschedule to any
   other task that is in bigger need for CPU time than the currently 
   running task, as indicated by CFS's -wait_runtime metric. (or as 
   indicated by the similar -vruntime metric in sched-devel.git)

 - if 1 (agressive mode), then the task will be one-time requeued to 
   the right end of the CFS rbtree. This means that for one instance, 
   all other tasks will run before this task will run again - after that 
   this task's natural ordering within the rbtree is restored.

 The desired behavior is for the current thread to not be rescheduled 
 until every thread at the same static priority as this thread has had 
 a chance to be scheduled.

do you realize that this desired behavior you just described is not 
achieved by the old scheduler, and that this random behavior _is_ the 
main problem here? If yield was well-specified then we could implement 
it in a well-specified way - even if the API was poor.

But fact is that it is _not_ well-specified, and apps grew upon a random 
scheduler implementation details in random ways. (in the lkml discussion 
about this topic, Linus offered a pretty sane theoretical definition for 
yield but it's not simple to implement [and no scheduler implements it 
at the moment] - nor will it map to the old scheduler's yield behavior 
so we'll end up breaking more apps.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Chris Friesen


Ingo Molnar wrote:

* Chris Friesen [EMAIL PROTECTED] wrote:


However, there are closed-source and/or frozen-source apps where it's 
not practical to rewrite or rebuild the app.  Does it make sense to 
break the behaviour of all of these?



See the background and answers to that in:

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

there's plenty of recourse possible to all possible kinds of apps. Tune 
the sysctl flag in one direction or another, depending on which behavior 
the app is expecting.


Yeah, I read those threads.

It seems like the fundamental source of the disconnect is that the tasks 
used to be sorted by priority (thus making it easy to bump a yielding 
task to the end of that priority level) while now they're organized by 
time (making it harder to do anything priority-based).  Do I have that 
right?


Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


 These are generic statements, but i'm _really_ interested in the
 specifics. Real, specific code that i can look at. The typical Linux
 distro consists of in execess of 500 millions of lines of code, in tens
 of thousands of apps, so there really must be some good, valid and
 right use of sched_yield() somewhere in there, in some mainstream app,
 right? (because, as you might have guessed it, in the past decade of
 sched_yield() existence i _have_ seen my share of sched_yield()
 utilizing user-space code, and at the moment i'm not really impressed by
 those examples.)

Maybe, maybe not. Even if so, it would be very difficult to find. Simply
grepping for sched_yield is not going to help because determining whether a
given use of sched_yield is smart is not going to be easy.

 (user-space spinlocks are broken beyond words for anything but perhaps
 SCHED_FIFO tasks.)

User-space spinlocks are broken so spinlocks can only be implemented in
kernel-space? Even if you use the kernel to schedule/unschedule the tasks,
you still have to spin in user-space.

  One example I know of is a defragmenter for a multi-threaded memory
  allocator, and it has to lock whole pools. When it releases these
  locks, it calls yield before re-acquiring them to go back to work. The
  idea is to go to the back of the line if any threads are blocking on
  those mutexes.

 at a quick glance this seems broken too - but if you show the specific
 code i might be able to point out the breakage in detail. (One
 underlying problem here appears to be fairness: a quick unlock/lock
 sequence may starve out other threads. yield wont solve that fundamental
 problem either, and it will introduce random latencies into apps using
 this memory allocator.)

You are assuming that random latencies are necessarily bad. Random latencies
may be significantly better than predictable high latency.


  Can you explain what the current sched_yield behavior *is* for CFS and
  what the tunable does to change it?

 sure. (and i described that flag on lkml before) The sched_yield flag
 does two things:

  - if 0 (opportunistic mode), then the task will reschedule to any
other task that is in bigger need for CPU time than the currently
running task, as indicated by CFS's -wait_runtime metric. (or as
indicated by the similar -vruntime metric in sched-devel.git)

  - if 1 (agressive mode), then the task will be one-time requeued to
the right end of the CFS rbtree. This means that for one instance,
all other tasks will run before this task will run again - after that
this task's natural ordering within the rbtree is restored.

Thank you. Unfortunately, neither of these does what sched_yiled is really
supposed to do. Opportunistic mode does too little and agressive mode does
too much.

  The desired behavior is for the current thread to not be rescheduled
  until every thread at the same static priority as this thread has had
  a chance to be scheduled.

 do you realize that this desired behavior you just described is not
 achieved by the old scheduler, and that this random behavior _is_ the
 main problem here? If yield was well-specified then we could implement
 it in a well-specified way - even if the API was poor.

 But fact is that it is _not_ well-specified, and apps grew upon a random
 scheduler implementation details in random ways. (in the lkml discussion
 about this topic, Linus offered a pretty sane theoretical definition for
 yield but it's not simple to implement [and no scheduler implements it
 at the moment] - nor will it map to the old scheduler's yield behavior
 so we'll end up breaking more apps.)

I don't have a problem with failing to emulate the old scheduler's behavior
if we can show that the new behavior has saner semantics. Unfortunately, in
this case, I think CFS' semantics are pretty bad. Neither of these is what
sched_yield is supposed to do.

Note that I'm not saying this is a particularly big deal. And I'm not
calling CFS' behavior a regression, since it's not really better or worse
than the old behavior, simply different.

I'm not familiar enough with CFS' internals to help much on the
implementation, but there may be some simple compromise yield that might
work well enough. How about simply acting as if the task used up its
timeslice and scheduling the next one? (Possibly with a slight reduction in
penalty or reward for not really using all the time, if possible?)

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 09:49:35 -0700
David Schwartz [EMAIL PROTECTED] wrote:

 
  * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
   BTW, it looks like risky to criticise sched_yield too much: some
   people can misinterpret such discussions and stop using this at
   all, even where it's right.
 
  Really, i have never seen a _single_ mainstream app where the use of
  sched_yield() was the right choice.
 
 It can occasionally be an optimization. You may have a case where you
 can do something very efficiently if a lock is not held, but you
 cannot afford to wait for the lock to be released. So you check the
 lock, if it's held, you yield and then check again. If that fails,
 you do it the less optimal way (for example, dispatching it to a
 thread that *can* afford to wait).


at this point it's use a futex instead; once you're doing system
calls you might as well use the right one for what you're trying to
achieve.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Hubert Tonneau

Ingo Molnar wrote:

 Really, i have never seen a _single_ mainstream app where the use of
 sched_yield() was the right choice.

Pliant 'FastSem' semaphore implementation (as oppsed to 'Sem') uses 'yield'
http://old.fullpliant.org/

Basically, if the ressource you are protecting with the semaphore will be held
for a significant time, then a full semaphore might be better, but if the
ressource will be held just a fiew cycles, then light aquiering might bring best
result because the most significant cost is in aquiering/releasing.

So the aquiering algorithm for fast semaphores might be:
try to aquire with a hardware atomic read and set instruction, then if it fails,
call yield then retry (at least on a single processor single core system).


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


Arjan van de Ven wrote:

  It can occasionally be an optimization. You may have a case where you
  can do something very efficiently if a lock is not held, but you
  cannot afford to wait for the lock to be released. So you check the
  lock, if it's held, you yield and then check again. If that fails,
  you do it the less optimal way (for example, dispatching it to a
  thread that *can* afford to wait).

 at this point it's use a futex instead; once you're doing system
 calls you might as well use the right one for what you're trying to
 achieve.

There are two answers to this. One is that you sometimes are writing POSIX
code and Linux-specific optimizations don't change the fact that you still
need a portable implementation.

The other answer is that futexes don't change anything in this case. In
fact, in the last time I hit this, the lock was a futex on Linux.
Nevertheless, that doesn't change the basic issue. The lock is locked, you
cannot afford to wait for it, but not getting the lock is expensive. The
solution is to yield and check the lock again. If it's still held, you
dispatch to another thread, but many times, yielding can avoid that.

A futex doesn't change the fact that sometimes you can't afford to block on
a lock but nevertheless would save significant effort if you were able to
acquire it. Odds are the thread that holds it is about to release it anyway.

That is, you need something in-between non-blocking trylock, fail easily
and blocking lock, do not fail, but you'd rather make forward progress
without the lock than actually block/sleep.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 15:17:52 -0700
David Schwartz [EMAIL PROTECTED] wrote:

 
 Arjan van de Ven wrote:
 
   It can occasionally be an optimization. You may have a case where
   you can do something very efficiently if a lock is not held, but
   you cannot afford to wait for the lock to be released. So you
   check the lock, if it's held, you yield and then check again. If
   that fails, you do it the less optimal way (for example,
   dispatching it to a thread that *can* afford to wait).
 
  at this point it's use a futex instead; once you're doing system
  calls you might as well use the right one for what you're trying to
  achieve.
 
 There are two answers to this. One is that you sometimes are writing
 POSIX code and Linux-specific optimizations don't change the fact
 that you still need a portable implementation.
 
 The other answer is that futexes don't change anything in this case.
 In fact, in the last time I hit this, the lock was a futex on Linux.
 Nevertheless, that doesn't change the basic issue. The lock is
 locked, you cannot afford to wait for it, but not getting the lock is
 expensive. The solution is to yield and check the lock again. If it's
 still held, you dispatch to another thread, but many times, yielding
 can avoid that.

yielding IS blocking. Just with indeterminate fuzzyness added to it
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Network slowdown due to CFS

2007-10-01 Thread David Schwartz


 yielding IS blocking. Just with indeterminate fuzzyness added to it

Yielding is sort of blocking, but the difference is that yielding will not
idle the CPU while blocking might. Yielding is sometimes preferable to
blocking in a case where the thread knows it can make forward progress even
if it doesn't get the resource. (As in the examples I explained.)

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-10-01 Thread Arjan van de Ven

On Mon, 1 Oct 2007 15:44:09 -0700
David Schwartz [EMAIL PROTECTED] wrote:

 
  yielding IS blocking. Just with indeterminate fuzzyness added to
  it
 
 Yielding is sort of blocking, but the difference is that yielding
 will not idle the CPU while blocking might. 

not really; SOMEONE will make progress, the one holding the lock.
Granted, he can be on some other cpu, but at that point all yielding
gets you is a bunch of cache bounces.

Yielding is sometimes
 preferable to blocking in a case where the thread knows it can make
 forward progress even if it doesn't get the resource. (As in the
 examples I explained.)

that's also what trylock is for... as well as spinaphores...
(you can argue that futexes should be more intelligent and do
spinaphore stuff etc... and I can buy that, lets improve them in the
kernel by any means. But userspace yield() isn't the answer. A
yield_to() would have been a ton better (which would return immediately
if the thing you want to yield to is running already somethere), a
blind yield isn't, since it doesn't say what you want to yield to.

Note: The answer to what to yield to isn't everything that might
want to run; we tried that way back when the 2.6.early scheduler was
designed and that turns out to not be what people calling yield
expected.. (it made their things even slower than they thought). So
they want yield to semantics, without telling the kernel what they
want to yield to, and complain if the kernel second-guesses wrongly


not a good api.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-29 Thread Nick Piggin

On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
> On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
> > * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
>
> ...
>
> > > OK, but let's forget about fixing iperf. Probably I got this wrong,
> > > but I've thought this "bad" iperf patch was tested on a few nixes and
> > > linux was the most different one. The main point is: even if there is
> > > no standard here, it should be a common interest to try to not differ
> > > too much at least. So, it's not about exactness, but 50% (63 -> 95)
> > > change in linux own 'definition' after upgrading seems to be a lot.
> > > So, IMHO, maybe some 'compatibility' test could be prepared to compare
> > > a few different ideas on this yield and some average value could be a
> > > kind of at least linux' own standard, which should be emulated within
> > > some limits by next kernels?
> >
> > you repeat your point of "emulating yield", and i can only repeat my
> > point that you should please read this:
> >
> > http://lkml.org/lkml/2007/9/19/357
> >
> > because, once you read that, i think you'll agree with me that what you
> > say is simply not possible in a sane way at this stage. We went through
> > a number of yield implementations already and each will change behavior
> > for _some_ category of apps. So right now we offer two implementations,
> > and the default was chosen empirically to minimize the amount of
> > complaints. (but it's not possible to eliminate them altogether, for the
> > reasons outlined above - hence the switch.)
>
> Sorry, but I think you got me wrong: I didn't mean emulation of any
> implementation, but probably the some thing you write above: emulation
> of time/performance. In my opinion this should be done experimentally
> too, but with something more objective and constant than current
> "complaints counter". And the first thing could be a try to set some
> kind of linux internal "standard of yeld" for the future by averaging
> a few most popular systems in a test doing things like this iperf or
> preferably more.

By definition, yield is essentially undefined as to the behaviour between
SCHED_OTHER tasks at the same priority level (ie. all of them), because
SCHED_OTHER scheduling behaviour itself is undefined.

It's never going to do exactly what everybody wants, except those using
it for legitimate reasons in realtime applications.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-29 Thread Nick Piggin

On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
 On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
  * Jarek Poplawski [EMAIL PROTECTED] wrote:

 ...

   OK, but let's forget about fixing iperf. Probably I got this wrong,
   but I've thought this bad iperf patch was tested on a few nixes and
   linux was the most different one. The main point is: even if there is
   no standard here, it should be a common interest to try to not differ
   too much at least. So, it's not about exactness, but 50% (63 - 95)
   change in linux own 'definition' after upgrading seems to be a lot.
   So, IMHO, maybe some 'compatibility' test could be prepared to compare
   a few different ideas on this yield and some average value could be a
   kind of at least linux' own standard, which should be emulated within
   some limits by next kernels?
 
  you repeat your point of emulating yield, and i can only repeat my
  point that you should please read this:
 
  http://lkml.org/lkml/2007/9/19/357
 
  because, once you read that, i think you'll agree with me that what you
  say is simply not possible in a sane way at this stage. We went through
  a number of yield implementations already and each will change behavior
  for _some_ category of apps. So right now we offer two implementations,
  and the default was chosen empirically to minimize the amount of
  complaints. (but it's not possible to eliminate them altogether, for the
  reasons outlined above - hence the switch.)

 Sorry, but I think you got me wrong: I didn't mean emulation of any
 implementation, but probably the some thing you write above: emulation
 of time/performance. In my opinion this should be done experimentally
 too, but with something more objective and constant than current
 complaints counter. And the first thing could be a try to set some
 kind of linux internal standard of yeld for the future by averaging
 a few most popular systems in a test doing things like this iperf or
 preferably more.

By definition, yield is essentially undefined as to the behaviour between
SCHED_OTHER tasks at the same priority level (ie. all of them), because
SCHED_OTHER scheduling behaviour itself is undefined.

It's never going to do exactly what everybody wants, except those using
it for legitimate reasons in realtime applications.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
...
> > OK, but let's forget about fixing iperf. Probably I got this wrong, 
> > but I've thought this "bad" iperf patch was tested on a few nixes and 
> > linux was the most different one. The main point is: even if there is 
> > no standard here, it should be a common interest to try to not differ 
> > too much at least. So, it's not about exactness, but 50% (63 -> 95) 
> > change in linux own 'definition' after upgrading seems to be a lot. 
> > So, IMHO, maybe some 'compatibility' test could be prepared to compare 
> > a few different ideas on this yield and some average value could be a 
> > kind of at least linux' own standard, which should be emulated within 
> > some limits by next kernels?
> 
> you repeat your point of "emulating yield", and i can only repeat my 
> point that you should please read this:
> 
> http://lkml.org/lkml/2007/9/19/357
> 
> because, once you read that, i think you'll agree with me that what you 
> say is simply not possible in a sane way at this stage. We went through 
> a number of yield implementations already and each will change behavior 
> for _some_ category of apps. So right now we offer two implementations, 
> and the default was chosen empirically to minimize the amount of 
> complaints. (but it's not possible to eliminate them altogether, for the 
> reasons outlined above - hence the switch.)

Sorry, but I think you got me wrong: I didn't mean emulation of any
implementation, but probably the some thing you write above: emulation
of time/performance. In my opinion this should be done experimentally
too, but with something more objective and constant than current
"complaints counter". And the first thing could be a try to set some
kind of linux internal "standard of yeld" for the future by averaging
a few most popular systems in a test doing things like this iperf or
preferably more.

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
[...]
> > What you missed is that there is no such thing as "predictable yield 
> > behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
> > keep the behavior). Please read this thread on lkml for a more detailed 
> > background:
> > 
> >CFS: some bad numbers with Java/database threading [FIXED]
> > 
> >http://lkml.org/lkml/2007/9/19/357
> >http://lkml.org/lkml/2007/9/19/328
> > 
> > in short: the yield implementation was tied to the O(1) scheduler, so 
> > the only way to have the exact same behavior would be to have the exact 
> > same core scheduler again. If what you said was true we would not be 
> > able to change the scheduler, ever. For something as vaguely defined of 
> > an API as yield, there's just no way to have a different core scheduler 
> > and still behave the same way.
> > 
> > So _generally_ i'd agree with you that normally we want to be bug for 
> > bug compatible, but in this specific (iperf) case there's just no point 
> > in preserving behavior that papers over this _clearly_ broken user-space 
> > app/thread locking (for which now two fixes exist already, plus a third 
> > fix is the twiddling of that sysctl).
> > 
> 
> OK, but let's forget about fixing iperf. Probably I got this wrong, 
> but I've thought this "bad" iperf patch was tested on a few nixes and 
> linux was the most different one. The main point is: even if there is 
> no standard here, it should be a common interest to try to not differ 
> too much at least. So, it's not about exactness, but 50% (63 -> 95) 
> change in linux own 'definition' after upgrading seems to be a lot. 
> So, IMHO, maybe some 'compatibility' test could be prepared to compare 
> a few different ideas on this yield and some average value could be a 
> kind of at least linux' own standard, which should be emulated within 
> some limits by next kernels?

you repeat your point of "emulating yield", and i can only repeat my 
point that you should please read this:

http://lkml.org/lkml/2007/9/19/357

because, once you read that, i think you'll agree with me that what you 
say is simply not possible in a sane way at this stage. We went through 
a number of yield implementations already and each will change behavior 
for _some_ category of apps. So right now we offer two implementations, 
and the default was chosen empirically to minimize the amount of 
complaints. (but it's not possible to eliminate them altogether, for the 
reasons outlined above - hence the switch.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> 
> > > the (small) patch below fixes the iperf locking bug and removes the 
> > > yield() use. There are numerous immediate benefits of this patch:
> > ...
> > > 
> > > sched_yield() is almost always the symptom of broken locking or other 
> > > bug. In that sense CFS does the right thing by exposing such bugs =B-)
> > 
> > ...Only if it were under some DEBUG option. [...]
> 
> note that i qualified my sentence both via "In that sense" and via a 
> smiley! So i was not suggesting that this is a general rule at all and i 
> was also joking :-)

Actually, I've analyzed this smiley for some time but these scheduler
jokes are really hard, and I definitely need more time...

> 
> > [...] Even if iperf is doing the wrong thing there is no explanation 
> > for such big difference in the behavior between sched_compat_yield 1 
> > vs. 0. It seems common interfaces should work similarly and 
> > predictably on various systems, and here, if I didn't miss something, 
> > linux looks like a different kind?
> 
> What you missed is that there is no such thing as "predictable yield 
> behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
> keep the behavior). Please read this thread on lkml for a more detailed 
> background:
> 
>CFS: some bad numbers with Java/database threading [FIXED]
> 
>http://lkml.org/lkml/2007/9/19/357
>http://lkml.org/lkml/2007/9/19/328
> 
> in short: the yield implementation was tied to the O(1) scheduler, so 
> the only way to have the exact same behavior would be to have the exact 
> same core scheduler again. If what you said was true we would not be 
> able to change the scheduler, ever. For something as vaguely defined of 
> an API as yield, there's just no way to have a different core scheduler 
> and still behave the same way.
> 
> So _generally_ i'd agree with you that normally we want to be bug for 
> bug compatible, but in this specific (iperf) case there's just no point 
> in preserving behavior that papers over this _clearly_ broken user-space 
> app/thread locking (for which now two fixes exist already, plus a third 
> fix is the twiddling of that sysctl).
> 

OK, but let's forget about fixing iperf. Probably I got this wrong,
but I've thought this "bad" iperf patch was tested on a few nixes and
linux was the most different one. The main point is: even if there is
no standard here, it should be a common interest to try to not differ
too much at least. So, it's not about exactness, but 50% (63 -> 95)
change in linux own 'definition' after upgrading seems to be a lot.
So, IMHO, maybe some 'compatibility' test could be prepared to
compare a few different ideas on this yield and some average value
could be a kind of at least linux' own standard, which should be
emulated within some limits by next kernels?

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Martin Michlmayr

* Ingo Molnar <[EMAIL PROTECTED]> [2007-09-27 12:56]:
> i'm curious by how much does CPU go down, and what's the output of
> iperf? (does it saturate full 100mbit network bandwidth)

I get about 94-95 Mbits/sec and CPU drops from 99% to about 82% (this
is with a 600 MHz ARM CPU).
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Martin Michlmayr <[EMAIL PROTECTED]> wrote:

> * Ingo Molnar <[EMAIL PROTECTED]> [2007-09-27 11:49]:
> > Martin, could you check the iperf patch below instead of the yield
> > patch - does it solve the iperf performance problem equally well,
> > and does CPU utilization drop for you too?
> 
> Yes, it works and CPU goes down too.

i'm curious by how much does CPU go down, and what's the output of 
iperf? (does it saturate full 100mbit network bandwidth)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Martin Michlmayr

* Ingo Molnar <[EMAIL PROTECTED]> [2007-09-27 11:49]:
> Martin, could you check the iperf patch below instead of the yield
> patch - does it solve the iperf performance problem equally well,
> and does CPU utilization drop for you too?

Yes, it works and CPU goes down too.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Martin Michlmayr <[EMAIL PROTECTED]> wrote:

> > I think the real fix would be for iperf to use blocking network IO 
> > though, or maybe to use a POSIX mutex or POSIX semaphores.
> 
> So it's definitely not a bug in the kernel, only in iperf?
> 
> (CCing Stephen Hemminger who wrote the iperf patch.)

Martin, could you check the iperf patch below instead of the yield patch 
- does it solve the iperf performance problem equally well, and does CPU 
utilization drop for you too?

Ingo

-->
Subject: iperf: fix locking
From: Ingo Molnar <[EMAIL PROTECTED]>

fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 compat/Thread.c |3 ---
 src/Reporter.c  |   13 +
 src/main.cpp|2 ++
 3 files changed, 11 insertions(+), 7 deletions(-)

Index: iperf-2.0.2/compat/Thread.c
===
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
 void thread_rest ( void ) {
 #if defined( HAVE_THREAD )
 #if defined( HAVE_POSIX_THREAD )
-// TODO add checks for sched_yield or pthread_yield and call that
-// if available
-usleep( 0 );
 #else // Win32
 SwitchToThread( );
 #endif
Index: iperf-2.0.2/src/Reporter.c
===
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
 char buffer[64]; // Buffer for printing
 ReportHeader *ReportRoot = NULL;
 extern Condition ReportCond;
+extern Condition ReportDoneCond;
 int reporter_process_report ( ReportHeader *report );
 void process_report ( ReportHeader *report );
 int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent, 
 // item
 while ( index == 0 ) {
 Condition_Signal(  );
-thread_rest();
+Condition_Wait(  );
 index = agent->reporterindex;
 }
 agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent, 
 // Need to make sure that reporter is not about to be "lapped"
 while ( index - 1 == agent->agentindex ) {
 Condition_Signal(  );
-thread_rest();
+Condition_Wait(  );
 index = agent->reporterindex;
 }
 
@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
 }
 Condition_Unlock ( ReportCond );
 
+again:
 if ( ReportRoot != NULL ) {
 ReportHeader *temp = ReportRoot;
 //Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
 // finished with report so free it
 free( temp );
 Condition_Unlock ( ReportCond );
+   Condition_Signal(  );
+   if (ReportRoot)
+   goto again;
 }
-// yield control of CPU is another thread is waiting
-thread_rest();
+Condition_Signal(  );
+usleep(1);
 } else {
 //Condition_Unlock ( ReportCond );
 }
Index: iperf-2.0.2/src/main.cpp
===
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern "C" {
 // records being accessed in a report and also to
 // serialize modification of the report list
 Condition ReportCond;
+Condition ReportDoneCond;
 }
 
 // global variables only accessed within this file
@@ -141,6 +142,7 @@ int main( int argc, char **argv ) {
 
 // Initialize global mutexes and conditions
 Condition_Initialize (  );
+Condition_Initialize (  );
 Mutex_Initialize(  );
 Mutex_Initialize( _mutex );
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar

* Jarek Poplawski <[EMAIL PROTECTED]> wrote:

> > the (small) patch below fixes the iperf locking bug and removes the 
> > yield() use. There are numerous immediate benefits of this patch:
> ...
> > 
> > sched_yield() is almost always the symptom of broken locking or other 
> > bug. In that sense CFS does the right thing by exposing such bugs =B-)
> 
> ...Only if it were under some DEBUG option. [...]

note that i qualified my sentence both via "In that sense" and via a 
smiley! So i was not suggesting that this is a general rule at all and i 
was also joking :-)

> [...] Even if iperf is doing the wrong thing there is no explanation 
> for such big difference in the behavior between sched_compat_yield 1 
> vs. 0. It seems common interfaces should work similarly and 
> predictably on various systems, and here, if I didn't miss something, 
> linux looks like a different kind?

What you missed is that there is no such thing as "predictable yield 
behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
keep the behavior). Please read this thread on lkml for a more detailed 
background:

   CFS: some bad numbers with Java/database threading [FIXED]

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

in short: the yield implementation was tied to the O(1) scheduler, so 
the only way to have the exact same behavior would be to have the exact 
same core scheduler again. If what you said was true we would not be 
able to change the scheduler, ever. For something as vaguely defined of 
an API as yield, there's just no way to have a different core scheduler 
and still behave the same way.

So _generally_ i'd agree with you that normally we want to be bug for 
bug compatible, but in this specific (iperf) case there's just no point 
in preserving behavior that papers over this _clearly_ broken user-space 
app/thread locking (for which now two fixes exist already, plus a third 
fix is the twiddling of that sysctl).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On 26-09-2007 15:31, Ingo Molnar wrote:
> * David Schwartz <[EMAIL PROTECTED]> wrote:
> 
 I think the real fix would be for iperf to use blocking network IO 
 though, or maybe to use a POSIX mutex or POSIX semaphores.
>>> So it's definitely not a bug in the kernel, only in iperf?
>> Martin:
>>
>> Actually, in this case I think iperf is doing the right thing (though not
>> the best thing) and the kernel is doing the wrong thing. [...]
> 
> it's not doing the right thing at all. I had a quick look at the source 
> code, and the reason for that weird yield usage was that there's a 
> locking bug in iperf's "Reporter thread" abstraction and apparently 
> instead of fixing the bug it was worked around via a horrible yield() 
> based user-space lock.
> 
> the (small) patch below fixes the iperf locking bug and removes the 
> yield() use. There are numerous immediate benefits of this patch:
...
> 
> sched_yield() is almost always the symptom of broken locking or other 
> bug. In that sense CFS does the right thing by exposing such bugs =B-)

...Only if it were under some DEBUG option. Even if iperf is doing
the wrong thing there is no explanation for such big difference in
the behavior between sched_compat_yield 1 vs. 0. It seems common
interfaces should work similarly and predictably on various
systems, and here, if I didn't miss something, linux looks like a
different kind?

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On 26-09-2007 15:31, Ingo Molnar wrote:
 * David Schwartz [EMAIL PROTECTED] wrote:
 
 I think the real fix would be for iperf to use blocking network IO 
 though, or maybe to use a POSIX mutex or POSIX semaphores.
 So it's definitely not a bug in the kernel, only in iperf?
 Martin:

 Actually, in this case I think iperf is doing the right thing (though not
 the best thing) and the kernel is doing the wrong thing. [...]
 
 it's not doing the right thing at all. I had a quick look at the source 
 code, and the reason for that weird yield usage was that there's a 
 locking bug in iperf's Reporter thread abstraction and apparently 
 instead of fixing the bug it was worked around via a horrible yield() 
 based user-space lock.
 
 the (small) patch below fixes the iperf locking bug and removes the 
 yield() use. There are numerous immediate benefits of this patch:
...
 
 sched_yield() is almost always the symptom of broken locking or other 
 bug. In that sense CFS does the right thing by exposing such bugs =B-)

...Only if it were under some DEBUG option. Even if iperf is doing
the wrong thing there is no explanation for such big difference in
the behavior between sched_compat_yield 1 vs. 0. It seems common
interfaces should work similarly and predictably on various
systems, and here, if I didn't miss something, linux looks like a
different kind?

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

  the (small) patch below fixes the iperf locking bug and removes the 
  yield() use. There are numerous immediate benefits of this patch:
 ...
  
  sched_yield() is almost always the symptom of broken locking or other 
  bug. In that sense CFS does the right thing by exposing such bugs =B-)
 
 ...Only if it were under some DEBUG option. [...]

note that i qualified my sentence both via In that sense and via a 
smiley! So i was not suggesting that this is a general rule at all and i 
was also joking :-)

 [...] Even if iperf is doing the wrong thing there is no explanation 
 for such big difference in the behavior between sched_compat_yield 1 
 vs. 0. It seems common interfaces should work similarly and 
 predictably on various systems, and here, if I didn't miss something, 
 linux looks like a different kind?

What you missed is that there is no such thing as predictable yield 
behavior for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
keep the behavior). Please read this thread on lkml for a more detailed 
background:

   CFS: some bad numbers with Java/database threading [FIXED]

   http://lkml.org/lkml/2007/9/19/357
   http://lkml.org/lkml/2007/9/19/328

in short: the yield implementation was tied to the O(1) scheduler, so 
the only way to have the exact same behavior would be to have the exact 
same core scheduler again. If what you said was true we would not be 
able to change the scheduler, ever. For something as vaguely defined of 
an API as yield, there's just no way to have a different core scheduler 
and still behave the same way.

So _generally_ i'd agree with you that normally we want to be bug for 
bug compatible, but in this specific (iperf) case there's just no point 
in preserving behavior that papers over this _clearly_ broken user-space 
app/thread locking (for which now two fixes exist already, plus a third 
fix is the twiddling of that sysctl).

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Martin Michlmayr [EMAIL PROTECTED] wrote:

  I think the real fix would be for iperf to use blocking network IO 
  though, or maybe to use a POSIX mutex or POSIX semaphores.
 
 So it's definitely not a bug in the kernel, only in iperf?
 
 (CCing Stephen Hemminger who wrote the iperf patch.)

Martin, could you check the iperf patch below instead of the yield patch 
- does it solve the iperf performance problem equally well, and does CPU 
utilization drop for you too?

Ingo

--
Subject: iperf: fix locking
From: Ingo Molnar [EMAIL PROTECTED]

fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 compat/Thread.c |3 ---
 src/Reporter.c  |   13 +
 src/main.cpp|2 ++
 3 files changed, 11 insertions(+), 7 deletions(-)

Index: iperf-2.0.2/compat/Thread.c
===
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
 void thread_rest ( void ) {
 #if defined( HAVE_THREAD )
 #if defined( HAVE_POSIX_THREAD )
-// TODO add checks for sched_yield or pthread_yield and call that
-// if available
-usleep( 0 );
 #else // Win32
 SwitchToThread( );
 #endif
Index: iperf-2.0.2/src/Reporter.c
===
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
 char buffer[64]; // Buffer for printing
 ReportHeader *ReportRoot = NULL;
 extern Condition ReportCond;
+extern Condition ReportDoneCond;
 int reporter_process_report ( ReportHeader *report );
 void process_report ( ReportHeader *report );
 int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent, 
 // item
 while ( index == 0 ) {
 Condition_Signal( ReportCond );
-thread_rest();
+Condition_Wait( ReportDoneCond );
 index = agent-reporterindex;
 }
 agent-agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent, 
 // Need to make sure that reporter is not about to be lapped
 while ( index - 1 == agent-agentindex ) {
 Condition_Signal( ReportCond );
-thread_rest();
+Condition_Wait( ReportDoneCond );
 index = agent-reporterindex;
 }
 
@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
 }
 Condition_Unlock ( ReportCond );
 
+again:
 if ( ReportRoot != NULL ) {
 ReportHeader *temp = ReportRoot;
 //Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
 // finished with report so free it
 free( temp );
 Condition_Unlock ( ReportCond );
+   Condition_Signal( ReportDoneCond );
+   if (ReportRoot)
+   goto again;
 }
-// yield control of CPU is another thread is waiting
-thread_rest();
+Condition_Signal( ReportDoneCond );
+usleep(1);
 } else {
 //Condition_Unlock ( ReportCond );
 }
Index: iperf-2.0.2/src/main.cpp
===
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern C {
 // records being accessed in a report and also to
 // serialize modification of the report list
 Condition ReportCond;
+Condition ReportDoneCond;
 }
 
 // global variables only accessed within this file
@@ -141,6 +142,7 @@ int main( int argc, char **argv ) {
 
 // Initialize global mutexes and conditions
 Condition_Initialize ( ReportCond );
+Condition_Initialize ( ReportDoneCond );
 Mutex_Initialize( groupCond );
 Mutex_Initialize( clients_mutex );
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Martin Michlmayr

* Ingo Molnar [EMAIL PROTECTED] [2007-09-27 11:49]:
 Martin, could you check the iperf patch below instead of the yield
 patch - does it solve the iperf performance problem equally well,
 and does CPU utilization drop for you too?

Yes, it works and CPU goes down too.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Martin Michlmayr [EMAIL PROTECTED] wrote:

 * Ingo Molnar [EMAIL PROTECTED] [2007-09-27 11:49]:
  Martin, could you check the iperf patch below instead of the yield
  patch - does it solve the iperf performance problem equally well,
  and does CPU utilization drop for you too?
 
 Yes, it works and CPU goes down too.

i'm curious by how much does CPU go down, and what's the output of 
iperf? (does it saturate full 100mbit network bandwidth)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Martin Michlmayr

* Ingo Molnar [EMAIL PROTECTED] [2007-09-27 12:56]:
 i'm curious by how much does CPU go down, and what's the output of
 iperf? (does it saturate full 100mbit network bandwidth)

I get about 94-95 Mbits/sec and CPU drops from 99% to about 82% (this
is with a 600 MHz ARM CPU).
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
   the (small) patch below fixes the iperf locking bug and removes the 
   yield() use. There are numerous immediate benefits of this patch:
  ...
   
   sched_yield() is almost always the symptom of broken locking or other 
   bug. In that sense CFS does the right thing by exposing such bugs =B-)
  
  ...Only if it were under some DEBUG option. [...]
 
 note that i qualified my sentence both via In that sense and via a 
 smiley! So i was not suggesting that this is a general rule at all and i 
 was also joking :-)

Actually, I've analyzed this smiley for some time but these scheduler
jokes are really hard, and I definitely need more time...

 
  [...] Even if iperf is doing the wrong thing there is no explanation 
  for such big difference in the behavior between sched_compat_yield 1 
  vs. 0. It seems common interfaces should work similarly and 
  predictably on various systems, and here, if I didn't miss something, 
  linux looks like a different kind?
 
 What you missed is that there is no such thing as predictable yield 
 behavior for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
 keep the behavior). Please read this thread on lkml for a more detailed 
 background:
 
CFS: some bad numbers with Java/database threading [FIXED]
 
http://lkml.org/lkml/2007/9/19/357
http://lkml.org/lkml/2007/9/19/328
 
 in short: the yield implementation was tied to the O(1) scheduler, so 
 the only way to have the exact same behavior would be to have the exact 
 same core scheduler again. If what you said was true we would not be 
 able to change the scheduler, ever. For something as vaguely defined of 
 an API as yield, there's just no way to have a different core scheduler 
 and still behave the same way.
 
 So _generally_ i'd agree with you that normally we want to be bug for 
 bug compatible, but in this specific (iperf) case there's just no point 
 in preserving behavior that papers over this _clearly_ broken user-space 
 app/thread locking (for which now two fixes exist already, plus a third 
 fix is the twiddling of that sysctl).
 

OK, but let's forget about fixing iperf. Probably I got this wrong,
but I've thought this bad iperf patch was tested on a few nixes and
linux was the most different one. The main point is: even if there is
no standard here, it should be a common interest to try to not differ
too much at least. So, it's not about exactness, but 50% (63 - 95)
change in linux own 'definition' after upgrading seems to be a lot.
So, IMHO, maybe some 'compatibility' test could be prepared to
compare a few different ideas on this yield and some average value
could be a kind of at least linux' own standard, which should be
emulated within some limits by next kernels?

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Ingo Molnar


* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
[...]
  What you missed is that there is no such thing as predictable yield 
  behavior for anything but SCHED_FIFO/RR tasks (for which tasks CFS does 
  keep the behavior). Please read this thread on lkml for a more detailed 
  background:
  
 CFS: some bad numbers with Java/database threading [FIXED]
  
 http://lkml.org/lkml/2007/9/19/357
 http://lkml.org/lkml/2007/9/19/328
  
  in short: the yield implementation was tied to the O(1) scheduler, so 
  the only way to have the exact same behavior would be to have the exact 
  same core scheduler again. If what you said was true we would not be 
  able to change the scheduler, ever. For something as vaguely defined of 
  an API as yield, there's just no way to have a different core scheduler 
  and still behave the same way.
  
  So _generally_ i'd agree with you that normally we want to be bug for 
  bug compatible, but in this specific (iperf) case there's just no point 
  in preserving behavior that papers over this _clearly_ broken user-space 
  app/thread locking (for which now two fixes exist already, plus a third 
  fix is the twiddling of that sysctl).
  
 
 OK, but let's forget about fixing iperf. Probably I got this wrong, 
 but I've thought this bad iperf patch was tested on a few nixes and 
 linux was the most different one. The main point is: even if there is 
 no standard here, it should be a common interest to try to not differ 
 too much at least. So, it's not about exactness, but 50% (63 - 95) 
 change in linux own 'definition' after upgrading seems to be a lot. 
 So, IMHO, maybe some 'compatibility' test could be prepared to compare 
 a few different ideas on this yield and some average value could be a 
 kind of at least linux' own standard, which should be emulated within 
 some limits by next kernels?

you repeat your point of emulating yield, and i can only repeat my 
point that you should please read this:

http://lkml.org/lkml/2007/9/19/357

because, once you read that, i think you'll agree with me that what you 
say is simply not possible in a sane way at this stage. We went through 
a number of yield implementations already and each will change behavior 
for _some_ category of apps. So right now we offer two implementations, 
and the default was chosen empirically to minimize the amount of 
complaints. (but it's not possible to eliminate them altogether, for the 
reasons outlined above - hence the switch.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-27 Thread Jarek Poplawski

On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
...
  OK, but let's forget about fixing iperf. Probably I got this wrong, 
  but I've thought this bad iperf patch was tested on a few nixes and 
  linux was the most different one. The main point is: even if there is 
  no standard here, it should be a common interest to try to not differ 
  too much at least. So, it's not about exactness, but 50% (63 - 95) 
  change in linux own 'definition' after upgrading seems to be a lot. 
  So, IMHO, maybe some 'compatibility' test could be prepared to compare 
  a few different ideas on this yield and some average value could be a 
  kind of at least linux' own standard, which should be emulated within 
  some limits by next kernels?
 
 you repeat your point of emulating yield, and i can only repeat my 
 point that you should please read this:
 
 http://lkml.org/lkml/2007/9/19/357
 
 because, once you read that, i think you'll agree with me that what you 
 say is simply not possible in a sane way at this stage. We went through 
 a number of yield implementations already and each will change behavior 
 for _some_ category of apps. So right now we offer two implementations, 
 and the default was chosen empirically to minimize the amount of 
 complaints. (but it's not possible to eliminate them altogether, for the 
 reasons outlined above - hence the switch.)

Sorry, but I think you got me wrong: I didn't mean emulation of any
implementation, but probably the some thing you write above: emulation
of time/performance. In my opinion this should be done experimentally
too, but with something more objective and constant than current
complaints counter. And the first thing could be a try to set some
kind of linux internal standard of yeld for the future by averaging
a few most popular systems in a test doing things like this iperf or
preferably more.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-26 Thread Stephen Hemminger

Here is the combined fixes from iperf-users list.

Begin forwarded message:

Date: Thu, 30 Aug 2007 15:55:22 -0400
From: "Andrew Gallatin" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: [PATCH] performance fixes for non-linux

Hi,

I've attached a patch which gives iperf similar performance to netperf
on my FreeBSD, MacOSX and Solaris hosts.  It does not seem to
negatively impact Linux.  I only started looking at the iperf source
yesterday, so I don't really expect this to be integrated as is, but a
patch is worth a 1000 words :)

Background: On both Solaris and FreeBSD, there are 2 things slowing
iperf down: The gettimeofday timestamp around each socket read/write
is terribly expensive, and the sched_yield() or usleep(0) causes iperf
to take 100% of the time (system time on BSD, split user/system time
on Solaris and MacOSX), which slows things down and confuses the
scheduler.

To address the gettimeofday() issue,  I treat TCP different than UDP,
and TCP tests behave as though only a single (huge) packet was sent.
Rather then ending the test based on polling gettimeofday()
timestamps, an interval timer / sigalarm handler is used.  I had
to increase the packetLen from an int to a max_size_t.

To address the sched_yield/usleep issue, I put the reporter thread
to sleep on a condition variable.  For the TCP tests at least, there
is no reason to have it running during the test and it is best
to just get it out of the way rather than burning CPU in a tight
loop.

I've also incorporated some fixes from the FreeBSD ports collection:

--- include/headers.h
use a 64-bit type for max_size_t

--- compat/Thread.c
oldTID is not declared anywhere.  Make this compile
(seems needed for at least FreeBSD & MacOSX)

--- src/Client.cpp
BSDs can return ENOBUFS during a UDP test when the socket
buffer fills. Don't exit when this happens.

I've run the resulting iperf on FreeBSD, Solaris, MacOSX and Linux,
and it seems to work for me.  It is nice not to have a 100% CPU
load when running an iperf test across a 100Mb/s network.

Drew

Index: include/Reporter.h
===
--- include/Reporter.h	(revision 11)
+++ include/Reporter.h	(working copy)
@@ -74,7 +74,7 @@
  */
 typedef struct ReportStruct {
 int packetID;
-int packetLen;
+max_size_t packetLen;
 struct timeval packetTime;
 struct timeval sentTime;
 } ReportStruct;
Index: include/headers.h
===
--- include/headers.h	(revision 11)
+++ include/headers.h	(working copy)
@@ -180,7 +180,7 @@
 // from the gnu archive

 #include 
-typedef uintmax_t max_size_t;
+typedef uint64_t max_size_t;

 /* in case the OS doesn't have these, we provide our own implementations */
 #include "gettimeofday.h"
Index: include/Client.hpp
===
--- include/Client.hpp	(revision 11)
+++ include/Client.hpp	(working copy)
@@ -69,6 +69,9 @@
 // connects and sends data
 void Run( void );

+// TCP specific version of above
+void RunTCP( void );
+
 void InitiateServer();

 // UDP / TCP
Index: compat/Thread.c
===
--- compat/Thread.c	(revision 11)
+++ compat/Thread.c	(working copy)
@@ -202,7 +202,7 @@
 #if   defined( HAVE_POSIX_THREAD )
 // Cray J90 doesn't have pthread_cancel; Iperf works okay without
 #ifdef HAVE_PTHREAD_CANCEL
-pthread_cancel( oldTID );
+pthread_cancel( thread->mTID );
 #endif
 #else // Win32
 // this is a somewhat dangerous function; it's not
Index: src/Reporter.c
===
--- src/Reporter.c	(revision 11)
+++ src/Reporter.c	(working copy)
@@ -110,6 +110,8 @@

 char buffer[64]; // Buffer for printing
 ReportHeader *ReportRoot = NULL;
+int threadWait = 0;
+int threadSleeping = 0;
 extern Condition ReportCond;
 int reporter_process_report ( ReportHeader *report );
 void process_report ( ReportHeader *report );
@@ -349,7 +351,9 @@
 thread_rest();
 index = agent->reporterindex;
 }
-
+	if (threadSleeping)
+   Condition_Signal(  );
+
 // Put the information there
 memcpy( agent->data + agent->agentindex, packet, sizeof(ReportStruct) );

@@ -378,6 +382,9 @@
 packet->packetLen = 0;
 ReportPacket( agent, packet );
 packet->packetID = agent->report.cntDatagrams;
+	if (threadSleeping)
+   Condition_Signal(  );
+
 }
 }

@@ -389,6 +396,9 @@
 void EndReport( ReportHeader *agent ) {
 if ( agent != NULL ) {
 int index = agent->reporterindex;
+	if (threadSleeping)
+   Condition_Signal(  );
+
 while ( index != -1 ) {
 thread_rest();
 index = agent->reporterindex;
@@ -457,6 +467,10 @@
  * Update the ReportRoot to include this report.

Re: Network slowdown due to CFS

2007-09-26 Thread Stephen Hemminger

On Wed, 26 Sep 2007 15:31:38 +0200
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * David Schwartz <[EMAIL PROTECTED]> wrote:
> 
> > > > I think the real fix would be for iperf to use blocking network
> > > > IO though, or maybe to use a POSIX mutex or POSIX semaphores.
> > >
> > > So it's definitely not a bug in the kernel, only in iperf?
> > 
> > Martin:
> > 
> > Actually, in this case I think iperf is doing the right thing
> > (though not the best thing) and the kernel is doing the wrong
> > thing. [...]
> 
> it's not doing the right thing at all. I had a quick look at the
> source code, and the reason for that weird yield usage was that
> there's a locking bug in iperf's "Reporter thread" abstraction and
> apparently instead of fixing the bug it was worked around via a
> horrible yield() based user-space lock.
> 
> the (small) patch below fixes the iperf locking bug and removes the 
> yield() use. There are numerous immediate benefits of this patch:
> 
>  - iperf uses _much_ less CPU time. On my Core2Duo test system,
> before the patch it used up 100% CPU time to saturate 1 gigabit of
> network traffic to another box. With the patch applied it now uses 9%
> of CPU time.
> 
>  - sys_sched_yield() is removed altogether
> 
>  - i was able to measure much higher bandwidth over localhost for 
>example. This is the case for over-the-network measurements as
> well.
> 
>  - the results are also more consistent and more deterministic, hence 
>more reliable as a benchmarking tool. (the reason for that is that
>more CPU time is spent on actually delivering packets, instead of
>mindlessly polling on the user-space "lock", so we actually max out
>the CPU, instead of relying on the random proportion the workload
> was able to make progress versus wasting CPU time on polling.)
> 
> sched_yield() is almost always the symptom of broken locking or other 
> bug. In that sense CFS does the right thing by exposing such bugs =B-)
>  
>   Ingo

A similar patch has already been submitted, since BSD wouldn't work
without it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-26 Thread Ingo Molnar


* David Schwartz <[EMAIL PROTECTED]> wrote:

> > > I think the real fix would be for iperf to use blocking network IO 
> > > though, or maybe to use a POSIX mutex or POSIX semaphores.
> >
> > So it's definitely not a bug in the kernel, only in iperf?
> 
> Martin:
> 
> Actually, in this case I think iperf is doing the right thing (though not
> the best thing) and the kernel is doing the wrong thing. [...]

it's not doing the right thing at all. I had a quick look at the source 
code, and the reason for that weird yield usage was that there's a 
locking bug in iperf's "Reporter thread" abstraction and apparently 
instead of fixing the bug it was worked around via a horrible yield() 
based user-space lock.

the (small) patch below fixes the iperf locking bug and removes the 
yield() use. There are numerous immediate benefits of this patch:

 - iperf uses _much_ less CPU time. On my Core2Duo test system, before 
   the patch it used up 100% CPU time to saturate 1 gigabit of network 
   traffic to another box. With the patch applied it now uses 9% of 
   CPU time.

 - sys_sched_yield() is removed altogether

 - i was able to measure much higher bandwidth over localhost for 
   example. This is the case for over-the-network measurements as well.

 - the results are also more consistent and more deterministic, hence 
   more reliable as a benchmarking tool. (the reason for that is that
   more CPU time is spent on actually delivering packets, instead of
   mindlessly polling on the user-space "lock", so we actually max out
   the CPU, instead of relying on the random proportion the workload was
   able to make progress versus wasting CPU time on polling.)

sched_yield() is almost always the symptom of broken locking or other 
bug. In that sense CFS does the right thing by exposing such bugs =B-)
 
Ingo

->
Subject: iperf: fix locking
From: Ingo Molnar <[EMAIL PROTECTED]>

fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 compat/Thread.c |3 ---
 src/Reporter.c  |   13 +
 src/main.cpp|2 ++
 3 files changed, 11 insertions(+), 7 deletions(-)

Index: iperf-2.0.2/compat/Thread.c
===
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
 void thread_rest ( void ) {
 #if defined( HAVE_THREAD )
 #if defined( HAVE_POSIX_THREAD )
-// TODO add checks for sched_yield or pthread_yield and call that
-// if available
-usleep( 0 );
 #else // Win32
 SwitchToThread( );
 #endif
Index: iperf-2.0.2/src/Reporter.c
===
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
 char buffer[64]; // Buffer for printing
 ReportHeader *ReportRoot = NULL;
 extern Condition ReportCond;
+extern Condition ReportDoneCond;
 int reporter_process_report ( ReportHeader *report );
 void process_report ( ReportHeader *report );
 int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent, 
 // item
 while ( index == 0 ) {
 Condition_Signal(  );
-thread_rest();
+Condition_Wait(  );
 index = agent->reporterindex;
 }
 agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent, 
 // Need to make sure that reporter is not about to be "lapped"
 while ( index - 1 == agent->agentindex ) {
 Condition_Signal(  );
-thread_rest();
+Condition_Wait(  );
 index = agent->reporterindex;
 }
 
@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
 }
 Condition_Unlock ( ReportCond );
 
+again:
 if ( ReportRoot != NULL ) {
 ReportHeader *temp = ReportRoot;
 //Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
 // finished with report so free it
 free( temp );
 Condition_Unlock ( ReportCond );
+   Condition_Signal(  );
+   if (ReportRoot)
+   goto again;
 }
-// yield control of CPU is another thread is waiting
-thread_rest();
+Condition_Signal(  );
+usleep(1);
 } else {
 //Condition_Unlock ( ReportCond );
 }
Index: iperf-2.0.2/src/main.cpp
===
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern "C" {
 // records being accessed in a report and also to
 // serialize

RE: Network slowdown due to CFS

2007-09-26 Thread David Schwartz


> > I think the real fix would be for iperf to use blocking network IO
> > though, or maybe to use a POSIX mutex or POSIX semaphores.
>
> So it's definitely not a bug in the kernel, only in iperf?

Martin:

Actually, in this case I think iperf is doing the right thing (though not
the best thing) and the kernel is doing the wrong thing. It's calling
'sched_yield' to ensure that every other thread gets a chance to run before
the current thread runs again. CFS is not doing that, allowing the yielding
thread to hog the CPU to the exclusion of the other threads. (It can allow
the yielding thread to hog the CPU, of course, just not to the exclusion of
other threads.)

It's still better to use some kind of rational synchronization primitive
(like mutex/sempahore) so that the other threads can tell you when there's
something for you to do. It's still better to use blocking network IO, so
the kernel will let you know exactly when to try I/O and your dynamic
priority can rise.

Ingo:

Can you clarify what CFS' current default sched_yield implementation is and
what setting sched_compat_yield to 1 does? Which way do we get the right
semantics (all threads of equal static priority are scheduled, with some
possible SMP fuzziness, before this thread is scheduled again)?

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-26 Thread Martin Michlmayr

* Ingo Molnar <[EMAIL PROTECTED]> [2007-09-26 13:21]:
> > > I noticed on the iperf website a patch which contains sched_yield().
> > > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt
> 
> great! Could you try this too:
>echo 1 > /proc/sys/kernel/sched_compat_yield
> 
> does it fix iperf performance too (with the yield patch applied to
> iperf)?

Yes, this gives me good performance too.

> I think the real fix would be for iperf to use blocking network IO
> though, or maybe to use a POSIX mutex or POSIX semaphores.

So it's definitely not a bug in the kernel, only in iperf?

(CCing Stephen Hemminger who wrote the iperf patch.)
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network slowdown due to CFS

2007-09-26 Thread Ingo Molnar


* Martin Michlmayr <[EMAIL PROTECTED]> wrote:

> * Mike Galbraith <[EMAIL PROTECTED]> [2007-09-26 12:23]:
> > I noticed on the iperf website a patch which contains sched_yield().
> > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt
> > 
> > Do you have that patch applied by any chance?  If so, it might be a
> > worth while to try it without it.
> 
> Yes, this patch was applied.  When I revert it, I get the same (high) 
> performance with both kernels.

great! Could you try this too:

   echo 1 > /proc/sys/kernel/sched_compat_yield

does it fix iperf performance too (with the yield patch applied to 
iperf)?

I think the real fix would be for iperf to use blocking network IO 
though, or maybe to use a POSIX mutex or POSIX semaphores.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 120 matches

Mail list logo