Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-11 Thread Roman Zippel
Hi,

On Tue, 11 Sep 2007, Mike Galbraith wrote:

> I still see the fairtest2 sleeper startup anomaly.  Sometimes it starts
> up normally, others the sleeper is a delayed.  Seems to require idle
> time to trigger worst case startup delay.
> 
> 14854 root  20   0  1568  468  384 R   52  0.0   0:07.50 1 fairtest2
> 14855 root  20   0  1568  112   28 R   45  0.0   0:00.46 1 fairtest2
> 
> Everything else still seems fine.  Boot-time warnings:
> 
> [  113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by 
> auid=4294967295
> [  114.077979] IA-32 Microcode Update Driver: v1.14a <[EMAIL PROTECTED]>
> [  116.281216] 4,f70355a0(5633,b,d1cef00): 
> 32af5bc18d4,31b4921,f70355a0(5633,b,d1cef00),2
> [  116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
> [  116.312380]  [] show_trace_log_lvl+0x1a/0x30
> [  116.326193]  [] show_trace+0x12/0x14
> [  116.339270]  [] dump_stack+0x16/0x18
> [  116.352199]  [] verify_queue+0x2f8/0x3fb
> [  116.365406]  [] enqueue_entity+0x186/0x21b
> [  116.378571]  [] enqueue_task_fair+0x2f/0x31
> [  116.391545]  [] enqueue_task+0xd/0x18
> [  116.403833]  [] activate_task+0x20/0x2d
> [  116.416174]  [] __migrate_task+0x9a/0xc4
> [  116.428490]  [] migration_thread+0x175/0x220
> [  116.441159]  [] kthread+0x37/0x59
> [  116.452824]  [] kernel_thread_helper+0x7/0x14
> [  116.465582]  ===

Damn, I forgot that tasks which are reniced or migrate to another cpu 
need some more initialization, so the small incremental patch does that.
Thanks again for testing.

bye, Roman

Index: linux-2.6/kernel/sched_norm.c
===
--- linux-2.6.orig/kernel/sched_norm.c  2007-09-11 13:15:00.0 +0200
+++ linux-2.6/kernel/sched_norm.c   2007-09-11 13:13:43.0 +0200
@@ -326,11 +326,14 @@ static void update_curr(struct cfs_rq *c
 }
 
 static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
 {
verify_queue(cfs_rq, cfs_rq->curr != se, se);
cfs_rq->time_avg_min = kclock_max(cfs_rq->time_avg_min, 
get_time_avg(cfs_rq));
-   se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, 
se->time_norm);
+   if (likely(wakeup))
+   se->time_norm = kclock_max(cfs_rq->time_avg_min - 
se->req_weight_inv, se->time_norm);
+   else
+   se->time_norm = cfs_rq->time_avg_min;
 
cfs_rq->nr_running++;
cfs_rq->weight_sum += 1 << se->weight_shift;
@@ -553,7 +556,7 @@ static void enqueue_task_fair(struct rq 
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
-   enqueue_entity(cfs_rq, se);
+   enqueue_entity(cfs_rq, se, wakeup);
}
 }
 
@@ -813,7 +816,7 @@ static void task_new_fair(struct rq *rq,
rq->curr->se.time_norm -= time;
se->time_norm = rq->curr->se.time_norm;
 
-   enqueue_entity(cfs_rq, se);
+   enqueue_entity(cfs_rq, se, 1);
p->se.on_rq = 1;
 
cfs_rq->next = se;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-11 Thread Mike Galbraith
On Tue, 2007-09-11 at 01:23 +0200, Roman Zippel wrote:
> Hi,
> 
> On Sat, 8 Sep 2007, Mike Galbraith wrote:
> 
> > > On Sun, 2 Sep 2007, Ingo Molnar wrote:
> > > 
> > > Below is a patch updated against the latest git tree, no major changes.
> > 
> > Interesting, I see major behavioral changes.
> > 
> > I still see an aberration with fairtest2.  On startup, the hog component
> > will consume 100% cpu for a bit, then the sleeper shows up.  This
> > doesn't always happen, but happens quite often.
> 
> I found the problem for this. What basically happened is that a task that 
> hasn't been running for a second is enqueued first on an idle queue and it 
> keeps that advantage compared to tasks that had been running more recently 
> until it catched up. The new version will now remember where the last task 
> left off and use that for that first task which restarts the queue. As a 
> side effect it also limits the bonus a task gets if multiple tasks are 
> woken at the same time.

I still see the fairtest2 sleeper startup anomaly.  Sometimes it starts
up normally, others the sleeper is a delayed.  Seems to require idle
time to trigger worst case startup delay.

14854 root  20   0  1568  468  384 R   52  0.0   0:07.50 1 fairtest2
14855 root  20   0  1568  112   28 R   45  0.0   0:00.46 1 fairtest2

Everything else still seems fine.  Boot-time warnings:

[  113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by auid=4294967295
[  114.077979] IA-32 Microcode Update Driver: v1.14a <[EMAIL PROTECTED]>
[  116.281216] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b4921,f70355a0(5633,b,d1cef00),2
[  116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.312380]  [] show_trace_log_lvl+0x1a/0x30
[  116.326193]  [] show_trace+0x12/0x14
[  116.339270]  [] dump_stack+0x16/0x18
[  116.352199]  [] verify_queue+0x2f8/0x3fb
[  116.365406]  [] enqueue_entity+0x186/0x21b
[  116.378571]  [] enqueue_task_fair+0x2f/0x31
[  116.391545]  [] enqueue_task+0xd/0x18
[  116.403833]  [] activate_task+0x20/0x2d
[  116.416174]  [] __migrate_task+0x9a/0xc4
[  116.428490]  [] migration_thread+0x175/0x220
[  116.441159]  [] kthread+0x37/0x59
[  116.452824]  [] kernel_thread_helper+0x7/0x14
[  116.465582]  ===
[  116.476497] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b496c,f7e1a5a0(4179,3b0,232aaf800),4
[  116.492899] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.506459]  [] show_trace_log_lvl+0x1a/0x30
[  116.519302]  [] show_trace+0x12/0x14
[  116.531313]  [] dump_stack+0x16/0x18
[  116.543334]  [] verify_queue+0x2f8/0x3fb
[  116.555710]  [] update_curr+0xe6/0x102
[  116.567654]  [] task_tick_fair+0x3f/0x1f2
[  116.579657]  [] scheduler_tick+0x18d/0x2d5
[  116.591808]  [] update_process_times+0x44/0x63
[  116.604400]  [] tick_sched_timer+0x5c/0xbd
[  116.616741]  [] hrtimer_interrupt+0x12c/0x1a0
[  116.629230]  [] smp_apic_timer_interrupt+0x57/0x88
[  116.642047]  [] apic_timer_interrupt+0x28/0x30
[  116.654692]  [] __wake_up+0x3a/0x42
[  116.666158]  [] sock_def_readable+0x6e/0x70
[  116.678143]  [] unix_stream_sendmsg+0x19d/0x32c
[  116.690520]  [] sock_aio_write+0x104/0x123
[  116.702479]  [] do_sync_readv_writev+0xb4/0xea
[  116.714907]  [] do_readv_writev+0xbb/0x1d4
[  116.727074]  [] vfs_writev+0x3f/0x51
[  116.738800]  [] sys_writev+0x3d/0x64
[  116.750154]  [] sysenter_past_esp+0x5f/0x85
[  116.761932]  ===
[  116.772092] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b496c,f793bae0(4010,3b0,232aaf800),3
[  116.787811] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.800930]  [] show_trace_log_lvl+0x1a/0x30
[  116.813437]  [] show_trace+0x12/0x14
[  116.825164]  [] dump_stack+0x16/0x18
[  116.836828]  [] verify_queue+0x2f8/0x3fb
[  116.848910]  [] dequeue_task_fair+0x4b/0x2d3
[  116.861390]  [] dequeue_task+0xd/0x18
[  116.873193]  [] deactivate_task+0x20/0x3a
[  116.885335]  [] balance_tasks+0x9f/0x140
[  116.897357]  [] load_balance_fair+0x60/0x7d
[  116.909627]  [] move_tasks+0x5b/0x77
[  116.921200]  [] load_balance+0xd3/0x2ab
[  116.932786]  [] run_rebalance_domains+0x89/0x319
[  116.944998]  [] __do_softirq+0x73/0xe0
[  116.956238]  [] do_softirq+0x6e/0xc1
[  116.967316]  ===
[  116.977554] 3,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b2ded,f793bae0(4010,3b0,232aaf800),1
[  116.993315] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  117.006408]  [] show_trace_log_lvl+0x1a/0x30
[  117.018914]  [] show_trace+0x12/0x14
[  117.030667]  [] dump_stack+0x16/0x18
[  117.042332]  [] verify_queue+0x2f8/0x3fb
[  117.054335]  [] dequeue_task_fair+0x1a6/0x2d3
[  117.066781]  [] dequeue_task+0xd/0x18
[  117.078429]  [] deactivate_task+0x20/0x3a
[  117.090380]  [] balance_tasks+0x9f/0x140
[  117.102203]  [] load_balance_fair+0x60/0x7d
[  117.114283]  [] move_tasks+0x5b/0x77
[  117.125689]  [] load_balance+0xd3/0x2ab
[  117.137320]  [] run_rebalance_domains+0x89/0x319
[  117.149773]  [] __do_softirq+0x73/0xe0
[  117.161343]  

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-11 Thread Mike Galbraith
On Tue, 2007-09-11 at 01:23 +0200, Roman Zippel wrote:
 Hi,
 
 On Sat, 8 Sep 2007, Mike Galbraith wrote:
 
   On Sun, 2 Sep 2007, Ingo Molnar wrote:
   
   Below is a patch updated against the latest git tree, no major changes.
  
  Interesting, I see major behavioral changes.
  
  I still see an aberration with fairtest2.  On startup, the hog component
  will consume 100% cpu for a bit, then the sleeper shows up.  This
  doesn't always happen, but happens quite often.
 
 I found the problem for this. What basically happened is that a task that 
 hasn't been running for a second is enqueued first on an idle queue and it 
 keeps that advantage compared to tasks that had been running more recently 
 until it catched up. The new version will now remember where the last task 
 left off and use that for that first task which restarts the queue. As a 
 side effect it also limits the bonus a task gets if multiple tasks are 
 woken at the same time.

I still see the fairtest2 sleeper startup anomaly.  Sometimes it starts
up normally, others the sleeper is a delayed.  Seems to require idle
time to trigger worst case startup delay.

14854 root  20   0  1568  468  384 R   52  0.0   0:07.50 1 fairtest2
14855 root  20   0  1568  112   28 R   45  0.0   0:00.46 1 fairtest2

Everything else still seems fine.  Boot-time warnings:

[  113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by auid=4294967295
[  114.077979] IA-32 Microcode Update Driver: v1.14a [EMAIL PROTECTED]
[  116.281216] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b4921,f70355a0(5633,b,d1cef00),2
[  116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.312380]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  116.326193]  [c0105d83] show_trace+0x12/0x14
[  116.339270]  [c0105d9b] dump_stack+0x16/0x18
[  116.352199]  [c011e61a] verify_queue+0x2f8/0x3fb
[  116.365406]  [c011e8a3] enqueue_entity+0x186/0x21b
[  116.378571]  [c0123344] enqueue_task_fair+0x2f/0x31
[  116.391545]  [c011d043] enqueue_task+0xd/0x18
[  116.403833]  [c011dffe] activate_task+0x20/0x2d
[  116.416174]  [c0120336] __migrate_task+0x9a/0xc4
[  116.428490]  [c0122d0b] migration_thread+0x175/0x220
[  116.441159]  [c01393f7] kthread+0x37/0x59
[  116.452824]  [c0104dd3] kernel_thread_helper+0x7/0x14
[  116.465582]  ===
[  116.476497] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b496c,f7e1a5a0(4179,3b0,232aaf800),4
[  116.492899] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.506459]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  116.519302]  [c0105d83] show_trace+0x12/0x14
[  116.531313]  [c0105d9b] dump_stack+0x16/0x18
[  116.543334]  [c011e61a] verify_queue+0x2f8/0x3fb
[  116.555710]  [c011ea1e] update_curr+0xe6/0x102
[  116.567654]  [c011ebb9] task_tick_fair+0x3f/0x1f2
[  116.579657]  [c01223f3] scheduler_tick+0x18d/0x2d5
[  116.591808]  [c012fa9d] update_process_times+0x44/0x63
[  116.604400]  [c014179e] tick_sched_timer+0x5c/0xbd
[  116.616741]  [c013cbe1] hrtimer_interrupt+0x12c/0x1a0
[  116.629230]  [c01173f0] smp_apic_timer_interrupt+0x57/0x88
[  116.642047]  [c0104c50] apic_timer_interrupt+0x28/0x30
[  116.654692]  [c011f119] __wake_up+0x3a/0x42
[  116.666158]  [c03f02d3] sock_def_readable+0x6e/0x70
[  116.678143]  [c046195c] unix_stream_sendmsg+0x19d/0x32c
[  116.690520]  [c03ebf11] sock_aio_write+0x104/0x123
[  116.702479]  [c0178c60] do_sync_readv_writev+0xb4/0xea
[  116.714907]  [c0179318] do_readv_writev+0xbb/0x1d4
[  116.727074]  [c0179470] vfs_writev+0x3f/0x51
[  116.738800]  [c01798b4] sys_writev+0x3d/0x64
[  116.750154]  [c0104182] sysenter_past_esp+0x5f/0x85
[  116.761932]  ===
[  116.772092] 4,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b496c,f793bae0(4010,3b0,232aaf800),3
[  116.787811] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  116.800930]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  116.813437]  [c0105d83] show_trace+0x12/0x14
[  116.825164]  [c0105d9b] dump_stack+0x16/0x18
[  116.836828]  [c011e61a] verify_queue+0x2f8/0x3fb
[  116.848910]  [c0123391] dequeue_task_fair+0x4b/0x2d3
[  116.861390]  [c011d05b] dequeue_task+0xd/0x18
[  116.873193]  [c011dfa1] deactivate_task+0x20/0x3a
[  116.885335]  [c011e0aa] balance_tasks+0x9f/0x140
[  116.897357]  [c011e1ab] load_balance_fair+0x60/0x7d
[  116.909627]  [c011d141] move_tasks+0x5b/0x77
[  116.921200]  [c0120c0b] load_balance+0xd3/0x2ab
[  116.932786]  [c0123fca] run_rebalance_domains+0x89/0x319
[  116.944998]  [c012bbc2] __do_softirq+0x73/0xe0
[  116.956238]  [c0106bc3] do_softirq+0x6e/0xc1
[  116.967316]  ===
[  116.977554] 3,f70355a0(5633,b,d1cef00): 
32af5bc18d4,31b2ded,f793bae0(4010,3b0,232aaf800),1
[  116.993315] WARNING: at kernel/sched_norm.c:271 verify_queue()
[  117.006408]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  117.018914]  [c0105d83] show_trace+0x12/0x14
[  117.030667]  [c0105d9b] dump_stack+0x16/0x18
[  117.042332]  [c011e61a] verify_queue+0x2f8/0x3fb
[  117.054335]  [c01234ec] 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-11 Thread Roman Zippel
Hi,

On Tue, 11 Sep 2007, Mike Galbraith wrote:

 I still see the fairtest2 sleeper startup anomaly.  Sometimes it starts
 up normally, others the sleeper is a delayed.  Seems to require idle
 time to trigger worst case startup delay.
 
 14854 root  20   0  1568  468  384 R   52  0.0   0:07.50 1 fairtest2
 14855 root  20   0  1568  112   28 R   45  0.0   0:00.46 1 fairtest2
 
 Everything else still seems fine.  Boot-time warnings:
 
 [  113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by 
 auid=4294967295
 [  114.077979] IA-32 Microcode Update Driver: v1.14a [EMAIL PROTECTED]
 [  116.281216] 4,f70355a0(5633,b,d1cef00): 
 32af5bc18d4,31b4921,f70355a0(5633,b,d1cef00),2
 [  116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
 [  116.312380]  [c0105188] show_trace_log_lvl+0x1a/0x30
 [  116.326193]  [c0105d83] show_trace+0x12/0x14
 [  116.339270]  [c0105d9b] dump_stack+0x16/0x18
 [  116.352199]  [c011e61a] verify_queue+0x2f8/0x3fb
 [  116.365406]  [c011e8a3] enqueue_entity+0x186/0x21b
 [  116.378571]  [c0123344] enqueue_task_fair+0x2f/0x31
 [  116.391545]  [c011d043] enqueue_task+0xd/0x18
 [  116.403833]  [c011dffe] activate_task+0x20/0x2d
 [  116.416174]  [c0120336] __migrate_task+0x9a/0xc4
 [  116.428490]  [c0122d0b] migration_thread+0x175/0x220
 [  116.441159]  [c01393f7] kthread+0x37/0x59
 [  116.452824]  [c0104dd3] kernel_thread_helper+0x7/0x14
 [  116.465582]  ===

Damn, I forgot that tasks which are reniced or migrate to another cpu 
need some more initialization, so the small incremental patch does that.
Thanks again for testing.

bye, Roman

Index: linux-2.6/kernel/sched_norm.c
===
--- linux-2.6.orig/kernel/sched_norm.c  2007-09-11 13:15:00.0 +0200
+++ linux-2.6/kernel/sched_norm.c   2007-09-11 13:13:43.0 +0200
@@ -326,11 +326,14 @@ static void update_curr(struct cfs_rq *c
 }
 
 static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
 {
verify_queue(cfs_rq, cfs_rq-curr != se, se);
cfs_rq-time_avg_min = kclock_max(cfs_rq-time_avg_min, 
get_time_avg(cfs_rq));
-   se-time_norm = kclock_max(cfs_rq-time_avg_min - se-req_weight_inv, 
se-time_norm);
+   if (likely(wakeup))
+   se-time_norm = kclock_max(cfs_rq-time_avg_min - 
se-req_weight_inv, se-time_norm);
+   else
+   se-time_norm = cfs_rq-time_avg_min;
 
cfs_rq-nr_running++;
cfs_rq-weight_sum += 1  se-weight_shift;
@@ -553,7 +556,7 @@ static void enqueue_task_fair(struct rq 
if (se-on_rq)
break;
cfs_rq = cfs_rq_of(se);
-   enqueue_entity(cfs_rq, se);
+   enqueue_entity(cfs_rq, se, wakeup);
}
 }
 
@@ -813,7 +816,7 @@ static void task_new_fair(struct rq *rq,
rq-curr-se.time_norm -= time;
se-time_norm = rq-curr-se.time_norm;
 
-   enqueue_entity(cfs_rq, se);
+   enqueue_entity(cfs_rq, se, 1);
p-se.on_rq = 1;
 
cfs_rq-next = se;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-10 Thread Roman Zippel
Hi,

On Sat, 8 Sep 2007, Mike Galbraith wrote:

> > On Sun, 2 Sep 2007, Ingo Molnar wrote:
> > 
> > Below is a patch updated against the latest git tree, no major changes.
> 
> Interesting, I see major behavioral changes.
> 
> I still see an aberration with fairtest2.  On startup, the hog component
> will consume 100% cpu for a bit, then the sleeper shows up.  This
> doesn't always happen, but happens quite often.

I found the problem for this. What basically happened is that a task that 
hasn't been running for a second is enqueued first on an idle queue and it 
keeps that advantage compared to tasks that had been running more recently 
until it catched up. The new version will now remember where the last task 
left off and use that for that first task which restarts the queue. As a 
side effect it also limits the bonus a task gets if multiple tasks are 
woken at the same time.

This version also limits againsts runtime overruns, that means a task 
exceeds its time slice, because the timer interrupt was disabled and so 
the current task wasn't preempted, this usually happens during boot, but 
also when using a serial console.

bye, Roman

---
 include/linux/sched.h |   26 -
 kernel/sched.c|  173 +
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  870 ++
 kernel/sysctl.c   |   30 -
 5 files changed, 977 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se->block_start
  * 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
  * 6 se->load.weight
- * 7 se->delta_fair
- *15 se->wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
-   u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int sysctl_sched_stat_granularity;
-extern unsigned int sysctl_sched_runtime_limit;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
 
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -183,18 +183,24 @@ struct cfs_rq {
 
s64 fair_clock;
u64 exec_clock;
-   s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
+   u64 prev_update;
+   s64 time_norm_base, time_norm_inc, time_avg_min;
+   u64 run_start, run_end;
+   u64 run_end_next, run_end_curr;
+   s64 time_sum_max, time_sum_off;
+   unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
-   struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr;
+   struct sched_entity *curr, *next;
+   struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
 
/* leaf cfs_rqs are those that hold tasks (lowest 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-10 Thread Roman Zippel
Hi,

On Sat, 8 Sep 2007, Mike Galbraith wrote:

  On Sun, 2 Sep 2007, Ingo Molnar wrote:
  
  Below is a patch updated against the latest git tree, no major changes.
 
 Interesting, I see major behavioral changes.
 
 I still see an aberration with fairtest2.  On startup, the hog component
 will consume 100% cpu for a bit, then the sleeper shows up.  This
 doesn't always happen, but happens quite often.

I found the problem for this. What basically happened is that a task that 
hasn't been running for a second is enqueued first on an idle queue and it 
keeps that advantage compared to tasks that had been running more recently 
until it catched up. The new version will now remember where the last task 
left off and use that for that first task which restarts the queue. As a 
side effect it also limits the bonus a task gets if multiple tasks are 
woken at the same time.

This version also limits againsts runtime overruns, that means a task 
exceeds its time slice, because the timer interrupt was disabled and so 
the current task wasn't preempted, this usually happens during boot, but 
also when using a serial console.

bye, Roman

---
 include/linux/sched.h |   26 -
 kernel/sched.c|  173 +
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  870 ++
 kernel/sysctl.c   |   30 -
 5 files changed, 977 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se-block_start
  * 4 se-run_node
- * 4 se-sleep_start
- * 4 se-sleep_start_fair
  * 6 se-load.weight
- * 7 se-delta_fair
- *15 se-wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
-   u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int sysctl_sched_stat_granularity;
-extern unsigned int sysctl_sched_runtime_limit;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
 
Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -183,18 +183,24 @@ struct cfs_rq {
 
s64 fair_clock;
u64 exec_clock;
-   s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
+   u64 prev_update;
+   s64 time_norm_base, time_norm_inc, time_avg_min;
+   u64 run_start, run_end;
+   u64 run_end_next, run_end_curr;
+   s64 time_sum_max, time_sum_off;
+   unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
-   struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr;
+   struct sched_entity *curr, *next;
+   struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
 
/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-08 Thread Mike Galbraith
On Sat, 2007-09-08 at 09:56 +0200, Mike Galbraith wrote:

> 

They weren't all repeats after all, the last few were...

[  120.267389] 2,f73035a0(5624): 1fa7e90b58c,1fb3b46,f73035a0(5624),5
[  120.281110] WARNING: at kernel/sched_norm.c:413 entity_tick()
[  120.294101]  [] show_trace_log_lvl+0x1a/0x30
[  120.306425]  [] show_trace+0x12/0x14
[  120.317857]  [] dump_stack+0x16/0x18
[  120.329124]  [] task_tick_fair+0x15d/0x170
[  120.340988]  [] scheduler_tick+0x18d/0x2d5
[  120.352896]  [] update_process_times+0x44/0x63
[  120.365177]  [] tick_sched_timer+0x5c/0xbd
[  120.377146]  [] hrtimer_interrupt+0x12c/0x1a0
[  120.389392]  [] smp_apic_timer_interrupt+0x57/0x88
[  120.402114]  [] apic_timer_interrupt+0x28/0x30
[  120.414491]  [] wake_up_new_task+0x72/0xad
[  120.426477]  [] do_fork+0x13c/0x205
[  120.437795]  [] sys_clone+0x33/0x39
[  120.449062]  [] sysenter_past_esp+0x5f/0x85
[  120.461047]  ===


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-08 Thread Mike Galbraith
On Fri, 2007-09-07 at 17:35 +0200, Roman Zippel wrote:
> Hi,
> 
> On Sun, 2 Sep 2007, Ingo Molnar wrote:
> 
> Below is a patch updated against the latest git tree, no major changes.

Interesting, I see major behavioral changes.

I still see an aberration with fairtest2.  On startup, the hog component
will consume 100% cpu for a bit, then the sleeper shows up.  This
doesn't always happen, but happens quite often.  The time it takes for
the sleeper to show up varies, but is roughly as below.  With the
previous kernel, sleeper starvation was permanent once the hog ran for a
bit.  This behavior inverted itself, starvation now goes away after a
bit.

6573 root  20   0  1568  468  384 R   51  0.0   0:08.30 1 fairtest2
6574 root  20   0  1568  112   28 R   47  0.0   0:01.07 1 fairtest2

Once it coughs up it's fairtest2 startup-furball, it does well on mixed
load (various interval sleepers + hog) fairness.  The previous kernel
did not do at all well at mixed load, as soon as I added a hog, it was
all over for sleepers, and even without a hog, fairness among the
various interval sleepers was not at all good.

On the interactivity front, your first cut was not really usable here,
but this one seems fine .  One test I do here
is Amarok song switch time under hefty kbuild load, and this kernel
passes that test just fine.  I didn't try this test with your first cut,
because it was very ragged even under modest load.  All in all, this one
is behaving quite well, which is radically different than my experience
with first cut.

Debug messages triggered on boot, but haven't triggered since.

[  113.426575] audit(1189232695.670:2): audit_backlog_limit=256 old=64 by 
auid=4294967295 res=1
[  113.953958] IA-32 Microcode Update Driver: v1.14a <[EMAIL PROTECTED]>
[  115.851209] audit(1189232698.095:3): audit_pid=5597 old=0 by auid=4294967295
[  116.707631] 2,f73035a0(5624): 1fa78027f5c,1fb37ff,f7d035a0(5626),1
[  116.723091] WARNING: at kernel/sched_norm.c:243 verify_queue()
[  116.737979]  [] show_trace_log_lvl+0x1a/0x30
[  116.752191]  [] show_trace+0x12/0x14
[  116.765544]  [] dump_stack+0x16/0x18
[  116.778579]  [] verify_queue+0x158/0x388
[  116.791716]  [] enqueue_entity+0x22/0x19a
[  116.804880]  [] task_new_fair+0xa9/0xd2
[  116.817724]  [] wake_up_new_task+0x9e/0xad
[  116.830723]  [] do_fork+0x13c/0x205
[  116.843063]  [] sys_clone+0x33/0x39
[  116.855353]  [] sysenter_past_esp+0x5f/0x85
[  116.868344]  ===
[  116.879674] 2,f7303ae0(5618): 1fbf7fca89d,1fb37ff,f7d035a0(5626),1
[  116.894100] WARNING: at kernel/sched_norm.c:250 verify_queue()
[  116.907863]  [] show_trace_log_lvl+0x1a/0x30
[  116.920912]  [] show_trace+0x12/0x14
[  116.933141]  [] dump_stack+0x16/0x18
[  116.945361]  [] verify_queue+0x288/0x388
[  116.957945]  [] enqueue_entity+0x22/0x19a
[  116.970363]  [] task_new_fair+0xa9/0xd2
[  116.982308]  [] wake_up_new_task+0x9e/0xad
[  116.994597]  [] do_fork+0x13c/0x205
[  117.006331]  [] sys_clone+0x33/0x39
[  117.017848]  [] sysenter_past_esp+0x5f/0x85
[  117.029869]  ===




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-08 Thread Mike Galbraith
On Fri, 2007-09-07 at 17:35 +0200, Roman Zippel wrote:
 Hi,
 
 On Sun, 2 Sep 2007, Ingo Molnar wrote:
 
 Below is a patch updated against the latest git tree, no major changes.

Interesting, I see major behavioral changes.

I still see an aberration with fairtest2.  On startup, the hog component
will consume 100% cpu for a bit, then the sleeper shows up.  This
doesn't always happen, but happens quite often.  The time it takes for
the sleeper to show up varies, but is roughly as below.  With the
previous kernel, sleeper starvation was permanent once the hog ran for a
bit.  This behavior inverted itself, starvation now goes away after a
bit.

6573 root  20   0  1568  468  384 R   51  0.0   0:08.30 1 fairtest2
6574 root  20   0  1568  112   28 R   47  0.0   0:01.07 1 fairtest2

Once it coughs up it's fairtest2 startup-furball, it does well on mixed
load (various interval sleepers + hog) fairness.  The previous kernel
did not do at all well at mixed load, as soon as I added a hog, it was
all over for sleepers, and even without a hog, fairness among the
various interval sleepers was not at all good.

On the interactivity front, your first cut was not really usable here,
but this one seems fine not heavily tested caveat.  One test I do here
is Amarok song switch time under hefty kbuild load, and this kernel
passes that test just fine.  I didn't try this test with your first cut,
because it was very ragged even under modest load.  All in all, this one
is behaving quite well, which is radically different than my experience
with first cut.

Debug messages triggered on boot, but haven't triggered since.

[  113.426575] audit(1189232695.670:2): audit_backlog_limit=256 old=64 by 
auid=4294967295 res=1
[  113.953958] IA-32 Microcode Update Driver: v1.14a [EMAIL PROTECTED]
[  115.851209] audit(1189232698.095:3): audit_pid=5597 old=0 by auid=4294967295
[  116.707631] 2,f73035a0(5624): 1fa78027f5c,1fb37ff,f7d035a0(5626),1
[  116.723091] WARNING: at kernel/sched_norm.c:243 verify_queue()
[  116.737979]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  116.752191]  [c0105d83] show_trace+0x12/0x14
[  116.765544]  [c0105d9b] dump_stack+0x16/0x18
[  116.778579]  [c011e489] verify_queue+0x158/0x388
[  116.791716]  [c011e6db] enqueue_entity+0x22/0x19a
[  116.804880]  [c011e9e2] task_new_fair+0xa9/0xd2
[  116.817724]  [c0123c47] wake_up_new_task+0x9e/0xad
[  116.830723]  [c0125ddc] do_fork+0x13c/0x205
[  116.843063]  [c0102258] sys_clone+0x33/0x39
[  116.855353]  [c0104182] sysenter_past_esp+0x5f/0x85
[  116.868344]  ===
[  116.879674] 2,f7303ae0(5618): 1fbf7fca89d,1fb37ff,f7d035a0(5626),1
[  116.894100] WARNING: at kernel/sched_norm.c:250 verify_queue()
[  116.907863]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  116.920912]  [c0105d83] show_trace+0x12/0x14
[  116.933141]  [c0105d9b] dump_stack+0x16/0x18
[  116.945361]  [c011e5b9] verify_queue+0x288/0x388
[  116.957945]  [c011e6db] enqueue_entity+0x22/0x19a
[  116.970363]  [c011e9e2] task_new_fair+0xa9/0xd2
[  116.982308]  [c0123c47] wake_up_new_task+0x9e/0xad
[  116.994597]  [c0125ddc] do_fork+0x13c/0x205
[  117.006331]  [c0102258] sys_clone+0x33/0x39
[  117.017848]  [c0104182] sysenter_past_esp+0x5f/0x85
[  117.029869]  ===

repeats dozen times or so


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-08 Thread Mike Galbraith
On Sat, 2007-09-08 at 09:56 +0200, Mike Galbraith wrote:

 repeats dozen times or so

They weren't all repeats after all, the last few were...

[  120.267389] 2,f73035a0(5624): 1fa7e90b58c,1fb3b46,f73035a0(5624),5
[  120.281110] WARNING: at kernel/sched_norm.c:413 entity_tick()
[  120.294101]  [c0105188] show_trace_log_lvl+0x1a/0x30
[  120.306425]  [c0105d83] show_trace+0x12/0x14
[  120.317857]  [c0105d9b] dump_stack+0x16/0x18
[  120.329124]  [c011eb68] task_tick_fair+0x15d/0x170
[  120.340988]  [c01221f3] scheduler_tick+0x18d/0x2d5
[  120.352896]  [c012f91d] update_process_times+0x44/0x63
[  120.365177]  [c014161e] tick_sched_timer+0x5c/0xbd
[  120.377146]  [c013ca61] hrtimer_interrupt+0x12c/0x1a0
[  120.389392]  [c01173f0] smp_apic_timer_interrupt+0x57/0x88
[  120.402114]  [c0104c50] apic_timer_interrupt+0x28/0x30
[  120.414491]  [c0123c1b] wake_up_new_task+0x72/0xad
[  120.426477]  [c0125ddc] do_fork+0x13c/0x205
[  120.437795]  [c0102258] sys_clone+0x33/0x39
[  120.449062]  [c0104182] sysenter_past_esp+0x5f/0x85
[  120.461047]  ===


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-07 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

Below is a patch updated against the latest git tree, no major changes. 
For a split version I'm still waiting for some more explanation about the 
CFS tuning parameter.
I added another check for the debug version so that any inbalances (as 
e.g. Mike seen it) should be reported. I left some of the proc parameter 
visible even in the non-debug version, so one can play with it without 
getting the debug overhead.

> And if you look at the resulting code size/complexity, it actually 
> increases with Roman's patch (UP, nodebug, x86):
> 
>  textdata bss dec hex filename
> 13420 2281204   148523a04 sched.o.rc5
> 13554 2281228   150103aa2 sched.o.rc5-roman

You'll be happy to know, that with the patch even the SMP code size 
decreases.
BTW if I made the impression that overall code size must decrease, than I 
apologize, overall code size is of course only an indication, IMO more 
important is where that code is, i.e. that regularly executed code is as 
compact as possible.
64bit math simply adds a level of bloatness, which is hard to avoid, but 
with the fixed math it's actually possible to further reduce the code size 
by getting rid of the 64bit math, so that anyone prefering a really 
compact scheduler can have that.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
> > my tree, and RFS was faster to some degree (it varied)..
> 
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive 
> runs fresh after bootup, Core2Duo, UP:
> 
>-rc5(cfs)   -rc5+rfs
>   ---
>   Time: 3.905 Time: 4.259
>   Time: 3.962 Time: 4.190
>   Time: 3.981 Time: 4.241
>   Time: 3.986 Time: 3.937
>   Time: 3.984 Time: 4.120
>   Time: 4.001 Time: 4.013
>   Time: 3.980 Time: 4.248
>   Time: 3.983 Time: 3.961
>   Time: 3.989 Time: 4.345
>   Time: 3.981 Time: 4.294
>   ---
>Avg: 3.975  Avg: 4.160 (+4.6%)
>  Fluct: 0.138Fluct: 1.671
> 
> so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
> it's also more consistent/stable (10 times lower fluctuations).

hackbench apparently works best if one doesn't wake up the consumer too 
early, so in the patch I don't preempt a just woken process, which 
produces far better numbers.

bye, Roman


---
 include/linux/sched.h |   26 -
 kernel/sched.c|  173 +-
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  841 ++
 kernel/sysctl.c   |   30 -
 5 files changed, 948 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se->block_start
  * 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
  * 6 se->load.weight
- * 7 se->delta_fair
- *15 se->wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
-   u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-07 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

Below is a patch updated against the latest git tree, no major changes. 
For a split version I'm still waiting for some more explanation about the 
CFS tuning parameter.
I added another check for the debug version so that any inbalances (as 
e.g. Mike seen it) should be reported. I left some of the proc parameter 
visible even in the non-debug version, so one can play with it without 
getting the debug overhead.

 And if you look at the resulting code size/complexity, it actually 
 increases with Roman's patch (UP, nodebug, x86):
 
  textdata bss dec hex filename
 13420 2281204   148523a04 sched.o.rc5
 13554 2281228   150103aa2 sched.o.rc5-roman

You'll be happy to know, that with the patch even the SMP code size 
decreases.
BTW if I made the impression that overall code size must decrease, than I 
apologize, overall code size is of course only an indication, IMO more 
important is where that code is, i.e. that regularly executed code is as 
compact as possible.
64bit math simply adds a level of bloatness, which is hard to avoid, but 
with the fixed math it's actually possible to further reduce the code size 
by getting rid of the 64bit math, so that anyone prefering a really 
compact scheduler can have that.

  I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
  my tree, and RFS was faster to some degree (it varied)..
 
 here are some actual numbers for hackbench 50 on -rc5, 10 consecutive 
 runs fresh after bootup, Core2Duo, UP:
 
-rc5(cfs)   -rc5+rfs
   ---
   Time: 3.905 Time: 4.259
   Time: 3.962 Time: 4.190
   Time: 3.981 Time: 4.241
   Time: 3.986 Time: 3.937
   Time: 3.984 Time: 4.120
   Time: 4.001 Time: 4.013
   Time: 3.980 Time: 4.248
   Time: 3.983 Time: 3.961
   Time: 3.989 Time: 4.345
   Time: 3.981 Time: 4.294
   ---
Avg: 3.975  Avg: 4.160 (+4.6%)
  Fluct: 0.138Fluct: 1.671
 
 so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
 it's also more consistent/stable (10 times lower fluctuations).

hackbench apparently works best if one doesn't wake up the consumer too 
early, so in the patch I don't preempt a just woken process, which 
produces far better numbers.

bye, Roman


---
 include/linux/sched.h |   26 -
 kernel/sched.c|  173 +-
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  841 ++
 kernel/sysctl.c   |   30 -
 5 files changed, 948 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se-block_start
  * 4 se-run_node
- * 4 se-sleep_start
- * 4 se-sleep_start_fair
  * 6 se-load.weight
- * 7 se-delta_fair
- *15 se-wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
-   u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void) 
 
 extern void sched_idle_next(void);
 
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-05 Thread Syren Baran
Am Montag, den 03.09.2007, 04:58 +0200 schrieb Roman Zippel:
> Hi,
> 
> On Sun, 2 Sep 2007, Ingo Molnar wrote:
> 
> > > Did you even try to understand what I wrote? I didn't say that it's a 
> > > "common problem", it's a conceptual problem. The rounding has been 
> > > improved lately, so it's not as easy to trigger with some simple busy 
> > > loops.
> > 
> > As i mentioned it in my first reply to you i really welcome your changes 
> > and your interest in the Linux scheduler, but i'm still somewhat 
> > surprised why you focus on rounding so much (and why you attack CFS's 
> > math implementation so vehemently and IMO so unfairly) - and we had this 
> > discussion before in the "CFS review" thread that you started.
> 
> The rounding is insofar a problem as it makes it very difficult to produce 
> a correct mathematical model of CFS, which can be used to verify the 
> working of code.
> With the recent rounding changes they have indeed little effect in the 
> short term, especially as the error is distributed equally to the task, so 
> every task gets relatively the same time, the error still exists 
> nonetheless and adds up over time (although with the rounding changes it 
> doesn't grow into one direction anymore).
So, its like i roll a dice repeatedly and subtract 3.5 every time? Even
in the long run the the result should be close to zero.
>  The problem is now the limiting, 
> which you can't remove as long as the error exists. Part of this problem
> is that CFS doesn't really maintain a balanced sum of the available 
> runtime, i.e. the sum of all updated wait_runtime values of all active 
> tasks should be zero. This means under complex loads it's possible to hit 
> the wait_runtime limits and the overflow/underflow time is lost in the 
> calculation and this value can be easily in the millisecond range.
> To be honest I have no idea how real this problem in the praxis is right 
> now, quite possibly people will only see it as a small glitch and in most 
> cases they won't even notice.
So, given the above example, if the result of the dice rolls ever
exceeds +/- 1,000,000 i´ll get some ugly timing glitches? As the number
of dice rolls grows towards infinity the chance of remaining within this
boundary goes steadily towards 0%.
What does this equate to in the real world? Weeks, month, years,
milennia?

>  Previously it had been rather easy to 
> trigger these problems due to the rounding problems. The main problem for 
> me here is now that with any substantial change in CFS I'm running risk of 
> making this worse and noticable again somehow. For example CFS relies on 
> the 64bit math to have enough resolution so that the errors aren't 
> noticable.
> That's why I needed a mathematical model I can work with and with it for 
> example I can easily downscale the math to 32bit and I know exactly the 
> limits within it will work correctly.

If i understand the problem correctly these errors occur due to the fact
that delta-values are used instead of recalculating the "fair" process
time every "dice" roll?
Somehow the dead simple solution would seem to "reset" or "reinitialise"
the "dice" roll every million rolls or so.
Or, to put this into context again, figure out how large the accumulated
error would have to be to be noticable. Then figure out how long it
would take for such an error occur in the real world and just
reinitialise the damn thing. Should´nt be too complicated stochastics.

Greetings,
Syren

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-03 Thread Daniel Walker
On Mon, 2007-09-03 at 20:20 +0200, Roman Zippel wrote:
> Basically that's it and I hope that explains the basic math a bit easier. :-)
> 

It helps a tiny bit .. However, I appreciate that you took the time to
write this .. Thanks you.

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-03 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Daniel Walker wrote:

> For instance if there are three tasks in the system. Call them A,B, and
> C.
> 
> then 
> 
> time equals "time of A" + "time of B" + "time of C"

Ok, let's take a simple example. :)

If we have three task A, B, C, each with a weight of 1, 2, 3, so the 
weight sum is 6. If we let these three tasks run over a period of 12s, 
each task gets time/weight_sum*weight_{t} seconds, that is each task gets 
2, 4, 6 seconds of runtime.
The basic idea of CFS is now that using this formula, the time is
distributed to the active tasks and accounted against the runtime they
get. Let's assume a time slice of 1s, where each task gets
1s/weight_sum*weight_{t} of eligible runtime every second, so in the 
following table the task column contains the values (eligible runtime 
- obtained runtime):

timecurrent A   B   C   fair_clock
1   C   1/6-0   2/6-0   3/6-1   1/6
2   C   2/6-0   4/6-0   6/6-2   2/6
3   C   3/6-0   6/6-0   9/6-3   3/6
4   B   4/6-0   8/6-1   12/6-3  4/6
5   B   5/6-0   10/6-2  15/6-3  5/6
6   A   6/6-1   12/6-2  18/6-3  6/6
7   C   7/6-1   14/6-2  21/6-4  7/6
8   C   8/6-1   16/6-2  24/6-5  8/6
9   C   9/6-1   18/6-2  27/6-6  9/6
10  B   10/6-1  20/6-3  30/6-6  10/6
11  B   11/6-1  22/6-4  33/6-6  11/6
12  A   12/6-2  24/6-4  36/6-6  12/6

Practically the eligible runtime part isn't updated constantly, but a
delta against the last update is used. If everything is working
correctly in the end the difference between the eligible and obtained
runtime is zero.

Peter's patches now make an interesting step, basically they change the
position of the weight_{t} variable, so every task get's now
1s/weight_sum per second and weight_{t} is now used to scale the
obtained runtime instead:

timecurrent A   B   C   fair_clock
1   C   1/6-0/1 1/6-0/2 1/6-1/3 1/6
2   C   2/6-0/1 2/6-0/2 2/6-2/3 2/6
3   C   3/6-0/1 3/6-0/2 3/6-3/3 3/6
4   B   4/6-0/1 4/6-1/2 4/6-3/3 4/6
5   B   5/6-0/1 5/6-2/2 5/6-3/3 5/6
6   A   6/6-1/1 6/6-2/2 6/6-3/3 6/6
7   C   7/6-1/1 7/6-2/2 7/6-4/3 7/6
8   C   8/6-1/1 8/6-2/2 8/6-5/3 8/6
9   C   9/6-1/1 9/6-2/2 9/6-6/3 9/6
10  B   10/6-1/110/6-3/210/6-6/310/6
11  B   11/6-1/111/6-4/211/6-6/311/6
12  A   12/6-2/112/6-4/212/6-6/312/6

As you can see here the fair_clock part is the same for every task in 
every row, so that it can be removed, this is basically the step I do in 
my patch:

timecurrent A   B   C   fair_clock
1   C   0/1 0/2 1/3 1/6
2   C   0/1 0/2 2/3 2/6
3   C   0/1 0/2 3/3 3/6
4   B   0/1 1/2 3/3 4/6
5   B   0/1 2/2 3/3 5/6
6   A   1/1 2/2 3/3 6/6
7   C   1/1 2/2 4/3 7/6
8   C   1/1 2/2 5/3 8/6
9   C   1/1 2/2 6/3 9/6
10  B   1/1 3/2 6/3 10/6
11  B   1/1 4/2 6/3 11/6
12  A   2/1 4/2 6/3 12/6

This means fair_clock isn't used in the runtime calculation anymore and
time isn't relative to fair_clock, but it's an absolute value. It looks
relatively simple, but it means all calculations involving fair_clock
deltas and wait_runtime values have to be redone in this context and
that's the part where I need some help with - some of the tuning
features aren't converted yet.
The importance of this step is that it removes a major error source for
the accuracy of the scheduler. All these fractions have to be rounded to
integer values at some point and the main problem here is that the
denominator in the fair_clock calculation is variable - it changes with
the number of active tasks and it requires quite some resolution, so the
error doesn't become noticable.

fair_clock isn't used here anymore, but it's still used to initialize the 
time value of new tasks. For 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-03 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Daniel Walker wrote:

 For instance if there are three tasks in the system. Call them A,B, and
 C.
 
 then 
 
 time equals time of A + time of B + time of C

Ok, let's take a simple example. :)

If we have three task A, B, C, each with a weight of 1, 2, 3, so the 
weight sum is 6. If we let these three tasks run over a period of 12s, 
each task gets time/weight_sum*weight_{t} seconds, that is each task gets 
2, 4, 6 seconds of runtime.
The basic idea of CFS is now that using this formula, the time is
distributed to the active tasks and accounted against the runtime they
get. Let's assume a time slice of 1s, where each task gets
1s/weight_sum*weight_{t} of eligible runtime every second, so in the 
following table the task column contains the values (eligible runtime 
- obtained runtime):

timecurrent A   B   C   fair_clock
1   C   1/6-0   2/6-0   3/6-1   1/6
2   C   2/6-0   4/6-0   6/6-2   2/6
3   C   3/6-0   6/6-0   9/6-3   3/6
4   B   4/6-0   8/6-1   12/6-3  4/6
5   B   5/6-0   10/6-2  15/6-3  5/6
6   A   6/6-1   12/6-2  18/6-3  6/6
7   C   7/6-1   14/6-2  21/6-4  7/6
8   C   8/6-1   16/6-2  24/6-5  8/6
9   C   9/6-1   18/6-2  27/6-6  9/6
10  B   10/6-1  20/6-3  30/6-6  10/6
11  B   11/6-1  22/6-4  33/6-6  11/6
12  A   12/6-2  24/6-4  36/6-6  12/6

Practically the eligible runtime part isn't updated constantly, but a
delta against the last update is used. If everything is working
correctly in the end the difference between the eligible and obtained
runtime is zero.

Peter's patches now make an interesting step, basically they change the
position of the weight_{t} variable, so every task get's now
1s/weight_sum per second and weight_{t} is now used to scale the
obtained runtime instead:

timecurrent A   B   C   fair_clock
1   C   1/6-0/1 1/6-0/2 1/6-1/3 1/6
2   C   2/6-0/1 2/6-0/2 2/6-2/3 2/6
3   C   3/6-0/1 3/6-0/2 3/6-3/3 3/6
4   B   4/6-0/1 4/6-1/2 4/6-3/3 4/6
5   B   5/6-0/1 5/6-2/2 5/6-3/3 5/6
6   A   6/6-1/1 6/6-2/2 6/6-3/3 6/6
7   C   7/6-1/1 7/6-2/2 7/6-4/3 7/6
8   C   8/6-1/1 8/6-2/2 8/6-5/3 8/6
9   C   9/6-1/1 9/6-2/2 9/6-6/3 9/6
10  B   10/6-1/110/6-3/210/6-6/310/6
11  B   11/6-1/111/6-4/211/6-6/311/6
12  A   12/6-2/112/6-4/212/6-6/312/6

As you can see here the fair_clock part is the same for every task in 
every row, so that it can be removed, this is basically the step I do in 
my patch:

timecurrent A   B   C   fair_clock
1   C   0/1 0/2 1/3 1/6
2   C   0/1 0/2 2/3 2/6
3   C   0/1 0/2 3/3 3/6
4   B   0/1 1/2 3/3 4/6
5   B   0/1 2/2 3/3 5/6
6   A   1/1 2/2 3/3 6/6
7   C   1/1 2/2 4/3 7/6
8   C   1/1 2/2 5/3 8/6
9   C   1/1 2/2 6/3 9/6
10  B   1/1 3/2 6/3 10/6
11  B   1/1 4/2 6/3 11/6
12  A   2/1 4/2 6/3 12/6

This means fair_clock isn't used in the runtime calculation anymore and
time isn't relative to fair_clock, but it's an absolute value. It looks
relatively simple, but it means all calculations involving fair_clock
deltas and wait_runtime values have to be redone in this context and
that's the part where I need some help with - some of the tuning
features aren't converted yet.
The importance of this step is that it removes a major error source for
the accuracy of the scheduler. All these fractions have to be rounded to
integer values at some point and the main problem here is that the
denominator in the fair_clock calculation is variable - it changes with
the number of active tasks and it requires quite some resolution, so the
error doesn't become noticable.

fair_clock isn't used here anymore, but it's still used to initialize the 
time value of new tasks. For this I can 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-03 Thread Daniel Walker
On Mon, 2007-09-03 at 20:20 +0200, Roman Zippel wrote:
 Basically that's it and I hope that explains the basic math a bit easier. :-)
 

It helps a tiny bit .. However, I appreciate that you took the time to
write this .. Thanks you.

Daniel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> > Did you even try to understand what I wrote? I didn't say that it's a 
> > "common problem", it's a conceptual problem. The rounding has been 
> > improved lately, so it's not as easy to trigger with some simple busy 
> > loops.
> 
> As i mentioned it in my first reply to you i really welcome your changes 
> and your interest in the Linux scheduler, but i'm still somewhat 
> surprised why you focus on rounding so much (and why you attack CFS's 
> math implementation so vehemently and IMO so unfairly) - and we had this 
> discussion before in the "CFS review" thread that you started.

The rounding is insofar a problem as it makes it very difficult to produce 
a correct mathematical model of CFS, which can be used to verify the 
working of code.
With the recent rounding changes they have indeed little effect in the 
short term, especially as the error is distributed equally to the task, so 
every task gets relatively the same time, the error still exists 
nonetheless and adds up over time (although with the rounding changes it 
doesn't grow into one direction anymore). The problem is now the limiting, 
which you can't remove as long as the error exists. Part of this problem
is that CFS doesn't really maintain a balanced sum of the available 
runtime, i.e. the sum of all updated wait_runtime values of all active 
tasks should be zero. This means under complex loads it's possible to hit 
the wait_runtime limits and the overflow/underflow time is lost in the 
calculation and this value can be easily in the millisecond range.
To be honest I have no idea how real this problem in the praxis is right 
now, quite possibly people will only see it as a small glitch and in most 
cases they won't even notice. Previously it had been rather easy to 
trigger these problems due to the rounding problems. The main problem for 
me here is now that with any substantial change in CFS I'm running risk of 
making this worse and noticable again somehow. For example CFS relies on 
the 64bit math to have enough resolution so that the errors aren't 
noticable.
That's why I needed a mathematical model I can work with and with it for 
example I can easily downscale the math to 32bit and I know exactly the 
limits within it will work correctly.

> Your math is fairly simple (and that is _good_, just like CFS's existing 
> math is simple),

I really wouldn't call CFS's existing math simple, the basic ideas are 
indeed simple, but that the math of the actual implementation is quite a 
bit more complex.

> it can be summed up in essence as (without complicating 
> it with nice-level weighting, for easy understandability):
> 
> " use the already existing p->sum_exec_runtime 'task runtime' metric 
>   that CFS maintains, and use that as the key into the rb-tree that 
>   selects tasks that should be run next. To handle sleeping tasks: keep 
>   a per-rq sum of all runnable task's ->sum_exec_runtime values and 
>   start newly woken tasks at the average rq->sum/nr_running value. "
> 
> Now your patch does not actually do it that way in a clearly discernible 
> manner because lots of changes are intermixed into one big patch. 

Well, it's part of the math, but I wouldn't call it the "essence" of the 
_math_, as it leaves many questions unanswered, the essence of the math 
mainly involves how the problems are solved created by the weighting.

If you want to describe the basic logical differences, you can do it a 
little simpler: CFS maintains the runtime of a task relative to a global 
clock, while in my model every task has its own clock and an approximated 
average is used to initialize the clock of new tasks.

> ( please correct me if i got your math wrong. Your patch does not add 
>   any comments at all to the new code and this slowed down my review
>   and analysis of your patch quite considerably. Lack of comments makes
>   it harder to see the purpose and makes it harder to notice the
>   benefits/tradeoffs involved in each change. )

I hope you did notice that I explained in the announcement in quite detail 
what the code does.

> To sum up: the math in your patch _could_ be implemented as a much 
> smaller add-on to the already existing variables maintained by CFS, but 
> you chose to do lots of changes, variable-renames and removals at once 
> and posted them as one big change.

I didn't rename that much and there are only a few that could be reused, 
for most part I indeed removed quite a bit and the rest really has a 
different meaning.

>  _Please_, separate things 
> out so that they can be reviewed one by one.

That's not really the problem, but _please_ someone explain to me the 
features you want to see preserved. I don't want to be left at guessing 
what they're intendend to do and blindly implement something which should 
do something similiar.
Thanks.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Roman Zippel <[EMAIL PROTECTED]> wrote:

> > > > so unmodified CFS is 4.6% faster on this box than with Roman's 
> > > > patch and it's also more consistent/stable (10 times lower 
> > > > fluctuations).
> > > 
> > > Was SCHED_DEBUG enabled or disabled for these runs?
> > 
> > debugging disabled of course. (your patch has a self-validity 
> > checking function [verify_queue()] that is called on SCHED_DEBUG=y, 
> > it would have been unfair to test your patch with that included.)
> 
> I'll look into it next week.

thanks. FYI, there's plenty of time - i'll be at the KS next week so 
i'll be quite unresponsive to emails. Would be nice if you could take a 
quick look at the trivial patch i posted today though. How close is it 
to your algorithm, have i missed any important details? (not counting 
nice levels and rounding, it's just a quick & dirty prototype to show 
the first layer of the core math and nothing more.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> so i thought you must be aware of the problem - at least considering how 
> much you've criticised CFS's "complexity" both in your initial review of 
> CFS (which included object size comparisons) and in this patch 
> submission of yours (which did not include object size comparisons
> though).

Ingo as long as you freely mix algorithmic and code complexity we won't 
get very far. I did code size comparisons for your _stable_ code, which 
was merged into Linus' tree. I explicitly said that my patch is only a 
prototype, so I haven't done any cleanups and tuning in this direction 
yet, so making any conclusions based on code size comparisions is quite 
ridiculous at this point.
The whole point of this patch was to demonstrate the algorithmic changes, 
not to demonstrate a final and perfectly tuned scheduler.

> > > so unmodified CFS is 4.6% faster on this box than with Roman's patch 
> > > and it's also more consistent/stable (10 times lower fluctuations).
> > 
> > Was SCHED_DEBUG enabled or disabled for these runs?
> 
> debugging disabled of course. (your patch has a self-validity checking 
> function [verify_queue()] that is called on SCHED_DEBUG=y, it would have 
> been unfair to test your patch with that included.)

I'll look into it next week.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sat, 1 Sep 2007, Bill Davidsen wrote:

> > I'd expect Ingo to know better, but the more he refuses to answer my
> > questions, the more I doubt it, at least than it comes to the math part.
> > 
> The "math part" is important if you're doing a thesis defense, but
> demonstrating better behavior under some realistic load would probably be a
> better starting point.

That wasn't my design goal. The initial trigger for this was all the 64bit 
math in CFS and the question how can this be tuned using arch specific 
knowledge about the input values. For this I needed a correct and 
verifyable model to analyze, so I know where the limits are and how it 
reacts to different sets of inputs. This is where I got into trouble with 
all the rounding - I couldn't substantially change the math without 
convincing myself that it's still correct for all kinds of input data. 
That's why I completely redid the math from scratch - it's based on the 
same basic ideas, but the solution and thus the math is quite different.

The main reason I didn't concentrate on the behaviour so much is that 
since both scheduler are conceptually not that far apart, it should be 
possible to apply any tuning done to CFS also to my model. But this 
requires someone actually understands what tuning was done and it wasn't 
done by random changes and seeing what happens, i.e. someone should be 
able to explain it to me.
BTW in this context it's rather interesting to see that Ingo attacks me 
now for not having done this tuning yet and for which I explicitely asked 
for help...

> Maybe Ingo missed something in your math, and maybe he
> just didn't find a benefit.

Maybe he didn't understand it at all and is too proud to admit it?
(At least that's the only explanation which makes sense to me.)

> You dropped this on the world two days before a major US holiday, at the end
> of the summer when those of us not on vacation may be covering for those who
> are, did you expect Ingo to drop his regular work to look at your stuff?

Did I demand anything like that? It had been fine if he had been asking 
for more time or if he had more questions first, but it took him only a 
few hours to come to his conclusion without any need for further 
questions.

> And
> do you think there are many people who can look at your math, and look at your
> code, and then have any clue how well it works in practice? I bet there aren't
> ten people in the world who would even claim to do that, and half of them are
> kidding themselves.

I don't think the math is that complex, it may need some more explanations 
outside the math formulas though. But the math is needed to analyze what 
effect changes to scheduler have, otherwise there is a risk it becomes 
only guesswork with random changes which may help in some situations but 
break others.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Roman Zippel <[EMAIL PROTECTED]> wrote:

> > And if you look at the resulting code size/complexity, it actually 
> > increases with Roman's patch (UP, nodebug, x86):
> > 
> >  textdata bss dec hex filename
> > 13420 2281204   148523a04 sched.o.rc5
> > 13554 2281228   150103aa2 sched.o.rc5-roman
> 
> That's pretty easy to explain due to differences in inlining:
> 
>textdata bss dec hex filename
>   15092 2281204   16524408c kernel/sched.o
>   15444 2241228   168964200 kernel/sched.o.rfs
>   14708 2241228   161603f20 kernel/sched.o.rfs.noinline

no, when generating those numbers i used:

  CONFIG_CC_OPTIMIZE_FOR_SIZE=y
  # CONFIG_FORCED_INLINING is not set

(but i also re-did it for all the other combinations of these build 
flags and similar results can be seen - your patch, despite removing 
lots of source code, produces a larger sched.o.)

> Sorry, but I didn't spend as much time as you on tuning these numbers.

some changes did slip into your patch that have no other purpose but to 
reduce code size:

  +#ifdef CONFIG_SMP
  unsigned long cpu_load[CPU_LOAD_IDX_MAX];
  +#endif
  [...]
  +#ifdef CONFIG_SMP
   /* Used instead of source_load when we know the type == 0 */
   unsigned long weighted_cpuload(const int cpu)
   {
  return cpu_rq(cpu)->ls.load.weight;
   }
  +#endif
  [...]

so i thought you must be aware of the problem - at least considering how 
much you've criticised CFS's "complexity" both in your initial review of 
CFS (which included object size comparisons) and in this patch 
submission of yours (which did not include object size comparisons
though).

> > so unmodified CFS is 4.6% faster on this box than with Roman's patch 
> > and it's also more consistent/stable (10 times lower fluctuations).
> 
> Was SCHED_DEBUG enabled or disabled for these runs?

debugging disabled of course. (your patch has a self-validity checking 
function [verify_queue()] that is called on SCHED_DEBUG=y, it would have 
been unfair to test your patch with that included.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> And if you look at the resulting code size/complexity, it actually 
> increases with Roman's patch (UP, nodebug, x86):
> 
>  textdata bss dec hex filename
> 13420 2281204   148523a04 sched.o.rc5
> 13554 2281228   150103aa2 sched.o.rc5-roman

That's pretty easy to explain due to differences in inlining:

   textdata bss dec hex filename
  15092 2281204   16524408c kernel/sched.o
  15444 2241228   168964200 kernel/sched.o.rfs
  14708 2241228   161603f20 kernel/sched.o.rfs.noinline

Sorry, but I didn't spend as much time as you on tuning these numbers.

Index: linux-2.6/kernel/sched_norm.c
===
--- linux-2.6.orig/kernel/sched_norm.c  2007-09-02 16:58:05.0 +0200
+++ linux-2.6/kernel/sched_norm.c   2007-09-02 16:10:58.0 +0200
@@ -145,7 +145,7 @@ static inline struct task_struct *task_o
 /*
  * Enqueue an entity into the rb-tree:
  */
-static inline void
+static void
 __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
struct rb_node **link = _rq->tasks_timeline.rb_node;
@@ -192,7 +192,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq, 
se->queued = 1;
 }
 
-static inline void
+static void
 __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
if (cfs_rq->rb_leftmost == se) {
@@ -240,7 +240,7 @@ static void verify_queue(struct cfs_rq *
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
  */
-static inline void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr(struct cfs_rq *cfs_rq)
 {
struct sched_entity *curr = cfs_rq->curr;
kclock_t now = rq_of(cfs_rq)->clock;

> Although it _should_ have been a net code size win, because if you look 
> at the diff you'll see that other useful things were removed as well: 
> sleeper fairness, CPU time distribution smarts, tunings, scheduler 
> instrumentation code, etc.

Well, these are things I'd like you to explain a little, for example I 
repeatedly asked you about the sleeper fairness and I got no answer.
BTW you seemed to haved missed that I actually give a bonus to sleepers 
as well.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
> > my tree, and RFS was faster to some degree (it varied)..
> 
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive 
> runs fresh after bootup, Core2Duo, UP:
> 
>-rc5(cfs)   -rc5+rfs
>   ---
>   Time: 3.905 Time: 4.259
>   Time: 3.962 Time: 4.190
>   Time: 3.981 Time: 4.241
>   Time: 3.986 Time: 3.937
>   Time: 3.984 Time: 4.120
>   Time: 4.001 Time: 4.013
>   Time: 3.980 Time: 4.248
>   Time: 3.983 Time: 3.961
>   Time: 3.989 Time: 4.345
>   Time: 3.981 Time: 4.294
>   ---
>Avg: 3.975  Avg: 4.160 (+4.6%)
>  Fluct: 0.138Fluct: 1.671
> 
> so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
> it's also more consistent/stable (10 times lower fluctuations).

Was SCHED_DEBUG enabled or disabled for these runs?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Daniel Walker
On Sun, 2007-09-02 at 16:47 +0200, Roman Zippel wrote:

> > > (1)   time = sum_{t}^{T}(time_{t})
> > > (2)   weight_sum = sum_{t}^{T}(weight_{t})
> > 
> > I read your description, but I was distracted by this latex style
> > notation .. Could you walk through in english what these two equation
> > are doing ..
> 
> T is the number of tasks, so the first is the sum of the time every task 
> gets over a period of time and the second is the sum of all weights of 
> each task.

I think I'm more puzzled now than I was before, (scratches head for
minutes) ..

I see .. It's a summation symbol .. 

For instance if there are three tasks in the system. Call them A,B, and
C.

then 

time equals "time of A" + "time of B" + "time of C"

Right? 

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sat, 1 Sep 2007, Daniel Walker wrote:

> Out of curiosity I was reviewing your patch .. Since you create
> kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
> what had changed .. So I re-diffed your patch to eliminate
> kernel/sched_norm.c and just make the changes to kernel/sched_fair.c .. 

I mainly did it this way to avoid annoying conflicts. I also started with 
a copy where I threw out everything but a basic skeleton and than just 
copied the stuff I needed.

> > (1) time = sum_{t}^{T}(time_{t})
> > (2) weight_sum = sum_{t}^{T}(weight_{t})
> 
> I read your description, but I was distracted by this latex style
> notation .. Could you walk through in english what these two equation
> are doing ..

T is the number of tasks, so the first is the sum of the time every task 
gets over a period of time and the second is the sum of all weights of 
each task.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Satyam Sharma <[EMAIL PROTECTED]> wrote:

> On Sun, 2 Sep 2007, Ingo Molnar wrote:
> > 
> > Although it _should_ have been a net code size win, because if you 
> > look at the diff you'll see that other useful things were removed as 
> > well: sleeper fairness, CPU time distribution smarts, tunings, 
> > scheduler instrumentation code, etc.
> 
> To be fair to Roman, he probably started development off an earlier 
> CFS, most probably 2.6.23-rc3-git1, if I guess correctly from his 
> original posting. So it's likely he missed out on some of the 
> tunings/comments(?) etc code that got merged after that.

actually, here are the rc3->rc5 changes to CFS:

 sched.c   |   89 
 sched_debug.c |3 -
 sched_fair.c  |  142 +-
 sched_rt.c|   11 +++-
 4 files changed, 182 insertions(+), 63 deletions(-)

so since -rc3 CFS's size _increased_ (a bit).

and i just checked, the sched.o codesize still increases even when 
comparing rc4 against rc4+patch (his original patch) and there are no 
comments added by Roman's patch at all. (all the comments in 
sched_norm.c were inherited from sched_fair.c and none of the new code 
comes with comments - this can be seen in Daniel's rediffed patch.)

(and it's still not apples to oranges, for the reasons i outlined - so 
this whole comparison is unfair to CFS on several levels.)

also, note that CFS's modularity probably enabled Roman to do a fairly 
stable kernel/sched_norm.c (as most of the post-rc3 CFS changes were not 
to sched.c but to sched_fair.c) with easy porting. So with the CFS 
modular framework you can easily whip up and prototype a new scheduler 
and name it whatever you like. [ i expect the RCFS (Really Completely 
Fair Scheduler) patches to be posted to lkml any minute ;-) ]

> > It would be far more reviewable and objectively judgeable on an item 
> > by item basis if Roman posted the finegrained patches i asked for. 
> > (which patch series should be sorted in order of intrusiveness - 
> > i.e. leaving the harder changes to the end of the series.)
>
> Absolutely. And if there indeed are net improvements (be it for corner 
> cases) over latest CFS-rc5, while maintaining performance for the 
> common cases at the same time, well, that can only be a good thing.

yeah - and as i said to Roman, i like for example the use of a 
ready-queue instead of a run-queue. (but these are independent of the 
math changes, obviously.)

I also think that the core math changes should be split from the 
Breshenham optimizations. I.e. the Breshenham _fract code should be done 
as a "this improves performance and improves rounding, without changing 
behavior" add-on ontop of a fairly simple core math change. I think 
Roman will easily be able to do this with a few hours of effort which 
should present much reduced .text overhead in his next version of the 
patch, to demonstrate the simplicity of his implementation of the CFS 
fairness math - this really isnt hard to do.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Roman Zippel <[EMAIL PROTECTED]> wrote:

> > with Peter's queue there are no underflows/overflows either anymore 
> > in any synthetic corner-case we could come up with. Peter's queue 
> > works well but it's 2.6.24 material.
> 
> Did you even try to understand what I wrote? I didn't say that it's a 
> "common problem", it's a conceptual problem. The rounding has been 
> improved lately, so it's not as easy to trigger with some simple busy 
> loops.

As i mentioned it in my first reply to you i really welcome your changes 
and your interest in the Linux scheduler, but i'm still somewhat 
surprised why you focus on rounding so much (and why you attack CFS's 
math implementation so vehemently and IMO so unfairly) - and we had this 
discussion before in the "CFS review" thread that you started.

The kind of rounding error you seem to be worried about is very, very 
small. For normal nice-0 tasks it's in the "one part per a hundred 
million" range, or smaller. As a comparison: "top"'s CPU utilization 
statistics are accurate to "one part per thousand" - and that's roughly 
the precision range that humans care about. (Graphical tools are even 
coarser - one part per hundred or worse.)

I suspect if rounding effects are measurable you will post numbers that 
prove it, correct? You did not do that so far. Or if they are not 
measurable for any workload we care about, why should we worry about it 
if it causes no problems in practice? (or if it causes problems, what 
are the actual effects and can they be addressed in a simpler way?)

The main reason i'm interested in changing the fairness math under CFS 
(be that Peter's changes or your changes) is _not_ primarily to address 
any rounding behavior, but to potentially improve performance! Rounding 
errors are at most an academic issue, unless they are actually 
measurable in a workload we care about. (If rounding gets improved as a 
side-effect of a change that's an added, albeit lower-prio bonus.) But 
even more important is quality of scheduling - performance is secondary 
to that. (unless performance is so bad that it becomes a quality issue: 
such as an O(N) algorithm would do.)

Your math is fairly simple (and that is _good_, just like CFS's existing 
math is simple), it can be summed up in essence as (without complicating 
it with nice-level weighting, for easy understandability):

" use the already existing p->sum_exec_runtime 'task runtime' metric 
  that CFS maintains, and use that as the key into the rb-tree that 
  selects tasks that should be run next. To handle sleeping tasks: keep 
  a per-rq sum of all runnable task's ->sum_exec_runtime values and 
  start newly woken tasks at the average rq->sum/nr_running value. "

Now your patch does not actually do it that way in a clearly discernible 
manner because lots of changes are intermixed into one big patch. 

( please correct me if i got your math wrong. Your patch does not add 
  any comments at all to the new code and this slowed down my review
  and analysis of your patch quite considerably. Lack of comments makes
  it harder to see the purpose and makes it harder to notice the
  benefits/tradeoffs involved in each change. )

Much of the remaining complexity in your patch i see as an add-on 
optimization to that concept: you use a quite complex Bresenham 
implementation that hides the real machinery of the code. Yes, rounding 
improves too with Breshenham, but to me that is secondary - i'm mainly 
interested in the performance aspects of Breshenham. There are 
advantages of your approach but it also has disadavantages: you removed 
sleeper fairness for example, which makes it harder to compare the 
scheduling quality of the two implementations.

To sum up: the math in your patch _could_ be implemented as a much 
smaller add-on to the already existing variables maintained by CFS, but 
you chose to do lots of changes, variable-renames and removals at once 
and posted them as one big change.

I really welcome large changes to the scheduler (hey, in the past 2.5 
years alone we added 350+ scheduler commits from over 95 unique 
contributors, so i'm as feature-happy as it gets), but it's far easier 
to review and merge stuff if it's nicely split up. (I'll eventually get 
through your patch too, but it's much harder that way and as you 
probably know every core kernel hacker is away for the Kernel Summit so 
dont expect much activity for a week or so.)

One conceptual reason behind the intrusive policy-modularization done in 
CFS was to _never again_ be forced to do "big" scheduler changes - we 
now can do most things in small, understandable steps.

I posted my very first, raw version of CFS after 3 days of hacking which 
showed the raw math and nothing more, and i posted every iteration since 
then, so you can follow through its history. _Please_, separate things 
out so that they can be reviewed one by one. And _please_ order the 
patches in "level of intrusiveness" order - i.e. leave the more 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Satyam Sharma


On Sun, 2 Sep 2007, Ingo Molnar wrote:
> 
> Although it _should_ have been a net code size win, because if you look 
> at the diff you'll see that other useful things were removed as well: 
> sleeper fairness, CPU time distribution smarts, tunings, scheduler 
> instrumentation code, etc.

To be fair to Roman, he probably started development off an earlier CFS,
most probably 2.6.23-rc3-git1, if I guess correctly from his original
posting. So it's likely he missed out on some of the tunings/comments(?)
etc code that got merged after that.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
> > my tree, and RFS was faster to some degree (it varied)..
> 
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive 
> runs fresh after bootup, Core2Duo, UP:

Again, it would be interesting to benchmark against 2.6.23-rc3-git1. And
also probably rediff vs 2.6.23-rc3-git1 and compare how the code actually
changed ... but admittedly, doing so would be utterly pointless, because
much water has flowed down the Ganges since -rc3 (misc CFS improvements,
Peter's patches that you mentioned). So a "look, I told you so" kind of
situation wouldn't really be constructive at all.

> It would be far more reviewable and objectively judgeable on an item by 
> item basis if Roman posted the finegrained patches i asked for. (which 
> patch series should be sorted in order of intrusiveness - i.e. leaving 
> the harder changes to the end of the series.)

Absolutely. And if there indeed are net improvements (be it for corner
cases) over latest CFS-rc5, while maintaining performance for the common
cases at the same time, well, that can only be a good thing.

Just my Rs. 0.02,

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Daniel Walker <[EMAIL PROTECTED]> wrote:

> The the patch is near the end of this email..  The most notable thing 
> about the rediff is the line count,
> 
>  4 files changed, 323 insertions(+), 729 deletions(-)
> 
> That's impressive (assuming my rediff is correct). [...]

Yeah, at first glance i liked that too, then i looked into the diff and 
noticed that a good chunk of the removal "win" comes from the removal of 
~35 comment blocks while adding new code that has no comments at all 
(!).

And if you look at the resulting code size/complexity, it actually 
increases with Roman's patch (UP, nodebug, x86):

 textdata bss dec hex filename
13420 2281204   148523a04 sched.o.rc5
13554 2281228   150103aa2 sched.o.rc5-roman

Although it _should_ have been a net code size win, because if you look 
at the diff you'll see that other useful things were removed as well: 
sleeper fairness, CPU time distribution smarts, tunings, scheduler 
instrumentation code, etc.

> I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
> my tree, and RFS was faster to some degree (it varied)..

here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive 
runs fresh after bootup, Core2Duo, UP:

   -rc5(cfs)   -rc5+rfs
  ---
  Time: 3.905 Time: 4.259
  Time: 3.962 Time: 4.190
  Time: 3.981 Time: 4.241
  Time: 3.986 Time: 3.937
  Time: 3.984 Time: 4.120
  Time: 4.001 Time: 4.013
  Time: 3.980 Time: 4.248
  Time: 3.983 Time: 3.961
  Time: 3.989 Time: 4.345
  Time: 3.981 Time: 4.294
  ---
   Avg: 3.975  Avg: 4.160 (+4.6%)
 Fluct: 0.138Fluct: 1.671

so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
it's also more consistent/stable (10 times lower fluctuations).

At lower hackbench levels (hackbench 10) the numbers are closer - that 
could be what you saw.

But, this measurement too is apples to oranges, given the amount of 
useful code the patch removes - fact is that you can always speed up the 
scheduler by removing stuff, just run hackbench as SCHED_FIFO (via "chrt 
-f 90 ./hackbench 50") to see what a minimal scheduler can do.

It would be far more reviewable and objectively judgeable on an item by 
item basis if Roman posted the finegrained patches i asked for. (which 
patch series should be sorted in order of intrusiveness - i.e. leaving 
the harder changes to the end of the series.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Daniel Walker [EMAIL PROTECTED] wrote:

 The the patch is near the end of this email..  The most notable thing 
 about the rediff is the line count,
 
  4 files changed, 323 insertions(+), 729 deletions(-)
 
 That's impressive (assuming my rediff is correct). [...]

Yeah, at first glance i liked that too, then i looked into the diff and 
noticed that a good chunk of the removal win comes from the removal of 
~35 comment blocks while adding new code that has no comments at all 
(!).

And if you look at the resulting code size/complexity, it actually 
increases with Roman's patch (UP, nodebug, x86):

 textdata bss dec hex filename
13420 2281204   148523a04 sched.o.rc5
13554 2281228   150103aa2 sched.o.rc5-roman

Although it _should_ have been a net code size win, because if you look 
at the diff you'll see that other useful things were removed as well: 
sleeper fairness, CPU time distribution smarts, tunings, scheduler 
instrumentation code, etc.

 I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
 my tree, and RFS was faster to some degree (it varied)..

here are some actual numbers for hackbench 50 on -rc5, 10 consecutive 
runs fresh after bootup, Core2Duo, UP:

   -rc5(cfs)   -rc5+rfs
  ---
  Time: 3.905 Time: 4.259
  Time: 3.962 Time: 4.190
  Time: 3.981 Time: 4.241
  Time: 3.986 Time: 3.937
  Time: 3.984 Time: 4.120
  Time: 4.001 Time: 4.013
  Time: 3.980 Time: 4.248
  Time: 3.983 Time: 3.961
  Time: 3.989 Time: 4.345
  Time: 3.981 Time: 4.294
  ---
   Avg: 3.975  Avg: 4.160 (+4.6%)
 Fluct: 0.138Fluct: 1.671

so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
it's also more consistent/stable (10 times lower fluctuations).

At lower hackbench levels (hackbench 10) the numbers are closer - that 
could be what you saw.

But, this measurement too is apples to oranges, given the amount of 
useful code the patch removes - fact is that you can always speed up the 
scheduler by removing stuff, just run hackbench as SCHED_FIFO (via chrt 
-f 90 ./hackbench 50) to see what a minimal scheduler can do.

It would be far more reviewable and objectively judgeable on an item by 
item basis if Roman posted the finegrained patches i asked for. (which 
patch series should be sorted in order of intrusiveness - i.e. leaving 
the harder changes to the end of the series.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Satyam Sharma


On Sun, 2 Sep 2007, Ingo Molnar wrote:
 
 Although it _should_ have been a net code size win, because if you look 
 at the diff you'll see that other useful things were removed as well: 
 sleeper fairness, CPU time distribution smarts, tunings, scheduler 
 instrumentation code, etc.

To be fair to Roman, he probably started development off an earlier CFS,
most probably 2.6.23-rc3-git1, if I guess correctly from his original
posting. So it's likely he missed out on some of the tunings/comments(?)
etc code that got merged after that.

  I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
  my tree, and RFS was faster to some degree (it varied)..
 
 here are some actual numbers for hackbench 50 on -rc5, 10 consecutive 
 runs fresh after bootup, Core2Duo, UP:

Again, it would be interesting to benchmark against 2.6.23-rc3-git1. And
also probably rediff vs 2.6.23-rc3-git1 and compare how the code actually
changed ... but admittedly, doing so would be utterly pointless, because
much water has flowed down the Ganges since -rc3 (misc CFS improvements,
Peter's patches that you mentioned). So a look, I told you so kind of
situation wouldn't really be constructive at all.

 It would be far more reviewable and objectively judgeable on an item by 
 item basis if Roman posted the finegrained patches i asked for. (which 
 patch series should be sorted in order of intrusiveness - i.e. leaving 
 the harder changes to the end of the series.)

Absolutely. And if there indeed are net improvements (be it for corner
cases) over latest CFS-rc5, while maintaining performance for the common
cases at the same time, well, that can only be a good thing.

Just my Rs. 0.02,

Satyam
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Roman Zippel [EMAIL PROTECTED] wrote:

  with Peter's queue there are no underflows/overflows either anymore 
  in any synthetic corner-case we could come up with. Peter's queue 
  works well but it's 2.6.24 material.
 
 Did you even try to understand what I wrote? I didn't say that it's a 
 common problem, it's a conceptual problem. The rounding has been 
 improved lately, so it's not as easy to trigger with some simple busy 
 loops.

As i mentioned it in my first reply to you i really welcome your changes 
and your interest in the Linux scheduler, but i'm still somewhat 
surprised why you focus on rounding so much (and why you attack CFS's 
math implementation so vehemently and IMO so unfairly) - and we had this 
discussion before in the CFS review thread that you started.

The kind of rounding error you seem to be worried about is very, very 
small. For normal nice-0 tasks it's in the one part per a hundred 
million range, or smaller. As a comparison: top's CPU utilization 
statistics are accurate to one part per thousand - and that's roughly 
the precision range that humans care about. (Graphical tools are even 
coarser - one part per hundred or worse.)

I suspect if rounding effects are measurable you will post numbers that 
prove it, correct? You did not do that so far. Or if they are not 
measurable for any workload we care about, why should we worry about it 
if it causes no problems in practice? (or if it causes problems, what 
are the actual effects and can they be addressed in a simpler way?)

The main reason i'm interested in changing the fairness math under CFS 
(be that Peter's changes or your changes) is _not_ primarily to address 
any rounding behavior, but to potentially improve performance! Rounding 
errors are at most an academic issue, unless they are actually 
measurable in a workload we care about. (If rounding gets improved as a 
side-effect of a change that's an added, albeit lower-prio bonus.) But 
even more important is quality of scheduling - performance is secondary 
to that. (unless performance is so bad that it becomes a quality issue: 
such as an O(N) algorithm would do.)

Your math is fairly simple (and that is _good_, just like CFS's existing 
math is simple), it can be summed up in essence as (without complicating 
it with nice-level weighting, for easy understandability):

 use the already existing p-sum_exec_runtime 'task runtime' metric 
  that CFS maintains, and use that as the key into the rb-tree that 
  selects tasks that should be run next. To handle sleeping tasks: keep 
  a per-rq sum of all runnable task's -sum_exec_runtime values and 
  start newly woken tasks at the average rq-sum/nr_running value. 

Now your patch does not actually do it that way in a clearly discernible 
manner because lots of changes are intermixed into one big patch. 

( please correct me if i got your math wrong. Your patch does not add 
  any comments at all to the new code and this slowed down my review
  and analysis of your patch quite considerably. Lack of comments makes
  it harder to see the purpose and makes it harder to notice the
  benefits/tradeoffs involved in each change. )

Much of the remaining complexity in your patch i see as an add-on 
optimization to that concept: you use a quite complex Bresenham 
implementation that hides the real machinery of the code. Yes, rounding 
improves too with Breshenham, but to me that is secondary - i'm mainly 
interested in the performance aspects of Breshenham. There are 
advantages of your approach but it also has disadavantages: you removed 
sleeper fairness for example, which makes it harder to compare the 
scheduling quality of the two implementations.

To sum up: the math in your patch _could_ be implemented as a much 
smaller add-on to the already existing variables maintained by CFS, but 
you chose to do lots of changes, variable-renames and removals at once 
and posted them as one big change.

I really welcome large changes to the scheduler (hey, in the past 2.5 
years alone we added 350+ scheduler commits from over 95 unique 
contributors, so i'm as feature-happy as it gets), but it's far easier 
to review and merge stuff if it's nicely split up. (I'll eventually get 
through your patch too, but it's much harder that way and as you 
probably know every core kernel hacker is away for the Kernel Summit so 
dont expect much activity for a week or so.)

One conceptual reason behind the intrusive policy-modularization done in 
CFS was to _never again_ be forced to do big scheduler changes - we 
now can do most things in small, understandable steps.

I posted my very first, raw version of CFS after 3 days of hacking which 
showed the raw math and nothing more, and i posted every iteration since 
then, so you can follow through its history. _Please_, separate things 
out so that they can be reviewed one by one. And _please_ order the 
patches in level of intrusiveness order - i.e. leave the more 
intrusive patches as late in the 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Satyam Sharma [EMAIL PROTECTED] wrote:

 On Sun, 2 Sep 2007, Ingo Molnar wrote:
  
  Although it _should_ have been a net code size win, because if you 
  look at the diff you'll see that other useful things were removed as 
  well: sleeper fairness, CPU time distribution smarts, tunings, 
  scheduler instrumentation code, etc.
 
 To be fair to Roman, he probably started development off an earlier 
 CFS, most probably 2.6.23-rc3-git1, if I guess correctly from his 
 original posting. So it's likely he missed out on some of the 
 tunings/comments(?) etc code that got merged after that.

actually, here are the rc3-rc5 changes to CFS:

 sched.c   |   89 
 sched_debug.c |3 -
 sched_fair.c  |  142 +-
 sched_rt.c|   11 +++-
 4 files changed, 182 insertions(+), 63 deletions(-)

so since -rc3 CFS's size _increased_ (a bit).

and i just checked, the sched.o codesize still increases even when 
comparing rc4 against rc4+patch (his original patch) and there are no 
comments added by Roman's patch at all. (all the comments in 
sched_norm.c were inherited from sched_fair.c and none of the new code 
comes with comments - this can be seen in Daniel's rediffed patch.)

(and it's still not apples to oranges, for the reasons i outlined - so 
this whole comparison is unfair to CFS on several levels.)

also, note that CFS's modularity probably enabled Roman to do a fairly 
stable kernel/sched_norm.c (as most of the post-rc3 CFS changes were not 
to sched.c but to sched_fair.c) with easy porting. So with the CFS 
modular framework you can easily whip up and prototype a new scheduler 
and name it whatever you like. [ i expect the RCFS (Really Completely 
Fair Scheduler) patches to be posted to lkml any minute ;-) ]

  It would be far more reviewable and objectively judgeable on an item 
  by item basis if Roman posted the finegrained patches i asked for. 
  (which patch series should be sorted in order of intrusiveness - 
  i.e. leaving the harder changes to the end of the series.)

 Absolutely. And if there indeed are net improvements (be it for corner 
 cases) over latest CFS-rc5, while maintaining performance for the 
 common cases at the same time, well, that can only be a good thing.

yeah - and as i said to Roman, i like for example the use of a 
ready-queue instead of a run-queue. (but these are independent of the 
math changes, obviously.)

I also think that the core math changes should be split from the 
Breshenham optimizations. I.e. the Breshenham _fract code should be done 
as a this improves performance and improves rounding, without changing 
behavior add-on ontop of a fairly simple core math change. I think 
Roman will easily be able to do this with a few hours of effort which 
should present much reduced .text overhead in his next version of the 
patch, to demonstrate the simplicity of his implementation of the CFS 
fairness math - this really isnt hard to do.

Ingo

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sat, 1 Sep 2007, Daniel Walker wrote:

 Out of curiosity I was reviewing your patch .. Since you create
 kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
 what had changed .. So I re-diffed your patch to eliminate
 kernel/sched_norm.c and just make the changes to kernel/sched_fair.c .. 

I mainly did it this way to avoid annoying conflicts. I also started with 
a copy where I threw out everything but a basic skeleton and than just 
copied the stuff I needed.

  (1) time = sum_{t}^{T}(time_{t})
  (2) weight_sum = sum_{t}^{T}(weight_{t})
 
 I read your description, but I was distracted by this latex style
 notation .. Could you walk through in english what these two equation
 are doing ..

T is the number of tasks, so the first is the sum of the time every task 
gets over a period of time and the second is the sum of all weights of 
each task.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Daniel Walker
On Sun, 2007-09-02 at 16:47 +0200, Roman Zippel wrote:

   (1)   time = sum_{t}^{T}(time_{t})
   (2)   weight_sum = sum_{t}^{T}(weight_{t})
  
  I read your description, but I was distracted by this latex style
  notation .. Could you walk through in english what these two equation
  are doing ..
 
 T is the number of tasks, so the first is the sum of the time every task 
 gets over a period of time and the second is the sum of all weights of 
 each task.

I think I'm more puzzled now than I was before, (scratches head for
minutes) ..

I see .. It's a summation symbol .. 

For instance if there are three tasks in the system. Call them A,B, and
C.

then 

time equals time of A + time of B + time of C

Right? 

Daniel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

 And if you look at the resulting code size/complexity, it actually 
 increases with Roman's patch (UP, nodebug, x86):
 
  textdata bss dec hex filename
 13420 2281204   148523a04 sched.o.rc5
 13554 2281228   150103aa2 sched.o.rc5-roman

That's pretty easy to explain due to differences in inlining:

   textdata bss dec hex filename
  15092 2281204   16524408c kernel/sched.o
  15444 2241228   168964200 kernel/sched.o.rfs
  14708 2241228   161603f20 kernel/sched.o.rfs.noinline

Sorry, but I didn't spend as much time as you on tuning these numbers.

Index: linux-2.6/kernel/sched_norm.c
===
--- linux-2.6.orig/kernel/sched_norm.c  2007-09-02 16:58:05.0 +0200
+++ linux-2.6/kernel/sched_norm.c   2007-09-02 16:10:58.0 +0200
@@ -145,7 +145,7 @@ static inline struct task_struct *task_o
 /*
  * Enqueue an entity into the rb-tree:
  */
-static inline void
+static void
 __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
struct rb_node **link = cfs_rq-tasks_timeline.rb_node;
@@ -192,7 +192,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq, 
se-queued = 1;
 }
 
-static inline void
+static void
 __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
if (cfs_rq-rb_leftmost == se) {
@@ -240,7 +240,7 @@ static void verify_queue(struct cfs_rq *
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
  */
-static inline void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr(struct cfs_rq *cfs_rq)
 {
struct sched_entity *curr = cfs_rq-curr;
kclock_t now = rq_of(cfs_rq)-clock;

 Although it _should_ have been a net code size win, because if you look 
 at the diff you'll see that other useful things were removed as well: 
 sleeper fairness, CPU time distribution smarts, tunings, scheduler 
 instrumentation code, etc.

Well, these are things I'd like you to explain a little, for example I 
repeatedly asked you about the sleeper fairness and I got no answer.
BTW you seemed to haved missed that I actually give a bonus to sleepers 
as well.

  I also ran hackbench (in a haphazard way) a few times on it vs. CFS in 
  my tree, and RFS was faster to some degree (it varied)..
 
 here are some actual numbers for hackbench 50 on -rc5, 10 consecutive 
 runs fresh after bootup, Core2Duo, UP:
 
-rc5(cfs)   -rc5+rfs
   ---
   Time: 3.905 Time: 4.259
   Time: 3.962 Time: 4.190
   Time: 3.981 Time: 4.241
   Time: 3.986 Time: 3.937
   Time: 3.984 Time: 4.120
   Time: 4.001 Time: 4.013
   Time: 3.980 Time: 4.248
   Time: 3.983 Time: 3.961
   Time: 3.989 Time: 4.345
   Time: 3.981 Time: 4.294
   ---
Avg: 3.975  Avg: 4.160 (+4.6%)
  Fluct: 0.138Fluct: 1.671
 
 so unmodified CFS is 4.6% faster on this box than with Roman's patch and 
 it's also more consistent/stable (10 times lower fluctuations).

Was SCHED_DEBUG enabled or disabled for these runs?

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Ingo Molnar

* Roman Zippel [EMAIL PROTECTED] wrote:

  And if you look at the resulting code size/complexity, it actually 
  increases with Roman's patch (UP, nodebug, x86):
  
   textdata bss dec hex filename
  13420 2281204   148523a04 sched.o.rc5
  13554 2281228   150103aa2 sched.o.rc5-roman
 
 That's pretty easy to explain due to differences in inlining:
 
textdata bss dec hex filename
   15092 2281204   16524408c kernel/sched.o
   15444 2241228   168964200 kernel/sched.o.rfs
   14708 2241228   161603f20 kernel/sched.o.rfs.noinline

no, when generating those numbers i used:

  CONFIG_CC_OPTIMIZE_FOR_SIZE=y
  # CONFIG_FORCED_INLINING is not set

(but i also re-did it for all the other combinations of these build 
flags and similar results can be seen - your patch, despite removing 
lots of source code, produces a larger sched.o.)

 Sorry, but I didn't spend as much time as you on tuning these numbers.

some changes did slip into your patch that have no other purpose but to 
reduce code size:

  +#ifdef CONFIG_SMP
  unsigned long cpu_load[CPU_LOAD_IDX_MAX];
  +#endif
  [...]
  +#ifdef CONFIG_SMP
   /* Used instead of source_load when we know the type == 0 */
   unsigned long weighted_cpuload(const int cpu)
   {
  return cpu_rq(cpu)-ls.load.weight;
   }
  +#endif
  [...]

so i thought you must be aware of the problem - at least considering how 
much you've criticised CFS's complexity both in your initial review of 
CFS (which included object size comparisons) and in this patch 
submission of yours (which did not include object size comparisons
though).

  so unmodified CFS is 4.6% faster on this box than with Roman's patch 
  and it's also more consistent/stable (10 times lower fluctuations).
 
 Was SCHED_DEBUG enabled or disabled for these runs?

debugging disabled of course. (your patch has a self-validity checking 
function [verify_queue()] that is called on SCHED_DEBUG=y, it would have 
been unfair to test your patch with that included.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sat, 1 Sep 2007, Bill Davidsen wrote:

  I'd expect Ingo to know better, but the more he refuses to answer my
  questions, the more I doubt it, at least than it comes to the math part.
  
 The math part is important if you're doing a thesis defense, but
 demonstrating better behavior under some realistic load would probably be a
 better starting point.

That wasn't my design goal. The initial trigger for this was all the 64bit 
math in CFS and the question how can this be tuned using arch specific 
knowledge about the input values. For this I needed a correct and 
verifyable model to analyze, so I know where the limits are and how it 
reacts to different sets of inputs. This is where I got into trouble with 
all the rounding - I couldn't substantially change the math without 
convincing myself that it's still correct for all kinds of input data. 
That's why I completely redid the math from scratch - it's based on the 
same basic ideas, but the solution and thus the math is quite different.

The main reason I didn't concentrate on the behaviour so much is that 
since both scheduler are conceptually not that far apart, it should be 
possible to apply any tuning done to CFS also to my model. But this 
requires someone actually understands what tuning was done and it wasn't 
done by random changes and seeing what happens, i.e. someone should be 
able to explain it to me.
BTW in this context it's rather interesting to see that Ingo attacks me 
now for not having done this tuning yet and for which I explicitely asked 
for help...

 Maybe Ingo missed something in your math, and maybe he
 just didn't find a benefit.

Maybe he didn't understand it at all and is too proud to admit it?
(At least that's the only explanation which makes sense to me.)

 You dropped this on the world two days before a major US holiday, at the end
 of the summer when those of us not on vacation may be covering for those who
 are, did you expect Ingo to drop his regular work to look at your stuff?

Did I demand anything like that? It had been fine if he had been asking 
for more time or if he had more questions first, but it took him only a 
few hours to come to his conclusion without any need for further 
questions.

 And
 do you think there are many people who can look at your math, and look at your
 code, and then have any clue how well it works in practice? I bet there aren't
 ten people in the world who would even claim to do that, and half of them are
 kidding themselves.

I don't think the math is that complex, it may need some more explanations 
outside the math formulas though. But the math is needed to analyze what 
effect changes to scheduler have, otherwise there is a risk it becomes 
only guesswork with random changes which may help in some situations but 
break others.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

 so i thought you must be aware of the problem - at least considering how 
 much you've criticised CFS's complexity both in your initial review of 
 CFS (which included object size comparisons) and in this patch 
 submission of yours (which did not include object size comparisons
 though).

Ingo as long as you freely mix algorithmic and code complexity we won't 
get very far. I did code size comparisons for your _stable_ code, which 
was merged into Linus' tree. I explicitly said that my patch is only a 
prototype, so I haven't done any cleanups and tuning in this direction 
yet, so making any conclusions based on code size comparisions is quite 
ridiculous at this point.
The whole point of this patch was to demonstrate the algorithmic changes, 
not to demonstrate a final and perfectly tuned scheduler.

   so unmodified CFS is 4.6% faster on this box than with Roman's patch 
   and it's also more consistent/stable (10 times lower fluctuations).
  
  Was SCHED_DEBUG enabled or disabled for these runs?
 
 debugging disabled of course. (your patch has a self-validity checking 
 function [verify_queue()] that is called on SCHED_DEBUG=y, it would have 
 been unfair to test your patch with that included.)

I'll look into it next week.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-02 Thread Roman Zippel
Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

  Did you even try to understand what I wrote? I didn't say that it's a 
  common problem, it's a conceptual problem. The rounding has been 
  improved lately, so it's not as easy to trigger with some simple busy 
  loops.
 
 As i mentioned it in my first reply to you i really welcome your changes 
 and your interest in the Linux scheduler, but i'm still somewhat 
 surprised why you focus on rounding so much (and why you attack CFS's 
 math implementation so vehemently and IMO so unfairly) - and we had this 
 discussion before in the CFS review thread that you started.

The rounding is insofar a problem as it makes it very difficult to produce 
a correct mathematical model of CFS, which can be used to verify the 
working of code.
With the recent rounding changes they have indeed little effect in the 
short term, especially as the error is distributed equally to the task, so 
every task gets relatively the same time, the error still exists 
nonetheless and adds up over time (although with the rounding changes it 
doesn't grow into one direction anymore). The problem is now the limiting, 
which you can't remove as long as the error exists. Part of this problem
is that CFS doesn't really maintain a balanced sum of the available 
runtime, i.e. the sum of all updated wait_runtime values of all active 
tasks should be zero. This means under complex loads it's possible to hit 
the wait_runtime limits and the overflow/underflow time is lost in the 
calculation and this value can be easily in the millisecond range.
To be honest I have no idea how real this problem in the praxis is right 
now, quite possibly people will only see it as a small glitch and in most 
cases they won't even notice. Previously it had been rather easy to 
trigger these problems due to the rounding problems. The main problem for 
me here is now that with any substantial change in CFS I'm running risk of 
making this worse and noticable again somehow. For example CFS relies on 
the 64bit math to have enough resolution so that the errors aren't 
noticable.
That's why I needed a mathematical model I can work with and with it for 
example I can easily downscale the math to 32bit and I know exactly the 
limits within it will work correctly.

 Your math is fairly simple (and that is _good_, just like CFS's existing 
 math is simple),

I really wouldn't call CFS's existing math simple, the basic ideas are 
indeed simple, but that the math of the actual implementation is quite a 
bit more complex.

 it can be summed up in essence as (without complicating 
 it with nice-level weighting, for easy understandability):
 
  use the already existing p-sum_exec_runtime 'task runtime' metric 
   that CFS maintains, and use that as the key into the rb-tree that 
   selects tasks that should be run next. To handle sleeping tasks: keep 
   a per-rq sum of all runnable task's -sum_exec_runtime values and 
   start newly woken tasks at the average rq-sum/nr_running value. 
 
 Now your patch does not actually do it that way in a clearly discernible 
 manner because lots of changes are intermixed into one big patch. 

Well, it's part of the math, but I wouldn't call it the essence of the 
_math_, as it leaves many questions unanswered, the essence of the math 
mainly involves how the problems are solved created by the weighting.

If you want to describe the basic logical differences, you can do it a 
little simpler: CFS maintains the runtime of a task relative to a global 
clock, while in my model every task has its own clock and an approximated 
average is used to initialize the clock of new tasks.

 ( please correct me if i got your math wrong. Your patch does not add 
   any comments at all to the new code and this slowed down my review
   and analysis of your patch quite considerably. Lack of comments makes
   it harder to see the purpose and makes it harder to notice the
   benefits/tradeoffs involved in each change. )

I hope you did notice that I explained in the announcement in quite detail 
what the code does.

 To sum up: the math in your patch _could_ be implemented as a much 
 smaller add-on to the already existing variables maintained by CFS, but 
 you chose to do lots of changes, variable-renames and removals at once 
 and posted them as one big change.

I didn't rename that much and there are only a few that could be reused, 
for most part I indeed removed quite a bit and the rest really has a 
different meaning.

  _Please_, separate things 
 out so that they can be reviewed one by one.

That's not really the problem, but _please_ someone explain to me the 
features you want to see preserved. I don't want to be left at guessing 
what they're intendend to do and blindly implement something which should 
do something similiar.
Thanks.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-01 Thread Bill Davidsen

Roman Zippel wrote:

Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

Maybe I should explain for everyone else (especially after seeing some of 
the comments on kerneltrap), why I reacted somewhat irritated on what 
looks like such an innocent mail.
The problem is without the necessary background one can't know how wrong 
such statements as this are, the level of confidence is amazing though:



Peter's work fully solves the rounding corner-cases you described as:


I'd expect Ingo to know better, but the more he refuses to answer my 
questions, the more I doubt it, at least than it comes to the math part.


The "math part" is important if you're doing a thesis defense, but 
demonstrating better behavior under some realistic load would probably 
be a better starting point. Maybe Ingo missed something in your math, 
and maybe he just didn't find a benefit.


The ck and sd schedulers developed over a long time of public use and 
redefinition of goals. The cfs developed faster, but still had months of 
testing and the benefit of rt experience. Your scheduler is more of a 
virgin birth, no slow public discussion before, no time yet for people 
to run it under real loads, you're seeing first impressions.


You dropped this on the world two days before a major US holiday, at the 
end of the summer when those of us not on vacation may be covering for 
those who are, did you expect Ingo to drop his regular work to look at 
your stuff? And do you think there are many people who can look at your 
math, and look at your code, and then have any clue how well it works in 
practice? I bet there aren't ten people in the world who would even 
claim to do that, and half of them are kidding themselves.


So give people a few weeks to see if the rounding errors you eliminated 
mean anything in practice. Faster context is nice, but it's not 
something most people count as their major problem. I did a *lot* of cfs 
vs. sd testing, not just all the glitch1 tests I posted here, lots of 
stuff I just send to Ingo, and lots of tests like IPCbench and NNTP 
server loading showed nothing. (That's good, it means cfs isn't worse 
than the old scheduler for those loads.) I played with the tuning, I 
even diddled the code a bit, without finding meaningful improvement, so 
your possibly better math may not change the overall behavior much.


I will wait until I post actual test results before I say any more.

--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-01 Thread Daniel Walker
On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
> Hi,
> 
> I'm glad to announce a working prototype of the basic algorithm I
> already suggested last time.
> As I already tried to explain previously CFS has a considerable
> algorithmic and computational complexity. This patch should now make it
> clearer, why I could so easily skip over Ingo's long explanation of all
> the tricks CFS uses to keep the computational overhead low - I simply
> don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
> kernel, the first 3 runs are without patch, the last 3 runs are with the
> patch:

Out of curiosity I was reviewing your patch .. Since you create
kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
what had changed .. So I re-diffed your patch to eliminate
kernel/sched_norm.c and just make the changes to kernel/sched_fair.c .. 

The the patch is near the end of this email..  The most notable thing
about the rediff is the line count,

 4 files changed, 323 insertions(+), 729 deletions(-)

That's impressive (assuming my rediff is correct). Although I don't know
for certain how that line reduction is achieved..

I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
my tree, and RFS was faster to some degree (it varied).. 

> (1)   time = sum_{t}^{T}(time_{t})
> (2)   weight_sum = sum_{t}^{T}(weight_{t})

I read your description, but I was distracted by this latex style
notation .. Could you walk through in english what these two equation
are doing ..

The patch below is on top of my tree (2.6.23-rc5-dw1) ,
ftp://source.mvista.com/pub/dwalker/rt/

Daniel

---
 include/linux/sched.h |   21 -
 kernel/sched.c|  186 ---
 kernel/sched_debug.c  |   26 -
 kernel/sched_fair.c   |  819 +-
 4 files changed, 323 insertions(+), 729 deletions(-)

Index: linux-2.6.22/include/linux/sched.h
===
--- linux-2.6.22.orig/include/linux/sched.h
+++ linux-2.6.22/include/linux/sched.h
@@ -1046,40 +1046,29 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se->block_start
  * 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
  * 6 se->load.weight
- * 7 se->delta_fair
- *15 se->wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
Index: linux-2.6.22/kernel/sched.c
===
--- linux-2.6.22.orig/kernel/sched.c
+++ linux-2.6.22/kernel/sched.c
@@ -230,18 +230,24 @@ struct cfs_rq {
 
s64 fair_clock;
u64 exec_clock;
-   s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
+   u64 prev_update;
+   s64 time_norm_base, time_norm_inc;
+   u64 run_start, run_end;
+   u64 run_end_next, run_end_curr;
+   s64 time_sum_max, time_sum_off;
+   unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
-   struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr;
+   struct sched_entity *curr, *next;
+   struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
 
/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -278,12 +284,16 @@ struct rq {
 */
unsigned long nr_running;

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-01 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

Maybe I should explain for everyone else (especially after seeing some of 
the comments on kerneltrap), why I reacted somewhat irritated on what 
looks like such an innocent mail.
The problem is without the necessary background one can't know how wrong 
such statements as this are, the level of confidence is amazing though:

> Peter's work fully solves the rounding corner-cases you described as:

I'd expect Ingo to know better, but the more he refuses to answer my 
questions, the more I doubt it, at least than it comes to the math part.

While Peter's patches are interesting, they are only a small step to what 
I'm trying to achieve. With these patches the basic CFS math pretty much 
looks like this:

sum_{t}^{T}(round(time_{t} * round(WMULT / weight_{t}) * WEIGTH0 / 
WMULT))
= sum_{r}^{R}(round(time_{r} * round(WMULT / weight_sum) * WEIGTH0 / 
WMULT))

It's based on this equation:

sum_{t}^{T}(time_{t} / weight_{t}) = sum_{r}^{R}(time_{r} / weight_sum)

This is the time a single task gets relative to the the runtime of all 
tasks. These sums are incrementally added/substracted to wait_runtime and 
it should stay around zero.

All Peter's wait_runtime-scale patch does is to move the weight_{t} from 
one side to the other and that's it, it changes _nothing_ about the 
rounding above - "the rounding corner-cases" are still there.

In my announcement I describe now in quite some detail, how I get rid of 
this rounding effect. The only rounding from above equation which is still 
left is "round(WMULT / weight_{t})", but this is luckily a constant and so 
is the time one task gets relative to another (unless reniced).

Anyone who has actually read and understood what I wrote will hopefully 
realize what complete nonsense a statement like this is:

> So the most intrusive (math) aspects of your patch have been implemented
> already for CFS (almost a month ago), in a finegrained way.

I'm not repeating again the whole math here, if anyone has questions about 
it, I'll try my best to answer them. So instead here are some of the 
intrusive aspects, which supposedly have been implemented already.

One key difference is that I don't maintain the global sum (fair_clock) 
directly anymore, I can calculate it if needed, but it's not used to 
update wait_runtime anymore. This has the advantage that the whole 
rounding involved in it has no influence anymore on how much time a task 
gets. Without this value the whole math around how to schedule a task is 
quite different as well.

Another key difference is that I got rid of (WEIGTH0 / WMULT), this has 
the advantage that it completely gets rid of the problematic rounding and 
the scheduler can be now finally precise as the hardware allows it. 
OTOH this has consequences for the range of values, as they can and are 
expected to overflow now after some time, which the math has to take into 
account.

One positive side effect of these overflows is that I can reduce the 
resolution the scheduler is working with and thus I can get rid of pretty 
much all of the 64bit math, if the reduced resolution is sufficient, e.g. 
for all archs which have a jiffies based scheduler clock, but even 
embedded archs might like it.

> So if you'd like to work with us on this and get items that make sense 
> merged (which we'd very much like to see happen), could you please 
> re-shape the rest of your changes and ideas (such as whether to use 
> ready-queueing or a runqueue concept, which does look interesting) ontop 
> of Peter's queue, and please do it as a finegrained, reviewable, 
> mergable series of patches, like Peter did. Thanks!

The funny thing is it should be no that hard to split the patch, but that 
wasn't point. The point is to discuss the differences - how the different 
approach effects the scheduling decisions, as the new scheduler maintains 
somewhat different values. If I'm now told "it's just the same thing 
mathematically", which is provably nonsense, I'm a little stunned and the 
point that aggravates me is that most people simply are going to believe 
Ingo, because they don't understand the issue (and they don't really have 
to). I'm still amazed how easily Ingo can just ignore the main part of my 
work and still gets away with it. The last thing I want is yet another 
flame war, all I want is that this work is taken seriously, but it took 
Ingo less than 9 hours to get to a thorough judgement...

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-01 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

Maybe I should explain for everyone else (especially after seeing some of 
the comments on kerneltrap), why I reacted somewhat irritated on what 
looks like such an innocent mail.
The problem is without the necessary background one can't know how wrong 
such statements as this are, the level of confidence is amazing though:

 Peter's work fully solves the rounding corner-cases you described as:

I'd expect Ingo to know better, but the more he refuses to answer my 
questions, the more I doubt it, at least than it comes to the math part.

While Peter's patches are interesting, they are only a small step to what 
I'm trying to achieve. With these patches the basic CFS math pretty much 
looks like this:

sum_{t}^{T}(round(time_{t} * round(WMULT / weight_{t}) * WEIGTH0 / 
WMULT))
= sum_{r}^{R}(round(time_{r} * round(WMULT / weight_sum) * WEIGTH0 / 
WMULT))

It's based on this equation:

sum_{t}^{T}(time_{t} / weight_{t}) = sum_{r}^{R}(time_{r} / weight_sum)

This is the time a single task gets relative to the the runtime of all 
tasks. These sums are incrementally added/substracted to wait_runtime and 
it should stay around zero.

All Peter's wait_runtime-scale patch does is to move the weight_{t} from 
one side to the other and that's it, it changes _nothing_ about the 
rounding above - the rounding corner-cases are still there.

In my announcement I describe now in quite some detail, how I get rid of 
this rounding effect. The only rounding from above equation which is still 
left is round(WMULT / weight_{t}), but this is luckily a constant and so 
is the time one task gets relative to another (unless reniced).

Anyone who has actually read and understood what I wrote will hopefully 
realize what complete nonsense a statement like this is:

 So the most intrusive (math) aspects of your patch have been implemented
 already for CFS (almost a month ago), in a finegrained way.

I'm not repeating again the whole math here, if anyone has questions about 
it, I'll try my best to answer them. So instead here are some of the 
intrusive aspects, which supposedly have been implemented already.

One key difference is that I don't maintain the global sum (fair_clock) 
directly anymore, I can calculate it if needed, but it's not used to 
update wait_runtime anymore. This has the advantage that the whole 
rounding involved in it has no influence anymore on how much time a task 
gets. Without this value the whole math around how to schedule a task is 
quite different as well.

Another key difference is that I got rid of (WEIGTH0 / WMULT), this has 
the advantage that it completely gets rid of the problematic rounding and 
the scheduler can be now finally precise as the hardware allows it. 
OTOH this has consequences for the range of values, as they can and are 
expected to overflow now after some time, which the math has to take into 
account.

One positive side effect of these overflows is that I can reduce the 
resolution the scheduler is working with and thus I can get rid of pretty 
much all of the 64bit math, if the reduced resolution is sufficient, e.g. 
for all archs which have a jiffies based scheduler clock, but even 
embedded archs might like it.

 So if you'd like to work with us on this and get items that make sense 
 merged (which we'd very much like to see happen), could you please 
 re-shape the rest of your changes and ideas (such as whether to use 
 ready-queueing or a runqueue concept, which does look interesting) ontop 
 of Peter's queue, and please do it as a finegrained, reviewable, 
 mergable series of patches, like Peter did. Thanks!

The funny thing is it should be no that hard to split the patch, but that 
wasn't point. The point is to discuss the differences - how the different 
approach effects the scheduling decisions, as the new scheduler maintains 
somewhat different values. If I'm now told it's just the same thing 
mathematically, which is provably nonsense, I'm a little stunned and the 
point that aggravates me is that most people simply are going to believe 
Ingo, because they don't understand the issue (and they don't really have 
to). I'm still amazed how easily Ingo can just ignore the main part of my 
work and still gets away with it. The last thing I want is yet another 
flame war, all I want is that this work is taken seriously, but it took 
Ingo less than 9 hours to get to a thorough judgement...

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-09-01 Thread Daniel Walker
On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
 Hi,
 
 I'm glad to announce a working prototype of the basic algorithm I
 already suggested last time.
 As I already tried to explain previously CFS has a considerable
 algorithmic and computational complexity. This patch should now make it
 clearer, why I could so easily skip over Ingo's long explanation of all
 the tricks CFS uses to keep the computational overhead low - I simply
 don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
 kernel, the first 3 runs are without patch, the last 3 runs are with the
 patch:

Out of curiosity I was reviewing your patch .. Since you create
kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
what had changed .. So I re-diffed your patch to eliminate
kernel/sched_norm.c and just make the changes to kernel/sched_fair.c .. 

The the patch is near the end of this email..  The most notable thing
about the rediff is the line count,

 4 files changed, 323 insertions(+), 729 deletions(-)

That's impressive (assuming my rediff is correct). Although I don't know
for certain how that line reduction is achieved..

I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
my tree, and RFS was faster to some degree (it varied).. 

 (1)   time = sum_{t}^{T}(time_{t})
 (2)   weight_sum = sum_{t}^{T}(weight_{t})

I read your description, but I was distracted by this latex style
notation .. Could you walk through in english what these two equation
are doing ..

The patch below is on top of my tree (2.6.23-rc5-dw1) ,
ftp://source.mvista.com/pub/dwalker/rt/

Daniel

---
 include/linux/sched.h |   21 -
 kernel/sched.c|  186 ---
 kernel/sched_debug.c  |   26 -
 kernel/sched_fair.c   |  819 +-
 4 files changed, 323 insertions(+), 729 deletions(-)

Index: linux-2.6.22/include/linux/sched.h
===
--- linux-2.6.22.orig/include/linux/sched.h
+++ linux-2.6.22/include/linux/sched.h
@@ -1046,40 +1046,29 @@ struct load_weight {
  *
  * Current field usage histogram:
  *
- * 4 se-block_start
  * 4 se-run_node
- * 4 se-sleep_start
- * 4 se-sleep_start_fair
  * 6 se-load.weight
- * 7 se-delta_fair
- *15 se-wait_runtime
  */
 struct sched_entity {
-   longwait_runtime;
-   unsigned long   delta_fair_run;
-   unsigned long   delta_fair_sleep;
-   unsigned long   delta_exec;
-   s64 fair_key;
+   s64 time_key;
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
-   unsigned inton_rq;
+   unsigned inton_rq, queued;;
+   unsigned intweight_shift;
 
u64 exec_start;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
-   u64 wait_start_fair;
-   u64 sleep_start_fair;
+   s64 time_norm;
+   s64 req_weight_inv;
 
 #ifdef CONFIG_SCHEDSTATS
-   u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;
 
-   u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
 
-   u64 block_start;
u64 block_max;
u64 exec_max;
 
Index: linux-2.6.22/kernel/sched.c
===
--- linux-2.6.22.orig/kernel/sched.c
+++ linux-2.6.22/kernel/sched.c
@@ -230,18 +230,24 @@ struct cfs_rq {
 
s64 fair_clock;
u64 exec_clock;
-   s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
+   u64 prev_update;
+   s64 time_norm_base, time_norm_inc;
+   u64 run_start, run_end;
+   u64 run_end_next, run_end_curr;
+   s64 time_sum_max, time_sum_off;
+   unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
-   struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr;
+   struct sched_entity *curr, *next;
+   struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */
 
/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -278,12 +284,16 @@ struct rq {
 */
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:

> Were there some kernel messages while running it?

Nope.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:
> Hi,
> 
> On Fri, 31 Aug 2007, Mike Galbraith wrote:
> 
> > I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
> > loads, and hit some starvation.  If I got it in right (think so) there's
> > a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
> > It starts up running both tasks at about 60/40 for hog/sleeper, then
> > after a short while goes nuts.  The hog component eats 100% cpu and
> > starves the sleeper (and events, forever).
> 
> Thanks for testing, although your test program does nothing unusual here.
> Can you please send me your .config?

Attached.

> Were there some kernel messages while running it?

I didn't look actually, was in rather a hurry.  I'll try it again
tomorrow.

-Mike
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-rc3
# Fri Aug 31 09:04:00 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-smp-r"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_LSF=y
# CONFIG_BLK_DEV_BSG is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MCORE2 is not set
CONFIG_MPENTIUM4=y
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_XADD=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=2
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Mike Galbraith wrote:

> I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
> loads, and hit some starvation.  If I got it in right (think so) there's
> a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
> It starts up running both tasks at about 60/40 for hog/sleeper, then
> after a short while goes nuts.  The hog component eats 100% cpu and
> starves the sleeper (and events, forever).

Thanks for testing, although your test program does nothing unusual here.
Can you please send me your .config?
Were there some kernel messages while running it?

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

> So the most intrusive (math) aspects of your patch have been implemented 
> already for CFS (almost a month ago), in a finegrained way.

Interesting claim, please substantiate.

> Peter's patches change the CFS calculations gradually over from
> 'normalized' to 'non-normalized' wait-runtime, to avoid the
> normalizing/denormalizing overhead and rounding error.

Actually it changes wait-runtime to a normalized value and it changes 
nothing about the rounding error I was talking about.
It addresses the conversion error between the different units I was 
mentioning in an earlier mail, but the value is still rounded.

> > This model is far more accurate than CFS is and doesn't add an error 
> > over time, thus there are no more underflow/overflow anymore within 
> > the described limits.
> 
> ( your characterisation errs in that it makes it appear to be a common
>   problem, while in practice it's only a corner-case limited to extreme 
>   negative nice levels and even there it needs a very high rate of 
>   scheduling and an artificially constructed workload: several hundreds 
>   of thousand of context switches per second with a yield-ing loop to be 
>   even measurable with unmodified CFS. So this is not a 2.6.23 issue at 
>   all - unless there's some testcase that proves the opposite. )
> 
> with Peter's queue there are no underflows/overflows either anymore in 
> any synthetic corner-case we could come up with. Peter's queue works 
> well but it's 2.6.24 material.

Did you even try to understand what I wrote?
I didn't say that it's a "common problem", it's a conceptual problem. The 
rounding has been improved lately, so it's not as easy to trigger with 
some simple busy loops.
Peter's patches don't remove limit_wait_runtime() and AFAICT they can't, 
so I'm really amazed how you can make such claims.

> All in one, we dont disagree, this is an incremental improvement we are 
> thinking about for 2.6.24. We do disagree with this being positioned as 
> something fundamentally different though - it's just the same thing 
> mathematically, expressed without a "/weight" divisor, resulting in no 
> change in scheduling behavior. (except for a small shift of CPU 
> utilization for a synthetic corner-case)

Everytime I'm amazed how quickly you get to your judgements... :-(
Especially interesting is that you don't need to ask a single question for 
that, which would mean you actually understood what I wrote, OTOH your 
wild claims tell me something completely different.

BTW who is "we" and how is it possible that this meta mind can come to 
such quick judgements?

The basic concept is quite different enough, one can e.g. see that I have
to calculate some of the key CFS variables for the debug output.
The concepts are related, but they are definitively not "the same thing 
mathematically", the method of resolution is quite different, if you think 
otherwise then please _prove_ it.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Ingo Molnar

* Roman Zippel <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I'm glad to announce a working prototype of the basic algorithm I 
> already suggested last time. As I already tried to explain previously 
> CFS has a considerable algorithmic and computational complexity. [...]

hey, thanks for working on this, it's appreciated! In terms of merging 
your stuff, your patch looks a bit large and because you did not tell us 
that you were coding in this area, you probably missed Peter Zijlstra's 
excellent CFS work:

  http://programming.kicks-ass.net/kernel-patches/sched-cfs/

The following portion of Peter's series does much of the core math 
changes of what your patch provides (and which makes up for most of the 
conceptual delta in your patch), on a step by step basis:

  sched-update_weight_inv.patch
  sched-se_load.patch
  sched-se_load-2.patch
  sched-64-wait_runtime.patch
  sched-scaled-limit.patch
  sched-wait_runtime-scale.patch
  sched-calc_delta.patch

So the most intrusive (math) aspects of your patch have been implemented 
already for CFS (almost a month ago), in a finegrained way.

Peter's patches change the CFS calculations gradually over from 
'normalized' to 'non-normalized' wait-runtime, to avoid the 
normalizing/denormalizing overhead and rounding error. Turn off sleeper 
fairness, remove the limit code and we should arrive to something quite 
close to the core math in your patch, with similar rounding properties 
and similar overhead/complexity. (there are some other small details in 
the math but this is the biggest item by far.) I find Peter's series 
very understandable and he outlined the math arguments in his replies to 
your review mails. (would be glad to re-open those discussions though if 
you still think there are disagreements.)

Peter's work fully solves the rounding corner-cases you described as:

> This model is far more accurate than CFS is and doesn't add an error 
> over time, thus there are no more underflow/overflow anymore within 
> the described limits.

( your characterisation errs in that it makes it appear to be a common
  problem, while in practice it's only a corner-case limited to extreme 
  negative nice levels and even there it needs a very high rate of 
  scheduling and an artificially constructed workload: several hundreds 
  of thousand of context switches per second with a yield-ing loop to be 
  even measurable with unmodified CFS. So this is not a 2.6.23 issue at 
  all - unless there's some testcase that proves the opposite. )

with Peter's queue there are no underflows/overflows either anymore in 
any synthetic corner-case we could come up with. Peter's queue works 
well but it's 2.6.24 material.

Non-normalized wait-runtime is simply a different unit (resulting in 
slightly higher context-switch performance), the principle and the end 
result does not change.

All in one, we dont disagree, this is an incremental improvement we are 
thinking about for 2.6.24. We do disagree with this being positioned as 
something fundamentally different though - it's just the same thing 
mathematically, expressed without a "/weight" divisor, resulting in no 
change in scheduling behavior. (except for a small shift of CPU 
utilization for a synthetic corner-case)

And if we handled that fundamental aspect via Peter's queue, all the 
remaining changes you did can be done (and considered and merged) 
evolutionarily instead of revolutionarily, ontop of CFS - this should 
cut down on the size and the impact of your changes significantly!

So if you'd like to work with us on this and get items that make sense 
merged (which we'd very much like to see happen), could you please 
re-shape the rest of your changes and ideas (such as whether to use 
ready-queueing or a runqueue concept, which does look interesting) ontop 
of Peter's queue, and please do it as a finegrained, reviewable, 
mergable series of patches, like Peter did. Thanks!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
> Hi,

Greetings,

> I'm glad to announce a working prototype of the basic algorithm I
> already suggested last time.

(finding it difficult to resist the urge to go shopping, I
fast-forwarded to test drive... grep shopping arch/i386/kernel/tcs.c if
you're curious;)

I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
loads, and hit some starvation.  If I got it in right (think so) there's
a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
It starts up running both tasks at about 60/40 for hog/sleeper, then
after a short while goes nuts.  The hog component eats 100% cpu and
starves the sleeper (and events, forever).

runnable tasks:
task   PIDtree-key delta   waiting  switches  
priosum-execsum-wait   sum-sleepwait-overrun   
wait-underrun
--
events/1 8  13979193020350 -3984516524180  541606276813  2014   
115   0   0   0   0 
  0
R  fairtest2  7984  10027571241955 -7942765479096 5707836335486   294   
120   0   0   0   0 
  0
   fairtest2  7989  13539381091732 -4424328443109 8147144513458   286   
120   0   0   0   0 
  0

taskset -c 1 massive_intr 3  produces much saner looking numbers,
and is fair...

runnable tasks:
task   PIDtree-key delta   waiting  switches  
priosum-execsum-wait   sum-sleepwait-overrun   
wait-underrun
--
massive_intr 12808  29762662234003 21699   -506538  4650   
120   0   0   0   0 
  0
R   massive_intr 12809  29762662225939  -687 53406  4633   
120   0   0   0   0 
  0
massive_intr 12810  29762662220183  7879453097  4619   
120   0   0   0   0 
  0

// gcc -O2 -o fiftyp fiftyp.c -lrt
// code from interbench.c
#include 
#include 
#include 
#include 
#include 
#include 
#include 
int forks=1;
int runus,sleepus=7000;
unsigned long loops_per_ms;
void terminal_error(const char *name)
{
	fprintf(stderr, "\n");
	perror(name);
	exit (1);
}

unsigned long long get_nsecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		terminal_error("clock_gettime");
	return (myts->tv_sec * 10 + myts->tv_nsec );
}

void burn_loops(unsigned long loops)
{
	unsigned long i;

	/*
	 * We need some magic here to prevent the compiler from optimising
	 * this loop away. Otherwise trying to emulate a fixed cpu load
	 * with this loop will not work.
	 */
	for (i = 0 ; i < loops ; i++)
	 asm volatile("" : : : "memory");
}

/* Use this many usecs of cpu time */
void burn_usecs(unsigned long usecs)
{
	unsigned long ms_loops;

	ms_loops = loops_per_ms / 1000 * usecs;
	burn_loops(ms_loops);
}

void microsleep(unsigned long long usecs)
{
	struct timespec req, rem;

	rem.tv_sec = rem.tv_nsec = 0;

	req.tv_sec = usecs / 100;
	req.tv_nsec = (usecs - (req.tv_sec * 100)) * 1000;
continue_sleep:
	if ((nanosleep(, )) == -1) {
		if (errno == EINTR) {
			if (rem.tv_sec || rem.tv_nsec) {
req.tv_sec = rem.tv_sec;
req.tv_nsec = rem.tv_nsec;
goto continue_sleep;
			}
			goto out;
		}
		terminal_error("nanosleep");
	}
out:
	return;
}

/*
 * In an unoptimised loop we try to benchmark how many meaningless loops
 * per second we can perform on this hardware to fairly accurately
 * reproduce certain percentage cpu usage
 */
void calibrate_loop(void)
{
	unsigned long long start_time, loops_per_msec, run_time = 0;
	unsigned long loops;
	struct timespec myts;

	loops_per_msec = 100;
redo:
	/* Calibrate to within 1% accuracy */
	while (run_time > 101 || run_time < 99) {
		loops = loops_per_msec;
		start_time = get_nsecs();
		burn_loops(loops);
		run_time = get_nsecs() - start_time;
		loops_per_msec = (100 * loops_per_msec / run_time ? :
			loops_per_msec);
	}

	/* Rechecking after a pause increases reproducibility */
	sleep(1);
	loops = loops_per_msec;
	start_time = get_nsecs();
	burn_loops(loops);
	run_time = get_nsecs() - start_time;

	/* Tolerate 5% difference on checking */
	if (run_time > 105 || run_time < 95)
		goto redo;
 loops_per_ms=loops_per_msec;
 sleep(1);
 start_time=get_nsecs();
 microsleep(sleepus);
 run_time=get_nsecs()-start_time;
 runus=run_time/1000;
}

/*
 * chew.c: original idea 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:

 Were there some kernel messages while running it?

Nope.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
 Hi,

Greetings,

 I'm glad to announce a working prototype of the basic algorithm I
 already suggested last time.

(finding it difficult to resist the urge to go shopping, I
fast-forwarded to test drive... grep shopping arch/i386/kernel/tcs.c if
you're curious;)

I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
loads, and hit some starvation.  If I got it in right (think so) there's
a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
It starts up running both tasks at about 60/40 for hog/sleeper, then
after a short while goes nuts.  The hog component eats 100% cpu and
starves the sleeper (and events, forever).

runnable tasks:
task   PIDtree-key delta   waiting  switches  
priosum-execsum-wait   sum-sleepwait-overrun   
wait-underrun
--
events/1 8  13979193020350 -3984516524180  541606276813  2014   
115   0   0   0   0 
  0
R  fairtest2  7984  10027571241955 -7942765479096 5707836335486   294   
120   0   0   0   0 
  0
   fairtest2  7989  13539381091732 -4424328443109 8147144513458   286   
120   0   0   0   0 
  0

taskset -c 1 massive_intr 3  produces much saner looking numbers,
and is fair...

runnable tasks:
task   PIDtree-key delta   waiting  switches  
priosum-execsum-wait   sum-sleepwait-overrun   
wait-underrun
--
massive_intr 12808  29762662234003 21699   -506538  4650   
120   0   0   0   0 
  0
R   massive_intr 12809  29762662225939  -687 53406  4633   
120   0   0   0   0 
  0
massive_intr 12810  29762662220183  7879453097  4619   
120   0   0   0   0 
  0

// gcc -O2 -o fiftyp fiftyp.c -lrt
// code from interbench.c
#include stdio.h
#include stdlib.h
#include time.h
#include unistd.h
#include errno.h
#include sys/types.h
#include sys/resource.h
int forks=1;
int runus,sleepus=7000;
unsigned long loops_per_ms;
void terminal_error(const char *name)
{
	fprintf(stderr, \n);
	perror(name);
	exit (1);
}

unsigned long long get_nsecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		terminal_error(clock_gettime);
	return (myts-tv_sec * 10 + myts-tv_nsec );
}

void burn_loops(unsigned long loops)
{
	unsigned long i;

	/*
	 * We need some magic here to prevent the compiler from optimising
	 * this loop away. Otherwise trying to emulate a fixed cpu load
	 * with this loop will not work.
	 */
	for (i = 0 ; i  loops ; i++)
	 asm volatile( : : : memory);
}

/* Use this many usecs of cpu time */
void burn_usecs(unsigned long usecs)
{
	unsigned long ms_loops;

	ms_loops = loops_per_ms / 1000 * usecs;
	burn_loops(ms_loops);
}

void microsleep(unsigned long long usecs)
{
	struct timespec req, rem;

	rem.tv_sec = rem.tv_nsec = 0;

	req.tv_sec = usecs / 100;
	req.tv_nsec = (usecs - (req.tv_sec * 100)) * 1000;
continue_sleep:
	if ((nanosleep(req, rem)) == -1) {
		if (errno == EINTR) {
			if (rem.tv_sec || rem.tv_nsec) {
req.tv_sec = rem.tv_sec;
req.tv_nsec = rem.tv_nsec;
goto continue_sleep;
			}
			goto out;
		}
		terminal_error(nanosleep);
	}
out:
	return;
}

/*
 * In an unoptimised loop we try to benchmark how many meaningless loops
 * per second we can perform on this hardware to fairly accurately
 * reproduce certain percentage cpu usage
 */
void calibrate_loop(void)
{
	unsigned long long start_time, loops_per_msec, run_time = 0;
	unsigned long loops;
	struct timespec myts;

	loops_per_msec = 100;
redo:
	/* Calibrate to within 1% accuracy */
	while (run_time  101 || run_time  99) {
		loops = loops_per_msec;
		start_time = get_nsecs(myts);
		burn_loops(loops);
		run_time = get_nsecs(myts) - start_time;
		loops_per_msec = (100 * loops_per_msec / run_time ? :
			loops_per_msec);
	}

	/* Rechecking after a pause increases reproducibility */
	sleep(1);
	loops = loops_per_msec;
	start_time = get_nsecs(myts);
	burn_loops(loops);
	run_time = get_nsecs(myts) - start_time;

	/* Tolerate 5% difference on checking */
	if (run_time  105 || run_time  95)
		goto redo;
 loops_per_ms=loops_per_msec;
 sleep(1);
 start_time=get_nsecs(myts);
 microsleep(sleepus);
 

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Ingo Molnar

* Roman Zippel [EMAIL PROTECTED] wrote:

 Hi,
 
 I'm glad to announce a working prototype of the basic algorithm I 
 already suggested last time. As I already tried to explain previously 
 CFS has a considerable algorithmic and computational complexity. [...]

hey, thanks for working on this, it's appreciated! In terms of merging 
your stuff, your patch looks a bit large and because you did not tell us 
that you were coding in this area, you probably missed Peter Zijlstra's 
excellent CFS work:

  http://programming.kicks-ass.net/kernel-patches/sched-cfs/

The following portion of Peter's series does much of the core math 
changes of what your patch provides (and which makes up for most of the 
conceptual delta in your patch), on a step by step basis:

  sched-update_weight_inv.patch
  sched-se_load.patch
  sched-se_load-2.patch
  sched-64-wait_runtime.patch
  sched-scaled-limit.patch
  sched-wait_runtime-scale.patch
  sched-calc_delta.patch

So the most intrusive (math) aspects of your patch have been implemented 
already for CFS (almost a month ago), in a finegrained way.

Peter's patches change the CFS calculations gradually over from 
'normalized' to 'non-normalized' wait-runtime, to avoid the 
normalizing/denormalizing overhead and rounding error. Turn off sleeper 
fairness, remove the limit code and we should arrive to something quite 
close to the core math in your patch, with similar rounding properties 
and similar overhead/complexity. (there are some other small details in 
the math but this is the biggest item by far.) I find Peter's series 
very understandable and he outlined the math arguments in his replies to 
your review mails. (would be glad to re-open those discussions though if 
you still think there are disagreements.)

Peter's work fully solves the rounding corner-cases you described as:

 This model is far more accurate than CFS is and doesn't add an error 
 over time, thus there are no more underflow/overflow anymore within 
 the described limits.

( your characterisation errs in that it makes it appear to be a common
  problem, while in practice it's only a corner-case limited to extreme 
  negative nice levels and even there it needs a very high rate of 
  scheduling and an artificially constructed workload: several hundreds 
  of thousand of context switches per second with a yield-ing loop to be 
  even measurable with unmodified CFS. So this is not a 2.6.23 issue at 
  all - unless there's some testcase that proves the opposite. )

with Peter's queue there are no underflows/overflows either anymore in 
any synthetic corner-case we could come up with. Peter's queue works 
well but it's 2.6.24 material.

Non-normalized wait-runtime is simply a different unit (resulting in 
slightly higher context-switch performance), the principle and the end 
result does not change.

All in one, we dont disagree, this is an incremental improvement we are 
thinking about for 2.6.24. We do disagree with this being positioned as 
something fundamentally different though - it's just the same thing 
mathematically, expressed without a /weight divisor, resulting in no 
change in scheduling behavior. (except for a small shift of CPU 
utilization for a synthetic corner-case)

And if we handled that fundamental aspect via Peter's queue, all the 
remaining changes you did can be done (and considered and merged) 
evolutionarily instead of revolutionarily, ontop of CFS - this should 
cut down on the size and the impact of your changes significantly!

So if you'd like to work with us on this and get items that make sense 
merged (which we'd very much like to see happen), could you please 
re-shape the rest of your changes and ideas (such as whether to use 
ready-queueing or a runqueue concept, which does look interesting) ontop 
of Peter's queue, and please do it as a finegrained, reviewable, 
mergable series of patches, like Peter did. Thanks!

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

 So the most intrusive (math) aspects of your patch have been implemented 
 already for CFS (almost a month ago), in a finegrained way.

Interesting claim, please substantiate.

 Peter's patches change the CFS calculations gradually over from
 'normalized' to 'non-normalized' wait-runtime, to avoid the
 normalizing/denormalizing overhead and rounding error.

Actually it changes wait-runtime to a normalized value and it changes 
nothing about the rounding error I was talking about.
It addresses the conversion error between the different units I was 
mentioning in an earlier mail, but the value is still rounded.

  This model is far more accurate than CFS is and doesn't add an error 
  over time, thus there are no more underflow/overflow anymore within 
  the described limits.
 
 ( your characterisation errs in that it makes it appear to be a common
   problem, while in practice it's only a corner-case limited to extreme 
   negative nice levels and even there it needs a very high rate of 
   scheduling and an artificially constructed workload: several hundreds 
   of thousand of context switches per second with a yield-ing loop to be 
   even measurable with unmodified CFS. So this is not a 2.6.23 issue at 
   all - unless there's some testcase that proves the opposite. )
 
 with Peter's queue there are no underflows/overflows either anymore in 
 any synthetic corner-case we could come up with. Peter's queue works 
 well but it's 2.6.24 material.

Did you even try to understand what I wrote?
I didn't say that it's a common problem, it's a conceptual problem. The 
rounding has been improved lately, so it's not as easy to trigger with 
some simple busy loops.
Peter's patches don't remove limit_wait_runtime() and AFAICT they can't, 
so I'm really amazed how you can make such claims.

 All in one, we dont disagree, this is an incremental improvement we are 
 thinking about for 2.6.24. We do disagree with this being positioned as 
 something fundamentally different though - it's just the same thing 
 mathematically, expressed without a /weight divisor, resulting in no 
 change in scheduling behavior. (except for a small shift of CPU 
 utilization for a synthetic corner-case)

Everytime I'm amazed how quickly you get to your judgements... :-(
Especially interesting is that you don't need to ask a single question for 
that, which would mean you actually understood what I wrote, OTOH your 
wild claims tell me something completely different.

BTW who is we and how is it possible that this meta mind can come to 
such quick judgements?

The basic concept is quite different enough, one can e.g. see that I have
to calculate some of the key CFS variables for the debug output.
The concepts are related, but they are definitively not the same thing 
mathematically, the method of resolution is quite different, if you think 
otherwise then please _prove_ it.

bye, Roman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Roman Zippel
Hi,

On Fri, 31 Aug 2007, Mike Galbraith wrote:

 I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
 loads, and hit some starvation.  If I got it in right (think so) there's
 a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
 It starts up running both tasks at about 60/40 for hog/sleeper, then
 after a short while goes nuts.  The hog component eats 100% cpu and
 starves the sleeper (and events, forever).

Thanks for testing, although your test program does nothing unusual here.
Can you please send me your .config?
Were there some kernel messages while running it?

bye, Roman

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith
On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:
 Hi,
 
 On Fri, 31 Aug 2007, Mike Galbraith wrote:
 
  I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
  loads, and hit some starvation.  If I got it in right (think so) there's
  a bug lurking somewhere.  taskset -c 1 fairtest2 resulted in the below.
  It starts up running both tasks at about 60/40 for hog/sleeper, then
  after a short while goes nuts.  The hog component eats 100% cpu and
  starves the sleeper (and events, forever).
 
 Thanks for testing, although your test program does nothing unusual here.
 Can you please send me your .config?

Attached.

 Were there some kernel messages while running it?

I didn't look actually, was in rather a hurry.  I'll try it again
tomorrow.

-Mike
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-rc3
# Fri Aug 31 09:04:00 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=-smp-r
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_LSF=y
# CONFIG_BLK_DEV_BSG is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MCORE2 is not set
CONFIG_MPENTIUM4=y
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_XADD=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=2
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
#