subject:"Scheduler improvements"

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread David Coppa

On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote:
 On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
 On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
  This patch simply halves the timeslice processes get until they are
  preempted. This patch is standalone and the rest of the patches does not
  depend on it, but I figured I'd throw it in anyway.
 
  --
  Gregor Best
 

 Didn't get this patch. Is it me or the patch is missing?


 The patch was the message my mail was in reply to. There's no Patch 2/5
 subject line because I forgot to change that after sending out Patch
 1/5.

Can you resend it, please?
I'm confused...

TIA,
David

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread Stuart Henderson

On 2012/10/15 16:18, David Coppa wrote:
 On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote:
  On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
  On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
   This patch simply halves the timeslice processes get until they are
   preempted. This patch is standalone and the rest of the patches does not
   depend on it, but I figured I'd throw it in anyway.
  
   --
   Gregor Best
  
 
  Didn't get this patch. Is it me or the patch is missing?
 
 
  The patch was the message my mail was in reply to. There's no Patch 2/5
  subject line because I forgot to change that after sending out Patch
  1/5.
 
 Can you resend it, please?
 I'm confused...
 
 TIA,
 David
 

Best to send the diff, with the accompanying text, in the same email,
and make sure they still all apply, I tried testing some of these but
didn't manage to get them to apply (either some conflicting change,
or they were in the wrong order and stacked up on top of each other
or something).

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread Stefan Fritsch

On Monday 15 October 2012, Stuart Henderson wrote:
 Best to send the diff, with the accompanying text, in the same
 email, and make sure they still all apply, I tried testing some of
 these but didn't manage to get them to apply (either some
 conflicting change, or they were in the wrong order and stacked up
 on top of each other or something).

As you mentioned using git in another mail, I would recommend to use 
git's format-patch to create the mails. Maybe with the --no-prefix 
option,  because that seems to be the preferred patch format here. 
Also, interactive rebase (git rebase -i) is nice for adjusting patches 
and commit messages (in case you didn't know that one).

Cheers,
Stefan

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread David Coppa

On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
 This patch simply halves the timeslice processes get until they are
 preempted. This patch is standalone and the rest of the patches does not
 depend on it, but I figured I'd throw it in anyway.

 --
 Gregor Best


Didn't get this patch. Is it me or the patch is missing?

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread Gregor Best

On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
 On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
  This patch simply halves the timeslice processes get until they are
  preempted. This patch is standalone and the rest of the patches does not
  depend on it, but I figured I'd throw it in anyway.
 
  --
  Gregor Best
 
 
 Didn't get this patch. Is it me or the patch is missing?
 

The patch was the message my mail was in reply to. There's no Patch 2/5
subject line because I forgot to change that after sending out Patch
1/5.

-- 
Gregor Best

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread Bob Beck

Gregor you would perhaps get better feedback if it were easier to discern
where your patches are and what each one is doing.  If you can't be
inclined to keep the subjects matching the diffs and are sending stuff out
with subjects like scheduler improvement diff X instead of something like
reduce time slice length by 50% it makes it a lot more effort for someone
to read and test.
On Oct 14, 2012 9:01 AM, Gregor Best g...@ring0.de wrote:

 On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
  On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
   This patch simply halves the timeslice processes get until they are
   preempted. This patch is standalone and the rest of the patches does
 not
   depend on it, but I figured I'd throw it in anyway.
  
   --
   Gregor Best
  
 
  Didn't get this patch. Is it me or the patch is missing?
 

 The patch was the message my mail was in reply to. There's no Patch 2/5
 subject line because I forgot to change that after sending out Patch
 1/5.

 --
 Gregor Best

Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-13 Thread Philip Guenther

On Tue, Oct 9, 2012 at 9:27 AM, Gregor Best g...@ring0.de wrote:
 This patch moves struct schedstate_percpu to kernel land, which I think
 is cleaner than exposing structures for scheduler state to userland,
 especially since grepping for 'schedstate' in /usr/src yielded no
 results outside of /usr/src/sys.

I think this is a misunderstanding about the intent of _KERNEL.  There
are, IMO, two reasons to condition parts of header files on _KERNEL:

1) the header's contents have been specified by a de facto or de jure
standard, but for its own reasons the kernel wants to place additional
info there.

2) the header defines kernel data structures that tools may need (for
crash dump debugging, for example) as well as external symbols like
functions and variables, declarations that can't be used by userspace.

sys/sched.h isn't a standardized header and isn't pulled in by a
standardized header, so (1) isn't an issue.  So, only a non-standard
program is going to be pulling in sys/sched.h anyway.  Unless
there's a reason otherwise, all types and #defines be exposed to it.


(Yeah, the use of CP_* and CPUSTATES by the KERN_CPTIME* sysctl()s is
less than ideal, but this isn't the right way to fix that, IMO.)


Philip Guenther

Scheduler improvements, take 1001

2012-10-09 Thread Gregor Best

(By popular request as a new thread).

Hi people,

I've tried splitting my scheduler patch into smaller fragments, and
here's the result.

I changed a few things people mentioned over the last few days, such as
the following:

1) sys/proc.h now includes sys/tree.h, which should make libc builds
work again.
2) deadline generation now takes process priorities into account, as
suggested by ratchov@. The way it's done now, processes can use their
sleep priority as a way to lower their nice value for short periods of
time. I didn't notice any real changes, but I'd love to hear from people
with more demanding applications.
3) schedstate_percpu is private to the kernel now, as I couldn't find a
single occurrence of `struct schedstate_percpu` outside of /usr/src/sys
and it seemed cleaner not to expose kernel data to userland in such a
broad way.

The patches will follow as single emails.

-- 
Gregor Best

Re: Scheduler improvements, take 1001, Patch 1/5

2012-10-09 Thread Gregor Best

diff --git a/kern/sched_bsd.c b/kern/sched_bsd.c
index 172bb8f..c7121dc 100644
--- a/kern/sched_bsd.c
+++ b/kern/sched_bsd.c
@@ -77,12 +77,12 @@ scheduler_start(void)
 
timeout_set(schedcpu_to, schedcpu, schedcpu_to);
 
-   rrticks_init = hz / 10;
+   rrticks_init = hz / 20;
schedcpu(schedcpu_to);
 }
 
 /*
- * Force switch among equal priority processes every 100ms.
+ * Force switch among equal priority processes every 50ms.
  */
 void
 roundrobin(struct cpu_info *ci)
-- 
1.7.6

Re: Scheduler improvements, take 1001, Patch 1/5

2012-10-09 Thread Gregor Best

diff --git a/kern/kern_clock.c b/kern/kern_clock.c
index 843965b..f598afc 100644
--- a/kern/kern_clock.c
+++ b/kern/kern_clock.c
@@ -233,7 +233,7 @@ hardclock(struct clockframe *frame)
if (stathz == 0)
statclock(frame);
 
-   if (--ci-ci_schedstate.spc_rrticks = 0)
+   if (p  (--(p-p_rrticks) = 0))
roundrobin(ci);
 
/*
diff --git a/kern/kern_proc.c b/kern/kern_proc.c
index ad861c8..e0d5536 100644
--- a/kern/kern_proc.c
+++ b/kern/kern_proc.c
@@ -398,8 +398,6 @@ proc_printit(struct proc *p, const char *modif, int 
(*pr)(const char *, ...))
p-p_comm, p-p_pid, pst, p-p_flag, P_BITS);
(*pr)(pri=%u, usrpri=%u, nice=%d\n,
p-p_priority, p-p_usrpri, p-p_p-ps_nice);
-   (*pr)(forw=%p, list=%p,%p\n,
-   TAILQ_NEXT(p, p_runq), p-p_list.le_next, p-p_list.le_prev);
(*pr)(process=%p user=%p, vmspace=%p\n,
p-p_p, p-p_addr, p-p_vmspace);
(*pr)(estcpu=%u, cpticks=%d, pctcpu=%u.%u, swtime=%u\n,
diff --git a/kern/kern_sched.c b/kern/kern_sched.c
index 253226a..79eb28c 100644
--- a/kern/kern_sched.c
+++ b/kern/kern_sched.c
@@ -24,11 +24,22 @@
 #include sys/resourcevar.h
 #include sys/signalvar.h
 #include sys/mutex.h
+#include sys/tree.h
 
 #include uvm/uvm_extern.h
 
 #include sys/malloc.h
 
+static int
+sched_cmp_proc(struct proc *a, struct proc *b) {
+   if (a == b)
+   return 0;
+   if (timercmp((a-p_deadline), (b-p_deadline), ))
+   return -1;
+   return 1;
+}
+
+RB_GENERATE_STATIC(prochead, proc, p_runq, sched_cmp_proc);
 
 void sched_kthreads_create(void *);
 
@@ -79,10 +90,8 @@ void
 sched_init_cpu(struct cpu_info *ci)
 {
struct schedstate_percpu *spc = ci-ci_schedstate;
-   int i;
 
-   for (i = 0; i  SCHED_NQS; i++)
-   TAILQ_INIT(spc-spc_qs[i]);
+   RB_INIT(spc-spc_runq);
 
spc-spc_idleproc = NULL;
 
@@ -158,18 +167,19 @@ sched_idle(void *v)
 
cpuset_add(sched_idle_cpus, ci);
cpu_idle_enter();
-   while (spc-spc_whichqs == 0) {
+
+   while (curcpu_is_idle()) {
if (spc-spc_schedflags  SPCF_SHOULDHALT 
-   (spc-spc_schedflags  SPCF_HALTED) == 0) {
+(spc-spc_schedflags  SPCF_HALTED) == 0) {
cpuset_del(sched_idle_cpus, ci);
SCHED_LOCK(s);
-   atomic_setbits_int(spc-spc_schedflags,
-   spc-spc_whichqs ? 0 : SPCF_HALTED);
+   atomic_setbits_int(spc-spc_schedflags, 
SPCF_HALTED);
SCHED_UNLOCK(s);
wakeup(spc);
}
cpu_idle_cycle();
}
+
cpu_idle_leave();
cpuset_del(sched_idle_cpus, ci);
}
@@ -222,14 +232,13 @@ void
 setrunqueue(struct proc *p)
 {
struct schedstate_percpu *spc;
-   int queue = p-p_priority  2;
 
SCHED_ASSERT_LOCKED();
spc = p-p_cpu-ci_schedstate;
spc-spc_nrun++;
 
-   TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq);
-   spc-spc_whichqs |= (1  queue);
+   KASSERT(!RB_FIND(prochead, spc-spc_runq, p));
+   RB_INSERT(prochead, spc-spc_runq, p);
cpuset_add(sched_queued_cpus, p-p_cpu);
 
if (cpuset_isset(sched_idle_cpus, p-p_cpu))
@@ -240,38 +249,29 @@ void
 remrunqueue(struct proc *p)
 {
struct schedstate_percpu *spc;
-   int queue = p-p_priority  2;
 
SCHED_ASSERT_LOCKED();
spc = p-p_cpu-ci_schedstate;
spc-spc_nrun--;
 
-   TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq);
-   if (TAILQ_EMPTY(spc-spc_qs[queue])) {
-   spc-spc_whichqs = ~(1  queue);
-   if (spc-spc_whichqs == 0)
-   cpuset_del(sched_queued_cpus, p-p_cpu);
-   }
+   KASSERT(RB_REMOVE(prochead, spc-spc_runq, p));
+   if (RB_EMPTY(spc-spc_runq))
+   cpuset_del(sched_queued_cpus, p-p_cpu);
 }
 
 struct proc *
 sched_chooseproc(void)
 {
struct schedstate_percpu *spc = curcpu()-ci_schedstate;
-   struct proc *p;
-   int queue;
+   struct proc *p, *p_tmp = NULL;
 
SCHED_ASSERT_LOCKED();
 
if (spc-spc_schedflags  SPCF_SHOULDHALT) {
-   if (spc-spc_whichqs) {
-   for (queue = 0; queue  SCHED_NQS; queue++) {
-   TAILQ_FOREACH(p, spc-spc_qs[queue], p_runq) {
-   remrunqueue(p);
-   p-p_cpu = sched_choosecpu(p);
-   setrunqueue(p);
-   }
-   }
+   RB_FOREACH_SAFE(p, prochead, spc-spc_runq, p_tmp) {
+   remrunqueue(p);
+

Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-09 Thread Gregor Best

diff --git a/sys/sched.h b/sys/sched.h
index fb01f21..1784ee2 100644
--- a/sys/sched.h
+++ b/sys/sched.h
@@ -69,8 +69,10 @@
 #ifndef_SYS_SCHED_H_
 #define_SYS_SCHED_H_
 
+#ifdef _KERNEL
 #include sys/queue.h
 #include sys/tree.h
+#endif
 
 /*
  * Posix defines a sched.h which may want to include sys/sched.h
@@ -88,11 +90,9 @@
 #define CP_IDLE4
 #define CPUSTATES  5
 
-#defineSCHED_NQS   32  /* 32 run queues. */
-
+#ifdef _KERNEL
 /*
  * Per-CPU scheduler state.
- * XXX - expose to userland for now.
  */
 struct schedstate_percpu {
struct timeval spc_runtime; /* time curproc started running */
@@ -107,15 +107,13 @@ struct schedstate_percpu {
u_int spc_nrun; /* procs on the run queues */
fixpt_t spc_ldavg;  /* shortest load avg. for this cpu */
 
-   RB_HEAD(prochead, proc) spc_runq;
-
 #ifdef notyet
struct proc *spc_reaper;/* dead proc reaper */
 #endif
LIST_HEAD(,proc) spc_deadproc;
-};
 
-#ifdef _KERNEL
+   RB_HEAD(prochead, proc) spc_runq;
+};
 
 /* spc_flags */
 #define SPCF_SEENRR 0x0001  /* process has seen roundrobin() */
-- 
1.7.6

Re: Scheduler improvements, take 1001, Patch 4/5

2012-10-09 Thread Gregor Best

diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h
index 12e48d6..99501a1 100644
--- a/arch/amd64/include/cpu.h
+++ b/arch/amd64/include/cpu.h
@@ -102,9 +102,11 @@ struct cpu_info {
u_int32_t   ci_cflushsz;
u_int64_t   ci_tsc_freq;
 
+#define ARCH_HAVE_CPU_TOPOLOGY
u_int32_t   ci_smt_id;
u_int32_t   ci_core_id;
u_int32_t   ci_pkg_id;
+
struct cpu_functions *ci_func;
void (*cpu_setup)(struct cpu_info *);
void (*ci_info)(struct cpu_info *);
diff --git a/kern/kern_sched.c b/kern/kern_sched.c
index 79eb28c..072ef38 100644
--- a/kern/kern_sched.c
+++ b/kern/kern_sched.c
@@ -496,6 +496,10 @@ int sched_cost_load = 1;
 int sched_cost_priority = 1;
 int sched_cost_runnable = 3;
 int sched_cost_resident = 1;
+#ifdef ARCH_HAVE_CPU_TOPOLOGY
+int sched_cost_diffcore = 2; /* cost for moving to a different core */
+int sched_cost_diffpkg = 3; /* cost for moving to a different package */
+#endif
 
 int
 sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p)
@@ -536,6 +540,13 @@ sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p)
cost -= l2resident * sched_cost_resident;
}
 
+#ifdef ARCH_HAVE_CPU_TOPOLOGY
+   if (p-p_cpu-ci_pkg_id != ci-ci_pkg_id)
+   cost *= sched_cost_diffpkg;
+   else if (p-p_cpu-ci_core_id != ci-ci_core_id)
+   cost *= sched_cost_diffcore;
+#endif
+
return (cost);
 }
 
-- 
1.7.6

Re: Scheduler improvements, take 1001, Patch 3/5

2012-10-09 Thread Gregor Best

diff --git a/arch/amd64/amd64/identcpu.c b/arch/amd64/amd64/identcpu.c
index c597bb0..982c2bb 100644
--- a/arch/amd64/amd64/identcpu.c
+++ b/arch/amd64/amd64/identcpu.c
@@ -210,6 +210,8 @@ void (*setperf_setup)(struct cpu_info *);
 
 void via_nano_setup(struct cpu_info *ci);
 
+void cpu_topology(struct cpu_info *ci);
+
 void
 via_nano_setup(struct cpu_info *ci)
 {
@@ -479,4 +481,123 @@ identifycpu(struct cpu_info *ci)
sensordev_install(ci-ci_sensordev);
 #endif
}
+
+   cpu_topology(ci);
+}
+
+/*
+ * Base 2 logarithm of an int. returns 0 for 0 (yeye, I know).
+ */
+static int
+log2(unsigned int i)
+{
+   int ret = 0;
+
+   while (i = 1)
+   ret++;
+
+   return (ret);
+}
+
+static int
+mask_width(u_int x)
+{
+   int bit;
+   int mask;
+   int powerof2;
+
+   powerof2 = ((x - 1)  x) == 0;
+   mask = (x  (1 - powerof2)) - 1;
+
+   /* fls */
+   if (mask == 0)
+   return (0);
+   for (bit = 1; mask != 1; bit++)
+   mask = (unsigned int)mask  1;
+
+   return (bit);
+}
+
+/*
+ * Build up cpu topology for given cpu, must run on the core itself.
+ */
+void
+cpu_topology(struct cpu_info *ci)
+{
+   u_int32_t eax, ebx, ecx, edx;
+   u_int32_t apicid, max_apicid, max_coreid;
+   u_int32_t smt_bits, core_bits, pkg_bits;
+   u_int32_t smt_mask, core_mask, pkg_mask;
+
+   /* We need at least apicid at CPUID 1 */
+   CPUID(0, eax, ebx, ecx, edx);
+   if (eax  1)
+   goto no_topology;
+
+   /* Initial apicid */
+   CPUID(1, eax, ebx, ecx, edx);
+   apicid = (ebx  24)  0xff;
+
+   if (strcmp(cpu_vendor, AuthenticAMD) == 0) {
+   /* We need at least apicid at CPUID 0x8008 */
+   CPUID(0x8000, eax, ebx, ecx, edx);
+   if (eax  0x8008)
+   goto no_topology;
+
+   CPUID(0x8008, eax, ebx, ecx, edx);
+   core_bits = (ecx  12)  0xf;
+   if (core_bits == 0)
+   goto no_topology;
+   /* So coreidsize 2 gives 3, 3 gives 7... */
+   core_mask = (1  core_bits) - 1;
+   /* Core id is the least significant considering mask */
+   ci-ci_core_id = apicid  core_mask;
+   /* Pkg id is the upper remaining bits */
+   ci-ci_pkg_id = apicid  ~core_mask;
+   ci-ci_pkg_id = core_bits;
+   } else if (strcmp(cpu_vendor, GenuineIntel) == 0) {
+   /* We only support leaf 1/4 detection */
+   CPUID(0, eax, ebx, ecx, edx);
+   if (eax  4)
+   goto no_topology;
+   /* Get max_apicid */
+   CPUID(1, eax, ebx, ecx, edx);
+   max_apicid = (ebx  16)  0xff;
+   /* Get max_coreid */
+   CPUID2(4, 0, eax, ebx, ecx, edx);
+   max_coreid = ((eax  26)  0x3f) + 1;
+   /* SMT */
+   smt_bits = mask_width(max_apicid / max_coreid);
+   smt_mask = (1  smt_bits) - 1;
+   /* Core */
+   core_bits = log2(max_coreid);
+   core_mask = (1  (core_bits + smt_bits)) - 1;
+   core_mask ^= smt_mask;
+   /* Pkg */
+   pkg_bits = core_bits + smt_bits;
+   pkg_mask = -1  core_bits;
+
+   ci-ci_smt_id = apicid  smt_mask;
+   ci-ci_core_id = (apicid  core_mask)  smt_bits;
+   ci-ci_pkg_id = (apicid  pkg_mask)  pkg_bits;
+   } else
+   goto no_topology;
+#ifdef DEBUG
+   printf(cpu%d: smt %u, core %u, pkg %u 
+   (apicid 0x%x, max_apicid 0x%x, max_coreid 0x%x, smt_bits 0x%x, 
smt_mask 0x%x, 
+   core_bits 0x%x, core_mask 0x%x, pkg_bits 0x%x, pkg_mask 
0x%x)\n,
+   ci-ci_cpuid, ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id,
+   apicid, max_apicid, max_coreid, smt_bits, smt_mask, core_bits,
+   core_mask, pkg_bits, pkg_mask);
+#else
+   printf(cpu%d: smt %u, core %u, package %u\n, ci-ci_cpuid,
+   ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id);
+
+#endif
+   return;
+   /* We can't map, so consider ci_core_id as ci_cpuid */
+no_topology:
+   ci-ci_smt_id  = 0;
+   ci-ci_core_id = ci-ci_cpuid;
+   ci-ci_pkg_id  = 0;
 }
diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h
index 9ce437a..12e48d6 100644
--- a/arch/amd64/include/cpu.h
+++ b/arch/amd64/include/cpu.h
@@ -102,6 +102,9 @@ struct cpu_info {
u_int32_t   ci_cflushsz;
u_int64_t   ci_tsc_freq;
 
+   u_int32_t   ci_smt_id;
+   u_int32_t   ci_core_id;
+   u_int32_t   ci_pkg_id;
struct cpu_functions *ci_func;
void (*cpu_setup)(struct cpu_info *);
void (*ci_info)(struct cpu_info *);
diff --git a/arch/amd64/include/specialreg.h b/arch/amd64/include/specialreg.h
index 142fbbc..cab0985 100644
---

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-09 Thread Gregor Best

This patch simply halves the timeslice processes get until they are
preempted. This patch is standalone and the rest of the patches does not
depend on it, but I figured I'd throw it in anyway.

-- 
Gregor Best

Re: Scheduler improvements, take 1001, Patch 3/5

2012-10-09 Thread Gregor Best

This patch simply imports Christiano's code for detecting CPU topology,
as posted on tech@ a while (more than two months) ago. I took it
verbatim and didn't change anything yet.

-- 
Gregor Best

Re: Scheduler improvements, take 1001, Patch 4/5

2012-10-09 Thread Gregor Best

This patch uses the previous one to take CPU topology into account when
calculating the cost of moving a process between CPUs. This is only done
on amd64 at the moment, and the cost factors are guesses right now, but
it's a start.

-- 
Gregor Best

Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-09 Thread Gregor Best

This patch moves struct schedstate_percpu to kernel land, which I think
is cleaner than exposing structures for scheduler state to userland,
especially since grepping for 'schedstate' in /usr/src yielded no
results outside of /usr/src/sys.

I have not seen negative impact from this, but I haven't yet run a full
userland build (it's running at the moment but the machine I'm building
on is a bit slower than my laptop).

-- 
Gregor Best

Re: Scheduler improvements

2012-10-08 Thread Ville Valkonen

Hi,

Seems to be a bit snappier, though more thorough testing needed. Many web pages
became more usable after the patch (chrome as browser) and some gains in
desktop too. System is amd64 @ Intel(R) Core(TM)2 Duo CPU T5870 @ 2.00GHz and
4G RAM.

--
cheers,
Ville

Re: Scheduler improvements

2012-10-08 Thread Marc Espie

log2 should probably be scrapped, since you're reinventing ffs.

Re: Scheduler improvements

2012-10-08 Thread Christiano F. Haesbaert

On 8 October 2012 11:24, Marc Espie es...@nerim.net wrote:
 log2 should probably be scrapped, since you're reinventing ffs.


That is actually my code, back then I failed at finding an ffs.

I'll try to have a look at the code this week.

Re: Scheduler improvements

2012-10-08 Thread Alexandre Ratchov

On Sat, Oct 06, 2012 at 10:38:34PM +0200, Gregor Best wrote:
 Hi Alexandre,
 
  [...]
  This change is unclear for me; AFAIU, it removes the mechanism
  which makes processes wake up with a priority depending on what  
  they are blocked on.
  [...]
 
 Where do you see that? The code I removed/changed simply calulated the queue 
 from
 which to remove `p` and removed it from there (similar for insertion). That 
 needed
 to be changed to modify the RB tree used as a new runqueue.
 
  [...]
  For instance processes waking up from poll(2) or audio read/write
  won't be prioritized any longer. If so, this would hurt audio and
  other interactive processes but would improve cpu-intesive
  bloatware.
  [...]
 
 They weren't prioritised with the old version of this code either.
 

AFAIU, when a process waits for an event (with tsleep(9)), it
provides the priority of the event it's waiting for (eg an audio
interrupt). When the event occurs, the process is put in the
runnable queue calculated from the priority. This way, interactive
processes (i.e. waiting for external events) are prioritized during
the event processing.

This mechanism ensures audio and midi are usable without having to
manually tweak priorities. Furthermore, priorities changed
depending on what the process is doing.

The most notable exception to above rule is the poll(2)
implementation, which still ignores priorities; it may need to be
fixed one day, but this is not the most urgent problem hurting
interactive processes.
[A
-- Alexandre

Re: Scheduler improvements

2012-10-08 Thread Christiano F. Haesbaert

On Oct 8, 2012 6:33 PM, Alexandre Ratchov a...@caoua.org wrote:

 On Sat, Oct 06, 2012 at 10:38:34PM +0200, Gregor Best wrote:
  Hi Alexandre,
 
   [...]
   This change is unclear for me; AFAIU, it removes the mechanism
   which makes processes wake up with a priority depending on what
   they are blocked on.
   [...]
 
  Where do you see that? The code I removed/changed simply calulated the
queue from
  which to remove `p` and removed it from there (similar for insertion).
That needed
  to be changed to modify the RB tree used as a new runqueue.
 
   [...]
   For instance processes waking up from poll(2) or audio read/write
   won't be prioritized any longer. If so, this would hurt audio and
   other interactive processes but would improve cpu-intesive
   bloatware.
   [...]
 
  They weren't prioritised with the old version of this code either.
 

 AFAIU, when a process waits for an event (with tsleep(9)), it
 provides the priority of the event it's waiting for (eg an audio
 interrupt). When the event occurs, the process is put in the
 runnable queue calculated from the priority. This way, interactive
 processes (i.e. waiting for external events) are prioritized during
 the event processing.

 This mechanism ensures audio and midi are usable without having to
 manually tweak priorities. Furthermore, priorities changed
 depending on what the process is doing.

 The most notable exception to above rule is the poll(2)
 implementation, which still ignores priorities; it may need to be
 fixed one day, but this is not the most urgent problem hurting
 interactive processes.
 [A
 -- Alexandre


Ratchov is correct, when a process sleeps you specify the priority you
should have when waking up. I think george was referring to the old version
of his code.

The 4.4 bsd scheduler relies heavily on these priority boosts, i still
didnt have a look at George's code, but an equivalent mechanism is usually
needed. On cfs scheduling for example the processes forces a new delta
calculation and gets a fraction proportional to all other runnables
processes.

Re: Scheduler improvements

2012-10-08 Thread Gregor Best

On Mon, Oct 08, 2012 at 06:28:53PM +0200, Alexandre Ratchov wrote:
 [...]
 AFAIU, when a process waits for an event (with tsleep(9)), it
 provides the priority of the event it's waiting for (eg an audio
 interrupt). When the event occurs, the process is put in the
 runnable queue calculated from the priority. This way, interactive
 processes (i.e. waiting for external events) are prioritized during
 the event processing.
 [...]

I haven't noticed any degradation in performance with media-stuff (quite
the contrary, actually), but you have a point. I'll change the
deadline-calculation to also take the sleep priority into account.
Thanks for the explanation :)

--
Gregor Best

[demime 1.01d removed an attachment of type application/pgp-signature]

Re: Scheduler improvements

2012-10-08 Thread Ted Unangst

On Mon, Oct 08, 2012 at 18:43, Christiano F. Haesbaert wrote:

 Ratchov is correct, when a process sleeps you specify the priority you
 should have when waking up. I think george was referring to the old version
 of his code.
 
 The 4.4 bsd scheduler relies heavily on these priority boosts, i still
 didnt have a look at George's code, but an equivalent mechanism is usually
 needed. On cfs scheduling for example the processes forces a new delta
 calculation and gets a fraction proportional to all other runnables
 processes.

I've wondered how effective (or more importantly, accurate) these
priorities are.  Maintenance has been somewhat haphazard, e.g., I have
no idea when the last use of PVFS was removed, but there aren't any
left except for the definition in param.h.

Re: Scheduler improvements

2012-10-08 Thread Marc Espie

On Sun, Oct 07, 2012 at 04:04:16PM +0200, Gregor Best wrote:
  [...]
  I don't think changing the idle loop like this is ok.  You want to
  continue checking whether the runqueue is empty in between
  cpu_idle_enter() and cpu_idle_leave().
  [...]
 
 Fair point. I'll change that :)

Your diff doesn't pass userland compiles.
You're adding a dependency on tree.h in proc.h.

In kernel land on amd64, that's okay, since machine/param.h will pull
machine/cpu.h which pulls sys/sched.h which now pulls sys/tree.h,

but it definitely doesn't fly in userland... my build crashed  burned
in libc...

assuming that's okay (not usually a kernel guy), you can go include 
sys/tree.h directly in proc.h...

Please, please, please, don't vanish into the woods for the next 6 months.

Try splitting up your diff and working them into pieces kernel hackers can
work with...

Re: Scheduler improvements

2012-10-08 Thread Gregor Best

On Mon, Oct 08, 2012 at 07:42:13PM +0200, Marc Espie wrote:
 [...]
 Your diff doesn't pass userland compiles.
 You're adding a dependency on tree.h in proc.h.
 
 In kernel land on amd64, that's okay, since machine/param.h will pull
 machine/cpu.h which pulls sys/sched.h which now pulls sys/tree.h,
 
 but it definitely doesn't fly in userland... my build crashed  burned
 in libc...
 
 assuming that's okay (not usually a kernel guy), you can go include 
 sys/tree.h directly in proc.h...
 [...]

I think wrapping all the things that need sys/tree.h in #ifdef KERNEL
should do the trick since they don't really make sense in userland
anyway.

 Please, please, please, don't vanish into the woods for the next 6 months.
 [...]

Since I now have way more space time on my hands than before, that
should not be an issue anymore, don't worry :)

 Try splitting up your diff and working them into pieces kernel hackers can
 work with...
 [...]

I will. Actually, the diff is a bunch of seperate git commits on my
machine, so that shouldn't be too hard to do.

-- 
Gregor Best

Re: Scheduler improvements

2012-10-07 Thread Gregor Best

 [...]
 I don't think changing the idle loop like this is ok.  You want to
 continue checking whether the runqueue is empty in between
 cpu_idle_enter() and cpu_idle_leave().
 [...]

Fair point. I'll change that :)

-- 
Gregor Best

Re: Scheduler improvements

2012-10-06 Thread Gregor Best

Hi Alexandre,

 [...]
 This change is unclear for me; AFAIU, it removes the mechanism
 which makes processes wake up with a priority depending on what  
 they are blocked on.
 [...]

Where do you see that? The code I removed/changed simply calulated the queue 
from
which to remove `p` and removed it from there (similar for insertion). That 
needed
to be changed to modify the RB tree used as a new runqueue.

 [...]
 For instance processes waking up from poll(2) or audio read/write
 won't be prioritized any longer. If so, this would hurt audio and
 other interactive processes but would improve cpu-intesive
 bloatware.
 [...]

They weren't prioritised with the old version of this code either.

 [...]
 this change is unrelated to the rest isn't it?
 [...]

Yes, it is. I changed it because 100ms time slices felt a bit too large,
but the rest of the patch will of course work without this bit.

Thanks for taking your time to look at this :)

-- 
Gregor Best

Re: Scheduler improvements

2012-10-05 Thread Antoine Jacoutot

On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote:
 As before, I'm looking forward to anything you have to comment, especially 
 cool
 benchmark ideas or the like.

I know my report is not a benchmark of any kind but I do see a slight 
improvements when running a full GNOME 3 installation.
ipis never go past 10K where they are regularly around 50K without this patch 
(confirmed with 2 different amd64 boxes).

I can't really see any interactive enhancements but I don't think that it's 
really the point.
That said, GNOME 3 is a nice interactive test for that kind of things, it's 
heavily threaded and currently the performance of the SP kernel outperforms by 
far the MP one.

Please count me as a guinea pig for any scheduler patch because my Desktop 
machines are become barely usable with an MP kernel since we switched to 
rthreads and I would really much want to see some improvements in that area.
I'll keep running with this (and any other mods) for the upcoming weeks and if 
I have time to do some real and useful benchmarking or see any regression, I'll 
let you know.

Thanks.

-- 
Antoine

Re: Scheduler improvements

2012-10-05 Thread Juan Francisco Cantero Hurtado

On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote:
 Hi people,
 
 after a (very) long time of silence on this, here's another go at it. This 
 time,
 I basically started from scratch and used a bit of code by Christiano 
 Haesberth
 which had been posted to tech@ a while ago to detect CPU topology on amd64 and
 take that into account when moving processes between CPUs.
 
 This version has one single queue per CPU, getting rid of a) the one single 
 system
 wide runqueue and b) the queue for expired processes. This simplifies things 
 a bit
 and performs just as well as my previous versions (the only difference is the 
 order
 in which expired procs get selected for running on a CPU). One advantage is 
 that
 process selection is in O(log n) of the number of processes on the CPU and 
 depends
 neither on the total number of processes nor the number of expired processes 
 in
 the runqueue.
 
 The factors for the cost of moving a process between hardware threads, cpu 
 dies
 and cpu packages are guesses now, I think those will have to be tuned further.
 Sadly, I haven't had access to a multiprocessor machine with a more diverse
 architecture than a bunch of cores on the same die.
 
 I tested this on some more machines than before; a Core i5, an i7 and my Core 
 2
 Duo and on all machines (perceived) interactivity was improved. The simplest
 benchmark I used was playing back a 1080p version of Big Buck Bunny with 
 mplayer.
 All machines I tested on had Intel graphics cards, one GM965 (on the 
 Core2Duo) and
 the others were Sandy Bridge devices. On all of them, playback was smoother 
 with
 the i7 being most visible. With the default scheduler, watching the movie was 
 a
 big pain due to heavy frame-dropping, with my patch, the movie was watchable 
 (with
 frame dropping only (barely) noticable in scenes with much movement).
 
 As before, I'm looking forward to anything you have to comment, especially 
 cool
 benchmark ideas or the like.

With your patch, html5 videos on firefox are smoother, even the playback
is better when I try firefox + dpb 2 jobs vs only firefox without your
patch installed. I use i3wm, so I can't comment about the performance of
a full desktop environment.

My hardware: AMD A4-3400, 8GB, ATI Radeon HD 4350.

Please, still working on this area :)

-- 
Juan Francisco Cantero Hurtado http://juanfra.info

Re: Scheduler improvements

2012-10-05 Thread Alexandre Ratchov

On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote:
 @@ -222,14 +230,13 @@
  setrunqueue(struct proc *p)
  {
   struct schedstate_percpu *spc;
 - int queue = p-p_priority  2;
  
   SCHED_ASSERT_LOCKED();
   spc = p-p_cpu-ci_schedstate;
   spc-spc_nrun++;
  
 - TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq);
 - spc-spc_whichqs |= (1  queue);
 + KASSERT(!RB_FIND(prochead, spc-spc_runq, p));
 + RB_INSERT(prochead, spc-spc_runq, p);
   cpuset_add(sched_queued_cpus, p-p_cpu);
  
   if (cpuset_isset(sched_idle_cpus, p-p_cpu))
 @@ -240,38 +247,29 @@
  remrunqueue(struct proc *p)
  {
   struct schedstate_percpu *spc;
 - int queue = p-p_priority  2;
  
   SCHED_ASSERT_LOCKED();
   spc = p-p_cpu-ci_schedstate;
   spc-spc_nrun--;
  
 - TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq);
 - if (TAILQ_EMPTY(spc-spc_qs[queue])) {
 - spc-spc_whichqs = ~(1  queue);
 - if (spc-spc_whichqs == 0)
 - cpuset_del(sched_queued_cpus, p-p_cpu);
 - }
 + KASSERT(RB_REMOVE(prochead, spc-spc_runq, p));
 + if (RB_EMPTY(spc-spc_runq))
 + cpuset_del(sched_queued_cpus, p-p_cpu);
  }
  

This change is unclear for me; AFAIU, it removes the mechanism
which makes processes wake up with a priority depending on what  
they are blocked on.

For instance processes waking up from poll(2) or audio read/write
won't be prioritized any longer. If so, this would hurt audio and
other interactive processes but would improve cpu-intesive
bloatware.

haven't tested this though

 Index: kern/sched_bsd.c
 ===
 RCS file: /cvs/src/sys/kern/sched_bsd.c,v
 retrieving revision 1.30
 diff -u -r1.30 sched_bsd.c
 --- kern/sched_bsd.c  9 Jul 2012 17:27:32 -   1.30
 +++ kern/sched_bsd.c  4 Oct 2012 21:27:58 -
 @@ -77,20 +77,18 @@
  
   timeout_set(schedcpu_to, schedcpu, schedcpu_to);
  
 - rrticks_init = hz / 10;
 + rrticks_init = hz / 20;

this change is unrelated to the rest isn't it?

-- Alexandre

Re: Scheduler improvements

2012-10-05 Thread Kevin Chadwick

 It appears to have sped up porn. movies on the machine seem a bit better.
 
 I will try this in a few other places

Just not at the mother in laws or in public places no matter how
impressed you are at the difference?

-- 
___

'Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface'

(Doug McIlroy)
___

Re: Scheduler improvements

2012-10-05 Thread Norman Golisz

On Fri Oct  5 2012 14:24, Antoine Jacoutot wrote:
 On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote:
  As before, I'm looking forward to anything you have to comment, especially 
  cool
  benchmark ideas or the like.
 
 I know my report is not a benchmark of any kind but I do see a slight
 improvements when running a full GNOME 3 installation.
 ipis never go past 10K where they are regularly around 50K without this
 patch (confirmed with 2 different amd64 boxes).

I can confirm this, too. GNOME 3 generated around 80.000 to over 100.000
ipis, by simply moving the mouse cursor over the menu entries for a few
seconds! The desktop felt sluggish. With this patch, the same
procedure feels less sluggish, and the ipi count dropped down to 9.000 to
15.000.

Further, the time it takes to compile the kernel more than halved from
~10min to 4,27min.

And, the best of all, the first time I'm able to watch html5 embedded
videos smoothly, even in full screen mode.

Re: Scheduler improvements

2012-10-05 Thread Mark Kettenis

 Date: Thu, 4 Oct 2012 23:42:45 +0200
 From: Gregor Best g...@ring0.de
 Index: kern/kern_sched.c
 ===
 RCS file: /cvs/src/sys/kern/kern_sched.c,v
 retrieving revision 1.27
 diff -u -r1.27 kern_sched.c
 --- kern/kern_sched.c 10 Jul 2012 18:20:37 -  1.27
 +++ kern/kern_sched.c 4 Oct 2012 21:27:58 -
 @@ -158,18 +167,17 @@
  
   cpuset_add(sched_idle_cpus, ci);
   cpu_idle_enter();
 - while (spc-spc_whichqs == 0) {
 - if (spc-spc_schedflags  SPCF_SHOULDHALT 
 - (spc-spc_schedflags  SPCF_HALTED) == 0) {
 - cpuset_del(sched_idle_cpus, ci);
 - SCHED_LOCK(s);
 - atomic_setbits_int(spc-spc_schedflags,
 - spc-spc_whichqs ? 0 : SPCF_HALTED);
 - SCHED_UNLOCK(s);
 - wakeup(spc);
 - }
 - cpu_idle_cycle();
 +
 + if (spc-spc_schedflags  SPCF_SHOULDHALT 
 +  (spc-spc_schedflags  SPCF_HALTED) == 0) {
 + cpuset_del(sched_idle_cpus, ci);
 + SCHED_LOCK(s);
 + atomic_setbits_int(spc-spc_schedflags, SPCF_HALTED);
 + SCHED_UNLOCK(s);
 + wakeup(spc);
   }
 + cpu_idle_cycle();
 +
   cpu_idle_leave();
   cpuset_del(sched_idle_cpus, ci);
   }
 @@ -222,14 +230,13 @@

I don't think changing the idle loop like this is ok.  You want to
continue checking whether the runqueue is empty in between
cpu_idle_enter() and cpu_idle_leave().

Re: Scheduler improvements

2012-10-04 Thread Ted Unangst

On Thu, Oct 04, 2012 at 23:42, Gregor Best wrote:

 As before, I'm looking forward to anything you have to comment, especially
 cool
 benchmark ideas or the like.

A short one is building a kernel.  Try before and after with make,
make -j 2, and make -j 4.  (or ncpu and ncpu * 2).

Re: Scheduler improvements

2012-10-04 Thread Marc Espie

On Thu, Oct 04, 2012 at 06:28:11PM -0400, Ted Unangst wrote:
 On Thu, Oct 04, 2012 at 23:42, Gregor Best wrote:
 
  As before, I'm looking forward to anything you have to comment, especially
  cool
  benchmark ideas or the like.
 
 A short one is building a kernel.  Try before and after with make,
 make -j 2, and make -j 4.  (or ncpu and ncpu * 2).

A more realistic and useful one is rebuilding the full package tree
with dpb, which is rather simple to try these days (just requires about
30GB of disk space)

kernel will take a few minutes, dpb some hours to a few days...

Re: Scheduler improvements

2012-01-21 Thread Gregor Best

On Fri, Jan 20, 2012 at 06:06:08PM +0100, Ariane van der Steldt wrote:
 [...]
  * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms
long timeslices on machines with hz  100.
 
 I'm not really happy having timeslices get larger. They're considerably
 large already. Can you think of a way to derive a sane value based on
 the hz?
 [...]

Hmm, maybe my writing is not as clear as I thought it to be. The 20ms vs
10ms was the second version of my patch vs. the first one. Mainline has
timeslices of 100ms. They are already derived from hz in
kern/sched_bsd.c:scheduler_start(). Originally, the relevant line read

rrticks_init = hz / 10;

whereas now it is

rrticks_init = hz / 50;

The first version of the patch had

rrticks_init = hz / 100;

which lead to rrticks_init being zero on machines with hz  100. From
the comments to the first version I gathered that one can rely on
50 = hz = 1024. I'm not too sure whether changing timeslices has any
real benefit though. I noticed a bit less A/V lag in mplayer with
smaller timeslices so I gave it a shot.

 [...]
 You'll want to test this on a multi-socket machine as well (preferably a
 numa machine). You're currently migrating processes using your on-die L3
 cache, which is considerably faster than memory.
 [...]

Yes, right. Christiano also mentioned that in an off-list mail. Maybe I
should abuse one of the machines at my university for that...

 [...]
 The above comparison will fail if:
 a != b, but a-p_deadline_set == b-p_deadline_set.
 Your code hides this from view, because of the a==b test before it,
 turning it into a rare occurence. However when it hits during insert or
 remove, you may find your tree getting corrupted and your invariants
 getting violated.
 [...]

Right. Thanks for spotting that. I'll change that in the next iteration
of the patch.

 [...]
 Note: the above breaks RB_FIND and RB_NFIND, since a search clause
 can never compare equal.
 [...]

Since neither RB_FIND nor RB_NFIND is used (as far as I can see), I'd
call that a good tradeoff.

 [...]
  +
  +RB_GENERATE_STATIC(runqueue, proc, p_runq, sched_cmp_proc);
  +RB_GENERATE_STATIC(expqueue, proc, p_expq, sched_cmp_proc_exp);
 
 That's gonna add a lot of code.
 [...]

Hmm, is there any better way to do it? I couldn't think of any way to
have two queues with a disjoint set of processes in each and their only
difference being the comparison function.

 [...]
 Btw, even though the implementation works fine without it, I'd really
 like if you could put matching RB_PROTOTYPE_STATIC before the generator
 macro. Using GENERATE without PROTOTYPE is abuse of an implementation
 detail.
 [...]

Will be fixed.

 [...]
  +   if (!(RB_REMOVE(runqueue, sched_runqueue, p)))
 
 Is it ok to call this function for a process not on the run queue?
 Your code implies it is. If you don't intend that to happen, put a
 KASSERT in.
 
  +   RB_REMOVE(expqueue, sched_expqueue, p);
 
 Please check return value. I understand that if a process is on the
 runqueue, it must also be on the expqueue?
 [...]

No, runqueue and expqueue are disjoint. A process is either in the
runqueue if its deadline has not yet been reached and in the expqueue if
its deadline is expired. Maybe I should think of a better name than
runqueue because essentially both RB-trees together are the runqueue.

 [...]
 tv_usec is measured in microseconds, msec suggests milliseconds.
 I'd change that to:
   .tv_usec = offset_msec % 1000 * 1000
 [...]

D'oh! I'll fix that. Must've been my brain being a bit sleepy :)

 [...]

Thanks a lot for your time and suggestions :)

-- 
Gregor Best

Re: Scheduler improvements

2012-01-20 Thread Ariane van der Steldt

On Wed, Jan 18, 2012 at 04:41:35PM +0100, Gregor Best wrote:
 after a long time of silence here's a second iteration of the patch.
 I've addressed a few concerns voiced here:
 
 * Process lookup and friends now have O(log n) runtime. I achieved that
   by abusing RB-trees as priority queues since they have runtime
   O(log n) in all relevant algorithms.

I'm wondering if a heap tree isn't more suitable. But since they're
currently not in the kernel, I guess the RB tree is the best solution.

 * The algorithm for calculating a new deadline for a given process has
   been simplified and is now documented a bit better. It also derives
   the deadline offset from the value of hz (via rrticks_init) as
   suggested by Miod (?).
 * CPU rlimits are now honored again. The relevant code did not change, the
   new patch just doesn't remove rlimit enforcement anymore.
 * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms
   long timeslices on machines with hz  100.

I'm not really happy having timeslices get larger. They're considerably
large already. Can you think of a way to derive a sane value based on
the hz?

 With recent improvements in the mainline scheduler and especially
 rthreads, the performance of the patched scheduler and mainline is now
 roughly similar, at least if throughput is concerned. I have the feeling
 that the system behaves snappier with my patch, but that might be some
 sort of placebo effect. I haven't yet come up with a reliable method to
 benchmark interactivity except for actually using the machine and doing
 stuff on it. It's interesting to note however that the patched scheduler
 achieves a performance similar to the default one without all the fancy
 methods for calculating how expensive it is to move a process from one
 CPU to another or related methods for preserving cache locality.

You'll want to test this on a multi-socket machine as well (preferably a
numa machine). You're currently migrating processes using your on-die L3
cache, which is considerably faster than memory.

 I use the patched scheduler exclusively on my Core2Duo machine with an
 MP build.

 The amount of lines removed versus added lines by this patch shifted
 towards more added lines but is still at 173 lines less than the
 default.
 
 Once again, comments, rants, insults, everything is welcome :)

Comments on the code inline. This is just me glancing over it, I'll try
and get more in-depth over the weekend.


 Index: sys/proc.h
 ===
 RCS file: /cvs/src/sys/sys/proc.h,v
 retrieving revision 1.149
 diff -u -r1.149 proc.h
 --- sys/proc.h7 Jan 2012 05:38:12 -   1.149
 +++ sys/proc.h17 Jan 2012 16:10:09 -
 @@ -43,6 +43,7 @@
  #include machine/proc.h/* Machine-dependent proc substruct. */
  #include sys/selinfo.h /* For struct selinfo */
  #include sys/queue.h
 +#include sys/tree.h
  #include sys/timeout.h /* For struct timeout */
  #include sys/event.h   /* For struct klist */
  #include sys/mutex.h   /* For struct mutex */
 @@ -210,7 +211,9 @@
  #define  PS_SINGLEUNWIND _P_SINGLEUNWIND
  
  struct proc {
 - TAILQ_ENTRY(proc) p_runq;
 + RB_ENTRY(proc) p_runq;
 + RB_ENTRY(proc) p_expq;
 + TAILQ_ENTRY(proc) p_slpq;
   LIST_ENTRY(proc) p_list;/* List of all processes. */
  
   struct  process *p_p;   /* The process of this thread. */
 @@ -251,6 +254,9 @@
  #endif
  
   /* scheduling */
 + struct timeval p_deadline; /* virtual deadline used for scheduling */
 + struct timeval p_deadline_set; /* time at which the deadline was set */
 + int  p_rrticks; /* number of ticks this process is allowed to stay on a 
 processor */
   u_int   p_estcpu;/* Time averaged value of p_cpticks. */
   int p_cpticks;   /* Ticks of cpu time. */
   fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */
 Index: sys/sched.h
 ===
 RCS file: /cvs/src/sys/sys/sched.h,v
 retrieving revision 1.30
 diff -u -r1.30 sched.h
 --- sys/sched.h   16 Nov 2011 20:50:19 -  1.30
 +++ sys/sched.h   17 Jan 2012 16:10:09 -
 @@ -87,8 +87,6 @@
  #define CP_IDLE  4
  #define CPUSTATES5
  
 -#define  SCHED_NQS   32  /* 32 run queues. */
 -
  /*
   * Per-CPU scheduler state.
   * XXX - expose to userland for now.
 @@ -99,7 +97,6 @@
   u_int spc_schedticks;   /* ticks for schedclock() */
   u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
   u_char spc_curpriority; /* usrpri of curproc */
 - int spc_rrticks;/* ticks until roundrobin() */
   int spc_pscnt;  /* prof/stat counter */
   int spc_psdiv;  /* prof/stat divisor */ 
   struct proc *spc_idleproc;  /*

Re: Scheduler improvements

2012-01-20 Thread Marc Espie

On Fri, Jan 20, 2012 at 06:06:08PM +0100, Ariane van der Steldt wrote:
  * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms
long timeslices on machines with hz  100.
 
 I'm not really happy having timeslices get larger. They're considerably
 large already. Can you think of a way to derive a sane value based on
 the hz?

20ms slices are a big problem. You only get 50 per second. This will yield
all kinds of stroboscopic effects in video applications...

(granted, that's probably not an issue on machines with hz100...)

Re: Scheduler improvements

2012-01-18 Thread Gregor Best

Hi people,

after a long time of silence here's a second iteration of the patch.
I've addressed a few concerns voiced here:

* Process lookup and friends now have O(log n) runtime. I achieved that
  by abusing RB-trees as priority queues since they have runtime
  O(log n) in all relevant algorithms.
* The algorithm for calculating a new deadline for a given process has
  been simplified and is now documented a bit better. It also derives
  the deadline offset from the value of hz (via rrticks_init) as
  suggested by Miod (?).
* CPU rlimits are now honored again. The relevant code did not change, the
  new patch just doesn't remove rlimit enforcement anymore.
* Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms
  long timeslices on machines with hz  100.

With recent improvements in the mainline scheduler and especially
rthreads, the performance of the patched scheduler and mainline is now
roughly similar, at least if throughput is concerned. I have the feeling
that the system behaves snappier with my patch, but that might be some
sort of placebo effect. I haven't yet come up with a reliable method to
benchmark interactivity except for actually using the machine and doing
stuff on it. It's interesting to note however that the patched scheduler
achieves a performance similar to the default one without all the fancy
methods for calculating how expensive it is to move a process from one
CPU to another or related methods for preserving cache locality.

I use the patched scheduler exclusively on my Core2Duo machine with an
MP build.

The amount of lines removed versus added lines by this patch shifted
towards more added lines but is still at 173 lines less than the
default.

Once again, comments, rants, insults, everything is welcome :)

-- 
Gregor Best
Index: sys/proc.h
===
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.149
diff -u -r1.149 proc.h
--- sys/proc.h  7 Jan 2012 05:38:12 -   1.149
+++ sys/proc.h  17 Jan 2012 16:10:09 -
@@ -43,6 +43,7 @@
 #include machine/proc.h  /* Machine-dependent proc substruct. */
 #include sys/selinfo.h   /* For struct selinfo */
 #include sys/queue.h
+#include sys/tree.h
 #include sys/timeout.h   /* For struct timeout */
 #include sys/event.h /* For struct klist */
 #include sys/mutex.h /* For struct mutex */
@@ -210,7 +211,9 @@
 #definePS_SINGLEUNWIND _P_SINGLEUNWIND
 
 struct proc {
-   TAILQ_ENTRY(proc) p_runq;
+   RB_ENTRY(proc) p_runq;
+   RB_ENTRY(proc) p_expq;
+   TAILQ_ENTRY(proc) p_slpq;
LIST_ENTRY(proc) p_list;/* List of all processes. */
 
struct  process *p_p;   /* The process of this thread. */
@@ -251,6 +254,9 @@
 #endif
 
/* scheduling */
+   struct timeval p_deadline; /* virtual deadline used for scheduling */
+   struct timeval p_deadline_set; /* time at which the deadline was set */
+   int  p_rrticks; /* number of ticks this process is allowed to stay on a 
processor */
u_int   p_estcpu;/* Time averaged value of p_cpticks. */
int p_cpticks;   /* Ticks of cpu time. */
fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */
Index: sys/sched.h
===
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.30
diff -u -r1.30 sched.h
--- sys/sched.h 16 Nov 2011 20:50:19 -  1.30
+++ sys/sched.h 17 Jan 2012 16:10:09 -
@@ -87,8 +87,6 @@
 #define CP_IDLE4
 #define CPUSTATES  5
 
-#defineSCHED_NQS   32  /* 32 run queues. */
-
 /*
  * Per-CPU scheduler state.
  * XXX - expose to userland for now.
@@ -99,7 +97,6 @@
u_int spc_schedticks;   /* ticks for schedclock() */
u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
u_char spc_curpriority; /* usrpri of curproc */
-   int spc_rrticks;/* ticks until roundrobin() */
int spc_pscnt;  /* prof/stat counter */
int spc_psdiv;  /* prof/stat divisor */ 
struct proc *spc_idleproc;  /* idle proc for this cpu */
@@ -107,9 +104,6 @@
u_int spc_nrun; /* procs on the run queues */
fixpt_t spc_ldavg;  /* shortest load avg. for this cpu */
 
-   TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
-   volatile uint32_t spc_whichqs;
-
 #ifdef notyet
struct proc *spc_reaper;/* dead proc reaper */
 #endif
@@ -119,18 +113,16 @@
 #ifdef _KERNEL
 
 /* spc_flags */
-#define SPCF_SEENRR 0x0001  /* process has seen roundrobin() */
-#define SPCF_SHOULDYIELD0x0002  /* process should yield the CPU */
-#define SPCF_SWITCHCLEAR(SPCF_SEENRR|SPCF_SHOULDYIELD)
-#define SPCF_SHOULDHALT0x0004  /* CPU

Re: Scheduler improvements

2012-01-18 Thread Gregor Best

And it didn't take long for me to find a small bug... attached is a
fixed version of the patch. Such things happen if one decides to
regenerate a patch just in case and forgets to revert to a working
version before doing that :D

-- 
Gregor Best
Index: sys/proc.h
===
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.149
diff -u -r1.149 proc.h
--- sys/proc.h  7 Jan 2012 05:38:12 -   1.149
+++ sys/proc.h  17 Jan 2012 16:10:09 -
@@ -43,6 +43,7 @@
 #include machine/proc.h  /* Machine-dependent proc substruct. */
 #include sys/selinfo.h   /* For struct selinfo */
 #include sys/queue.h
+#include sys/tree.h
 #include sys/timeout.h   /* For struct timeout */
 #include sys/event.h /* For struct klist */
 #include sys/mutex.h /* For struct mutex */
@@ -210,7 +211,9 @@
 #definePS_SINGLEUNWIND _P_SINGLEUNWIND
 
 struct proc {
-   TAILQ_ENTRY(proc) p_runq;
+   RB_ENTRY(proc) p_runq;
+   RB_ENTRY(proc) p_expq;
+   TAILQ_ENTRY(proc) p_slpq;
LIST_ENTRY(proc) p_list;/* List of all processes. */
 
struct  process *p_p;   /* The process of this thread. */
@@ -251,6 +254,9 @@
 #endif
 
/* scheduling */
+   struct timeval p_deadline; /* virtual deadline used for scheduling */
+   struct timeval p_deadline_set; /* time at which the deadline was set */
+   int  p_rrticks; /* number of ticks this process is allowed to stay on a 
processor */
u_int   p_estcpu;/* Time averaged value of p_cpticks. */
int p_cpticks;   /* Ticks of cpu time. */
fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */
Index: sys/sched.h
===
RCS file: /cvs/src/sys/sys/sched.h,v
retrieving revision 1.30
diff -u -r1.30 sched.h
--- sys/sched.h 16 Nov 2011 20:50:19 -  1.30
+++ sys/sched.h 17 Jan 2012 16:10:09 -
@@ -87,8 +87,6 @@
 #define CP_IDLE4
 #define CPUSTATES  5
 
-#defineSCHED_NQS   32  /* 32 run queues. */
-
 /*
  * Per-CPU scheduler state.
  * XXX - expose to userland for now.
@@ -99,7 +97,6 @@
u_int spc_schedticks;   /* ticks for schedclock() */
u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
u_char spc_curpriority; /* usrpri of curproc */
-   int spc_rrticks;/* ticks until roundrobin() */
int spc_pscnt;  /* prof/stat counter */
int spc_psdiv;  /* prof/stat divisor */ 
struct proc *spc_idleproc;  /* idle proc for this cpu */
@@ -107,9 +104,6 @@
u_int spc_nrun; /* procs on the run queues */
fixpt_t spc_ldavg;  /* shortest load avg. for this cpu */
 
-   TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS];
-   volatile uint32_t spc_whichqs;
-
 #ifdef notyet
struct proc *spc_reaper;/* dead proc reaper */
 #endif
@@ -119,18 +113,16 @@
 #ifdef _KERNEL
 
 /* spc_flags */
-#define SPCF_SEENRR 0x0001  /* process has seen roundrobin() */
-#define SPCF_SHOULDYIELD0x0002  /* process should yield the CPU */
-#define SPCF_SWITCHCLEAR(SPCF_SEENRR|SPCF_SHOULDYIELD)
-#define SPCF_SHOULDHALT0x0004  /* CPU should be vacated */
-#define SPCF_HALTED0x0008  /* CPU has been halted */
+#define SPCF_SHOULDYIELD0x0001  /* process should yield the CPU */
+#define SPCF_SHOULDHALT0x0002  /* CPU should be vacated */
+#define SPCF_HALTED0x0004  /* CPU has been halted */
 
-#defineSCHED_PPQ   (128 / SCHED_NQS)   /* priorities per queue 
*/
 #define NICE_WEIGHT 2  /* priorities per nice level */
-#defineESTCPULIM(e) min((e), NICE_WEIGHT * PRIO_MAX - SCHED_PPQ)
+#defineESTCPULIM(e) min((e), NICE_WEIGHT * PRIO_MAX)
 
 extern int schedhz;/* ideally: 16 */
 extern int rrticks_init;   /* ticks per roundrobin() */
+extern struct cpuset sched_idle_cpus;
 
 struct proc;
 void schedclock(struct proc *);
@@ -147,18 +139,20 @@
 void cpu_switchto(struct proc *, struct proc *);
 struct proc *sched_chooseproc(void);
 struct cpu_info *sched_choosecpu(struct proc *);
-struct cpu_info *sched_choosecpu_fork(struct proc *parent, int);
+struct cpu_info *sched_choosecpu_fork(struct proc *parent);
 void cpu_idle_enter(void);
 void cpu_idle_cycle(void);
 void cpu_idle_leave(void);
 void sched_peg_curproc(struct cpu_info *ci);
 
+void generate_deadline(struct proc *, char);
+
 #ifdef MULTIPROCESSOR
 void sched_start_secondary_cpus(void);
 void sched_stop_secondary_cpus(void);
 #endif
 
-#define curcpu_is_idle()   (curcpu()-ci_schedstate.spc_whichqs == 0)
+#define curcpu_is_idle() (cpuset_isset(sched_idle_cpus, curcpu()))
 
 void

Re: Scheduler improvements

2011-10-13 Thread Alexandre Ratchov

On Wed, Oct 12, 2011 at 02:55:39PM +0200, Gregor Best wrote:
 Hi people,
 
 attached is a patch that basically rewrites the CPU scheduler. It
 replaces the multilevel feedback queue currently employed with an
 earliest effective deadline first approach, similar to the BrainFuck
 Scheduler for Linux.
 
 My main motivation for rewriting the scheduler was mplayer getting
 killed all the time because it went over its share of CPU resources,
 something which can not happen with the patch I wrote.
 
 The algorithms employed simply assigns a (soft) deadline to each
 process, which is calculated based solely on nice value and selects the
 next process to execute based on the earliest deadline in the runqueue.
 If more than one deadline is missed, the first one encountered in FIFO
 order (FIFO with respect to insertion into the runqueue) with a deadline
 smaller than current time is selected. This prevents processes from
 yielding a small time before their timeslice expires to fool the
 scheduler into leaving them in their respective priority queue, thus
 preventing multiple processes from teaming up to starve others of CPU
 runtime.
 
 I did a few benchmarks to test the difference between the original and
 my patched scheduler to see whether there was any improvements. The tool
 I used was sysbench (with both CPU and Threads benchmarks). There are a
 few graphs of the tests at http://gurkenfunk.de/projects/openbsd-eevdf,
 but it looks as if my patches improve latency on workloads with high CPU
 usage, little I/O and a relatively large amount of threads.
 
 I'm not aiming for a yeah, nice, we'll merge it on this, but rather
 for suggestions whether it's worth anyones time to pursue this further.

All this is very interesting and I've spent some time, in 2.7 days on
trying to understand/improve the behaviour of interactive processes
(aka processes like synths that wait on i/o and have a deadline for
processing their input).

I still believe that the current scheduler, in the UP case, is very
friendly to audio, since interactive process almost always get the CPU
at time. AFAICS, changing the scheduler won't improve correct audio
code that already works. Another scheduler might improve the wrong
code.

Since we're talking about mplayer and alike, currently most of the
stuttering is caused by mistakes in audio code and cpu-intensive code
runnin in kernel mode. The scheduler doesn't hurt interactive
processes, does it?

 Oh, also, with the patch, the scheduler code is 225 lines shorter and
 IMHO a bit easier to understand.
 

I love simplification diffs especially ones with more -'s than
+'s. You've removed some of the MP affinity bits and the rlimit bits,
i guess this is temporary for clarity only, isn't it?

I've read kolivas design paper you suggest, but failed to understand
how this scheduling algorithm would improve interactive processes like
mplayer.

Do you have further references?

-- Alexandre

Scheduler improvements

2011-10-12 Thread Gregor Best

Hi people,

attached is a patch that basically rewrites the CPU scheduler. It
replaces the multilevel feedback queue currently employed with an
earliest effective deadline first approach, similar to the BrainFuck
Scheduler for Linux.

My main motivation for rewriting the scheduler was mplayer getting
killed all the time because it went over its share of CPU resources,
something which can not happen with the patch I wrote.

The algorithms employed simply assigns a (soft) deadline to each
process, which is calculated based solely on nice value and selects the
next process to execute based on the earliest deadline in the runqueue.
If more than one deadline is missed, the first one encountered in FIFO
order (FIFO with respect to insertion into the runqueue) with a deadline
smaller than current time is selected. This prevents processes from
yielding a small time before their timeslice expires to fool the
scheduler into leaving them in their respective priority queue, thus
preventing multiple processes from teaming up to starve others of CPU
runtime.

I did a few benchmarks to test the difference between the original and
my patched scheduler to see whether there was any improvements. The tool
I used was sysbench (with both CPU and Threads benchmarks). There are a
few graphs of the tests at http://gurkenfunk.de/projects/openbsd-eevdf,
but it looks as if my patches improve latency on workloads with high CPU
usage, little I/O and a relatively large amount of threads.

I'm not aiming for a yeah, nice, we'll merge it on this, but rather
for suggestions whether it's worth anyones time to pursue this further.

Oh, also, with the patch, the scheduler code is 225 lines shorter and
IMHO a bit easier to understand.

-- 
Gregor Best
? sys/.git
? sys/.gitignore
? sys/vim.core
? sys/arch/alpha/stand/installboot/installboot.8.manlint
? sys/arch/alpha/stand/setnetbootinfo/setnetbootinfo.8.manlint
? sys/arch/amd64/amd64/.acpi_machdep.c.swp
? sys/arch/amd64/include/.cpu.h.swp
? sys/arch/armish/stand/boot/boot.8.manlint
? sys/arch/aviion/stand/a2coff/a2coff.8.manlint
? sys/arch/hppa/stand/boot/boot.8.manlint
? sys/arch/hppa/stand/mkboot/mkboot.8.manlint
? sys/arch/hppa64/stand/boot/boot.8.manlint
? sys/arch/hppa64/stand/mkboot/mkboot.8.manlint
? sys/arch/i386/stand/biosboot/biosboot.8.manlint
? sys/arch/i386/stand/boot/boot.8.manlint
? sys/arch/i386/stand/cdboot/cdboot.8.manlint
? sys/arch/i386/stand/installboot/installboot.8.manlint
? sys/arch/i386/stand/pxeboot/pxeboot.8.manlint
? sys/arch/landisk/stand/boot/boot.8.manlint
? sys/arch/landisk/stand/mbr/mbr.8.manlint
? sys/arch/landisk/stand/xxboot/xxboot.8.manlint
? sys/arch/mvme68k/stand/installboot/installboot.8.manlint
? sys/arch/mvme88k/stand/installboot/installboot.8.manlint
? sys/arch/sgi/stand/sgivol/sgivol.8.manlint
? sys/arch/socppc/stand/boot/boot.8.manlint
? sys/arch/socppc/stand/mkboot/mkboot.8.manlint
? sys/arch/sparc/stand/installboot/installboot.8.manlint
? sys/arch/sparc64/stand/installboot/installboot.8.manlint
? sys/arch/zaurus/stand/zboot/boot.8.manlint
? sys/dev/microcode/atmel/atu-at76c503-i3863-ext
? sys/dev/microcode/atmel/atu-at76c503-i3863-int
? sys/dev/microcode/atmel/atu-at76c503-rfmd-acc-ext
? sys/dev/microcode/atmel/atu-at76c503-rfmd-acc-int
? sys/dev/microcode/atmel/atu-at76c505-rfmd-ext
? sys/dev/microcode/atmel/atu-at76c505-rfmd-int
? sys/dev/microcode/atmel/atu-intersil-ext
? sys/dev/microcode/atmel/atu-intersil-int
? sys/dev/microcode/atmel/atu-rfmd-ext
? sys/dev/microcode/atmel/atu-rfmd-int
? sys/dev/microcode/atmel/atu-rfmd2958-ext
? sys/dev/microcode/atmel/atu-rfmd2958-int
? sys/dev/microcode/atmel/atu-rfmd2958smc-ext
? sys/dev/microcode/atmel/atu-rfmd2958smc-int
? sys/dev/microcode/atmel/build
? sys/dev/microcode/bnx/bnx-b06
? sys/dev/microcode/bnx/bnx-b09
? sys/dev/microcode/bnx/bnx-rv2p
? sys/dev/microcode/bnx/bnx-xi-rv2p
? sys/dev/microcode/bnx/bnx-xi90-rv2p
? sys/dev/microcode/bnx/build
? sys/dev/microcode/cirruslogic/build
? sys/dev/microcode/cirruslogic/cs4280
? sys/dev/microcode/fxp/build
? sys/dev/microcode/fxp/fxp-d101a
? sys/dev/microcode/fxp/fxp-d101b0
? sys/dev/microcode/fxp/fxp-d101ma
? sys/dev/microcode/fxp/fxp-d101s
? sys/dev/microcode/fxp/fxp-d102
? sys/dev/microcode/fxp/fxp-d102c
? sys/dev/microcode/fxp/fxp-d102e
? sys/dev/microcode/kue/build
? sys/dev/microcode/kue/kue
? sys/dev/microcode/myx/build
? sys/dev/microcode/myx/myx-eth_z8e
? sys/dev/microcode/myx/myx-ethp_z8e
? sys/dev/microcode/ral/build
? sys/dev/microcode/ral/ral-rt2561
? sys/dev/microcode/ral/ral-rt2561s
? sys/dev/microcode/ral/ral-rt2661
? sys/dev/microcode/ral/ral-rt2860
? sys/dev/microcode/rum/build
? sys/dev/microcode/rum/rum-rt2573
? sys/dev/microcode/rum/run-rt2870
? sys/dev/microcode/rum/run-rt3071
? sys/dev/microcode/tht/build
? sys/dev/microcode/tht/tht
? sys/dev/microcode/tigon/build
? sys/dev/microcode/tigon/tigon1
? sys/dev/microcode/tigon/tigon2
? sys/dev/microcode/tusb3410/build
? sys/dev/microcode/tusb3410/tusb3410
?

Re: Scheduler improvements

2011-10-12 Thread Miod Vallat

 Hi people,

[...]

 I'm not aiming for a yeah, nice, we'll merge it on this, but rather
 for suggestions whether it's worth anyones time to pursue this further.

This is interesting, there might be a good outcome of this diff
eventually. However as of now:
- you have removed the bunch of code which tries to make sure processes
  do not hop between cpus unless there is a gain in doing so, on MP
  kernels. Did you try your diff on a MP kernel?
- non-tabbed indent is evil. And makes the diff harder to read.
- I see many locks removed in there, and moving information to the proc
  structure does not explain all of them. This gives a bad gut feeling
  about the behaviour of this on MP kernels. A really bad gut feeling.
- you have changed the roundrobin duration from (hz / 10 to (hz / 100).
  However, this computation yields zero on platforms where hz  100,
  which is not a good idea. The only assumptions you can make about hz
  is that 50 = hz = 1024.
- you removed the enforcement of RLIMIT_CPU. No way. Over my dead body.
- I also don't really like switching from NQS run queues to a single
  one. I understand you need this to be able to compare the deadlines
  with only one queue walk, but this can become expensive with many
  processes in the runqueue...
- your priority computation are unexplained magic. What does 20 mean?
  Shouldn't at least one of them be derived from `hz'? Why  1 versus
   5? Are you sure the values you are computed are always within the
  bounds you are expecting? Why isn't there any comment about these
  computations?

Miod

Re: Scheduler improvements

2011-10-12 Thread Gregor Best

On Wed, Oct 12, 2011 at 05:51:33PM +, Miod Vallat wrote:
  Hi people,

 [...]

  I'm not aiming for a yeah, nice, we'll merge it on this, but rather
  for suggestions whether it's worth anyones time to pursue this further.

 This is interesting, there might be a good outcome of this diff
 eventually. However as of now:
 - you have removed the bunch of code which tries to make sure processes
   do not hop between cpus unless there is a gain in doing so, on MP
   kernels. Did you try your diff on a MP kernel?
[...]

Yes. Almost all builds I have done so far were MP kernels, which I run
almost all the time (I switch to un-patched kernels for benchmarks
only). I'll do a benchmark with an SP build though to check whether
there are any negative impacts.

 - non-tabbed indent is evil. And makes the diff harder to read.
[...]

I will clean that up.

 - I see many locks removed in there, and moving information to the proc
   structure does not explain all of them. This gives a bad gut feeling
   about the behaviour of this on MP kernels. A really bad gut feeling.
[...]

Works fine with two cores, but I can see where that feeling comes from.
I'll add some more explanations to my changes.

 - you have changed the roundrobin duration from (hz / 10 to (hz / 100).
   However, this computation yields zero on platforms where hz  100,
   which is not a good idea. The only assumptions you can make about hz
   is that 50 = hz = 1024.
[...]

Good point. I'll change that.

 - you removed the enforcement of RLIMIT_CPU. No way. Over my dead body.
 - I also don't really like switching from NQS run queues to a single
   one. I understand you need this to be able to compare the deadlines
   with only one queue walk, but this can become expensive with many
   processes in the runqueue...
[...]

Right. I'm planning to move this to a real priority queue via some sort
of heap, so lookups and insertions can be done in O(log n) instead of
O(n).

 - your priority computation are unexplained magic. What does 20 mean?
   Shouldn't at least one of them be derived from `hz'? Why  1 versus
5? Are you sure the values you are computed are always within the
   bounds you are expecting? Why isn't there any comment about these
   computations?
[...]

Deriving them from hz didn't cross my mind but you are right of course.
I chose the  5 because that makes for a relatively linear increase in
the deadlines while not overflowing the value of .tv_usec. I'll add some
explanations though.


 Miod


Thanks a lot for your time and suggestions.

--
Gregor Best

[demime 1.01d removed an attachment of type application/pgp-signature]

Re: Scheduler improvements

2011-10-12 Thread Tobias Weingartner

I like what I see.  It could be the start for the means of managing a single
socket's queue of processes.  You do mention that this won't really scale
beyond roughly 8 cores.  I would love to see you extend this with a 2D
array of weights that can be populated by various means (run-time testing,
or ACPI tables, etc) with the relative costs of moving a process from one
core to another (or one scheduling queue to another, if queue's are assigned
to a set of cores, say a socket full of cores).  In this way, we may be able
to have cores shut down by modifying these weights based on demand/load.

While I agree with your comments that OpenBSD was not originally made for
lots of cpu/cores, I don't really wish to entertain huge steps backwards on that
front.  The rthreads work (amoung other work) has been done with an eye
towards more efficiently using MP systems.  Again, I like this, looks simpler.
Let's see you try and solve more of the MP parts as well.  :)

-Toby.

46 matches

Mail list logo