Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread David Coppa
On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote:
 On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
 On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
  This patch simply halves the timeslice processes get until they are
  preempted. This patch is standalone and the rest of the patches does not
  depend on it, but I figured I'd throw it in anyway.
 
  --
  Gregor Best
 

 Didn't get this patch. Is it me or the patch is missing?


 The patch was the message my mail was in reply to. There's no Patch 2/5
 subject line because I forgot to change that after sending out Patch
 1/5.

Can you resend it, please?
I'm confused...

TIA,
David



Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread Stuart Henderson
On 2012/10/15 16:18, David Coppa wrote:
 On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote:
  On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
  On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
   This patch simply halves the timeslice processes get until they are
   preempted. This patch is standalone and the rest of the patches does not
   depend on it, but I figured I'd throw it in anyway.
  
   --
   Gregor Best
  
 
  Didn't get this patch. Is it me or the patch is missing?
 
 
  The patch was the message my mail was in reply to. There's no Patch 2/5
  subject line because I forgot to change that after sending out Patch
  1/5.
 
 Can you resend it, please?
 I'm confused...
 
 TIA,
 David
 

Best to send the diff, with the accompanying text, in the same email,
and make sure they still all apply, I tried testing some of these but
didn't manage to get them to apply (either some conflicting change,
or they were in the wrong order and stacked up on top of each other
or something).



Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-15 Thread Stefan Fritsch
On Monday 15 October 2012, Stuart Henderson wrote:
 Best to send the diff, with the accompanying text, in the same
 email, and make sure they still all apply, I tried testing some of
 these but didn't manage to get them to apply (either some
 conflicting change, or they were in the wrong order and stacked up
 on top of each other or something).

As you mentioned using git in another mail, I would recommend to use 
git's format-patch to create the mails. Maybe with the --no-prefix 
option,  because that seems to be the preferred patch format here. 
Also, interactive rebase (git rebase -i) is nice for adjusting patches 
and commit messages (in case you didn't know that one).

Cheers,
Stefan



Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread David Coppa
On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
 This patch simply halves the timeslice processes get until they are
 preempted. This patch is standalone and the rest of the patches does not
 depend on it, but I figured I'd throw it in anyway.

 --
 Gregor Best


Didn't get this patch. Is it me or the patch is missing?



Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread Gregor Best
On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
 On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
  This patch simply halves the timeslice processes get until they are
  preempted. This patch is standalone and the rest of the patches does not
  depend on it, but I figured I'd throw it in anyway.
 
  --
  Gregor Best
 
 
 Didn't get this patch. Is it me or the patch is missing?
 

The patch was the message my mail was in reply to. There's no Patch 2/5
subject line because I forgot to change that after sending out Patch
1/5.

-- 
Gregor Best



Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-14 Thread Bob Beck
Gregor you would perhaps get better feedback if it were easier to discern
where your patches are and what each one is doing.  If you can't be
inclined to keep the subjects matching the diffs and are sending stuff out
with subjects like scheduler improvement diff X instead of something like
reduce time slice length by 50% it makes it a lot more effort for someone
to read and test.
On Oct 14, 2012 9:01 AM, Gregor Best g...@ring0.de wrote:

 On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote:
  On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote:
   This patch simply halves the timeslice processes get until they are
   preempted. This patch is standalone and the rest of the patches does
 not
   depend on it, but I figured I'd throw it in anyway.
  
   --
   Gregor Best
  
 
  Didn't get this patch. Is it me or the patch is missing?
 

 The patch was the message my mail was in reply to. There's no Patch 2/5
 subject line because I forgot to change that after sending out Patch
 1/5.

 --
 Gregor Best



Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-13 Thread Philip Guenther
On Tue, Oct 9, 2012 at 9:27 AM, Gregor Best g...@ring0.de wrote:
 This patch moves struct schedstate_percpu to kernel land, which I think
 is cleaner than exposing structures for scheduler state to userland,
 especially since grepping for 'schedstate' in /usr/src yielded no
 results outside of /usr/src/sys.

I think this is a misunderstanding about the intent of _KERNEL.  There
are, IMO, two reasons to condition parts of header files on _KERNEL:

1) the header's contents have been specified by a de facto or de jure
standard, but for its own reasons the kernel wants to place additional
info there.

2) the header defines kernel data structures that tools may need (for
crash dump debugging, for example) as well as external symbols like
functions and variables, declarations that can't be used by userspace.

sys/sched.h isn't a standardized header and isn't pulled in by a
standardized header, so (1) isn't an issue.  So, only a non-standard
program is going to be pulling in sys/sched.h anyway.  Unless
there's a reason otherwise, all types and #defines be exposed to it.


(Yeah, the use of CP_* and CPUSTATES by the KERN_CPTIME* sysctl()s is
less than ideal, but this isn't the right way to fix that, IMO.)


Philip Guenther



Scheduler improvements, take 1001

2012-10-09 Thread Gregor Best
(By popular request as a new thread).

Hi people,

I've tried splitting my scheduler patch into smaller fragments, and
here's the result.

I changed a few things people mentioned over the last few days, such as
the following:

1) sys/proc.h now includes sys/tree.h, which should make libc builds
work again.
2) deadline generation now takes process priorities into account, as
suggested by ratchov@. The way it's done now, processes can use their
sleep priority as a way to lower their nice value for short periods of
time. I didn't notice any real changes, but I'd love to hear from people
with more demanding applications.
3) schedstate_percpu is private to the kernel now, as I couldn't find a
single occurrence of `struct schedstate_percpu` outside of /usr/src/sys
and it seemed cleaner not to expose kernel data to userland in such a
broad way.

The patches will follow as single emails.

-- 
Gregor Best



Re: Scheduler improvements, take 1001, Patch 1/5

2012-10-09 Thread Gregor Best
diff --git a/kern/sched_bsd.c b/kern/sched_bsd.c
index 172bb8f..c7121dc 100644
--- a/kern/sched_bsd.c
+++ b/kern/sched_bsd.c
@@ -77,12 +77,12 @@ scheduler_start(void)
 
timeout_set(schedcpu_to, schedcpu, schedcpu_to);
 
-   rrticks_init = hz / 10;
+   rrticks_init = hz / 20;
schedcpu(schedcpu_to);
 }
 
 /*
- * Force switch among equal priority processes every 100ms.
+ * Force switch among equal priority processes every 50ms.
  */
 void
 roundrobin(struct cpu_info *ci)
-- 
1.7.6



Re: Scheduler improvements, take 1001, Patch 1/5

2012-10-09 Thread Gregor Best
diff --git a/kern/kern_clock.c b/kern/kern_clock.c
index 843965b..f598afc 100644
--- a/kern/kern_clock.c
+++ b/kern/kern_clock.c
@@ -233,7 +233,7 @@ hardclock(struct clockframe *frame)
if (stathz == 0)
statclock(frame);
 
-   if (--ci-ci_schedstate.spc_rrticks = 0)
+   if (p  (--(p-p_rrticks) = 0))
roundrobin(ci);
 
/*
diff --git a/kern/kern_proc.c b/kern/kern_proc.c
index ad861c8..e0d5536 100644
--- a/kern/kern_proc.c
+++ b/kern/kern_proc.c
@@ -398,8 +398,6 @@ proc_printit(struct proc *p, const char *modif, int 
(*pr)(const char *, ...))
p-p_comm, p-p_pid, pst, p-p_flag, P_BITS);
(*pr)(pri=%u, usrpri=%u, nice=%d\n,
p-p_priority, p-p_usrpri, p-p_p-ps_nice);
-   (*pr)(forw=%p, list=%p,%p\n,
-   TAILQ_NEXT(p, p_runq), p-p_list.le_next, p-p_list.le_prev);
(*pr)(process=%p user=%p, vmspace=%p\n,
p-p_p, p-p_addr, p-p_vmspace);
(*pr)(estcpu=%u, cpticks=%d, pctcpu=%u.%u, swtime=%u\n,
diff --git a/kern/kern_sched.c b/kern/kern_sched.c
index 253226a..79eb28c 100644
--- a/kern/kern_sched.c
+++ b/kern/kern_sched.c
@@ -24,11 +24,22 @@
 #include sys/resourcevar.h
 #include sys/signalvar.h
 #include sys/mutex.h
+#include sys/tree.h
 
 #include uvm/uvm_extern.h
 
 #include sys/malloc.h
 
+static int
+sched_cmp_proc(struct proc *a, struct proc *b) {
+   if (a == b)
+   return 0;
+   if (timercmp((a-p_deadline), (b-p_deadline), ))
+   return -1;
+   return 1;
+}
+
+RB_GENERATE_STATIC(prochead, proc, p_runq, sched_cmp_proc);
 
 void sched_kthreads_create(void *);
 
@@ -79,10 +90,8 @@ void
 sched_init_cpu(struct cpu_info *ci)
 {
struct schedstate_percpu *spc = ci-ci_schedstate;
-   int i;
 
-   for (i = 0; i  SCHED_NQS; i++)
-   TAILQ_INIT(spc-spc_qs[i]);
+   RB_INIT(spc-spc_runq);
 
spc-spc_idleproc = NULL;
 
@@ -158,18 +167,19 @@ sched_idle(void *v)
 
cpuset_add(sched_idle_cpus, ci);
cpu_idle_enter();
-   while (spc-spc_whichqs == 0) {
+
+   while (curcpu_is_idle()) {
if (spc-spc_schedflags  SPCF_SHOULDHALT 
-   (spc-spc_schedflags  SPCF_HALTED) == 0) {
+(spc-spc_schedflags  SPCF_HALTED) == 0) {
cpuset_del(sched_idle_cpus, ci);
SCHED_LOCK(s);
-   atomic_setbits_int(spc-spc_schedflags,
-   spc-spc_whichqs ? 0 : SPCF_HALTED);
+   atomic_setbits_int(spc-spc_schedflags, 
SPCF_HALTED);
SCHED_UNLOCK(s);
wakeup(spc);
}
cpu_idle_cycle();
}
+
cpu_idle_leave();
cpuset_del(sched_idle_cpus, ci);
}
@@ -222,14 +232,13 @@ void
 setrunqueue(struct proc *p)
 {
struct schedstate_percpu *spc;
-   int queue = p-p_priority  2;
 
SCHED_ASSERT_LOCKED();
spc = p-p_cpu-ci_schedstate;
spc-spc_nrun++;
 
-   TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq);
-   spc-spc_whichqs |= (1  queue);
+   KASSERT(!RB_FIND(prochead, spc-spc_runq, p));
+   RB_INSERT(prochead, spc-spc_runq, p);
cpuset_add(sched_queued_cpus, p-p_cpu);
 
if (cpuset_isset(sched_idle_cpus, p-p_cpu))
@@ -240,38 +249,29 @@ void
 remrunqueue(struct proc *p)
 {
struct schedstate_percpu *spc;
-   int queue = p-p_priority  2;
 
SCHED_ASSERT_LOCKED();
spc = p-p_cpu-ci_schedstate;
spc-spc_nrun--;
 
-   TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq);
-   if (TAILQ_EMPTY(spc-spc_qs[queue])) {
-   spc-spc_whichqs = ~(1  queue);
-   if (spc-spc_whichqs == 0)
-   cpuset_del(sched_queued_cpus, p-p_cpu);
-   }
+   KASSERT(RB_REMOVE(prochead, spc-spc_runq, p));
+   if (RB_EMPTY(spc-spc_runq))
+   cpuset_del(sched_queued_cpus, p-p_cpu);
 }
 
 struct proc *
 sched_chooseproc(void)
 {
struct schedstate_percpu *spc = curcpu()-ci_schedstate;
-   struct proc *p;
-   int queue;
+   struct proc *p, *p_tmp = NULL;
 
SCHED_ASSERT_LOCKED();
 
if (spc-spc_schedflags  SPCF_SHOULDHALT) {
-   if (spc-spc_whichqs) {
-   for (queue = 0; queue  SCHED_NQS; queue++) {
-   TAILQ_FOREACH(p, spc-spc_qs[queue], p_runq) {
-   remrunqueue(p);
-   p-p_cpu = sched_choosecpu(p);
-   setrunqueue(p);
-   }
-   }
+   RB_FOREACH_SAFE(p, prochead, spc-spc_runq, p_tmp) {
+   remrunqueue(p);
+   

Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-09 Thread Gregor Best
diff --git a/sys/sched.h b/sys/sched.h
index fb01f21..1784ee2 100644
--- a/sys/sched.h
+++ b/sys/sched.h
@@ -69,8 +69,10 @@
 #ifndef_SYS_SCHED_H_
 #define_SYS_SCHED_H_
 
+#ifdef _KERNEL
 #include sys/queue.h
 #include sys/tree.h
+#endif
 
 /*
  * Posix defines a sched.h which may want to include sys/sched.h
@@ -88,11 +90,9 @@
 #define CP_IDLE4
 #define CPUSTATES  5
 
-#defineSCHED_NQS   32  /* 32 run queues. */
-
+#ifdef _KERNEL
 /*
  * Per-CPU scheduler state.
- * XXX - expose to userland for now.
  */
 struct schedstate_percpu {
struct timeval spc_runtime; /* time curproc started running */
@@ -107,15 +107,13 @@ struct schedstate_percpu {
u_int spc_nrun; /* procs on the run queues */
fixpt_t spc_ldavg;  /* shortest load avg. for this cpu */
 
-   RB_HEAD(prochead, proc) spc_runq;
-
 #ifdef notyet
struct proc *spc_reaper;/* dead proc reaper */
 #endif
LIST_HEAD(,proc) spc_deadproc;
-};
 
-#ifdef _KERNEL
+   RB_HEAD(prochead, proc) spc_runq;
+};
 
 /* spc_flags */
 #define SPCF_SEENRR 0x0001  /* process has seen roundrobin() */
-- 
1.7.6



Re: Scheduler improvements, take 1001, Patch 4/5

2012-10-09 Thread Gregor Best
diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h
index 12e48d6..99501a1 100644
--- a/arch/amd64/include/cpu.h
+++ b/arch/amd64/include/cpu.h
@@ -102,9 +102,11 @@ struct cpu_info {
u_int32_t   ci_cflushsz;
u_int64_t   ci_tsc_freq;
 
+#define ARCH_HAVE_CPU_TOPOLOGY
u_int32_t   ci_smt_id;
u_int32_t   ci_core_id;
u_int32_t   ci_pkg_id;
+
struct cpu_functions *ci_func;
void (*cpu_setup)(struct cpu_info *);
void (*ci_info)(struct cpu_info *);
diff --git a/kern/kern_sched.c b/kern/kern_sched.c
index 79eb28c..072ef38 100644
--- a/kern/kern_sched.c
+++ b/kern/kern_sched.c
@@ -496,6 +496,10 @@ int sched_cost_load = 1;
 int sched_cost_priority = 1;
 int sched_cost_runnable = 3;
 int sched_cost_resident = 1;
+#ifdef ARCH_HAVE_CPU_TOPOLOGY
+int sched_cost_diffcore = 2; /* cost for moving to a different core */
+int sched_cost_diffpkg = 3; /* cost for moving to a different package */
+#endif
 
 int
 sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p)
@@ -536,6 +540,13 @@ sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p)
cost -= l2resident * sched_cost_resident;
}
 
+#ifdef ARCH_HAVE_CPU_TOPOLOGY
+   if (p-p_cpu-ci_pkg_id != ci-ci_pkg_id)
+   cost *= sched_cost_diffpkg;
+   else if (p-p_cpu-ci_core_id != ci-ci_core_id)
+   cost *= sched_cost_diffcore;
+#endif
+
return (cost);
 }
 
-- 
1.7.6



Re: Scheduler improvements, take 1001, Patch 3/5

2012-10-09 Thread Gregor Best
diff --git a/arch/amd64/amd64/identcpu.c b/arch/amd64/amd64/identcpu.c
index c597bb0..982c2bb 100644
--- a/arch/amd64/amd64/identcpu.c
+++ b/arch/amd64/amd64/identcpu.c
@@ -210,6 +210,8 @@ void (*setperf_setup)(struct cpu_info *);
 
 void via_nano_setup(struct cpu_info *ci);
 
+void cpu_topology(struct cpu_info *ci);
+
 void
 via_nano_setup(struct cpu_info *ci)
 {
@@ -479,4 +481,123 @@ identifycpu(struct cpu_info *ci)
sensordev_install(ci-ci_sensordev);
 #endif
}
+
+   cpu_topology(ci);
+}
+
+/*
+ * Base 2 logarithm of an int. returns 0 for 0 (yeye, I know).
+ */
+static int
+log2(unsigned int i)
+{
+   int ret = 0;
+
+   while (i = 1)
+   ret++;
+
+   return (ret);
+}
+
+static int
+mask_width(u_int x)
+{
+   int bit;
+   int mask;
+   int powerof2;
+
+   powerof2 = ((x - 1)  x) == 0;
+   mask = (x  (1 - powerof2)) - 1;
+
+   /* fls */
+   if (mask == 0)
+   return (0);
+   for (bit = 1; mask != 1; bit++)
+   mask = (unsigned int)mask  1;
+
+   return (bit);
+}
+
+/*
+ * Build up cpu topology for given cpu, must run on the core itself.
+ */
+void
+cpu_topology(struct cpu_info *ci)
+{
+   u_int32_t eax, ebx, ecx, edx;
+   u_int32_t apicid, max_apicid, max_coreid;
+   u_int32_t smt_bits, core_bits, pkg_bits;
+   u_int32_t smt_mask, core_mask, pkg_mask;
+
+   /* We need at least apicid at CPUID 1 */
+   CPUID(0, eax, ebx, ecx, edx);
+   if (eax  1)
+   goto no_topology;
+
+   /* Initial apicid */
+   CPUID(1, eax, ebx, ecx, edx);
+   apicid = (ebx  24)  0xff;
+
+   if (strcmp(cpu_vendor, AuthenticAMD) == 0) {
+   /* We need at least apicid at CPUID 0x8008 */
+   CPUID(0x8000, eax, ebx, ecx, edx);
+   if (eax  0x8008)
+   goto no_topology;
+
+   CPUID(0x8008, eax, ebx, ecx, edx);
+   core_bits = (ecx  12)  0xf;
+   if (core_bits == 0)
+   goto no_topology;
+   /* So coreidsize 2 gives 3, 3 gives 7... */
+   core_mask = (1  core_bits) - 1;
+   /* Core id is the least significant considering mask */
+   ci-ci_core_id = apicid  core_mask;
+   /* Pkg id is the upper remaining bits */
+   ci-ci_pkg_id = apicid  ~core_mask;
+   ci-ci_pkg_id = core_bits;
+   } else if (strcmp(cpu_vendor, GenuineIntel) == 0) {
+   /* We only support leaf 1/4 detection */
+   CPUID(0, eax, ebx, ecx, edx);
+   if (eax  4)
+   goto no_topology;
+   /* Get max_apicid */
+   CPUID(1, eax, ebx, ecx, edx);
+   max_apicid = (ebx  16)  0xff;
+   /* Get max_coreid */
+   CPUID2(4, 0, eax, ebx, ecx, edx);
+   max_coreid = ((eax  26)  0x3f) + 1;
+   /* SMT */
+   smt_bits = mask_width(max_apicid / max_coreid);
+   smt_mask = (1  smt_bits) - 1;
+   /* Core */
+   core_bits = log2(max_coreid);
+   core_mask = (1  (core_bits + smt_bits)) - 1;
+   core_mask ^= smt_mask;
+   /* Pkg */
+   pkg_bits = core_bits + smt_bits;
+   pkg_mask = -1  core_bits;
+
+   ci-ci_smt_id = apicid  smt_mask;
+   ci-ci_core_id = (apicid  core_mask)  smt_bits;
+   ci-ci_pkg_id = (apicid  pkg_mask)  pkg_bits;
+   } else
+   goto no_topology;
+#ifdef DEBUG
+   printf(cpu%d: smt %u, core %u, pkg %u 
+   (apicid 0x%x, max_apicid 0x%x, max_coreid 0x%x, smt_bits 0x%x, 
smt_mask 0x%x, 
+   core_bits 0x%x, core_mask 0x%x, pkg_bits 0x%x, pkg_mask 
0x%x)\n,
+   ci-ci_cpuid, ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id,
+   apicid, max_apicid, max_coreid, smt_bits, smt_mask, core_bits,
+   core_mask, pkg_bits, pkg_mask);
+#else
+   printf(cpu%d: smt %u, core %u, package %u\n, ci-ci_cpuid,
+   ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id);
+
+#endif
+   return;
+   /* We can't map, so consider ci_core_id as ci_cpuid */
+no_topology:
+   ci-ci_smt_id  = 0;
+   ci-ci_core_id = ci-ci_cpuid;
+   ci-ci_pkg_id  = 0;
 }
diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h
index 9ce437a..12e48d6 100644
--- a/arch/amd64/include/cpu.h
+++ b/arch/amd64/include/cpu.h
@@ -102,6 +102,9 @@ struct cpu_info {
u_int32_t   ci_cflushsz;
u_int64_t   ci_tsc_freq;
 
+   u_int32_t   ci_smt_id;
+   u_int32_t   ci_core_id;
+   u_int32_t   ci_pkg_id;
struct cpu_functions *ci_func;
void (*cpu_setup)(struct cpu_info *);
void (*ci_info)(struct cpu_info *);
diff --git a/arch/amd64/include/specialreg.h b/arch/amd64/include/specialreg.h
index 142fbbc..cab0985 100644
--- 

Re: Scheduler improvements, take 1001, Patch 2/5

2012-10-09 Thread Gregor Best
This patch simply halves the timeslice processes get until they are
preempted. This patch is standalone and the rest of the patches does not
depend on it, but I figured I'd throw it in anyway.

-- 
Gregor Best



Re: Scheduler improvements, take 1001, Patch 3/5

2012-10-09 Thread Gregor Best
This patch simply imports Christiano's code for detecting CPU topology,
as posted on tech@ a while (more than two months) ago. I took it
verbatim and didn't change anything yet.

-- 
Gregor Best



Re: Scheduler improvements, take 1001, Patch 4/5

2012-10-09 Thread Gregor Best
This patch uses the previous one to take CPU topology into account when
calculating the cost of moving a process between CPUs. This is only done
on amd64 at the moment, and the cost factors are guesses right now, but
it's a start.

-- 
Gregor Best



Re: Scheduler improvements, take 1001, Patch 5/5

2012-10-09 Thread Gregor Best
This patch moves struct schedstate_percpu to kernel land, which I think
is cleaner than exposing structures for scheduler state to userland,
especially since grepping for 'schedstate' in /usr/src yielded no
results outside of /usr/src/sys.

I have not seen negative impact from this, but I haven't yet run a full
userland build (it's running at the moment but the machine I'm building
on is a bit slower than my laptop).

-- 
Gregor Best