Re: Scheduler improvements, take 1001, Patch 2/5
On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. Can you resend it, please? I'm confused... TIA, David
Re: Scheduler improvements, take 1001, Patch 2/5
On 2012/10/15 16:18, David Coppa wrote: On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. Can you resend it, please? I'm confused... TIA, David Best to send the diff, with the accompanying text, in the same email, and make sure they still all apply, I tried testing some of these but didn't manage to get them to apply (either some conflicting change, or they were in the wrong order and stacked up on top of each other or something).
Re: Scheduler improvements, take 1001, Patch 2/5
On Monday 15 October 2012, Stuart Henderson wrote: Best to send the diff, with the accompanying text, in the same email, and make sure they still all apply, I tried testing some of these but didn't manage to get them to apply (either some conflicting change, or they were in the wrong order and stacked up on top of each other or something). As you mentioned using git in another mail, I would recommend to use git's format-patch to create the mails. Maybe with the --no-prefix option, because that seems to be the preferred patch format here. Also, interactive rebase (git rebase -i) is nice for adjusting patches and commit messages (in case you didn't know that one). Cheers, Stefan
Re: Scheduler improvements, take 1001, Patch 2/5
On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing?
Re: Scheduler improvements, take 1001, Patch 2/5
On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 2/5
Gregor you would perhaps get better feedback if it were easier to discern where your patches are and what each one is doing. If you can't be inclined to keep the subjects matching the diffs and are sending stuff out with subjects like scheduler improvement diff X instead of something like reduce time slice length by 50% it makes it a lot more effort for someone to read and test. On Oct 14, 2012 9:01 AM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 5/5
On Tue, Oct 9, 2012 at 9:27 AM, Gregor Best g...@ring0.de wrote: This patch moves struct schedstate_percpu to kernel land, which I think is cleaner than exposing structures for scheduler state to userland, especially since grepping for 'schedstate' in /usr/src yielded no results outside of /usr/src/sys. I think this is a misunderstanding about the intent of _KERNEL. There are, IMO, two reasons to condition parts of header files on _KERNEL: 1) the header's contents have been specified by a de facto or de jure standard, but for its own reasons the kernel wants to place additional info there. 2) the header defines kernel data structures that tools may need (for crash dump debugging, for example) as well as external symbols like functions and variables, declarations that can't be used by userspace. sys/sched.h isn't a standardized header and isn't pulled in by a standardized header, so (1) isn't an issue. So, only a non-standard program is going to be pulling in sys/sched.h anyway. Unless there's a reason otherwise, all types and #defines be exposed to it. (Yeah, the use of CP_* and CPUSTATES by the KERN_CPTIME* sysctl()s is less than ideal, but this isn't the right way to fix that, IMO.) Philip Guenther
Scheduler improvements, take 1001
(By popular request as a new thread). Hi people, I've tried splitting my scheduler patch into smaller fragments, and here's the result. I changed a few things people mentioned over the last few days, such as the following: 1) sys/proc.h now includes sys/tree.h, which should make libc builds work again. 2) deadline generation now takes process priorities into account, as suggested by ratchov@. The way it's done now, processes can use their sleep priority as a way to lower their nice value for short periods of time. I didn't notice any real changes, but I'd love to hear from people with more demanding applications. 3) schedstate_percpu is private to the kernel now, as I couldn't find a single occurrence of `struct schedstate_percpu` outside of /usr/src/sys and it seemed cleaner not to expose kernel data to userland in such a broad way. The patches will follow as single emails. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 1/5
diff --git a/kern/sched_bsd.c b/kern/sched_bsd.c index 172bb8f..c7121dc 100644 --- a/kern/sched_bsd.c +++ b/kern/sched_bsd.c @@ -77,12 +77,12 @@ scheduler_start(void) timeout_set(schedcpu_to, schedcpu, schedcpu_to); - rrticks_init = hz / 10; + rrticks_init = hz / 20; schedcpu(schedcpu_to); } /* - * Force switch among equal priority processes every 100ms. + * Force switch among equal priority processes every 50ms. */ void roundrobin(struct cpu_info *ci) -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 1/5
diff --git a/kern/kern_clock.c b/kern/kern_clock.c index 843965b..f598afc 100644 --- a/kern/kern_clock.c +++ b/kern/kern_clock.c @@ -233,7 +233,7 @@ hardclock(struct clockframe *frame) if (stathz == 0) statclock(frame); - if (--ci-ci_schedstate.spc_rrticks = 0) + if (p (--(p-p_rrticks) = 0)) roundrobin(ci); /* diff --git a/kern/kern_proc.c b/kern/kern_proc.c index ad861c8..e0d5536 100644 --- a/kern/kern_proc.c +++ b/kern/kern_proc.c @@ -398,8 +398,6 @@ proc_printit(struct proc *p, const char *modif, int (*pr)(const char *, ...)) p-p_comm, p-p_pid, pst, p-p_flag, P_BITS); (*pr)(pri=%u, usrpri=%u, nice=%d\n, p-p_priority, p-p_usrpri, p-p_p-ps_nice); - (*pr)(forw=%p, list=%p,%p\n, - TAILQ_NEXT(p, p_runq), p-p_list.le_next, p-p_list.le_prev); (*pr)(process=%p user=%p, vmspace=%p\n, p-p_p, p-p_addr, p-p_vmspace); (*pr)(estcpu=%u, cpticks=%d, pctcpu=%u.%u, swtime=%u\n, diff --git a/kern/kern_sched.c b/kern/kern_sched.c index 253226a..79eb28c 100644 --- a/kern/kern_sched.c +++ b/kern/kern_sched.c @@ -24,11 +24,22 @@ #include sys/resourcevar.h #include sys/signalvar.h #include sys/mutex.h +#include sys/tree.h #include uvm/uvm_extern.h #include sys/malloc.h +static int +sched_cmp_proc(struct proc *a, struct proc *b) { + if (a == b) + return 0; + if (timercmp((a-p_deadline), (b-p_deadline), )) + return -1; + return 1; +} + +RB_GENERATE_STATIC(prochead, proc, p_runq, sched_cmp_proc); void sched_kthreads_create(void *); @@ -79,10 +90,8 @@ void sched_init_cpu(struct cpu_info *ci) { struct schedstate_percpu *spc = ci-ci_schedstate; - int i; - for (i = 0; i SCHED_NQS; i++) - TAILQ_INIT(spc-spc_qs[i]); + RB_INIT(spc-spc_runq); spc-spc_idleproc = NULL; @@ -158,18 +167,19 @@ sched_idle(void *v) cpuset_add(sched_idle_cpus, ci); cpu_idle_enter(); - while (spc-spc_whichqs == 0) { + + while (curcpu_is_idle()) { if (spc-spc_schedflags SPCF_SHOULDHALT - (spc-spc_schedflags SPCF_HALTED) == 0) { +(spc-spc_schedflags SPCF_HALTED) == 0) { cpuset_del(sched_idle_cpus, ci); SCHED_LOCK(s); - atomic_setbits_int(spc-spc_schedflags, - spc-spc_whichqs ? 0 : SPCF_HALTED); + atomic_setbits_int(spc-spc_schedflags, SPCF_HALTED); SCHED_UNLOCK(s); wakeup(spc); } cpu_idle_cycle(); } + cpu_idle_leave(); cpuset_del(sched_idle_cpus, ci); } @@ -222,14 +232,13 @@ void setrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun++; - TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq); - spc-spc_whichqs |= (1 queue); + KASSERT(!RB_FIND(prochead, spc-spc_runq, p)); + RB_INSERT(prochead, spc-spc_runq, p); cpuset_add(sched_queued_cpus, p-p_cpu); if (cpuset_isset(sched_idle_cpus, p-p_cpu)) @@ -240,38 +249,29 @@ void remrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun--; - TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq); - if (TAILQ_EMPTY(spc-spc_qs[queue])) { - spc-spc_whichqs = ~(1 queue); - if (spc-spc_whichqs == 0) - cpuset_del(sched_queued_cpus, p-p_cpu); - } + KASSERT(RB_REMOVE(prochead, spc-spc_runq, p)); + if (RB_EMPTY(spc-spc_runq)) + cpuset_del(sched_queued_cpus, p-p_cpu); } struct proc * sched_chooseproc(void) { struct schedstate_percpu *spc = curcpu()-ci_schedstate; - struct proc *p; - int queue; + struct proc *p, *p_tmp = NULL; SCHED_ASSERT_LOCKED(); if (spc-spc_schedflags SPCF_SHOULDHALT) { - if (spc-spc_whichqs) { - for (queue = 0; queue SCHED_NQS; queue++) { - TAILQ_FOREACH(p, spc-spc_qs[queue], p_runq) { - remrunqueue(p); - p-p_cpu = sched_choosecpu(p); - setrunqueue(p); - } - } + RB_FOREACH_SAFE(p, prochead, spc-spc_runq, p_tmp) { + remrunqueue(p); +
Re: Scheduler improvements, take 1001, Patch 5/5
diff --git a/sys/sched.h b/sys/sched.h index fb01f21..1784ee2 100644 --- a/sys/sched.h +++ b/sys/sched.h @@ -69,8 +69,10 @@ #ifndef_SYS_SCHED_H_ #define_SYS_SCHED_H_ +#ifdef _KERNEL #include sys/queue.h #include sys/tree.h +#endif /* * Posix defines a sched.h which may want to include sys/sched.h @@ -88,11 +90,9 @@ #define CP_IDLE4 #define CPUSTATES 5 -#defineSCHED_NQS 32 /* 32 run queues. */ - +#ifdef _KERNEL /* * Per-CPU scheduler state. - * XXX - expose to userland for now. */ struct schedstate_percpu { struct timeval spc_runtime; /* time curproc started running */ @@ -107,15 +107,13 @@ struct schedstate_percpu { u_int spc_nrun; /* procs on the run queues */ fixpt_t spc_ldavg; /* shortest load avg. for this cpu */ - RB_HEAD(prochead, proc) spc_runq; - #ifdef notyet struct proc *spc_reaper;/* dead proc reaper */ #endif LIST_HEAD(,proc) spc_deadproc; -}; -#ifdef _KERNEL + RB_HEAD(prochead, proc) spc_runq; +}; /* spc_flags */ #define SPCF_SEENRR 0x0001 /* process has seen roundrobin() */ -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 4/5
diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h index 12e48d6..99501a1 100644 --- a/arch/amd64/include/cpu.h +++ b/arch/amd64/include/cpu.h @@ -102,9 +102,11 @@ struct cpu_info { u_int32_t ci_cflushsz; u_int64_t ci_tsc_freq; +#define ARCH_HAVE_CPU_TOPOLOGY u_int32_t ci_smt_id; u_int32_t ci_core_id; u_int32_t ci_pkg_id; + struct cpu_functions *ci_func; void (*cpu_setup)(struct cpu_info *); void (*ci_info)(struct cpu_info *); diff --git a/kern/kern_sched.c b/kern/kern_sched.c index 79eb28c..072ef38 100644 --- a/kern/kern_sched.c +++ b/kern/kern_sched.c @@ -496,6 +496,10 @@ int sched_cost_load = 1; int sched_cost_priority = 1; int sched_cost_runnable = 3; int sched_cost_resident = 1; +#ifdef ARCH_HAVE_CPU_TOPOLOGY +int sched_cost_diffcore = 2; /* cost for moving to a different core */ +int sched_cost_diffpkg = 3; /* cost for moving to a different package */ +#endif int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p) @@ -536,6 +540,13 @@ sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p) cost -= l2resident * sched_cost_resident; } +#ifdef ARCH_HAVE_CPU_TOPOLOGY + if (p-p_cpu-ci_pkg_id != ci-ci_pkg_id) + cost *= sched_cost_diffpkg; + else if (p-p_cpu-ci_core_id != ci-ci_core_id) + cost *= sched_cost_diffcore; +#endif + return (cost); } -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 3/5
diff --git a/arch/amd64/amd64/identcpu.c b/arch/amd64/amd64/identcpu.c index c597bb0..982c2bb 100644 --- a/arch/amd64/amd64/identcpu.c +++ b/arch/amd64/amd64/identcpu.c @@ -210,6 +210,8 @@ void (*setperf_setup)(struct cpu_info *); void via_nano_setup(struct cpu_info *ci); +void cpu_topology(struct cpu_info *ci); + void via_nano_setup(struct cpu_info *ci) { @@ -479,4 +481,123 @@ identifycpu(struct cpu_info *ci) sensordev_install(ci-ci_sensordev); #endif } + + cpu_topology(ci); +} + +/* + * Base 2 logarithm of an int. returns 0 for 0 (yeye, I know). + */ +static int +log2(unsigned int i) +{ + int ret = 0; + + while (i = 1) + ret++; + + return (ret); +} + +static int +mask_width(u_int x) +{ + int bit; + int mask; + int powerof2; + + powerof2 = ((x - 1) x) == 0; + mask = (x (1 - powerof2)) - 1; + + /* fls */ + if (mask == 0) + return (0); + for (bit = 1; mask != 1; bit++) + mask = (unsigned int)mask 1; + + return (bit); +} + +/* + * Build up cpu topology for given cpu, must run on the core itself. + */ +void +cpu_topology(struct cpu_info *ci) +{ + u_int32_t eax, ebx, ecx, edx; + u_int32_t apicid, max_apicid, max_coreid; + u_int32_t smt_bits, core_bits, pkg_bits; + u_int32_t smt_mask, core_mask, pkg_mask; + + /* We need at least apicid at CPUID 1 */ + CPUID(0, eax, ebx, ecx, edx); + if (eax 1) + goto no_topology; + + /* Initial apicid */ + CPUID(1, eax, ebx, ecx, edx); + apicid = (ebx 24) 0xff; + + if (strcmp(cpu_vendor, AuthenticAMD) == 0) { + /* We need at least apicid at CPUID 0x8008 */ + CPUID(0x8000, eax, ebx, ecx, edx); + if (eax 0x8008) + goto no_topology; + + CPUID(0x8008, eax, ebx, ecx, edx); + core_bits = (ecx 12) 0xf; + if (core_bits == 0) + goto no_topology; + /* So coreidsize 2 gives 3, 3 gives 7... */ + core_mask = (1 core_bits) - 1; + /* Core id is the least significant considering mask */ + ci-ci_core_id = apicid core_mask; + /* Pkg id is the upper remaining bits */ + ci-ci_pkg_id = apicid ~core_mask; + ci-ci_pkg_id = core_bits; + } else if (strcmp(cpu_vendor, GenuineIntel) == 0) { + /* We only support leaf 1/4 detection */ + CPUID(0, eax, ebx, ecx, edx); + if (eax 4) + goto no_topology; + /* Get max_apicid */ + CPUID(1, eax, ebx, ecx, edx); + max_apicid = (ebx 16) 0xff; + /* Get max_coreid */ + CPUID2(4, 0, eax, ebx, ecx, edx); + max_coreid = ((eax 26) 0x3f) + 1; + /* SMT */ + smt_bits = mask_width(max_apicid / max_coreid); + smt_mask = (1 smt_bits) - 1; + /* Core */ + core_bits = log2(max_coreid); + core_mask = (1 (core_bits + smt_bits)) - 1; + core_mask ^= smt_mask; + /* Pkg */ + pkg_bits = core_bits + smt_bits; + pkg_mask = -1 core_bits; + + ci-ci_smt_id = apicid smt_mask; + ci-ci_core_id = (apicid core_mask) smt_bits; + ci-ci_pkg_id = (apicid pkg_mask) pkg_bits; + } else + goto no_topology; +#ifdef DEBUG + printf(cpu%d: smt %u, core %u, pkg %u + (apicid 0x%x, max_apicid 0x%x, max_coreid 0x%x, smt_bits 0x%x, smt_mask 0x%x, + core_bits 0x%x, core_mask 0x%x, pkg_bits 0x%x, pkg_mask 0x%x)\n, + ci-ci_cpuid, ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id, + apicid, max_apicid, max_coreid, smt_bits, smt_mask, core_bits, + core_mask, pkg_bits, pkg_mask); +#else + printf(cpu%d: smt %u, core %u, package %u\n, ci-ci_cpuid, + ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id); + +#endif + return; + /* We can't map, so consider ci_core_id as ci_cpuid */ +no_topology: + ci-ci_smt_id = 0; + ci-ci_core_id = ci-ci_cpuid; + ci-ci_pkg_id = 0; } diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h index 9ce437a..12e48d6 100644 --- a/arch/amd64/include/cpu.h +++ b/arch/amd64/include/cpu.h @@ -102,6 +102,9 @@ struct cpu_info { u_int32_t ci_cflushsz; u_int64_t ci_tsc_freq; + u_int32_t ci_smt_id; + u_int32_t ci_core_id; + u_int32_t ci_pkg_id; struct cpu_functions *ci_func; void (*cpu_setup)(struct cpu_info *); void (*ci_info)(struct cpu_info *); diff --git a/arch/amd64/include/specialreg.h b/arch/amd64/include/specialreg.h index 142fbbc..cab0985 100644 ---
Re: Scheduler improvements, take 1001, Patch 2/5
This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 3/5
This patch simply imports Christiano's code for detecting CPU topology, as posted on tech@ a while (more than two months) ago. I took it verbatim and didn't change anything yet. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 4/5
This patch uses the previous one to take CPU topology into account when calculating the cost of moving a process between CPUs. This is only done on amd64 at the moment, and the cost factors are guesses right now, but it's a start. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 5/5
This patch moves struct schedstate_percpu to kernel land, which I think is cleaner than exposing structures for scheduler state to userland, especially since grepping for 'schedstate' in /usr/src yielded no results outside of /usr/src/sys. I have not seen negative impact from this, but I haven't yet run a full userland build (it's running at the moment but the machine I'm building on is a bit slower than my laptop). -- Gregor Best