Re: Scheduler improvements, take 1001, Patch 2/5
On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. Can you resend it, please? I'm confused... TIA, David
Re: Scheduler improvements, take 1001, Patch 2/5
On 2012/10/15 16:18, David Coppa wrote: On Sun, Oct 14, 2012 at 4:53 PM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. Can you resend it, please? I'm confused... TIA, David Best to send the diff, with the accompanying text, in the same email, and make sure they still all apply, I tried testing some of these but didn't manage to get them to apply (either some conflicting change, or they were in the wrong order and stacked up on top of each other or something).
Re: Scheduler improvements, take 1001, Patch 2/5
On Monday 15 October 2012, Stuart Henderson wrote: Best to send the diff, with the accompanying text, in the same email, and make sure they still all apply, I tried testing some of these but didn't manage to get them to apply (either some conflicting change, or they were in the wrong order and stacked up on top of each other or something). As you mentioned using git in another mail, I would recommend to use git's format-patch to create the mails. Maybe with the --no-prefix option, because that seems to be the preferred patch format here. Also, interactive rebase (git rebase -i) is nice for adjusting patches and commit messages (in case you didn't know that one). Cheers, Stefan
Re: Scheduler improvements, take 1001, Patch 2/5
On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing?
Re: Scheduler improvements, take 1001, Patch 2/5
On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 2/5
Gregor you would perhaps get better feedback if it were easier to discern where your patches are and what each one is doing. If you can't be inclined to keep the subjects matching the diffs and are sending stuff out with subjects like scheduler improvement diff X instead of something like reduce time slice length by 50% it makes it a lot more effort for someone to read and test. On Oct 14, 2012 9:01 AM, Gregor Best g...@ring0.de wrote: On Sun, Oct 14, 2012 at 11:05:36AM +0200, David Coppa wrote: On Tue, Oct 9, 2012 at 6:21 PM, Gregor Best g...@ring0.de wrote: This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best Didn't get this patch. Is it me or the patch is missing? The patch was the message my mail was in reply to. There's no Patch 2/5 subject line because I forgot to change that after sending out Patch 1/5. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 5/5
On Tue, Oct 9, 2012 at 9:27 AM, Gregor Best g...@ring0.de wrote: This patch moves struct schedstate_percpu to kernel land, which I think is cleaner than exposing structures for scheduler state to userland, especially since grepping for 'schedstate' in /usr/src yielded no results outside of /usr/src/sys. I think this is a misunderstanding about the intent of _KERNEL. There are, IMO, two reasons to condition parts of header files on _KERNEL: 1) the header's contents have been specified by a de facto or de jure standard, but for its own reasons the kernel wants to place additional info there. 2) the header defines kernel data structures that tools may need (for crash dump debugging, for example) as well as external symbols like functions and variables, declarations that can't be used by userspace. sys/sched.h isn't a standardized header and isn't pulled in by a standardized header, so (1) isn't an issue. So, only a non-standard program is going to be pulling in sys/sched.h anyway. Unless there's a reason otherwise, all types and #defines be exposed to it. (Yeah, the use of CP_* and CPUSTATES by the KERN_CPTIME* sysctl()s is less than ideal, but this isn't the right way to fix that, IMO.) Philip Guenther
Scheduler improvements, take 1001
(By popular request as a new thread). Hi people, I've tried splitting my scheduler patch into smaller fragments, and here's the result. I changed a few things people mentioned over the last few days, such as the following: 1) sys/proc.h now includes sys/tree.h, which should make libc builds work again. 2) deadline generation now takes process priorities into account, as suggested by ratchov@. The way it's done now, processes can use their sleep priority as a way to lower their nice value for short periods of time. I didn't notice any real changes, but I'd love to hear from people with more demanding applications. 3) schedstate_percpu is private to the kernel now, as I couldn't find a single occurrence of `struct schedstate_percpu` outside of /usr/src/sys and it seemed cleaner not to expose kernel data to userland in such a broad way. The patches will follow as single emails. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 1/5
diff --git a/kern/sched_bsd.c b/kern/sched_bsd.c index 172bb8f..c7121dc 100644 --- a/kern/sched_bsd.c +++ b/kern/sched_bsd.c @@ -77,12 +77,12 @@ scheduler_start(void) timeout_set(schedcpu_to, schedcpu, schedcpu_to); - rrticks_init = hz / 10; + rrticks_init = hz / 20; schedcpu(schedcpu_to); } /* - * Force switch among equal priority processes every 100ms. + * Force switch among equal priority processes every 50ms. */ void roundrobin(struct cpu_info *ci) -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 1/5
diff --git a/kern/kern_clock.c b/kern/kern_clock.c index 843965b..f598afc 100644 --- a/kern/kern_clock.c +++ b/kern/kern_clock.c @@ -233,7 +233,7 @@ hardclock(struct clockframe *frame) if (stathz == 0) statclock(frame); - if (--ci-ci_schedstate.spc_rrticks = 0) + if (p (--(p-p_rrticks) = 0)) roundrobin(ci); /* diff --git a/kern/kern_proc.c b/kern/kern_proc.c index ad861c8..e0d5536 100644 --- a/kern/kern_proc.c +++ b/kern/kern_proc.c @@ -398,8 +398,6 @@ proc_printit(struct proc *p, const char *modif, int (*pr)(const char *, ...)) p-p_comm, p-p_pid, pst, p-p_flag, P_BITS); (*pr)(pri=%u, usrpri=%u, nice=%d\n, p-p_priority, p-p_usrpri, p-p_p-ps_nice); - (*pr)(forw=%p, list=%p,%p\n, - TAILQ_NEXT(p, p_runq), p-p_list.le_next, p-p_list.le_prev); (*pr)(process=%p user=%p, vmspace=%p\n, p-p_p, p-p_addr, p-p_vmspace); (*pr)(estcpu=%u, cpticks=%d, pctcpu=%u.%u, swtime=%u\n, diff --git a/kern/kern_sched.c b/kern/kern_sched.c index 253226a..79eb28c 100644 --- a/kern/kern_sched.c +++ b/kern/kern_sched.c @@ -24,11 +24,22 @@ #include sys/resourcevar.h #include sys/signalvar.h #include sys/mutex.h +#include sys/tree.h #include uvm/uvm_extern.h #include sys/malloc.h +static int +sched_cmp_proc(struct proc *a, struct proc *b) { + if (a == b) + return 0; + if (timercmp((a-p_deadline), (b-p_deadline), )) + return -1; + return 1; +} + +RB_GENERATE_STATIC(prochead, proc, p_runq, sched_cmp_proc); void sched_kthreads_create(void *); @@ -79,10 +90,8 @@ void sched_init_cpu(struct cpu_info *ci) { struct schedstate_percpu *spc = ci-ci_schedstate; - int i; - for (i = 0; i SCHED_NQS; i++) - TAILQ_INIT(spc-spc_qs[i]); + RB_INIT(spc-spc_runq); spc-spc_idleproc = NULL; @@ -158,18 +167,19 @@ sched_idle(void *v) cpuset_add(sched_idle_cpus, ci); cpu_idle_enter(); - while (spc-spc_whichqs == 0) { + + while (curcpu_is_idle()) { if (spc-spc_schedflags SPCF_SHOULDHALT - (spc-spc_schedflags SPCF_HALTED) == 0) { +(spc-spc_schedflags SPCF_HALTED) == 0) { cpuset_del(sched_idle_cpus, ci); SCHED_LOCK(s); - atomic_setbits_int(spc-spc_schedflags, - spc-spc_whichqs ? 0 : SPCF_HALTED); + atomic_setbits_int(spc-spc_schedflags, SPCF_HALTED); SCHED_UNLOCK(s); wakeup(spc); } cpu_idle_cycle(); } + cpu_idle_leave(); cpuset_del(sched_idle_cpus, ci); } @@ -222,14 +232,13 @@ void setrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun++; - TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq); - spc-spc_whichqs |= (1 queue); + KASSERT(!RB_FIND(prochead, spc-spc_runq, p)); + RB_INSERT(prochead, spc-spc_runq, p); cpuset_add(sched_queued_cpus, p-p_cpu); if (cpuset_isset(sched_idle_cpus, p-p_cpu)) @@ -240,38 +249,29 @@ void remrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun--; - TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq); - if (TAILQ_EMPTY(spc-spc_qs[queue])) { - spc-spc_whichqs = ~(1 queue); - if (spc-spc_whichqs == 0) - cpuset_del(sched_queued_cpus, p-p_cpu); - } + KASSERT(RB_REMOVE(prochead, spc-spc_runq, p)); + if (RB_EMPTY(spc-spc_runq)) + cpuset_del(sched_queued_cpus, p-p_cpu); } struct proc * sched_chooseproc(void) { struct schedstate_percpu *spc = curcpu()-ci_schedstate; - struct proc *p; - int queue; + struct proc *p, *p_tmp = NULL; SCHED_ASSERT_LOCKED(); if (spc-spc_schedflags SPCF_SHOULDHALT) { - if (spc-spc_whichqs) { - for (queue = 0; queue SCHED_NQS; queue++) { - TAILQ_FOREACH(p, spc-spc_qs[queue], p_runq) { - remrunqueue(p); - p-p_cpu = sched_choosecpu(p); - setrunqueue(p); - } - } + RB_FOREACH_SAFE(p, prochead, spc-spc_runq, p_tmp) { + remrunqueue(p); +
Re: Scheduler improvements, take 1001, Patch 5/5
diff --git a/sys/sched.h b/sys/sched.h index fb01f21..1784ee2 100644 --- a/sys/sched.h +++ b/sys/sched.h @@ -69,8 +69,10 @@ #ifndef_SYS_SCHED_H_ #define_SYS_SCHED_H_ +#ifdef _KERNEL #include sys/queue.h #include sys/tree.h +#endif /* * Posix defines a sched.h which may want to include sys/sched.h @@ -88,11 +90,9 @@ #define CP_IDLE4 #define CPUSTATES 5 -#defineSCHED_NQS 32 /* 32 run queues. */ - +#ifdef _KERNEL /* * Per-CPU scheduler state. - * XXX - expose to userland for now. */ struct schedstate_percpu { struct timeval spc_runtime; /* time curproc started running */ @@ -107,15 +107,13 @@ struct schedstate_percpu { u_int spc_nrun; /* procs on the run queues */ fixpt_t spc_ldavg; /* shortest load avg. for this cpu */ - RB_HEAD(prochead, proc) spc_runq; - #ifdef notyet struct proc *spc_reaper;/* dead proc reaper */ #endif LIST_HEAD(,proc) spc_deadproc; -}; -#ifdef _KERNEL + RB_HEAD(prochead, proc) spc_runq; +}; /* spc_flags */ #define SPCF_SEENRR 0x0001 /* process has seen roundrobin() */ -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 4/5
diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h index 12e48d6..99501a1 100644 --- a/arch/amd64/include/cpu.h +++ b/arch/amd64/include/cpu.h @@ -102,9 +102,11 @@ struct cpu_info { u_int32_t ci_cflushsz; u_int64_t ci_tsc_freq; +#define ARCH_HAVE_CPU_TOPOLOGY u_int32_t ci_smt_id; u_int32_t ci_core_id; u_int32_t ci_pkg_id; + struct cpu_functions *ci_func; void (*cpu_setup)(struct cpu_info *); void (*ci_info)(struct cpu_info *); diff --git a/kern/kern_sched.c b/kern/kern_sched.c index 79eb28c..072ef38 100644 --- a/kern/kern_sched.c +++ b/kern/kern_sched.c @@ -496,6 +496,10 @@ int sched_cost_load = 1; int sched_cost_priority = 1; int sched_cost_runnable = 3; int sched_cost_resident = 1; +#ifdef ARCH_HAVE_CPU_TOPOLOGY +int sched_cost_diffcore = 2; /* cost for moving to a different core */ +int sched_cost_diffpkg = 3; /* cost for moving to a different package */ +#endif int sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p) @@ -536,6 +540,13 @@ sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p) cost -= l2resident * sched_cost_resident; } +#ifdef ARCH_HAVE_CPU_TOPOLOGY + if (p-p_cpu-ci_pkg_id != ci-ci_pkg_id) + cost *= sched_cost_diffpkg; + else if (p-p_cpu-ci_core_id != ci-ci_core_id) + cost *= sched_cost_diffcore; +#endif + return (cost); } -- 1.7.6
Re: Scheduler improvements, take 1001, Patch 3/5
diff --git a/arch/amd64/amd64/identcpu.c b/arch/amd64/amd64/identcpu.c index c597bb0..982c2bb 100644 --- a/arch/amd64/amd64/identcpu.c +++ b/arch/amd64/amd64/identcpu.c @@ -210,6 +210,8 @@ void (*setperf_setup)(struct cpu_info *); void via_nano_setup(struct cpu_info *ci); +void cpu_topology(struct cpu_info *ci); + void via_nano_setup(struct cpu_info *ci) { @@ -479,4 +481,123 @@ identifycpu(struct cpu_info *ci) sensordev_install(ci-ci_sensordev); #endif } + + cpu_topology(ci); +} + +/* + * Base 2 logarithm of an int. returns 0 for 0 (yeye, I know). + */ +static int +log2(unsigned int i) +{ + int ret = 0; + + while (i = 1) + ret++; + + return (ret); +} + +static int +mask_width(u_int x) +{ + int bit; + int mask; + int powerof2; + + powerof2 = ((x - 1) x) == 0; + mask = (x (1 - powerof2)) - 1; + + /* fls */ + if (mask == 0) + return (0); + for (bit = 1; mask != 1; bit++) + mask = (unsigned int)mask 1; + + return (bit); +} + +/* + * Build up cpu topology for given cpu, must run on the core itself. + */ +void +cpu_topology(struct cpu_info *ci) +{ + u_int32_t eax, ebx, ecx, edx; + u_int32_t apicid, max_apicid, max_coreid; + u_int32_t smt_bits, core_bits, pkg_bits; + u_int32_t smt_mask, core_mask, pkg_mask; + + /* We need at least apicid at CPUID 1 */ + CPUID(0, eax, ebx, ecx, edx); + if (eax 1) + goto no_topology; + + /* Initial apicid */ + CPUID(1, eax, ebx, ecx, edx); + apicid = (ebx 24) 0xff; + + if (strcmp(cpu_vendor, AuthenticAMD) == 0) { + /* We need at least apicid at CPUID 0x8008 */ + CPUID(0x8000, eax, ebx, ecx, edx); + if (eax 0x8008) + goto no_topology; + + CPUID(0x8008, eax, ebx, ecx, edx); + core_bits = (ecx 12) 0xf; + if (core_bits == 0) + goto no_topology; + /* So coreidsize 2 gives 3, 3 gives 7... */ + core_mask = (1 core_bits) - 1; + /* Core id is the least significant considering mask */ + ci-ci_core_id = apicid core_mask; + /* Pkg id is the upper remaining bits */ + ci-ci_pkg_id = apicid ~core_mask; + ci-ci_pkg_id = core_bits; + } else if (strcmp(cpu_vendor, GenuineIntel) == 0) { + /* We only support leaf 1/4 detection */ + CPUID(0, eax, ebx, ecx, edx); + if (eax 4) + goto no_topology; + /* Get max_apicid */ + CPUID(1, eax, ebx, ecx, edx); + max_apicid = (ebx 16) 0xff; + /* Get max_coreid */ + CPUID2(4, 0, eax, ebx, ecx, edx); + max_coreid = ((eax 26) 0x3f) + 1; + /* SMT */ + smt_bits = mask_width(max_apicid / max_coreid); + smt_mask = (1 smt_bits) - 1; + /* Core */ + core_bits = log2(max_coreid); + core_mask = (1 (core_bits + smt_bits)) - 1; + core_mask ^= smt_mask; + /* Pkg */ + pkg_bits = core_bits + smt_bits; + pkg_mask = -1 core_bits; + + ci-ci_smt_id = apicid smt_mask; + ci-ci_core_id = (apicid core_mask) smt_bits; + ci-ci_pkg_id = (apicid pkg_mask) pkg_bits; + } else + goto no_topology; +#ifdef DEBUG + printf(cpu%d: smt %u, core %u, pkg %u + (apicid 0x%x, max_apicid 0x%x, max_coreid 0x%x, smt_bits 0x%x, smt_mask 0x%x, + core_bits 0x%x, core_mask 0x%x, pkg_bits 0x%x, pkg_mask 0x%x)\n, + ci-ci_cpuid, ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id, + apicid, max_apicid, max_coreid, smt_bits, smt_mask, core_bits, + core_mask, pkg_bits, pkg_mask); +#else + printf(cpu%d: smt %u, core %u, package %u\n, ci-ci_cpuid, + ci-ci_smt_id, ci-ci_core_id, ci-ci_pkg_id); + +#endif + return; + /* We can't map, so consider ci_core_id as ci_cpuid */ +no_topology: + ci-ci_smt_id = 0; + ci-ci_core_id = ci-ci_cpuid; + ci-ci_pkg_id = 0; } diff --git a/arch/amd64/include/cpu.h b/arch/amd64/include/cpu.h index 9ce437a..12e48d6 100644 --- a/arch/amd64/include/cpu.h +++ b/arch/amd64/include/cpu.h @@ -102,6 +102,9 @@ struct cpu_info { u_int32_t ci_cflushsz; u_int64_t ci_tsc_freq; + u_int32_t ci_smt_id; + u_int32_t ci_core_id; + u_int32_t ci_pkg_id; struct cpu_functions *ci_func; void (*cpu_setup)(struct cpu_info *); void (*ci_info)(struct cpu_info *); diff --git a/arch/amd64/include/specialreg.h b/arch/amd64/include/specialreg.h index 142fbbc..cab0985 100644 ---
Re: Scheduler improvements, take 1001, Patch 2/5
This patch simply halves the timeslice processes get until they are preempted. This patch is standalone and the rest of the patches does not depend on it, but I figured I'd throw it in anyway. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 3/5
This patch simply imports Christiano's code for detecting CPU topology, as posted on tech@ a while (more than two months) ago. I took it verbatim and didn't change anything yet. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 4/5
This patch uses the previous one to take CPU topology into account when calculating the cost of moving a process between CPUs. This is only done on amd64 at the moment, and the cost factors are guesses right now, but it's a start. -- Gregor Best
Re: Scheduler improvements, take 1001, Patch 5/5
This patch moves struct schedstate_percpu to kernel land, which I think is cleaner than exposing structures for scheduler state to userland, especially since grepping for 'schedstate' in /usr/src yielded no results outside of /usr/src/sys. I have not seen negative impact from this, but I haven't yet run a full userland build (it's running at the moment but the machine I'm building on is a bit slower than my laptop). -- Gregor Best
Re: Scheduler improvements
Hi, Seems to be a bit snappier, though more thorough testing needed. Many web pages became more usable after the patch (chrome as browser) and some gains in desktop too. System is amd64 @ Intel(R) Core(TM)2 Duo CPU T5870 @ 2.00GHz and 4G RAM. -- cheers, Ville
Re: Scheduler improvements
log2 should probably be scrapped, since you're reinventing ffs.
Re: Scheduler improvements
On 8 October 2012 11:24, Marc Espie es...@nerim.net wrote: log2 should probably be scrapped, since you're reinventing ffs. That is actually my code, back then I failed at finding an ffs. I'll try to have a look at the code this week.
Re: Scheduler improvements
On Sat, Oct 06, 2012 at 10:38:34PM +0200, Gregor Best wrote: Hi Alexandre, [...] This change is unclear for me; AFAIU, it removes the mechanism which makes processes wake up with a priority depending on what they are blocked on. [...] Where do you see that? The code I removed/changed simply calulated the queue from which to remove `p` and removed it from there (similar for insertion). That needed to be changed to modify the RB tree used as a new runqueue. [...] For instance processes waking up from poll(2) or audio read/write won't be prioritized any longer. If so, this would hurt audio and other interactive processes but would improve cpu-intesive bloatware. [...] They weren't prioritised with the old version of this code either. AFAIU, when a process waits for an event (with tsleep(9)), it provides the priority of the event it's waiting for (eg an audio interrupt). When the event occurs, the process is put in the runnable queue calculated from the priority. This way, interactive processes (i.e. waiting for external events) are prioritized during the event processing. This mechanism ensures audio and midi are usable without having to manually tweak priorities. Furthermore, priorities changed depending on what the process is doing. The most notable exception to above rule is the poll(2) implementation, which still ignores priorities; it may need to be fixed one day, but this is not the most urgent problem hurting interactive processes. [A -- Alexandre
Re: Scheduler improvements
On Oct 8, 2012 6:33 PM, Alexandre Ratchov a...@caoua.org wrote: On Sat, Oct 06, 2012 at 10:38:34PM +0200, Gregor Best wrote: Hi Alexandre, [...] This change is unclear for me; AFAIU, it removes the mechanism which makes processes wake up with a priority depending on what they are blocked on. [...] Where do you see that? The code I removed/changed simply calulated the queue from which to remove `p` and removed it from there (similar for insertion). That needed to be changed to modify the RB tree used as a new runqueue. [...] For instance processes waking up from poll(2) or audio read/write won't be prioritized any longer. If so, this would hurt audio and other interactive processes but would improve cpu-intesive bloatware. [...] They weren't prioritised with the old version of this code either. AFAIU, when a process waits for an event (with tsleep(9)), it provides the priority of the event it's waiting for (eg an audio interrupt). When the event occurs, the process is put in the runnable queue calculated from the priority. This way, interactive processes (i.e. waiting for external events) are prioritized during the event processing. This mechanism ensures audio and midi are usable without having to manually tweak priorities. Furthermore, priorities changed depending on what the process is doing. The most notable exception to above rule is the poll(2) implementation, which still ignores priorities; it may need to be fixed one day, but this is not the most urgent problem hurting interactive processes. [A -- Alexandre Ratchov is correct, when a process sleeps you specify the priority you should have when waking up. I think george was referring to the old version of his code. The 4.4 bsd scheduler relies heavily on these priority boosts, i still didnt have a look at George's code, but an equivalent mechanism is usually needed. On cfs scheduling for example the processes forces a new delta calculation and gets a fraction proportional to all other runnables processes.
Re: Scheduler improvements
On Mon, Oct 08, 2012 at 06:28:53PM +0200, Alexandre Ratchov wrote: [...] AFAIU, when a process waits for an event (with tsleep(9)), it provides the priority of the event it's waiting for (eg an audio interrupt). When the event occurs, the process is put in the runnable queue calculated from the priority. This way, interactive processes (i.e. waiting for external events) are prioritized during the event processing. [...] I haven't noticed any degradation in performance with media-stuff (quite the contrary, actually), but you have a point. I'll change the deadline-calculation to also take the sleep priority into account. Thanks for the explanation :) -- Gregor Best [demime 1.01d removed an attachment of type application/pgp-signature]
Re: Scheduler improvements
On Mon, Oct 08, 2012 at 18:43, Christiano F. Haesbaert wrote: Ratchov is correct, when a process sleeps you specify the priority you should have when waking up. I think george was referring to the old version of his code. The 4.4 bsd scheduler relies heavily on these priority boosts, i still didnt have a look at George's code, but an equivalent mechanism is usually needed. On cfs scheduling for example the processes forces a new delta calculation and gets a fraction proportional to all other runnables processes. I've wondered how effective (or more importantly, accurate) these priorities are. Maintenance has been somewhat haphazard, e.g., I have no idea when the last use of PVFS was removed, but there aren't any left except for the definition in param.h.
Re: Scheduler improvements
On Sun, Oct 07, 2012 at 04:04:16PM +0200, Gregor Best wrote: [...] I don't think changing the idle loop like this is ok. You want to continue checking whether the runqueue is empty in between cpu_idle_enter() and cpu_idle_leave(). [...] Fair point. I'll change that :) Your diff doesn't pass userland compiles. You're adding a dependency on tree.h in proc.h. In kernel land on amd64, that's okay, since machine/param.h will pull machine/cpu.h which pulls sys/sched.h which now pulls sys/tree.h, but it definitely doesn't fly in userland... my build crashed burned in libc... assuming that's okay (not usually a kernel guy), you can go include sys/tree.h directly in proc.h... Please, please, please, don't vanish into the woods for the next 6 months. Try splitting up your diff and working them into pieces kernel hackers can work with...
Re: Scheduler improvements
On Mon, Oct 08, 2012 at 07:42:13PM +0200, Marc Espie wrote: [...] Your diff doesn't pass userland compiles. You're adding a dependency on tree.h in proc.h. In kernel land on amd64, that's okay, since machine/param.h will pull machine/cpu.h which pulls sys/sched.h which now pulls sys/tree.h, but it definitely doesn't fly in userland... my build crashed burned in libc... assuming that's okay (not usually a kernel guy), you can go include sys/tree.h directly in proc.h... [...] I think wrapping all the things that need sys/tree.h in #ifdef KERNEL should do the trick since they don't really make sense in userland anyway. Please, please, please, don't vanish into the woods for the next 6 months. [...] Since I now have way more space time on my hands than before, that should not be an issue anymore, don't worry :) Try splitting up your diff and working them into pieces kernel hackers can work with... [...] I will. Actually, the diff is a bunch of seperate git commits on my machine, so that shouldn't be too hard to do. -- Gregor Best
Re: Scheduler improvements
[...] I don't think changing the idle loop like this is ok. You want to continue checking whether the runqueue is empty in between cpu_idle_enter() and cpu_idle_leave(). [...] Fair point. I'll change that :) -- Gregor Best
Re: Scheduler improvements
Hi Alexandre, [...] This change is unclear for me; AFAIU, it removes the mechanism which makes processes wake up with a priority depending on what they are blocked on. [...] Where do you see that? The code I removed/changed simply calulated the queue from which to remove `p` and removed it from there (similar for insertion). That needed to be changed to modify the RB tree used as a new runqueue. [...] For instance processes waking up from poll(2) or audio read/write won't be prioritized any longer. If so, this would hurt audio and other interactive processes but would improve cpu-intesive bloatware. [...] They weren't prioritised with the old version of this code either. [...] this change is unrelated to the rest isn't it? [...] Yes, it is. I changed it because 100ms time slices felt a bit too large, but the rest of the patch will of course work without this bit. Thanks for taking your time to look at this :) -- Gregor Best
Re: Scheduler improvements
On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote: As before, I'm looking forward to anything you have to comment, especially cool benchmark ideas or the like. I know my report is not a benchmark of any kind but I do see a slight improvements when running a full GNOME 3 installation. ipis never go past 10K where they are regularly around 50K without this patch (confirmed with 2 different amd64 boxes). I can't really see any interactive enhancements but I don't think that it's really the point. That said, GNOME 3 is a nice interactive test for that kind of things, it's heavily threaded and currently the performance of the SP kernel outperforms by far the MP one. Please count me as a guinea pig for any scheduler patch because my Desktop machines are become barely usable with an MP kernel since we switched to rthreads and I would really much want to see some improvements in that area. I'll keep running with this (and any other mods) for the upcoming weeks and if I have time to do some real and useful benchmarking or see any regression, I'll let you know. Thanks. -- Antoine
Re: Scheduler improvements
On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote: Hi people, after a (very) long time of silence on this, here's another go at it. This time, I basically started from scratch and used a bit of code by Christiano Haesberth which had been posted to tech@ a while ago to detect CPU topology on amd64 and take that into account when moving processes between CPUs. This version has one single queue per CPU, getting rid of a) the one single system wide runqueue and b) the queue for expired processes. This simplifies things a bit and performs just as well as my previous versions (the only difference is the order in which expired procs get selected for running on a CPU). One advantage is that process selection is in O(log n) of the number of processes on the CPU and depends neither on the total number of processes nor the number of expired processes in the runqueue. The factors for the cost of moving a process between hardware threads, cpu dies and cpu packages are guesses now, I think those will have to be tuned further. Sadly, I haven't had access to a multiprocessor machine with a more diverse architecture than a bunch of cores on the same die. I tested this on some more machines than before; a Core i5, an i7 and my Core 2 Duo and on all machines (perceived) interactivity was improved. The simplest benchmark I used was playing back a 1080p version of Big Buck Bunny with mplayer. All machines I tested on had Intel graphics cards, one GM965 (on the Core2Duo) and the others were Sandy Bridge devices. On all of them, playback was smoother with the i7 being most visible. With the default scheduler, watching the movie was a big pain due to heavy frame-dropping, with my patch, the movie was watchable (with frame dropping only (barely) noticable in scenes with much movement). As before, I'm looking forward to anything you have to comment, especially cool benchmark ideas or the like. With your patch, html5 videos on firefox are smoother, even the playback is better when I try firefox + dpb 2 jobs vs only firefox without your patch installed. I use i3wm, so I can't comment about the performance of a full desktop environment. My hardware: AMD A4-3400, 8GB, ATI Radeon HD 4350. Please, still working on this area :) -- Juan Francisco Cantero Hurtado http://juanfra.info
Re: Scheduler improvements
On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote: @@ -222,14 +230,13 @@ setrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun++; - TAILQ_INSERT_TAIL(spc-spc_qs[queue], p, p_runq); - spc-spc_whichqs |= (1 queue); + KASSERT(!RB_FIND(prochead, spc-spc_runq, p)); + RB_INSERT(prochead, spc-spc_runq, p); cpuset_add(sched_queued_cpus, p-p_cpu); if (cpuset_isset(sched_idle_cpus, p-p_cpu)) @@ -240,38 +247,29 @@ remrunqueue(struct proc *p) { struct schedstate_percpu *spc; - int queue = p-p_priority 2; SCHED_ASSERT_LOCKED(); spc = p-p_cpu-ci_schedstate; spc-spc_nrun--; - TAILQ_REMOVE(spc-spc_qs[queue], p, p_runq); - if (TAILQ_EMPTY(spc-spc_qs[queue])) { - spc-spc_whichqs = ~(1 queue); - if (spc-spc_whichqs == 0) - cpuset_del(sched_queued_cpus, p-p_cpu); - } + KASSERT(RB_REMOVE(prochead, spc-spc_runq, p)); + if (RB_EMPTY(spc-spc_runq)) + cpuset_del(sched_queued_cpus, p-p_cpu); } This change is unclear for me; AFAIU, it removes the mechanism which makes processes wake up with a priority depending on what they are blocked on. For instance processes waking up from poll(2) or audio read/write won't be prioritized any longer. If so, this would hurt audio and other interactive processes but would improve cpu-intesive bloatware. haven't tested this though Index: kern/sched_bsd.c === RCS file: /cvs/src/sys/kern/sched_bsd.c,v retrieving revision 1.30 diff -u -r1.30 sched_bsd.c --- kern/sched_bsd.c 9 Jul 2012 17:27:32 - 1.30 +++ kern/sched_bsd.c 4 Oct 2012 21:27:58 - @@ -77,20 +77,18 @@ timeout_set(schedcpu_to, schedcpu, schedcpu_to); - rrticks_init = hz / 10; + rrticks_init = hz / 20; this change is unrelated to the rest isn't it? -- Alexandre
Re: Scheduler improvements
It appears to have sped up porn. movies on the machine seem a bit better. I will try this in a few other places Just not at the mother in laws or in public places no matter how impressed you are at the difference? -- ___ 'Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface' (Doug McIlroy) ___
Re: Scheduler improvements
On Fri Oct 5 2012 14:24, Antoine Jacoutot wrote: On Thu, Oct 04, 2012 at 11:42:45PM +0200, Gregor Best wrote: As before, I'm looking forward to anything you have to comment, especially cool benchmark ideas or the like. I know my report is not a benchmark of any kind but I do see a slight improvements when running a full GNOME 3 installation. ipis never go past 10K where they are regularly around 50K without this patch (confirmed with 2 different amd64 boxes). I can confirm this, too. GNOME 3 generated around 80.000 to over 100.000 ipis, by simply moving the mouse cursor over the menu entries for a few seconds! The desktop felt sluggish. With this patch, the same procedure feels less sluggish, and the ipi count dropped down to 9.000 to 15.000. Further, the time it takes to compile the kernel more than halved from ~10min to 4,27min. And, the best of all, the first time I'm able to watch html5 embedded videos smoothly, even in full screen mode.
Re: Scheduler improvements
Date: Thu, 4 Oct 2012 23:42:45 +0200 From: Gregor Best g...@ring0.de Index: kern/kern_sched.c === RCS file: /cvs/src/sys/kern/kern_sched.c,v retrieving revision 1.27 diff -u -r1.27 kern_sched.c --- kern/kern_sched.c 10 Jul 2012 18:20:37 - 1.27 +++ kern/kern_sched.c 4 Oct 2012 21:27:58 - @@ -158,18 +167,17 @@ cpuset_add(sched_idle_cpus, ci); cpu_idle_enter(); - while (spc-spc_whichqs == 0) { - if (spc-spc_schedflags SPCF_SHOULDHALT - (spc-spc_schedflags SPCF_HALTED) == 0) { - cpuset_del(sched_idle_cpus, ci); - SCHED_LOCK(s); - atomic_setbits_int(spc-spc_schedflags, - spc-spc_whichqs ? 0 : SPCF_HALTED); - SCHED_UNLOCK(s); - wakeup(spc); - } - cpu_idle_cycle(); + + if (spc-spc_schedflags SPCF_SHOULDHALT + (spc-spc_schedflags SPCF_HALTED) == 0) { + cpuset_del(sched_idle_cpus, ci); + SCHED_LOCK(s); + atomic_setbits_int(spc-spc_schedflags, SPCF_HALTED); + SCHED_UNLOCK(s); + wakeup(spc); } + cpu_idle_cycle(); + cpu_idle_leave(); cpuset_del(sched_idle_cpus, ci); } @@ -222,14 +230,13 @@ I don't think changing the idle loop like this is ok. You want to continue checking whether the runqueue is empty in between cpu_idle_enter() and cpu_idle_leave().
Re: Scheduler improvements
On Thu, Oct 04, 2012 at 23:42, Gregor Best wrote: As before, I'm looking forward to anything you have to comment, especially cool benchmark ideas or the like. A short one is building a kernel. Try before and after with make, make -j 2, and make -j 4. (or ncpu and ncpu * 2).
Re: Scheduler improvements
On Thu, Oct 04, 2012 at 06:28:11PM -0400, Ted Unangst wrote: On Thu, Oct 04, 2012 at 23:42, Gregor Best wrote: As before, I'm looking forward to anything you have to comment, especially cool benchmark ideas or the like. A short one is building a kernel. Try before and after with make, make -j 2, and make -j 4. (or ncpu and ncpu * 2). A more realistic and useful one is rebuilding the full package tree with dpb, which is rather simple to try these days (just requires about 30GB of disk space) kernel will take a few minutes, dpb some hours to a few days...
Re: Scheduler improvements
On Fri, Jan 20, 2012 at 06:06:08PM +0100, Ariane van der Steldt wrote: [...] * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms long timeslices on machines with hz 100. I'm not really happy having timeslices get larger. They're considerably large already. Can you think of a way to derive a sane value based on the hz? [...] Hmm, maybe my writing is not as clear as I thought it to be. The 20ms vs 10ms was the second version of my patch vs. the first one. Mainline has timeslices of 100ms. They are already derived from hz in kern/sched_bsd.c:scheduler_start(). Originally, the relevant line read rrticks_init = hz / 10; whereas now it is rrticks_init = hz / 50; The first version of the patch had rrticks_init = hz / 100; which lead to rrticks_init being zero on machines with hz 100. From the comments to the first version I gathered that one can rely on 50 = hz = 1024. I'm not too sure whether changing timeslices has any real benefit though. I noticed a bit less A/V lag in mplayer with smaller timeslices so I gave it a shot. [...] You'll want to test this on a multi-socket machine as well (preferably a numa machine). You're currently migrating processes using your on-die L3 cache, which is considerably faster than memory. [...] Yes, right. Christiano also mentioned that in an off-list mail. Maybe I should abuse one of the machines at my university for that... [...] The above comparison will fail if: a != b, but a-p_deadline_set == b-p_deadline_set. Your code hides this from view, because of the a==b test before it, turning it into a rare occurence. However when it hits during insert or remove, you may find your tree getting corrupted and your invariants getting violated. [...] Right. Thanks for spotting that. I'll change that in the next iteration of the patch. [...] Note: the above breaks RB_FIND and RB_NFIND, since a search clause can never compare equal. [...] Since neither RB_FIND nor RB_NFIND is used (as far as I can see), I'd call that a good tradeoff. [...] + +RB_GENERATE_STATIC(runqueue, proc, p_runq, sched_cmp_proc); +RB_GENERATE_STATIC(expqueue, proc, p_expq, sched_cmp_proc_exp); That's gonna add a lot of code. [...] Hmm, is there any better way to do it? I couldn't think of any way to have two queues with a disjoint set of processes in each and their only difference being the comparison function. [...] Btw, even though the implementation works fine without it, I'd really like if you could put matching RB_PROTOTYPE_STATIC before the generator macro. Using GENERATE without PROTOTYPE is abuse of an implementation detail. [...] Will be fixed. [...] + if (!(RB_REMOVE(runqueue, sched_runqueue, p))) Is it ok to call this function for a process not on the run queue? Your code implies it is. If you don't intend that to happen, put a KASSERT in. + RB_REMOVE(expqueue, sched_expqueue, p); Please check return value. I understand that if a process is on the runqueue, it must also be on the expqueue? [...] No, runqueue and expqueue are disjoint. A process is either in the runqueue if its deadline has not yet been reached and in the expqueue if its deadline is expired. Maybe I should think of a better name than runqueue because essentially both RB-trees together are the runqueue. [...] tv_usec is measured in microseconds, msec suggests milliseconds. I'd change that to: .tv_usec = offset_msec % 1000 * 1000 [...] D'oh! I'll fix that. Must've been my brain being a bit sleepy :) [...] Thanks a lot for your time and suggestions :) -- Gregor Best
Re: Scheduler improvements
On Wed, Jan 18, 2012 at 04:41:35PM +0100, Gregor Best wrote: after a long time of silence here's a second iteration of the patch. I've addressed a few concerns voiced here: * Process lookup and friends now have O(log n) runtime. I achieved that by abusing RB-trees as priority queues since they have runtime O(log n) in all relevant algorithms. I'm wondering if a heap tree isn't more suitable. But since they're currently not in the kernel, I guess the RB tree is the best solution. * The algorithm for calculating a new deadline for a given process has been simplified and is now documented a bit better. It also derives the deadline offset from the value of hz (via rrticks_init) as suggested by Miod (?). * CPU rlimits are now honored again. The relevant code did not change, the new patch just doesn't remove rlimit enforcement anymore. * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms long timeslices on machines with hz 100. I'm not really happy having timeslices get larger. They're considerably large already. Can you think of a way to derive a sane value based on the hz? With recent improvements in the mainline scheduler and especially rthreads, the performance of the patched scheduler and mainline is now roughly similar, at least if throughput is concerned. I have the feeling that the system behaves snappier with my patch, but that might be some sort of placebo effect. I haven't yet come up with a reliable method to benchmark interactivity except for actually using the machine and doing stuff on it. It's interesting to note however that the patched scheduler achieves a performance similar to the default one without all the fancy methods for calculating how expensive it is to move a process from one CPU to another or related methods for preserving cache locality. You'll want to test this on a multi-socket machine as well (preferably a numa machine). You're currently migrating processes using your on-die L3 cache, which is considerably faster than memory. I use the patched scheduler exclusively on my Core2Duo machine with an MP build. The amount of lines removed versus added lines by this patch shifted towards more added lines but is still at 173 lines less than the default. Once again, comments, rants, insults, everything is welcome :) Comments on the code inline. This is just me glancing over it, I'll try and get more in-depth over the weekend. Index: sys/proc.h === RCS file: /cvs/src/sys/sys/proc.h,v retrieving revision 1.149 diff -u -r1.149 proc.h --- sys/proc.h7 Jan 2012 05:38:12 - 1.149 +++ sys/proc.h17 Jan 2012 16:10:09 - @@ -43,6 +43,7 @@ #include machine/proc.h/* Machine-dependent proc substruct. */ #include sys/selinfo.h /* For struct selinfo */ #include sys/queue.h +#include sys/tree.h #include sys/timeout.h /* For struct timeout */ #include sys/event.h /* For struct klist */ #include sys/mutex.h /* For struct mutex */ @@ -210,7 +211,9 @@ #define PS_SINGLEUNWIND _P_SINGLEUNWIND struct proc { - TAILQ_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_expq; + TAILQ_ENTRY(proc) p_slpq; LIST_ENTRY(proc) p_list;/* List of all processes. */ struct process *p_p; /* The process of this thread. */ @@ -251,6 +254,9 @@ #endif /* scheduling */ + struct timeval p_deadline; /* virtual deadline used for scheduling */ + struct timeval p_deadline_set; /* time at which the deadline was set */ + int p_rrticks; /* number of ticks this process is allowed to stay on a processor */ u_int p_estcpu;/* Time averaged value of p_cpticks. */ int p_cpticks; /* Ticks of cpu time. */ fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */ Index: sys/sched.h === RCS file: /cvs/src/sys/sys/sched.h,v retrieving revision 1.30 diff -u -r1.30 sched.h --- sys/sched.h 16 Nov 2011 20:50:19 - 1.30 +++ sys/sched.h 17 Jan 2012 16:10:09 - @@ -87,8 +87,6 @@ #define CP_IDLE 4 #define CPUSTATES5 -#define SCHED_NQS 32 /* 32 run queues. */ - /* * Per-CPU scheduler state. * XXX - expose to userland for now. @@ -99,7 +97,6 @@ u_int spc_schedticks; /* ticks for schedclock() */ u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */ u_char spc_curpriority; /* usrpri of curproc */ - int spc_rrticks;/* ticks until roundrobin() */ int spc_pscnt; /* prof/stat counter */ int spc_psdiv; /* prof/stat divisor */ struct proc *spc_idleproc; /*
Re: Scheduler improvements
On Fri, Jan 20, 2012 at 06:06:08PM +0100, Ariane van der Steldt wrote: * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms long timeslices on machines with hz 100. I'm not really happy having timeslices get larger. They're considerably large already. Can you think of a way to derive a sane value based on the hz? 20ms slices are a big problem. You only get 50 per second. This will yield all kinds of stroboscopic effects in video applications... (granted, that's probably not an issue on machines with hz100...)
Re: Scheduler improvements
Hi people, after a long time of silence here's a second iteration of the patch. I've addressed a few concerns voiced here: * Process lookup and friends now have O(log n) runtime. I achieved that by abusing RB-trees as priority queues since they have runtime O(log n) in all relevant algorithms. * The algorithm for calculating a new deadline for a given process has been simplified and is now documented a bit better. It also derives the deadline offset from the value of hz (via rrticks_init) as suggested by Miod (?). * CPU rlimits are now honored again. The relevant code did not change, the new patch just doesn't remove rlimit enforcement anymore. * Timeslices are 20ms long instead of 10ms. This solves the issue of 0ms long timeslices on machines with hz 100. With recent improvements in the mainline scheduler and especially rthreads, the performance of the patched scheduler and mainline is now roughly similar, at least if throughput is concerned. I have the feeling that the system behaves snappier with my patch, but that might be some sort of placebo effect. I haven't yet come up with a reliable method to benchmark interactivity except for actually using the machine and doing stuff on it. It's interesting to note however that the patched scheduler achieves a performance similar to the default one without all the fancy methods for calculating how expensive it is to move a process from one CPU to another or related methods for preserving cache locality. I use the patched scheduler exclusively on my Core2Duo machine with an MP build. The amount of lines removed versus added lines by this patch shifted towards more added lines but is still at 173 lines less than the default. Once again, comments, rants, insults, everything is welcome :) -- Gregor Best Index: sys/proc.h === RCS file: /cvs/src/sys/sys/proc.h,v retrieving revision 1.149 diff -u -r1.149 proc.h --- sys/proc.h 7 Jan 2012 05:38:12 - 1.149 +++ sys/proc.h 17 Jan 2012 16:10:09 - @@ -43,6 +43,7 @@ #include machine/proc.h /* Machine-dependent proc substruct. */ #include sys/selinfo.h /* For struct selinfo */ #include sys/queue.h +#include sys/tree.h #include sys/timeout.h /* For struct timeout */ #include sys/event.h /* For struct klist */ #include sys/mutex.h /* For struct mutex */ @@ -210,7 +211,9 @@ #definePS_SINGLEUNWIND _P_SINGLEUNWIND struct proc { - TAILQ_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_expq; + TAILQ_ENTRY(proc) p_slpq; LIST_ENTRY(proc) p_list;/* List of all processes. */ struct process *p_p; /* The process of this thread. */ @@ -251,6 +254,9 @@ #endif /* scheduling */ + struct timeval p_deadline; /* virtual deadline used for scheduling */ + struct timeval p_deadline_set; /* time at which the deadline was set */ + int p_rrticks; /* number of ticks this process is allowed to stay on a processor */ u_int p_estcpu;/* Time averaged value of p_cpticks. */ int p_cpticks; /* Ticks of cpu time. */ fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */ Index: sys/sched.h === RCS file: /cvs/src/sys/sys/sched.h,v retrieving revision 1.30 diff -u -r1.30 sched.h --- sys/sched.h 16 Nov 2011 20:50:19 - 1.30 +++ sys/sched.h 17 Jan 2012 16:10:09 - @@ -87,8 +87,6 @@ #define CP_IDLE4 #define CPUSTATES 5 -#defineSCHED_NQS 32 /* 32 run queues. */ - /* * Per-CPU scheduler state. * XXX - expose to userland for now. @@ -99,7 +97,6 @@ u_int spc_schedticks; /* ticks for schedclock() */ u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */ u_char spc_curpriority; /* usrpri of curproc */ - int spc_rrticks;/* ticks until roundrobin() */ int spc_pscnt; /* prof/stat counter */ int spc_psdiv; /* prof/stat divisor */ struct proc *spc_idleproc; /* idle proc for this cpu */ @@ -107,9 +104,6 @@ u_int spc_nrun; /* procs on the run queues */ fixpt_t spc_ldavg; /* shortest load avg. for this cpu */ - TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS]; - volatile uint32_t spc_whichqs; - #ifdef notyet struct proc *spc_reaper;/* dead proc reaper */ #endif @@ -119,18 +113,16 @@ #ifdef _KERNEL /* spc_flags */ -#define SPCF_SEENRR 0x0001 /* process has seen roundrobin() */ -#define SPCF_SHOULDYIELD0x0002 /* process should yield the CPU */ -#define SPCF_SWITCHCLEAR(SPCF_SEENRR|SPCF_SHOULDYIELD) -#define SPCF_SHOULDHALT0x0004 /* CPU
Re: Scheduler improvements
And it didn't take long for me to find a small bug... attached is a fixed version of the patch. Such things happen if one decides to regenerate a patch just in case and forgets to revert to a working version before doing that :D -- Gregor Best Index: sys/proc.h === RCS file: /cvs/src/sys/sys/proc.h,v retrieving revision 1.149 diff -u -r1.149 proc.h --- sys/proc.h 7 Jan 2012 05:38:12 - 1.149 +++ sys/proc.h 17 Jan 2012 16:10:09 - @@ -43,6 +43,7 @@ #include machine/proc.h /* Machine-dependent proc substruct. */ #include sys/selinfo.h /* For struct selinfo */ #include sys/queue.h +#include sys/tree.h #include sys/timeout.h /* For struct timeout */ #include sys/event.h /* For struct klist */ #include sys/mutex.h /* For struct mutex */ @@ -210,7 +211,9 @@ #definePS_SINGLEUNWIND _P_SINGLEUNWIND struct proc { - TAILQ_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_runq; + RB_ENTRY(proc) p_expq; + TAILQ_ENTRY(proc) p_slpq; LIST_ENTRY(proc) p_list;/* List of all processes. */ struct process *p_p; /* The process of this thread. */ @@ -251,6 +254,9 @@ #endif /* scheduling */ + struct timeval p_deadline; /* virtual deadline used for scheduling */ + struct timeval p_deadline_set; /* time at which the deadline was set */ + int p_rrticks; /* number of ticks this process is allowed to stay on a processor */ u_int p_estcpu;/* Time averaged value of p_cpticks. */ int p_cpticks; /* Ticks of cpu time. */ fixpt_t p_pctcpu;/* %cpu for this process during p_swtime */ Index: sys/sched.h === RCS file: /cvs/src/sys/sys/sched.h,v retrieving revision 1.30 diff -u -r1.30 sched.h --- sys/sched.h 16 Nov 2011 20:50:19 - 1.30 +++ sys/sched.h 17 Jan 2012 16:10:09 - @@ -87,8 +87,6 @@ #define CP_IDLE4 #define CPUSTATES 5 -#defineSCHED_NQS 32 /* 32 run queues. */ - /* * Per-CPU scheduler state. * XXX - expose to userland for now. @@ -99,7 +97,6 @@ u_int spc_schedticks; /* ticks for schedclock() */ u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */ u_char spc_curpriority; /* usrpri of curproc */ - int spc_rrticks;/* ticks until roundrobin() */ int spc_pscnt; /* prof/stat counter */ int spc_psdiv; /* prof/stat divisor */ struct proc *spc_idleproc; /* idle proc for this cpu */ @@ -107,9 +104,6 @@ u_int spc_nrun; /* procs on the run queues */ fixpt_t spc_ldavg; /* shortest load avg. for this cpu */ - TAILQ_HEAD(prochead, proc) spc_qs[SCHED_NQS]; - volatile uint32_t spc_whichqs; - #ifdef notyet struct proc *spc_reaper;/* dead proc reaper */ #endif @@ -119,18 +113,16 @@ #ifdef _KERNEL /* spc_flags */ -#define SPCF_SEENRR 0x0001 /* process has seen roundrobin() */ -#define SPCF_SHOULDYIELD0x0002 /* process should yield the CPU */ -#define SPCF_SWITCHCLEAR(SPCF_SEENRR|SPCF_SHOULDYIELD) -#define SPCF_SHOULDHALT0x0004 /* CPU should be vacated */ -#define SPCF_HALTED0x0008 /* CPU has been halted */ +#define SPCF_SHOULDYIELD0x0001 /* process should yield the CPU */ +#define SPCF_SHOULDHALT0x0002 /* CPU should be vacated */ +#define SPCF_HALTED0x0004 /* CPU has been halted */ -#defineSCHED_PPQ (128 / SCHED_NQS) /* priorities per queue */ #define NICE_WEIGHT 2 /* priorities per nice level */ -#defineESTCPULIM(e) min((e), NICE_WEIGHT * PRIO_MAX - SCHED_PPQ) +#defineESTCPULIM(e) min((e), NICE_WEIGHT * PRIO_MAX) extern int schedhz;/* ideally: 16 */ extern int rrticks_init; /* ticks per roundrobin() */ +extern struct cpuset sched_idle_cpus; struct proc; void schedclock(struct proc *); @@ -147,18 +139,20 @@ void cpu_switchto(struct proc *, struct proc *); struct proc *sched_chooseproc(void); struct cpu_info *sched_choosecpu(struct proc *); -struct cpu_info *sched_choosecpu_fork(struct proc *parent, int); +struct cpu_info *sched_choosecpu_fork(struct proc *parent); void cpu_idle_enter(void); void cpu_idle_cycle(void); void cpu_idle_leave(void); void sched_peg_curproc(struct cpu_info *ci); +void generate_deadline(struct proc *, char); + #ifdef MULTIPROCESSOR void sched_start_secondary_cpus(void); void sched_stop_secondary_cpus(void); #endif -#define curcpu_is_idle() (curcpu()-ci_schedstate.spc_whichqs == 0) +#define curcpu_is_idle() (cpuset_isset(sched_idle_cpus, curcpu())) void
Re: Scheduler improvements
On Wed, Oct 12, 2011 at 02:55:39PM +0200, Gregor Best wrote: Hi people, attached is a patch that basically rewrites the CPU scheduler. It replaces the multilevel feedback queue currently employed with an earliest effective deadline first approach, similar to the BrainFuck Scheduler for Linux. My main motivation for rewriting the scheduler was mplayer getting killed all the time because it went over its share of CPU resources, something which can not happen with the patch I wrote. The algorithms employed simply assigns a (soft) deadline to each process, which is calculated based solely on nice value and selects the next process to execute based on the earliest deadline in the runqueue. If more than one deadline is missed, the first one encountered in FIFO order (FIFO with respect to insertion into the runqueue) with a deadline smaller than current time is selected. This prevents processes from yielding a small time before their timeslice expires to fool the scheduler into leaving them in their respective priority queue, thus preventing multiple processes from teaming up to starve others of CPU runtime. I did a few benchmarks to test the difference between the original and my patched scheduler to see whether there was any improvements. The tool I used was sysbench (with both CPU and Threads benchmarks). There are a few graphs of the tests at http://gurkenfunk.de/projects/openbsd-eevdf, but it looks as if my patches improve latency on workloads with high CPU usage, little I/O and a relatively large amount of threads. I'm not aiming for a yeah, nice, we'll merge it on this, but rather for suggestions whether it's worth anyones time to pursue this further. All this is very interesting and I've spent some time, in 2.7 days on trying to understand/improve the behaviour of interactive processes (aka processes like synths that wait on i/o and have a deadline for processing their input). I still believe that the current scheduler, in the UP case, is very friendly to audio, since interactive process almost always get the CPU at time. AFAICS, changing the scheduler won't improve correct audio code that already works. Another scheduler might improve the wrong code. Since we're talking about mplayer and alike, currently most of the stuttering is caused by mistakes in audio code and cpu-intensive code runnin in kernel mode. The scheduler doesn't hurt interactive processes, does it? Oh, also, with the patch, the scheduler code is 225 lines shorter and IMHO a bit easier to understand. I love simplification diffs especially ones with more -'s than +'s. You've removed some of the MP affinity bits and the rlimit bits, i guess this is temporary for clarity only, isn't it? I've read kolivas design paper you suggest, but failed to understand how this scheduling algorithm would improve interactive processes like mplayer. Do you have further references? -- Alexandre
Scheduler improvements
Hi people, attached is a patch that basically rewrites the CPU scheduler. It replaces the multilevel feedback queue currently employed with an earliest effective deadline first approach, similar to the BrainFuck Scheduler for Linux. My main motivation for rewriting the scheduler was mplayer getting killed all the time because it went over its share of CPU resources, something which can not happen with the patch I wrote. The algorithms employed simply assigns a (soft) deadline to each process, which is calculated based solely on nice value and selects the next process to execute based on the earliest deadline in the runqueue. If more than one deadline is missed, the first one encountered in FIFO order (FIFO with respect to insertion into the runqueue) with a deadline smaller than current time is selected. This prevents processes from yielding a small time before their timeslice expires to fool the scheduler into leaving them in their respective priority queue, thus preventing multiple processes from teaming up to starve others of CPU runtime. I did a few benchmarks to test the difference between the original and my patched scheduler to see whether there was any improvements. The tool I used was sysbench (with both CPU and Threads benchmarks). There are a few graphs of the tests at http://gurkenfunk.de/projects/openbsd-eevdf, but it looks as if my patches improve latency on workloads with high CPU usage, little I/O and a relatively large amount of threads. I'm not aiming for a yeah, nice, we'll merge it on this, but rather for suggestions whether it's worth anyones time to pursue this further. Oh, also, with the patch, the scheduler code is 225 lines shorter and IMHO a bit easier to understand. -- Gregor Best ? sys/.git ? sys/.gitignore ? sys/vim.core ? sys/arch/alpha/stand/installboot/installboot.8.manlint ? sys/arch/alpha/stand/setnetbootinfo/setnetbootinfo.8.manlint ? sys/arch/amd64/amd64/.acpi_machdep.c.swp ? sys/arch/amd64/include/.cpu.h.swp ? sys/arch/armish/stand/boot/boot.8.manlint ? sys/arch/aviion/stand/a2coff/a2coff.8.manlint ? sys/arch/hppa/stand/boot/boot.8.manlint ? sys/arch/hppa/stand/mkboot/mkboot.8.manlint ? sys/arch/hppa64/stand/boot/boot.8.manlint ? sys/arch/hppa64/stand/mkboot/mkboot.8.manlint ? sys/arch/i386/stand/biosboot/biosboot.8.manlint ? sys/arch/i386/stand/boot/boot.8.manlint ? sys/arch/i386/stand/cdboot/cdboot.8.manlint ? sys/arch/i386/stand/installboot/installboot.8.manlint ? sys/arch/i386/stand/pxeboot/pxeboot.8.manlint ? sys/arch/landisk/stand/boot/boot.8.manlint ? sys/arch/landisk/stand/mbr/mbr.8.manlint ? sys/arch/landisk/stand/xxboot/xxboot.8.manlint ? sys/arch/mvme68k/stand/installboot/installboot.8.manlint ? sys/arch/mvme88k/stand/installboot/installboot.8.manlint ? sys/arch/sgi/stand/sgivol/sgivol.8.manlint ? sys/arch/socppc/stand/boot/boot.8.manlint ? sys/arch/socppc/stand/mkboot/mkboot.8.manlint ? sys/arch/sparc/stand/installboot/installboot.8.manlint ? sys/arch/sparc64/stand/installboot/installboot.8.manlint ? sys/arch/zaurus/stand/zboot/boot.8.manlint ? sys/dev/microcode/atmel/atu-at76c503-i3863-ext ? sys/dev/microcode/atmel/atu-at76c503-i3863-int ? sys/dev/microcode/atmel/atu-at76c503-rfmd-acc-ext ? sys/dev/microcode/atmel/atu-at76c503-rfmd-acc-int ? sys/dev/microcode/atmel/atu-at76c505-rfmd-ext ? sys/dev/microcode/atmel/atu-at76c505-rfmd-int ? sys/dev/microcode/atmel/atu-intersil-ext ? sys/dev/microcode/atmel/atu-intersil-int ? sys/dev/microcode/atmel/atu-rfmd-ext ? sys/dev/microcode/atmel/atu-rfmd-int ? sys/dev/microcode/atmel/atu-rfmd2958-ext ? sys/dev/microcode/atmel/atu-rfmd2958-int ? sys/dev/microcode/atmel/atu-rfmd2958smc-ext ? sys/dev/microcode/atmel/atu-rfmd2958smc-int ? sys/dev/microcode/atmel/build ? sys/dev/microcode/bnx/bnx-b06 ? sys/dev/microcode/bnx/bnx-b09 ? sys/dev/microcode/bnx/bnx-rv2p ? sys/dev/microcode/bnx/bnx-xi-rv2p ? sys/dev/microcode/bnx/bnx-xi90-rv2p ? sys/dev/microcode/bnx/build ? sys/dev/microcode/cirruslogic/build ? sys/dev/microcode/cirruslogic/cs4280 ? sys/dev/microcode/fxp/build ? sys/dev/microcode/fxp/fxp-d101a ? sys/dev/microcode/fxp/fxp-d101b0 ? sys/dev/microcode/fxp/fxp-d101ma ? sys/dev/microcode/fxp/fxp-d101s ? sys/dev/microcode/fxp/fxp-d102 ? sys/dev/microcode/fxp/fxp-d102c ? sys/dev/microcode/fxp/fxp-d102e ? sys/dev/microcode/kue/build ? sys/dev/microcode/kue/kue ? sys/dev/microcode/myx/build ? sys/dev/microcode/myx/myx-eth_z8e ? sys/dev/microcode/myx/myx-ethp_z8e ? sys/dev/microcode/ral/build ? sys/dev/microcode/ral/ral-rt2561 ? sys/dev/microcode/ral/ral-rt2561s ? sys/dev/microcode/ral/ral-rt2661 ? sys/dev/microcode/ral/ral-rt2860 ? sys/dev/microcode/rum/build ? sys/dev/microcode/rum/rum-rt2573 ? sys/dev/microcode/rum/run-rt2870 ? sys/dev/microcode/rum/run-rt3071 ? sys/dev/microcode/tht/build ? sys/dev/microcode/tht/tht ? sys/dev/microcode/tigon/build ? sys/dev/microcode/tigon/tigon1 ? sys/dev/microcode/tigon/tigon2 ? sys/dev/microcode/tusb3410/build ? sys/dev/microcode/tusb3410/tusb3410 ?
Re: Scheduler improvements
Hi people, [...] I'm not aiming for a yeah, nice, we'll merge it on this, but rather for suggestions whether it's worth anyones time to pursue this further. This is interesting, there might be a good outcome of this diff eventually. However as of now: - you have removed the bunch of code which tries to make sure processes do not hop between cpus unless there is a gain in doing so, on MP kernels. Did you try your diff on a MP kernel? - non-tabbed indent is evil. And makes the diff harder to read. - I see many locks removed in there, and moving information to the proc structure does not explain all of them. This gives a bad gut feeling about the behaviour of this on MP kernels. A really bad gut feeling. - you have changed the roundrobin duration from (hz / 10 to (hz / 100). However, this computation yields zero on platforms where hz 100, which is not a good idea. The only assumptions you can make about hz is that 50 = hz = 1024. - you removed the enforcement of RLIMIT_CPU. No way. Over my dead body. - I also don't really like switching from NQS run queues to a single one. I understand you need this to be able to compare the deadlines with only one queue walk, but this can become expensive with many processes in the runqueue... - your priority computation are unexplained magic. What does 20 mean? Shouldn't at least one of them be derived from `hz'? Why 1 versus 5? Are you sure the values you are computed are always within the bounds you are expecting? Why isn't there any comment about these computations? Miod
Re: Scheduler improvements
On Wed, Oct 12, 2011 at 05:51:33PM +, Miod Vallat wrote: Hi people, [...] I'm not aiming for a yeah, nice, we'll merge it on this, but rather for suggestions whether it's worth anyones time to pursue this further. This is interesting, there might be a good outcome of this diff eventually. However as of now: - you have removed the bunch of code which tries to make sure processes do not hop between cpus unless there is a gain in doing so, on MP kernels. Did you try your diff on a MP kernel? [...] Yes. Almost all builds I have done so far were MP kernels, which I run almost all the time (I switch to un-patched kernels for benchmarks only). I'll do a benchmark with an SP build though to check whether there are any negative impacts. - non-tabbed indent is evil. And makes the diff harder to read. [...] I will clean that up. - I see many locks removed in there, and moving information to the proc structure does not explain all of them. This gives a bad gut feeling about the behaviour of this on MP kernels. A really bad gut feeling. [...] Works fine with two cores, but I can see where that feeling comes from. I'll add some more explanations to my changes. - you have changed the roundrobin duration from (hz / 10 to (hz / 100). However, this computation yields zero on platforms where hz 100, which is not a good idea. The only assumptions you can make about hz is that 50 = hz = 1024. [...] Good point. I'll change that. - you removed the enforcement of RLIMIT_CPU. No way. Over my dead body. - I also don't really like switching from NQS run queues to a single one. I understand you need this to be able to compare the deadlines with only one queue walk, but this can become expensive with many processes in the runqueue... [...] Right. I'm planning to move this to a real priority queue via some sort of heap, so lookups and insertions can be done in O(log n) instead of O(n). - your priority computation are unexplained magic. What does 20 mean? Shouldn't at least one of them be derived from `hz'? Why 1 versus 5? Are you sure the values you are computed are always within the bounds you are expecting? Why isn't there any comment about these computations? [...] Deriving them from hz didn't cross my mind but you are right of course. I chose the 5 because that makes for a relatively linear increase in the deadlines while not overflowing the value of .tv_usec. I'll add some explanations though. Miod Thanks a lot for your time and suggestions. -- Gregor Best [demime 1.01d removed an attachment of type application/pgp-signature]
Re: Scheduler improvements
I like what I see. It could be the start for the means of managing a single socket's queue of processes. You do mention that this won't really scale beyond roughly 8 cores. I would love to see you extend this with a 2D array of weights that can be populated by various means (run-time testing, or ACPI tables, etc) with the relative costs of moving a process from one core to another (or one scheduling queue to another, if queue's are assigned to a set of cores, say a socket full of cores). In this way, we may be able to have cores shut down by modifying these weights based on demand/load. While I agree with your comments that OpenBSD was not originally made for lots of cpu/cores, I don't really wish to entertain huge steps backwards on that front. The rthreads work (amoung other work) has been done with an eye towards more efficiently using MP systems. Again, I like this, looks simpler. Let's see you try and solve more of the MP parts as well. :) -Toby.