more data: SCHED_ULE+PREEMPTION is the problem
On 04/07/18 10:18, Peter wrote: > Hi all, > [...] Thanks for all the investigation! > 3. kern.sched.preempt_thresh > > I could make the problem disappear by changing kern.sched.preempt_thresh > from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This > seems to correspond to the disk interrupt threads, which run at intr:12 > (i5-3570T) or intr:8 (p3). > [...] More data. With SCHED_4BSD at FreeBSD 10.4-RELEASE-p8 #0 r331984: kern.sched.runq_fuzz: 1 kern.sched.ipiwakeup.useloop: 0 kern.sched.ipiwakeup.usemask: 1 kern.sched.ipiwakeup.delivered: 376139898 kern.sched.ipiwakeup.requested: 376137875 kern.sched.ipiwakeup.enabled: 1 kern.sched.slice: 12 kern.sched.quantum: 94488 kern.sched.name: 4BSD kern.sched.preemption: 1 kern.sched.cpusetsize: 8 With dnetc running on a 6-core AMD CPU from a few years back, "time make buildworld" yields: 6640.224u 828.874s 2:14:37.73 92.4% 28525+494k 31633+431554io 33192pf+0w I shifted to a GENERIC kernel, FreeBSD 10.4-RELEASE-p8 #0 r332560: kern.sched.topology_spec: 0, 1, 2, 3, 4, 5 0, 1, 2, 3, 4, 5 kern.sched.steal_thresh: 2 kern.sched.steal_idle: 1 kern.sched.balance_interval: 127 kern.sched.balance: 1 kern.sched.affinity: 1 kern.sched.idlespinthresh: 157 kern.sched.idlespins: 1 kern.sched.static_boost: 152 kern.sched.preempt_thresh: 80 kern.sched.interact: 30 kern.sched.slice: 12 kern.sched.quantum: 94488 kern.sched.name: ULE kern.sched.preemption: 1 kern.sched.cpusetsize: 8 I stupidly typed "make buildworld" without the "time" command, but the build log started at Mon Apr 16 13:49:12 EDT 2018 and completed at Tue Apr 17 00:22:23 EDT 2018. You read that right: 2+ hours vs 10 1/2! So I set "sysctl kern.sched.preempt_thresh=5" (a wild guess on my part) and started another "time make buildworld". It's still going now, but subjectively it's still running like molasses. I'll post more results later after trying sysctl kern.sched.preempt_thresh=0. By the way, over the years that this discussion has been going on, I've *never* had a response to my question: "What is the workload for which SCHED_ULE outperforms SCHED_4BSD?"-- George signature.asc Description: OpenPGP digital signature
Re: more data: SCHED_ULE+PREEMPTION is the problem
Hi Stefan, I'm glad to see You're thinking along similar paths as I did. But let me fist answer Your question straight away, and sort out the remainder afterwards. > I'd be interested in your results with preempt_thresh set to a value > of e.g.190. There is no difference. Any value above 7 shows the problem identically. I think this value (or preemtion as a whole) is not the actual cause for the problem; it just changes some conditions that make the problem visible. So, trying to adjust preempt_thresh in order to fix the problem seems to be a dead end. Stefan Esser wrote: The critical use of preempt_thresh is marked above. If it is 0, no preemption will occur. On a single processor system, this should allow the CPU bound thread to run for as long its quantum lasts. I would like to contradict here. From what I understand, preemption is *not* the base of task switching. AFAIK preemption is an additional feature that allows to switch threads while they execute in kernel mode. While executing in user mode, a thread can be interrupted and switched at any time, and that is how the traditional time-sharing systems did it. Traditionally a thread would execute in kernel mode only during interrupts and syscalls, and those last no longer than a few ms, and for long that was not an issue. Only when we got the fast interfaces (10Gbps etc.) and got big monsters executing in kernel space (traffic-shaper, ZFS, etc.), that scheme became problematic and preemption was invented. According to McKusicks book, the scheduler is two-fold: an outer logic runs few times per second and calculates priorities. And an inner logic runs very often (at every interrupt?) and chooses the next runnable thread simply by priority. The meaning of the quantum is then: when it is used up, the thread is moved to the end of it's queue, so that it may take a while until it runs again. This is for implementing round-robin behaviour within a single queue (= a single priority). It should not prevent task-switching as such. Lets have a look. sched_choose() seems to be that low-level scheduler function that decides which thread to run next. Lets create a log of its decisions.[1] With preempt_thresh >= 12 (kernel threads left out): PIDCOMMAND TIMESTAMP 18196 bash 1192.549 18196 bash 1192.554 18196 bash 1192.559 66683 lz4 1192.560 18196 bash 1192.560 18196 bash 1192.562 18196 bash 1192.563 18196 bash 1192.564 79496 ntpd 1192.569 18196 bash 1192.569 18196 bash 1192.574 18196 bash 1192.579 18196 bash 1192.584 18196 bash 1192.588 18196 bash 1192.589 18196 bash 1192.594 18196 bash 1192.599 18196 bash 1192.604 18196 bash 1192.609 18196 bash 1192.613 18196 bash 1192.614 18196 bash 1192.619 18196 bash 1192.624 18196 bash 1192.629 18196 bash 1192.634 18196 bash 1192.638 18196 bash 1192.639 18196 bash 1192.644 18196 bash 1192.649 18196 bash 1192.654 66683 lz4 1192.654 18196 bash 1192.655 18196 bash 1192.655 18196 bash 1192.659 The worker is indeed called only after 95ms. And with preempt_thresh < 8: PIDCOMMAND TIMESTAMP 18196 bash 1268.955 66683 lz4 1268.956 18196 bash 1268.956 66683 lz4 1268.956 18196 bash 1268.957 66683 lz4 1268.957 18196 bash 1268.957 66683 lz4 1268.958 18196 bash 1268.958 66683 lz4 1268.959 18196 bash 1268.959 66683 lz4 1268.959 18196 bash 1268.960 66683 lz4 1268.960 18196 bash 1268.961 66683 lz4 1268.961 18196 bash 1268.961 66683 lz4 1268.962 18196 bash 1268.962 Here we have 3 Csw per millisecond. (The fact that the decisions are over-all more frequent is easily explained: when lz4 gets to run, it will do disk I/O, which quickly returns and triggers new decisions.) In the second record, things are clear: while lz4 does disk I/O, the scheduler MUST run bash, because nothing else is there. But when data arrives, it runs again lz4. But in the first record - why does the scheduler choose bash, although lz4 has already much higher prio (52 versus 97, usually)? A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound thread to preempt any other thread with
Appendices - more data: SCHED_ULE+PREEMPTION is the problem
I forgot to attach the commands used to create the logs - they are ugly anyway: [1] dtrace -q -n '::sched_choose:return { @[((struct thread *)arg1)->td_proc->p_pid, stringof(((struct thread *)arg1)->td_proc->p_comm), timestamp] = count(); } tick-1s { exit(0); }' | sort -nk 3 | awk '$1 > 27 {$3 = ($3/100)*1.0/1000; printf "%6d %20s %3.3f\n", $1, $2, $3 }' [2] dtrace -q -n '::runq_choose_from:entry /arg1 == 0||arg1 == 32/ { @[arg1, timestamp] = count(); }' | sort -nk2 ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: more data: SCHED_ULE+PREEMPTION is the problem
Am 07.04.18 um 16:18 schrieb Peter: > 3. kern.sched.preempt_thresh > > I could make the problem disappear by changing kern.sched.preempt_thresh from > the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to > correspond to the disk interrupt threads, which run at intr:12 (i5-3570T) or > intr:8 (p3). [CC added to include Jeff as the author of the ULE scheduler ...] Since I had somewhat similar problems on my systems (with 4 Quad-Core with SMT enabled, i.e. 8 threads of execution) with compute bound processes keeping I/O intensive processes from running (load average of 12 with 8 "CPUs"), and these problems where affected by preempt_thresh, I checked how this variable is used in the scheduler. The code is in /sys/kern/sched_ule.c. It controls, whether a thread that has become runnable (e.g., after waiting for disk I/O to complete) will preempt the thread currently running on "this" CPU (i.e. the one executing this test in the kernel). IMHO, sched_preempt should default to a much higher number than 80 (e.g. 190), but maybe I misunderstand some of the details ... static inline int sched_shouldpreempt(int pri, int cpri, int remote) { The parameters are: pri: the priority if the now runnable thread cpri: the priority of the thread that currently runs on "this" CPU remote: whether to consider preempting a thread on another CPU The priority values are those displayed by top or ps -l as "PRI", but with an offset of 100 applied (i.e. pri=120 is displayed as PRI=20 in top). If this thread has less priority than the currently executing one (cpri), the currently running thread will not be preempted: /* * If the new priority is not better than the current priority there is * nothing to do. */ if (pri >= cpri) return (0); If the current thread is the idle thread, it will always be preempted by the now runnable thread: /* * Always preempt idle. */ if (cpri >= PRI_MIN_IDLE) return (1); A value of preempt_thresh=0 (e.g. if "options PREEMPTION" is missing in the kernel config) lets the previously running thread continue (except if was the idle thread, which has been dealt with above). The compute bound thread may continue until its quantum has expired. /* * If preemption is disabled don't preempt others. */ if (preempt_thresh == 0) return (0); For any other value of preempt_thresh, the new priority of the thread that just has become runnable will be compared to preempt_thresh and if this new priority is higher (lower numeric value) or equal to preempt_thresh, the thread for which (e.g.) disk I/O finished will preempt the current thread: /* * Preempt if we exceed the threshold. */ if (pri <= preempt_thresh) return (1); ===> This is the only condition that depends on preempt_thresh > 0 <=== The flag "remote" controls whether this thread will be scheduled to run, if its priority is higher or equal to PRI_MAX_INTERACT (less than or equal to 151) and if the opposite is true for the currently running thread (cpri). The value of remote will always be 0 on kernels built without "options SMP". /* * If we're interactive or better and there is non-interactive * or worse running preempt only remote processors. */ if (remote && pri <= PRI_MAX_INTERACT && cpri > PRI_MAX_INTERACT) return (1); The critical use of preempt_thresh is marked above. If it is 0, no preemption will occur. On a single processor system, this should allow the CPU bound thread to run for as long its quantum lasts. A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound thread to preempt any other thread with lower priority (cpri > pri). But in case of a high priority kernel thread being active during this test (with a low numeric cpri value), the I/O bound process will not preempt that higher priority thread (i.e. some high priority kernel thread). Whether the I/O bound thread will run (instead of the compute bound) after the higher priority thread has given up the CPU, will depend on the scheduler decision which thread to select. And for "timeshare" threads, this will often not be the higher priority (I/O bound) thread, but the compute bound thread, which then may execute until next being interrupted by the I/O bound thread (which will not happen, if no new I/O has been requested). This might explain, why setting preempt_thresh to a very low value (in the range of real-time kernel threads) enforces preemption of the CPU bound thread, while any higher (numeric) value of preempt_thresh prevents this and makes tdq_choose() often select the low priority CPU bound over the higher priority I/O bound thread. BUT the first test in sched_shouldpreempt() should prevent any user process from ever preempting a real-time thread "if (pri >= cpri) return 0;". For preemption to occur, pri must be numerically lower than cpri, and pri
Re: more data: SCHED_ULE+PREEMPTION is the problem
On 04/07/18 10:18, Peter wrote: > [...] > B. Findings: > > 1. Filesystem > > I could never reproduce this when reading from plain UFS. Only when > reading from ZFS (direct or via l2arc). > [...] My consistent way of reproducing the problem was to run ports/misc/dnetc while trying to buildworld/buildkernel. But I was always building on a UFS partition. I don't know why your results would be different. -- George signature.asc Description: OpenPGP digital signature
more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler
Hi all, in the meantime I did some tests and found the following: A. The Problem: --- On a single CPU, there are -exactly- two processes runnable: One is doing mostly compute without I/O - this can be a compressing job or similar; in the tests I used simply an endless-loop. Lets call this the "piglet". The other is doing frequent file reads, but also some compute interim - this can be a backup job traversing the FS, or a postgres VACUUM, or some fast compressor like lz4. Lets call this the "worker". It then happens that the piglet gets 99% CPU, while the worker gets only 0.5% CPU and makes nearly no progress at all. Investigations shows that the worker makes precisely one I/O per timeslice (timeslice as defined in kern.sched.quantum) - or two I/O on a mirrored ZFS. B. Findings: 1. Filesystem I could never reproduce this when reading from plain UFS. Only when reading from ZFS (direct or via l2arc). 2. Machine The problem originally appeared on a pentium3@1GHz. I was able to reproduce it on an i5-3570T, given the following measures: * config in BIOS to use only one CPU * reduce speed: "dev.cpu.0.freq=200" I did see the problem also when running full speed (which means it happens there also), but could not reproduce it well. 3. kern.sched.preempt_thresh I could make the problem disappear by changing kern.sched.preempt_thresh from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to correspond to the disk interrupt threads, which run at intr:12 (i5-3570T) or intr:8 (p3). 4. dynamic behaviour Here the piglet is already running as PID=2119. Then we can watch the dynamic behaviour as follows (on i5-3570T@200MHz): a. with kern.sched.preempt_thresh=80 $ lz4 DATABASE_TEST_FILE /dev/null & while true; do ps -o pid,pri,"%cpu",command -p 2119,$! sleep 3 done [1] 6073 PID PRI %CPU COMMAND 6073 20 0.0 lz4 DATABASE_TEST_FILE /dev/null 2119 100 91.0 -bash (bash) PID PRI %CPU COMMAND 6073 76 15.0 lz4 DATABASE_TEST_FILE /dev/null 2119 95 74.5 -bash (bash) PID PRI %CPU COMMAND 6073 52 19.0 lz4 DATABASE_TEST_FILE /dev/null 2119 94 71.5 -bash (bash) PID PRI %CPU COMMAND 6073 52 16.0 lz4 DATABASE_TEST_FILE /dev/null 2119 95 76.5 -bash (bash) PID PRI %CPU COMMAND 6073 52 14.0 lz4 DATABASE_TEST_FILE /dev/null 2119 96 80.0 -bash (bash) PID PRI %CPU COMMAND 6073 52 12.5 lz4 DATABASE_TEST_FILE /dev/null 2119 96 82.5 -bash (bash) PID PRI %CPU COMMAND 6073 74 10.0 lz4 DATABASE_TEST_FILE /dev/null 2119 98 86.5 -bash (bash) PID PRI %CPU COMMAND 6073 52 8.0 lz4 DATABASE_TEST_FILE /dev/null 2119 98 89.0 -bash (bash) PID PRI %CPU COMMAND 6073 52 7.0 lz4 DATABASE_TEST_FILE /dev/null 2119 98 90.5 -bash (bash) PID PRI %CPU COMMAND 6073 52 6.5 lz4 DATABASE_TEST_FILE /dev/null 2119 99 91.5 -bash (bash) b. with kern.sched.preempt_thresh=11 PID PRI %CPU COMMAND 4920 21 0.0 lz4 DATABASE_TEST_FILE /dev/null 2119 101 93.5 -bash (bash) PID PRI %CPU COMMAND 4920 78 20.0 lz4 DATABASE_TEST_FILE /dev/null 2119 94 70.5 -bash (bash) PID PRI %CPU COMMAND 4920 82 34.5 lz4 DATABASE_TEST_FILE /dev/null 2119 88 54.0 -bash (bash) PID PRI %CPU COMMAND 4920 85 42.5 lz4 DATABASE_TEST_FILE /dev/null 2119 86 45.0 -bash (bash) PID PRI %CPU COMMAND 4920 85 43.5 lz4 DATABASE_TEST_FILE /dev/null 2119 86 44.5 -bash (bash) PID PRI %CPU COMMAND 4920 85 43.0 lz4 DATABASE_TEST_FILE /dev/null 2119 85 45.0 -bash (bash) PID PRI %CPU COMMAND 4920 85 43.0 lz4 DATABASE_TEST_FILE /dev/null 2119 85 45.5 -bash (bash) From this we can see that in case b. both processes balance out nicely and meet at equal CPU shares. Whereas in case a., after about 10 Seconds (the first 3 records) they move to opposite ends of the scale and stay there. From this I might suppose that here is some kind of mis-calculation or mis-adjustment of the task priorities happening. P. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"