Hi Stefan,
I'm glad to see You're thinking along similar paths as I did. But let
me fist answer Your question straight away, and sort out the remainder
afterwards.
> I'd be interested in your results with preempt_thresh set to a value
> of e.g.190.
There is no difference. Any value above 7 shows the problem identically.
I think this value (or preemtion as a whole) is not the actual cause for
the problem; it just changes some conditions that make the problem
visible. So, trying to adjust preempt_thresh in order to fix the
problem seems to be a dead end.
Stefan Esser wrote:
The critical use of preempt_thresh is marked above. If it is 0, no preemption
will occur. On a single processor system, this should allow the CPU bound
thread to run for as long its quantum lasts.
I would like to contradict here.
From what I understand, preemption is *not* the base of task switching.
AFAIK preemption is an additional feature that allows to switch threads
while they execute in kernel mode. While executing in user mode, a
thread can be interrupted and switched at any time, and that is how
the traditional time-sharing systems did it. Traditionally a thread
would execute in kernel mode only during interrupts and syscalls, and
those last no longer than a few ms, and for long that was not an issue.
Only when we got the fast interfaces (10Gbps etc.) and got big monsters
executing in kernel space (traffic-shaper, ZFS, etc.), that scheme
became problematic and preemption was invented.
According to McKusicks book, the scheduler is two-fold: an outer logic
runs few times per second and calculates priorities. And an inner logic
runs very often (at every interrupt?) and chooses the next runnable
thread simply by priority.
The meaning of the quantum is then: when it is used up, the thread is
moved to the end of it's queue, so that it may take a while until it
runs again. This is for implementing round-robin behaviour within a
single queue (= a single priority). It should not prevent task-switching
as such.
Lets have a look. sched_choose() seems to be that low-level scheduler
function that decides which thread to run next. Lets create a log of its
decisions.[1]
With preempt_thresh >= 12 (kernel threads left out):
PIDCOMMAND TIMESTAMP
18196 bash 1192.549
18196 bash 1192.554
18196 bash 1192.559
66683 lz4 1192.560
18196 bash 1192.560
18196 bash 1192.562
18196 bash 1192.563
18196 bash 1192.564
79496 ntpd 1192.569
18196 bash 1192.569
18196 bash 1192.574
18196 bash 1192.579
18196 bash 1192.584
18196 bash 1192.588
18196 bash 1192.589
18196 bash 1192.594
18196 bash 1192.599
18196 bash 1192.604
18196 bash 1192.609
18196 bash 1192.613
18196 bash 1192.614
18196 bash 1192.619
18196 bash 1192.624
18196 bash 1192.629
18196 bash 1192.634
18196 bash 1192.638
18196 bash 1192.639
18196 bash 1192.644
18196 bash 1192.649
18196 bash 1192.654
66683 lz4 1192.654
18196 bash 1192.655
18196 bash 1192.655
18196 bash 1192.659
The worker is indeed called only after 95ms.
And with preempt_thresh < 8:
PIDCOMMAND TIMESTAMP
18196 bash 1268.955
66683 lz4 1268.956
18196 bash 1268.956
66683 lz4 1268.956
18196 bash 1268.957
66683 lz4 1268.957
18196 bash 1268.957
66683 lz4 1268.958
18196 bash 1268.958
66683 lz4 1268.959
18196 bash 1268.959
66683 lz4 1268.959
18196 bash 1268.960
66683 lz4 1268.960
18196 bash 1268.961
66683 lz4 1268.961
18196 bash 1268.961
66683 lz4 1268.962
18196 bash 1268.962
Here we have 3 Csw per millisecond. (The fact that the decisions are
over-all more frequent is easily explained: when lz4 gets to run, it
will do disk I/O, which quickly returns and triggers new decisions.)
In the second record, things are clear: while lz4 does disk I/O, the
scheduler MUST run bash, because nothing else is there. But when data
arrives, it runs again lz4.
But in the first record - why does the scheduler choose bash, although
lz4 has already much higher prio (52 versus 97, usually)?
A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
thread to preempt any other thread with