Re: kern.maxswzone causing serious problems

2018-04-10 Thread Curtis Villamizar
Replying to myself... again.

Bug report https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227436
has been submitted.  Hopefully this can get picked up in 11.2 and
maybe patched into 11.1.  If not at least go into 12.

Curtis


In message <387a65b7-d221-0a10-b801-1dd573054...@orleans.occnc.com>
Curtis Villamizar writes:
> 
[...]
>  
> Will check back later after regression testing.  Apparently the best way 
> to get this to get some attention is to file a bug so once the changes 
> are fully verified in the regression testing I'll do that.  Just in case 
> there is some further interaction with using more swap that recommended 
> by the current code.
>  
> Curtis
>  
[...]
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Found the issue! - SCHED_ULE+PREEMPTION is the problem

2018-04-10 Thread Peter

Results:


1. The tdq_ridx pointer

The perceived slow advance (of the tdq_ridx pointer into the circular 
array) is correct behaviour. McKusick writes:



The pointer is advanced once per system tick, although it may not
advance on a tick until the currently selected queue is empty. Since
each thread is given a maximum time slice and no threads may be added
to the current position, the queue will drain in a bounded amount of
time.


Therefore, it is also normal that the process (the piglet in this case) 
does run until it's time slice (aka quantum) is used up.



2. The influence of preempt_thresh

This can be found in tdq_runq_add(). A simplified description of the 
logic there is as follows:


td_priority <  152 ? -> add to realtime-queue
td_priority <= 223 ? -> add to timeshare-queue
   if preempted
   circular-index = tdq_ridx
   else
   circular_index = tdq_idx + td_priority
else-> add to idle-queue

If the thread had been preempted, it is reinserted at the current 
working position of the circular array, otherwise the position is 
calculated from thread priority.



3. The quorum

Most of the task switches come from device interrupts. Those are running 
at priority intr:8 or intr:12. So, as soon as preempt_thresh is 12 or 
bigger, the piglet is almost always reinserted in the runqueue due to 
preemption.
And, as we see, in that case we do not have a scheduling, we have a 
simple resume!


A real scheduling happens only after the quorum is exhausted. Therefore,
reducing the quorum helps.


4. History

In r171713 was this behaviour deliberately introduced.

In r220198 it was fixed, with a focus on CPU-hogs and single-CPU.

In r239157 the fix was undone due to performance considerations, with 
the focus on rescheduling only at end of the time-slice.



5. Conclusion

The current defaults seem not very well suited for certain CPU-intense 
tasks. Possible solutions are one of:

 * not use SCHED_ULE
 * not use preemption
 * change kern.sched.quorum to minimal value.

P.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: more data: SCHED_ULE+PREEMPTION is the problem

2018-04-10 Thread Peter


Hi Stefan,

 I'm glad to see You're thinking along similar paths as I did. But let 
me fist answer Your question straight away, and sort out the remainder

afterwards.

> I'd be interested in your results with preempt_thresh set to a value
> of e.g.190.

There is no difference. Any value above 7 shows the problem identically.

I think this value (or preemtion as a whole) is not the actual cause for 
the problem; it just changes some conditions that make the problem 
visible. So, trying to adjust preempt_thresh in order to fix the

problem seems to be a dead end.

Stefan Esser wrote:


The critical use of preempt_thresh is marked above. If it is 0, no preemption
will occur. On a single processor system, this should allow the CPU bound
thread to run for as long its quantum lasts.


I would like to contradict here.

From what I understand, preemption is *not* the base of task switching.
AFAIK preemption is an additional feature that allows to switch threads
while they execute in kernel mode. While executing in user mode, a 
thread can be interrupted and switched at any time, and that is how

the traditional time-sharing systems did it. Traditionally a thread
would execute in kernel mode only during interrupts and syscalls, and
those last no longer than a few ms, and for long that was not an issue. 
Only when we got the fast interfaces (10Gbps etc.) and got big monsters 
executing in kernel space (traffic-shaper, ZFS, etc.), that scheme 
became problematic and preemption was invented.


According to McKusicks book, the scheduler is two-fold: an outer logic
runs few times per second and calculates priorities. And an inner logic
runs very often (at every interrupt?) and chooses the next runnable 
thread simply by priority.
The meaning of the quantum is then: when it is used up, the thread is 
moved to the end of it's queue, so that it may take a while until it 
runs again. This is for implementing round-robin behaviour within a
single queue (= a single priority). It should not prevent task-switching 
as such.


Lets have a look. sched_choose() seems to be that low-level scheduler 
function that decides which thread to run next. Lets create a log of its 
decisions.[1]


With preempt_thresh >= 12 (kernel threads left out):

 PIDCOMMAND TIMESTAMP
 18196 bash 1192.549
 18196 bash 1192.554
 18196 bash 1192.559
 66683  lz4 1192.560
 18196 bash 1192.560
 18196 bash 1192.562
 18196 bash 1192.563
 18196 bash 1192.564
 79496 ntpd 1192.569
 18196 bash 1192.569
 18196 bash 1192.574
 18196 bash 1192.579
 18196 bash 1192.584
 18196 bash 1192.588
 18196 bash 1192.589
 18196 bash 1192.594
 18196 bash 1192.599
 18196 bash 1192.604
 18196 bash 1192.609
 18196 bash 1192.613
 18196 bash 1192.614
 18196 bash 1192.619
 18196 bash 1192.624
 18196 bash 1192.629
 18196 bash 1192.634
 18196 bash 1192.638
 18196 bash 1192.639
 18196 bash 1192.644
 18196 bash 1192.649
 18196 bash 1192.654
 66683  lz4 1192.654
 18196 bash 1192.655
 18196 bash 1192.655
 18196 bash 1192.659

The worker is indeed called only after 95ms.

And with preempt_thresh < 8:

 PIDCOMMAND TIMESTAMP

 18196 bash 1268.955
 66683  lz4 1268.956
 18196 bash 1268.956
 66683  lz4 1268.956
 18196 bash 1268.957
 66683  lz4 1268.957
 18196 bash 1268.957
 66683  lz4 1268.958
 18196 bash 1268.958
 66683  lz4 1268.959
 18196 bash 1268.959
 66683  lz4 1268.959
 18196 bash 1268.960
 66683  lz4 1268.960
 18196 bash 1268.961
 66683  lz4 1268.961
 18196 bash 1268.961
 66683  lz4 1268.962
 18196 bash 1268.962

Here we have 3 Csw per millisecond. (The fact that the decisions are 
over-all more frequent is easily explained: when lz4 gets to run, it 
will do disk I/O, which quickly returns and triggers new decisions.)


In the second record, things are clear: while lz4 does disk I/O, the 
scheduler MUST run bash, because nothing else is there. But when data 
arrives, it runs again lz4.

But in the first record - why does the scheduler choose bash, although
lz4 has already much higher prio (52 versus 97, usually)?


A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
thread to preempt any other thread with 

Appendices - more data: SCHED_ULE+PREEMPTION is the problem

2018-04-10 Thread Peter
I forgot to attach the commands used to create the logs - they are ugly 
anyway:


[1]
dtrace -q -n '::sched_choose:return { @[((struct thread 
*)arg1)->td_proc->p_pid, stringof(((struct thread 
*)arg1)->td_proc->p_comm), timestamp] = count(); } tick-1s { exit(0); }' 
| sort -nk 3 | awk '$1 > 27 {$3 = ($3/100)*1.0/1000; printf "%6d 
%20s %3.3f\n", $1, $2, $3 }'


[2]
dtrace -q -n '::runq_choose_from:entry /arg1 == 0||arg1 == 32/ { @[arg1, 
timestamp] = count(); }' | sort -nk2

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"