Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-27 Thread Timo Myyrä
Amit Kulkarni  writes:

>> root on sd2a (88532b67c09ce3ee.a) swap on sd2b dump on sd2b
>> TSC skew=-6129185140 drift=170
>> TSC skew=-6129184900 drift=-10
>> TSC skew=-6129184890 drift=-20
>> TSC skew=-6129184910 drift=30
>> TSC skew=-6129184910 drift=10
>> TSC skew=-6129184900 drift=20
>> TSC skew=-6129184910 drift=30
>> iwm0: hw rev 0x230, fw ver 22.361476.0, address 68:ec:c5:ad:9a:cb
>> initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x17AA:0x506F 0xC4).
>> amdgpu0: 1920x1080, 32bpp
>> wsdisplay0 at amdgpu0 mux 1: console (std, vt100 emulation), using wskbd0
>> wsdisplay0: screen 1-5 added (std, vt100 emulation)
>>
>
> It seems that you have Paul's TSC patch also applied. Please apply
> just one patch and test separately, and then report back!
>
> Thanks

Ah, I tested also without the TSC patch and it didn't make any difference.
Only other tweak is enabled amdgpu driver in GENERIC.

timo



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-27 Thread Amit Kulkarni
> root on sd2a (88532b67c09ce3ee.a) swap on sd2b dump on sd2b
> TSC skew=-6129185140 drift=170
> TSC skew=-6129184900 drift=-10
> TSC skew=-6129184890 drift=-20
> TSC skew=-6129184910 drift=30
> TSC skew=-6129184910 drift=10
> TSC skew=-6129184900 drift=20
> TSC skew=-6129184910 drift=30
> iwm0: hw rev 0x230, fw ver 22.361476.0, address 68:ec:c5:ad:9a:cb
> initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x17AA:0x506F 0xC4).
> amdgpu0: 1920x1080, 32bpp
> wsdisplay0 at amdgpu0 mux 1: console (std, vt100 emulation), using wskbd0
> wsdisplay0: screen 1-5 added (std, vt100 emulation)
>

It seems that you have Paul's TSC patch also applied. Please apply
just one patch and test separately, and then report back!

Thanks



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-27 Thread Timo Myyrä
Martin Pieuchot  writes:

> On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
>> On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
>> > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
>> > > Diff below exists mainly for documentation and test purposes.  If
>> > > you're not interested about how to break the scheduler internals in
>> > > pieces, don't read further and go straight to testing!
>> > > 
>> > > - First change is to stop calling tsleep(9) at PUSER.  That makes
>> > >   it clear that all "sleeping priorities" are smaller than PUSER.
>> > >   That's important to understand for the diff below.  `p_priority'
>> > >   is currently a placeholder for the "sleeping priority" and the
>> > >   "runnqueue priority".  Both fields are separated by this diff.
>> > > 
>> > > - When a thread goes to sleep, the priority argument of tsleep(9) is
>> > >   now recorded in `p_slpprio'.  This argument can be considered as part
>> > >   of the sleep queue.  Its purpose is to place the thread into a higher
>> > >   runqueue when awoken.
>> > > 
>> > > - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
>> > >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
>> > >   in the preferred queue without having to use `p_priority'.  Note that
>> > >   `p_usrpri' is still recalculated *after* having called setrunqueue().
>> > >   This is currently fine because setrunnable() is called with 
>> > > SCHED_LOCK() 
>> > >   but it will be racy when we'll split it.
>> > > 
>> > > - A new field, `p_runprio' has been introduced.  It should be considered
>> > >   as part of the per-CPU runqueues.  It indicates where a current thread
>> > >   is placed.
>> > > 
>> > > - `spc_curpriority' is now updated at every context-switch.  That means
>> > >need_resched() won't be called after comparing an out-of-date value.
>> > >At the same time, `p_usrpri' is initialized to the highest possible
>> > >value for idle threads.
>> > > 
>> > > - resched_proc() was calling need_resched() in the following conditions:
>> > >- If the SONPROC thread has a higher priority that the current
>> > >  running thread (itself).
>> > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
>> > >- If schedcpu() considered that a thread, after updating its prio,
>> > >  should preempt the one running on the CPU pointed by `p_cpu'. 
>> > > 
>> > >   The diff below simplify all of that by calling need_resched() when:
>> > >- A thread is inserted in a CPU runqueue at a higher priority than
>> > >  the one SONPROC.
>> > >- schedcpu() decides that a thread in SRUN state should preempt the
>> > >  one SONPROC.
>> > > 
>> > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
>> > >   of a thread are now updated while holding a per-thread mutex.  As a
>> > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
>> > >   and schedcpu() almost never take it.
>> > > 
>> > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
>> > >   when displaying priorities.  This is helpful to understand what's
>> > >   happening:
>> > > 
>> > > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
>> > > 23:42:10
>> > > 70 threads: 68 idle, 2 on processor
>> > > up  0:09
>> > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% 
>> > > idle
>> > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% 
>> > > idle
>> > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
>> > > 
>> > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
>> > > COMMAND
>> > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
>> > > softnet
>> > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
>> > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
>> > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% 
>> > > idle0  
>> > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
>> > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
>> > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
>> > > 
>> > > 
>> > > 
>> > > - The removal of `p_priority' and the change that makes mi_switch()
>> > >   always update `spc_curpriority' might introduce some changes in
>> > >   behavior, especially with kernel threads that were not going through
>> > >   tsleep(9).  We currently have some situations where the priority of
>> > >   the running thread isn't correctly reflected.  This diff changes that
>> > >   which means we should be able to better understand where the problems
>> > >   are.
>> > > 
>> > > I'd be interested in comments/tests/reviews before continuing in this
>> > > direction.  Note that at least part of this diff are required to split
>> > > the accounting apart from 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-24 Thread Bryan Linton
On 2019-06-21 17:11:26, Martin Pieuchot  wrote:
> On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
> > On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> > > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > > > Diff below exists mainly for documentation and test purposes.  If
> > > > you're not interested about how to break the scheduler internals in
> > > > pieces, don't read further and go straight to testing!
> > > > 
> > > > - First change is to stop calling tsleep(9) at PUSER.  That makes
> > > >   it clear that all "sleeping priorities" are smaller than PUSER.
> > > >   That's important to understand for the diff below.  `p_priority'
> > > >   is currently a placeholder for the "sleeping priority" and the
> > > >   "runnqueue priority".  Both fields are separated by this diff.
> > > > 
> > > > - When a thread goes to sleep, the priority argument of tsleep(9) is
> > > >   now recorded in `p_slpprio'.  This argument can be considered as part
> > > >   of the sleep queue.  Its purpose is to place the thread into a higher
> > > >   runqueue when awoken.
> > > > 
> > > > - Currently, for stopped threads, `p_priority' correspond to 
> > > > `p_usrpri'. 
> > > >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
> > > >   in the preferred queue without having to use `p_priority'.  Note that
> > > >   `p_usrpri' is still recalculated *after* having called setrunqueue().
> > > >   This is currently fine because setrunnable() is called with 
> > > > SCHED_LOCK() 
> > > >   but it will be racy when we'll split it.
> > > > 
> > > > - A new field, `p_runprio' has been introduced.  It should be considered
> > > >   as part of the per-CPU runqueues.  It indicates where a current thread
> > > >   is placed.
> > > > 
> > > > - `spc_curpriority' is now updated at every context-switch.  That means
> > > >need_resched() won't be called after comparing an out-of-date value.
> > > >At the same time, `p_usrpri' is initialized to the highest possible
> > > >value for idle threads.
> > > > 
> > > > - resched_proc() was calling need_resched() in the following conditions:
> > > >- If the SONPROC thread has a higher priority that the current
> > > >  running thread (itself).
> > > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> > > >- If schedcpu() considered that a thread, after updating its prio,
> > > >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > > > 
> > > >   The diff below simplify all of that by calling need_resched() when:
> > > >- A thread is inserted in a CPU runqueue at a higher priority than
> > > >  the one SONPROC.
> > > >- schedcpu() decides that a thread in SRUN state should preempt the
> > > >  one SONPROC.
> > > > 
> > > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> > > >   of a thread are now updated while holding a per-thread mutex.  As a
> > > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> > > >   and schedcpu() almost never take it.
> > > > 
> > > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> > > >   when displaying priorities.  This is helpful to understand what's
> > > >   happening:
> > > > 
> > > > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > > > 23:42:10
> > > > 70 threads: 68 idle, 2 on processor
> > > > up  0:09
> > > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% 
> > > > idle
> > > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% 
> > > > idle
> > > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > > > 
> > > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
> > > > COMMAND
> > > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
> > > > softnet
> > > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% 
> > > > idle0  
> > > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% 
> > > > idle1
> > > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh 
> > > >  
> > > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% 
> > > > nfsd 
> > > > 
> > > > 
> > > > 
> > > > - The removal of `p_priority' and the change that makes mi_switch()
> > > >   always update `spc_curpriority' might introduce some changes in
> > > >   behavior, especially with kernel threads that were not going through
> > > >   tsleep(9).  We currently have some situations where the priority of
> > > >   the running thread isn't correctly reflected.  This diff changes that
> > > >   which means we should be able to better understand where the problems
> > > >   are.
> > > > 
> > > > I'd be interested in 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-23 Thread Amit Kulkarni
On Fri, 21 Jun 2019 21:54:18 -0700
Mike Larkin  wrote:

> On Fri, Jun 21, 2019 at 05:11:26PM -0300, Martin Pieuchot wrote:
> > On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
> > > On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> > > > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > > > > Diff below exists mainly for documentation and test purposes.  If
> > > > > you're not interested about how to break the scheduler internals in
> > > > > pieces, don't read further and go straight to testing!
> > > > > 
> > > > > - First change is to stop calling tsleep(9) at PUSER.  That makes
> > > > >   it clear that all "sleeping priorities" are smaller than PUSER.
> > > > >   That's important to understand for the diff below.  `p_priority'
> > > > >   is currently a placeholder for the "sleeping priority" and the
> > > > >   "runnqueue priority".  Both fields are separated by this diff.
> > > > > 
> > > > > - When a thread goes to sleep, the priority argument of tsleep(9) is
> > > > >   now recorded in `p_slpprio'.  This argument can be considered as 
> > > > > part
> > > > >   of the sleep queue.  Its purpose is to place the thread into a 
> > > > > higher
> > > > >   runqueue when awoken.
> > > > > 
> > > > > - Currently, for stopped threads, `p_priority' correspond to 
> > > > > `p_usrpri'. 
> > > > >   So setrunnable() has been untangled to place SSTOP and SSLEEP 
> > > > > threads
> > > > >   in the preferred queue without having to use `p_priority'.  Note 
> > > > > that
> > > > >   `p_usrpri' is still recalculated *after* having called 
> > > > > setrunqueue().
> > > > >   This is currently fine because setrunnable() is called with 
> > > > > SCHED_LOCK() 
> > > > >   but it will be racy when we'll split it.
> > > > > 
> > > > > - A new field, `p_runprio' has been introduced.  It should be 
> > > > > considered
> > > > >   as part of the per-CPU runqueues.  It indicates where a current 
> > > > > thread
> > > > >   is placed.
> > > > > 
> > > > > - `spc_curpriority' is now updated at every context-switch.  That 
> > > > > means
> > > > >need_resched() won't be called after comparing an out-of-date 
> > > > > value.
> > > > >At the same time, `p_usrpri' is initialized to the highest possible
> > > > >value for idle threads.
> > > > > 
> > > > > - resched_proc() was calling need_resched() in the following 
> > > > > conditions:
> > > > >- If the SONPROC thread has a higher priority that the current
> > > > >  running thread (itself).
> > > > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> > > > >- If schedcpu() considered that a thread, after updating its prio,
> > > > >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > > > > 
> > > > >   The diff below simplify all of that by calling need_resched() when:
> > > > >- A thread is inserted in a CPU runqueue at a higher priority than
> > > > >  the one SONPROC.
> > > > >- schedcpu() decides that a thread in SRUN state should preempt the
> > > > >  one SONPROC.
> > > > > 
> > > > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> > > > >   of a thread are now updated while holding a per-thread mutex.  As a
> > > > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> > > > >   and schedcpu() almost never take it.
> > > > > 
> > > > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' 
> > > > > value
> > > > >   when displaying priorities.  This is helpful to understand what's
> > > > >   happening:
> > > > > 
> > > > > load averages:  0.99,  0.56,  0.25   
> > > > > two.lab.grenadille.net 23:42:10
> > > > > 70 threads: 68 idle, 2 on processor   
> > > > >  up  0:09
> > > > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 
> > > > > 47.1% idle
> > > > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 
> > > > > 43.1% idle
> > > > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > > > > 
> > > > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
> > > > > COMMAND
> > > > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
> > > > > softnet
> > > > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% 
> > > > > cvs 
> > > > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% 
> > > > > nfsd
> > > > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% 
> > > > > idle0  
> > > > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% 
> > > > > idle1
> > > > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% 
> > > > > ssh  
> > > > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% 
> > > > > nfsd 
> > > > > 
> > > > > 
> > > > > 
> > > > > - The removal of `p_priority' and the change that makes mi_switch()
> > > > >   always update `spc_curpriority' might introduce some 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-21 Thread Mike Larkin
On Fri, Jun 21, 2019 at 05:11:26PM -0300, Martin Pieuchot wrote:
> On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
> > On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> > > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > > > Diff below exists mainly for documentation and test purposes.  If
> > > > you're not interested about how to break the scheduler internals in
> > > > pieces, don't read further and go straight to testing!
> > > > 
> > > > - First change is to stop calling tsleep(9) at PUSER.  That makes
> > > >   it clear that all "sleeping priorities" are smaller than PUSER.
> > > >   That's important to understand for the diff below.  `p_priority'
> > > >   is currently a placeholder for the "sleeping priority" and the
> > > >   "runnqueue priority".  Both fields are separated by this diff.
> > > > 
> > > > - When a thread goes to sleep, the priority argument of tsleep(9) is
> > > >   now recorded in `p_slpprio'.  This argument can be considered as part
> > > >   of the sleep queue.  Its purpose is to place the thread into a higher
> > > >   runqueue when awoken.
> > > > 
> > > > - Currently, for stopped threads, `p_priority' correspond to 
> > > > `p_usrpri'. 
> > > >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
> > > >   in the preferred queue without having to use `p_priority'.  Note that
> > > >   `p_usrpri' is still recalculated *after* having called setrunqueue().
> > > >   This is currently fine because setrunnable() is called with 
> > > > SCHED_LOCK() 
> > > >   but it will be racy when we'll split it.
> > > > 
> > > > - A new field, `p_runprio' has been introduced.  It should be considered
> > > >   as part of the per-CPU runqueues.  It indicates where a current thread
> > > >   is placed.
> > > > 
> > > > - `spc_curpriority' is now updated at every context-switch.  That means
> > > >need_resched() won't be called after comparing an out-of-date value.
> > > >At the same time, `p_usrpri' is initialized to the highest possible
> > > >value for idle threads.
> > > > 
> > > > - resched_proc() was calling need_resched() in the following conditions:
> > > >- If the SONPROC thread has a higher priority that the current
> > > >  running thread (itself).
> > > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> > > >- If schedcpu() considered that a thread, after updating its prio,
> > > >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > > > 
> > > >   The diff below simplify all of that by calling need_resched() when:
> > > >- A thread is inserted in a CPU runqueue at a higher priority than
> > > >  the one SONPROC.
> > > >- schedcpu() decides that a thread in SRUN state should preempt the
> > > >  one SONPROC.
> > > > 
> > > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> > > >   of a thread are now updated while holding a per-thread mutex.  As a
> > > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> > > >   and schedcpu() almost never take it.
> > > > 
> > > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> > > >   when displaying priorities.  This is helpful to understand what's
> > > >   happening:
> > > > 
> > > > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > > > 23:42:10
> > > > 70 threads: 68 idle, 2 on processor
> > > > up  0:09
> > > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% 
> > > > idle
> > > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% 
> > > > idle
> > > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > > > 
> > > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
> > > > COMMAND
> > > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
> > > > softnet
> > > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% 
> > > > idle0  
> > > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% 
> > > > idle1
> > > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh 
> > > >  
> > > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% 
> > > > nfsd 
> > > > 
> > > > 
> > > > 
> > > > - The removal of `p_priority' and the change that makes mi_switch()
> > > >   always update `spc_curpriority' might introduce some changes in
> > > >   behavior, especially with kernel threads that were not going through
> > > >   tsleep(9).  We currently have some situations where the priority of
> > > >   the running thread isn't correctly reflected.  This diff changes that
> > > >   which means we should be able to better understand where the problems
> > > >   are.
> > > > 
> > > > I'd be interested 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-21 Thread Martin Pieuchot
On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
> On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > > Diff below exists mainly for documentation and test purposes.  If
> > > you're not interested about how to break the scheduler internals in
> > > pieces, don't read further and go straight to testing!
> > > 
> > > - First change is to stop calling tsleep(9) at PUSER.  That makes
> > >   it clear that all "sleeping priorities" are smaller than PUSER.
> > >   That's important to understand for the diff below.  `p_priority'
> > >   is currently a placeholder for the "sleeping priority" and the
> > >   "runnqueue priority".  Both fields are separated by this diff.
> > > 
> > > - When a thread goes to sleep, the priority argument of tsleep(9) is
> > >   now recorded in `p_slpprio'.  This argument can be considered as part
> > >   of the sleep queue.  Its purpose is to place the thread into a higher
> > >   runqueue when awoken.
> > > 
> > > - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
> > >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
> > >   in the preferred queue without having to use `p_priority'.  Note that
> > >   `p_usrpri' is still recalculated *after* having called setrunqueue().
> > >   This is currently fine because setrunnable() is called with 
> > > SCHED_LOCK() 
> > >   but it will be racy when we'll split it.
> > > 
> > > - A new field, `p_runprio' has been introduced.  It should be considered
> > >   as part of the per-CPU runqueues.  It indicates where a current thread
> > >   is placed.
> > > 
> > > - `spc_curpriority' is now updated at every context-switch.  That means
> > >need_resched() won't be called after comparing an out-of-date value.
> > >At the same time, `p_usrpri' is initialized to the highest possible
> > >value for idle threads.
> > > 
> > > - resched_proc() was calling need_resched() in the following conditions:
> > >- If the SONPROC thread has a higher priority that the current
> > >  running thread (itself).
> > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> > >- If schedcpu() considered that a thread, after updating its prio,
> > >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > > 
> > >   The diff below simplify all of that by calling need_resched() when:
> > >- A thread is inserted in a CPU runqueue at a higher priority than
> > >  the one SONPROC.
> > >- schedcpu() decides that a thread in SRUN state should preempt the
> > >  one SONPROC.
> > > 
> > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> > >   of a thread are now updated while holding a per-thread mutex.  As a
> > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> > >   and schedcpu() almost never take it.
> > > 
> > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> > >   when displaying priorities.  This is helpful to understand what's
> > >   happening:
> > > 
> > > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > > 23:42:10
> > > 70 threads: 68 idle, 2 on processorup 
> > >  0:09
> > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% 
> > > idle
> > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% 
> > > idle
> > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > > 
> > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
> > > COMMAND
> > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
> > > softnet
> > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0 
> > >  
> > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> > > 
> > > 
> > > 
> > > - The removal of `p_priority' and the change that makes mi_switch()
> > >   always update `spc_curpriority' might introduce some changes in
> > >   behavior, especially with kernel threads that were not going through
> > >   tsleep(9).  We currently have some situations where the priority of
> > >   the running thread isn't correctly reflected.  This diff changes that
> > >   which means we should be able to better understand where the problems
> > >   are.
> > > 
> > > I'd be interested in comments/tests/reviews before continuing in this
> > > direction.  Note that at least part of this diff are required to split
> > > the accounting apart from the SCHED_LOCK() as well.
> > > 
> > > I'll also work on exporting scheduler statistics unless somebody wants
> > > to 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-09 Thread Amit Kulkarni
On Sun, Jun 9, 2019 at 10:39 PM Theo de Raadt  wrote:
>
> Amit Kulkarni  wrote:
>
> > > Index: sys/sysctl.h
> > > ===
> > > RCS file: /cvs/src/sys/sys/sysctl.h,v
> > > retrieving revision 1.188
> > > diff -u -p -r1.188 sysctl.h
> > > --- sys/sysctl.h1 Jun 2019 14:11:18 -   1.188
> > > +++ sys/sysctl.h1 Jun 2019 16:36:13 -
> > > @@ -629,7 +629,7 @@ do {  
> > >   \
> > > (kp)->p_stat = (p)->p_stat; \
> > > (kp)->p_slptime = (p)->p_slptime;   \
> > > (kp)->p_holdcnt = 1;\
> > > -   (kp)->p_priority = (p)->p_priority; \
> > > +   (kp)->p_priority = (p)->p_usrpri + PZERO;   \
> > > (kp)->p_usrpri = (p)->p_usrpri; \
> > > if ((p)->p_wchan && (p)->p_wmesg)   \
> > > copy_str((kp)->p_wmesg, (p)->p_wmesg,   \
> > >
> >
> >
> > Hi,
> >
> > A request, to remove the +PZERO here above and the -PZERO in 
> > /usr/src/usr.bin/top/machine.c, why do an unnecessary calculation twice, 
> > once to set and other time to this?
>
> This is getting out of hand.
>
> Have you reviewed the *entire universe* of software to ensure that
> top is the only program which looks at this?
>
> No.  You have not.  So please stop proposing changes where you aren't
> willing to invest into studying the history.

Got it. Sorry, my fault here.



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-09 Thread Theo de Raadt
Amit Kulkarni  wrote:

> > Index: sys/sysctl.h
> > ===
> > RCS file: /cvs/src/sys/sys/sysctl.h,v
> > retrieving revision 1.188
> > diff -u -p -r1.188 sysctl.h
> > --- sys/sysctl.h1 Jun 2019 14:11:18 -   1.188
> > +++ sys/sysctl.h1 Jun 2019 16:36:13 -
> > @@ -629,7 +629,7 @@ do {
> > \
> > (kp)->p_stat = (p)->p_stat; \
> > (kp)->p_slptime = (p)->p_slptime;   \
> > (kp)->p_holdcnt = 1;\
> > -   (kp)->p_priority = (p)->p_priority; \
> > +   (kp)->p_priority = (p)->p_usrpri + PZERO;   \
> > (kp)->p_usrpri = (p)->p_usrpri; \
> > if ((p)->p_wchan && (p)->p_wmesg)   \
> > copy_str((kp)->p_wmesg, (p)->p_wmesg,   \
> > 
> 
> 
> Hi,
> 
> A request, to remove the +PZERO here above and the -PZERO in 
> /usr/src/usr.bin/top/machine.c, why do an unnecessary calculation twice, once 
> to set and other time to this?
 
This is getting out of hand.

Have you reviewed the *entire universe* of software to ensure that
top is the only program which looks at this?

No.  You have not.  So please stop proposing changes where you aren't
willing to invest into studying the history.



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-09 Thread Amit Kulkarni
> Index: sys/sysctl.h
> ===
> RCS file: /cvs/src/sys/sys/sysctl.h,v
> retrieving revision 1.188
> diff -u -p -r1.188 sysctl.h
> --- sys/sysctl.h  1 Jun 2019 14:11:18 -   1.188
> +++ sys/sysctl.h  1 Jun 2019 16:36:13 -
> @@ -629,7 +629,7 @@ do {  
> \
>   (kp)->p_stat = (p)->p_stat; \
>   (kp)->p_slptime = (p)->p_slptime;   \
>   (kp)->p_holdcnt = 1;\
> - (kp)->p_priority = (p)->p_priority; \
> + (kp)->p_priority = (p)->p_usrpri + PZERO;   \
>   (kp)->p_usrpri = (p)->p_usrpri; \
>   if ((p)->p_wchan && (p)->p_wmesg)   \
>   copy_str((kp)->p_wmesg, (p)->p_wmesg,   \
> 


Hi,

A request, to remove the +PZERO here above and the -PZERO in 
/usr/src/usr.bin/top/machine.c, why do an unnecessary calculation twice, once 
to set and other time to unset?

Thanks



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-06 Thread Martin Pieuchot
On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > Diff below exists mainly for documentation and test purposes.  If
> > you're not interested about how to break the scheduler internals in
> > pieces, don't read further and go straight to testing!
> > 
> > - First change is to stop calling tsleep(9) at PUSER.  That makes
> >   it clear that all "sleeping priorities" are smaller than PUSER.
> >   That's important to understand for the diff below.  `p_priority'
> >   is currently a placeholder for the "sleeping priority" and the
> >   "runnqueue priority".  Both fields are separated by this diff.
> > 
> > - When a thread goes to sleep, the priority argument of tsleep(9) is
> >   now recorded in `p_slpprio'.  This argument can be considered as part
> >   of the sleep queue.  Its purpose is to place the thread into a higher
> >   runqueue when awoken.
> > 
> > - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
> >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
> >   in the preferred queue without having to use `p_priority'.  Note that
> >   `p_usrpri' is still recalculated *after* having called setrunqueue().
> >   This is currently fine because setrunnable() is called with SCHED_LOCK() 
> >   but it will be racy when we'll split it.
> > 
> > - A new field, `p_runprio' has been introduced.  It should be considered
> >   as part of the per-CPU runqueues.  It indicates where a current thread
> >   is placed.
> > 
> > - `spc_curpriority' is now updated at every context-switch.  That means
> >need_resched() won't be called after comparing an out-of-date value.
> >At the same time, `p_usrpri' is initialized to the highest possible
> >value for idle threads.
> > 
> > - resched_proc() was calling need_resched() in the following conditions:
> >- If the SONPROC thread has a higher priority that the current
> >  running thread (itself).
> >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> >- If schedcpu() considered that a thread, after updating its prio,
> >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > 
> >   The diff below simplify all of that by calling need_resched() when:
> >- A thread is inserted in a CPU runqueue at a higher priority than
> >  the one SONPROC.
> >- schedcpu() decides that a thread in SRUN state should preempt the
> >  one SONPROC.
> > 
> > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> >   of a thread are now updated while holding a per-thread mutex.  As a
> >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> >   and schedcpu() almost never take it.
> > 
> > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> >   when displaying priorities.  This is helpful to understand what's
> >   happening:
> > 
> > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > 23:42:10
> > 70 threads: 68 idle, 2 on processorup  
> > 0:09
> > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > 
> >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> > 
> > 
> > 
> > - The removal of `p_priority' and the change that makes mi_switch()
> >   always update `spc_curpriority' might introduce some changes in
> >   behavior, especially with kernel threads that were not going through
> >   tsleep(9).  We currently have some situations where the priority of
> >   the running thread isn't correctly reflected.  This diff changes that
> >   which means we should be able to better understand where the problems
> >   are.
> > 
> > I'd be interested in comments/tests/reviews before continuing in this
> > direction.  Note that at least part of this diff are required to split
> > the accounting apart from the SCHED_LOCK() as well.
> > 
> > I'll also work on exporting scheduler statistics unless somebody wants
> > to beat me :)
> 
> Updated diff to use IPL_SCHED and rebased to apply on top of -current :) 

Updated diff that fixes a pagefault reported by sthen@.

Index: arch/amd64/amd64/genassym.cf
===
RCS file: 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-06 Thread Mike Larkin
On Thu, Jun 06, 2019 at 02:55:35PM +0100, Stuart Henderson wrote:
> I'm testing the "pump my sched" and read/write unlock diffs and ran into
> the panic below. Seems more likely that it would be connected with the
> sched diff rather than anything else. I'll build a WITNESS kernel and
> see if I can get more details if it hapens again.
> 

Funny thing, I ran into the same panic last night after running this diff
without issue for several days. Mine didn't include a stack trace though.

> ddb{1}> show panic
> kernel diagnostic assertion "__mp_lock_held(_lock, curcpu()) == 0" 
> failed
> : file "/sys/kern/kern_lock.c", line 63
> ddb{1}> tr
> db_enter() at db_enter+0x10
> panic() at panic+0x128
> __assert(81aebffd,81ac7a49,3f,81b0e315) at 
> __assert+0x2
> e
> _kernel_lock() at _kernel_lock+0xf6
> pageflttrap() at pageflttrap+0x75
> kerntrap(800033ee0ca0) at kerntrap+0x91
> alltraps_kern(6,380,fffb,2,80002201aff0,e0) at alltraps_kern+0x7b
> setrunqueue(0,8000342b2ce8,e0) at setrunqueue+0xbc
> setrunnable(8000342b2ce8,e0) at setrunnable+0xa5
> ptsignal(8000342b2ce8,13,0) at ptsignal+0x3c4
> sys_kill(800033c2fcc0,800033ee0ec0,800033ee0f20) at sys_kill+0x1c5
> syscall(800033ee0f90) at syscall+0x389
> Xsyscall(6,7a,0,7a,13,7f7e3e18) at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7e3cb0, count: -13
> ddb{1}> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>  89165  411452   1971667  30x1000b2  poll  ping
>   1971   11232  86869667  30x82  piperdcheck_ping
>  66051  380473  92153667  30x1000b2  poll  ping
>  92153  480496  86869667  30x82  piperdcheck_ping
>   7608  110605   8332667  30x1000b2  poll  ping
>   8332  284560  86869667  30x82  piperdcheck_ping
>  40097  336365  35317667  30x1000b2  poll  ping
>  35317  385677  86869667  30x82  piperdcheck_ping
>  33714  102547  21689667  30x1000b2  poll  ping
>  21689  210571  86869667  30x82  piperdcheck_ping
>  72142  347023  62694667  30x1000b2  poll  ping
>  62694  439660  86869667  30x82  piperdcheck_ping
>  87347  447624  76730   1000  4   0x8000430xscreensaver
>  37569  319522  76730   1000  20x800402piecewise
>  11841  481452  18848667  30x1000b2  poll  ping
>  18848  111230  86869667  30x82  piperdcheck_ping
>  69292  189646  44902667  30x1000b2  poll  ping
>  44902   21064  86869667  30x82  piperdcheck_ping
>  67905  363250   1761507  30x92  kqreadpickup
>  86709  146945  91951   1000  30x92  kqreadimap
>  36596  227858  91951666  30x92  kqreadimap-login
>  90723  180281  91951   1000  30x92  kqreadimap
>  42867  129499  91951   1000  30x92  kqreadimap
>  52963  203919  91951   1000  30x92  kqreadimap
>  61599  167915  91951666  30x92  kqreadimap-login
>  54450  154426  91951666  30x92  kqreadimap-login
>  54061  103553  91951666  30x92  kqreadimap-login
>  39422  362611  85872715  30x90  netconperl
>  49809  260623  85872715  30x90  lockf perl
>   3824   92473  85872715  30x90  lockf perl
>  30990  398152  85872715  30x90  lockf perl
>   3327  264795  85872715  30x90  lockf perl
>  24406  217143  85872715  30x90  lockf perl
>  52871  404585  85872715  30x90  lockf perl
>  58220  118881  85872715  30x90  lockf perl
>  32898  307331  85872715  30x90  lockf perl
>  50203  396987  85872715  30x90  lockf perl
>  93024   24785  1   1000  30x100080  selectssh
>  22261  482889  52840   1000  30x100083  poll  ssh
>  52840  404967  61470   1000  30x10008b  pause ksh
>  35796  132597  61470   1000  30xb0  netio urxvt
>  61470  485028  72107   1000  30xb2  selecturxvt
>  72107  196192  1   1000  30x10008a  pause sh
>  70238  396933  1   1000  30x100080  selectssh
>  41840  302658  10362   1000  30x100083  poll  ssh
>  89638  411799  91951   1000  30x92  kqreadimap
>  49250  507579  81821   1000  30x100083  ttyin ksh
>  37608  523403  91951666  30x92  kqreadimap-login
>  75552   67606  33096   1000  30x100083  poll  mutt
>  33096  492099  81821   1000  30x10008b  pause ksh
>  81821  270355  1   1000  30x100080  kqreadtmux
>  24522  341281  91861   1000  30x100083  kqreadtmux
>  

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-06 Thread Stuart Henderson
On 2019/06/06 14:55, Stuart Henderson wrote:
> I'm testing the "pump my sched" and read/write unlock diffs and ran into
> the panic below. Seems more likely that it would be connected with the
> sched diff rather than anything else. I'll build a WITNESS kernel and
> see if I can get more details if it hapens again.

Easily repeatable: Run xscreensaver. Let it lock. Try to unlock.

login: panic: kernel diagnostic assertion "__mp_lock_held(_lock, 
curcpu()) == 0" failed: file "/src/cvs-openbsd/sys/kern/kern_lock.c", line 63
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
   6043  83667   1000 0x2  01  firefox
 320664  83667   1000 0x2  0x4802  firefox
 523640  70812   1000 0x2  0x4003  firefox
*307581  55620   10000x32  00K xscreensaver
db_enter() at db_enter+0x10
panic() at panic+0x128
__assert(81af6d92,81afc585,3f,81b1a406) at __assert+0x2
e
__mp_lock_held(811362e0,800033cfe5e0) at __mp_lock_held
pageflttrap() at pageflttrap+0x78
kerntrap(800033cfe6b0) at kerntrap+0x91
alltraps_kern(6,300,fffb,2,800022009ff0,c0) at alltraps_kern+0x7b
setrunqueue(0,8000342572a8,c0) at setrunqueue+0xbc
setrunnable(8000342572a8,c0) at setrunnable+0xa5
ptsignal(8000342572a8,13,0) at ptsignal+0x3c4
sys_kill(800033c532a8,800033cfe8d0,800033cfe930) at sys_kill+0x1c5
syscall(800033cfe9a0) at syscall+0x399
Xsyscall(6,7a,0,7a,13,7f7e8cc8) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7e8b60, count: 2
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}> sh witness
Sleep locks:
_fpriv->lock (type: rwlock, depth: 0)_priv->fbs_lock (type: rwlock, d
epth: 0)futex (type: rwlock, depth: 0) _lock (type: kernel_lock, depth: 6
) >lock (type: rwlock, depth: 5)  _lock (type: kernel_lock, depth: 6
) -- (already displayed)
 >pr_lock (type: rwlock, depth: 1)  _lock (type: kernel_lock, depth:
 6) -- (already displayed)
swplk (type: rwlock, depth: 0) >lock (type: rwlock, depth: 5) -- (already d
isplayed)
 >i_lock (type: rrwlock, depth: 2)  _lock (type: kernel_lock, depth:
 6) -- (already displayed)
  >lock (type: rwlock, depth: 5) -- (already displayed)
  >dk_lock (type: rwlock, depth: 3)   _lock (type: kernel_lock, d
epth: 6) -- (already displayed)
  _mtx (type: rwlock, depth: 3)   >dh_mtx (type: rwlock, depth: 4
)  lockflk (type: rwlock, depth: 3)(type: rwlock, depth: 4)   
->pr_lock (type: rwlock, depth: 4)   >pr_lock (type: rwlock, depth: 4)  
->dh_mtx (type: rwlock, depth: 4) -- (already displayed)
  >uv_lock (type: rwlock, depth: 3)   >uv_lock (type: rwlock, depth: 4)
  >uv_lock (type: rwlock, depth: 4) -- (already displayed)
sysctllk (type: rwlock, depth: 0) _lock (type: kernel_lock, depth: 6) --
 (already displayed)
 >lock (type: rwlock, depth: 5) -- (already displayed)
 >fd_fd.fd_lock (type: rwlock, depth: 1)  _lock (type: kernel_lo
ck, depth: 6) -- (already displayed)
  >lock (type: rwlock, depth: 5) -- (already displayed)
  netlock (type: rwlock, depth: 2)   _lock (type: kernel_lock, depth: 6)
 -- (already displayed)
   pools (type: rwlock, depth: 3)_lock (type: kernel_lock, depth: 6)
 -- (already displayed)
   >lock (type: rwlock, depth: 5) -- (already displayed)
   >ar_lock (type: rwlock, depth: 3)   vlantag (type: rwlock, depth: 3)  lo
ckflk (type: rwlock, depth: 3) -- (already displayed)
  >i_lock (type: rrwlock, depth: 2) -- (already displayed)
  vfs_stall (type: rwlock, depth: 2)   _lock (type: kernel_lock, depth: 6
) -- (already displayed)
  ptarrlk (type: rwlock, depth: 2) netlock (type: rwlock, depth: 2) -- (already
 displayed)
 tc_lock (type: rwlock, depth: 1) >i_lock (type: rrwlock, depth: 2) -- (alr
eady displayed)
 vfs_stall (type: rwlock, depth: 2) -- (already displayed)
 sysctldlk (type: rwlock, depth: 1)>hdcp_mutex (type: rwlock, depth:
 0)>lock (type: rwlock, depth: 0)>mutex (type: rwlock, depth: 0
) >mode_config.idr_mutex (type: rwlock, depth: 3)_priv->dpll_lock (typ
e: rwlock, depth: 0)taskq (type: rwlock, depth: 0) >struct_mutex (type: rw
lock, depth: 3)  >lock (type: rwlock, depth: 5) -- (already displayed)
   (type: rwlock, depth: 4) -- (already displayed)
  >vm_lock (type: rwlock, depth: 5)  _priv->pcu_lock (type: rwlock, de
pth: 4)   _priv->gt_pm.rps.power.mutex (type: rwlock, depth: 5)  _dom
ains->lock (type: rwlock, depth: 4)  _priv->mm.stolen_lock (type: rwlock, d
epth: 4)  >lock (type: rwlock, depth: 4)  >vm_lock (type: rwlock, de
pth: 4)  >mm.lock (type: rwlock, depth: 4)   >lock (type: rwlock, dep
th: 5) -- (already displayed)
   >vm_lock (type: rwlock, depth: 5) -- (already displayed)
  >mm.get_page.lock (type: rwlock, depth: 4) >vm_lock (type: rwlock, d
epth: 5) -- (already displayed)
 >mm.lock (type: rwlock, depth: 4) -- (already displayed)
_priv->av_mutex (type: rwlock, depth: 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-06 Thread Stuart Henderson
I'm testing the "pump my sched" and read/write unlock diffs and ran into
the panic below. Seems more likely that it would be connected with the
sched diff rather than anything else. I'll build a WITNESS kernel and
see if I can get more details if it hapens again.

ddb{1}> show panic
kernel diagnostic assertion "__mp_lock_held(_lock, curcpu()) == 0" failed
: file "/sys/kern/kern_lock.c", line 63
ddb{1}> tr
db_enter() at db_enter+0x10
panic() at panic+0x128
__assert(81aebffd,81ac7a49,3f,81b0e315) at __assert+0x2
e
_kernel_lock() at _kernel_lock+0xf6
pageflttrap() at pageflttrap+0x75
kerntrap(800033ee0ca0) at kerntrap+0x91
alltraps_kern(6,380,fffb,2,80002201aff0,e0) at alltraps_kern+0x7b
setrunqueue(0,8000342b2ce8,e0) at setrunqueue+0xbc
setrunnable(8000342b2ce8,e0) at setrunnable+0xa5
ptsignal(8000342b2ce8,13,0) at ptsignal+0x3c4
sys_kill(800033c2fcc0,800033ee0ec0,800033ee0f20) at sys_kill+0x1c5
syscall(800033ee0f90) at syscall+0x389
Xsyscall(6,7a,0,7a,13,7f7e3e18) at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7e3cb0, count: -13
ddb{1}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 89165  411452   1971667  30x1000b2  poll  ping
  1971   11232  86869667  30x82  piperdcheck_ping
 66051  380473  92153667  30x1000b2  poll  ping
 92153  480496  86869667  30x82  piperdcheck_ping
  7608  110605   8332667  30x1000b2  poll  ping
  8332  284560  86869667  30x82  piperdcheck_ping
 40097  336365  35317667  30x1000b2  poll  ping
 35317  385677  86869667  30x82  piperdcheck_ping
 33714  102547  21689667  30x1000b2  poll  ping
 21689  210571  86869667  30x82  piperdcheck_ping
 72142  347023  62694667  30x1000b2  poll  ping
 62694  439660  86869667  30x82  piperdcheck_ping
 87347  447624  76730   1000  4   0x8000430xscreensaver
 37569  319522  76730   1000  20x800402piecewise
 11841  481452  18848667  30x1000b2  poll  ping
 18848  111230  86869667  30x82  piperdcheck_ping
 69292  189646  44902667  30x1000b2  poll  ping
 44902   21064  86869667  30x82  piperdcheck_ping
 67905  363250   1761507  30x92  kqreadpickup
 86709  146945  91951   1000  30x92  kqreadimap
 36596  227858  91951666  30x92  kqreadimap-login
 90723  180281  91951   1000  30x92  kqreadimap
 42867  129499  91951   1000  30x92  kqreadimap
 52963  203919  91951   1000  30x92  kqreadimap
 61599  167915  91951666  30x92  kqreadimap-login
 54450  154426  91951666  30x92  kqreadimap-login
 54061  103553  91951666  30x92  kqreadimap-login
 39422  362611  85872715  30x90  netconperl
 49809  260623  85872715  30x90  lockf perl
  3824   92473  85872715  30x90  lockf perl
 30990  398152  85872715  30x90  lockf perl
  3327  264795  85872715  30x90  lockf perl
 24406  217143  85872715  30x90  lockf perl
 52871  404585  85872715  30x90  lockf perl
 58220  118881  85872715  30x90  lockf perl
 32898  307331  85872715  30x90  lockf perl
 50203  396987  85872715  30x90  lockf perl
 93024   24785  1   1000  30x100080  selectssh
 22261  482889  52840   1000  30x100083  poll  ssh
 52840  404967  61470   1000  30x10008b  pause ksh
 35796  132597  61470   1000  30xb0  netio urxvt
 61470  485028  72107   1000  30xb2  selecturxvt
 72107  196192  1   1000  30x10008a  pause sh
 70238  396933  1   1000  30x100080  selectssh
 41840  302658  10362   1000  30x100083  poll  ssh
 89638  411799  91951   1000  30x92  kqreadimap
 49250  507579  81821   1000  30x100083  ttyin ksh
 37608  523403  91951666  30x92  kqreadimap-login
 75552   67606  33096   1000  30x100083  poll  mutt
 33096  492099  81821   1000  30x10008b  pause ksh
 81821  270355  1   1000  30x100080  kqreadtmux
 24522  341281  91861   1000  30x100083  kqreadtmux
 23218  441369  78842   1000  30x82  poll  firefox
 23218  430421  78842   1000  3   0x482  kqreadfirefox
 23218  460884  78842   1000  3   0x482  fsleepfirefox
 23218   17619  78842   1000  3   0x482  fsleepfirefox
 23218  138796  78842   1000  3   0x482  fsleepfirefox
 23218  341276  78842   1000  3   0x482  

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-05 Thread Martin Pieuchot
On 02/06/19(Sun) 22:03, Mark Kettenis wrote:
> > Date: Sat, 1 Jun 2019 18:55:20 -0300
> > From: Martin Pieuchot 
> [...] 
> > - First change is to stop calling tsleep(9) at PUSER.  That makes
> >   it clear that all "sleeping priorities" are smaller than PUSER.
> >   That's important to understand for the diff below.  `p_priority'
> >   is currently a placeholder for the "sleeping priority" and the
> >   "runnqueue priority".  Both fields are separated by this diff.
> 
> Separating out the fields is a good idea.  The current way priorities
> are recorded is just confusing.  The use of PUSER vs. PWAIT seems to
> be fairly arbitrary, so that is probably not a big issue.  Except
> maybe for the single-threded signal stuff.  Would be good to get
> guenther@'s thoughts on this bit.
> 
> The PUSER -> PWAIT change isn't really necessary is it?  It just makes
> it easier for you to understand what;s going on when looking at the
> queues.

The problem becomes easier to understand with this change.  They
is currently two places where `p_priority' is updated iff it was
previously >= PUSER.  These two places are schedclock() and schedcpu()
they both "duplicate" the same logic to update `p_priority' just after
having recalculated `p_usrpri'.  So with it we can say that this code
do no apply to the new `p_slpprio' because it is always < PUSER.

Now the question is why does `p_priority' exist and why it some times
reflect `p_usrpri'?  The only places where mi_switch() is called without
updating `p_priority' nor putting the caller on a queue are related to
signals.  To exit the SSTOP state, setrunnable() is called on a thread. 
Since setrunnable() is shared between SSLEEP and SSTOP threads it is not
obvious that `p_priority' is the "sleeping priority" in the first case
and 'p_usrpri' in the second case.

The only exception to this logic are stopped threads with nice(1) value
that have a `p_priority' < PUSER.  In this specific case my diff, that
passes `p_usrpri' as argument to setrunnable(), introduces a change in
behavior.  However I doubt it matters since such threads are generally
exceptions and the only ones with a priority < PUSER.

This can be observed by stopping sndiod(8).

> > [...]
> > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> >   of a thread are now updated while holding a per-thread mutex.  As a
> >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> >   and schedcpu() almost never take it.
> 
> Need to look closer at how this works.

`p_slptime' could be removed if we want to simplify the logic.  I don't
see the point of deferring the calculation of sleeping/stopped threads.  

> 
> > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> >   when displaying priorities.  This is helpful to understand what's
> >   happening:
> 
> Do you intend to remove that bit before committing this?

I don't know.  Which "priority" should we export then?  The sleep/run
priority?  This would be the most backward-compatible change.  However
the priority that really matters *is* `p_usrpri'.  With this diff it
becomes so much easier to understand what's happening...



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-04 Thread Mike Larkin
On Mon, Jun 03, 2019 at 11:50:14AM +0200, Solene Rapenne wrote:
> On Sat, Jun 01, 2019 at 06:55:20PM -0300, Martin Pieuchot wrote:
> > Diff below exists mainly for documentation and test purposes.  If
> > you're not interested about how to break the scheduler internals in
> > pieces, don't read further and go straight to testing!
> 
> I'm running it since a few hours.
> 
> - games/gzdoom feels smoother with this patch (stuttering was certainly
>   related to audio)
> - mpd playback doesn't seem interrupted under heavy load as it
>   occasionnaly did
> 
> this may be coincidences or placebo effect.
> 

On one of my machines, I'm running this diff and the unlock more syscalls diff
(the one from mpi@ that claudio@ recently reposted). It does indeed seem more
responsive (unclear which diff is doing this, though). The combination is also
stable without anything out of the ordinary.

-ml



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-04 Thread Hrvoje Popovski
On 2.6.2019. 21:41, Martin Pieuchot wrote:
> On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
>> Diff below exists mainly for documentation and test purposes.  If
>> you're not interested about how to break the scheduler internals in
>> pieces, don't read further and go straight to testing!

> Updated diff to use IPL_SCHED and rebased to apply on top of -current :) 

i'm running this diff together with proctreelk diff on openbsd desktop
with gnome and samba server and everything seems fine 






Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-03 Thread Solene Rapenne
On Sat, Jun 01, 2019 at 06:55:20PM -0300, Martin Pieuchot wrote:
> Diff below exists mainly for documentation and test purposes.  If
> you're not interested about how to break the scheduler internals in
> pieces, don't read further and go straight to testing!

I'm running it since a few hours.

- games/gzdoom feels smoother with this patch (stuttering was certainly
  related to audio)
- mpd playback doesn't seem interrupted under heavy load as it
  occasionnaly did

this may be coincidences or placebo effect.



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-03 Thread Amit Kulkarni
Hi,

This is pretty cool diff in splitting the sleep prio and the run prio!

In a few places, the documentation comment could be changed from process to 
proc, tried to find it below and mark. It leaves reader confused for a moment.

thanks

> > - `spc_curpriority' is now updated at every context-switch.  That means
> >need_resched() won't be called after comparing an out-of-date value.
> >At the same time, `p_usrpri' is initialized to the highest possible
> >value for idle threads.
> > - resched_proc() was calling need_resched() in the following conditions:
> >- If the SONPROC thread has a higher priority that the current
> >  running thread (itself).
> >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> >- If schedcpu() considered that a thread, after updating its prio,
> >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > 
> >   The diff below simplify all of that by calling need_resched() when:
> >- A thread is inserted in a CPU runqueue at a higher priority than
> >  the one SONPROC.
> >- schedcpu() decides that a thread in SRUN state should preempt the
> >  one SONPROC.

Just FYI, this should fix a serious bug, the resched_proc() call was very wrong 
in comparing stale priority in deciding what to schedule, and it made a pretty 
bad decision consistently!

> > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> >   of a thread are now updated while holding a per-thread mutex.  As a
> >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> >   and schedcpu() almost never take it.

You forgot to add resetpriority() which is also moved from SCHED_LOCK!

> > 
> > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> >   when displaying priorities.  This is helpful to understand what's
> >   happening:
> > 
> > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > 23:42:10
> > 70 threads: 68 idle, 2 on processorup  
> > 0:09
> > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > 
> >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> > 
> > 
> > 
> > - The removal of `p_priority' and the change that makes mi_switch()
> >   always update `spc_curpriority' might introduce some changes in
> >   behavior, especially with kernel threads that were not going through
> >   tsleep(9).  We currently have some situations where the priority of
> >   the running thread isn't correctly reflected.  This diff changes that
> >   which means we should be able to better understand where the problems
> >   are.
> > 
> > I'd be interested in comments/tests/reviews before continuing in this
> > direction.  Note that at least part of this diff are required to split
> > the accounting apart from the SCHED_LOCK() as well.
> > 
> > I'll also work on exporting scheduler statistics unless somebody wants
> > to beat me :)
> 
> Updated diff to use IPL_SCHED and rebased to apply on top of -current :) 
> 
> Index: arch/amd64/amd64/genassym.cf
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/genassym.cf,v
> retrieving revision 1.40
> diff -u -p -r1.40 genassym.cf
> --- arch/amd64/amd64/genassym.cf  17 May 2019 19:07:15 -  1.40
> +++ arch/amd64/amd64/genassym.cf  1 Jun 2019 16:27:46 -
> @@ -32,7 +32,6 @@ export  VM_MIN_KERNEL_ADDRESS
>  
>  struct   proc
>  member   p_addr
> -member   p_priority
>  member   p_stat
>  member   p_wchan
>  member   P_MD_REGS   p_md.md_regs
> Index: arch/hppa/hppa/genassym.cf
> ===
> RCS file: /cvs/src/sys/arch/hppa/hppa/genassym.cf,v
> retrieving revision 1.47
> diff -u -p -r1.47 genassym.cf
> --- arch/hppa/hppa/genassym.cf9 Feb 2015 08:20:13 -   1.47
> +++ arch/hppa/hppa/genassym.cf1 Jun 2019 17:21:44 -
> @@ -130,7 +130,6 @@ membertf_cr30
>  # proc fields and values
>  struct   proc
>  member   p_addr
> -member   p_priority
>  member   p_stat
>  member   p_wchan
>  member   p_md
> Index: arch/i386/i386/esm.c
> 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-02 Thread Mark Kettenis
> Date: Sat, 1 Jun 2019 18:55:20 -0300
> From: Martin Pieuchot 
> 
> Diff below exists mainly for documentation and test purposes.  If
> you're not interested about how to break the scheduler internals in
> pieces, don't read further and go straight to testing!

Still digesting this, but.

> - First change is to stop calling tsleep(9) at PUSER.  That makes
>   it clear that all "sleeping priorities" are smaller than PUSER.
>   That's important to understand for the diff below.  `p_priority'
>   is currently a placeholder for the "sleeping priority" and the
>   "runnqueue priority".  Both fields are separated by this diff.

Separating out the fields is a good idea.  The current way priorities
are recorded is just confusing.  The use of PUSER vs. PWAIT seems to
be fairly arbitrary, so that is probably not a big issue.  Except
maybe for the single-threded signal stuff.  Would be good to get
guenther@'s thoughts on this bit.

The PUSER -> PWAIT change isn't really necessary is it?  It just makes
it easier for you to understand what;s going on when looking at the
queues.

> - When a thread goes to sleep, the priority argument of tsleep(9) is
>   now recorded in `p_slpprio'.  This argument can be considered as part
>   of the sleep queue.  Its purpose is to place the thread into a higher
>   runqueue when awoken.

Great!

> - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
>   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
>   in the preferred queue without having to use `p_priority'.  Note that
>   `p_usrpri' is still recalculated *after* having called setrunqueue().
>   This is currently fine because setrunnable() is called with SCHED_LOCK() 
>   but it will be racy when we'll split it.
> 
> - A new field, `p_runprio' has been introduced.  It should be considered
>   as part of the per-CPU runqueues.  It indicates where a current thread
>   is placed.

You made this an uint8_t, whereas the other priority fields are all
u_char.  Different names for the same thing, but it probably is a good
idea to keep this consistent for now.

> - `spc_curpriority' is now updated at every context-switch.  That means
>need_resched() won't be called after comparing an out-of-date value.
>At the same time, `p_usrpri' is initialized to the highest possible
>value for idle threads.
> 
> - resched_proc() was calling need_resched() in the following conditions:
>- If the SONPROC thread has a higher priority that the current
>  running thread (itself).
>- Twice in setrunnable() when we know that p_priority <= p_usrpri.
>- If schedcpu() considered that a thread, after updating its prio,
>  should preempt the one running on the CPU pointed by `p_cpu'. 
> 
>   The diff below simplify all of that by calling need_resched() when:
>- A thread is inserted in a CPU runqueue at a higher priority than
>  the one SONPROC.
>- schedcpu() decides that a thread in SRUN state should preempt the
>  one SONPROC.
> 
> - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
>   of a thread are now updated while holding a per-thread mutex.  As a
>   result schedclock() and donice() no longer takes the SCHED_LOCK(),
>   and schedcpu() almost never take it.

Need to look closer at how this works.

> - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
>   when displaying priorities.  This is helpful to understand what's
>   happening:

Do you intend to remove that bit before committing this?

> load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> 23:42:10
> 70 threads: 68 idle, 2 on processorup  
> 0:09
> CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> 
>   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> 
> 
> 
> - The removal of `p_priority' and the change that makes mi_switch()
>   always update `spc_curpriority' might introduce some changes in
>   behavior, especially with kernel threads that were not going through
>   tsleep(9).  We currently have some situations where the priority of
>   the running thread isn't correctly reflected.  This diff changes that
>   which means we should be able to better understand where the problems
>   

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-02 Thread Martin Pieuchot
On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> Diff below exists mainly for documentation and test purposes.  If
> you're not interested about how to break the scheduler internals in
> pieces, don't read further and go straight to testing!
> 
> - First change is to stop calling tsleep(9) at PUSER.  That makes
>   it clear that all "sleeping priorities" are smaller than PUSER.
>   That's important to understand for the diff below.  `p_priority'
>   is currently a placeholder for the "sleeping priority" and the
>   "runnqueue priority".  Both fields are separated by this diff.
> 
> - When a thread goes to sleep, the priority argument of tsleep(9) is
>   now recorded in `p_slpprio'.  This argument can be considered as part
>   of the sleep queue.  Its purpose is to place the thread into a higher
>   runqueue when awoken.
> 
> - Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
>   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
>   in the preferred queue without having to use `p_priority'.  Note that
>   `p_usrpri' is still recalculated *after* having called setrunqueue().
>   This is currently fine because setrunnable() is called with SCHED_LOCK() 
>   but it will be racy when we'll split it.
> 
> - A new field, `p_runprio' has been introduced.  It should be considered
>   as part of the per-CPU runqueues.  It indicates where a current thread
>   is placed.
> 
> - `spc_curpriority' is now updated at every context-switch.  That means
>need_resched() won't be called after comparing an out-of-date value.
>At the same time, `p_usrpri' is initialized to the highest possible
>value for idle threads.
> 
> - resched_proc() was calling need_resched() in the following conditions:
>- If the SONPROC thread has a higher priority that the current
>  running thread (itself).
>- Twice in setrunnable() when we know that p_priority <= p_usrpri.
>- If schedcpu() considered that a thread, after updating its prio,
>  should preempt the one running on the CPU pointed by `p_cpu'. 
> 
>   The diff below simplify all of that by calling need_resched() when:
>- A thread is inserted in a CPU runqueue at a higher priority than
>  the one SONPROC.
>- schedcpu() decides that a thread in SRUN state should preempt the
>  one SONPROC.
> 
> - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
>   of a thread are now updated while holding a per-thread mutex.  As a
>   result schedclock() and donice() no longer takes the SCHED_LOCK(),
>   and schedcpu() almost never take it.
> 
> - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
>   when displaying priorities.  This is helpful to understand what's
>   happening:
> 
> load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> 23:42:10
> 70 threads: 68 idle, 2 on processorup  
> 0:09
> CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
> CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
> Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> 
>   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
> 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
> 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
> 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
> 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 
> 
> 
> 
> - The removal of `p_priority' and the change that makes mi_switch()
>   always update `spc_curpriority' might introduce some changes in
>   behavior, especially with kernel threads that were not going through
>   tsleep(9).  We currently have some situations where the priority of
>   the running thread isn't correctly reflected.  This diff changes that
>   which means we should be able to better understand where the problems
>   are.
> 
> I'd be interested in comments/tests/reviews before continuing in this
> direction.  Note that at least part of this diff are required to split
> the accounting apart from the SCHED_LOCK() as well.
> 
> I'll also work on exporting scheduler statistics unless somebody wants
> to beat me :)

Updated diff to use IPL_SCHED and rebased to apply on top of -current :) 

Index: arch/amd64/amd64/genassym.cf
===
RCS file: /cvs/src/sys/arch/amd64/amd64/genassym.cf,v
retrieving revision 1.40
diff -u -p -r1.40 genassym.cf
--- arch/amd64/amd64/genassym.cf17 May 2019 19:07:15 -  1.40
+++ arch/amd64/amd64/genassym.cf1 Jun 2019 16:27:46 -
@@ -32,7 +32,6 @@ export

Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-01 Thread Martin Pieuchot
Diff below exists mainly for documentation and test purposes.  If
you're not interested about how to break the scheduler internals in
pieces, don't read further and go straight to testing!

- First change is to stop calling tsleep(9) at PUSER.  That makes
  it clear that all "sleeping priorities" are smaller than PUSER.
  That's important to understand for the diff below.  `p_priority'
  is currently a placeholder for the "sleeping priority" and the
  "runnqueue priority".  Both fields are separated by this diff.

- When a thread goes to sleep, the priority argument of tsleep(9) is
  now recorded in `p_slpprio'.  This argument can be considered as part
  of the sleep queue.  Its purpose is to place the thread into a higher
  runqueue when awoken.

- Currently, for stopped threads, `p_priority' correspond to `p_usrpri'. 
  So setrunnable() has been untangled to place SSTOP and SSLEEP threads
  in the preferred queue without having to use `p_priority'.  Note that
  `p_usrpri' is still recalculated *after* having called setrunqueue().
  This is currently fine because setrunnable() is called with SCHED_LOCK() 
  but it will be racy when we'll split it.

- A new field, `p_runprio' has been introduced.  It should be considered
  as part of the per-CPU runqueues.  It indicates where a current thread
  is placed.

- `spc_curpriority' is now updated at every context-switch.  That means
   need_resched() won't be called after comparing an out-of-date value.
   At the same time, `p_usrpri' is initialized to the highest possible
   value for idle threads.

- resched_proc() was calling need_resched() in the following conditions:
   - If the SONPROC thread has a higher priority that the current
 running thread (itself).
   - Twice in setrunnable() when we know that p_priority <= p_usrpri.
   - If schedcpu() considered that a thread, after updating its prio,
 should preempt the one running on the CPU pointed by `p_cpu'. 

  The diff below simplify all of that by calling need_resched() when:
   - A thread is inserted in a CPU runqueue at a higher priority than
 the one SONPROC.
   - schedcpu() decides that a thread in SRUN state should preempt the
 one SONPROC.

- `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
  of a thread are now updated while holding a per-thread mutex.  As a
  result schedclock() and donice() no longer takes the SCHED_LOCK(),
  and schedcpu() almost never take it.

- With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
  when displaying priorities.  This is helpful to understand what's
  happening:

load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 23:42:10
70 threads: 68 idle, 2 on processorup  0:09
CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% idle
CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% idle
Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M

  PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% softnet
47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
21615   602473 12700K 1664K sleep/0   - 7:22  0.00% idle0  
12413   606242 12700K 1664K sleep/1   - 7:08  0.00% idle1
85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh  
22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% nfsd 



- The removal of `p_priority' and the change that makes mi_switch()
  always update `spc_curpriority' might introduce some changes in
  behavior, especially with kernel threads that were not going through
  tsleep(9).  We currently have some situations where the priority of
  the running thread isn't correctly reflected.  This diff changes that
  which means we should be able to better understand where the problems
  are.

I'd be interested in comments/tests/reviews before continuing in this
direction.  Note that at least part of this diff are required to split
the accounting apart from the SCHED_LOCK() as well.

I'll also work on exporting scheduler statistics unless somebody wants
to beat me :)

This has been tested on amd64 and sparc64 and includes ze mtx_enter_try(9)
diff I just sent.

Index: arch/amd64/amd64/genassym.cf
===
RCS file: /cvs/src/sys/arch/amd64/amd64/genassym.cf,v
retrieving revision 1.40
diff -u -p -r1.40 genassym.cf
--- arch/amd64/amd64/genassym.cf17 May 2019 19:07:15 -  1.40
+++ arch/amd64/amd64/genassym.cf1 Jun 2019 16:27:46 -
@@ -32,7 +32,6 @@ exportVM_MIN_KERNEL_ADDRESS
 
 struct proc
 member p_addr
-member p_priority
 member p_stat
 member p_wchan
 member P_MD_REGS   p_md.md_regs
Index: arch/hppa/hppa/genassym.cf