Re: scheduler small changes

2019-06-01 Thread Amit Kulkarni
> > > The only reason I added quantum, was that I stumbled on the round robin 
> > > interval buglet. Initially added a fixed 100 ms per proc, and then 
> > > decided how much I could explore this quantum idea while still trying to 
> > > keep the code understandable.
> > 
> > Which buglet?  Should we fix it?
> 
> A minimal diff for the round robin interval buglet is attached at the end of 
> this email, doesn't use the SCHED_LOCK(). Note, I removed all the different 
> quantums, just kept it at the default 100 ms, to try to convey and determine 
> if it is my misunderstanding, and there is no buglet. I still haven't looked 
> deeply at implementing the sysctl for a separate systat view. Also, your 
> diffs to minimize SCHED_LOCK() contention, makes me think it is not right 
> time to introduce variable rr interval scheduling, without trying to reduce 
> the simple cases of SCHED_LOCK() contention first.


Aand, I forgot to give an explanation in the email about this diff. 
Currently, because of the 100 ms round-robin, we don't know when a proc 
actually started, it might have started 10 ms or 20 ms ago, and suddenly it 
could be moved away due to the rrticks being hit every 100 ms, that rr interval 
clock and the proc start time is never guaranteed to be in sync. We try to give 
a proc its full quantum of 100 ms, once it completes its 100 ms quantum, then 
it signals yield, and lets other proc's run. The other kernel mechanisms are 
not touched at all. So a running proc could get preempted via sched_pause(), 
and everytime any proc runs, its quantum is reset back to 10 rrticks_init. In 
best case, each proc gets 100 ms of running time, which is an improvement over 
the current behavoir. In the worst case, a proc could get preempted/yield to 
higher priority proc, but this worst case behaviour is the same before and 
after this diff.

Thanks



Re: scheduler small changes

2019-05-31 Thread Amit Kulkarni
Hi,

Sorry for slacking off earlier, I was trying to recharge myself with some time 
off without looking at kernel code, and come back with a renewed focus.

> > > Regarding the choice of deriving quantum from the priority, are you sure
> > > the priorities are correct?  Should we keep priorities?  Or if userland
> > > needs priorities shouldn't we convert quantum into priority and not the
> > > other way around?
> > 
> > I am not entirely sure of the p_priority/usrpri/estcpu/load_avg 
> > calculations, as I am still trying to make sense of the code. But once we 
> > make sure all the p_priority calculations are consistent, I think the 
> > priorities are the way to go.
> 
> Why?  What's your goal?  In other words what kind of scheduling policy you
> want?  When should another thread be selected?  What about preemption? 

We should continue polishing current scheduling policy, maybe simplify it a 
bit. I can't answer this question right now, as I am still trying to understand 
the current design and the issues it is having.

> > If we go by deadlines, we will not have a way to understand how a proc is 
> > behaving in real time
> 
> What do you mean by behaving?  How much time did it spend running?  On
> which CPU?  Can't we monitor that?

To clarify and make sure what is a deadline: a deadline is a monotonically 
increasing counter, a single increment indicates 10 ms slice is consumed. I was 
not sure how that would tell us about the recent past i.e. within the last 1 
sec, whether it had used any CPU. I am not sure how we can monitor a p_deadline 
unless we add some other proc variables into the calculation. Like how we use 
p_estcpu or p_cpticks to derive p_priority.

I recently read up on Linux CFS here on this page: 
https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html
 One way to do the scheduling with a p_deadline could be from Section 13.3.1 
How CFS Works:

"When CFS schedules a task it accumulates “virtual runtime” or vruntime. The 
next task picked to run is always the task with the minimum accumulated 
vruntime so far. By balancing the red-black tree when tasks are inserted into 
the run queue (a planned time line of processes to be executed next), the task 
with the minimum vruntime is always the first entry in the red-black tree.

The amount of vruntime a task accrues is related to its priority. High priority 
tasks gain vruntime at a slower rate than low priority tasks, which results in 
high priority tasks being picked to run on the processor more often."

But again, this vruntime (deadline as used by Gregor/Michal) is used with 
priority. I think we cannot escape priority, as we support nice'ing a proc. 
Maybe there is a way to do scheduling decision without priority? At this time, 
it is beyond my understanding.

> > as a p_deadline is a history variable, how much a proc used its time 
> > slices. But if we stay with priorities, it is way simpler, and ranges 
> > strictly from 0 -> 127, and will work with current code. So the code 
> > changes will be minimal and easy to understand. Also, UNIX has historically 
> > had a concept of priority/nice levels, so I think we should stick with it.
> 
> So it's because it is simpler?  But if we follow this road, not changing
> anything is also simpler.  So why do you want to change the current
> code? :)

You got me there! :) We are all trying to improve, hopefully we keep it simple!

> > IMHO, a p_deadline field is a better substitute for p_slptime.
> 
> Why?

p_slptime is really only used for 
1) starting the swapper after a 10 sec boot delay
2) in choosing the parent process to swap out, which has a proc which is 
sleeping the most.
3) in setting the p_priority when waking up from sleep in sched_bsd.

For #1, we keep a 10 sec boot delay separately, for #2 we can substitute with a 
p_deadline, which is incremented right where p_cpticks is incremented, and we 
choose a proc with least amount of p_deadline in uvm_glue.c. For #3, in 
kern_synch, we can set the p_priority at the correct place(s) when a proc is 
coming back from SSLEEP, and in schedcpu() of sched_bsd.c we can update 
p_priority only for proc's with either a SONPROC or SRUN (reduced SCHED_LOCK() 
contention?), so p_slptime and updatepri() can completely go away. Note, this 
is theory only. Is this thinking valid?

> > The only reason I added quantum, was that I stumbled on the round robin 
> > interval buglet. Initially added a fixed 100 ms per proc, and then decided 
> > how much I could explore this quantum idea while still trying to keep the 
> > code understandable.
> 
> Which buglet?  Should we fix it?

A minimal diff for the round robin interval buglet is attached at the end of 
this email, doesn't use the SCHED_LOCK(). Note, I removed all the different 
quantums, just kept it at the default 100 ms, to try to convey and determine if 
it is my misunderstanding, and there is no buglet. I still haven't looked 
deeply at implementing 

Re: scheduler small changes

2019-05-17 Thread Alexandre Ratchov
On Wed, May 15, 2019 at 09:05:32AM -0500, Amit Kulkarni wrote:
> 
> This diff survived multiple tens of kernel builds, a bsd.sp build,
> bsd.rd build, a bsd.mp without these changes, userland/xenocara, a
> make regress a few days ago all on 3 desktops on amd64. Ran under
> all possible scenarios listed in previous sentence. No major
> invasive changes attempted, all tools should work as is, if its
> broken, please let me know. This is faster than current, but not
> sure by how much, placebo effect in play.

CPU-bound processes (builds) run at the same priorority so both the
current and the proposed algorithms should split them in equal time
slices resulting in the same behavior.

> 
> Not sure if the reduction from 32 queues to a single TAILQ would be
> accepted, but tried it anyway, it is definitely faster. This code
> tries to reduce the complexity in deciding which queue to place the
> proc on. 

The intent of using multiple queues is to allow to wakeup processes
and to make them runnable with the desired priority. To achieve this
with a single queue, this would require to insert the process to wake
up in the "middle" of the TAILQ. This is not a problem as there are
only few runnable processes on a typical system.

Example, on a single CPU machine, suppose a CPU-bound process is
running (ex. gzip compressing a file), we want any I/O-bound processes
(ex. audio player) that wakes up to be inserted on the run queue
before the CPU-bound process.



Re: scheduler small changes

2019-05-17 Thread Amit Kulkarni
On Thu, 16 May 2019 15:15:24 -0300
Martin Pieuchot  wrote:

> On 16/05/19(Thu) 00:08, Amit Kulkarni wrote:
> > [...] 
> > > Regarding the choice of deriving quantum from the priority, are you sure
> > > the priorities are correct?  Should we keep priorities?  Or if userland
> > > needs priorities shouldn't we convert quantum into priority and not the
> > > other way around?
> > 
> > I am not entirely sure of the p_priority/usrpri/estcpu/load_avg 
> > calculations, as I am still trying to make sense of the code. But once we 
> > make sure all the p_priority calculations are consistent, I think the 
> > priorities are the way to go.
> 
> Why?  What's your goal?  In other words what kind of scheduling policy you
> want?  When should another thread be selected?  What about preemption? 
> 
> > If we go by deadlines, we will not have a way to understand how a proc is 
> > behaving in real time
> 
> What do you mean by behaving?  How much time did it spend running?  On
> which CPU?  Can't we monitor that?
> 
> > as a p_deadline is a history variable, how much a proc used its time 
> > slices. But if we stay with priorities, it is way simpler, and ranges 
> > strictly from 0 -> 127, and will work with current code. So the code 
> > changes will be minimal and easy to understand. Also, UNIX has historically 
> > had a concept of priority/nice levels, so I think we should stick with it.
> 
> So it's because it is simpler?  But if we follow this road, not changing
> anything is also simpler.  So why do you want to change the current
> code? :)
> 
> 
> > IMHO, a p_deadline field is a better substitute for p_slptime.
> 
> Why?
> 
> > The only reason I added quantum, was that I stumbled on the round robin 
> > interval buglet. Initially added a fixed 100 ms per proc, and then decided 
> > how much I could explore this quantum idea while still trying to keep the 
> > code understandable.
> 
> Which buglet?  Should we fix it?
> 
> > I would guess a lot of code in userland is based on priorities, the 
> > GNOME/KDE equivalent of top/classical top/htop etc... I would think of 
> > p_priority as a real time tracking field of how hot the proc is in using 
> > the CPU. In order to change the quantum, how would we decide to increment 
> > or decrement it, unless by a real time tracking field? There's code which 
> > already does this tracking for p_priority, it might be flawed or complex, 
> > so why not fix it and use it?
> 
> What do you mean by 'real time tracking field'?  I'd suggest you look at
> how priorities are set when a thread goes to sleep and how they are used
> when it is awoken.


Hi Martin,
I feel that you are trying to point out something by giving me some hints. 
Please give me some time to think through all possible scenarios, and I will 
send a detailed reply to all the points you raised above in this email, with a 
reduced+functional diff, over this weekend.

@Solene, thanks for testing, please ignore the previous diff for now.

Thanks



Re: scheduler small changes

2019-05-16 Thread Martin Pieuchot
On 16/05/19(Thu) 00:08, Amit Kulkarni wrote:
> [...] 
> > Regarding the choice of deriving quantum from the priority, are you sure
> > the priorities are correct?  Should we keep priorities?  Or if userland
> > needs priorities shouldn't we convert quantum into priority and not the
> > other way around?
> 
> I am not entirely sure of the p_priority/usrpri/estcpu/load_avg calculations, 
> as I am still trying to make sense of the code. But once we make sure all the 
> p_priority calculations are consistent, I think the priorities are the way to 
> go.

Why?  What's your goal?  In other words what kind of scheduling policy you
want?  When should another thread be selected?  What about preemption? 

> If we go by deadlines, we will not have a way to understand how a proc is 
> behaving in real time

What do you mean by behaving?  How much time did it spend running?  On
which CPU?  Can't we monitor that?

> as a p_deadline is a history variable, how much a proc used its time slices. 
> But if we stay with priorities, it is way simpler, and ranges strictly from 0 
> -> 127, and will work with current code. So the code changes will be minimal 
> and easy to understand. Also, UNIX has historically had a concept of 
> priority/nice levels, so I think we should stick with it.

So it's because it is simpler?  But if we follow this road, not changing
anything is also simpler.  So why do you want to change the current
code? :)


> IMHO, a p_deadline field is a better substitute for p_slptime.

Why?

> The only reason I added quantum, was that I stumbled on the round robin 
> interval buglet. Initially added a fixed 100 ms per proc, and then decided 
> how much I could explore this quantum idea while still trying to keep the 
> code understandable.

Which buglet?  Should we fix it?

> I would guess a lot of code in userland is based on priorities, the GNOME/KDE 
> equivalent of top/classical top/htop etc... I would think of p_priority as a 
> real time tracking field of how hot the proc is in using the CPU. In order to 
> change the quantum, how would we decide to increment or decrement it, unless 
> by a real time tracking field? There's code which already does this tracking 
> for p_priority, it might be flawed or complex, so why not fix it and use it?

What do you mean by 'real time tracking field'?  I'd suggest you look at
how priorities are set when a thread goes to sleep and how they are used
when it is awoken.



Re: scheduler small changes

2019-05-16 Thread Solene Rapenne
On Wed, May 15, 2019 at 09:05:32AM -0500, Amit Kulkarni wrote:
> Hi,
> 
> This effort is heavily based on top of Gregor's and Michal's diffs. Tried to 
> incorporate feedback given by different people to them in 2011/2016. Split 
> the new code in a ifdef, so people can do a straight comparison, tried very 
> hard not to delete existing code, just shifted it around. Main motivation was 
> to find if it is possible to do incremental improvements in scheduler. After 
> sometime, have still not understood completely the load avg part, 
> p_priority/p_usrpri calculations, the cpuset code. Looked heavily at 
> Dragonfly (simpler than FreeBSD) and FreeBSD code. As a newcomer, OpenBSD 
> code is way easier to read. Thanks for the detailed explanations given to 
> Michal and also in later diffs posted to tech@, they helped a lot when trying 
> to understand the decisions made by other devs, the commit messages help a 
> little but the explanations help a lot more. This invites discussions and end 
> users learn a lot more in the process.
> 
> This diff survived multiple tens of kernel builds, a bsd.sp build, bsd.rd 
> build, a bsd.mp without these changes, userland/xenocara, a make regress a 
> few days ago all on 3 desktops on amd64. Ran under all possible scenarios 
> listed in previous sentence. No major invasive changes attempted, all tools 
> should work as is, if its broken, please let me know. This is faster than 
> current, but not sure by how much, placebo effect in play.
> 
> I think  there is a bug in resched_proc() which is fixed in mi_switch(), a 
> quick enhancement in cpu_switchto(), p_slptime, and precise round robin 
> interval for each proc unless preempted or it yields().
> 
> Tried to fiddle with different time slices other than hz/10, not sure if it 
> will work on other arches, but tried to stay within MI code, so it should 
> work. Other than counting no idea to verify a proc is actually switched away 
> after its rr interval is over. Just went by what is there in the existing 
> code. Tried giving higher slice like 200 ms, but didn't want to exceed the 
> rucheck() in kern_resource 1 sec limit.
> 
> Tried to do rudimentary work on affinity without having a affinity field in 
> struct proc or struct schedstate_percpu (like Dragonfly/FreeBSD does). 
> Observed the unbalance in runqs. Affinity works most of the time under light 
> load. There's a problem when I try to comment out sched_steal_proc(), in 
> kern_sched, that is the problem with this iteration of the diff.
> 
> Not sure if the reduction from 32 queues to a single TAILQ would be accepted, 
> but tried it anyway, it is definitely faster. This code tries to reduce the 
> complexity in deciding which queue to place the proc on. There is no current 
> support for a real-time queue or other types of scheduling classes, so IMHO 
> this is a simplification.
> 
> Tried to give detailed explanation of thinking. Sent the complete diff, but 
> will split diff, if parts of it are found to be valid.
> 
> In any case, a request to please accept a small change in top, to display 
> p_priority directly.
> 
> Thanks
> 
> 

Hi,

I tried your diff. It makes game games/gzdoom unplayable because of too
much stuttering. I did not feel any change apart for this case on my
intel i7 8550-U.



Re: scheduler small changes

2019-05-15 Thread Amit Kulkarni
> Why did you decide to change the data structure of the runqueue?  What
> problem are you trying to solve?

Thanks for your feedback. It forced me to do some introspection.

I was trying to explore if we can tweak and make the current code faster, while 
still tryign to keep it as simple as it is currently.

The explanation I gave in the diff for the removal of 32 TAILQs is wrong. That 
flawed analysis was set in my mind weeks ago, when I was just starting to read 
the scheduler code. I understand now that the current code is quite a bit 
better. It places a proc with low value of p_priority in lower array index and 
will return it first, and a higher value of p_priority will be placed in higher 
array index. So a proc is being carefully placed in sorted order of p_priority. 
I will revert this data structure change as the single TAILQ will not return a 
proc in the sorted p_priority order.

> Regarding your work, if you want to continue in the scheduler area, may
> I suggest you start by making the global counters per-spc and export
> them to userland via a syscl?  Add a new view to systat(1) to see what's
> happening.  Without more visibility it's hard to confirm any theory.

This is an excellent idea. It will take me some time to understand the 
sysctl/systat part and implement, but I will try to address within a week or 
two.

> 
> Regarding the choice of deriving quantum from the priority, are you sure
> the priorities are correct?  Should we keep priorities?  Or if userland
> needs priorities shouldn't we convert quantum into priority and not the
> other way around?

I am not entirely sure of the p_priority/usrpri/estcpu/load_avg calculations, 
as I am still trying to make sense of the code. But once we make sure all the 
p_priority calculations are consistent, I think the priorities are the way to 
go. If we go by deadlines, we will not have a way to understand how a proc is 
behaving in real time, as a p_deadline is a history variable, how much a proc 
used its time slices. But if we stay with priorities, it is way simpler, and 
ranges strictly from 0 -> 127, and will work with current code. So the code 
changes will be minimal and easy to understand. Also, UNIX has historically had 
a concept of priority/nice levels, so I think we should stick with it. IMHO, a 
p_deadline field is a better substitute for p_slptime.

The only reason I added quantum, was that I stumbled on the round robin 
interval buglet. Initially added a fixed 100 ms per proc, and then decided how 
much I could explore this quantum idea while still trying to keep the code 
understandable.

I would guess a lot of code in userland is based on priorities, the GNOME/KDE 
equivalent of top/classical top/htop etc... I would think of p_priority as a 
real time tracking field of how hot the proc is in using the CPU. In order to 
change the quantum, how would we decide to increment or decrement it, unless by 
a real time tracking field? There's code which already does this tracking for 
p_priority, it might be flawed or complex, so why not fix it and use it?

> Regarding your diff, if you find a thread in RUN state inside the sleep
> queue (wakeup_n()), something is really wrong there.

I added that SRUN assert weeks ago when I was trying to add some code and 
figure out things, and broke it immediately on boot, the assert made the 
problem go away, and till your noticing it, I forgot about it. Removed it today 
and compiled kernel/userland while browsing on a desktop and a kernel build on 
another dekstop. It worked out fine.

Thanks



Re: scheduler small changes

2019-05-15 Thread Martin Pieuchot
Hello Amit,

On 15/05/19(Wed) 09:05, Amit Kulkarni wrote:
> Hi,
> 
> This effort is heavily based on top of Gregor's and Michal's diffs. Tried to 
> incorporate feedback given by different people to them in 2011/2016. Split 
> the new code in a ifdef, so people can do a straight comparison, tried very 
> hard not to delete existing code, just shifted it around. Main motivation was 
> to find if it is possible to do incremental improvements in scheduler. After 
> sometime, have still not understood completely the load avg part, 
> p_priority/p_usrpri calculations, the cpuset code. Looked heavily at 
> Dragonfly (simpler than FreeBSD) and FreeBSD code. As a newcomer, OpenBSD 
> code is way easier to read. Thanks for the detailed explanations given to 
> Michal and also in later diffs posted to tech@, they helped a lot when trying 
> to understand the decisions made by other devs, the commit messages help a 
> little but the explanations help a lot more. This invites discussions and end 
> users learn a lot more in the process.

Why did you decide to change the data structure of the runqueue?  What
problem are you trying to solve?

Regarding your work, if you want to continue in the scheduler area, may
I suggest you start by making the global counters per-spc and export
them to userland via a syscl?  Add a new view to systat(1) to see what's
happening.  Without more visibility it's hard to confirm any theory.

Regarding the choice of deriving quantum from the priority, are you sure
the priorities are correct?  Should we keep priorities?  Or if userland
needs priorities shouldn't we convert quantum into priority and not the
other way around?

Regarding your diff, if you find a thread in RUN state inside the sleep
queue (wakeup_n()), something is really wrong there.