Re: scheduler small changes

Amit Kulkarni Fri, 31 May 2019 22:04:50 -0700

Hi,

Sorry for slacking off earlier, I was trying to recharge myself with some time 
off without looking at kernel code, and come back with a renewed focus.


> > > Regarding the choice of deriving quantum from the priority, are you sure
> > > the priorities are correct?  Should we keep priorities?  Or if userland
> > > needs priorities shouldn't we convert quantum into priority and not the
> > > other way around?
> > 
> > I am not entirely sure of the p_priority/usrpri/estcpu/load_avg 
> > calculations, as I am still trying to make sense of the code. But once we 
> > make sure all the p_priority calculations are consistent, I think the 
> > priorities are the way to go.
> 
> Why?  What's your goal?  In other words what kind of scheduling policy you
> want?  When should another thread be selected?  What about preemption? 

We should continue polishing current scheduling policy, maybe simplify it a 
bit. I can't answer this question right now, as I am still trying to understand 
the current design and the issues it is having.

> > If we go by deadlines, we will not have a way to understand how a proc is 
> > behaving in real time
> 
> What do you mean by behaving?  How much time did it spend running?  On
> which CPU?  Can't we monitor that?

To clarify and make sure what is a deadline: a deadline is a monotonically 
increasing counter, a single increment indicates 10 ms slice is consumed. I was 
not sure how that would tell us about the recent past i.e. within the last 1 
sec, whether it had used any CPU. I am not sure how we can monitor a p_deadline 
unless we add some other proc variables into the calculation. Like how we use 
p_estcpu or p_cpticks to derive p_priority.

I recently read up on Linux CFS here on this page: 
https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html
 One way to do the scheduling with a p_deadline could be from Section 13.3.1 
How CFS Works:

"When CFS schedules a task it accumulates “virtual runtime” or vruntime. The 
next task picked to run is always the task with the minimum accumulated 
vruntime so far. By balancing the red-black tree when tasks are inserted into 
the run queue (a planned time line of processes to be executed next), the task 
with the minimum vruntime is always the first entry in the red-black tree.

The amount of vruntime a task accrues is related to its priority. High priority 
tasks gain vruntime at a slower rate than low priority tasks, which results in 
high priority tasks being picked to run on the processor more often."

But again, this vruntime (deadline as used by Gregor/Michal) is used with 
priority. I think we cannot escape priority, as we support nice'ing a proc. 
Maybe there is a way to do scheduling decision without priority? At this time, 
it is beyond my understanding.

> > as a p_deadline is a history variable, how much a proc used its time 
> > slices. But if we stay with priorities, it is way simpler, and ranges 
> > strictly from 0 -> 127, and will work with current code. So the code 
> > changes will be minimal and easy to understand. Also, UNIX has historically 
> > had a concept of priority/nice levels, so I think we should stick with it.
> 
> So it's because it is simpler?  But if we follow this road, not changing
> anything is also simpler.  So why do you want to change the current
> code? :)

You got me there! :) We are all trying to improve, hopefully we keep it simple!

> > IMHO, a p_deadline field is a better substitute for p_slptime.
> 
> Why?

p_slptime is really only used for 
1) starting the swapper after a 10 sec boot delay
2) in choosing the parent process to swap out, which has a proc which is 
sleeping the most.
3) in setting the p_priority when waking up from sleep in sched_bsd.

For #1, we keep a 10 sec boot delay separately, for #2 we can substitute with a 
p_deadline, which is incremented right where p_cpticks is incremented, and we 
choose a proc with least amount of p_deadline in uvm_glue.c. For #3, in 
kern_synch, we can set the p_priority at the correct place(s) when a proc is 
coming back from SSLEEP, and in schedcpu() of sched_bsd.c we can update 
p_priority only for proc's with either a SONPROC or SRUN (reduced SCHED_LOCK() 
contention?), so p_slptime and updatepri() can completely go away. Note, this 
is theory only. Is this thinking valid?

> > The only reason I added quantum, was that I stumbled on the round robin 
> > interval buglet. Initially added a fixed 100 ms per proc, and then decided 
> > how much I could explore this quantum idea while still trying to keep the 
> > code understandable.
> 
> Which buglet?  Should we fix it?

A minimal diff for the round robin interval buglet is attached at the end of 
this email, doesn't use the SCHED_LOCK(). Note, I removed all the different 
quantums, just kept it at the default 100 ms, to try to convey and determine if 
it is my misunderstanding, and there is no buglet. I still haven't looked 
deeply at implementing the sysctl for a separate systat view. Also, your diffs 
to minimize SCHED_LOCK() contention, makes me think it is not right time to 
introduce variable rr interval scheduling, without trying to reduce the simple 
cases of SCHED_LOCK() contention first.

I think there are still some buglets, but I will try to send diffs one at a 
time.
1) spc_curpriority is being compared in resched_proc(), but is not being set 
correctly?
2) depending on clarification on eliminating p_slptime:
    a) there's a problem in p_slptime calculation.
    b) p_slptime usage in sched_proc_to_cpu_cost() can be replaced by a more 
correct determination that the proc is on this cpu.

Finally, an enhancement: in cpu_switchto() can we keep a proc on the same cpu 
for lightly loaded case, so old proc and new proc is the same. It worked in my 
experinments earlier.

> > I would guess a lot of code in userland is based on priorities, the 
> > GNOME/KDE equivalent of top/classical top/htop etc... I would think of 
> > p_priority as a real time tracking field of how hot the proc is in using 
> > the CPU. In order to change the quantum, how would we decide to increment 
> > or decrement it, unless by a real time tracking field? There's code which 
> > already does this tracking for p_priority, it might be flawed or complex, 
> > so why not fix it and use it?
> 
> What do you mean by 'real time tracking field'?  I'd suggest you look at
> how priorities are set when a thread goes to sleep and how they are used
> when it is awoken.

I think you were trying to hint about p_slptime? If not, will you please let me 
know?

Thanks



diff --git a/kern/kern_clock.c b/kern/kern_clock.c
index ee701945966..d6c8f730dab 100644
--- a/kern/kern_clock.c
+++ b/kern/kern_clock.c
@@ -171,8 +171,8 @@ hardclock(struct clockframe *frame)
        if (stathz == 0)
                statclock(frame);
 
-       if (--ci->ci_schedstate.spc_rrticks <= 0)
-               roundrobin(ci);
+       if (p && --p->p_cnsd_slice <= 0)
+               roundrobin(ci, p);
 
        /*
         * If we are not the primary CPU, we're not allowed to do
diff --git a/kern/kern_sched.c b/kern/kern_sched.c
index 346c719d13b..d3e5dfbb79c 100644
--- a/kern/kern_sched.c
+++ b/kern/kern_sched.c
@@ -257,6 +257,7 @@ setrunqueue(struct proc *p)
        spc = &p->p_cpu->ci_schedstate;
        spc->spc_nrun++;
 
+       p->p_cnsd_slice = rrticks_init;
        TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
        spc->spc_whichqs |= (1 << queue);
        cpuset_add(&sched_queued_cpus, p->p_cpu);
diff --git a/kern/sched_bsd.c b/kern/sched_bsd.c
index 1cb5c91ce38..97a23efa756 100644
--- a/kern/sched_bsd.c
+++ b/kern/sched_bsd.c
@@ -85,29 +85,22 @@ scheduler_start(void)
  * Force switch among equal priority processes every 100ms.
  */
 void
-roundrobin(struct cpu_info *ci)
+roundrobin(struct cpu_info *ci, struct proc *p)
 {
        struct schedstate_percpu *spc = &ci->ci_schedstate;
+       spc->rrticks_ctr++;
 
-       spc->spc_rrticks = rrticks_init;
-
-       if (ci->ci_curproc != NULL) {
-               if (spc->spc_schedflags & SPCF_SEENRR) {
-                       /*
-                        * The process has already been through a roundrobin
-                        * without switching and may be hogging the CPU.
-                        * Indicate that the process should yield.
-                        */
-                       atomic_setbits_int(&spc->spc_schedflags,
-                           SPCF_SHOULDYIELD);
-               } else {
-                       atomic_setbits_int(&spc->spc_schedflags,
-                           SPCF_SEENRR);
-               }
-       }
+       /* reset p_cnsd_slice immediately. This below call maybe unecessary,
+        * as we reset this, hopefully immediately, in setrunqueue(). */
+       p->p_cnsd_slice = rrticks_init;
 
-       if (spc->spc_nrun)
-               need_resched(ci);
+       /*
+        * The proc has already been through a roundrobin
+        * without switching and may be hogging the CPU.
+        * Indicate that the proc should yield.
+        */
+       atomic_setbits_int(&spc->spc_schedflags, SPCF_SHOULDYIELD);
+       need_resched(ci);
 }
 
 /*
@@ -389,7 +382,7 @@ mi_switch(void)
         * Process is about to yield the CPU; clear the appropriate
         * scheduling flags.
         */
-       atomic_clearbits_int(&spc->spc_schedflags, SPCF_SWITCHCLEAR);
+       atomic_clearbits_int(&spc->spc_schedflags, SPCF_SHOULDYIELD);
 
        nextproc = sched_chooseproc();
 
diff --git a/sys/proc.h b/sys/proc.h
index dd2ea7c45cc..3d326b940cb 100644
--- a/sys/proc.h
+++ b/sys/proc.h
@@ -359,6 +359,7 @@ struct proc {
 
        int      p_siglist;             /* Signals arrived but not delivered. */
 
+       u_int p_cnsd_slice;             /* approx time using a CPU, N * 10 ms. 
*/
 /* End area that is zeroed on creation. */
 #define        p_endzero       p_startcopy
 
diff --git a/sys/sched.h b/sys/sched.h
index d1ef56b7a9e..31e92f04230 100644
--- a/sys/sched.h
+++ b/sys/sched.h
@@ -104,7 +104,7 @@ struct schedstate_percpu {
        u_int spc_schedticks;           /* ticks for schedclock() */
        u_int64_t spc_cp_time[CPUSTATES]; /* CPU state statistics */
        u_char spc_curpriority;         /* usrpri of curproc */
-       int spc_rrticks;                /* ticks until roundrobin() */
+       int rrticks_ctr;                /* rrticks_init of rr completed */
        int spc_pscnt;                  /* prof/stat counter */
        int spc_psdiv;                  /* prof/stat divisor */ 
 
@@ -131,9 +131,7 @@ struct cpustats {
 #ifdef _KERNEL
 
 /* spc_flags */
-#define SPCF_SEENRR             0x0001  /* process has seen roundrobin() */
 #define SPCF_SHOULDYIELD        0x0002  /* process should yield the CPU */
-#define SPCF_SWITCHCLEAR        (SPCF_SEENRR|SPCF_SHOULDYIELD)
 #define SPCF_SHOULDHALT                0x0004  /* CPU should be vacated */
 #define SPCF_HALTED            0x0008  /* CPU has been halted */
 
@@ -147,7 +145,7 @@ extern int rrticks_init;            /* ticks per 
roundrobin() */
 struct proc;
 void schedclock(struct proc *);
 struct cpu_info;
-void roundrobin(struct cpu_info *);
+void roundrobin(struct cpu_info *, struct proc *);
 void scheduler_start(void);
 void userret(struct proc *p);

Re: scheduler small changes

Reply via email to