> > When a cpu is idle, and wants to steal, it should steal from worst loaded
> > cpu, i.e. with the highest cost, not the least cost.
>
> What you say might be a valid choice. However I'm not sure to
> understand how does it relate to the behavior of the code?
>
> When stealing sched_proc_to_cpu_cost() is always called with for the
> same CPU, `ci' is always the same, which means `cost' is equivalent to
> `p_usrpri', right?
>
> Now the priority space is defined in sys/param.h and the higher priority
> correspond to a small number. So are we sure we want to steal a thread
> with the highest `cost', which would mean with the lowest priority? Or
> did I miss something?
When a cpu is idle, it is looking for a proc to steal. Now these two loops
(outer queued procs cpu loop + inner runqueue per cpu loop) basically loops
over all procs in the first readily runnable runqueue per each cpu. We should
ideally steal from the most loaded CPU, denoted by cost. Cost takes into
account: priority, runqueue length, primary cpu check, load average, and
finally memory footprint. Here, the most dominant variable should be the
runqueue length, since it is a multiplication with sched_cost_runnable.
'ci' changes after we proceed to another cpu, and cost will be different for
each cpu+proc combination. Actually, in the steal case, if there is a large
difference in spc_curpriority and p_usrpri between running proc which is
SONPROC vs the current proc, then the cost will be skewed towards an outlier,
but this will happen for all procs for that ci as you analyzed.
To summarize: currently, if a cpu is running 1-2 procs with large differences
in priority, there is a high possibility that this function will go steal a
proc from it for the just idled cpu. Whereas at the same time, another cpu is
running 5-7 procs with exact same priorities, that will be ignored in the above
calculation! We should have stolen from the cpu with 5-7 procs.
I hope this is a correct interpretation? Is there something wrong in this
analysis?
Another small optimization is to issue a break right after the if(cost >
bestcost) check, from the TAILQ_FOREACH, so we short circuit the potential
exponential lookup (i * j). Currently, this is O(N * N )! With that
optimization in place, we now looked at this cpu, and maybe we consider keeping
this unpegged proc in mind, for our stealing calculation. Let's now go look at
just 1 proc in other cpu's. Why iterate all procs at the head of runqueue of
each cpu (worst case), one proc is enough per cpu in this stealing calculation,
is it not? Once we decide which is the highest cost, we send it to the idle
cpu. Basically, we try to shorten this O(N * N) lookup. My guess, for machines
with large ncpu, this will be really noticeable slowness, in sched_steal_proc().
Thanks
>
> > Index: kern/kern_sched.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_sched.c,v
> > retrieving revision 1.64
> > diff -u -p -u -p -r1.64 kern_sched.c
> > --- kern/kern_sched.c 30 Jan 2020 08:51:27 -0000 1.64
> > +++ kern/kern_sched.c 4 Feb 2020 14:25:59 -0000
> > @@ -487,7 +487,7 @@ sched_steal_proc(struct cpu_info *self)
> > struct proc *best = NULL;
> > #ifdef MULTIPROCESSOR
> > struct schedstate_percpu *spc;
> > - int bestcost = INT_MAX;
> > + int bestcost = 0;
> > struct cpu_info *ci;
> > struct cpuset set;
> >
> > @@ -515,7 +515,7 @@ sched_steal_proc(struct cpu_info *self)
> >
> > cost = sched_proc_to_cpu_cost(self, p);
> >
> > - if (best == NULL || cost < bestcost) {
> > + if (cost > bestcost) {
> > best = p;
> > bestcost = cost;
> > }
a break statement right here.
> > @@ -524,7 +524,6 @@ sched_steal_proc(struct cpu_info *self)
> > if (best == NULL)
> > return (NULL);
> >
> > - spc = &best->p_cpu->ci_schedstate;
> > remrunqueue(best);
> > best->p_cpu = self;
> >
> >
--
Amit Kulkarni <[email protected]>