On Fri, 2016-09-16 at 10:49 +0800, Wei Yang wrote:
> On Wed, Sep 14, 2016 at 06:18:48PM +0200, Dario Faggioli wrote:
> > On Wed, 2016-09-14 at 18:44 +0800, Wei Yang wrote:
> > If the system is not overbooked, it's a bit strange that the
> > scheduler
> > is the bottleneck.
> I looked at the original data again. I don't see detailed data to
> describe the
> dom0 configuration.
I see. No collection of output of top and xentop in dom0 either then,
I'm guessing? :-/

> The exact user model is not accessible from our client. We guess
> their model
> looks like this.
>      +--------+     +-----------+         +-------------+
>      |Timer   | --->|Coordinator| ---+--->|Worker       |
>      +--------+     +-----------+    |    +-------------+
>                                      |
>                                      |    +-------------+
>                                      +--->|Worker       |
>                                      |    +-------------+
>                                      |
>                                      |    +-------------+
>                                      +--->|Worker       |
>                                           +-------------+
> One Coordinator would driver several workers based on a high
> resolution timer.
> Periodically, workers would be waked up by the coordinator. So at one
> moment,
> many workers would be waked up and would triggers the vcpu_wake() in
> xen.
It's not clear to me whether 'Coordinator' and 'Worker's are VMs, or if
the graph describes the workload run inside the (and if yes, which
ones) VMs... but that is not terribly important, after all.

> Not sure this would be a possible reason for the burst vcpu_wake()?
Well, there would be, at least potentially, a sensible number of vcpus
waking up, which indeed can make the runqueue locks of the various
pcpus contended.

But then again, if the system is not oversubscribed, I'd tend to think
it to be tolerable, and I'd expect the biggest problem to be the work-
stealing logic (considering the high number of pcpus), rather than the
duration of the critical sections within vcpu_wake().

> >  - pcpu 6 is eithe idle or it is running someone already (say d3v4)
> >    + if pcpu 6 is idle, we tickle (i.e., we raise SCHEDULE_SOFTIRQ)
> >      pcpu 6 itself, and we're done
> Ok, it looks the behavior differs from 4.1 and current upstream. The
> upstream
> just raise the softirq to pcpu6 when it is idle, while 4.1 would
> raise softirq
> to both pcpu6 and other idlers even pcpu6 is idle.
> I think current upstream is more cleaver.
I also think current upstream is a bit better, especially because it's
mostly me that made it look the way it does. :-D

But I was actually describing how 4.1 works. In fact, in 4.1, if pcpu 6
is idle (se the '//xxx xxx xxx' comments I'm adding to the code

    if ( new->pri > cur->pri )  //is true, so we put pcpu 6 in mask
        cpu_set(cpu, mask);
    if ( cur->pri > CSCHED_PRI_IDLE )  //is false!!
    if ( !cpus_empty(mask) ) //the mask contains pcpu 6 only
        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);

On the other hand, if pcpu 6 is not idle (and, sticking to the example
started in the last email, is running d3v4):

    if ( new->pri > cur->pri )  //depends from d2v2's prio and d3v4 prio's
        cpu_set(cpu, mask);
    if ( cur->pri > CSCHED_PRI_IDLE ) //is true, so let's see...
        if ( cpus_empty(prv->idlers) )  //is true *only* if there are no idle 
pcpu 6. Let's assume there are (i.e., let's assume this is false)
            cpumask_t idle_mask;

            cpus_and(idle_mask, prv->idlers, new->vcpu->cpu_affinity);
            if ( !cpus_empty(idle_mask) )  //is true if there are idlers 
suitable for new (let's assume there are)
                if ( opt_tickle_one_idle ) //chosen on boot, default is true
                    this_cpu(last_tickle_cpu) = 
                        cycle_cpu(this_cpu(last_tickle_cpu), idle_mask);
                    cpu_set(this_cpu(last_tickle_cpu), mask);
            cpus_and(mask, mask, new->vcpu->cpu_affinity);
    if ( !cpus_empty(mask) ) //mask contains pcpu 6 if d2v2 prio's was higher, 
and also contains one idle pcpu
        cpumask_raise_softirq(mask, SCHEDULE_SOFTIRQ);

So, as I was saying, if pcpu 6 is idle, only pcpu 6 is tickled. If it's
not, at least one idler (if it exists) is tickled, and pcpu 6 is
tickled or not, depending or priorities.

> > And in fact, in more recent version of Xen, I made the code do
> > something very close to what you're suggesting. Here's the comment
> > that
> > you can find where all this logic is implemented (it's way more
> > complicated than this, and than the code in 4.1, because it takes
> > soft-
> > affinity into account).
> > 
> >     /*
> >      * If the pcpu is idle, or there are no idlers and the new
> >      * vcpu is a higher priority than the old vcpu, run it here.
> >      *
> >      * If there are idle cpus, first try to find one suitable to
> > run
> >      * new, so we can avoid preempting cur.  If we cannot find a
> >      * suitable idler on which to run new, run it here, but try to
> >      * find a suitable idler on which to run cur instead.
> >      */
> > 
> I like this idea.
Then update... Xen 4.2 or 4.3 should already contains it. :-P

> We found the schedule lock be a consuming point in our scenario, so
> the direct
> thought is try to avoid to grab it. Hmm... while our idea is not that
> sparkling.
Again, I can't say how sparkling it will reveal once implemented.
Architecturally, it doesn't sound much different from the status quo to
me, and so I'd say it's unlikely that it will solve your problem, but
this is something always very hard to tell.

And again, I'd personally spend some more time --even by temporarily
and hackishly instrumenting the code-- to understand better where the
bottleneck is.

> > Yes, but pcpu 4 releases pcpu's 6 lock right after having raised
> > the
> > softirq for pcpus 6 and 9. Most likely, when 6 and 9 will try to
> > grab
> > their own locks for running csched_schedule(), after having reacted
> > to
> > the IPI, etc., pcpu 4 will have released pcpu's 6 lock already.
> > 
> > BTW, you say that pcpu 9 is idle, but then that "it can't do his
> > job"
> > because pcpu 4 holds the lock of pcpu 6... I don't think I
> > understand
> > what you mean with this.
> After pcpu9 get the softirq from pcup4, it try to schedule and do
> load
> balance, in which would take pcpu6 schedule lock.
> As I thought previously, pcpu6 schedule lock is hold by pcpu4, so
> pcpu9 should
> wait until pcpu4 release it. This is what I mean "it can't do his job
> immediately".
Yeah, but...

> While as you mentioned, compared with pcpu9, pcpu4 would release
> pcpu6
> schedule lock earlier. :-)
... indeed that's what I think it happens 99% of the time. And this is
also why I'd tempted to think that the issue may be laying somewhere

E.g., something that has proven effective for others (and for which
I've got an unfinished and never submitted patch to upstream), is
keeping track of what pcpus actually have at least 1 vcpu in their
runqueues (i.e., at least 1 vcpus that is runnable and not running)
and, inside balance_load(), only consider those when looking for work
to steal.

<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list

Reply via email to