On Sun, 2016-09-18 at 12:03 +0800, Wei Yang wrote:
> Dario,
> Took a look into your code, some questions as below:
Haha, here you are! :-D

> 1. vcpu_wake is split into two cases, the first case is "not in irq"
> and "irq
> enabled". Some reason for this classification? Maybe some background
> knowledge
> I need to have.
Mmm... yes. Well, probably that distinction should not be there, as it
is a (potential, benchmark will tell if that's the case) optimization.
And optimizations should be added later, with their own patch and patch
description, etc... But as I said in the other email, the code is not
even RFC. :-/

So, point is, the whole point of the patch series is to leave
interrupts enabled inside scheduling functions, including vcpu_wake()
(which is becoming _vcpu_wake()). In order to do that, we taking the
scheduler lock, we'll start using spin_lock() and vcpu_spin_lock(),
instead of vcpu_spinlock_irq[save](). But this also means that IRQs
can't be disabled when we do that, or:
 - we violate Xen's lock consistency rule (there's ASSERT()s checking
   that in debug builds)
 - we risk re-enabling interrupts at the wrong time.

So, if IRQs are disabled when vcpu_wake() is called, or we are in IRQ
context and not disabling them expose us to races, we want to go the
deferred way the patch series is implementing.

However, if vcpu_wake() is called with IRQ enabled and not from IRQ
context, none of the above risks stand, and we can avoid doing all the
dance required for deferred wakeup, and just call _vcpu_wake().

I'm saying this is a (potential) optimization because deferring the
wakeup has a cost, and this 'if' is trying to avoid paying such when
not strictly necessary.

But really, in a proper version of this series I should just always
defer the wakeup in this patch, and have another patch, at the end of
the series, where this attempt to avoid deferral is implemented, and
run benchmarks with and without said patch applied.

> 2. in vcpu_wake(), I see you raise def_vcpu_wake softirq when the
> list is
> empty. and then add the vcpu to the list. any reason for this
> sequence? BTW,
> at this moment vcpu_wake() holds the spin lock and
> vcpu_wake_deferred() should
> take the spin lock before continuing.
If the list is empty it means there is no one going through it and
waking up vcpus from there. This means I need to raise the softirq that
will cause someone to go check the list, so it will find the vcpu that
I'm inserting there right now.

OTOH, if the list is not empty, it means there is already someone going
through the list itself. This means I don't need to raise anything, as
the vcpu that I'm inserting in the list right now will be found by such
waker, and woken up.

This is for avoiding raising too many softirq and poking too many cpus
to run their deferred wakers, which would introduce contention issue on
the deferred list lock.

The reason why the check is before the insertion is pure cosmetics. I
can well insert first and check for list_is_singleton(). In the code
you see there, that does not matter, because the deferred list has it's
spinlock. But this will probably change if I get rid of that, making
the list lockless.

> In the last patch, I saw you introduce a flag wakeup_triggered, and
> then
> change the sequence a little. Why at this place, you want to change
> the
> sequence here?
It was another attempt of optimization. But again, this is all changing
with the list becoming lockless.

> 3. The main purpose of this series is to enable irq for "scheduler
> handlers".
> I am not sure the original initiative for this, while I guess this
> could
> increase the parallelism, right? And I think we have to make sure
> those
> functions are not re-entrant, otherwise they will face dead lock when
> locking
> on the same spin-lock.
Hopefully, I explained the purpose in my previous email.

I'm not sure what you mean here with 're-entrant'. The only reason why
we want IRQs to be disabled when taking a lock, is because we take that
lock from both inside and outside of IRQ context.

So, if we're sure that a lock is never taken from IRQ context, we can
avoid disabling IRQs. And this is what this is all about: the scheduler
lock is taken once from IRQ context, and this series is trying to get
rid of that one case.

> 4. This looks similar with what we try to achieve. While we try to
> give the
> vcpu to someone else, who we think may pick it, to wake it up, and in
> your
> implementation, you at this vcpu to its own list and wake it up some
> time
> later. 
It does looks similar (this is the reason why I mentioned it to you :-
)), although:
 - both the goal and the expectations are different,
 - I'm not really 'giving the vcpu to someone else', I'm _deferring_ 
   it, i.e., it is still going to happen on the same cpu where the 
   event happens, only a little later in time (when no longer in IRQ

> I guess next step you want to do is to make the wakeup_defer.list a
> lock-less list.
Yes, that would be the plan (and refreshing the series, of course).

> 5. Well, this may not relevant to your code directly. I found some
> difference
> between raise_softirq() and cpumask_raise_softirq(). The first one
> just set
> pending bit, while the second one will send IPI.

>  __runq_tickle() use the
> second one to send SCHEDULE softirq, this means other pcpu would
> react
> immediately?
I don't think there's much difference in observable behavior. Perhaps,
in cpumask_raise_softirq(), if the current cpu is in the mast, it would
be enough to set a pensing bit for it.

However, this is something very different again, between 4.1 and
upstream. In fact, check commits (not present in Xen 4.1):

 c47c316d99b3b570d0b "x86/HVM: batch vCPU wakeups"
 9a727a813e9b25003e4 "x86: suppress event check IPI to MWAITing CPUs"

So I can't really be sure or comment much on this.

<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list

Reply via email to