On Fri, 16 Apr 2021, Julien Grall wrote:
> Hi Stefano,
> 
> On 13/04/2021 23:43, Stefano Stabellini wrote:
> > On Sat, 20 Mar 2021, Julien Grall wrote:
> > > On 20/03/2021 00:01, Stefano Stabellini wrote:
> > > > On Sat, 27 Feb 2021, Julien Grall wrote:
> > > > > (+ Dario and George)
> > > > > 
> > > > > Hi Stefano,
> > > > > 
> > > > > I have added Dario and George to get some inputs from the scheduling
> > > > > part.
> > > > > 
> > > > > On 27/02/2021 01:58, Stefano Stabellini wrote:
> > > > > > On Fri, 26 Feb 2021, Julien Grall wrote:
> > > > > > > From: Julien Grall <jgr...@amazon.com>
> > > > > > > 
> > > > > > > A vCPU can get scheduled as soon as _VPF_down is cleared. As there
> > > > > > > is
> > > > > > > currently not ordering guarantee in arch_set_info_guest(), it may
> > > > > > > be
> > > > > > > possible that flag can be observed cleared before the new values
> > > > > > > of
> > > > > > > vCPU
> > > > > > > registers are observed.
> > > > > > > 
> > > > > > > Add an smp_mb() before the flag is cleared to prevent re-ordering.
> > > > > > > 
> > > > > > > Signed-off-by: Julien Grall <jgr...@amazon.com>
> > > > > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > > Barriers should work in pair. However, I am not entirely sure
> > > > > > > whether
> > > > > > > to
> > > > > > > put the other half. Maybe at the beginning of context_switch_to()?
> > > > > > 
> > > > > > It should be right after VGCF_online is set or cleared, right?
> > > > > 
> > > > > vcpu_guest_context_t is variable allocated on the heap just for the
> > > > > purpose of
> > > > > this call. So an ordering with VGFC_online is not going to do
> > > > > anything.
> > > > > 
> > > > > > So it
> > > > > > would be:
> > > > > > 
> > > > > > xen/arch/arm/domctl.c:arch_get_info_guest
> > > > > > xen/arch/arm/vpsci.c:do_common_cpu_on
> > > > > > 
> > > > > > But I think it is impossible that either of them get called at the
> > > > > > same
> > > > > > time as arch_set_info_guest, which makes me wonder if we actually
> > > > > > need
> > > > > > the barrier...
> > > > > arch_get_info_guest() is called without the domain lock held and I
> > > > > can't
> > > > > see
> > > > > any other lock that could prevent it to be called in // of
> > > > > arch_set_info_guest().
> > > > > 
> > > > > So you could technically get corrupted information from
> > > > > XEN_DOMCTL_getvcpucontext. For this case, we would want a smp_wmb()
> > > > > before
> > > > > writing to v->is_initialised. The corresponding read barrier would be
> > > > > in
> > > > > vcpu_pause() -> vcpu_sleep_sync() -> sync_vcpu_execstate().
> > > > > 
> > > > > But this is not the issue I was originally trying to solve. Currently,
> > > > > do_common_cpu_on() will roughly do:
> > > > > 
> > > > >    1) domain_lock(d)
> > > > > 
> > > > >    2) v->arch.sctlr = ...
> > > > >       v->arch.ttbr0 = ...
> > > > > 
> > > > >    3) clear_bit(_VFP_down, &v->pause_flags);
> > > > > 
> > > > >    4) domain_unlock(d)
> > > > > 
> > > > >    5) vcpu_wake(v);
> > > > > 
> > > > > If we had only one pCPU on the system, then we would only wake the
> > > > > vCPU in
> > > > > step 5. We would be fine in this situation. But that's not the
> > > > > interesting
> > > > > case.
> > > > > 
> > > > > If you add a second pCPU in the story, it may be possible to have
> > > > > vcpu_wake()
> > > > > happening in // (see more below). As there is no memory barrier, step
> > > > > 3
> > > > > may be
> > > > > observed before 2. So, assuming the vcpu is runnable, we could start
> > > > > to
> > > > > schedule a vCPU before any update to the registers (step 2) are
> > > > > observed.
> > > > > 
> > > > > This means that when context_switch_to() is called, we may end up to
> > > > > restore
> > > > > some old values.
> > > > > 
> > > > > Now the question is can vcpu_wake() be called in // from another pCPU?
> > > > > AFAICT,
> > > > > it would be only called if a given flag in v->pause_flags is cleared
> > > > > (e.g.
> > > > > _VFP_blocked). But can we rely on that?
> > > > > 
> > > > > Even if we can rely on it, v->pause_flags has other flags in it. I
> > > > > couldn't
> > > > > rule out that _VPF_down cannot be set at the same time as the other
> > > > > _VPF_*.
> > > > > 
> > > > > Therefore, I think a barrier is necessary to ensure the ordering.
> > > > > 
> > > > > Do you agree with this analysis?
> > > >    Yes, I think this makes sense. The corresponding barrier in the
> > > > scheduling code would have to be after reading _VPF_down and before
> > > > reading v->arch.sctlr, etc.
> > > > 
> > > > 
> > > > > > > The issues described here is also quite theoritical because there
> > > > > > > are
> > > > > > > hundreds of instructions executed between the time a vCPU is seen
> > > > > > > runnable and scheduled. But better be safe than sorry :).
> > > > > > > ---
> > > > > > >     xen/arch/arm/domain.c | 7 +++++++
> > > > > > >     1 file changed, 7 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> > > > > > > index bdd3d3e5b5d5..2b705e66be81 100644
> > > > > > > --- a/xen/arch/arm/domain.c
> > > > > > > +++ b/xen/arch/arm/domain.c
> > > > > > > @@ -914,7 +914,14 @@ int arch_set_info_guest(
> > > > > > >         v->is_initialised = 1;
> > > > > > >           if ( ctxt->flags & VGCF_online )
> > > > > > > +    {
> > > > > > > +        /*
> > > > > > > +         * The vCPU can be scheduled as soon as _VPF_down is
> > > > > > > cleared.
> > > > > > > +         * So clear the bit *after* the context was loaded.
> > > > > > > +         */
> > > > > > > +        smp_mb();
> > > > > 
> > > > >   From the discussion above, I would move this barrier before
> > > > > v->is_initialised.
> > > > > So we also take care of the issue with arch_get_info_guest().
> > > > > 
> > > > > This barrier also can be reduced to a smp_wmb() as we only expect an
> > > > > ordering
> > > > > between writes.
> > > > > 
> > > > > The barrier would be paired with the barrier in:
> > > > >      - sync_vcpu_execstate() in the case of
> > > > > arch_get_vcpu_info_guest().
> > > > >      - context_switch_to() in the case of scheduling (The exact
> > > > > barrier is
> > > > > TDB).
> > > > 
> > > > OK, this makes sense, but why before:
> > > > 
> > > >     v->is_initialised = 1;
> > > > 
> > > > instead of right after it? It is just v->pause_flags we care about,
> > > > right?
> > > 
> > > The issue I originally tried to address was a race with v->pause_flags.
> > > But I
> > > also discovered one with v->initialised while answering to your previous
> > > e-mail. This was only briefly mentioned so let me expand it.
> > > 
> > > A toolstack can take a snapshot of the vCPU context using
> > > XEN_DOMCTL_get_vcpucontext. The helper will bail out if v->is_initialized
> > > is
> > > 0.
> > > 
> > > If v->is_initialized is 1, it will temporarily pause the vCPU and then
> > > call
> > > arch_get_info_guest().
> > > 
> > > AFAICT, arch_get_info_guest() and arch_set_info_guest() (called from PSCI
> > > CPU
> > > on) can run concurrently.
> > > 
> > > If you put the barrier right after v->is_initialised, then a
> > > processor/compiler is allowed to re-order the write with what comes
> > > before.
> > > Therefore, the new value of v->is_initialised may be observed before
> > > v->arch.{sctlr, ttbr0,...}.
> > > 
> > > Hence, we need a barrier before setting v->is_initialized so the new value
> > > is
> > > observed *after* the changes to v->arch.{sctlr, ttbr0, ...) have been
> > > observed.
> > > 
> > > A single smp_wmb() barrier before v->is_initialized should be sufficient
> > > to
> > > cover the two problems discussed as I don't think we need to observe
> > > v->is_initialized *before* v->pause_flags.
> > 
> > I think your explanation is correct. However, don't we need a smp_rmb()
> > barrier after reading v->is_initialised in xen/common/domctl.c:do_domctl
> > ? That would be the barrier that pairs with smp_wmb in regards to
> > v->is_initialised.
> 
> There is already a smp_mb() in sync_vcpu_exec_state() which is called from
> vcpu_pause() -> vcpu_sleep_sync().

But that's too late, isn't? The v->is_initialized check is done before
the vcpu_pause() call. We might end up taking the wrong code path:

https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/domctl.c#L587
https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/domctl.c#L598



> I don't think we can ever remove the memory barrier in sync_vcpu_exec_state()
> because the vCPU paused may have run (or initialized) on a different pCPU. So
> I would like to rely on the barrier rather than adding an extra one (even
> thought it is not a fast path).
> 
> I am thinking to add a comment on top of vcpu_pause() to clarify that after
> the call, the vCPU context will be observed without extra synchronization
> required.

Given that an if.. goto is involved, even if both code paths lead to
good results, I think it would be best to have the smp_rmb() barrier
above after the first v->is_initialised read to make sure we take the
right path. Instead of having to make sure that even the "wrong" path
behaves correctly.

Reply via email to