Re: [PATCH v3] xen/arm: Convert runstate address during hypcall

Bertrand Marquis Fri, 31 Jul 2020 06:17:44 -0700

> On 30 Jul 2020, at 22:50, Julien Grall <jul...@xen.org> wrote:
> 
> Hi Bertrand,
> 
> To avoid extra work on your side, I would recommend to wait a bit before 
> sending a new version. It would be good to at least settle the conversation 
> in v2 regarding the approach taken.

Thanks for the advice.
I thought all discussions where done the first time we started the discussion 
on the last thread but there are apparently more point to go through.

> 
> On 30/07/2020 11:24, Bertrand Marquis wrote:
>> At the moment on Arm, a Linux guest running with KTPI enabled will
>> cause the following error when a context switch happens in user mode:
>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0
>> The error is caused by the virtual address for the runstate area
>> registered by the guest only being accessible when the guest is running
>> in kernel space when KPTI is enabled.
>> To solve this issue, this patch is doing the translation from virtual
>> address to physical address during the hypercall and mapping the
>> required pages using vmap. This is removing the conversion from virtual
>> to physical address during the context switch which is solving the
>> problem with KPTI.
> 
> To echo what Jan said on the previous version, this is a change in a stable 
> ABI and therefore may break existing guest. FAOD, I agree in principle with 
> the idea. However, we want to explain why breaking the ABI is the *only* 
> viable solution.
> 
> From my understanding, it is not possible to fix without an ABI breakage 
> because the hypervisor doesn't know when the guest will switch back from 
> userspace to kernel space. The risk is the information provided by the 
> runstate wouldn't contain accurate information and could affect how the guest 
> handle stolen time.
> 
> Additionally there are a few issues with the current interface:
>   1) It is assuming the virtual address cannot be re-used by the userspace. 
> Thanksfully Linux have a split address space. But this may change with KPTI 
> in place.
>   2) When update the page-tables, the guest has to go through an invalid 
> mapping. So the translation may fail at any point.
> 
> IOW, the existing interface can lead to random memory corruption and 
> inacurracy of the stolen time.

I agree but i am not sure what you want me to do here.
Should i add more details in the commit message ?

> 
>> This is done only on arm architecture, the behaviour on x86 is not
>> modified by this patch and the address conversion is done as before
>> during each context switch.
>> This is introducing several limitations in comparison to the previous
>> behaviour (on arm only):
>> - if the guest is remapping the area at a different physical address Xen
>> will continue to update the area at the previous physical address. As
>> the area is in kernel space and usually defined as a global variable this
>> is something which is believed not to happen. If this is required by a
>> guest, it will have to call the hypercall with the new area (even if it
>> is at the same virtual address).
>> - the area needs to be mapped during the hypercall. For the same reasons
>> as for the previous case, even if the area is registered for a different
>> vcpu. It is believed that registering an area using a virtual address
>> unmapped is not something done.
> 
> This is not clear whether the virtual address refer to the current vCPU or 
> the vCPU you register the runstate for. From the past discussion, I think you 
> refer to the former. It would be good to clarify.

Ok i will try to clarify.

> 
> Additionally, all the new restrictions should be documented in the public 
> interface. So an OS developper can find the differences between the 
> architectures.
> 
> To answer Jan's concern, we certainly don't know all the guest OSes existing, 
> however we also need to balance the benefit for a large majority of the users.
> 
> From previous discussion, the current approach was deemed to be acceptable on 
> Arm and, AFAICT, also x86 (see [1]).
> 
> TBH, I would rather see the approach to be common. For that, we would an 
> agreement from Andrew and Jan in the approach here. Meanwhile, I think this 
> is the best approach to address the concern from Arm users.

>From this I get that you want me to document the specific behaviour on Arm on 
>the public header describing the hypercall, right ?

Bertrand

> 
>> inline functions in headers could not be used as the architecture
>> domain.h is included before the global domain.h making it impossible
>> to use the struct vcpu inside the architecture header.
>> This should not have any performance impact as the hypercall is only
>> called once per vcpu usually.
>> Signed-off-by: Bertrand Marquis <bertrand.marq...@arm.com>
>> ---
>>   Changes in v2
>>     - use vmap to map the pages during the hypercall.
>>     - reintroduce initial copy during hypercall.
>>   Changes in v3
>>     - Fix Coding style
>>     - Fix vaddr printing on arm32
>>     - use write_atomic to modify state_entry_time update bit (only
>>     in guest structure as the bit is not used inside Xen copy)
>> ---
>>  xen/arch/arm/domain.c        | 161 ++++++++++++++++++++++++++++++-----
>>  xen/arch/x86/domain.c        |  29 ++++++-
>>  xen/arch/x86/x86_64/domain.c |   4 +-
>>  xen/common/domain.c          |  19 ++---
>>  xen/include/asm-arm/domain.h |   9 ++
>>  xen/include/asm-x86/domain.h |  16 ++++
>>  xen/include/xen/domain.h     |   5 ++
>>  xen/include/xen/sched.h      |  16 +---
>>  8 files changed, 206 insertions(+), 53 deletions(-)
>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>> index 31169326b2..8b36946017 100644
>> --- a/xen/arch/arm/domain.c
>> +++ b/xen/arch/arm/domain.c
>> @@ -19,6 +19,7 @@
>>  #include <xen/sched.h>
>>  #include <xen/softirq.h>
>>  #include <xen/wait.h>
>> +#include <xen/vmap.h>
>>    #include <asm/alternative.h>
>>  #include <asm/cpuerrata.h>
>> @@ -275,36 +276,156 @@ static void ctxt_switch_to(struct vcpu *n)
>>      virt_timer_restore(n);
>>  }
>>  -/* Update per-VCPU guest runstate shared memory area (if registered). */
>> -static void update_runstate_area(struct vcpu *v)
>> +static void cleanup_runstate_vcpu_locked(struct vcpu *v)
>>  {
>> -    void __user *guest_handle = NULL;
>> +    if ( v->arch.runstate_guest )
>> +    {
>> +        vunmap((void *)((unsigned long)v->arch.runstate_guest & PAGE_MASK));
>> +
>> +        put_page(v->arch.runstate_guest_page[0]);
>> +
>> +        if ( v->arch.runstate_guest_page[1] )
>> +            put_page(v->arch.runstate_guest_page[1]);
>> +
>> +        v->arch.runstate_guest = NULL;
>> +    }
>> +}
>> +
>> +void arch_vcpu_cleanup_runstate(struct vcpu *v)
>> +{
>> +    spin_lock(&v->arch.runstate_guest_lock);
>> +
>> +    cleanup_runstate_vcpu_locked(v);
>> +
>> +    spin_unlock(&v->arch.runstate_guest_lock);
>> +}
>> +
>> +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr)
>> +{
>> +    unsigned int offset;
>> +    mfn_t mfn[2];
>> +    struct page_info *page;
>> +    unsigned int numpages;
>>      struct vcpu_runstate_info runstate;
>> +    void *p;
>>  -    if ( guest_handle_is_null(runstate_guest(v)) )
>> -        return;
>> +    /* user can pass a NULL address to unregister a previous area */
>> +    if ( vaddr == 0 )
>> +        return 0;
>> +
>> +    offset = vaddr & ~PAGE_MASK;
>> +
>> +    /* provided address must be aligned to a 64bit */
>> +    if ( offset % alignof(struct vcpu_runstate_info) )
> 
> This new restriction wants to be explained in the commit message and public 
> header.
> 
>> +    {
>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%"PRIvaddr
>> +                ": Invalid offset\n", vaddr);
> 
> We usually enforce 80 character per lines except for format string. So it is 
> easier to grep them in the code.
> 
>> +        return -EINVAL;
>> +    }
>> +
>> +    page = get_page_from_gva(v, vaddr, GV2M_WRITE);
>> +    if ( !page )
>> +    {
>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%"PRIvaddr
>> +                ": Page is not mapped\n", vaddr);
>> +        return -EINVAL;
>> +    }
>> +
>> +    mfn[0] = page_to_mfn(page);
>> +    v->arch.runstate_guest_page[0] = page;
>> +
>> +    if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) )
>> +    {
>> +        /* guest area is crossing pages */
>> +        page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE);
>> +        if ( !page )
>> +        {
>> +            put_page(v->arch.runstate_guest_page[0]);
>> +            gprintk(XENLOG_WARNING,
>> +                    "Cannot map runstate pointer at 0x%"PRIvaddr
>> +                    ": 2nd Page is not mapped\n", vaddr);
>> +            return -EINVAL;
>> +        }
>> +        mfn[1] = page_to_mfn(page);
>> +        v->arch.runstate_guest_page[1] = page;
>> +        numpages = 2;
>> +    }
>> +    else
>> +    {
>> +        v->arch.runstate_guest_page[1] = NULL;
>> +        numpages = 1;
>> +    }
>>  -    memcpy(&runstate, &v->runstate, sizeof(runstate));
>> +    p = vmap(mfn, numpages);
>> +    if ( !p )
>> +    {
>> +        put_page(v->arch.runstate_guest_page[0]);
>> +        if ( numpages == 2 )
>> +            put_page(v->arch.runstate_guest_page[1]);
>>  -    if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%"PRIvaddr
>> +                ": vmap error\n", vaddr);
>> +        return -EINVAL;
>> +    }
>> +
>> +    v->arch.runstate_guest = p + offset;
>> +
>> +    if (v == current)
>> +        memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate));
>> +    else
>>      {
>> -        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
>> -        guest_handle--;
>> -        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>> -        __raw_copy_to_guest(guest_handle,
>> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
>> 1);
>> -        smp_wmb();
>> +        vcpu_runstate_get(v, &runstate);
>> +        memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate));
>>      }
>>  -    __copy_to_guest(runstate_guest(v), &runstate, 1);
>> +    return 0;
>> +}
>> +
>> +int arch_vcpu_setup_runstate(struct vcpu *v,
>> +                             struct vcpu_register_runstate_memory_area area)
>> +{
>> +    int rc;
>> +
>> +    spin_lock(&v->arch.runstate_guest_lock);
>> +
>> +    /* cleanup if we are recalled */
>> +    cleanup_runstate_vcpu_locked(v);
>> +
>> +    rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v);
>> +
>> +    spin_unlock(&v->arch.runstate_guest_lock);
>>  -    if ( guest_handle )
>> +    return rc;
>> +}
>> +
>> +
>> +/* Update per-VCPU guest runstate shared memory area (if registered). */
>> +static void update_runstate_area(struct vcpu *v)
>> +{
>> +    spin_lock(&v->arch.runstate_guest_lock);
>> +
>> +    if ( v->arch.runstate_guest )
>>      {
>> -        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
>> -        smp_wmb();
>> -        __raw_copy_to_guest(guest_handle,
>> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
>> 1);
>> +        if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +        {
>> +            v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>> +            write_atomic(&(v->arch.runstate_guest->state_entry_time),
>> +                    v->runstate.state_entry_time);
> 
> NIT: You want to indent v-> at the same level as the argument from the first 
> line.
> 
> Also, I think you are missing a smp_wmb() here.
> 
>> +        }
>> +
>> +        memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate));
>> +
>> +        if ( VM_ASSIST(v->domain, runstate_update_flag) )
>> +        {
>> +            /* copy must be done before switching the bit */
>> +            smp_wmb();
>> +            v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
>> +            write_atomic(&(v->arch.runstate_guest->state_entry_time),
>> +                    v->runstate.state_entry_time);
> 
> Same remark for the indentation.
> 
>> +        }
>>      }
>> +
>> +    spin_unlock(&v->arch.runstate_guest_lock);
>>  }
>>    static void schedule_tail(struct vcpu *prev)
>> @@ -560,6 +681,8 @@ int arch_vcpu_create(struct vcpu *v)
>>      v->arch.saved_context.sp = (register_t)v->arch.cpu_info;
>>      v->arch.saved_context.pc = (register_t)continue_new_vcpu;
>>  +    spin_lock_init(&v->arch.runstate_guest_lock);
>> +
>>      /* Idle VCPUs don't need the rest of this setup */
>>      if ( is_idle_vcpu(v) )
>>          return rc;
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index fee6c3931a..b9b81e94e5 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1642,6 +1642,29 @@ void paravirt_ctxt_switch_to(struct vcpu *v)
>>          wrmsr_tsc_aux(v->arch.msrs->tsc_aux);
>>  }
>>  +int arch_vcpu_setup_runstate(struct vcpu *v,
>> +                             struct vcpu_register_runstate_memory_area area)
>> +{
>> +    struct vcpu_runstate_info runstate;
>> +
>> +    runstate_guest(v) = area.addr.h;
>> +
>> +    if ( v == current )
>> +        __copy_to_guest(runstate_guest(v), &v->runstate, 1);
>> +    else
>> +    {
>> +        vcpu_runstate_get(v, &runstate);
>> +        __copy_to_guest(runstate_guest(v), &runstate, 1);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +void arch_vcpu_cleanup_runstate(struct vcpu *v)
>> +{
>> +    set_xen_guest_handle(runstate_guest(v), NULL);
>> +}
>> +
>>  /* Update per-VCPU guest runstate shared memory area (if registered). */
>>  bool update_runstate_area(struct vcpu *v)
>>  {
>> @@ -1660,8 +1683,8 @@ bool update_runstate_area(struct vcpu *v)
>>      if ( VM_ASSIST(v->domain, runstate_update_flag) )
>>      {
>>          guest_handle = has_32bit_shinfo(v->domain)
>> -            ? &v->runstate_guest.compat.p->state_entry_time + 1
>> -            : &v->runstate_guest.native.p->state_entry_time + 1;
>> +            ? &v->arch.runstate_guest.compat.p->state_entry_time + 1
>> +            : &v->arch.runstate_guest.native.p->state_entry_time + 1;
>>          guest_handle--;
>>          runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>>          __raw_copy_to_guest(guest_handle,
>> @@ -1674,7 +1697,7 @@ bool update_runstate_area(struct vcpu *v)
>>          struct compat_vcpu_runstate_info info;
>>            XLAT_vcpu_runstate_info(&info, &runstate);
>> -        __copy_to_guest(v->runstate_guest.compat, &info, 1);
>> +        __copy_to_guest(v->arch.runstate_guest.compat, &info, 1);
>>          rc = true;
>>      }
>>      else
>> diff --git a/xen/arch/x86/x86_64/domain.c b/xen/arch/x86/x86_64/domain.c
>> index c46dccc25a..b879e8dd2c 100644
>> --- a/xen/arch/x86/x86_64/domain.c
>> +++ b/xen/arch/x86/x86_64/domain.c
>> @@ -36,7 +36,7 @@ arch_compat_vcpu_op(
>>              break;
>>            rc = 0;
>> -        guest_from_compat_handle(v->runstate_guest.compat, area.addr.h);
>> +        guest_from_compat_handle(v->arch.runstate_guest.compat, 
>> area.addr.h);
>>            if ( v == current )
>>          {
>> @@ -49,7 +49,7 @@ arch_compat_vcpu_op(
>>              vcpu_runstate_get(v, &runstate);
>>              XLAT_vcpu_runstate_info(&info, &runstate);
>>          }
>> -        __copy_to_guest(v->runstate_guest.compat, &info, 1);
>> +        __copy_to_guest(v->arch.runstate_guest.compat, &info, 1);
>>            break;
>>      }
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index f0f9c62feb..739c6b7b62 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -727,7 +727,10 @@ int domain_kill(struct domain *d)
>>          if ( cpupool_move_domain(d, cpupool0) )
>>              return -ERESTART;
>>          for_each_vcpu ( d, v )
>> +        {
>> +            arch_vcpu_cleanup_runstate(v);
>>              unmap_vcpu_info(v);
>> +        }
>>          d->is_dying = DOMDYING_dead;
>>          /* Mem event cleanup has to go here because the rings
>>           * have to be put before we call put_domain. */
>> @@ -1167,7 +1170,7 @@ int domain_soft_reset(struct domain *d)
>>        for_each_vcpu ( d, v )
>>      {
>> -        set_xen_guest_handle(runstate_guest(v), NULL);
>> +        arch_vcpu_cleanup_runstate(v);
>>          unmap_vcpu_info(v);
>>      }
>>  @@ -1494,7 +1497,6 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>      case VCPUOP_register_runstate_memory_area:
>>      {
>>          struct vcpu_register_runstate_memory_area area;
>> -        struct vcpu_runstate_info runstate;
>>            rc = -EFAULT;
>>          if ( copy_from_guest(&area, arg, 1) )
>> @@ -1503,18 +1505,7 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, 
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>          if ( !guest_handle_okay(area.addr.h, 1) )
>>              break;
>>  -        rc = 0;
>> -        runstate_guest(v) = area.addr.h;
>> -
>> -        if ( v == current )
>> -        {
>> -            __copy_to_guest(runstate_guest(v), &v->runstate, 1);
>> -        }
>> -        else
>> -        {
>> -            vcpu_runstate_get(v, &runstate);
>> -            __copy_to_guest(runstate_guest(v), &runstate, 1);
>> -        }
>> +        rc = arch_vcpu_setup_runstate(v, area);
>>            break;
>>      }
>> diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
>> index 6819a3bf38..2f62c3e8f5 100644
>> --- a/xen/include/asm-arm/domain.h
>> +++ b/xen/include/asm-arm/domain.h
>> @@ -204,6 +204,15 @@ struct arch_vcpu
>>       */
>>      bool need_flush_to_ram;
>>  +    /* runstate guest lock */
>> +    spinlock_t runstate_guest_lock;
>> +
>> +    /* runstate guest info */
>> +    struct vcpu_runstate_info *runstate_guest;
>> +
>> +    /* runstate pages mapped for runstate_guest */
>> +    struct page_info *runstate_guest_page[2];
>> +
>>  }  __cacheline_aligned;
>>    void vcpu_show_execution_state(struct vcpu *);
>> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
>> index 635335634d..007ccfbf9f 100644
>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -11,6 +11,11 @@
>>  #include <asm/x86_emulate.h>
>>  #include <public/vcpu.h>
>>  #include <public/hvm/hvm_info_table.h>
>> +#ifdef CONFIG_COMPAT
>> +#include <compat/vcpu.h>
>> +DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t);
>> +#endif
>> +
>>    #define has_32bit_shinfo(d)    ((d)->arch.has_32bit_shinfo)
>>  @@ -638,6 +643,17 @@ struct arch_vcpu
>>      struct {
>>          bool next_interrupt_enabled;
>>      } monitor;
>> +
>> +#ifndef CONFIG_COMPAT
>> +# define runstate_guest(v) ((v)->arch.runstate_guest)
>> +    XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address 
>> */
>> +#else
>> +# define runstate_guest(v) ((v)->arch.runstate_guest.native)
>> +    union {
>> +        XEN_GUEST_HANDLE(vcpu_runstate_info_t) native;
>> +        XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
>> +    } runstate_guest;
>> +#endif
>>  };
>>    struct guest_memory_policy
>> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
>> index 7e51d361de..5e8cbba31d 100644
>> --- a/xen/include/xen/domain.h
>> +++ b/xen/include/xen/domain.h
>> @@ -5,6 +5,7 @@
>>  #include <xen/types.h>
>>    #include <public/xen.h>
>> +#include <public/vcpu.h>
>>  #include <asm/domain.h>
>>  #include <asm/numa.h>
>>  @@ -63,6 +64,10 @@ void arch_vcpu_destroy(struct vcpu *v);
>>  int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset);
>>  void unmap_vcpu_info(struct vcpu *v);
>>  +int arch_vcpu_setup_runstate(struct vcpu *v,
>> +                             struct vcpu_register_runstate_memory_area 
>> area);
>> +void arch_vcpu_cleanup_runstate(struct vcpu *v);
>> +
>>  int arch_domain_create(struct domain *d,
>>                         struct xen_domctl_createdomain *config);
>>  diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index ac53519d7f..fac030fb83 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -29,11 +29,6 @@
>>  #include <public/vcpu.h>
>>  #include <public/event_channel.h>
>>  -#ifdef CONFIG_COMPAT
>> -#include <compat/vcpu.h>
>> -DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t);
>> -#endif
>> -
>>  /*
>>   * Stats
>>   *
>> @@ -166,16 +161,7 @@ struct vcpu
>>      struct sched_unit *sched_unit;
>>        struct vcpu_runstate_info runstate;
>> -#ifndef CONFIG_COMPAT
>> -# define runstate_guest(v) ((v)->runstate_guest)
>> -    XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address 
>> */
>> -#else
>> -# define runstate_guest(v) ((v)->runstate_guest.native)
>> -    union {
>> -        XEN_GUEST_HANDLE(vcpu_runstate_info_t) native;
>> -        XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
>> -    } runstate_guest; /* guest address */
>> -#endif
>> +
>>      unsigned int     new_state;
>>        /* Has the FPU been initialised? */
> 
> [1] <b6511a29-35a4-a1d0-dd29-7de4103ec...@citrix.com>
> 
> 
> -- 
> Julien Grall
Re: [PATCH v3] xen/arm: Convert runstate address during hypcall

Reply via email to