date:20151117

On Tue, Nov 17, 2015 at 01:20:38PM +0800, Haozhong Zhang wrote:
> Following two changes are made to the TSC rate setting code in
> kvm_arch_init_vcpu():
>  * The code is moved to a new function kvm_arch_set_tsc_khz().
>  * If setting user-specified TSC rate fails and the host TSC rate is
>inconsistent with the user-specified one, print a warning message.
> 
> Signed-off-by: Haozhong Zhang 

This matches what I was expecting, and now I see that we don't
even need the user_tsc_khz field.

> ---
>  target-i386/kvm.c | 45 ++---
>  1 file changed, 38 insertions(+), 7 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 9e4d27f..6a1acb4 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -524,6 +524,41 @@ static bool hyperv_enabled(X86CPU *cpu)
>  cpu->hyperv_runtime);
>  }
>  
> +/**
> + * Ask KVM to set vcpu's TSC rate to X86_CPU(cs)->env.tsc_khz.
> + *
> + * Returns: 0if successful;
> + *  -ENOTSUP if KVM_CAP_TSC_CONTROL is unavailable;
> + *  -EIO if KVM_SET_TSC_KHZ fails.

If KVM_SET_TSC_KHZ fails, the error code will be useful to
understand what went wrong. It's better to return the error code
returned by KVM instead of -EIO.

> + */
> +static int kvm_arch_set_tsc_khz(CPUState *cs)
> +{
> +X86CPU *cpu = X86_CPU(cs);
> +CPUX86State *env = >env;
> +int has_tsc_control, r = 0;
> +
> +if (!env->tsc_khz) {
> +return 0;
> +}
> +
> +has_tsc_control = kvm_check_extension(cs->kvm_state, 
> KVM_CAP_TSC_CONTROL);
> +if (has_tsc_control) {
> +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> +}
> +
> +if (!has_tsc_control || r < 0) {
> +r = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
> +kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) : -ENOTSUP;
> +if (r <= 0 || r != env->tsc_khz) {
> +error_report("warning: TSC frequency mismatch between "
> + "VM and host, and TSC scaling unavailable");
> +return has_tsc_control ? -EIO : -ENOTSUP;
> +}
> +}

What about:

r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL) ?
kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz) :
-ENOTSUP;
if (r < 0) {
/* If KVM_SET_TSC_KHZ fails, it is an error only if the
 * current TSC frequency doesn't match the one we want.
 */
int cur_freq = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
   kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) :
   -ENOTSUP;
   if (cur_freq <= 0 || cur_freq != env->tsc_khz) {
   error_report("warning: TSC frequency mismatch between "
"VM and host, and TSC scaling unavailable");
   return r;
   }
}

return 0;

> +
> +return 0;
> +}
> +
>  static Error *invtsc_mig_blocker;
>  
>  #define KVM_MAX_CPUID_ENTRIES  100
> @@ -823,13 +858,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  return r;
>  }
>  
> -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> -if (r && env->tsc_khz) {
> -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> -if (r < 0) {
> -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> -return r;
> -}
> +if (kvm_arch_set_tsc_khz(cs) == -EIO) {
> +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");

Now kvm_arch_set_tsc_khz() prints an error message, we can remove
this one.

> +return -EIO;

To keep the previous behavior without losing the error code
returned by KVM, this could be written as:

r = kvm_arch_set_tsc_khz(cs);
if (r < 0 && r != -ENOTSUP) {
return r;
}

But I belive the previous behavior was wrong. If tsc-freq was
explicitly set by the user, we should abort if
KVM_CAP_TSC_CONTROL is unavailable.

So I suggest we simply do this:

r = kvm_arch_set_tsc_khz(cs);
if (r < 0) {
return r;
}

(In case you implement this behavior change in the same patch,
please mention that in the commit message.)

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/3] target-i386: fallback vcpu's TSC rate to value returned by KVM

2015-11-17 Thread Haozhong Zhang

On 11/17/15 11:14, Eduardo Habkost wrote:
> On Tue, Nov 17, 2015 at 01:20:37PM +0800, Haozhong Zhang wrote:
> > If no user-specified TSC rate is present, we will try to set
> > env->tsc_khz to the value returned by KVM_GET_TSC_KHZ.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  target-i386/kvm.c | 12 
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 2a9953b..9e4d27f 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -832,6 +832,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >  }
> >  }
> >  
> > +/*
> > + * If no user-specified TSC frequency is present, we will try to
> > + * set env->tsc_khz to the value used by KVM.
> > + */
> 
> If you send a new version of this series, please to describe
> "why", not "what". We can see in the code that we are setting
> env->tsc to the value used by KVM, but the comment need to
> explain why.
>

I'll update comments in the next version.

Haozhong

> > +if (!env->tsc_khz) {
> > +r = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
> > +kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) : -ENOTSUP;
> > +if (r > 0) {
> > +env->tsc_khz = r;
> > +}
> > +}
> > +
> >  if (has_xsave) {
> >  env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
> >  }
> > -- 
> > 2.4.8
> > 
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/vmx: EPTP switching test

On 17/11/2015 11:44, Wang, Wei W wrote:
> On 17/11/2015 18:18,  Paolo Bonzini wrote:
>> On 17/11/2015 02:45, Zhang, Yang Z wrote:
>>> We have a different version in hand which is using separate
>>> EPTP.
>> 
>> Can you say in advance what you are using EPTP switching for?
>> Offlist if necessary.
> 
> Hi Paolo,
> 
> We are using EPTP switching for a protected inter-VM communication
> design, as shown in the slides (#23) here:
> http://events.linuxfoundation.org/sites/events/files/slides/Jun_Nakajima_NFV_KVM%202015_final.pdf

[offlist, adding virt-intel-l...@redhat.com]

If the EPTP switch is only adding extra data pages (e.g. mapping another
guest's memory inside a PCI BAR), this can work.

However, slides 24 and 25 suggest that the executable pages change
between the two EPTP views, similar to Jun's KVM Forum 2014
presentation.  Michael and I explained in Seattle that this only works
if the guest is trusted.  I am a bit disappointed that Intel continued
developing this feature without contacting us or without urging us to
present our issues more formally.

I think I should make this very clear: I am not going to accept in KVM a
feature that requires the guest to be trusted.  A trusted guest kernel
may make sense for other applications of VMFUNC (e.g. McAfee memory
scan) but not for virtualization; if the guest is trusted, you don't
have virtualization anymore.

Michael and I are going to present our findings to Intel soon.  This
will hopefully clarify why the guest has to be trusted.  We will also
present possible extensions to VMFUNC that enable its usage with
untrusted guests, albeit only at CPL=0.

Asit Mallick is going to contact Jun about this so we can organize the
meeting.  Unfortunately it is going to be hard for everyone to attend
since we have people in Europe, US and China, but we will share the slides.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/21] arm64: KVM: Implement fpsimd save/restore

2015-11-17 Thread Marc Zyngier

On 17/11/15 11:13, Steve Capper wrote:
> On 16 November 2015 at 13:11, Marc Zyngier  wrote:
>> Implement the fpsimd save restore, keeping the lazy part in
>> assembler (as returning to C would be overkill).
>>
>> Signed-off-by: Marc Zyngier 
>> ---
>>  arch/arm64/kvm/hyp/Makefile |  1 +
>>  arch/arm64/kvm/hyp/entry.S  | 32 +++-
>>  arch/arm64/kvm/hyp/fpsimd.S | 33 +
>>  arch/arm64/kvm/hyp/hyp.h|  3 +++
>>  arch/arm64/kvm/hyp/switch.c |  8 
>>  5 files changed, 76 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm64/kvm/hyp/fpsimd.S
>>
>> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
>> index 9c11b0f..56238d0 100644
>> --- a/arch/arm64/kvm/hyp/Makefile
>> +++ b/arch/arm64/kvm/hyp/Makefile
>> @@ -9,3 +9,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
>>  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
>>  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
>>  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
>> +obj-$(CONFIG_KVM_ARM_HOST) += fpsimd.o
>> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
>> index 2c4449a..7552922 100644
>> --- a/arch/arm64/kvm/hyp/entry.S
>> +++ b/arch/arm64/kvm/hyp/entry.S
>> @@ -27,6 +27,7 @@
>>
>>  #define CPU_GP_REG_OFFSET(x)   (CPU_GP_REGS + x)
>>  #define CPU_XREG_OFFSET(x) CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
>> +#define CPU_SYSREG_OFFSET(x)   (CPU_SYSREGS + 8*x)
>>
>> .text
>> .pushsection.hyp.text, "ax"
>> @@ -152,4 +153,33 @@ ENTRY(__guest_exit)
>> ret
>>  ENDPROC(__guest_exit)
>>
>> -   /* Insert fault handling here */
>> +ENTRY(__fpsimd_guest_restore)
>> +   pushx4, lr
>> +
>> +   mrs x2, cptr_el2
>> +   bic x2, x2, #CPTR_EL2_TFP
>> +   msr cptr_el2, x2
>> +   isb
>> +
>> +   mrs x3, tpidr_el2
>> +
>> +   ldr x0, [x3, #VCPU_HOST_CONTEXT]
>> +   kern_hyp_va x0
>> +   add x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
>> +   bl  __fpsimd_save_state
>> +
>> +   add x2, x3, #VCPU_CONTEXT
>> +   add x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
>> +   bl  __fpsimd_restore_state
>> +
>> +   mrs x1, hcr_el2
>> +   tbnzx1, #HCR_RW_SHIFT, 1f
>> +   ldr x4, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)]
>> +   msr fpexc32_el2, x4
>> +1:
>> +   pop x4, lr
>> +   pop x2, x3
>> +   pop x0, x1
>> +
>> +   eret
>> +ENDPROC(__fpsimd_guest_restore)
>> diff --git a/arch/arm64/kvm/hyp/fpsimd.S b/arch/arm64/kvm/hyp/fpsimd.S
>> new file mode 100644
>> index 000..da3f22c
>> --- /dev/null
>> +++ b/arch/arm64/kvm/hyp/fpsimd.S
>> @@ -0,0 +1,33 @@
>> +/*
>> + * Copyright (C) 2015 - ARM Ltd
>> + * Author: Marc Zyngier 
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see .
>> + */
>> +
>> +#include 
>> +
>> +#include 
>> +
>> +   .text
>> +   .pushsection.hyp.text, "ax"
>> +
>> +ENTRY(__fpsimd_save_state)
>> +   fpsimd_save x0, 1
>> +   ret
>> +ENDPROC(__fpsimd_save_state)
>> +
>> +ENTRY(__fpsimd_restore_state)
>> +   fpsimd_restore  x0, 1
>> +   ret
>> +ENDPROC(__fpsimd_restore_state)
>> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
>> index bf13238..240fb79 100644
>> --- a/arch/arm64/kvm/hyp/hyp.h
>> +++ b/arch/arm64/kvm/hyp/hyp.h
>> @@ -70,6 +70,9 @@ void __debug_clear_restore_state(struct kvm_vcpu *vcpu,
>>  struct kvm_guest_debug_arch *dbg,
>>  struct kvm_cpu_context *ctxt);
>>
>> +void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
>> +void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
>> +
>>  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
>>
>>  #endif /* __ARM64_KVM_HYP_H__ */
>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index a3af81a..06d3e20 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -88,6 +88,7 @@ int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
>>  {
>> struct kvm_cpu_context *host_ctxt;
>> struct kvm_cpu_context *guest_ctxt;
>> +   bool fp_enabled;
>> u64 exit_code;
>>
>> vcpu = kern_hyp_va(vcpu);
>> @@ -117,6 +118,8 @@ int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
>> exit_code =

Re: [PATCH 12/21] arm64: KVM: Implement fpsimd save/restore

2015-11-17 Thread Marc Zyngier

On 17/11/15 11:49, Steve Capper wrote:
> On 17 November 2015 at 11:25, Marc Zyngier  wrote:
>> On 17/11/15 11:13, Steve Capper wrote:
>>> On 16 November 2015 at 13:11, Marc Zyngier  wrote:
 Implement the fpsimd save restore, keeping the lazy part in
 assembler (as returning to C would be overkill).

 Signed-off-by: Marc Zyngier 
 ---
  arch/arm64/kvm/hyp/Makefile |  1 +
  arch/arm64/kvm/hyp/entry.S  | 32 +++-
  arch/arm64/kvm/hyp/fpsimd.S | 33 +
  arch/arm64/kvm/hyp/hyp.h|  3 +++
  arch/arm64/kvm/hyp/switch.c |  8 
  5 files changed, 76 insertions(+), 1 deletion(-)
  create mode 100644 arch/arm64/kvm/hyp/fpsimd.S

 diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
 index 9c11b0f..56238d0 100644
 --- a/arch/arm64/kvm/hyp/Makefile
 +++ b/arch/arm64/kvm/hyp/Makefile
 @@ -9,3 +9,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
 +obj-$(CONFIG_KVM_ARM_HOST) += fpsimd.o
 diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
 index 2c4449a..7552922 100644
 --- a/arch/arm64/kvm/hyp/entry.S
 +++ b/arch/arm64/kvm/hyp/entry.S
 @@ -27,6 +27,7 @@

  #define CPU_GP_REG_OFFSET(x)   (CPU_GP_REGS + x)
  #define CPU_XREG_OFFSET(x) CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
 +#define CPU_SYSREG_OFFSET(x)   (CPU_SYSREGS + 8*x)

 .text
 .pushsection.hyp.text, "ax"
 @@ -152,4 +153,33 @@ ENTRY(__guest_exit)
 ret
  ENDPROC(__guest_exit)

 -   /* Insert fault handling here */
 +ENTRY(__fpsimd_guest_restore)
 +   pushx4, lr
 +
 +   mrs x2, cptr_el2
 +   bic x2, x2, #CPTR_EL2_TFP
 +   msr cptr_el2, x2
 +   isb
 +
 +   mrs x3, tpidr_el2
 +
 +   ldr x0, [x3, #VCPU_HOST_CONTEXT]
 +   kern_hyp_va x0
 +   add x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
 +   bl  __fpsimd_save_state
 +
 +   add x2, x3, #VCPU_CONTEXT
 +   add x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
 +   bl  __fpsimd_restore_state
 +
 +   mrs x1, hcr_el2
 +   tbnzx1, #HCR_RW_SHIFT, 1f
 +   ldr x4, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)]
 +   msr fpexc32_el2, x4
 +1:
 +   pop x4, lr
 +   pop x2, x3
 +   pop x0, x1
 +
 +   eret
 +ENDPROC(__fpsimd_guest_restore)
 diff --git a/arch/arm64/kvm/hyp/fpsimd.S b/arch/arm64/kvm/hyp/fpsimd.S
 new file mode 100644
 index 000..da3f22c
 --- /dev/null
 +++ b/arch/arm64/kvm/hyp/fpsimd.S
 @@ -0,0 +1,33 @@
 +/*
 + * Copyright (C) 2015 - ARM Ltd
 + * Author: Marc Zyngier 
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program.  If not, see .
 + */
 +
 +#include 
 +
 +#include 
 +
 +   .text
 +   .pushsection.hyp.text, "ax"
 +
 +ENTRY(__fpsimd_save_state)
 +   fpsimd_save x0, 1
 +   ret
 +ENDPROC(__fpsimd_save_state)
 +
 +ENTRY(__fpsimd_restore_state)
 +   fpsimd_restore  x0, 1
 +   ret
 +ENDPROC(__fpsimd_restore_state)
 diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
 index bf13238..240fb79 100644
 --- a/arch/arm64/kvm/hyp/hyp.h
 +++ b/arch/arm64/kvm/hyp/hyp.h
 @@ -70,6 +70,9 @@ void __debug_clear_restore_state(struct kvm_vcpu *vcpu,
  struct kvm_guest_debug_arch *dbg,
  struct kvm_cpu_context *ctxt);

 +void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
 +void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
 +
  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context 
 *host_ctxt);

  #endif /* __ARM64_KVM_HYP_H__ */
 diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
 index a3af81a..06d3e20 100644
 --- a/arch/arm64/kvm/hyp/switch.c
 +++

Re: [PATCH v5 1/3] target-i386: fallback vcpu's TSC rate to value returned by KVM

On Tue, Nov 17, 2015 at 01:20:37PM +0800, Haozhong Zhang wrote:
> If no user-specified TSC rate is present, we will try to set
> env->tsc_khz to the value returned by KVM_GET_TSC_KHZ.
> 
> Signed-off-by: Haozhong Zhang 
> ---
>  target-i386/kvm.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 2a9953b..9e4d27f 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -832,6 +832,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  }
>  }
>  
> +/*
> + * If no user-specified TSC frequency is present, we will try to
> + * set env->tsc_khz to the value used by KVM.
> + */

If you send a new version of this series, please to describe
"why", not "what". We can see in the code that we are setting
env->tsc to the value used by KVM, but the comment need to
explain why.

> +if (!env->tsc_khz) {
> +r = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
> +kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) : -ENOTSUP;
> +if (r > 0) {
> +env->tsc_khz = r;
> +}
> +}
> +
>  if (has_xsave) {
>  env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>  }
> -- 
> 2.4.8
> 

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-unit-tests] x86: pkeys: add test for PKEYS



On 16/11/2015 08:53, Huaitong Han wrote:
> Signed-off-by: Huaitong Han 
> 
> diff --git a/config/config-x86-common.mak b/config/config-x86-common.mak
> index c2f9908..2ef98cc 100644
> --- a/config/config-x86-common.mak
> +++ b/config/config-x86-common.mak
> @@ -36,7 +36,8 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat 
> \
> $(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
> $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \
> $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
> -   $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat
> +   $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
> +   $(TEST_DIR)/pku.flat

This needs to be in config-x86_64.mak, because the 32-bit version will
have EFER.LMA=0.  Otherwise looks good (and I've tested it with a
quick-and-dirty implementation of pkeys in QEMU).

Paolo

>  ifdef API
>  tests-common += api/api-sample
> @@ -104,6 +105,8 @@ $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
>  
>  $(TEST_DIR)/smap.elf: $(cstart.o) $(TEST_DIR)/smap.o
>  
> +$(TEST_DIR)/pku.elf: $(cstart.o) $(TEST_DIR)/pku.o
> +
>  $(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o $(TEST_DIR)/vmx_tests.o
>  
>  $(TEST_DIR)/debug.elf: $(cstart.o) $(TEST_DIR)/debug.o
> diff --git a/lib/x86/processor.h b/lib/x86/processor.h
> index 7973879..f7aa5ec 100644
> --- a/lib/x86/processor.h
> +++ b/lib/x86/processor.h
> @@ -26,6 +26,7 @@
>  #define X86_CR4_PAE0x0020
>  #define X86_CR4_PCIDE  0x0002
>  #define X86_CR4_SMAP   0x0020
> +#define X86_CR4_PKE0x0040
>  
>  #define X86_IA32_EFER  0xc080
>  #define X86_EFER_LMA   (1UL << 8)
> diff --git a/x86/pku.c b/x86/pku.c
> new file mode 100644
> index 000..0e00b99
> --- /dev/null
> +++ b/x86/pku.c
> @@ -0,0 +1,161 @@
> +#include "libcflat.h"
> +#include "x86/desc.h"
> +#include "x86/processor.h"
> +#include "x86/vm.h"
> +#include "x86/msr.h"
> +
> +#define X86_FEATURE_PKU  3
> +#define CR0_WP_MASK  (1UL << 16)
> +#define PTE_PKEY_BIT 59
> +#define USER_BASE(1 << 24)
> +#define USER_VAR(v)  (*((__typeof__(&(v))) (((unsigned long)) + 
> USER_BASE)))
> +
> +volatile int pf_count = 0;
> +volatile unsigned save;
> +volatile unsigned test;
> +
> +void set_cr0_wp(int wp)
> +{
> +unsigned long cr0 = read_cr0();
> +
> +cr0 &= ~CR0_WP_MASK;
> +if (wp)
> +cr0 |= CR0_WP_MASK;
> +write_cr0(cr0);
> +}
> +
> +static inline u32 read_pkru(void)
> +{
> +unsigned int eax, edx;
> +unsigned int ecx = 0;
> +unsigned int pkru;
> +
> +asm volatile(".byte 0x0f,0x01,0xee\n\t"
> + : "=a" (eax), "=d" (edx)
> + : "c" (ecx));
> +pkru = eax;
> +return pkru;
> +}
> +
> +static void write_pkru(u32 pkru)
> +{
> +unsigned int eax = pkru;
> +unsigned int ecx = 0;
> +unsigned int edx = 0;
> +
> +asm volatile(".byte 0x0f,0x01,0xef\n\t"
> +: : "a" (eax), "c" (ecx), "d" (edx));
> +}
> +
> +void do_pf_tss(unsigned long error_code)
> +{
> +pf_count++;
> +save = test;
> +write_pkru(0);
> +}
> +
> +extern void pf_tss(void);
> +
> +asm ("pf_tss: \n\t"
> +#ifdef __x86_64__
> +// no task on x86_64, save/restore caller-save regs
> +"push %rax; push %rcx; push %rdx; push %rsi; push %rdi\n"
> +"push %r8; push %r9; push %r10; push %r11\n"
> +#endif
> +"call do_pf_tss \n\t"
> +#ifdef __x86_64__
> +"pop %r11; pop %r10; pop %r9; pop %r8\n"
> +"pop %rdi; pop %rsi; pop %rdx; pop %rcx; pop %rax\n"
> +#endif
> +"add $"S", %"R "sp\n\t" // discard error code
> +"iret"W" \n\t"
> +"jmp pf_tss\n\t"
> +);
> +
> +static void init_test()
> +{
> +pf_count = 0;
> +
> +invlpg();
> +invlpg(_VAR(test));
> +write_pkru(0);
> +set_cr0_wp(0);
> +}
> +
> +int main(int ac, char **av)
> +{
> +unsigned long i;
> +unsigned int pkey = 0x2;
> +unsigned int pkru_ad = 0x10;
> +unsigned int pkru_wd = 0x20;
> +
> +if (!(cpuid_indexed(7, 0).c & (1 << X86_FEATURE_PKU))) {
> +printf("PKU not enabled, exiting\n");
> +exit(1);
> +}
> +
> +setup_vm();
> +setup_alt_stack();
> +set_intr_alt_stack(14, pf_tss);
> +wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_LMA);
> +
> +for (i = 0; i < USER_BASE; i += PAGE_SIZE) {
> +*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= ~PTE_USER;
> +*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) |= ((unsigned 
> long)pkey << PTE_PKEY_BIT);
> +invlpg((void *)i);
> +}
> +
> +for (i = USER_BASE; i < 2 * USER_BASE; i += PAGE_SIZE) {
> +*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= ~USER_BASE;
> +*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) |= ((unsigned 
> long)pkey << PTE_PKEY_BIT);
> +invlpg((void *)i);
> +}
> +
> +write_cr4(read_cr4() | X86_CR4_PKE);
> +write_cr3(read_cr3());
> +
> +init_test();
> +

Re: [kvm-unit-tests] x86: smap: add smap check to unittests.cfg



On 16/11/2015 08:53, Huaitong Han wrote:
> Signed-off-by: Huaitong Han 
> 
> diff --git a/x86/unittests.cfg b/x86/unittests.cfg
> index 14e36a4..6d3dc89 100644
> --- a/x86/unittests.cfg
> +++ b/x86/unittests.cfg
> @@ -72,6 +72,10 @@ groups = vmexit
>  file = access.flat
>  arch = x86_64
>  
> +[smap]
> +file = smap.flat
> +extra_params = -cpu host
> +
>  [pku]
>  file = pku.flat
>  arch = x86_64
> 

Applied, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 3/3] target-i386: add support to migrate vcpu's TSC rate

Hi,

On Tue, Nov 17, 2015 at 01:20:39PM +0800, Haozhong Zhang wrote:
> This patch enables migrating vcpu's TSC rate. If KVM on the destination
> machine supports TSC scaling, guest programs will observe a consistent
> TSC rate across the migration.
> 
> If TSC scaling is not supported on the destination machine, the
> migration will not be aborted and QEMU on the destination will not set
> vcpu's TSC rate to the migrated value.
> 
> If vcpu's TSC rate specified by CPU option 'tsc-freq' on the destination
> machine is inconsistent with the migrated TSC rate, the migration will
> be aborted.
> 
> For backwards compatibility, the migration of vcpu's TSC rate is
> disabled on pc-*-2.4 and older machine types.
> 
> Signed-off-by: Haozhong Zhang 

Now the logic in this patch (and the rest of the series) looks
good to me. All my suggestions are only related to code comments
and error handling:

[...]
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 6a1acb4..6856899 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -2384,6 +2384,10 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>  }
>  }
>  
> +if (level == KVM_PUT_FULL_STATE) {
> +kvm_arch_set_tsc_khz(cpu);

Please add a comment here indicating that errors are being
ignored, and explaining why.

> +}
> +
>  ret = kvm_getput_regs(x86_cpu, 1);
>  if (ret < 0) {
>  return ret;
> diff --git a/target-i386/machine.c b/target-i386/machine.c
> index a18e16e..3c5d24b 100644
> --- a/target-i386/machine.c
> +++ b/target-i386/machine.c
> @@ -331,6 +331,13 @@ static int cpu_post_load(void *opaque, int version_id)
>  CPUX86State *env = >env;
>  int i;
>  
> +if (env->tsc_khz && env->user_tsc_khz &&
> +env->tsc_khz != env->user_tsc_khz) {
> +fprintf(stderr, "Mismatch between user-specified TSC frequency and "
> +"migrated TSC frequency\n");

Please use error_report() instead of fprintf().

> +return -1;

Please return a valid -errno value. Other post_load functions
that implement sanity checks use -EINVAL (e.g.
global_state_post_load(), configuration_post_load()), so that's
probably what we should do here.

> +}
> +
>  /*
>   * Real mode guest segments register DPL should be zero.
>   * Older KVM version were setting it wrongly.
> @@ -775,6 +782,26 @@ static const VMStateDescription vmstate_xss = {
>  }
>  };
>  
> +static bool tsc_khz_needed(void *opaque)
> +{
> +X86CPU *cpu = opaque;
> +CPUX86State *env = >env;
> +MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> +PCMachineClass *pcmc = PC_MACHINE_CLASS(mc);
> +return env->tsc_khz && pcmc->save_tsc_khz;
> +}
> +
> +static const VMStateDescription vmstate_tsc_khz = {
> +.name = "cpu/tsc_khz",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.needed = tsc_khz_needed,
> +.fields = (VMStateField[]) {
> +VMSTATE_INT64(env.tsc_khz, X86CPU),
> +VMSTATE_END_OF_LIST()
> +}
> +};
> +
>  VMStateDescription vmstate_x86_cpu = {
>  .name = "cpu",
>  .version_id = 12,
> @@ -895,6 +922,7 @@ VMStateDescription vmstate_x86_cpu = {
>  _msr_hyperv_runtime,
>  _avx512,
>  _xss,
> +_tsc_khz,
>  NULL
>  }
>  };
> -- 
> 2.4.8
> 

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/3] target-i386: reorganize TSC rate setting code

2015-11-17 Thread Haozhong Zhang

On 11/17/15 11:32, Eduardo Habkost wrote:
> On Tue, Nov 17, 2015 at 01:20:38PM +0800, Haozhong Zhang wrote:
> > Following two changes are made to the TSC rate setting code in
> > kvm_arch_init_vcpu():
> >  * The code is moved to a new function kvm_arch_set_tsc_khz().
> >  * If setting user-specified TSC rate fails and the host TSC rate is
> >inconsistent with the user-specified one, print a warning message.
> > 
> > Signed-off-by: Haozhong Zhang 
> 
> This matches what I was expecting, and now I see that we don't
> even need the user_tsc_khz field.
>

I guess you mean the user_tsc_khz field is not needed when setting TSC
rate. It's still needed in patch 3 to check if the migrated TSC rate
is consistent with the user-specified TSC rate (and of course it's not
in kvm_arch_set_tsc_khz()).

> > ---
> >  target-i386/kvm.c | 45 ++---
> >  1 file changed, 38 insertions(+), 7 deletions(-)
> > 
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 9e4d27f..6a1acb4 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -524,6 +524,41 @@ static bool hyperv_enabled(X86CPU *cpu)
> >  cpu->hyperv_runtime);
> >  }
> >  
> > +/**
> > + * Ask KVM to set vcpu's TSC rate to X86_CPU(cs)->env.tsc_khz.
> > + *
> > + * Returns: 0if successful;
> > + *  -ENOTSUP if KVM_CAP_TSC_CONTROL is unavailable;
> > + *  -EIO if KVM_SET_TSC_KHZ fails.
> 
> If KVM_SET_TSC_KHZ fails, the error code will be useful to
> understand what went wrong. It's better to return the error code
> returned by KVM instead of -EIO.
>

Yes, I'll change in the next version.

> > + */
> > +static int kvm_arch_set_tsc_khz(CPUState *cs)
> > +{
> > +X86CPU *cpu = X86_CPU(cs);
> > +CPUX86State *env = >env;
> > +int has_tsc_control, r = 0;
> > +
> > +if (!env->tsc_khz) {
> > +return 0;
> > +}
> > +
> > +has_tsc_control = kvm_check_extension(cs->kvm_state, 
> > KVM_CAP_TSC_CONTROL);
> > +if (has_tsc_control) {
> > +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > +}
> > +
> > +if (!has_tsc_control || r < 0) {
> > +r = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
> > +kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) : -ENOTSUP;
> > +if (r <= 0 || r != env->tsc_khz) {
> > +error_report("warning: TSC frequency mismatch between "
> > + "VM and host, and TSC scaling unavailable");
> > +return has_tsc_control ? -EIO : -ENOTSUP;
> > +}
> > +}
> 
> What about:
> 
> r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL) ?
> kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz) :
> -ENOTSUP;
> if (r < 0) {
> /* If KVM_SET_TSC_KHZ fails, it is an error only if the
>  * current TSC frequency doesn't match the one we want.
>  */
> int cur_freq = kvm_check_extension(cs->kvm_state, 
> KVM_CAP_GET_TSC_KHZ) ?
>kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) :
>-ENOTSUP;
>if (cur_freq <= 0 || cur_freq != env->tsc_khz) {
>error_report("warning: TSC frequency mismatch between "
> "VM and host, and TSC scaling unavailable");
>return r;
>}
> }
> 
> return 0;
>

Yes, your suggestion is better.

> > +
> > +return 0;
> > +}
> > +
> >  static Error *invtsc_mig_blocker;
> >  
> >  #define KVM_MAX_CPUID_ENTRIES  100
> > @@ -823,13 +858,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >  return r;
> >  }
> >  
> > -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> > -if (r && env->tsc_khz) {
> > -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > -if (r < 0) {
> > -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> > -return r;
> > -}
> > +if (kvm_arch_set_tsc_khz(cs) == -EIO) {
> > +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> 
> Now kvm_arch_set_tsc_khz() prints an error message, we can remove
> this one.

will remove in the next version.

> 
> > +return -EIO;
> 
> To keep the previous behavior without losing the error code
> returned by KVM, this could be written as:
> 
> r = kvm_arch_set_tsc_khz(cs);
> if (r < 0 && r != -ENOTSUP) {
> return r;
> }
> 
> But I belive the previous behavior was wrong. If tsc-freq was
> explicitly set by the user, we should abort if
> KVM_CAP_TSC_CONTROL is unavailable.
> 
> So I suggest we simply do this:
> 
> r = kvm_arch_set_tsc_khz(cs);
> if (r < 0) {
> return r;
> }
>

Yes, I also prefer this one. I'll change and update the commit message
in the next version.

Thanks,
Haozhong

> (In case you implement this behavior change in the same patch,
> please mention that in the commit message.)
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in

Re: [PATCH v5 3/3] target-i386: add support to migrate vcpu's TSC rate

2015-11-17 Thread Haozhong Zhang

On 11/17/15 11:40, Eduardo Habkost wrote:
> Hi,
> 
> On Tue, Nov 17, 2015 at 01:20:39PM +0800, Haozhong Zhang wrote:
> > This patch enables migrating vcpu's TSC rate. If KVM on the destination
> > machine supports TSC scaling, guest programs will observe a consistent
> > TSC rate across the migration.
> > 
> > If TSC scaling is not supported on the destination machine, the
> > migration will not be aborted and QEMU on the destination will not set
> > vcpu's TSC rate to the migrated value.
> > 
> > If vcpu's TSC rate specified by CPU option 'tsc-freq' on the destination
> > machine is inconsistent with the migrated TSC rate, the migration will
> > be aborted.
> > 
> > For backwards compatibility, the migration of vcpu's TSC rate is
> > disabled on pc-*-2.4 and older machine types.
> > 
> > Signed-off-by: Haozhong Zhang 
> 
> Now the logic in this patch (and the rest of the series) looks
> good to me. All my suggestions are only related to code comments
> and error handling:
> 
> [...]
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 6a1acb4..6856899 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -2384,6 +2384,10 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
> >  }
> >  }
> >  
> > +if (level == KVM_PUT_FULL_STATE) {
> > +kvm_arch_set_tsc_khz(cpu);
> 
> Please add a comment here indicating that errors are being
> ignored, and explaining why.
>

will add

> > +}
> > +
> >  ret = kvm_getput_regs(x86_cpu, 1);
> >  if (ret < 0) {
> >  return ret;
> > diff --git a/target-i386/machine.c b/target-i386/machine.c
> > index a18e16e..3c5d24b 100644
> > --- a/target-i386/machine.c
> > +++ b/target-i386/machine.c
> > @@ -331,6 +331,13 @@ static int cpu_post_load(void *opaque, int version_id)
> >  CPUX86State *env = >env;
> >  int i;
> >  
> > +if (env->tsc_khz && env->user_tsc_khz &&
> > +env->tsc_khz != env->user_tsc_khz) {
> > +fprintf(stderr, "Mismatch between user-specified TSC frequency and 
> > "
> > +"migrated TSC frequency\n");
> 
> Please use error_report() instead of fprintf().
>

will change

> > +return -1;
> 
> Please return a valid -errno value. Other post_load functions
> that implement sanity checks use -EINVAL (e.g.
> global_state_post_load(), configuration_post_load()), so that's
> probably what we should do here.
>

will change

Thanks,
Haozhong

> > +}
> > +
> >  /*
> >   * Real mode guest segments register DPL should be zero.
> >   * Older KVM version were setting it wrongly.
> > @@ -775,6 +782,26 @@ static const VMStateDescription vmstate_xss = {
> >  }
> >  };
> >  
> > +static bool tsc_khz_needed(void *opaque)
> > +{
> > +X86CPU *cpu = opaque;
> > +CPUX86State *env = >env;
> > +MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> > +PCMachineClass *pcmc = PC_MACHINE_CLASS(mc);
> > +return env->tsc_khz && pcmc->save_tsc_khz;
> > +}
> > +
> > +static const VMStateDescription vmstate_tsc_khz = {
> > +.name = "cpu/tsc_khz",
> > +.version_id = 1,
> > +.minimum_version_id = 1,
> > +.needed = tsc_khz_needed,
> > +.fields = (VMStateField[]) {
> > +VMSTATE_INT64(env.tsc_khz, X86CPU),
> > +VMSTATE_END_OF_LIST()
> > +}
> > +};
> > +
> >  VMStateDescription vmstate_x86_cpu = {
> >  .name = "cpu",
> >  .version_id = 12,
> > @@ -895,6 +922,7 @@ VMStateDescription vmstate_x86_cpu = {
> >  _msr_hyperv_runtime,
> >  _avx512,
> >  _xss,
> > +_tsc_khz,
> >  NULL
> >  }
> >  };
> > -- 
> > 2.4.8
> > 
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/3] target-i386: reorganize TSC rate setting code

On Tue, Nov 17, 2015 at 10:07:53PM +0800, Haozhong Zhang wrote:
> On 11/17/15 11:32, Eduardo Habkost wrote:
> > On Tue, Nov 17, 2015 at 01:20:38PM +0800, Haozhong Zhang wrote:
> > > Following two changes are made to the TSC rate setting code in
> > > kvm_arch_init_vcpu():
> > >  * The code is moved to a new function kvm_arch_set_tsc_khz().
> > >  * If setting user-specified TSC rate fails and the host TSC rate is
> > >inconsistent with the user-specified one, print a warning message.
> > > 
> > > Signed-off-by: Haozhong Zhang 
> > 
> > This matches what I was expecting, and now I see that we don't
> > even need the user_tsc_khz field.
> >
> 
> I guess you mean the user_tsc_khz field is not needed when setting TSC
> rate. It's still needed in patch 3 to check if the migrated TSC rate
> is consistent with the user-specified TSC rate (and of course it's not
> in kvm_arch_set_tsc_khz()).

Yes, I was looking only at the error-checking logic in
kvm_arch_init_vcpu(). Then I noticed user_tsc_khz was added for
the migration sanity check (which makes sense).

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/4] KVM: X86: Implementation of DEBUGCTLMSR is moved

Move the old implementation of DEBUGCTLMSR from x86.c to vmx.c

Signed-off-by: Jian Zhou 
Signed-off-by: Stephen He 
---
 arch/x86/kvm/x86.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a9a198..9133c48 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -185,6 +185,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu 
*vcpu)
vcpu->arch.apf.gfns[i] = ~0;
 }

+void kvm_lbr_init(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.lbr_status = 0;
+   vcpu->arch.lbr_used = 0;
+   vcpu->arch.lbr_msr.nr = 0;
+}
+
 static void kvm_on_user_return(struct user_return_notifier *urn)
 {
unsigned slot;
@@ -1947,18 +1954,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
}
break;
-   case MSR_IA32_DEBUGCTLMSR:
-   if (!data) {
-   /* We support the non-activated case already */
-   break;
-   } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
-   /* Values other than LBR and BTF are vendor-specific,
-  thus reserved and should throw a #GP */
-   return 1;
-   }
-   vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
-   __func__, data);
-   break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -2180,11 +2175,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
-   case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
-   case MSR_IA32_LASTINTFROMIP:
-   case MSR_IA32_LASTINTTOIP:
case MSR_K8_SYSCFG:
case MSR_K8_TSEG_ADDR:
case MSR_K8_TSEG_MASK:
@@ -7376,6 +7368,8 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);

+   kvm_lbr_init(vcpu);
+
return 0;

 fail_free_mce_banks:
--
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/4] KVM: VMX: Implementation of DEBUGCTLMSR and LBRV

The new implementation of VMX DEBUGCTLMSR is moved to here.
Using msr intercept bitmap and arrays(save/restore LBR MSRs)
in kvm_vcpu_arch struct to support LBR virtualization, and
a parameter of kvm_intel module is added to permanently disable LBRV.
Userspace can get/set contents of LBR MSRs, so the migration is supported.

Signed-off-by: Jian Zhou 
Signed-off-by: Stephen He 
---
 arch/x86/kvm/vmx.c | 335 +
 1 file changed, 335 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a8bc64..2d4e90b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -90,6 +90,9 @@ module_param(fasteoi, bool, S_IRUGO);
 static bool __read_mostly enable_apicv = 1;
 module_param(enable_apicv, bool, S_IRUGO);

+static bool __read_mostly lbrv = 1;
+module_param(lbrv, bool, S_IRUGO);
+
 static bool __read_mostly enable_shadow_vmcs = 1;
 module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO);
 /*
@@ -2617,6 +2620,256 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
return 0;
 }

+struct lbr_info {
+   u32 base;
+   u8 count;
+} pentium4_lbr[] = {
+   { MSR_PENTIUM4_LER_FROM_LIP,   1 },
+   { MSR_PENTIUM4_LER_TO_LIP, 1 },
+   { MSR_PENTIUM4_LBR_TOS,1 },
+   { MSR_LBR_PENTIUM4_FROM,   SIZE_PENTIUM4_LBR_STACK },
+   { MSR_LBR_PENTIUM4_TO, SIZE_PENTIUM4_LBR_STACK },
+   { 0, 0 }
+}, core2_lbr[] = {
+   { MSR_IA32_LASTINTFROMIP,   1 },
+   { MSR_IA32_LASTINTTOIP, 1 },
+   { MSR_LBR_TOS,  1 },
+   { MSR_LBR_CORE2_FROM,   SIZE_CORE2_LBR_STACK },
+   { MSR_LBR_CORE2_TO, SIZE_CORE2_LBR_STACK },
+   { 0, 0 }
+}, atom_lbr[] = {
+   { MSR_IA32_LASTINTFROMIP,   1 },
+   { MSR_IA32_LASTINTTOIP, 1 },
+   { MSR_LBR_TOS,  1 },
+   { MSR_LBR_ATOM_FROM,SIZE_ATOM_LBR_STACK },
+   { MSR_LBR_ATOM_TO,  SIZE_ATOM_LBR_STACK },
+   { 0, 0 }
+}, nehalem_lbr[] = {
+   { MSR_LBR_SELECT,   1 },
+   { MSR_IA32_LASTINTFROMIP,   1 },
+   { MSR_IA32_LASTINTTOIP, 1 },
+   { MSR_LBR_TOS,  1 },
+   { MSR_LBR_NHM_FROM, SIZE_NHM_LBR_STACK },
+   { MSR_LBR_NHM_TO,   SIZE_NHM_LBR_STACK },
+   { 0, 0 }
+}, skylake_lbr[] = {
+   { MSR_LBR_SELECT,   1 },
+   { MSR_IA32_LASTINTFROMIP,   1 },
+   { MSR_IA32_LASTINTTOIP, 1 },
+   { MSR_LBR_TOS,  1 },
+   { MSR_LBR_SKYLAKE_FROM, SIZE_SKYLAKE_LBR_STACK },
+   { MSR_LBR_SKYLAKE_TO,   SIZE_SKYLAKE_LBR_STACK },
+   { 0, 0}
+};
+
+static const struct lbr_info *last_branch_msr_get(u8 family, u8 model)
+{
+   if (family == 6)
+   {
+   switch (model)
+   {
+   case 15: /* 65nm Core2 "Merom" */
+   case 22: /* 65nm Core2 "Merom-L" */
+   case 23: /* 45nm Core2 "Penryn" */
+   case 29: /* 45nm Core2 "Dunnington (MP) */
+   return core2_lbr;
+   break;
+   case 28: /* 45nm Atom "Pineview" */
+   case 38: /* 45nm Atom "Lincroft" */
+   case 39: /* 32nm Atom "Penwell" */
+   case 53: /* 32nm Atom "Cloverview" */
+   case 54: /* 32nm Atom "Cedarview" */
+   case 55: /* 22nm Atom "Silvermont" */
+   case 76: /* 14nm Atom "Airmont" */
+   case 77: /* 22nm Atom "Silvermont Avoton/Rangely" */
+   return atom_lbr;
+   break;
+   case 30: /* 45nm Nehalem */
+   case 26: /* 45nm Nehalem-EP */
+   case 46: /* 45nm Nehalem-EX */
+   case 37: /* 32nm Westmere */
+   case 44: /* 32nm Westmere-EP */
+   case 47: /* 32nm Westmere-EX */
+   case 42: /* 32nm SandyBridge */
+   case 45: /* 32nm SandyBridge-E/EN/EP */
+   case 58: /* 22nm IvyBridge */
+   case 62: /* 22nm IvyBridge-EP/EX */
+   case 60: /* 22nm Haswell Core */
+   case 63: /* 22nm Haswell Server */
+   case 69: /* 22nm Haswell ULT */
+   case 70: /* 22nm Haswell + GT3e */
+   case 61: /* 14nm Broadwell Core-M */
+   case 86: /* 14nm Broadwell Xeon D */
+   case 71: /* 14nm Broadwell + GT3e */
+   case 79: /* 14nm Broadwell Server */
+   return nehalem_lbr;
+

[PATCH v3 1/4] KVM: X86: Names and addresses of LBR MSRs

Signed-off-by: Jian Zhou 
Signed-off-by: Stephen He 
---
 arch/x86/include/asm/msr-index.h | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b98b471..942af1f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -66,12 +66,34 @@
 #define MSR_TURBO_RATIO_LIMIT1 0x01ae
 #define MSR_TURBO_RATIO_LIMIT2 0x01af

+#define MSR_LBR_STATUS 0x00d6
 #define MSR_LBR_SELECT 0x01c8
 #define MSR_LBR_TOS0x01c9
+#define MSR_LBR_CORE_FROM  0x0040
+#define MSR_LBR_CORE_TO0x0060
+/* Pentium4/Xeon(based on NetBurst) LBR */
+#define MSR_PENTIUM4_LER_FROM_LIP  0x01d7
+#define MSR_PENTIUM4_LER_TO_LIP0x01d8
+#define MSR_PENTIUM4_LBR_TOS   0x01da
+#define MSR_LBR_PENTIUM4_FROM  0x0680
+#define MSR_LBR_PENTIUM4_TO0x06c0
+#define SIZE_PENTIUM4_LBR_STACK16
+/* Core2 LBR */
+#define MSR_LBR_CORE2_FROM MSR_LBR_CORE_FROM
+#define MSR_LBR_CORE2_TO   MSR_LBR_CORE_TO
+#define SIZE_CORE2_LBR_STACK   4
+/* Atom LBR */
+#define MSR_LBR_ATOM_FROM  MSR_LBR_CORE_FROM
+#define MSR_LBR_ATOM_TOMSR_LBR_CORE_TO
+#define SIZE_ATOM_LBR_STACK8
+/* Nehalem LBR */
 #define MSR_LBR_NHM_FROM   0x0680
 #define MSR_LBR_NHM_TO 0x06c0
-#define MSR_LBR_CORE_FROM  0x0040
-#define MSR_LBR_CORE_TO0x0060
+#define SIZE_NHM_LBR_STACK 16
+/* Skylake LBR */
+#define MSR_LBR_SKYLAKE_FROM   MSR_LBR_NHM_FROM
+#define MSR_LBR_SKYLAKE_TO MSR_LBR_NHM_TO
+#define SIZE_SKYLAKE_LBR_STACK 32

 #define MSR_LBR_INFO_0 0x0dc0 /* ... 0xddf for _31 */
 #define LBR_INFO_MISPRED   BIT_ULL(63)
--
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/4] KVM: VMX: enable LBR virtualization

With host CPU model SandyBridge while using guest CPU model
core2duo(addresses of MSRs recording last branch information
between these two CPUs are different), to read the contents of
LBR MSRs in the guest, but the result is not correct, the
reason is that these MSRs do not exist in the physical CPU.
The prerequisite to enable LBRV has changed in v3.

Changelog in v3:
1) The new implementation of VMX DEBUGCTLMSR is moved
   from x86.c to vmx.c
2) LBRV depends on two prerequisites:
 - guest CPU model
 - numbers and addresses of MSRs about LBR MUST be the same
   between the host and the guest
3) Tweak the CPU table

Changelog in v2:
1) Move the implementation into vmx.c
2) Migraton is supported
3) Add arrays in kvm_vcpu_arch struct to save/restore
   LBR MSRs at vm exit/entry time.
4) Add a parameter of kvm_intel module to permanently
   disable LBRV
5) Table of supported CPUs is reorgnized, LBRV
   can be enabled or not according to the guest CPUID

Jian Zhou (4):
  KVM: X86: Names and addresses of LBR MSRs
  KVM: X86: Add arrays to save/restore LBR MSRs
  KVM: X86: Implementation of DEBUGCTLMSR is moved
  KVM: VMX: Implementation of DEBUGCTLMSR and LBRV

 arch/x86/include/asm/kvm_host.h  |  22 ++-
 arch/x86/include/asm/msr-index.h |  26 ++-
 arch/x86/kvm/vmx.c   | 335 +++
 arch/x86/kvm/x86.c   |  24 ++-
 4 files changed, 384 insertions(+), 23 deletions(-)

--
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/4] KVM: X86: Add arrays to save/restore LBR MSRs

Add arrays in kvm_vcpu_arch struct to save/restore
host/guest LBR MSRs at vm exit/entry time.

Signed-off-by: Jian Zhou 
Signed-off-by: Stephen He 
---
 arch/x86/include/asm/kvm_host.h | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a36ee7..ac0c745 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -91,6 +91,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, 
int level)
 #define KVM_MAX_CPUID_ENTRIES 80
 #define KVM_NR_FIXED_MTRR_REGION 88
 #define KVM_NR_VAR_MTRR 8
+#define KVM_MAX_LBR_MSRS 128

 #define ASYNC_PF_PER_VCPU 64

@@ -376,6 +377,12 @@ struct kvm_vcpu_hv {
u64 hv_vapic;
 };

+struct msr_data {
+   bool host_initiated;
+   u32 index;
+   u64 data;
+};
+
 struct kvm_vcpu_arch {
/*
 * rip and regs accesses must go through
@@ -516,6 +523,15 @@ struct kvm_vcpu_arch {
unsigned long eff_db[KVM_NR_DB_REGS];
unsigned long guest_debug_dr7;

+   int lbr_status;
+   int lbr_used;
+
+   struct lbr_msr {
+   unsigned nr;
+   struct msr_data guest[KVM_MAX_LBR_MSRS];
+   struct msr_data host[KVM_MAX_LBR_MSRS];
+   } lbr_msr;
+
u64 mcg_cap;
u64 mcg_status;
u64 mcg_ctl;
@@ -728,12 +744,6 @@ struct kvm_vcpu_stat {

 struct x86_instruction_info;

-struct msr_data {
-   bool host_initiated;
-   u32 index;
-   u64 data;
-};
-
 struct kvm_lapic_irq {
u32 vector;
u16 delivery_mode;
--
1.7.12.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/4] KVM: x86: request interrupt window when IRQ chip is split



On 17/11/2015 00:26, Matt Gingell wrote:
> Before this patch, we incorrectly enter the guest without requesting an
> interrupt window if the IRQ chip is split between user space and the
> kernel.
> 
> Because lapic_in_kernel no longer implies the PIC is in the kernel, this
> patch tests pic_in_kernel to determining whether an interrupt window
> should be requested when entering the guest.
> 
> If the APIC is in the kernel and we request an interrupt window the
> guest will return immediately. If the APIC is masked the guest will not
> not make forward progress and unmask it, leading to a loop when KVM
> reenters and requests again. This patch adds a check to ensure the APIC
> is ready to accept an interrupt before requesting a window.
> 
> Reviewed-by: Steve Rutherford 
> Signed-off-by: Matt Gingell 
> ---
>  arch/x86/kvm/x86.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c370eef..d57bdd9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6259,8 +6259,11 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm 
> *kvm,
>  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  {
>   int r;
> - bool req_int_win = !lapic_in_kernel(vcpu) &&
> - vcpu->run->request_interrupt_window;
> + bool req_int_win =
> + vcpu->run->request_interrupt_window &&
> + likely(!pic_in_kernel(vcpu->kvm)) &&
> + (!lapic_in_kernel(vcpu) || kvm_apic_accept_pic_intr(vcpu));
> +

This can be

bool req_int_win =
dm_request_for_irq_injection(vcpu) &&
(!lapic_in_kernel(vcpu) || kvm_apic_accept_pic_intr(vcpu));

I'll apply the patches and send them to Linus for 4.4-rc2, thanks!
These cleanups can go on top:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e16068bba51..0bca1ec199df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2661,12 +2661,24 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu 
*vcpu,
return 0;
 }
 
-static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu) {
+static int kvm_cpu_accept_dm_intr(struct kvm_vcpu *vcpu)
+{
+   return (!lapic_in_kernel(vcpu) ||
+   kvm_apic_accept_pic_intr(vcpu));
+}
+
+/*
+ * if userspace requested an interrupt window, check that the
+ * interrupt window is open.
+ *
+ * No need to exit to userspace if we already have an interrupt queued.
+ */
+static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
+{
return kvm_arch_interrupt_allowed(vcpu) &&
!kvm_cpu_has_interrupt(vcpu) &&
!kvm_event_needs_reinjection(vcpu) &&
-   (!lapic_in_kernel(vcpu) ||
-kvm_apic_accept_pic_intr(vcpu));
+   kvm_cpu_accept_dm_intr(vcpu);
 }
 
 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
@@ -5817,12 +5829,6 @@ static int emulator_fix_hypercall(struct 
x86_emulate_ctxt *ctxt)
return emulator_write_emulated(ctxt, rip, instruction, 3, NULL);
 }
 
-/*
- * Check if userspace requested an interrupt window, and that the
- * interrupt window is open.
- *
- * No need to exit to userspace if we already have an interrupt queued.
- */
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
 {
return vcpu->run->request_interrupt_window &&
@@ -6253,9 +6259,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
int r;
bool req_int_win =
-   vcpu->run->request_interrupt_window &&
-   likely(!pic_in_kernel(vcpu->kvm)) &&
-   (!lapic_in_kernel(vcpu) || kvm_apic_accept_pic_intr(vcpu));
+   dm_request_for_irq_injection(vcpu) &&
+   kvm_cpu_accept_dm_intr(vcpu);
 
bool req_immediate_exit = false;
 

A couple questions, that can be fixed by separate patches:

- should kvm_cpu_accept_dm_intr check that pending_external_vector == -1?

- should kvm_vcpu_ioctl_interrupt then use kvm_cpu_accept_dm_intr
instead of checking pending_external_vector == -1 directly?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 1/3] target-i386: add pkeys support for cpuid handling



On 16/11/2015 08:52, Huaitong Han wrote:
> This patch adds pkeys support for cpuid handling.
> 
> Signed-off-by: Huaitong Han 
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 4d1b085..2ff73ee 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -264,6 +264,17 @@ static const char *cpuid_7_0_ebx_feature_name[] = {
>  NULL, NULL, "avx512pf", "avx512er", "avx512cd", NULL, NULL, NULL,
>  };
>  
> +static const char *cpuid_7_0_ecx_feature_name[] = {
> +NULL, NULL, "pku", "ospke",

These strings are at index 2 and 3, while PKU and OSPKE are respectively
bit 3 and 4 in CPUID[EAX=7,ECX=0].ECX.

Otherwise okay.  The other two patches are fine as well.

Paolo

> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +NULL, NULL, NULL, NULL,
> +};
> +
>  static const char *cpuid_apm_edx_feature_name[] = {
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
> @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = {
>CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
>CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
>CPUID_7_0_EBX_RDSEED */
> +#define TCG_7_0_ECX_FEATURES 0
>  #define TCG_APM_FEATURES 0
>  #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
>  
> @@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] 
> = {
>  .cpuid_reg = R_EBX,
>  .tcg_features = TCG_7_0_EBX_FEATURES,
>  },
> +[FEAT_7_0_ECX] = {
> +.feat_names = cpuid_7_0_ecx_feature_name,
> +.cpuid_eax = 7,
> +.cpuid_needs_ecx = true, .cpuid_ecx = 0,
> +.cpuid_reg = R_ECX,
> +.tcg_features = TCG_7_0_ECX_FEATURES,
> +},
>  [FEAT_8000_0007_EDX] = {
>  .feat_names = cpuid_apm_edx_feature_name,
>  .cpuid_eax = 0x8007,
> @@ -2401,7 +2420,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  if (count == 0) {
>  *eax = 0; /* Maximum ECX value for sub-leaves */
>  *ebx = env->features[FEAT_7_0_EBX]; /* Feature flags */
> -*ecx = 0; /* Reserved */
> +*ecx = env->features[FEAT_7_0_ECX]; /* Feature flags */
>  *edx = 0; /* Reserved */
>  } else {
>  *eax = 0;
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index ead2832..c2e7501 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -408,6 +408,7 @@ typedef enum FeatureWord {
>  FEAT_1_EDX, /* CPUID[1].EDX */
>  FEAT_1_ECX, /* CPUID[1].ECX */
>  FEAT_7_0_EBX,   /* CPUID[EAX=7,ECX=0].EBX */
> +FEAT_7_0_ECX,   /* CPUID[EAX=7,ECX=0].ECX */
>  FEAT_8000_0001_EDX, /* CPUID[8000_0001].EDX */
>  FEAT_8000_0001_ECX, /* CPUID[8000_0001].ECX */
>  FEAT_8000_0007_EDX, /* CPUID[8000_0007].EDX */
> @@ -576,6 +577,9 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
>  #define CPUID_7_0_EBX_AVX512ER (1U << 27) /* AVX-512 Exponential and 
> Reciprocal */
>  #define CPUID_7_0_EBX_AVX512CD (1U << 28) /* AVX-512 Conflict Detection */
>  
> +#define CPUID_7_0_ECX_PKU  (1U << 3)
> +#define CPUID_7_0_ECX_OSPKE(1U << 4)
> +
>  #define CPUID_XSAVE_XSAVEOPT   (1U << 0)
>  #define CPUID_XSAVE_XSAVEC (1U << 1)
>  #define CPUID_XSAVE_XGETBV1(1U << 2)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] arm64: KVM: Add workaround for Cortex-A57 erratum 834220

2015-11-17 Thread Will Deacon

Hi Marc,

On Mon, Nov 16, 2015 at 10:28:18AM +, Marc Zyngier wrote:
> Cortex-A57 parts up to r1p2 can misreport Stage 2 translation faults
> when a Stage 1 permission fault or device alignment fault should
> have been reported.
> 
> This patch implements the workaround (which is to validate that the
> Stage-1 translation actually succeeds) by using code patching.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/Kconfig  | 21 +
>  arch/arm64/include/asm/cpufeature.h |  3 ++-
>  arch/arm64/kernel/cpu_errata.c  |  9 +
>  arch/arm64/kvm/hyp.S|  6 ++
>  4 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 9ac16a4..746d985 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -316,6 +316,27 @@ config ARM64_ERRATUM_832075
>  
> If unsure, say Y.
>  
> +config ARM64_ERRATUM_834220
> + bool "Cortex-A57: 834220: Stage 2 translation fault might be 
> incorrectly reported in presence of a Stage 1 fault"
> + depends on KVM
> + default y
> + help
> +   This option adds an alternative code sequence to work around ARM
> +   erratum 834220 on Cortex-A57 parts up to r1p2.
> +
> +   Affected Cortex-A57 parts might report a Stage 2 translation
> +   fault as a the result of a Stage 1 fault for load crossing a

s/as a the/as the/
s/for load/for a load/

> +   page boundary when there is a permission or device memory
> +   alignment fault at Stage 1 and a translation fault at Stage 2.
> +
> +   The workaround is to verify that the Stage-1 translation

Consistency between "Stage 1" and "Stage-1".

> +   doesn't generate a fault before handling the Stage-2 fault.

Same here.

> +   Please note that this does not necessarily enable the workaround,
> +   as it depends on the alternative framework, which will only patch
> +   the kernel if an affected CPU is detected.
> +
> +   If unsure, say Y.
> +
>  config ARM64_ERRATUM_845719
>   bool "Cortex-A53: 845719: a load might read incorrect data"
>   depends on COMPAT
> diff --git a/arch/arm64/include/asm/cpufeature.h 
> b/arch/arm64/include/asm/cpufeature.h
> index 11d5bb0f..52722ee 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -29,8 +29,9 @@
>  #define ARM64_HAS_PAN4
>  #define ARM64_HAS_LSE_ATOMICS5
>  #define ARM64_WORKAROUND_CAVIUM_231546
> +#define ARM64_WORKAROUND_834220  7
>  
> -#define ARM64_NCAPS  7
> +#define ARM64_NCAPS  8
>  
>  #ifndef __ASSEMBLY__
>  
> diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
> index 24926f2..feb6b4e 100644
> --- a/arch/arm64/kernel/cpu_errata.c
> +++ b/arch/arm64/kernel/cpu_errata.c
> @@ -75,6 +75,15 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
>  (1 << MIDR_VARIANT_SHIFT) | 2),
>   },
>  #endif
> +#ifdef CONFIG_ARM64_ERRATUM_834220
> + {
> + /* Cortex-A57 r0p0 - r1p2 */
> + .desc = "ARM erratum 834220",
> + .capability = ARM64_WORKAROUND_834220,
> + MIDR_RANGE(MIDR_CORTEX_A57, 0x00,
> +(1 << MIDR_VARIANT_SHIFT) | 2),
> + },
> +#endif
>  #ifdef CONFIG_ARM64_ERRATUM_845719
>   {
>   /* Cortex-A53 r0p[01234] */
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index 1599701..ff2e038 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -1015,9 +1015,15 @@ el1_trap:
>   b.ne1f  // Not an abort we care about
>  
>   /* This is an abort. Check for permission fault */
> +alternative_if_not ARM64_WORKAROUND_834220
>   and x2, x1, #ESR_ELx_FSC_TYPE
>   cmp x2, #FSC_PERM
>   b.ne1f  // Not a permission fault
> +alternative_else
> + nop // Use the permission fault path to
> + nop // check for a valid S1 translation,
> + nop // regardless of the ESR value.
> +alternative_endif

With the cosmetic changes:

  Reviewed-by: Will Deacon 

Can you cc stable as well, please?

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/21] arm64: KVM: Implement fpsimd save/restore

2015-11-17 Thread Steve Capper

On 16 November 2015 at 13:11, Marc Zyngier  wrote:
> Implement the fpsimd save restore, keeping the lazy part in
> assembler (as returning to C would be overkill).
>
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/hyp/Makefile |  1 +
>  arch/arm64/kvm/hyp/entry.S  | 32 +++-
>  arch/arm64/kvm/hyp/fpsimd.S | 33 +
>  arch/arm64/kvm/hyp/hyp.h|  3 +++
>  arch/arm64/kvm/hyp/switch.c |  8 
>  5 files changed, 76 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/hyp/fpsimd.S
>
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 9c11b0f..56238d0 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
>  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
>  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
> +obj-$(CONFIG_KVM_ARM_HOST) += fpsimd.o
> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> index 2c4449a..7552922 100644
> --- a/arch/arm64/kvm/hyp/entry.S
> +++ b/arch/arm64/kvm/hyp/entry.S
> @@ -27,6 +27,7 @@
>
>  #define CPU_GP_REG_OFFSET(x)   (CPU_GP_REGS + x)
>  #define CPU_XREG_OFFSET(x) CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> +#define CPU_SYSREG_OFFSET(x)   (CPU_SYSREGS + 8*x)
>
> .text
> .pushsection.hyp.text, "ax"
> @@ -152,4 +153,33 @@ ENTRY(__guest_exit)
> ret
>  ENDPROC(__guest_exit)
>
> -   /* Insert fault handling here */
> +ENTRY(__fpsimd_guest_restore)
> +   pushx4, lr
> +
> +   mrs x2, cptr_el2
> +   bic x2, x2, #CPTR_EL2_TFP
> +   msr cptr_el2, x2
> +   isb
> +
> +   mrs x3, tpidr_el2
> +
> +   ldr x0, [x3, #VCPU_HOST_CONTEXT]
> +   kern_hyp_va x0
> +   add x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
> +   bl  __fpsimd_save_state
> +
> +   add x2, x3, #VCPU_CONTEXT
> +   add x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
> +   bl  __fpsimd_restore_state
> +
> +   mrs x1, hcr_el2
> +   tbnzx1, #HCR_RW_SHIFT, 1f
> +   ldr x4, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)]
> +   msr fpexc32_el2, x4
> +1:
> +   pop x4, lr
> +   pop x2, x3
> +   pop x0, x1
> +
> +   eret
> +ENDPROC(__fpsimd_guest_restore)
> diff --git a/arch/arm64/kvm/hyp/fpsimd.S b/arch/arm64/kvm/hyp/fpsimd.S
> new file mode 100644
> index 000..da3f22c
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/fpsimd.S
> @@ -0,0 +1,33 @@
> +/*
> + * Copyright (C) 2015 - ARM Ltd
> + * Author: Marc Zyngier 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +
> +#include 
> +
> +   .text
> +   .pushsection.hyp.text, "ax"
> +
> +ENTRY(__fpsimd_save_state)
> +   fpsimd_save x0, 1
> +   ret
> +ENDPROC(__fpsimd_save_state)
> +
> +ENTRY(__fpsimd_restore_state)
> +   fpsimd_restore  x0, 1
> +   ret
> +ENDPROC(__fpsimd_restore_state)
> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
> index bf13238..240fb79 100644
> --- a/arch/arm64/kvm/hyp/hyp.h
> +++ b/arch/arm64/kvm/hyp/hyp.h
> @@ -70,6 +70,9 @@ void __debug_clear_restore_state(struct kvm_vcpu *vcpu,
>  struct kvm_guest_debug_arch *dbg,
>  struct kvm_cpu_context *ctxt);
>
> +void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
> +void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
> +
>  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
>
>  #endif /* __ARM64_KVM_HYP_H__ */
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index a3af81a..06d3e20 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -88,6 +88,7 @@ int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
>  {
> struct kvm_cpu_context *host_ctxt;
> struct kvm_cpu_context *guest_ctxt;
> +   bool fp_enabled;
> u64 exit_code;
>
> vcpu = kern_hyp_va(vcpu);
> @@ -117,6 +118,8 @@ int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
> exit_code = __guest_enter(vcpu, host_ctxt);
> /* And we're baaack! */
>
> +   fp_enabled = !!(read_sysreg(cptr_el2) & CPTR_EL2_TFP);

Should this not be a single logical not?
If

Re: [PATCH 12/21] arm64: KVM: Implement fpsimd save/restore

2015-11-17 Thread Steve Capper

On 17 November 2015 at 11:25, Marc Zyngier  wrote:
> On 17/11/15 11:13, Steve Capper wrote:
>> On 16 November 2015 at 13:11, Marc Zyngier  wrote:
>>> Implement the fpsimd save restore, keeping the lazy part in
>>> assembler (as returning to C would be overkill).
>>>
>>> Signed-off-by: Marc Zyngier 
>>> ---
>>>  arch/arm64/kvm/hyp/Makefile |  1 +
>>>  arch/arm64/kvm/hyp/entry.S  | 32 +++-
>>>  arch/arm64/kvm/hyp/fpsimd.S | 33 +
>>>  arch/arm64/kvm/hyp/hyp.h|  3 +++
>>>  arch/arm64/kvm/hyp/switch.c |  8 
>>>  5 files changed, 76 insertions(+), 1 deletion(-)
>>>  create mode 100644 arch/arm64/kvm/hyp/fpsimd.S
>>>
>>> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
>>> index 9c11b0f..56238d0 100644
>>> --- a/arch/arm64/kvm/hyp/Makefile
>>> +++ b/arch/arm64/kvm/hyp/Makefile
>>> @@ -9,3 +9,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += sysreg-sr.o
>>>  obj-$(CONFIG_KVM_ARM_HOST) += debug-sr.o
>>>  obj-$(CONFIG_KVM_ARM_HOST) += entry.o
>>>  obj-$(CONFIG_KVM_ARM_HOST) += switch.o
>>> +obj-$(CONFIG_KVM_ARM_HOST) += fpsimd.o
>>> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
>>> index 2c4449a..7552922 100644
>>> --- a/arch/arm64/kvm/hyp/entry.S
>>> +++ b/arch/arm64/kvm/hyp/entry.S
>>> @@ -27,6 +27,7 @@
>>>
>>>  #define CPU_GP_REG_OFFSET(x)   (CPU_GP_REGS + x)
>>>  #define CPU_XREG_OFFSET(x) CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
>>> +#define CPU_SYSREG_OFFSET(x)   (CPU_SYSREGS + 8*x)
>>>
>>> .text
>>> .pushsection.hyp.text, "ax"
>>> @@ -152,4 +153,33 @@ ENTRY(__guest_exit)
>>> ret
>>>  ENDPROC(__guest_exit)
>>>
>>> -   /* Insert fault handling here */
>>> +ENTRY(__fpsimd_guest_restore)
>>> +   pushx4, lr
>>> +
>>> +   mrs x2, cptr_el2
>>> +   bic x2, x2, #CPTR_EL2_TFP
>>> +   msr cptr_el2, x2
>>> +   isb
>>> +
>>> +   mrs x3, tpidr_el2
>>> +
>>> +   ldr x0, [x3, #VCPU_HOST_CONTEXT]
>>> +   kern_hyp_va x0
>>> +   add x0, x0, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
>>> +   bl  __fpsimd_save_state
>>> +
>>> +   add x2, x3, #VCPU_CONTEXT
>>> +   add x0, x2, #CPU_GP_REG_OFFSET(CPU_FP_REGS)
>>> +   bl  __fpsimd_restore_state
>>> +
>>> +   mrs x1, hcr_el2
>>> +   tbnzx1, #HCR_RW_SHIFT, 1f
>>> +   ldr x4, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)]
>>> +   msr fpexc32_el2, x4
>>> +1:
>>> +   pop x4, lr
>>> +   pop x2, x3
>>> +   pop x0, x1
>>> +
>>> +   eret
>>> +ENDPROC(__fpsimd_guest_restore)
>>> diff --git a/arch/arm64/kvm/hyp/fpsimd.S b/arch/arm64/kvm/hyp/fpsimd.S
>>> new file mode 100644
>>> index 000..da3f22c
>>> --- /dev/null
>>> +++ b/arch/arm64/kvm/hyp/fpsimd.S
>>> @@ -0,0 +1,33 @@
>>> +/*
>>> + * Copyright (C) 2015 - ARM Ltd
>>> + * Author: Marc Zyngier 
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public License
>>> + * along with this program.  If not, see .
>>> + */
>>> +
>>> +#include 
>>> +
>>> +#include 
>>> +
>>> +   .text
>>> +   .pushsection.hyp.text, "ax"
>>> +
>>> +ENTRY(__fpsimd_save_state)
>>> +   fpsimd_save x0, 1
>>> +   ret
>>> +ENDPROC(__fpsimd_save_state)
>>> +
>>> +ENTRY(__fpsimd_restore_state)
>>> +   fpsimd_restore  x0, 1
>>> +   ret
>>> +ENDPROC(__fpsimd_restore_state)
>>> diff --git a/arch/arm64/kvm/hyp/hyp.h b/arch/arm64/kvm/hyp/hyp.h
>>> index bf13238..240fb79 100644
>>> --- a/arch/arm64/kvm/hyp/hyp.h
>>> +++ b/arch/arm64/kvm/hyp/hyp.h
>>> @@ -70,6 +70,9 @@ void __debug_clear_restore_state(struct kvm_vcpu *vcpu,
>>>  struct kvm_guest_debug_arch *dbg,
>>>  struct kvm_cpu_context *ctxt);
>>>
>>> +void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
>>> +void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
>>> +
>>>  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context 
>>> *host_ctxt);
>>>
>>>  #endif /* __ARM64_KVM_HYP_H__ */
>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>> index a3af81a..06d3e20 100644
>>> --- a/arch/arm64/kvm/hyp/switch.c
>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>> @@ -88,6 +88,7 @@ int __hyp_text __guest_run(struct kvm_vcpu *vcpu)
>>>  {
>>> struct kvm_cpu_context *host_ctxt;
>>> struct kvm_cpu_context

Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()

On 16/11/2015 03:51, Takuya Yoshikawa wrote:
> What kvm_mmu_mark_parents_unsync() does is:
> 
>   for each p_i in sp->parent_ptes rmap chain
> mark_unsync(p_i);
> 
> Then, mark_unsync() finds the parent sp including that p_i to
> set ->unsync_child_bitmap and increment ->unsync_children if
> necessary.  It may also call kvm_mmu_mark_parents_unsync()
> recursively.

I agree.  sp->parent_ptes goes up one level from sp;
kvm_mmu_mark_parents_unsync(sp) visits the level above sp, while
mark_unsync(sp) visit sp and all the levels above it.

Calling mark_unsync(parent_pte) is enough to complete the visit, after
kvm_mmu_mark_parents_unsync has already processed the level above sp.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 0/5] implement vNVDIMM



Ping...

Do you have any comment on this patchset? Could it be applied to somewhere
if it is okay for you?

Thanks!

On 11/16/2015 06:50 PM, Xiao Guangrong wrote:

This patchset can be found at:
   https://github.com/xiaogr/qemu.git nvdimm-v8

It is based on pci branch on Michael's tree and the top commit is:
commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts).

Changelog in v8:
We split the long patch series into the small parts, as you see now, this
is the first part which enables NVDIMM without label data support.

The command line has been changed because some patches simplifying the
things have not been included into this series, you should specify the
file size exactly using the parameters as follows:
memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
-device nvdimm,memdev=mem1,id=nv1

Changelog in v7:
- changes from Vladimir Sementsov-Ogievskiy's comments:
   1) let gethugepagesize() realize if fstat is failed instead of get
  normal page size
   2) rename  open_file_path to open_ram_file_path
   3) better log the error message by using error_setg_errno
   4) update commit in the commit log to explain hugepage detection on
  Windows

- changes from Eduardo Habkost's comments:
   1) use 'Error**' to collect error message for qemu_file_get_page_size()
   2) move gethugepagesize() replacement to the same patch to make it
  better for review
   3) introduce qemu_get_file_size to unity the code with raw_getlength()

- changes from Stefan's comments:
   1) check the memory region is large enough to contain DSM output
  buffer

- changes from Eric Blake's comments:
   1) update the shell command in the commit log to generate the patch
  which drops 'pc-dimm' prefix

- others:
   pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
   Eric Blake.

Changelog in v6:
- changes from Stefan's comments:
   1) fix code style of struct naming by CamelCase way
   2) fix offset + length overflow when read/write label data
   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
  be used to replace getpagesize()

Changelog in v5:
- changes from Michael's comments:
   1) prefix nvdimm_ to everything in NVDIMM source files
   2) make parsing _DSM Arg3 more clear
   3) comment style fix
   5) drop single used definition
   6) fix dirty dsm buffer lost due to memory write happened on host
   7) check dsm buffer if it is big enough to contain input data
   8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
   1) introduce 'nvdimm-support' parameter to control nvdimm
  enablement and it is disabled for 2.4 and its earlier versions
  to make live migration compatible
   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
  virtualization

- changes from Stefan's comments:
   1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
   1) fix compile on ppc

- others:
   1) the buffer length is directly got from IO read rather than got
  from dsm memory
   2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
   1) show the message, "Memory is not allocated from HugeTlbfs", if file
  based memory is not allocated from hugetlbfs.
   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
  from Machine.
   3) statically define UUID and make its operation more clear
   4) use GArray to build device structures to avoid potential buffer
  overflow
   4) improve comments in the code
   5) improve code style

- changes from Igor's comments:
   1) add NVDIMM ACPI spec document
   2) use serialized method to avoid Mutex
   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
   4) introduce a common ASL method used by _DSM for all devices to reduce
  ACPI size
   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
  it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
   1) copy input data from DSM memory to local buffer to avoid potential
  issues as DSM memory is visible to guest. Output data is handled
  in a similar way

- changes from Dan's comments:
   1) drop static namespace as Linux has already supported label-less
  nvdimm devices

- changes from Vladimir's comments:
   1) print better message, "failed to get file size for %s, can't create
  backend on it", if any file operation filed to obtain file size

- others:
   create a git repo on github.com for better review/test

Also, thanks for Eric Blake's review on QAPI's side.

Thank all of you to review this patchset.

Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
   1) abstract dimm device type from pc-dimm and create nvdimm device based on

[PATCH v3 1/3] target-i386: add pkeys support for cpuid handling

This patch adds pkeys support for cpuid handling.

Signed-off-by: Huaitong Han 

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 4d1b085..3c11e02 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -264,6 +264,17 @@ static const char *cpuid_7_0_ebx_feature_name[] = {
 NULL, NULL, "avx512pf", "avx512er", "avx512cd", NULL, NULL, NULL,
 };
 
+static const char *cpuid_7_0_ecx_feature_name[] = {
+NULL, NULL, NULL, "pku",
+"ospke", NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 static const char *cpuid_apm_edx_feature_name[] = {
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = {
   CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
+#define TCG_7_0_ECX_FEATURES 0
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
@@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EBX,
 .tcg_features = TCG_7_0_EBX_FEATURES,
 },
+[FEAT_7_0_ECX] = {
+.feat_names = cpuid_7_0_ecx_feature_name,
+.cpuid_eax = 7,
+.cpuid_needs_ecx = true, .cpuid_ecx = 0,
+.cpuid_reg = R_ECX,
+.tcg_features = TCG_7_0_ECX_FEATURES,
+},
 [FEAT_8000_0007_EDX] = {
 .feat_names = cpuid_apm_edx_feature_name,
 .cpuid_eax = 0x8007,
@@ -2401,7 +2420,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 if (count == 0) {
 *eax = 0; /* Maximum ECX value for sub-leaves */
 *ebx = env->features[FEAT_7_0_EBX]; /* Feature flags */
-*ecx = 0; /* Reserved */
+*ecx = env->features[FEAT_7_0_ECX]; /* Feature flags */
 *edx = 0; /* Reserved */
 } else {
 *eax = 0;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index ead2832..c2e7501 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -408,6 +408,7 @@ typedef enum FeatureWord {
 FEAT_1_EDX, /* CPUID[1].EDX */
 FEAT_1_ECX, /* CPUID[1].ECX */
 FEAT_7_0_EBX,   /* CPUID[EAX=7,ECX=0].EBX */
+FEAT_7_0_ECX,   /* CPUID[EAX=7,ECX=0].ECX */
 FEAT_8000_0001_EDX, /* CPUID[8000_0001].EDX */
 FEAT_8000_0001_ECX, /* CPUID[8000_0001].ECX */
 FEAT_8000_0007_EDX, /* CPUID[8000_0007].EDX */
@@ -576,6 +577,9 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
 #define CPUID_7_0_EBX_AVX512ER (1U << 27) /* AVX-512 Exponential and 
Reciprocal */
 #define CPUID_7_0_EBX_AVX512CD (1U << 28) /* AVX-512 Conflict Detection */
 
+#define CPUID_7_0_ECX_PKU  (1U << 3)
+#define CPUID_7_0_ECX_OSPKE(1U << 4)
+
 #define CPUID_XSAVE_XSAVEOPT   (1U << 0)
 #define CPUID_XSAVE_XSAVEC (1U << 1)
 #define CPUID_XSAVE_XGETBV1(1U << 2)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/3] target-i386: add pkeys support for vm migration

This patch adds pkeys support for vm migration.

Signed-off-by: Huaitong Han 

diff --git a/target-i386/machine.c b/target-i386/machine.c
index a0df64b..1b190c7 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -725,6 +725,26 @@ static const VMStateDescription vmstate_xss = {
 VMSTATE_END_OF_LIST()
 }
 };
+#ifdef TARGET_X86_64
+static bool pkru_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = >env;
+
+return env->pkru != 0;
+}
+
+static const VMStateDescription vmstate_pkru = {
+.name = "cpu/pkru",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = pkru_needed,
+.fields = (VMStateField[]){
+VMSTATE_UINT32(env.pkru, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+#endif
 
 VMStateDescription vmstate_x86_cpu = {
 .name = "cpu",
@@ -844,6 +864,9 @@ VMStateDescription vmstate_x86_cpu = {
 _msr_hyperv_time,
 _avx512,
 _xss,
+#ifdef TARGET_X86_64
+_pkru,
+#endif
 NULL
 }
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/3] target-i386: add memory protection-key support

Changes in v3:
*Fix cpuid_7_0_ecx_feature_name error.

Changes in v2:
*Fix memcpy error for xsave state.
*Fix TCG_7_0_ECX_FEATURES to 0.
*Make subjects more readable.

The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

The PKRU register is XSAVE-managed state CPUID.D.0.EAX[9], the size of XSAVE
state component for PKRU is 8 bytes, the offset is 0xa80.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

Huaitong Han (3):
  target-i386: add pkeys support for cpuid handling
  target-i386: add pkeys support for xsave state handling
  target-i386: add pkeys support for vm migration

 target-i386/cpu.c | 23 ++-
 target-i386/cpu.h |  7 +++
 target-i386/kvm.c |  3 +++
 target-i386/machine.c | 23 +++
 4 files changed, 55 insertions(+), 1 deletion(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/3] target-i386: add pkeys support for xsave state handling

This patch adds pkeys support for xsave state handling.

Signed-off-by: Huaitong Han 

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 3c11e02..456cb3b 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -487,6 +487,8 @@ static const ExtSaveArea ext_save_areas[] = {
 .offset = 0x480, .size = 0x200 },
 [7] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
 .offset = 0x680, .size = 0x400 },
+[9] = { .feature = FEAT_7_0_ECX, .bits = CPUID_7_0_ECX_PKU,
+.offset = 0xA80, .size = 0x8 },
 };
 
 const char *get_register_name_32(unsigned int reg)
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index c2e7501..2230b3e 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -401,6 +401,7 @@
 #define XSTATE_OPMASK   (1ULL << 5)
 #define XSTATE_ZMM_Hi256(1ULL << 6)
 #define XSTATE_Hi16_ZMM (1ULL << 7)
+#define XSTATE_PKRU (1ULL << 9)
 
 
 /* CPUID feature words */
@@ -984,6 +985,8 @@ typedef struct CPUX86State {
 uint64_t xcr0;
 uint64_t xss;
 
+uint32_t pkru;
+
 TPRAccess tpr_access_type;
 } CPUX86State;
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 066d03d..16a8eff 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1092,6 +1092,7 @@ static int kvm_put_fpu(X86CPU *cpu)
 #define XSAVE_OPMASK  272
 #define XSAVE_ZMM_Hi256   288
 #define XSAVE_Hi16_ZMM416
+#define XSAVE_PKRU672
 
 static int kvm_put_xsave(X86CPU *cpu)
 {
@@ -1145,6 +1146,7 @@ static int kvm_put_xsave(X86CPU *cpu)
 #ifdef TARGET_X86_64
 memcpy(>region[XSAVE_Hi16_ZMM], >xmm_regs[16],
 16 * sizeof env->xmm_regs[16]);
+memcpy(>region[XSAVE_PKRU], >pkru, sizeof env->pkru);
 #endif
 r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave);
 return r;
@@ -1516,6 +1518,7 @@ static int kvm_get_xsave(X86CPU *cpu)
 #ifdef TARGET_X86_64
 memcpy(>xmm_regs[16], >region[XSAVE_Hi16_ZMM],
16 * sizeof env->xmm_regs[16]);
+memcpy(>pkru, >region[XSAVE_PKRU], sizeof env->pkru);
 #endif
 return 0;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[kvm-unit-tests v2] x86: pkeys: add test for memory protection-key

Changes in v2:
* Move pku.flat from config-x86-common.mak to config-x86_64.mak.


The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

Signed-off-by: Huaitong Han 

diff --git a/config/config-x86-common.mak b/config/config-x86-common.mak
index c2f9908..a72055c 100644
--- a/config/config-x86-common.mak
+++ b/config/config-x86-common.mak
@@ -104,6 +104,8 @@ $(TEST_DIR)/pcid.elf: $(cstart.o) $(TEST_DIR)/pcid.o
 
 $(TEST_DIR)/smap.elf: $(cstart.o) $(TEST_DIR)/smap.o
 
+$(TEST_DIR)/pku.elf: $(cstart.o) $(TEST_DIR)/pku.o
+
 $(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o $(TEST_DIR)/vmx_tests.o
 
 $(TEST_DIR)/debug.elf: $(cstart.o) $(TEST_DIR)/debug.o
diff --git a/config/config-x86_64.mak b/config/config-x86_64.mak
index 7d4eb34..fcf4486 100644
--- a/config/config-x86_64.mak
+++ b/config/config-x86_64.mak
@@ -11,5 +11,6 @@ tests = $(TEST_DIR)/access.flat $(TEST_DIR)/apic.flat \
 tests += $(TEST_DIR)/svm.flat
 tests += $(TEST_DIR)/vmx.flat
 tests += $(TEST_DIR)/tscdeadline_latency.flat
+tests += $(TEST_DIR)/pku.flat
 
 include config/config-x86-common.mak
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index f6eb187..1816807 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -28,6 +28,7 @@
 #define X86_CR4_PAE0x0020
 #define X86_CR4_PCIDE  0x0002
 #define X86_CR4_SMAP   0x0020
+#define X86_CR4_PKE0x0040
 
 #define X86_EFLAGS_CF  0x0001
 #define X86_EFLAGS_ZF  0x0040
diff --git a/x86/pku.c b/x86/pku.c
new file mode 100644
index 000..4def60d
--- /dev/null
+++ b/x86/pku.c
@@ -0,0 +1,162 @@
+#include "libcflat.h"
+#include "x86/desc.h"
+#include "x86/processor.h"
+#include "x86/vm.h"
+#include "x86/msr.h"
+
+#define X86_FEATURE_PKU  3
+#define CR0_WP_MASK  (1UL << 16)
+#define PTE_PKEY_BIT 59
+#define USER_BASE(1 << 24)
+#define USER_VAR(v)  (*((__typeof__(&(v))) (((unsigned long)) + 
USER_BASE)))
+
+volatile int pf_count = 0;
+volatile unsigned save;
+volatile unsigned test;
+
+void set_cr0_wp(int wp)
+{
+unsigned long cr0 = read_cr0();
+
+cr0 &= ~CR0_WP_MASK;
+if (wp)
+cr0 |= CR0_WP_MASK;
+write_cr0(cr0);
+}
+
+static inline u32 read_pkru(void)
+{
+unsigned int eax, edx;
+unsigned int ecx = 0;
+unsigned int pkru;
+
+asm volatile(".byte 0x0f,0x01,0xee\n\t"
+ : "=a" (eax), "=d" (edx)
+ : "c" (ecx));
+pkru = eax;
+return pkru;
+}
+
+static void write_pkru(u32 pkru)
+{
+unsigned int eax = pkru;
+unsigned int ecx = 0;
+unsigned int edx = 0;
+
+asm volatile(".byte 0x0f,0x01,0xef\n\t"
+: : "a" (eax), "c" (ecx), "d" (edx));
+}
+
+void do_pf_tss(unsigned long error_code)
+{
+pf_count++;
+save = test;
+write_pkru(0);
+}
+
+extern void pf_tss(void);
+
+asm ("pf_tss: \n\t"
+#ifdef __x86_64__
+// no task on x86_64, save/restore caller-save regs
+"push %rax; push %rcx; push %rdx; push %rsi; push %rdi\n"
+"push %r8; push %r9; push %r10; push %r11\n"
+#endif
+"call do_pf_tss \n\t"
+#ifdef __x86_64__
+"pop %r11; pop %r10; pop %r9; pop %r8\n"
+"pop %rdi; pop %rsi; pop %rdx; pop %rcx; pop %rax\n"
+#endif
+"add $"S", %"R "sp\n\t" // discard error code
+"iret"W" \n\t"
+"jmp pf_tss\n\t"
+);
+
+static void init_test()
+{
+pf_count = 0;
+
+invlpg();
+invlpg(_VAR(test));
+write_pkru(0);
+set_cr0_wp(0);
+}
+
+int main(int ac, char **av)
+{
+unsigned long i;
+unsigned int pkey = 0x2;
+unsigned int pkru_ad = 0x10;
+unsigned int pkru_wd = 0x20;
+
+if (!(cpuid_indexed(7, 0).c & (1 << X86_FEATURE_PKU))) {
+printf("PKU not enabled, exiting\n");
+exit(1);
+}
+
+setup_vm();
+setup_alt_stack();
+set_intr_alt_stack(14, pf_tss);
+wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_LMA);
+
+for (i = 0; i < USER_BASE; i += PAGE_SIZE) {
+*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= ~PTE_USER;
+*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) |= ((unsigned 
long)pkey << PTE_PKEY_BIT);
+invlpg((void *)i);
+}
+
+for (i = USER_BASE; i < 2 * USER_BASE; i += PAGE_SIZE) {
+*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) &= ~USER_BASE;
+*get_pte(phys_to_virt(read_cr3()), phys_to_virt(i)) |= ((unsigned 
long)pkey << PTE_PKEY_BIT);
+invlpg((void *)i);
+}
+
+write_cr4(read_cr4() | X86_CR4_PKE);
+write_cr3(read_cr3());
+
+init_test();
+set_cr0_wp(1);
+write_pkru(pkru_ad);
+test = 21;
+report("write to supervisor page when pkru is ad and wp == 1", pf_count == 
0 && test == 21);
+
+init_test();
+set_cr0_wp(0);
+write_pkru(pkru_ad);
+test = 22;
+

Re: [PATCH 02/10] KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap




On 11/12/2015 07:50 PM, Takuya Yoshikawa wrote:

+   if (!ret) {
+   clear_unsync_child_bit(sp, i);
+   continue;
+   } else if (ret > 0) {
nr_unsync_leaf += ret;


Just a single line here, braces are unnecessary.


-   else
+   } else
return ret;
} else if (child->unsync) {
nr_unsync_leaf++;
if (mmu_pages_add(pvec, child, i))
return -ENOSPC;
} else
-goto clear_child_bitmap;
-
-   continue;
-
-clear_child_bitmap:
-   __clear_bit(i, sp->unsync_child_bitmap);
-   sp->unsync_children--;
-   WARN_ON((int)sp->unsync_children < 0);
+   clear_unsync_child_bit(sp, i);
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] kvmtool: add support for modern virtio-pci

2015-11-17 Thread Sasha Levin

This is a first go at adding support for the modern (based on the 1.0 virtio
spec) virtio-pci implementation.

kvmtool makes it simple to add additional transports such as this because of
it's layering, so we are able to add it as a 3rd (after legacy virtio-pci and
virtio-mmio) transport layer, and still allow users to choose to use either
the legacy or the modern implementations (but setting the modern one as
default.

The changes to the virtio devices are mostly the result of needing to support
>32bit features, and the different initialization method for VQs.

It's worth noting that supporting v1.0 implies any_layout, but some of our
devices made assumptions about the layout - which I've fixed. But it's worth
to keep in mind that some probably went unnoticed.

To sum it up: this is a lightly tested version for feedback about the design
and to weed out major bugs people notice. Feedback is very welcome!

Signed-off-by: Sasha Levin 
---
 Makefile  |   1 +
 builtin-run.c |   4 +
 include/kvm/kvm-config.h  |   1 +
 include/kvm/pci.h |   8 +-
 include/kvm/virtio-9p.h   |   2 +-
 include/kvm/{virtio-pci.h => virtio-pci-modern.h} |  23 +-
 include/kvm/virtio-pci.h  |   6 +-
 include/kvm/virtio.h  |  25 +-
 include/linux/virtio_pci.h| 199 +++
 net/uip/core.c|   7 +-
 virtio/9p.c   |  35 +-
 virtio/balloon.c  |  37 +-
 virtio/blk.c  |  50 +-
 virtio/console.c  |  42 +-
 virtio/core.c |  16 +
 virtio/mmio.c |  13 +-
 virtio/net.c  |  59 ++-
 virtio/pci.c  |   4 +-
 virtio/pci_modern.c   | 599 ++
 virtio/rng.c  |  29 +-
 virtio/scsi.c |  36 +-
 x86/include/kvm/kvm-arch.h|   2 +-
 22 files changed, 1109 insertions(+), 89 deletions(-)
 copy include/kvm/{virtio-pci.h => virtio-pci-modern.h} (69%)
 create mode 100644 include/linux/virtio_pci.h
 create mode 100644 virtio/pci_modern.c

diff --git a/Makefile b/Makefile
index 59622c3..13a12f8 100644
--- a/Makefile
+++ b/Makefile
@@ -67,6 +67,7 @@ OBJS  += virtio/net.o
 OBJS   += virtio/rng.o
 OBJS+= virtio/balloon.o
 OBJS   += virtio/pci.o
+OBJS   += virtio/pci_modern.o
 OBJS   += disk/blk.o
 OBJS   += disk/qcow.o
 OBJS   += disk/raw.o
diff --git a/builtin-run.c b/builtin-run.c
index edcaf3e..e133b10 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -128,6 +128,8 @@ void kvm_run_set_wrapper_sandbox(void)
" rootfs"), \
OPT_STRING('\0', "hugetlbfs", &(cfg)->hugetlbfs_path, "path",   \
"Hugetlbfs path"),  \
+   OPT_BOOLEAN('\0', "virtio-legacy", &(cfg)->old_virtio, "Use"\
+   " legacy virtio-pci devices"),  \
\
OPT_GROUP("Kernel options:"),   \
OPT_STRING('k', "kernel", &(cfg)->kernel_filename, "kernel",\
@@ -517,6 +519,8 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
kvm->cfg.vmlinux_filename = find_vmlinux();
kvm->vmlinux = kvm->cfg.vmlinux_filename;
 
+   default_transport = kvm->cfg.old_virtio ? VIRTIO_PCI : 
VIRTIO_PCI_MODERN;
+
if (kvm->cfg.nrcpus == 0)
kvm->cfg.nrcpus = nr_online_cpus;
 
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 386fa8c..b1512a1 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -57,6 +57,7 @@ struct kvm_config {
bool no_dhcp;
bool ioport_debug;
bool mmio_debug;
+   bool old_virtio;
 };
 
 #endif
diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index b0c28a1..19ec56a 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "kvm/devices.h"
@@ -81,7 +82,12 @@ struct pci_device_header {
u8  min_gnt;
u8  max_lat;
struct msix_cap msix;
-   u8  empty[136]; /* Rest of PCI config space */
+   struct virtio_pci_cap common_cap;
+   struct virtio_pci_notify_cap notify_cap;
+   struct virtio_pci_cap isr_cap;
+   struct virtio_pci_cap device_cap;
+   struct virtio_pci_cfg_cap pci_cap;
+   u8  empty[48]; /* Rest of PCI config space */
u32

[PATCH v3 1/7] KVM, pkeys: expose CPUID/CR4 to guest

This patch exposes CPUID/CR4 to guest.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c12e845..3bbc1cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -55,7 +55,8 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
+ | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 156441b..ece687b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
apic->lapic_timer.timer_mode_mask = 1 << 17;
}
 
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   if (!best)
+   return 0;
+
+   /*Update OSPKE bit */
+   if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) {
+   best->ecx &= ~F(OSPKE);
+   if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE))
+   best->ecx |= F(OSPKE);
+   }
+
best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (!best) {
vcpu->arch.guest_supported_xcr0 = 0;
@@ -354,6 +365,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
const u32 kvm_supported_word10_x86_features =
F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves;
 
+   /* cpuid 7.0.ecx*/
+   const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
+
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
 
@@ -431,10 +445,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
cpuid_mask(>ebx, 9);
// TSC_ADJUST is emulated
entry->ebx |= F(TSC_ADJUST);
-   } else
+   entry->ecx &= kvm_supported_word11_x86_features;
+   cpuid_mask(>ecx, 13);
+   } else {
entry->ebx = 0;
+   entry->ecx = 0;
+   }
entry->eax = 0;
-   entry->ecx = 0;
entry->edx = 0;
break;
}
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index dd05b9c..7775158 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -70,6 +70,14 @@ static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu 
*vcpu)
return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
 }
 
+static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ecx & bit(X86_FEATURE_PKU));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d4e54d..5181834 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -709,7 +709,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long old_cr4 = kvm_read_cr4(vcpu);
unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-  X86_CR4_SMEP | X86_CR4_SMAP;
+  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
 
if (cr4 & CR4_RESERVED_BITS)
return 1;
@@ -726,6 +726,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (!guest_cpuid_has_fsgsbase(vcpu) && (cr4 & X86_CR4_FSGSBASE))
return 1;
 
+   if (!guest_cpuid_has_pku(vcpu) && (cr4 & X86_CR4_PKE))
+   return 1;
+
if (is_long_mode(vcpu)) {
if (!(cr4 & X86_CR4_PAE))
return 1;
@@ -751,7 +754,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
kvm_mmu_reset_context(vcpu);
 
-   if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
+   if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
kvm_update_cpuid(vcpu);
 
return 0;
@@ -6838,7 +6841,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 
mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
-   if (sregs->cr4 &

[PATCH v3 0/7] KVM, pkeys: add memory protection-key support

Changes in v3:
*Add comments for patch that disable PKU feature without ept.

Changes in v2:
*Add pku.c for kvm-unit-tests.
*Optimize permission_fault codes for patch4.
*Delete is_long_mode and PK for patch5.
*Squash cpuid and cr4 patches.

The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

When CR4.PKE = 1, every linear address is associated with the 4-bit protection
key located in bits 62:59 of the paging-structure entry that mapped the page
containing the linear address. The PKRU register determines, for each
protection key, whether user-mode addresses with that protection key may be
read or written.

The PKRU register (protection key rights for user pages) is a 32-bit register
with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
bit for protection key i (WDi).

Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and
write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus
be read and written by instructions in the XSAVE feature set.

PFEC.PK (bit 5) is defined as protection key violations.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

The kernel native patchset have not yet been merged to upstream, you can found
at git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v007.

Huaitong Han (7):
KVM, pkeys: expose CPUID/CR4 to guest
KVM, pkeys: disable pkeys for guests in non-paging mode
KVM, pkeys: update memeory permission bitmask for pkeys
KVM, pkeys: add pkeys support for permission_fault logic
KVM, pkeys: Add pkeys support for gva_to_gpa funcions
KVM, pkeys: add pkeys support for xsave state
KVM, pkeys: disable PKU feature without ept

--
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH v3 7/7] KVM, pkeys: disable PKU feature without ept

This patch disables CPUID:PKU without ept, becase pkeys is not supported
with softmmu.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ece687b..e422f0a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -447,6 +447,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
entry->ebx |= F(TSC_ADJUST);
entry->ecx &= kvm_supported_word11_x86_features;
cpuid_mask(>ecx, 13);
+   if (!tdp_enabled)
+   /* PKU cannot work with softmmu */
+   entry->ecx &= ~F(PKU);
} else {
entry->ebx = 0;
entry->ecx = 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 6/7] KVM, pkeys: add pkeys support for xsave state

This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f2afa5f..0f71d5d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -182,7 +182,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
-   | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512)
+   | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
+   | XFEATURE_MASK_PKRU)
 extern u64 host_xcr0;
 
 extern u64 kvm_supported_xcr0(void);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/7] KVM, pkeys: update memeory permission bitmask for pkeys

Pkeys define a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation.

This patch updates memeory permission bitmask for pkeys.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3bbc1cb..8852b9f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -159,12 +159,14 @@ enum {
 #define PFERR_USER_BIT 2
 #define PFERR_RSVD_BIT 3
 #define PFERR_FETCH_BIT 4
+#define PFERR_PK_BIT 5
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
 #define PFERR_USER_MASK (1U << PFERR_USER_BIT)
 #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
 #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
+#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC   0
@@ -288,10 +290,12 @@ struct kvm_mmu {
 
/*
 * Bitmap; bit set = permission fault
-* Byte index: page fault error code [4:1]
+* Byte index: page fault error code [5:1]
 * Bit index: pte permissions in ACC_* format
+*
+* Add PFEC.PK (bit 5) for protection-key violations
 */
-   u8 permissions[16];
+   u8 permissions[32];
 
u64 *pae_root;
u64 *lm_root;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 69088a1..0568635 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3793,16 +3793,22 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
+   bool fault, x, w, u, smap = 0, pku = 0;
+   bool wf, uf, ff, smapf, rsvdf, pkuf;
+   bool cr4_smap, cr4_smep, cr4_pku;
 
cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
+   cr4_pku = kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
wf = pfec & PFERR_WRITE_MASK;
uf = pfec & PFERR_USER_MASK;
ff = pfec & PFERR_FETCH_MASK;
+   rsvdf = pfec & PFERR_RSVD_MASK;
+   pkuf = pfec & PFERR_PK_MASK;
/*
 * PFERR_RSVD_MASK bit is set in PFEC if the access is not
 * subject to SMAP restrictions, and cleared otherwise. The
@@ -3841,12 +3847,34 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
 *   clearer.
 */
smap = cr4_smap && u && !uf && !ff;
+
+   /*
+   * PKU:additional mechanism by which the paging
+   * controls access to user-mode addresses based
+   * on the value in the PKRU register. A fault is
+   * considered as a PKU violation if all of the
+   * following conditions are true:
+   * 1.CR4_PKE=1.
+   * 2.EFER_LMA=1.
+   * 3.page is present with no reserved bit
+   *   violations.
+   * 4.the access is not an instruction fetch.
+   * 5.the access is to a user page.
+   * 6.PKRU.AD=1
+   *   or The access is a data write and
+   *  PKRU.WD=1 and either CR0.WP=1
+   *  or it is a user access.
+   *
+   * The 2nd and 6th conditions are computed
+   * dynamically in permission_fault.
+   */
+   pku = cr4_pku && !rsvdf && !ff && u;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
fault = (ff && !x) || (uf && !u) || (wf && !w) ||
-   (smapf && smap);
+   (smapf && smap) || (pkuf && pku);
map |= fault << bit;
}
mmu->permissions[byte] = map;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/7] KVM, pkeys: add pkeys support for permission_fault logic

Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_permission_bitmask, dynamic logic
need read pkey from page table entries, get pkru value, and deduce the
correct result.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e4202e4..c76e744 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -3,6 +3,7 @@
 
 #include 
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #define PT64_PT_BITS 9
 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -24,6 +25,7 @@
 #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
+
 #define PT64_NX_SHIFT 63
 #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
 
@@ -45,6 +47,10 @@
 #define PT_PAGE_TABLE_LEVEL 1
 #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
 
+#define PKRU_READ   0
+#define PKRU_WRITE  1
+#define PKRU_ATTRS  2
+
 static inline u64 rsvd_bits(int s, int e)
 {
return ((1ULL << (e - s + 1)) - 1) << s;
@@ -145,10 +151,50 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
  * fault with the given access (in ACC_* format)?
  */
 static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-   unsigned pte_access, unsigned pfec)
+   unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
 {
-   int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long smap, rflags;
+   u32 pkru, pkru_bits;
+   int cpl, index;
+   bool wf, uf;
+
+   cpl = kvm_x86_ops->get_cpl(vcpu);
+   rflags = kvm_x86_ops->get_rflags(vcpu);
+
+   /*
+   * PKU is computed dynamically in permission_fault.
+   * 2nd and 6th conditions:
+   * 2.EFER_LMA=1
+   * 6.PKRU.AD=1
+   *   or The access is a data write and PKRU.WD=1 and
+   *  either CR0.WP=1 or it is a user mode access
+   */
+   pkru = is_long_mode(vcpu) ? read_pkru() : 0;
+   if (unlikely(pkru) && (pfec & PFERR_PK_MASK))
+   {
+   /*
+   * PKRU defines 32 bits, there are 16 domains and 2 attribute 
bits per
+   * domain in pkru, pkey is the index to a defined domain, so the 
value
+   * of pkey * PKRU_ATTRS is offset of a defined domain.
+   */
+   pkru_bits = (pkru >> (pte_pkeys * PKRU_ATTRS)) & 3;
+
+   wf = pfec & PFERR_WRITE_MASK;
+   uf = pfec & PFERR_USER_MASK;
+
+   /*
+   * Ignore PKRU.WD if not relevant to this access (a read,
+   * or a supervisor mode access if CR0.WP=0).
+   * So 6th conditions is equivalent to "pkru_bits != 0"
+   */
+   if (!wf || (!uf && !is_write_protection(vcpu)))
+   pkru_bits &= ~(1 << PKRU_WRITE);
+
+   /* Flip pfec on PK bit if pkru_bits is zero */
+   pfec ^= pkru_bits ? 0 : PFERR_PK_MASK;
+   }
+   else
+   pfec &= ~PFERR_PK_MASK;
 
/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -163,8 +209,8 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 * but it will be one in index if SMAP checks are being overridden.
 * It is important to keep this branchless.
 */
-   unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
-   int index = (pfec >> 1) +
+   smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
+   index = (pfec >> 1) +
(smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
 
WARN_ON(pfec & PFERR_RSVD_MASK);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab..02daa97 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -253,6 +253,15 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
}
return 0;
 }
+static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+   unsigned pkeys = 0;
+#if PTTYPE == 64
+   pte_t pte = {.pte = gpte};
+   pkeys = pte_pkey(pte);
+#endif
+   return pkeys;
+}
 
 /*
  * Fetch a guest pte for a guest virtual address
@@ -265,12 +274,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
pt_element_t pte;
pt_element_t __user *uninitialized_var(ptep_user);
gfn_t table_gfn;
-   unsigned index, pt_access, pte_access, accessed_dirty;
+   unsigned index, pt_access, pte_access, accessed_dirty, pte_pkeys;
gpa_t pte_gpa;
int offset;
const int write_fault = access & PFERR_WRITE_MASK;
const int user_fault  = access &

[PATCH v3 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode

Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d019868..9b12c80 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3645,14 +3645,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP/SMAP is disabled if CPU is in non-paging mode
-* in hardware. However KVM always uses paging mode to
-* emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP/SMAP needs to be
+* SMEP/SMAP/PKU is disabled if CPU is in non-paging
+* mode in hardware. However KVM always uses paging
+* mode to emulate guest non-paging mode with TDP.
+* To emulate this behavior, SMEP/SMAP/PKU needs to be
 * manually disabled when guest switches to non-paging
 * mode.
 */
-   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
+   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions

This patch adds pkeys support for gva_to_gpa funcions.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a84b83..bd942f3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3960,6 +3960,7 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, 
gva_t gva,
  struct x86_exception *exception)
 {
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -3976,6 +3977,7 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, 
gva_t gva,
 {
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
access |= PFERR_WRITE_MASK;
+   access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -4026,10 +4028,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
*ctxt,
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
unsigned offset;
int ret;
+   gpa_t gpa;
+
+   access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 
/* Inline kvm_read_guest_virt_helper for speed.  */
-   gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
access|PFERR_FETCH_MASK,
-   exception);
+   gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
+   access | PFERR_FETCH_MASK, exception);
if (unlikely(gpa == UNMAPPED_GVA))
return X86EMUL_PROPAGATE_FAULT;
 
@@ -4050,6 +4055,7 @@ int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
  exception);
@@ -4073,9 +4079,13 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
*ctxt,
void *data = val;
int r = X86EMUL_CONTINUE;
 
+   u32 access = PFERR_WRITE_MASK;
+
+   access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
+
while (bytes) {
gpa_t gpa =  vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
-PFERR_WRITE_MASK,
+access,
 exception);
unsigned offset = addr & (PAGE_SIZE-1);
unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()




On 11/12/2015 07:52 PM, Takuya Yoshikawa wrote:

kvm_mmu_mark_parents_unsync() alone uses pte_list_walk(), witch does
nearly the same as the for_each_rmap_spte macro.  The only difference
is that is_shadow_present_pte() checks cannot be placed there because
kvm_mmu_mark_parents_unsync() can be called with a new parent pointer
whose entry is not set yet.

By calling mark_unsync() separately for the parent and adding the parent
pointer to the parent_ptes chain later in kvm_mmu_get_page(), the macro
works with no problem.

Signed-off-by: Takuya Yoshikawa 
---
  arch/x86/kvm/mmu.c | 36 +---
  1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e8cfdc4..1691171 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1007,26 +1007,6 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
}
  }

-typedef void (*pte_list_walk_fn) (u64 *spte);
-static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
-{
-   struct pte_list_desc *desc;
-   int i;
-
-   if (!*pte_list)
-   return;
-
-   if (!(*pte_list & 1))
-   return fn((u64 *)*pte_list);
-
-   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
-   for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
-   fn(desc->sptes[i]);
-   desc = desc->more;
-   }
-}
-
  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
  {
@@ -1741,7 +1721,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
  static void mark_unsync(u64 *spte);
  static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
  {
-   pte_list_walk(>parent_ptes, mark_unsync);
+   u64 *sptep;
+   struct rmap_iterator iter;
+
+   for_each_rmap_spte(>parent_ptes, , sptep) {
+   mark_unsync(sptep);
+   }
  }

  static void mark_unsync(u64 *spte)
@@ -2111,12 +2096,17 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (sp->unsync && kvm_sync_page_transient(vcpu, sp))
break;

-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
if (sp->unsync_children) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_mmu_mark_parents_unsync(sp);


After your change above, the @sp's parents have not been changed, no need to 
call it now.


-   } else if (sp->unsync)
+   if (parent_pte)
+   mark_unsync(parent_pte);
+   } else if (sp->unsync) {
kvm_mmu_mark_parents_unsync(sp);


Ditto.


+   if (parent_pte)
+   mark_unsync(parent_pte);
+   }
+   mmu_page_add_parent_pte(vcpu, sp, parent_pte);

__clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10 RFC] KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()




On 11/12/2015 07:56 PM, Takuya Yoshikawa wrote:

Every time kvm_mmu_get_page() is called with a non-NULL parent_pte
argument, link_shadow_page() follows that to set the parent entry so
that the new mapping will point to the returned page table.

Moving parent_pte handling there allows to clean up the code because
parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and
mmu_page_add_parent_pte().

Signed-off-by: Takuya Yoshikawa 
---
  arch/x86/kvm/mmu.c | 21 -
  arch/x86/kvm/paging_tmpl.h |  6 ++
  2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9273cd4..33fe720 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2108,14 +2108,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (sp->unsync_children) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_mmu_mark_parents_unsync(sp);
-   if (parent_pte)
-   mark_unsync(parent_pte);
} else if (sp->unsync) {
kvm_mmu_mark_parents_unsync(sp);
-   if (parent_pte)
-   mark_unsync(parent_pte);
}
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);

__clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
@@ -2127,7 +2122,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
sp = kvm_mmu_alloc_page(vcpu, direct);

sp->parent_ptes.val = 0;
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);

sp->gfn = gfn;
sp->role = role;
@@ -2196,7 +2190,8 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator->sptep);
  }

-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool 
accessed)
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+struct kvm_mmu_page *sp, bool accessed)
  {
u64 spte;

@@ -2210,6 +2205,11 @@ static void link_shadow_page(u64 *sptep, struct 
kvm_mmu_page *sp, bool accessed)
spte |= shadow_accessed_mask;

mmu_spte_set(sptep, spte);
+
+   if (sp->unsync_children || sp->unsync)
+   mark_unsync(sptep);


Why are these needed?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_ring: Shadow available ring flags & index

2015-11-17 Thread Venkatesh Srinivas

On Tue, Nov 17, 2015 at 08:08:18PM -0800, Venkatesh Srinivas wrote:
> On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei  wrote:
> 
> > On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
> > > On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin wrote:
> > >> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas wrote:
> > >>> Improves cacheline transfer flow of available ring header.
> > >>>
> > >>> Virtqueues are implemented as a pair of rings, one producer->consumer
> > >>> avail ring and one consumer->producer used ring; preceding the
> > >>> avail ring in memory are two contiguous u16 fields -- avail->flags
> > >>> and avail->idx. A producer posts work by writing to avail->idx and
> > >>> a consumer reads avail->idx.
> > >>>
> > >>> The flags and idx fields only need to be written by a producer CPU
> > >>> and only read by a consumer CPU; when the producer and consumer are
> > >>> running on different CPUs and the virtio_ring code is structured to
> > >>> only have source writes/sink reads, we can continuously transfer the
> > >>> avail header cacheline between 'M' states between cores. This flow
> > >>> optimizes core -> core bandwidth on certain CPUs.
> > >>>
> > >>> (see: "Software Optimization Guide for AMD Family 15h Processors",
> > >>> Section 11.6; similar language appears in the 10h guide and should
> > >>> apply to CPUs w/ exclusive caches, using LLC as a transfer cache)
> > >>>
> > >>> Unfortunately the existing virtio_ring code issued reads to the
> > >>> avail->idx and read-modify-writes to avail->flags on the producer.
> > >>>
> > >>> This change shadows the flags and index fields in producer memory;
> > >>> the vring code now reads from the shadows and only ever writes to
> > >>> avail->flags and avail->idx, allowing the cacheline to transfer
> > >>> core -> core optimally.
> > >> Sounds logical, I'll apply this after a  bit of testing
> > >> of my own, thanks!
> > > Thanks!
> >
> 
> > Venkatesh:
> > Is it that your patch only applies to CPUs w/ exclusive caches?
> 
> No --- it applies when the inter-cache coherence flow is optimized by
> 'M' -> 'M' transfers and when producer reads might interfere w/
> consumer prefetchw/reads. The AMD Optimization guides have specific
> language on this subject, but other platforms may benefit.
> (see Intel #'s below)
> 
> > Do you have perf data on Intel CPUs?
> 
> Good idea -- I ran some tests on a couple of Intel platforms:
> 
> (these are perf data from sample runs; for each I ran many runs, the
>  numbers were pretty stable except for Haswell-EP cross-socket)
> 
> One-socket Intel Xeon W3690 ("Westmere"), 3.46 GHz; core turbo disabled
> ===
> (note -- w/ core turbo disabled, performance is _very_ stable; variance of
>  < 0.5% run-to-run; figure of merit is "seconds elapsed" here)
> 
> * Producer / consumer bound to Hyperthread pairs:
> 
>  Performance counter stats for './vring_bench_noshadow 10':
> 
>  343,425,166,916 L1-dcache-loads
>   21,393,148 L1-dcache-load-misses #0.01% of all L1-dcache hits
>   61,709,640,363 L1-dcache-stores
>5,745,690 L1-dcache-store-misses
>   10,186,932,553 L1-dcache-prefetches
>1,491 L1-dcache-prefetch-misses
>121.335699344 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 10':
> 
>  334,766,413,861 L1-dcache-loads
>   15,787,778 L1-dcache-load-misses #0.00% of all L1-dcache hits
>   62,735,792,799 L1-dcache-stores
>3,252,113 L1-dcache-store-misses
>9,018,273,596 L1-dcache-prefetches
>  819 L1-dcache-prefetch-misses
>121.206339656 seconds time elapsed
> 
> Effectively Performance-neutral.
> 
> * Producer / consumer bound to separate cores, same socket:
> 
>  Performance counter stats for './vring_bench_noshadow 10':
> 
>399,943,384,509 L1-dcache-loads
>  8,868,334,693 L1-dcache-load-misses #2.22% of all L1-dcache hits
> 62,721,376,685 L1-dcache-stores
>  2,786,806,982 L1-dcache-store-misses
> 10,915,046,967 L1-dcache-prefetches
>328,508 L1-dcache-prefetch-misses
>  146.585969976 seconds time elapsed
> 
>  Performance counter stats for './vring_bench_shadow 10':
> 
>425,123,067,750 L1-dcache-loads 
>  6,689,318,709 L1-dcache-load-misses #1.57% of all L1-dcache hits
> 62,747,525,005 L1-dcache-stores 
>  2,496,274,505 L1-dcache-store-misses
>  8,627,873,397 L1-dcache-prefetches
>146,729 L1-dcache-prefetch-misses
>  142.657327765 seconds time elapsed
> 
> 2.6% reduction in runtime; note that L1-dcache-load-misses reduced
> dramatically, 2 Billion(!) L1d misses saved.
> 
> Two-socket Intel Sandy Bridge(-EP) Xeon, 2.6 GHz; core turbo disabled
> =
> 
> * Producer / consumer bound to Hyperthread pairs:
> 
>  Performance counter stats for

Re: [PATCH 07/10] KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct




On 11/12/2015 07:55 PM, Takuya Yoshikawa wrote:

@@ -1720,7 +1724,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
 * this feature. See the comments in kvm_zap_obsolete_pages().
 */
list_add(>link, >kvm->arch.active_mmu_pages);
-   sp->parent_ptes = 0;
+   sp->parent_ptes.val = 0;


The sp is allocated from kmem_cache_zalloc() so explicitly initialize it to 
zero is
not needed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Add lowest-priority support for vt-d posted-interrupts



On 16/11/2015 20:03, Radim Krčmář wrote:
> 2015-11-09 10:46+0800, Feng Wu:
>> Use vector-hashing to handle lowest-priority interrupts for
>> posted-interrupts. As an example, modern Intel CPUs use this
>> method to handle lowest-priority interrupts.
> 
> (I don't think it's a good idea that the algorithm differs from non-PI
>  lowest priority delivery.  I'd make them both vector-hashing, which
>  would be "fun" to explain to people expecting round robin ...)

Yup, I would make it a module option.  Thanks very much Radim for
helping with the review.

Paolo

>> Signed-off-by: Feng Wu 
>> ---
>> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
>> +/*
>> + * This routine handles lowest-priority interrupts using vector-hashing
>> + * mechanism. As an example, modern Intel CPUs use this method to handle
>> + * lowest-priority interrupts.
>> + *
>> + * Here is the details about the vector-hashing mechanism:
>> + * 1. For lowest-priority interrupts, store all the possible destination
>> + *vCPUs in an array.
>> + * 2. Use "guest vector % max number of destination vCPUs" to find the right
>> + *destination vCPU in the array for the lowest-priority interrupt.
>> + */
> 
> (Is Skylake i7-6700 a modern Intel CPU?
>  I didn't manage to get hashing ... all interrupts always went to the
>  lowest APIC ID in the set :/
>  Is there a simple way to verify the algorithm?)
> 
>> +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
>> +  struct kvm_lapic_irq *irq)
>> +
>> +{
>> +unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>> +unsigned int dest_vcpus = 0;
>> +struct kvm_vcpu *vcpu;
>> +unsigned int i, mod, idx = 0;
>> +
>> +vcpu = kvm_intr_vector_hashing_dest_fast(kvm, irq);
>> +if (vcpu)
>> +return vcpu;
> 
> I think the rest of this function shouldn't be implemented:
>  - Shorthands are only for IPIs and hence don't need to be handled,
>  - Lowest priority physical broadcast is not supported,
>  - Lowest priority cluster logical broadcast is not supported,
>  - No point in optimizing mixed xAPIC and x2APIC mode,
>  - The rest is handled by kvm_intr_vector_hashing_dest_fast().
>(Even lowest priority flat logical "broadcast".)
>  - We do the work twice when vcpu == NULL means that there is no
>matching destination.
> 
> Is there a valid case that can be resolved by going through all vcpus?
> 
>> +
>> +memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
>> +
>> +kvm_for_each_vcpu(i, vcpu, kvm) {
>> +if (!kvm_apic_present(vcpu))
>> +continue;
>> +
>> +if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
>> +irq->dest_id, irq->dest_mode))
>> +continue;
>> +
>> +__set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
>> +dest_vcpus++;
>> +}
>> +
>> +if (dest_vcpus == 0)
>> +return NULL;
>> +
>> +mod = irq->vector % dest_vcpus;
>> +
>> +for (i = 0; i <= mod; i++) {
>> +idx = find_next_bit(dest_vcpu_bitmap, KVM_MAX_VCPUS, idx) + 1;
>> +BUG_ON(idx >= KVM_MAX_VCPUS);
>> +}
>> +
>> +return kvm_get_vcpu(kvm, idx - 1);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
>> +
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> @@ -816,6 +816,63 @@ out:
>> +struct kvm_vcpu *kvm_intr_vector_hashing_dest_fast(struct kvm *kvm,
>> +   struct kvm_lapic_irq *irq)
> 
> We now have three very similar functions :(
> 
>   kvm_irq_delivery_to_apic_fast
>   kvm_intr_is_single_vcpu_fast
>   kvm_intr_vector_hashing_dest_fast
> 
> By utilizing the gcc optimizer, they can be merged without introducing
> many instructions to the hot path, kvm_irq_delivery_to_apic_fast.
> (I would eventually do it, so you can save time by ignoring this.)
> 
> Thanks.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/vmx: EPTP switching test



On 17/11/2015 02:45, Zhang, Yang Z wrote:
> We have a different version in hand which is using separate EPTP.

Can you say in advance what you are using EPTP switching for?  Offlist
if necessary.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/vmx: EPTP switching test