Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-05-22 Thread Gleb Natapov
On Tue, May 21, 2013 at 03:26:18PM -0700, Nakajima, Jun wrote:
 On Tue, May 21, 2013 at 4:05 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
  On 05/21/2013 05:01 PM, Gleb Natapov wrote:
  On Tue, May 21, 2013 at 04:30:13PM +0800, Xiao Guangrong wrote:
  @@ -772,6 +810,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct 
  kvm_vcpu *vcpu, gva_t vaddr,
 
 return gpa;
   }
  +#endif
 
  Strange!
 
  Why does nested ept not need these functions? How to emulate the 
  instruction faulted on L2?
 
  Sorry, i misunderstood it. Have found the reason out.
 
  You can write it down here for future reviewers :)
 
  Okay.
 
  The functions used to translate L2's gva to L1's gpa are 
  paging32_gva_to_gpa_nested
  and paging64_gva_to_gpa_nested which are created by PTTYPE == 32 and PTTYPE 
  == 64.
 
 
 
 Back to your comments on PT_MAX_FULL_LEVELS:
  + #ifdef CONFIG_X86_64
  + #define PT_MAX_FULL_LEVELS 4
  + #define CMPXCHG cmpxchg
  + #else
  + #define CMPXCHG cmpxchg64
  +#define PT_MAX_FULL_LEVELS 2
 I don't think we need to support nEPT on 32-bit hosts.  So, I plan to
 remove such code. What do you think?
 
Why shouldn't we support nEPT on 32-bit hosts?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Gleb Natapov
On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
 On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
   Any pages with stale information will be zapped by kvm_mmu_zap_all().
   When that happens, page faults will take place which will automatically 
   use the new generation number.
   
   So still not clear why is this necessary.
   
  This is not, strictly speaking, necessary, but it is the sane thing to do.
  You cannot update page's generation number to prevent it from been
  destroyed since after kvm_mmu_zap_all() completes stale ptes in the
  shadow page may point to now deleted memslot. So why build shadow page
  table with a page that is in a process of been destroyed?
 
 OK, can this be introduced separately, in a later patch, with separate
 justification, then?
 
 Xiao please have the first patches of the patchset focus on the problem
 at hand: fix long mmu_lock hold times.
 
  Not sure what you mean again. We flush TLB once before entering this 
  function.
  kvm_reload_remote_mmus() does this for us, no?
 
 kvm_reload_remote_mmus() is used as an optimization, its separate from the
 problem solution.
 
   
   What was suggested was... go to phrase which starts with The only purpose
   of the generation number should be to.
   
   The comment quoted here does not match that description.
   
  The comment describes what code does and in this it is correct.
  
  You propose to not reload roots right away and do it only when root sp
  is encountered, right? So my question is what's the point? There are,
  obviously, root sps with invalid generation number at this point, so
  reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
  do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
  invalid and obsolete sps as I proposed in one of my email?
 
 Sure. But Xiao please introduce that TLB collapsing optimization as a
 later patch, so we can reason about it in a more organized fashion.

So, if I understand correctly, you are asking to move is_obsolete_sp()
check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
become nop. But I question the need to zap all shadow pages tables there
in the first place, why kvm_flush_remote_tlbs() is not enough?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-22 Thread Vadim Rozenfeld


- Original Message -
From: Marcelo Tosatti mtosa...@redhat.com
To: Vadim Rozenfeld vroze...@redhat.com
Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net
Sent: Wednesday, May 22, 2013 10:50:46 AM
Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

On Sun, May 19, 2013 at 05:06:37PM +1000, Vadim Rozenfeld wrote:
 The following patch allows to activate a partition reference 
 time enlightenment that is based on the host platform's support 
 for an Invariant Time Stamp Counter (iTSC).
 NOTE: This code will survive migration due to lack of VM stop/resume
 handlers, when offset, scale and sequence should be
 readjusted. 
 
 ---
  arch/x86/kvm/x86.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 9645dab..b423fe4 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1838,7 +1838,6 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
 msr, u64 data)
   u64 gfn;
   unsigned long addr;
   HV_REFERENCE_TSC_PAGE tsc_ref;
 - tsc_ref.TscSequence = 0;
   if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE)) {
   kvm-arch.hv_tsc_page = data;
   break;
 @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 data)
   HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT);
   if (kvm_is_error_hva(addr))
   return 1;
 + tsc_ref.TscSequence =
 + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0;

1) You want NONSTOP_TSC (see 40fb1715 commit) which matches INVARIANT TSC.
[VR]
Thank you for reviewing. Will fix it.
2) TscSequence should increase? 
This field serves as a sequence number that is incremented whenever...
[VR]
Yes, on every VM resume, including migration. After migration we also need
to recalculate scale and adjust offset. 
3) 0x is the value for invalid source of reference time?
[VR] Yes, on boot-up. In this case guest will go with PMTimer (not sure about 
HPET
but I can check). But if we set sequence to 0x after migration - it's 
probably will not work.

 + tsc_ref.TscScale =
 + ((1LL  32) / vcpu-arch.virtual_tsc_khz) 
  32;
 + tsc_ref.TscOffset = 0;
   if (__copy_to_user((void __user *)addr, tsc_ref, 
 sizeof(tsc_ref)))
   return 1;
   mark_page_dirty(kvm, gfn);
 -- 
 1.8.1.2
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter

2013-05-22 Thread Vadim Rozenfeld


- Original Message -
From: Marcelo Tosatti mtosa...@redhat.com
To: Vadim Rozenfeld vroze...@redhat.com
Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net
Sent: Wednesday, May 22, 2013 10:46:14 AM
Subject: Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter

On Sun, May 19, 2013 at 05:06:36PM +1000, Vadim Rozenfeld wrote:
 Signed-off: Peter Lieven p...@dlh.net
 Signed-off: Gleb Natapov g...@redhat.com
 Signed-off: Vadim Rozenfeld vroze...@redhat.com
 
 v1 - v2
 1. mark TSC page dirty as suggested by 
 Eric Northup digitale...@google.com and Gleb
 2. disable local irq when calling get_kernel_ns, 
 as it was done by Peter Lieven p...@dlhnet.de
 3. move check for TSC page enable from second patch
 to this one.
  
 ---
  arch/x86/include/asm/kvm_host.h|  2 ++
  arch/x86/include/uapi/asm/hyperv.h | 14 ++
  arch/x86/kvm/x86.c | 39 
 +-
  include/uapi/linux/kvm.h   |  1 +
  4 files changed, 55 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 3741c65..f0fee35 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -575,6 +575,8 @@ struct kvm_arch {
   /* fields used by HYPER-V emulation */
   u64 hv_guest_os_id;
   u64 hv_hypercall;
 + u64 hv_ref_count;
 + u64 hv_tsc_page;
  
   #ifdef CONFIG_KVM_MMU_AUDIT
   int audit_point;
 diff --git a/arch/x86/include/uapi/asm/hyperv.h 
 b/arch/x86/include/uapi/asm/hyperv.h
 index b80420b..890dfc3 100644
 --- a/arch/x86/include/uapi/asm/hyperv.h
 +++ b/arch/x86/include/uapi/asm/hyperv.h
 @@ -136,6 +136,9 @@
  /* MSR used to read the per-partition time reference counter */
  #define HV_X64_MSR_TIME_REF_COUNT0x4020
  
 +/* A partition's reference time stamp counter (TSC) page */
 +#define HV_X64_MSR_REFERENCE_TSC 0x4021
 +
  /* Define the virtual APIC registers */
  #define HV_X64_MSR_EOI   0x4070
  #define HV_X64_MSR_ICR   0x4071
 @@ -179,6 +182,9 @@
  #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \
   (~((1ull  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
  
 +#define HV_X64_MSR_TSC_REFERENCE_ENABLE  0x0001
 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT   12
 +
  #define HV_PROCESSOR_POWER_STATE_C0  0
  #define HV_PROCESSOR_POWER_STATE_C1  1
  #define HV_PROCESSOR_POWER_STATE_C2  2
 @@ -191,4 +197,12 @@
  #define HV_STATUS_INVALID_ALIGNMENT  4
  #define HV_STATUS_INSUFFICIENT_BUFFERS   19
  
 +typedef struct _HV_REFERENCE_TSC_PAGE {
 + __u32 TscSequence;
 + __u32 Rserved1;
 + __u64 TscScale;
 + __s64  TscOffset;
 +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
 +
 +
  #endif
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 8d28810..9645dab 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -843,7 +843,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  static u32 msrs_to_save[] = {
   MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
   MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, not needed.
 - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_TIME_REF_COUNT,
   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
   MSR_KVM_PV_EOI_EN,
   MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 @@ -1788,6 +1788,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
   switch (msr) {
   case HV_X64_MSR_GUEST_OS_ID:
   case HV_X64_MSR_HYPERCALL:
 + case HV_X64_MSR_REFERENCE_TSC:
 + case HV_X64_MSR_TIME_REF_COUNT:
   r = true;
   break;
   }
 @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 data)
   if (__copy_to_user((void __user *)addr, instructions, 4))
   return 1;
   kvm-arch.hv_hypercall = data;
 + local_irq_disable();
 + kvm-arch.hv_ref_count = get_kernel_ns();
 + local_irq_enable();
 + break;

local_irq_disable/local_irq_enable not needed.


What is the reasoning behind reading this time value at msr write time?
[VR] Windows writs this MSR only once, during HAL initialization.
So, I decided to treat this call as a partition crate event.

 + }
 + case HV_X64_MSR_REFERENCE_TSC: {
 + u64 gfn;
 + unsigned long addr;
 + HV_REFERENCE_TSC_PAGE tsc_ref;
 + tsc_ref.TscSequence = 0;
 + if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE)) {
 + kvm-arch.hv_tsc_page = data;
 + break;
 + }
 + gfn = data  HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
 + addr = gfn_to_hva(kvm, data 
 + 

Re: [PATCH v4 21/32] arm64: KVM: Build system integration

2013-05-22 Thread Catalin Marinas
On Tue, May 21, 2013 at 05:09:47PM +0100, Paolo Bonzini wrote:
 
 
 - Messaggio originale -
  Da: Catalin Marinas catalin.mari...@arm.com
  A: Marc Zyngier marc.zyng...@arm.com
  Cc: linux-arm-ker...@lists.infradead.org, kvm...@lists.cs.columbia.edu, 
  kvm@vger.kernel.org, Will Deacon
  will.dea...@arm.com, pbonz...@redhat.com, g...@redhat.com, Christopher 
  Covington c...@codeaurora.org
  Inviato: Martedì, 21 maggio 2013 17:42:27
  Oggetto: Re: [PATCH v4 21/32] arm64: KVM: Build system integration
  
  On Tue, May 14, 2013 at 03:13:49PM +0100, Marc Zyngier wrote:
   --- /dev/null
   +++ b/arch/arm64/kvm/Kconfig
  ...
   +config KVM_ARM_VGIC
   +bool
   + depends on KVM_ARM_HOST  OF
   + select HAVE_KVM_IRQCHIP
   + ---help---
   +   Adds support for a hardware assisted, in-kernel GIC emulation.
   +
   +config KVM_ARM_TIMER
   +bool
   + depends on KVM_ARM_VGIC
   + select HAVE_KVM_IRQCHIP
   + ---help---
   +   Adds support for the Architected Timers in virtual machines
   +
   +source drivers/virtio/Kconfig
  
  Shouldn't the above configs go to drivers/virtio/Kconfig as well (or
  drivers/virtio/arm/Kconfig)?
 
 virtio doesn't mean virtual versions of devices that also exist in
 hardware; those are indeed in arch/ARCH/kvm.  virtio is a specific bus
 for paravirtualized devices.  It is not KVM-specific and the code runs
 in the guest (whereas arch/arm64/kvm/Kconfig is host code).

You are right. What I meant was kvm/virt/arm/Kconfig (the place where
the vgic and timer driver goes).

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 21/32] arm64: KVM: Build system integration

2013-05-22 Thread Paolo Bonzini
Il 22/05/2013 10:42, Catalin Marinas ha scritto:
 On Tue, May 21, 2013 at 05:09:47PM +0100, Paolo Bonzini wrote:


 - Messaggio originale -
 Da: Catalin Marinas catalin.mari...@arm.com
 A: Marc Zyngier marc.zyng...@arm.com
 Cc: linux-arm-ker...@lists.infradead.org, kvm...@lists.cs.columbia.edu, 
 kvm@vger.kernel.org, Will Deacon
 will.dea...@arm.com, pbonz...@redhat.com, g...@redhat.com, Christopher 
 Covington c...@codeaurora.org
 Inviato: Martedì, 21 maggio 2013 17:42:27
 Oggetto: Re: [PATCH v4 21/32] arm64: KVM: Build system integration

 On Tue, May 14, 2013 at 03:13:49PM +0100, Marc Zyngier wrote:
 --- /dev/null
 +++ b/arch/arm64/kvm/Kconfig
 ...
 +config KVM_ARM_VGIC
 +bool
 +  depends on KVM_ARM_HOST  OF
 +  select HAVE_KVM_IRQCHIP
 +  ---help---
 +Adds support for a hardware assisted, in-kernel GIC emulation.
 +
 +config KVM_ARM_TIMER
 +bool
 +  depends on KVM_ARM_VGIC
 +  select HAVE_KVM_IRQCHIP
 +  ---help---
 +Adds support for the Architected Timers in virtual machines
 +
 +source drivers/virtio/Kconfig

 Shouldn't the above configs go to drivers/virtio/Kconfig as well (or
 drivers/virtio/arm/Kconfig)?

 virtio doesn't mean virtual versions of devices that also exist in
 hardware; those are indeed in arch/ARCH/kvm.  virtio is a specific bus
 for paravirtualized devices.  It is not KVM-specific and the code runs
 in the guest (whereas arch/arm64/kvm/Kconfig is host code).
 
 You are right. What I meant was kvm/virt/arm/Kconfig (the place where
 the vgic and timer driver goes).

Ah yeah, that could make sense.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Xiao Guangrong
On 05/22/2013 02:34 PM, Gleb Natapov wrote:
 On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
 On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
 Any pages with stale information will be zapped by kvm_mmu_zap_all().
 When that happens, page faults will take place which will automatically 
 use the new generation number.

 So still not clear why is this necessary.

 This is not, strictly speaking, necessary, but it is the sane thing to do.
 You cannot update page's generation number to prevent it from been
 destroyed since after kvm_mmu_zap_all() completes stale ptes in the
 shadow page may point to now deleted memslot. So why build shadow page
 table with a page that is in a process of been destroyed?

 OK, can this be introduced separately, in a later patch, with separate
 justification, then?

 Xiao please have the first patches of the patchset focus on the problem
 at hand: fix long mmu_lock hold times.

 Not sure what you mean again. We flush TLB once before entering this 
 function.
 kvm_reload_remote_mmus() does this for us, no?

 kvm_reload_remote_mmus() is used as an optimization, its separate from the
 problem solution.


 What was suggested was... go to phrase which starts with The only purpose
 of the generation number should be to.

 The comment quoted here does not match that description.

 The comment describes what code does and in this it is correct.

 You propose to not reload roots right away and do it only when root sp
 is encountered, right? So my question is what's the point? There are,
 obviously, root sps with invalid generation number at this point, so
 reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
 do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
 invalid and obsolete sps as I proposed in one of my email?

 Sure. But Xiao please introduce that TLB collapsing optimization as a
 later patch, so we can reason about it in a more organized fashion.
 
 So, if I understand correctly, you are asking to move is_obsolete_sp()
 check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
 kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
 we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
 call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
 become nop. But I question the need to zap all shadow pages tables there
 in the first place, why kvm_flush_remote_tlbs() is not enough?

I do not know too... I even do no know why kvm_flush_remote_tlbs
is needed. :(



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Gleb Natapov
On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 02:34 PM, Gleb Natapov wrote:
  On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
  On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
  Any pages with stale information will be zapped by kvm_mmu_zap_all().
  When that happens, page faults will take place which will automatically 
  use the new generation number.
 
  So still not clear why is this necessary.
 
  This is not, strictly speaking, necessary, but it is the sane thing to do.
  You cannot update page's generation number to prevent it from been
  destroyed since after kvm_mmu_zap_all() completes stale ptes in the
  shadow page may point to now deleted memslot. So why build shadow page
  table with a page that is in a process of been destroyed?
 
  OK, can this be introduced separately, in a later patch, with separate
  justification, then?
 
  Xiao please have the first patches of the patchset focus on the problem
  at hand: fix long mmu_lock hold times.
 
  Not sure what you mean again. We flush TLB once before entering this 
  function.
  kvm_reload_remote_mmus() does this for us, no?
 
  kvm_reload_remote_mmus() is used as an optimization, its separate from the
  problem solution.
 
 
  What was suggested was... go to phrase which starts with The only 
  purpose
  of the generation number should be to.
 
  The comment quoted here does not match that description.
 
  The comment describes what code does and in this it is correct.
 
  You propose to not reload roots right away and do it only when root sp
  is encountered, right? So my question is what's the point? There are,
  obviously, root sps with invalid generation number at this point, so
  reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
  do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
  invalid and obsolete sps as I proposed in one of my email?
 
  Sure. But Xiao please introduce that TLB collapsing optimization as a
  later patch, so we can reason about it in a more organized fashion.
  
  So, if I understand correctly, you are asking to move is_obsolete_sp()
  check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
  kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
  we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
  call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
  become nop. But I question the need to zap all shadow pages tables there
  in the first place, why kvm_flush_remote_tlbs() is not enough?
 
 I do not know too... I even do no know why kvm_flush_remote_tlbs
 is needed. :(
We changed the content of an executable page, we need to flush instruction
cache of all vcpus to not use stale data, so my suggestion to call
kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
not instruction cache, but why kvm_reload_remote_mmus() would flush
instruction cache?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Arnd Bergmann
On Thursday 16 May 2013, Michael S. Tsirkin wrote:
 This improves the might_fault annotations used
 by uaccess routines:
 
 1. The only reason uaccess routines might sleep
is if they fault. Make this explicit for
all architectures.
 2. Accesses (e.g through socket ops) to kernel memory
with KERNEL_DS like net/sunrpc does will never sleep.
Remove an unconditinal might_sleep in the inline
might_fault in kernel.h
(used when PROVE_LOCKING is not set).
 3. Accesses with pagefault_disable return EFAULT
but won't cause caller to sleep.
Check for that and avoid might_sleep when
PROVE_LOCKING is set.
 
 I'd like these changes to go in for the benefit of
 the vhost driver where we want to call socket ops
 under a spinlock, and fall back on slower thread handler
 on error.

Hi Michael,

I have recently stumbled over a related topic, which is the highly
inconsistent placement of might_fault() or might_sleep() in certain
classes of uaccess functions. Your patches seem completely reasonable,
but it would be good to also fix the other problem, at least on
the architectures we most care about.

Given the most commonly used functions and a couple of architectures
I'm familiar with, these are the ones that currently call might_fault()

x86-32  x86-64  arm arm64   powerpc s390generic
copy_to_user-   x   -   -   -   x   x
copy_from_user  -   x   -   -   -   x   x
put_userx   x   x   x   x   x   x
get_userx   x   x   x   x   x   x
__copy_to_user  x   x   -   -   x   -   -
__copy_from_userx   x   -   -   x   -   -
__put_user  -   -   x   -   x   -   -
__get_user  -   -   x   -   x   -   -

WTF?

Calling might_fault() for every __get_user/__put_user is rather expensive
because it turns what should be a single instruction (plus fixup) into an
external function call.

My feeling is that we should do might_fault() only in access_ok() to get
the right balance.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add detail error message when fail to add ioeventfd

2013-05-22 Thread Stefan Hajnoczi
On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote:
 I try to hotplug 28 * 8 multiple-function devices to guest with
 old host kernel, ioeventfds in host kernel will be exhausted, then
 qemu fails to allocate ioeventfds for blk/nic devices.
 
 It's better to add detail error here.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  kvm-all.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)

It would be nice to make kvm bus scalable so that the hardcoded
in-kernel I/O device limit can be lifted.

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Xiao Guangrong
On 05/22/2013 04:54 PM, Gleb Natapov wrote:
 On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 02:34 PM, Gleb Natapov wrote:
 On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
 On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
 Any pages with stale information will be zapped by kvm_mmu_zap_all().
 When that happens, page faults will take place which will automatically 
 use the new generation number.

 So still not clear why is this necessary.

 This is not, strictly speaking, necessary, but it is the sane thing to do.
 You cannot update page's generation number to prevent it from been
 destroyed since after kvm_mmu_zap_all() completes stale ptes in the
 shadow page may point to now deleted memslot. So why build shadow page
 table with a page that is in a process of been destroyed?

 OK, can this be introduced separately, in a later patch, with separate
 justification, then?

 Xiao please have the first patches of the patchset focus on the problem
 at hand: fix long mmu_lock hold times.

 Not sure what you mean again. We flush TLB once before entering this 
 function.
 kvm_reload_remote_mmus() does this for us, no?

 kvm_reload_remote_mmus() is used as an optimization, its separate from the
 problem solution.


 What was suggested was... go to phrase which starts with The only 
 purpose
 of the generation number should be to.

 The comment quoted here does not match that description.

 The comment describes what code does and in this it is correct.

 You propose to not reload roots right away and do it only when root sp
 is encountered, right? So my question is what's the point? There are,
 obviously, root sps with invalid generation number at this point, so
 reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
 do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
 invalid and obsolete sps as I proposed in one of my email?

 Sure. But Xiao please introduce that TLB collapsing optimization as a
 later patch, so we can reason about it in a more organized fashion.

 So, if I understand correctly, you are asking to move is_obsolete_sp()
 check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
 kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
 we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
 call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
 become nop. But I question the need to zap all shadow pages tables there
 in the first place, why kvm_flush_remote_tlbs() is not enough?

 I do not know too... I even do no know why kvm_flush_remote_tlbs
 is needed. :(
 We changed the content of an executable page, we need to flush instruction
 cache of all vcpus to not use stale data, so my suggestion to call

I thought the reason is about icache too but icache is automatically
flushed on x86, we only need to invalidate the prefetched instructions by
executing a serializing operation.

See the SDM in the chapter of
8.1.3 Handling Self- and Cross-Modifying Code

 kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
 not instruction cache, but why kvm_reload_remote_mmus() would flush
 instruction cache?

kvm_reload_remote_mmus do not have any help i think.

I find that this change is introduced by commit: 7aa81cc0
and I have added Anthony in the CC.

I also find some discussions related to calling
kvm_reload_remote_mmus():


 But if the instruction is architecture dependent, and you run on the
 wrong architecture, now you have to patch many locations at fault time,
 introducing some nasty runtime code / data cache overlap performance
 problems.  Granted, they go away eventually.


We're addressing that by blowing away the shadow cache and holding the
big kvm lock to ensure SMP safety.  Not a great thing to do from a
performance perspective but the whole point of patching is that the cost
is amortized.

(http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)

But i can not understand...


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep

2013-05-22 Thread Michael S. Tsirkin
On Tue, May 21, 2013 at 01:57:34PM +0200, Peter Zijlstra wrote:
 On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote:
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -198,7 +198,6 @@ void might_fault(void);
 #else
 static inline void might_fault(void)
 {
-   might_sleep();
   
   This removes potential resched points for PREEMPT_VOLUNTARY -- was that
   intentional?
  
  No it's a bug. Thanks for pointing this out.
  OK so I guess it should be might_sleep_if(!in_atomic())
  and this means might_fault would have to move from linux/kernel.h to
  linux/uaccess.h, since in_atomic() is in linux/hardirq.h
  
  Makes sense?
 
 So the only difference between PROVE_LOCKING and not should be the
 might_lock_read() thing; so how about something like this?
 
 ---
  include/linux/kernel.h  |  7 ++-
  include/linux/uaccess.h | 26 ++
  mm/memory.c | 14 ++
  3 files changed, 30 insertions(+), 17 deletions(-)
 
 diff --git a/include/linux/kernel.h b/include/linux/kernel.h
 index e96329c..70812f4 100644
 --- a/include/linux/kernel.h
 +++ b/include/linux/kernel.h
 @@ -194,12 +194,9 @@ extern int _cond_resched(void);
   })
  
  #ifdef CONFIG_PROVE_LOCKING
 -void might_fault(void);
 +void might_fault_lockdep(void);
  #else
 -static inline void might_fault(void)
 -{
 - might_sleep();
 -}
 +static inline void might_fault_lockdep(void) { }
  #endif
  
  extern struct atomic_notifier_head panic_notifier_list;
 diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
 index 5ca0951..50a2cc9 100644
 --- a/include/linux/uaccess.h
 +++ b/include/linux/uaccess.h
 @@ -38,6 +38,32 @@ static inline void pagefault_enable(void)
   preempt_check_resched();
  }
  
 +static inline bool __can_fault(void)
 +{
 + /*
 +  * Some code (nfs/sunrpc) uses socket ops on kernel memory while
 +  * holding the mmap_sem, this is safe because kernel memory doesn't
 +  * get paged out, therefore we'll never actually fault, and the
 +  * below annotations will generate false positives.
 +  */
 + if (segment_eq(get_fs(), KERNEL_DS))
 + return false;
 +
 + if (in_atomic() /* || pagefault_disabled() */)

One question here: I'm guessing you put this comment here
for illustrative purposes, implying code that will
be enabled in -rt?
We don't want it upstream I think, right?


 + return false;
 +
 + return true;
 +}
 +
 +static inline void might_fault(void)
 +{
 + if (!__can_fault())
 + return;
 +
 + might_sleep();
 + might_fault_lockdep();
 +}
 +
  #ifndef ARCH_HAS_NOCACHE_UACCESS
  
  static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
 diff --git a/mm/memory.c b/mm/memory.c
 index 6dc1882..266610c 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -4211,19 +4211,9 @@ void print_vma_addr(char *prefix, unsigned long ip)
  }
  
  #ifdef CONFIG_PROVE_LOCKING
 -void might_fault(void)
 +void might_fault_lockdep(void)
  {
   /*
 -  * Some code (nfs/sunrpc) uses socket ops on kernel memory while
 -  * holding the mmap_sem, this is safe because kernel memory doesn't
 -  * get paged out, therefore we'll never actually fault, and the
 -  * below annotations will generate false positives.
 -  */
 - if (segment_eq(get_fs(), KERNEL_DS))
 - return;
 -
 - might_sleep();
 - /*
* it would be nicer only to annotate paths which are not under
* pagefault_disable, however that requires a larger audit and
* providing helpers like get_user_atomic.
 @@ -4231,7 +4221,7 @@ void might_fault(void)
   if (!in_atomic()  current-mm)
   might_lock_read(current-mm-mmap_sem);
  }
 -EXPORT_SYMBOL(might_fault);
 +EXPORT_SYMBOL(might_fault_lockdep);
  #endif
  
  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
 On Thursday 16 May 2013, Michael S. Tsirkin wrote:
  This improves the might_fault annotations used
  by uaccess routines:
  
  1. The only reason uaccess routines might sleep
 is if they fault. Make this explicit for
 all architectures.
  2. Accesses (e.g through socket ops) to kernel memory
 with KERNEL_DS like net/sunrpc does will never sleep.
 Remove an unconditinal might_sleep in the inline
 might_fault in kernel.h
 (used when PROVE_LOCKING is not set).
  3. Accesses with pagefault_disable return EFAULT
 but won't cause caller to sleep.
 Check for that and avoid might_sleep when
 PROVE_LOCKING is set.
  
  I'd like these changes to go in for the benefit of
  the vhost driver where we want to call socket ops
  under a spinlock, and fall back on slower thread handler
  on error.
 
 Hi Michael,
 
 I have recently stumbled over a related topic, which is the highly
 inconsistent placement of might_fault() or might_sleep() in certain
 classes of uaccess functions. Your patches seem completely reasonable,
 but it would be good to also fix the other problem, at least on
 the architectures we most care about.
 
 Given the most commonly used functions and a couple of architectures
 I'm familiar with, these are the ones that currently call might_fault()
 
   x86-32  x86-64  arm arm64   powerpc s390generic
 copy_to_user  -   x   -   -   -   x   x
 copy_from_user-   x   -   -   -   x   
 x
 put_user  x   x   x   x   x   x   x
 get_user  x   x   x   x   x   x   x
 __copy_to_userx   x   -   -   x   -   
 -
 __copy_from_user  x   x   -   -   x   -   -
 __put_user-   -   x   -   x   -   -
 __get_user-   -   x   -   x   -   -
 
 WTF?

Yea.

 Calling might_fault() for every __get_user/__put_user is rather expensive
 because it turns what should be a single instruction (plus fixup) into an
 external function call.

You mean _cond_resched with CONFIG_PREEMPT_VOLUNTARY? Or do you
mean when we build with PROVE_LOCKING?

 My feeling is that we should do might_fault() only in access_ok() to get
 the right balance.
 
   Arnd

Well access_ok is currently non-blocking I think - we'd have to audit
all callers. There are some 200 of these in drivers and some
1000 total so ... a bit risky.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-22 Thread Zang Hongyong

On 2013/5/20 15:43, Michael S. Tsirkin wrote:

On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:

Vhost thread provide both tx and rx ability for virtio-net.
In the forwarding scenarios, tx and rx share the vhost thread, and throughput 
is limited by single thread.

So I did a patch for provide vhost thread per virtqueue, not per vhost_net.

Of course, multi-queue virtio-net is final solution, but it require new version 
of virtio-net working in guest.
If you have to work with suse10,11, redhat 5.x as guest, and want to improve 
the forward throughput,
using vhost thread per queue seems to be the only solution.

Why is it? If multi-queue works well for you, just update the drivers in
the guests that you care about. Guest driver backport is not so hard.

In my testing, performance of thread per vq varies: some workloads might
gain throughput but you get more IPIs and more scheduling overhead, so
you waste more host CPU per byte. As you create more VMs, this stops
being a win.


I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and then 
two vhost thread provide
double tx/rx forwarding performance than signal vhost thread.
The virtqueue of vhost_blk is 1, so it still use one vhost thread without 
change.

Is there something wrong in this solution? If not, I would list patch later.

Best regards
King

Yes, I don't think we want to create threads even more aggressively
in all cases. I'm worried about scalability as it is.
I think we should explore a flexible approach, use a thread pool
(for example, a wq) to share threads between virtqueues,
switch to a separate thread only if there's free CPU and existing
threads are busy. Hopefully share threads between vhost instances too.
On Xen platform, network backend pv driver model has evolved to this 
way. Netbacks from all DomUs share a thread pool,

and thread number eaqual to cpu core number.
Is there any plan for kvm paltform?





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-22 Thread Jason Wang
On 05/22/2013 05:59 PM, Zang Hongyong wrote:
 On 2013/5/20 15:43, Michael S. Tsirkin wrote:
 On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:
 Vhost thread provide both tx and rx ability for virtio-net.
 In the forwarding scenarios, tx and rx share the vhost thread, and
 throughput is limited by single thread.

 So I did a patch for provide vhost thread per virtqueue, not per
 vhost_net.

 Of course, multi-queue virtio-net is final solution, but it require
 new version of virtio-net working in guest.
 If you have to work with suse10,11, redhat 5.x as guest, and want to
 improve the forward throughput,
 using vhost thread per queue seems to be the only solution.
 Why is it? If multi-queue works well for you, just update the drivers in
 the guests that you care about. Guest driver backport is not so hard.

 In my testing, performance of thread per vq varies: some workloads might
 gain throughput but you get more IPIs and more scheduling overhead, so
 you waste more host CPU per byte. As you create more VMs, this stops
 being a win.

 I did the test with kernel 3.0.27 and qemu-1.4.0, guest is
 suse11-sp2, and then two vhost thread provide
 double tx/rx forwarding performance than signal vhost thread.
 The virtqueue of vhost_blk is 1, so it still use one vhost thread
 without change.

 Is there something wrong in this solution? If not, I would list
 patch later.

 Best regards
 King
 Yes, I don't think we want to create threads even more aggressively
 in all cases. I'm worried about scalability as it is.
 I think we should explore a flexible approach, use a thread pool
 (for example, a wq) to share threads between virtqueues,
 switch to a separate thread only if there's free CPU and existing
 threads are busy. Hopefully share threads between vhost instances too.
 On Xen platform, network backend pv driver model has evolved to this
 way. Netbacks from all DomUs share a thread pool,
 and thread number eaqual to cpu core number.
 Is there any plan for kvm paltform?

There used to be two related RFCs for this, one is the multiple vhost
workers from Anthony another is percpu vhost thread from Shirley. You
can search the archives on netdev or kvm for the patches.



 -- 
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-22 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 05:59:03PM +0800, Zang Hongyong wrote:
 On 2013/5/20 15:43, Michael S. Tsirkin wrote:
 On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:
 Vhost thread provide both tx and rx ability for virtio-net.
 In the forwarding scenarios, tx and rx share the vhost thread, and 
 throughput is limited by single thread.
 
 So I did a patch for provide vhost thread per virtqueue, not per vhost_net.
 
 Of course, multi-queue virtio-net is final solution, but it require new 
 version of virtio-net working in guest.
 If you have to work with suse10,11, redhat 5.x as guest, and want to 
 improve the forward throughput,
 using vhost thread per queue seems to be the only solution.
 Why is it? If multi-queue works well for you, just update the drivers in
 the guests that you care about. Guest driver backport is not so hard.
 
 In my testing, performance of thread per vq varies: some workloads might
 gain throughput but you get more IPIs and more scheduling overhead, so
 you waste more host CPU per byte. As you create more VMs, this stops
 being a win.
 
 I did the test with kernel 3.0.27 and qemu-1.4.0, guest is suse11-sp2, and 
 then two vhost thread provide
 double tx/rx forwarding performance than signal vhost thread.
 The virtqueue of vhost_blk is 1, so it still use one vhost thread without 
 change.
 
 Is there something wrong in this solution? If not, I would list patch later.
 
 Best regards
 King
 Yes, I don't think we want to create threads even more aggressively
 in all cases. I'm worried about scalability as it is.
 I think we should explore a flexible approach, use a thread pool
 (for example, a wq) to share threads between virtqueues,
 switch to a separate thread only if there's free CPU and existing
 threads are busy. Hopefully share threads between vhost instances too.
 On Xen platform, network backend pv driver model has evolved to this
 way. Netbacks from all DomUs share a thread pool,
 and thread number eaqual to cpu core number.
 Is there any plan for kvm paltform?

Shirley Ma had a patchset like this. Look it up:
'NUMA aware scheduling per vhost thread patch'

Unfortunately I don't think we can fix the thread number: if a thread
gets blocked because its accessing a swapped out memory for guest 1, we
must allow guest 2 to make progress.

But it shouldn't be too hard to fix: detect that
a thread is blocked and spawn a new one,
or pre-create a per-guest thread and bounce the work there.

 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[nVMX ] A couple more tests

2013-05-22 Thread Kashyap Chamarthy
Heya,

So, Gleb suggested (thank you) to try kvm unit tests in L1, here's some
(incomprehensive) detail. I also ran a defconfig in a 20 loop iteration
(in L2) where subsequent runs after the first one are from page cache.


1. A couple of unit tests w/ 64 bit compiled:

$ ./configure

$ time make
[...]
gcc -g -D__x86_64__ -I../include/x86 -m64 -O1 -MMD -MF x86/.init.d -g
-fomit-frame-pointer -Wall  -fno-stack-protector   -I. -std=gnu99
-ffreestanding -I lib -I lib/x86   -c -o x86/init.o x86/init.c
x86/init.c: In function ‘main’:
x86/init.c:110:1: warning: control reaches end of non-void function
[-Wreturn-type]
 }
 ^
gcc -g -D__x86_64__ -I../include/x86 -m64 -O1 -MMD -MF x86/.ini

real0m14.358s
user0m6.990s
sys 0m6.639s


= MSR test =

$ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \
  -nographic -no-user-config -nodefaults -device \
  isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/msr.flat \

enabling apic
MSR_IA32_APICBASE: PASS
MSR_IA32_APICBASE: PASS
IA32_SYSENTER_CS: PASS
MSR_IA32_SYSENTER_ESP: PASS
IA32_SYSENTER_EIP: PASS
MSR_IA32_MISC_ENABLE: PASS
MSR_IA32_CR_PAT: PASS
MSR_FS_BASE: PASS
MSR_GS_BASE: PASS
MSR_KERNEL_GS_BASE: PASS
MSR_EFER: PASS
MSR_LSTAR: PASS
MSR_CSTAR: PASS
MSR_SYSCALL_MASK: PASS
MSR_*STAR eager loading: PASS
15 tests, 0 failures

real0m0.525s
user0m0.147s
sys 0m0.121s


= eventinj test =

$ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \
  -nographic -no-user-config -nodefaults -device \
  isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel ./x86/eventinj.flat
enabling apic
paging enabled
cr0 = 80010011
cr3 = 7fff000
cr4 = 20
Try to divide by 0
DE isr running divider is 0
Result is 150
DE exception: PASS
Try int 3
BP isr running
After int 3
BP exception: PASS
Try send vec 33 to itself
irq1 running
After vec 33 to itself
vec 33: PASS
Try int $33
irq1 running
After int $33
int $33: PASS
Try send vec 32 and 33 to itself
irq1 running
irq0 running
After vec 32 and 33 to itself
vec 32/33: PASS
Try send vec 32 and int $33
irq1 running
irq0 running
After vec 32 and int $33
vec 32/int $33: PASS
Try send vec 33 and 62 and mask one with TPR
irq1 running
After 33/62 TPR test
TPR: PASS
irq0 running
Try send NMI to itself
After NMI to itself
NMI: FAIL
Try int 33 with shadowed stack
irq1 running
After int 33 with shadowed stack
int 33 with shadowed stack: PASS

summary: 9 tests, 1 failures

real0m0.586s
user0m0.159s
sys 0m0.163s



2. A couple of kvm unit tests w/ 32-bit compiled:

$ yum install '*/stubs-32.h'
[...]

$ make ARCH=i386 clean all
[...]
gcc -g -D__i386__ -I
/usr/src/kernels/3.10.0-0.rc1.git5.2.fc20.x86_64/include
-I../include/x86 -m32 -O1 -MMD -MF x86/.eventinj.d -g
-fomit-frame-pointer -Wall  -fno-stack-protector   -I. -nostdlib -o
x86/eventin
j.elf -Wl,-T,flat.lds x86/eventinj.o x86/cstart.o lib/libcflat.a
/usr/lib/gcc/x86_64-redhat-linux/4.8.0/32/libgcc.a
objcopy -O elf32-i386 x86/eventinj.elf x86/eventinj.flat
[...]


= eventinj test =

$ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio \
-nographic \ -no-user-config -nodefaults -device \
isa-debug-exit,iobase=0xf4,iosize=0x4 -kernel \ ./x86/eventinj.flat \

[...]
After NMI to itself
NMI: FAIL
Try to divide by 0
PF running
DE isr running divider is 0
Result is 150
DE PF exceptions: PASS
Before NP test
PF running
NP isr running 400b5f err=18
irq1 running
After int33
NP PF exceptions: PASS
Try int 33 with shadowed stack
irq1 running
After int 33 with shadowed stack
int 33 with shadowed stack: PASS

summary: 14 tests, 1 failures

real0m0.589s
user0m0.188s
sys 0m0.127s



= MSR test =

$ time qemu-system-x86_64 -enable-kvm -device pc-testdev -serial stdio
-nographic \
 -no-user-config -nodefaults -device isa-debug-exit,iobase=0xf4,iosize=0x4 
 -kernel \
 ./x86/msr.flat
enabling apic
MSR_IA32_APICBASE: PASS
MSR_IA32_APICBASE: PASS
IA32_SYSENTER_CS: PASS
MSR_IA32_SYSENTER_ESP: PASS
IA32_SYSENTER_EIP: PASS
MSR_IA32_MISC_ENABLE: PASS
MSR_IA32_CR_PAT: PASS
MSR_FS_BASE: PASS
MSR_GS_BASE: PASS
MSR_KERNEL_GS_BASE: PASS
MSR_LSTAR: PASS
MSR_CSTAR: PASS
MSR_SYSCALL_MASK: PASS
13 tests, 0 failures

real0m0.499s
user0m0.136s
sys 0m0.117s


3. 20 iterations of kernel compile time:

   
http://kashyapc.fedorapeople.org/virt/Haswell-VMCS-Shadowing-tests/kernel-build-times-20-iterations.txt

Average real (elapsed) time - 9m44.495s.

The result of of the above file is obtained after:

$ for i in {1..20}; do make clean; \
  time make -j 3; done 21 | tee  \
  defconfig-make-output-20-iterations-l2.txt

in L2 (nested guest).

And, then 

Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep

2013-05-22 Thread Peter Zijlstra
On Wed, May 22, 2013 at 12:47:09PM +0300, Michael S. Tsirkin wrote:
   
  +static inline bool __can_fault(void)
  +{
  +   /*
  +* Some code (nfs/sunrpc) uses socket ops on kernel memory while
  +* holding the mmap_sem, this is safe because kernel memory doesn't
  +* get paged out, therefore we'll never actually fault, and the
  +* below annotations will generate false positives.
  +*/
  +   if (segment_eq(get_fs(), KERNEL_DS))
  +   return false;
  +
  +   if (in_atomic() /* || pagefault_disabled() */)
 
 One question here: I'm guessing you put this comment here
 for illustrative purposes, implying code that will
 be enabled in -rt?
 We don't want it upstream I think, right?

Right, and as a reminder that when we do this we need to add a patch to
-rt. But yeah, we should have a look and see if its worth pulling those
patches from -rt into mainline in some way shape or form. They're big
but trivial IIRC.

I'm fine with you leaving that comment out though.. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Peter Zijlstra
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
 Calling might_fault() for every __get_user/__put_user is rather expensive
 because it turns what should be a single instruction (plus fixup) into an
 external function call.

We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like
might_sleep() is. I'm not sure there's a point to might_fault() when
might_sleep() is a NOP.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 12:19:16PM +0200, Peter Zijlstra wrote:
 On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
  Calling might_fault() for every __get_user/__put_user is rather expensive
  because it turns what should be a single instruction (plus fixup) into an
  external function call.
 
 We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like
 might_sleep() is. I'm not sure there's a point to might_fault() when
 might_sleep() is a NOP.

The patch that you posted gets pretty close.
E.g. I'm testing this now:
+#define might_fault() do { \
+   if (_might_fault()) \
+   __might_sleep(__FILE__, __LINE__, 0); \
+   might_resched(); \
+} while(0)

So if might_sleep is a NOP, __might_sleep and might_resched are NOPs
so compiler will optimize this all out.

However, in a related thread, you pointed out that might_sleep is not a NOP if
CONFIG_PREEMPT_VOLUNTARY is set, even without CONFIG_DEBUG_ATOMIC_SLEEP.

Do you think we should drop the preemption point in might_fault?
Only copy_XX_user?
Only __copy_XXX_user ?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Peter Zijlstra
On Wed, May 22, 2013 at 02:07:29PM +0300, Michael S. Tsirkin wrote:
 On Wed, May 22, 2013 at 12:19:16PM +0200, Peter Zijlstra wrote:
  On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
   Calling might_fault() for every __get_user/__put_user is rather expensive
   because it turns what should be a single instruction (plus fixup) into an
   external function call.
  
  We could hide it all behind CONFIG_DEBUG_ATOMIC_SLEEP just like
  might_sleep() is. I'm not sure there's a point to might_fault() when
  might_sleep() is a NOP.
 
 The patch that you posted gets pretty close.
 E.g. I'm testing this now:
 +#define might_fault() do { \
 +   if (_might_fault()) \
 +   __might_sleep(__FILE__, __LINE__, 0); \
 +   might_resched(); \
 +} while(0)
 
 So if might_sleep is a NOP, __might_sleep and might_resched are NOPs
 so compiler will optimize this all out.
 
 However, in a related thread, you pointed out that might_sleep is not a NOP if
 CONFIG_PREEMPT_VOLUNTARY is set, even without CONFIG_DEBUG_ATOMIC_SLEEP.

Oh crud yeah.. and you actually need that _might_fault() stuff for that
too. Bugger.

Yeah, I wouldn't know what the effects of dropping ita (for the copy
functions) would be, VOLUNTARY isn't a preemption mode I ever use (even
though it seems most distros default to it).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] KVM: add kvm_para_available to asm-generic/kvm_para.h

2013-05-22 Thread James Hogan
According to include/uapi/linux/kvm_para.h architectures should define
kvm_para_available, so add an implementation to asm-generic/kvm_para.h
which just returns false.

Signed-off-by: James Hogan james.ho...@imgtec.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Cc: Gleb Natapov g...@redhat.com
Cc: Arnd Bergmann a...@arndb.de
---
 include/asm-generic/kvm_para.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/asm-generic/kvm_para.h b/include/asm-generic/kvm_para.h
index 9d96605..fa25bec 100644
--- a/include/asm-generic/kvm_para.h
+++ b/include/asm-generic/kvm_para.h
@@ -18,4 +18,9 @@ static inline unsigned int kvm_arch_para_features(void)
return 0;
 }
 
+static inline bool kvm_para_available(void)
+{
+   return false;
+}
+
 #endif
-- 
1.8.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.

2013-05-22 Thread Gleb Natapov
On Tue, May 21, 2013 at 01:54:49PM -0700, David Daney wrote:
 From: David Daney david.da...@cavium.com
 
 The initial patch set implementing MIPS KVM does not handle 64-bit
 guests or use of the FPU.  This patch set corrects these ABI issues,
 and does some very minor clean up.
 
Sanjay, is this looks good to you. 

What userspace MIPS is using for machine emulation? Is there corresponding
patches to the userspace?

 Chandes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS.  Added
  ENOIOCTLCMD patch.
 
 Changes from v2: Split into five parts, no code change.
 
 David Daney (6):
   mips/kvm: Fix ABI for use of FPU.
   mips/kvm: Fix ABI for use of 64-bit registers.
   mips/kvm: Fix name of gpr field in struct kvm_regs.
   mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in
 kvm_arch_vcpu_ioctl_{s,g}et_regs
   mips/kvm: Fix ABI by moving manipulation of CP0 registers to
 KVM_{G,S}ET_ONE_REG
   mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls.
 
  arch/mips/include/asm/kvm.h  | 137 
  arch/mips/include/asm/kvm_host.h |   4 -
  arch/mips/kvm/kvm_mips.c | 118 +++---
  arch/mips/kvm/kvm_trap_emul.c| 338 
 ++-
  4 files changed, 430 insertions(+), 167 deletions(-)
 
 -- 
 1.7.11.7

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Gleb Natapov
On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 04:54 PM, Gleb Natapov wrote:
  On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
  On 05/22/2013 02:34 PM, Gleb Natapov wrote:
  On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
  On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
  Any pages with stale information will be zapped by kvm_mmu_zap_all().
  When that happens, page faults will take place which will 
  automatically 
  use the new generation number.
 
  So still not clear why is this necessary.
 
  This is not, strictly speaking, necessary, but it is the sane thing to 
  do.
  You cannot update page's generation number to prevent it from been
  destroyed since after kvm_mmu_zap_all() completes stale ptes in the
  shadow page may point to now deleted memslot. So why build shadow page
  table with a page that is in a process of been destroyed?
 
  OK, can this be introduced separately, in a later patch, with separate
  justification, then?
 
  Xiao please have the first patches of the patchset focus on the problem
  at hand: fix long mmu_lock hold times.
 
  Not sure what you mean again. We flush TLB once before entering this 
  function.
  kvm_reload_remote_mmus() does this for us, no?
 
  kvm_reload_remote_mmus() is used as an optimization, its separate from 
  the
  problem solution.
 
 
  What was suggested was... go to phrase which starts with The only 
  purpose
  of the generation number should be to.
 
  The comment quoted here does not match that description.
 
  The comment describes what code does and in this it is correct.
 
  You propose to not reload roots right away and do it only when root sp
  is encountered, right? So my question is what's the point? There are,
  obviously, root sps with invalid generation number at this point, so
  reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
  do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
  invalid and obsolete sps as I proposed in one of my email?
 
  Sure. But Xiao please introduce that TLB collapsing optimization as a
  later patch, so we can reason about it in a more organized fashion.
 
  So, if I understand correctly, you are asking to move is_obsolete_sp()
  check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
  kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
  we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
  call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
  become nop. But I question the need to zap all shadow pages tables there
  in the first place, why kvm_flush_remote_tlbs() is not enough?
 
  I do not know too... I even do no know why kvm_flush_remote_tlbs
  is needed. :(
  We changed the content of an executable page, we need to flush instruction
  cache of all vcpus to not use stale data, so my suggestion to call
 
 I thought the reason is about icache too but icache is automatically
 flushed on x86, we only need to invalidate the prefetched instructions by
 executing a serializing operation.
 
 See the SDM in the chapter of
 8.1.3 Handling Self- and Cross-Modifying Code
 
Right, so we do cross-modifying code here and we need to make sure no
vcpu is running in a guest mode while this happens, but
kvm_mmu_zap_all() does not provide this guaranty since vcpus will
continue running after reloading roots!
 
  kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
  not instruction cache, but why kvm_reload_remote_mmus() would flush
  instruction cache?
 
 kvm_reload_remote_mmus do not have any help i think.
 
 I find that this change is introduced by commit: 7aa81cc0
 and I have added Anthony in the CC.
 
 I also find some discussions related to calling
 kvm_reload_remote_mmus():
 
 
  But if the instruction is architecture dependent, and you run on the
  wrong architecture, now you have to patch many locations at fault time,
  introducing some nasty runtime code / data cache overlap performance
  problems.  Granted, they go away eventually.
 
 
 We're addressing that by blowing away the shadow cache and holding the
 big kvm lock to ensure SMP safety.  Not a great thing to do from a
 performance perspective but the whole point of patching is that the cost
 is amortized.
 
 (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)
 
 But i can not understand...
Back then kvm-lock protected memslot access so code like:

mutex_lock(vcpu-kvm-lock);
kvm_mmu_zap_all(vcpu-kvm);
mutex_unlock(vcpu-kvm-lock);

which is what 7aa81cc0 does was enough to guaranty that no vcpu will
run while code is patched. This is no longer the case and
mutex_lock(vcpu-kvm-lock); is gone from that code path long time ago,
so now kvm_mmu_zap_all() there is useless and the code is incorrect.

Lets drop kvm_mmu_zap_all() there (in separate patch) and fix the
patching properly later.

--
Gleb.
--
To unsubscribe from 

Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Russell King - ARM Linux
On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
 Given the most commonly used functions and a couple of architectures
 I'm familiar with, these are the ones that currently call might_fault()
 
   x86-32  x86-64  arm arm64   powerpc s390generic
 copy_to_user  -   x   -   -   -   x   x
 copy_from_user-   x   -   -   -   x   
 x
 put_user  x   x   x   x   x   x   x
 get_user  x   x   x   x   x   x   x
 __copy_to_userx   x   -   -   x   -   
 -
 __copy_from_user  x   x   -   -   x   -   -
 __put_user-   -   x   -   x   -   -
 __get_user-   -   x   -   x   -   -
 
 WTF?

I think your table is rather screwed - especially on ARM.  Tell me -
how can __copy_to_user() use might_fault() but copy_to_user() not when
copy_to_user() is implemented using __copy_to_user() ?  Same for
copy_from_user() but the reverse argument - there's nothing special
in our copy_from_user() which would make it do might_fault() when
__copy_from_user() wouldn't.

The correct position for ARM is: our (__)?(pu|ge)t_user all use
might_fault(), but (__)?copy_(to|from)_user do not.  Neither does
(__)?clear_user.  We might want to fix those to use might_fault().
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add detail error message when fail to add ioeventfd

2013-05-22 Thread Amos Kong
On Wed, May 22, 2013 at 11:32:27AM +0200, Stefan Hajnoczi wrote:
 On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote:
  I try to hotplug 28 * 8 multiple-function devices to guest with
  old host kernel, ioeventfds in host kernel will be exhausted, then
  qemu fails to allocate ioeventfds for blk/nic devices.
  
  It's better to add detail error here.
  
  Signed-off-by: Amos Kong ak...@redhat.com
  ---
   kvm-all.c |4 
   1 files changed, 4 insertions(+), 0 deletions(-)
 
 It would be nice to make kvm bus scalable so that the hardcoded
 in-kernel I/O device limit can be lifted.

I had increased kernel NR_IOBUS_DEVS to 1000 (a limitation is needed for
security) in last Mar, and make resizing kvm_io_range array dynamical.

 Reviewed-by: Stefan Hajnoczi stefa...@redhat.com

-- 
Amos.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/

2013-05-22 Thread Arnd Bergmann
On Thursday 16 May 2013, Michael S. Tsirkin wrote:
 @@ -178,7 +178,7 @@ do {  
   \
 long __pu_err;  \
 __typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
 if (!is_kernel_addr((unsigned long)__pu_addr))  \
 -   might_sleep();  \
 +   might_fault();  \
 __chk_user_ptr(ptr);\
 __put_user_size((x), __pu_addr, (size), __pu_err);  \
 __pu_err;   \
 

Another observation:

if (!is_kernel_addr((unsigned long)__pu_addr))
might_sleep();

is almost the same as

might_fault();

except that it does not call might_lock_read().

The version above may have been put there intentionally and correctly, but
if you want to replace it with might_fault(), you should remove the
if () condition.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Arnd Bergmann
On Wednesday 22 May 2013, Russell King - ARM Linux wrote:
 On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
  Given the most commonly used functions and a couple of architectures
  I'm familiar with, these are the ones that currently call might_fault()
  
x86-32  x86-64  arm arm64   powerpc s390
  generic
  copy_to_user  -   x   -   -   -   x   x
  copy_from_user-   x   -   -   -   x   x
  put_user  x   x   x   x   x   x   x
  get_user  x   x   x   x   x   x   x
  __copy_to_userx   x   -   -   x   -   -
  __copy_from_user  x   x   -   -   x   -   -
  __put_user-   -   x   -   x   -   -
  __get_user-   -   x   -   x   -   -
  
  WTF?
 
 I think your table is rather screwed - especially on ARM.  Tell me -
 how can __copy_to_user() use might_fault() but copy_to_user() not when
 copy_to_user() is implemented using __copy_to_user() ?  Same for
 copy_from_user() but the reverse argument - there's nothing special
 in our copy_from_user() which would make it do might_fault() when
 __copy_from_user() wouldn't.

I think something went wrong with formatting of the tabstobs in
the table. I've tried to correct it above to the same version I
see on the mailing list.

 The correct position for ARM is: our (__)?(pu|ge)t_user all use
 might_fault(), but (__)?copy_(to|from)_user do not.  Neither does
 (__)?clear_user.  We might want to fix those to use might_fault().

Yes, that sounds like a good idea, especially since they are all
implemented out-of-line.

For __get_user()/__put_user(), I would probably do the reverse and make
them not call might_fault() though, like we do on most other architectures:

Look at the object code produced for setup_sigframe for instance, it calls
might_fault() around 25 times where one should really be enough. Using
__put_user() instead of put_user() is normally an indication that the
author of that function has made performance considerations and move the
(trivial) access_ok() call out, but now we add a more expensive
call instead.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/

2013-05-22 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 03:59:01PM +0200, Arnd Bergmann wrote:
 On Thursday 16 May 2013, Michael S. Tsirkin wrote:
  @@ -178,7 +178,7 @@ do {
  \
  long __pu_err;  \
  __typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
  if (!is_kernel_addr((unsigned long)__pu_addr))  \
  -   might_sleep();  \
  +   might_fault();  \
  __chk_user_ptr(ptr);\
  __put_user_size((x), __pu_addr, (size), __pu_err);  \
  __pu_err;   \
  
 
 Another observation:
 
   if (!is_kernel_addr((unsigned long)__pu_addr))
   might_sleep();
 
 is almost the same as
 
   might_fault();
 
 except that it does not call might_lock_read().
 
 The version above may have been put there intentionally and correctly, but
 if you want to replace it with might_fault(), you should remove the
 if () condition.
 
   Arnd

Good point, thanks.
I think I'll do it in a separate patch on top,
just to make sure bisect does not result in a revision that
produces false positive warnings.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] uaccess: better might_sleep/might_fault behavior

2013-05-22 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 04:04:48PM +0200, Arnd Bergmann wrote:
 On Wednesday 22 May 2013, Russell King - ARM Linux wrote:
  On Wed, May 22, 2013 at 11:25:36AM +0200, Arnd Bergmann wrote:
   Given the most commonly used functions and a couple of architectures
   I'm familiar with, these are the ones that currently call might_fault()
   
 x86-32  x86-64  arm arm64   powerpc s390
   generic
   copy_to_user  -   x   -   -   -   x   x
   copy_from_user-   x   -   -   -   x   x
   put_user  x   x   x   x   x   x   x
   get_user  x   x   x   x   x   x   x
   __copy_to_userx   x   -   -   x   -   -
   __copy_from_user  x   x   -   -   x   -   -
   __put_user-   -   x   -   x   -   -
   __get_user-   -   x   -   x   -   -
   
   WTF?
  
  I think your table is rather screwed - especially on ARM.  Tell me -
  how can __copy_to_user() use might_fault() but copy_to_user() not when
  copy_to_user() is implemented using __copy_to_user() ?  Same for
  copy_from_user() but the reverse argument - there's nothing special
  in our copy_from_user() which would make it do might_fault() when
  __copy_from_user() wouldn't.
 
 I think something went wrong with formatting of the tabstobs in
 the table. I've tried to correct it above to the same version I
 see on the mailing list.
 
  The correct position for ARM is: our (__)?(pu|ge)t_user all use
  might_fault(), but (__)?copy_(to|from)_user do not.  Neither does
  (__)?clear_user.  We might want to fix those to use might_fault().
 
 Yes, that sounds like a good idea, especially since they are all
 implemented out-of-line.
 
 For __get_user()/__put_user(), I would probably do the reverse and make
 them not call might_fault() though, like we do on most other architectures:
 
 Look at the object code produced for setup_sigframe for instance, it calls
 might_fault() around 25 times where one should really be enough.

Well it depends on what config options you set.
But with VOLUNTARY you are right.
Also, look at memcpy_fromiovec and weep.

 Using
 __put_user() instead of put_user() is normally an indication that the
 author of that function has made performance considerations and move the
 (trivial) access_ok() call out, but now we add a more expensive
 call instead.
 
   Arnd

I think exactly the same rules should apply to __XXX_user and
__copy_XXX_user - otherwise it's really confusing.

Maybe a preempt point in might_fault should go away?
Basically

#define might_fault() __might_sleep(__FILE__, __LINE__, 0)

Possibly adding the in_atomic() etc checks that Peter suggested.

Ingo, what do you think? And what testing would be appropriate
for such a change?


Thanks,

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter

2013-05-22 Thread Marcelo Tosatti
On Wed, May 22, 2013 at 06:38:47AM +0300, Gleb Natapov wrote:
 On Wed, May 22, 2013 at 12:32:57AM -0300, Marcelo Tosatti wrote:
  On Wed, May 22, 2013 at 06:28:47AM +0300, Gleb Natapov wrote
 + case HV_X64_MSR_TIME_REF_COUNT:
   r = true;
   break;
   }
 @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu 
 *vcpu, u32 msr, u64 data)
   if (__copy_to_user((void __user *)addr, instructions, 
 4))
   return 1;
   kvm-arch.hv_hypercall = data;
 + local_irq_disable();
 + kvm-arch.hv_ref_count = get_kernel_ns();
 + local_irq_enable();
 + break;

local_irq_disable/local_irq_enable not needed.
   
   get_kernel_ns has WARN_ON(preemptible()) so at least preempt_disable()
   is needed, but all callers of the function disable interrupts.
  
  OK. Its not necessary to disable interrupts (this is a left over of
  kvm_guest_update_time).
 So preempt_disable() then? Should we change other callers?

Yes. Should only disable interrupts if necessary (the case in
kvm_write_guest_time is necessary).


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Marcelo Tosatti
On Wed, May 22, 2013 at 09:34:13AM +0300, Gleb Natapov wrote:
 On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
  On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
Any pages with stale information will be zapped by kvm_mmu_zap_all().
When that happens, page faults will take place which will automatically 
use the new generation number.

So still not clear why is this necessary.

   This is not, strictly speaking, necessary, but it is the sane thing to do.
   You cannot update page's generation number to prevent it from been
   destroyed since after kvm_mmu_zap_all() completes stale ptes in the
   shadow page may point to now deleted memslot. So why build shadow page
   table with a page that is in a process of been destroyed?
  
  OK, can this be introduced separately, in a later patch, with separate
  justification, then?
  
  Xiao please have the first patches of the patchset focus on the problem
  at hand: fix long mmu_lock hold times.
  
   Not sure what you mean again. We flush TLB once before entering this 
   function.
   kvm_reload_remote_mmus() does this for us, no?
  
  kvm_reload_remote_mmus() is used as an optimization, its separate from the
  problem solution.
  

What was suggested was... go to phrase which starts with The only 
purpose
of the generation number should be to.

The comment quoted here does not match that description.

   The comment describes what code does and in this it is correct.
   
   You propose to not reload roots right away and do it only when root sp
   is encountered, right? So my question is what's the point? There are,
   obviously, root sps with invalid generation number at this point, so
   reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
   do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
   invalid and obsolete sps as I proposed in one of my email?
  
  Sure. But Xiao please introduce that TLB collapsing optimization as a
  later patch, so we can reason about it in a more organized fashion.
 
 So, if I understand correctly, you are asking to move is_obsolete_sp()
 check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
 kvm_mmu_invalidate_all_pages() to a separate patch.

- Move the is_obsolete_sp() check from kvm_mmu_get_page() to a separate
optimization patch.
- Move kvm_reload_remote_mmus() optimization to a separate optimization
patch.
- Leave the is_obsolete_sp() check in kvm_mmu_invalidate_all_pages(), as
it is necessary to guarantee forward progress, in the initial patches.

 Fine by me, but if
 we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
 call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
 become nop. But I question the need to zap all shadow pages tables there
 in the first place, why kvm_flush_remote_tlbs() is not enough?

Good question.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Xiao Guangrong
On 05/22/2013 09:17 PM, Gleb Natapov wrote:
 On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 04:54 PM, Gleb Natapov wrote:
 On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 02:34 PM, Gleb Natapov wrote:
 On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
 On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
 Any pages with stale information will be zapped by kvm_mmu_zap_all().
 When that happens, page faults will take place which will 
 automatically 
 use the new generation number.

 So still not clear why is this necessary.

 This is not, strictly speaking, necessary, but it is the sane thing to 
 do.
 You cannot update page's generation number to prevent it from been
 destroyed since after kvm_mmu_zap_all() completes stale ptes in the
 shadow page may point to now deleted memslot. So why build shadow page
 table with a page that is in a process of been destroyed?

 OK, can this be introduced separately, in a later patch, with separate
 justification, then?

 Xiao please have the first patches of the patchset focus on the problem
 at hand: fix long mmu_lock hold times.

 Not sure what you mean again. We flush TLB once before entering this 
 function.
 kvm_reload_remote_mmus() does this for us, no?

 kvm_reload_remote_mmus() is used as an optimization, its separate from 
 the
 problem solution.


 What was suggested was... go to phrase which starts with The only 
 purpose
 of the generation number should be to.

 The comment quoted here does not match that description.

 The comment describes what code does and in this it is correct.

 You propose to not reload roots right away and do it only when root sp
 is encountered, right? So my question is what's the point? There are,
 obviously, root sps with invalid generation number at this point, so
 reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
 do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
 invalid and obsolete sps as I proposed in one of my email?

 Sure. But Xiao please introduce that TLB collapsing optimization as a
 later patch, so we can reason about it in a more organized fashion.

 So, if I understand correctly, you are asking to move is_obsolete_sp()
 check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
 kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
 we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
 call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
 become nop. But I question the need to zap all shadow pages tables there
 in the first place, why kvm_flush_remote_tlbs() is not enough?

 I do not know too... I even do no know why kvm_flush_remote_tlbs
 is needed. :(
 We changed the content of an executable page, we need to flush instruction
 cache of all vcpus to not use stale data, so my suggestion to call

 I thought the reason is about icache too but icache is automatically
 flushed on x86, we only need to invalidate the prefetched instructions by
 executing a serializing operation.

 See the SDM in the chapter of
 8.1.3 Handling Self- and Cross-Modifying Code

 Right, so we do cross-modifying code here and we need to make sure no
 vcpu is running in a guest mode while this happens, but
 kvm_mmu_zap_all() does not provide this guaranty since vcpus will
 continue running after reloading roots!

May be we can introduce a function to atomic write gpa, then the guest
either 1) see the old value, in that case, it can be intercepted or
2) see the the new value in that case, it can continue to execute.

 
 kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
 not instruction cache, but why kvm_reload_remote_mmus() would flush
 instruction cache?

 kvm_reload_remote_mmus do not have any help i think.

 I find that this change is introduced by commit: 7aa81cc0
 and I have added Anthony in the CC.

 I also find some discussions related to calling
 kvm_reload_remote_mmus():


 But if the instruction is architecture dependent, and you run on the
 wrong architecture, now you have to patch many locations at fault time,
 introducing some nasty runtime code / data cache overlap performance
 problems.  Granted, they go away eventually.


 We're addressing that by blowing away the shadow cache and holding the
 big kvm lock to ensure SMP safety.  Not a great thing to do from a
 performance perspective but the whole point of patching is that the cost
 is amortized.

 (http://kerneltrap.org/mailarchive/linux-kernel/2007/9/14/260288)

 But i can not understand...
 Back then kvm-lock protected memslot access so code like:
 
 mutex_lock(vcpu-kvm-lock);
 kvm_mmu_zap_all(vcpu-kvm);
 mutex_unlock(vcpu-kvm-lock);
 
 which is what 7aa81cc0 does was enough to guaranty that no vcpu will
 run while code is patched. 

So, at that time, kvm-lock is also held when #PF is being fixed?

 This is no longer the case and
 mutex_lock(vcpu-kvm-lock); is gone 

Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Gleb Natapov
On Wed, May 22, 2013 at 11:25:13PM +0800, Xiao Guangrong wrote:
 On 05/22/2013 09:17 PM, Gleb Natapov wrote:
  On Wed, May 22, 2013 at 05:41:10PM +0800, Xiao Guangrong wrote:
  On 05/22/2013 04:54 PM, Gleb Natapov wrote:
  On Wed, May 22, 2013 at 04:46:04PM +0800, Xiao Guangrong wrote:
  On 05/22/2013 02:34 PM, Gleb Natapov wrote:
  On Tue, May 21, 2013 at 10:33:30PM -0300, Marcelo Tosatti wrote:
  On Tue, May 21, 2013 at 11:39:03AM +0300, Gleb Natapov wrote:
  Any pages with stale information will be zapped by kvm_mmu_zap_all().
  When that happens, page faults will take place which will 
  automatically 
  use the new generation number.
 
  So still not clear why is this necessary.
 
  This is not, strictly speaking, necessary, but it is the sane thing 
  to do.
  You cannot update page's generation number to prevent it from been
  destroyed since after kvm_mmu_zap_all() completes stale ptes in the
  shadow page may point to now deleted memslot. So why build shadow page
  table with a page that is in a process of been destroyed?
 
  OK, can this be introduced separately, in a later patch, with separate
  justification, then?
 
  Xiao please have the first patches of the patchset focus on the problem
  at hand: fix long mmu_lock hold times.
 
  Not sure what you mean again. We flush TLB once before entering this 
  function.
  kvm_reload_remote_mmus() does this for us, no?
 
  kvm_reload_remote_mmus() is used as an optimization, its separate from 
  the
  problem solution.
 
 
  What was suggested was... go to phrase which starts with The only 
  purpose
  of the generation number should be to.
 
  The comment quoted here does not match that description.
 
  The comment describes what code does and in this it is correct.
 
  You propose to not reload roots right away and do it only when root sp
  is encountered, right? So my question is what's the point? There are,
  obviously, root sps with invalid generation number at this point, so
  reload will happen regardless in kvm_mmu_prepare_zap_page(). So why 
  not
  do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
  invalid and obsolete sps as I proposed in one of my email?
 
  Sure. But Xiao please introduce that TLB collapsing optimization as a
  later patch, so we can reason about it in a more organized fashion.
 
  So, if I understand correctly, you are asking to move is_obsolete_sp()
  check from kvm_mmu_get_page() and kvm_reload_remote_mmus() from
  kvm_mmu_invalidate_all_pages() to a separate patch. Fine by me, but if
  we drop kvm_reload_remote_mmus() from kvm_mmu_invalidate_all_pages() the
  call to kvm_mmu_invalidate_all_pages() in emulator_fix_hypercall() will
  become nop. But I question the need to zap all shadow pages tables there
  in the first place, why kvm_flush_remote_tlbs() is not enough?
 
  I do not know too... I even do no know why kvm_flush_remote_tlbs
  is needed. :(
  We changed the content of an executable page, we need to flush instruction
  cache of all vcpus to not use stale data, so my suggestion to call
 
  I thought the reason is about icache too but icache is automatically
  flushed on x86, we only need to invalidate the prefetched instructions by
  executing a serializing operation.
 
  See the SDM in the chapter of
  8.1.3 Handling Self- and Cross-Modifying Code
 
  Right, so we do cross-modifying code here and we need to make sure no
  vcpu is running in a guest mode while this happens, but
  kvm_mmu_zap_all() does not provide this guaranty since vcpus will
  continue running after reloading roots!
 
 May be we can introduce a function to atomic write gpa, then the guest
 either 1) see the old value, in that case, it can be intercepted or
 2) see the the new value in that case, it can continue to execute.
 
SDM says atomic write is not enough. All vcpu should be guarantied to
not execute code in the vicinity of modified code. This is easy to
achieve though:

vcpu0:
lock(x);
make_all_cpus_request(EXIT);
unlock(x);

vcpuX:
if (kvm_check_request(EXIT)) { 
lock(x);
unlock(x);
}

  kvm_flush_remote_tlbs() is obviously incorrect since this flushes tlb,
  not instruction cache, but why kvm_reload_remote_mmus() would flush
  instruction cache?
 
  kvm_reload_remote_mmus do not have any help i think.
 
  I find that this change is introduced by commit: 7aa81cc0
  and I have added Anthony in the CC.
 
  I also find some discussions related to calling
  kvm_reload_remote_mmus():
 
 
  But if the instruction is architecture dependent, and you run on the
  wrong architecture, now you have to patch many locations at fault time,
  introducing some nasty runtime code / data cache overlap performance
  problems.  Granted, they go away eventually.
 
 
  We're addressing that by blowing away the shadow cache and holding the
  big kvm lock to ensure SMP safety.  Not a great thing to do from a
  performance perspective but the whole point of patching is that the cost
  is amortized.
 
  

Re: [PATCH v4 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.

2013-05-22 Thread Sanjay Lal

On May 22, 2013, at 5:54 AM, Gleb Natapov wrote:

 On Tue, May 21, 2013 at 01:54:49PM -0700, David Daney wrote:
 From: David Daney david.da...@cavium.com
 
 The initial patch set implementing MIPS KVM does not handle 64-bit
 guests or use of the FPU.  This patch set corrects these ABI issues,
 and does some very minor clean up.
 
 Sanjay, is this looks good to you. 
 
 What userspace MIPS is using for machine emulation? Is there corresponding
 patches to the userspace?

Gleb, I'll post some comments on the patches later in the day.  We use QEMu for 
the machine emulation. I am in the process of integrating with the new ABI, and 
will post the QEMU patches shortly.

Regards
Sanjay

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG

2013-05-22 Thread Sanjay Lal

On May 21, 2013, at 1:54 PM, David Daney wrote:

 From: David Daney david.da...@cavium.com
 
 Because not all 256 CP0 registers are ever implemented, we need a
 different method of manipulating them.  Use the
 KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism.
 
 Code related to implementing KVM_SET_ONE_REG/KVM_GET_ONE_REG is
 consolidated in to kvm_trap_emul.c, now unused code and definitions
 are removed.
 
 Signed-off-by: David Daney david.da...@cavium.com
 ---
 arch/mips/include/asm/kvm.h  |  91 +--
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c |  90 +--
 arch/mips/kvm/kvm_trap_emul.c| 338 ++-
 4 files changed, 383 insertions(+), 140 deletions(-)
 
 diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
 index d145ead..3f424f5 100644
 --- a/arch/mips/include/asm/kvm.h
 +++ b/arch/mips/include/asm/kvm.h
 @@ -13,10 +13,11 @@
 
 #include linux/types.h
 
 -#define __KVM_MIPS
 -
 -#define N_MIPS_COPROC_REGS  32
 -#define N_MIPS_COPROC_SEL8
 +/*
 + * KVM MIPS specific structures and definitions.
 + *
 + * Some parts derived from the x86 version of this file.
 + */
 
 /*
  * for KVM_GET_REGS and KVM_SET_REGS
 @@ -31,12 +32,6 @@ struct kvm_regs {
   __u64 hi;
   __u64 lo;
   __u64 pc;
 -
 - __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL];
 -};
 -
 -/* for KVM_GET_SREGS and KVM_SET_SREGS */
 -struct kvm_sregs {
 };
 
 /*
 @@ -55,21 +50,89 @@ struct kvm_fpu {
   __u32 pad;
 };
 
 +
 +/*
 + * For MIPS, we use KVM_SET_ONE_REG and KVM_GET_ONE_REG to access CP0
 + * registers.  The id field is broken down as follows:
 + *
 + *  bits[2..0]   - Register 'sel' index.
 + *  bits[7..3]   - Register 'rd'  index.
 + *  bits[15..8]  - Must be zero.
 + *  bits[63..16] - 1 - CP0 registers.
 + *
 + * Other sets registers may be added in the future.  Each set would
 + * have its own identifier in bits[63..16].
 + *
 + * The addr field of struct kvm_one_reg must point to an aligned
 + * 64-bit wide location.  For registers that are narrower than
 + * 64-bits, the value is stored in the low order bits of the location,
 + * and sign extended to 64-bits.
 + *
 + * The registers defined in struct kvm_regs are also accessible, the
 + * id values for these are below.
 + */
 +
 +#define KVM_REG_MIPS_R0 0
 +#define KVM_REG_MIPS_R1 1
 +#define KVM_REG_MIPS_R2 2
 +#define KVM_REG_MIPS_R3 3
 +#define KVM_REG_MIPS_R4 4
 +#define KVM_REG_MIPS_R5 5
 +#define KVM_REG_MIPS_R6 6
 +#define KVM_REG_MIPS_R7 7
 +#define KVM_REG_MIPS_R8 8
 +#define KVM_REG_MIPS_R9 9
 +#define KVM_REG_MIPS_R10 10
 +#define KVM_REG_MIPS_R11 11
 +#define KVM_REG_MIPS_R12 12
 +#define KVM_REG_MIPS_R13 13
 +#define KVM_REG_MIPS_R14 14
 +#define KVM_REG_MIPS_R15 15
 +#define KVM_REG_MIPS_R16 16
 +#define KVM_REG_MIPS_R17 17
 +#define KVM_REG_MIPS_R18 18
 +#define KVM_REG_MIPS_R19 19
 +#define KVM_REG_MIPS_R20 20
 +#define KVM_REG_MIPS_R21 21
 +#define KVM_REG_MIPS_R22 22
 +#define KVM_REG_MIPS_R23 23
 +#define KVM_REG_MIPS_R24 24
 +#define KVM_REG_MIPS_R25 25
 +#define KVM_REG_MIPS_R26 26
 +#define KVM_REG_MIPS_R27 27
 +#define KVM_REG_MIPS_R28 28
 +#define KVM_REG_MIPS_R29 29
 +#define KVM_REG_MIPS_R30 30
 +#define KVM_REG_MIPS_R31 31
 +
 +#define KVM_REG_MIPS_HI 32
 +#define KVM_REG_MIPS_LO 33
 +#define KVM_REG_MIPS_PC 34
 +
 +/*
 + * KVM MIPS specific structures and definitions
 + *
 + */
 struct kvm_debug_exit_arch {
 + __u64 epc;
 };
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
 };
 
 +/* definition of registers in kvm_run */
 +struct kvm_sync_regs {
 +};
 +
 +/* dummy definition */
 +struct kvm_sregs {
 +};
 +
 struct kvm_mips_interrupt {
   /* in */
   __u32 cpu;
   __u32 irq;
 };
 
 -/* definition of registers in kvm_run */
 -struct kvm_sync_regs {
 -};
 -
 #endif /* __LINUX_KVM_MIPS_H */
 diff --git a/arch/mips/include/asm/kvm_host.h 
 b/arch/mips/include/asm/kvm_host.h
 index 143875c..4d6fa0b 100644
 --- a/arch/mips/include/asm/kvm_host.h
 +++ b/arch/mips/include/asm/kvm_host.h
 @@ -496,10 +496,6 @@ struct kvm_mips_callbacks {
   uint32_t cause);
   int (*irq_clear) (struct kvm_vcpu *vcpu, unsigned int priority,
 uint32_t cause);
 - int (*vcpu_ioctl_get_regs) (struct kvm_vcpu *vcpu,
 - struct kvm_regs *regs);
 - int (*vcpu_ioctl_set_regs) (struct kvm_vcpu *vcpu,
 - struct kvm_regs *regs);
 };
 extern struct kvm_mips_callbacks *kvm_mips_callbacks;
 int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks);
 diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
 index 71a1fc1..bc879bd 100644
 --- a/arch/mips/kvm/kvm_mips.c
 +++ b/arch/mips/kvm/kvm_mips.c
 @@ -51,16 +51,6 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
   {NULL}
 };
 
 -static int kvm_mips_reset_vcpu(struct kvm_vcpu *vcpu)
 -{
 - int i;
 - for_each_possible_cpu(i) {
 

Re: [PATCH v4 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG

2013-05-22 Thread David Daney

On 05/22/2013 10:44 AM, Sanjay Lal wrote:


On May 21, 2013, at 1:54 PM, David Daney wrote:


From: David Daney david.da...@cavium.com

Because not all 256 CP0 registers are ever implemented, we need a
different method of manipulating them.  Use the
KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism.

Code related to implementing KVM_SET_ONE_REG/KVM_GET_ONE_REG is
consolidated in to kvm_trap_emul.c, now unused code and definitions
are removed.

Signed-off-by: David Daney david.da...@cavium.com
---
arch/mips/include/asm/kvm.h  |  91 +--
arch/mips/include/asm/kvm_host.h |   4 -
arch/mips/kvm/kvm_mips.c |  90 +--
arch/mips/kvm/kvm_trap_emul.c| 338 ++-
4 files changed, 383 insertions(+), 140 deletions(-)

[...]




Most of the functions that have been relocated to kvm_trap_emul.c should stay 
in kvm_mips.c. They are/will shared between the trap and emulate and VZ modes.  
They include kvm_mips_reset_vcpu(), kvm_vcpu_ioctl_interrupt(), 
kvm_arch_vcpu_ioctl().

kvm_mips_get_reg() and kvm_mips_set_reg() should be in kvm_mips.c as they will 
be shared by the trap and emulate and VZ code.



OK, I will revise the patch set to rearrange things in a manner that 
leaves these in kvm_mips.c.  However, this is of secondary importance to 
the question of the suitability of the ABI.




If you plan on defining specific versions of these functions for Cavium's 
implementation of KVM, please make them callbacks.



There will soon be follow on patches that do exactly that.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next] ia64, metag: Do not export min_low_pfn in arch-specific code

2013-05-22 Thread Ralf Baechle
On Sun, Apr 28, 2013 at 11:24:35AM +0100, James Hogan wrote:

 On 27 April 2013 19:20, Geert Uytterhoeven ge...@linux-m68k.org wrote:
  As of commit 787dcbe6984b3638e94f60d807dcb51bb8a07211 (MIPS: Export
  symbols used by KVM/MIPS module), min_low_pfn is already exported by
  the generic mm/bootmem.c, causing:
 
  WARNING: vmlinux: 'min_low_pfn' exported twice. Previous export was in 
  vmlinux
 
  Signed-off-by: Geert Uytterhoeven ge...@linux-m68k.org
 
 This has been pointed out several times and needs fixing in the mips
 tree where the warning was introduced.

I've changed the MIPS code to no longer require the export of min_low_pfn
and dropped the export of that particular file.  I still see that IA-64
and metag export min_low_pfn so maybe it should be exported from mm/
after all?

  Ralf
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

The initial patch set implementing MIPS KVM does not handle 64-bit
guests or use of the FPU.  This patch set corrects these ABI issues,
and does some very minor clean up.

Changes from v4: No code change, just keep more of the code in
 kvm_mips.c rather than kvm_trap_emul.c

Changes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS.  Added
 ENOIOCTLCMD patch.

Changes from v2: Split into five parts, no code change.

David Daney (6):
  mips/kvm: Fix ABI for use of FPU.
  mips/kvm: Fix ABI for use of 64-bit registers.
  mips/kvm: Fix name of gpr field in struct kvm_regs.
  mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in
kvm_arch_vcpu_ioctl_{s,g}et_regs
  mips/kvm: Fix ABI by moving manipulation of CP0 registers to
KVM_{G,S}ET_ONE_REG
  mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls.

 arch/mips/include/asm/kvm.h  | 137 +++
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c | 278 ---
 arch/mips/kvm/kvm_trap_emul.c|  50 ---
 4 files changed, 367 insertions(+), 102 deletions(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 2/6] mips/kvm: Fix ABI for use of 64-bit registers.

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

All registers are 64-bits wide, 32-bit guests use the least
significant portion of the register storage fields.

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/include/asm/kvm.h | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 0e8f565..86812fb 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -18,12 +18,18 @@
 #define N_MIPS_COPROC_REGS  32
 #define N_MIPS_COPROC_SEL  8
 
-/* for KVM_GET_REGS and KVM_SET_REGS */
+/*
+ * for KVM_GET_REGS and KVM_SET_REGS
+ *
+ * If Config[AT] is zero (32-bit CPU), the register contents are
+ * stored in the lower 32-bits of the struct kvm_regs fields and sign
+ * extended to 64-bits.
+ */
 struct kvm_regs {
-   __u32 gprs[32];
-   __u32 hi;
-   __u32 lo;
-   __u32 pc;
+   __u64 gprs[32];
+   __u64 hi;
+   __u64 lo;
+   __u64 pc;
 
__u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL];
 };
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/6] mips/kvm: Fix ABI for use of FPU.

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

Define a non-empty struct kvm_fpu.

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/include/asm/kvm.h | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 85789ea..0e8f565 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -1,11 +1,12 @@
 /*
-* This file is subject to the terms and conditions of the GNU General Public
-* License.  See the file COPYING in the main directory of this archive
-* for more details.
-*
-* Copyright (C) 2012  MIPS Technologies, Inc.  All rights reserved.
-* Authors: Sanjay Lal sanj...@kymasys.com
-*/
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2012  MIPS Technologies, Inc.  All rights reserved.
+ * Copyright (C) 2013 Cavium, Inc.
+ * Authors: Sanjay Lal sanj...@kymasys.com
+ */
 
 #ifndef __LINUX_KVM_MIPS_H
 #define __LINUX_KVM_MIPS_H
@@ -31,8 +32,20 @@ struct kvm_regs {
 struct kvm_sregs {
 };
 
-/* for KVM_GET_FPU and KVM_SET_FPU */
+/*
+ * for KVM_GET_FPU and KVM_SET_FPU
+ *
+ * If Status[FR] is zero (32-bit FPU), the upper 32-bits of the FPRs
+ * are zero filled.
+ */
 struct kvm_fpu {
+   __u64 fpr[32];
+   __u32 fir;
+   __u32 fccr;
+   __u32 fexr;
+   __u32 fenr;
+   __u32 fcsr;
+   __u32 pad;
 };
 
 struct kvm_debug_exit_arch {
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 6/6] mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls.

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

The Linux Way is to return -ENOIOCTLCMD to the vfs when an
unimplemented ioctl is requested.  Do this in kvm_mips instead of a
random mixture of -ENOTSUPP and -EINVAL.

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/kvm/kvm_mips.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 5aec4f4..13e9438 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -195,7 +195,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 long
 kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
-   return -EINVAL;
+   return -ENOIOCTLCMD;
 }
 
 void kvm_arch_free_memslot(struct kvm_memory_slot *free,
@@ -401,7 +401,7 @@ int
 kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg)
 {
-   return -EINVAL;
+   return -ENOIOCTLCMD;
 }
 
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
@@ -475,14 +475,14 @@ int
 kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
 {
-   return -EINVAL;
+   return -ENOIOCTLCMD;
 }
 
 int
 kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
 {
-   return -EINVAL;
+   return -ENOIOCTLCMD;
 }
 
 #define KVM_REG_MIPS_CP0_INDEX (0x1 + 8 * 0 + 0)
@@ -799,7 +799,7 @@ long kvm_arch_vm_ioctl(struct file *filp, unsigned int 
ioctl, unsigned long arg)
 
switch (ioctl) {
default:
-   r = -EINVAL;
+   r = -ENOIOCTLCMD;
}
 
return r;
@@ -827,13 +827,13 @@ void kvm_arch_exit(void)
 int
 kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 {
-   return -ENOTSUPP;
+   return -ENOIOCTLCMD;
 }
 
 int
 kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 {
-   return -ENOTSUPP;
+   return -ENOIOCTLCMD;
 }
 
 int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
@@ -843,12 +843,12 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -ENOIOCTLCMD;
 }
 
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -ENOIOCTLCMD;
 }
 
 int kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 4/6] mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in kvm_arch_vcpu_ioctl_{s,g}et_regs

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

Also we cannot set special zero register, so force it to zero.

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/kvm/kvm_mips.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 93da750..71a1fc1 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -677,9 +677,9 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 {
int i;
 
-   for (i = 0; i  32; i++)
+   for (i = 1; i  ARRAY_SIZE(vcpu-arch.gprs); i++)
vcpu-arch.gprs[i] = regs-gpr[i];
-
+   vcpu-arch.gprs[0] = 0; /* zero is special, and cannot be set. */
vcpu-arch.hi = regs-hi;
vcpu-arch.lo = regs-lo;
vcpu-arch.pc = regs-pc;
@@ -691,7 +691,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 {
int i;
 
-   for (i = 0; i  32; i++)
+   for (i = 0; i  ARRAY_SIZE(vcpu-arch.gprs); i++)
regs-gpr[i] = vcpu-arch.gprs[i];
 
regs-hi = vcpu-arch.hi;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 5/6] mips/kvm: Fix ABI by moving manipulation of CP0 registers to KVM_{G,S}ET_ONE_REG

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

Because not all 256 CP0 registers are ever implemented, we need a
different method of manipulating them.  Use the
KVM_SET_ONE_REG/KVM_GET_ONE_REG mechanism.

Now unused code and definitions are removed.

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/include/asm/kvm.h  |  91 +++---
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c | 250 +--
 arch/mips/kvm/kvm_trap_emul.c|  50 
 4 files changed, 320 insertions(+), 75 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index d145ead..3f424f5 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -13,10 +13,11 @@
 
 #include linux/types.h
 
-#define __KVM_MIPS
-
-#define N_MIPS_COPROC_REGS  32
-#define N_MIPS_COPROC_SEL  8
+/*
+ * KVM MIPS specific structures and definitions.
+ *
+ * Some parts derived from the x86 version of this file.
+ */
 
 /*
  * for KVM_GET_REGS and KVM_SET_REGS
@@ -31,12 +32,6 @@ struct kvm_regs {
__u64 hi;
__u64 lo;
__u64 pc;
-
-   __u32 cp0reg[N_MIPS_COPROC_REGS][N_MIPS_COPROC_SEL];
-};
-
-/* for KVM_GET_SREGS and KVM_SET_SREGS */
-struct kvm_sregs {
 };
 
 /*
@@ -55,21 +50,89 @@ struct kvm_fpu {
__u32 pad;
 };
 
+
+/*
+ * For MIPS, we use KVM_SET_ONE_REG and KVM_GET_ONE_REG to access CP0
+ * registers.  The id field is broken down as follows:
+ *
+ *  bits[2..0]   - Register 'sel' index.
+ *  bits[7..3]   - Register 'rd'  index.
+ *  bits[15..8]  - Must be zero.
+ *  bits[63..16] - 1 - CP0 registers.
+ *
+ * Other sets registers may be added in the future.  Each set would
+ * have its own identifier in bits[63..16].
+ *
+ * The addr field of struct kvm_one_reg must point to an aligned
+ * 64-bit wide location.  For registers that are narrower than
+ * 64-bits, the value is stored in the low order bits of the location,
+ * and sign extended to 64-bits.
+ *
+ * The registers defined in struct kvm_regs are also accessible, the
+ * id values for these are below.
+ */
+
+#define KVM_REG_MIPS_R0 0
+#define KVM_REG_MIPS_R1 1
+#define KVM_REG_MIPS_R2 2
+#define KVM_REG_MIPS_R3 3
+#define KVM_REG_MIPS_R4 4
+#define KVM_REG_MIPS_R5 5
+#define KVM_REG_MIPS_R6 6
+#define KVM_REG_MIPS_R7 7
+#define KVM_REG_MIPS_R8 8
+#define KVM_REG_MIPS_R9 9
+#define KVM_REG_MIPS_R10 10
+#define KVM_REG_MIPS_R11 11
+#define KVM_REG_MIPS_R12 12
+#define KVM_REG_MIPS_R13 13
+#define KVM_REG_MIPS_R14 14
+#define KVM_REG_MIPS_R15 15
+#define KVM_REG_MIPS_R16 16
+#define KVM_REG_MIPS_R17 17
+#define KVM_REG_MIPS_R18 18
+#define KVM_REG_MIPS_R19 19
+#define KVM_REG_MIPS_R20 20
+#define KVM_REG_MIPS_R21 21
+#define KVM_REG_MIPS_R22 22
+#define KVM_REG_MIPS_R23 23
+#define KVM_REG_MIPS_R24 24
+#define KVM_REG_MIPS_R25 25
+#define KVM_REG_MIPS_R26 26
+#define KVM_REG_MIPS_R27 27
+#define KVM_REG_MIPS_R28 28
+#define KVM_REG_MIPS_R29 29
+#define KVM_REG_MIPS_R30 30
+#define KVM_REG_MIPS_R31 31
+
+#define KVM_REG_MIPS_HI 32
+#define KVM_REG_MIPS_LO 33
+#define KVM_REG_MIPS_PC 34
+
+/*
+ * KVM MIPS specific structures and definitions
+ *
+ */
 struct kvm_debug_exit_arch {
+   __u64 epc;
 };
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
 };
 
+/* definition of registers in kvm_run */
+struct kvm_sync_regs {
+};
+
+/* dummy definition */
+struct kvm_sregs {
+};
+
 struct kvm_mips_interrupt {
/* in */
__u32 cpu;
__u32 irq;
 };
 
-/* definition of registers in kvm_run */
-struct kvm_sync_regs {
-};
-
 #endif /* __LINUX_KVM_MIPS_H */
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 143875c..4d6fa0b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -496,10 +496,6 @@ struct kvm_mips_callbacks {
uint32_t cause);
int (*irq_clear) (struct kvm_vcpu *vcpu, unsigned int priority,
  uint32_t cause);
-   int (*vcpu_ioctl_get_regs) (struct kvm_vcpu *vcpu,
-   struct kvm_regs *regs);
-   int (*vcpu_ioctl_set_regs) (struct kvm_vcpu *vcpu,
-   struct kvm_regs *regs);
 };
 extern struct kvm_mips_callbacks *kvm_mips_callbacks;
 int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks);
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 71a1fc1..5aec4f4 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -485,15 +485,251 @@ kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
return -EINVAL;
 }
 
+#define KVM_REG_MIPS_CP0_INDEX (0x1 + 8 * 0 + 0)
+#define KVM_REG_MIPS_CP0_ENTRYLO0 (0x1 + 8 * 2 + 0)
+#define KVM_REG_MIPS_CP0_ENTRYLO1 (0x1 + 8 * 3 + 0)
+#define KVM_REG_MIPS_CP0_CONTEXT (0x1 + 8 * 4 + 0)
+#define KVM_REG_MIPS_CP0_USERLOCAL (0x1 + 8 * 4 + 2)
+#define KVM_REG_MIPS_CP0_PAGEMASK (0x1 + 8 * 5 + 0)
+#define 

[PATCH v5 3/6] mips/kvm: Fix name of gpr field in struct kvm_regs.

2013-05-22 Thread David Daney
From: David Daney david.da...@cavium.com

Signed-off-by: David Daney david.da...@cavium.com
---
 arch/mips/include/asm/kvm.h | 3 ++-
 arch/mips/kvm/kvm_mips.c| 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/mips/include/asm/kvm.h b/arch/mips/include/asm/kvm.h
index 86812fb..d145ead 100644
--- a/arch/mips/include/asm/kvm.h
+++ b/arch/mips/include/asm/kvm.h
@@ -26,7 +26,8 @@
  * extended to 64-bits.
  */
 struct kvm_regs {
-   __u64 gprs[32];
+   /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+   __u64 gpr[32];
__u64 hi;
__u64 lo;
__u64 pc;
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index e0dad02..93da750 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -678,7 +678,7 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
int i;
 
for (i = 0; i  32; i++)
-   vcpu-arch.gprs[i] = regs-gprs[i];
+   vcpu-arch.gprs[i] = regs-gpr[i];
 
vcpu-arch.hi = regs-hi;
vcpu-arch.lo = regs-lo;
@@ -692,7 +692,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
int i;
 
for (i = 0; i  32; i++)
-   regs-gprs[i] = vcpu-arch.gprs[i];
+   regs-gpr[i] = vcpu-arch.gprs[i];
 
regs-hi = vcpu-arch.hi;
regs-lo = vcpu-arch.lo;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/6] mips/kvm: Fix ABI for compatibility with 64-bit guests.

2013-05-22 Thread Sanjay Lal

On May 22, 2013, at 11:43 AM, David Daney wrote:

 From: David Daney david.da...@cavium.com
 
 The initial patch set implementing MIPS KVM does not handle 64-bit
 guests or use of the FPU.  This patch set corrects these ABI issues,
 and does some very minor clean up.
 
 Changes from v4: No code change, just keep more of the code in
 kvm_mips.c rather than kvm_trap_emul.c
 
 Changes from v3: Use KVM_SET_ONE_REG instead of KVM_SET_MSRS.  Added
 ENOIOCTLCMD patch.
 
 Changes from v2: Split into five parts, no code change.
 
 David Daney (6):
  mips/kvm: Fix ABI for use of FPU.
  mips/kvm: Fix ABI for use of 64-bit registers.
  mips/kvm: Fix name of gpr field in struct kvm_regs.
  mips/kvm: Use ARRAY_SIZE() instead of hardcoded constants in
kvm_arch_vcpu_ioctl_{s,g}et_regs
  mips/kvm: Fix ABI by moving manipulation of CP0 registers to
KVM_{G,S}ET_ONE_REG
  mips/kvm: Use ENOIOCTLCMD to indicate unimplemented ioctls.
 
 arch/mips/include/asm/kvm.h  | 137 +++
 arch/mips/include/asm/kvm_host.h |   4 -
 arch/mips/kvm/kvm_mips.c | 278 ---
 arch/mips/kvm/kvm_trap_emul.c|  50 ---
 4 files changed, 367 insertions(+), 102 deletions(-)
 
 -- 
 1.7.11.7

Acked-by: Sanjay Lal sanj...@kymasys.com

Regards
Sanjay

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 04/11] KVM: MMU: zap pages in batch

2013-05-22 Thread Xiao Guangrong
Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments

[ It improves kernel building 0.6% ~ 1% ]

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   35 +++
 1 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f302540..688e755 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4203,14 +4203,18 @@ restart:
spin_unlock(kvm-mmu_lock);
 }
 
+#define BATCH_ZAP_PAGES10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
+   int batch = 0;
 
 restart:
list_for_each_entry_safe_reverse(sp, node,
  kvm-arch.active_mmu_pages, link) {
+   int ret;
+
/*
 * No obsolete page exists before new created page since
 * active_mmu_pages is the FIFO list.
@@ -4219,28 +4223,6 @@ restart:
break;
 
/*
-* Do not repeatedly zap a root page to avoid unnecessary
-* KVM_REQ_MMU_RELOAD, otherwise we may not be able to
-* progress:
-*vcpu 0vcpu 1
-* call vcpu_enter_guest():
-*1): handle KVM_REQ_MMU_RELOAD
-*and require mmu-lock to
-*load mmu
-* repeat:
-*1): zap root page and
-*send KVM_REQ_MMU_RELOAD
-*
-*2): if (cond_resched_lock(mmu-lock))
-*
-*2): hold mmu-lock and load mmu
-*
-*3): see KVM_REQ_MMU_RELOAD bit
-*on vcpu-requests is set
-*then return 1 to call
-*vcpu_enter_guest() again.
-*goto repeat;
-*
 * Since we are reversely walking the list and the invalid
 * list will be moved to the head, skip the invalid page
 * can help us to avoid the infinity list walking.
@@ -4248,13 +4230,18 @@ restart:
if (sp-role.invalid)
continue;
 
-   if (need_resched() || spin_needbreak(kvm-mmu_lock)) {
+   if (batch = BATCH_ZAP_PAGES 
+ (need_resched() || spin_needbreak(kvm-mmu_lock))) {
+   batch = 0;
kvm_mmu_commit_zap_page(kvm, invalid_list);
cond_resched_lock(kvm-mmu_lock);
goto restart;
}
 
-   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
+   ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+   batch += ret;
+
+   if (ret)
goto restart;
}
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 05/11] KVM: x86: use the fast way to invalidate all pages

2013-05-22 Thread Xiao Guangrong
Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   15 ---
 arch/x86/kvm/x86.c |4 ++--
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 688e755..c010ace 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4188,21 +4188,6 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
spin_unlock(kvm-mmu_lock);
 }
 
-void kvm_mmu_zap_all(struct kvm *kvm)
-{
-   struct kvm_mmu_page *sp, *node;
-   LIST_HEAD(invalid_list);
-
-   spin_lock(kvm-mmu_lock);
-restart:
-   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link)
-   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
-   goto restart;
-
-   kvm_mmu_commit_zap_page(kvm, invalid_list);
-   spin_unlock(kvm-mmu_lock);
-}
-
 #define BATCH_ZAP_PAGES10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3758ff9..15e10f7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7066,13 +7066,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-   kvm_mmu_zap_all(kvm);
+   kvm_mmu_invalidate_zap_all_pages(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
   struct kvm_memory_slot *slot)
 {
-   kvm_arch_flush_shadow_all(kvm);
+   kvm_mmu_invalidate_zap_all_pages(kvm);
 }
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 07/11] KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages

2013-05-22 Thread Xiao Guangrong
It is good for debug and development

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c  |1 +
 arch/x86/kvm/mmutrace.h |   20 
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c010ace..3a3e6c5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4245,6 +4245,7 @@ restart:
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
 {
spin_lock(kvm-mmu_lock);
+   trace_kvm_mmu_invalidate_zap_all_pages(kvm);
kvm-arch.mmu_valid_gen++;
 
kvm_zap_obsolete_pages(kvm);
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 697f466..eb444dd 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -276,6 +276,26 @@ TRACE_EVENT(
  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
)
 );
+
+TRACE_EVENT(
+   kvm_mmu_invalidate_zap_all_pages,
+   TP_PROTO(struct kvm *kvm),
+   TP_ARGS(kvm),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, mmu_valid_gen)
+   __field(unsigned int, mmu_used_pages)
+   ),
+
+   TP_fast_assign(
+   __entry-mmu_valid_gen = kvm-arch.mmu_valid_gen;
+   __entry-mmu_used_pages = kvm-arch.n_used_mmu_pages;
+   ),
+
+   TP_printk(kvm-mmu-valid-gen %lx used_pages %x,
+ __entry-mmu_valid_gen, __entry-mmu_used_pages
+   )
+);
 #endif /* _TRACE_KVMMMU_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 11/11] KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped

2013-05-22 Thread Xiao Guangrong
From: Gleb Natapov g...@redhat.com

Quote Gleb's mail:
| why don't we check for sp-role.invalid in
| kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?

and

| Actually we can add check for is_obsolete_sp() there too since
| kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
| after incrementing mmu_valid_gen.

[ Xiao: add some comments and the check of is_obsolete_sp() ]

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5e34056..055d675 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2097,7 +2097,13 @@ __kvm_mmu_prepare_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp,
kvm_mod_used_mmu_pages(kvm, -1);
} else {
list_move(sp-link, kvm-arch.active_mmu_pages);
-   kvm_reload_remote_mmus(kvm);
+
+   /*
+* The obsolete pages can not be used on any vcpus.
+* See the comments in kvm_mmu_invalidate_zap_all_pages().
+*/
+   if (!sp-role.invalid  !is_obsolete_sp(kvm, sp))
+   kvm_reload_remote_mmus(kvm);
}
 
sp-role.invalid = 1;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 10/11] KVM: MMU: collapse TLB flushes when zap all pages

2013-05-22 Thread Xiao Guangrong
kvm_zap_obsolete_pages uses lock-break technique to zap pages,
it will flush tlb every time when it does lock-break

We can reload mmu on all vcpus after updating the generation
number so that the obsolete pages are not used on any vcpus,
after that we do not need to flush tlb when obsolete pages
are zapped

Note: kvm_mmu_commit_zap_page is still needed before free
the pages since other vcpus may be doing locklessly shadow
page walking

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   32 ++--
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e676356..5e34056 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 restart:
list_for_each_entry_safe_reverse(sp, node,
  kvm-arch.active_mmu_pages, link) {
-   int ret;
-
/*
 * No obsolete page exists before new created page since
 * active_mmu_pages is the FIFO list.
@@ -4254,21 +4252,24 @@ restart:
if (sp-role.invalid)
continue;
 
+   /*
+* Need not flush tlb since we only zap the sp with invalid
+* generation number.
+*/
if (batch = BATCH_ZAP_PAGES 
- (need_resched() || spin_needbreak(kvm-mmu_lock))) {
+ cond_resched_lock(kvm-mmu_lock)) {
batch = 0;
-   kvm_mmu_commit_zap_page(kvm, invalid_list);
-   cond_resched_lock(kvm-mmu_lock);
goto restart;
}
 
-   ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
-   batch += ret;
-
-   if (ret)
-   goto restart;
+   batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
+ invalid_list);
}
 
+   /*
+* Should flush tlb before free page tables since lockless-walking
+* may use the pages.
+*/
kvm_mmu_commit_zap_page(kvm, invalid_list);
 }
 
@@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
trace_kvm_mmu_invalidate_zap_all_pages(kvm);
kvm-arch.mmu_valid_gen++;
 
+   /*
+* Notify all vcpus to reload its shadow page table
+* and flush TLB. Then all vcpus will switch to new
+* shadow page table with the new mmu_valid_gen.
+*
+* Note: we should do this under the protection of
+* mmu-lock, otherwise, vcpu would purge shadow page
+* but miss tlb flush.
+*/
+   kvm_reload_remote_mmus(kvm);
+
kvm_zap_obsolete_pages(kvm);
spin_unlock(kvm-mmu_lock);
 }
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 08/11] KVM: MMU: do not reuse the obsolete page

2013-05-22 Thread Xiao Guangrong
The obsolete page will be zapped soon, do not resue it to
reduce future page fault

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a3e6c5..9b57faa 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1869,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
for_each_gfn_sp(vcpu-kvm, sp, gfn) {
+   if (is_obsolete_sp(vcpu-kvm, sp))
+   continue;
+
if (!need_sync  sp-unsync)
need_sync = true;
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 00/11] KVM: MMU: fast zap all shadow pages

2013-05-22 Thread Xiao Guangrong
Changlog:
V7:
  1): separate some optimization into two patches which do not reuse
  the obsolete pages and collapse tlb flushes, suggested by Marcelo.

  2): make the patch based on Gleb's diff change which reduce
  KVM_REQ_MMU_RELOAD when root page is being zapped.

  3): remove calling kvm_mmu_zap_page when patching hypercall, investigated
  by Gleb.

  4): drop the patch which deleted page from hash list at the prepare
  time since it can break the walk based on hash list.

  5): rename kvm_mmu_invalidate_all_pages to kvm_mmu_invalidate_zap_all_pages.

  6): introduce kvm_mmu_prepare_zap_obsolete_page which is used to zap obsolete
  page to collapse tlb flushes.

V6:
  1): reversely walk active_list to skip the new created pages based
  on the comments from Gleb and Paolo.

  2): completely replace kvm_mmu_zap_all by kvm_mmu_invalidate_all_pages
  based on Gleb's comments.

  3): improve the parameters of kvm_mmu_invalidate_all_pages based on
  Gleb's comments.
 
  4): rename kvm_mmu_invalidate_memslot_pages to kvm_mmu_invalidate_all_pages
  5): rename zap_invalid_pages to kvm_zap_obsolete_pages

V5:
  1): rename is_valid_sp to is_obsolete_sp
  2): use lock-break technique to zap all old pages instead of only pages
  linked on invalid slot's rmap suggested by Marcelo.
  3): trace invalid pages and kvm_mmu_invalidate_memslot_pages()
  4): rename kvm_mmu_invalid_memslot_pages to kvm_mmu_invalidate_memslot_pages
  according to Takuya's comments.

V4:
  1): drop unmapping invalid rmap out of mmu-lock and use lock-break technique
  instead. Thanks to Gleb's comments.

  2): needn't handle invalid-gen pages specially due to page table always
  switched by KVM_REQ_MMU_RELOAD. Thanks to Marcelo's comments.

V3:
  completely redesign the algorithm, please see below.

V2:
  - do not reset n_requested_mmu_pages and n_max_mmu_pages
  - batch free root shadow pages to reduce vcpu notification and mmu-lock
contention
  - remove the first patch that introduce kvm-arch.mmu_cache since we only
'memset zero' on hashtable rather than all mmu cache members in this
version
  - remove unnecessary kvm_reload_remote_mmus after kvm_mmu_zap_all

* Issue
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

* Idea
KVM maintains a global mmu invalid generation-number which is stored in
kvm-arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp-mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

Then the invalid-gen pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
are zapped by using lock-break technique.

Gleb Natapov (1):
  KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped

Xiao Guangrong (10):
  KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall
  KVM: MMU: drop unnecessary kvm_reload_remote_mmus
  KVM: MMU: fast invalidate all pages
  KVM: MMU: zap pages in batch
  KVM: x86: use the fast way to invalidate all pages
  KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
  KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
  KVM: MMU: do not reuse the obsolete page
  KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page
  KVM: MMU: collapse TLB flushes when zap all pages

 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |  134 ---
 arch/x86/kvm/mmu.h  |1 +
 arch/x86/kvm/mmutrace.h |   42 +---
 arch/x86/kvm/x86.c  |   16 +
 5 files changed, 162 insertions(+), 33 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 01/11] KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall

2013-05-22 Thread Xiao Guangrong
Quote Gleb's mail:

| Back then kvm-lock protected memslot access so code like:
|
| mutex_lock(vcpu-kvm-lock);
| kvm_mmu_zap_all(vcpu-kvm);
| mutex_unlock(vcpu-kvm-lock);
|
| which is what 7aa81cc0 does was enough to guaranty that no vcpu will
| run while code is patched. This is no longer the case and
| mutex_lock(vcpu-kvm-lock); is gone from that code path long time ago,
| so now kvm_mmu_zap_all() there is useless and the code is incorrect.

So we drop it and it will be fixed later

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d28810..6739b1d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5523,13 +5523,6 @@ static int emulator_fix_hypercall(struct 
x86_emulate_ctxt *ctxt)
char instruction[3];
unsigned long rip = kvm_rip_read(vcpu);
 
-   /*
-* Blow out the MMU to ensure that no other VCPU has an active mapping
-* to ensure that the updated hypercall appears atomically across all
-* VCPUs.
-*/
-   kvm_mmu_zap_all(vcpu-kvm);
-
kvm_x86_ops-patch_hypercall(vcpu, instruction);
 
return emulator_write_emulated(ctxt, rip, instruction, 3, NULL);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

2013-05-22 Thread Xiao Guangrong
It is only used to zap the obsolete page. Since the obsolete page
will not be used, we need not spend time to find its unsync children
out. Also, we delete the page from shadow page cache so that the page
is completely isolated after call this function.

The later patch will use it to collapse tlb flushes

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   46 +-
 1 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b57faa..e676356 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
 static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
-   hlist_del(sp-hash_link);
+   hlist_del_init(sp-hash_link);
list_del(sp-link);
free_page((unsigned long)sp-spt);
if (!sp-role.direct)
@@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
-static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-   struct list_head *invalid_list)
+static int
+__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+  bool zap_unsync_children,
+  struct list_head *invalid_list)
 {
-   int ret;
+   int ret = 0;
 
trace_kvm_mmu_prepare_zap_page(sp);
++kvm-stat.mmu_shadow_zapped;
-   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
+
+   if (likely(zap_unsync_children))
+   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
+
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
 
@@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
return ret;
 }
 
+/*
+ * The obsolete page will not be used, we need not spend time to find
+ * its unsync children out. Also, we delete the page from shadow page
+ * cache so that the page is completely isolated after call this
+ * function.
+ *
+ * Note: if we use this function in for_each_gfn_xxx macros, we should
+ * re-walk the list when it successfully zaps one page.
+ */
+static int
+kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ struct list_head *invalid_list)
+{
+   int ret;
+
+   WARN_ON(!is_obsolete_sp(kvm, sp));
+
+   ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list);
+   if (ret)
+   hlist_del_init(sp-hash_link);
+
+   WARN_ON(ret  1);
+   return ret;
+}
+
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   struct list_head *invalid_list)
+{
+   return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list);
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 06/11] KVM: MMU: show mmu_valid_gen in shadow page related tracepoints

2013-05-22 Thread Xiao Guangrong
Show sp-mmu_valid_gen

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmutrace.h |   22 --
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index b8f6172..697f466 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -7,16 +7,18 @@
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM kvmmmu
 
-#define KVM_MMU_PAGE_FIELDS \
-   __field(__u64, gfn) \
-   __field(__u32, role) \
-   __field(__u32, root_count) \
+#define KVM_MMU_PAGE_FIELDS\
+   __field(unsigned long, mmu_valid_gen)   \
+   __field(__u64, gfn) \
+   __field(__u32, role)\
+   __field(__u32, root_count)  \
__field(bool, unsync)
 
-#define KVM_MMU_PAGE_ASSIGN(sp) \
-   __entry-gfn = sp-gfn;  \
-   __entry-role = sp-role.word;   \
-   __entry-root_count = sp-root_count;\
+#define KVM_MMU_PAGE_ASSIGN(sp)\
+   __entry-mmu_valid_gen = sp-mmu_valid_gen; \
+   __entry-gfn = sp-gfn; \
+   __entry-role = sp-role.word;  \
+   __entry-root_count = sp-root_count;   \
__entry-unsync = sp-unsync;
 
 #define KVM_MMU_PAGE_PRINTK() ({   \
@@ -28,8 +30,8 @@
\
role.word = __entry-role;  \
\
-   trace_seq_printf(p, sp gfn %llx %u%s q%u%s %s%s   \
- %snxe root %u %s%c, \
+   trace_seq_printf(p, sp gen %lx gfn %llx %u%s q%u%s %s%s   \
+ %snxe root %u %s%c, __entry-mmu_valid_gen, \
 __entry-gfn, role.level,  \
 role.cr4_pae ?  pae : ,\
 role.quadrant, \
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 03/11] KVM: MMU: fast invalidate all pages

2013-05-22 Thread Xiao Guangrong
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability

In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm-arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp-mmu_valid_gen when it is created

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp-mmu_valid_gen != kvm-arch.mmu_valid_gen)
are zapped by using lock-break technique

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +
 arch/x86/kvm/mmu.c  |   84 +++
 arch/x86/kvm/mmu.h  |1 +
 3 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3741c65..bff7d46 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -222,6 +222,7 @@ struct kvm_mmu_page {
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
+   unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
 
 #ifdef CONFIG_X86_32
@@ -529,6 +530,7 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+   unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f8ca2f3..f302540 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
 }
 
+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   return unlikely(sp-mmu_valid_gen != kvm-arch.mmu_valid_gen);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 
account_shadowed(vcpu-kvm, gfn);
}
+   sp-mmu_valid_gen = vcpu-kvm-arch.mmu_valid_gen;
init_shadow_page_table(sp);
trace_kvm_mmu_get_page(sp, true);
return sp;
@@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
+
if (!sp-role.invalid  !sp-role.direct)
unaccount_shadowed(kvm, sp-gfn);
+
if (sp-unsync)
kvm_unlink_unsync_page(kvm, sp);
if (!sp-root_count) {
@@ -4195,6 +4203,82 @@ restart:
spin_unlock(kvm-mmu_lock);
 }
 
+static void kvm_zap_obsolete_pages(struct kvm *kvm)
+{
+   struct kvm_mmu_page *sp, *node;
+   LIST_HEAD(invalid_list);
+
+restart:
+   list_for_each_entry_safe_reverse(sp, node,
+ kvm-arch.active_mmu_pages, link) {
+   /*
+* No obsolete page exists before new created page since
+* active_mmu_pages is the FIFO list.
+*/
+   if (!is_obsolete_sp(kvm, sp))
+   break;
+
+   /*
+* Do not repeatedly zap a root page to avoid unnecessary
+* KVM_REQ_MMU_RELOAD, otherwise we may not be able to
+* progress:
+*vcpu 0vcpu 1
+* call vcpu_enter_guest():
+*1): handle KVM_REQ_MMU_RELOAD
+*and require mmu-lock to
+*load mmu
+* repeat:
+*1): zap root page and
+*send KVM_REQ_MMU_RELOAD
+*
+*2): if (cond_resched_lock(mmu-lock))
+*
+*2): hold mmu-lock and load mmu
+*
+*3): see KVM_REQ_MMU_RELOAD bit
+*on vcpu-requests is set
+  

[PATCH v7 02/11] KVM: MMU: drop unnecessary kvm_reload_remote_mmus

2013-05-22 Thread Xiao Guangrong
It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6739b1d..3758ff9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7060,16 +7060,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 * If memory slot is created, or moved, we need to clear all
 * mmio sptes.
 */
-   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE))
kvm_mmu_zap_mmio_sptes(kvm);
-   kvm_reload_remote_mmus(kvm);
-   }
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
kvm_mmu_zap_all(kvm);
-   kvm_reload_remote_mmus(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep

2013-05-22 Thread Michael S. Tsirkin
On Thu, May 16, 2013 at 08:40:41PM +0200, Peter Zijlstra wrote:
 On Thu, May 16, 2013 at 02:16:10PM +0300, Michael S. Tsirkin wrote:
  There are several ways to make sure might_fault
  calling function does not sleep.
  One is to use it on kernel or otherwise locked memory - apparently
  nfs/sunrpc does this. As noted by Ingo, this is handled by the
  migh_fault() implementation in mm/memory.c but not the one in
  linux/kernel.h so in the current code might_fault() schedules
  differently depending on CONFIG_PROVE_LOCKING, which is an undesired
  semantical side effect.
  
  Another is to call pagefault_disable: in this case the page fault
  handler will go to fixups processing and we get an error instead of
  sleeping, so the might_sleep annotation is a false positive.
  vhost driver wants to do this now in order to reuse socket ops
  under a spinlock (and fall back on slower thread handler
  on error).
 
 Are you using the assumption that spin_lock() implies preempt_disable() 
 implies
 pagefault_disable()? Note that this assumption isn't valid for -rt where the
 spinlock becomes preemptible but we'll not disable pagefaults.
 
  Address both issues by:
  - dropping the unconditional call to might_sleep
from the fast might_fault code in linux/kernel.h
  - checking for pagefault_disable() in the
CONFIG_PROVE_LOCKING implementation
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   include/linux/kernel.h |  1 -
   mm/memory.c| 14 +-
   2 files changed, 9 insertions(+), 6 deletions(-)
  
  diff --git a/include/linux/kernel.h b/include/linux/kernel.h
  index e96329c..322b065 100644
  --- a/include/linux/kernel.h
  +++ b/include/linux/kernel.h
  @@ -198,7 +198,6 @@ void might_fault(void);
   #else
   static inline void might_fault(void)
   {
  -   might_sleep();
 
 This removes potential resched points for PREEMPT_VOLUNTARY -- was that
 intentional?

OK so I'm thinking of going back to this idea:
it has the advantage of being very simple,
and just might make some workloads faster
if they do lots of copy_XX_user in a loop.

Will have to be tested of course - anyone
has objections?

   }
   #endif
   
  diff --git a/mm/memory.c b/mm/memory.c
  index 6dc1882..1b8327b 100644
  --- a/mm/memory.c
  +++ b/mm/memory.c
  @@ -4222,13 +4222,17 @@ void might_fault(void)
  if (segment_eq(get_fs(), KERNEL_DS))
  return;
   
  -   might_sleep();
  /*
  -* it would be nicer only to annotate paths which are not under
  -* pagefault_disable, however that requires a larger audit and
  -* providing helpers like get_user_atomic.
  +* It would be nicer to annotate paths which are under preempt_disable
  +* but not under pagefault_disable, however that requires a new flag
  +* for differentiating between the two.
 
 -rt has this, pagefault_disable() doesn't change the preempt count but pokes
 at task_struct::pagefault_disable.
 
   */
  -   if (!in_atomic()  current-mm)
  +   if (in_atomic())
  +   return;
  +
  +   might_sleep();
  +
  +   if (current-mm)
  might_lock_read(current-mm-mmap_sem);
   }
   EXPORT_SYMBOL(might_fault);
  -- 
  MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] kernel: might_fault does not imply might_sleep

2013-05-22 Thread Michael S. Tsirkin
On Tue, May 21, 2013 at 01:57:34PM +0200, Peter Zijlstra wrote:
 On Sun, May 19, 2013 at 12:35:26PM +0300, Michael S. Tsirkin wrote:
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -198,7 +198,6 @@ void might_fault(void);
 #else
 static inline void might_fault(void)
 {
-   might_sleep();
   
   This removes potential resched points for PREEMPT_VOLUNTARY -- was that
   intentional?
  
  No it's a bug. Thanks for pointing this out.
  OK so I guess it should be might_sleep_if(!in_atomic())
  and this means might_fault would have to move from linux/kernel.h to
  linux/uaccess.h, since in_atomic() is in linux/hardirq.h
  
  Makes sense?
 
 So the only difference between PROVE_LOCKING and not should be the
 might_lock_read() thing; so how about something like this?

So the problem with the below is that might_fault is needed
in asm/uaccess.h.
I'm still trying various approaches but the dependencies there
are very complex.

 ---
  include/linux/kernel.h  |  7 ++-
  include/linux/uaccess.h | 26 ++
  mm/memory.c | 14 ++
  3 files changed, 30 insertions(+), 17 deletions(-)
 
 diff --git a/include/linux/kernel.h b/include/linux/kernel.h
 index e96329c..70812f4 100644
 --- a/include/linux/kernel.h
 +++ b/include/linux/kernel.h
 @@ -194,12 +194,9 @@ extern int _cond_resched(void);
   })
  
  #ifdef CONFIG_PROVE_LOCKING
 -void might_fault(void);
 +void might_fault_lockdep(void);
  #else
 -static inline void might_fault(void)
 -{
 - might_sleep();
 -}
 +static inline void might_fault_lockdep(void) { }
  #endif
  
  extern struct atomic_notifier_head panic_notifier_list;
 diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
 index 5ca0951..50a2cc9 100644
 --- a/include/linux/uaccess.h
 +++ b/include/linux/uaccess.h
 @@ -38,6 +38,32 @@ static inline void pagefault_enable(void)
   preempt_check_resched();
  }
  
 +static inline bool __can_fault(void)
 +{
 + /*
 +  * Some code (nfs/sunrpc) uses socket ops on kernel memory while
 +  * holding the mmap_sem, this is safe because kernel memory doesn't
 +  * get paged out, therefore we'll never actually fault, and the
 +  * below annotations will generate false positives.
 +  */
 + if (segment_eq(get_fs(), KERNEL_DS))
 + return false;
 +
 + if (in_atomic() /* || pagefault_disabled() */)
 + return false;
 +
 + return true;
 +}
 +
 +static inline void might_fault(void)
 +{
 + if (!__can_fault())
 + return;
 +
 + might_sleep();
 + might_fault_lockdep();
 +}
 +
  #ifndef ARCH_HAS_NOCACHE_UACCESS
  
  static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
 diff --git a/mm/memory.c b/mm/memory.c
 index 6dc1882..266610c 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -4211,19 +4211,9 @@ void print_vma_addr(char *prefix, unsigned long ip)
  }
  
  #ifdef CONFIG_PROVE_LOCKING
 -void might_fault(void)
 +void might_fault_lockdep(void)
  {
   /*
 -  * Some code (nfs/sunrpc) uses socket ops on kernel memory while
 -  * holding the mmap_sem, this is safe because kernel memory doesn't
 -  * get paged out, therefore we'll never actually fault, and the
 -  * below annotations will generate false positives.
 -  */
 - if (segment_eq(get_fs(), KERNEL_DS))
 - return;
 -
 - might_sleep();
 - /*
* it would be nicer only to annotate paths which are not under
* pagefault_disable, however that requires a larger audit and
* providing helpers like get_user_atomic.
 @@ -4231,7 +4221,7 @@ void might_fault(void)
   if (!in_atomic()  current-mm)
   might_lock_read(current-mm-mmap_sem);
  }
 -EXPORT_SYMBOL(might_fault);
 +EXPORT_SYMBOL(might_fault_lockdep);
  #endif
  
  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] x86: Annotate _ASM_EXTABLE users to distinguish uaccess from everything else

2013-05-22 Thread Andy Lutomirski
The idea is that the kernel can be much more careful fixing up
uaccess exceptions -- page faults on user addresses are the only
legitimate reason for a uaccess instruction to fault.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---

I'm not 100% sure what's happening in the KVM code.  Can someone familiar
with it take a look?

 arch/x86/ia32/ia32entry.S |   4 +-
 arch/x86/include/asm/asm.h|  13 ++-
 arch/x86/include/asm/fpu-internal.h   |   6 +-
 arch/x86/include/asm/futex.h  |   8 +-
 arch/x86/include/asm/kvm_host.h   |   2 +-
 arch/x86/include/asm/msr.h|   4 +-
 arch/x86/include/asm/segment.h|   2 +-
 arch/x86/include/asm/special_insns.h  |   2 +-
 arch/x86/include/asm/uaccess.h|   8 +-
 arch/x86/include/asm/word-at-a-time.h |   2 +-
 arch/x86/include/asm/xsave.h  |   6 +-
 arch/x86/kernel/entry_32.S|  26 ++---
 arch/x86/kernel/entry_64.S|   6 +-
 arch/x86/kernel/ftrace.c  |   4 +-
 arch/x86/kernel/test_nx.c |   2 +-
 arch/x86/kernel/test_rodata.c |   2 +-
 arch/x86/kvm/emulate.c|   4 +-
 arch/x86/lib/checksum_32.S|   4 +-
 arch/x86/lib/copy_user_64.S   |  50 
 arch/x86/lib/copy_user_nocache_64.S   |  44 +++
 arch/x86/lib/csum-copy_64.S   |   6 +-
 arch/x86/lib/getuser.S|  12 +-
 arch/x86/lib/mmx_32.c |  12 +-
 arch/x86/lib/msr-reg.S|   4 +-
 arch/x86/lib/putuser.S|  10 +-
 arch/x86/lib/usercopy_32.c| 212 +-
 arch/x86/lib/usercopy_64.c|   4 +-
 arch/x86/mm/init_32.c |   2 +-
 arch/x86/um/checksum_32.S |   4 +-
 arch/x86/xen/xen-asm_32.S |   2 +-
 30 files changed, 236 insertions(+), 231 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 474dc1b..8d3b5c2 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -149,7 +149,7 @@ ENTRY(ia32_sysenter_target)
   32bit zero extended */ 
ASM_STAC
 1: movl(%rbp),%ebp
-   _ASM_EXTABLE(1b,ia32_badarg)
+   _ASM_EXTABLE_UACCESS(1b,ia32_badarg)
ASM_CLAC
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
testl   
$_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
@@ -306,7 +306,7 @@ ENTRY(ia32_cstar_target)
/* hardware stack frame is complete now */  
ASM_STAC
 1: movl(%r8),%r9d
-   _ASM_EXTABLE(1b,ia32_badarg)
+   _ASM_EXTABLE_UACCESS(1b,ia32_badarg)
ASM_CLAC
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
testl   
$_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fa47fd4..f48a850 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -57,14 +57,16 @@
  */
 
 /* There are two bits of extable entry class, added to a signed offset. */
-#define _EXTABLE_CLASS_DEFAULT 0   /* standard uaccess fixup */
+#define _EXTABLE_CLASS_UACCESS 0   /* standard uaccess fixup */
+#define _EXTABLE_CLASS_ANY 0x4000  /* catch any exception */
 #define _EXTABLE_CLASS_EX  0x8000  /* uaccess + set uaccess_err */
 
 /*
  * The biases are the class constants + 0x2000, as signed integers.
  * This can't use ordinary arithmetic -- the assembler isn't that smart.
  */
-#define _EXTABLE_BIAS_DEFAULT  0x2000
+#define _EXTABLE_BIAS_UACCESS  0x2000
+#define _EXTABLE_BIAS_ANY  0x2000 + 0x4000
 #define _EXTABLE_BIAS_EX   0x2000 - 0x8000
 
 #ifdef __ASSEMBLY__
@@ -85,8 +87,11 @@
 .popsection\n
 #endif
 
-#define _ASM_EXTABLE(from,to)  \
-   _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_DEFAULT)
+#define _ASM_EXTABLE_UACCESS(from,to)  \
+   _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_UACCESS)
+
+#define _ASM_EXTABLE_ANY(from,to)  \
+   _ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_ANY)
 
 #define _ASM_EXTABLE_EX(from,to)   \
_ASM_EXTABLE_CLASS(from, to, _EXTABLE_BIAS_EX)
diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index e25cc33..7f86031 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -133,7 +133,7 @@ static inline void sanitize_i387_state(struct task_struct 
*tsk)
 3:  movl $-1,%[err]\n\
 jmp  2b\n\
 .previous\n  \
-_ASM_EXTABLE(1b, 3b)   \
+_ASM_EXTABLE_UACCESS(1b, 3b)

Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-22 Thread Scott Wood

On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
+++ b/arch/powerpc/kvm/powerpc.c
break;
 #endif
case KVM_CAP_SPAPR_MULTITCE:
+   case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
default:


Don't advertise SPAPR capabilities if it's not book3s -- and probably  
there's some additional limitation that would be appropriate.



@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce);
goto out;
}
+   case KVM_CREATE_SPAPR_TCE_IOMMU: {
+   struct kvm_create_spapr_tce_iommu create_tce_iommu;
+   struct kvm *kvm = filp-private_data;
+
+   r = -EFAULT;
+   if (copy_from_user(create_tce_iommu, argp,
+   sizeof(create_tce_iommu)))
+   goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,  
create_tce_iommu);

+   goto out;
+   }
 #endif /* CONFIG_PPC_BOOK3S_64 */

 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90)


Hmm...


@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct  
kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct  
kvm_device_attr)


+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct  
kvm_create_spapr_tce_iommu)


Shouldn't this go under the vm ioctl section?

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-22 Thread Marcelo Tosatti
On Wed, May 22, 2013 at 03:22:55AM -0400, Vadim Rozenfeld wrote:
 
 
 - Original Message -
 From: Marcelo Tosatti mtosa...@redhat.com
 To: Vadim Rozenfeld vroze...@redhat.com
 Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net
 Sent: Wednesday, May 22, 2013 10:50:46 AM
 Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC
 
 On Sun, May 19, 2013 at 05:06:37PM +1000, Vadim Rozenfeld wrote:
  The following patch allows to activate a partition reference 
  time enlightenment that is based on the host platform's support 
  for an Invariant Time Stamp Counter (iTSC).
  NOTE: This code will survive migration due to lack of VM stop/resume
  handlers, when offset, scale and sequence should be
  readjusted. 
  
  ---
   arch/x86/kvm/x86.c | 6 +-
   1 file changed, 5 insertions(+), 1 deletion(-)
  
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 9645dab..b423fe4 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -1838,7 +1838,6 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
  u32 msr, u64 data)
  u64 gfn;
  unsigned long addr;
  HV_REFERENCE_TSC_PAGE tsc_ref;
  -   tsc_ref.TscSequence = 0;
  if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE)) {
  kvm-arch.hv_tsc_page = data;
  break;
  @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
  u32 msr, u64 data)
  HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT);
  if (kvm_is_error_hva(addr))
  return 1;
  +   tsc_ref.TscSequence =
  +   boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0;
 
 1) You want NONSTOP_TSC (see 40fb1715 commit) which matches INVARIANT TSC.
 [VR]
 Thank you for reviewing. Will fix it.
 2) TscSequence should increase? 
 This field serves as a sequence number that is incremented whenever...
 [VR]
 Yes, on every VM resume, including migration. After migration we also need
 to recalculate scale and adjust offset. 
 3) 0x is the value for invalid source of reference time?
 [VR] Yes, on boot-up. In this case guest will go with PMTimer (not sure about 
 HPET
 but I can check). But if we set sequence to 0x after migration - it's 
 probably will not work.

Reference TSC during Save and Restore and Migration

To address migration scenarios to physical platforms that do not support
iTSC, the TscSequence field is used. In the event that a guest partition
is  migrated from an iTSC capable host to a non-iTSC capable host, the
hypervisor sets TscSequence to the special value of 0x, which
directs the guest operating system to fall back to a different clock
source (for example, the virtual PM timer).

Why it would not/does not work after migration?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] add support for Hyper-V reference time counter

2013-05-22 Thread Paolo Bonzini
Il 22/05/2013 09:32, Vadim Rozenfeld ha scritto:
  @@ -1827,6 +1829,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
  u32 msr, u64 data)
 if (__copy_to_user((void __user *)addr, instructions, 4))
 return 1;
 kvm-arch.hv_hypercall = data;
  +  local_irq_disable();
  +  kvm-arch.hv_ref_count = get_kernel_ns();
  +  local_irq_enable();
  +  break;
 local_irq_disable/local_irq_enable not needed.
 
 
 What is the reasoning behind reading this time value at msr write time?
 [VR] Windows writs this MSR only once, during HAL initialization.
 So, I decided to treat this call as a partition crate event.
 

But is it expected by Windows that the reference count starts counting
up from 0 at partition creation time?  If you could just use
(get_kernel_ns() + kvm-arch.kvmclock_offset) / 100, it would also be
simpler for migration purposes.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: fix unmatched RAM alloction/free

2013-05-22 Thread Xudong Hao
mmap is used in qemu_vmalloc function instead of qemu_memalign(commit
7dda5dc8), so it should change qemu_vfree to munmap to fix a unmatched
issue.

This issue appears when a PCI device is being assigned to KVM guest,
failure to read PCI rom file will bring RAM free, then the incorrect
qemu_vfree calling will cause a segment fault.

Signed-off-by: Xudong Hao xudong@intel.com
---
 exec.c |6 +-
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/exec.c b/exec.c
index fa1e0c3..d40d237 100644
--- a/exec.c
+++ b/exec.c
@@ -1152,15 +1152,11 @@ void qemu_ram_free(ram_addr_t addr)
 abort();
 #endif
 } else {
-#if defined(TARGET_S390X)  defined(CONFIG_KVM)
-munmap(block-host, block-length);
-#else
 if (xen_enabled()) {
 xen_invalidate_map_cache_entry(block-host);
 } else {
-qemu_vfree(block-host);
+munmap(block-host, block-length);
 }
-#endif
 }
 g_free(block);
 break;
-- 
1.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-22 Thread Rusty Russell
Jason Wang jasow...@redhat.com writes:
 On 05/22/2013 05:59 PM, Zang Hongyong wrote:
 On 2013/5/20 15:43, Michael S. Tsirkin wrote:
 On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:
 Yes, I don't think we want to create threads even more aggressively
 in all cases. I'm worried about scalability as it is.
 I think we should explore a flexible approach, use a thread pool
 (for example, a wq) to share threads between virtqueues,
 switch to a separate thread only if there's free CPU and existing
 threads are busy. Hopefully share threads between vhost instances too.
 On Xen platform, network backend pv driver model has evolved to this
 way. Netbacks from all DomUs share a thread pool,
 and thread number eaqual to cpu core number.
 Is there any plan for kvm paltform?

 There used to be two related RFCs for this, one is the multiple vhost
 workers from Anthony another is percpu vhost thread from Shirley. You
 can search the archives on netdev or kvm for the patches.

As I've said to MST before, I think our entire model is wrong.
Userspace should create the threads and call in.  If you're doing kernel
acceleration, two extra threads per NIC is a tiny overhead.

Of course, such radical changes to vhost doesn't help existing users as
Qinchuanyu asked...

Cheers,
Rusty,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 09/11] KVM: MMU: introduce kvm_mmu_prepare_zap_obsolete_page

2013-05-22 Thread Gleb Natapov
On Thu, May 23, 2013 at 03:55:58AM +0800, Xiao Guangrong wrote:
 It is only used to zap the obsolete page. Since the obsolete page
 will not be used, we need not spend time to find its unsync children
 out. Also, we delete the page from shadow page cache so that the page
 is completely isolated after call this function.
 
 The later patch will use it to collapse tlb flushes
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   46 +-
  1 files changed, 41 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9b57faa..e676356 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1466,7 +1466,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
 *kvm, int nr)
  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
  {
   ASSERT(is_empty_shadow_page(sp-spt));
 - hlist_del(sp-hash_link);
 + hlist_del_init(sp-hash_link);
Why do you need hlist_del_init() here? Why not move it into
kvm_mmu_prepare_zap_page() like we discussed it here:
https://patchwork.kernel.org/patch/2580351/ instead of doing
it differently for obsolete and non obsolete pages?

   list_del(sp-link);
   free_page((unsigned long)sp-spt);
   if (!sp-role.direct)
 @@ -2069,14 +2069,19 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
   return zapped;
  }
  
 -static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 - struct list_head *invalid_list)
 +static int
 +__kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 +bool zap_unsync_children,
 +struct list_head *invalid_list)
  {
 - int ret;
 + int ret = 0;
  
   trace_kvm_mmu_prepare_zap_page(sp);
   ++kvm-stat.mmu_shadow_zapped;
 - ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
 +
 + if (likely(zap_unsync_children))
 + ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
 +
   kvm_mmu_page_unlink_children(kvm, sp);
   kvm_mmu_unlink_parents(kvm, sp);
  
 @@ -2099,6 +2104,37 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
 struct kvm_mmu_page *sp,
   return ret;
  }
  
 +/*
 + * The obsolete page will not be used, we need not spend time to find
 + * its unsync children out. Also, we delete the page from shadow page
 + * cache so that the page is completely isolated after call this
 + * function.
 + *
 + * Note: if we use this function in for_each_gfn_xxx macros, we should
 + * re-walk the list when it successfully zaps one page.
 + */
 +static int
 +kvm_mmu_prepare_zap_obsolete_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 +   struct list_head *invalid_list)
 +{
 + int ret;
 +
 + WARN_ON(!is_obsolete_sp(kvm, sp));
 +
 + ret = __kvm_mmu_prepare_zap_page(kvm, sp, false, invalid_list);
 + if (ret)
 + hlist_del_init(sp-hash_link);
Why hlist_del() is not enough?

 +
 + WARN_ON(ret  1);
 + return ret;
 +}
 +
 +static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 + struct list_head *invalid_list)
 +{
 + return __kvm_mmu_prepare_zap_page(kvm, sp, true, invalid_list);
 +}
 +
  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
   struct list_head *invalid_list)
  {
 -- 
 1.7.7.6

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-22 Thread Scott Wood

On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
+++ b/arch/powerpc/kvm/powerpc.c
break;
 #endif
case KVM_CAP_SPAPR_MULTITCE:
+   case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
default:


Don't advertise SPAPR capabilities if it's not book3s -- and probably  
there's some additional limitation that would be appropriate.



@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, create_tce);
goto out;
}
+   case KVM_CREATE_SPAPR_TCE_IOMMU: {
+   struct kvm_create_spapr_tce_iommu create_tce_iommu;
+   struct kvm *kvm = filp-private_data;
+
+   r = -EFAULT;
+   if (copy_from_user(create_tce_iommu, argp,
+   sizeof(create_tce_iommu)))
+   goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,  
create_tce_iommu);

+   goto out;
+   }
 #endif /* CONFIG_PPC_BOOK3S_64 */

 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90)


Hmm...


@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct  
kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct  
kvm_device_attr)


+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct  
kvm_create_spapr_tce_iommu)


Shouldn't this go under the vm ioctl section?

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html