date:20140922

Re: [PATCH RFC 2/2] vhost: support urgent descriptors

2014-09-22 Thread Michael S. Tsirkin

On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote:
 On 09/20/2014 06:00 PM, Paolo Bonzini wrote:
  Il 19/09/2014 09:10, Jason Wang ha scritto:
   
  -if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
  +if (vq-urgent || !vhost_has_feature(vq, 
  VIRTIO_RING_F_EVENT_IDX)) {
  So the urgent descriptor only work when event index was not enabled?
  This seems suboptimal, we may still want to benefit from event index
  even if urgent descriptor is used. Looks like we need return true here
  when vq-urgent is true?
  Its ||, not .
 
  Without event index, all descriptors are treated as urgent.
 
  Paolo
 
 
 The problem is if vq-urgent is true, the patch checks
 VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in
 virtqueue_enable_cb() regardless of event index feature and cleared
 unconditionally in virtqueue_disable_cb().

The reverse actually, right?

 So virtqueue_enable_cb() was
 used to not only publish a new event index but also enable the urgent
 descriptor. And virtqueue_disable_cb() disabled all interrupts including
 the urgent descriptor. Guest won't get urgent interrupts by just adding
 virtqueue_add_outbuf_urgent() since what it needs is to enable and
 disable interrupt for !urgent descriptor.

Right, we want a new API that advances event index but does not
set VRING_AVAIL_F_NO_INTERRUPT.
IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx
interrupts, to avoid interrupt storms.

 Btw, not sure urgent is a suitable name, since interrupt is often slow
 in kvm guest. And in fact virtio-net will probably use urgent
 descriptor for those packets (e.g stream packets who can be delayed a
 little bit to batch more bytes from userspace) who was not urgent
 compared to other packets.
 

Yes but we are asking for an interrupt before event index is reached
because something is waiting for the packet to be transmitted.
I couldn't come up with a better name.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC patch 5/6] s390: implement pci instruction

2014-09-22 Thread Frank Blaschka

On Fri, Sep 19, 2014 at 05:12:15PM +0200, Thomas Huth wrote:
 
  Hi Frank,
 
 On Fri, 19 Sep 2014 13:54:34 +0200
 frank.blasc...@de.ibm.com wrote:
 
  From: Frank Blaschka frank.blasc...@de.ibm.com
  
  This patch implements the s390 pci instructions in qemu. This allows
  to attach qemu pci devices including vfio. This does not mean the
  devices are functional but at least detection and config/memory space
  access is working.
  
  Signed-off-by: Frank Blaschka frank.blasc...@de.ibm.com
  ---
   target-s390x/Makefile.objs |2 
   target-s390x/kvm.c |   52 +++
   target-s390x/pci_ic.c  |  621 
  +
   target-s390x/pci_ic.h  |  425 ++
   4 files changed, 1099 insertions(+), 1 deletion(-)
  
  --- a/target-s390x/Makefile.objs
  +++ b/target-s390x/Makefile.objs
  @@ -2,4 +2,4 @@ obj-y += translate.o helper.o cpu.o inte
   obj-y += int_helper.o fpu_helper.o cc_helper.o mem_helper.o misc_helper.o
   obj-y += gdbstub.o
   obj-$(CONFIG_SOFTMMU) += ioinst.o arch_dump.o
  -obj-$(CONFIG_KVM) += kvm.o
  +obj-$(CONFIG_KVM) += kvm.o pci_ic.o
  --- a/target-s390x/kvm.c
  +++ b/target-s390x/kvm.c
  @@ -40,6 +40,7 @@
   #include exec/gdbstub.h
   #include trace.h
   #include qapi-event.h
  +#include pci_ic.h
  
   /* #define DEBUG_KVM */
  
  @@ -56,6 +57,7 @@
   #define IPA0_B2 0xb200
   #define IPA0_B9 0xb900
   #define IPA0_EB 0xeb00
  +#define IPA0_E3 0xe300
  
   #define PRIV_B2_SCLP_CALL   0x20
   #define PRIV_B2_CSCH0x30
  @@ -76,8 +78,17 @@
   #define PRIV_B2_XSCH0x76
  
   #define PRIV_EB_SQBS0x8a
  +#define PRIV_EB_PCISTB  0xd0
  +#define PRIV_EB_SIC 0xd1
  
   #define PRIV_B9_EQBS0x9c
  +#define PRIV_B9_CLP 0xa0
  +#define PRIV_B9_PCISTG  0xd0
  +#define PRIV_B9_PCILG   0xd2
  +#define PRIV_B9_RPCIT   0xd3
  +
  +#define PRIV_E3_MPCIFC  0xd0
  +#define PRIV_E3_STPCIFC 0xd4
  
   #define DIAG_IPL0x308
   #define DIAG_KVM_HYPERCALL  0x500
  @@ -813,6 +824,18 @@ static int handle_b9(S390CPU *cpu, struc
   int r = 0;
  
   switch (ipa1) {
  +case PRIV_B9_CLP:
  +r = kvm_clp_service_call(cpu, run);
  +break;
  +case PRIV_B9_PCISTG:
  +r = kvm_pcistg_service_call(cpu, run);
  +break;
  +case PRIV_B9_PCILG:
  +r = kvm_pcilg_service_call(cpu, run);
  +break;
  +case PRIV_B9_RPCIT:
  +r = kvm_rpcit_service_call(cpu, run);
  +break;
   case PRIV_B9_EQBS:
   /* just inject exception */
   r = -1;
  @@ -831,6 +854,12 @@ static int handle_eb(S390CPU *cpu, struc
   int r = 0;
  
   switch (ipa1) {
  +case PRIV_EB_PCISTB:
  +r = kvm_pcistb_service_call(cpu, run);
  +break;
  +case PRIV_EB_SIC:
  +r = kvm_sic_service_call(cpu, run);
  +break;
   case PRIV_EB_SQBS:
   /* just inject exception */
   r = -1;
 
 I'm not sure, but I think the handler for the eb instructions is wrong:
 The second byte of the opcode is encoded in the lowest byte of the ipb
 field, not the lowest byte of the ipa field (just like with the e3
 handler). Did you verify that your handlers get called correctly?


Hi Thomas, you are absolutely right. I already have a patch available for
this issue but did not append it to this RFC post (since it is basically a bug
fix). To the next posting I will add this patch as well.

Will also fix the remaining issues thx for your review.
 
  @@ -844,6 +873,26 @@ static int handle_eb(S390CPU *cpu, struc
   return r;
   }
  
  +static int handle_e3(S390CPU *cpu, struct kvm_run *run, uint8_t ipa1)
  +{
  +int r = 0;
  +
  +switch (ipa1) {
  +case PRIV_E3_MPCIFC:
  +r = kvm_mpcifc_service_call(cpu, run);
  +break;
  +case PRIV_E3_STPCIFC:
  +r = kvm_stpcifc_service_call(cpu, run);
  +break;
  +default:
  +r = -1;
  +DPRINTF(KVM: unhandled PRIV: 0xe3%x\n, ipa1);
  +break;
  +}
  +
  +return r;
  +}
 
 Could you please replace ipa1 with ipb1 to avoid confusion here?
 
   static int handle_hypercall(S390CPU *cpu, struct kvm_run *run)
   {
   CPUS390XState *env = cpu-env;
  @@ -1038,6 +1087,9 @@ static int handle_instruction(S390CPU *c
   case IPA0_EB:
   r = handle_eb(cpu, run, ipa1);
   break;
  +case IPA0_E3:
  +r = handle_e3(cpu, run, run-s390_sieic.ipb  0xff);
  +break;
   case IPA0_DIAG:
   r = handle_diag(cpu, run, run-s390_sieic.ipb);
   break;
  --- /dev/null
  +++ b/target-s390x/pci_ic.c
  @@ -0,0 +1,621 @@
 [...]
  +
  +int

Re: [[RFC] KVM-S390: Provide guest TOD Clock Get/Set Controls

2014-09-22 Thread Christian Borntraeger

On 09/19/2014 10:38 PM, Alexander Graf wrote:
 
 
 On 19.09.14 20:51, Christian Borntraeger wrote:
 On 09/19/2014 04:19 PM, Jason J. Herne wrote:
 From: Jason J. Herne jjhe...@us.ibm.com

 Enable KVM_SET_CLOCK and KVM_GET_CLOCK Ioctls on S390 for managing guest TOD
 clock value.


 Just some education. On s390 the guest visible TOD clock is thehost TOD 
 clock + hypervisor programmable offset in the control block. There is only 
 one TOD per system, so the offset must be the same for every CPU.
 
 Can that offset be negative?

The offset is an u64, but the usual sum rules apply. The carry is irgnored and 
by using a large value you can have a negative offset.

 

 we add the KVM_CLOCK_FORWARD_ONLY flag to indicate to KVM_SET_CLOCK that the
 given clock value should only be set if it is = the current guest TOD clock
host TOD, 
 (right?)

 The alternative scheme would be to simply get/set the guest TOD time. This 
 works perfect for migration, but for managedsave the guest time is in the 
 past.
 Your approach has the advantange that after managedsave the guest will (most 
 of the time) have the host time of the target system, avoiding that the 
 guest has a time that is in the past (e.g. after 1 week managedsave the 
 guest would live in the past).
 
 But that's what users will expect, no? When you save an image in the
 past, it should resume at that very point in time.

Actually, I would expect something different (more or less something like 
standby/resume).

In fact Jasons code that we have internally in testing is doing the simple 
approach
1. source reads guest time at migration end
2. target sets guest time from source

So we have the guarantee that the time will never move backwards. It also works 
quite well for migration. As a bonus, we could really reuse the existing ioctl. 

I asked Jason to explore alternatives, though: I think it is somehow wrong, if 
you save a guest into an image file, open that one month later and the guest 
will always be 1 month behind unless it uses some kind of ntp. If everybody 
agrees that this is fine, I will queue up Jasons intial patch (simple get/set). 
The only question is then: shall we use an s390 specific ioctl (e.g. via VM 
attribute) or just use the existing KVM_SET_CLOCK.
But maybe lets answer the first question before we decide on this.

 
 Also I personally don't care whether the interface is delta to now or
 this is the time. In general, delta to now is safer because you
 can't possibly run back in time. But you also definitely want to check
 out the way PPC does it - it also accomodates for the time we spend
 inside the migration path itself.
 
 
 Alex
 

 Question for Paolo (maybe others) is. Does it make sense to reuse/extend the 
 existing ioctl (I think so, but defining a new one could also be ok)

 Christian



 value. This guarantees a monotonically increasing time.

 NOTE: In the event that the KVM_CLOCK_FORWARD_ONLY flag is set and the given
 time would cause the guest time to jump backward, then we set the guest TOD
 clock equal to the host TOD clock. Does this behavior make sense, or is it 
 too
 weird? I could believe that other architectures might not want this exact
 behavior. Instead they might prefer to implement the function such that an
 error code is returned instead of syncing the guest time to host time? In 
 that
 case S390 would need another bit KVM_CLOCK_SET_TO_HOST which we could call 
 to
 sync host time when the preferred guest time value would otherwise violate
 the monotonic property of the KVM_CLOCK_FORWARD_ONLY flag.

 Signed-off-by: Jason J. Herne jjhe...@us.ibm.com
 ---
  Documentation/virtual/kvm/api.txt |  5 +++
  arch/s390/kvm/kvm-s390.c  | 80 
 +++
  include/uapi/linux/kvm.h  |  3 ++
  3 files changed, 88 insertions(+)

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index beae3fd..615c2e4 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -779,6 +779,11 @@ struct kvm_clock_data {
 __u32 pad[9];
  };

 +S390: KVM_CLOCK_FORWARD_ONLY is used by KVM_SET_CLOCK to indicate that the 
 guest
 +TOD clock should not be allowed to jump back in time. This flag guarantees 
 a
 +monotonically increasing guest clock. If the clock value specified would 
 cause
 +the guest to jump back in time then the guest TOD clock is set to the host
 +TOD clock value.

  4.31 KVM_GET_VCPU_EVENTS

 diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
 index 81b0e11..2450db3 100644
 --- a/arch/s390/kvm/kvm-s390.c
 +++ b/arch/s390/kvm/kvm-s390.c
 @@ -31,6 +31,7 @@
  #include asm/switch_to.h
  #include asm/facility.h
  #include asm/sclp.h
 +#includeasm/timex.h
  #include kvm-s390.h
  #include gaccess.h

 @@ -169,6 +170,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
 ext)
 case KVM_CAP_S390_IRQCHIP:
 case

Re: [PATCH] kvm: Make init_rmode_tss() return 0 on success.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 01:44, Radim Krčmář ha scritto:
  This patch removes the redundant variable, by making init_rmode_tss()
  return 0 on success, -errno on failure.
 Which is going to propagate all the way to userpace through ioctl ...
 is this change of A[PB]I acceptable?
 
 Otherwise, -EFAULT seems to match unlikely problems better than -ENOMEM,
 so if it is acceptable:  Reviewed-by: Radim Krčmář rkrc...@redhat.com

Yes, I think it is a bugfix.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 6/9] kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 Since different architectures need different handling, we will add some arch 
 specific
 code later. The code may need to make cpu requests outside kvm_main.c, so 
 make it
 non-static and rename it to kvm_make_all_cpus_request().
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  include/linux/kvm_host.h |  1 +
  virt/kvm/kvm_main.c  | 10 +-
  2 files changed, 6 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index c23236a..73de13c 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -580,6 +580,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
  void kvm_reload_remote_mmus(struct kvm *kvm);
  void kvm_make_mclock_inprogress_request(struct kvm *kvm);
  void kvm_make_scan_ioapic_request(struct kvm *kvm);
 +bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
  
  long kvm_arch_dev_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 33712fb..0f8b6f6 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -152,7 +152,7 @@ static void ack_flush(void *_completed)
  {
  }
  
 -static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 +bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
  {
   int i, cpu, me;
   cpumask_var_t cpus;
 @@ -189,7 +189,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
   long dirty_count = kvm-tlbs_dirty;
  
   smp_mb();
 - if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
 + if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
   ++kvm-stat.remote_tlb_flush;
   cmpxchg(kvm-tlbs_dirty, dirty_count, 0);
  }
 @@ -197,17 +197,17 @@ EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
  
  void kvm_reload_remote_mmus(struct kvm *kvm)
  {
 - make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
  }
  
  void kvm_make_mclock_inprogress_request(struct kvm *kvm)
  {
 - make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 + kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
  }
  
  void kvm_make_scan_ioapic_request(struct kvm *kvm)
  {
 - make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
 + kvm_make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
  }
  
  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 9/9] kvm, mem-hotplug: Unpin and remove kvm_arch-apic_access_page.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 @@ -4534,8 +4539,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
   }
  
   if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))
 - vmcs_write64(APIC_ACCESS_ADDR,
 -  
 page_to_phys(vmx-vcpu.kvm-arch.apic_access_page));
 + kvm_vcpu_reload_apic_access_page(vcpu);
  
   if (vmx_vm_has_apicv(vcpu-kvm))
   memset(vmx-pi_desc, 0, sizeof(struct pi_desc));

Please make the call unconditional in kvm_vcpu_reset.

 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 9c8ae32..99378d7 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3624,6 +3624,11 @@ static bool svm_has_secondary_apic_access(struct 
 kvm_vcpu *vcpu)
   return false;
  }
  
 +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
 +{
 + return;
 +}
 +
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
   return 0;

I will ask you to modify the vmx_has_secondary_apic_access callback in
such a way that svm_set_apic_access_page_addr is not needed, so please
remove it from v8.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 7/9] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 We are handling L1 and L2 share one apic access page situation when 
 migrating
 apic access page. We should do some handling when migration happens in the
 following situations:
 
1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's
   vmcs in the next L1-L2 entry.
 
2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next
   L0-L1 entry and L2's vmcs in the next L1-L2 entry.
 
3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next
   L0-L2 entry and L1's vmcs in the next L2-L1 exit.
 
 This patch handles 3).
 
 In L0-L2 entry, L2's vmcs will be updated in prepare_vmcs02() called by
 nested_vm_run(). So we need to do nothing.
 
 In L2-L1 exit, this patch requests apic access page reload in L2-L1 vmexit.
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h | 1 +
  arch/x86/kvm/vmx.c  | 6 ++
  arch/x86/kvm/x86.c  | 3 ++-
  3 files changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 56156eb..1a8317e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -1047,6 +1047,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
  int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
  int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
  void kvm_vcpu_reset(struct kvm_vcpu *vcpu);
 +void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);
  
  void kvm_define_shared_msr(unsigned index, u32 msr);
  void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c8e90ec..baac78a 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8803,6 +8803,12 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
 u32 exit_reason,
   }
  
   /*
 +  * We are now running in L2, mmu_notifier will force to reload the
 +  * page's hpa for L2 vmcs. Need to reload it for L1 before entering L1.
 +  */
 + kvm_vcpu_reload_apic_access_page(vcpu);
 +
 + /*
* Exiting from L2 to L1, we're now back to L1 which thinks it just
* finished a VMLAUNCH or VMRESUME instruction, so we need to set the
* success or failure flag accordingly.
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index fc54fa6..2ae2dc7 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5989,7 +5989,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
   kvm_apic_update_tmr(vcpu, tmr);
  }
  
 -static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 +void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
  {
   /*
* Only APIC access page shared by L1 and L2 vm is handled. The APIC
 @@ -6009,6 +6009,7 @@ static void kvm_vcpu_reload_apic_access_page(struct 
 kvm_vcpu *vcpu)
   page_to_phys(vcpu-kvm-arch.apic_access_page));
   }
  }
 +EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
  
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
 

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 8/9] kvm, mem-hotplug: Add arch specific mmu notifier to handle apic access migration.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 We are handling L1 and L2 share one apic access page situation when 
 migrating
 apic access page. We should do some handling when migration happens in the
 following situations:
 
1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's
   vmcs in the next L1-L2 entry.
 
2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next
   L0-L1 entry and L2's vmcs in the next L1-L2 entry.
 
3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next
   L0-L2 entry and L1's vmcs in the next L2-L1 exit.
 
 This patch force a L1-L0 exit or L2-L0 exit when shared apic access page is
 migrated using mmu notifier. Since apic access page is only used on intel x86,
 this is arch specific code.
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/kvm/x86.c   | 11 +++
  include/linux/kvm_host.h | 14 +-
  virt/kvm/kvm_main.c  |  3 +++
  3 files changed, 27 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 2ae2dc7..7dd4179 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6011,6 +6011,17 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu 
 *vcpu)
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
  
 +void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm,
 +unsigned long address)
 +{
 + /*
 +  * The physical address of apic access page is stored in VMCS.
 +  * Update it when it becomes invalid.
 +  */
 + if (address == gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT))
 + kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 +}
 +
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 73de13c..b6e4d38 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -917,7 +917,19 @@ static inline int mmu_notifier_retry(struct kvm *kvm, 
 unsigned long mmu_seq)
   return 1;
   return 0;
  }
 -#endif
 +
 +#ifdef _ASM_X86_KVM_HOST_H
 +void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm,
 +unsigned long address);
 +#else /* _ASM_X86_KVM_HOST_H */
 +inline void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm,
 +   unsigned long address)

This needs to be static.

However, do not add it here.  Instead, touch all
arch/*/include/asm/kvm_host.h files, adding either the prototype or the
inline function.

Apart from this, the patch looks good.

Paolo

 +{
 + return;
 +}
 +#endif /* _ASM_X86_KVM_HOST_H */
 +
 +#endif /* CONFIG_MMU_NOTIFIER  KVM_ARCH_WANT_MMU_NOTIFIER */
  
  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
  
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 0f8b6f6..5427973d 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -295,6 +295,9 @@ static void kvm_mmu_notifier_invalidate_page(struct 
 mmu_notifier *mn,
   kvm_flush_remote_tlbs(kvm);
  
   spin_unlock(kvm-mmu_lock);
 +
 + kvm_arch_mmu_notifier_invalidate_page(kvm, address);
 +
   srcu_read_unlock(kvm-srcu, idx);
  }
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 5/9] kvm, mem-hotplug: Reload L1's apic access page in vcpu_enter_guest().

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 @@ -3624,6 +3624,11 @@ static bool svm_has_secondary_apic_access(struct 
 kvm_vcpu *vcpu)
   return false;
  }
  
 +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
 +{
 + return;
 +}
 +
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
   return 0;
 @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
   .has_secondary_apic_access = svm_has_secondary_apic_access,
 + .set_apic_access_page_addr = svm_set_apic_access_page_addr,
   .vm_has_apicv = svm_vm_has_apicv,
   .load_eoi_exitmap = svm_load_eoi_exitmap,
   .hwapic_isr_update = svm_hwapic_isr_update,

Something's wrong in the way you're generating the patches, because
you're adding these hunks twice.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 5/9] kvm, mem-hotplug: Reload L1's apic access page in vcpu_enter_guest().

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 11:33, Paolo Bonzini ha scritto:
 Something's wrong in the way you're generating the patches, because
 you're adding these hunks twice.

Nevermind, that was my mistake.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 4/9] kvm: Add interface to check if secondary exec virtualzed apic accesses is enabled.

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 12:47, Tang Chen ha scritto:
 We wants to migrate apic access page pinned by guest (L1 and L2) to make 
 memory
 hotplug available.
 
 There are two situations need to be handled for apic access page used by L2 
 vm:
 1. L1 prepares a separate apic access page for L2.
 
L2 pins a lot of pages in memory. Even if we can migrate apic access page,
memory hotplug is not available when L2 is running. So do not handle this
now. Migrate apic access page only.
 
 2. L1 and L2 share one apic access page.
 
Since we will migrate L1's apic access page, we should do some handling 
 when
migration happens in the following situations:
 
1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's
   vmcs in the next L1-L2 entry.
 
2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next
   L0-L1 entry and L2's vmcs in the next L1-L2 entry.
 
3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next
   L0-L2 entry and L1's vmcs in the next L2-L1 exit.
 
 Since we don't handle L1 ans L2 have separate apic access pages situation,
 when we update vmcs, we need to check if we are in L2 and if L2's secondary
 exec virtualzed apic accesses is enabled.
 
 This patch adds an interface to check if L2's secondary exec virtualzed apic
 accesses is enabled, because vmx cannot be accessed outside vmx.c.
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h | 1 +
  arch/x86/kvm/svm.c  | 6 ++
  arch/x86/kvm/vmx.c  | 9 +
  3 files changed, 16 insertions(+)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 35171c7..69fe032 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -739,6 +739,7 @@ struct kvm_x86_ops {
   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
 + bool (*has_secondary_apic_access)(struct kvm_vcpu *vcpu);
   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 1d941ad..9c8ae32 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct 
 kvm_vcpu *vcpu, bool set)
   return;
  }
  
 +static bool svm_has_secondary_apic_access(struct kvm_vcpu *vcpu)
 +{
 + return false;
 +}
 +
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
   return 0;
 @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .enable_irq_window = enable_irq_window,
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
 + .has_secondary_apic_access = svm_has_secondary_apic_access,
   .vm_has_apicv = svm_vm_has_apicv,
   .load_eoi_exitmap = svm_load_eoi_exitmap,
   .hwapic_isr_update = svm_hwapic_isr_update,
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 72a0470..0b541d9 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7090,6 +7090,14 @@ static void vmx_set_virtual_x2apic_mode(struct 
 kvm_vcpu *vcpu, bool set)
   vmx_set_msr_bitmap(vcpu);
  }
  
 +static bool vmx_has_secondary_apic_access(struct kvm_vcpu *vcpu)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 + return (vmx-nested.current_vmcs12-secondary_vm_exec_control 
 + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +}
 +
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
   u16 status;
 @@ -8909,6 +8917,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .enable_irq_window = enable_irq_window,
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
 + .has_secondary_apic_access = vmx_has_secondary_apic_access,
   .vm_has_apicv = vmx_vm_has_apicv,
   .load_eoi_exitmap = vmx_load_eoi_exitmap,
   .hwapic_irr_update = vmx_hwapic_irr_update,
 

I don't think you need two hooks.  You can just do the gfn_to_page
unconditionally in kvm_vcpu_reload_apic_access_page, like

vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
if (vcpu-kvm-arch.apic_access_page)
kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
page_to_phys(vcpu-kvm-arch.apic_access_page));
/* The last patch adds put_page here.  */

and check is_guest_mode etc. in vmx.c.  The extra gfn_to_page/put_page
is rare enough that you can just do it unconditionally.

Another small change; you shouldn't run this code unless the APIC page
is in use.  So you can add

if (!kvm_x86_ops-set_apic_access_page_addr)

Re: [PATCH RFC 2/2] vhost: support urgent descriptors

2014-09-22 Thread Jason Wang

On 09/22/2014 02:55 PM, Michael S. Tsirkin wrote:
 On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote:
 On 09/20/2014 06:00 PM, Paolo Bonzini wrote:
 Il 19/09/2014 09:10, Jason Wang ha scritto:
  
 -if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 +if (vq-urgent || !vhost_has_feature(vq, 
 VIRTIO_RING_F_EVENT_IDX)) {
 So the urgent descriptor only work when event index was not enabled?
 This seems suboptimal, we may still want to benefit from event index
 even if urgent descriptor is used. Looks like we need return true here
 when vq-urgent is true?
 Its ||, not .

 Without event index, all descriptors are treated as urgent.

 Paolo

 The problem is if vq-urgent is true, the patch checks
 VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in
 virtqueue_enable_cb() regardless of event index feature and cleared
 unconditionally in virtqueue_disable_cb().
 The reverse actually, right?

Ah, right.

 So virtqueue_enable_cb() was
 used to not only publish a new event index but also enable the urgent
 descriptor. And virtqueue_disable_cb() disabled all interrupts including
 the urgent descriptor. Guest won't get urgent interrupts by just adding
 virtqueue_add_outbuf_urgent() since what it needs is to enable and
 disable interrupt for !urgent descriptor.
 Right, we want a new API that advances event index but does not
 set VRING_AVAIL_F_NO_INTERRUPT.
 IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx
 interrupts, to avoid interrupt storms.

I see, so urgent descriptor needs to be disabled in this case. But vhost
parts need a little big changes, we could not just check
VRING_AVAIL_F_NO_INTERRUPT when vq-urgent is true. If event index is
enabled, we still need to check used event to make sure the current tx
delayed interrupt works.

But just re-using VRING_AVAIL_F_NO_INTERRUPT for urgent descriptor may
not work in some case. I see codes of virtqueue_get_buf() that may
breaks this:

/* If we expect an interrupt for the next entry, tell
host
   

 * by writing event index and flush out the write
before  
   

 * the read in the next get_buf call. */
if (!(vq-vring.avail-flags  VRING_AVAIL_F_NO_INTERRUPT)) {
vring_used_event(vq-vring) = vq-last_used_idx;
virtio_mb(vq-weak_barriers);
}

Consider if only urgent descriptor is enabled, this will publish used
event which in fact enable lots of unnecessary interrupt. In fact I
don't quite understand how the above lines is used. Virtio-net stop the
queue before enable the tx interrupt in start_xmit(), so the above lines
will not run at all.

 Btw, not sure urgent is a suitable name, since interrupt is often slow
 in kvm guest. And in fact virtio-net will probably use urgent
 descriptor for those packets (e.g stream packets who can be delayed a
 little bit to batch more bytes from userspace) who was not urgent
 compared to other packets.

 Yes but we are asking for an interrupt before event index is reached
 because something is waiting for the packet to be transmitted.
 I couldn't come up with a better name.


Ok.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86:kvm: fix two typos in comment

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 04:31, Tiejun Chen ha scritto:
 s/drity/dirty and s/vmsc01/vmcs01
 
 Signed-off-by: Tiejun Chen tiejun.c...@intel.com
 ---
  arch/x86/kvm/mmu.c | 2 +-
  arch/x86/kvm/vmx.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 76398fe..f76bc19 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1174,7 +1174,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
 *sptep)
   * Write-protect on the specified @sptep, @pt_protect indicates whether
   * spte write-protection is caused by protecting shadow page table.
   *
 - * Note: write protection is difference between drity logging and spte
 + * Note: write protection is difference between dirty logging and spte
   * protection:
   * - for dirty logging, the spte can be set to writable at anytime if
   *   its dirty bitmap is properly set.
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6ffd643..305e667 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8008,7 +8008,7 @@ static void vmx_start_preemption_timer(struct kvm_vcpu 
 *vcpu)
  /*
   * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
   * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
 - * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
 + * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2
   * guest in a way that will both be appropriate to L1's requests, and our
   * needs. In addition to modifying the active vmcs (which is vmcs02), this
   * function also has additional necessary side-effects, like setting various
 

Thanks, applying.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Paolo Bonzini

Il 20/09/2014 01:03, David Matlack ha scritto:
 vcpu ioctls can hang the calling thread if issued while a vcpu is
 running. If we know ioctl is going to be rejected as invalid anyway,
 we can fail before trying to take the vcpu mutex.
 
 This patch does not change functionality, it just makes invalid ioctls
 fail faster.
 
 Signed-off-by: David Matlack dmatl...@google.com
 ---
  virt/kvm/kvm_main.c | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 96ec622..f9234e5 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -52,6 +52,7 @@
  
  #include asm/processor.h
  #include asm/io.h
 +#include asm/ioctl.h
  #include asm/uaccess.h
  #include asm/pgtable.h
  
 @@ -1975,6 +1976,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
   if (vcpu-kvm-mm != current-mm)
   return -EIO;
  
 + if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
 + return -EINVAL;
 +
  #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
   /*
* Special cases: vcpu ioctls that are asynchronous to vcpu execution,
 

Thanks, applying this patch.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Guest kernel divide error in kvm_unlock_kick

2014-09-22 Thread Paolo Bonzini

Il 11/09/2014 19:03, Chris Webb ha scritto:
 Paolo Bonzini pbonz...@redhat.com wrote:
 
 This is a hypercall that should have kicked VCPU 3 (see rcx).

 Can you please apply this patch and gather a trace of the host
 (using trace-cmd -e kvm qemu-kvm arguments)?
 
 Sure, no problem. I've built the trace-cmd tool against udis86 (I hope) and
 have put the resulting trace.dat at
 
   http://cdw.me.uk/tmp/trace.dat
 
 This is actually for a -smp 2 qemu (failing to kick VCPU 1?) as I was having
 trouble persuading the -smp 4 qemu to crash as reliably under tracing.
 (Something timing related?) Otherwise the qemu-system-x86 command line is
 exactly as before.

Do you by chance have CONFIG_DEBUG_RODATA set?  In that case, the fix is
simply not to set it.

Paolo

 The guest kernel crash message which corresponds to this trace was:
 
 divide error:  [#1] PREEMPT SMP 
 Modules linked in:
 CPU: 0 PID: 618 Comm: mkdir Not tainted 3.16.2-guest #2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 task: 88007c997080 ti: 88007c614000 task.ti: 88007c614000
 RIP: 0010:[81037fe2]  [81037fe2] kvm_unlock_kick+0x72/0x80
 RSP: 0018:88007c617d40  EFLAGS: 00010046
 RAX: 0005 RBX:  RCX: 0001
 RDX: 0001 RSI: 88007fd11c40 RDI: 
 RBP: 88007fd11c40 R08: 81b98940 R09: 0001
 R10:  R11: 0007 R12: 00f6
 R13: 0001 R14: 0001 R15: 00011c40
 FS:  7f43eb1ed700() GS:88007fc0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 7f43eace0a30 CR3: 01a12000 CR4: 000406f0
 Stack:
  88007c994380 88007c9949aa 0046 81689715
  810f3174 0001 ea0001f16320 ea0001f17860
   88007c99e1e8 88007c997080 0001
 Call Trace:
  [81689715] ? _raw_spin_unlock+0x45/0x70
  [810f3174] ? try_to_wake_up+0x2a4/0x330
  [81101e2c] ? __wake_up_common+0x4c/0x80
  [81102418] ? __wake_up_sync_key+0x38/0x60
  [810d873a] ? do_notify_parent+0x19a/0x280
  [810f4d56] ? sched_move_task+0xb6/0x190
  [810cb4fc] ? do_exit+0xa1c/0xab0
  [810cc344] ? do_group_exit+0x34/0xb0
  [810cc3cb] ? SyS_exit_group+0xb/0x10
  [8168a16d] ? system_call_fastpath+0x1a/0x1f
 Code: c0 ca a7 81 48 8d 04 0b 48 8b 30 48 39 ee 75 c9 0f b6 40 08 44 38 e0 75 
 c0 48 c7 c0 22 b0 00 00 31 db 0f b7 0c 08 b8 05 00 00 00 0f 01 c1 0f 1f 00 
 5b 5d 41 5c c3 0f 1f 00 48 c7 c0 10 cf 00 00 
 RIP  [81037fe2] kvm_unlock_kick+0x72/0x80
  RSP 88007c617d40
 ---[ end trace bf5a4445f9decdbb ]---
 Fixing recursive fault but reboot is needed!
 BUG: scheduling while atomic: mkdir/618/0x0006
 Modules linked in:
 CPU: 0 PID: 618 Comm: mkdir Tainted: G  D   3.16.2-guest #2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
   c022d302 81684029 
  810ee956 81686266 00011c40 88007c617fd8
  00011c40 88007c997080 0006 0046
 Call Trace:
  [81684029] ? dump_stack+0x49/0x6a
  [810ee956] ? __schedule_bug+0x46/0x60
  [81686266] ? __schedule+0x5a6/0x7c0
  [816828cd] ? printk+0x59/0x75
  [810cb33b] ? do_exit+0x85b/0xab0
  [816828cd] ? printk+0x59/0x75
  [8100614a] ? oops_end+0x7a/0x100
  [810033e5] ? do_error_trap+0x85/0x110
  [81037fe2] ? kvm_unlock_kick+0x72/0x80
  [8114a358] ? __alloc_pages_nodemask+0x108/0xa60
  [8168b57e] ? divide_error+0x1e/0x30
  [81037fe2] ? kvm_unlock_kick+0x72/0x80
  [81689715] ? _raw_spin_unlock+0x45/0x70
  [810f3174] ? try_to_wake_up+0x2a4/0x330
  [81101e2c] ? __wake_up_common+0x4c/0x80
  [81102418] ? __wake_up_sync_key+0x38/0x60
  [810d873a] ? do_notify_parent+0x19a/0x280
  [810f4d56] ? sched_move_task+0xb6/0x190
  [810cb4fc] ? do_exit+0xa1c/0xab0
  [810cc344] ? do_group_exit+0x34/0xb0
  [810cc3cb] ? SyS_exit_group+0xb/0x10
  [8168a16d] ? system_call_fastpath+0x1a/0x1f
 
 Best wishes,
 
 Chris.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only

2014-09-22 Thread Paolo Bonzini

On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.

The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed.  However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.

Reported-by: Chris Webb ch...@arachsys.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: Borislav Petkov b...@suse.de
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/include/asm/kvm_para.h   | 10 --
 arch/x86/kernel/cpu/amd.c |  7 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index bb9b258d60e7..2075e6c34c78 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -202,6 +202,7 @@
 #define X86_FEATURE_DECODEASSISTS ( 8*32+12) /* AMD Decode Assists support */
 #define X86_FEATURE_PAUSEFILTER ( 8*32+13) /* AMD filtered pause intercept */
 #define X86_FEATURE_PFTHRESHOLD ( 8*32+14) /* AMD pause filter threshold */
+#define X86_FEATURE_VMMCALL ( 8*32+15) /* Prefer vmmcall to vmcall */
 
 
 /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c7678e43465b..e62cf897f781 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -2,6 +2,7 @@
 #define _ASM_X86_KVM_PARA_H
 
 #include asm/processor.h
+#include asm/alternative.h
 #include uapi/asm/kvm_para.h
 
 extern void kvmclock_init(void);
@@ -16,10 +17,15 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 }
 #endif /* CONFIG_KVM_GUEST */
 
-/* This instruction is vmcall.  On non-VT architectures, it will generate a
- * trap that we will then rewrite to the appropriate instruction.
+#ifdef CONFIG_DEBUG_RODATA
+#define KVM_HYPERCALL \
+ALTERNATIVE(.byte 0x0f,0x01,0xc1, .byte 0x0f,0x01,0xd9, 
X86_FEATURE_VMMCALL)
+#else
+/* On AMD processors, vmcall will generate a trap that we will
+ * then rewrite to the appropriate instruction.
  */
 #define KVM_HYPERCALL .byte 0x0f,0x01,0xc1
+#endif
 
 /* For KVM hypercalls, a three-byte sequence of either the vmcall or the 
vmmcall
  * instruction.  The hypervisor may replace it with something else but only the
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 60e5497681f5..813d29d00a17 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -525,6 +525,13 @@ static void early_init_amd(struct cpuinfo_x86 *c)
}
 #endif
 
+   /*
+* This is only needed to tell the kernel whether to use VMCALL
+* and VMMCALL.  VMMCALL is never executed except under virt, so
+* we can set it unconditionally.
+*/
+   set_cpu_cap(c, X86_FEATURE_VMMCALL);
+
/* F16h erratum 793, CVE-2013-6885 */
if (c-x86 == 0x16  c-x86_model = 0xf)
msr_set_bit(MSR_AMD64_LS_CFG, 15);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 2/2] vhost: support urgent descriptors

2014-09-22 Thread Michael S. Tsirkin

On Mon, Sep 22, 2014 at 05:55:23PM +0800, Jason Wang wrote:
 On 09/22/2014 02:55 PM, Michael S. Tsirkin wrote:
  On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote:
  On 09/20/2014 06:00 PM, Paolo Bonzini wrote:
  Il 19/09/2014 09:10, Jason Wang ha scritto:
   
  -  if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
  +  if (vq-urgent || !vhost_has_feature(vq, 
  VIRTIO_RING_F_EVENT_IDX)) {
  So the urgent descriptor only work when event index was not enabled?
  This seems suboptimal, we may still want to benefit from event index
  even if urgent descriptor is used. Looks like we need return true here
  when vq-urgent is true?
  Its ||, not .
 
  Without event index, all descriptors are treated as urgent.
 
  Paolo
 
  The problem is if vq-urgent is true, the patch checks
  VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in
  virtqueue_enable_cb() regardless of event index feature and cleared
  unconditionally in virtqueue_disable_cb().
  The reverse actually, right?
 
 Ah, right.
 
  So virtqueue_enable_cb() was
  used to not only publish a new event index but also enable the urgent
  descriptor. And virtqueue_disable_cb() disabled all interrupts including
  the urgent descriptor. Guest won't get urgent interrupts by just adding
  virtqueue_add_outbuf_urgent() since what it needs is to enable and
  disable interrupt for !urgent descriptor.
  Right, we want a new API that advances event index but does not
  set VRING_AVAIL_F_NO_INTERRUPT.
  IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx
  interrupts, to avoid interrupt storms.
 
 I see, so urgent descriptor needs to be disabled in this case. But vhost
 parts need a little big changes, we could not just check
 VRING_AVAIL_F_NO_INTERRUPT when vq-urgent is true. If event index is
 enabled, we still need to check used event to make sure the current tx
 delayed interrupt works.
 
 But just re-using VRING_AVAIL_F_NO_INTERRUPT for urgent descriptor may
 not work in some case. I see codes of virtqueue_get_buf() that may
 breaks this:
 
 /* If we expect an interrupt for the next entry, tell
 host  
  
 
  * by writing event index and flush out the write
 before
  
 
  * the read in the next get_buf call. */
 if (!(vq-vring.avail-flags  VRING_AVAIL_F_NO_INTERRUPT)) {
 vring_used_event(vq-vring) = vq-last_used_idx;
 virtio_mb(vq-weak_barriers);
 }


Right, we need to change this path to use a private flag.

 Consider if only urgent descriptor is enabled, this will publish used
 event which in fact enable lots of unnecessary interrupt. In fact I
 don't quite understand how the above lines is used. Virtio-net stop the
 queue before enable the tx interrupt in start_xmit(), so the above lines
 will not run at all.

True - this code is for drivers which process VQ without
disabling interrupts first. No rule says this is not allowed.

  Btw, not sure urgent is a suitable name, since interrupt is often slow
  in kvm guest. And in fact virtio-net will probably use urgent
  descriptor for those packets (e.g stream packets who can be delayed a
  little bit to batch more bytes from userspace) who was not urgent
  compared to other packets.
 
  Yes but we are asking for an interrupt before event index is reached
  because something is waiting for the packet to be transmitted.
  I couldn't come up with a better name.
 
 
 Ok.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Final KVM changes for 3.17

2014-09-22 Thread Paolo Bonzini

Linus,

The following changes since commit 02a68d0503fa470abff8852e10b1890df5730a08:

  powerpc/kvm/cma: Fix panic introduces by signed shift operation (2014-09-03 
10:34:07 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to 27fbe64bfa63cfb9da025975b59d96568caa2d53:

  KVM: correct null pid check in kvm_vcpu_yield_to() (2014-09-22 13:21:29 +0200)


Two very simple bugfixes, affecting all supported architectures.


Ard Biesheuvel (2):
  mm: export symbol dependencies of is_zero_pfn()
  KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn()

Sam Bobroff (1):
  KVM: correct null pid check in kvm_vcpu_yield_to()

 arch/mips/mm/init.c | 1 +
 arch/s390/mm/init.c | 1 +
 mm/memory.c | 2 ++
 virt/kvm/kvm_main.c | 4 ++--
 4 files changed, 6 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial

2014-09-22 Thread Zhang Haoyu

   Hi, all
   
   I start a VM with virtio-serial (default ports number: 31), and found 
   that virtio-blk performance degradation happened, about 25%, this 
   problem can be reproduced 100%.
   without virtio-serial:
   4k-read-random 1186 IOPS
   with virtio-serial:
   4k-read-random 871 IOPS
   
   but if use max_ports=2 option to limit the max number of virio-serial 
   ports, then the IO performance degradation is not so serious, about 5%.
   
   And, ide performance degradation does not happen with virtio-serial.
  
  Pretty sure it's related to MSI vectors in use.  It's possible that
  the virtio-serial device takes up all the avl vectors in the guests,
  leaving old-style irqs for the virtio-blk device.
  
  I don't think so,
  I use iometer to test 64k-read(or write)-sequence case, if I disable the 
  virtio-serial dynamically via device manager-virtio-serial = disable,
  then the performance get promotion about 25% immediately, then I 
  re-enable the virtio-serial via device manager-virtio-serial = enable,
  the performance got back again, very obvious.
  add comments:
  Although the virtio-serial is enabled, I don't use it at all, the 
  degradation still happened.
 
 Using the vectors= option as mentioned below, you can restrict the
 number of MSI vectors the virtio-serial device gets.  You can then
 confirm whether it's MSI that's related to these issues.

Amit,

It's related to the big number of ioeventfds used in virtio-serial-pci. With
virtio-serial-pci's ioeventfd=off, the performance is not affected no matter if
guest initializes it or not.

In my test, there are 12 fds to poll in qemu_poll_ns before loading guest
virtio_console.ko, whereas 76 once modprobe virtio_console.

Looks like the ppoll takes more time to poll more fds.

Some trace data with systemtap:

12 fds:

time  rel_time  symbol
15(+1)  qemu_poll_ns  [enter]
18(+3)  qemu_poll_ns  [return]

76 fd:

12(+2)  qemu_poll_ns  [enter]
18(+6)  qemu_poll_ns  [return]

I haven't looked at virtio-serial code, I'm not sure if we should reduce the
number of ioeventfds in virtio-serial-pci or focus on lower level efficiency.

Does ioeventfd=off hamper the performance of virtio-serial?
In my opinion, virtio-serial's use scenario is not so high throughput rate, 
so ioventfd=off has slight impaction on the performance.

Thanks,
Zhang Haoyu

Haven't compared with g_poll but I think the underlying syscall should be the
same.

Any ideas?

Fam


 
  So, I think it has no business with legacy interrupt mode, right?
  
  I am going to observe the difference of perf top data on qemu and perf 
  kvm stat data when disable/enable virtio-serial in guest,
  and the difference of perf top data on guest when disable/enable 
  virtio-serial in guest,
  any ideas?
  
  Thanks,
  Zhang Haoyu
  If you restrict the number of vectors the virtio-serial device gets
  (using the -device virtio-serial-pci,vectors= param), does that make
  things better for you?
 
 
 
  Amit

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-22 Thread Christopher Covington

On 09/19/2014 05:46 PM, H. Peter Anvin wrote:
 On 09/19/2014 01:46 PM, Andy Lutomirski wrote:

 However, it sounds to me that at least for KVM, it is very easy just to 
 emulate the RDRAND instruction. The hypervisor would report to the guest 
 that RDRAND is supported in CPUID and the emulate the instruction when 
 guest executes it. KVM already traps guest #UD (which would occur if RDRAND 
 executed while it is not supported) - so this scheme wouldn’t introduce 
 additional overhead over RDMSR.

 Because then guest user code will think that rdrand is there and will
 try to use it, resulting in abysmal performance.

 
 Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
 source.

A guest kernel couldn't make it look like RDRAND is not present to guest
userspace?

Christopher

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-22 Thread Christopher Covington

On 09/19/2014 02:42 PM, Andy Lutomirski wrote:
 On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
 c...@codeaurora.org wrote:
 On 09/17/2014 10:50 PM, Andy Lutomirski wrote:
 Hi all-

 I would like to standardize on a very simple protocol by which a guest
 OS can obtain an RNG seed early in boot.

 The main design requirements are:

  - The interface should be very easy to use.  Linux, at least, will
 want to use it extremely early in boot as part of kernel ASLR.  This
 means that PCI and ACPI will not work.

 How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
 Could hypervisors and simulators simply make sure these work?

 
 If RDRAND is available, then Linux, at least, will use it.  The rest
 are too complicated for early use.  Linux on x86 plays some vaguely
 clever games with rdtsc and poking at the i8254 port.

I just wanted to check that it couldn't be as simple as giving one or both of
the timers random initial values.

Christopher

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Christian Borntraeger

On 09/22/2014 12:50 PM, Paolo Bonzini wrote:
 Il 20/09/2014 01:03, David Matlack ha scritto:
 vcpu ioctls can hang the calling thread if issued while a vcpu is
 running. If we know ioctl is going to be rejected as invalid anyway,
 we can fail before trying to take the vcpu mutex.

 This patch does not change functionality, it just makes invalid ioctls
 fail faster.

 Signed-off-by: David Matlack dmatl...@google.com
 ---
  virt/kvm/kvm_main.c | 4 
  1 file changed, 4 insertions(+)

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 96ec622..f9234e5 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -52,6 +52,7 @@
  
  #include asm/processor.h
  #include asm/io.h
 +#include asm/ioctl.h
  #include asm/uaccess.h
  #include asm/pgtable.h
  
 @@ -1975,6 +1976,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
  if (vcpu-kvm-mm != current-mm)
  return -EIO;
  
 +if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
 +return -EINVAL;
 +
  #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
  /*
   * Special cases: vcpu ioctls that are asynchronous to vcpu execution,

 
 Thanks, applying this patch.

Isnt that the wrong trade off?

We now have an extra condition check for every valid ioctl, to make an error 
case go faster.
I know, the extra check is just a 1 or 2 cycles if branch prediction is right, 
but still.

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[no subject]

2014-09-22 Thread Abu Rasheda

unsubscribe kvm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] blk-mq: Avoid race condition with uninitialized requests

2014-09-22 Thread Christian Borntraeger

On 09/18/2014 11:04 AM, David Hildenbrand wrote:
 This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249.
 
 We have to initialize at least the atomic_flags and the cmd_flags when
 allocating storage for the requests.
 
 Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when
 racing with the creation of a request.
 
 Also move the reset of cmd_flags for the initializing code to the point where 
 a
 request is freed. So we will never end up with pending flush request 
 indicators
 that might trigger dereferences of invalid pointers in blk_mq_timeout_check().
 
 Cc: sta...@vger.kernel.org
 Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com

Acked-by: Christian Borntraeger borntrae...@de.ibm.com

Can you please add
Reported-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com
Tested-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com

as Paulo did the testing work?

We thing this patch is fine and should go upstream.



Christian


 ---
  block/blk-mq.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 383ea0c..eed6340 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -203,7 +203,6 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, 
 int rw)
   if (tag != BLK_MQ_TAG_FAIL) {
   rq = data-hctx-tags-rqs[tag];
 
 - rq-cmd_flags = 0;
   if (blk_mq_tag_busy(data-hctx)) {
   rq-cmd_flags = REQ_MQ_INFLIGHT;
   atomic_inc(data-hctx-nr_active);
 @@ -258,6 +257,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx 
 *hctx,
 
   if (rq-cmd_flags  REQ_MQ_INFLIGHT)
   atomic_dec(hctx-nr_active);
 + rq-cmd_flags = 0;
 
   clear_bit(REQ_ATOM_STARTED, rq-atomic_flags);
   blk_mq_put_tag(hctx, tag, ctx-last_tag);
 @@ -1404,6 +1404,8 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
 blk_mq_tag_set *set,
   left -= to_do * rq_size;
   for (j = 0; j  to_do; j++) {
   tags-rqs[i] = p;
 + tags-rqs[i]-atomic_flags = 0;
 + tags-rqs[i]-cmd_flags = 0;
   if (set-ops-init_request) {
   if (set-ops-init_request(set-driver_data,
   tags-rqs[i], hctx_idx, i,
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] blk-mq: Avoid race condition with uninitialized requests

2014-09-22 Thread Jens Axboe


On 2014-09-22 08:15, Christian Borntraeger wrote:

On 09/18/2014 11:04 AM, David Hildenbrand wrote:

This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249.

We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.

Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when
racing with the creation of a request.

Also move the reset of cmd_flags for the initializing code to the point where a
request is freed. So we will never end up with pending flush request indicators
that might trigger dereferences of invalid pointers in blk_mq_timeout_check().

Cc: sta...@vger.kernel.org
Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com


Acked-by: Christian Borntraeger borntrae...@de.ibm.com

Can you please add
Reported-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com
Tested-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com

as Paulo did the testing work?

We thing this patch is fine and should go upstream.


I might have to pick'n rebase the series, in which case I'll add it. But 
I already queued it up last week, so if I don't, then I can't easily add 
it. I wish the git notes wasn't such a horrible and unusable hack, so we 
had a chance to annotate commits without having to rewrite history...


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-22 Thread H. Peter Anvin

On 09/22/2014 06:31 AM, Christopher Covington wrote:
 On 09/19/2014 05:46 PM, H. Peter Anvin wrote:
 On 09/19/2014 01:46 PM, Andy Lutomirski wrote:

 However, it sounds to me that at least for KVM, it is very easy just to 
 emulate the RDRAND instruction. The hypervisor would report to the guest 
 that RDRAND is supported in CPUID and the emulate the instruction when 
 guest executes it. KVM already traps guest #UD (which would occur if 
 RDRAND executed while it is not supported) - so this scheme wouldn’t 
 introduce additional overhead over RDMSR.

 Because then guest user code will think that rdrand is there and will
 try to use it, resulting in abysmal performance.


 Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
 source.
 
 A guest kernel couldn't make it look like RDRAND is not present to guest
 userspace?
 

It could, but how would you enumerate that?  A new RDRAND-CPL-0 CPUID
bit pretty much would be required.

-hpa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-22 Thread H. Peter Anvin

On 09/22/2014 07:17 AM, H. Peter Anvin wrote:
 
 It could, but how would you enumerate that?  A new RDRAND-CPL-0 CPUID
 bit pretty much would be required.
 

Note that there are two things that differ: the CPL 0-ness and the
performance/exhaustibility attributes.

-hpa


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
 We now have an extra condition check for every valid ioctl, to make an error 
 case go faster.
 I know, the extra check is just a 1 or 2 cycles if branch prediction is 
 right, but still.

I applied the patch because the delay could be substantial, depending on
what the other VCPU is doing.  Perhaps something like this would be
better?

(Untested, but Tested-by/Reviewed-bys are welcome).

Paolo

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 84e24b210273..ed31760d79fe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-int vcpu_load(struct kvm_vcpu *vcpu)
+static void __vcpu_load(struct kvm_vcpu *vcpu)
 {
int cpu;
 
-   if (mutex_lock_killable(vcpu-mutex))
-   return -EINTR;
if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
/* The thread running this VCPU changed. */
struct pid *oldpid = vcpu-pid;
@@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
preempt_notifier_register(vcpu-preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
+}
+
+int vcpu_load(struct kvm_vcpu *vcpu)
+{
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+
+   __vcpu_load(vcpu);
return 0;
 }
 
@@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (vcpu-kvm-mm != current-mm)
return -EIO;
 
-   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
-   return -EINVAL;
-
 #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
/*
 * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
@@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 #endif
 
+   if (!mutex_trylock(vcpu-mutex))) {
+   /*
+* Before a potentially long sleep, check if we'd exit anyway.
+* The common case is for the mutex not to be contended, when
+* this does not add overhead.
+*/
+   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
+   return -EINVAL;
+
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+   }
+
 
-   r = vcpu_load(vcpu);
+   r = __vcpu_load(vcpu);
if (r)
return r;
switch (ioctl) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6] arm64: fix VTTBR_BADDR_MASK

2014-09-22 Thread Joel Schopp


 -#define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
 -#define VTTBR_BADDR_MASK  (((1LLU  (40 - VTTBR_X)) - 1)  
 VTTBR_BADDR_SHIFT)
 Actually, after some more thinking, why don't we just make the upper
 limit of this mask 48-bit always or even 64-bit. That's a physical mask
 for checking whether the pgd pointer in vttbr is aligned as per the
 architecture requirements. Given that the pointer is allocated from the
 platform memory, it's clear that it is within the PA range. So basically
 you just need a mask to check the bottom alignment based on
 VTCR_EL2.T0SZ (which should be independent from the PA range). I guess
 it should be enough as:
This sounds fine to me.  I would say that there is no harm in
re-checking the upper bits, but I agree it is unnecessary.


 #define VTTBR_BADDR_MASK  (~0ULL  VTTBR_BADDR_SHIFT)

 without any other changes to T0SZ.

 The TCR_EL2.PS setting should be done based on the ID_A64MMFR0_EL1
 but you can do this in __do_hyp_init (it looks like this function
 handles VTCR_EL2.PS already, not sure why it does do it for TCR_EL2 as
 well).

 So IMO you only need about a few lines patch.

My original patch to fix this problem was one line, so I'm perfectly
happy with simplification.  But it would be nice if the other reviewers
could agree with this approach.  With six versions that each addressed
all the comments from reviewers I'd like it if the v7 that throws away
most of that feedback didn't result in a v8 that puts it all back again.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v6] arm64: fix VTTBR_BADDR_MASK

2014-09-22 Thread Catalin Marinas

On Mon, Sep 22, 2014 at 04:56:58PM +0100, Joel Schopp wrote:
  The TCR_EL2.PS setting should be done based on the ID_A64MMFR0_EL1
  but you can do this in __do_hyp_init (it looks like this function
  handles VTCR_EL2.PS already, not sure why it does do it for TCR_EL2 as
  well).
 
  So IMO you only need about a few lines patch.

 My original patch to fix this problem was one line, so I'm perfectly
 happy with simplification.  But it would be nice if the other reviewers
 could agree with this approach.  With six versions that each addressed
 all the comments from reviewers I'd like it if the v7 that throws away
 most of that feedback didn't result in a v8 that puts it all back again.

I'm having some discussion with Christoffer around this. He will come up
with some patches on top of yours but I don't think the problem is that
simple. Basically the IPA size is restricted by the PARange but also
affected by the number of page table levels used by the host (both
having implications on VTCR_EL2.SL0).

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/4] KVM: MMU: pinned sps are not candidates for deletion.

2014-09-22 Thread Marcelo Tosatti

On Tue, Sep 09, 2014 at 12:41:27PM -0300, Marcelo Tosatti wrote:
 On Tue, Jul 22, 2014 at 05:59:42AM +0800, Xiao Guangrong wrote:
  
  On Jul 10, 2014, at 3:12 AM, mtosa...@redhat.com wrote:
  
   Skip pinned shadow pages when selecting pages to zap.
  
  It seems there is no way to prevent changing pinned spte on
  zap-all path?
 
 Xiao,
 
 The way would be to reload remote mmus, forcing the vcpu to exit,
 zapping a page, then vcpu will pagefault any necessary page via
 kvm_mmu_pin_pages.
 
 kvm_mmu_invalidate_zap_all_pages does: 
 
 - spin_lock(mmu_lock)
 - kvm_reload_remote_mmus
 ...
 - spin_unlock(mmu_lock)
 
 So its OK to change pinned spte on zap all path.
 
  I am thing if we could move pinned spte to another list (eg. 
  pinned_shadow_pages)
  instead of active list so that it can not be touched by any other free 
  paths. 
  Your idea?
 
 As mentioned it above, it is ok to zap pinned sptes as long w 
 reload remote mmus request is performed.
 
 That said, you still consider a separate list?
 

Xiao, ping?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-09-22 Thread Marcelo Tosatti

On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote:
 On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote:
  On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote:
   Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
   deleting a pinned spte.
   
   Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
   
   ---
arch/x86/kvm/mmu.c |   29 +++--
1 file changed, 23 insertions(+), 6 deletions(-)
   
   Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
   ===
   --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-07-09 
   11:23:59.290744490 -0300
   +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-07-09 11:24:58.449632435 
   -0300
   @@ -1208,7 +1208,8 @@
 *
 * Return true if tlb need be flushed.
 */
   -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
   pt_protect)
   +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
   pt_protect,
   +bool skip_pinned)
{
 u64 spte = *sptep;

   @@ -1218,6 +1219,22 @@

 rmap_printk(rmap_write_protect: spte %p %llx\n, sptep, *sptep);

   + if (is_pinned_spte(spte)) {
   + /* keep pinned spte intact, mark page dirty again */
   + if (skip_pinned) {
   + struct kvm_mmu_page *sp;
   + gfn_t gfn;
   +
   + sp = page_header(__pa(sptep));
   + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-spt);
   +
   + mark_page_dirty(kvm, gfn);
   + return false;
  Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() while
  populating dirty_bitmap_buffer?
 
 The pinned gfns are per-vcpu. Requires looping all vcpus (not
 scalable).
 
 
   + } else
   + mmu_reload_pinned_vcpus(kvm);
  Can you explain why do you need this?
 
 Because if skip_pinned = false, we want vcpus to exit (called
 from enable dirty logging codepath).

Gleb, any further opinions on this ?

Can you ack the patch ? TIA

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  4 +--
 arch/powerpc/kvm/book3s.h   |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  4 +--
 arch/powerpc/kvm/book3s_pr.c|  3 ++-
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 50 +
 drivers/iommu/amd_iommu_v2.c|  6 +++--
 include/linux/mmu_notifier.h| 24 --
 include/trace/events/kvm.h  | 10 
 mm/mmu_notifier.c   |  5 ++--
 mm/rmap.c   |  6 -
 virt/kvm/kvm_main.c |  5 ++--
 16 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 032a853..8c3f7eb 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index be9970a..a3c671b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6040008..d329bc5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,7 +56,7 @@
 extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index dd03f6b..c16cfbf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-   return kvm-arch.kvm_ops-age_hva(kvm, hva);
+   return kvm-arch.kvm_ops-age_hva(kvm, start, end);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 4bf956c..d2b3ec0 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start,
+ unsigned long end);
 extern int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Paolo Bonzini wrote:
 Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
  We now have an extra condition check for every valid ioctl, to make an 
  error case go faster.
  I know, the extra check is just a 1 or 2 cycles if branch prediction is 
  right, but still.
 
 I applied the patch because the delay could be substantial, depending on
 what the other VCPU is doing.  Perhaps something like this would be
 better?

I'm happy with either approach.

 
 (Untested, but Tested-by/Reviewed-bys are welcome).

There were a few build bugs in your diff. Here's a working version that
I tested. Feel free to add my Tested-by and Reviewed-by if you go with
this.


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c71931f..fbdcdc2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -133,12 +133,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
  */
-int vcpu_load(struct kvm_vcpu *vcpu)
+static void __vcpu_load(struct kvm_vcpu *vcpu)
 {
int cpu;
 
-   if (mutex_lock_killable(vcpu-mutex))
-   return -EINTR;
if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
/* The thread running this VCPU changed. */
struct pid *oldpid = vcpu-pid;
@@ -151,6 +149,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
preempt_notifier_register(vcpu-preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
+}
+
+int vcpu_load(struct kvm_vcpu *vcpu)
+{
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+
+   __vcpu_load(vcpu);
return 0;
 }
 
@@ -2197,10 +2203,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
 #endif
 
+   if (!mutex_trylock(vcpu-mutex)) {
+   /*
+* Before a potentially long sleep, check if we'd exit anyway.
+* The common case is for the mutex not to be contended, when
+* this does not add overhead.
+*/
+   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
+   return -EINVAL;
+
+   if (mutex_lock_killable(vcpu-mutex))
+   return -EINTR;
+   }
+
+   __vcpu_load(vcpu);
 
-   r = vcpu_load(vcpu);
-   if (r)
-   return r;
switch (ioctl) {
case KVM_RUN:
r = -EINVAL;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: nested VMX: disable perf cpuid reporting

2014-09-22 Thread Marcelo Tosatti

On Thu, Sep 18, 2014 at 06:24:57PM -0300, Marcelo Tosatti wrote:
 
 Initilization of L2 guest with -cpu host, on L1 guest with -cpu host 
 triggers:
 
 (qemu) KVM: entry failed, hardware error 0x7
 ...
 nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported
 
 Nested VMX MSR load/store support is not sufficient to 
 allow perf for L2 guest.
 
 Until properly fixed, trap CPUID and disable function 0xA.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Ping, Paolo?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Guest kernel divide error in kvm_unlock_kick

2014-09-22 Thread Chris Webb

Paolo Bonzini pbonz...@redhat.com wrote:

 Il 11/09/2014 19:03, Chris Webb ha scritto:
 Paolo Bonzini pbonz...@redhat.com wrote:
 
 This is a hypercall that should have kicked VCPU 3 (see rcx).
 
 Can you please apply this patch and gather a trace of the host
 (using trace-cmd -e kvm qemu-kvm arguments)?
 
 Sure, no problem. I've built the trace-cmd tool against udis86 (I hope) and
 have put the resulting trace.dat at
 
  http://cdw.me.uk/tmp/trace.dat
 
 This is actually for a -smp 2 qemu (failing to kick VCPU 1?) as I was having
 trouble persuading the -smp 4 qemu to crash as reliably under tracing.
 (Something timing related?) Otherwise the qemu-system-x86 command line is
 exactly as before.
 
 Do you by chance have CONFIG_DEBUG_RODATA set?  In that case, the fix is
 simply not to set it.

Absolutely right: my host and guest kernels do have CONFIG_DEBUG_RODATA set!

Your patch to use alternatives for VMCALL vs VMMCALL definitely fixed the
divide-by-zero crashes I saw.

Given that I can easily use either (or both) of these solutions, is it be
more efficient to turn off CONFIG_DEBUG_RODATA in the guest kernel so kvm
can fix up the instructions in-place, or is using alternatives for
VMCALL/VMMCALL as implemented by your patch just as good?

Best wishes,

Chris.--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] Guest kernel divide error in kvm_unlock_kick

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 21:08, Chris Webb ha scritto:
  Do you by chance have CONFIG_DEBUG_RODATA set?  In that case, the fix is
  simply not to set it.
 
 Absolutely right: my host and guest kernels do have CONFIG_DEBUG_RODATA set!
 
 Your patch to use alternatives for VMCALL vs VMMCALL definitely fixed the
 divide-by-zero crashes I saw.
 
 Given that I can easily use either (or both) of these solutions, is it be
 more efficient to turn off CONFIG_DEBUG_RODATA in the guest kernel so kvm
 can fix up the instructions in-place, or is using alternatives for
 VMCALL/VMMCALL as implemented by your patch just as good?

I posted a patch to use alternatives if CONFIG_DEBUG_RODATA is enabled,
but the bug is in KVM that explicitly documents you can use any of
VMCALL or VMMCALL.

I'll also see to fix KVM, but the patch is still useful because a) KVM
would not patch a read-only text segment, so there would be a small
performance benefit; b) you cannot control already deployed hypervisors.

However, since there is a workaround, I won't push it into 3.17 so late
in the cycle.  Also, there's a chance that it is NACKed since it touches
non-KVM files.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: nested VMX: disable perf cpuid reporting

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 21:01, Marcelo Tosatti ha scritto:
 On Thu, Sep 18, 2014 at 06:24:57PM -0300, Marcelo Tosatti wrote:

 Initilization of L2 guest with -cpu host, on L1 guest with -cpu host 
 triggers:

 (qemu) KVM: entry failed, hardware error 0x7
 ...
 nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported

 Nested VMX MSR load/store support is not sufficient to 
 allow perf for L2 guest.

 Until properly fixed, trap CPUID and disable function 0xA.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Ping, Paolo?
 

Sorry, didn't push to kvm/queue but I've already applied the patch
locally.  I will do it first thing tomorrow (my workstation is off right
now so I cannot send you a SHA).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Christian Borntraeger

On 09/22/2014 04:31 PM, Paolo Bonzini wrote:
 Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
 We now have an extra condition check for every valid ioctl, to make an error 
 case go faster.
 I know, the extra check is just a 1 or 2 cycles if branch prediction is 
 right, but still.
 
 I applied the patch because the delay could be substantial,

I know, but only for seriously misbehaving userspace, no? See my comment is 
really a minor one - lets say 2 more cycles for something that exited to 
userspace - nobody would even notice. I am just disturbed by the fact that we 
care about something that is not slow-path but broken beyond repair (why does 
userspace call a non-KVM ioctl on a fd of a vcpu from a different thread 
(otherwise the mutex would be free)?

Please, can we have an explanation, e.g. something like
while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
Lets bail out early on invalid ioctls. or similar?


 depending on what the other VCPU is doing.
 Perhaps something like this would be
 better?
 
 (Untested, but Tested-by/Reviewed-bys are welcome).

Given that it makes sense to improve a misbehaving userspace, I prefer Davids 
variant as the patch is smaller and easier to get right. No need to revert, but 
a better explanation for the use case would be appeciated.

Christian 
 
 Paolo
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 84e24b210273..ed31760d79fe 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
  /*
   * Switches to specified vcpu, until a matching vcpu_put()
   */
 -int vcpu_load(struct kvm_vcpu *vcpu)
 +static void __vcpu_load(struct kvm_vcpu *vcpu)
  {
   int cpu;
 
 - if (mutex_lock_killable(vcpu-mutex))
 - return -EINTR;
   if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
   /* The thread running this VCPU changed. */
   struct pid *oldpid = vcpu-pid;
 @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
   preempt_notifier_register(vcpu-preempt_notifier);
   kvm_arch_vcpu_load(vcpu, cpu);
   put_cpu();
 +}
 +
 +int vcpu_load(struct kvm_vcpu *vcpu)
 +{
 + if (mutex_lock_killable(vcpu-mutex))
 + return -EINTR;
 +
 + __vcpu_load(vcpu);
   return 0;
  }
 
 @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp,
   if (vcpu-kvm-mm != current-mm)
   return -EIO;
 
 - if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
 - return -EINVAL;
 -
  #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
   /*
* Special cases: vcpu ioctls that are asynchronous to vcpu execution,
 @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
   return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
  #endif
 
 + if (!mutex_trylock(vcpu-mutex))) {
 + /*
 +  * Before a potentially long sleep, check if we'd exit anyway.
 +  * The common case is for the mutex not to be contended, when
 +  * this does not add overhead.
 +  */
 + if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
 + return -EINVAL;
 +
 + if (mutex_lock_killable(vcpu-mutex))
 + return -EINTR;
 + }
 +
 
 - r = vcpu_load(vcpu);
 + r = __vcpu_load(vcpu);
   if (r)
   return r;
   switch (ioctl) {
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 21:20, Christian Borntraeger ha scritto:
 while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
 Lets bail out early on invalid ioctls. or similar?

Okay.  David, can you explain how you found it so that I can make up my
mind?

Gleb and Marcelo, a fourth and fifth opinion? :)

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Christian Borntraeger wrote:
 On 09/22/2014 04:31 PM, Paolo Bonzini wrote:
  Il 22/09/2014 15:45, Christian Borntraeger ha scritto:
  We now have an extra condition check for every valid ioctl, to make an 
  error case go faster.
  I know, the extra check is just a 1 or 2 cycles if branch prediction is 
  right, but still.
  
  I applied the patch because the delay could be substantial,
 
 I know, but only for seriously misbehaving userspace, no? See my comment is 
 really a minor one - lets say 2 more cycles for something that exited to 
 userspace - nobody would even notice. I am just disturbed by the fact that we 
 care about something that is not slow-path but broken beyond repair (why does 
 userspace call a non-KVM ioctl on a fd of a vcpu from a different thread 
 (otherwise the mutex would be free)?
 
 Please, can we have an explanation, e.g. something like
 while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
 Lets bail out early on invalid ioctls. or similar?

We noticed this while using code that inspects the open file descriptors
of a process. One of the things it did was check if each file descriptor
was a tty (isatty() calls ioctl(TCGETS)).

 
 
  depending on what the other VCPU is doing.
  Perhaps something like this would be
  better?
  
  (Untested, but Tested-by/Reviewed-bys are welcome).
 
 Given that it makes sense to improve a misbehaving userspace, I prefer Davids 
 variant as the patch is smaller and easier to get right. No need to revert, 
 but a better explanation for the use case would be appeciated.
 
 Christian 
  
  Paolo
  
  diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
  index 84e24b210273..ed31760d79fe 100644
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn)
   /*
* Switches to specified vcpu, until a matching vcpu_put()
*/
  -int vcpu_load(struct kvm_vcpu *vcpu)
  +static void __vcpu_load(struct kvm_vcpu *vcpu)
   {
  int cpu;
  
  -   if (mutex_lock_killable(vcpu-mutex))
  -   return -EINTR;
  if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
  /* The thread running this VCPU changed. */
  struct pid *oldpid = vcpu-pid;
  @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu)
  preempt_notifier_register(vcpu-preempt_notifier);
  kvm_arch_vcpu_load(vcpu, cpu);
  put_cpu();
  +}
  +
  +int vcpu_load(struct kvm_vcpu *vcpu)
  +{
  +   if (mutex_lock_killable(vcpu-mutex))
  +   return -EINTR;
  +
  +   __vcpu_load(vcpu);
  return 0;
   }
  
  @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp,
  if (vcpu-kvm-mm != current-mm)
  return -EIO;
  
  -   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
  -   return -EINVAL;
  -
   #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
  /*
   * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
  @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp,
  return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
   #endif
  
  +   if (!mutex_trylock(vcpu-mutex))) {
  +   /*
  +* Before a potentially long sleep, check if we'd exit anyway.
  +* The common case is for the mutex not to be contended, when
  +* this does not add overhead.
  +*/
  +   if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
  +   return -EINVAL;
  +
  +   if (mutex_lock_killable(vcpu-mutex))
  +   return -EINTR;
  +   }
  +
  
  -   r = vcpu_load(vcpu);
  +   r = __vcpu_load(vcpu);
  if (r)
  return r;
  switch (ioctl) {
  
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only

2014-09-22 Thread Borislav Petkov

On Mon, Sep 22, 2014 at 01:17:48PM +0200, Paolo Bonzini wrote:
 On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.

Hmm, that depends on DEBUG_KERNEL.

I think you're actually talking about distro kernels which enable
CONFIG_DEBUG_RODATA, right?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

---
v1 - v2: * Small cosmetic changes
  * Remove unnecessary handling of mmu_notifier_seq in
mmu_lock protected region
  * Func prototypes fixed
---
Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  4 ++--
 arch/powerpc/kvm/book3s.h   |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_pr.c|  3 ++-
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 46 +++--
 drivers/iommu/amd_iommu_v2.c|  6 +++--
 include/linux/mmu_notifier.h| 24 +--
 include/trace/events/kvm.h  | 10 
 mm/mmu_notifier.c   |  5 ++--
 mm/rmap.c   |  6 -
 virt/kvm/kvm_main.c |  5 ++--
 16 files changed, 81 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 032a853..8c3f7eb 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index be9970a..a3c671b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6040008..d329bc5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,7 +56,7 @@
 extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index dd03f6b..c16cfbf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-   return kvm-arch.kvm_ops-age_hva(kvm, hva);
+   return kvm-arch.kvm_ops-age_hva(kvm, start, end);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 4bf956c..d2b3ec0 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start,
+ unsigned long end);
 extern int

[PATCH v2] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

---
v1 - v2: * Small cosmetic changes
  * Remove unnecessary handling of mmu_notifier_seq in
mmu_lock protected region
  * Func prototypes fixed
---
Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  4 ++--
 arch/powerpc/kvm/book3s.h   |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_pr.c|  3 ++-
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 46 +++--
 drivers/iommu/amd_iommu_v2.c|  6 +++--
 include/linux/mmu_notifier.h| 24 +--
 include/trace/events/kvm.h  | 10 
 mm/mmu_notifier.c   |  5 ++--
 mm/rmap.c   |  6 -
 virt/kvm/kvm_main.c |  5 ++--
 16 files changed, 81 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 032a853..8c3f7eb 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index be9970a..a3c671b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6040008..d329bc5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,7 +56,7 @@
 extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index dd03f6b..c16cfbf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-   return kvm-arch.kvm_ops-age_hva(kvm, hva);
+   return kvm-arch.kvm_ops-age_hva(kvm, start, end);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 4bf956c..d2b3ec0 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start,
+ unsigned long end);
 extern int

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Marcelo Tosatti

On Fri, Sep 19, 2014 at 04:03:25PM -0700, David Matlack wrote:
 vcpu ioctls can hang the calling thread if issued while a vcpu is
 running. 

There is a mutex per-vcpu, so thats expected, OK...

 If we know ioctl is going to be rejected as invalid anyway,
 we can fail before trying to take the vcpu mutex.

Consider a valid ioctl that takes the vcpu mutex. If you need immediate
access for that valid ioctl, it is necessary to interrupt thread
which KVM_RUN ioctl executes. 

So knowledge of whether KVM_RUN is being executed is expected in
userspace (either
that or ask the KVM_RUN thread to run the ioctl for you, as qemu does).

Can't see why having different behaviour for valid/invalid ioctls
is a good thing.

 This patch does not change functionality, it just makes invalid ioctls
 fail faster.

Should not be executing vcpu ioctls without interrupt KVM_RUN in the
first place.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Fix page ageing bugs

2014-09-22 Thread Rik van Riel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/22/2014 03:57 PM, Andres Lagar-Cavilla wrote:
 1. We were calling clear_flush_young_notify in unmap_one, but we
 are within an mmu notifier invalidate range scope. The spte exists
 no more (due to range_start) and the accessed bit info has already
 been propagated (due to kvm_pfn_set_accessed). Simply call 
 clear_flush_young.
 
 2. We clear_flush_young on a primary MMU PMD, but this may be
 mapped as a collection of PTEs by the secondary MMU (e.g. during
 log-dirty). This required expanding the interface of the
 clear_flush_young mmu notifier, so a lot of code has been trivially
 touched.
 
 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we
 emulate the access bit by blowing the spte. This requires proper
 synchronizing with MMU notifier consumers, like every other removal
 of spte's does.

Acked-by: Rik van Riel r...@redhat.com


- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJUIIQiAAoJEM553pKExN6DWeoH/RpkYF1QCxnbxgZhnioaWjyu
Rp/kN6Rck6Eu3k/yRI6k+8IhgUJWkVhSXybTIDl1X6aVGgYwhaeOv2zPPfshfM6h
ABE3pLFjs2qtdotZXFSvZ4mNwbQE73YHphAbmFUBSdm2Oz1bj6Qfq+KYFdM+aQd7
UYIFgtdGg/tyLMqE25J7pAnSDRR5GKmAKLvkFjN3Q8O4ynD3rExH1yTMLtQbyKvb
oadSzaQLBOkLDAj3rpiOTl52B2tlQS+cxWqEfzA/AXOK8TkllDKIQT5BeRXV5O1c
/WsZmusiA6KYgjLxnL0K9XJpgpOQ5unYAFyIGgYmKiaN6PQsd+pGM5GDnOWGorE=
=dftO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

---
v1 - v2: * Small cosmetic changes
  * Remove unnecessary handling of mmu_notifier_seq in
mmu_lock protected region
  * Func prototypes fixed

v2 - v3  * Added Acked-by
  * Fix compilation in PowerPC
---
Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com
Acked-by: Rik van Riel r...@redhat.com
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  4 ++--
 arch/powerpc/kvm/book3s.h   |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_pr.c|  3 ++-
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 46 +++--
 drivers/iommu/amd_iommu_v2.c|  6 +++--
 include/linux/mmu_notifier.h| 24 +--
 include/trace/events/kvm.h  | 10 
 mm/mmu_notifier.c   |  5 ++--
 mm/rmap.c   |  6 -
 virt/kvm/kvm_main.c |  5 ++--
 16 files changed, 81 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 032a853..8c3f7eb 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index be9970a..a3c671b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6040008..d329bc5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,7 +56,7 @@
 extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index dd03f6b..c16cfbf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-   return kvm-arch.kvm_ops-age_hva(kvm, hva);
+   return kvm-arch.kvm_ops-age_hva(kvm, start, end);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 4bf956c..d2b3ec0 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
  unsigned long end);
-extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva);
+extern int

Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390

2014-09-22 Thread Alex Williamson

On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote:
 This set of patches implements a vfio based solution for pci
 pass-through on the s390 platform. The kernel stuff is pretty
 much straight forward, but qemu needs more work.
 
 Most interesting patch is:
   vfio: make vfio run on s390 platform
 
 I hope Alex  Alex can give me some guidance how to do the changes
 in an appropriate way. After creating a separate iommmu address space
 for each attached PCI device I can successfully run the vfio type1
 iommu. So If we could extend type1 not registering all guest memory
 (see patch) I think we do not need a special vfio iommu for s390
 for the moment.
 
 The patches implement the base pass-through support. s390 specific
 virtualization functions are currently not included. This would
 be a second step after the base support is done.
 
 kernel patches apply to linux-kvm-next
 
 KVM: s390: Enable PCI instructions
 iommu: add iommu for s390 platform
 vfio: make vfio build on s390
 
 qemu patches apply to qemu-master
 
 s390: Add PCI bus support
 s390: implement pci instruction
 vfio: make vfio run on s390 platform
 
 Thx for feedback and review comments

Sending patches as attachments makes it difficult to comment inline.

2/6
 - careful of the namespace as you're changing functions from static and
exporting them
 - doesn't seem like functions need to be exported, just non-static to
call from s390-iommu.c

6/6
 - We shouldn't need to globally disable mmap, each VFIO region reports
whether it supports mmap and vfio-pci on s390 should indicate mmap is
not supported on the platform.
 - INTx should be done the same way, the interrupt index for INTx should
report 0 count.  The current code likely doesn't handle this, but it
should be easy to fix.
 - s390_msix_notify() vs msix_notify() should be abstracted somewhere
else.  How would an emulated PCI device with MSI-X support work?
 - same for add_msi_route
 - We can probably come up with a better way to determine which address
space to connect to the memory listener.

Looks like a reasonable first pass, good re-use of vfio code.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-22 Thread Andres Lagar-Cavilla

On Thu, Sep 18, 2014 at 11:08 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto:
  Paolo, should I recut including the recent Reviewed-by's?

 No, I'll add them myself.

Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?

Thanks much
Andres



 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 22:08, Marcelo Tosatti ha scritto:
  This patch does not change functionality, it just makes invalid ioctls
  fail faster.
 
 Should not be executing vcpu ioctls without interrupt KVM_RUN in the
 first place.

This is not entirely true, there are a couple of asynchronous ioctls
though they are not used on x86 (commit 2122ff5eab8f, KVM: move vcpu
locking to dispatcher for generic vcpu ioctls, 2010-05-13).

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: EVENTFD: Only conditionally remove inclusion of irq.h

2014-09-22 Thread Christoffer Dall

Commit c77dcac KVM: Move more code under CONFIG_HAVE_KVM_IRQFD added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.

At the same time, 0ba0951 KVM: EVENTFD: remove inclusion of irq.h
removed the inclusion of irq.h unconditionally, which happened to
include ioapic.h.

Instead, include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is
defined.

Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
---
 virt/kvm/eventfd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0c712a7..b0fb390 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -36,6 +36,9 @@
 #include linux/seqlock.h
 #include trace/events/kvm.h
 
+#ifdef __KVM_HAVE_IOAPIC
+#include ioapic.h
+#endif
 #include iodev.h
 
 #ifdef CONFIG_HAVE_KVM_IRQFD
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto:
   Paolo, should I recut including the recent Reviewed-by's?
 
  No, I'll add them myself.
 Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?

It's waiting for an Acked-by on the mm/ changes.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-22 Thread Andrew Morton

On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini pbonz...@redhat.com wrote:

 Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto:
Paolo, should I recut including the recent Reviewed-by's?
  
   No, I'll add them myself.
  Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?
 
 It's waiting for an Acked-by on the mm/ changes.
 

The MM changes look good to me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] kvm: Fix page ageing bugs

2014-09-22 Thread Paolo Bonzini

Il 22/09/2014 22:26, Andres Lagar-Cavilla ha scritto:
 + __entry-gfn= gfn;
 + __entry-hva= ((gfn - slot-base_gfn) 

This must be .

 + PAGE_SHIFT) + slot-userspace_addr;

 + /*
 +  * No need for _notify because we're called within an
 +  * mmu_notifier_invalidate_range_ {start|end} scope.
 +  */

Why called within?  It is try_to_unmap_cluster itself that calls
mmu_notifier_invalidate_range_*, so we're within an
mmu_notifier_invalidate_range_start/end scope sounds better, and it's
also what you use in the commit message.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

---
v1 - v2: * Small cosmetic changes
  * Remove unnecessary handling of mmu_notifier_seq in
mmu_lock protected region
  * Func prototypes fixed

v2 - v3  * Added Acked-by
  * Fix compilation in PowerPC

v3 - v4  * Fix page shift direction in trace routine
  * Clarified comment in rmap.c
---
Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com
Acked-by: Rik van Riel r...@redhat.com
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s.c   |  4 ++--
 arch/powerpc/kvm/book3s.h   |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_pr.c|  3 ++-
 arch/powerpc/kvm/e500_mmu_host.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.c  | 46 +++--
 drivers/iommu/amd_iommu_v2.c|  6 +++--
 include/linux/mmu_notifier.h| 24 +--
 include/trace/events/kvm.h  | 10 
 mm/mmu_notifier.c   |  5 ++--
 mm/rmap.c   |  6 -
 virt/kvm/kvm_main.c |  5 ++--
 16 files changed, 81 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 032a853..8c3f7eb 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index be9970a..a3c671b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
 /* We do not have shadow page tables, hence the empty hooks */
-static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+static inline int kvm_age_hva(struct kvm *kvm, unsigned long start,
+ unsigned long end)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6040008..d329bc5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -56,7 +56,7 @@
 extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index dd03f6b..c16cfbf 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-   return kvm-arch.kvm_ops-age_hva(kvm, hva);
+   return kvm-arch.kvm_ops-age_hva(kvm, start, end);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 4bf956c..d2b3ec0 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
 extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva);
 extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,

Re: [PATCH v3] kvm: Fix page ageing bugs

2014-09-22 Thread Andres Lagar-Cavilla

On Mon, Sep 22, 2014 at 2:48 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 22/09/2014 22:26, Andres Lagar-Cavilla ha scritto:
 + __entry-gfn= gfn;
 + __entry-hva= ((gfn - slot-base_gfn) 

 This must be .

Correct, thanks.


 + PAGE_SHIFT) + slot-userspace_addr;

 + /*
 +  * No need for _notify because we're called within an
 +  * mmu_notifier_invalidate_range_ {start|end} scope.
 +  */

 Why called within?  It is try_to_unmap_cluster itself that calls
 mmu_notifier_invalidate_range_*, so we're within an
 mmu_notifier_invalidate_range_start/end scope sounds better, and it's
 also what you use in the commit message.

Also correct. V4...
Andres


 Paolo



-- 
Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390

2014-09-22 Thread Alexander Graf



On 22.09.14 22:47, Alex Williamson wrote:
 On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote:
 This set of patches implements a vfio based solution for pci
 pass-through on the s390 platform. The kernel stuff is pretty
 much straight forward, but qemu needs more work.

 Most interesting patch is:
   vfio: make vfio run on s390 platform

 I hope Alex  Alex can give me some guidance how to do the changes
 in an appropriate way. After creating a separate iommmu address space
 for each attached PCI device I can successfully run the vfio type1
 iommu. So If we could extend type1 not registering all guest memory
 (see patch) I think we do not need a special vfio iommu for s390
 for the moment.

 The patches implement the base pass-through support. s390 specific
 virtualization functions are currently not included. This would
 be a second step after the base support is done.

 kernel patches apply to linux-kvm-next

 KVM: s390: Enable PCI instructions
 iommu: add iommu for s390 platform
 vfio: make vfio build on s390

 qemu patches apply to qemu-master

 s390: Add PCI bus support
 s390: implement pci instruction
 vfio: make vfio run on s390 platform

 Thx for feedback and review comments
 
 Sending patches as attachments makes it difficult to comment inline.
 
 2/6
  - careful of the namespace as you're changing functions from static and
 exporting them
  - doesn't seem like functions need to be exported, just non-static to
 call from s390-iommu.c
 
 6/6
  - We shouldn't need to globally disable mmap, each VFIO region reports
 whether it supports mmap and vfio-pci on s390 should indicate mmap is
 not supported on the platform.

Can we emulate MMIO on mmap'ed regions by routing every memory access
via the kernel? It'd be slow, but at least make existing VFIO code
compatible.

  - INTx should be done the same way, the interrupt index for INTx should
 report 0 count.  The current code likely doesn't handle this, but it
 should be easy to fix.
  - s390_msix_notify() vs msix_notify() should be abstracted somewhere
 else.  How would an emulated PCI device with MSI-X support work?
  - same for add_msi_route

Yes, please implement emulated PCI device support first, then do VFIO.


Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390

2014-09-22 Thread Alex Williamson

On Tue, 2014-09-23 at 00:08 +0200, Alexander Graf wrote:
 
 On 22.09.14 22:47, Alex Williamson wrote:
  On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote:
  This set of patches implements a vfio based solution for pci
  pass-through on the s390 platform. The kernel stuff is pretty
  much straight forward, but qemu needs more work.
 
  Most interesting patch is:
vfio: make vfio run on s390 platform
 
  I hope Alex  Alex can give me some guidance how to do the changes
  in an appropriate way. After creating a separate iommmu address space
  for each attached PCI device I can successfully run the vfio type1
  iommu. So If we could extend type1 not registering all guest memory
  (see patch) I think we do not need a special vfio iommu for s390
  for the moment.
 
  The patches implement the base pass-through support. s390 specific
  virtualization functions are currently not included. This would
  be a second step after the base support is done.
 
  kernel patches apply to linux-kvm-next
 
  KVM: s390: Enable PCI instructions
  iommu: add iommu for s390 platform
  vfio: make vfio build on s390
 
  qemu patches apply to qemu-master
 
  s390: Add PCI bus support
  s390: implement pci instruction
  vfio: make vfio run on s390 platform
 
  Thx for feedback and review comments
  
  Sending patches as attachments makes it difficult to comment inline.
  
  2/6
   - careful of the namespace as you're changing functions from static and
  exporting them
   - doesn't seem like functions need to be exported, just non-static to
  call from s390-iommu.c
  
  6/6
   - We shouldn't need to globally disable mmap, each VFIO region reports
  whether it supports mmap and vfio-pci on s390 should indicate mmap is
  not supported on the platform.
 
 Can we emulate MMIO on mmap'ed regions by routing every memory access
 via the kernel? It'd be slow, but at least make existing VFIO code
 compatible.

Isn't that effectively what we do when we use memory_region_init_io() vs
memory_region_init_ram_ptr() or are you suggesting something that can
handle the MMIO without bouncing out to QEMU?  VFIO is already
compatible with regions that cannot be mmap'd, the kernel just needs to
report it as such.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread David Matlack

On 09/22, Marcelo Tosatti wrote:
 On Fri, Sep 19, 2014 at 04:03:25PM -0700, David Matlack wrote:
  vcpu ioctls can hang the calling thread if issued while a vcpu is
  running. 
 
 There is a mutex per-vcpu, so thats expected, OK...
 
  If we know ioctl is going to be rejected as invalid anyway,
  we can fail before trying to take the vcpu mutex.
 
 Consider a valid ioctl that takes the vcpu mutex. If you need immediate
 access for that valid ioctl, it is necessary to interrupt thread
 which KVM_RUN ioctl executes. 
 
 So knowledge of whether KVM_RUN is being executed is expected in
 userspace (either
 that or ask the KVM_RUN thread to run the ioctl for you, as qemu does).
 
 Can't see why having different behaviour for valid/invalid ioctls
 is a good thing.
 
  This patch does not change functionality, it just makes invalid ioctls
  fail faster.
 
 Should not be executing vcpu ioctls without interrupt KVM_RUN in the
 first place.

This patch is trying to be nice to code that isn't aware it's
probing kvm file descriptors. We saw long hangs with some generic
process inspection code that was probing all open file descriptors.
There's no reason non-kvm ioctls should have to wait for the vcpu
mutex to become available just to fail.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Marcelo Tosatti

On Mon, Sep 22, 2014 at 11:29:16PM +0200, Paolo Bonzini wrote:
 Il 22/09/2014 22:08, Marcelo Tosatti ha scritto:
   This patch does not change functionality, it just makes invalid ioctls
   fail faster.
  
  Should not be executing vcpu ioctls without interrupt KVM_RUN in the
  first place.
 
 This is not entirely true, there are a couple of asynchronous ioctls
 though they are not used on x86 (commit 2122ff5eab8f, KVM: move vcpu
 locking to dispatcher for generic vcpu ioctls, 2010-05-13).
 
 Paolo

Alright. 

s/Should not be executing vcpu ioctls/Should not be executing vcpu
ioctls which take vcpu mutex/


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-22 Thread H. Peter Anvin

Not really, no.

Sent from my tablet, pardon any formatting problems.

 On Sep 22, 2014, at 06:31, Christopher Covington c...@codeaurora.org wrote:
 
 On 09/19/2014 05:46 PM, H. Peter Anvin wrote:
 On 09/19/2014 01:46 PM, Andy Lutomirski wrote:
 
 However, it sounds to me that at least for KVM, it is very easy just to 
 emulate the RDRAND instruction. The hypervisor would report to the guest 
 that RDRAND is supported in CPUID and the emulate the instruction when 
 guest executes it. KVM already traps guest #UD (which would occur if 
 RDRAND executed while it is not supported) - so this scheme wouldn’t 
 introduce additional overhead over RDMSR.
 
 Because then guest user code will think that rdrand is there and will
 try to use it, resulting in abysmal performance.
 
 Yes, the presence of RDRAND implies a cheap and inexhaustible entropy
 source.
 
 A guest kernel couldn't make it look like RDRAND is not present to guest
 userspace?
 
 Christopher
 
 -- 
 Employee of Qualcomm Innovation Center, Inc.
 Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
 hosted by the Linux Foundation.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-22 Thread Marcelo Tosatti

On Mon, Sep 22, 2014 at 03:58:16PM -0700, David Matlack wrote:
  Should not be executing vcpu ioctls without interrupt KVM_RUN in the
  first place.
 
 This patch is trying to be nice to code that isn't aware it's
 probing kvm file descriptors. We saw long hangs with some generic
 process inspection code that was probing all open file descriptors.
 There's no reason non-kvm ioctls should have to wait for the vcpu
 mutex to become available just to fail.

OK then, please add the usecase to the changelog.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 0/6] arm: dirty page logging support for ARMv7

2014-09-22 Thread Mario Smarduch

This patch adds support for ARMv7 dirty page logging. Some functions of dirty
page logging have been split to generic and arch specific implementations,
details below. Dirty page logging is one of serveral features required for
live migration, live migration has been tested for ARMv7.

Testing:
- QEMU machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest
  memory cycling.
- ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers
  did a basic bringup and memory test.
- x86_64 QEMU basic migration on same platform.

See https://github.com/mjsmar/arm-migration-howto for details testing and
setup (see README).

The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch
series has been compiled for affected architectures:

- x86_64 - defconfig 
- ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
  around that to make sure new changes don't break build. Eventually build
  breaks due to other reasons.
- mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
- ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
- s390 - s390x-linux-gcc4.6.3 - defconfig
- armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig

ARMv7 Dirty page logging support overivew-
- initially write protects VM RAM memory regions - 2nd stage page tables
- add support to read dirty page log and again write protect the dirty pages 
  - second stage page table for next pass.
- second stage huge page are dissolved into small page tables to keep track of
  dirty pages at page granularity. Tracking at huge page granularity limits
  migration to an almost idle system.
- In the event migration is canceled, normal behavior is resumed huge pages
  are rebuilt over time.

- Future Work 
  o Completed ARMv8 additions - need to validate, on hardware juno
  o ARMv8 Validation test - with no migration support yet, implement  
replication of memory DB using dirty page logging in HA environment. 

Changes since v10:
- addressed wanghaibin comments 
- addressed Christoffers comments

Changes since v9:
- Split patches into generic and architecture specific variants for TLB Flushing
  and dirty log read (patches 1,2  3,4,5,6)
- rebased to 3.16.0-rc1
- Applied Christoffers comments.

Mario Smarduch (6):
  KVM: Add architecture-specific TLB flush implementations
  KVM: Add generic implementation of kvm_vm_ioctl_get_dirty_log
  arm: KVM: Add ARMv7 API to flush TLBs
  arm: KVM: Add initial dirty page locking infrastructure
  arm: KVM: dirty log read write protect support
  arm: KVM: ARMv7 dirty page logging 2nd stage page fault

 arch/arm/include/asm/kvm_asm.h|1 +
 arch/arm/include/asm/kvm_host.h   |   14 +++
 arch/arm/include/asm/kvm_mmu.h|   20 
 arch/arm/include/asm/pgtable-3level.h |1 +
 arch/arm/kvm/Kconfig  |1 +
 arch/arm/kvm/arm.c|   13 ++-
 arch/arm/kvm/interrupts.S |   12 ++
 arch/arm/kvm/mmu.c|  196 -
 arch/arm64/kvm/Kconfig|1 +
 arch/ia64/kvm/Kconfig |1 +
 arch/ia64/kvm/kvm-ia64.c  |2 +-
 arch/mips/kvm/Kconfig |1 +
 arch/mips/kvm/mips.c  |2 +-
 arch/powerpc/kvm/Kconfig  |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/booke.c  |2 +-
 arch/s390/kvm/Kconfig |1 +
 arch/s390/kvm/kvm-s390.c  |2 +-
 arch/x86/kvm/x86.c|   86 ---
 include/linux/kvm_host.h  |7 ++
 virt/kvm/Kconfig  |6 +
 virt/kvm/kvm_main.c   |   95 
 22 files changed, 369 insertions(+), 98 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 1/6] KVM: Add architecture-specific TLB flush implementations

2014-09-22 Thread Mario Smarduch

Add support to declare architecture specific TLB flush function, for now ARMv7.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 include/linux/kvm_host.h |1 +
 virt/kvm/Kconfig |3 +++
 virt/kvm/kvm_main.c  |4 
 3 files changed, 8 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec4e3bd..a49a6df 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -592,6 +592,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
+void kvm_arch_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request(struct kvm *kvm);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 13f2d19..f1efaa5 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -34,3 +34,6 @@ config HAVE_KVM_CPU_RELAX_INTERCEPT
 
 config KVM_VFIO
bool
+
+config HAVE_KVM_ARCH_TLB_FLUSH_ALL
+   bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..d0a24f5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -186,12 +186,16 @@ static bool make_all_cpus_request(struct kvm *kvm, 
unsigned int req)
 
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
+#ifdef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
+   kvm_arch_flush_remote_tlbs(kvm);
+#else
long dirty_count = kvm-tlbs_dirty;
 
smp_mb();
if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm-stat.remote_tlb_flush;
cmpxchg(kvm-tlbs_dirty, dirty_count, 0);
+#endif
 }
 EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 2/6] KVM: Add generic implementation of kvm_vm_ioctl_get_dirty_log

2014-09-22 Thread Mario Smarduch

Add support for generic implementation of dirty log read function. For now
x86_64 and ARMv7 share generic dirty log read. Other architectures call
their architecture specific functions.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/kvm/Kconfig  |1 +
 arch/arm/kvm/arm.c|2 +-
 arch/arm64/kvm/Kconfig|1 +
 arch/ia64/kvm/Kconfig |1 +
 arch/ia64/kvm/kvm-ia64.c  |2 +-
 arch/mips/kvm/Kconfig |1 +
 arch/mips/kvm/mips.c  |2 +-
 arch/powerpc/kvm/Kconfig  |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/booke.c  |2 +-
 arch/s390/kvm/Kconfig |1 +
 arch/s390/kvm/kvm-s390.c  |2 +-
 arch/x86/kvm/x86.c|   86 --
 include/linux/kvm_host.h  |6 +++
 virt/kvm/Kconfig  |3 ++
 virt/kvm/kvm_main.c   |   91 +
 16 files changed, 112 insertions(+), 92 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 4be5bb1..cd9bb1c 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
select KVM_ARM_HOST
+   select HAVE_KVM_ARCH_DIRTY_LOG
depends on ARM_VIRT_EXT  ARM_LPAE  !CPU_BIG_ENDIAN
---help---
  Support hosting virtualized guest machines. You will also
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 3c82b37..c52b2bd 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -774,7 +774,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
}
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
+int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
return -EINVAL;
 }
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 8ba85e9..40a8d19 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -26,6 +26,7 @@ config KVM
select KVM_ARM_HOST
select KVM_ARM_VGIC
select KVM_ARM_TIMER
+   select HAVE_KVM_ARCH_DIRTY_LOG
---help---
  Support hosting virtualized guest machines.
 
diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index 990b864..217f10a 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -28,6 +28,7 @@ config KVM
select HAVE_KVM_IRQ_ROUTING
select KVM_APIC_ARCHITECTURE
select KVM_MMIO
+   select HAVE_KVM_ARCH_DIRTY_LOG
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 6a4309b..3166df5 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1812,7 +1812,7 @@ static void kvm_ia64_sync_dirty_log(struct kvm *kvm,
spin_unlock(kvm-arch.dirty_log_lock);
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
+int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log)
 {
int r;
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 30e334e..b57f49e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -20,6 +20,7 @@ config KVM
select PREEMPT_NOTIFIERS
select ANON_INODES
select KVM_MMIO
+   select HAVE_KVM_ARCH_DIRTY_LOG
---help---
  Support for hosting Guest kernels.
  Currently supported on MIPS32 processors.
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index d687c6e..885fdfe 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -791,7 +791,7 @@ out:
 }
 
 /* Get (and clear) the dirty memory log for a memory slot. */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
+int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
struct kvm_memory_slot *memslot;
unsigned long ga, ga_end;
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index d6a53b9..4f28a82 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
select PREEMPT_NOTIFIERS
select ANON_INODES
select HAVE_KVM_EVENTFD
+   select HAVE_KVM_ARCH_DIRTY_LOG
 
 config KVM_BOOK3S_HANDLER
bool
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c254c27..304faa1 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -815,7 +815,7 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
return vcpu-kvm-arch.kvm_ops-check_requests(vcpu);
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
+int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
return kvm-arch.kvm_ops-get_dirty_log(kvm, log);
 }
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index ab62109..50dd33d 100644
---

[PATCH v11 3/6] arm: KVM: Add ARMv7 API to flush TLBs

2014-09-22 Thread Mario Smarduch

This patch adds ARMv7 architecture TLB Flush function.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/include/asm/kvm_asm.h  |1 +
 arch/arm/include/asm/kvm_host.h |   12 
 arch/arm/kvm/Kconfig|1 +
 arch/arm/kvm/interrupts.S   |   12 
 4 files changed, 26 insertions(+)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 53b3c4a..21bc519 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[];
 
 extern void __kvm_flush_vm_context(void);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
+extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 193ceaf..311486f 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -220,6 +220,18 @@ static inline void __cpu_init_hyp_mode(phys_addr_t 
boot_pgd_ptr,
kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr);
 }
 
+/**
+ * kvm_arch_flush_remote_tlbs() - flush all VM TLB entries
+ * @kvm:   pointer to kvm structure.
+ *
+ * Interface to HYP function to flush all VM TLB entries without address
+ * parameter.
+ */
+static inline void kvm_arch_flush_remote_tlbs(struct kvm *kvm)
+{
+   kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
+}
+
 static inline int kvm_arch_dev_ioctl_check_extension(long ext)
 {
return 0;
diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index cd9bb1c..eba8b00 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -24,6 +24,7 @@ config KVM
select KVM_MMIO
select KVM_ARM_HOST
select HAVE_KVM_ARCH_DIRTY_LOG
+   select HAVE_KVM_ARCH_TLB_FLUSH_ALL
depends on ARM_VIRT_EXT  ARM_LPAE  !CPU_BIG_ENDIAN
---help---
  Support hosting virtualized guest machines. You will also
diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index 0d68d40..1258d46 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -66,6 +66,18 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
bx  lr
 ENDPROC(__kvm_tlb_flush_vmid_ipa)
 
+/**
+ * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs
+ *
+ * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address
+ * parameter
+ */
+
+ENTRY(__kvm_tlb_flush_vmid)
+   b   __kvm_tlb_flush_vmid_ipa
+ENDPROC(__kvm_tlb_flush_vmid)
+
+
 /
  * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
  * domain, for all VMIDs
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 4/6] arm: KVM: Add initial dirty page locking infrastructure

2014-09-22 Thread Mario Smarduch

Patch adds support for initial write protection of VM memlsot. This patch series
assumes that huge PUDs will not be used in 2nd stage tables, which is awlays
valid on ARMv7.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/include/asm/kvm_host.h   |2 +
 arch/arm/include/asm/kvm_mmu.h|   20 +
 arch/arm/include/asm/pgtable-3level.h |1 +
 arch/arm/kvm/arm.c|9 +++
 arch/arm/kvm/mmu.c|  129 +
 5 files changed, 161 insertions(+)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 311486f..12311a5 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -243,4 +243,6 @@ int kvm_perf_teardown(void);
 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
+void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 5cc0b0f..08ab5e8 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -114,6 +114,26 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
pmd_val(*pmd) |= L_PMD_S2_RDWR;
 }
 
+static inline void kvm_set_s2pte_readonly(pte_t *pte)
+{
+   pte_val(*pte) = (pte_val(*pte)  ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY;
+}
+
+static inline bool kvm_s2pte_readonly(pte_t *pte)
+{
+   return (pte_val(*pte)  L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
+}
+
+static inline void kvm_set_s2pmd_readonly(pmd_t *pmd)
+{
+   pmd_val(*pmd) = (pmd_val(*pmd)  ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY;
+}
+
+static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
+{
+   return (pmd_val(*pmd)  L_PMD_S2_RDWR) == L_PMD_S2_RDONLY;
+}
+
 /* Open coded p*d_addr_end that can deal with 64bit addresses */
 #define kvm_pgd_addr_end(addr, end)\
 ({ u64 __boundary = ((addr) + PGDIR_SIZE)  PGDIR_MASK;\
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 85c60ad..8a3266e 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -129,6 +129,7 @@
 #define L_PTE_S2_RDONLY(_AT(pteval_t, 1)  6)   /* 
HAP[1]   */
 #define L_PTE_S2_RDWR  (_AT(pteval_t, 3)  6)   /* HAP[2:1] */
 
+#define L_PMD_S2_RDONLY(_AT(pmdval_t, 1)  6)   /* 
HAP[1]   */
 #define L_PMD_S2_RDWR  (_AT(pmdval_t, 3)  6)   /* HAP[2:1] */
 
 /*
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index c52b2bd..e1be6c7 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -242,6 +242,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
   const struct kvm_memory_slot *old,
   enum kvm_mr_change change)
 {
+#ifdef CONFIG_ARM
+   /*
+* At this point memslot has been committed and there is an
+* allocated dirty_bitmap[], dirty pages will be be tracked while the
+* memory slot is write protected.
+*/
+   if ((change != KVM_MR_DELETE)  (mem-flags  KVM_MEM_LOG_DIRTY_PAGES))
+   kvm_mmu_wp_memory_region(kvm, mem-slot);
+#endif
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 2336061..ba00899 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -45,6 +45,7 @@ static phys_addr_t hyp_idmap_vector;
 #define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
 
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
+#define kvm_pud_huge(_x)   pud_huge(_x)
 
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
@@ -746,6 +747,134 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, 
phys_addr_t *ipap)
return false;
 }
 
+#ifdef CONFIG_ARM
+/**
+ * stage2_wp_ptes - write protect PMD range
+ * @pmd:   pointer to pmd entry
+ * @addr:  range start address
+ * @end:   range end address
+ */
+static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+   if (!pte_none(*pte)) {
+   if (!kvm_s2pte_readonly(pte))
+   kvm_set_s2pte_readonly(pte);
+   }
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_wp_pmds - write protect PUD range
+ * @pud:   pointer to pud entry
+ * @addr:  range start address
+ * @end:   range end address
+ */
+static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
+{
+   pmd_t *pmd;
+   phys_addr_t next;
+
+   pmd = pmd_offset(pud, addr);
+
+   do {
+   next = kvm_pmd_addr_end(addr, end);
+   if (!pmd_none(*pmd)) {
+   if (kvm_pmd_huge(*pmd)) {
+

[PATCH v11 5/6] arm: KVM: dirty log read write protect support

2014-09-22 Thread Mario Smarduch

This patch adds support to track VM dirty pages, between dirty log reads. Pages
that have been dirtied since last log read are write protected again, in
preparation of next dirty log read. In addition ARMv7 dirty log read function
is pushed up to generic layer.

Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/kvm/Kconfig |1 -
 arch/arm/kvm/arm.c   |2 ++
 arch/arm/kvm/mmu.c   |   22 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index eba8b00..dddbb3d 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
select KVM_ARM_HOST
-   select HAVE_KVM_ARCH_DIRTY_LOG
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
depends on ARM_VIRT_EXT  ARM_LPAE  !CPU_BIG_ENDIAN
---help---
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e1be6c7..0546fa3 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -783,10 +783,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
}
 }
 
+#ifdef CONFIG_ARM64
 int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
return -EINVAL;
 }
+#endif
 
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
struct kvm_arm_device_addr *dev_addr)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index ba00899..5f52c8a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -873,6 +873,28 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
spin_unlock(kvm-mmu_lock);
kvm_flush_remote_tlbs(kvm);
 }
+
+/**
+ * kvm_mmu_write_protected_pt_masked() - write protect dirty pages set in mask
+ * @kvm:   The KVM pointer
+ * @slot:  The memory slot associated with mask
+ * @gfn_offset:The gfn offset in memory slot
+ * @mask:  The mask of dirty pages at offset 'gfn_offset' in this memory
+ * slot to be write protected
+ *
+ * Walks bits set in mask write protects the associated pte's. Caller must
+ * acquire kvm_mmu_lock.
+ */
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
+{
+   phys_addr_t base_gfn = slot-base_gfn + gfn_offset;
+   phys_addr_t start = (base_gfn +  __ffs(mask))  PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + __fls(mask) + 1)  PAGE_SHIFT;
+
+   stage2_wp_range(kvm, start, end);
+}
 #endif
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 6/6] arm: KVM: ARMv7 dirty page logging 2nd stage page fault

2014-09-22 Thread Mario Smarduch

This patch adds support for handling 2nd stage page faults during migration,
it disables faulting in huge pages, and dissolves huge pages to page tables.
In case migration is canceled huge pages may be used again.


Signed-off-by: Mario Smarduch m.smard...@samsung.com
---
 arch/arm/kvm/mmu.c |   45 +++--
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 5f52c8a..df1a5a3 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -47,6 +47,15 @@ static phys_addr_t hyp_idmap_vector;
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 #define kvm_pud_huge(_x)   pud_huge(_x)
 
+static bool kvm_get_logging_state(struct kvm_memory_slot *memslot)
+{
+#ifdef CONFIG_ARM
+   return !!memslot-dirty_bitmap;
+#else
+   return false;
+#endif
+}
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
/*
@@ -626,7 +635,8 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 }
 
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
- phys_addr_t addr, const pte_t *new_pte, bool iomap)
+ phys_addr_t addr, const pte_t *new_pte, bool iomap,
+ bool logging_active)
 {
pmd_t *pmd;
pte_t *pte, old_pte;
@@ -641,6 +651,18 @@ static int stage2_set_pte(struct kvm *kvm, struct 
kvm_mmu_memory_cache *cache,
return 0;
}
 
+   /*
+* While dirty memory logging, clear PMD entry for huge page and split
+* into smaller pages, to track dirty memory at page granularity.
+*/
+   if (logging_active  kvm_pmd_huge(*pmd)) {
+   phys_addr_t ipa = pmd_pfn(*pmd)  PAGE_SHIFT;
+
+   pmd_clear(pmd);
+   kvm_tlb_flush_vmid_ipa(kvm, ipa);
+   put_page(virt_to_page(pmd));
+   }
+
/* Create stage-2 page mappings - Level 2 */
if (pmd_none(*pmd)) {
if (!cache)
@@ -693,7 +715,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
if (ret)
goto out;
spin_lock(kvm-mmu_lock);
-   ret = stage2_set_pte(kvm, cache, addr, pte, true);
+   ret = stage2_set_pte(kvm, cache, addr, pte, true, false);
spin_unlock(kvm-mmu_lock);
if (ret)
goto out;
@@ -910,6 +932,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache;
struct vm_area_struct *vma;
pfn_t pfn;
+   bool logging_active = kvm_get_logging_state(memslot);
 
write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
if (fault_status == FSC_PERM  !write_fault) {
@@ -920,7 +943,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
/* Let's check if we will get back a huge page backed by hugetlbfs */
down_read(current-mm-mmap_sem);
vma = find_vma_intersection(current-mm, hva, hva + 1);
-   if (is_vm_hugetlb_page(vma)) {
+   if (is_vm_hugetlb_page(vma)  !logging_active) {
hugetlb = true;
gfn = (fault_ipa  PMD_MASK)  PAGE_SHIFT;
} else {
@@ -963,7 +986,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
spin_lock(kvm-mmu_lock);
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
-   if (!hugetlb  !force_pte)
+   if (!hugetlb  !force_pte  !logging_active)
hugetlb = transparent_hugepage_adjust(pfn, fault_ipa);
 
if (hugetlb) {
@@ -982,9 +1005,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_set_pfn_dirty(pfn);
}
coherent_cache_guest_page(vcpu, hva, PAGE_SIZE);
-   ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false);
+   ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false,
+   logging_active);
}
 
+   if (write_fault)
+   mark_page_dirty(kvm, gfn);
 
 out_unlock:
spin_unlock(kvm-mmu_lock);
@@ -1135,7 +1161,14 @@ static void kvm_set_spte_handler(struct kvm *kvm, gpa_t 
gpa, void *data)
 {
pte_t *pte = (pte_t *)data;
 
-   stage2_set_pte(kvm, NULL, gpa, pte, false);
+   /*
+* We can always call stage2_set_pte with logging_active == false,
+* because MMU notifiers will have unmapped a huge PMD before calling
+* -change_pte() (which in turn calls kvm_set_spte_hva()) and therefore
+* stage2_set_pte() never needs to clear out a huge PMD through this
+* calling path.
+*/
+   stage2_set_pte(kvm, NULL, gpa, pte, false, false);
 }
 
 
-- 
1.7.9.5

--
To unsubscribe from this

Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial

2014-09-22 Thread Fam Zheng

On Mon, 09/22 21:23, Zhang Haoyu wrote:
 
 Amit,
 
 It's related to the big number of ioeventfds used in virtio-serial-pci. With
 virtio-serial-pci's ioeventfd=off, the performance is not affected no matter 
 if
 guest initializes it or not.
 
 In my test, there are 12 fds to poll in qemu_poll_ns before loading guest
 virtio_console.ko, whereas 76 once modprobe virtio_console.
 
 Looks like the ppoll takes more time to poll more fds.
 
 Some trace data with systemtap:
 
 12 fds:
 
 time  rel_time  symbol
 15(+1)  qemu_poll_ns  [enter]
 18(+3)  qemu_poll_ns  [return]
 
 76 fd:
 
 12(+2)  qemu_poll_ns  [enter]
 18(+6)  qemu_poll_ns  [return]
 
 I haven't looked at virtio-serial code, I'm not sure if we should reduce the
 number of ioeventfds in virtio-serial-pci or focus on lower level efficiency.
 
 Does ioeventfd=off hamper the performance of virtio-serial?

In theory it has an impact, but I have no data about this. If you have a
performance demand, it's best to try it against your use case to answer this
question.

Fam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] x86, kvm: use macros to compute bank MSRs

2014-09-22 Thread Chen Yucong

Avoid open coded calculations for bank MSRs by using well-defined
macros that hide the index of higher bank MSRs.

No semantic changes.

Signed-off-by: Chen Yucong sla...@gmail.com
---
 arch/x86/kvm/x86.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 31e55ae..e8c1e3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1825,7 +1825,7 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
break;
default:
if (msr = MSR_IA32_MC0_CTL 
-   msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
+   msr  MSR_IA32_MCx_CTL(bank_num)) {
u32 offset = msr - MSR_IA32_MC0_CTL;
/* only 0 or all 1s can be written to IA32_MCi_CTL
 * some Linux kernels though clear bit 10 in bank 4 to
@@ -2184,7 +2184,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
 
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
-   case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
+   case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
return set_msr_mce(vcpu, msr, data);
 
/* Performance counters are not protected by a CPUID bit,
@@ -2350,7 +2350,7 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
break;
default:
if (msr = MSR_IA32_MC0_CTL 
-   msr  MSR_IA32_MC0_CTL + 4 * bank_num) {
+   msr  MSR_IA32_MCx_CTL(bank_num)) {
u32 offset = msr - MSR_IA32_MC0_CTL;
data = vcpu-arch.mce_banks[offset];
break;
@@ -2531,7 +2531,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_IA32_MCG_CAP:
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
-   case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
+   case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
return get_msr_mce(vcpu, msr, pdata);
case MSR_K7_CLK_CTL:
/*
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows

2014-09-22 Thread Alexey Kardashevskiy


This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 159.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 159 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.

Please comment. Thanks!


Changes:
v2:
* added missing __pa() in powerpc/powernv: Release replaced TCE
* reposted to make some noise :)



Alexey Kardashevskiy (13):
  powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/powernv: Do not set read flag if direction==DMA_NONE
  powerpc/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h|  35 ++-
 arch/powerpc/include/asm/machdep.h  |  25 --
 arch/powerpc/include/asm/tce.h  |  37 +++
 arch/powerpc/kernel/iommu.c | 213 +--
 arch/powerpc/kernel/vio.c   |   5 +-
 arch/powerpc/platforms/cell/iommu.c |   9 +-
 arch/powerpc/platforms/pasemi/iommu.c   |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 233 +++--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c| 113 +---
 arch/powerpc/platforms/powernv/pci.h|  15 +-
 arch/powerpc/platforms/pseries/iommu.c  |  77 --
 arch/powerpc/sysdev/dart_iommu.c|  13 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 384 +++-
 include/uapi/linux/vfio.h   |  25 +-
 15 files changed, 925 insertions(+), 271 deletions(-)

-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-09-22 Thread Alexey Kardashevskiy

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/iommu.h  |  1 -
 arch/powerpc/include/asm/tce.h|  2 ++
 arch/powerpc/kernel/iommu.c   | 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 20 
 drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
 5 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
-   void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 9f159eb..e6355f9 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops {
struct iommu_table *(*get_table)(
struct spapr_tce_iommu_group *data,
int num);
+   void (*take_ownership)(struct spapr_tce_iommu_group *data,
+   bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1c5dae7..c2c8d9d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
 
-   /*
-* Disable iommu bypass, otherwise the user can DMA to all of
-* our physical memory via the bypass window instead of just
-* the pages that has been explicitly mapped into the iommu
-*/
-   if (tbl-set_bypass)
-   tbl-set_bypass(tbl, false);
-
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl-it_offset == 0)
set_bit(0, tbl-it_map);
-
-   /* The kernel owns the device now, we can restore the iommu bypass */
-   if (tbl-set_bypass)
-   tbl-set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d32a1c..8cb2f31 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32.table);
uint16_t window_id = (pe-pe_number  1 ) + 1;
int64_t rc;
 
@@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table 
*tbl, bool enable)
 * host side.
 */
if (pe-pdev)
-   set_iommu_table_base(pe-pdev-dev, tbl);
+   set_iommu_table_base(pe-pdev-dev, pe-tce32.table);
else
pnv_ioda_setup_bus_dma(pe, pe-pbus, false);
}
@@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe-tce_bypass_base = 1ull  59;
 
-   /* Install set_bypass callback for VFIO */
-   pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(pe-tce32.table, true);
+   pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+bool

[PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size

2014-09-22 Thread Alexey Kardashevskiy

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index c9fac97..0dccbc4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl)
if (!current || !current-mm)
return -ESRCH; /* process exited */
 
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+   npages = (tbl-it_size  tbl-it_page_shift)  PAGE_SHIFT;
 
down_write(current-mm-mmap_sem);
locked = current-mm-locked_vm + npages;
@@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl)
if (!current || !current-mm)
return; /* process exited */
 
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+   npages = (tbl-it_size  tbl-it_page_shift)  PAGE_SHIFT;
 
down_write(current-mm-mmap_sem);
if (npages  current-mm-locked_vm)
@@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz  minsz)
return -EINVAL;
 
-   info.dma32_window_start = tbl-it_offset  IOMMU_PAGE_SHIFT_4K;
-   info.dma32_window_size = tbl-it_size  IOMMU_PAGE_SHIFT_4K;
+   info.dma32_window_start = tbl-it_offset  tbl-it_page_shift;
+   info.dma32_window_size = tbl-it_size  tbl-it_page_shift;
info.flags = 0;
 
if (copy_to_user((void __user *)arg, info, minsz))
@@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
 
-   if ((param.size  ~IOMMU_PAGE_MASK_4K) ||
-   (param.vaddr  ~IOMMU_PAGE_MASK_4K))
-   return -EINVAL;
-
/* iova is checked by the IOMMU API */
tce = param.vaddr;
if (param.flags  VFIO_DMA_MAP_FLAG_READ)
@@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data,
return -ENXIO;
BUG_ON(!tbl-it_group);
 
+   if ((param.size  ~IOMMU_PAGE_MASK(tbl)) ||
+   (param.vaddr  ~IOMMU_PAGE_MASK(tbl)))
+   return -EINVAL;
+
ret = iommu_tce_put_param_check(tbl, param.iova, tce);
if (ret)
return ret;
 
-   for (i = 0; i  (param.size  IOMMU_PAGE_SHIFT_4K); ++i) {
+   for (i = 0; i  (param.size  tbl-it_page_shift); ++i) {
ret = iommu_put_tce_user_mode(tbl,
-   (param.iova  IOMMU_PAGE_SHIFT_4K) + i,
+   (param.iova  tbl-it_page_shift) + i,
tce);
if (ret)
break;
-   tce += IOMMU_PAGE_SIZE_4K;
+   tce += IOMMU_PAGE_SIZE(tbl);
}
if (ret)
iommu_clear_tces_and_put_pages(tbl,
-   param.iova  IOMMU_PAGE_SHIFT_4K, i);
+   param.iova  tbl-it_page_shift, i);
 
iommu_flush_tce(tbl);
 
@@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
 
-   if (param.size  ~IOMMU_PAGE_MASK_4K)
-   return -EINVAL;
-
tbl = spapr_tce_find_table(container, data, param.iova);
if (!tbl)
return -ENXIO;
 
+   if (param.size  ~IOMMU_PAGE_MASK(tbl))
+   return -EINVAL;
+
BUG_ON(!tbl-it_group);
 
ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-   param.size  IOMMU_PAGE_SHIFT_4K);
+   param.size  tbl-it_page_shift);
if (ret)
return ret;
 
ret = iommu_clear_tces_and_put_pages(tbl,
-   param.iova  IOMMU_PAGE_SHIFT_4K,
-   param.size  IOMMU_PAGE_SHIFT_4K);
+   param.iova  tbl-it_page_shift,
+   param.size  tbl-it_page_shift);
iommu_flush_tce(tbl);
 
return ret;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body

[PATCH v2 08/13] powerpc/powernv: Release replaced TCE

2014-09-22 Thread Alexey Kardashevskiy

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This makes iommu_tce_build() put pages returned by exchange().

This replaces iommu_clear_tce() with iommu_tce_build which now
can call exchange() with TCE==NULL (i.e. clear).

This preserves permission bits in TCE in iommu_put_tce_user_mode().

This removes use of pool locks for external IOMMU uses.

This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not
implement exchange() callback. Therefore the powernv platform is
the only supported one after this patch.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v2:
* added missing __pa() for TCE which was read from the table

---
 arch/powerpc/include/asm/iommu.h |  8 +++--
 arch/powerpc/kernel/iommu.c  | 62 
 arch/powerpc/platforms/powernv/pci.c | 40 +++
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c725e4a..8e0537d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+   int (*exchange)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   unsigned long *old_tces,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
void (*clear)(struct iommu_table *tbl,
long index, long npages);
unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table 
*tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 678fee8..39ccce7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock((pool-lock));
-
-   oldtce = tbl-it_ops-get(tbl, entry);
-   if (oldtce  (TCE_PCI_WRITE | TCE_PCI_READ))
-   tbl-it_ops-clear(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock((pool-lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce  PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce  TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
+   iommu_tce_build(tbl, entry, 0, DMA_NONE);
}
 
return 0;
@@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction)
 {
-   int ret = -EBUSY;
+   int ret;
unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
 
-   spin_lock((pool-lock));
+   ret = tbl-it_ops-exchange(tbl, entry, 1, hwaddr, oldtce,
+   direction, NULL);
 
-

[PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2014-09-22 Thread Alexey Kardashevskiy

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++--
 1 file changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 1c1a9c4..c9fac97 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
+static long try_increment_locked_vm(struct iommu_table *tbl)
+{
+   long ret = 0, locked, lock_limit, npages;
+
+   if (!current || !current-mm)
+   return -ESRCH; /* process exited */
+
+   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+
+   down_write(current-mm-mmap_sem);
+   locked = current-mm-locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
+   if (locked  lock_limit  !capable(CAP_IPC_LOCK)) {
+   pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n,
+   rlimit(RLIMIT_MEMLOCK));
+   ret = -ENOMEM;
+   } else {
+   current-mm-locked_vm += npages;
+   }
+   up_write(current-mm-mmap_sem);
+
+   return ret;
+}
+
+static void decrement_locked_vm(struct iommu_table *tbl)
+{
+   long npages;
+
+   if (!current || !current-mm)
+   return; /* process exited */
+
+   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
+
+   down_write(current-mm-mmap_sem);
+   if (npages  current-mm-locked_vm)
+   npages = current-mm-locked_vm;
+   current-mm-locked_vm -= npages;
+   up_write(current-mm-mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct 
spapr_tce_iommu_group *data,
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
-   unsigned long locked, lock_limit, npages;
struct iommu_table *tbl;
struct spapr_tce_iommu_group *data;
 
@@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container 
*container)
 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 * that would effectively kill the guest at random points, much better
 * enforcing the limit based on the max that the guest can map.
+*
+* Unfortunately at the moment it counts whole tables, no matter how
+* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+* each with 2GB DMA window, 8GB will be counted here. The reason for
+* this is that we cannot tell here the amount of RAM used by the guest
+* as this information is only available from KVM and VFIO is
+* KVM agnostic.
 */
tbl = data-ops-get_table(data, 0);
if (!tbl)
return -ENXIO;
 
-   down_write(current-mm-mmap_sem);
-   npages = (tbl-it_size  IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
-   locked = current-mm-locked_vm + npages;
-   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
-   if (locked  lock_limit  !capable(CAP_IPC_LOCK)) {
-   pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n,
-   rlimit(RLIMIT_MEMLOCK));
-   ret = -ENOMEM;
-   } else {
-   current-mm-locked_vm += npages;
-   container-enabled = true;
-   }
-   up_write(current-mm-mmap_sem);
+   ret = try_increment_locked_vm(tbl);
+   if (ret)
+   return ret;
+
+   container-enabled = true;
 
return ret;
 }
@@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container 
*container)
if (!tbl)
return;
 
-   down_write(current-mm-mmap_sem);
-   current-mm-locked_vm -= (tbl-it_size 
-   IOMMU_PAGE_SHIFT_4K)  PAGE_SHIFT;
-   up_write(current-mm-mmap_sem);
+   decrement_locked_vm(tbl);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/13] powerpc/powernv: Do not set read flag if direction==DMA_NONE

2014-09-22 Thread Alexey Kardashevskiy

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/platforms/powernv/pci.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index deddcad..ab79e2d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long 
index, long npages,
__be64 *tcep, *tces;
u64 rpn;
 
-   proto_tce = TCE_PCI_READ; // Read allowed
-
-   if (direction != DMA_TO_DEVICE)
-   proto_tce |= TCE_PCI_WRITE;
+   switch (direction) {
+   case DMA_BIDIRECTIONAL:
+   case DMA_FROM_DEVICE:
+   proto_tce = TCE_PCI_READ | TCE_PCI_WRITE;
+   break;
+   case DMA_TO_DEVICE:
+   proto_tce = TCE_PCI_READ;
+   break;
+   default:
+   proto_tce = 0;
+   break;
+   }
 
tces = tcep = ((__be64 *)tbl-it_base) + index - tbl-it_offset;
rpn = __pa(uaddr)  tbl-it_page_shift;
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO

2014-09-22 Thread Alexey Kardashevskiy

The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchaange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9a7364f..ae15b5a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, 
long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
long npages, unsigned long uaddr,
+   unsigned long *old_tces,
enum dma_data_direction direction,
struct dma_attrs *attrs)
 {
u64 rc = 0;
u64 proto_tce, tce;
u64 rpn;
-   int ret = 0;
+   int ret = 0, i = 0;
long tcenum_start = tcenum, npages_start = npages;
 
rpn = __pa(uaddr)  TCE_SHIFT;
@@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
while (npages--) {
tce = proto_tce | (rpn  TCE_RPN_MASK)  TCE_RPN_SHIFT;
+   if (old_tces)
+   plpar_tce_get((u64)tbl-it_index, (u64)tcenum  12,
+   old_tces[i++]);
rc = plpar_tce_put((u64)tbl-it_index, (u64)tcenum  12, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 long npages, unsigned long uaddr,
+unsigned long *old_tces,
 enum dma_data_direction direction,
 struct dma_attrs *attrs)
 {
@@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+  old_tces,
   direction, attrs);
}
 
@@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+   old_tces,
direction, attrs);
}
__get_cpu_var(tce_page) = tcep;
@@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
for (l = 0; l  limit; l++) {
tcep[l] = cpu_to_be64(proto_tce | (rpn  TCE_RPN_MASK) 
 TCE_RPN_SHIFT);
rpn++;
+   if (old_tces)
+   plpar_tce_get((u64)tbl-it_index,
+   (u64)(tcenum + l)  12,
+   old_tces[tcenum + l]);
}
 
rc = plpar_tce_put_indirect((u64)tbl-it_index,
@@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+long npages, unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs)
+{
+   return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+   direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
@@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
.set = tce_buildmulti_pSeriesLP,
+   .exchange = tce_xchg_pSeriesLP,
.clear = tce_freemulti_pSeriesLP,
.get = tce_get_pSeriesLP
 };
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA

2014-09-22 Thread Alexey Kardashevskiy

SPAPR defines an interface to create additional DMA windows dynamically.
Dynamically means that the window is not allocated before the guest
even started, the guest can request it later. In practice, existing linux
guests check for the capability and if it is there, they create and map
a DMA window as big as the entire guest RAM.

This adds 4 callbacks to the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);

2. create - ibm,create-pe-dma-window - creates a window;

3. remove - ibm,remove-pe-dma-window - removes a window; removing
the default 32bit window is not allowed by this patch, this will be added
later if needed;

4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; as the default window cannot be removed, it only
removes the additional window if it was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
provide necessary support to the userspace.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/tce.h|  22 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 159 +-
 arch/powerpc/platforms/powernv/pci.h  |   1 +
 3 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index e6355f9..23b0362 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops {
int num);
void (*take_ownership)(struct spapr_tce_iommu_group *data,
bool enable);
+
+   /* Dynamic DMA window */
+   /* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K   0x01
+#define DDW_PGSIZE_64K  0x02
+#define DDW_PGSIZE_16M  0x04
+#define DDW_PGSIZE_32M  0x08
+#define DDW_PGSIZE_64M  0x10
+#define DDW_PGSIZE_128M 0x20
+#define DDW_PGSIZE_256M 0x40
+#define DDW_PGSIZE_16G  0x80
+   long (*query)(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available,
+   __u32 *page_size_mask);
+   long (*create)(struct spapr_tce_iommu_group *data,
+   __u32 page_shift,
+   __u32 window_shift,
+   struct iommu_table **ptbl);
+   long (*remove)(struct spapr_tce_iommu_group *data,
+   struct iommu_table *tbl);
+   long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 296f49b..a6318cb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+   struct spapr_tce_iommu_group *data,
+   int num)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (pe-tce32.table.it_size)
+   return pe-tce32.table;
+   return NULL;
+   case 1:
+   if (pe-tce64.table.it_size)
+   return pe-tce64.table;
+   return NULL;
+   default:
+   return NULL;
+   }
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 bool enable)
 {
@@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct 
spapr_tce_iommu_group *data,
pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available, __u32 *page_size_mask)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+
+   *windows_available = 2;
+   *current_windows = 0;
+   if (pe-tce32.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   if (pe-tce64.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   *page_size_mask =
+   DDW_PGSIZE_4K |
+   DDW_PGSIZE_64K |
+   DDW_PGSIZE_16M;
+
+   return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+   __u32 page_shift, __u32 window_shift,
+   struct iommu_table **ptbl)
+{
+   struct pnv_ioda_pe *pe = data-iommu_owner;
+   struct pnv_phb *phb = pe-phb;
+   struct page *tce_mem = NULL;
+   void *addr;
+   long ret;
+   unsigned long tce_table_size =
+   (1ULL  (window_shift - page_shift)) * 8;
+   unsigned

[PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table

2014-09-22 Thread Alexey Kardashevskiy

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds an extra @ops parameter to iommu_init_table() to make sure
that we do not leave any IOMMU table without iommu_table_ops. @it_ops is
initialized in the very beginning as iommu_init_table() calls
iommu_table_clear() and the latter uses callbacks already.

This does s/tce_build/set/, s/tce_free/clear/ and removes tce_ prefixes
for better readability.

This removes tce_xxx_rm handlers from ppc_md as well but does not add
them to iommu_table_ops, this will be done later if we decide to support
TCE hypercalls in real mode.

This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as
callbacks for pseries. This changes multi callbacks to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support multitce=off
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace multi callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/machdep.h  | 25 ---
 arch/powerpc/kernel/iommu.c | 50 -
 arch/powerpc/kernel/vio.c   |  5 ++-
 arch/powerpc/platforms/cell/iommu.c |  9 --
 arch/powerpc/platforms/pasemi/iommu.c   |  8 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 +--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 +-
 arch/powerpc/platforms/powernv/pci.c| 24 --
 arch/powerpc/platforms/powernv/pci.h|  1 +
 arch/powerpc/platforms/pseries/iommu.c  | 42 +---
 arch/powerpc/sysdev/dart_iommu.c| 13 
 12 files changed, 102 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2b0b01d..c725e4a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+   int (*set)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
+   void (*clear)(struct iommu_table *tbl,
+   long index, long npages);
+   unsigned long (*get)(struct iommu_table *tbl, long index);
+   void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
+   struct iommu_table_ops *it_ops;
 };
 
 /* Pure 2^n version of get_order */
@@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const 
char *node_name);
  * structure
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-   int nid);
+   int nid,
+   struct iommu_table_ops *ops);
 
 struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index b125cea..1fc824d 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 * destroyed as well */
void(*hpte_clear_all)(void);
 
-   int (*tce_build)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction direction,
-struct dma_attrs *attrs);
-   void(*tce_free)(struct iommu_table *tbl,
-   long index,
-   long npages);
-   unsigned long   (*tce_get)(struct iommu_table *tbl,
-   long index);
-   void(*tce_flush)(struct iommu_table *tbl);
-
-   /* _rm versions are for real mode use only */
-   int (*tce_build_rm)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction

[PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions

2014-09-22 Thread Alexey Kardashevskiy

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kernel/iommu.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c2c8d9d..cd80867 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl-it_size + 7)  3;
+   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
+   int ret = 0, bit0 = 0;
+
+   spin_lock_irqsave(tbl-large_pool.lock, flags);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_lock(tbl-pools[i].lock);
 
if (tbl-it_offset == 0)
-   clear_bit(0, tbl-it_map);
+   bit0 = test_and_clear_bit(0, tbl-it_map);
 
if (!bitmap_empty(tbl-it_map, tbl-it_size)) {
pr_err(iommu_tce: it_map is not empty);
-   return -EBUSY;
+   ret = -EBUSY;
+   if (bit0)
+   set_bit(0, tbl-it_map);
+   } else {
+   memset(tbl-it_map, 0xff, sz);
}
 
-   memset(tbl-it_map, 0xff, sz);
-   iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_unlock(tbl-pools[i].lock);
+   spin_unlock_irqrestore(tbl-large_pool.lock, flags);
 
-   return 0;
+   if (!ret)
+   iommu_clear_tces_and_put_pages(tbl, tbl-it_offset,
+   tbl-it_size);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl-it_size + 7)  3;
+   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
 
iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size);
+
+   spin_lock_irqsave(tbl-large_pool.lock, flags);
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_lock(tbl-pools[i].lock);
+
memset(tbl-it_map, 0, sz);
 
/* Restore bit#0 set by iommu_init_table() */
if (tbl-it_offset == 0)
set_bit(0, tbl-it_map);
+
+   for (i = 0; i  tbl-nr_pools; i++)
+   spin_unlock(tbl-pools[i].lock);
+   spin_unlock_irqrestore(tbl-large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows

2014-09-22 Thread Alexey Kardashevskiy

This defines and implements VFIO IOMMU API which lets the userspace
create and remove DMA windows.

This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
available windows and page mask.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
to allow the user space to create and remove window(s).

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls platform DDW reset() callback when IOMMU is being disabled
to reset the DMA configuration to its original state.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++--
 include/uapi/linux/vfio.h   |  25 ++-
 2 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0dccbc4..b518891 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container 
*container)
 
container-enabled = false;
 
-   if (!container-grp || !current-mm)
+   if (!container-grp)
return;
 
data = iommu_group_get_iommudata(container-grp);
if (!data || !data-iommu_owner || !data-ops-get_table)
return;
 
-   tbl = data-ops-get_table(data, 0);
-   if (!tbl)
-   return;
+   if (current-mm) {
+   tbl = data-ops-get_table(data, 0);
+   if (tbl)
+   decrement_locked_vm(tbl);
 
-   decrement_locked_vm(tbl);
+   tbl = data-ops-get_table(data, 1);
+   if (tbl)
+   decrement_locked_vm(tbl);
+   }
+
+   if (data-ops-reset)
+   data-ops-reset(data);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
struct tce_container *container = iommu_data;
-   unsigned long minsz;
+   unsigned long minsz, ddwsz;
long ret;
 
switch (cmd) {
@@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
info.dma32_window_size = tbl-it_size  tbl-it_page_shift;
info.flags = 0;
 
+   ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+   page_size_mask);
+
+   if (info.argsz == ddwsz) {
+   if (data-ops-query  data-ops-create 
+   data-ops-remove) {
+   info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
+
+   ret = data-ops-query(data,
+   info.current_windows,
+   info.windows_available,
+   info.page_size_mask);
+   if (ret)
+   return ret;
+   } else {
+   info.current_windows = 0;
+   info.windows_available = 0;
+   info.page_size_mask = 0;
+   }
+   minsz = ddwsz;
+   }
+
if (copy_to_user((void __user *)arg, info, minsz))
return -EFAULT;
 
@@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(container-lock);
return 0;
+
case VFIO_EEH_PE_OP:
if (!container-grp)
return -ENODEV;
 
return vfio_spapr_iommu_eeh_ioctl(container-grp,
  cmd, arg);
+
+   case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+   struct vfio_iommu_spapr_tce_create create;
+   struct spapr_tce_iommu_group *data;
+   struct iommu_table *tbl;
+
+   if (WARN_ON(!container-grp))
+   return -ENXIO;
+
+   data = iommu_group_get_iommudata(container-grp);
+
+   minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+   start_addr);
+
+   if (copy_from_user(create, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (create.argsz  minsz)
+   return -EINVAL;
+
+   if (create.flags)
+   return -EINVAL;
+
+   if (!data-ops-create || !data-iommu_owner)
+   return -ENOSYS;
+
+   BUG_ON(!data || !data-ops || !data-ops-remove);
+
+

[PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback

2014-09-22 Thread Alexey Kardashevskiy

At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via
container_of(tbl). Since we are going to have to add Dynamic DMA windows
and that means having 2 IOMMU tables per PE, this is not going to work.

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v4:
* changed commit log to explain why this change is needed
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 ---
 arch/powerpc/platforms/powernv/pci.c  | 31 ---
 arch/powerpc/platforms/powernv/pci.h  | 13 +++-
 3 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index df241b1..136e765 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, 
struct pci_dev *pdev
 
pe = phb-ioda.pe_array[pdn-pe_number];
WARN_ON(get_dma_ops(pdev-dev) != dma_iommu_ops);
-   set_iommu_table_base_and_group(pdev-dev, pe-tce32_table);
+   set_iommu_table_base_and_group(pdev-dev, pe-tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(pdev-dev, Using 32-bit DMA via iommu\n);
set_dma_ops(pdev-dev, dma_iommu_ops);
-   set_iommu_table_base(pdev-dev, pe-tce32_table);
+   set_iommu_table_base(pdev-dev, pe-tce32.table);
}
*pdev-dev.dma_mask = dma_mask;
return 0;
@@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, bus-devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(dev-dev,
-  pe-tce32_table);
+  pe-tce32.table);
else
-   set_iommu_table_base(dev-dev, pe-tce32_table);
+   set_iommu_table_base(dev-dev, pe-tce32.table);
 
if (dev-subordinate)
pnv_ioda_setup_bus_dma(pe, dev-subordinate,
@@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-__be64 *startp, __be64 *endp, bool rm)
-{
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
-   struct pnv_phb *phb = pe-phb;
-
-   if (phb-type == PNV_PHB_IODA1)
-   pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-   else
-   pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
  struct pnv_ioda_pe *pe, unsigned int base,
  unsigned int segs)
@@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
}
 
/* Setup linux iommu table */
-   tbl = pe-tce32_table;
+   tbl = pe-tce32.table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
  base  28, IOMMU_PAGE_SHIFT_4K);
+   pe-tce32.pe = pe;
+   pe-tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate;
 
/* OPAL variant of P7IOC SW invalidated TCEs */
swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL);
@@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ tce32.table);
uint16_t window_id = (pe-pe_number  1 ) + 1;
int64_t rc;
 
@@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
pe-tce_bypass_base = 1ull  59;
 
/* Install set_bypass callback for VFIO */
-   pe-tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+   pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(pe-tce32_table, true);
+   pnv_pci_ioda2_set_bypass(pe-tce32.table, true);
 }
 
 static void

[PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops

2014-09-22 Thread Alexey Kardashevskiy

Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

This defines a get_table() callback which returns an iommu_table
by its number.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
This changes the tce_container struct to store iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
- iommu_table;
- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA12 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/include/asm/iommu.h|   6 ++
 arch/powerpc/include/asm/tce.h  |  13 +++
 arch/powerpc/kernel/iommu.c |  35 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c|   2 +-
 arch/powerpc/platforms/pseries/iommu.c  |  10 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 148 ++--
 8 files changed, 208 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, 
const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+void *iommu_owner,
+struct spapr_tce_iommu_ops *ops,
 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner,
+   struct spapr_tce_iommu_ops *ops,
int pci_domain_number,
unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..9f159eb 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,18 @@
 #define TCE_PCI_READ   0x1 /* read from PCI allowed */
 #define TCE_VB_WRITE   0x1 /* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+struct spapr_tce_iommu_ops {
+   struct iommu_table *(*get_table)(
+   struct spapr_tce_iommu_group *data,
+   int num);
+};
+
+struct spapr_tce_iommu_group {
+   void *iommu_owner;
+   struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b378f78..1c5dae7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
  */
 static void group_release(void *iommu_data)
 {
-   struct iommu_table *tbl = iommu_data;
-   tbl-it_group = NULL;
+   kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_default_get_table(
+   struct spapr_tce_iommu_group *data, int num)
+{
+   struct iommu_table *tbl = data-iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (tbl-it_size)
+   return tbl;
+   /* fallthru */
+   default:
+   return NULL;
+   }
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+   .get_table = spapr_tce_default_get_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner, struct spapr_tce_iommu_ops *ops,
int pci_domain_number, unsigned long

[PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size

2014-09-22 Thread Alexey Kardashevskiy

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
 arch/powerpc/kernel/iommu.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a10642a..b378f78 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include linux/pci.h
 #include linux/iommu.h
 #include linux/sched.h
+#include linux/hugetlb.h
 #include asm/io.h
 #include asm/prom.h
 #include asm/iommu.h
@@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
unsigned long entry,
tce, entry  tbl-it_page_shift, ret); */
return -EFAULT;
}
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   if (PageHuge(page)) {
+   struct page *head = compound_head(page);
+   long shift = PAGE_SHIFT + compound_order(head);
+
+   if (shift  tbl-it_page_shift) {
+   ret = -EINVAL;
+   goto put_page_exit;
+   }
+
+   }
+
hwaddr = (unsigned long) page_address(page) + offset;
 
ret = iommu_tce_build(tbl, entry, hwaddr, direction);
if (ret)
-   put_page(page);
+   goto put_page_exit;
 
-   if (ret  0)
-   pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n,
+   return 0;
+
+put_page_exit:
+   pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n,
__func__, entry  tbl-it_page_shift, tce, ret);
 
+   put_page(page);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-- 
2.0.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/4] KVM: MMU: pinned sps are not candidates for deletion.

2014-09-22 Thread Xiao Guangrong


Hi Marcelo,

Sorry for the delay.

On Sep 9, 2014, at 11:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Tue, Jul 22, 2014 at 05:59:42AM +0800, Xiao Guangrong wrote:
 
 On Jul 10, 2014, at 3:12 AM, mtosa...@redhat.com wrote:
 
 Skip pinned shadow pages when selecting pages to zap.
 
 It seems there is no way to prevent changing pinned spte on
 zap-all path?
 
 Xiao,
 
 The way would be to reload remote mmus, forcing the vcpu to exit,
 zapping a page, then vcpu will pagefault any necessary page via
 kvm_mmu_pin_pages.
 
 kvm_mmu_invalidate_zap_all_pages does: 
 
 - spin_lock(mmu_lock)
 - kvm_reload_remote_mmus
 ...
 - spin_unlock(mmu_lock)
 
 So its OK to change pinned spte on zap all path.

Got it, thanks for your explanation.

 
 I am thing if we could move pinned spte to another list (eg. 
 pinned_shadow_pages)
 instead of active list so that it can not be touched by any other free 
 paths. 
 Your idea?
 
 As mentioned it above, it is ok to zap pinned sptes as long w 
 reload remote mmus request is performed.
 
 That said, you still consider a separate list?

I need to think more about this idea… let’s address it as your patch first. :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

86 matches

Mail list logo