Re: [PATCH RFC 2/2] vhost: support urgent descriptors
On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote: On 09/20/2014 06:00 PM, Paolo Bonzini wrote: Il 19/09/2014 09:10, Jason Wang ha scritto: -if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { +if (vq-urgent || !vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { So the urgent descriptor only work when event index was not enabled? This seems suboptimal, we may still want to benefit from event index even if urgent descriptor is used. Looks like we need return true here when vq-urgent is true? Its ||, not . Without event index, all descriptors are treated as urgent. Paolo The problem is if vq-urgent is true, the patch checks VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in virtqueue_enable_cb() regardless of event index feature and cleared unconditionally in virtqueue_disable_cb(). The reverse actually, right? So virtqueue_enable_cb() was used to not only publish a new event index but also enable the urgent descriptor. And virtqueue_disable_cb() disabled all interrupts including the urgent descriptor. Guest won't get urgent interrupts by just adding virtqueue_add_outbuf_urgent() since what it needs is to enable and disable interrupt for !urgent descriptor. Right, we want a new API that advances event index but does not set VRING_AVAIL_F_NO_INTERRUPT. IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx interrupts, to avoid interrupt storms. Btw, not sure urgent is a suitable name, since interrupt is often slow in kvm guest. And in fact virtio-net will probably use urgent descriptor for those packets (e.g stream packets who can be delayed a little bit to batch more bytes from userspace) who was not urgent compared to other packets. Yes but we are asking for an interrupt before event index is reached because something is waiting for the packet to be transmitted. I couldn't come up with a better name. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC patch 5/6] s390: implement pci instruction
On Fri, Sep 19, 2014 at 05:12:15PM +0200, Thomas Huth wrote: Hi Frank, On Fri, 19 Sep 2014 13:54:34 +0200 frank.blasc...@de.ibm.com wrote: From: Frank Blaschka frank.blasc...@de.ibm.com This patch implements the s390 pci instructions in qemu. This allows to attach qemu pci devices including vfio. This does not mean the devices are functional but at least detection and config/memory space access is working. Signed-off-by: Frank Blaschka frank.blasc...@de.ibm.com --- target-s390x/Makefile.objs |2 target-s390x/kvm.c | 52 +++ target-s390x/pci_ic.c | 621 + target-s390x/pci_ic.h | 425 ++ 4 files changed, 1099 insertions(+), 1 deletion(-) --- a/target-s390x/Makefile.objs +++ b/target-s390x/Makefile.objs @@ -2,4 +2,4 @@ obj-y += translate.o helper.o cpu.o inte obj-y += int_helper.o fpu_helper.o cc_helper.o mem_helper.o misc_helper.o obj-y += gdbstub.o obj-$(CONFIG_SOFTMMU) += ioinst.o arch_dump.o -obj-$(CONFIG_KVM) += kvm.o +obj-$(CONFIG_KVM) += kvm.o pci_ic.o --- a/target-s390x/kvm.c +++ b/target-s390x/kvm.c @@ -40,6 +40,7 @@ #include exec/gdbstub.h #include trace.h #include qapi-event.h +#include pci_ic.h /* #define DEBUG_KVM */ @@ -56,6 +57,7 @@ #define IPA0_B2 0xb200 #define IPA0_B9 0xb900 #define IPA0_EB 0xeb00 +#define IPA0_E3 0xe300 #define PRIV_B2_SCLP_CALL 0x20 #define PRIV_B2_CSCH0x30 @@ -76,8 +78,17 @@ #define PRIV_B2_XSCH0x76 #define PRIV_EB_SQBS0x8a +#define PRIV_EB_PCISTB 0xd0 +#define PRIV_EB_SIC 0xd1 #define PRIV_B9_EQBS0x9c +#define PRIV_B9_CLP 0xa0 +#define PRIV_B9_PCISTG 0xd0 +#define PRIV_B9_PCILG 0xd2 +#define PRIV_B9_RPCIT 0xd3 + +#define PRIV_E3_MPCIFC 0xd0 +#define PRIV_E3_STPCIFC 0xd4 #define DIAG_IPL0x308 #define DIAG_KVM_HYPERCALL 0x500 @@ -813,6 +824,18 @@ static int handle_b9(S390CPU *cpu, struc int r = 0; switch (ipa1) { +case PRIV_B9_CLP: +r = kvm_clp_service_call(cpu, run); +break; +case PRIV_B9_PCISTG: +r = kvm_pcistg_service_call(cpu, run); +break; +case PRIV_B9_PCILG: +r = kvm_pcilg_service_call(cpu, run); +break; +case PRIV_B9_RPCIT: +r = kvm_rpcit_service_call(cpu, run); +break; case PRIV_B9_EQBS: /* just inject exception */ r = -1; @@ -831,6 +854,12 @@ static int handle_eb(S390CPU *cpu, struc int r = 0; switch (ipa1) { +case PRIV_EB_PCISTB: +r = kvm_pcistb_service_call(cpu, run); +break; +case PRIV_EB_SIC: +r = kvm_sic_service_call(cpu, run); +break; case PRIV_EB_SQBS: /* just inject exception */ r = -1; I'm not sure, but I think the handler for the eb instructions is wrong: The second byte of the opcode is encoded in the lowest byte of the ipb field, not the lowest byte of the ipa field (just like with the e3 handler). Did you verify that your handlers get called correctly? Hi Thomas, you are absolutely right. I already have a patch available for this issue but did not append it to this RFC post (since it is basically a bug fix). To the next posting I will add this patch as well. Will also fix the remaining issues thx for your review. @@ -844,6 +873,26 @@ static int handle_eb(S390CPU *cpu, struc return r; } +static int handle_e3(S390CPU *cpu, struct kvm_run *run, uint8_t ipa1) +{ +int r = 0; + +switch (ipa1) { +case PRIV_E3_MPCIFC: +r = kvm_mpcifc_service_call(cpu, run); +break; +case PRIV_E3_STPCIFC: +r = kvm_stpcifc_service_call(cpu, run); +break; +default: +r = -1; +DPRINTF(KVM: unhandled PRIV: 0xe3%x\n, ipa1); +break; +} + +return r; +} Could you please replace ipa1 with ipb1 to avoid confusion here? static int handle_hypercall(S390CPU *cpu, struct kvm_run *run) { CPUS390XState *env = cpu-env; @@ -1038,6 +1087,9 @@ static int handle_instruction(S390CPU *c case IPA0_EB: r = handle_eb(cpu, run, ipa1); break; +case IPA0_E3: +r = handle_e3(cpu, run, run-s390_sieic.ipb 0xff); +break; case IPA0_DIAG: r = handle_diag(cpu, run, run-s390_sieic.ipb); break; --- /dev/null +++ b/target-s390x/pci_ic.c @@ -0,0 +1,621 @@ [...] + +int
Re: [[RFC] KVM-S390: Provide guest TOD Clock Get/Set Controls
On 09/19/2014 10:38 PM, Alexander Graf wrote: On 19.09.14 20:51, Christian Borntraeger wrote: On 09/19/2014 04:19 PM, Jason J. Herne wrote: From: Jason J. Herne jjhe...@us.ibm.com Enable KVM_SET_CLOCK and KVM_GET_CLOCK Ioctls on S390 for managing guest TOD clock value. Just some education. On s390 the guest visible TOD clock is thehost TOD clock + hypervisor programmable offset in the control block. There is only one TOD per system, so the offset must be the same for every CPU. Can that offset be negative? The offset is an u64, but the usual sum rules apply. The carry is irgnored and by using a large value you can have a negative offset. we add the KVM_CLOCK_FORWARD_ONLY flag to indicate to KVM_SET_CLOCK that the given clock value should only be set if it is = the current guest TOD clock host TOD, (right?) The alternative scheme would be to simply get/set the guest TOD time. This works perfect for migration, but for managedsave the guest time is in the past. Your approach has the advantange that after managedsave the guest will (most of the time) have the host time of the target system, avoiding that the guest has a time that is in the past (e.g. after 1 week managedsave the guest would live in the past). But that's what users will expect, no? When you save an image in the past, it should resume at that very point in time. Actually, I would expect something different (more or less something like standby/resume). In fact Jasons code that we have internally in testing is doing the simple approach 1. source reads guest time at migration end 2. target sets guest time from source So we have the guarantee that the time will never move backwards. It also works quite well for migration. As a bonus, we could really reuse the existing ioctl. I asked Jason to explore alternatives, though: I think it is somehow wrong, if you save a guest into an image file, open that one month later and the guest will always be 1 month behind unless it uses some kind of ntp. If everybody agrees that this is fine, I will queue up Jasons intial patch (simple get/set). The only question is then: shall we use an s390 specific ioctl (e.g. via VM attribute) or just use the existing KVM_SET_CLOCK. But maybe lets answer the first question before we decide on this. Also I personally don't care whether the interface is delta to now or this is the time. In general, delta to now is safer because you can't possibly run back in time. But you also definitely want to check out the way PPC does it - it also accomodates for the time we spend inside the migration path itself. Alex Question for Paolo (maybe others) is. Does it make sense to reuse/extend the existing ioctl (I think so, but defining a new one could also be ok) Christian value. This guarantees a monotonically increasing time. NOTE: In the event that the KVM_CLOCK_FORWARD_ONLY flag is set and the given time would cause the guest time to jump backward, then we set the guest TOD clock equal to the host TOD clock. Does this behavior make sense, or is it too weird? I could believe that other architectures might not want this exact behavior. Instead they might prefer to implement the function such that an error code is returned instead of syncing the guest time to host time? In that case S390 would need another bit KVM_CLOCK_SET_TO_HOST which we could call to sync host time when the preferred guest time value would otherwise violate the monotonic property of the KVM_CLOCK_FORWARD_ONLY flag. Signed-off-by: Jason J. Herne jjhe...@us.ibm.com --- Documentation/virtual/kvm/api.txt | 5 +++ arch/s390/kvm/kvm-s390.c | 80 +++ include/uapi/linux/kvm.h | 3 ++ 3 files changed, 88 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index beae3fd..615c2e4 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -779,6 +779,11 @@ struct kvm_clock_data { __u32 pad[9]; }; +S390: KVM_CLOCK_FORWARD_ONLY is used by KVM_SET_CLOCK to indicate that the guest +TOD clock should not be allowed to jump back in time. This flag guarantees a +monotonically increasing guest clock. If the clock value specified would cause +the guest to jump back in time then the guest TOD clock is set to the host +TOD clock value. 4.31 KVM_GET_VCPU_EVENTS diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 81b0e11..2450db3 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -31,6 +31,7 @@ #include asm/switch_to.h #include asm/facility.h #include asm/sclp.h +#includeasm/timex.h #include kvm-s390.h #include gaccess.h @@ -169,6 +170,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_S390_IRQCHIP: case
Re: [PATCH] kvm: Make init_rmode_tss() return 0 on success.
Il 20/09/2014 01:44, Radim Krčmář ha scritto: This patch removes the redundant variable, by making init_rmode_tss() return 0 on success, -errno on failure. Which is going to propagate all the way to userpace through ioctl ... is this change of A[PB]I acceptable? Otherwise, -EFAULT seems to match unlikely problems better than -ENOMEM, so if it is acceptable: Reviewed-by: Radim Krčmář rkrc...@redhat.com Yes, I think it is a bugfix. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 6/9] kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static.
Il 20/09/2014 12:47, Tang Chen ha scritto: Since different architectures need different handling, we will add some arch specific code later. The code may need to make cpu requests outside kvm_main.c, so make it non-static and rename it to kvm_make_all_cpus_request(). Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 10 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c23236a..73de13c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -580,6 +580,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm); void kvm_reload_remote_mmus(struct kvm *kvm); void kvm_make_mclock_inprogress_request(struct kvm *kvm); void kvm_make_scan_ioapic_request(struct kvm *kvm); +bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 33712fb..0f8b6f6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -152,7 +152,7 @@ static void ack_flush(void *_completed) { } -static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) +bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req) { int i, cpu, me; cpumask_var_t cpus; @@ -189,7 +189,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm) long dirty_count = kvm-tlbs_dirty; smp_mb(); - if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) + if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; cmpxchg(kvm-tlbs_dirty, dirty_count, 0); } @@ -197,17 +197,17 @@ EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs); void kvm_reload_remote_mmus(struct kvm *kvm) { - make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); + kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); } void kvm_make_mclock_inprogress_request(struct kvm *kvm) { - make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS); + kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS); } void kvm_make_scan_ioapic_request(struct kvm *kvm) { - make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC); + kvm_make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC); } int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) Reviewed-by: Paolo Bonzini pbonz...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 9/9] kvm, mem-hotplug: Unpin and remove kvm_arch-apic_access_page.
Il 20/09/2014 12:47, Tang Chen ha scritto: @@ -4534,8 +4539,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) } if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) - vmcs_write64(APIC_ACCESS_ADDR, - page_to_phys(vmx-vcpu.kvm-arch.apic_access_page)); + kvm_vcpu_reload_apic_access_page(vcpu); if (vmx_vm_has_apicv(vcpu-kvm)) memset(vmx-pi_desc, 0, sizeof(struct pi_desc)); Please make the call unconditional in kvm_vcpu_reset. diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 9c8ae32..99378d7 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3624,6 +3624,11 @@ static bool svm_has_secondary_apic_access(struct kvm_vcpu *vcpu) return false; } +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; I will ask you to modify the vmx_has_secondary_apic_access callback in such a way that svm_set_apic_access_page_addr is not needed, so please remove it from v8. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 7/9] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.
Il 20/09/2014 12:47, Tang Chen ha scritto: We are handling L1 and L2 share one apic access page situation when migrating apic access page. We should do some handling when migration happens in the following situations: 1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next L0-L2 entry and L1's vmcs in the next L2-L1 exit. This patch handles 3). In L0-L2 entry, L2's vmcs will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do nothing. In L2-L1 exit, this patch requests apic access page reload in L2-L1 vmexit. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx.c | 6 ++ arch/x86/kvm/x86.c | 3 ++- 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 56156eb..1a8317e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1047,6 +1047,7 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int kvm_cpu_get_interrupt(struct kvm_vcpu *v); void kvm_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu); void kvm_define_shared_msr(unsigned index, u32 msr); void kvm_set_shared_msr(unsigned index, u64 val, u64 mask); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c8e90ec..baac78a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8803,6 +8803,12 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, } /* + * We are now running in L2, mmu_notifier will force to reload the + * page's hpa for L2 vmcs. Need to reload it for L1 before entering L1. + */ + kvm_vcpu_reload_apic_access_page(vcpu); + + /* * Exiting from L2 to L1, we're now back to L1 which thinks it just * finished a VMLAUNCH or VMRESUME instruction, so we need to set the * success or failure flag accordingly. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fc54fa6..2ae2dc7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5989,7 +5989,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_apic_update_tmr(vcpu, tmr); } -static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) +void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) { /* * Only APIC access page shared by L1 and L2 vm is handled. The APIC @@ -6009,6 +6009,7 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) page_to_phys(vcpu-kvm-arch.apic_access_page)); } } +EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page); /* * Returns 1 to let __vcpu_run() continue the guest execution loop without Reviewed-by: Paolo Bonzini pbonz...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 8/9] kvm, mem-hotplug: Add arch specific mmu notifier to handle apic access migration.
Il 20/09/2014 12:47, Tang Chen ha scritto: We are handling L1 and L2 share one apic access page situation when migrating apic access page. We should do some handling when migration happens in the following situations: 1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next L0-L2 entry and L1's vmcs in the next L2-L1 exit. This patch force a L1-L0 exit or L2-L0 exit when shared apic access page is migrated using mmu notifier. Since apic access page is only used on intel x86, this is arch specific code. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 11 +++ include/linux/kvm_host.h | 14 +- virt/kvm/kvm_main.c | 3 +++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2ae2dc7..7dd4179 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6011,6 +6011,17 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page); +void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, +unsigned long address) +{ + /* + * The physical address of apic access page is stored in VMCS. + * Update it when it becomes invalid. + */ + if (address == gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE PAGE_SHIFT)) + kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); +} + /* * Returns 1 to let __vcpu_run() continue the guest execution loop without * exiting to the userspace. Otherwise, the value will be returned to the diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 73de13c..b6e4d38 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -917,7 +917,19 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) return 1; return 0; } -#endif + +#ifdef _ASM_X86_KVM_HOST_H +void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, +unsigned long address); +#else /* _ASM_X86_KVM_HOST_H */ +inline void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, + unsigned long address) This needs to be static. However, do not add it here. Instead, touch all arch/*/include/asm/kvm_host.h files, adding either the prototype or the inline function. Apart from this, the patch looks good. Paolo +{ + return; +} +#endif /* _ASM_X86_KVM_HOST_H */ + +#endif /* CONFIG_MMU_NOTIFIER KVM_ARCH_WANT_MMU_NOTIFIER */ #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0f8b6f6..5427973d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -295,6 +295,9 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, kvm_flush_remote_tlbs(kvm); spin_unlock(kvm-mmu_lock); + + kvm_arch_mmu_notifier_invalidate_page(kvm, address); + srcu_read_unlock(kvm-srcu, idx); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 5/9] kvm, mem-hotplug: Reload L1's apic access page in vcpu_enter_guest().
Il 20/09/2014 12:47, Tang Chen ha scritto: @@ -3624,6 +3624,11 @@ static bool svm_has_secondary_apic_access(struct kvm_vcpu *vcpu) return false; } +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = { .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, .has_secondary_apic_access = svm_has_secondary_apic_access, + .set_apic_access_page_addr = svm_set_apic_access_page_addr, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_isr_update = svm_hwapic_isr_update, Something's wrong in the way you're generating the patches, because you're adding these hunks twice. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 5/9] kvm, mem-hotplug: Reload L1's apic access page in vcpu_enter_guest().
Il 22/09/2014 11:33, Paolo Bonzini ha scritto: Something's wrong in the way you're generating the patches, because you're adding these hunks twice. Nevermind, that was my mistake. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 4/9] kvm: Add interface to check if secondary exec virtualzed apic accesses is enabled.
Il 20/09/2014 12:47, Tang Chen ha scritto: We wants to migrate apic access page pinned by guest (L1 and L2) to make memory hotplug available. There are two situations need to be handled for apic access page used by L2 vm: 1. L1 prepares a separate apic access page for L2. L2 pins a lot of pages in memory. Even if we can migrate apic access page, memory hotplug is not available when L2 is running. So do not handle this now. Migrate apic access page only. 2. L1 and L2 share one apic access page. Since we will migrate L1's apic access page, we should do some handling when migration happens in the following situations: 1) when L0 is running: Update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 2) when L1 is running: Force a L1-L0 exit, update L1's vmcs in the next L0-L1 entry and L2's vmcs in the next L1-L2 entry. 3) when L2 is running: Force a L2-L0 exit, update L2's vmcs in the next L0-L2 entry and L1's vmcs in the next L2-L1 exit. Since we don't handle L1 ans L2 have separate apic access pages situation, when we update vmcs, we need to check if we are in L2 and if L2's secondary exec virtualzed apic accesses is enabled. This patch adds an interface to check if L2's secondary exec virtualzed apic accesses is enabled, because vmx cannot be accessed outside vmx.c. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 9 + 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 35171c7..69fe032 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -739,6 +739,7 @@ struct kvm_x86_ops { void (*hwapic_isr_update)(struct kvm *kvm, int isr); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); + bool (*has_secondary_apic_access)(struct kvm_vcpu *vcpu); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1d941ad..9c8ae32 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } +static bool svm_has_secondary_apic_access(struct kvm_vcpu *vcpu) +{ + return false; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, + .has_secondary_apic_access = svm_has_secondary_apic_access, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_isr_update = svm_hwapic_isr_update, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 72a0470..0b541d9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7090,6 +7090,14 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) vmx_set_msr_bitmap(vcpu); } +static bool vmx_has_secondary_apic_access(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + return (vmx-nested.current_vmcs12-secondary_vm_exec_control + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); +} + static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) { u16 status; @@ -8909,6 +8917,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode, + .has_secondary_apic_access = vmx_has_secondary_apic_access, .vm_has_apicv = vmx_vm_has_apicv, .load_eoi_exitmap = vmx_load_eoi_exitmap, .hwapic_irr_update = vmx_hwapic_irr_update, I don't think you need two hooks. You can just do the gfn_to_page unconditionally in kvm_vcpu_reload_apic_access_page, like vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm, APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); if (vcpu-kvm-arch.apic_access_page) kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm, page_to_phys(vcpu-kvm-arch.apic_access_page)); /* The last patch adds put_page here. */ and check is_guest_mode etc. in vmx.c. The extra gfn_to_page/put_page is rare enough that you can just do it unconditionally. Another small change; you shouldn't run this code unless the APIC page is in use. So you can add if (!kvm_x86_ops-set_apic_access_page_addr)
Re: [PATCH RFC 2/2] vhost: support urgent descriptors
On 09/22/2014 02:55 PM, Michael S. Tsirkin wrote: On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote: On 09/20/2014 06:00 PM, Paolo Bonzini wrote: Il 19/09/2014 09:10, Jason Wang ha scritto: -if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { +if (vq-urgent || !vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { So the urgent descriptor only work when event index was not enabled? This seems suboptimal, we may still want to benefit from event index even if urgent descriptor is used. Looks like we need return true here when vq-urgent is true? Its ||, not . Without event index, all descriptors are treated as urgent. Paolo The problem is if vq-urgent is true, the patch checks VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in virtqueue_enable_cb() regardless of event index feature and cleared unconditionally in virtqueue_disable_cb(). The reverse actually, right? Ah, right. So virtqueue_enable_cb() was used to not only publish a new event index but also enable the urgent descriptor. And virtqueue_disable_cb() disabled all interrupts including the urgent descriptor. Guest won't get urgent interrupts by just adding virtqueue_add_outbuf_urgent() since what it needs is to enable and disable interrupt for !urgent descriptor. Right, we want a new API that advances event index but does not set VRING_AVAIL_F_NO_INTERRUPT. IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx interrupts, to avoid interrupt storms. I see, so urgent descriptor needs to be disabled in this case. But vhost parts need a little big changes, we could not just check VRING_AVAIL_F_NO_INTERRUPT when vq-urgent is true. If event index is enabled, we still need to check used event to make sure the current tx delayed interrupt works. But just re-using VRING_AVAIL_F_NO_INTERRUPT for urgent descriptor may not work in some case. I see codes of virtqueue_get_buf() that may breaks this: /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before * the read in the next get_buf call. */ if (!(vq-vring.avail-flags VRING_AVAIL_F_NO_INTERRUPT)) { vring_used_event(vq-vring) = vq-last_used_idx; virtio_mb(vq-weak_barriers); } Consider if only urgent descriptor is enabled, this will publish used event which in fact enable lots of unnecessary interrupt. In fact I don't quite understand how the above lines is used. Virtio-net stop the queue before enable the tx interrupt in start_xmit(), so the above lines will not run at all. Btw, not sure urgent is a suitable name, since interrupt is often slow in kvm guest. And in fact virtio-net will probably use urgent descriptor for those packets (e.g stream packets who can be delayed a little bit to batch more bytes from userspace) who was not urgent compared to other packets. Yes but we are asking for an interrupt before event index is reached because something is waiting for the packet to be transmitted. I couldn't come up with a better name. Ok. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86:kvm: fix two typos in comment
Il 22/09/2014 04:31, Tiejun Chen ha scritto: s/drity/dirty and s/vmsc01/vmcs01 Signed-off-by: Tiejun Chen tiejun.c...@intel.com --- arch/x86/kvm/mmu.c | 2 +- arch/x86/kvm/vmx.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 76398fe..f76bc19 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1174,7 +1174,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) * Write-protect on the specified @sptep, @pt_protect indicates whether * spte write-protection is caused by protecting shadow page table. * - * Note: write protection is difference between drity logging and spte + * Note: write protection is difference between dirty logging and spte * protection: * - for dirty logging, the spte can be set to writable at anytime if * its dirty bitmap is properly set. diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6ffd643..305e667 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8008,7 +8008,7 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu) /* * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it - * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2 + * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2 * guest in a way that will both be appropriate to L1's requests, and our * needs. In addition to modifying the active vmcs (which is vmcs02), this * function also has additional necessary side-effects, like setting various Thanks, applying. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
Il 20/09/2014 01:03, David Matlack ha scritto: vcpu ioctls can hang the calling thread if issued while a vcpu is running. If we know ioctl is going to be rejected as invalid anyway, we can fail before trying to take the vcpu mutex. This patch does not change functionality, it just makes invalid ioctls fail faster. Signed-off-by: David Matlack dmatl...@google.com --- virt/kvm/kvm_main.c | 4 1 file changed, 4 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 96ec622..f9234e5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -52,6 +52,7 @@ #include asm/processor.h #include asm/io.h +#include asm/ioctl.h #include asm/uaccess.h #include asm/pgtable.h @@ -1975,6 +1976,9 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; + if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) + return -EINVAL; + #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, Thanks, applying this patch. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Guest kernel divide error in kvm_unlock_kick
Il 11/09/2014 19:03, Chris Webb ha scritto: Paolo Bonzini pbonz...@redhat.com wrote: This is a hypercall that should have kicked VCPU 3 (see rcx). Can you please apply this patch and gather a trace of the host (using trace-cmd -e kvm qemu-kvm arguments)? Sure, no problem. I've built the trace-cmd tool against udis86 (I hope) and have put the resulting trace.dat at http://cdw.me.uk/tmp/trace.dat This is actually for a -smp 2 qemu (failing to kick VCPU 1?) as I was having trouble persuading the -smp 4 qemu to crash as reliably under tracing. (Something timing related?) Otherwise the qemu-system-x86 command line is exactly as before. Do you by chance have CONFIG_DEBUG_RODATA set? In that case, the fix is simply not to set it. Paolo The guest kernel crash message which corresponds to this trace was: divide error: [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 618 Comm: mkdir Not tainted 3.16.2-guest #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 task: 88007c997080 ti: 88007c614000 task.ti: 88007c614000 RIP: 0010:[81037fe2] [81037fe2] kvm_unlock_kick+0x72/0x80 RSP: 0018:88007c617d40 EFLAGS: 00010046 RAX: 0005 RBX: RCX: 0001 RDX: 0001 RSI: 88007fd11c40 RDI: RBP: 88007fd11c40 R08: 81b98940 R09: 0001 R10: R11: 0007 R12: 00f6 R13: 0001 R14: 0001 R15: 00011c40 FS: 7f43eb1ed700() GS:88007fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f43eace0a30 CR3: 01a12000 CR4: 000406f0 Stack: 88007c994380 88007c9949aa 0046 81689715 810f3174 0001 ea0001f16320 ea0001f17860 88007c99e1e8 88007c997080 0001 Call Trace: [81689715] ? _raw_spin_unlock+0x45/0x70 [810f3174] ? try_to_wake_up+0x2a4/0x330 [81101e2c] ? __wake_up_common+0x4c/0x80 [81102418] ? __wake_up_sync_key+0x38/0x60 [810d873a] ? do_notify_parent+0x19a/0x280 [810f4d56] ? sched_move_task+0xb6/0x190 [810cb4fc] ? do_exit+0xa1c/0xab0 [810cc344] ? do_group_exit+0x34/0xb0 [810cc3cb] ? SyS_exit_group+0xb/0x10 [8168a16d] ? system_call_fastpath+0x1a/0x1f Code: c0 ca a7 81 48 8d 04 0b 48 8b 30 48 39 ee 75 c9 0f b6 40 08 44 38 e0 75 c0 48 c7 c0 22 b0 00 00 31 db 0f b7 0c 08 b8 05 00 00 00 0f 01 c1 0f 1f 00 5b 5d 41 5c c3 0f 1f 00 48 c7 c0 10 cf 00 00 RIP [81037fe2] kvm_unlock_kick+0x72/0x80 RSP 88007c617d40 ---[ end trace bf5a4445f9decdbb ]--- Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: mkdir/618/0x0006 Modules linked in: CPU: 0 PID: 618 Comm: mkdir Tainted: G D 3.16.2-guest #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 c022d302 81684029 810ee956 81686266 00011c40 88007c617fd8 00011c40 88007c997080 0006 0046 Call Trace: [81684029] ? dump_stack+0x49/0x6a [810ee956] ? __schedule_bug+0x46/0x60 [81686266] ? __schedule+0x5a6/0x7c0 [816828cd] ? printk+0x59/0x75 [810cb33b] ? do_exit+0x85b/0xab0 [816828cd] ? printk+0x59/0x75 [8100614a] ? oops_end+0x7a/0x100 [810033e5] ? do_error_trap+0x85/0x110 [81037fe2] ? kvm_unlock_kick+0x72/0x80 [8114a358] ? __alloc_pages_nodemask+0x108/0xa60 [8168b57e] ? divide_error+0x1e/0x30 [81037fe2] ? kvm_unlock_kick+0x72/0x80 [81689715] ? _raw_spin_unlock+0x45/0x70 [810f3174] ? try_to_wake_up+0x2a4/0x330 [81101e2c] ? __wake_up_common+0x4c/0x80 [81102418] ? __wake_up_sync_key+0x38/0x60 [810d873a] ? do_notify_parent+0x19a/0x280 [810f4d56] ? sched_move_task+0xb6/0x190 [810cb4fc] ? do_exit+0xa1c/0xab0 [810cc344] ? do_group_exit+0x34/0xb0 [810cc3cb] ? SyS_exit_group+0xb/0x10 [8168a16d] ? system_call_fastpath+0x1a/0x1f Best wishes, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. In that case, KVM will fail to patch VMCALL instructions to VMMCALL as required on AMD processors. The failure mode is currently a divide-by-zero exception, which obviously is a KVM bug that has to be fixed. However, picking the right instruction between VMCALL and VMMCALL will be faster and will help if you cannot upgrade the hypervisor. Reported-by: Chris Webb ch...@arachsys.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: Borislav Petkov b...@suse.de Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kvm_para.h | 10 -- arch/x86/kernel/cpu/amd.c | 7 +++ 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index bb9b258d60e7..2075e6c34c78 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -202,6 +202,7 @@ #define X86_FEATURE_DECODEASSISTS ( 8*32+12) /* AMD Decode Assists support */ #define X86_FEATURE_PAUSEFILTER ( 8*32+13) /* AMD filtered pause intercept */ #define X86_FEATURE_PFTHRESHOLD ( 8*32+14) /* AMD pause filter threshold */ +#define X86_FEATURE_VMMCALL ( 8*32+15) /* Prefer vmmcall to vmcall */ /* Intel-defined CPU features, CPUID level 0x0007:0 (ebx), word 9 */ diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c7678e43465b..e62cf897f781 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -2,6 +2,7 @@ #define _ASM_X86_KVM_PARA_H #include asm/processor.h +#include asm/alternative.h #include uapi/asm/kvm_para.h extern void kvmclock_init(void); @@ -16,10 +17,15 @@ static inline bool kvm_check_and_clear_guest_paused(void) } #endif /* CONFIG_KVM_GUEST */ -/* This instruction is vmcall. On non-VT architectures, it will generate a - * trap that we will then rewrite to the appropriate instruction. +#ifdef CONFIG_DEBUG_RODATA +#define KVM_HYPERCALL \ +ALTERNATIVE(.byte 0x0f,0x01,0xc1, .byte 0x0f,0x01,0xd9, X86_FEATURE_VMMCALL) +#else +/* On AMD processors, vmcall will generate a trap that we will + * then rewrite to the appropriate instruction. */ #define KVM_HYPERCALL .byte 0x0f,0x01,0xc1 +#endif /* For KVM hypercalls, a three-byte sequence of either the vmcall or the vmmcall * instruction. The hypervisor may replace it with something else but only the diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 60e5497681f5..813d29d00a17 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -525,6 +525,13 @@ static void early_init_amd(struct cpuinfo_x86 *c) } #endif + /* +* This is only needed to tell the kernel whether to use VMCALL +* and VMMCALL. VMMCALL is never executed except under virt, so +* we can set it unconditionally. +*/ + set_cpu_cap(c, X86_FEATURE_VMMCALL); + /* F16h erratum 793, CVE-2013-6885 */ if (c-x86 == 0x16 c-x86_model = 0xf) msr_set_bit(MSR_AMD64_LS_CFG, 15); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] vhost: support urgent descriptors
On Mon, Sep 22, 2014 at 05:55:23PM +0800, Jason Wang wrote: On 09/22/2014 02:55 PM, Michael S. Tsirkin wrote: On Mon, Sep 22, 2014 at 11:30:23AM +0800, Jason Wang wrote: On 09/20/2014 06:00 PM, Paolo Bonzini wrote: Il 19/09/2014 09:10, Jason Wang ha scritto: - if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { + if (vq-urgent || !vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { So the urgent descriptor only work when event index was not enabled? This seems suboptimal, we may still want to benefit from event index even if urgent descriptor is used. Looks like we need return true here when vq-urgent is true? Its ||, not . Without event index, all descriptors are treated as urgent. Paolo The problem is if vq-urgent is true, the patch checks VRING_AVAIL_F_NO_INTERRUPT bit. This bit were set unconditionally in virtqueue_enable_cb() regardless of event index feature and cleared unconditionally in virtqueue_disable_cb(). The reverse actually, right? Ah, right. So virtqueue_enable_cb() was used to not only publish a new event index but also enable the urgent descriptor. And virtqueue_disable_cb() disabled all interrupts including the urgent descriptor. Guest won't get urgent interrupts by just adding virtqueue_add_outbuf_urgent() since what it needs is to enable and disable interrupt for !urgent descriptor. Right, we want a new API that advances event index but does not set VRING_AVAIL_F_NO_INTERRUPT. IMO still want to set VRING_AVAIL_F_NO_INTERRUPT when handling tx interrupts, to avoid interrupt storms. I see, so urgent descriptor needs to be disabled in this case. But vhost parts need a little big changes, we could not just check VRING_AVAIL_F_NO_INTERRUPT when vq-urgent is true. If event index is enabled, we still need to check used event to make sure the current tx delayed interrupt works. But just re-using VRING_AVAIL_F_NO_INTERRUPT for urgent descriptor may not work in some case. I see codes of virtqueue_get_buf() that may breaks this: /* If we expect an interrupt for the next entry, tell host * by writing event index and flush out the write before * the read in the next get_buf call. */ if (!(vq-vring.avail-flags VRING_AVAIL_F_NO_INTERRUPT)) { vring_used_event(vq-vring) = vq-last_used_idx; virtio_mb(vq-weak_barriers); } Right, we need to change this path to use a private flag. Consider if only urgent descriptor is enabled, this will publish used event which in fact enable lots of unnecessary interrupt. In fact I don't quite understand how the above lines is used. Virtio-net stop the queue before enable the tx interrupt in start_xmit(), so the above lines will not run at all. True - this code is for drivers which process VQ without disabling interrupts first. No rule says this is not allowed. Btw, not sure urgent is a suitable name, since interrupt is often slow in kvm guest. And in fact virtio-net will probably use urgent descriptor for those packets (e.g stream packets who can be delayed a little bit to batch more bytes from userspace) who was not urgent compared to other packets. Yes but we are asking for an interrupt before event index is reached because something is waiting for the packet to be transmitted. I couldn't come up with a better name. Ok. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Final KVM changes for 3.17
Linus, The following changes since commit 02a68d0503fa470abff8852e10b1890df5730a08: powerpc/kvm/cma: Fix panic introduces by signed shift operation (2014-09-03 10:34:07 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus for you to fetch changes up to 27fbe64bfa63cfb9da025975b59d96568caa2d53: KVM: correct null pid check in kvm_vcpu_yield_to() (2014-09-22 13:21:29 +0200) Two very simple bugfixes, affecting all supported architectures. Ard Biesheuvel (2): mm: export symbol dependencies of is_zero_pfn() KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn() Sam Bobroff (1): KVM: correct null pid check in kvm_vcpu_yield_to() arch/mips/mm/init.c | 1 + arch/s390/mm/init.c | 1 + mm/memory.c | 2 ++ virt/kvm/kvm_main.c | 4 ++-- 4 files changed, 6 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial
Hi, all I start a VM with virtio-serial (default ports number: 31), and found that virtio-blk performance degradation happened, about 25%, this problem can be reproduced 100%. without virtio-serial: 4k-read-random 1186 IOPS with virtio-serial: 4k-read-random 871 IOPS but if use max_ports=2 option to limit the max number of virio-serial ports, then the IO performance degradation is not so serious, about 5%. And, ide performance degradation does not happen with virtio-serial. Pretty sure it's related to MSI vectors in use. It's possible that the virtio-serial device takes up all the avl vectors in the guests, leaving old-style irqs for the virtio-blk device. I don't think so, I use iometer to test 64k-read(or write)-sequence case, if I disable the virtio-serial dynamically via device manager-virtio-serial = disable, then the performance get promotion about 25% immediately, then I re-enable the virtio-serial via device manager-virtio-serial = enable, the performance got back again, very obvious. add comments: Although the virtio-serial is enabled, I don't use it at all, the degradation still happened. Using the vectors= option as mentioned below, you can restrict the number of MSI vectors the virtio-serial device gets. You can then confirm whether it's MSI that's related to these issues. Amit, It's related to the big number of ioeventfds used in virtio-serial-pci. With virtio-serial-pci's ioeventfd=off, the performance is not affected no matter if guest initializes it or not. In my test, there are 12 fds to poll in qemu_poll_ns before loading guest virtio_console.ko, whereas 76 once modprobe virtio_console. Looks like the ppoll takes more time to poll more fds. Some trace data with systemtap: 12 fds: time rel_time symbol 15(+1) qemu_poll_ns [enter] 18(+3) qemu_poll_ns [return] 76 fd: 12(+2) qemu_poll_ns [enter] 18(+6) qemu_poll_ns [return] I haven't looked at virtio-serial code, I'm not sure if we should reduce the number of ioeventfds in virtio-serial-pci or focus on lower level efficiency. Does ioeventfd=off hamper the performance of virtio-serial? In my opinion, virtio-serial's use scenario is not so high throughput rate, so ioventfd=off has slight impaction on the performance. Thanks, Zhang Haoyu Haven't compared with g_poll but I think the underlying syscall should be the same. Any ideas? Fam So, I think it has no business with legacy interrupt mode, right? I am going to observe the difference of perf top data on qemu and perf kvm stat data when disable/enable virtio-serial in guest, and the difference of perf top data on guest when disable/enable virtio-serial in guest, any ideas? Thanks, Zhang Haoyu If you restrict the number of vectors the virtio-serial device gets (using the -device virtio-serial-pci,vectors= param), does that make things better for you? Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardizing an MSR or other hypercall to get an RNG seed?
On 09/19/2014 05:46 PM, H. Peter Anvin wrote: On 09/19/2014 01:46 PM, Andy Lutomirski wrote: However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn’t introduce additional overhead over RDMSR. Because then guest user code will think that rdrand is there and will try to use it, resulting in abysmal performance. Yes, the presence of RDRAND implies a cheap and inexhaustible entropy source. A guest kernel couldn't make it look like RDRAND is not present to guest userspace? Christopher -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardizing an MSR or other hypercall to get an RNG seed?
On 09/19/2014 02:42 PM, Andy Lutomirski wrote: On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington c...@codeaurora.org wrote: On 09/17/2014 10:50 PM, Andy Lutomirski wrote: Hi all- I would like to standardize on a very simple protocol by which a guest OS can obtain an RNG seed early in boot. The main design requirements are: - The interface should be very easy to use. Linux, at least, will want to use it extremely early in boot as part of kernel ASLR. This means that PCI and ACPI will not work. How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand? Could hypervisors and simulators simply make sure these work? If RDRAND is available, then Linux, at least, will use it. The rest are too complicated for early use. Linux on x86 plays some vaguely clever games with rdtsc and poking at the i8254 port. I just wanted to check that it couldn't be as simple as giving one or both of the timers random initial values. Christopher -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On 09/22/2014 12:50 PM, Paolo Bonzini wrote: Il 20/09/2014 01:03, David Matlack ha scritto: vcpu ioctls can hang the calling thread if issued while a vcpu is running. If we know ioctl is going to be rejected as invalid anyway, we can fail before trying to take the vcpu mutex. This patch does not change functionality, it just makes invalid ioctls fail faster. Signed-off-by: David Matlack dmatl...@google.com --- virt/kvm/kvm_main.c | 4 1 file changed, 4 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 96ec622..f9234e5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -52,6 +52,7 @@ #include asm/processor.h #include asm/io.h +#include asm/ioctl.h #include asm/uaccess.h #include asm/pgtable.h @@ -1975,6 +1976,9 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; +if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) +return -EINVAL; + #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, Thanks, applying this patch. Isnt that the wrong trade off? We now have an extra condition check for every valid ioctl, to make an error case go faster. I know, the extra check is just a 1 or 2 cycles if branch prediction is right, but still. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
unsubscribe kvm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blk-mq: Avoid race condition with uninitialized requests
On 09/18/2014 11:04 AM, David Hildenbrand wrote: This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: sta...@vger.kernel.org Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Acked-by: Christian Borntraeger borntrae...@de.ibm.com Can you please add Reported-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com Tested-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com as Paulo did the testing work? We thing this patch is fine and should go upstream. Christian --- block/blk-mq.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 383ea0c..eed6340 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -203,7 +203,6 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw) if (tag != BLK_MQ_TAG_FAIL) { rq = data-hctx-tags-rqs[tag]; - rq-cmd_flags = 0; if (blk_mq_tag_busy(data-hctx)) { rq-cmd_flags = REQ_MQ_INFLIGHT; atomic_inc(data-hctx-nr_active); @@ -258,6 +257,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq-cmd_flags REQ_MQ_INFLIGHT) atomic_dec(hctx-nr_active); + rq-cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, rq-atomic_flags); blk_mq_put_tag(hctx, tag, ctx-last_tag); @@ -1404,6 +1404,8 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, left -= to_do * rq_size; for (j = 0; j to_do; j++) { tags-rqs[i] = p; + tags-rqs[i]-atomic_flags = 0; + tags-rqs[i]-cmd_flags = 0; if (set-ops-init_request) { if (set-ops-init_request(set-driver_data, tags-rqs[i], hctx_idx, i, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blk-mq: Avoid race condition with uninitialized requests
On 2014-09-22 08:15, Christian Borntraeger wrote: On 09/18/2014 11:04 AM, David Hildenbrand wrote: This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: sta...@vger.kernel.org Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Acked-by: Christian Borntraeger borntrae...@de.ibm.com Can you please add Reported-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com Tested-by: Paulo De Rezende Pinatti ppina...@linux.vnet.ibm.com as Paulo did the testing work? We thing this patch is fine and should go upstream. I might have to pick'n rebase the series, in which case I'll add it. But I already queued it up last week, so if I don't, then I can't easily add it. I wish the git notes wasn't such a horrible and unusable hack, so we had a chance to annotate commits without having to rewrite history... -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardizing an MSR or other hypercall to get an RNG seed?
On 09/22/2014 06:31 AM, Christopher Covington wrote: On 09/19/2014 05:46 PM, H. Peter Anvin wrote: On 09/19/2014 01:46 PM, Andy Lutomirski wrote: However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn’t introduce additional overhead over RDMSR. Because then guest user code will think that rdrand is there and will try to use it, resulting in abysmal performance. Yes, the presence of RDRAND implies a cheap and inexhaustible entropy source. A guest kernel couldn't make it look like RDRAND is not present to guest userspace? It could, but how would you enumerate that? A new RDRAND-CPL-0 CPUID bit pretty much would be required. -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardizing an MSR or other hypercall to get an RNG seed?
On 09/22/2014 07:17 AM, H. Peter Anvin wrote: It could, but how would you enumerate that? A new RDRAND-CPL-0 CPUID bit pretty much would be required. Note that there are two things that differ: the CPL 0-ness and the performance/exhaustibility attributes. -hpa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
Il 22/09/2014 15:45, Christian Borntraeger ha scritto: We now have an extra condition check for every valid ioctl, to make an error case go faster. I know, the extra check is just a 1 or 2 cycles if branch prediction is right, but still. I applied the patch because the delay could be substantial, depending on what the other VCPU is doing. Perhaps something like this would be better? (Untested, but Tested-by/Reviewed-bys are welcome). Paolo diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 84e24b210273..ed31760d79fe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn) /* * Switches to specified vcpu, until a matching vcpu_put() */ -int vcpu_load(struct kvm_vcpu *vcpu) +static void __vcpu_load(struct kvm_vcpu *vcpu) { int cpu; - if (mutex_lock_killable(vcpu-mutex)) - return -EINTR; if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { /* The thread running this VCPU changed. */ struct pid *oldpid = vcpu-pid; @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu) preempt_notifier_register(vcpu-preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); +} + +int vcpu_load(struct kvm_vcpu *vcpu) +{ + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + + __vcpu_load(vcpu); return 0; } @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; - if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) - return -EINVAL; - #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp, return kvm_arch_vcpu_ioctl(filp, ioctl, arg); #endif + if (!mutex_trylock(vcpu-mutex))) { + /* +* Before a potentially long sleep, check if we'd exit anyway. +* The common case is for the mutex not to be contended, when +* this does not add overhead. +*/ + if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) + return -EINVAL; + + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + } + - r = vcpu_load(vcpu); + r = __vcpu_load(vcpu); if (r) return r; switch (ioctl) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] arm64: fix VTTBR_BADDR_MASK
-#define VTTBR_BADDR_SHIFT (VTTBR_X - 1) -#define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X)) - 1) VTTBR_BADDR_SHIFT) Actually, after some more thinking, why don't we just make the upper limit of this mask 48-bit always or even 64-bit. That's a physical mask for checking whether the pgd pointer in vttbr is aligned as per the architecture requirements. Given that the pointer is allocated from the platform memory, it's clear that it is within the PA range. So basically you just need a mask to check the bottom alignment based on VTCR_EL2.T0SZ (which should be independent from the PA range). I guess it should be enough as: This sounds fine to me. I would say that there is no harm in re-checking the upper bits, but I agree it is unnecessary. #define VTTBR_BADDR_MASK (~0ULL VTTBR_BADDR_SHIFT) without any other changes to T0SZ. The TCR_EL2.PS setting should be done based on the ID_A64MMFR0_EL1 but you can do this in __do_hyp_init (it looks like this function handles VTCR_EL2.PS already, not sure why it does do it for TCR_EL2 as well). So IMO you only need about a few lines patch. My original patch to fix this problem was one line, so I'm perfectly happy with simplification. But it would be nice if the other reviewers could agree with this approach. With six versions that each addressed all the comments from reviewers I'd like it if the v7 that throws away most of that feedback didn't result in a v8 that puts it all back again. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] arm64: fix VTTBR_BADDR_MASK
On Mon, Sep 22, 2014 at 04:56:58PM +0100, Joel Schopp wrote: The TCR_EL2.PS setting should be done based on the ID_A64MMFR0_EL1 but you can do this in __do_hyp_init (it looks like this function handles VTCR_EL2.PS already, not sure why it does do it for TCR_EL2 as well). So IMO you only need about a few lines patch. My original patch to fix this problem was one line, so I'm perfectly happy with simplification. But it would be nice if the other reviewers could agree with this approach. With six versions that each addressed all the comments from reviewers I'd like it if the v7 that throws away most of that feedback didn't result in a v8 that puts it all back again. I'm having some discussion with Christoffer around this. He will come up with some patches on top of yours but I don't think the problem is that simple. Basically the IPA size is restricted by the PARange but also affected by the number of page table levels used by the host (both having implications on VTCR_EL2.SL0). -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] KVM: MMU: pinned sps are not candidates for deletion.
On Tue, Sep 09, 2014 at 12:41:27PM -0300, Marcelo Tosatti wrote: On Tue, Jul 22, 2014 at 05:59:42AM +0800, Xiao Guangrong wrote: On Jul 10, 2014, at 3:12 AM, mtosa...@redhat.com wrote: Skip pinned shadow pages when selecting pages to zap. It seems there is no way to prevent changing pinned spte on zap-all path? Xiao, The way would be to reload remote mmus, forcing the vcpu to exit, zapping a page, then vcpu will pagefault any necessary page via kvm_mmu_pin_pages. kvm_mmu_invalidate_zap_all_pages does: - spin_lock(mmu_lock) - kvm_reload_remote_mmus ... - spin_unlock(mmu_lock) So its OK to change pinned spte on zap all path. I am thing if we could move pinned spte to another list (eg. pinned_shadow_pages) instead of active list so that it can not be touched by any other free paths. Your idea? As mentioned it above, it is ok to zap pinned sptes as long w reload remote mmus request is performed. That said, you still consider a separate list? Xiao, ping? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path
On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote: On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote: On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote: Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before deleting a pinned spte. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/mmu.c | 29 +++-- 1 file changed, 23 insertions(+), 6 deletions(-) Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c 2014-07-09 11:23:59.290744490 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-07-09 11:24:58.449632435 -0300 @@ -1208,7 +1208,8 @@ * * Return true if tlb need be flushed. */ -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect) +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect, +bool skip_pinned) { u64 spte = *sptep; @@ -1218,6 +1219,22 @@ rmap_printk(rmap_write_protect: spte %p %llx\n, sptep, *sptep); + if (is_pinned_spte(spte)) { + /* keep pinned spte intact, mark page dirty again */ + if (skip_pinned) { + struct kvm_mmu_page *sp; + gfn_t gfn; + + sp = page_header(__pa(sptep)); + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-spt); + + mark_page_dirty(kvm, gfn); + return false; Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() while populating dirty_bitmap_buffer? The pinned gfns are per-vcpu. Requires looping all vcpus (not scalable). + } else + mmu_reload_pinned_vcpus(kvm); Can you explain why do you need this? Because if skip_pinned = false, we want vcpus to exit (called from enable dirty logging codepath). Gleb, any further opinions on this ? Can you ack the patch ? TIA -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com --- arch/arm/include/asm/kvm_host.h | 3 ++- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 4 +-- arch/powerpc/kvm/book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-- arch/powerpc/kvm/book3s_pr.c| 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 50 + drivers/iommu/amd_iommu_v2.c| 6 +++-- include/linux/mmu_notifier.h| 24 -- include/trace/events/kvm.h | 10 mm/mmu_notifier.c | 5 ++-- mm/rmap.c | 6 - virt/kvm/kvm_main.c | 5 ++-- 16 files changed, 86 insertions(+), 46 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 032a853..8c3f7eb 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index be9970a..a3c671b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm, void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6040008..d329bc5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -56,7 +56,7 @@ extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index dd03f6b..c16cfbf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end); } -int kvm_age_hva(struct kvm *kvm, unsigned long hva) +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) { - return kvm-arch.kvm_ops-age_hva(kvm, hva); + return kvm-arch.kvm_ops-age_hva(kvm, start, end); } int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 4bf956c..d2b3ec0 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start, + unsigned long end); extern int kvm_test_age_hva_hv(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On 09/22, Paolo Bonzini wrote: Il 22/09/2014 15:45, Christian Borntraeger ha scritto: We now have an extra condition check for every valid ioctl, to make an error case go faster. I know, the extra check is just a 1 or 2 cycles if branch prediction is right, but still. I applied the patch because the delay could be substantial, depending on what the other VCPU is doing. Perhaps something like this would be better? I'm happy with either approach. (Untested, but Tested-by/Reviewed-bys are welcome). There were a few build bugs in your diff. Here's a working version that I tested. Feel free to add my Tested-by and Reviewed-by if you go with this. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c71931f..fbdcdc2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -133,12 +133,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn) /* * Switches to specified vcpu, until a matching vcpu_put() */ -int vcpu_load(struct kvm_vcpu *vcpu) +static void __vcpu_load(struct kvm_vcpu *vcpu) { int cpu; - if (mutex_lock_killable(vcpu-mutex)) - return -EINTR; if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { /* The thread running this VCPU changed. */ struct pid *oldpid = vcpu-pid; @@ -151,6 +149,14 @@ int vcpu_load(struct kvm_vcpu *vcpu) preempt_notifier_register(vcpu-preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); +} + +int vcpu_load(struct kvm_vcpu *vcpu) +{ + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + + __vcpu_load(vcpu); return 0; } @@ -2197,10 +2203,21 @@ static long kvm_vcpu_ioctl(struct file *filp, return kvm_arch_vcpu_ioctl(filp, ioctl, arg); #endif + if (!mutex_trylock(vcpu-mutex)) { + /* +* Before a potentially long sleep, check if we'd exit anyway. +* The common case is for the mutex not to be contended, when +* this does not add overhead. +*/ + if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) + return -EINVAL; + + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + } + + __vcpu_load(vcpu); - r = vcpu_load(vcpu); - if (r) - return r; switch (ioctl) { case KVM_RUN: r = -EINVAL; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: nested VMX: disable perf cpuid reporting
On Thu, Sep 18, 2014 at 06:24:57PM -0300, Marcelo Tosatti wrote: Initilization of L2 guest with -cpu host, on L1 guest with -cpu host triggers: (qemu) KVM: entry failed, hardware error 0x7 ... nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported Nested VMX MSR load/store support is not sufficient to allow perf for L2 guest. Until properly fixed, trap CPUID and disable function 0xA. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Ping, Paolo? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Guest kernel divide error in kvm_unlock_kick
Paolo Bonzini pbonz...@redhat.com wrote: Il 11/09/2014 19:03, Chris Webb ha scritto: Paolo Bonzini pbonz...@redhat.com wrote: This is a hypercall that should have kicked VCPU 3 (see rcx). Can you please apply this patch and gather a trace of the host (using trace-cmd -e kvm qemu-kvm arguments)? Sure, no problem. I've built the trace-cmd tool against udis86 (I hope) and have put the resulting trace.dat at http://cdw.me.uk/tmp/trace.dat This is actually for a -smp 2 qemu (failing to kick VCPU 1?) as I was having trouble persuading the -smp 4 qemu to crash as reliably under tracing. (Something timing related?) Otherwise the qemu-system-x86 command line is exactly as before. Do you by chance have CONFIG_DEBUG_RODATA set? In that case, the fix is simply not to set it. Absolutely right: my host and guest kernels do have CONFIG_DEBUG_RODATA set! Your patch to use alternatives for VMCALL vs VMMCALL definitely fixed the divide-by-zero crashes I saw. Given that I can easily use either (or both) of these solutions, is it be more efficient to turn off CONFIG_DEBUG_RODATA in the guest kernel so kvm can fix up the instructions in-place, or is using alternatives for VMCALL/VMMCALL as implemented by your patch just as good? Best wishes, Chris.-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Guest kernel divide error in kvm_unlock_kick
Il 22/09/2014 21:08, Chris Webb ha scritto: Do you by chance have CONFIG_DEBUG_RODATA set? In that case, the fix is simply not to set it. Absolutely right: my host and guest kernels do have CONFIG_DEBUG_RODATA set! Your patch to use alternatives for VMCALL vs VMMCALL definitely fixed the divide-by-zero crashes I saw. Given that I can easily use either (or both) of these solutions, is it be more efficient to turn off CONFIG_DEBUG_RODATA in the guest kernel so kvm can fix up the instructions in-place, or is using alternatives for VMCALL/VMMCALL as implemented by your patch just as good? I posted a patch to use alternatives if CONFIG_DEBUG_RODATA is enabled, but the bug is in KVM that explicitly documents you can use any of VMCALL or VMMCALL. I'll also see to fix KVM, but the patch is still useful because a) KVM would not patch a read-only text segment, so there would be a small performance benefit; b) you cannot control already deployed hypervisors. However, since there is a workaround, I won't push it into 3.17 so late in the cycle. Also, there's a chance that it is NACKed since it touches non-KVM files. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: nested VMX: disable perf cpuid reporting
Il 22/09/2014 21:01, Marcelo Tosatti ha scritto: On Thu, Sep 18, 2014 at 06:24:57PM -0300, Marcelo Tosatti wrote: Initilization of L2 guest with -cpu host, on L1 guest with -cpu host triggers: (qemu) KVM: entry failed, hardware error 0x7 ... nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported Nested VMX MSR load/store support is not sufficient to allow perf for L2 guest. Until properly fixed, trap CPUID and disable function 0xA. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Ping, Paolo? Sorry, didn't push to kvm/queue but I've already applied the patch locally. I will do it first thing tomorrow (my workstation is off right now so I cannot send you a SHA). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On 09/22/2014 04:31 PM, Paolo Bonzini wrote: Il 22/09/2014 15:45, Christian Borntraeger ha scritto: We now have an extra condition check for every valid ioctl, to make an error case go faster. I know, the extra check is just a 1 or 2 cycles if branch prediction is right, but still. I applied the patch because the delay could be substantial, I know, but only for seriously misbehaving userspace, no? See my comment is really a minor one - lets say 2 more cycles for something that exited to userspace - nobody would even notice. I am just disturbed by the fact that we care about something that is not slow-path but broken beyond repair (why does userspace call a non-KVM ioctl on a fd of a vcpu from a different thread (otherwise the mutex would be free)? Please, can we have an explanation, e.g. something like while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. Lets bail out early on invalid ioctls. or similar? depending on what the other VCPU is doing. Perhaps something like this would be better? (Untested, but Tested-by/Reviewed-bys are welcome). Given that it makes sense to improve a misbehaving userspace, I prefer Davids variant as the patch is smaller and easier to get right. No need to revert, but a better explanation for the use case would be appeciated. Christian Paolo diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 84e24b210273..ed31760d79fe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn) /* * Switches to specified vcpu, until a matching vcpu_put() */ -int vcpu_load(struct kvm_vcpu *vcpu) +static void __vcpu_load(struct kvm_vcpu *vcpu) { int cpu; - if (mutex_lock_killable(vcpu-mutex)) - return -EINTR; if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { /* The thread running this VCPU changed. */ struct pid *oldpid = vcpu-pid; @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu) preempt_notifier_register(vcpu-preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); +} + +int vcpu_load(struct kvm_vcpu *vcpu) +{ + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + + __vcpu_load(vcpu); return 0; } @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; - if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) - return -EINVAL; - #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp, return kvm_arch_vcpu_ioctl(filp, ioctl, arg); #endif + if (!mutex_trylock(vcpu-mutex))) { + /* + * Before a potentially long sleep, check if we'd exit anyway. + * The common case is for the mutex not to be contended, when + * this does not add overhead. + */ + if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) + return -EINVAL; + + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + } + - r = vcpu_load(vcpu); + r = __vcpu_load(vcpu); if (r) return r; switch (ioctl) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
Il 22/09/2014 21:20, Christian Borntraeger ha scritto: while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. Lets bail out early on invalid ioctls. or similar? Okay. David, can you explain how you found it so that I can make up my mind? Gleb and Marcelo, a fourth and fifth opinion? :) Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On 09/22, Christian Borntraeger wrote: On 09/22/2014 04:31 PM, Paolo Bonzini wrote: Il 22/09/2014 15:45, Christian Borntraeger ha scritto: We now have an extra condition check for every valid ioctl, to make an error case go faster. I know, the extra check is just a 1 or 2 cycles if branch prediction is right, but still. I applied the patch because the delay could be substantial, I know, but only for seriously misbehaving userspace, no? See my comment is really a minor one - lets say 2 more cycles for something that exited to userspace - nobody would even notice. I am just disturbed by the fact that we care about something that is not slow-path but broken beyond repair (why does userspace call a non-KVM ioctl on a fd of a vcpu from a different thread (otherwise the mutex would be free)? Please, can we have an explanation, e.g. something like while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. Lets bail out early on invalid ioctls. or similar? We noticed this while using code that inspects the open file descriptors of a process. One of the things it did was check if each file descriptor was a tty (isatty() calls ioctl(TCGETS)). depending on what the other VCPU is doing. Perhaps something like this would be better? (Untested, but Tested-by/Reviewed-bys are welcome). Given that it makes sense to improve a misbehaving userspace, I prefer Davids variant as the patch is smaller and easier to get right. No need to revert, but a better explanation for the use case would be appeciated. Christian Paolo diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 84e24b210273..ed31760d79fe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -117,12 +117,10 @@ bool kvm_is_mmio_pfn(pfn_t pfn) /* * Switches to specified vcpu, until a matching vcpu_put() */ -int vcpu_load(struct kvm_vcpu *vcpu) +static void __vcpu_load(struct kvm_vcpu *vcpu) { int cpu; - if (mutex_lock_killable(vcpu-mutex)) - return -EINTR; if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { /* The thread running this VCPU changed. */ struct pid *oldpid = vcpu-pid; @@ -136,6 +134,14 @@ int vcpu_load(struct kvm_vcpu *vcpu) preempt_notifier_register(vcpu-preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); +} + +int vcpu_load(struct kvm_vcpu *vcpu) +{ + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + + __vcpu_load(vcpu); return 0; } @@ -1989,9 +1995,6 @@ static long kvm_vcpu_ioctl(struct file *filp, if (vcpu-kvm-mm != current-mm) return -EIO; - if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) - return -EINVAL; - #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS) /* * Special cases: vcpu ioctls that are asynchronous to vcpu execution, @@ -2001,8 +2004,21 @@ static long kvm_vcpu_ioctl(struct file *filp, return kvm_arch_vcpu_ioctl(filp, ioctl, arg); #endif + if (!mutex_trylock(vcpu-mutex))) { + /* +* Before a potentially long sleep, check if we'd exit anyway. +* The common case is for the mutex not to be contended, when +* this does not add overhead. +*/ + if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) + return -EINVAL; + + if (mutex_lock_killable(vcpu-mutex)) + return -EINTR; + } + - r = vcpu_load(vcpu); + r = __vcpu_load(vcpu); if (r) return r; switch (ioctl) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
On Mon, Sep 22, 2014 at 01:17:48PM +0200, Paolo Bonzini wrote: On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA. Hmm, that depends on DEBUG_KERNEL. I think you're actually talking about distro kernels which enable CONFIG_DEBUG_RODATA, right? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. --- v1 - v2: * Small cosmetic changes * Remove unnecessary handling of mmu_notifier_seq in mmu_lock protected region * Func prototypes fixed --- Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com --- arch/arm/include/asm/kvm_host.h | 3 ++- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 4 ++-- arch/powerpc/kvm/book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++-- arch/powerpc/kvm/book3s_pr.c| 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 46 +++-- drivers/iommu/amd_iommu_v2.c| 6 +++-- include/linux/mmu_notifier.h| 24 +-- include/trace/events/kvm.h | 10 mm/mmu_notifier.c | 5 ++-- mm/rmap.c | 6 - virt/kvm/kvm_main.c | 5 ++-- 16 files changed, 81 insertions(+), 47 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 032a853..8c3f7eb 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index be9970a..a3c671b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm, void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6040008..d329bc5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -56,7 +56,7 @@ extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index dd03f6b..c16cfbf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end); } -int kvm_age_hva(struct kvm *kvm, unsigned long hva) +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) { - return kvm-arch.kvm_ops-age_hva(kvm, hva); + return kvm-arch.kvm_ops-age_hva(kvm, start, end); } int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 4bf956c..d2b3ec0 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start, + unsigned long end); extern int
[PATCH v2] kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. --- v1 - v2: * Small cosmetic changes * Remove unnecessary handling of mmu_notifier_seq in mmu_lock protected region * Func prototypes fixed --- Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com --- arch/arm/include/asm/kvm_host.h | 3 ++- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 4 ++-- arch/powerpc/kvm/book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++-- arch/powerpc/kvm/book3s_pr.c| 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 46 +++-- drivers/iommu/amd_iommu_v2.c| 6 +++-- include/linux/mmu_notifier.h| 24 +-- include/trace/events/kvm.h | 10 mm/mmu_notifier.c | 5 ++-- mm/rmap.c | 6 - virt/kvm/kvm_main.c | 5 ++-- 16 files changed, 81 insertions(+), 47 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 032a853..8c3f7eb 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index be9970a..a3c671b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm, void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6040008..d329bc5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -56,7 +56,7 @@ extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index dd03f6b..c16cfbf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end); } -int kvm_age_hva(struct kvm *kvm, unsigned long hva) +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) { - return kvm-arch.kvm_ops-age_hva(kvm, hva); + return kvm-arch.kvm_ops-age_hva(kvm, start, end); } int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 4bf956c..d2b3ec0 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long start, + unsigned long end); extern int
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On Fri, Sep 19, 2014 at 04:03:25PM -0700, David Matlack wrote: vcpu ioctls can hang the calling thread if issued while a vcpu is running. There is a mutex per-vcpu, so thats expected, OK... If we know ioctl is going to be rejected as invalid anyway, we can fail before trying to take the vcpu mutex. Consider a valid ioctl that takes the vcpu mutex. If you need immediate access for that valid ioctl, it is necessary to interrupt thread which KVM_RUN ioctl executes. So knowledge of whether KVM_RUN is being executed is expected in userspace (either that or ask the KVM_RUN thread to run the ioctl for you, as qemu does). Can't see why having different behaviour for valid/invalid ioctls is a good thing. This patch does not change functionality, it just makes invalid ioctls fail faster. Should not be executing vcpu ioctls without interrupt KVM_RUN in the first place. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Fix page ageing bugs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/22/2014 03:57 PM, Andres Lagar-Cavilla wrote: 1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Acked-by: Rik van Riel r...@redhat.com - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUIIQiAAoJEM553pKExN6DWeoH/RpkYF1QCxnbxgZhnioaWjyu Rp/kN6Rck6Eu3k/yRI6k+8IhgUJWkVhSXybTIDl1X6aVGgYwhaeOv2zPPfshfM6h ABE3pLFjs2qtdotZXFSvZ4mNwbQE73YHphAbmFUBSdm2Oz1bj6Qfq+KYFdM+aQd7 UYIFgtdGg/tyLMqE25J7pAnSDRR5GKmAKLvkFjN3Q8O4ynD3rExH1yTMLtQbyKvb oadSzaQLBOkLDAj3rpiOTl52B2tlQS+cxWqEfzA/AXOK8TkllDKIQT5BeRXV5O1c /WsZmusiA6KYgjLxnL0K9XJpgpOQ5unYAFyIGgYmKiaN6PQsd+pGM5GDnOWGorE= =dftO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. --- v1 - v2: * Small cosmetic changes * Remove unnecessary handling of mmu_notifier_seq in mmu_lock protected region * Func prototypes fixed v2 - v3 * Added Acked-by * Fix compilation in PowerPC --- Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com Acked-by: Rik van Riel r...@redhat.com --- arch/arm/include/asm/kvm_host.h | 3 ++- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 4 ++-- arch/powerpc/kvm/book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++-- arch/powerpc/kvm/book3s_pr.c| 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 46 +++-- drivers/iommu/amd_iommu_v2.c| 6 +++-- include/linux/mmu_notifier.h| 24 +-- include/trace/events/kvm.h | 10 mm/mmu_notifier.c | 5 ++-- mm/rmap.c | 6 - virt/kvm/kvm_main.c | 5 ++-- 16 files changed, 81 insertions(+), 47 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 032a853..8c3f7eb 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index be9970a..a3c671b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm, void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6040008..d329bc5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -56,7 +56,7 @@ extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index dd03f6b..c16cfbf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end); } -int kvm_age_hva(struct kvm *kvm, unsigned long hva) +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) { - return kvm-arch.kvm_ops-age_hva(kvm, hva); + return kvm-arch.kvm_ops-age_hva(kvm, start, end); } int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 4bf956c..d2b3ec0 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva_hv(struct kvm *kvm, unsigned long hva); +extern int
Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390
On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote: This set of patches implements a vfio based solution for pci pass-through on the s390 platform. The kernel stuff is pretty much straight forward, but qemu needs more work. Most interesting patch is: vfio: make vfio run on s390 platform I hope Alex Alex can give me some guidance how to do the changes in an appropriate way. After creating a separate iommmu address space for each attached PCI device I can successfully run the vfio type1 iommu. So If we could extend type1 not registering all guest memory (see patch) I think we do not need a special vfio iommu for s390 for the moment. The patches implement the base pass-through support. s390 specific virtualization functions are currently not included. This would be a second step after the base support is done. kernel patches apply to linux-kvm-next KVM: s390: Enable PCI instructions iommu: add iommu for s390 platform vfio: make vfio build on s390 qemu patches apply to qemu-master s390: Add PCI bus support s390: implement pci instruction vfio: make vfio run on s390 platform Thx for feedback and review comments Sending patches as attachments makes it difficult to comment inline. 2/6 - careful of the namespace as you're changing functions from static and exporting them - doesn't seem like functions need to be exported, just non-static to call from s390-iommu.c 6/6 - We shouldn't need to globally disable mmap, each VFIO region reports whether it supports mmap and vfio-pci on s390 should indicate mmap is not supported on the platform. - INTx should be done the same way, the interrupt index for INTx should report 0 count. The current code likely doesn't handle this, but it should be easy to fix. - s390_msix_notify() vs msix_notify() should be abstracted somewhere else. How would an emulated PCI device with MSI-X support work? - same for add_msi_route - We can probably come up with a better way to determine which address space to connect to the memory listener. Looks like a reasonable first pass, good re-use of vfio code. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Thu, Sep 18, 2014 at 11:08 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 19/09/2014 05:58, Andres Lagar-Cavilla ha scritto: Paolo, should I recut including the recent Reviewed-by's? No, I'll add them myself. Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? Thanks much Andres Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
Il 22/09/2014 22:08, Marcelo Tosatti ha scritto: This patch does not change functionality, it just makes invalid ioctls fail faster. Should not be executing vcpu ioctls without interrupt KVM_RUN in the first place. This is not entirely true, there are a couple of asynchronous ioctls though they are not used on x86 (commit 2122ff5eab8f, KVM: move vcpu locking to dispatcher for generic vcpu ioctls, 2010-05-13). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: EVENTFD: Only conditionally remove inclusion of irq.h
Commit c77dcac KVM: Move more code under CONFIG_HAVE_KVM_IRQFD added functionality that depends on definitions in ioapic.h when __KVM_HAVE_IOAPIC is defined. At the same time, 0ba0951 KVM: EVENTFD: remove inclusion of irq.h removed the inclusion of irq.h unconditionally, which happened to include ioapic.h. Instead, include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- virt/kvm/eventfd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 0c712a7..b0fb390 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -36,6 +36,9 @@ #include linux/seqlock.h #include trace/events/kvm.h +#ifdef __KVM_HAVE_IOAPIC +#include ioapic.h +#endif #include iodev.h #ifdef CONFIG_HAVE_KVM_IRQFD -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto: Paolo, should I recut including the recent Reviewed-by's? No, I'll add them myself. Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? It's waiting for an Acked-by on the mm/ changes. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini pbonz...@redhat.com wrote: Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto: Paolo, should I recut including the recent Reviewed-by's? No, I'll add them myself. Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? It's waiting for an Acked-by on the mm/ changes. The MM changes look good to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm: Fix page ageing bugs
Il 22/09/2014 22:26, Andres Lagar-Cavilla ha scritto: + __entry-gfn= gfn; + __entry-hva= ((gfn - slot-base_gfn) This must be . + PAGE_SHIFT) + slot-userspace_addr; + /* + * No need for _notify because we're called within an + * mmu_notifier_invalidate_range_ {start|end} scope. + */ Why called within? It is try_to_unmap_cluster itself that calls mmu_notifier_invalidate_range_*, so we're within an mmu_notifier_invalidate_range_start/end scope sounds better, and it's also what you use in the commit message. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. --- v1 - v2: * Small cosmetic changes * Remove unnecessary handling of mmu_notifier_seq in mmu_lock protected region * Func prototypes fixed v2 - v3 * Added Acked-by * Fix compilation in PowerPC v3 - v4 * Fix page shift direction in trace routine * Clarified comment in rmap.c --- Signed-off-by: Andres Lagar-Cavilla andre...@gooogle.com Acked-by: Rik van Riel r...@redhat.com --- arch/arm/include/asm/kvm_host.h | 3 ++- arch/arm64/include/asm/kvm_host.h | 3 ++- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 4 ++-- arch/powerpc/kvm/book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 ++-- arch/powerpc/kvm/book3s_pr.c| 3 ++- arch/powerpc/kvm/e500_mmu_host.c| 2 +- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu.c | 46 +++-- drivers/iommu/amd_iommu_v2.c| 6 +++-- include/linux/mmu_notifier.h| 24 +-- include/trace/events/kvm.h | 10 mm/mmu_notifier.c | 5 ++-- mm/rmap.c | 6 - virt/kvm/kvm_main.c | 5 ++-- 16 files changed, 81 insertions(+), 47 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 032a853..8c3f7eb 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -170,7 +170,8 @@ unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index be9970a..a3c671b 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -180,7 +180,8 @@ int kvm_unmap_hva_range(struct kvm *kvm, void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); /* We do not have shadow page tables, hence the empty hooks */ -static inline int kvm_age_hva(struct kvm *kvm, unsigned long hva) +static inline int kvm_age_hva(struct kvm *kvm, unsigned long start, + unsigned long end) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6040008..d329bc5 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -56,7 +56,7 @@ extern int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); -extern int kvm_age_hva(struct kvm *kvm, unsigned long hva); +extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index dd03f6b..c16cfbf 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -851,9 +851,9 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) return kvm-arch.kvm_ops-unmap_hva_range(kvm, start, end); } -int kvm_age_hva(struct kvm *kvm, unsigned long hva) +int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) { - return kvm-arch.kvm_ops-age_hva(kvm, hva); + return kvm-arch.kvm_ops-age_hva(kvm, start, end); } int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h index 4bf956c..d2b3ec0 100644 --- a/arch/powerpc/kvm/book3s.h +++ b/arch/powerpc/kvm/book3s.h @@ -17,7 +17,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm, extern int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva); extern int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
Re: [PATCH v3] kvm: Fix page ageing bugs
On Mon, Sep 22, 2014 at 2:48 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 22/09/2014 22:26, Andres Lagar-Cavilla ha scritto: + __entry-gfn= gfn; + __entry-hva= ((gfn - slot-base_gfn) This must be . Correct, thanks. + PAGE_SHIFT) + slot-userspace_addr; + /* + * No need for _notify because we're called within an + * mmu_notifier_invalidate_range_ {start|end} scope. + */ Why called within? It is try_to_unmap_cluster itself that calls mmu_notifier_invalidate_range_*, so we're within an mmu_notifier_invalidate_range_start/end scope sounds better, and it's also what you use in the commit message. Also correct. V4... Andres Paolo -- Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390
On 22.09.14 22:47, Alex Williamson wrote: On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote: This set of patches implements a vfio based solution for pci pass-through on the s390 platform. The kernel stuff is pretty much straight forward, but qemu needs more work. Most interesting patch is: vfio: make vfio run on s390 platform I hope Alex Alex can give me some guidance how to do the changes in an appropriate way. After creating a separate iommmu address space for each attached PCI device I can successfully run the vfio type1 iommu. So If we could extend type1 not registering all guest memory (see patch) I think we do not need a special vfio iommu for s390 for the moment. The patches implement the base pass-through support. s390 specific virtualization functions are currently not included. This would be a second step after the base support is done. kernel patches apply to linux-kvm-next KVM: s390: Enable PCI instructions iommu: add iommu for s390 platform vfio: make vfio build on s390 qemu patches apply to qemu-master s390: Add PCI bus support s390: implement pci instruction vfio: make vfio run on s390 platform Thx for feedback and review comments Sending patches as attachments makes it difficult to comment inline. 2/6 - careful of the namespace as you're changing functions from static and exporting them - doesn't seem like functions need to be exported, just non-static to call from s390-iommu.c 6/6 - We shouldn't need to globally disable mmap, each VFIO region reports whether it supports mmap and vfio-pci on s390 should indicate mmap is not supported on the platform. Can we emulate MMIO on mmap'ed regions by routing every memory access via the kernel? It'd be slow, but at least make existing VFIO code compatible. - INTx should be done the same way, the interrupt index for INTx should report 0 count. The current code likely doesn't handle this, but it should be easy to fix. - s390_msix_notify() vs msix_notify() should be abstracted somewhere else. How would an emulated PCI device with MSI-X support work? - same for add_msi_route Yes, please implement emulated PCI device support first, then do VFIO. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC patch 0/6] vfio based pci pass-through for qemu/KVM on s390
On Tue, 2014-09-23 at 00:08 +0200, Alexander Graf wrote: On 22.09.14 22:47, Alex Williamson wrote: On Fri, 2014-09-19 at 13:54 +0200, frank.blasc...@de.ibm.com wrote: This set of patches implements a vfio based solution for pci pass-through on the s390 platform. The kernel stuff is pretty much straight forward, but qemu needs more work. Most interesting patch is: vfio: make vfio run on s390 platform I hope Alex Alex can give me some guidance how to do the changes in an appropriate way. After creating a separate iommmu address space for each attached PCI device I can successfully run the vfio type1 iommu. So If we could extend type1 not registering all guest memory (see patch) I think we do not need a special vfio iommu for s390 for the moment. The patches implement the base pass-through support. s390 specific virtualization functions are currently not included. This would be a second step after the base support is done. kernel patches apply to linux-kvm-next KVM: s390: Enable PCI instructions iommu: add iommu for s390 platform vfio: make vfio build on s390 qemu patches apply to qemu-master s390: Add PCI bus support s390: implement pci instruction vfio: make vfio run on s390 platform Thx for feedback and review comments Sending patches as attachments makes it difficult to comment inline. 2/6 - careful of the namespace as you're changing functions from static and exporting them - doesn't seem like functions need to be exported, just non-static to call from s390-iommu.c 6/6 - We shouldn't need to globally disable mmap, each VFIO region reports whether it supports mmap and vfio-pci on s390 should indicate mmap is not supported on the platform. Can we emulate MMIO on mmap'ed regions by routing every memory access via the kernel? It'd be slow, but at least make existing VFIO code compatible. Isn't that effectively what we do when we use memory_region_init_io() vs memory_region_init_ram_ptr() or are you suggesting something that can handle the MMIO without bouncing out to QEMU? VFIO is already compatible with regions that cannot be mmap'd, the kernel just needs to report it as such. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On 09/22, Marcelo Tosatti wrote: On Fri, Sep 19, 2014 at 04:03:25PM -0700, David Matlack wrote: vcpu ioctls can hang the calling thread if issued while a vcpu is running. There is a mutex per-vcpu, so thats expected, OK... If we know ioctl is going to be rejected as invalid anyway, we can fail before trying to take the vcpu mutex. Consider a valid ioctl that takes the vcpu mutex. If you need immediate access for that valid ioctl, it is necessary to interrupt thread which KVM_RUN ioctl executes. So knowledge of whether KVM_RUN is being executed is expected in userspace (either that or ask the KVM_RUN thread to run the ioctl for you, as qemu does). Can't see why having different behaviour for valid/invalid ioctls is a good thing. This patch does not change functionality, it just makes invalid ioctls fail faster. Should not be executing vcpu ioctls without interrupt KVM_RUN in the first place. This patch is trying to be nice to code that isn't aware it's probing kvm file descriptors. We saw long hangs with some generic process inspection code that was probing all open file descriptors. There's no reason non-kvm ioctls should have to wait for the vcpu mutex to become available just to fail. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On Mon, Sep 22, 2014 at 11:29:16PM +0200, Paolo Bonzini wrote: Il 22/09/2014 22:08, Marcelo Tosatti ha scritto: This patch does not change functionality, it just makes invalid ioctls fail faster. Should not be executing vcpu ioctls without interrupt KVM_RUN in the first place. This is not entirely true, there are a couple of asynchronous ioctls though they are not used on x86 (commit 2122ff5eab8f, KVM: move vcpu locking to dispatcher for generic vcpu ioctls, 2010-05-13). Paolo Alright. s/Should not be executing vcpu ioctls/Should not be executing vcpu ioctls which take vcpu mutex/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardizing an MSR or other hypercall to get an RNG seed?
Not really, no. Sent from my tablet, pardon any formatting problems. On Sep 22, 2014, at 06:31, Christopher Covington c...@codeaurora.org wrote: On 09/19/2014 05:46 PM, H. Peter Anvin wrote: On 09/19/2014 01:46 PM, Andy Lutomirski wrote: However, it sounds to me that at least for KVM, it is very easy just to emulate the RDRAND instruction. The hypervisor would report to the guest that RDRAND is supported in CPUID and the emulate the instruction when guest executes it. KVM already traps guest #UD (which would occur if RDRAND executed while it is not supported) - so this scheme wouldn’t introduce additional overhead over RDMSR. Because then guest user code will think that rdrand is there and will try to use it, resulting in abysmal performance. Yes, the presence of RDRAND implies a cheap and inexhaustible entropy source. A guest kernel couldn't make it look like RDRAND is not present to guest userspace? Christopher -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On Mon, Sep 22, 2014 at 03:58:16PM -0700, David Matlack wrote: Should not be executing vcpu ioctls without interrupt KVM_RUN in the first place. This patch is trying to be nice to code that isn't aware it's probing kvm file descriptors. We saw long hangs with some generic process inspection code that was probing all open file descriptors. There's no reason non-kvm ioctls should have to wait for the vcpu mutex to become available just to fail. OK then, please add the usecase to the changelog. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 0/6] arm: dirty page logging support for ARMv7
This patch adds support for ARMv7 dirty page logging. Some functions of dirty page logging have been split to generic and arch specific implementations, details below. Dirty page logging is one of serveral features required for live migration, live migration has been tested for ARMv7. Testing: - QEMU machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest memory cycling. - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers did a basic bringup and memory test. - x86_64 QEMU basic migration on same platform. See https://github.com/mjsmar/arm-migration-howto for details testing and setup (see README). The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch series has been compiled for affected architectures: - x86_64 - defconfig - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked around that to make sure new changes don't break build. Eventually build breaks due to other reasons. - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig - s390 - s390x-linux-gcc4.6.3 - defconfig - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig ARMv7 Dirty page logging support overivew- - initially write protects VM RAM memory regions - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into small page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. - Future Work o Completed ARMv8 additions - need to validate, on hardware juno o ARMv8 Validation test - with no migration support yet, implement replication of memory DB using dirty page logging in HA environment. Changes since v10: - addressed wanghaibin comments - addressed Christoffers comments Changes since v9: - Split patches into generic and architecture specific variants for TLB Flushing and dirty log read (patches 1,2 3,4,5,6) - rebased to 3.16.0-rc1 - Applied Christoffers comments. Mario Smarduch (6): KVM: Add architecture-specific TLB flush implementations KVM: Add generic implementation of kvm_vm_ioctl_get_dirty_log arm: KVM: Add ARMv7 API to flush TLBs arm: KVM: Add initial dirty page locking infrastructure arm: KVM: dirty log read write protect support arm: KVM: ARMv7 dirty page logging 2nd stage page fault arch/arm/include/asm/kvm_asm.h|1 + arch/arm/include/asm/kvm_host.h | 14 +++ arch/arm/include/asm/kvm_mmu.h| 20 arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/Kconfig |1 + arch/arm/kvm/arm.c| 13 ++- arch/arm/kvm/interrupts.S | 12 ++ arch/arm/kvm/mmu.c| 196 - arch/arm64/kvm/Kconfig|1 + arch/ia64/kvm/Kconfig |1 + arch/ia64/kvm/kvm-ia64.c |2 +- arch/mips/kvm/Kconfig |1 + arch/mips/kvm/mips.c |2 +- arch/powerpc/kvm/Kconfig |1 + arch/powerpc/kvm/book3s.c |2 +- arch/powerpc/kvm/booke.c |2 +- arch/s390/kvm/Kconfig |1 + arch/s390/kvm/kvm-s390.c |2 +- arch/x86/kvm/x86.c| 86 --- include/linux/kvm_host.h |7 ++ virt/kvm/Kconfig |6 + virt/kvm/kvm_main.c | 95 22 files changed, 369 insertions(+), 98 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 1/6] KVM: Add architecture-specific TLB flush implementations
Add support to declare architecture specific TLB flush function, for now ARMv7. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- include/linux/kvm_host.h |1 + virt/kvm/Kconfig |3 +++ virt/kvm/kvm_main.c |4 3 files changed, 8 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ec4e3bd..a49a6df 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -592,6 +592,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); void kvm_flush_remote_tlbs(struct kvm *kvm); +void kvm_arch_flush_remote_tlbs(struct kvm *kvm); void kvm_reload_remote_mmus(struct kvm *kvm); void kvm_make_mclock_inprogress_request(struct kvm *kvm); void kvm_make_scan_ioapic_request(struct kvm *kvm); diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 13f2d19..f1efaa5 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -34,3 +34,6 @@ config HAVE_KVM_CPU_RELAX_INTERCEPT config KVM_VFIO bool + +config HAVE_KVM_ARCH_TLB_FLUSH_ALL + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4b6c01b..d0a24f5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -186,12 +186,16 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { +#ifdef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL + kvm_arch_flush_remote_tlbs(kvm); +#else long dirty_count = kvm-tlbs_dirty; smp_mb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; cmpxchg(kvm-tlbs_dirty, dirty_count, 0); +#endif } EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 2/6] KVM: Add generic implementation of kvm_vm_ioctl_get_dirty_log
Add support for generic implementation of dirty log read function. For now x86_64 and ARMv7 share generic dirty log read. Other architectures call their architecture specific functions. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/arm/kvm/Kconfig |1 + arch/arm/kvm/arm.c|2 +- arch/arm64/kvm/Kconfig|1 + arch/ia64/kvm/Kconfig |1 + arch/ia64/kvm/kvm-ia64.c |2 +- arch/mips/kvm/Kconfig |1 + arch/mips/kvm/mips.c |2 +- arch/powerpc/kvm/Kconfig |1 + arch/powerpc/kvm/book3s.c |2 +- arch/powerpc/kvm/booke.c |2 +- arch/s390/kvm/Kconfig |1 + arch/s390/kvm/kvm-s390.c |2 +- arch/x86/kvm/x86.c| 86 -- include/linux/kvm_host.h |6 +++ virt/kvm/Kconfig |3 ++ virt/kvm/kvm_main.c | 91 + 16 files changed, 112 insertions(+), 92 deletions(-) diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig index 4be5bb1..cd9bb1c 100644 --- a/arch/arm/kvm/Kconfig +++ b/arch/arm/kvm/Kconfig @@ -23,6 +23,7 @@ config KVM select HAVE_KVM_CPU_RELAX_INTERCEPT select KVM_MMIO select KVM_ARM_HOST + select HAVE_KVM_ARCH_DIRTY_LOG depends on ARM_VIRT_EXT ARM_LPAE !CPU_BIG_ENDIAN ---help--- Support hosting virtualized guest machines. You will also diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 3c82b37..c52b2bd 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -774,7 +774,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp, } } -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) +int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { return -EINVAL; } diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 8ba85e9..40a8d19 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -26,6 +26,7 @@ config KVM select KVM_ARM_HOST select KVM_ARM_VGIC select KVM_ARM_TIMER + select HAVE_KVM_ARCH_DIRTY_LOG ---help--- Support hosting virtualized guest machines. diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig index 990b864..217f10a 100644 --- a/arch/ia64/kvm/Kconfig +++ b/arch/ia64/kvm/Kconfig @@ -28,6 +28,7 @@ config KVM select HAVE_KVM_IRQ_ROUTING select KVM_APIC_ARCHITECTURE select KVM_MMIO + select HAVE_KVM_ARCH_DIRTY_LOG ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 6a4309b..3166df5 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1812,7 +1812,7 @@ static void kvm_ia64_sync_dirty_log(struct kvm *kvm, spin_unlock(kvm-arch.dirty_log_lock); } -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, +int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { int r; diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig index 30e334e..b57f49e 100644 --- a/arch/mips/kvm/Kconfig +++ b/arch/mips/kvm/Kconfig @@ -20,6 +20,7 @@ config KVM select PREEMPT_NOTIFIERS select ANON_INODES select KVM_MMIO + select HAVE_KVM_ARCH_DIRTY_LOG ---help--- Support for hosting Guest kernels. Currently supported on MIPS32 processors. diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index d687c6e..885fdfe 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -791,7 +791,7 @@ out: } /* Get (and clear) the dirty memory log for a memory slot. */ -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) +int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { struct kvm_memory_slot *memslot; unsigned long ga, ga_end; diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index d6a53b9..4f28a82 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM select PREEMPT_NOTIFIERS select ANON_INODES select HAVE_KVM_EVENTFD + select HAVE_KVM_ARCH_DIRTY_LOG config KVM_BOOK3S_HANDLER bool diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index c254c27..304faa1 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -815,7 +815,7 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu) return vcpu-kvm-arch.kvm_ops-check_requests(vcpu); } -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) +int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { return kvm-arch.kvm_ops-get_dirty_log(kvm, log); } diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index ab62109..50dd33d 100644 ---
[PATCH v11 3/6] arm: KVM: Add ARMv7 API to flush TLBs
This patch adds ARMv7 architecture TLB Flush function. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/arm/include/asm/kvm_asm.h |1 + arch/arm/include/asm/kvm_host.h | 12 arch/arm/kvm/Kconfig|1 + arch/arm/kvm/interrupts.S | 12 4 files changed, 26 insertions(+) diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h index 53b3c4a..21bc519 100644 --- a/arch/arm/include/asm/kvm_asm.h +++ b/arch/arm/include/asm/kvm_asm.h @@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[]; extern void __kvm_flush_vm_context(void); extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); +extern void __kvm_tlb_flush_vmid(struct kvm *kvm); extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); #endif diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 193ceaf..311486f 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -220,6 +220,18 @@ static inline void __cpu_init_hyp_mode(phys_addr_t boot_pgd_ptr, kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr); } +/** + * kvm_arch_flush_remote_tlbs() - flush all VM TLB entries + * @kvm: pointer to kvm structure. + * + * Interface to HYP function to flush all VM TLB entries without address + * parameter. + */ +static inline void kvm_arch_flush_remote_tlbs(struct kvm *kvm) +{ + kvm_call_hyp(__kvm_tlb_flush_vmid, kvm); +} + static inline int kvm_arch_dev_ioctl_check_extension(long ext) { return 0; diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig index cd9bb1c..eba8b00 100644 --- a/arch/arm/kvm/Kconfig +++ b/arch/arm/kvm/Kconfig @@ -24,6 +24,7 @@ config KVM select KVM_MMIO select KVM_ARM_HOST select HAVE_KVM_ARCH_DIRTY_LOG + select HAVE_KVM_ARCH_TLB_FLUSH_ALL depends on ARM_VIRT_EXT ARM_LPAE !CPU_BIG_ENDIAN ---help--- Support hosting virtualized guest machines. You will also diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S index 0d68d40..1258d46 100644 --- a/arch/arm/kvm/interrupts.S +++ b/arch/arm/kvm/interrupts.S @@ -66,6 +66,18 @@ ENTRY(__kvm_tlb_flush_vmid_ipa) bx lr ENDPROC(__kvm_tlb_flush_vmid_ipa) +/** + * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs + * + * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address + * parameter + */ + +ENTRY(__kvm_tlb_flush_vmid) + b __kvm_tlb_flush_vmid_ipa +ENDPROC(__kvm_tlb_flush_vmid) + + / * Flush TLBs and instruction caches of all CPUs inside the inner-shareable * domain, for all VMIDs -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 4/6] arm: KVM: Add initial dirty page locking infrastructure
Patch adds support for initial write protection of VM memlsot. This patch series assumes that huge PUDs will not be used in 2nd stage tables, which is awlays valid on ARMv7. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/arm/include/asm/kvm_host.h |2 + arch/arm/include/asm/kvm_mmu.h| 20 + arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/arm.c|9 +++ arch/arm/kvm/mmu.c| 129 + 5 files changed, 161 insertions(+) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 311486f..12311a5 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -243,4 +243,6 @@ int kvm_perf_teardown(void); u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid); int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); +void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); + #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 5cc0b0f..08ab5e8 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -114,6 +114,26 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd) pmd_val(*pmd) |= L_PMD_S2_RDWR; } +static inline void kvm_set_s2pte_readonly(pte_t *pte) +{ + pte_val(*pte) = (pte_val(*pte) ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY; +} + +static inline bool kvm_s2pte_readonly(pte_t *pte) +{ + return (pte_val(*pte) L_PTE_S2_RDWR) == L_PTE_S2_RDONLY; +} + +static inline void kvm_set_s2pmd_readonly(pmd_t *pmd) +{ + pmd_val(*pmd) = (pmd_val(*pmd) ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY; +} + +static inline bool kvm_s2pmd_readonly(pmd_t *pmd) +{ + return (pmd_val(*pmd) L_PMD_S2_RDWR) == L_PMD_S2_RDONLY; +} + /* Open coded p*d_addr_end that can deal with 64bit addresses */ #define kvm_pgd_addr_end(addr, end)\ ({ u64 __boundary = ((addr) + PGDIR_SIZE) PGDIR_MASK;\ diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 85c60ad..8a3266e 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -129,6 +129,7 @@ #define L_PTE_S2_RDONLY(_AT(pteval_t, 1) 6) /* HAP[1] */ #define L_PTE_S2_RDWR (_AT(pteval_t, 3) 6) /* HAP[2:1] */ +#define L_PMD_S2_RDONLY(_AT(pmdval_t, 1) 6) /* HAP[1] */ #define L_PMD_S2_RDWR (_AT(pmdval_t, 3) 6) /* HAP[2:1] */ /* diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index c52b2bd..e1be6c7 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -242,6 +242,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, const struct kvm_memory_slot *old, enum kvm_mr_change change) { +#ifdef CONFIG_ARM + /* +* At this point memslot has been committed and there is an +* allocated dirty_bitmap[], dirty pages will be be tracked while the +* memory slot is write protected. +*/ + if ((change != KVM_MR_DELETE) (mem-flags KVM_MEM_LOG_DIRTY_PAGES)) + kvm_mmu_wp_memory_region(kvm, mem-slot); +#endif } void kvm_arch_flush_shadow_all(struct kvm *kvm) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 2336061..ba00899 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -45,6 +45,7 @@ static phys_addr_t hyp_idmap_vector; #define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) +#define kvm_pud_huge(_x) pud_huge(_x) static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) { @@ -746,6 +747,134 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap) return false; } +#ifdef CONFIG_ARM +/** + * stage2_wp_ptes - write protect PMD range + * @pmd: pointer to pmd entry + * @addr: range start address + * @end: range end address + */ +static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end) +{ + pte_t *pte; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_none(*pte)) { + if (!kvm_s2pte_readonly(pte)) + kvm_set_s2pte_readonly(pte); + } + } while (pte++, addr += PAGE_SIZE, addr != end); +} + +/** + * stage2_wp_pmds - write protect PUD range + * @pud: pointer to pud entry + * @addr: range start address + * @end: range end address + */ +static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end) +{ + pmd_t *pmd; + phys_addr_t next; + + pmd = pmd_offset(pud, addr); + + do { + next = kvm_pmd_addr_end(addr, end); + if (!pmd_none(*pmd)) { + if (kvm_pmd_huge(*pmd)) { +
[PATCH v11 5/6] arm: KVM: dirty log read write protect support
This patch adds support to track VM dirty pages, between dirty log reads. Pages that have been dirtied since last log read are write protected again, in preparation of next dirty log read. In addition ARMv7 dirty log read function is pushed up to generic layer. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/arm/kvm/Kconfig |1 - arch/arm/kvm/arm.c |2 ++ arch/arm/kvm/mmu.c | 22 ++ 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig index eba8b00..dddbb3d 100644 --- a/arch/arm/kvm/Kconfig +++ b/arch/arm/kvm/Kconfig @@ -23,7 +23,6 @@ config KVM select HAVE_KVM_CPU_RELAX_INTERCEPT select KVM_MMIO select KVM_ARM_HOST - select HAVE_KVM_ARCH_DIRTY_LOG select HAVE_KVM_ARCH_TLB_FLUSH_ALL depends on ARM_VIRT_EXT ARM_LPAE !CPU_BIG_ENDIAN ---help--- diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index e1be6c7..0546fa3 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -783,10 +783,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp, } } +#ifdef CONFIG_ARM64 int kvm_arch_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { return -EINVAL; } +#endif static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, struct kvm_arm_device_addr *dev_addr) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index ba00899..5f52c8a 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -873,6 +873,28 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot) spin_unlock(kvm-mmu_lock); kvm_flush_remote_tlbs(kvm); } + +/** + * kvm_mmu_write_protected_pt_masked() - write protect dirty pages set in mask + * @kvm: The KVM pointer + * @slot: The memory slot associated with mask + * @gfn_offset:The gfn offset in memory slot + * @mask: The mask of dirty pages at offset 'gfn_offset' in this memory + * slot to be write protected + * + * Walks bits set in mask write protects the associated pte's. Caller must + * acquire kvm_mmu_lock. + */ +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) +{ + phys_addr_t base_gfn = slot-base_gfn + gfn_offset; + phys_addr_t start = (base_gfn + __ffs(mask)) PAGE_SHIFT; + phys_addr_t end = (base_gfn + __fls(mask) + 1) PAGE_SHIFT; + + stage2_wp_range(kvm, start, end); +} #endif static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 6/6] arm: KVM: ARMv7 dirty page logging 2nd stage page fault
This patch adds support for handling 2nd stage page faults during migration, it disables faulting in huge pages, and dissolves huge pages to page tables. In case migration is canceled huge pages may be used again. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/arm/kvm/mmu.c | 45 +++-- 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 5f52c8a..df1a5a3 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -47,6 +47,15 @@ static phys_addr_t hyp_idmap_vector; #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) #define kvm_pud_huge(_x) pud_huge(_x) +static bool kvm_get_logging_state(struct kvm_memory_slot *memslot) +{ +#ifdef CONFIG_ARM + return !!memslot-dirty_bitmap; +#else + return false; +#endif +} + static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) { /* @@ -626,7 +635,8 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache } static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, - phys_addr_t addr, const pte_t *new_pte, bool iomap) + phys_addr_t addr, const pte_t *new_pte, bool iomap, + bool logging_active) { pmd_t *pmd; pte_t *pte, old_pte; @@ -641,6 +651,18 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, return 0; } + /* +* While dirty memory logging, clear PMD entry for huge page and split +* into smaller pages, to track dirty memory at page granularity. +*/ + if (logging_active kvm_pmd_huge(*pmd)) { + phys_addr_t ipa = pmd_pfn(*pmd) PAGE_SHIFT; + + pmd_clear(pmd); + kvm_tlb_flush_vmid_ipa(kvm, ipa); + put_page(virt_to_page(pmd)); + } + /* Create stage-2 page mappings - Level 2 */ if (pmd_none(*pmd)) { if (!cache) @@ -693,7 +715,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, if (ret) goto out; spin_lock(kvm-mmu_lock); - ret = stage2_set_pte(kvm, cache, addr, pte, true); + ret = stage2_set_pte(kvm, cache, addr, pte, true, false); spin_unlock(kvm-mmu_lock); if (ret) goto out; @@ -910,6 +932,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache; struct vm_area_struct *vma; pfn_t pfn; + bool logging_active = kvm_get_logging_state(memslot); write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu)); if (fault_status == FSC_PERM !write_fault) { @@ -920,7 +943,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, /* Let's check if we will get back a huge page backed by hugetlbfs */ down_read(current-mm-mmap_sem); vma = find_vma_intersection(current-mm, hva, hva + 1); - if (is_vm_hugetlb_page(vma)) { + if (is_vm_hugetlb_page(vma) !logging_active) { hugetlb = true; gfn = (fault_ipa PMD_MASK) PAGE_SHIFT; } else { @@ -963,7 +986,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, spin_lock(kvm-mmu_lock); if (mmu_notifier_retry(kvm, mmu_seq)) goto out_unlock; - if (!hugetlb !force_pte) + if (!hugetlb !force_pte !logging_active) hugetlb = transparent_hugepage_adjust(pfn, fault_ipa); if (hugetlb) { @@ -982,9 +1005,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_set_pfn_dirty(pfn); } coherent_cache_guest_page(vcpu, hva, PAGE_SIZE); - ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false); + ret = stage2_set_pte(kvm, memcache, fault_ipa, new_pte, false, + logging_active); } + if (write_fault) + mark_page_dirty(kvm, gfn); out_unlock: spin_unlock(kvm-mmu_lock); @@ -1135,7 +1161,14 @@ static void kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) { pte_t *pte = (pte_t *)data; - stage2_set_pte(kvm, NULL, gpa, pte, false); + /* +* We can always call stage2_set_pte with logging_active == false, +* because MMU notifiers will have unmapped a huge PMD before calling +* -change_pte() (which in turn calls kvm_set_spte_hva()) and therefore +* stage2_set_pte() never needs to clear out a huge PMD through this +* calling path. +*/ + stage2_set_pte(kvm, NULL, gpa, pte, false, false); } -- 1.7.9.5 -- To unsubscribe from this
Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial
On Mon, 09/22 21:23, Zhang Haoyu wrote: Amit, It's related to the big number of ioeventfds used in virtio-serial-pci. With virtio-serial-pci's ioeventfd=off, the performance is not affected no matter if guest initializes it or not. In my test, there are 12 fds to poll in qemu_poll_ns before loading guest virtio_console.ko, whereas 76 once modprobe virtio_console. Looks like the ppoll takes more time to poll more fds. Some trace data with systemtap: 12 fds: time rel_time symbol 15(+1) qemu_poll_ns [enter] 18(+3) qemu_poll_ns [return] 76 fd: 12(+2) qemu_poll_ns [enter] 18(+6) qemu_poll_ns [return] I haven't looked at virtio-serial code, I'm not sure if we should reduce the number of ioeventfds in virtio-serial-pci or focus on lower level efficiency. Does ioeventfd=off hamper the performance of virtio-serial? In theory it has an impact, but I have no data about this. If you have a performance demand, it's best to try it against your use case to answer this question. Fam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] x86, kvm: use macros to compute bank MSRs
Avoid open coded calculations for bank MSRs by using well-defined macros that hide the index of higher bank MSRs. No semantic changes. Signed-off-by: Chen Yucong sla...@gmail.com --- arch/x86/kvm/x86.c |8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 31e55ae..e8c1e3b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1825,7 +1825,7 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 data) break; default: if (msr = MSR_IA32_MC0_CTL - msr MSR_IA32_MC0_CTL + 4 * bank_num) { + msr MSR_IA32_MCx_CTL(bank_num)) { u32 offset = msr - MSR_IA32_MC0_CTL; /* only 0 or all 1s can be written to IA32_MCi_CTL * some Linux kernels though clear bit 10 in bank 4 to @@ -2184,7 +2184,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: - case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: return set_msr_mce(vcpu, msr, data); /* Performance counters are not protected by a CPUID bit, @@ -2350,7 +2350,7 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) break; default: if (msr = MSR_IA32_MC0_CTL - msr MSR_IA32_MC0_CTL + 4 * bank_num) { + msr MSR_IA32_MCx_CTL(bank_num)) { u32 offset = msr - MSR_IA32_MC0_CTL; data = vcpu-arch.mce_banks[offset]; break; @@ -2531,7 +2531,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case MSR_IA32_MCG_CAP: case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: - case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: + case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1: return get_msr_mce(vcpu, msr, pdata); case MSR_K7_CLK_CTL: /* -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows
This enables PAPR defined feature called Dynamic DMA windows (DDW). Each Partitionable Endpoint (IOMMU group) has a separate DMA window on a PCI bus where devices are allows to perform DMA. By default there is 1 or 2GB window allocated at the host boot time and these windows are used when an IOMMU group is passed to the userspace (guest). These windows are mapped at zero offset on a PCI bus. Hi-speed devices may suffer from limited size of this window. On the host side a TCE bypass mode is enabled on POWER8 CPU which implements direct mapping of the host memory to a PCI bus at 159. For the guest, PAPR defines a DDW RTAS API which allows the pseries guest to query the hypervisor if it supports DDW and what are the parameters of possible windows. Currently POWER8 supports 2 DMA windows per PE - already mentioned and used small 32bit window and 64bit window which can only start from 159 and can support various page sizes. This patchset reworks PPC IOMMU code and adds necessary structures to extend it to support big windows. When the guest detectes the feature and the PE is capable of 64bit DMA, it does: 1. query to hypervisor about number of available windows and page masks; 2. creates a window with the biggest possible page size (current guests can do 64K or 16MB TCEs); 3. maps the entire guest RAM via H_PUT_TCE* hypercalls 4. switches dma_ops to direct_dma_ops on the selected PE. Once this is done, H_PUT_TCE is not called anymore and the guest gets maximum performance. Please comment. Thanks! Changes: v2: * added missing __pa() in powerpc/powernv: Release replaced TCE * reposted to make some noise :) Alexey Kardashevskiy (13): powerpc/iommu: Check that TCE page size is equal to it_page_size powerpc/powernv: Make invalidate() a callback powerpc/spapr: vfio: Implement spapr_tce_iommu_ops powerpc/powernv: Convert/move set_bypass() callback to take_ownership() powerpc/iommu: Fix IOMMU ownership control functions powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table powerpc/powernv: Do not set read flag if direction==DMA_NONE powerpc/powernv: Release replaced TCE powerpc/pseries/lpar: Enable VFIO powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA vfio: powerpc/spapr: Move locked_vm accounting to helpers vfio: powerpc/spapr: Use it_page_size vfio: powerpc/spapr: Enable Dynamic DMA windows arch/powerpc/include/asm/iommu.h| 35 ++- arch/powerpc/include/asm/machdep.h | 25 -- arch/powerpc/include/asm/tce.h | 37 +++ arch/powerpc/kernel/iommu.c | 213 +-- arch/powerpc/kernel/vio.c | 5 +- arch/powerpc/platforms/cell/iommu.c | 9 +- arch/powerpc/platforms/pasemi/iommu.c | 8 +- arch/powerpc/platforms/powernv/pci-ioda.c | 233 +++-- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 4 +- arch/powerpc/platforms/powernv/pci.c| 113 +--- arch/powerpc/platforms/powernv/pci.h| 15 +- arch/powerpc/platforms/pseries/iommu.c | 77 -- arch/powerpc/sysdev/dart_iommu.c| 13 +- drivers/vfio/vfio_iommu_spapr_tce.c | 384 +++- include/uapi/linux/vfio.h | 25 +- 15 files changed, 925 insertions(+), 271 deletions(-) -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. Since the set_bypass() is not really an iommu_table function but PE's function, and we have an ops struct per IOMMU owner, let's move set_bypass() to the spapr_tce_iommu_ops struct. As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and has very little to do with PEs, this moves take_ownership() calls to the VFIO SPAPR TCE driver. This renames set_bypass() to take_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it a generic name. The bool parameter is inverted. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- arch/powerpc/include/asm/iommu.h | 1 - arch/powerpc/include/asm/tce.h| 2 ++ arch/powerpc/kernel/iommu.c | 12 arch/powerpc/platforms/powernv/pci-ioda.c | 20 drivers/vfio/vfio_iommu_spapr_tce.c | 16 5 files changed, 30 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 84ee339..2b0b01d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -77,7 +77,6 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index 9f159eb..e6355f9 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops { struct iommu_table *(*get_table)( struct spapr_tce_iommu_group *data, int num); + void (*take_ownership)(struct spapr_tce_iommu_group *data, + bool enable); }; struct spapr_tce_iommu_group { diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 1c5dae7..c2c8d9d 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl-it_map, 0xff, sz); iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); - /* -* Disable iommu bypass, otherwise the user can DMA to all of -* our physical memory via the bypass window instead of just -* the pages that has been explicitly mapped into the iommu -*/ - if (tbl-set_bypass) - tbl-set_bypass(tbl, false); - return 0; } EXPORT_SYMBOL_GPL(iommu_take_ownership); @@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl) /* Restore bit#0 set by iommu_init_table() */ if (tbl-it_offset == 0) set_bit(0, tbl-it_map); - - /* The kernel owns the device now, we can restore the iommu bypass */ - if (tbl-set_bypass) - tbl-set_bypass(tbl, true); } EXPORT_SYMBOL_GPL(iommu_release_ownership); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 2d32a1c..8cb2f31 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32.table); uint16_t window_id = (pe-pe_number 1 ) + 1; int64_t rc; @@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe-pdev) - set_iommu_table_base(pe-pdev-dev, tbl); + set_iommu_table_base(pe-pdev-dev, pe-tce32.table); else pnv_ioda_setup_bus_dma(pe, pe-pbus, false); } @@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, /* TVE #1 is selected by PCI address bit 59 */ pe-tce_bypass_base = 1ull 59; - /* Install set_bypass callback for VFIO */ - pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass; - /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(pe-tce32.table, true); + pnv_pci_ioda2_set_bypass(pe, true); +} + +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, +bool
[PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c9fac97..0dccbc4 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl) if (!current || !current-mm) return -ESRCH; /* process exited */ - npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + npages = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; down_write(current-mm-mmap_sem); locked = current-mm-locked_vm + npages; @@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl) if (!current || !current-mm) return; /* process exited */ - npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + npages = (tbl-it_size tbl-it_page_shift) PAGE_SHIFT; down_write(current-mm-mmap_sem); if (npages current-mm-locked_vm) @@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data, if (info.argsz minsz) return -EINVAL; - info.dma32_window_start = tbl-it_offset IOMMU_PAGE_SHIFT_4K; - info.dma32_window_size = tbl-it_size IOMMU_PAGE_SHIFT_4K; + info.dma32_window_start = tbl-it_offset tbl-it_page_shift; + info.dma32_window_size = tbl-it_size tbl-it_page_shift; info.flags = 0; if (copy_to_user((void __user *)arg, info, minsz)) @@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data, VFIO_DMA_MAP_FLAG_WRITE)) return -EINVAL; - if ((param.size ~IOMMU_PAGE_MASK_4K) || - (param.vaddr ~IOMMU_PAGE_MASK_4K)) - return -EINVAL; - /* iova is checked by the IOMMU API */ tce = param.vaddr; if (param.flags VFIO_DMA_MAP_FLAG_READ) @@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data, return -ENXIO; BUG_ON(!tbl-it_group); + if ((param.size ~IOMMU_PAGE_MASK(tbl)) || + (param.vaddr ~IOMMU_PAGE_MASK(tbl))) + return -EINVAL; + ret = iommu_tce_put_param_check(tbl, param.iova, tce); if (ret) return ret; - for (i = 0; i (param.size IOMMU_PAGE_SHIFT_4K); ++i) { + for (i = 0; i (param.size tbl-it_page_shift); ++i) { ret = iommu_put_tce_user_mode(tbl, - (param.iova IOMMU_PAGE_SHIFT_4K) + i, + (param.iova tbl-it_page_shift) + i, tce); if (ret) break; - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); } if (ret) iommu_clear_tces_and_put_pages(tbl, - param.iova IOMMU_PAGE_SHIFT_4K, i); + param.iova tbl-it_page_shift, i); iommu_flush_tce(tbl); @@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data, if (param.flags) return -EINVAL; - if (param.size ~IOMMU_PAGE_MASK_4K) - return -EINVAL; - tbl = spapr_tce_find_table(container, data, param.iova); if (!tbl) return -ENXIO; + if (param.size ~IOMMU_PAGE_MASK(tbl)) + return -EINVAL; + BUG_ON(!tbl-it_group); ret = iommu_tce_clear_param_check(tbl, param.iova, 0, - param.size IOMMU_PAGE_SHIFT_4K); + param.size tbl-it_page_shift); if (ret) return ret; ret = iommu_clear_tces_and_put_pages(tbl, - param.iova IOMMU_PAGE_SHIFT_4K, - param.size IOMMU_PAGE_SHIFT_4K); + param.iova tbl-it_page_shift, + param.size tbl-it_page_shift); iommu_flush_tce(tbl); return ret; -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body
[PATCH v2 08/13] powerpc/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE(s) so the caller can release the pages afterwards. This makes iommu_tce_build() put pages returned by exchange(). This replaces iommu_clear_tce() with iommu_tce_build which now can call exchange() with TCE==NULL (i.e. clear). This preserves permission bits in TCE in iommu_put_tce_user_mode(). This removes use of pool locks for external IOMMU uses. This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not implement exchange() callback. Therefore the powernv platform is the only supported one after this patch. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v2: * added missing __pa() for TCE which was read from the table --- arch/powerpc/include/asm/iommu.h | 8 +++-- arch/powerpc/kernel/iommu.c | 62 arch/powerpc/platforms/powernv/pci.c | 40 +++ 3 files changed, 67 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c725e4a..8e0537d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -49,6 +49,12 @@ struct iommu_table_ops { unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs); + int (*exchange)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + unsigned long *old_tces, + enum dma_data_direction direction, + struct dma_attrs *attrs); void (*clear)(struct iommu_table *tbl, long index, long npages); unsigned long (*get)(struct iommu_table *tbl, long index); @@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 678fee8..39ccce7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock((pool-lock)); - - oldtce = tbl-it_ops-get(tbl, entry); - if (oldtce (TCE_PCI_WRITE | TCE_PCI_READ)) - tbl-it_ops-clear(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock((pool-lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } + iommu_tce_build(tbl, entry, 0, DMA_NONE); } return 0; @@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction) { - int ret = -EBUSY; + int ret; unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - spin_lock((pool-lock)); + ret = tbl-it_ops-exchange(tbl, entry, 1, hwaddr, oldtce, + direction, NULL); -
[PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++-- 1 file changed, 53 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 1c1a9c4..c9fac97 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,46 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(struct iommu_table *tbl) +{ + long ret = 0, locked, lock_limit, npages; + + if (!current || !current-mm) + return -ESRCH; /* process exited */ + + npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + + down_write(current-mm-mmap_sem); + locked = current-mm-locked_vm + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; + if (locked lock_limit !capable(CAP_IPC_LOCK)) { + pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n, + rlimit(RLIMIT_MEMLOCK)); + ret = -ENOMEM; + } else { + current-mm-locked_vm += npages; + } + up_write(current-mm-mmap_sem); + + return ret; +} + +static void decrement_locked_vm(struct iommu_table *tbl) +{ + long npages; + + if (!current || !current-mm) + return; /* process exited */ + + npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; + + down_write(current-mm-mmap_sem); + if (npages current-mm-locked_vm) + npages = current-mm-locked_vm; + current-mm-locked_vm -= npages; + up_write(current-mm-mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data, static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; struct iommu_table *tbl; struct spapr_tce_iommu_group *data; @@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups +* each with 2GB DMA window, 8GB will be counted here. The reason for +* this is that we cannot tell here the amount of RAM used by the guest +* as this information is only available from KVM and VFIO is +* KVM agnostic. */ tbl = data-ops-get_table(data, 0); if (!tbl) return -ENXIO; - down_write(current-mm-mmap_sem); - npages = (tbl-it_size IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; - locked = current-mm-locked_vm + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) PAGE_SHIFT; - if (locked lock_limit !capable(CAP_IPC_LOCK)) { - pr_warn(RLIMIT_MEMLOCK (%ld) exceeded\n, - rlimit(RLIMIT_MEMLOCK)); - ret = -ENOMEM; - } else { - current-mm-locked_vm += npages; - container-enabled = true; - } - up_write(current-mm-mmap_sem); + ret = try_increment_locked_vm(tbl); + if (ret) + return ret; + + container-enabled = true; return ret; } @@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container *container) if (!tbl) return; - down_write(current-mm-mmap_sem); - current-mm-locked_vm -= (tbl-it_size - IOMMU_PAGE_SHIFT_4K) PAGE_SHIFT; - up_write(current-mm-mmap_sem); + decrement_locked_vm(tbl); } static void *tce_iommu_open(unsigned long arg) -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/13] powerpc/powernv: Do not set read flag if direction==DMA_NONE
Normally a bitmap from the iommu_table is used to track what TCE entry is in use. Since we are going to use iommu_table without its locks and do xchg() instead, it becomes essential not to put bits which are not implied in the direction flag. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/powernv/pci.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index deddcad..ab79e2d 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages, __be64 *tcep, *tces; u64 rpn; - proto_tce = TCE_PCI_READ; // Read allowed - - if (direction != DMA_TO_DEVICE) - proto_tce |= TCE_PCI_WRITE; + switch (direction) { + case DMA_BIDIRECTIONAL: + case DMA_FROM_DEVICE: + proto_tce = TCE_PCI_READ | TCE_PCI_WRITE; + break; + case DMA_TO_DEVICE: + proto_tce = TCE_PCI_READ; + break; + default: + proto_tce = 0; + break; + } tces = tcep = ((__be64 *)tbl-it_base) + index - tbl-it_offset; rpn = __pa(uaddr) tbl-it_page_shift; -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO
The previous patch introduced iommu_table_ops::exchange() callback which effectively disabled VFIO on pseries. This implements exchange() for pseries/lpar so VFIO can work in nested guests. Since exchaange() callback returns an old TCE, it has to call H_GET_TCE for every TCE being put to the table so VFIO performance in guests running under PR KVM is expected to be slower than in guests running under HV KVM or bare metal hosts. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/platforms/pseries/iommu.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 9a7364f..ae15b5a 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long); static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, + unsigned long *old_tces, enum dma_data_direction direction, struct dma_attrs *attrs) { u64 rc = 0; u64 proto_tce, tce; u64 rpn; - int ret = 0; + int ret = 0, i = 0; long tcenum_start = tcenum, npages_start = npages; rpn = __pa(uaddr) TCE_SHIFT; @@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, while (npages--) { tce = proto_tce | (rpn TCE_RPN_MASK) TCE_RPN_SHIFT; + if (old_tces) + plpar_tce_get((u64)tbl-it_index, (u64)tcenum 12, + old_tces[i++]); rc = plpar_tce_put((u64)tbl-it_index, (u64)tcenum 12, tce); if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) { @@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, static DEFINE_PER_CPU(__be64 *, tce_page); -static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, +static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, +unsigned long *old_tces, enum dma_data_direction direction, struct dma_attrs *attrs) { @@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) { return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, + old_tces, direction, attrs); } @@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, if (!tcep) { local_irq_restore(flags); return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, + old_tces, direction, attrs); } __get_cpu_var(tce_page) = tcep; @@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, for (l = 0; l limit; l++) { tcep[l] = cpu_to_be64(proto_tce | (rpn TCE_RPN_MASK) TCE_RPN_SHIFT); rpn++; + if (old_tces) + plpar_tce_get((u64)tbl-it_index, + (u64)(tcenum + l) 12, + old_tces[tcenum + l]); } rc = plpar_tce_put_indirect((u64)tbl-it_index, @@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, return ret; } +static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, +long npages, unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs) +{ + return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL, + direction, attrs); +} + static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages) { u64 rc; @@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus) struct iommu_table_ops iommu_table_lpar_multi_ops = { .set = tce_buildmulti_pSeriesLP, + .exchange = tce_xchg_pSeriesLP, .clear = tce_freemulti_pSeriesLP, .get = tce_get_pSeriesLP }; -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
SPAPR defines an interface to create additional DMA windows dynamically. Dynamically means that the window is not allocated before the guest even started, the guest can request it later. In practice, existing linux guests check for the capability and if it is there, they create and map a DMA window as big as the entire guest RAM. This adds 4 callbacks to the spapr_tce_iommu_ops struct: 1. query - ibm,query-pe-dma-window - returns number/size of windows which can be created (one, any page size); 2. create - ibm,create-pe-dma-window - creates a window; 3. remove - ibm,remove-pe-dma-window - removes a window; removing the default 32bit window is not allowed by this patch, this will be added later if needed; 4. reset - ibm,reset-pe-dma-window - reset the DMA windows configuration to the default state; as the default window cannot be removed, it only removes the additional window if it was created. The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to provide necessary support to the userspace. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/tce.h| 22 + arch/powerpc/platforms/powernv/pci-ioda.c | 159 +- arch/powerpc/platforms/powernv/pci.h | 1 + 3 files changed, 181 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index e6355f9..23b0362 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops { int num); void (*take_ownership)(struct spapr_tce_iommu_group *data, bool enable); + + /* Dynamic DMA window */ + /* Page size flags for ibm,query-pe-dma-window */ +#define DDW_PGSIZE_4K 0x01 +#define DDW_PGSIZE_64K 0x02 +#define DDW_PGSIZE_16M 0x04 +#define DDW_PGSIZE_32M 0x08 +#define DDW_PGSIZE_64M 0x10 +#define DDW_PGSIZE_128M 0x20 +#define DDW_PGSIZE_256M 0x40 +#define DDW_PGSIZE_16G 0x80 + long (*query)(struct spapr_tce_iommu_group *data, + __u32 *current_windows, + __u32 *windows_available, + __u32 *page_size_mask); + long (*create)(struct spapr_tce_iommu_group *data, + __u32 page_shift, + __u32 window_shift, + struct iommu_table **ptbl); + long (*remove)(struct spapr_tce_iommu_group *data, + struct iommu_table *tbl); + long (*reset)(struct spapr_tce_iommu_group *data); }; struct spapr_tce_iommu_group { diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 296f49b..a6318cb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, pnv_pci_ioda2_set_bypass(pe, true); } +static struct iommu_table *pnv_ioda2_iommu_get_table( + struct spapr_tce_iommu_group *data, + int num) +{ + struct pnv_ioda_pe *pe = data-iommu_owner; + + switch (num) { + case 0: + if (pe-tce32.table.it_size) + return pe-tce32.table; + return NULL; + case 1: + if (pe-tce64.table.it_size) + return pe-tce64.table; + return NULL; + default: + return NULL; + } +} + static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, bool enable) { @@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, pnv_pci_ioda2_set_bypass(pe, !enable); } +static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data, + __u32 *current_windows, + __u32 *windows_available, __u32 *page_size_mask) +{ + struct pnv_ioda_pe *pe = data-iommu_owner; + + *windows_available = 2; + *current_windows = 0; + if (pe-tce32.table.it_size) { + --*windows_available; + ++*current_windows; + } + if (pe-tce64.table.it_size) { + --*windows_available; + ++*current_windows; + } + *page_size_mask = + DDW_PGSIZE_4K | + DDW_PGSIZE_64K | + DDW_PGSIZE_16M; + + return 0; +} + +static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data, + __u32 page_shift, __u32 window_shift, + struct iommu_table **ptbl) +{ + struct pnv_ioda_pe *pe = data-iommu_owner; + struct pnv_phb *phb = pe-phb; + struct page *tce_mem = NULL; + void *addr; + long ret; + unsigned long tce_table_size = + (1ULL (window_shift - page_shift)) * 8; + unsigned
[PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds an extra @ops parameter to iommu_init_table() to make sure that we do not leave any IOMMU table without iommu_table_ops. @it_ops is initialized in the very beginning as iommu_init_table() calls iommu_table_clear() and the latter uses callbacks already. This does s/tce_build/set/, s/tce_free/clear/ and removes tce_ prefixes for better readability. This removes tce_xxx_rm handlers from ppc_md as well but does not add them to iommu_table_ops, this will be done later if we decide to support TCE hypercalls in real mode. This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as callbacks for pseries. This changes multi callbacks to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support multitce=off boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace multi callbacks with single ones. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 20 +++- arch/powerpc/include/asm/machdep.h | 25 --- arch/powerpc/kernel/iommu.c | 50 - arch/powerpc/kernel/vio.c | 5 ++- arch/powerpc/platforms/cell/iommu.c | 9 -- arch/powerpc/platforms/pasemi/iommu.c | 8 +++-- arch/powerpc/platforms/powernv/pci-ioda.c | 4 +-- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 +- arch/powerpc/platforms/powernv/pci.c| 24 -- arch/powerpc/platforms/powernv/pci.h| 1 + arch/powerpc/platforms/pseries/iommu.c | 42 +--- arch/powerpc/sysdev/dart_iommu.c| 13 12 files changed, 102 insertions(+), 102 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2b0b01d..c725e4a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -43,6 +43,22 @@ extern int iommu_is_off; extern int iommu_force_on; +struct iommu_table_ops { + int (*set)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs); + void (*clear)(struct iommu_table *tbl, + long index, long npages); + unsigned long (*get)(struct iommu_table *tbl, long index); + void (*flush)(struct iommu_table *tbl); +}; + +/* These are used by VIO */ +extern struct iommu_table_ops iommu_table_lpar_multi_ops; +extern struct iommu_table_ops iommu_table_pseries_ops; + /* * IOMAP_MAX_ORDER defines the largest contiguous block * of dma space we can get. IOMAP_MAX_ORDER = 13 @@ -77,6 +93,7 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif + struct iommu_table_ops *it_ops; }; /* Pure 2^n version of get_order */ @@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); * structure */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, - int nid); + int nid, + struct iommu_table_ops *ops); struct spapr_tce_iommu_ops; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index b125cea..1fc824d 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -65,31 +65,6 @@ struct machdep_calls { * destroyed as well */ void(*hpte_clear_all)(void); - int (*tce_build)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -struct dma_attrs *attrs); - void(*tce_free)(struct iommu_table *tbl, - long index, - long npages); - unsigned long (*tce_get)(struct iommu_table *tbl, - long index); - void(*tce_flush)(struct iommu_table *tbl); - - /* _rm versions are for real mode use only */ - int (*tce_build_rm)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction
[PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions
This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/kernel/iommu.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index c2c8d9d..cd80867 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); int iommu_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl-it_size + 7) 3; + unsigned long flags, i, sz = (tbl-it_size + 7) 3; + int ret = 0, bit0 = 0; + + spin_lock_irqsave(tbl-large_pool.lock, flags); + for (i = 0; i tbl-nr_pools; i++) + spin_lock(tbl-pools[i].lock); if (tbl-it_offset == 0) - clear_bit(0, tbl-it_map); + bit0 = test_and_clear_bit(0, tbl-it_map); if (!bitmap_empty(tbl-it_map, tbl-it_size)) { pr_err(iommu_tce: it_map is not empty); - return -EBUSY; + ret = -EBUSY; + if (bit0) + set_bit(0, tbl-it_map); + } else { + memset(tbl-it_map, 0xff, sz); } - memset(tbl-it_map, 0xff, sz); - iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); + for (i = 0; i tbl-nr_pools; i++) + spin_unlock(tbl-pools[i].lock); + spin_unlock_irqrestore(tbl-large_pool.lock, flags); - return 0; + if (!ret) + iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, + tbl-it_size); + return ret; } EXPORT_SYMBOL_GPL(iommu_take_ownership); void iommu_release_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl-it_size + 7) 3; + unsigned long flags, i, sz = (tbl-it_size + 7) 3; iommu_clear_tces_and_put_pages(tbl, tbl-it_offset, tbl-it_size); + + spin_lock_irqsave(tbl-large_pool.lock, flags); + for (i = 0; i tbl-nr_pools; i++) + spin_lock(tbl-pools[i].lock); + memset(tbl-it_map, 0, sz); /* Restore bit#0 set by iommu_init_table() */ if (tbl-it_offset == 0) set_bit(0, tbl-it_map); + + for (i = 0; i tbl-nr_pools; i++) + spin_unlock(tbl-pools[i].lock); + spin_unlock_irqrestore(tbl-large_pool.lock, flags); } EXPORT_SYMBOL_GPL(iommu_release_ownership); -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
This defines and implements VFIO IOMMU API which lets the userspace create and remove DMA windows. This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of available windows and page mask. This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE to allow the user space to create and remove window(s). The VFIO IOMMU driver does basic sanity checks and calls corresponding SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge) implements them. This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via VFIO_IOMMU_SPAPR_TCE_GET_INFO. This calls platform DDW reset() callback when IOMMU is being disabled to reset the DMA configuration to its original state. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++-- include/uapi/linux/vfio.h | 25 ++- 2 files changed, 153 insertions(+), 7 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 0dccbc4..b518891 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container *container) container-enabled = false; - if (!container-grp || !current-mm) + if (!container-grp) return; data = iommu_group_get_iommudata(container-grp); if (!data || !data-iommu_owner || !data-ops-get_table) return; - tbl = data-ops-get_table(data, 0); - if (!tbl) - return; + if (current-mm) { + tbl = data-ops-get_table(data, 0); + if (tbl) + decrement_locked_vm(tbl); - decrement_locked_vm(tbl); + tbl = data-ops-get_table(data, 1); + if (tbl) + decrement_locked_vm(tbl); + } + + if (data-ops-reset) + data-ops-reset(data); } static void *tce_iommu_open(unsigned long arg) @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { struct tce_container *container = iommu_data; - unsigned long minsz; + unsigned long minsz, ddwsz; long ret; switch (cmd) { @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data, info.dma32_window_size = tbl-it_size tbl-it_page_shift; info.flags = 0; + ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, + page_size_mask); + + if (info.argsz == ddwsz) { + if (data-ops-query data-ops-create + data-ops-remove) { + info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW; + + ret = data-ops-query(data, + info.current_windows, + info.windows_available, + info.page_size_mask); + if (ret) + return ret; + } else { + info.current_windows = 0; + info.windows_available = 0; + info.page_size_mask = 0; + } + minsz = ddwsz; + } + if (copy_to_user((void __user *)arg, info, minsz)) return -EFAULT; @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data, tce_iommu_disable(container); mutex_unlock(container-lock); return 0; + case VFIO_EEH_PE_OP: if (!container-grp) return -ENODEV; return vfio_spapr_iommu_eeh_ioctl(container-grp, cmd, arg); + + case VFIO_IOMMU_SPAPR_TCE_CREATE: { + struct vfio_iommu_spapr_tce_create create; + struct spapr_tce_iommu_group *data; + struct iommu_table *tbl; + + if (WARN_ON(!container-grp)) + return -ENXIO; + + data = iommu_group_get_iommudata(container-grp); + + minsz = offsetofend(struct vfio_iommu_spapr_tce_create, + start_addr); + + if (copy_from_user(create, (void __user *)arg, minsz)) + return -EFAULT; + + if (create.argsz minsz) + return -EINVAL; + + if (create.flags) + return -EINVAL; + + if (!data-ops-create || !data-iommu_owner) + return -ENOSYS; + + BUG_ON(!data || !data-ops || !data-ops-remove); + +
[PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback
At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via container_of(tbl). Since we are going to have to add Dynamic DMA windows and that means having 2 IOMMU tables per PE, this is not going to work. This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback. This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer to PE there. PNV's ppc_md.tce_build() call uses this to find PE and do the invalidation. This will be used later for Dynamic DMA windows too. This registers invalidate() callbacks for IODA1 and IODA2: - pnv_pci_ioda1_tce_invalidate; - pnv_pci_ioda2_tce_invalidate. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v4: * changed commit log to explain why this change is needed --- arch/powerpc/platforms/powernv/pci-ioda.c | 35 --- arch/powerpc/platforms/powernv/pci.c | 31 --- arch/powerpc/platforms/powernv/pci.h | 13 +++- 3 files changed, 48 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index df241b1..136e765 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev pe = phb-ioda.pe_array[pdn-pe_number]; WARN_ON(get_dma_ops(pdev-dev) != dma_iommu_ops); - set_iommu_table_base_and_group(pdev-dev, pe-tce32_table); + set_iommu_table_base_and_group(pdev-dev, pe-tce32.table); } static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, @@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, } else { dev_info(pdev-dev, Using 32-bit DMA via iommu\n); set_dma_ops(pdev-dev, dma_iommu_ops); - set_iommu_table_base(pdev-dev, pe-tce32_table); + set_iommu_table_base(pdev-dev, pe-tce32.table); } *pdev-dev.dma_mask = dma_mask; return 0; @@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, list_for_each_entry(dev, bus-devices, bus_list) { if (add_to_iommu_group) set_iommu_table_base_and_group(dev-dev, - pe-tce32_table); + pe-tce32.table); else - set_iommu_table_base(dev-dev, pe-tce32_table); + set_iommu_table_base(dev-dev, pe-tce32.table); if (dev-subordinate) pnv_ioda_setup_bus_dma(pe, dev-subordinate, @@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, } } -void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) -{ - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); - struct pnv_phb *phb = pe-phb; - - if (phb-type == PNV_PHB_IODA1) - pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm); - else - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm); -} - static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe, unsigned int base, unsigned int segs) @@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } /* Setup linux iommu table */ - tbl = pe-tce32_table; + tbl = pe-tce32.table; pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs, base 28, IOMMU_PAGE_SHIFT_4K); + pe-tce32.pe = pe; + pe-tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate; /* OPAL variant of P7IOC SW invalidated TCEs */ swinvp = of_get_property(phb-hose-dn, ibm,opal-tce-kill, NULL); @@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) { struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); + tce32.table); uint16_t window_id = (pe-pe_number 1 ) + 1; int64_t rc; @@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, pe-tce_bypass_base = 1ull 59; /* Install set_bypass callback for VFIO */ - pe-tce32_table.set_bypass = pnv_pci_ioda2_set_bypass; + pe-tce32.table.set_bypass = pnv_pci_ioda2_set_bypass; /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(pe-tce32_table, true); + pnv_pci_ioda2_set_bypass(pe-tce32.table, true); } static void
[PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
Modern IBM POWERPC systems support multiple IOMMU tables per PE so we need a more reliable way (compared to container_of()) to get a PE pointer from the iommu_table struct pointer used in IOMMU functions. At the moment IOMMU group data points to an iommu_table struct. This introduces a spapr_tce_iommu_group struct which keeps an iommu_owner and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to the pnv_ioda_pe struct, for others it is still a pointer to the iommu_table struct. The ops structs correspond to the type which iommu_owner points to. This defines a get_table() callback which returns an iommu_table by its number. As the IOMMU group data pointer points to variable type instead of iommu_table, VFIO SPAPR TCE driver is updated to use the new type. This changes the tce_container struct to store iommu_group instead of iommu_table. So, it was: - iommu_table points to iommu_group via iommu_table::it_group; - iommu_group points to iommu_table via iommu_group_get_iommudata(); now it is: - iommu_table points to iommu_group via iommu_table::it_group; - iommu_group points to spapr_tce_iommu_group via iommu_group_get_iommudata(); - spapr_tce_iommu_group points to either (depending on .get_table()): - iommu_table; - pnv_ioda_pe; This uses pnv_ioda1_iommu_get_table for both IODA12 but IODA2 will have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table will only be used for IODA1. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/include/asm/iommu.h| 6 ++ arch/powerpc/include/asm/tce.h | 13 +++ arch/powerpc/kernel/iommu.c | 35 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 31 +- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 1 + arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/pseries/iommu.c | 10 +- drivers/vfio/vfio_iommu_spapr_tce.c | 148 ++-- 8 files changed, 208 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 42632c7..84ee339 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +struct spapr_tce_iommu_ops; #ifdef CONFIG_IOMMU_API extern void iommu_register_group(struct iommu_table *tbl, +void *iommu_owner, +struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); #else static inline void iommu_register_group(struct iommu_table *tbl, + void *iommu_owner, + struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index 743f36b..9f159eb 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -50,5 +50,18 @@ #define TCE_PCI_READ 0x1 /* read from PCI allowed */ #define TCE_VB_WRITE 0x1 /* write from VB allowed */ +struct spapr_tce_iommu_group; + +struct spapr_tce_iommu_ops { + struct iommu_table *(*get_table)( + struct spapr_tce_iommu_group *data, + int num); +}; + +struct spapr_tce_iommu_group { + void *iommu_owner; + struct spapr_tce_iommu_ops *ops; +}; + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_TCE_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b378f78..1c5dae7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, */ static void group_release(void *iommu_data) { - struct iommu_table *tbl = iommu_data; - tbl-it_group = NULL; + kfree(iommu_data); } +static struct iommu_table *spapr_tce_default_get_table( + struct spapr_tce_iommu_group *data, int num) +{ + struct iommu_table *tbl = data-iommu_owner; + + switch (num) { + case 0: + if (tbl-it_size) + return tbl; + /* fallthru */ + default: + return NULL; + } +} + +static struct spapr_tce_iommu_ops spapr_tce_default_ops = { + .get_table = spapr_tce_default_get_table +}; + void iommu_register_group(struct iommu_table *tbl, + void *iommu_owner, struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long
[PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size
This checks that the TCE table page size is not bigger that the size of a page we just pinned and going to put its physical address to the table. Otherwise the hardware gets unwanted access to physical memory between the end of the actual page and the end of the aligned up TCE page. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- arch/powerpc/kernel/iommu.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a10642a..b378f78 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include linux/pci.h #include linux/iommu.h #include linux/sched.h +#include linux/hugetlb.h #include asm/io.h #include asm/prom.h #include asm/iommu.h @@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry, tce, entry tbl-it_page_shift, ret); */ return -EFAULT; } + + /* +* Check that the TCE table granularity is not bigger than the size of +* a page we just found. Otherwise the hardware can get access to +* a bigger memory chunk that it should. +*/ + if (PageHuge(page)) { + struct page *head = compound_head(page); + long shift = PAGE_SHIFT + compound_order(head); + + if (shift tbl-it_page_shift) { + ret = -EINVAL; + goto put_page_exit; + } + + } + hwaddr = (unsigned long) page_address(page) + offset; ret = iommu_tce_build(tbl, entry, hwaddr, direction); if (ret) - put_page(page); + goto put_page_exit; - if (ret 0) - pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n, + return 0; + +put_page_exit: + pr_err(iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n, __func__, entry tbl-it_page_shift, tce, ret); + put_page(page); + return ret; } EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); -- 2.0.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] KVM: MMU: pinned sps are not candidates for deletion.
Hi Marcelo, Sorry for the delay. On Sep 9, 2014, at 11:41 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jul 22, 2014 at 05:59:42AM +0800, Xiao Guangrong wrote: On Jul 10, 2014, at 3:12 AM, mtosa...@redhat.com wrote: Skip pinned shadow pages when selecting pages to zap. It seems there is no way to prevent changing pinned spte on zap-all path? Xiao, The way would be to reload remote mmus, forcing the vcpu to exit, zapping a page, then vcpu will pagefault any necessary page via kvm_mmu_pin_pages. kvm_mmu_invalidate_zap_all_pages does: - spin_lock(mmu_lock) - kvm_reload_remote_mmus ... - spin_unlock(mmu_lock) So its OK to change pinned spte on zap all path. Got it, thanks for your explanation. I am thing if we could move pinned spte to another list (eg. pinned_shadow_pages) instead of active list so that it can not be touched by any other free paths. Your idea? As mentioned it above, it is ok to zap pinned sptes as long w reload remote mmus request is performed. That said, you still consider a separate list? I need to think more about this idea… let’s address it as your patch first. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html