Re: [PATCH kernel v11 20/34] powerpc/powernv/ioda2: Move TCE kill register address to PE
On Fri, May 29, 2015 at 06:44:44PM +1000, Alexey Kardashevskiy wrote: At the moment the DMA setup code looks for the ibm,opal-tce-kill property which contains the TCE kill register address. Writing to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_pnb as this register belongs to PHB and invalidates TCE cache for all tables of all attached PEs. This moves the property reading/remapping code to a helper which is called when DMA is being configured for PE and which does DMA setup for both IODA1 and IODA2. This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which invalidates cache for the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there was just a single TCE table and 64bit DMA was handled via bypass window (which has no table so no cache was used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpoCyC0dlyWS.pgp Description: PGP signature
Re: [PATCH kernel v11 17/34] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
On Fri, May 29, 2015 at 06:44:41PM +1000, Alexey Kardashevskiy wrote: Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. For IODA, instead of embedding iommu_table, the new iommu_table_group keeps pointers to those. The iommu_table structs are allocated dynamically now by a pnv_pci_table_alloc() helper as PCI hotplug code (for EEH recovery) and SRIOV are supported there. For P5IOC2, both iommu_table_group and iommu_table are embedded into PE struct. As there is no EEH and SRIOV support for P5IOC2, iommu_free_table() should not be called on iommu_table struct pointers so we can keep it embedded in pnv_phb::p5ioc2. For pSeries, this replaces multiple calls of kzalloc_node() with a new iommu_pseries_group_alloc() helper and stores the table group struct pointer into the pci_dn struct. For release, a iommu_table_group_free() helper is added. This moves iommu_table struct allocation from SR-IOV code to the generic DMA initialization code in pnv_pci_ioda2_setup_dma_pe. This replaces a single pointer to iommu_group with a list of iommu_table_group structs. For now it is just a single iommu_table_group in this list but later with TCE table sharing enabled, the list will keep all the IOMMU groups which use the particular table. The list uses iommu_table_group_link structs rather than iommu_table_group::next as a VFIO container may have 2 IOMMU tables, each will have its own list head pointer as it is mainly for TCE invalidation code which should walk through all attached groups and invalidate TCE cache so the table has to keep the list head pointer. The other option would be storing list head in a VFIO container but it would not work as the platform code (which does TCE table update and invalidation) has no idea about VFIO. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com It looks like this commit message doesn't match the code - it seems like an older or newer version of the message from the previous patch. This patch seems instead to be about changing the table_group - table relationship from 1:1 to many:many. --- Changes: v10: * iommu_table is not embedded into iommu_table_group but allocated dynamically * iommu_table allocation is moved to a single place for IODA2's pnv_pci_ioda_setup_dma_pe where it belongs to * added list of groups into iommu_table; most of the code just looks at the first item to keep the patch simpler v9: * s/it_group/it_table_group/ * added and used iommu_table_group_free(), from now iommu_free_table() is only used for VIO * added iommu_pseries_group_alloc() * squashed powerpc/iommu: Introduce iommu_table_alloc() helper into this --- arch/powerpc/include/asm/iommu.h| 8 +- arch/powerpc/kernel/iommu.c | 9 +- arch/powerpc/platforms/powernv/pci-ioda.c | 45 ++ arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 + arch/powerpc/platforms/powernv/pci.c| 76 + arch/powerpc/platforms/powernv/pci.h| 7 ++ arch/powerpc/platforms/pseries/iommu.c | 33 +++- drivers/vfio/vfio_iommu_spapr_tce.c | 122 8 files changed, 242 insertions(+), 61 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 5a7267f..44a20cc 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -91,7 +91,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ - struct iommu_table_group *it_table_group; + struct list_head it_group_list;/* List of iommu_table_group_link */ struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -126,6 +126,12 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group_link { + struct list_head next; + struct rcu_head rcu; + struct iommu_table_group *table_group; +}; + struct iommu_table_group { struct iommu_group *group; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 719f048..e305a8f 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1078,6 +1078,7 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership); int iommu_add_device(struct device
Re: [PATCH kernel v11 18/34] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
On Fri, May 29, 2015 at 06:44:42PM +1000, Alexey Kardashevskiy wrote: This adds tce_iommu_take_ownership() and tce_iommu_release_ownership which call in a loop iommu_take_ownership()/iommu_release_ownership() for every table on the group. As there is just one now, no change in behaviour is expected. At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds take_ownership()/release_ownership() callbacks to it which are called when an external user takes/releases control over the IOMMU. This replaces set_bypass() with ownership callbacks as it is not necessarily just bypass enabling, it can be something else/more so let's give it more generic name. The callbacks is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. The following patches will replace iommu_take_ownership/ iommu_release_ownership calls in IODA2 with full IOMMU table release/ create. As we here and touching bypass control, this removes pnv_pci_ioda2_setup_bypass_pe() as it does not do much more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base initialization to pnv_pci_ioda2_setup_dma_pe. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpSeNI0zax8E.pgp Description: PGP signature
Re: [PATCH kernel v11 21/34] powerpc/powernv/ioda2: Add TCE invalidation for all attached groups
On Fri, May 29, 2015 at 06:44:45PM +1000, Alexey Kardashevskiy wrote: The iommu_table struct keeps a list of IOMMU groups it is used for. At the moment there is just a single group attached but further patches will add TCE table sharing. When sharing is enabled, TCE cache in each PE needs to be invalidated so does the patch. This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan to enable TCE table sharing on PHBs older than IODA2. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgp3Qap2rncgo.pgp Description: PGP signature
Re: [PATCH kernel v11 23/34] powerpc/iommu/powernv: Release replaced TCE
On Fri, May 29, 2015 at 06:44:47PM +1000, Alexey Kardashevskiy wrote: At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpzSxt7XbJHf.pgp Description: PGP signature
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
On 01/06/2015 13:42, Christian Borntraeger wrote: Am 01.06.2015 um 13:34 schrieb Paolo Bonzini: On 01/06/2015 09:47, Christian Borntraeger wrote: 1: disable, guest, disable again and save, restore to disable, enable and now it is 2: disable, guest, enable and with your patch it is 3: disable, guest, enable, disable, enable I assume that 3 and 1 are similar in its costs, so this is probably ok. At least on x86, 3 and 2 are similar, but 3 is much more expensive than 1! See https://lkml.org/lkml/2015/5/5/835: That does not make sense. If 3 and 2 are similar, then 2 must be much more expensive than 1 as well. As 2 is a strict subset of 1 it must be cheaper, no? Yes, it must. I meant 3 is much cheaper than 1. Paolo Cost of: CLI insn same-IF : 0 cycles Cost of: CLI insn flip-IF : 0 cycles Cost of: STI insn same-IF : 0 cycles Cost of: STI insn flip-IF : 0 cycles Cost of: PUSHF insn : 0 cycles Cost of: POPFinsn same-IF :20 cycles Cost of: POPFinsn flip-IF :28 cycles Cost of: local_irq_save()fn:20 cycles Cost of: local_irq_restore() fnsame-IF :24 cycles Cost of: local_irq_restore() fnflip-IF :28 cycles Cost of: irq_save()+restore()fnsame-IF :48 cycles Cost of: irq_save()+restore()fnflip-IF :48 cycles Yes its similar on s390. local_irq_save/restore is noticable in guest exit hot loops (thats what inspired my patch), but a simple irq disable is just single cycle pipelined. Given the design of aggressive out-out order designs with all the architectural ordering this makes sense. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] treewide: Fix typo compatability - compatibility
On 27.05.15 14:05, Laurent Pinchart wrote: Even though 'compatability' has a dedicated entry in the Wiktionary, it's listed as 'Mispelling of compatibility'. Fix it. Signed-off-by: Laurent Pinchart laurent.pinch...@ideasonboard.com --- arch/metag/include/asm/elf.h | 2 +- arch/powerpc/kvm/book3s.c| 2 +- Acked-by: Alexander Graf ag...@suse.de for the PPC KVM bit. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
Am 01.06.2015 um 13:34 schrieb Paolo Bonzini: On 01/06/2015 09:47, Christian Borntraeger wrote: 1: disable, guest, disable again and save, restore to disable, enable and now it is 2: disable, guest, enable and with your patch it is 3: disable, guest, enable, disable, enable I assume that 3 and 1 are similar in its costs, so this is probably ok. At least on x86, 3 and 2 are similar, but 3 is much more expensive than 1! See https://lkml.org/lkml/2015/5/5/835: That does not make sense. If 3 and 2 are similar, then 2 must be much more expensive than 1 as well. As 2 is a strict subset of 1 it must be cheaper, no? Cost of: CLI insn same-IF : 0 cycles Cost of: CLI insn flip-IF : 0 cycles Cost of: STI insn same-IF : 0 cycles Cost of: STI insn flip-IF : 0 cycles Cost of: PUSHF insn : 0 cycles Cost of: POPFinsn same-IF :20 cycles Cost of: POPFinsn flip-IF :28 cycles Cost of: local_irq_save()fn:20 cycles Cost of: local_irq_restore() fnsame-IF :24 cycles Cost of: local_irq_restore() fnflip-IF :28 cycles Cost of: irq_save()+restore()fnsame-IF :48 cycles Cost of: irq_save()+restore()fnflip-IF :48 cycles Yes its similar on s390. local_irq_save/restore is noticable in guest exit hot loops (thats what inspired my patch), but a simple irq disable is just single cycle pipelined. Given the design of aggressive out-out order designs with all the architectural ordering this makes sense. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
On 01/06/2015 09:47, Christian Borntraeger wrote: 1: disable, guest, disable again and save, restore to disable, enable and now it is 2: disable, guest, enable and with your patch it is 3: disable, guest, enable, disable, enable I assume that 3 and 1 are similar in its costs, so this is probably ok. At least on x86, 3 and 2 are similar, but 3 is much more expensive than 1! See https://lkml.org/lkml/2015/5/5/835: Cost of: CLI insn same-IF : 0 cycles Cost of: CLI insn flip-IF : 0 cycles Cost of: STI insn same-IF : 0 cycles Cost of: STI insn flip-IF : 0 cycles Cost of: PUSHF insn : 0 cycles Cost of: POPFinsn same-IF :20 cycles Cost of: POPFinsn flip-IF :28 cycles Cost of: local_irq_save()fn:20 cycles Cost of: local_irq_restore() fnsame-IF :24 cycles Cost of: local_irq_restore() fnflip-IF :28 cycles Cost of: irq_save()+restore()fnsame-IF :48 cycles Cost of: irq_save()+restore()fnflip-IF :48 cycles Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 10/12] KVM: arm64: guest debug, HW assisted debug support
Will Deacon will.dea...@arm.com writes: Hi Alex, On Fri, May 29, 2015 at 10:30:26AM +0100, Alex Bennée wrote: This adds support for userspace to control the HW debug registers for guest debug. In the debug ioctl we copy the IMPDEF defined number of registers into a new register set called host_debug_state. There is now a new vcpu parameter called debug_ptr which selects which register set is to copied into the real registers when world switch occurs. I've moved some helper functions into the hw_breakpoint.h header for re-use. As with single step we need to tweak the guest registers to enable the exceptions so we need to save and restore those bits. Two new capabilities have been added to the KVM_EXTENSION ioctl to allow userspace to query the number of hardware break and watch points available on the host hardware. Signed-off-by: Alex Bennée alex.ben...@linaro.org [...] diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 33c8143..ada57df 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2668,7 +2668,7 @@ The top 16 bits of the control field are architecture specific control flags which can include the following: - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64] - - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390] + - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64] - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86] - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86] - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390] @@ -2683,6 +2683,11 @@ updated to the correct (supplied) values. The second part of the structure is architecture specific and typically contains a set of debug registers. +For arm64 the number of debug registers is implementation defined and +can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and +KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number +indicating the number of supported registers. + When debug events exit the main run loop with the reason KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 0d17c7b..6df47c1 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -307,6 +307,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) #define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE |\ KVM_GUESTDBG_USE_SW_BP | \ +KVM_GUESTDBG_USE_HW_BP | \ KVM_GUESTDBG_SINGLESTEP) /** @@ -327,6 +328,12 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, if (dbg-control KVM_GUESTDBG_ENABLE) { vcpu-guest_debug = dbg-control; + +/* Hardware assisted Break and Watch points */ +if (vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP) { +vcpu-arch.external_debug_state = dbg-arch; +} + } else { /* If not enabled clear all flags */ vcpu-guest_debug = 0; diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h index 52b484b..c450552 100644 --- a/arch/arm64/include/asm/hw_breakpoint.h +++ b/arch/arm64/include/asm/hw_breakpoint.h @@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct task_struct *task) } #endif +/* Determine number of BRP registers available. */ +static inline int get_num_brps(void) +{ +return ((read_cpuid(ID_AA64DFR0_EL1) 12) 0xf) + 1; +} + +/* Determine number of WRP registers available. */ +static inline int get_num_wrps(void) +{ +return ((read_cpuid(ID_AA64DFR0_EL1) 20) 0xf) + 1; +} I'm fine with moving these into the header file, but we should probably revisit this at some point so that we use a shadow value to indicate the minimum number of registers across the system for e.g. big.LITTLE platforms that don't have uniform debug resources. There is work going forward in QEMU to specify this limitation when creating vcpus. At the moment the kernel sanity checks these are all the same on boot up but I guess we will have to decide where the smarts will end up. Do we: - report the real number per real-cpu (QEMU figures out vcpu config) or - report the lowest common denominator extern struct pmu perf_ops_bp; #endif /* __KERNEL__ */ diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index e5040b6..498d4f7 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -113,12 +113,13 @@ struct kvm_vcpu_arch { /* * For debugging the guest we need to keep a set of debug - * registers which can override the guests own debug state + * registers which can
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
Am 28.05.2015 um 20:49 schrieb Christoffer Dall: Until now we have been calling kvm_guest_exit after re-enabling interrupts when we come back from the guest, but this has the unfortunate effect that CPU time accounting done in the context of timer interrupts occurring while the guest is running doesn't properly notice that the time since the last tick was spent in the guest. Can you verify that a CPU bound guest has almost zero guest time? Assuming that your answer is yes your patch make sense as host timer interrupts should be the only reasons for guest exits then. Inspired by the comment in the x86 code, move the kvm_guest_exit() call below the local_irq_enable() call and change __kvm_guest_exit() to kvm_guest_exit(), because we are now calling this function with interrupts enabled. We have to now explicitly disable preemption and not enable preemption before we've called kvm_guest_exit(), since otherwise we could be preempted and everything happening before we eventually get scheduled again would be accounted for as guest time. At the same time, move the trace_kvm_exit() call outside of the atomic section, since there is no reason for us to do that with interrupts disabled. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- This patch is based on kvm/queue, because it has the kvm_guest_enter/exit rework recently posted by Christian Borntraeger. I hope I got the logic of this right, there were 2 slightly worrying facts about this: First, we now enable and disable and enable interrupts on each exit path, but I couldn't see any performance overhead on hackbench - yes the only benchmark we care about. This should be somewhat similar to the situation before my patch. There it was 1: disable, guest, disable again and save, restore to disable, enable and now it is 2: disable, guest, enable and with your patch it is 3: disable, guest, enable, disable, enable I assume that 3 and 1 are similar in its costs, so this is probably ok. Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Not an expert here, but I assume mips has the same logic as arm so if your patch is right for arm its probably also for mips. powerpc looks similar to what s390 does (not using the tick, instead it uses a hw-timer) so this should be fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 09/11] KVM: arm: disable debug mode if we don't actually need it.
On Sun, May 31, 2015 at 05:27:10AM +0100, Zhichao Huang wrote: Until now we enable debug mode all the time even if we don't actually need it. Inspired by the implementation in arm64, disable debug mode if we don't need it. And then we are able to reduce unnecessary registers saving/restoring when the debug mode is disabled. I'm terrified about this patch. Enabling monitor mode has proven to be *extremely* fragile in practice on 32-bit ARM SoCs, so trying to do this morei often makes me very nervous. Signed-off-by: Zhichao Huang zhichao.hu...@linaro.org --- arch/arm/kernel/hw_breakpoint.c | 55 ++--- 1 file changed, 46 insertions(+), 9 deletions(-) diff --git a/arch/arm/kernel/hw_breakpoint.c b/arch/arm/kernel/hw_breakpoint.c index dc7d0a9..1d27563 100644 --- a/arch/arm/kernel/hw_breakpoint.c +++ b/arch/arm/kernel/hw_breakpoint.c @@ -266,8 +266,7 @@ static int enable_monitor_mode(void) } /* Check that the write made it through. */ - ARM_DBG_READ(c0, c1, 0, dscr); - if (!(dscr ARM_DSCR_MDBGEN)) { + if (!monitor_mode_enabled()) { pr_warn_once(Failed to enable monitor mode on CPU %d.\n, smp_processor_id()); return -EPERM; Ok, this hunk is harmless :) @@ -277,6 +276,43 @@ out: return 0; } +static int disable_monitor_mode(void) +{ + u32 dscr; + + ARM_DBG_READ(c0, c1, 0, dscr); + + /* If monitor mode is already disabled, just return. */ + if (!(dscr ARM_DSCR_MDBGEN)) + goto out; + + /* Write to the corresponding DSCR. */ + switch (get_debug_arch()) { + case ARM_DEBUG_ARCH_V6: + case ARM_DEBUG_ARCH_V6_1: + ARM_DBG_WRITE(c0, c1, 0, (dscr ~ARM_DSCR_MDBGEN)); + break; + case ARM_DEBUG_ARCH_V7_ECP14: + case ARM_DEBUG_ARCH_V7_1: + case ARM_DEBUG_ARCH_V8: + ARM_DBG_WRITE(c0, c2, 2, (dscr ~ARM_DSCR_MDBGEN)); + isb(); + break; + default: + return -ENODEV; + } + + /* Check that the write made it through. */ + if (monitor_mode_enabled()) { + pr_warn_once(Failed to disable monitor mode on CPU %d.\n, + smp_processor_id()); + return -EPERM; + } + +out: + return 0; +} I'm not comfortable with this. enable_monitor_mode has precisly one caller [reset_ctrl_regs] which goes to some lengths to get the system into a well-defined state. On top of that, the whole thing is run with an undef hook registered because there isn't an architected way to discover whether or not DBGSWENABLE is driven low. Why exactly do you need this? Can you not trap guest debug accesses some other way? int hw_breakpoint_slots(int type) { if (!debug_arch_supported()) @@ -338,6 +374,8 @@ int arch_install_hw_breakpoint(struct perf_event *bp) int i, max_slots, ctrl_base, val_base; u32 addr, ctrl; + enable_monitor_mode(); + addr = info-address; ctrl = encode_ctrl_reg(info-ctrl) | 0x1; @@ -430,6 +468,8 @@ void arch_uninstall_hw_breakpoint(struct perf_event *bp) /* Reset the control register. */ write_wb_reg(base + i, 0); + + disable_monitor_mode(); My previous concerns notwithstanding, shouldn't this be refcounted? Will -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
On Mon, Jun 01, 2015 at 09:47:46AM +0200, Christian Borntraeger wrote: Am 28.05.2015 um 20:49 schrieb Christoffer Dall: Until now we have been calling kvm_guest_exit after re-enabling interrupts when we come back from the guest, but this has the unfortunate effect that CPU time accounting done in the context of timer interrupts occurring while the guest is running doesn't properly notice that the time since the last tick was spent in the guest. Can you verify that a CPU bound guest has almost zero guest time? Assuming that your answer is yes your patch make sense as host timer interrupts should be the only reasons for guest exits then. Yes, 'cat /dev/urandom /dev/null' in the guest shows up as 100% sys in mpstat on the host, 0% guest. Inspired by the comment in the x86 code, move the kvm_guest_exit() call below the local_irq_enable() call and change __kvm_guest_exit() to kvm_guest_exit(), because we are now calling this function with interrupts enabled. We have to now explicitly disable preemption and not enable preemption before we've called kvm_guest_exit(), since otherwise we could be preempted and everything happening before we eventually get scheduled again would be accounted for as guest time. At the same time, move the trace_kvm_exit() call outside of the atomic section, since there is no reason for us to do that with interrupts disabled. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- This patch is based on kvm/queue, because it has the kvm_guest_enter/exit rework recently posted by Christian Borntraeger. I hope I got the logic of this right, there were 2 slightly worrying facts about this: First, we now enable and disable and enable interrupts on each exit path, but I couldn't see any performance overhead on hackbench - yes the only benchmark we care about. This should be somewhat similar to the situation before my patch. There it was 1: disable, guest, disable again and save, restore to disable, enable and now it is 2: disable, guest, enable and with your patch it is 3: disable, guest, enable, disable, enable I assume that 3 and 1 are similar in its costs, so this is probably ok. Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Not an expert here, but I assume mips has the same logic as arm so if your patch is right for arm its probably also for mips. powerpc looks similar to what s390 does (not using the tick, instead it uses a hw-timer) so this should be fine. I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get this for free which would avoid the need for this patch? Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/15] KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
On 30/05/2015 12:59, Xiao Guangrong wrote: - kvm_mtrr_get_guest_memory_type() only checks one page in MTRRs so that it's unnecessary to check to see if the range is partially covered in MTRR - optimize the check of overlap memory type and add some comments to explain the precedence Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- arch/x86/kvm/mtrr.c | 89 ++--- 1 file changed, 44 insertions(+), 45 deletions(-) diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index bc9c6da..d3c06d2 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -231,24 +231,16 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) return 0; } -/* - * The function is based on mtrr_type_lookup() in - * arch/x86/kernel/cpu/mtrr/generic.c - */ -static int get_mtrr_type(struct kvm_mtrr *mtrr_state, - u64 start, u64 end) +u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn) { - u64 base, mask; - u8 prev_match, curr_match; - int i, num_var_ranges = KVM_NR_VAR_MTRR; + struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state; + u64 base, mask, start = gfn_to_gpa(gfn); + int i, num_var_ranges = KVM_NR_VAR_MTRR, type_mask, type = -1; Do not mix initialized and uninitialized variables on the same line (preexisting, I know, but let's fix it instead of making it worse :)). Please put each initialized variable on a separate line. Also please initialize type_mask here (more on this below). /* MTRR is completely disabled, use UC for all of physical memory. */ if (!mtrr_state-mtrr_enabled) return MTRR_TYPE_UNCACHABLE; - /* Make end inclusive end, instead of exclusive */ - end--; - /* Look in fixed ranges. Just return the type as per start */ if (mtrr_state-fixed_mtrr_enabled (start 0x10)) { int idx; @@ -273,9 +265,9 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state, * Look of multiple ranges matching this address and pick type * as per MTRR precedence */ - prev_match = 0xFF; + type_mask = (1 MTRR_TYPE_WRBACK) | (1 MTRR_TYPE_WRTHROUGH); for (i = 0; i num_var_ranges; ++i) { - unsigned short start_state, end_state; + int curr_type; if (!(mtrr_state-var_ranges[i].mask (1 11))) continue; @@ -283,50 +275,57 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state, base = mtrr_state-var_ranges[i].base PAGE_MASK; mask = mtrr_state-var_ranges[i].mask PAGE_MASK; - start_state = ((start mask) == (base mask)); - end_state = ((end mask) == (base mask)); - if (start_state != end_state) - return 0xFE; - if ((start mask) != (base mask)) continue; - curr_match = mtrr_state-var_ranges[i].base 0xff; - if (prev_match == 0xFF) { - prev_match = curr_match; + /* + * Please refer to Intel SDM Volume 3: 11.11.4.1 MTRR + * Precedences. + */ + + curr_type = mtrr_state-var_ranges[i].base 0xff; + if (type == -1) { + type = curr_type; continue; } - if (prev_match == MTRR_TYPE_UNCACHABLE || - curr_match == MTRR_TYPE_UNCACHABLE) + /* + * If two or more variable memory ranges match and the + * memory types are identical, then that memory type is + * used. + */ + if (type == curr_type) + continue; + + /* + * If two or more variable memory ranges match and one of + * the memory types is UC, the UC memory type used. + */ + if (curr_type == MTRR_TYPE_UNCACHABLE) return MTRR_TYPE_UNCACHABLE; - if ((prev_match == MTRR_TYPE_WRBACK - curr_match == MTRR_TYPE_WRTHROUGH) || - (prev_match == MTRR_TYPE_WRTHROUGH - curr_match == MTRR_TYPE_WRBACK)) { - prev_match = MTRR_TYPE_WRTHROUGH; - curr_match = MTRR_TYPE_WRTHROUGH; + /* + * If two or more variable memory ranges match and the + * memory types are WT and WB, the WT memory type is used. + */ + if (((1 type) type_mask) + ((1 curr_type) type_mask)) { Please inline definition of type_mask in the if, or rename it to wt_wb_mask and make it const. Or another suggestion below... + type = MTRR_TYPE_WRTHROUGH; + continue; } -
Re: [PATCH 14/15] KVM: MTRR: do not map huage page for non-consistent range
On 01/06/2015 11:36, Paolo Bonzini wrote: Does this have a performance impact on shadow? Perhaps we could cache in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs? Nevermind, patch 15 answers my question. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
Am 01.06.2015 um 11:08 schrieb Christoffer Dall: Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Not an expert here, but I assume mips has the same logic as arm so if your patch is right for arm its probably also for mips. powerpc looks similar to what s390 does (not using the tick, instead it uses a hw-timer) so this should be fine. I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get this for free which would avoid the need for this patch? Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks - yes it might work out. Can you give it a try? Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/15] KVM: MTRR: do not map huage page for non-consistent range
On 30/05/2015 12:59, Xiao Guangrong wrote: Currently guest MTRR is completely prohibited if cache snoop is supported on IOMMU (!noncoherent_dma) and host does the emulation based on the knowledge from host side, however, host side is not the good point to know what the purpose of guest is. A good example is that pass-throughed VGA frame buffer is not always UC as host expected Can you explain how? The original idea was that such a framebuffer would be kvm_is_reserved_pfn and thus be unconditionally UC. +bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, + int page_num) +{ + struct mtrr_looker looker; + struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state; + u64 start = gfn_to_gpa(gfn), end = gfn_to_gpa(gfn + page_num); + int type = -1; + + mtrr_for_each_mem_type(looker, mtrr_state, start, end) { + if (type == -1) { + type = looker.mem_type; + continue; + } + + if (type != looker.mem_type) + return false; + } + + if ((type != -1) looker.partial_map + (mtrr_state-def_type != type)) + return false; + No Pascal-like parentheses. Does this have a performance impact on shadow? Perhaps we could cache in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: arm: vgic: Drop useless Group0 warning
On Fri, May 29, 2015 at 05:30:15PM +0100, Marc Zyngier wrote: If a GICv3-enabled guest tries to configure Group0, we print a warning on the console (because we don't support Group0 interrupts). This is fairly pointless, and would allow a guest to spam the console. Let's just drop the warning. Signed-off-by: Marc Zyngier marc.zyng...@arm.com Acked-by: Christoffer Dall christoffer.d...@linaro.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/11] KVM: arm: plug guest debug exploit
Hi Zhichao, On 31/05/15 05:27, Zhichao Huang wrote: Hardware debugging in guests is not intercepted currently, it means that a malicious guest can bring down the entire machine by writing to the debug registers. This patch enable trapping of all debug registers, preventing the guests to mess with the host state. However, it is a precursor for later patches which will need to do more to world switch debug states while necessary. Cc: sta...@vger.kernel.org Signed-off-by: Zhichao Huang zhichao.hu...@linaro.org --- arch/arm/include/asm/kvm_coproc.h | 3 +- arch/arm/kvm/coproc.c | 60 +++ arch/arm/kvm/handle_exit.c| 4 +-- arch/arm/kvm/interrupts_head.S| 2 +- 4 files changed, 59 insertions(+), 10 deletions(-) diff --git a/arch/arm/include/asm/kvm_coproc.h b/arch/arm/include/asm/kvm_coproc.h index 4917c2f..e74ab0f 100644 --- a/arch/arm/include/asm/kvm_coproc.h +++ b/arch/arm/include/asm/kvm_coproc.h @@ -31,7 +31,8 @@ void kvm_register_target_coproc_table(struct kvm_coproc_target_table *table); int kvm_handle_cp10_id(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp_0_13_access(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run); -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run); int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run); diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c index f3d88dc..2e12760 100644 --- a/arch/arm/kvm/coproc.c +++ b/arch/arm/kvm/coproc.c @@ -91,12 +91,6 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run) return 1; } -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run) -{ - kvm_inject_undefined(vcpu); - return 1; -} - static void reset_mpidr(struct kvm_vcpu *vcpu, const struct coproc_reg *r) { /* @@ -519,6 +513,60 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run) return emulate_cp15(vcpu, params); } +/** + * kvm_handle_cp14_64 -- handles a mrrc/mcrr trap on a guest CP14 access + * @vcpu: The VCPU pointer + * @run: The kvm_run struct + */ +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run) +{ + struct coproc_params params; + + params.CRn = (kvm_vcpu_get_hsr(vcpu) 1) 0xf; + params.Rt1 = (kvm_vcpu_get_hsr(vcpu) 5) 0xf; + params.is_write = ((kvm_vcpu_get_hsr(vcpu) 1) == 0); + params.is_64bit = true; + + params.Op1 = (kvm_vcpu_get_hsr(vcpu) 16) 0xf; + params.Op2 = 0; + params.Rt2 = (kvm_vcpu_get_hsr(vcpu) 10) 0xf; + params.CRm = 0; + + /* raz_wi */ + (void)pm_fake(vcpu, params, NULL); + + /* handled */ + kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu)); + return 1; +} + +/** + * kvm_handle_cp14_32 -- handles a mrc/mcr trap on a guest CP14 access + * @vcpu: The VCPU pointer + * @run: The kvm_run struct + */ +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run) +{ + struct coproc_params params; + + params.CRm = (kvm_vcpu_get_hsr(vcpu) 1) 0xf; + params.Rt1 = (kvm_vcpu_get_hsr(vcpu) 5) 0xf; + params.is_write = ((kvm_vcpu_get_hsr(vcpu) 1) == 0); + params.is_64bit = false; + + params.CRn = (kvm_vcpu_get_hsr(vcpu) 10) 0xf; + params.Op1 = (kvm_vcpu_get_hsr(vcpu) 14) 0x7; + params.Op2 = (kvm_vcpu_get_hsr(vcpu) 17) 0x7; + params.Rt2 = 0; + + /* raz_wi */ + (void)pm_fake(vcpu, params, NULL); + + /* handled */ + kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu)); + return 1; +} + /** * Userspace API */ diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c index 95f12b2..357ad1b 100644 --- a/arch/arm/kvm/handle_exit.c +++ b/arch/arm/kvm/handle_exit.c @@ -104,9 +104,9 @@ static exit_handle_fn arm_exit_handlers[] = { [HSR_EC_WFI]= kvm_handle_wfx, [HSR_EC_CP15_32]= kvm_handle_cp15_32, [HSR_EC_CP15_64]= kvm_handle_cp15_64, - [HSR_EC_CP14_MR]= kvm_handle_cp14_access, + [HSR_EC_CP14_MR]= kvm_handle_cp14_32, [HSR_EC_CP14_LS]= kvm_handle_cp14_load_store, - [HSR_EC_CP14_64]= kvm_handle_cp14_access, + [HSR_EC_CP14_64]= kvm_handle_cp14_64, [HSR_EC_CP_0_13]= kvm_handle_cp_0_13_access, [HSR_EC_CP10_ID]= kvm_handle_cp10_id, [HSR_EC_SVC_HYP]= handle_svc_hyp, diff --git
Re: [PATCH kernel v11 25/34] powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages
On Fri, May 29, 2015 at 06:44:49PM +1000, Alexey Kardashevskiy wrote: This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves the code which allocates the actual TCE tables to helpers: pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages(). These do not allocate/free the iommu_table struct. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgplT2uFUf7Sj.pgp Description: PGP signature
Re: [PATCH 05/15] KVM: MTRR: clean up mtrr default type
On 30/05/2015 12:59, Xiao Guangrong wrote: Use union definition to avoid the decode/code workload and drop all the hard code Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- arch/x86/include/asm/kvm_host.h | 12 ++-- arch/x86/kvm/mtrr.c | 19 --- 2 files changed, 18 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2c3c52d..95ce2ff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -347,8 +347,16 @@ enum { struct kvm_mtrr { struct mtrr_var_range var_ranges[KVM_NR_VAR_MTRR]; mtrr_type fixed_ranges[KVM_NR_FIXED_MTRR_REGION]; - unsigned char enabled; - mtrr_type def_type; + + union { + u64 deftype; + struct { + unsigned def_type:8; + unsigned reserved:2; + unsigned fixed_mtrr_enabled:1; + unsigned mtrr_enabled:1; + }; + }; }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index 562341b..6de49dd 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -105,7 +105,6 @@ EXPORT_SYMBOL_GPL(kvm_mtrr_valid); static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) { struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state; - unsigned char mtrr_enabled = mtrr_state-enabled; gfn_t start, end, mask; int index; bool is_fixed = true; @@ -114,7 +113,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) !kvm_arch_has_noncoherent_dma(vcpu-kvm)) return; - if (!(mtrr_enabled 0x2) msr != MSR_MTRRdefType) + if (!mtrr_state-mtrr_enabled msr != MSR_MTRRdefType) I know Linus doesn't like bitfields too much. Can you change these to inline functions, and only leave an u64 deftype in the struct? Paolo return; switch (msr) { @@ -153,7 +152,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) end = ((start mask) | ~mask) + 1; } - if (is_fixed !(mtrr_enabled 0x1)) + if (is_fixed !mtrr_state-fixed_mtrr_enabled) return; kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end)); @@ -166,10 +165,9 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (!kvm_mtrr_valid(vcpu, msr, data)) return 1; - if (msr == MSR_MTRRdefType) { - vcpu-arch.mtrr_state.def_type = data; - vcpu-arch.mtrr_state.enabled = (data 0xc00) 10; - } else if (msr == MSR_MTRRfix64K_0) + if (msr == MSR_MTRRdefType) + vcpu-arch.mtrr_state.deftype = data; + else if (msr == MSR_MTRRfix64K_0) p[0] = data; else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A) p[1 + msr - MSR_MTRRfix16K_8] = data; @@ -216,8 +214,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) return 1; if (msr == MSR_MTRRdefType) - *pdata = vcpu-arch.mtrr_state.def_type + - (vcpu-arch.mtrr_state.enabled 10); + *pdata = vcpu-arch.mtrr_state.deftype; else if (msr == MSR_MTRRfix64K_0) *pdata = p[0]; else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A) @@ -256,14 +253,14 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state, int i, num_var_ranges = KVM_NR_VAR_MTRR; /* MTRR is completely disabled, use UC for all of physical memory. */ - if (!(mtrr_state-enabled 0x2)) + if (!mtrr_state-mtrr_enabled) return MTRR_TYPE_UNCACHABLE; /* Make end inclusive end, instead of exclusive */ end--; /* Look in fixed ranges. Just return the type as per start */ - if ((mtrr_state-enabled 0x1) (start 0x10)) { + if (mtrr_state-fixed_mtrr_enabled (start 0x10)) { int idx; if (start 0x8) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/15] KVM: MTRR: sort variable MTRRs
On 30/05/2015 12:59, Xiao Guangrong wrote: Sort all valid variable MTRRs based on its base address, it will help us to check a range to see if it's fully contained in variable MTRRs Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/mtrr.c | 39 +++ arch/x86/kvm/x86.c | 2 +- arch/x86/kvm/x86.h | 1 + 4 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f3fc152..5be8f2e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -347,6 +347,7 @@ enum { struct kvm_mtrr_range { u64 base; u64 mask; + struct list_head node; }; struct kvm_mtrr { @@ -362,6 +363,8 @@ struct kvm_mtrr { unsigned mtrr_enabled:1; }; }; + + struct list_head head; }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index aeb9767..8e3b81a 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -262,6 +262,38 @@ do_zap: kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end)); } +static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range) +{ + u64 start, end; + + if (!(range-mask (1 11))) + return false; + + var_mtrr_range(range, start, end); + return end start; +} + +static void set_var_mtrr_start(struct kvm_mtrr *mtrr_state, int index) +{ + /* remove the entry if it's in the list. */ + if (var_mtrr_range_is_valid(mtrr_state-var_ranges[index])) + list_del(mtrr_state-var_ranges[index].node); +} + +static void set_var_mtrr_end(struct kvm_mtrr *mtrr_state, int index) +{ + struct kvm_mtrr_range *tmp, *cur = mtrr_state-var_ranges[index]; + + /* add it to the list if it's valid. */ + if (var_mtrr_range_is_valid(mtrr_state-var_ranges[index])) { + list_for_each_entry(tmp, mtrr_state-head, node) + if (cur-base tmp-base) + list_add_tail(cur-node, tmp-node); + + list_add_tail(cur-node, mtrr_state-head); + } +} + int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data) { int index; @@ -281,10 +313,12 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data) index = (msr - 0x200) / 2; is_mtrr_mask = msr - 0x200 - 2 * index; + set_var_mtrr_start(vcpu-arch.mtrr_state, index); if (!is_mtrr_mask) vcpu-arch.mtrr_state.var_ranges[index].base = data; else vcpu-arch.mtrr_state.var_ranges[index].mask = data; + set_var_mtrr_end(vcpu-arch.mtrr_state, index); Just move the whole code to a new kvm_set_var_mtrr function. } update_mtrr(vcpu, msr); @@ -331,6 +365,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) return 0; } +void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu) +{ + INIT_LIST_HEAD(vcpu-arch.mtrr_state.head); +} + u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn) { struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64c2891..8084ac3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6991,7 +6991,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu) kvm_vcpu_reset(vcpu, false); kvm_mmu_setup(vcpu); vcpu_put(vcpu); - + kvm_vcpu_mtrr_init(vcpu); Please move this call much earlier. Paolo return r; } diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 956558d..a3dae49 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -162,6 +162,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt, gva_t addr, void *val, unsigned int bytes, struct x86_exception *exception); +void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu); bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data); int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data); int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type
On 30/05/2015 12:59, Xiao Guangrong wrote: +struct mtrr_looker { + /* input fields. */ + struct kvm_mtrr *mtrr_state; + u64 start; + u64 end; s/looker/iter/ or s/looker/lookup/ +static void mtrr_lookup_start(struct mtrr_looker *looker) +{ + looker-mem_type = -1; + + if (!looker-mtrr_state-mtrr_enabled) { + looker-partial_map = true; + return; + } + + if (!mtrr_lookup_fixed_start(looker)) + mtrr_lookup_var_start(looker); +} + Separating mtrr_lookup_start and mtrr_lookup_init is weird. There are common parts of mtrr_lookup_*_start and mtrr_lookup_*_next. For example this: + looker-mem_type = looker-mtrr_state-fixed_ranges[index]; + looker-start = fixed_mtrr_range_end_addr(seg, index); + return true; in mtrr_lookup_fixed_start is the same as this: + end = fixed_mtrr_range_end_addr(looker-seg, looker-index); + + /* switch to next segment. */ + if (end = eseg-end) { + looker-seg++; + looker-index = 0; + + /* have looked up for all fixed MTRRs. */ + if (looker-seg = ARRAY_SIZE(fixed_seg_table)) + return mtrr_lookup_var_start(looker); + + end = fixed_mtrr_range_end_addr(looker-seg, looker-index); + } + + looker-mem_type = mtrr_state-fixed_ranges[looker-index]; + looker-start = end; in mtrr_lookup_fixed_next. Can you try to make them more common? Basically you should have +#define mtrr_for_each_mem_type(_looker_, _mtrr_, _gpa_start_, _gpa_end_) \ + for (mtrr_lookup_init(_looker_, _mtrr_, _gpa_start_, _gpa_end_); \ +!mtrr_lookup_end(_looker_); mtrr_lookup_next(_looker_)) Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 2/8] tun: add tun_is_little_endian() helper
On Fri, Apr 24, 2015 at 02:24:38PM +0200, Greg Kurz wrote: Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com Dave, could you please ack merging this through the virtio tree? --- drivers/net/tun.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 857dca4..3c3d6c0 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -206,14 +206,19 @@ struct tun_struct { u32 flow_count; }; +static inline bool tun_is_little_endian(struct tun_struct *tun) +{ + return tun-flags TUN_VNET_LE; +} + static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val) { - return __virtio16_to_cpu(tun-flags TUN_VNET_LE, val); + return __virtio16_to_cpu(tun_is_little_endian(tun), val); } static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val) { - return __cpu_to_virtio16(tun-flags TUN_VNET_LE, val); + return __cpu_to_virtio16(tun_is_little_endian(tun), val); } static inline u32 tun_hashfn(u32 rxhash) ___ Virtualization mailing list virtualizat...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 10/12] KVM: arm64: guest debug, HW assisted debug support
Hi Alex, On Fri, May 29, 2015 at 10:30:26AM +0100, Alex Bennée wrote: This adds support for userspace to control the HW debug registers for guest debug. In the debug ioctl we copy the IMPDEF defined number of registers into a new register set called host_debug_state. There is now a new vcpu parameter called debug_ptr which selects which register set is to copied into the real registers when world switch occurs. I've moved some helper functions into the hw_breakpoint.h header for re-use. As with single step we need to tweak the guest registers to enable the exceptions so we need to save and restore those bits. Two new capabilities have been added to the KVM_EXTENSION ioctl to allow userspace to query the number of hardware break and watch points available on the host hardware. Signed-off-by: Alex Bennée alex.ben...@linaro.org [...] diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 33c8143..ada57df 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2668,7 +2668,7 @@ The top 16 bits of the control field are architecture specific control flags which can include the following: - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64] - - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390] + - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64] - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86] - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86] - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390] @@ -2683,6 +2683,11 @@ updated to the correct (supplied) values. The second part of the structure is architecture specific and typically contains a set of debug registers. +For arm64 the number of debug registers is implementation defined and +can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and +KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number +indicating the number of supported registers. + When debug events exit the main run loop with the reason KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 0d17c7b..6df47c1 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -307,6 +307,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) #define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE |\ KVM_GUESTDBG_USE_SW_BP | \ + KVM_GUESTDBG_USE_HW_BP | \ KVM_GUESTDBG_SINGLESTEP) /** @@ -327,6 +328,12 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, if (dbg-control KVM_GUESTDBG_ENABLE) { vcpu-guest_debug = dbg-control; + + /* Hardware assisted Break and Watch points */ + if (vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP) { + vcpu-arch.external_debug_state = dbg-arch; + } + } else { /* If not enabled clear all flags */ vcpu-guest_debug = 0; diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h index 52b484b..c450552 100644 --- a/arch/arm64/include/asm/hw_breakpoint.h +++ b/arch/arm64/include/asm/hw_breakpoint.h @@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct task_struct *task) } #endif +/* Determine number of BRP registers available. */ +static inline int get_num_brps(void) +{ + return ((read_cpuid(ID_AA64DFR0_EL1) 12) 0xf) + 1; +} + +/* Determine number of WRP registers available. */ +static inline int get_num_wrps(void) +{ + return ((read_cpuid(ID_AA64DFR0_EL1) 20) 0xf) + 1; +} I'm fine with moving these into the header file, but we should probably revisit this at some point so that we use a shadow value to indicate the minimum number of registers across the system for e.g. big.LITTLE platforms that don't have uniform debug resources. extern struct pmu perf_ops_bp; #endif /* __KERNEL__ */ diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index e5040b6..498d4f7 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -113,12 +113,13 @@ struct kvm_vcpu_arch { /* * For debugging the guest we need to keep a set of debug - * registers which can override the guests own debug state + * registers which can override the guest's own debug state * while being used. These are set via the KVM_SET_GUEST_DEBUG * ioctl. */ struct kvm_guest_debug_arch *debug_ptr; struct kvm_guest_debug_arch vcpu_debug_state; + struct kvm_guest_debug_arch external_debug_state; /* Pointer to host CPU context */ kvm_cpu_context_t *host_cpu_context; diff
Re: [PATCH 08/15] KVM: MTRR: introduce fixed_mtrr_segment table
On 30/05/2015 12:59, Xiao Guangrong wrote: This table summarizes the information of fixed MTRRs and introduce some APIs to abstract its operation which helps us to clean up the code and will be used in later patches Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com --- arch/x86/kvm/mtrr.c | 191 ++-- 1 file changed, 142 insertions(+), 49 deletions(-) diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c index d3c06d2..888441e 100644 --- a/arch/x86/kvm/mtrr.c +++ b/arch/x86/kvm/mtrr.c @@ -102,12 +102,126 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data) } EXPORT_SYMBOL_GPL(kvm_mtrr_valid); +struct fixed_mtrr_segment { + u64 start; + u64 end; + + /* + * unit corresponds to the MSR entry in this segment, the size + * of unit is covered in one MSR. + */ + u64 unit_size; + + /* a range is covered in one memory cache type. */ + u64 range_size; + + /* the start position in kvm_mtrr.fixed_ranges[]. */ + int range_start; +}; + +static struct fixed_mtrr_segment fixed_seg_table[] = { + /* MSR_MTRRfix64K_0, 1 unit. 64K fixed mtrr. */ + { + .start = 0x0, + .end = 0x8, + .unit_size = 0x8, Unit size is always range size * 8. + .range_size = 1ULL 16, Perhaps 64 * 1024 (and so on below) is clearer, because it matches the name of the MSR? + .range_start = 0, + }, + + /* + * MSR_MTRRfix16K_8 ... MSR_MTRRfix16K_A, 2 units, + * 16K fixed mtrr. + */ + { + .start = 0x8, + .end = 0xc, + .unit_size = (0xc - 0x8) / 2, + .range_size = 1ULL 14, + .range_start = 8, + }, + + /* + * MSR_MTRRfix4K_C ... MSR_MTRRfix4K_F8000, 8 units, + * 4K fixed mtrr. + */ + { + .start = 0xc, + .end = 0x10, + .unit_size = (0x10 - 0xc) / 8, + .range_size = 1ULL 12, + .range_start = 24, + } +}; + +static int fixed_msr_to_range_index(u32 msr) +{ + int seg, unit; + + if (!fixed_msr_to_seg_unit(msr, seg, unit)) + return -1; + + fixed_mtrr_seg_unit_range_index(seg, unit); This looks wrong. + return 0; +} + static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) { struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state; gfn_t start, end, mask; int index; - bool is_fixed = true; if (msr == MSR_IA32_CR_PAT || !tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu-kvm)) @@ -116,32 +230,19 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) if (!mtrr_state-mtrr_enabled msr != MSR_MTRRdefType) return; + if (fixed_msr_to_range(msr, start, end)) { + if (!mtrr_state-fixed_mtrr_enabled) + return; + goto do_zap; + } + switch (msr) { Please move defType handling in an if above if (fixed_msr_to_range(msr, start, end)). Then you don't need the goto. Paolo - case MSR_MTRRfix64K_0: - start = 0x0; - end = 0x8; - break; - case MSR_MTRRfix16K_8: - start = 0x8; - end = 0xa; - break; - case MSR_MTRRfix16K_A: - start = 0xa; - end = 0xc; - break; - case MSR_MTRRfix4K_C ... MSR_MTRRfix4K_F8000: - index = msr - MSR_MTRRfix4K_C; - start = 0xc + index * (32 10); - end = start + (32 10); - break; case MSR_MTRRdefType: - is_fixed = false; start = 0x0; end = ~0ULL; break; default: /* variable range MTRRs. */ - is_fixed = false; index = (msr - 0x200) / 2; start = mtrr_state-var_ranges[index].base PAGE_MASK; mask = mtrr_state-var_ranges[index].mask PAGE_MASK; @@ -150,38 +251,33 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr) end = ((start mask) | ~mask) + 1; } - if (is_fixed !mtrr_state-fixed_mtrr_enabled) - return; - +do_zap: kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end)); } int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data) { - u64 *p = (u64 *)vcpu-arch.mtrr_state.fixed_ranges; + int index; if (!kvm_mtrr_valid(vcpu, msr, data)) return 1; - if (msr == MSR_MTRRdefType) + index = fixed_msr_to_range_index(msr); + if (index = 0) + *(u64 *)vcpu-arch.mtrr_state.fixed_ranges[index] = data; + else if (msr == MSR_MTRRdefType)
Re: [PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper
On Fri, Apr 24, 2015 at 02:24:48PM +0200, Greg Kurz wrote: Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com Dave, could you pls ack merging this through the virtio tree? --- drivers/net/macvtap.c |9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 27ecc5c..a2f2958 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -49,14 +49,19 @@ struct macvtap_queue { #define MACVTAP_VNET_LE 0x8000 +static inline bool macvtap_is_little_endian(struct macvtap_queue *q) +{ + return q-flags MACVTAP_VNET_LE; +} + static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val) { - return __virtio16_to_cpu(q-flags MACVTAP_VNET_LE, val); + return __virtio16_to_cpu(macvtap_is_little_endian(q), val); } static inline __virtio16 cpu_to_macvtap16(struct macvtap_queue *q, u16 val) { - return __cpu_to_virtio16(q-flags MACVTAP_VNET_LE, val); + return __cpu_to_virtio16(macvtap_is_little_endian(q), val); } static struct proto macvtap_proto = { ___ Virtualization mailing list virtualizat...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
On 05/30/2015 11:59 PM, Christoffer Dall wrote: Hi Mario, On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote: On 05/28/2015 11:49 AM, Christoffer Dall wrote: Until now we have been calling kvm_guest_exit after re-enabling interrupts when we come back from the guest, but this has the unfortunate effect that CPU time accounting done in the context of timer interrupts occurring while the guest is running doesn't properly notice that the time since the last tick was spent in the guest. Inspired by the comment in the x86 code, move the kvm_guest_exit() call below the local_irq_enable() call and change __kvm_guest_exit() to kvm_guest_exit(), because we are now calling this function with interrupts enabled. We have to now explicitly disable preemption and not enable preemption before we've called kvm_guest_exit(), since otherwise we could be preempted and everything happening before we eventually get scheduled again would be accounted for as guest time. At the same time, move the trace_kvm_exit() call outside of the atomic section, since there is no reason for us to do that with interrupts disabled. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- This patch is based on kvm/queue, because it has the kvm_guest_enter/exit rework recently posted by Christian Borntraeger. I hope I got the logic of this right, there were 2 slightly worrying facts about this: First, we now enable and disable and enable interrupts on each exit path, but I couldn't see any performance overhead on hackbench - yes the only benchmark we care about. Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Changes since v1: - Tweak comment and commit text based on Marc's feedback. - Explicitly disable preemption and enable it only after kvm_guest_exit(). arch/arm/kvm/arm.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index e41cb11..fe8028d 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) kvm_vgic_flush_hwstate(vcpu); kvm_timer_flush_hwstate(vcpu); + preempt_disable(); local_irq_disable(); /* @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) { local_irq_enable(); + preempt_enable(); kvm_timer_sync_hwstate(vcpu); kvm_vgic_sync_hwstate(vcpu); continue; @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) ret = kvm_call_hyp(__kvm_vcpu_run, vcpu); vcpu-mode = OUTSIDE_GUEST_MODE; - __kvm_guest_exit(); - trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); + /* +* Back from guest +*/ + /* * We may have taken a host interrupt in HYP mode (ie * while executing the guest). This interrupt is still @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) local_irq_enable(); /* -* Back from guest -*/ +* We do local_irq_enable() before calling kvm_guest_exit() so +* that if a timer interrupt hits while running the guest we +* account that tick as being spent in the guest. We enable +* preemption after calling kvm_guest_exit() so that if we get +* preempted we make sure ticks after that is not counted as +* guest time. +*/ + kvm_guest_exit(); + trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); + preempt_enable(); + kvm_timer_sync_hwstate(vcpu); kvm_vgic_sync_hwstate(vcpu); Hi Christoffer, so currently we take a snap shot when we enter the guest (tsk-vtime_snap) and upon exit add the time we spent in the guest and update accrued time, which appears correct. not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN. Or am I missing something obvious here? I see what you mean we can't use cycle based accounting to accrue Guest time. With this patch it appears that interrupts running in host mode are accrued to Guest time, and additional preemption latency is added. It is true that interrupt processing in host mode (if they hit on a CPU when it is running a guest) are accrued to guest time, but without this
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote: Am 01.06.2015 um 11:08 schrieb Christoffer Dall: Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Not an expert here, but I assume mips has the same logic as arm so if your patch is right for arm its probably also for mips. powerpc looks similar to what s390 does (not using the tick, instead it uses a hw-timer) so this should be fine. I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get this for free which would avoid the need for this patch? Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks - yes it might work out. Can you give it a try? Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels like a fix since it would be a shame to force users to use this config option to report CPU usage correctly. I'm not entirely sure what the history and meaning behind these configs are, so maybe there is an entirely different rework needed here. It seems logical that you could simply sample the counter at entry/exit of the guest, but if there is nowhere to store this data without NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why? Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time
Am 01.06.2015 um 15:35 schrieb Christoffer Dall: On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote: Am 01.06.2015 um 11:08 schrieb Christoffer Dall: Second, looking at the ppc and mips code, they seem to also call kvm_guest_exit() before enabling interrupts, so I don't understand how guest CPU time accounting works on those architectures. Not an expert here, but I assume mips has the same logic as arm so if your patch is right for arm its probably also for mips. powerpc looks similar to what s390 does (not using the tick, instead it uses a hw-timer) so this should be fine. I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get this for free which would avoid the need for this patch? Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks - yes it might work out. Can you give it a try? Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels like a fix since it would be a shame to force users to use this config option to report CPU usage correctly. I'm not entirely sure what the history and meaning behind these configs are, so maybe there is an entirely different rework needed here. It seems logical that you could simply sample the counter at entry/exit of the guest, but if there is nowhere to store this data without NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why? Given Paolos response that irq_disable/enable is faster than save/restore at least on x86 your v2 patch might actually be the right thing to do. Christian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type
On 01/06/2015 11:33, Paolo Bonzini wrote: +looker-mem_type = looker-mtrr_state-fixed_ranges[index]; + looker-start = fixed_mtrr_range_end_addr(seg, index); + return true; in mtrr_lookup_fixed_start is the same as this: + end = fixed_mtrr_range_end_addr(looker-seg, looker-index); + + /* switch to next segment. */ + if (end = eseg-end) { + looker-seg++; + looker-index = 0; + + /* have looked up for all fixed MTRRs. */ + if (looker-seg = ARRAY_SIZE(fixed_seg_table)) + return mtrr_lookup_var_start(looker); + + end = fixed_mtrr_range_end_addr(looker-seg, looker-index); + } + + looker-mem_type = mtrr_state-fixed_ranges[looker-index]; + looker-start = end; in mtrr_lookup_fixed_next. Can you try to make them more common? Basically you should have +#define mtrr_for_each_mem_type(_looker_, _mtrr_, _gpa_start_, _gpa_end_) \ + for (mtrr_lookup_init(_looker_, _mtrr_, _gpa_start_, _gpa_end_); \ + !mtrr_lookup_end(_looker_); mtrr_lookup_next(_looker_)) ... where the above code I quoted would be part of mtrr_lookup_end. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] vhost: cleanups and fixes
The following changes since commit c65b99f046843d2455aa231747b5a07a999a9f3d: Linux 4.1-rc6 (2015-05-31 19:01:07 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to 8a7b19d8b542b87bccc3eaaf81dcc90a5ca48aea: include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h (2015-06-01 15:46:54 +0200) virtio: last-minute fix for 4.1 This tweaks an exported user-space header to fix build breakage for userspace using it. Signed-off-by: Michael S. Tsirkin m...@redhat.com Mikko Rapeli (1): include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h include/uapi/linux/virtio_balloon.h | 1 + 1 file changed, 1 insertion(+) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] vhost: cleanups and fixes
The mail subject is wrong or course - the one in the pull request is correct. Sorry about that - this shouldn't interfere with merging it, I'll correct it for future requests. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper
From: Michael S. Tsirkin m...@redhat.com Date: Mon, 1 Jun 2015 12:30:16 +0200 On Fri, Apr 24, 2015 at 02:24:48PM +0200, Greg Kurz wrote: Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com Dave, could you pls ack merging this through the virtio tree? Acked-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 2/8] tun: add tun_is_little_endian() helper
From: Michael S. Tsirkin m...@redhat.com Date: Mon, 1 Jun 2015 12:29:43 +0200 On Fri, Apr 24, 2015 at 02:24:38PM +0200, Greg Kurz wrote: Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com Dave, could you please ack merging this through the virtio tree? Acked-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables
On Fri, May 29, 2015 at 06:44:51PM +1000, Alexey Kardashevskiy wrote: TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages() helpers. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- Changes: v10: * fixed multiple comments received for v9 v9: * moved from ioda2 to common powernv pci code * fixed cleanup if allocation fails in a middle * removed check for the size - all boundary checks happen in the calling code anyway --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 98 --- arch/powerpc/platforms/powernv/pci.c | 13 3 files changed, 104 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 4636734..706cfc0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -96,6 +96,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fda01c1..68ffc7a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -49,6 +49,9 @@ /* 256M DMA window, 4K TCE pages, 8 bytes TCE */ #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) +#define POWERNV_IOMMU_DEFAULT_LEVELS 1 +#define POWERNV_IOMMU_MAX_LEVELS 5 + static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl); static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, @@ -1975,6 +1978,8 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, table_group); struct pnv_phb *phb = pe-phb; int64_t rc; + const unsigned long size = tbl-it_indirect_levels ? + tbl-it_level_size : tbl-it_size; const __u64 start_addr = tbl-it_offset tbl-it_page_shift; const __u64 win_size = tbl-it_size tbl-it_page_shift; @@ -1989,9 +1994,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, pe-pe_number 1, - 1, + tbl-it_indirect_levels + 1, __pa(tbl-it_base), - tbl-it_size 3, + size 3, IOMMU_PAGE_SIZE(tbl)); if (rc) { pe_err(pe, Failed to configure TCE table, err %ld\n, rc); @@ -2071,11 +2076,19 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb) phb-ioda.tce_inval_reg = ioremap(phb-ioda.tce_inval_reg_phys, 8); } -static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift) +static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift, + unsigned levels, unsigned long limit, + unsigned long *tce_table_allocated) { struct page *tce_mem = NULL; - __be64 *addr; + __be64 *addr, *tmp; unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; + unsigned long local_allocated = 1UL (order + PAGE_SHIFT); + unsigned entries = 1UL (shift - 3); + long i; + + if (*tce_table_allocated = limit) + return NULL; I'm not quite clear what case this limit logic is trying to catch. tce_mem = alloc_pages_node(nid, GFP_KERNEL, order); if (!tce_mem) { @@ -2083,31 +2096,69 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift) return NULL; } addr = page_address(tce_mem); - memset(addr, 0, 1UL (order + PAGE_SHIFT)); + memset(addr, 0, local_allocated); + + --levels; + if (!levels) { + *tce_table_allocated += local_allocated; + return addr; + } + + for (i = 0; i entries; ++i) { + tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift, + levels, limit, tce_table_allocated); + if (!tmp) + break; + +
Re: [PATCH kernel v11 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
On Fri, May 29, 2015 at 06:44:50PM +1000, Alexey Kardashevskiy wrote: This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is I'm assuming that's what's now called pnv_pci_ioda2_invalidate_entire()? a good thing to do. It does not have immediate effect now as the table is never recreated after reboot but it will in the following patches. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com --- Changes: v11: * replaced some 1it_page_shift with IOMMU_PAGE_SIZE() macro v9: * initialize pe-table_group.tables[0] at the very end when tbl is fully initialized * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch --- arch/powerpc/platforms/powernv/pci-ioda.c | 47 +-- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 3d29fe3..fda01c1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1968,6 +1968,43 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } } +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, + int num, struct iommu_table *tbl) +{ + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); + struct pnv_phb *phb = pe-phb; + int64_t rc; + const __u64 start_addr = tbl-it_offset tbl-it_page_shift; + const __u64 win_size = tbl-it_size tbl-it_page_shift; + + pe_info(pe, Setting up window %llx..%llx pg=%x\n, + start_addr, start_addr + win_size - 1, + IOMMU_PAGE_SIZE(tbl)); + + /* + * Map TCE table through TVT. The TVE index is the PE number + * shifted by 1 bit for 32-bits DMA space. + */ + rc = opal_pci_map_pe_dma_window(phb-opal_id, + pe-pe_number, + pe-pe_number 1, + 1, + __pa(tbl-it_base), + tbl-it_size 3, + IOMMU_PAGE_SIZE(tbl)); + if (rc) { + pe_err(pe, Failed to configure TCE table, err %ld\n, rc); + return rc; + } + + pnv_pci_link_table_and_group(phb-hose-node, num, + tbl, pe-table_group); + pnv_pci_ioda2_tce_invalidate_entire(pe); + + return 0; +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe-pe_number 1 ) + 1; @@ -2123,21 +2160,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe-table_group.ops = pnv_pci_ioda2_ops; #endif - /* - * Map TCE table through TVT. The TVE index is the PE number - * shifted by 1 bit for 32-bits DMA space. - */ - rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number, - pe-pe_number 1, 1, __pa(tbl-it_base), - tbl-it_size 3, 1ULL tbl-it_page_shift); + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl); if (rc) { pe_err(pe, Failed to configure 32-bit TCE table, err %ld\n, rc); goto fail; } - pnv_pci_ioda2_tce_invalidate_entire(pe); - /* OPAL variant of PHB3 invalidated TCEs */ if (phb-ioda.tce_inval_reg) tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgptxgs23iUmd.pgp Description: PGP signature
Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires: 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com --- Changes: v11: * mm_iommu_put() does not return a code so this does not check it * moved v2 in tce_container to pack the struct v10: * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO specific thing * squashed powerpc/iommu: Add userspace view of TCE table into this as it is a part of IOMMU v2 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/ * fixed some function names to have tce_iommu_ in the beginning rather just tce_ * as mm_iommu_mapped_inc() can now fail, check for the return code v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged vfio: powerpc/spapr: Define v2 IOMMU into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 31 ++- arch/powerpc/include/asm/iommu.h| 6 + drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++-- include/uapi/linux/vfio.h | 27 ++ 4 files changed, 487 insertions(+), 89 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows -
Re: [PATCH kernel v11 31/34] vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
On Fri, May 29, 2015 at 06:44:55PM +1000, Alexey Kardashevskiy wrote: Before the IOMMU user (VFIO) would take control over the IOMMU table belonging to a specific IOMMU group. This approach did not allow sharing tables between IOMMU groups attached to the same container. This introduces a new IOMMU ownership flavour when the user can not just control the existing IOMMU table but remove/create tables on demand. If an IOMMU implements take/release_ownership() callbacks, this lets the user have full control over the IOMMU group. When the ownership is taken, the platform code removes all the windows so the caller must create them. Before returning the ownership back to the platform code, VFIO unprograms and removes all the tables it created. This changes IODA2's onwership handler to remove the existing table rather than manipulating with the existing one. From now on, iommu_take_ownership() and iommu_release_ownership() are only called from the vfio_iommu_spapr_tce driver. Old-style ownership is still supported allowing VFIO to run on older P5IOC2 and IODA IO controllers. No change in userspace-visible behaviour is expected. Since it recreates TCE tables on each ownership change, related kernel traces will appear more often. This adds a pnv_pci_ioda2_setup_default_config() which is called when PE is being configured at boot time and when the ownership is passed from VFIO to the platform code. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpGI6H0c3HKJ.pgp Description: PGP signature
Re: [PATCH kernel v11 29/34] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release
On Fri, May 29, 2015 at 06:44:53PM +1000, Alexey Kardashevskiy wrote: The existing code programmed TVT#0 with some address and then immediately released that memory. This makes use of pnv_pci_ioda2_unset_window() and pnv_pci_ioda2_set_bypass() which do correct resource release and TVT update. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpHcf_huXaFN.pgp Description: PGP signature
Re: [PATCH kernel v11 32/34] powerpc/mmu: Add userspace-to-physical addresses translation cache
On Fri, May 29, 2015 at 06:44:56PM +1000, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every register ioctl; decremented on unregister; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru There's a bunch of ways this could be cleaned up, but those can be done later. So, Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpxPkLIdT5rV.pgp Description: PGP signature
Re: [PATCH kernel v11 34/34] vfio: powerpc/spapr: Support Dynamic DMA windows
On Fri, May 29, 2015 at 06:44:58PM +1000, Alexey Kardashevskiy wrote: This adds create/remove window ioctls to create and remove DMA windows. sPAPR defines a Dynamic DMA windows capability which allows para-virtualized guests to create additional DMA windows on a PCI bus. The existing linux kernels use this new window to map the entire guest memory and switch to the direct DMA operations saving time on map/unmap requests which would normally happen in a big amounts. This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows. Up to 2 windows are supported now by the hardware and by this driver. This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional information such as a number of supported windows and maximum number levels of TCE tables. DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature as we still want to support v2 on platforms which cannot do DDW for the sake of TCE acceleration in KVM (coming soon). Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru [aw: for the vfio related changes] Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgptc8o5ntY8T.pgp Description: PGP signature
Re: [PATCH kernel v11 28/34] vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
On Fri, May 29, 2015 at 06:44:52PM +1000, Alexey Kardashevskiy wrote: This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. This makes use of new values in vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not advertise pagemasks to the userspace. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Acked-by: Alex Williamson alex.william...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpLhpGvd0Imy.pgp Description: PGP signature
Re: [PATCH kernel v11 30/34] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On Fri, May 29, 2015 at 06:44:54PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_ioda2_get_table_size() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru Reviewed-by: David Gibson da...@gibson.dropbear.id.au -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpBOujNE4LxQ.pgp Description: PGP signature