Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-05-24 Thread David Gibson
On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >index 8465c2a..da6bf61 100644
> >--- a/arch/powerpc/kvm/powerpc.c
> >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >+++ b/arch/powerpc/kvm/powerpc.c
> > break;
> > #endif
> > case KVM_CAP_SPAPR_MULTITCE:
> >+case KVM_CAP_SPAPR_TCE_IOMMU:
> > r = 1;
> > break;
> > default:
> 
> Don't advertise SPAPR capabilities if it's not book3s -- and
> probably there's some additional limitation that would be
> appropriate.

So, in the case of MULTITCE, that's not quite right.  PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well.  We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.

CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
host side hardware (i.e. a PAPR style IOMMU), though.

> 
> >@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> > goto out;
> > }
> >+case KVM_CREATE_SPAPR_TCE_IOMMU: {
> >+struct kvm_create_spapr_tce_iommu create_tce_iommu;
> >+struct kvm *kvm = filp->private_data;
> >+
> >+r = -EFAULT;
> >+if (copy_from_user(&create_tce_iommu, argp,
> >+sizeof(create_tce_iommu)))
> >+goto out;
> >+r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
> >&create_tce_iommu);
> >+goto out;
> >+}
> > #endif /* CONFIG_PPC_BOOK3S_64 */
> >
> > #ifdef CONFIG_KVM_BOOK3S_64_HV
> >diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >index 5a2afda..450c82a 100644
> >--- a/include/uapi/linux/kvm.h
> >+++ b/include/uapi/linux/kvm.h
> >@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> > #define KVM_CAP_PPC_RTAS 91
> > #define KVM_CAP_IRQ_XICS 92
> > #define KVM_CAP_SPAPR_MULTITCE (0x11 + 89)
> >+#define KVM_CAP_SPAPR_TCE_IOMMU (0x11 + 90)
> 
> Hmm...

Ah, yeah, that needs to be fixed.  Those were interim numbers so that
we didn't have to keep changing our internal trees as new upstream
ioctls got added to the list.  We need to get a proper number for the
merge, though.

> >@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> > #define KVM_GET_DEVICE_ATTR   _IOW(KVMIO,  0xe2, struct
> >kvm_device_attr)
> > #define KVM_HAS_DEVICE_ATTR   _IOW(KVMIO,  0xe3, struct
> >kvm_device_attr)
> >
> >+/* ioctl for SPAPR TCE IOMMU */
> >+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xe4, struct
> >kvm_create_spapr_tce_iommu)
> 
> Shouldn't this go under the vm ioctl section?
> 
> -Scott
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


[PATCH] kvm: exclude ioeventfd from counting kvm_io_range limit

2013-05-24 Thread Amos Kong
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).

In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.

Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.

Also fixed one indent issue in kvm_host.h

Signed-off-by: Amos Kong 
---
 include/linux/kvm_host.h | 3 ++-
 virt/kvm/eventfd.c   | 2 ++
 virt/kvm/kvm_main.c  | 3 ++-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f0eea07..ef261ab 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -144,7 +144,8 @@ struct kvm_io_range {
 #define NR_IOBUS_DEVS 1000
 
 struct kvm_io_bus {
-   int   dev_count;
+   int dev_count;
+   int ioeventfd_count;
struct kvm_io_range range[];
 };
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 64ee720..1550637 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -753,6 +753,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (ret < 0)
goto unlock_fail;
 
+   kvm->buses[bus_idx]->ioeventfd_count++;
list_add_tail(&p->list, &kvm->ioeventfds);
 
mutex_unlock(&kvm->slots_lock);
@@ -798,6 +799,7 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
+   kvm->buses[bus_idx]->ioeventfd_count--;
ioeventfd_release(p);
ret = 0;
break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 302681c..c6d9baf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2926,7 +2926,8 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus 
bus_idx, gpa_t addr,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm->buses[bus_idx];
-   if (bus->dev_count > NR_IOBUS_DEVS - 1)
+   /* exclude ioeventfd which is limited by maximum fd */
+   if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)
return -ENOSPC;
 
new_bus = kzalloc(sizeof(*bus) + ((bus->dev_count + 1) *
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

2013-05-24 Thread Marcelo Tosatti
On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
> caused by requiring lock
> 
> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
> so update the comments
> 
> [ It improves kernel building 0.6% ~ 1% ]

Can you please describe the overload in more detail? Under what scenario
is kernel building improved?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 03/11] KVM: MMU: fast invalidate all pages

2013-05-24 Thread Marcelo Tosatti
Hi Xiao,

On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability
> 
> In this patch, we introduce a faster way to invalidate all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are zapped by using lock-break technique
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |2 +
>  arch/x86/kvm/mmu.c  |   84 
> +++
>  arch/x86/kvm/mmu.h  |1 +
>  3 files changed, 87 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..bff7d46 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -222,6 +222,7 @@ struct kvm_mmu_page {
>   int root_count;  /* Currently serving as active root */
>   unsigned int unsync_children;
>   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
> + unsigned long mmu_valid_gen;
>   DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -529,6 +530,7 @@ struct kvm_arch {
>   unsigned int n_requested_mmu_pages;
>   unsigned int n_max_mmu_pages;
>   unsigned int indirect_shadow_pages;
> + unsigned long mmu_valid_gen;
>   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   /*
>* Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f8ca2f3..f302540 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>   __clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>gfn_t gfn,
>gva_t gaddr,
> @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>  
>   account_shadowed(vcpu->kvm, gfn);
>   }
> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>   init_shadow_page_table(sp);
>   trace_kvm_mmu_get_page(sp, true);
>   return sp;
> @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp,
>   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>   kvm_mmu_page_unlink_children(kvm, sp);
>   kvm_mmu_unlink_parents(kvm, sp);
> +
>   if (!sp->role.invalid && !sp->role.direct)
>   unaccount_shadowed(kvm, sp->gfn);
> +
>   if (sp->unsync)
>   kvm_unlink_unsync_page(kvm, sp);
>   if (!sp->root_count) {
> @@ -4195,6 +4203,82 @@ restart:
>   spin_unlock(&kvm->mmu_lock);
>  }
>  
> +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> +{
> + struct kvm_mmu_page *sp, *node;
> + LIST_HEAD(invalid_list);
> +
> +restart:
> + list_for_each_entry_safe_reverse(sp, node,
> +   &kvm->arch.active_mmu_pages, link) {
> + /*
> +  * No obsolete page exists before new created page since
> +  * active_mmu_pages is the FIFO list.
> +  */
> + if (!is_obsolete_sp(kvm, sp))
> + break;

Can you add a comment to list_add(x, active_mmu_pages) callsites
mentioning this case?

Because it'll break silently if people do list_add_tail().

> + /*
> +  * Do not repeatedly zap a root page to avoid unnecessary
> +  * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> +  * progress:
> +  *vcpu 0vcpu 1
> +  * call vcpu_enter_guest():
> +  *1): handle KVM_REQ_MMU_RELOAD
> +  *and require mmu-lock to
> +  *load mmu
> +  * repeat:
> +  *1): zap root page and
> +  *send KVM_REQ_MMU_RELOAD
> +  *
> +  

Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-24 Thread Marcelo Tosatti
On Fri, May 24, 2013 at 06:11:16AM -0400, Vadim Rozenfeld wrote:
> > Is there a better option?
> > 
> If setting TscSequence to zero makes Windows fall back to the MSR this is a
> better option.
> 
> +1 
> This is why MS has two different mechanisms:
> iTSC as a primary, reference counters as a fall-back.

Ok, is it documented that transition

iTSC valid (Sequence != 0 and != 0x) -> 
iTSC not valid but ref MSR valid (Sequence = 0), 

is a valid transition?

It was not obvious for me. Can you point to documentation?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] vfio: type1 iommu hugepage support

2013-05-24 Thread Alex Williamson
This series let's the vfio type1 iommu backend take advantage of iommu
large page support.  See patch 2/2 for the details.  This has been
tested on both amd_iommu and intel_iommu, but only my AMD system has
large page support.  I'd appreciate any testing and feedback on other
systems, particularly vt-d systems supporting large pages.  Mapping
efficiency should be improved a bit without iommu hugepages, but I
hope that it's much more noticeable with huge pages, especially for
very large QEMU guests.

This change includes a clarification to the mapping expectations for
users of the type1 iommu, but is compatible with known users and works
with existing QEMU userspace supporting vfio.  Thanks,

Alex

---

Alex Williamson (2):
  vfio: Convert type1 iommu to use rbtree
  vfio: hugepage support for vfio_iommu_type1


 drivers/vfio/vfio_iommu_type1.c |  607 ---
 include/uapi/linux/vfio.h   |8 -
 2 files changed, 387 insertions(+), 228 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] vfio: Convert type1 iommu to use rbtree

2013-05-24 Thread Alex Williamson
We need to keep track of all the DMA mappings of an iommu container so
that it can be automatically unmapped when the user releases the file
descriptor.  We currently do this using a simple list, where we merge
entries with contiguous iovas and virtual addresses.  Using a tree for
this is a bit more efficient and allows us to use common code instead
of inventing our own.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |  190 ---
 1 file changed, 96 insertions(+), 94 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6f3fbc4..0e863b3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include  /* pci_bus_type */
+#include 
 #include 
 #include 
 #include 
@@ -50,13 +51,13 @@ MODULE_PARM_DESC(allow_unsafe_interrupts,
 struct vfio_iommu {
struct iommu_domain *domain;
struct mutexlock;
-   struct list_headdma_list;
+   struct rb_root  dma_list;
struct list_headgroup_list;
boolcache;
 };
 
 struct vfio_dma {
-   struct list_headnext;
+   struct rb_node  node;
dma_addr_t  iova;   /* Device address */
unsigned long   vaddr;  /* Process virtual addr */
longnpage;  /* Number of pages */
@@ -75,6 +76,49 @@ struct vfio_group {
 
 #define NPAGE_TO_SIZE(npage)   ((size_t)(npage) << PAGE_SHIFT)
 
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+ dma_addr_t start, size_t size)
+{
+   struct rb_node *node = iommu->dma_list.rb_node;
+
+   while (node) {
+   struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
+
+   if (start + size <= dma->iova)
+   node = node->rb_left;
+   else if (start >= dma->iova + NPAGE_TO_SIZE(dma->npage))
+   node = node->rb_right;
+   else
+   return dma;
+   }
+
+   return NULL;
+}
+
+static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
+{
+   struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+   struct vfio_dma *dma;
+
+   while (*link) {
+   parent = *link;
+   dma = rb_entry(parent, struct vfio_dma, node);
+
+   if (new->iova + NPAGE_TO_SIZE(new->npage) <= dma->iova)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(&new->node, parent, link);
+   rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
+{
+   rb_erase(&old->node, &iommu->dma_list);
+}
+
 struct vwork {
struct mm_struct*mm;
longnpage;
@@ -289,31 +333,8 @@ static int __vfio_dma_map(struct vfio_iommu *iommu, 
dma_addr_t iova,
return 0;
 }
 
-static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
- dma_addr_t start2, size_t size2)
-{
-   if (start1 < start2)
-   return (start2 - start1 < size1);
-   else if (start2 < start1)
-   return (start1 - start2 < size2);
-   return (size1 > 0 && size2 > 0);
-}
-
-static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
-   dma_addr_t start, size_t size)
-{
-   struct vfio_dma *dma;
-
-   list_for_each_entry(dma, &iommu->dma_list, next) {
-   if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
-  start, size))
-   return dma;
-   }
-   return NULL;
-}
-
-static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
-   size_t size, struct vfio_dma *dma)
+static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+  size_t size, struct vfio_dma *dma)
 {
struct vfio_dma *split;
long npage_lo, npage_hi;
@@ -322,10 +343,9 @@ static long vfio_remove_dma_overlap(struct vfio_iommu 
*iommu, dma_addr_t start,
if (start <= dma->iova &&
start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
-   list_del(&dma->next);
-   npage_lo = dma->npage;
+   vfio_remove_dma(iommu, dma);
kfree(dma);
-   return npage_lo;
+   return 0;
}
 
/* Overlap low address of existing range */
@@ -339,7 +359,7 @@ static long vfio_remove_dma_overlap(struct vfio_iommu 
*iommu, dma_addr_t start,
dma->iova += ov

[PATCH 2/2] vfio: hugepage support for vfio_iommu_type1

2013-05-24 Thread Alex Williamson
We currently send all mappings to the iommu in PAGE_SIZE chunks,
which prevents the iommu from enabling support for larger page sizes.
We still need to pin pages, which means we step through them in
PAGE_SIZE chunks, but we can batch up contiguous physical memory
chunks to allow the iommu the opportunity to use larger pages.  The
approach here is a bit different that the one currently used for
legacy KVM device assignment.  Rather than looking at the vma page
size and using that as the maximum size to pass to the iommu, we
instead simply look at whether the next page is physically
contiguous.  This means we might ask the iommu to map a 4MB region,
while legacy KVM might limit itself to a maximum of 2MB.

Splitting our mapping path also allows us to be smarter about locked
memory because we can more easily unwind if the user attempts to
exceed the limit.  Therefore, rather than assuming that a mapping
will result in locked memory, we test each page as it is pinned to
determine whether it locks RAM vs an mmap'd MMIO region.  This should
result in better locking granularity and less locked page fudge
factors in userspace.

The unmap path uses the same algorithm as legacy KVM.  We don't want
to track the pfn for each mapping ourselves, but we need the pfn in
order to unpin pages.  We therefore ask the iommu for the iova to
physical address translation, ask it to unpin a page, and see how many
pages were actually unpinned.  iommus supporting large pages will
often return something bigger than a page here, which we know will be
physically contiguous and we can unpin a batch of pfns.  iommus not
supporting large mappings won't see an improvement in batching here as
they only unmap a page at a time.

With this change, we also make a clarification to the API for mapping
and unmapping DMA.  We can only guarantee unmaps at the same
granularity as used for the original mapping.  In other words,
unmapping a subregion of a previous mapping is not guaranteed and may
result in a larger or smaller unmapping than requested.  The size
field in the unmapping structure is updated to reflect this.
Previously this was unmodified on mapping, always returning the the
requested unmap size.  This is now updated to return the actual unmap
size on success, allowing userspace to appropriately track mappings.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio_iommu_type1.c |  523 +--
 include/uapi/linux/vfio.h   |8 -
 2 files changed, 344 insertions(+), 187 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0e863b3..6654a7e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -60,7 +60,7 @@ struct vfio_dma {
struct rb_node  node;
dma_addr_t  iova;   /* Device address */
unsigned long   vaddr;  /* Process virtual addr */
-   longnpage;  /* Number of pages */
+   size_t  size;   /* Map size (bytes) */
int prot;   /* IOMMU_READ/WRITE */
 };
 
@@ -74,8 +74,6 @@ struct vfio_group {
  * into DMA'ble space using the IOMMU
  */
 
-#define NPAGE_TO_SIZE(npage)   ((size_t)(npage) << PAGE_SHIFT)
-
 static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
  dma_addr_t start, size_t size)
 {
@@ -86,7 +84,7 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu 
*iommu,
 
if (start + size <= dma->iova)
node = node->rb_left;
-   else if (start >= dma->iova + NPAGE_TO_SIZE(dma->npage))
+   else if (start >= dma->iova + dma->size)
node = node->rb_right;
else
return dma;
@@ -104,7 +102,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu, 
struct vfio_dma *new)
parent = *link;
dma = rb_entry(parent, struct vfio_dma, node);
 
-   if (new->iova + NPAGE_TO_SIZE(new->npage) <= dma->iova)
+   if (new->iova + new->size <= dma->iova)
link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
@@ -144,8 +142,8 @@ static void vfio_lock_acct(long npage)
struct vwork *vwork;
struct mm_struct *mm;
 
-   if (!current->mm)
-   return; /* process exited */
+   if (!current->mm || !npage)
+   return; /* process exited or nothing to do */
 
if (down_write_trylock(¤t->mm->mmap_sem)) {
current->mm->locked_vm += npage;
@@ -217,33 +215,6 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
-/* Unmap DMA region */
-static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
-long npage, int prot)
-{
-   long i, unlocked = 0;
-
-   for (i = 0; i < npage; 

[Bug 58771] New: VM performance degradation after KVM QEMU migration or save/restore with Intel EPT enabled

2013-05-24 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=58771

   Summary: VM performance degradation after KVM QEMU migration or
save/restore with Intel EPT enabled
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.0+
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: high
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: ccorm...@gmail.com
Regression: Yes


Overview:
Once a VM has been migrated to another hypervisor or the VM has been saved and
restored the performance of the VM will be immediately impacted and as a result
disk access is slower and has an increased latency.

This only effects KVM QEMU hypervisors with Intel EPT capable CPUs and since
Kernel 3.0 and higher with EPT enabled in the kvm_intel kernel module. 

Test Setup:
Hypervisor with an Intel CPU that has EPT feature is required.
(reproduced with Fedora and Ubuntu Distros)

Hypervisor
-Ubuntu 12.04 (with any Kernel 3.0-3.9)
Guest
-Ubuntu 12.04

Steps to Reproduce:
Save/restore procedure on a single hypervisor:
-using virsh to manage VMs
-create a running VM
-save VMs running state ("virsh save  savefile")
-restore VMs running state ("virsh restore savefile")

Alternative reproduction to above is using the virsh livemigration or migration
option will also reproduce this bug.

Actual Results:
Guest VM IO intensive applications perform slower.

Expected Results:
Guest VM IO performance consistent before and after save/restore.

Build Date & Platform:
-Hypervisor - Ubuntu 12.04 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
-Guest - Ubuntu 12.04 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux

Additional Builds and Platforms:
Ubuntu 12.04, 13.04 with kernels 3.0-3.9
Fedora 18 (stock kernel)
Doesn't occur with tested Kernels 2.8.32, 2.8.39 squeezed into Ubuntu 12.04
(Fedora not tested with 2.8 kernels)

Additional Information:
Performance can be measured using various tools and benchmarks.
-lmbench wil show latencies
-some timed compilation benchmarks
-some disk benchmarks

Here are some examples of my before and after benchmarks.
LMBENCH :
Before:
Simple read: 0.1356 microseconds
Simple write: 0.1086 microseconds
Simple open/close: 1.0265 microseconds
After:
Simple read: 0.2125 microseconds
Simple write: 0.1913 microseconds
Simple open/close: 1.4482 microseconds

PostMark:
Before: 2808
After: 1893

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] powerpc/vfio: Implement IOMMU driver for VFIO

2013-05-24 Thread Alex Williamson
On Tue, 2013-05-21 at 13:33 +1000, Alexey Kardashevskiy wrote:
> VFIO implements platform independent stuff such as
> a PCI driver, BAR access (via read/write on a file descriptor
> or direct mapping when possible) and IRQ signaling.
> 
> The platform dependent part includes IOMMU initialization
> and handling.  This implements an IOMMU driver for VFIO
> which does mapping/unmapping pages for the guest IO and
> provides information about DMA window (required by a POWER
> guest).
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> Signed-off-by: Paul Mackerras 

Acked-by: Alex Williamson 

> ---
>  Documentation/vfio.txt  |   63 ++
>  drivers/vfio/Kconfig|6 +
>  drivers/vfio/Makefile   |1 +
>  drivers/vfio/vfio.c |1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c |  377 
> +++
>  include/uapi/linux/vfio.h   |   34 
>  6 files changed, 482 insertions(+)
>  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 8eda363..c55533c 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -283,6 +283,69 @@ a direct pass through for VFIO_DEVICE_* ioctls.  The 
> read/write/mmap
>  interfaces implement the device region access defined by the device's
>  own VFIO_DEVICE_GET_REGION_INFO ioctl.
>  
> +
> +PPC64 sPAPR implementation note
> +---
> +
> +This implementation has some specifics:
> +
> +1) Only one IOMMU group per container is supported as an IOMMU group
> +represents the minimal entity which isolation can be guaranteed for and
> +groups are allocated statically, one per a Partitionable Endpoint (PE)
> +(PE is often a PCI domain but not always).
> +
> +2) The hardware supports so called DMA windows - the PCI address range
> +within which DMA transfer is allowed, any attempt to access address space
> +out of the window leads to the whole PE isolation.
> +
> +3) PPC64 guests are paravirtualized but not fully emulated. There is an API
> +to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
> +currently there is no way to reduce the number of calls. In order to make 
> things
> +faster, the map/unmap handling has been implemented in real mode which 
> provides
> +an excellent performance which has limitations such as inability to do
> +locked pages accounting in real time.
> +
> +So 3 additional ioctls have been added:
> +
> + VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
> + of the DMA window on the PCI bus.
> +
> + VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
> + is done at this point. This lets user first to know what
> + the DMA window is and adjust rlimit before doing any real job.
> +
> + VFIO_IOMMU_DISABLE - disables the container.
> +
> +
> +The code flow from the example above should be slightly changed:
> +
> + .
> + /* Add the group to the container */
> + ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
> +
> + /* Enable the IOMMU model we want */
> + ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
> +
> + /* Get addition sPAPR IOMMU info */
> + vfio_iommu_spapr_tce_info spapr_iommu_info;
> + ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
> +
> + if (ioctl(container, VFIO_IOMMU_ENABLE))
> + /* Cannot enable container, may be low rlimit */
> +
> + /* Allocate some space and setup a DMA mapping */
> + dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
> +  MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +
> + dma_map.size = 1024 * 1024;
> + dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
> + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +
> + /* Check here is .iova/.size are within DMA window from 
> spapr_iommu_info */
> +
> + ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
> + .
> +
>  
> ---
>  
>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>   depends on VFIO
>   default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> + tristate
> + depends on VFIO && SPAPR_TCE_IOMMU
> + default n
> +
>  menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --

Re: [PATCH 3/3] powerpc/vfio: Enable on pSeries platform

2013-05-24 Thread Alex Williamson
On Tue, 2013-05-21 at 13:33 +1000, Alexey Kardashevskiy wrote:
> The enables VFIO on the pSeries platform, enabling user space
> programs to access PCI devices directly.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Cc: David Gibson 
> Signed-off-by: Paul Mackerras 

Acked-by: Alex Williamson 

> ---
>  arch/powerpc/platforms/pseries/iommu.c |4 
>  drivers/iommu/Kconfig  |2 +-
>  drivers/vfio/Kconfig   |2 +-
>  3 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 86ae364..23fc1dc 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -614,6 +614,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
>  
>   iommu_table_setparms(pci->phb, dn, tbl);
>   pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> + iommu_register_group(tbl, pci_domain_nr(bus), 0);
>  
>   /* Divide the rest (1.75GB) among the children */
>   pci->phb->dma_window_size = 0x8000ul;
> @@ -658,6 +659,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
> *bus)
>  ppci->phb->node);
>   iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
>   ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
> + iommu_register_group(tbl, pci_domain_nr(bus), 0);
>   pr_debug("  created table: %p\n", ppci->iommu_table);
>   }
>  }
> @@ -684,6 +686,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
>  phb->node);
>   iommu_table_setparms(phb, dn, tbl);
>   PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> + iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
>   set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
>   return;
>   }
> @@ -1184,6 +1187,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev 
> *dev)
>  pci->phb->node);
>   iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
>   pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> + iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
>   pr_debug("  created table: %p\n", pci->iommu_table);
>   } else {
>   pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 3f3abde..01730b2 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -263,7 +263,7 @@ config SHMOBILE_IOMMU_L1SIZE
>  
>  config SPAPR_TCE_IOMMU
>   bool "sPAPR TCE IOMMU Support"
> - depends on PPC_POWERNV
> + depends on PPC_POWERNV || PPC_PSERIES
>   select IOMMU_API
>   help
> Enables bits of IOMMU API required by VFIO. The iommu_ops
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index b464687..26b3d9d 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -12,7 +12,7 @@ menuconfig VFIO
>   tristate "VFIO Non-Privileged userspace driver framework"
>   depends on IOMMU_API
>   select VFIO_IOMMU_TYPE1 if X86
> - select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
> + select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
>   help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-24 Thread Michael S. Tsirkin
On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote:
> "Michael S. Tsirkin"  writes:
> 
> > On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote:
> >> On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote:
> >> > Hey guys,
> >> > I've updated the kvm networking todo wiki with current projects.
> >> > Will try to keep it up to date more often.
> >> > Original announcement below.
> >> 
> >> Thanks a lot. I've added the tasks I'm currently working on to the wiki.
> >> 
> >> btw. I notice the virtio-net data plane were missed in the wiki. Is the
> >> project still being considered?
> >
> > It might have been interesting several years ago, but now that linux has
> > vhost-net in kernel, the only point seems to be to
> > speed up networking on non-linux hosts.
> 
> Data plane just means having a dedicated thread for virtqueue processing
> that doesn't hold qemu_mutex.
> 
> Of course we're going to do this in QEMU.  It's a no brainer.  But not
> as a separate device, just as an improvement to the existing userspace
> virtio-net.
> 
> > Since non-linux does not have kvm, I doubt virtio is a bottleneck.
> 
> FWIW, I think what's more interesting is using vhost-net as a networking
> backend with virtio-net in QEMU being what's guest facing.
> 
> In theory, this gives you the best of both worlds: QEMU acts as a first
> line of defense against a malicious guest while still getting the
> performance advantages of vhost-net (zero-copy).

Great idea, that sounds very intresting.

I'll add it to the wiki.

In fact a bit of complexity in vhost was put there in the vague hope to
support something like this: virtio rings are not translated through
regular memory tables, instead, vhost gets a pointer to ring address.

This allows qemu acting as a man in the middle,
verifying the descriptors but not touching the

Anyone interested in working on such a project?

> > IMO yet another networking backend is a distraction,
> > and confusing to users.
> > In any case, I'd like to see virtio-blk dataplane replace
> > non dataplane first. We don't want two copies of
> > virtio-net in qemu.
> 
> 100% agreed.
> 
> Regards,
> 
> Anthony Liguori
> 
> >
> >> > 
> >> >
> >> > I've put up a wiki page with a kvm networking todo list,
> >> > mainly to avoid effort duplication, but also in the hope
> >> > to draw attention to what I think we should try addressing
> >> > in KVM:
> >> >
> >> > http://www.linux-kvm.org/page/NetworkingTodo
> >> >
> >> > This page could cover all networking related activity in KVM,
> >> > currently most info is related to virtio-net.
> >> >
> >> > Note: if there's no developer listed for an item,
> >> > this just means I don't know of anyone actively working
> >> > on an issue at the moment, not that no one intends to.
> >> >
> >> > I would appreciate it if others working on one of the items on this list
> >> > would add their names so we can communicate better.  If others like this
> >> > wiki page, please go ahead and add stuff you are working on if any.
> >> >
> >> > It would be especially nice to add autotest projects:
> >> > there is just a short test matrix and a catch-all
> >> > 'Cover test matrix with autotest', currently.
> >> >
> >> > Currently there are some links to Red Hat bugzilla entries,
> >> > feel free to add links to other bugzillas.
> >> >
> >> > Thanks!
> >> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-24 Thread Anthony Liguori
"Michael S. Tsirkin"  writes:

> On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote:
>> On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote:
>> > Hey guys,
>> > I've updated the kvm networking todo wiki with current projects.
>> > Will try to keep it up to date more often.
>> > Original announcement below.
>> 
>> Thanks a lot. I've added the tasks I'm currently working on to the wiki.
>> 
>> btw. I notice the virtio-net data plane were missed in the wiki. Is the
>> project still being considered?
>
> It might have been interesting several years ago, but now that linux has
> vhost-net in kernel, the only point seems to be to
> speed up networking on non-linux hosts.

Data plane just means having a dedicated thread for virtqueue processing
that doesn't hold qemu_mutex.

Of course we're going to do this in QEMU.  It's a no brainer.  But not
as a separate device, just as an improvement to the existing userspace
virtio-net.

> Since non-linux does not have kvm, I doubt virtio is a bottleneck.

FWIW, I think what's more interesting is using vhost-net as a networking
backend with virtio-net in QEMU being what's guest facing.

In theory, this gives you the best of both worlds: QEMU acts as a first
line of defense against a malicious guest while still getting the
performance advantages of vhost-net (zero-copy).

> IMO yet another networking backend is a distraction,
> and confusing to users.
> In any case, I'd like to see virtio-blk dataplane replace
> non dataplane first. We don't want two copies of
> virtio-net in qemu.

100% agreed.

Regards,

Anthony Liguori

>
>> > 
>> >
>> > I've put up a wiki page with a kvm networking todo list,
>> > mainly to avoid effort duplication, but also in the hope
>> > to draw attention to what I think we should try addressing
>> > in KVM:
>> >
>> > http://www.linux-kvm.org/page/NetworkingTodo
>> >
>> > This page could cover all networking related activity in KVM,
>> > currently most info is related to virtio-net.
>> >
>> > Note: if there's no developer listed for an item,
>> > this just means I don't know of anyone actively working
>> > on an issue at the moment, not that no one intends to.
>> >
>> > I would appreciate it if others working on one of the items on this list
>> > would add their names so we can communicate better.  If others like this
>> > wiki page, please go ahead and add stuff you are working on if any.
>> >
>> > It would be especially nice to add autotest projects:
>> > there is just a short test matrix and a catch-all
>> > 'Cover test matrix with autotest', currently.
>> >
>> > Currently there are some links to Red Hat bugzilla entries,
>> > feel free to add links to other bugzillas.
>> >
>> > Thanks!
>> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/

2013-05-24 Thread Arnd Bergmann
On Friday 24 May 2013, Michael S. Tsirkin wrote:
> So this won't work, unless we add the is_kernel_addr check
> to might_fault. That will become possible on top of this patchset
> but let's consider this carefully, and make this a separate
> patchset, OK?

Yes, makes sense.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/

2013-05-24 Thread Michael S. Tsirkin
On Fri, May 24, 2013 at 04:00:32PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 22, 2013 at 03:59:01PM +0200, Arnd Bergmann wrote:
> > On Thursday 16 May 2013, Michael S. Tsirkin wrote:
> > > @@ -178,7 +178,7 @@ do {  
> > >   \
> > > long __pu_err;  \
> > > __typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
> > > if (!is_kernel_addr((unsigned long)__pu_addr))  \
> > > -   might_sleep();  \
> > > +   might_fault();  \
> > > __chk_user_ptr(ptr);\
> > > __put_user_size((x), __pu_addr, (size), __pu_err);  \
> > > __pu_err;   \
> > > 
> > 
> > Another observation:
> > 
> > if (!is_kernel_addr((unsigned long)__pu_addr))
> > might_sleep();
> > 
> > is almost the same as
> > 
> > might_fault();
> > 
> > except that it does not call might_lock_read().
> > 
> > The version above may have been put there intentionally and correctly, but
> > if you want to replace it with might_fault(), you should remove the
> > "if ()" condition.
> > 
> > Arnd
> 
> Well not exactly. The non-inline might_fault checks the
> current segment, not the address.
> I'm guessing this is trying to do the same just without
> pulling in segment_eq, but I'd like a confirmation
> from more PPC maintainers.
> 
> Guys would you ack
> 
> - if (!is_kernel_addr((unsigned long)__pu_addr))
> - might_fault();
> + might_fault();
> 
> on top of this patch?

OK I spoke too fast: I found this:

powerpc: Fix incorrect might_sleep in __get_user/__put_user on kernel 
addresses

We have a case where __get_user and __put_user can validly be used
on kernel addresses in interrupt context - namely, the alignment
exception handler, as our get/put_unaligned just do a single access
and rely on the alignment exception handler to fix things up in the
rare cases where the cpu can't handle it in hardware.  Thus we can
get alignment exceptions in the network stack at interrupt level.
The alignment exception handler does a __get_user to read the
instruction and blows up in might_sleep().

Since a __get_user on a kernel address won't actually ever sleep,
this makes the might_sleep conditional on the address being less
than PAGE_OFFSET.

Signed-off-by: Paul Mackerras 

So this won't work, unless we add the is_kernel_addr check
to might_fault. That will become possible on top of this patchset
but let's consider this carefully, and make this a separate
patchset, OK?

> Also, any volunteer to test this (not just test-build)?
> 
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] qemu-kvm: fix unmatched RAM alloction/free

2013-05-24 Thread Eric Blake
On 05/23/2013 07:21 PM, Hao, Xudong wrote:
>> Just "git pull". :)  This is very similar to commit e7a09b9 (osdep: introduce
>> qemu_anon_ram_free to free qemu_anon_ram_alloc-ed memory, 2013-05-13)
>>
> 
> OK, this commit do the same thing as my patch, I did not notice qemu upstream 
> tree, just take a look at qemu-kvm tree, but I think this commit should be 
> backport to qemu-kvm tree, because many user are using qemu-kvm for KVM. 

That argues that the qemu-kvm tree needs one final commit that wipes
everything and replaces it with a readme file that tells users to
upgrade to the qemu upstream tree, now that the qemu-kvm tree has been
merged upstream and is no longer actively maintained.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 07/10] powerpc: uaccess s/might_sleep/might_fault/

2013-05-24 Thread Michael S. Tsirkin
On Wed, May 22, 2013 at 03:59:01PM +0200, Arnd Bergmann wrote:
> On Thursday 16 May 2013, Michael S. Tsirkin wrote:
> > @@ -178,7 +178,7 @@ do {
> > \
> > long __pu_err;  \
> > __typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
> > if (!is_kernel_addr((unsigned long)__pu_addr))  \
> > -   might_sleep();  \
> > +   might_fault();  \
> > __chk_user_ptr(ptr);\
> > __put_user_size((x), __pu_addr, (size), __pu_err);  \
> > __pu_err;   \
> > 
> 
> Another observation:
> 
>   if (!is_kernel_addr((unsigned long)__pu_addr))
>   might_sleep();
> 
> is almost the same as
> 
>   might_fault();
> 
> except that it does not call might_lock_read().
> 
> The version above may have been put there intentionally and correctly, but
> if you want to replace it with might_fault(), you should remove the
> "if ()" condition.
> 
>   Arnd

Well not exactly. The non-inline might_fault checks the
current segment, not the address.
I'm guessing this is trying to do the same just without
pulling in segment_eq, but I'd like a confirmation
from more PPC maintainers.

Guys would you ack

-   if (!is_kernel_addr((unsigned long)__pu_addr))
-   might_fault();
+   might_fault();

on top of this patch?

Also, any volunteer to test this (not just test-build)?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-24 Thread Michael S. Tsirkin
On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote:
> On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote:
> > Hey guys,
> > I've updated the kvm networking todo wiki with current projects.
> > Will try to keep it up to date more often.
> > Original announcement below.
> 
> Thanks a lot. I've added the tasks I'm currently working on to the wiki.
> 
> btw. I notice the virtio-net data plane were missed in the wiki. Is the
> project still being considered?

It might have been interesting several years ago, but now that linux has
vhost-net in kernel, the only point seems to be to
speed up networking on non-linux hosts. Since non-linux
does not have kvm, I doubt virtio is a bottleneck.
IMO yet another networking backend is a distraction,
and confusing to users.
In any case, I'd like to see virtio-blk dataplane replace
non dataplane first. We don't want two copies of
virtio-net in qemu.

> > 
> >
> > I've put up a wiki page with a kvm networking todo list,
> > mainly to avoid effort duplication, but also in the hope
> > to draw attention to what I think we should try addressing
> > in KVM:
> >
> > http://www.linux-kvm.org/page/NetworkingTodo
> >
> > This page could cover all networking related activity in KVM,
> > currently most info is related to virtio-net.
> >
> > Note: if there's no developer listed for an item,
> > this just means I don't know of anyone actively working
> > on an issue at the moment, not that no one intends to.
> >
> > I would appreciate it if others working on one of the items on this list
> > would add their names so we can communicate better.  If others like this
> > wiki page, please go ahead and add stuff you are working on if any.
> >
> > It would be especially nice to add autotest projects:
> > there is just a short test matrix and a catch-all
> > 'Cover test matrix with autotest', currently.
> >
> > Currently there are some links to Red Hat bugzilla entries,
> > feel free to add links to other bugzillas.
> >
> > Thanks!
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-24 Thread Vadim Rozenfeld


- Original Message -
From: "Paolo Bonzini" 
To: "Vadim Rozenfeld" 
Cc: kvm@vger.kernel.org, g...@redhat.com, mtosa...@redhat.com, p...@dlh.net
Sent: Friday, May 24, 2013 2:44:50 AM
Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

Il 19/05/2013 09:06, Vadim Rozenfeld ha scritto:
> The following patch allows to activate a partition reference 
> time enlightenment that is based on the host platform's support 
> for an Invariant Time Stamp Counter (iTSC).
> NOTE: This code will survive migration due to lack of VM stop/resume
> handlers, when offset, scale and sequence should be
> readjusted. 
> 
> ---
>  arch/x86/kvm/x86.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9645dab..b423fe4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1838,7 +1838,6 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
> msr, u64 data)
>   u64 gfn;
>   unsigned long addr;
>   HV_REFERENCE_TSC_PAGE tsc_ref;
> - tsc_ref.TscSequence = 0;
>   if (!(data & HV_X64_MSR_TSC_REFERENCE_ENABLE)) {
>   kvm->arch.hv_tsc_page = data;
>   break;
> @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
> u32 msr, u64 data)
>   HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT);
>   if (kvm_is_error_hva(addr))
>   return 1;
> + tsc_ref.TscSequence =
> + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0;

Thinking more of migration, could we increment whatever sequence value
we found (or better, do (x|3)+2 to skip over 0 and 0x), instead
of forcing it to 1?

[VR]
Yes, it should work.
We need to keep sequence between 1 and 0x and increment it every time
the VM was migrated or paused/resumed.

Add HV_X64_MSR_REFERENCE_TSC to msrs_to_save, and migration should just
work.

Paolo

> + tsc_ref.TscScale =
> + ((1LL << 32) / vcpu->arch.virtual_tsc_khz) 
> << 32;
> + tsc_ref.TscOffset = 0;
>   if (__copy_to_user((void __user *)addr, &tsc_ref, 
> sizeof(tsc_ref)))
>   return 1;
>   mark_page_dirty(kvm, gfn);
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-24 Thread Vadim Rozenfeld


- Original Message -
From: "Gleb Natapov" 
To: "Marcelo Tosatti" 
Cc: "Vadim Rozenfeld" , kvm@vger.kernel.org, p...@dlh.net
Sent: Friday, May 24, 2013 1:31:10 AM
Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

On Thu, May 23, 2013 at 10:53:38AM -0300, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 12:12:29PM +0300, Gleb Natapov wrote:
> > > To address migration scenarios to physical platforms that do not support
> > > iTSC, the TscSequence field is used. In the event that a guest partition
> > > is  migrated from an iTSC capable host to a non-iTSC capable host, the
> > > hypervisor sets TscSequence to the special value of 0x, which
> > > directs the guest operating system to fall back to a different clock
> > > source (for example, the virtual PM timer)."
> > > 
> > > Why it would not/does not work after migration?
> > > 
> > Please read the whole discussion, we talked about it already. We
> > definitely do not want to fall back to PM timer either, we want to use
> > reference counter instead.
> 
> Case 1) On migration of TSC page enabled Windows guest, from invariant TSC 
> host,
> to non-invariant TSC host, Windows guests fallback to PMTimer
> and not to reference timer via MSR. 
> 
> This is suboptimal because pmtimer emulation is excessively slow.
> 
> Is there a better option?
> 
If setting TscSequence to zero makes Windows fall back to the MSR this is a
better option.

+1 
This is why MS has two different mechanisms:
iTSC as a primary, reference counters as a fall-back.

  
> Case 2)
> Reference timer (via MSR) support is interesting for the case of non 
> invariant TSC
> host.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-24 Thread Vadim Rozenfeld


- Original Message -
From: "Marcelo Tosatti" 
To: "Vadim Rozenfeld" 
Cc: kvm@vger.kernel.org, g...@redhat.com, p...@dlh.net
Sent: Thursday, May 23, 2013 11:47:46 PM
Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

On Thu, May 23, 2013 at 08:21:29AM -0400, Vadim Rozenfeld wrote:
> > > @@ -1848,6 +1847,11 @@ static int set_msr_hyperv_pw(struct kvm_vcpu 
> > > *vcpu, u32 msr, u64 data)
> > >   HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT);
> > >   if (kvm_is_error_hva(addr))
> > >   return 1;
> > > + tsc_ref.TscSequence =
> > > + boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0;
> > 
> > 1) You want NONSTOP_TSC (see 40fb1715 commit) which matches INVARIANT TSC.
> > [VR]
> > Thank you for reviewing. Will fix it.
> > 2) TscSequence should increase? 
> > "This field serves as a sequence number that is incremented whenever..."
> > [VR]
> > Yes, on every VM resume, including migration. After migration we also need
> > to recalculate scale and adjust offset. 
> > 3) 0x is the value for invalid source of reference time?
> > [VR] Yes, on boot-up. In this case guest will go with PMTimer (not sure 
> > about HPET
> > but I can check). But if we set sequence to 0x after migration - 
> > it's probably will not work.
> 
> "Reference TSC during Save and Restore and Migration
> 
> To address migration scenarios to physical platforms that do not support
> iTSC, the TscSequence field is used. In the event that a guest partition
> is  migrated from an iTSC capable host to a non-iTSC capable host, the
> hypervisor sets TscSequence to the special value of 0x, which
> directs the guest operating system to fall back to a different clock
> source (for example, the virtual PM timer)."
> 
> Why it would not/does not work after migration?
> 
> [VR]
> Because of different frequencies, I think. 
> Hyper-V reference counters and iTSC report
> performance frequency equal to 10MHz,
> which is obviously is not true for PM and HPET timers.

Windows has to convert from the native hardware clock frequency to
internal system frequency, so i don't believe this is a problem.

> Windows calibrates timers on boot-up and you probably 
> have no chance to do it after or during resume.  

It is documented as such, it has been designed to fallback
to other hardware clock devices. Is there evidence for any 
problem on fallback?

Earlier you said:

"> What if you put 0x as a sequence? Or is this another case where
> the spec is wrong.
>
it will use PMTimer (maybe HPET if you have it) if you specify it on
VM's start up. But I'm not sure if it will work if you migrate from TSC
or reference counter to 0x
"
On startup, not after migration, when you migrate to host w/o iTSC and/or 
reference counters support.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

2013-05-24 Thread Vadim Rozenfeld


- Original Message -
From: "Marcelo Tosatti" 
To: "Gleb Natapov" 
Cc: "Peter Lieven" , "Vadim Rozenfeld" , 
kvm@vger.kernel.org, p...@dlh.net
Sent: Thursday, May 23, 2013 11:35:59 PM
Subject: Re: [RFC PATCH v2 2/2] add support for Hyper-V invariant TSC

On Thu, May 23, 2013 at 12:13:16PM +0300, Gleb Natapov wrote:
> > >
> > >"Reference TSC during Save and Restore and Migration
> > >
> > >To address migration scenarios to physical platforms that do not support
> > >iTSC, the TscSequence field is used. In the event that a guest partition
> > >is  migrated from an iTSC capable host to a non-iTSC capable host, the
> > >hypervisor sets TscSequence to the special value of 0x, which
> > >directs the guest operating system to fall back to a different clock
> > >source (for example, the virtual PM timer)."
> > >
> > >Why it would not/does not work after migration?
> > >
> > >
> > 
> > what exactly do we heed the reference TSC for? the reference counter alone 
> > works great and it seems
> > that there is a lot of trouble and crash possibilities involved with the 
> > referece tsc.
> > 
> Reference TSC is even faster. There should be no crashed with proper
> implementation.
> 
> --
>   Gleb.

Lack of invariant TSC support in the host.

if there is no iTSC in the host -> set sequence to 0 and go with
reference counter. It is why they both scaled to 10 MHz, and it's
why reference counters is a fall-back for iTSC. 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-24 Thread Jason Wang
On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote:
> Hey guys,
> I've updated the kvm networking todo wiki with current projects.
> Will try to keep it up to date more often.
> Original announcement below.

Thanks a lot. I've added the tasks I'm currently working on to the wiki.

btw. I notice the virtio-net data plane were missed in the wiki. Is the
project still being considered?
> 
>
> I've put up a wiki page with a kvm networking todo list,
> mainly to avoid effort duplication, but also in the hope
> to draw attention to what I think we should try addressing
> in KVM:
>
> http://www.linux-kvm.org/page/NetworkingTodo
>
> This page could cover all networking related activity in KVM,
> currently most info is related to virtio-net.
>
> Note: if there's no developer listed for an item,
> this just means I don't know of anyone actively working
> on an issue at the moment, not that no one intends to.
>
> I would appreciate it if others working on one of the items on this list
> would add their names so we can communicate better.  If others like this
> wiki page, please go ahead and add stuff you are working on if any.
>
> It would be especially nice to add autotest projects:
> there is just a short test matrix and a catch-all
> 'Cover test matrix with autotest', currently.
>
> Currently there are some links to Red Hat bugzilla entries,
> feel free to add links to other bugzillas.
>
> Thanks!
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add detail error message when fail to add ioeventfd

2013-05-24 Thread Amos Kong
On Thu, May 23, 2013 at 09:46:07AM +0200, Stefan Hajnoczi wrote:
> On Wed, May 22, 2013 at 09:48:21PM +0800, Amos Kong wrote:
> > On Wed, May 22, 2013 at 11:32:27AM +0200, Stefan Hajnoczi wrote:
> > > On Wed, May 22, 2013 at 12:57:35PM +0800, Amos Kong wrote:
> > > > I try to hotplug 28 * 8 multiple-function devices to guest with
> > > > old host kernel, ioeventfds in host kernel will be exhausted, then
> > > > qemu fails to allocate ioeventfds for blk/nic devices.
> > > > 
> > > > It's better to add detail error here.
> > > > 
> > > > Signed-off-by: Amos Kong 
> > > > ---
> > > >  kvm-all.c |4 
> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> > > 
> > > It would be nice to make kvm bus scalable so that the hardcoded
> > > in-kernel I/O device limit can be lifted.
> > 
> > I had increased kernel NR_IOBUS_DEVS to 1000 (a limitation is needed for
> > security) in last Mar, and make resizing kvm_io_range array dynamical.
> 
> The maximum should not be hardcoded.  File descriptor, maximum memory,
> etc are all controlled by rlimits.  And since ioeventfds are file
> descriptors they are already limited by the maximum number of file
> descriptors.

For implement the dynamically resize the kvm_io_range array,
I re-allocate new array (with new size) and free old array
when the array flexes. The array is only resized when
add/remove ioeventfds. It will not effect the perf.
 
> Why is there a need to impose a hardcoded limit?

I will send a patch to fix it.
 
> Stefan

-- 
Amos.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


USB Dongle problems using passthrough.

2013-05-24 Thread Caspar Smit
Hi all,

I have a KVM system running debian squeeze with the backports qemu-kvm packages:

# dpkg --list |grep kvm
ii  qemu-kvm   1.1.2+dfsg-5~bpo60+1   Full virtualization on x86 hardware

I'm running 2 VM's with Windows 7 installed.
The software running on the VM's require a hardware USB dongle (for licensing).

I managed to get them to work using USB2.0 passthrough using the usb2
controller config file
(ich9-ehci-uhci.cfg).
I've added the following command line switches to the kvm command:

-readconfig /kvm/ich9-ehci-uhci.cfg -device
usb-host,hostbus=2,hostaddr=3,bus=ehci.0

offcourse for the other vm it uses a different hostbus,hostaddr.

So far so good and everything works.

Now when I reboot one of the VM's the USB dongle doesn't work anymore
and comes up in windows device manager with a yellow exclamation mark
with the status:

"This device cannot start. (Code 10)"

To get it to work again, i have to shutodwn the entire kvm host system
and boot it up again forcing me to shutdown all other VM's too.

It seems the USB Dongle expects some kind of power cycle between
reboots and obviously this power cycle never happens on a virtualized
system.

I tested with a simple USB Flash Drive which didn't have any issues
after a reboot of the VM, so i guess i need to look at the USB Dongle
which causes this behaviour.

Now my question:

Is there a way to force a power cycle or reset of a single USB port
while rebooting a VM or is there another way I can fix this?

Kind regards and thanks in advance,
Caspar
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html