[RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-18 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check.

 Please enter the commit message for your changes. Lines starting

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  366 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 255 insertions(+), 113 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..4d9125d 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +700,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +727,121 @@ error_free_rfb:
return NULL;
 }
 
-/*
- * QCOW file might grow during a write operation. Not only data but metadata is
- * also written at the end of the file. Therefore it is necessary to ensure
- * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
- * synchronize the in-core state of QCOW image to disk.
- *
- * We also try to restore the image to a consistent state if the metdata
- * operation fails. The two metadat operations are: level 1 and level 2 table
- * update. If either of them fails the image is truncated to a consistent 
state.
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   rfb-entries[rfb_idx] = cpu_to_be16(refcount);
+   rfb-dirty = 1;
+
+   /*write refcount block*/
+   write_refcount_block(q, rfb);
+
+   /*update free_clust_idx since refcount becomes zero*/
+   if (!refcount  clust_idx  q-free_clust_idx)
+   q-free_clust_idx = clust_idx;
+
+   return 0;
+}
+
+static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size)
+{
+   struct qcow_header *header = q-header;
+   u64 start, end, offset;
+
+   start = clust_start  ~(q-cluster_size - 1);
+   end = (clust_start + size - 1)  ~(q-cluster_size - 1);
+   for (offset = start; offset = end; offset += q-cluster_size)
+   update_cluster_refcount(q, offset  header-cluster_bits, -1);
+}
+
+/*Allocate clusters according to the size. Find a postion that
+ *can satisfy the size. free_clust_idx is initialized to zero and
+ *Record last position.
+*/
+static u64 qcow_alloc_clusters(struct qcow *q, u64 size)
+{
+   struct qcow_header *header = q-header;
+   

[PATCH v2 0/6] KVM: optimize memslots searching

2011-11-18 Thread Xiao Guangrong
This is the more work base on my v1 patchset which is posted some months ago,
it can be found at:
https://lkml.org/lkml/2011/2/22/68

Change log:
- sort memslots base on its size and do the line search instead of binary
  search base on gfn, it is from Avi's idea.
- in order to reduce cache footprint, memslots are sorted in the array of
  kvm-memslots-memslots[] and introduce a table to map slot id to index in
  the array

There is the performance result:

autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
ept=1:  before: 449.5   after: 447.8
ept=0:  before: 532.7   after: 529.8

kernbench(average):
ept=1:  before: 127.94  after: 126.98
ept=0:  before: 196.85  after: 189.66

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] KVM: introduce KVM_MEM_SLOTS_NUM macro

2011-11-18 Thread Xiao Guangrong
Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |4 +++-
 arch/x86/kvm/mmu.c  |2 +-
 include/linux/kvm_host.h|7 +--
 virt/kvm/kvm_main.c |2 +-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d83264..b80a310 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -31,6 +31,8 @@
 #define KVM_MEMORY_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4
+#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+
 #define KVM_MMIO_SIZE 16

 #define KVM_PIO_PAGE_OFFSET 1
@@ -228,7 +230,7 @@ struct kvm_mmu_page {
 * One bit set per slot which has memory
 * in this shadow page.
 */
-   DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+   DECLARE_BITMAP(slot_bitmap, KVM_MEM_SLOTS_NUM);
bool unsync;
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a9b3a32..e6c2755 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1337,7 +1337,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
  PAGE_SIZE);
set_page_private(virt_to_page(sp-spt), (unsigned long)sp);
list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
-   bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
+   bitmap_zero(sp-slot_bitmap, KVM_MEM_SLOTS_NUM);
sp-parent_ptes = 0;
mmu_page_add_parent_pte(vcpu, sp, parent_pte);
kvm_mod_used_mmu_pages(vcpu-kvm, +1);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6a2ec9..a42d50b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -226,11 +226,14 @@ struct kvm_irq_routing_table {};

 #endif

+#ifndef KVM_MEM_SLOTS_NUM
+#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+#endif
+
 struct kvm_memslots {
int nmemslots;
u64 generation;
-   struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS +
-   KVM_PRIVATE_MEM_SLOTS];
+   struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
 };

 struct kvm {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..eb55a11 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -662,7 +662,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
(void __user *)(unsigned long)mem-userspace_addr,
mem-memory_size)))
goto out;
-   if (mem-slot = KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
+   if (mem-slot = KVM_MEM_SLOTS_NUM)
goto out;
if (mem-guest_phys_addr + mem-memory_size  mem-guest_phys_addr)
goto out;
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] KVM: introduce update_memslots function

2011-11-18 Thread Xiao Guangrong
Introduce update_memslots to update slot which will be update to
kvm-memslots

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c   |2 +-
 include/linux/kvm_host.h |1 +
 virt/kvm/kvm_main.c  |   22 +++---
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9eff4af..56d1a39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3504,7 +3504,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
goto out;
memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
-   slots-generation++;
+   update_memslots(slots, NULL);

old_slots = kvm-memslots;
rcu_assign_pointer(kvm-memslots, slots);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a42d50b..bb8728e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -319,6 +319,7 @@ void kvm_exit(void);

 void kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
+void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new);

 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eb55a11..0fb3aeb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -629,6 +629,19 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
 }
 #endif /* !CONFIG_S390 */

+void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new)
+{
+   if (new) {
+   int id = new-id;
+
+   slots-memslots[id] = *new;
+   if (id = slots-nmemslots)
+   slots-nmemslots = id + 1;
+   }
+
+   slots-generation++;
+}
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -779,10 +792,8 @@ skip_lpage:
if (!slots)
goto out_free;
memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
-   if (mem-slot = slots-nmemslots)
-   slots-nmemslots = mem-slot + 1;
-   slots-generation++;
slots-memslots[mem-slot].flags |= KVM_MEMSLOT_INVALID;
+   update_memslots(slots, NULL);

old_memslots = kvm-memslots;
rcu_assign_pointer(kvm-memslots, slots);
@@ -814,9 +825,6 @@ skip_lpage:
if (!slots)
goto out_free;
memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
-   if (mem-slot = slots-nmemslots)
-   slots-nmemslots = mem-slot + 1;
-   slots-generation++;

/* actual memory is freed via old in kvm_free_physmem_slot below */
if (!npages) {
@@ -826,7 +834,7 @@ skip_lpage:
new.lpage_info[i] = NULL;
}

-   slots-memslots[mem-slot] = new;
+   update_memslots(slots, new);
old_memslots = kvm-memslots;
rcu_assign_pointer(kvm-memslots, slots);
synchronize_srcu_expedited(kvm-srcu);
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-18 Thread Xiao Guangrong
Introduce kvm_for_each_memslot to walk all valid memslot

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/ia64/kvm/kvm-ia64.c |4 +---
 arch/x86/kvm/mmu.c   |9 +
 include/linux/kvm_host.h |4 
 virt/kvm/iommu.c |   13 +++--
 virt/kvm/kvm_main.c  |5 +++--
 5 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 43f4c92..7d5bc81 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1370,10 +1370,8 @@ static void kvm_release_vm_pages(struct kvm *kvm)
unsigned long base_gfn;

slots = kvm_memslots(kvm);
-   for (i = 0; i  slots-nmemslots; i++) {
-   memslot = slots-memslots[i];
+   kvm_for_each_memslot(slots, memslot, i) {
base_gfn = memslot-base_gfn;
-
for (j = 0; j  memslot-npages; j++) {
if (memslot-rmap[j])
put_page((struct page *)memslot-rmap[j]);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e6c2755..53dbae0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1120,11 +1120,11 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned 
long hva,
int ret;
int retval = 0;
struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;

slots = kvm_memslots(kvm);

-   for (i = 0; i  slots-nmemslots; i++) {
-   struct kvm_memory_slot *memslot = slots-memslots[i];
+   kvm_for_each_memslot(slots, memslot, i) {
unsigned long start = memslot-userspace_addr;
unsigned long end;

@@ -3977,11 +3977,12 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm 
*kvm)
unsigned int nr_mmu_pages;
unsigned int  nr_pages = 0;
struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;

slots = kvm_memslots(kvm);

-   for (i = 0; i  slots-nmemslots; i++)
-   nr_pages += slots-memslots[i].npages;
+   kvm_for_each_memslot(slots, memslot, i)
+   nr_pages += memslot-npages;

nr_mmu_pages = nr_pages * KVM_PERMILLE_MMU_PAGES / 1000;
nr_mmu_pages = max(nr_mmu_pages,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bb8728e..10524c0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
*kvm, int i)
 (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 idx++)

+#define kvm_for_each_memslot(slots, memslot, i)\
+   for (i = 0; i  (slots)-nmemslots\
+ ({ memslot = (slots)-memslots[i]; 1; }); i++)
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);

diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index d5f3b8d..02df243 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -134,12 +134,13 @@ static int kvm_iommu_map_memslots(struct kvm *kvm)
 {
int i, idx, r = 0;
struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;

idx = srcu_read_lock(kvm-srcu);
slots = kvm_memslots(kvm);

-   for (i = 0; i  slots-nmemslots; i++) {
-   r = kvm_iommu_map_pages(kvm, slots-memslots[i]);
+   kvm_for_each_memslot(slots, memslot, i) {
+   r = kvm_iommu_map_pages(kvm, memslot);
if (r)
break;
}
@@ -311,14 +312,14 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm)
 {
int i, idx;
struct kvm_memslots *slots;
+   struct kvm_memory_slot *memslot;

idx = srcu_read_lock(kvm-srcu);
slots = kvm_memslots(kvm);

-   for (i = 0; i  slots-nmemslots; i++) {
-   kvm_iommu_put_pages(kvm, slots-memslots[i].base_gfn,
-   slots-memslots[i].npages);
-   }
+   kvm_for_each_memslot(slots, memslot, i)
+   kvm_iommu_put_pages(kvm, memslot-base_gfn, memslot-npages);
+
srcu_read_unlock(kvm-srcu, idx);

return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0fb3aeb..ec3b03b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -549,9 +549,10 @@ void kvm_free_physmem(struct kvm *kvm)
 {
int i;
struct kvm_memslots *slots = kvm-memslots;
+   struct kvm_memory_slot *memslot;

-   for (i = 0; i  slots-nmemslots; ++i)
-   kvm_free_physmem_slot(slots-memslots[i], NULL);
+   kvm_for_each_memslot(slots, memslot, i)
+   kvm_free_physmem_slot(memslot, NULL);

kfree(kvm-memslots);
 }
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] KVM: introduce id_to_memslot function

2011-11-18 Thread Xiao Guangrong
Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/ia64/kvm/kvm-ia64.c  |2 +-
 arch/powerpc/kvm/book3s.c |2 +-
 arch/x86/kvm/vmx.c|6 --
 arch/x86/kvm/x86.c|8 +---
 include/linux/kvm_host.h  |6 ++
 virt/kvm/kvm_main.c   |   12 
 6 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 7d5bc81..773a984 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1818,7 +1818,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (log-slot = KVM_MEMORY_SLOTS)
goto out;

-   memslot = kvm-memslots-memslots[log-slot];
+   memslot = id_to_memslot(kvm-memslots, log-slot);
r = -ENOENT;
if (!memslot-dirty_bitmap)
goto out;
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index f68a34d..c1128d8 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -497,7 +497,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,

/* If nothing is dirty, don't bother messing with page tables. */
if (is_dirty) {
-   memslot = kvm-memslots-memslots[log-slot];
+   memslot = id_to_memslot(kvm-memslots, log-slot)

ga = memslot-base_gfn  PAGE_SHIFT;
ga_end = ga + (memslot-npages  PAGE_SHIFT);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6e28d58..4a92073 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2716,11 +2716,13 @@ static gva_t rmode_tss_base(struct kvm *kvm)
 {
if (!kvm-arch.tss_addr) {
struct kvm_memslots *slots;
+   struct kvm_memory_slot *slot;
gfn_t base_gfn;

slots = kvm_memslots(kvm);
-   base_gfn = slots-memslots[0].base_gfn +
-kvm-memslots-memslots[0].npages - 3;
+   slot = id_to_memslot(slots, 0);
+   base_gfn = slot-base_gfn + slot-npages - 3;
+
return base_gfn  PAGE_SHIFT;
}
return kvm-arch.tss_addr;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 56d1a39..315c2ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3478,7 +3478,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (log-slot = KVM_MEMORY_SLOTS)
goto out;

-   memslot = kvm-memslots-memslots[log-slot];
+   memslot = id_to_memslot(kvm-memslots, log-slot);
r = -ENOENT;
if (!memslot-dirty_bitmap)
goto out;
@@ -3491,6 +3491,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
/* If nothing is dirty, don't bother messing with page tables. */
if (is_dirty) {
struct kvm_memslots *slots, *old_slots;
+   struct kvm_memory_slot *slot;
unsigned long *dirty_bitmap;

dirty_bitmap = memslot-dirty_bitmap_head;
@@ -3503,13 +3504,14 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (!slots)
goto out;
memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
-   slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
+   slot = id_to_memslot(slots, log-slot);
+   slot-dirty_bitmap = dirty_bitmap;
update_memslots(slots, NULL);

old_slots = kvm-memslots;
rcu_assign_pointer(kvm-memslots, slots);
synchronize_srcu_expedited(kvm-srcu);
-   dirty_bitmap = old_slots-memslots[log-slot].dirty_bitmap;
+   dirty_bitmap = memslot-dirty_bitmap;
kfree(old_slots);

spin_lock(kvm-mmu_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 10524c0..a0e4d63 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,12 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm 
*kvm)
|| lockdep_is_held(kvm-slots_lock));
 }

+static inline struct kvm_memory_slot *
+id_to_memslot(struct kvm_memslots *slots, int id)
+{
+   return slots-memslots[id];
+}
+
 #define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
 #define HPA_ERR_MASK ((hpa_t)1  HPA_MSB)
 static inline int is_error_hpa(hpa_t hpa) { return hpa  HPA_MSB; }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ec3b03b..09bb794 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -634,8 +634,9 @@ void update_memslots(struct kvm_memslots *slots, struct 
kvm_memory_slot *new)
 {
if (new) {
int id = new-id;
+   struct kvm_memory_slot *old = id_to_memslot(slots, id);

-   slots-memslots[id] = *new;
+   *old = *new;
if (id = slots-nmemslots)
slots-nmemslots = id + 1;
}
@@ -681,7 +682,7 @@ int 

[PATCH v2 5/6] KVM: sort memslots by its size and use line search

2011-11-18 Thread Xiao Guangrong
Sort memslots base on its size and use line search to find it, so the larger
memslots have better fit

The idea is from Avi

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 include/linux/kvm_host.h |   22 +---
 virt/kvm/kvm_main.c  |   82 -
 2 files changed, 75 insertions(+), 29 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a0e4d63..83b396a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -230,8 +230,12 @@ struct kvm_irq_routing_table {};
 #define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
 #endif

+/*
+ * Note:
+ * memslots are not sorted by id anymore, please use id_to_memslot()
+ * to get the memslot by its id.
+ */
 struct kvm_memslots {
-   int nmemslots;
u64 generation;
struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
 };
@@ -307,9 +311,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm 
*kvm, int i)
 (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
 idx++)

-#define kvm_for_each_memslot(slots, memslot, i)\
-   for (i = 0; i  (slots)-nmemslots\
- ({ memslot = (slots)-memslots[i]; 1; }); i++)
+#define kvm_for_each_memslot(slots, memslot, i)\
+   for (i = 0; i  KVM_MEM_SLOTS_NUM \
+ ({ memslot = (slots)-memslots[i]; 1; })   \
+ memslot-npages != 0; i++)

 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
@@ -335,7 +340,14 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm 
*kvm)
 static inline struct kvm_memory_slot *
 id_to_memslot(struct kvm_memslots *slots, int id)
 {
-   return slots-memslots[id];
+   int i;
+
+   for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
+   if (slots-memslots[i].id == id)
+   return slots-memslots[i];
+
+   WARN_ON(1);
+   return NULL;
 }

 #define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09bb794..9425e3a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -440,6 +440,15 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)

 #endif /* CONFIG_MMU_NOTIFIER  KVM_ARCH_WANT_MMU_NOTIFIER */

+static void kvm_init_memslots_id(struct kvm *kvm)
+{
+   int i;
+   struct kvm_memslots *slots = kvm-memslots;
+
+   for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
+   slots-memslots[i].id = i;
+}
+
 static struct kvm *kvm_create_vm(void)
 {
int r, i;
@@ -465,6 +474,7 @@ static struct kvm *kvm_create_vm(void)
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)
goto out_err_nosrcu;
+   kvm_init_memslots_id(kvm);
if (init_srcu_struct(kvm-srcu))
goto out_err_nosrcu;
for (i = 0; i  KVM_NR_BUSES; i++) {
@@ -630,15 +640,55 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
 }
 #endif /* !CONFIG_S390 */

+static struct kvm_memory_slot *
+search_memslots(struct kvm_memslots *slots, gfn_t gfn)
+{
+   int i;
+   struct kvm_memory_slot *memslot;
+
+   kvm_for_each_memslot(slots, memslot, i)
+   if (gfn = memslot-base_gfn 
+ gfn  memslot-base_gfn + memslot-npages)
+   return memslot;
+
+   return NULL;
+}
+
+static int cmp_memslot(const void *slot1, const void *slot2)
+{
+   struct kvm_memory_slot *s1, *s2;
+
+   s1 = (struct kvm_memory_slot *)slot1;
+   s2 = (struct kvm_memory_slot *)slot2;
+
+   if (s1-npages  s2-npages)
+   return 1;
+   if (s1-npages  s2-npages)
+   return -1;
+
+   return 0;
+}
+
+/*
+ * Sort the memslots base on its size, so the larger slots
+ * will get better fit.
+ */
+static void sort_memslots(struct kvm_memslots *slots)
+{
+   sort(slots-memslots, KVM_MEM_SLOTS_NUM,
+ sizeof(struct kvm_memory_slot), cmp_memslot, NULL);
+}
+
 void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new)
 {
if (new) {
int id = new-id;
struct kvm_memory_slot *old = id_to_memslot(slots, id);
+   unsigned long npages = old-npages;

*old = *new;
-   if (id = slots-nmemslots)
-   slots-nmemslots = id + 1;
+   if (new-npages != npages)
+   sort_memslots(slots);
}

slots-generation++;
@@ -979,16 +1029,7 @@ EXPORT_SYMBOL_GPL(kvm_is_error_hva);
 static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots,
gfn_t gfn)
 {
-   int i;
-
-   for (i = 0; i  slots-nmemslots; ++i) {
-   struct kvm_memory_slot *memslot = slots-memslots[i];
-
-   if (gfn = memslot-base_gfn
-gfn  

[PATCH] vhost-net: Acquire device lock when releasing device

2011-11-18 Thread Sasha Levin
Device lock should be held when releasing a device, and specifically
when calling vhost_dev_cleanup(). Otherwise, RCU complains about it:

[ 2025.642835] ===
[ 2025.643838] [ INFO: suspicious RCU usage. ]
[ 2025.645182] ---
[ 2025.645927] drivers/vhost/vhost.c:475 suspicious rcu_dereference_protected() 
usage!
[ 2025.647329]
[ 2025.647330] other info that might help us debug this:
[ 2025.647331]
[ 2025.649042]
[ 2025.649043] rcu_scheduler_active = 1, debug_locks = 1
[ 2025.650235] no locks held by trinity/21042.
[ 2025.650971]
[ 2025.650972] stack backtrace:
[ 2025.651789] Pid: 21042, comm: trinity Not tainted 
3.2.0-rc2-sasha-00057-ga9098b3 #5
[ 2025.653342] Call Trace:
[ 2025.653792]  [810b4a6a] lockdep_rcu_suspicious+0xaf/0xb9
[ 2025.654916]  [818d4c2c] vhost_dev_cleanup+0x342/0x3ac
[ 2025.655985]  [818d4f18] vhost_net_release+0x48/0x7f
[ 2025.657247]  [811416e3] fput+0x11e/0x1dc
[ 2025.658091]  [8113f1ec] filp_close+0x6e/0x79
[ 2025.658964]  [81089ed7] put_files_struct+0xcc/0x196
[ 2025.659971]  [8108a034] exit_files+0x46/0x4f
[ 2025.660865]  [8108a716] do_exit+0x264/0x75c
[ 2025.662201]  [8113f490] ? fsnotify_modify+0x60/0x68
[ 2025.663260]  [81bbdbca] ? sysret_check+0x2e/0x69
[ 2025.664269]  [8108acc1] do_group_exit+0x83/0xb1
[ 2025.665448]  [8108ad01] sys_exit_group+0x12/0x16
[ 2025.666396]  [81bbdb92] system_call_fastpath+0x16/0x1b

Cc: Michael S. Tsirkin m...@redhat.com
Cc: kvm@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 drivers/vhost/net.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 882a51f..c9be601 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -586,6 +586,7 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
struct socket *tx_sock;
struct socket *rx_sock;
 
+   mutex_lock(n-dev.mutex);
vhost_net_stop(n, tx_sock, rx_sock);
vhost_net_flush(n);
vhost_dev_cleanup(n-dev);
@@ -596,6 +597,7 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
/* We do an extra flush before freeing memory,
 * since jobs can re-queue themselves. */
vhost_net_flush(n);
+   mutex_unlock(n-dev.mutex);
kfree(n);
return 0;
 }
-- 
1.7.8.rc1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry

2011-11-18 Thread Xiao Guangrong
The operation of getting dirty log is frequency When framebuffer-based
displays are used(for example, Xwindow), so, we introduce a mapping table
to speed up id_to_memslot()

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 include/linux/kvm_host.h |   13 +++--
 virt/kvm/kvm_main.c  |7 ++-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 83b396a..fbf66eb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -238,6 +238,8 @@ struct kvm_irq_routing_table {};
 struct kvm_memslots {
u64 generation;
struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
+   /* The mapping table from slot id to the index in memslots[]. */
+   int id_to_index[KVM_MEM_SLOTS_NUM];
 };

 struct kvm {
@@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct 
kvm *kvm)
 static inline struct kvm_memory_slot *
 id_to_memslot(struct kvm_memslots *slots, int id)
 {
-   int i;
+   int index = slots-id_to_index[id];
+   struct kvm_memory_slot *slot;

-   for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
-   if (slots-memslots[i].id == id)
-   return slots-memslots[i];
+   slot = slots-memslots[index];

-   WARN_ON(1);
-   return NULL;
+   WARN_ON(slot-id != id);
+   return slot;
 }

 #define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9425e3a..0dad825 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -446,7 +446,7 @@ static void kvm_init_memslots_id(struct kvm *kvm)
struct kvm_memslots *slots = kvm-memslots;

for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
-   slots-memslots[i].id = i;
+   slots-id_to_index[i] = slots-memslots[i].id = i;
 }

 static struct kvm *kvm_create_vm(void)
@@ -675,8 +675,13 @@ static int cmp_memslot(const void *slot1, const void 
*slot2)
  */
 static void sort_memslots(struct kvm_memslots *slots)
 {
+   int i;
+
sort(slots-memslots, KVM_MEM_SLOTS_NUM,
  sizeof(struct kvm_memory_slot), cmp_memslot, NULL);
+
+   for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
+   slots-id_to_index[slots-memslots[i].id] = i;
 }

 void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new)
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] KVM: optimize memslots searching

2011-11-18 Thread Sasha Levin
On Fri, 2011-11-18 at 17:16 +0800, Xiao Guangrong wrote:
 This is the more work base on my v1 patchset which is posted some months ago,
 it can be found at:
 https://lkml.org/lkml/2011/2/22/68
 
 Change log:
 - sort memslots base on its size and do the line search instead of binary
   search base on gfn, it is from Avi's idea.
 - in order to reduce cache footprint, memslots are sorted in the array of
   kvm-memslots-memslots[] and introduce a table to map slot id to index in
   the array
 
 There is the performance result:
 
 autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
 ept=1:before: 449.5   after: 447.8
 ept=0:before: 532.7   after: 529.8
 
 kernbench(average):
 ept=1:before: 127.94  after: 126.98
 ept=0:before: 196.85  after: 189.66

Hi,

Do these patches depend on any other patches? Looks like they don't
apply on kvm-next.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry

2011-11-18 Thread Sasha Levin
On Fri, 2011-11-18 at 17:20 +0800, Xiao Guangrong wrote:
 The operation of getting dirty log is frequency When framebuffer-based
 displays are used(for example, Xwindow), so, we introduce a mapping table
 to speed up id_to_memslot()
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/kvm_host.h |   13 +++--
  virt/kvm/kvm_main.c  |7 ++-
  2 files changed, 13 insertions(+), 7 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 83b396a..fbf66eb 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -238,6 +238,8 @@ struct kvm_irq_routing_table {};
  struct kvm_memslots {
   u64 generation;
   struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
 + /* The mapping table from slot id to the index in memslots[]. */
 + int id_to_index[KVM_MEM_SLOTS_NUM];
  };
 
  struct kvm {
 @@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct 
 kvm *kvm)
  static inline struct kvm_memory_slot *
  id_to_memslot(struct kvm_memslots *slots, int id)
  {
 - int i;
 + int index = slots-id_to_index[id];
 + struct kvm_memory_slot *slot;
 
 - for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
 - if (slots-memslots[i].id == id)
 - return slots-memslots[i];
 + slot = slots-memslots[index];
 
 - WARN_ON(1);
 - return NULL;
 + WARN_ON(slot-id != id);
 + return slot;
  }

If we didn't find the right memslot we shouldn't be returning a wrong
one, we should be failing.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] KVM: optimize memslots searching

2011-11-18 Thread Xiao Guangrong
On 11/18/2011 05:41 PM, Sasha Levin wrote:


 autotest for RHEL.6.1 setup/boot/reboot/shutdown(average):
 ept=1:   before: 449.5   after: 447.8
 ept=0:   before: 532.7   after: 529.8

 kernbench(average):
 ept=1:   before: 127.94  after: 126.98
 ept=0:   before: 196.85  after: 189.66
 
 Hi,
 
 Do these patches depend on any other patches? Looks like they don't
 apply on kvm-next.
 


Ah, i forgot that i was using Avi's kvm tree, this patchset is just base
on that, to avoid noise, i'll rebase it after it is reviewed.

Thank you, Sasha!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry

2011-11-18 Thread Xiao Guangrong
On 11/18/2011 05:45 PM, Sasha Levin wrote:


  struct kvm {
 @@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct 
 kvm *kvm)
  static inline struct kvm_memory_slot *
  id_to_memslot(struct kvm_memslots *slots, int id)
  {
 -int i;
 +int index = slots-id_to_index[id];
 +struct kvm_memory_slot *slot;

 -for (i = 0; i  KVM_MEM_SLOTS_NUM; i++)
 -if (slots-memslots[i].id == id)
 -return slots-memslots[i];
 +slot = slots-memslots[index];

 -WARN_ON(1);
 -return NULL;
 +WARN_ON(slot-id != id);
 +return slot;
  }
 
 If we didn't find the right memslot we shouldn't be returning a wrong
 one, we should be failing.
 


Um, in the current code, the id has already been checked by the caller,
that means id should be  KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS, so,
if we can not find the memslot for this slot, it must be a bug.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-18 Thread Kevin Wolf
Am 18.11.2011 09:47, schrieb Lan Tianyu:
 When meeting request to write the cluster without copied flag,
 allocate a new cluster and write original data with modification
 to the new cluster. This also adds support for the writing operation
 of the qcow2 compressed image. After testing, image file can pass
 through qemu-img check.
 
  Please enter the commit message for your changes. Lines starting
 
 Signed-off-by: Lan Tianyu tianyu@intel.com
 ---
  tools/kvm/disk/qcow.c|  366 
 +-
  tools/kvm/include/kvm/qcow.h |2 +
  2 files changed, 255 insertions(+), 113 deletions(-)
 
 diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
 index 680b37d..4d9125d 100644
 --- a/tools/kvm/disk/qcow.c
 +++ b/tools/kvm/disk/qcow.c
 @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct 
 qcow_l2_table *c)
*/
   lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
 list);
  
 - if (qcow_l2_cache_write(q, lru)  0)
 - goto error;
 -
   /* Remove the node from the cache */
   rb_erase(lru-node, r);
   list_del_init(lru-list);
 @@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct 
 qcow_refcount_block *c)
   if (rft-nr_cached == MAX_CACHE_NODES) {
   lru = list_first_entry(rft-lru_list, struct 
 qcow_refcount_block, list);
  
 - if (write_refcount_block(q, lru)  0)
 - goto error;
 -
   rb_erase(lru-node, r);
   list_del_init(lru-list);
   rft-nr_cached--;
 @@ -706,6 +700,11 @@ static struct qcow_refcount_block 
 *qcow_read_refcount_block(struct qcow *q, u64
  
   rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
  
 + if (!rfb_offset) {
 + pr_warning(Don't support to grow refcount table);
 + return NULL;
 + }
 +
   rfb = refcount_block_search(q, rfb_offset);
   if (rfb)
   return rfb;
 @@ -728,35 +727,121 @@ error_free_rfb:
   return NULL;
  }
  
 -/*
 - * QCOW file might grow during a write operation. Not only data but metadata 
 is
 - * also written at the end of the file. Therefore it is necessary to ensure
 - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to
 - * synchronize the in-core state of QCOW image to disk.
 - *
 - * We also try to restore the image to a consistent state if the metdata
 - * operation fails. The two metadat operations are: level 1 and level 2 table
 - * update. If either of them fails the image is truncated to a consistent 
 state.
 +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
 +{
 + struct qcow_refcount_block *rfb = NULL;
 + struct qcow_header *header = q-header;
 + u64 rfb_idx;
 +
 + rfb = qcow_read_refcount_block(q, clust_idx);
 + if (!rfb) {
 + pr_warning(Error while reading refcount table);
 + return -1;
 + }
 +
 + rfb_idx = clust_idx  (((1ULL 
 + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
 +
 + if (rfb_idx = rfb-size) {
 + pr_warning(L1: refcount block index out of bounds);
 + return -1;
 + }
 +
 + return be16_to_cpu(rfb-entries[rfb_idx]);
 +}
 +
 +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
 +{
 + struct qcow_refcount_block *rfb = NULL;
 + struct qcow_header *header = q-header;
 + u16 refcount;
 + u64 rfb_idx;
 +
 + rfb = qcow_read_refcount_block(q, clust_idx);
 + if (!rfb) {
 + pr_warning(error while reading refcount table);
 + return -1;
 + }
 +
 + rfb_idx = clust_idx  (((1ULL 
 + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
 + if (rfb_idx = rfb-size) {
 + pr_warning(refcount block index out of bounds);
 + return -1;
 + }
 +
 + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
 + rfb-entries[rfb_idx] = cpu_to_be16(refcount);
 + rfb-dirty = 1;
 +
 + /*write refcount block*/
 + write_refcount_block(q, rfb);

Missing error handling.

 +
 + /*update free_clust_idx since refcount becomes zero*/
 + if (!refcount  clust_idx  q-free_clust_idx)
 + q-free_clust_idx = clust_idx;
 +
 + return 0;
 +}
 +
 +static void  qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size)
 +{
 + struct qcow_header *header = q-header;
 + u64 start, end, offset;
 +
 + start = clust_start  ~(q-cluster_size - 1);
 + end = (clust_start + size - 1)  ~(q-cluster_size - 1);
 + for (offset = start; offset = end; offset += q-cluster_size)
 + update_cluster_refcount(q, offset  header-cluster_bits, -1);
 +}
 +
 +/*Allocate clusters according to the size. Find a postion that
 + *can satisfy the size. free_clust_idx is initialized to zero and
 + *Record last position.
 +*/
 +static u64 qcow_alloc_clusters(struct 

Re: Memory sync algorithm during migration

2011-11-18 Thread Oliver Hookins
On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote:
 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote:
  Adding qemu-devel ML to CC.
 
  Your question should have been sent to qemu-devel ML because the logic
  is implemented in QEMU, not KVM.
 
  (2011/11/11 1:35), Oliver Hookins wrote:
  Hi,
 
  I am performing some benchmarks on KVM migration on two different types of 
  VM.
  One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes 
  about 20
  seconds to migrate on our hardware while the 32GB VM takes about a minute.
 
  With a reasonable amount of memory activity going on (in the hundreds of 
  MB per
  second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never
  completes. Intuitively this tells me there is some watermarking of dirty 
  pages
  going on that is not particularly efficient when the dirty pages ratio is 
  high
  compared to total memory, but I may be completely incorrect.
 
  You can change the ratio IIRC.
  Hopefully, someone who knows well about QEMU will tell you better ways.
 
  Takuya
 
 
  Could anybody fill me in on what might be going on here? We're using 
  libvirt
  0.8.2 and kvm-83-224.el5.centos.1
 
 This is pretty old qemu/kvm code base.
 In principle, it makes no sense that with 32GB RAM migration finishes,
 and with 4GB RAM it is unable (intuitively it should be, if ever, the
 other way around).
 
 Do you have an easy test that makes the problem easily reproducible?
 Have you tried ustream qemu.git? (some improvements on that department).

On the surface this seems to be a bit beyond me. Is it reasonably
straightforward porting the patches from qemu-kvm to qemu? At the moment I'm
going to attempt just building the latest qemu-kvm (0.15 as opposed to 0.9.1)
- should this be enough?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: PPC: mostly booke: race and idle fixes, paravirt

2011-11-18 Thread Alexander Graf

On 09.11.2011, at 01:22, Scott Wood wrote:

 The first couple patches are some minor reorganization of the checking
 done prior to entering the guest, which should avoid some races relative
 to exception qeueing, and give us a more clearly-named place (which later
 patches in the series use) to put other checks that we want to always run
 before entering the guest.
 
 Then there's a fix to booke idle handling to not block inside
 KVM_SET_REGS, and some paravirt fixes and new functionality.
 
 Apply after KVM: PPC: booke: check for signals in kvmppc_vcpu_run
 (I didn't think there'd end up being a dependency when I sent that...)
 
 This supersedes the first three patches of KVM: PPC: booke: paravirt and
 timer from 8/26.
 
 Scott Wood (6):
  KVM: PPC: Rename deliver_interrupts to prepare_to_enter
  KVM: PPC: Move prepare_to_enter call site into subarch code
  KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
  KVM: PPC: booke: Fix int_pending calculation for MSR[EE] paravirt
  KVM: PPC: booke: Paravirtualize wrtee
  KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn

Thanks, applied all to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: PPC: booke: Improve timer register emulation

2011-11-18 Thread Alexander Graf

On 17.11.2011, at 23:39, Scott Wood wrote:

 Decrementers are now properly driven by TCR/TSR, and the guest
 has full read/write access to these registers.
 
 The decrementer keeps ticking (and setting the TSR bit) regardless of
 whether the interrupts are enabled with TCR.
 
 The decrementer stops at zero, rather than going negative.
 
 Decrementers (and FITs, once implemented) are delivered as
 level-triggered interrupts -- dequeued when the TSR bit is cleared, not
 on delivery.
 
 Signed-off-by: Liu Yu yu@freescale.com
 [scottw...@freescale.com: significant changes]
 Signed-off-by: Scott Wood scottw...@freescale.com

Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-18 Thread Alexander Graf

On 16.11.2011, at 23:50, Paul Mackerras wrote:

 This series of patches updates the Book3S-HV KVM code that manages the
 guest hashed page table (HPT) to enable several things:
 
 * MMIO emulation and MMIO pass-through
 
 * Use of small pages (4kB or 64kB, depending on config) to back the
  guest memory
 
 * Pageable guest memory - i.e. backing pages can be removed from the
  guest and reinstated on demand, using the MMU notifier mechanism.
 
 On PPC970 we have no way to get DSIs and ISIs to come to the
 hypervisor, so we can't do MMIO emulation or pageable guest memory.
 On POWER7 we set the VPM1 bit in the LPCR to make all DSIs and ISIs
 come to the hypervisor (host) as HDSIs or HISIs.
 
 This series is RFC for the moment, although the first 5 or so patches
 are pretty solid and could go in.  I am going to rework the later
 patches to use HPTEs with V=0 for the absent pages rather than key=31,
 which will require handling the HPTE-not-present HDSIs we will get and
 differentiating the case where the guest has created a HPTE but the
 underlying page is not resident from the case where the guest has
 created no HPTE for the address.

This touches areas that I'm sure non-PPC people would want to see as well. 
Could you please CC kvm@vger too next time?

Avi, Marcelo, mind to review some of the bits in this patch set? :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest floppy regression hits qemu-kvm, qemu is not affected

2011-11-18 Thread Lucas Meneghel Rodrigues

Hi guys,

Today during the last 'sanity' qemu-kvm testing, we've noticed a 
recurring problem: guest OS does not see the floppy, making the windows 
installs time out. This problem has been extensively discussed on qemu 
and qemu is fine here, problem is specific with qemu-kvm.


http://comments.gmane.org/gmane.comp.emulators.qemu/123928

So, this is something to go on your radars.

Cheers,

Lucas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-18 Thread Ben Hutchings
On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
  On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
   Changes for multiqueue virtio_net driver.
  [...]
   @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
{
 struct virtnet_info *vi = netdev_priv(dev);
 int cpu;
   - unsigned int start;

 for_each_possible_cpu(cpu) {
   - struct virtnet_stats __percpu *stats
   - = per_cpu_ptr(vi-stats, cpu);
   - u64 tpackets, tbytes, rpackets, rbytes;
   -
   - do {
   - start = u64_stats_fetch_begin(stats-syncp);
   - tpackets = stats-tx_packets;
   - tbytes   = stats-tx_bytes;
   - rpackets = stats-rx_packets;
   - rbytes   = stats-rx_bytes;
   - } while (u64_stats_fetch_retry(stats-syncp, start));
   -
   - tot-rx_packets += rpackets;
   - tot-tx_packets += tpackets;
   - tot-rx_bytes   += rbytes;
   - tot-tx_bytes   += tbytes;
   + int qpair;
   +
   + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
   + struct virtnet_send_stats __percpu *tx_stat;
   + struct virtnet_recv_stats __percpu *rx_stat;
  
  While you're at it, you can drop the per-CPU stats and make them only
  per-queue.  There is unlikely to be any benefit in maintaining them
  per-CPU while receive and transmit processing is serialised per-queue.
 
 It allows you to update stats without a lock.

But you'll already be holding a lock related to the queue.

 Whats the benefit of having them per queue?

It should save some memory (and a little time when summing stats, though
that's unlikely to matter much).

The important thing is that splitting up stats per-CPU *and* per-queue
is a waste.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-18 Thread Sasha Levin
On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote:
 On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
  On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
   On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
Changes for multiqueue virtio_net driver.
   [...]
@@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
 {
struct virtnet_info *vi = netdev_priv(dev);
int cpu;
-   unsigned int start;
 
for_each_possible_cpu(cpu) {
-   struct virtnet_stats __percpu *stats
-   = per_cpu_ptr(vi-stats, cpu);
-   u64 tpackets, tbytes, rpackets, rbytes;
-
-   do {
-   start = u64_stats_fetch_begin(stats-syncp);
-   tpackets = stats-tx_packets;
-   tbytes   = stats-tx_bytes;
-   rpackets = stats-rx_packets;
-   rbytes   = stats-rx_bytes;
-   } while (u64_stats_fetch_retry(stats-syncp, start));
-
-   tot-rx_packets += rpackets;
-   tot-tx_packets += tpackets;
-   tot-rx_bytes   += rbytes;
-   tot-tx_bytes   += tbytes;
+   int qpair;
+
+   for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
+   struct virtnet_send_stats __percpu *tx_stat;
+   struct virtnet_recv_stats __percpu *rx_stat;
   
   While you're at it, you can drop the per-CPU stats and make them only
   per-queue.  There is unlikely to be any benefit in maintaining them
   per-CPU while receive and transmit processing is serialised per-queue.
  
  It allows you to update stats without a lock.
 
 But you'll already be holding a lock related to the queue.

Right, but now you're holding a queue lock just when playing with the
queue, we don't hold it when we process the data - which is when we
usually need to update stats.

  Whats the benefit of having them per queue?
 
 It should save some memory (and a little time when summing stats, though
 that's unlikely to matter much).
 
 The important thing is that splitting up stats per-CPU *and* per-queue
 is a waste.
 
 Ben.
 


-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM for the enterprise

2011-11-18 Thread Michael Waite

 Hi,
The Open Virtualization Alliance is going to be having a webinar on 
December 8th which is intended to help promote KVM as an enterprise 
class hypervisor. I see so much great engineering work going on to make 
KVM a really robust technology and we want to help tell this story to 
the world-wide audience.


We will have a couple companies presenting how they deployed KVM in 
their infrastructure and the benefits they realized.
There will also be a speaker from IDC discussing virtualization-specific 
performance-enhancing features–like SR-IOV, hugepages, asynchronous I/O 
and more.
If anyone would like to attend this webinar (or help promote it) I have 
posted a link on the OVA site here:

http://www.openvirtualizationalliance.org/news/index.html

Thanks.

Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-18 Thread Greg Rose


On 11/17/2011 4:44 PM, Ben Hutchings wrote:

On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:

On 11/17/2011 4:15 PM, Ben Hutchings wrote:

Sorry to come to this rather late.

On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
[...]

v2 -   v3
- Moved set and get filter ops from rtnl_link_ops to netdev_ops
- Support for SRIOV VFs.
  [Note: The get filters msg (in the way current get rtnetlink handles
  it) might get too big for SRIOV vfs. This patch follows existing sriov
  vf get code and tries to accomodate filters for all VF's in a PF.
  And for the SRIOV case I have only tested the fact that the VF
  arguments are getting delivered to rtnetlink correctly. The code
  follows existing sriov vf handling code so rest of it should work 
fine]

[...]

This is already broken for large numbers of VFs, and increasing the
amount of information per VF is going to make the situation worse.  I am
no netlink expert but I think that the current approach of bundling all
information about an interface in a single message may not be
sustainable.

Also, I'm unclear on why this interface is to be used to set filtering
for the (PF) net device as well as for related VFs.  Doesn't that
duplicate the functionality of ndo_set_rx_mode and
ndo_vlan_rx_{add,kill}_vid?


Functionally yes but contextually no.  This allows the PF driver to know
that it is setting these filters in the context of the existence of VFs,
allowing it to take appropriate action.  The other two functions may be
called without the presence of SR-IOV enablement and the existence of VFs.

Anyway, that's why I asked Roopa to add that capability.


I don't follow.  The PF driver already knows whether it has enabled VFs.

How do filters set this way interact with filters set through the
existing operations?  Should they override promiscuous mode?  None of
this has been specified.


Promiscuous mode is exactly the issue this feature is intended for.  I'm 
not familiar with the solarflare device but Intel HW promiscuous mode is 
only promiscuous on the physical port, not on the VEB.  So a packet sent 
from a VF will not be captured by the PF across the VEB unless the MAC 
and VLAN filters have been programmed into the HW.  So you may not need 
the feature for your devices but it is required for Intel devices.  And 
it's a fairly simple request, just allow -1 to indicate that the target 
of the filter requests is for the PF itself.  Using the already existing 
set_rx_mode function wont' work because the PF driver will look at it 
and figure it's in promiscuous mode anyway, so it won't set the filters 
into the HW.  At least that is how it is in the case of our HW and 
driver.  Again, the behavior of your HW and driver is unknown to me and 
thus you may not require this feature.


- Greg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-18 Thread Ben Hutchings
On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote:
  On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
   On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
 Changes for multiqueue virtio_net driver.
[...]
 @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
  {
   struct virtnet_info *vi = netdev_priv(dev);
   int cpu;
 - unsigned int start;
  
   for_each_possible_cpu(cpu) {
 - struct virtnet_stats __percpu *stats
 - = per_cpu_ptr(vi-stats, cpu);
 - u64 tpackets, tbytes, rpackets, rbytes;
 -
 - do {
 - start = u64_stats_fetch_begin(stats-syncp);
 - tpackets = stats-tx_packets;
 - tbytes   = stats-tx_bytes;
 - rpackets = stats-rx_packets;
 - rbytes   = stats-rx_bytes;
 - } while (u64_stats_fetch_retry(stats-syncp, start));
 -
 - tot-rx_packets += rpackets;
 - tot-tx_packets += tpackets;
 - tot-rx_bytes   += rbytes;
 - tot-tx_bytes   += tbytes;
 + int qpair;
 +
 + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
 + struct virtnet_send_stats __percpu *tx_stat;
 + struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.
   
   It allows you to update stats without a lock.
  
  But you'll already be holding a lock related to the queue.
 
 Right, but now you're holding a queue lock just when playing with the
 queue, we don't hold it when we process the data - which is when we
 usually need to update stats.
[...]

The *stack* is holding the appropriate lock when calling the NAPI poll
function or ndo_start_xmit function.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-18 Thread Ben Hutchings
On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote:
 On 11/17/2011 4:44 PM, Ben Hutchings wrote:
  On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
  On 11/17/2011 4:15 PM, Ben Hutchings wrote:
  Sorry to come to this rather late.
 
  On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
  [...]
  v2 -   v3
  - Moved set and get filter ops from rtnl_link_ops to netdev_ops
  - Support for SRIOV VFs.
[Note: The get filters msg (in the way current get rtnetlink 
  handles
it) might get too big for SRIOV vfs. This patch follows 
  existing sriov
vf get code and tries to accomodate filters for all VF's in a 
  PF.
And for the SRIOV case I have only tested the fact that the VF
arguments are getting delivered to rtnetlink correctly. The 
  code
follows existing sriov vf handling code so rest of it should 
  work fine]
  [...]
 
  This is already broken for large numbers of VFs, and increasing the
  amount of information per VF is going to make the situation worse.  I am
  no netlink expert but I think that the current approach of bundling all
  information about an interface in a single message may not be
  sustainable.
 
  Also, I'm unclear on why this interface is to be used to set filtering
  for the (PF) net device as well as for related VFs.  Doesn't that
  duplicate the functionality of ndo_set_rx_mode and
  ndo_vlan_rx_{add,kill}_vid?
 
  Functionally yes but contextually no.  This allows the PF driver to know
  that it is setting these filters in the context of the existence of VFs,
  allowing it to take appropriate action.  The other two functions may be
  called without the presence of SR-IOV enablement and the existence of VFs.
 
  Anyway, that's why I asked Roopa to add that capability.
 
  I don't follow.  The PF driver already knows whether it has enabled VFs.
 
  How do filters set this way interact with filters set through the
  existing operations?  Should they override promiscuous mode?  None of
  this has been specified.
 
 Promiscuous mode is exactly the issue this feature is intended for.  I'm 
 not familiar with the solarflare device but Intel HW promiscuous mode is 
 only promiscuous on the physical port, not on the VEB.  So a packet sent 
 from a VF will not be captured by the PF across the VEB unless the MAC 
 and VLAN filters have been programmed into the HW.

Yes, I get it.  The hardware bridge needs to know more about the address
configuration on the host than the driver is getting at the moment.

 So you may not need 
 the feature for your devices but it is required for Intel devices.

Well we don't have the hardware bridge but that means each VF driver
needs to know whether to fall back to the software bridge.  The net
driver needs much the same additional information.

 And 
 it's a fairly simple request, just allow -1 to indicate that the target 
 of the filter requests is for the PF itself.  Using the already existing 
 set_rx_mode function wont' work because the PF driver will look at it 
 and figure it's in promiscuous mode anyway, so it won't set the filters 
 into the HW.  At least that is how it is in the case of our HW and 
 driver.  Again, the behavior of your HW and driver is unknown to me and 
 thus you may not require this feature.

What concerns me is that this seems to be a workaround rather than a fix
for over-use of promiscuous mode, and it changes the semantics of
filtering modes in ways that haven't been well-specified.

What if there's a software bridge between two net devices corresponding
to separate physical ports, so that they really need to be promiscuous?
What if the administrator runs tcpdump and really wants the (PF) net
device to be promiscuous?

These cases shouldn't break because of VF acceleration.  Or at least we
should make a conscious and documented decision that 'promiscuous'
doesn't mean that if you enable it on your network adapter.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-18 Thread Alex Williamson
On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
 On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
  On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
   On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
snip
+Groups, Devices, IOMMUs, oh my
+---
+
+A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
+can't always distinguish transactions from each individual device in
+the system.  Sometimes this is because of the IOMMU design, such as 
with
+PEs, other times it's caused by the I/O topology, for instance a
+PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
+devices created by these restictions IOMMU groups (or just groups for
+this document).
+
+The IOMMU cannot distiguish transactions between the individual devices
+within the group, therefore the group is the basic unit of ownership 
for
+a userspace process.  Because of this, groups are also the primary
+interface to both devices and IOMMU domains in VFIO.
+
+The VFIO representation of groups is created as devices are added into
+the framework by a VFIO bus driver.  The vfio-pci module is an example
+of a bus driver.  This module registers devices along with a set of bus
+specific callbacks with the VFIO core.  These callbacks provide the
+interfaces later used for device access.  As each new group is created,
+as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
+character device.
   
   Ok.. so, the fact that it's called vfio-pci suggests that the VFIO
   bus driver is per bus type, not per bus instance.   But grouping
   constraints could be per bus instance, if you have a couple of
   different models of PCI host bridge with IOMMUs of different
   capabilities built in, for example.
  
  Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
  instance.
 
 Ok, how can that work.  vfio-pci is responsible for generating the
 groupings, yes?  For which it needs to know the iommu/host bridge's
 isolation capabilities, which vary depending on the type of host
 bridge.

No, grouping is done at the iommu driver level.  vfio gets groupings via
iomm_device_group(), which uses the iommu_ops for the bus_type of the
requested device.  I'll attempt to clarify where groups come from in the
documentation.

   IOMMUs also register drivers per bus type, not per bus
  instance.  The IOMMU driver is free to impose any constraints it wants.
  
+In addition to the device enumeration and callbacks, the VFIO bus 
driver
+also provides a traditional device driver and is able to bind to 
devices
+on it's bus.  When a device is bound to the bus driver it's available 
to
+VFIO.  When all the devices within a group are bound to their bus 
drivers,
+the group becomes viable and a user with sufficient access to the 
VFIO
+group chardev can obtain exclusive access to the set of group devices.
+
+As documented in linux/vfio.h, several ioctls are provided on the
+group chardev:
+
+#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
+ #define VFIO_GROUP_FLAGS_VIABLE(1  0)
+ #define VFIO_GROUP_FLAGS_MM_LOCKED (1  1)
+#define VFIO_GROUP_MERGE_IOW(';', 101, int)
+#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
+#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
+#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)
+
+The last two ioctls return new file descriptors for accessing
+individual devices within the group and programming the IOMMU.  Each of
+these new file descriptors provide their own set of file interfaces.
+These ioctls will fail if any of the devices within the group are not
+bound to their VFIO bus driver.  Additionally, when either of these
+interfaces are used, the group is then bound to the struct_mm of the
+caller.  The GET_FLAGS ioctl can be used to view the state of the 
group.
+
+When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
+new IOMMU domain is created and all of the devices in the group are
+attached to it.  This is the only way to ensure full IOMMU isolation
+of the group, but potentially wastes resources and cycles if the user
+intends to manage multiple groups with the same set of IOMMU mappings.
+VFIO therefore provides a group MERGE and UNMERGE interface, which
+allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
+arbitrary groups to be merged, so the user should assume merging is
+opportunistic.
   
   I do not think opportunistic means what you think it means..
   
 A new group, with no open device or IOMMU file
+descriptors, can be merged into an existing, in-use, group using the
+MERGE 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-18 Thread Scott Wood
On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
 Hmm, that might be cleaner than eliminating the size with just using
 _IO().  So we might have something like:
 
 #define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct vfio_dma_map)
 #define VFIO_IOMMU_MAP_DMA_V2   _IOWR(';', 106, struct 
 vfio_dma_map_v2)
 
 For which the driver might do:
 
 case VFIO_IOMMU_MAP_DMA:
 case VFIO_IOMMU_MAP_DMA_V2:
 {
   struct vfio_dma_map map;
 
   /* We don't care about the extra v2 bits */
   if (copy_from_user(map, (void __user *)arg, sizeof map))
   return -EFAULT;

That won't work if you have an old kernel that doesn't know about v2, and
a new user that uses v2.  To make this work you'd have to strip out the
size from the ioctl number before switching (but still use it when
considering whether to access the v2 fields).  Simpler to just leave it
out of the ioctl number and put it in the struct field as currently
planned.

   I think all our structure sizes are independent of host width.  If I'm
   missing something, let me know.
  
  Ah, for structures, that might be true.  I was seeing the bunch of
  ioctl()s that take ints.
 
 Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
 mode.

Does Linux support ILP64?  There are int ioctls all over the place, and
I don't think we do compat wrappers for them.  In fact, some of the
ioctls in linux/fs.h use int for the compatible version of ioctls
originally defined as long.

It's cleaner to always use the fixed types, though.

 Darn it, guess we need to make everything 64bit, including file
 descriptors.

What's wrong with __u32/__s32 (or uint32_t/int32_t)?

I really do not see Linux supporting an ABI that has no 32-bit type at
all, especially in a situation where userspace compatibility is needed. 
If that does happen, the ABI breakage will go well beyond VFIO.

 The point of the group is to provide a unit of ownership.  We can't let
 $userA open $groupid and fetch a device, then have $userB do the same,
 grabbing a different device.  The mappings will step on each other and
 the devices have no isolation.  We can't restrict that purely by file
 permissions or we'll have the same problem with sudo.

What is the problem with sudo?  If you're running processes as the same
user, regardless of how, they're going to be able to mess with each
other.

Is it possible to expose access to only specific groups via an
individually-permissionable /dev/device, so only the entity handing out
access to devices needs access to everything?

 At one point we discussed a single open instance, but that
 unnecessarily limits the user, so we settled on the mm.  Thanks,

It would be nice if this limitation weren't excessively integrated into
the design -- in the embedded space we've got unusual partitioning
setups, including failover arrangements where partitions share devices. 
The device may be configured with the IOMMU pointing only at regions that
are shared by both mms, or the non-shared regions may be reconfigured as
active ownership of the device gets handed around.

It would be up to userspace code to make sure that the mappings don't
step on each other.  The mapping could be done with whichever mm issued
the map call for a given region.

For this use case, there is unlikely to be an issue with ownership
because there will not be separate privilege domains creating partitions
-- other use cases could refrain from enabling multiple-mm support unless
ownership issues are resolved.

This doesn't need to be supported initially, but we should try to avoid
letting the assumption permeate the code.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-18 Thread Paul Mackerras
On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote:
 
 This touches areas that I'm sure non-PPC people would want to see as
 well. Could you please CC kvm@vger too next time?
 
 Avi, Marcelo, mind to review some of the bits in this patch set? :)

I did cc the last patch (the one that adds barriers in the MMU
notifier sequence/count logic) to kvm@vger.  Do you mean I should cc
the whole series?  The only other thing touching generic code is the
addition of the KVM_MEMSLOT_IO flag in the first patch.

I'm hoping the extra barriers will be OK since they are no-ops on
x86.  In fact I now think that the smp_wmbs I added to
kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte
aren't in fact necessary, since it's only necessary to ensure that the
sequence number increase is visible before the point where
kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap
chain they look at, which will be ensured anyway by the barrier before
the unlock.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-18 Thread Lan, Tianyu
Hi Kevin:
Thanks for your review. The following means that there should be a 
fsync after updating 
metadata(refcunt block, l1 table and l2 table).

Thanks 
Tianyu Lan

-Original Message-
 + /*write l2 table*/
 + l2t-dirty = 1;
 + if (qcow_l2_cache_write(q, l2t)  0)
   goto free_cache;

You need to make sure that the refcount update is written first (e.g.
with fsync), otherwise you risk corruption when the host crashes in the middle.

  
 - if (cache_table(q, l2t)  0) {
 - if (ftruncate(q-fd, f_sz)  0)
 - goto free_cache;
 + /* Update the l1 talble */
 + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset
 + | QCOW2_OFLAG_COPIED);
  
 - goto free_cache;
 - }
 + if (pwrite_in_full(q-fd, l1t-l1_table,
 + l1t-table_size * sizeof(u64),
 + header-l1_table_offset)  0)
 + goto error;

Likewise, the L1 table write must be ordered against the L2 write.

goto error is using the wrong label.

  
 - /* Update the in-core entry */
 - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset);
 + /*cache l2 table*/
 + cache_table(q, l2t);

After so many explicit comments, you can probably guess what's wrong here...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: PPC: mostly booke: race and idle fixes, paravirt

2011-11-18 Thread Alexander Graf

On 09.11.2011, at 01:22, Scott Wood wrote:

 The first couple patches are some minor reorganization of the checking
 done prior to entering the guest, which should avoid some races relative
 to exception qeueing, and give us a more clearly-named place (which later
 patches in the series use) to put other checks that we want to always run
 before entering the guest.
 
 Then there's a fix to booke idle handling to not block inside
 KVM_SET_REGS, and some paravirt fixes and new functionality.
 
 Apply after KVM: PPC: booke: check for signals in kvmppc_vcpu_run
 (I didn't think there'd end up being a dependency when I sent that...)
 
 This supersedes the first three patches of KVM: PPC: booke: paravirt and
 timer from 8/26.
 
 Scott Wood (6):
  KVM: PPC: Rename deliver_interrupts to prepare_to_enter
  KVM: PPC: Move prepare_to_enter call site into subarch code
  KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
  KVM: PPC: booke: Fix int_pending calculation for MSR[EE] paravirt
  KVM: PPC: booke: Paravirtualize wrtee
  KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn

Thanks, applied all to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: PPC: booke: Improve timer register emulation

2011-11-18 Thread Alexander Graf

On 17.11.2011, at 23:39, Scott Wood wrote:

 Decrementers are now properly driven by TCR/TSR, and the guest
 has full read/write access to these registers.
 
 The decrementer keeps ticking (and setting the TSR bit) regardless of
 whether the interrupts are enabled with TCR.
 
 The decrementer stops at zero, rather than going negative.
 
 Decrementers (and FITs, once implemented) are delivered as
 level-triggered interrupts -- dequeued when the TSR bit is cleared, not
 on delivery.
 
 Signed-off-by: Liu Yu yu@freescale.com
 [scottw...@freescale.com: significant changes]
 Signed-off-by: Scott Wood scottw...@freescale.com

Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-18 Thread Alexander Graf

On 16.11.2011, at 23:50, Paul Mackerras wrote:

 This series of patches updates the Book3S-HV KVM code that manages the
 guest hashed page table (HPT) to enable several things:
 
 * MMIO emulation and MMIO pass-through
 
 * Use of small pages (4kB or 64kB, depending on config) to back the
  guest memory
 
 * Pageable guest memory - i.e. backing pages can be removed from the
  guest and reinstated on demand, using the MMU notifier mechanism.
 
 On PPC970 we have no way to get DSIs and ISIs to come to the
 hypervisor, so we can't do MMIO emulation or pageable guest memory.
 On POWER7 we set the VPM1 bit in the LPCR to make all DSIs and ISIs
 come to the hypervisor (host) as HDSIs or HISIs.
 
 This series is RFC for the moment, although the first 5 or so patches
 are pretty solid and could go in.  I am going to rework the later
 patches to use HPTEs with V=0 for the absent pages rather than key=31,
 which will require handling the HPTE-not-present HDSIs we will get and
 differentiating the case where the guest has created a HPTE but the
 underlying page is not resident from the case where the guest has
 created no HPTE for the address.

This touches areas that I'm sure non-PPC people would want to see as well. 
Could you please CC kvm@vger too next time?

Avi, Marcelo, mind to review some of the bits in this patch set? :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-18 Thread Paul Mackerras
On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote:
 
 This touches areas that I'm sure non-PPC people would want to see as
 well. Could you please CC kvm@vger too next time?
 
 Avi, Marcelo, mind to review some of the bits in this patch set? :)

I did cc the last patch (the one that adds barriers in the MMU
notifier sequence/count logic) to kvm@vger.  Do you mean I should cc
the whole series?  The only other thing touching generic code is the
addition of the KVM_MEMSLOT_IO flag in the first patch.

I'm hoping the extra barriers will be OK since they are no-ops on
x86.  In fact I now think that the smp_wmbs I added to
kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte
aren't in fact necessary, since it's only necessary to ensure that the
sequence number increase is visible before the point where
kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap
chain they look at, which will be ensured anyway by the barrier before
the unlock.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html