[RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. Please enter the commit message for your changes. Lines starting Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 366 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 255 insertions(+), 113 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..4d9125d 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +700,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +727,121 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb-entries[rfb_idx] = cpu_to_be16(refcount); + rfb-dirty = 1; + + /*write refcount block*/ + write_refcount_block(q, rfb); + + /*update free_clust_idx since refcount becomes zero*/ + if (!refcount clust_idx q-free_clust_idx) + q-free_clust_idx = clust_idx; + + return 0; +} + +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size) +{ + struct qcow_header *header = q-header; + u64 start, end, offset; + + start = clust_start ~(q-cluster_size - 1); + end = (clust_start + size - 1) ~(q-cluster_size - 1); + for (offset = start; offset = end; offset += q-cluster_size) + update_cluster_refcount(q, offset header-cluster_bits, -1); +} + +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ +static u64 qcow_alloc_clusters(struct qcow *q, u64 size) +{ + struct qcow_header *header = q-header; +
[PATCH v2 0/6] KVM: optimize memslots searching
This is the more work base on my v1 patchset which is posted some months ago, it can be found at: https://lkml.org/lkml/2011/2/22/68 Change log: - sort memslots base on its size and do the line search instead of binary search base on gfn, it is from Avi's idea. - in order to reduce cache footprint, memslots are sorted in the array of kvm-memslots-memslots[] and introduce a table to map slot id to index in the array There is the performance result: autotest for RHEL.6.1 setup/boot/reboot/shutdown(average): ept=1: before: 449.5 after: 447.8 ept=0: before: 532.7 after: 529.8 kernbench(average): ept=1: before: 127.94 after: 126.98 ept=0: before: 196.85 after: 189.66 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/6] KVM: introduce KVM_MEM_SLOTS_NUM macro
Introduce KVM_MEM_SLOTS_NUM macro to instead of KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |4 +++- arch/x86/kvm/mmu.c |2 +- include/linux/kvm_host.h|7 +-- virt/kvm/kvm_main.c |2 +- 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6d83264..b80a310 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -31,6 +31,8 @@ #define KVM_MEMORY_SLOTS 32 /* memory slots that does not exposed to userspace */ #define KVM_PRIVATE_MEM_SLOTS 4 +#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS) + #define KVM_MMIO_SIZE 16 #define KVM_PIO_PAGE_OFFSET 1 @@ -228,7 +230,7 @@ struct kvm_mmu_page { * One bit set per slot which has memory * in this shadow page. */ - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); + DECLARE_BITMAP(slot_bitmap, KVM_MEM_SLOTS_NUM); bool unsync; int root_count; /* Currently serving as active root */ unsigned int unsync_children; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a9b3a32..e6c2755 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1337,7 +1337,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, PAGE_SIZE); set_page_private(virt_to_page(sp-spt), (unsigned long)sp); list_add(sp-link, vcpu-kvm-arch.active_mmu_pages); - bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); + bitmap_zero(sp-slot_bitmap, KVM_MEM_SLOTS_NUM); sp-parent_ptes = 0; mmu_page_add_parent_pte(vcpu, sp, parent_pte); kvm_mod_used_mmu_pages(vcpu-kvm, +1); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c6a2ec9..a42d50b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -226,11 +226,14 @@ struct kvm_irq_routing_table {}; #endif +#ifndef KVM_MEM_SLOTS_NUM +#define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS) +#endif + struct kvm_memslots { int nmemslots; u64 generation; - struct kvm_memory_slot memslots[KVM_MEMORY_SLOTS + - KVM_PRIVATE_MEM_SLOTS]; + struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM]; }; struct kvm { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d9cfb78..eb55a11 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -662,7 +662,7 @@ int __kvm_set_memory_region(struct kvm *kvm, (void __user *)(unsigned long)mem-userspace_addr, mem-memory_size))) goto out; - if (mem-slot = KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS) + if (mem-slot = KVM_MEM_SLOTS_NUM) goto out; if (mem-guest_phys_addr + mem-memory_size mem-guest_phys_addr) goto out; -- 1.7.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] KVM: introduce update_memslots function
Introduce update_memslots to update slot which will be update to kvm-memslots Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/x86.c |2 +- include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c | 22 +++--- 3 files changed, 17 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9eff4af..56d1a39 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3504,7 +3504,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; - slots-generation++; + update_memslots(slots, NULL); old_slots = kvm-memslots; rcu_assign_pointer(kvm-memslots, slots); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a42d50b..bb8728e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -319,6 +319,7 @@ void kvm_exit(void); void kvm_get_kvm(struct kvm *kvm); void kvm_put_kvm(struct kvm *kvm); +void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new); static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index eb55a11..0fb3aeb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -629,6 +629,19 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) } #endif /* !CONFIG_S390 */ +void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new) +{ + if (new) { + int id = new-id; + + slots-memslots[id] = *new; + if (id = slots-nmemslots) + slots-nmemslots = id + 1; + } + + slots-generation++; +} + /* * Allocate some memory and give it an address in the guest physical address * space. @@ -779,10 +792,8 @@ skip_lpage: if (!slots) goto out_free; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); - if (mem-slot = slots-nmemslots) - slots-nmemslots = mem-slot + 1; - slots-generation++; slots-memslots[mem-slot].flags |= KVM_MEMSLOT_INVALID; + update_memslots(slots, NULL); old_memslots = kvm-memslots; rcu_assign_pointer(kvm-memslots, slots); @@ -814,9 +825,6 @@ skip_lpage: if (!slots) goto out_free; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); - if (mem-slot = slots-nmemslots) - slots-nmemslots = mem-slot + 1; - slots-generation++; /* actual memory is freed via old in kvm_free_physmem_slot below */ if (!npages) { @@ -826,7 +834,7 @@ skip_lpage: new.lpage_info[i] = NULL; } - slots-memslots[mem-slot] = new; + update_memslots(slots, new); old_memslots = kvm-memslots; rcu_assign_pointer(kvm-memslots, slots); synchronize_srcu_expedited(kvm-srcu); -- 1.7.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro
Introduce kvm_for_each_memslot to walk all valid memslot Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/ia64/kvm/kvm-ia64.c |4 +--- arch/x86/kvm/mmu.c |9 + include/linux/kvm_host.h |4 virt/kvm/iommu.c | 13 +++-- virt/kvm/kvm_main.c |5 +++-- 5 files changed, 20 insertions(+), 15 deletions(-) diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 43f4c92..7d5bc81 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1370,10 +1370,8 @@ static void kvm_release_vm_pages(struct kvm *kvm) unsigned long base_gfn; slots = kvm_memslots(kvm); - for (i = 0; i slots-nmemslots; i++) { - memslot = slots-memslots[i]; + kvm_for_each_memslot(slots, memslot, i) { base_gfn = memslot-base_gfn; - for (j = 0; j memslot-npages; j++) { if (memslot-rmap[j]) put_page((struct page *)memslot-rmap[j]); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e6c2755..53dbae0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1120,11 +1120,11 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, int ret; int retval = 0; struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; slots = kvm_memslots(kvm); - for (i = 0; i slots-nmemslots; i++) { - struct kvm_memory_slot *memslot = slots-memslots[i]; + kvm_for_each_memslot(slots, memslot, i) { unsigned long start = memslot-userspace_addr; unsigned long end; @@ -3977,11 +3977,12 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm) unsigned int nr_mmu_pages; unsigned int nr_pages = 0; struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; slots = kvm_memslots(kvm); - for (i = 0; i slots-nmemslots; i++) - nr_pages += slots-memslots[i].npages; + kvm_for_each_memslot(slots, memslot, i) + nr_pages += memslot-npages; nr_mmu_pages = nr_pages * KVM_PERMILLE_MMU_PAGES / 1000; nr_mmu_pages = max(nr_mmu_pages, diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bb8728e..10524c0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ idx++) +#define kvm_for_each_memslot(slots, memslot, i)\ + for (i = 0; i (slots)-nmemslots\ + ({ memslot = (slots)-memslots[i]; 1; }); i++) + int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id); void kvm_vcpu_uninit(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index d5f3b8d..02df243 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -134,12 +134,13 @@ static int kvm_iommu_map_memslots(struct kvm *kvm) { int i, idx, r = 0; struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; idx = srcu_read_lock(kvm-srcu); slots = kvm_memslots(kvm); - for (i = 0; i slots-nmemslots; i++) { - r = kvm_iommu_map_pages(kvm, slots-memslots[i]); + kvm_for_each_memslot(slots, memslot, i) { + r = kvm_iommu_map_pages(kvm, memslot); if (r) break; } @@ -311,14 +312,14 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm) { int i, idx; struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; idx = srcu_read_lock(kvm-srcu); slots = kvm_memslots(kvm); - for (i = 0; i slots-nmemslots; i++) { - kvm_iommu_put_pages(kvm, slots-memslots[i].base_gfn, - slots-memslots[i].npages); - } + kvm_for_each_memslot(slots, memslot, i) + kvm_iommu_put_pages(kvm, memslot-base_gfn, memslot-npages); + srcu_read_unlock(kvm-srcu, idx); return 0; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 0fb3aeb..ec3b03b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -549,9 +549,10 @@ void kvm_free_physmem(struct kvm *kvm) { int i; struct kvm_memslots *slots = kvm-memslots; + struct kvm_memory_slot *memslot; - for (i = 0; i slots-nmemslots; ++i) - kvm_free_physmem_slot(slots-memslots[i], NULL); + kvm_for_each_memslot(slots, memslot, i) + kvm_free_physmem_slot(memslot, NULL); kfree(kvm-memslots); } -- 1.7.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/6] KVM: introduce id_to_memslot function
Introduce id_to_memslot to get memslot by slot id Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/ia64/kvm/kvm-ia64.c |2 +- arch/powerpc/kvm/book3s.c |2 +- arch/x86/kvm/vmx.c|6 -- arch/x86/kvm/x86.c|8 +--- include/linux/kvm_host.h |6 ++ virt/kvm/kvm_main.c | 12 6 files changed, 25 insertions(+), 11 deletions(-) diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 7d5bc81..773a984 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1818,7 +1818,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, if (log-slot = KVM_MEMORY_SLOTS) goto out; - memslot = kvm-memslots-memslots[log-slot]; + memslot = id_to_memslot(kvm-memslots, log-slot); r = -ENOENT; if (!memslot-dirty_bitmap) goto out; diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index f68a34d..c1128d8 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -497,7 +497,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, /* If nothing is dirty, don't bother messing with page tables. */ if (is_dirty) { - memslot = kvm-memslots-memslots[log-slot]; + memslot = id_to_memslot(kvm-memslots, log-slot) ga = memslot-base_gfn PAGE_SHIFT; ga_end = ga + (memslot-npages PAGE_SHIFT); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6e28d58..4a92073 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2716,11 +2716,13 @@ static gva_t rmode_tss_base(struct kvm *kvm) { if (!kvm-arch.tss_addr) { struct kvm_memslots *slots; + struct kvm_memory_slot *slot; gfn_t base_gfn; slots = kvm_memslots(kvm); - base_gfn = slots-memslots[0].base_gfn + -kvm-memslots-memslots[0].npages - 3; + slot = id_to_memslot(slots, 0); + base_gfn = slot-base_gfn + slot-npages - 3; + return base_gfn PAGE_SHIFT; } return kvm-arch.tss_addr; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 56d1a39..315c2ca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3478,7 +3478,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, if (log-slot = KVM_MEMORY_SLOTS) goto out; - memslot = kvm-memslots-memslots[log-slot]; + memslot = id_to_memslot(kvm-memslots, log-slot); r = -ENOENT; if (!memslot-dirty_bitmap) goto out; @@ -3491,6 +3491,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, /* If nothing is dirty, don't bother messing with page tables. */ if (is_dirty) { struct kvm_memslots *slots, *old_slots; + struct kvm_memory_slot *slot; unsigned long *dirty_bitmap; dirty_bitmap = memslot-dirty_bitmap_head; @@ -3503,13 +3504,14 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, if (!slots) goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); - slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slot = id_to_memslot(slots, log-slot); + slot-dirty_bitmap = dirty_bitmap; update_memslots(slots, NULL); old_slots = kvm-memslots; rcu_assign_pointer(kvm-memslots, slots); synchronize_srcu_expedited(kvm-srcu); - dirty_bitmap = old_slots-memslots[log-slot].dirty_bitmap; + dirty_bitmap = memslot-dirty_bitmap; kfree(old_slots); spin_lock(kvm-mmu_lock); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 10524c0..a0e4d63 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -332,6 +332,12 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) || lockdep_is_held(kvm-slots_lock)); } +static inline struct kvm_memory_slot * +id_to_memslot(struct kvm_memslots *slots, int id) +{ + return slots-memslots[id]; +} + #define HPA_MSB ((sizeof(hpa_t) * 8) - 1) #define HPA_ERR_MASK ((hpa_t)1 HPA_MSB) static inline int is_error_hpa(hpa_t hpa) { return hpa HPA_MSB; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ec3b03b..09bb794 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -634,8 +634,9 @@ void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new) { if (new) { int id = new-id; + struct kvm_memory_slot *old = id_to_memslot(slots, id); - slots-memslots[id] = *new; + *old = *new; if (id = slots-nmemslots) slots-nmemslots = id + 1; } @@ -681,7 +682,7 @@ int
[PATCH v2 5/6] KVM: sort memslots by its size and use line search
Sort memslots base on its size and use line search to find it, so the larger memslots have better fit The idea is from Avi Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- include/linux/kvm_host.h | 22 +--- virt/kvm/kvm_main.c | 82 - 2 files changed, 75 insertions(+), 29 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a0e4d63..83b396a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -230,8 +230,12 @@ struct kvm_irq_routing_table {}; #define KVM_MEM_SLOTS_NUM (KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS) #endif +/* + * Note: + * memslots are not sorted by id anymore, please use id_to_memslot() + * to get the memslot by its id. + */ struct kvm_memslots { - int nmemslots; u64 generation; struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM]; }; @@ -307,9 +311,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ idx++) -#define kvm_for_each_memslot(slots, memslot, i)\ - for (i = 0; i (slots)-nmemslots\ - ({ memslot = (slots)-memslots[i]; 1; }); i++) +#define kvm_for_each_memslot(slots, memslot, i)\ + for (i = 0; i KVM_MEM_SLOTS_NUM \ + ({ memslot = (slots)-memslots[i]; 1; }) \ + memslot-npages != 0; i++) int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id); void kvm_vcpu_uninit(struct kvm_vcpu *vcpu); @@ -335,7 +340,14 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) static inline struct kvm_memory_slot * id_to_memslot(struct kvm_memslots *slots, int id) { - return slots-memslots[id]; + int i; + + for (i = 0; i KVM_MEM_SLOTS_NUM; i++) + if (slots-memslots[i].id == id) + return slots-memslots[i]; + + WARN_ON(1); + return NULL; } #define HPA_MSB ((sizeof(hpa_t) * 8) - 1) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 09bb794..9425e3a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -440,6 +440,15 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER KVM_ARCH_WANT_MMU_NOTIFIER */ +static void kvm_init_memslots_id(struct kvm *kvm) +{ + int i; + struct kvm_memslots *slots = kvm-memslots; + + for (i = 0; i KVM_MEM_SLOTS_NUM; i++) + slots-memslots[i].id = i; +} + static struct kvm *kvm_create_vm(void) { int r, i; @@ -465,6 +474,7 @@ static struct kvm *kvm_create_vm(void) kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); if (!kvm-memslots) goto out_err_nosrcu; + kvm_init_memslots_id(kvm); if (init_srcu_struct(kvm-srcu)) goto out_err_nosrcu; for (i = 0; i KVM_NR_BUSES; i++) { @@ -630,15 +640,55 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) } #endif /* !CONFIG_S390 */ +static struct kvm_memory_slot * +search_memslots(struct kvm_memslots *slots, gfn_t gfn) +{ + int i; + struct kvm_memory_slot *memslot; + + kvm_for_each_memslot(slots, memslot, i) + if (gfn = memslot-base_gfn + gfn memslot-base_gfn + memslot-npages) + return memslot; + + return NULL; +} + +static int cmp_memslot(const void *slot1, const void *slot2) +{ + struct kvm_memory_slot *s1, *s2; + + s1 = (struct kvm_memory_slot *)slot1; + s2 = (struct kvm_memory_slot *)slot2; + + if (s1-npages s2-npages) + return 1; + if (s1-npages s2-npages) + return -1; + + return 0; +} + +/* + * Sort the memslots base on its size, so the larger slots + * will get better fit. + */ +static void sort_memslots(struct kvm_memslots *slots) +{ + sort(slots-memslots, KVM_MEM_SLOTS_NUM, + sizeof(struct kvm_memory_slot), cmp_memslot, NULL); +} + void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new) { if (new) { int id = new-id; struct kvm_memory_slot *old = id_to_memslot(slots, id); + unsigned long npages = old-npages; *old = *new; - if (id = slots-nmemslots) - slots-nmemslots = id + 1; + if (new-npages != npages) + sort_memslots(slots); } slots-generation++; @@ -979,16 +1029,7 @@ EXPORT_SYMBOL_GPL(kvm_is_error_hva); static struct kvm_memory_slot *__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn) { - int i; - - for (i = 0; i slots-nmemslots; ++i) { - struct kvm_memory_slot *memslot = slots-memslots[i]; - - if (gfn = memslot-base_gfn -gfn
[PATCH] vhost-net: Acquire device lock when releasing device
Device lock should be held when releasing a device, and specifically when calling vhost_dev_cleanup(). Otherwise, RCU complains about it: [ 2025.642835] === [ 2025.643838] [ INFO: suspicious RCU usage. ] [ 2025.645182] --- [ 2025.645927] drivers/vhost/vhost.c:475 suspicious rcu_dereference_protected() usage! [ 2025.647329] [ 2025.647330] other info that might help us debug this: [ 2025.647331] [ 2025.649042] [ 2025.649043] rcu_scheduler_active = 1, debug_locks = 1 [ 2025.650235] no locks held by trinity/21042. [ 2025.650971] [ 2025.650972] stack backtrace: [ 2025.651789] Pid: 21042, comm: trinity Not tainted 3.2.0-rc2-sasha-00057-ga9098b3 #5 [ 2025.653342] Call Trace: [ 2025.653792] [810b4a6a] lockdep_rcu_suspicious+0xaf/0xb9 [ 2025.654916] [818d4c2c] vhost_dev_cleanup+0x342/0x3ac [ 2025.655985] [818d4f18] vhost_net_release+0x48/0x7f [ 2025.657247] [811416e3] fput+0x11e/0x1dc [ 2025.658091] [8113f1ec] filp_close+0x6e/0x79 [ 2025.658964] [81089ed7] put_files_struct+0xcc/0x196 [ 2025.659971] [8108a034] exit_files+0x46/0x4f [ 2025.660865] [8108a716] do_exit+0x264/0x75c [ 2025.662201] [8113f490] ? fsnotify_modify+0x60/0x68 [ 2025.663260] [81bbdbca] ? sysret_check+0x2e/0x69 [ 2025.664269] [8108acc1] do_group_exit+0x83/0xb1 [ 2025.665448] [8108ad01] sys_exit_group+0x12/0x16 [ 2025.666396] [81bbdb92] system_call_fastpath+0x16/0x1b Cc: Michael S. Tsirkin m...@redhat.com Cc: kvm@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: net...@vger.kernel.org Signed-off-by: Sasha Levin levinsasha...@gmail.com --- drivers/vhost/net.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 882a51f..c9be601 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -586,6 +586,7 @@ static int vhost_net_release(struct inode *inode, struct file *f) struct socket *tx_sock; struct socket *rx_sock; + mutex_lock(n-dev.mutex); vhost_net_stop(n, tx_sock, rx_sock); vhost_net_flush(n); vhost_dev_cleanup(n-dev); @@ -596,6 +597,7 @@ static int vhost_net_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_net_flush(n); + mutex_unlock(n-dev.mutex); kfree(n); return 0; } -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry
The operation of getting dirty log is frequency When framebuffer-based displays are used(for example, Xwindow), so, we introduce a mapping table to speed up id_to_memslot() Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- include/linux/kvm_host.h | 13 +++-- virt/kvm/kvm_main.c |7 ++- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 83b396a..fbf66eb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -238,6 +238,8 @@ struct kvm_irq_routing_table {}; struct kvm_memslots { u64 generation; struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM]; + /* The mapping table from slot id to the index in memslots[]. */ + int id_to_index[KVM_MEM_SLOTS_NUM]; }; struct kvm { @@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) static inline struct kvm_memory_slot * id_to_memslot(struct kvm_memslots *slots, int id) { - int i; + int index = slots-id_to_index[id]; + struct kvm_memory_slot *slot; - for (i = 0; i KVM_MEM_SLOTS_NUM; i++) - if (slots-memslots[i].id == id) - return slots-memslots[i]; + slot = slots-memslots[index]; - WARN_ON(1); - return NULL; + WARN_ON(slot-id != id); + return slot; } #define HPA_MSB ((sizeof(hpa_t) * 8) - 1) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9425e3a..0dad825 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -446,7 +446,7 @@ static void kvm_init_memslots_id(struct kvm *kvm) struct kvm_memslots *slots = kvm-memslots; for (i = 0; i KVM_MEM_SLOTS_NUM; i++) - slots-memslots[i].id = i; + slots-id_to_index[i] = slots-memslots[i].id = i; } static struct kvm *kvm_create_vm(void) @@ -675,8 +675,13 @@ static int cmp_memslot(const void *slot1, const void *slot2) */ static void sort_memslots(struct kvm_memslots *slots) { + int i; + sort(slots-memslots, KVM_MEM_SLOTS_NUM, sizeof(struct kvm_memory_slot), cmp_memslot, NULL); + + for (i = 0; i KVM_MEM_SLOTS_NUM; i++) + slots-id_to_index[slots-memslots[i].id] = i; } void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new) -- 1.7.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/6] KVM: optimize memslots searching
On Fri, 2011-11-18 at 17:16 +0800, Xiao Guangrong wrote: This is the more work base on my v1 patchset which is posted some months ago, it can be found at: https://lkml.org/lkml/2011/2/22/68 Change log: - sort memslots base on its size and do the line search instead of binary search base on gfn, it is from Avi's idea. - in order to reduce cache footprint, memslots are sorted in the array of kvm-memslots-memslots[] and introduce a table to map slot id to index in the array There is the performance result: autotest for RHEL.6.1 setup/boot/reboot/shutdown(average): ept=1:before: 449.5 after: 447.8 ept=0:before: 532.7 after: 529.8 kernbench(average): ept=1:before: 127.94 after: 126.98 ept=0:before: 196.85 after: 189.66 Hi, Do these patches depend on any other patches? Looks like they don't apply on kvm-next. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry
On Fri, 2011-11-18 at 17:20 +0800, Xiao Guangrong wrote: The operation of getting dirty log is frequency When framebuffer-based displays are used(for example, Xwindow), so, we introduce a mapping table to speed up id_to_memslot() Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- include/linux/kvm_host.h | 13 +++-- virt/kvm/kvm_main.c |7 ++- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 83b396a..fbf66eb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -238,6 +238,8 @@ struct kvm_irq_routing_table {}; struct kvm_memslots { u64 generation; struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM]; + /* The mapping table from slot id to the index in memslots[]. */ + int id_to_index[KVM_MEM_SLOTS_NUM]; }; struct kvm { @@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) static inline struct kvm_memory_slot * id_to_memslot(struct kvm_memslots *slots, int id) { - int i; + int index = slots-id_to_index[id]; + struct kvm_memory_slot *slot; - for (i = 0; i KVM_MEM_SLOTS_NUM; i++) - if (slots-memslots[i].id == id) - return slots-memslots[i]; + slot = slots-memslots[index]; - WARN_ON(1); - return NULL; + WARN_ON(slot-id != id); + return slot; } If we didn't find the right memslot we shouldn't be returning a wrong one, we should be failing. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/6] KVM: optimize memslots searching
On 11/18/2011 05:41 PM, Sasha Levin wrote: autotest for RHEL.6.1 setup/boot/reboot/shutdown(average): ept=1: before: 449.5 after: 447.8 ept=0: before: 532.7 after: 529.8 kernbench(average): ept=1: before: 127.94 after: 126.98 ept=0: before: 196.85 after: 189.66 Hi, Do these patches depend on any other patches? Looks like they don't apply on kvm-next. Ah, i forgot that i was using Avi's kvm tree, this patchset is just base on that, to avoid noise, i'll rebase it after it is reviewed. Thank you, Sasha! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/6] KVM: introduce a table to map slot id to index in memslots arry
On 11/18/2011 05:45 PM, Sasha Levin wrote: struct kvm { @@ -340,14 +342,13 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) static inline struct kvm_memory_slot * id_to_memslot(struct kvm_memslots *slots, int id) { -int i; +int index = slots-id_to_index[id]; +struct kvm_memory_slot *slot; -for (i = 0; i KVM_MEM_SLOTS_NUM; i++) -if (slots-memslots[i].id == id) -return slots-memslots[i]; +slot = slots-memslots[index]; -WARN_ON(1); -return NULL; +WARN_ON(slot-id != id); +return slot; } If we didn't find the right memslot we shouldn't be returning a wrong one, we should be failing. Um, in the current code, the id has already been checked by the caller, that means id should be KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS, so, if we can not find the memslot for this slot, it must be a bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Am 18.11.2011 09:47, schrieb Lan Tianyu: When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. Please enter the commit message for your changes. Lines starting Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 366 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 255 insertions(+), 113 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..4d9125d 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -122,9 +122,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -618,9 +615,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +700,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +727,121 @@ error_free_rfb: return NULL; } -/* - * QCOW file might grow during a write operation. Not only data but metadata is - * also written at the end of the file. Therefore it is necessary to ensure - * every write is committed to disk. Hence we use uses qcow_pwrite_sync() to - * synchronize the in-core state of QCOW image to disk. - * - * We also try to restore the image to a consistent state if the metdata - * operation fails. The two metadat operations are: level 1 and level 2 table - * update. If either of them fails the image is truncated to a consistent state. +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; + rfb-entries[rfb_idx] = cpu_to_be16(refcount); + rfb-dirty = 1; + + /*write refcount block*/ + write_refcount_block(q, rfb); Missing error handling. + + /*update free_clust_idx since refcount becomes zero*/ + if (!refcount clust_idx q-free_clust_idx) + q-free_clust_idx = clust_idx; + + return 0; +} + +static void qcow_free_clusters(struct qcow *q, u64 clust_start, u64 size) +{ + struct qcow_header *header = q-header; + u64 start, end, offset; + + start = clust_start ~(q-cluster_size - 1); + end = (clust_start + size - 1) ~(q-cluster_size - 1); + for (offset = start; offset = end; offset += q-cluster_size) + update_cluster_refcount(q, offset header-cluster_bits, -1); +} + +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ +static u64 qcow_alloc_clusters(struct
Re: Memory sync algorithm during migration
On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote: Adding qemu-devel ML to CC. Your question should have been sent to qemu-devel ML because the logic is implemented in QEMU, not KVM. (2011/11/11 1:35), Oliver Hookins wrote: Hi, I am performing some benchmarks on KVM migration on two different types of VM. One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes about 20 seconds to migrate on our hardware while the 32GB VM takes about a minute. With a reasonable amount of memory activity going on (in the hundreds of MB per second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never completes. Intuitively this tells me there is some watermarking of dirty pages going on that is not particularly efficient when the dirty pages ratio is high compared to total memory, but I may be completely incorrect. You can change the ratio IIRC. Hopefully, someone who knows well about QEMU will tell you better ways. Takuya Could anybody fill me in on what might be going on here? We're using libvirt 0.8.2 and kvm-83-224.el5.centos.1 This is pretty old qemu/kvm code base. In principle, it makes no sense that with 32GB RAM migration finishes, and with 4GB RAM it is unable (intuitively it should be, if ever, the other way around). Do you have an easy test that makes the problem easily reproducible? Have you tried ustream qemu.git? (some improvements on that department). On the surface this seems to be a bit beyond me. Is it reasonably straightforward porting the patches from qemu-kvm to qemu? At the moment I'm going to attempt just building the latest qemu-kvm (0.15 as opposed to 0.9.1) - should this be enough? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] KVM: PPC: mostly booke: race and idle fixes, paravirt
On 09.11.2011, at 01:22, Scott Wood wrote: The first couple patches are some minor reorganization of the checking done prior to entering the guest, which should avoid some races relative to exception qeueing, and give us a more clearly-named place (which later patches in the series use) to put other checks that we want to always run before entering the guest. Then there's a fix to booke idle handling to not block inside KVM_SET_REGS, and some paravirt fixes and new functionality. Apply after KVM: PPC: booke: check for signals in kvmppc_vcpu_run (I didn't think there'd end up being a dependency when I sent that...) This supersedes the first three patches of KVM: PPC: booke: paravirt and timer from 8/26. Scott Wood (6): KVM: PPC: Rename deliver_interrupts to prepare_to_enter KVM: PPC: Move prepare_to_enter call site into subarch code KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter KVM: PPC: booke: Fix int_pending calculation for MSR[EE] paravirt KVM: PPC: booke: Paravirtualize wrtee KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn Thanks, applied all to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: PPC: booke: Improve timer register emulation
On 17.11.2011, at 23:39, Scott Wood wrote: Decrementers are now properly driven by TCR/TSR, and the guest has full read/write access to these registers. The decrementer keeps ticking (and setting the TSR bit) regardless of whether the interrupts are enabled with TCR. The decrementer stops at zero, rather than going negative. Decrementers (and FITs, once implemented) are delivered as level-triggered interrupts -- dequeued when the TSR bit is cleared, not on delivery. Signed-off-by: Liu Yu yu@freescale.com [scottw...@freescale.com: significant changes] Signed-off-by: Scott Wood scottw...@freescale.com Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On 16.11.2011, at 23:50, Paul Mackerras wrote: This series of patches updates the Book3S-HV KVM code that manages the guest hashed page table (HPT) to enable several things: * MMIO emulation and MMIO pass-through * Use of small pages (4kB or 64kB, depending on config) to back the guest memory * Pageable guest memory - i.e. backing pages can be removed from the guest and reinstated on demand, using the MMU notifier mechanism. On PPC970 we have no way to get DSIs and ISIs to come to the hypervisor, so we can't do MMIO emulation or pageable guest memory. On POWER7 we set the VPM1 bit in the LPCR to make all DSIs and ISIs come to the hypervisor (host) as HDSIs or HISIs. This series is RFC for the moment, although the first 5 or so patches are pretty solid and could go in. I am going to rework the later patches to use HPTEs with V=0 for the absent pages rather than key=31, which will require handling the HPTE-not-present HDSIs we will get and differentiating the case where the guest has created a HPTE but the underlying page is not resident from the case where the guest has created no HPTE for the address. This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Guest floppy regression hits qemu-kvm, qemu is not affected
Hi guys, Today during the last 'sanity' qemu-kvm testing, we've noticed a recurring problem: guest OS does not see the floppy, making the windows installs time out. This problem has been extensively discussed on qemu and qemu is fine here, problem is specific with qemu-kvm. http://comments.gmane.org/gmane.comp.emulators.qemu/123928 So, this is something to go on your radars. Cheers, Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote: On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. It allows you to update stats without a lock. But you'll already be holding a lock related to the queue. Whats the benefit of having them per queue? It should save some memory (and a little time when summing stats, though that's unlikely to matter much). The important thing is that splitting up stats per-CPU *and* per-queue is a waste. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote: On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote: On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. It allows you to update stats without a lock. But you'll already be holding a lock related to the queue. Right, but now you're holding a queue lock just when playing with the queue, we don't hold it when we process the data - which is when we usually need to update stats. Whats the benefit of having them per queue? It should save some memory (and a little time when summing stats, though that's unlikely to matter much). The important thing is that splitting up stats per-CPU *and* per-queue is a waste. Ben. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM for the enterprise
Hi, The Open Virtualization Alliance is going to be having a webinar on December 8th which is intended to help promote KVM as an enterprise class hypervisor. I see so much great engineering work going on to make KVM a really robust technology and we want to help tell this story to the world-wide audience. We will have a couple companies presenting how they deployed KVM in their infrastructure and the benefits they realized. There will also be a speaker from IDC discussing virtualization-specific performance-enhancing features–like SR-IOV, hugepages, asynchronous I/O and more. If anyone would like to attend this webinar (or help promote it) I have posted a link on the OVA site here: http://www.openvirtualizationalliance.org/news/index.html Thanks. Mike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
On 11/17/2011 4:44 PM, Ben Hutchings wrote: On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote: On 11/17/2011 4:15 PM, Ben Hutchings wrote: Sorry to come to this rather late. On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote: [...] v2 - v3 - Moved set and get filter ops from rtnl_link_ops to netdev_ops - Support for SRIOV VFs. [Note: The get filters msg (in the way current get rtnetlink handles it) might get too big for SRIOV vfs. This patch follows existing sriov vf get code and tries to accomodate filters for all VF's in a PF. And for the SRIOV case I have only tested the fact that the VF arguments are getting delivered to rtnetlink correctly. The code follows existing sriov vf handling code so rest of it should work fine] [...] This is already broken for large numbers of VFs, and increasing the amount of information per VF is going to make the situation worse. I am no netlink expert but I think that the current approach of bundling all information about an interface in a single message may not be sustainable. Also, I'm unclear on why this interface is to be used to set filtering for the (PF) net device as well as for related VFs. Doesn't that duplicate the functionality of ndo_set_rx_mode and ndo_vlan_rx_{add,kill}_vid? Functionally yes but contextually no. This allows the PF driver to know that it is setting these filters in the context of the existence of VFs, allowing it to take appropriate action. The other two functions may be called without the presence of SR-IOV enablement and the existence of VFs. Anyway, that's why I asked Roopa to add that capability. I don't follow. The PF driver already knows whether it has enabled VFs. How do filters set this way interact with filters set through the existing operations? Should they override promiscuous mode? None of this has been specified. Promiscuous mode is exactly the issue this feature is intended for. I'm not familiar with the solarflare device but Intel HW promiscuous mode is only promiscuous on the physical port, not on the VEB. So a packet sent from a VF will not be captured by the PF across the VEB unless the MAC and VLAN filters have been programmed into the HW. So you may not need the feature for your devices but it is required for Intel devices. And it's a fairly simple request, just allow -1 to indicate that the target of the filter requests is for the PF itself. Using the already existing set_rx_mode function wont' work because the PF driver will look at it and figure it's in promiscuous mode anyway, so it won't set the filters into the HW. At least that is how it is in the case of our HW and driver. Again, the behavior of your HW and driver is unknown to me and thus you may not require this feature. - Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote: On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote: On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote: On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote: Changes for multiqueue virtio_net driver. [...] @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet { struct virtnet_info *vi = netdev_priv(dev); int cpu; - unsigned int start; for_each_possible_cpu(cpu) { - struct virtnet_stats __percpu *stats - = per_cpu_ptr(vi-stats, cpu); - u64 tpackets, tbytes, rpackets, rbytes; - - do { - start = u64_stats_fetch_begin(stats-syncp); - tpackets = stats-tx_packets; - tbytes = stats-tx_bytes; - rpackets = stats-rx_packets; - rbytes = stats-rx_bytes; - } while (u64_stats_fetch_retry(stats-syncp, start)); - - tot-rx_packets += rpackets; - tot-tx_packets += tpackets; - tot-rx_bytes += rbytes; - tot-tx_bytes += tbytes; + int qpair; + + for (qpair = 0; qpair vi-num_queue_pairs; qpair++) { + struct virtnet_send_stats __percpu *tx_stat; + struct virtnet_recv_stats __percpu *rx_stat; While you're at it, you can drop the per-CPU stats and make them only per-queue. There is unlikely to be any benefit in maintaining them per-CPU while receive and transmit processing is serialised per-queue. It allows you to update stats without a lock. But you'll already be holding a lock related to the queue. Right, but now you're holding a queue lock just when playing with the queue, we don't hold it when we process the data - which is when we usually need to update stats. [...] The *stack* is holding the appropriate lock when calling the NAPI poll function or ndo_start_xmit function. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote: On 11/17/2011 4:44 PM, Ben Hutchings wrote: On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote: On 11/17/2011 4:15 PM, Ben Hutchings wrote: Sorry to come to this rather late. On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote: [...] v2 - v3 - Moved set and get filter ops from rtnl_link_ops to netdev_ops - Support for SRIOV VFs. [Note: The get filters msg (in the way current get rtnetlink handles it) might get too big for SRIOV vfs. This patch follows existing sriov vf get code and tries to accomodate filters for all VF's in a PF. And for the SRIOV case I have only tested the fact that the VF arguments are getting delivered to rtnetlink correctly. The code follows existing sriov vf handling code so rest of it should work fine] [...] This is already broken for large numbers of VFs, and increasing the amount of information per VF is going to make the situation worse. I am no netlink expert but I think that the current approach of bundling all information about an interface in a single message may not be sustainable. Also, I'm unclear on why this interface is to be used to set filtering for the (PF) net device as well as for related VFs. Doesn't that duplicate the functionality of ndo_set_rx_mode and ndo_vlan_rx_{add,kill}_vid? Functionally yes but contextually no. This allows the PF driver to know that it is setting these filters in the context of the existence of VFs, allowing it to take appropriate action. The other two functions may be called without the presence of SR-IOV enablement and the existence of VFs. Anyway, that's why I asked Roopa to add that capability. I don't follow. The PF driver already knows whether it has enabled VFs. How do filters set this way interact with filters set through the existing operations? Should they override promiscuous mode? None of this has been specified. Promiscuous mode is exactly the issue this feature is intended for. I'm not familiar with the solarflare device but Intel HW promiscuous mode is only promiscuous on the physical port, not on the VEB. So a packet sent from a VF will not be captured by the PF across the VEB unless the MAC and VLAN filters have been programmed into the HW. Yes, I get it. The hardware bridge needs to know more about the address configuration on the host than the driver is getting at the moment. So you may not need the feature for your devices but it is required for Intel devices. Well we don't have the hardware bridge but that means each VF driver needs to know whether to fall back to the software bridge. The net driver needs much the same additional information. And it's a fairly simple request, just allow -1 to indicate that the target of the filter requests is for the PF itself. Using the already existing set_rx_mode function wont' work because the PF driver will look at it and figure it's in promiscuous mode anyway, so it won't set the filters into the HW. At least that is how it is in the case of our HW and driver. Again, the behavior of your HW and driver is unknown to me and thus you may not require this feature. What concerns me is that this seems to be a workaround rather than a fix for over-use of promiscuous mode, and it changes the semantics of filtering modes in ways that haven't been well-specified. What if there's a software bridge between two net devices corresponding to separate physical ports, so that they really need to be promiscuous? What if the administrator runs tcpdump and really wants the (PF) net device to be promiscuous? These cases shouldn't break because of VF acceleration. Or at least we should make a conscious and documented decision that 'promiscuous' doesn't mean that if you enable it on your network adapter. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote: On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote: On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote: On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: snip +Groups, Devices, IOMMUs, oh my +--- + +A fundamental component of VFIO is the notion of IOMMU groups. IOMMUs +can't always distinguish transactions from each individual device in +the system. Sometimes this is because of the IOMMU design, such as with +PEs, other times it's caused by the I/O topology, for instance a +PCIe-to-PCI bridge masking all devices behind it. We call the sets of +devices created by these restictions IOMMU groups (or just groups for +this document). + +The IOMMU cannot distiguish transactions between the individual devices +within the group, therefore the group is the basic unit of ownership for +a userspace process. Because of this, groups are also the primary +interface to both devices and IOMMU domains in VFIO. + +The VFIO representation of groups is created as devices are added into +the framework by a VFIO bus driver. The vfio-pci module is an example +of a bus driver. This module registers devices along with a set of bus +specific callbacks with the VFIO core. These callbacks provide the +interfaces later used for device access. As each new group is created, +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP +character device. Ok.. so, the fact that it's called vfio-pci suggests that the VFIO bus driver is per bus type, not per bus instance. But grouping constraints could be per bus instance, if you have a couple of different models of PCI host bridge with IOMMUs of different capabilities built in, for example. Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus instance. Ok, how can that work. vfio-pci is responsible for generating the groupings, yes? For which it needs to know the iommu/host bridge's isolation capabilities, which vary depending on the type of host bridge. No, grouping is done at the iommu driver level. vfio gets groupings via iomm_device_group(), which uses the iommu_ops for the bus_type of the requested device. I'll attempt to clarify where groups come from in the documentation. IOMMUs also register drivers per bus type, not per bus instance. The IOMMU driver is free to impose any constraints it wants. +In addition to the device enumeration and callbacks, the VFIO bus driver +also provides a traditional device driver and is able to bind to devices +on it's bus. When a device is bound to the bus driver it's available to +VFIO. When all the devices within a group are bound to their bus drivers, +the group becomes viable and a user with sufficient access to the VFIO +group chardev can obtain exclusive access to the set of group devices. + +As documented in linux/vfio.h, several ioctls are provided on the +group chardev: + +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64) + #define VFIO_GROUP_FLAGS_VIABLE(1 0) + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 1) +#define VFIO_GROUP_MERGE_IOW(';', 101, int) +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int) +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103) +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *) + +The last two ioctls return new file descriptors for accessing +individual devices within the group and programming the IOMMU. Each of +these new file descriptors provide their own set of file interfaces. +These ioctls will fail if any of the devices within the group are not +bound to their VFIO bus driver. Additionally, when either of these +interfaces are used, the group is then bound to the struct_mm of the +caller. The GET_FLAGS ioctl can be used to view the state of the group. + +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a +new IOMMU domain is created and all of the devices in the group are +attached to it. This is the only way to ensure full IOMMU isolation +of the group, but potentially wastes resources and cycles if the user +intends to manage multiple groups with the same set of IOMMU mappings. +VFIO therefore provides a group MERGE and UNMERGE interface, which +allows multiple groups to share an IOMMU domain. Not all IOMMUs allow +arbitrary groups to be merged, so the user should assume merging is +opportunistic. I do not think opportunistic means what you think it means.. A new group, with no open device or IOMMU file +descriptors, can be merged into an existing, in-use, group using the +MERGE
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote: Hmm, that might be cleaner than eliminating the size with just using _IO(). So we might have something like: #define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map) #define VFIO_IOMMU_MAP_DMA_V2 _IOWR(';', 106, struct vfio_dma_map_v2) For which the driver might do: case VFIO_IOMMU_MAP_DMA: case VFIO_IOMMU_MAP_DMA_V2: { struct vfio_dma_map map; /* We don't care about the extra v2 bits */ if (copy_from_user(map, (void __user *)arg, sizeof map)) return -EFAULT; That won't work if you have an old kernel that doesn't know about v2, and a new user that uses v2. To make this work you'd have to strip out the size from the ioctl number before switching (but still use it when considering whether to access the v2 fields). Simpler to just leave it out of the ioctl number and put it in the struct field as currently planned. I think all our structure sizes are independent of host width. If I'm missing something, let me know. Ah, for structures, that might be true. I was seeing the bunch of ioctl()s that take ints. Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat mode. Does Linux support ILP64? There are int ioctls all over the place, and I don't think we do compat wrappers for them. In fact, some of the ioctls in linux/fs.h use int for the compatible version of ioctls originally defined as long. It's cleaner to always use the fixed types, though. Darn it, guess we need to make everything 64bit, including file descriptors. What's wrong with __u32/__s32 (or uint32_t/int32_t)? I really do not see Linux supporting an ABI that has no 32-bit type at all, especially in a situation where userspace compatibility is needed. If that does happen, the ABI breakage will go well beyond VFIO. The point of the group is to provide a unit of ownership. We can't let $userA open $groupid and fetch a device, then have $userB do the same, grabbing a different device. The mappings will step on each other and the devices have no isolation. We can't restrict that purely by file permissions or we'll have the same problem with sudo. What is the problem with sudo? If you're running processes as the same user, regardless of how, they're going to be able to mess with each other. Is it possible to expose access to only specific groups via an individually-permissionable /dev/device, so only the entity handing out access to devices needs access to everything? At one point we discussed a single open instance, but that unnecessarily limits the user, so we settled on the mm. Thanks, It would be nice if this limitation weren't excessively integrated into the design -- in the embedded space we've got unusual partitioning setups, including failover arrangements where partitions share devices. The device may be configured with the IOMMU pointing only at regions that are shared by both mms, or the non-shared regions may be reconfigured as active ownership of the device gets handed around. It would be up to userspace code to make sure that the mappings don't step on each other. The mapping could be done with whichever mm issued the map call for a given region. For this use case, there is unlikely to be an issue with ownership because there will not be separate privilege domains creating partitions -- other use cases could refrain from enabling multiple-mm support unless ownership issues are resolved. This doesn't need to be supported initially, but we should try to avoid letting the assumption permeate the code. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote: This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) I did cc the last patch (the one that adds barriers in the MMU notifier sequence/count logic) to kvm@vger. Do you mean I should cc the whole series? The only other thing touching generic code is the addition of the KVM_MEMSLOT_IO flag in the first patch. I'm hoping the extra barriers will be OK since they are no-ops on x86. In fact I now think that the smp_wmbs I added to kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte aren't in fact necessary, since it's only necessary to ensure that the sequence number increase is visible before the point where kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap chain they look at, which will be ensured anyway by the barrier before the unlock. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Hi Kevin: Thanks for your review. The following means that there should be a fsync after updating metadata(refcunt block, l1 table and l2 table). Thanks Tianyu Lan -Original Message- + /*write l2 table*/ + l2t-dirty = 1; + if (qcow_l2_cache_write(q, l2t) 0) goto free_cache; You need to make sure that the refcount update is written first (e.g. with fsync), otherwise you risk corruption when the host crashes in the middle. - if (cache_table(q, l2t) 0) { - if (ftruncate(q-fd, f_sz) 0) - goto free_cache; + /* Update the l1 talble */ + l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_new_offset + | QCOW2_OFLAG_COPIED); - goto free_cache; - } + if (pwrite_in_full(q-fd, l1t-l1_table, + l1t-table_size * sizeof(u64), + header-l1_table_offset) 0) + goto error; Likewise, the L1 table write must be ordered against the L2 write. goto error is using the wrong label. - /* Update the in-core entry */ - l1t-l1_table[l1t_idx] = cpu_to_be64(l2t_offset); + /*cache l2 table*/ + cache_table(q, l2t); After so many explicit comments, you can probably guess what's wrong here... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] KVM: PPC: mostly booke: race and idle fixes, paravirt
On 09.11.2011, at 01:22, Scott Wood wrote: The first couple patches are some minor reorganization of the checking done prior to entering the guest, which should avoid some races relative to exception qeueing, and give us a more clearly-named place (which later patches in the series use) to put other checks that we want to always run before entering the guest. Then there's a fix to booke idle handling to not block inside KVM_SET_REGS, and some paravirt fixes and new functionality. Apply after KVM: PPC: booke: check for signals in kvmppc_vcpu_run (I didn't think there'd end up being a dependency when I sent that...) This supersedes the first three patches of KVM: PPC: booke: paravirt and timer from 8/26. Scott Wood (6): KVM: PPC: Rename deliver_interrupts to prepare_to_enter KVM: PPC: Move prepare_to_enter call site into subarch code KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter KVM: PPC: booke: Fix int_pending calculation for MSR[EE] paravirt KVM: PPC: booke: Paravirtualize wrtee KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn Thanks, applied all to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: PPC: booke: Improve timer register emulation
On 17.11.2011, at 23:39, Scott Wood wrote: Decrementers are now properly driven by TCR/TSR, and the guest has full read/write access to these registers. The decrementer keeps ticking (and setting the TSR bit) regardless of whether the interrupts are enabled with TCR. The decrementer stops at zero, rather than going negative. Decrementers (and FITs, once implemented) are delivered as level-triggered interrupts -- dequeued when the TSR bit is cleared, not on delivery. Signed-off-by: Liu Yu yu@freescale.com [scottw...@freescale.com: significant changes] Signed-off-by: Scott Wood scottw...@freescale.com Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On 16.11.2011, at 23:50, Paul Mackerras wrote: This series of patches updates the Book3S-HV KVM code that manages the guest hashed page table (HPT) to enable several things: * MMIO emulation and MMIO pass-through * Use of small pages (4kB or 64kB, depending on config) to back the guest memory * Pageable guest memory - i.e. backing pages can be removed from the guest and reinstated on demand, using the MMU notifier mechanism. On PPC970 we have no way to get DSIs and ISIs to come to the hypervisor, so we can't do MMIO emulation or pageable guest memory. On POWER7 we set the VPM1 bit in the LPCR to make all DSIs and ISIs come to the hypervisor (host) as HDSIs or HISIs. This series is RFC for the moment, although the first 5 or so patches are pretty solid and could go in. I am going to rework the later patches to use HPTEs with V=0 for the absent pages rather than key=31, which will require handling the HPTE-not-present HDSIs we will get and differentiating the case where the guest has created a HPTE but the underlying page is not resident from the case where the guest has created no HPTE for the address. This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling
On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote: This touches areas that I'm sure non-PPC people would want to see as well. Could you please CC kvm@vger too next time? Avi, Marcelo, mind to review some of the bits in this patch set? :) I did cc the last patch (the one that adds barriers in the MMU notifier sequence/count logic) to kvm@vger. Do you mean I should cc the whole series? The only other thing touching generic code is the addition of the KVM_MEMSLOT_IO flag in the first patch. I'm hoping the extra barriers will be OK since they are no-ops on x86. In fact I now think that the smp_wmbs I added to kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte aren't in fact necessary, since it's only necessary to ensure that the sequence number increase is visible before the point where kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap chain they look at, which will be ensured anyway by the barrier before the unlock. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html