Re: [PATCHv2 0/8 RFC] boot order specification
On Sun, Oct 31, 2010 at 06:25:53PM -0400, Kevin O'Connor wrote: On Sun, Oct 31, 2010 at 01:40:01PM +0200, Gleb Natapov wrote: This is current sate of the patch series for people to comment on. I tried to use open firmware naming scheme to specify device path names. The patch series produce names like these: for pci machine: /p...@i0cf8/pci-isa-bri...@1/f...@03f1/flo...@0 /p...@i0cf8/pci-isa-bri...@1/f...@03f1/flo...@1 /p...@i0cf8/a...@1,1/ata-d...@1:0 /p...@i0cf8/a...@1,1/ata-d...@1:1 /p...@i0cf8/virtio-...@3/virtio-d...@0 /p...@i0cf8/ether...@4/ethernet-...@0 /p...@i0cf8/ether...@5/ethernet-...@0 for isa machine: adding '/isa/f...@03f1/flo...@0' at index 2 adding '/isa/f...@03f1/flo...@1' at index 1 adding '/isa/a...@0170/ata-d...@0:0' at index 0 adding '/isa/a...@0170/ata-d...@0:1' at index 3 Hi Gleb, How will USB drives be identified? USB bus has Open Firmware binding. I haven't look at the spec yet, but it should be easy. I'm not sure how SeaBIOS will be able to line up something like /p...@i0cf8/ether...@4/ethernet-...@0 to an optionrom BEV. Also, if there is an optionrom with BCVs (eg, a scsi card), I'm not sure how that would that would be identified. The way to parse /p...@i0cf8/ether...@4/ethernet-...@0 is this: each element (between /.../) consist of node-n...@unit-address. node-name describes device/bus. unit-address is a device address on preceding node. So p...@i0cf8 tells us that this is pci bus accessible through io register 0x0cf8, ether...@4 tells us that this is ethernet device in pci slot 4 function 0, (a...@1,1 means ata device in slot 1 function 1). ethernet-...@0 means first phy on this ethernet device (usually there is only one anyway). So if the pci card in slot 4 device 0 has optionrom with BCV Seabios can associate bootindex with it easily given the device path above. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device
On Tue, Oct 26, 2010 at 03:19:38PM -0700, Sridhar Samudrala wrote: With the current default macvtap mode, a KVM guest using virtio with macvtap backend has the following limitations. - cannot change/add a mac address on the guest virtio-net - cannot create a vlan device on the guest virtio-net - cannot enable promiscuous mode on guest virtio-net This patch introduces a new mode called 'passthru' when creating a macvlan device which allows takeover of the underlying device and passing it to a guest using virtio with macvtap backend. Only one macvlan device is allowed in passthru mode and it inherits the mac address from the underlying device and sets it in promiscuous mode to receive and forward all the packets. Thanks Sridhar One concern with promisc mode is that for the common case, which is a single mac and no vlans, we will be getting extra packets that will get dropped in userspace/guest as compared to the case where same mac is programmed in hardware and by guest. We could let userspace supply a list of mac/vlan addresses through an ioctl on macvtap, and then 1. for a single mac, program it in hardware 2. for other configurations, set promisc mode tun already has TUNSETTXFILTER which might come in handy here. We don't pass in vlans with the filter now but maybe we should. How does this sound? diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 0ef0eb0..bca3cb7 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -38,6 +38,7 @@ struct macvlan_port { struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; struct list_headvlans; struct rcu_head rcu; + boolpassthru; }; #define macvlan_port_get_rcu(dev) \ @@ -169,6 +170,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE | MACVLAN_MODE_VEPA| + MACVLAN_MODE_PASSTHRU| MACVLAN_MODE_BRIDGE); else if (src-mode == MACVLAN_MODE_VEPA) /* flood to everyone except source */ @@ -185,7 +187,10 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) return skb; } - vlan = macvlan_hash_lookup(port, eth-h_dest); + if (port-passthru) + vlan = list_first_entry(port-vlans, struct macvlan_dev, list); + else + vlan = macvlan_hash_lookup(port, eth-h_dest); if (vlan == NULL) return skb; @@ -284,6 +289,11 @@ static int macvlan_open(struct net_device *dev) struct net_device *lowerdev = vlan-lowerdev; int err; + if (vlan-port-passthru) { + dev_set_promiscuity(lowerdev, 1); + goto hash_add; + } + err = -EBUSY; if (macvlan_addr_busy(vlan-port, dev-dev_addr)) goto out; @@ -296,6 +306,8 @@ static int macvlan_open(struct net_device *dev) if (err 0) goto del_unicast; } + +hash_add: macvlan_hash_add(vlan); return 0; @@ -310,12 +322,18 @@ static int macvlan_stop(struct net_device *dev) struct macvlan_dev *vlan = netdev_priv(dev); struct net_device *lowerdev = vlan-lowerdev; + if (vlan-port-passthru) { + dev_set_promiscuity(lowerdev, -1); + goto hash_del; + } + dev_mc_unsync(lowerdev, dev); if (dev-flags IFF_ALLMULTI) dev_set_allmulti(lowerdev, -1); dev_uc_del(lowerdev, dev-dev_addr); +hash_del: macvlan_hash_del(vlan); return 0; } @@ -549,6 +567,7 @@ static int macvlan_port_create(struct net_device *dev) if (port == NULL) return -ENOMEM; + port-passthru = false; port-dev = dev; INIT_LIST_HEAD(port-vlans); for (i = 0; i MACVLAN_HASH_SIZE; i++) @@ -593,6 +612,7 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) case MACVLAN_MODE_PRIVATE: case MACVLAN_MODE_VEPA: case MACVLAN_MODE_BRIDGE: + case MACVLAN_MODE_PASSTHRU: break; default: return -EINVAL; @@ -661,6 +681,10 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev, } port = macvlan_port_get(lowerdev); + /* Only 1 macvlan device can be created in passthru mode */ + if (port-passthru) + return -EINVAL; + vlan-lowerdev = lowerdev; vlan-dev = dev; vlan-port = port; @@ -671,6 +695,13 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev, if (data data[IFLA_MACVLAN_MODE]) vlan-mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
[PATCH v2 1/7] KVM: fix tracing kvm_try_async_get_page
Tracing 'async' and *pfn is useless, since 'async' is always true, and '*pfn' is always fault_pfn' We can trace 'gva' and 'gfn' instead, it can help us to see the life-cycle of an async_pf Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |2 +- include/trace/events/kvm.h | 12 +++- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 39cc0c6..5275c50 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2627,7 +2627,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, put_page(pfn_to_page(*pfn)); if (!no_apf can_do_async_pf(vcpu)) { - trace_kvm_try_async_get_page(async, *pfn); + trace_kvm_try_async_get_page(gva, gfn); if (kvm_find_async_pf_gfn(vcpu, gfn)) { trace_kvm_async_pf_doublefault(gva, gfn); kvm_make_request(KVM_REQ_APF_HALT, vcpu); diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 9c2cc6a..30063c6 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -188,18 +188,20 @@ TRACE_EVENT(kvm_age_page, #ifdef CONFIG_KVM_ASYNC_PF TRACE_EVENT( kvm_try_async_get_page, - TP_PROTO(bool async, u64 pfn), - TP_ARGS(async, pfn), + TP_PROTO(u64 gva, u64 gfn), + TP_ARGS(gva, gfn), TP_STRUCT__entry( - __field(__u64, pfn) + __field(u64, gva) + __field(u64, gfn) ), TP_fast_assign( - __entry-pfn = (!async) ? pfn : (u64)-1; + __entry-gva = gva; + __entry-gfn = gfn; ), - TP_printk(pfn %#llx, __entry-pfn) + TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn) ); TRACE_EVENT( -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/7] KVM: cleanup aysnc_pf tracepoints
Use 'DECLARE_EVENT_CLASS' to cleanup async_pf tracepoints Acked-by: Gleb Natapov g...@redhat.com Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- include/trace/events/kvm.h | 76 --- 1 files changed, 35 insertions(+), 41 deletions(-) diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 30063c6..7bec396 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -186,59 +186,71 @@ TRACE_EVENT(kvm_age_page, ); #ifdef CONFIG_KVM_ASYNC_PF -TRACE_EVENT( - kvm_try_async_get_page, +DECLARE_EVENT_CLASS(kvm_async_get_page_class, + TP_PROTO(u64 gva, u64 gfn), + TP_ARGS(gva, gfn), TP_STRUCT__entry( - __field(u64, gva) + __field(__u64, gva) __field(u64, gfn) - ), + ), TP_fast_assign( __entry-gva = gva; __entry-gfn = gfn; - ), + ), TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn) ); -TRACE_EVENT( - kvm_async_pf_not_present, +DEFINE_EVENT(kvm_async_get_page_class, kvm_try_async_get_page, + + TP_PROTO(u64 gva, u64 gfn), + + TP_ARGS(gva, gfn) +); + +DEFINE_EVENT(kvm_async_get_page_class, kvm_async_pf_doublefault, + + TP_PROTO(u64 gva, u64 gfn), + + TP_ARGS(gva, gfn) +); + +DECLARE_EVENT_CLASS(kvm_async_pf_nopresent_ready, + TP_PROTO(u64 token, u64 gva), + TP_ARGS(token, gva), TP_STRUCT__entry( __field(__u64, token) __field(__u64, gva) - ), + ), TP_fast_assign( __entry-token = token; __entry-gva = gva; - ), + ), + + TP_printk(token %#llx gva %#llx, __entry-token, __entry-gva) - TP_printk(token %#llx gva %#llx not present, __entry-token, - __entry-gva) ); -TRACE_EVENT( - kvm_async_pf_ready, +DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_not_present, + TP_PROTO(u64 token, u64 gva), - TP_ARGS(token, gva), - TP_STRUCT__entry( - __field(__u64, token) - __field(__u64, gva) - ), + TP_ARGS(token, gva) +); - TP_fast_assign( - __entry-token = token; - __entry-gva = gva; - ), +DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_ready, + + TP_PROTO(u64 token, u64 gva), - TP_printk(token %#llx gva %#llx ready, __entry-token, __entry-gva) + TP_ARGS(token, gva) ); TRACE_EVENT( @@ -262,24 +274,6 @@ TRACE_EVENT( __entry-address, __entry-pfn) ); -TRACE_EVENT( - kvm_async_pf_doublefault, - TP_PROTO(u64 gva, u64 gfn), - TP_ARGS(gva, gfn), - - TP_STRUCT__entry( - __field(u64, gva) - __field(u64, gfn) - ), - - TP_fast_assign( - __entry-gva = gva; - __entry-gfn = gfn; - ), - - TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn) -); - #endif #endif /* _TRACE_KVM_MAIN_H */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/7] KVM: fix searching async gfn in kvm_async_pf_gfn_slot
Don't search later slots if the slot is empty Acked-by: Gleb Natapov g...@redhat.com Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cfdf2d..9b543f4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6206,8 +6206,8 @@ static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn) u32 key = kvm_async_pf_hash_fn(gfn); for (i = 0; i roundup_pow_of_two(ASYNC_PF_PER_VCPU) -(vcpu-arch.apf.gfns[key] != gfn || - vcpu-arch.apf.gfns[key] == ~0); i++) +(vcpu-arch.apf.gfns[key] != gfn + vcpu-arch.apf.gfns[key] != ~0); i++) key = kvm_async_pf_next_probe(key); return key; -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/7] KVM: avoid unnecessary wait for a async pf
In current code, it checks async pf completion out of the wait context, like this: if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE !vcpu-arch.apf.halted) r = vcpu_enter_guest(vcpu); else { .. kvm_vcpu_block(vcpu) ^- waiting until 'async_pf.done' is not empty } kvm_check_async_pf_completion(vcpu) ^- delete list from async_pf.done So, if we check aysnc pf completion first, it can be blocked at kvm_vcpu_block Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion() path Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9b543f4..4da8485 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6280,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); } + vcpu-arch.apf.halted = false; } bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/7] KVM: handle more completed apfs if possible
If it's no need to inject async #PF to PV guest we can handle more completed apfs at one time, so we can retry guest #PF as early as possible Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |3 ++- arch/x86/kvm/x86.c |8 ++-- virt/kvm/async_pf.c | 28 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1be0058..c95b3ff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -818,7 +818,8 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip); void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, +/* return true if we can handle more completed apfs, false otherwise */ +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4da8485..189664a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6265,7 +6265,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, } } -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { trace_kvm_async_pf_ready(work-arch.token, work-gva); @@ -6274,13 +6274,17 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, else kvm_del_async_pf_gfn(vcpu, work-arch.gfn); + vcpu-arch.apf.halted = false; + if ((vcpu-arch.apf.msr_val KVM_ASYNC_PF_ENABLED) !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) { vcpu-arch.fault.error_code = 0; vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); + return false; } - vcpu-arch.apf.halted = false; + + return true; } bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 60df9e0..d57ec92 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -123,25 +123,29 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu) { struct kvm_async_pf *work; + bool ret; if (list_empty_careful(vcpu-async_pf.done) || !kvm_arch_can_inject_async_page_present(vcpu)) return; - spin_lock(vcpu-async_pf.lock); - work = list_first_entry(vcpu-async_pf.done, typeof(*work), link); - list_del(work-link); - spin_unlock(vcpu-async_pf.lock); + do { + spin_lock(vcpu-async_pf.lock); + work = list_first_entry(vcpu-async_pf.done, typeof(*work), + link); + list_del(work-link); + spin_unlock(vcpu-async_pf.lock); - if (work-page) - kvm_arch_async_page_ready(vcpu, work); - kvm_arch_async_page_present(vcpu, work); + if (work-page) + kvm_arch_async_page_ready(vcpu, work); + ret = kvm_arch_async_page_present(vcpu, work); - list_del(work-queue); - vcpu-async_pf.queued--; - if (work-page) - put_page(work-page); - kmem_cache_free(async_pf_cache, work); + list_del(work-queue); + vcpu-async_pf.queued--; + if (work-page) + put_page(work-page); + kmem_cache_free(async_pf_cache, work); + } while (ret !list_empty_careful(vcpu-async_pf.done)); } int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 6/7] KVM: fix the race while wakeup all pv guest
In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu-async_pf.done without holding vcpu-async_pf.lock, it will break if we are handling apfs at this time. Also use 'list_empty_careful()' instead of 'list_empty()' Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- virt/kvm/async_pf.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index d57ec92..6ef3373 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -200,7 +200,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu) { struct kvm_async_pf *work; - if (!list_empty(vcpu-async_pf.done)) + if (!list_empty_careful(vcpu-async_pf.done)) return 0; work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC); @@ -211,7 +211,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu) get_page(bad_page); INIT_LIST_HEAD(work-queue); /* for list_del to work */ + spin_lock(vcpu-async_pf.lock); list_add_tail(work-link, vcpu-async_pf.done); + spin_unlock(vcpu-async_pf.lock); + vcpu-async_pf.queued++; return 0; } -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs
Don't make a KVM_REQ_UNHALT request after async pf is completed since it can break guest's 'HLT' instruction. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 13 ++--- include/linux/kvm_host.h |6 ++ virt/kvm/kvm_main.c |9 - 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 189664a..c57fb38 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6105,13 +6105,20 @@ void kvm_arch_flush_shadow(struct kvm *kvm) int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) { - return (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE + if ((vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE !vcpu-arch.apf.halted) - || !list_empty_careful(vcpu-async_pf.done) || vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED || vcpu-arch.nmi_pending || (kvm_arch_interrupt_allowed(vcpu) -kvm_cpu_has_interrupt(vcpu)); +kvm_cpu_has_interrupt(vcpu))) + return 1; + + if (!list_empty_careful(vcpu-async_pf.done)) { + vcpu-arch.apf.halted = false; + return 2; + } + + return 0; } void kvm_vcpu_kick(struct kvm_vcpu *vcpu) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 462b982..8def043 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -436,6 +436,12 @@ void kvm_arch_hardware_disable(void *garbage); int kvm_arch_hardware_setup(void); void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); + +/* + * return value: 0 if the vcpu is runnable, 0 if not. + * Especially, if the return value == 1, we should make a + * 'KVM_REQ_UNHALT' request. + */ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); void kvm_free_physmem(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dbe1d6a..9ccf105 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1375,14 +1375,21 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn) void kvm_vcpu_block(struct kvm_vcpu *vcpu) { DEFINE_WAIT(wait); + int ret; for (;;) { prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE); - if (kvm_arch_vcpu_runnable(vcpu)) { + ret = kvm_arch_vcpu_runnable(vcpu); + + if (ret == 1) { kvm_make_request(KVM_REQ_UNHALT, vcpu); break; } + + if (ret 1) + break; + if (kvm_cpu_has_pending_timer(vcpu)) break; if (signal_pending(current)) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/7] KVM: handle more completed apfs if possible
On Mon, Nov 01, 2010 at 05:02:35PM +0800, Xiao Guangrong wrote: If it's no need to inject async #PF to PV guest we can handle more completed apfs at one time, so we can retry guest #PF as early as possible Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |3 ++- arch/x86/kvm/x86.c |8 ++-- virt/kvm/async_pf.c | 28 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1be0058..c95b3ff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -818,7 +818,8 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip); void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, +/* return true if we can handle more completed apfs, false otherwise */ +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4da8485..189664a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6265,7 +6265,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, } } -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { trace_kvm_async_pf_ready(work-arch.token, work-gva); @@ -6274,13 +6274,17 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, else kvm_del_async_pf_gfn(vcpu, work-arch.gfn); + vcpu-arch.apf.halted = false; + if ((vcpu-arch.apf.msr_val KVM_ASYNC_PF_ENABLED) !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) { vcpu-arch.fault.error_code = 0; vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); + return false; } - vcpu-arch.apf.halted = false; + + return true; } bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 60df9e0..d57ec92 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -123,25 +123,29 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu) { struct kvm_async_pf *work; + bool ret; if (list_empty_careful(vcpu-async_pf.done) || !kvm_arch_can_inject_async_page_present(vcpu)) return; - spin_lock(vcpu-async_pf.lock); - work = list_first_entry(vcpu-async_pf.done, typeof(*work), link); - list_del(work-link); - spin_unlock(vcpu-async_pf.lock); + do { + spin_lock(vcpu-async_pf.lock); + work = list_first_entry(vcpu-async_pf.done, typeof(*work), + link); + list_del(work-link); + spin_unlock(vcpu-async_pf.lock); - if (work-page) - kvm_arch_async_page_ready(vcpu, work); - kvm_arch_async_page_present(vcpu, work); + if (work-page) + kvm_arch_async_page_ready(vcpu, work); + ret = kvm_arch_async_page_present(vcpu, work); - list_del(work-queue); - vcpu-async_pf.queued--; - if (work-page) - put_page(work-page); - kmem_cache_free(async_pf_cache, work); + list_del(work-queue); + vcpu-async_pf.queued--; + if (work-page) + put_page(work-page); + kmem_cache_free(async_pf_cache, work); + } while (ret !list_empty_careful(vcpu-async_pf.done)); } No need to change kvm_arch_async_page_present() to return anything. You can do while loop like this: while (!list_empty_careful(vcpu-async_pf.done) kvm_arch_can_inject_async_page_present(vcpu)) { } If kvm_arch_async_page_present() call injects exception kvm_arch_can_inject_async_page_present() will return false on next iteration. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/7] KVM: avoid unnecessary wait for a async pf
On Mon, Nov 01, 2010 at 05:01:28PM +0800, Xiao Guangrong wrote: In current code, it checks async pf completion out of the wait context, like this: if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE !vcpu-arch.apf.halted) r = vcpu_enter_guest(vcpu); else { .. kvm_vcpu_block(vcpu) ^- waiting until 'async_pf.done' is not empty } kvm_check_async_pf_completion(vcpu) ^- delete list from async_pf.done So, if we check aysnc pf completion first, it can be blocked at kvm_vcpu_block Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion() path Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Acked-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/x86.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9b543f4..4da8485 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6280,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); } + vcpu-arch.apf.halted = false; } bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) -- 1.7.0.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/7] KVM: handle more completed apfs if possible
On 11/01/2010 05:24 PM, Gleb Natapov wrote: -put_page(work-page); -kmem_cache_free(async_pf_cache, work); +list_del(work-queue); +vcpu-async_pf.queued--; +if (work-page) +put_page(work-page); +kmem_cache_free(async_pf_cache, work); +} while (ret !list_empty_careful(vcpu-async_pf.done)); } No need to change kvm_arch_async_page_present() to return anything. You can do while loop like this: while (!list_empty_careful(vcpu-async_pf.done) kvm_arch_can_inject_async_page_present(vcpu)) { } If kvm_arch_async_page_present() call injects exception kvm_arch_can_inject_async_page_present() will return false on next iteration. Yeah, it's a better way, i'll fix it, thanks Gleb! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TODO item: guest programmable mac/vlan filtering with macvtap
1. add a secondary mac (or third, etc) address to the guest virtio-net interface. Maybe I misunderstood this. Is it just setting another mac on the guest virtio-net interface? 4. the above stuff must be controllable by host admin - Well, for this there are a few options: admin switch that allows the guest user to add macs preconfig allowed MAC's in mactap (or qemu config) for the guest user allow/disallow command for user in qemu (although this doesn't seem to be supported) Well, on a second thought, qemu capabilities should be just fine, right? -- Dragos -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device assignment, shared IRQs, uio_pci_generic
Am 31.10.2010 16:05, Jan Kiszka wrote: Still not working here are ehci, ath9k and e1000e when passed through. They receive IRQs, but somehow the Linux guest drivers are unhappy about the device states (the e1000e detects Hardware Unit Hang e.g.). This is independent of my patches. Anyone any ideas? Improper configurations of the BIOS? I stumbled over this in the kernel log: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU for graphics Argh, no one told me that there is CONFIG_DMAR_DEFAULT_ON and that you have to pass intel_iommu=on if that config switch is off like here. Moreover, I somehow assumed iommu=0 on the command line means use IOMMU with index 0 (don't ask me why). Usability deserves some patches - later. So it's starting to work, already accessed some devices (ath9k did a successful scan), but now it's crashing in intel_iommu_attach_device. Looks like use after release. Sigh. Jan signature.asc Description: OpenPGP digital signature
Re: TODO item: guest programmable mac/vlan filtering with macvtap
On Mon, Nov 01, 2010 at 11:48:23AM +0100, Dragos Tatulea wrote: 1. add a secondary mac (or third, etc) address to the guest virtio-net interface. Maybe I misunderstood this. Is it just setting another mac on the guest virtio-net interface? Well, yes, that's also not possible at the moment. Or e.g. set more than one mac per virtio-net device using macvlan. 4. the above stuff must be controllable by host admin - Well, for this there are a few options: admin switch that allows the guest user to add macs preconfig allowed MAC's in mactap (or qemu config) for the guest user allow/disallow command for user in qemu (although this doesn't seem to be supported) Well, on a second thought, qemu capabilities should be just fine, right? -- Dragos At some level, although I think we also want a way to disable access that qemu can't override unless it has net admin capability. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Crash in intel_iommu_assign_device
Hi Sheng, I'm not claiming to understand the details, but this looks like use (dereference of pte via dma_pte_addr) after release (free_pgtable_page of dmar_domain-pgd aka pte) to me: static int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { [...] pte = dmar_domain-pgd; if (dma_pte_present(pte)) { free_pgtable_page(dmar_domain-pgd); dmar_domain-pgd = (struct dma_pte *) phys_to_virt(dma_pte_addr(pte)); } At least it crashes here right on pte-val access. Swap both lines? Jan signature.asc Description: OpenPGP digital signature
Cannot boot 2.6.35 SMP guest in 2.6.35 host
I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x guests. The boot proceeds halfway but does not reach a shell or desktop. Oddly the console echoes characters. I am using a AMD SVM system and found this: https://patchwork.kernel.org/patch/226981/ Has this been resolved? Fedora 14 x86_64 Host: Guests: 0. Fedora 14 Install DVD - boots halfway and gets stuck 1. Ubuntu 10.04 (2.6.32.x) - boots SMP to desktop 2. Ubuntu 10.10 (2.6.35.x) -boots UP to shell (server version); cannot get it to boot to desktop UP or SMP 3. Windows XP guest ok -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot boot 2.6.35 SMP guest in 2.6.35 host
Richard Chan wrote: I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x guests. The boot proceeds halfway but does not reach a shell or desktop. Oddly the console echoes characters. I am using a AMD SVM system and found this: https://patchwork.kernel.org/patch/226981/ Has this been resolved? 2.6.36.8 kernel which were released a few days ago contains 2 simple patches that addresses this issue. The same patches are needed for 2.6.32 kernel too, but they weren't applied to 2.6.32.25. I can't say anything about fedora kernels. This is all about host kernel, not guest one. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot boot 2.6.35 SMP guest in 2.6.35 host
Michael Tokarev wrote: Richard Chan wrote: I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x guests. The boot proceeds halfway but does not reach a shell or desktop. Oddly the console echoes characters. I am using a AMD SVM system and found this: https://patchwork.kernel.org/patch/226981/ Has this been resolved? 2.6.36.8 kernel which were released a few days ago I mean 2.6.35.8 ofcourse. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Crash on kvm_iommu_map_pages
Hi again, OK, I swapped those two lines in intel_iommu_attach_device [1], fixed another warning in the wbinvd emulation, but now I'm about to give up. This is freaky MMU stuff: general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/pci:00/:00:1a.0/device CPU 1 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif dm_snapshot dm_mod fan processor ahci libahci ata_generi c liba ta scsi_mod thermal thermal_sys hwmon Nov 1 13:19:11 mchn199C kernel: Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700 RIP: 0010:[8121de8c] [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP: 0018:8800bd4bdb68 EFLAGS: 00010202 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009 RDX: 0180 RSI: 88012a940938 RDI: 0202 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004 R10: 0001 R11: R12: 880128dfee00 R13: 0002 R14: 000f R15: 0009 FS: 7f4990d33710() GS:8800be68() knlGS: CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: 01408000 CR3: bd7db000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 88012a94) Stack: 8800bd4bdb88 8800ac378000 00d2 000f 0 8800bd4bdce8 8800ac2fc000 0003 000f0001 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573 Call Trace: [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a [812a9573] iommu_iova_to_phys+0x13/0x15 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm] [8111404d] ? __vmalloc_node+0x86/0x9b [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm] [81081ee8] ? mark_held_locks+0x50/0x72 [8137c983] ? mutex_lock_nested+0x325/0x34d [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm] [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm] [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm] [811355aa] ? fget_light+0x17b/0x31f [81143bd7] do_vfs_ioctl+0x4c6/0x507 [81135732] ? fget_light+0x303/0x31f [811355aa] ? fget_light+0x17b/0x31f [8137ebb9] ? retint_swapgs+0x13/0x1b [81143c6e] sys_ioctl+0x56/0x7c [81002df2] system_call_fastpath+0x16/0x1b Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 RIP [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP 8800bd4bdb68 Can anyone parse this? Is it Intel-specific or a generic issue? The kernel is current kvm.git + unrelated patch + [1]. Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923 signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs
On Mon, Nov 01, 2010 at 05:05:00PM +0800, Xiao Guangrong wrote: Don't make a KVM_REQ_UNHALT request after async pf is completed since it can break guest's 'HLT' instruction. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 13 ++--- include/linux/kvm_host.h |6 ++ virt/kvm/kvm_main.c |9 - 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 189664a..c57fb38 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6105,13 +6105,20 @@ void kvm_arch_flush_shadow(struct kvm *kvm) int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) { - return (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE + if ((vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE !vcpu-arch.apf.halted) - || !list_empty_careful(vcpu-async_pf.done) || vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED || vcpu-arch.nmi_pending || (kvm_arch_interrupt_allowed(vcpu) - kvm_cpu_has_interrupt(vcpu)); + kvm_cpu_has_interrupt(vcpu))) + return 1; + + if (!list_empty_careful(vcpu-async_pf.done)) { + vcpu-arch.apf.halted = false; + return 2; + } kvm_arch_vcpu_runnable() shouldn't change vcpu state. I don't like the way it propagates internal x86 state to kvm_vcpu_block() too. May be what you are looking for is the patch below? (not tested). diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cfdf2d..f7aed95 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5295,8 +5295,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) { switch(vcpu-arch.mp_state) { case KVM_MP_STATE_HALTED: - vcpu-arch.mp_state = - KVM_MP_STATE_RUNNABLE; + if (list_empty_careful(vcpu-async_pf.done)) + vcpu-arch.mp_state = + KVM_MP_STATE_RUNNABLE; case KVM_MP_STATE_RUNNABLE: vcpu-arch.apf.halted = false; break; @@ -6279,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, vcpu-arch.fault.error_code = 0; vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); + vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; } } -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash on kvm_iommu_map_pages
[ Forgot to CC LKML - maybe it's not KVM-specific. BTW, is anyone actually using current KVM device assigment on Intel? I'm starting to believe that can only very few lucky people... ] Am 01.11.2010 13:51, Jan Kiszka wrote: Hi again, OK, I swapped those two lines in intel_iommu_attach_device [1], fixed another warning in the wbinvd emulation, but now I'm about to give up. This is freaky MMU stuff: general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/pci:00/:00:1a.0/device CPU 1 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif dm_snapshot dm_mod fan processor ahci libahci ata_gene ri c liba ta scsi_mod thermal thermal_sys hwmon Nov 1 13:19:11 mchn199C kernel: Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700 RIP: 0010:[8121de8c] [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP: 0018:8800bd4bdb68 EFLAGS: 00010202 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009 RDX: 0180 RSI: 88012a940938 RDI: 0202 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004 R10: 0001 R11: R12: 880128dfee00 R13: 0002 R14: 000f R15: 0009 FS: 7f4990d33710() GS:8800be68() knlGS: CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: 01408000 CR3: bd7db000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 88012a94) Stack: 8800bd4bdb88 8800ac378000 00d2 000f 0 8800bd4bdce8 8800ac2fc000 0003 000f0001 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573 Call Trace: [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a [812a9573] iommu_iova_to_phys+0x13/0x15 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm] [8111404d] ? __vmalloc_node+0x86/0x9b [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm] [81081ee8] ? mark_held_locks+0x50/0x72 [8137c983] ? mutex_lock_nested+0x325/0x34d [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm] [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm] [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm] [811355aa] ? fget_light+0x17b/0x31f [81143bd7] do_vfs_ioctl+0x4c6/0x507 [81135732] ? fget_light+0x303/0x31f [811355aa] ? fget_light+0x17b/0x31f [8137ebb9] ? retint_swapgs+0x13/0x1b [81143c6e] sys_ioctl+0x56/0x7c [81002df2] system_call_fastpath+0x16/0x1b Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 RIP [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP 8800bd4bdb68 Can anyone parse this? Is it Intel-specific or a generic issue? The kernel is current kvm.git + unrelated patch + [1]. Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923 signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH v2 6/7] KVM: fix the race while wakeup all pv guest
On Mon, Nov 01, 2010 at 05:03:44PM +0800, Xiao Guangrong wrote: In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu-async_pf.done without holding vcpu-async_pf.lock, it will break if we are handling apfs at this time. This should never happen to well behaved guest, but malicious guest can do it on purpose. Also use 'list_empty_careful()' instead of 'list_empty()' Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Acked-by: Gleb Natapov g...@redhat.com --- virt/kvm/async_pf.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index d57ec92..6ef3373 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -200,7 +200,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu) { struct kvm_async_pf *work; - if (!list_empty(vcpu-async_pf.done)) + if (!list_empty_careful(vcpu-async_pf.done)) return 0; work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC); @@ -211,7 +211,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu) get_page(bad_page); INIT_LIST_HEAD(work-queue); /* for list_del to work */ + spin_lock(vcpu-async_pf.lock); list_add_tail(work-link, vcpu-async_pf.done); + spin_unlock(vcpu-async_pf.lock); + vcpu-async_pf.queued++; return 0; } -- 1.7.0.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: x86: Issue smp_call_function_many with preemption disabled
From: Jan Kiszka jan.kis...@siemens.com smp_call_function_many is specified to be called only with preemption disabled. Fulfill this requirement. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/x86.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cfdf2d..0ed6dad 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4012,8 +4012,10 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu) return X86EMUL_CONTINUE; if (kvm_x86_ops-has_wbinvd_exit()) { + preempt_disable(); smp_call_function_many(vcpu-arch.wbinvd_dirty_mask, wbinvd_ipi, NULL, 1); + preempt_enable(); cpumask_clear(vcpu-arch.wbinvd_dirty_mask); } wbinvd(); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: x86: Avoid issuing wbinvd twice
From: Jan Kiszka jan.kis...@siemens.com Micro optimization to avoid calling wbinvd twice on the CPU that has to emulate it. As we might be preempted between smp_call_function_many and the local wbinvd, the cache might be filled again so that real work could be done uselessly. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- I can fold the first patch into this one if it's preferred. arch/x86/kvm/x86.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ed6dad..58e8232 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4012,13 +4012,15 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu) return X86EMUL_CONTINUE; if (kvm_x86_ops-has_wbinvd_exit()) { - preempt_disable(); + int cpu = get_cpu(); + + cpumask_set_cpu(cpu, vcpu-arch.wbinvd_dirty_mask); smp_call_function_many(vcpu-arch.wbinvd_dirty_mask, wbinvd_ipi, NULL, 1); - preempt_enable(); + put_cpu(); cpumask_clear(vcpu-arch.wbinvd_dirty_mask); - } - wbinvd(); + } else + wbinvd(); return X86EMUL_CONTINUE; } EXPORT_SYMBOL_GPL(kvm_emulate_wbinvd); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/7] KVM: fix tracing kvm_try_async_get_page
On Mon, Nov 01, 2010 at 04:58:43PM +0800, Xiao Guangrong wrote: Tracing 'async' and *pfn is useless, since 'async' is always true, and '*pfn' is always fault_pfn' We can trace 'gva' and 'gfn' instead, it can help us to see the life-cycle of an async_pf Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Acked-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c |2 +- include/trace/events/kvm.h | 12 +++- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 39cc0c6..5275c50 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2627,7 +2627,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, put_page(pfn_to_page(*pfn)); if (!no_apf can_do_async_pf(vcpu)) { - trace_kvm_try_async_get_page(async, *pfn); + trace_kvm_try_async_get_page(gva, gfn); if (kvm_find_async_pf_gfn(vcpu, gfn)) { trace_kvm_async_pf_doublefault(gva, gfn); kvm_make_request(KVM_REQ_APF_HALT, vcpu); diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 9c2cc6a..30063c6 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -188,18 +188,20 @@ TRACE_EVENT(kvm_age_page, #ifdef CONFIG_KVM_ASYNC_PF TRACE_EVENT( kvm_try_async_get_page, - TP_PROTO(bool async, u64 pfn), - TP_ARGS(async, pfn), + TP_PROTO(u64 gva, u64 gfn), + TP_ARGS(gva, gfn), TP_STRUCT__entry( - __field(__u64, pfn) + __field(u64, gva) + __field(u64, gfn) ), TP_fast_assign( - __entry-pfn = (!async) ? pfn : (u64)-1; + __entry-gva = gva; + __entry-gfn = gfn; ), - TP_printk(pfn %#llx, __entry-pfn) + TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn) ); TRACE_EVENT( -- 1.7.0.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash on kvm_iommu_map_pages
The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). Joerg On Mon, Nov 01, 2010 at 08:57:42AM -0400, Jan Kiszka wrote: [ Forgot to CC LKML - maybe it's not KVM-specific. BTW, is anyone actually using current KVM device assigment on Intel? I'm starting to believe that can only very few lucky people... ] Am 01.11.2010 13:51, Jan Kiszka wrote: Hi again, OK, I swapped those two lines in intel_iommu_attach_device [1], fixed another warning in the wbinvd emulation, but now I'm about to give up. This is freaky MMU stuff: general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/pci:00/:00:1a.0/device CPU 1 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif dm_snapshot dm_mod fan processor ahci libahci ata_gene ri c liba ta scsi_mod thermal thermal_sys hwmon Nov 1 13:19:11 mchn199C kernel: Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700 RIP: 0010:[8121de8c] [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP: 0018:8800bd4bdb68 EFLAGS: 00010202 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009 RDX: 0180 RSI: 88012a940938 RDI: 0202 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004 R10: 0001 R11: R12: 880128dfee00 R13: 0002 R14: 000f R15: 0009 FS: 7f4990d33710() GS:8800be68() knlGS: CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: 01408000 CR3: bd7db000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 88012a94) Stack: 8800bd4bdb88 8800ac378000 00d2 000f 0 8800bd4bdce8 8800ac2fc000 0003 000f0001 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573 Call Trace: [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a [812a9573] iommu_iova_to_phys+0x13/0x15 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm] [8111404d] ? __vmalloc_node+0x86/0x9b [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm] [81081ee8] ? mark_held_locks+0x50/0x72 [8137c983] ? mutex_lock_nested+0x325/0x34d [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm] [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm] [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm] [811355aa] ? fget_light+0x17b/0x31f [81143bd7] do_vfs_ioctl+0x4c6/0x507 [81135732] ? fget_light+0x303/0x31f [811355aa] ? fget_light+0x17b/0x31f [8137ebb9] ? retint_swapgs+0x13/0x1b [81143c6e] sys_ioctl+0x56/0x7c [81002df2] system_call_fastpath+0x16/0x1b Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 RIP [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP 8800bd4bdb68 Can anyone parse this? Is it Intel-specific or a generic issue? The kernel is current kvm.git + unrelated patch + [1]. Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message
Re: Crash on kvm_iommu_map_pages
Am 01.11.2010 14:21, Roedel, Joerg wrote: The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). In pfn_to_dma_pte, line 710: if (!dma_pte_present(pte)) { 8121de8c: f6 03 03testb $0x3,(%rbx) 8121de8f: 0f 85 d8 00 00 00 jne8121df6d pfn_to_dma_pte+0x154 The first instruction raises the fault. Jan Joerg On Mon, Nov 01, 2010 at 08:57:42AM -0400, Jan Kiszka wrote: [ Forgot to CC LKML - maybe it's not KVM-specific. BTW, is anyone actually using current KVM device assigment on Intel? I'm starting to believe that can only very few lucky people... ] Am 01.11.2010 13:51, Jan Kiszka wrote: Hi again, OK, I swapped those two lines in intel_iommu_attach_device [1], fixed another warning in the wbinvd emulation, but now I'm about to give up. This is freaky MMU stuff: general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/pci:00/:00:1a.0/device CPU 1 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif dm_snapshot dm_mod fan processor ahci libahci ata_ge ne ri c liba ta scsi_mod thermal thermal_sys hwmon Nov 1 13:19:11 mchn199C kernel: Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700 RIP: 0010:[8121de8c] [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP: 0018:8800bd4bdb68 EFLAGS: 00010202 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009 RDX: 0180 RSI: 88012a940938 RDI: 0202 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004 R10: 0001 R11: R12: 880128dfee00 R13: 0002 R14: 000f R15: 0009 FS: 7f4990d33710() GS:8800be68() knlGS: CS: 0010 DS: 002b ES: 002b CR0: 80050033 CR2: 01408000 CR3: bd7db000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 88012a94) Stack: 8800bd4bdb88 8800ac378000 00d2 000f 0 8800bd4bdce8 8800ac2fc000 0003 000f0001 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573 Call Trace: [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a [812a9573] iommu_iova_to_phys+0x13/0x15 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm] [8111404d] ? __vmalloc_node+0x86/0x9b [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm] [81081ee8] ? mark_held_locks+0x50/0x72 [8137c983] ? mutex_lock_nested+0x325/0x34d [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm] [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm] [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm] [811355aa] ? fget_light+0x17b/0x31f [81143bd7] do_vfs_ioctl+0x4c6/0x507 [81135732] ? fget_light+0x303/0x31f [811355aa] ? fget_light+0x17b/0x31f [8137ebb9] ? retint_swapgs+0x13/0x1b [81143c6e] sys_ioctl+0x56/0x7c [81002df2] system_call_fastpath+0x16/0x1b Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 RIP [8121de8c] pfn_to_dma_pte+0x73/0x190 RSP 8800bd4bdb68 Can anyone parse this? Is it Intel-specific or a generic issue? The kernel is current kvm.git + unrelated patch + [1]. Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923 signature.asc Description: OpenPGP digital signature
[PATCHv2] KVM: handle exit due to INVD in VMX
Currently the exit is unhandled, so guest halts with error if it tries to execute INVD instruction. Call into emulator when INVD instruction is executed by a guest instead. This instruction is not needed by ordinary guests, but firmware (like OpenBIOS) use it and fail. Signed-off-by: Gleb Natapov g...@redhat.com --- ChangeLog: v1-v2 - New and improved patch description diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 9f0cbd9..42d9590 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -239,6 +239,7 @@ enum vmcs_field { #define EXIT_REASON_TASK_SWITCH 9 #define EXIT_REASON_CPUID 10 #define EXIT_REASON_HLT 12 +#define EXIT_REASON_INVD13 #define EXIT_REASON_INVLPG 14 #define EXIT_REASON_RDPMC 15 #define EXIT_REASON_RDTSC 16 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 993e332..200533e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3350,6 +3350,11 @@ static int handle_vmx_insn(struct kvm_vcpu *vcpu) return 1; } +static int handle_invd(struct kvm_vcpu *vcpu) +{ + return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; +} + static int handle_invlpg(struct kvm_vcpu *vcpu) { unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); @@ -3650,6 +3655,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_MSR_WRITE] = handle_wrmsr, [EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window, [EXIT_REASON_HLT] = handle_halt, + [EXIT_REASON_INVD]= handle_invd, [EXIT_REASON_INVLPG] = handle_invlpg, [EXIT_REASON_VMCALL] = handle_vmcall, [EXIT_REASON_VMCLEAR] = handle_vmx_insn, -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash on kvm_iommu_map_pages
On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote: Am 01.11.2010 14:21, Roedel, Joerg wrote: The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). In pfn_to_dma_pte, line 710: if (!dma_pte_present(pte)) { 8121de8c: f6 03 03testb $0x3,(%rbx) 8121de8f: 0f 85 d8 00 00 00 jne8121df6d pfn_to_dma_pte+0x154 The first instruction raises the fault. Ok, so it seems that my understanding of the Code: field in the crash-message was wrong :) Anyway, the testb uses rbx as an address which has a non-canonical value. This means the the address of 'pte' is invalid. Since rax also contains a wrong address the 'parent' variable probably already contains the wrong address. Does the attached patch help? diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h index 5619f85..ca46f24 100644 --- a/include/linux/dma_remapping.h +++ b/include/linux/dma_remapping.h @@ -6,7 +6,7 @@ */ #define VTD_PAGE_SHIFT (12) #define VTD_PAGE_SIZE (1UL VTD_PAGE_SHIFT) -#define VTD_PAGE_MASK (((u64)-1) VTD_PAGE_SHIFT) +#define VTD_PAGE_MASK u64)-1) VTD_PAGE_SHIFT) ((1ULL 52) - 1)) #define VTD_PAGE_ALIGN(addr) (((addr) + VTD_PAGE_SIZE - 1) VTD_PAGE_MASK) #define DMA_PTE_READ (1) -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM: Clear assigned guest IRQ on release
From: Jan Kiszka jan.kis...@siemens.com When we deassign a guest IRQ, clear the potentially asserted guest line. There might be no chance for the guest to do this, specifically if we switch from INTx to MSI mode. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- virt/kvm/assigned-dev.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 5c1b56a..d3ddfea 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -129,6 +129,9 @@ static void deassign_guest_irq(struct kvm *kvm, kvm_unregister_irq_ack_notifier(kvm, assigned_dev-ack_notifier); assigned_dev-ack_notifier.gsi = -1; + kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id, + assigned_dev-guest_irq, 0); + if (assigned_dev-irq_source_id != -1) kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id); assigned_dev-irq_source_id = -1; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] KVM: Improve IRQ assignment for device passthrough
Three patches to improve classic device assigment /wrt IRQs. Highlight is the last one that resolves the host IRQ sharing issue for all PCI 2.3 devices. Quite essential when passing non-MSI-ready devices like many USB host controllers. Jan Kiszka (3): KVM: Fold assigned interrupt work into IRQ handler KVM: Clear assigned guest IRQ on release KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices include/linux/kvm_host.h |2 +- virt/kvm/assigned-dev.c | 215 ++ 2 files changed, 160 insertions(+), 57 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; + bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ + u16 orig, new; + bool supported = false; + + pci_block_user_cfg_access(pdev); + pci_read_config_word(pdev, PCI_COMMAND, orig); + pci_write_config_word(pdev, PCI_COMMAND, + orig ^ PCI_COMMAND_INTX_DISABLE); + pci_read_config_word(pdev, PCI_COMMAND, new); + + /* +* There's no way to protect against +* hardware bugs or detect them reliably, but as long as we know +* what the value should be, let's go ahead and check it. +*/ + if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { + dev_err(pdev-dev, Command changed from 0x%x to 0x%x: + driver or HW bug?\n, orig, new); + goto out; + } + if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { + dev_warn(pdev-dev, Device does not support +disabling interrupts: unable to bind.\n); + goto out; + } + supported = true; + + /* Now restore the original value. */ + pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: + pci_unblock_user_cfg_access(pdev); + return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ + u32 cmd_status_dword; + u16 origcmd, newcmd; + + /* +* We do a single dword read to retrieve both command and status. +* Document assumptions that make this possible. +*/ + BUILD_BUG_ON(PCI_COMMAND % 4); + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + + pci_block_user_cfg_access(dev); + + /* +* Read both command and status registers in a single 32-bit operation. +* Note: we could cache the value for command and move the status read +* out of the lock if there was a way to get notified of user changes +* to command register through sysfs. Should be good for shared irqs. +*/ + pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); + origcmd = cmd_status_dword; + + if (irq_status) { + /* + * Check interrupt status register to see whether our device triggered + * the interrupt. + */ + *irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; + if (*irq_status == 0) + goto done; + } + + newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; + if (mask) + newcmd |= PCI_COMMAND_INTX_DISABLE; + if (newcmd != origcmd) + pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: + pci_unblock_user_cfg_access(dev); +} + static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *) dev_id; + int ret = IRQ_HANDLED; unsigned long flags; spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) guest_entries[i].vector, 1); } } else { - kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id, - assigned_dev-guest_irq, 1); - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { - disable_irq_nosync(irq); + if (assigned_dev-pci_2_3) { + unsigned int irq_status; + +
[PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler
From: Jan Kiszka jan.kis...@siemens.com The complete work handler runs with assigned_dev_lock acquired and interrupts disabled, so there is nothing to gain pushing this work out of the actually IRQ handler. Fold them together. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 - virt/kvm/assigned-dev.c | 71 +++--- 2 files changed, 23 insertions(+), 49 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 462b982..df5497f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -465,7 +465,6 @@ struct kvm_guest_msix_entry { struct kvm_assigned_dev_kernel { struct kvm_irq_ack_notifier ack_notifier; - struct work_struct interrupt_work; struct list_head list; int assigned_dev_id; int host_segnr; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 7c98928..5c1b56a 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } -static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) +static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { - struct kvm_assigned_dev_kernel *assigned_dev; - int i; - - assigned_dev = container_of(work, struct kvm_assigned_dev_kernel, - interrupt_work); + struct kvm_assigned_dev_kernel *assigned_dev = + (struct kvm_assigned_dev_kernel *) dev_id; + unsigned long flags; - spin_lock_irq(assigned_dev-assigned_dev_lock); + spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); if (assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX) { struct kvm_guest_msix_entry *guest_entries = assigned_dev-guest_msix_entries; + int index = find_index_from_host_irq(assigned_dev, irq); + int i; + + if (index 0) + goto out; + + guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING; + for (i = 0; i assigned_dev-entries_nr; i++) { if (!(guest_entries[i].flags KVM_ASSIGNED_MSIX_PENDING)) @@ -76,33 +82,15 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) assigned_dev-irq_source_id, guest_entries[i].vector, 1); } - } else + } else { kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id, assigned_dev-guest_irq, 1); - spin_unlock_irq(assigned_dev-assigned_dev_lock); -} - -static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) -{ - unsigned long flags; - struct kvm_assigned_dev_kernel *assigned_dev = - (struct kvm_assigned_dev_kernel *) dev_id; - - spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX) { - int index = find_index_from_host_irq(assigned_dev, irq); - if (index 0) - goto out; - assigned_dev-guest_msix_entries[index].flags |= - KVM_ASSIGNED_MSIX_PENDING; - } - - schedule_work(assigned_dev-interrupt_work); - - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { - disable_irq_nosync(irq); - assigned_dev-host_irq_disabled = true; + if (assigned_dev-irq_requested_type + KVM_DEV_IRQ_GUEST_INTX) { + disable_irq_nosync(irq); + assigned_dev-host_irq_disabled = true; + } } out: @@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm, assigned_dev-irq_requested_type = ~(KVM_DEV_IRQ_GUEST_MASK); } -/* The function implicit hold kvm-lock mutex due to cancel_work_sync() */ static void deassign_host_irq(struct kvm *kvm, struct kvm_assigned_dev_kernel *assigned_dev) { /* -* In kvm_free_device_irq, cancel_work_sync return true if: -* 1. work is scheduled, and then cancelled. -* 2. work callback is executed. -* -* The first one ensured that the irq is disabled and no more events -* would happen. But for the second one, the irq may be enabled (e.g. -* for MSI). So we disable irq here to prevent further events. +* We disable irq here to prevent further events. * * Notice this maybe result in nested disable if the interrupt type is * INTx, but it's OK for we are going to free it. * * If this function is a part of VM destroy, please ensure that till * now, the
Re: [PATCH 0/4] VFIO V5: Non-privileged user level PCI drivers
Is this the complete patch set? Or maybe it assumes a pre-patch kernel tree. If so where can I get it? thanks, -Etienne -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash on kvm_iommu_map_pages
Am 01.11.2010 14:53, Roedel, Joerg wrote: On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote: Am 01.11.2010 14:21, Roedel, Joerg wrote: The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). In pfn_to_dma_pte, line 710: if (!dma_pte_present(pte)) { 8121de8c: f6 03 03testb $0x3,(%rbx) 8121de8f: 0f 85 d8 00 00 00 jne8121df6d pfn_to_dma_pte+0x154 The first instruction raises the fault. Ok, so it seems that my understanding of the Code: field in the crash-message was wrong :) Anyway, the testb uses rbx as an address which has a non-canonical value. This means the the address of 'pte' is invalid. Since rax also contains a wrong address the 'parent' variable probably already contains the wrong address. Does the attached patch help? diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h index 5619f85..ca46f24 100644 --- a/include/linux/dma_remapping.h +++ b/include/linux/dma_remapping.h @@ -6,7 +6,7 @@ */ #define VTD_PAGE_SHIFT (12) #define VTD_PAGE_SIZE(1UL VTD_PAGE_SHIFT) -#define VTD_PAGE_MASK(((u64)-1) VTD_PAGE_SHIFT) +#define VTD_PAGE_MASKu64)-1) VTD_PAGE_SHIFT) ((1ULL 52) - 1)) #define VTD_PAGE_ALIGN(addr) (((addr) + VTD_PAGE_SIZE - 1) VTD_PAGE_MASK) #define DMA_PTE_READ (1) Crashes during early boot while initializing dmar. If you need the trace, I could set up some debug console. Jan signature.asc Description: OpenPGP digital signature
Re: Crash on kvm_iommu_map_pages
On Mon, Nov 01, 2010 at 03:22:15PM +0100, Jan Kiszka wrote: Am 01.11.2010 14:53, Roedel, Joerg wrote: On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote: Am 01.11.2010 14:21, Roedel, Joerg wrote: The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). In pfn_to_dma_pte, line 710: if (!dma_pte_present(pte)) { 8121de8c: f6 03 03testb $0x3,(%rbx) 8121de8f: 0f 85 d8 00 00 00 jne8121df6d pfn_to_dma_pte+0x154 The first instruction raises the fault. Ok, so it seems that my understanding of the Code: field in the crash-message was wrong :) Anyway, the testb uses rbx as an address which has a non-canonical value. This means the the address of 'pte' is invalid. Since rax also contains a wrong address the 'parent' variable probably already contains the wrong address. Does the attached patch help? diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h index 5619f85..ca46f24 100644 --- a/include/linux/dma_remapping.h +++ b/include/linux/dma_remapping.h @@ -6,7 +6,7 @@ */ #define VTD_PAGE_SHIFT (12) #define VTD_PAGE_SIZE (1UL VTD_PAGE_SHIFT) -#define VTD_PAGE_MASK (((u64)-1) VTD_PAGE_SHIFT) +#define VTD_PAGE_MASK u64)-1) VTD_PAGE_SHIFT) ((1ULL 52) - 1)) #define VTD_PAGE_ALIGN(addr) (((addr) + VTD_PAGE_SIZE - 1) VTD_PAGE_MASK) #define DMA_PTE_READ (1) Crashes during early boot while initializing dmar. If you need the trace, I could set up some debug console. Hmm, no. This was only a guess. The VTD_PAGE_MASK does not mask out the bits 52-63 of the pte. According to the VT-d spec it is allowed to set these bits, some are marked as AVL and some have special meanings. If a pte has one of these bits set the phys_addr calculated will be wrong and the virt_addr calculated from it too (probably non-canonical, leading to the GPF). Probably masking out these bits in dma_pte_addr helps. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] Minimal RAM API support
v2: - Move to Makefile.objs - Move structures to memory.c and create a callback function - Fix memory leak I haven't moved to the state parameter because there should only be a single instance of this per VM. The state parameter seems like it would add complications in setup and function calling, but maybe point me to an example if I'm off base. Thanks, Alex v1: For VFIO based device assignment, we need to know what guest memory areas are actual RAM. RAMBlocks have long since become a grab bag of misc allocations, so aren't effective for this. Anthony has had a RAM API in mind for a while now that addresses this problem. This implements just enough of it so that we have an interface to get actual guest memory physical addresses to setup the host IOMMU. We can continue building a full RAM API on top of this stub. Anthony, feel free to add copyright to memory.c as it's based on your initial implementation. I had to add something since the file in your branch just copies a header with Frabrice's copywrite. Thanks, Alex --- Alex Williamson (2): RAM API: Make use of it for x86 PC Minimal RAM API support Makefile.objs |1 + cpu-common.h |2 + hw/pc.c | 12 +++--- memory.c | 109 + memory.h | 18 + 5 files changed, 136 insertions(+), 6 deletions(-) create mode 100644 memory.c create mode 100644 memory.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] Minimal RAM API support
This adds a minimum chunk of Anthony's RAM API support so that we can identify actual VM RAM versus all the other things that make use of qemu_ram_alloc. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Makefile.objs |1 + cpu-common.h |2 + memory.c | 109 + memory.h | 18 + 4 files changed, 130 insertions(+), 0 deletions(-) create mode 100644 memory.c create mode 100644 memory.h diff --git a/Makefile.objs b/Makefile.objs index f07fb01..33fae0b 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -154,6 +154,7 @@ hw-obj-y += vl.o loader.o hw-obj-y += virtio.o virtio-console.o hw-obj-y += fw_cfg.o pci.o pci_host.o pcie_host.o hw-obj-y += watchdog.o +hw-obj-y += memory.o hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o hw-obj-$(CONFIG_ECC) += ecc.o hw-obj-$(CONFIG_NAND) += nand.o diff --git a/cpu-common.h b/cpu-common.h index a543b5d..6aa2738 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -23,6 +23,8 @@ /* address in the RAM (different from a physical address) */ typedef unsigned long ram_addr_t; +#include memory.h + /* memory API */ typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, uint32_t value); diff --git a/memory.c b/memory.c new file mode 100644 index 000..2895082 --- /dev/null +++ b/memory.c @@ -0,0 +1,109 @@ +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, see http://www.gnu.org/licenses/. + */ +#include memory.h +#include range.h + +typedef struct QemuRamSlot { +target_phys_addr_t start_addr; +ram_addr_t size; +ram_addr_t offset; +QLIST_ENTRY(QemuRamSlot) next; +} QemuRamSlot; + +typedef struct QemuRamSlots { +QLIST_HEAD(slots, QemuRamSlot) slots; +} QemuRamSlots; + +static QemuRamSlots ram_slots = { .slots = QLIST_HEAD_INITIALIZER(ram_slots) }; + +static QemuRamSlot *qemu_ram_find_slot(target_phys_addr_t start_addr, + ram_addr_t size) +{ +QemuRamSlot *slot; + +QLIST_FOREACH(slot, ram_slots.slots, next) { +if (slot-start_addr == start_addr slot-size == size) { +return slot; +} + +if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) { +abort(); +} +} + +return NULL; +} + +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset) +{ +QemuRamSlot *slot; + +if (!size) { +return -EINVAL; +} + +assert(!qemu_ram_find_slot(start_addr, size)); + +slot = qemu_mallocz(sizeof(QemuRamSlot)); + +slot-start_addr = start_addr; +slot-size = size; +slot-offset = phys_offset; + +QLIST_INSERT_HEAD(ram_slots.slots, slot, next); + +cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset); + +return 0; +} + +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size) +{ +QemuRamSlot *slot; + +if (!size) { +return; +} + +slot = qemu_ram_find_slot(start_addr, size); +assert(slot != NULL); + +QLIST_REMOVE(slot, next); +qemu_free(slot); +cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED); + +return; +} + +int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn) +{ +QemuRamSlot *slot; + +QLIST_FOREACH(slot, ram_slots.slots, next) { +int ret = fn(opaque, slot-start_addr, slot-size, slot-offset); +if (ret) { +return ret; +} +} +return 0; +} diff --git a/memory.h b/memory.h new file mode 100644 index 000..0c17ff9 --- /dev/null +++ b/memory.h @@ -0,0 +1,18 @@ +#ifndef QEMU_MEMORY_H +#define QEMU_MEMORY_H + +#include qemu-common.h +#include cpu-common.h + +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset); + +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size); + +typedef int (*qemu_ram_for_each_slot_fn)(void *opaque, + target_phys_addr_t start_addr, + ram_addr_t size, + ram_addr_t phys_offset); + +int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn); +#endif -- To
[PATCH v2 2/2] RAM API: Make use of it for x86 PC
Register the actual VM RAM using the new API Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pc.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 69b13bf..0ea6d10 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -912,14 +912,14 @@ void pc_memory_init(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(NULL, pc.ram, below_4g_mem_size + above_4g_mem_size); -cpu_register_physical_memory(0, 0xa, ram_addr); -cpu_register_physical_memory(0x10, - below_4g_mem_size - 0x10, - ram_addr + 0x10); + +qemu_ram_register(0, 0xa, ram_addr); +qemu_ram_register(0x10, below_4g_mem_size - 0x10, + ram_addr + 0x10); #if TARGET_PHYS_ADDR_BITS 32 if (above_4g_mem_size 0) { -cpu_register_physical_memory(0x1ULL, above_4g_mem_size, - ram_addr + below_4g_mem_size); +qemu_ram_register(0x1ULL, above_4g_mem_size, + ram_addr + below_4g_mem_size); } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; + bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ + u16 orig, new; + bool supported = false; + + pci_block_user_cfg_access(pdev); + pci_read_config_word(pdev, PCI_COMMAND, orig); + pci_write_config_word(pdev, PCI_COMMAND, + orig ^ PCI_COMMAND_INTX_DISABLE); + pci_read_config_word(pdev, PCI_COMMAND, new); + + /* + * There's no way to protect against + * hardware bugs or detect them reliably, but as long as we know + * what the value should be, let's go ahead and check it. + */ + if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { + dev_err(pdev-dev, Command changed from 0x%x to 0x%x: + driver or HW bug?\n, orig, new); + goto out; + } + if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { + dev_warn(pdev-dev, Device does not support + disabling interrupts: unable to bind.\n); + goto out; + } + supported = true; + + /* Now restore the original value. */ + pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: + pci_unblock_user_cfg_access(pdev); + return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ + u32 cmd_status_dword; + u16 origcmd, newcmd; + + /* + * We do a single dword read to retrieve both command and status. + * Document assumptions that make this possible. + */ + BUILD_BUG_ON(PCI_COMMAND % 4); + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + + pci_block_user_cfg_access(dev); + + /* + * Read both command and status registers in a single 32-bit operation. + * Note: we could cache the value for command and move the status read + * out of the lock if there was a way to get notified of user changes + * to command register through sysfs. Should be good for shared irqs. + */ + pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); + origcmd = cmd_status_dword; + + if (irq_status) { + /* + * Check interrupt status register to see whether our device triggered + * the interrupt. + */ + *irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; + if (*irq_status == 0) + goto done; + } + + newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; + if (mask) + newcmd |= PCI_COMMAND_INTX_DISABLE; + if (newcmd != origcmd) + pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: + pci_unblock_user_cfg_access(dev); +} + Let's return irq_status instead of returning through a pointer? Will save a branch and generally make the code a bit cleaner. static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *) dev_id; + int ret = IRQ_HANDLED; unsigned long flags; spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) guest_entries[i].vector, 1); } } else { - kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id, - assigned_dev-guest_irq, 1); - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { -
Re: Crash on kvm_iommu_map_pages
Am 01.11.2010 15:35, Joerg Roedel wrote: On Mon, Nov 01, 2010 at 03:22:15PM +0100, Jan Kiszka wrote: Am 01.11.2010 14:53, Roedel, Joerg wrote: On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote: Am 01.11.2010 14:21, Roedel, Joerg wrote: The registers rax and rbx contain non-canonical addresses (if interpreted as pointers). The instruction where this happens is a mov so I guess that the #GP is because of an non-canonical address. Can you find out the code-line where this happens and the exact assembler instruction? (haven't managed to decode the registers used). In pfn_to_dma_pte, line 710: if (!dma_pte_present(pte)) { 8121de8c: f6 03 03testb $0x3,(%rbx) 8121de8f: 0f 85 d8 00 00 00 jne8121df6d pfn_to_dma_pte+0x154 The first instruction raises the fault. Ok, so it seems that my understanding of the Code: field in the crash-message was wrong :) Anyway, the testb uses rbx as an address which has a non-canonical value. This means the the address of 'pte' is invalid. Since rax also contains a wrong address the 'parent' variable probably already contains the wrong address. Does the attached patch help? diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h index 5619f85..ca46f24 100644 --- a/include/linux/dma_remapping.h +++ b/include/linux/dma_remapping.h @@ -6,7 +6,7 @@ */ #define VTD_PAGE_SHIFT (12) #define VTD_PAGE_SIZE (1UL VTD_PAGE_SHIFT) -#define VTD_PAGE_MASK (((u64)-1) VTD_PAGE_SHIFT) +#define VTD_PAGE_MASK u64)-1) VTD_PAGE_SHIFT) ((1ULL 52) - 1)) #define VTD_PAGE_ALIGN(addr) (((addr) + VTD_PAGE_SIZE - 1) VTD_PAGE_MASK) #define DMA_PTE_READ (1) Crashes during early boot while initializing dmar. If you need the trace, I could set up some debug console. Hmm, no. This was only a guess. The VTD_PAGE_MASK does not mask out the bits 52-63 of the pte. According to the VT-d spec it is allowed to set these bits, some are marked as AVL and some have special meanings. If a pte has one of these bits set the phys_addr calculated will be wrong and the virt_addr calculated from it too (probably non-canonical, leading to the GPF). Probably masking out these bits in dma_pte_addr helps. Nope. But I just noticed a fatal thinko in my fix to intel_iommu_attach_device - probably that was the key. Need to boot the test kernel... Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
Am 01.11.2010 16:24, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; +bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ +u16 orig, new; +bool supported = false; + +pci_block_user_cfg_access(pdev); +pci_read_config_word(pdev, PCI_COMMAND, orig); +pci_write_config_word(pdev, PCI_COMMAND, + orig ^ PCI_COMMAND_INTX_DISABLE); +pci_read_config_word(pdev, PCI_COMMAND, new); + +/* + * There's no way to protect against + * hardware bugs or detect them reliably, but as long as we know + * what the value should be, let's go ahead and check it. + */ +if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { +dev_err(pdev-dev, Command changed from 0x%x to 0x%x: +driver or HW bug?\n, orig, new); +goto out; +} +if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { +dev_warn(pdev-dev, Device does not support + disabling interrupts: unable to bind.\n); +goto out; +} +supported = true; + +/* Now restore the original value. */ +pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: +pci_unblock_user_cfg_access(pdev); +return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ +u32 cmd_status_dword; +u16 origcmd, newcmd; + +/* + * We do a single dword read to retrieve both command and status. + * Document assumptions that make this possible. + */ +BUILD_BUG_ON(PCI_COMMAND % 4); +BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + +pci_block_user_cfg_access(dev); + +/* + * Read both command and status registers in a single 32-bit operation. + * Note: we could cache the value for command and move the status read + * out of the lock if there was a way to get notified of user changes + * to command register through sysfs. Should be good for shared irqs. + */ +pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); +origcmd = cmd_status_dword; + +if (irq_status) { +/* +* Check interrupt status register to see whether our device triggered +* the interrupt. +*/ +*irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; +if (*irq_status == 0) +goto done; +} + +newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; +if (mask) +newcmd |= PCI_COMMAND_INTX_DISABLE; +if (newcmd != origcmd) +pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: +pci_unblock_user_cfg_access(dev); +} + Let's return irq_status instead of returning through a pointer? Will save a branch and generally make the code a bit cleaner. I'm open for a better API suggestion, but the current one goes like this: if irq_status is non-null, only mask the IRQ if the status bit indicates an interrupt. But we also have a user that wants to mask unconditionally. static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *) dev_id; +int ret = IRQ_HANDLED; unsigned long flags; spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) guest_entries[i].vector, 1); } } else { -kvm_set_irq(assigned_dev-kvm,
Re: [PATCH 0/3] Launch other test during migration
On 09/25/2010 11:36 AM, Jason Wang wrote: We could give a further test of migration by launch test during migartion. So the following series implements: - A simple class to run a specified test in the background which could be used to launch other test during migartion. Its design is rather simple and its usage is a little tricky, but it work well. - Two sample tests which take advantages of the background class: Test reboot during guest migration and test file_transfer during guest migration. In the future, we could even lauch autotest client test during guest migation. --- Jason Wang (3): KVM Test: Introduce a helper class to run a test in the background KVM test: Test reboot during migration KVM test: Test the file transfer during migartion client/tests/kvm/kvm_test_utils.py | 44 +++ .../kvm/tests/migration_with_file_transfer.py | 59 client/tests/kvm/tests/migration_with_reboot.py| 45 +++ client/tests/kvm/tests_base.cfg.sample | 12 4 files changed, 159 insertions(+), 1 deletions(-) create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py create mode 100644 client/tests/kvm/tests/migration_with_reboot.py It seems to me that this method is only applicable to tests/functions that don't require a VM object (i.e. that require only a shell session object). kvm_test_utils.reboot() operates on a VM object, and the same VM is destroyed by migrate() which runs in the background, so eventually reboot() tries logging into a destroyed VM, which fails because vm.get_address() fails. Any monitor operation will also fail. If the autotest wrapper requires a VM object (currently it does) then it can't be used either. An alternative (somewhat ugly) way to migrate in the background is to pass a boolean 'migrate' flag to various functions/tests, such as reboot() and run_autotest(). If 'migrate' is True, these functions will do something like vm = kvm_test_utils.migrate(vm, ...) in their waiting loops, where wait_for() is normally used. This guarantees that 'vm' is always a valid VM object. For example: # Log in after reboot while time.time() end_time: if migrate_in_bg: vm = kvm_test_utils.migrate(vm, ...) session = vm.remote_login() if session: break time.sleep(2) This is much longer than the usual wait_for(...) but it does the job. What do you think? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] Add support for async page fault to qemu
On Sun, Oct 24, 2010 at 02:27:55PM +0200, Gleb Natapov wrote: Add save/restore of MSR for migration and cpuid bit. Signed-off-by: Gleb Natapov g...@redhat.com -- v1-v2 - use vmstate subsection to migrate new msr. v2-v3 - rebase onto uq/master - protect use of MSR_KVM_ASYNC_PF_EN with ifdef KVM_CAP_ASYNC_PF Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote: Am 01.11.2010 16:24, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; + bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ + u16 orig, new; + bool supported = false; + + pci_block_user_cfg_access(pdev); + pci_read_config_word(pdev, PCI_COMMAND, orig); + pci_write_config_word(pdev, PCI_COMMAND, +orig ^ PCI_COMMAND_INTX_DISABLE); + pci_read_config_word(pdev, PCI_COMMAND, new); + + /* + * There's no way to protect against + * hardware bugs or detect them reliably, but as long as we know + * what the value should be, let's go ahead and check it. + */ + if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { + dev_err(pdev-dev, Command changed from 0x%x to 0x%x: + driver or HW bug?\n, orig, new); + goto out; + } + if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { + dev_warn(pdev-dev, Device does not support + disabling interrupts: unable to bind.\n); + goto out; + } + supported = true; + + /* Now restore the original value. */ + pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: + pci_unblock_user_cfg_access(pdev); + return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ + u32 cmd_status_dword; + u16 origcmd, newcmd; + + /* + * We do a single dword read to retrieve both command and status. + * Document assumptions that make this possible. + */ + BUILD_BUG_ON(PCI_COMMAND % 4); + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + + pci_block_user_cfg_access(dev); + + /* + * Read both command and status registers in a single 32-bit operation. + * Note: we could cache the value for command and move the status read + * out of the lock if there was a way to get notified of user changes + * to command register through sysfs. Should be good for shared irqs. + */ + pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); + origcmd = cmd_status_dword; + + if (irq_status) { + /* + * Check interrupt status register to see whether our device triggered + * the interrupt. + */ + *irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; + if (*irq_status == 0) + goto done; + } + + newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; + if (mask) + newcmd |= PCI_COMMAND_INTX_DISABLE; + if (newcmd != origcmd) + pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: + pci_unblock_user_cfg_access(dev); +} + Let's return irq_status instead of returning through a pointer? Will save a branch and generally make the code a bit cleaner. I'm open for a better API suggestion, Maybe use separate functions for this. pci_2_3_mask_irq pci_2_3_unmask_irq pci_2_3_disable_irq Common code can go into subfunctions. but the current one goes like this: if irq_status is non-null, only mask the IRQ if the status bit indicates an interrupt. But we also have a user that wants to mask unconditionally. Why do you ever want to do that? static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *) dev_id; + int ret = IRQ_HANDLED; unsigned long flags; spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); @@ -83,19
Re: [PATCH 1/2] msix: Allow msix_init on a device with existing MSI-X capability
On Thu, 2010-10-28 at 17:00 +0200, Avi Kivity wrote: On 10/23/2010 06:55 PM, Alex Williamson wrote: On Sat, 2010-10-23 at 18:18 +0200, Michael S. Tsirkin wrote: On Fri, Oct 22, 2010 at 02:40:31PM -0600, Alex Williamson wrote: To enable common msix support to be used with pass through devices, don't attempt to change the BAR if the device already has an MSI-X capability. This also means we want to pay closer attention to the size when we map the msix table page, as it isn't necessarily covering the entire end of the BAR. Signed-off-by: Alex Williamsonalex.william...@redhat.com --- hw/msix.c | 67 +++-- 1 files changed, 38 insertions(+), 29 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 43efbd2..4122395 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -167,35 +167,43 @@ static int msix_add_config(struct PCIDevice *pdev, unsigned short nentries, { int config_offset; uint8_t *config; -uint32_t new_size; -if (nentries 1 || nentries PCI_MSIX_FLAGS_QSIZE + 1) -return -EINVAL; -if (bar_size 0x8000) -return -ENOSPC; - -/* Add space for MSI-X structures */ -if (!bar_size) { -new_size = MSIX_PAGE_SIZE; -} else if (bar_size MSIX_PAGE_SIZE) { -bar_size = MSIX_PAGE_SIZE; -new_size = MSIX_PAGE_SIZE * 2; -} else { -new_size = bar_size * 2; -} - -pdev-msix_bar_size = new_size; -config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, MSIX_CAP_LENGTH); -if (config_offset 0) -return config_offset; -config = pdev-config + config_offset; - -pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1); -/* Table on top of BAR */ -pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr); -/* Pending bits on top of that */ -pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) | - bar_nr); +pdev-msix_bar_size = bar_size; + +config_offset = pci_find_capability(pdev, PCI_CAP_ID_MSIX); + +if (!config_offset) { +uint32_t new_size; + +if (nentries 1 || nentries PCI_MSIX_FLAGS_QSIZE + 1) +return -EINVAL; +if (bar_size 0x8000) +return -ENOSPC; + +/* Add space for MSI-X structures */ +if (!bar_size) { +new_size = MSIX_PAGE_SIZE; +} else if (bar_size MSIX_PAGE_SIZE) { +bar_size = MSIX_PAGE_SIZE; +new_size = MSIX_PAGE_SIZE * 2; +} else { +new_size = bar_size * 2; +} + +pdev-msix_bar_size = new_size; +config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, + MSIX_CAP_LENGTH); +if (config_offset 0) +return config_offset; +config = pdev-config + config_offset; + +pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1); +/* Table on top of BAR */ +pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr); +/* Pending bits on top of that */ +pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + MSIX_PAGE_PENDING) | + bar_nr); +} pdev-msix_cap = config_offset; /* Make flags bit writeable. */ pdev-wmask[config_offset + MSIX_CONTROL_OFFSET] |= MSIX_ENABLE_MASK | @@ -337,7 +345,8 @@ void msix_mmio_map(PCIDevice *d, int region_num, return; if (size= offset) return; -cpu_register_physical_memory(addr + offset, size - offset, +cpu_register_physical_memory(addr + offset, + MIN(size - offset, MSIX_PAGE_SIZE), This is wrong I think, the table might not fit in a single page. You would need to read table size out of from device config. That's true, but I was hoping to save that for later since we don't seem to be running into that problem yet. Current device assignment code assumes a single page, and I haven't heard of anyone with a vector table that exceeds that yet. Thanks, Ok; applied. Please add some warning if the condition happens so the breakage is at least not silent. Looks like this is already handled before we even get here. msix_init() checks nentries against the max supported in a single page and returns -EINVAL if over. So I think we're safely covered until we add support for more pages. Thanks, Alex -- To unsubscribe
Re: [PATCH 3/3] KVM test: Test the file transfer during migartion
On 09/25/2010 11:36 AM, Jason Wang wrote: This test just do the file transfer from host to guest during migartion in order to check whether the nic/block state could be saved and loaded correctly. Signed-off-by: Jason Wang jasow...@redhat.com --- .../kvm/tests/migration_with_file_transfer.py | 59 client/tests/kvm/tests_base.cfg.sample |7 ++ 2 files changed, 66 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py diff --git a/client/tests/kvm/tests/migration_with_file_transfer.py b/client/tests/kvm/tests/migration_with_file_transfer.py new file mode 100644 index 000..8a316bf --- /dev/null +++ b/client/tests/kvm/tests/migration_with_file_transfer.py @@ -0,0 +1,59 @@ +import logging, time +from autotest_lib.client.common_lib import utils, error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_migration_with_file_transfer(test, params, env): + +KVM migration test: +1) Get a live VM and clone it. +2) Verify that the source VM supports migration. If it does, proceed with +the test. +3) Reboot the VM +4) Send a migration command to the source VM and wait until it's finished. +5) Kill off the source VM. +6) Log into the destination VM after the migration is finished. + +@param test: kvm test object. +@param params: Dictionary with test parameters. +@param env: Dictionary with the test environment. + + +def transfer_test(vm, host_path, guest_path, timeout=120): + +vm.copy_files_to does not raise exception, so we need a wrapper +in order to make it to be used by BackgroundTest. + +if not vm.copy_files_to(host_path, guest_path, timeout=timeout): +raise error.TestError(Fail to do the file transfer!) + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout = int(params.get(login_timeout, 360)) +session = kvm_test_utils.wait_for_login(vm, timeout=timeout) + +mig_timeout = float(params.get(mig_timeout, 3600)) +mig_protocol = params.get(migration_protocol, tcp) + +guest_path = params.get(guest_path, /tmp) +file_size = params.get(file_size, 1000) +transfer_timeout = int(params.get(transfer_timeout, 240)) +bg = None + +try: +utils.run(dd if=/dev/zero of=/tmp/file bs=1M count=%s % file_size) Wouldn't it be useful to read from /dev/urandom instead of /dev/zero, and compare the md5 hash of the file generated on the host with the md5 hash of the file that arrives in the guest? + +# Transfer file from host to guest +bg = kvm_test_utils.BackgroundTest(transfer_test, + (vm, /tmp/file, guest_path, +transfer_timeout)) +bg.start() + +while bg.is_alive(): +logging.info(File transfer is not ended, start a round of migration ...) +# Migrate the VM +dest_vm = kvm_test_utils.migrate(vm, env, mig_timeout, mig_protocol, False) +vm = dest_vm +finally: +if bg: bg.join() +session.close() +utils.run(rm -rf /tmp/zero) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
Am 01.11.2010 16:52, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote: Am 01.11.2010 16:24, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; + bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ + u16 orig, new; + bool supported = false; + + pci_block_user_cfg_access(pdev); + pci_read_config_word(pdev, PCI_COMMAND, orig); + pci_write_config_word(pdev, PCI_COMMAND, +orig ^ PCI_COMMAND_INTX_DISABLE); + pci_read_config_word(pdev, PCI_COMMAND, new); + + /* + * There's no way to protect against + * hardware bugs or detect them reliably, but as long as we know + * what the value should be, let's go ahead and check it. + */ + if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { + dev_err(pdev-dev, Command changed from 0x%x to 0x%x: + driver or HW bug?\n, orig, new); + goto out; + } + if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { + dev_warn(pdev-dev, Device does not support + disabling interrupts: unable to bind.\n); + goto out; + } + supported = true; + + /* Now restore the original value. */ + pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: + pci_unblock_user_cfg_access(pdev); + return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ + u32 cmd_status_dword; + u16 origcmd, newcmd; + + /* + * We do a single dword read to retrieve both command and status. + * Document assumptions that make this possible. + */ + BUILD_BUG_ON(PCI_COMMAND % 4); + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + + pci_block_user_cfg_access(dev); + + /* + * Read both command and status registers in a single 32-bit operation. + * Note: we could cache the value for command and move the status read + * out of the lock if there was a way to get notified of user changes + * to command register through sysfs. Should be good for shared irqs. + */ + pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); + origcmd = cmd_status_dword; + + if (irq_status) { + /* + * Check interrupt status register to see whether our device triggered + * the interrupt. + */ + *irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; + if (*irq_status == 0) + goto done; + } + + newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; + if (mask) + newcmd |= PCI_COMMAND_INTX_DISABLE; + if (newcmd != origcmd) + pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: + pci_unblock_user_cfg_access(dev); +} + Let's return irq_status instead of returning through a pointer? Will save a branch and generally make the code a bit cleaner. I'm open for a better API suggestion, Maybe use separate functions for this. pci_2_3_mask_irq pci_2_3_unmask_irq pci_2_3_disable_irq Common code can go into subfunctions. but the current one goes like this: if irq_status is non-null, only mask the IRQ if the status bit indicates an interrupt. But we also have a user that wants to mask unconditionally. Why do you ever want to do that? During device shutdown (was disable_irq in the unshared case so far). The alternative would be to reset the device first, clearing any potentially pending events. If we can reorder the reset, that is likely better. static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *)
Re: [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler
Am 01.11.2010 15:08, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com The complete work handler runs with assigned_dev_lock acquired and interrupts disabled, so there is nothing to gain pushing this work out of the actually IRQ handler. Fold them together. Err, forget it. kvm_set_irq pulls in the famous pic lock, and that one is not prepared to be called from IRQ context (lockdep just complained). I will try to clean this up via a threaded IRQ handler, maybe using the slow-path only for INTx-type IRQs and pushing MSIs directly from the hard handler. Jan Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 - virt/kvm/assigned-dev.c | 71 +++--- 2 files changed, 23 insertions(+), 49 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 462b982..df5497f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -465,7 +465,6 @@ struct kvm_guest_msix_entry { struct kvm_assigned_dev_kernel { struct kvm_irq_ack_notifier ack_notifier; - struct work_struct interrupt_work; struct list_head list; int assigned_dev_id; int host_segnr; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 7c98928..5c1b56a 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } -static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) +static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { - struct kvm_assigned_dev_kernel *assigned_dev; - int i; - - assigned_dev = container_of(work, struct kvm_assigned_dev_kernel, - interrupt_work); + struct kvm_assigned_dev_kernel *assigned_dev = + (struct kvm_assigned_dev_kernel *) dev_id; + unsigned long flags; - spin_lock_irq(assigned_dev-assigned_dev_lock); + spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); if (assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX) { struct kvm_guest_msix_entry *guest_entries = assigned_dev-guest_msix_entries; + int index = find_index_from_host_irq(assigned_dev, irq); + int i; + + if (index 0) + goto out; + + guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING; + for (i = 0; i assigned_dev-entries_nr; i++) { if (!(guest_entries[i].flags KVM_ASSIGNED_MSIX_PENDING)) @@ -76,33 +82,15 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) assigned_dev-irq_source_id, guest_entries[i].vector, 1); } - } else + } else { kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id, assigned_dev-guest_irq, 1); - spin_unlock_irq(assigned_dev-assigned_dev_lock); -} - -static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) -{ - unsigned long flags; - struct kvm_assigned_dev_kernel *assigned_dev = - (struct kvm_assigned_dev_kernel *) dev_id; - - spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags); - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX) { - int index = find_index_from_host_irq(assigned_dev, irq); - if (index 0) - goto out; - assigned_dev-guest_msix_entries[index].flags |= - KVM_ASSIGNED_MSIX_PENDING; - } - - schedule_work(assigned_dev-interrupt_work); - - if (assigned_dev-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { - disable_irq_nosync(irq); - assigned_dev-host_irq_disabled = true; + if (assigned_dev-irq_requested_type + KVM_DEV_IRQ_GUEST_INTX) { + disable_irq_nosync(irq); + assigned_dev-host_irq_disabled = true; + } } out: @@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm, assigned_dev-irq_requested_type = ~(KVM_DEV_IRQ_GUEST_MASK); } -/* The function implicit hold kvm-lock mutex due to cancel_work_sync() */ static void deassign_host_irq(struct kvm *kvm, struct kvm_assigned_dev_kernel *assigned_dev) { /* - * In kvm_free_device_irq, cancel_work_sync return true if: - * 1. work is scheduled, and then cancelled. - * 2. work callback is executed. - * - * The first one ensured that the irq is disabled and no more events - * would happen. But for the second one, the irq may be enabled (e.g. - * for MSI). So we disable irq here to
Re: Crash on kvm_iommu_map_pages
Am 01.11.2010 16:29, Jan Kiszka wrote: Nope. But I just noticed a fatal thinko in my fix to intel_iommu_attach_device - probably that was the key. Need to boot the test kernel... That was indeed the reason for this GPF: I blindly swapped the problematic lines, releasing the wrong page. Sorry, false alarm this time, will send out the corrected intel_iommu_attach_device fix later. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
On Mon, Nov 01, 2010 at 05:30:20PM +0100, Jan Kiszka wrote: Am 01.11.2010 16:52, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote: Am 01.11.2010 16:24, Michael S. Tsirkin wrote: On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices between on the host side when passing them to a guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 153 + 2 files changed, 140 insertions(+), 14 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index df5497f..fcdc849 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel { unsigned int entries_nr; int host_irq; bool host_irq_disabled; +bool pci_2_3; struct msix_entry *host_msix_entries; int guest_irq; struct kvm_guest_msix_entry *guest_msix_entries; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index d3ddfea..411643c 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel return index; } +/* + * Verify that the device supports Interrupt Disable bit in command register, + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly + * in PCI 2.2. + */ +static bool pci_2_3_supported(struct pci_dev *pdev) +{ +u16 orig, new; +bool supported = false; + +pci_block_user_cfg_access(pdev); +pci_read_config_word(pdev, PCI_COMMAND, orig); +pci_write_config_word(pdev, PCI_COMMAND, + orig ^ PCI_COMMAND_INTX_DISABLE); +pci_read_config_word(pdev, PCI_COMMAND, new); + +/* + * There's no way to protect against + * hardware bugs or detect them reliably, but as long as we know + * what the value should be, let's go ahead and check it. + */ +if ((new ^ orig) ~PCI_COMMAND_INTX_DISABLE) { +dev_err(pdev-dev, Command changed from 0x%x to 0x%x: +driver or HW bug?\n, orig, new); +goto out; +} +if (!((new ^ orig) PCI_COMMAND_INTX_DISABLE)) { +dev_warn(pdev-dev, Device does not support + disabling interrupts: unable to bind.\n); +goto out; +} +supported = true; + +/* Now restore the original value. */ +pci_write_config_word(pdev, PCI_COMMAND, orig); + +out: +pci_unblock_user_cfg_access(pdev); +return supported; +} + +static void +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status) +{ +u32 cmd_status_dword; +u16 origcmd, newcmd; + +/* + * We do a single dword read to retrieve both command and status. + * Document assumptions that make this possible. + */ +BUILD_BUG_ON(PCI_COMMAND % 4); +BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS); + +pci_block_user_cfg_access(dev); + +/* + * Read both command and status registers in a single 32-bit operation. + * Note: we could cache the value for command and move the status read + * out of the lock if there was a way to get notified of user changes + * to command register through sysfs. Should be good for shared irqs. + */ +pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword); +origcmd = cmd_status_dword; + +if (irq_status) { +/* +* Check interrupt status register to see whether our device triggered +* the interrupt. +*/ +*irq_status = (cmd_status_dword 16) PCI_STATUS_INTERRUPT; +if (*irq_status == 0) +goto done; +} + +newcmd = origcmd ~PCI_COMMAND_INTX_DISABLE; +if (mask) +newcmd |= PCI_COMMAND_INTX_DISABLE; +if (newcmd != origcmd) +pci_write_config_word(dev, PCI_COMMAND, newcmd); + +done: +pci_unblock_user_cfg_access(dev); +} + Let's return irq_status instead of returning through a pointer? Will save a branch and generally make the code a bit cleaner. I'm open for a better API suggestion, Maybe use separate functions for this. pci_2_3_mask_irq pci_2_3_unmask_irq pci_2_3_disable_irq Common code can go
Re: TODO item: guest programmable mac/vlan filtering with macvtap
I have created a wiki page for this [1], also added to the networking todo list [2]. No meaty information yet. But it's enough to start working on it. [1] - http://www.linux-kvm.org/page/GuestProgrammableMacVlanFiltering [2] - http://www.linux-kvm.org/page/NetworkingTodo -- Dragos -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM: dirty logging optimization - double buffering
On Wed, Oct 27, 2010 at 06:21:02PM +0900, Takuya Yoshikawa wrote: This patch series just change the way we allocate dirty bitmaps but don't change timing related issues. - Changelog I have not changed anything about patch 1 and 2 since I got looks good comment from Marcelo. Just rebased. Patch 3 has become simpler by rebasing on top of this series. Thanks, Takuya Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Mask KVM_GET_SUPPORTED_CPUID data with Linux cpuid info
On Sun, Oct 24, 2010 at 03:38:46PM +0200, Avi Kivity wrote: This allows Linux to mask cpuid bits if, for example, nx is enabled on only some cpus. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 48ce015..54fda7e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2229,6 +2229,11 @@ out: return r; } +static void cpuid_mask(u32 *word, int wordnum) +{ + *word = boot_cpu_data.x86_capability[wordnum]; +} + static void do_cpuid_1_ent(struct kvm_cpuid_entry2 *entry, u32 function, u32 index) { @@ -2303,7 +2308,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, break; case 1: entry-edx = kvm_supported_word0_x86_features; + cpuid_mask(entry-edx, 0); entry-ecx = kvm_supported_word4_x86_features; + cpuid_mask(entry-edx, 4); ecx? /* we support x2apic emulation even if host does not support * it since we emulate x2apic in software */ entry-ecx |= F(X2APIC); @@ -2394,7 +2401,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, break; case 0x8001: entry-edx = kvm_supported_word1_x86_features; + cpuid_mask(entry-edx, 1); entry-ecx = kvm_supported_word6_x86_features; + cpuid_mask(entry-edx, 6); ecx? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] trace exit to userspace event
On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote: Add tracepoint for userspace exit. Signed-off-by: Gleb Natapov g...@redhat.com --- ChangeLog: v1-v2 log error case too. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: add cast within kvm_clear_guest_page to fix warning
On Wed, Oct 27, 2010 at 05:21:21PM +0200, Heiko Carstens wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Fixes this: CC arch/s390/kvm/../../../virt/kvm/kvm_main.o arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_clear_guest_page': arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 3 of 'kvm_write_guest_page' makes pointer from integer without a cast arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void *' but argument is of type 'long unsigned int' Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com --- virt/kvm/kvm_main.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Applied both, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: fix information leak to userland
On Sat, Oct 30, 2010 at 01:04:24PM +0400, Vasiliy Kulikov wrote: Structure kvm_ppc_pvinfo is copied to userland with flags and pad fields unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov sego...@gmail.com --- I cannot compile this driver, so it is not tested at all. arch/powerpc/kvm/powerpc.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch v2] x86: kvm: x86: fix information leak to userland
On Sat, Oct 30, 2010 at 10:54:47PM +0400, Vasiliy Kulikov wrote: Structures kvm_vcpu_events, kvm_debugregs, kvm_pit_state2 and kvm_clock_data are copied to userland with some padding and reserved fields unitialized. It leads to leaking of contents of kernel stack memory. We have to initialize them to zero. In patch v1 Jan Kiszka suggested to fill reserved fields with zeros instead of memset'ting the whole struct. It makes sense as these fields are explicitly marked as padding. No more fields need zeroing. Signed-off-by: Vasiliy Kulikov sego...@gmail.com --- Compile tesed only. arch/x86/kvm/x86.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice
On Mon, 1 Nov 2010, Takuya Yoshikawa wrote: (2010/10/31 3:28), Jesper Juhl wrote: Hi, We can improve kvm_vm_ioctl_get_dirty_log() slightly by using vzalloc() rather than first allocating and then manually zero the memory with memset(). Also, while I was looking at this I noticed that we assign I personally prefer this new vzalloc() to vmalloc() + memset(). Just from my interest, is there real performance difference not just the cleanup effect? If so, we'd better do this for other places too. There's definately a positive size impact for the generated object code and we save having to do the call to memset() and the cost of a vzalloc() call looks more or less the same as a call to vmalloc() to me. snip This patch is not based on kvm.git, I guess. Nope. It was generated against mainline git as of a couple of days ago. I can generate a version against kvm.git if you prefer. r = -ENOMEM; - dirty_bitmap = vmalloc(n); + dirty_bitmap = vzalloc(n); if (!dirty_bitmap) goto out; - memset(dirty_bitmap, 0, n); - r = -ENOMEM; This one is here because this belongs to a different code block from the previous one. This keeps it easy to insert another codes in between these two blocks. The optimization will be done at compile time. IIRC, I did like this based on Avi's advise. Fair enough, I've dropped that bit from the patch. Updated version below (against current mainline git). Signed-off-by: Jesper Juhl j...@chaosbits.net --- x86.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2288ad8..624d4da 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3174,10 +3174,9 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, spin_unlock(kvm-mmu_lock); r = -ENOMEM; - dirty_bitmap = vmalloc(n); + dirty_bitmap = vzalloc(n); if (!dirty_bitmap) goto out; - memset(dirty_bitmap, 0, n); r = -ENOMEM; slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); -- Jesper Juhl j...@chaosbits.net http://www.chaosbits.net/ Plain text mails only, please http://www.expita.com/nomime.html Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check
hi, the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. wbr, jirka Signed-off-by: Jiri Olsa jo...@redhat.com --- include/linux/cred.h |1 - include/linux/fdtable.h |1 - include/linux/kvm_host.h |1 - kernel/exit.c|1 - kernel/pid.c |1 - kernel/rcutorture.c |2 -- security/keys/keyring.c |1 - 7 files changed, 0 insertions(+), 8 deletions(-) diff --git a/include/linux/cred.h b/include/linux/cred.h index 4aaeab3..a6b9afc 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -283,7 +283,6 @@ static inline void put_cred(const struct cred *_cred) ({ \ const struct task_struct *__t = (task); \ rcu_dereference_check(__t-real_cred, \ - rcu_read_lock_held() || \ task_is_dead(__t)); \ }) diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 133c0ba..df7e3cf 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -60,7 +60,6 @@ struct files_struct { #define rcu_dereference_check_fdtable(files, fdtfd) \ (rcu_dereference_check((fdtfd), \ - rcu_read_lock_held() || \ lockdep_is_held((files)-file_lock) || \ atomic_read((files)-count) == 1 || \ rcu_my_thread_group_empty())) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a055742..a90a7e3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -256,7 +256,6 @@ void kvm_put_kvm(struct kvm *kvm); static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) { return rcu_dereference_check(kvm-memslots, - srcu_read_lock_held(kvm-srcu) || lockdep_is_held(kvm-slots_lock)); } diff --git a/kernel/exit.c b/kernel/exit.c index b194feb..f753342 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -85,7 +85,6 @@ static void __exit_signal(struct task_struct *tsk) struct tty_struct *uninitialized_var(tty); sighand = rcu_dereference_check(tsk-sighand, - rcu_read_lock_held() || lockdep_tasklist_lock_is_held()); spin_lock(sighand-siglock); diff --git a/kernel/pid.c b/kernel/pid.c index 39b65b6..c02adda 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -402,7 +402,6 @@ struct task_struct *pid_task(struct pid *pid, enum pid_type type) if (pid) { struct hlist_node *first; first = rcu_dereference_check(hlist_first_rcu(pid-tasks[type]), - rcu_read_lock_held() || lockdep_tasklist_lock_is_held()); if (first) result = hlist_entry(first, struct task_struct, pids[(type)].node); diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 9d8e8fb..0956a73 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -807,7 +807,6 @@ static void rcu_torture_timer(unsigned long unused) idx = cur_ops-readlock(); completed = cur_ops-completed(); p = rcu_dereference_check(rcu_torture_current, - rcu_read_lock_held() || rcu_read_lock_bh_held() || rcu_read_lock_sched_held() || srcu_read_lock_held(srcu_ctl)); @@ -868,7 +867,6 @@ rcu_torture_reader(void *arg) idx = cur_ops-readlock(); completed = cur_ops-completed(); p = rcu_dereference_check(rcu_torture_current, - rcu_read_lock_held() || rcu_read_lock_bh_held() || rcu_read_lock_sched_held() || srcu_read_lock_held(srcu_ctl)); diff --git a/security/keys/keyring.c b/security/keys/keyring.c index d37f713..73c23f2 100644 --- a/security/keys/keyring.c +++ b/security/keys/keyring.c @@ -157,7 +157,6 @@ static void keyring_destroy(struct key *keyring) } klist = rcu_dereference_check(keyring-payload.subscriptions, - rcu_read_lock_held() || atomic_read(keyring-usage) == 0); if (klist) { for (loop = klist-nkeys - 1; loop = 0; loop--) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More
[PATCH 3/3] net - removing superfluous rcu_read_lock_held check
hi, the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. wbr, jirka Signed-off-by: Jiri Olsa jo...@redhat.com --- include/linux/rtnetlink.h |3 +-- include/net/sock.h |3 +-- net/mac80211/sta_info.c|4 net/netlabel/netlabel_domainhash.c |3 +-- net/netlabel/netlabel_unlabeled.c |3 +-- 5 files changed, 4 insertions(+), 12 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index d42f274..dfe5ba1 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -758,8 +758,7 @@ extern int lockdep_rtnl_is_held(void); * or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference() */ #define rcu_dereference_rtnl(p)\ - rcu_dereference_check(p, rcu_read_lock_held() ||\ -lockdep_rtnl_is_held()) + rcu_dereference_check(p, lockdep_rtnl_is_held()) /** * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL diff --git a/include/net/sock.h b/include/net/sock.h index c7a7362..bee3e9c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1274,8 +1274,7 @@ extern unsigned long sock_i_ino(struct sock *sk); static inline struct dst_entry * __sk_dst_get(struct sock *sk) { - return rcu_dereference_check(sk-sk_dst_cache, rcu_read_lock_held() || - sock_owned_by_user(sk) || + return rcu_dereference_check(sk-sk_dst_cache, sock_owned_by_user(sk) || lockdep_is_held(sk-sk_lock.slock)); } diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index 6d8f897..c879217 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -94,7 +94,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata, struct sta_info *sta; sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)], - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); while (sta) { @@ -102,7 +101,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata, memcmp(sta-sta.addr, addr, ETH_ALEN) == 0) break; sta = rcu_dereference_check(sta-hnext, - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); } @@ -120,7 +118,6 @@ struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata, struct sta_info *sta; sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)], - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); while (sta) { @@ -129,7 +126,6 @@ struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata, memcmp(sta-sta.addr, addr, ETH_ALEN) == 0) break; sta = rcu_dereference_check(sta-hnext, - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); } diff --git a/net/netlabel/netlabel_domainhash.c b/net/netlabel/netlabel_domainhash.c index d37b7f8..82795a4 100644 --- a/net/netlabel/netlabel_domainhash.c +++ b/net/netlabel/netlabel_domainhash.c @@ -55,8 +55,7 @@ struct netlbl_domhsh_tbl { * should be okay */ static DEFINE_SPINLOCK(netlbl_domhsh_lock); #define netlbl_domhsh_rcu_deref(p) \ - rcu_dereference_check(p, rcu_read_lock_held() || \ -lockdep_is_held(netlbl_domhsh_lock)) + rcu_dereference_check(p, lockdep_is_held(netlbl_domhsh_lock)) static struct netlbl_domhsh_tbl *netlbl_domhsh = NULL; static struct netlbl_dom_map *netlbl_domhsh_def = NULL; diff --git a/net/netlabel/netlabel_unlabeled.c b/net/netlabel/netlabel_unlabeled.c index e2b0a68..d2f982f 100644 --- a/net/netlabel/netlabel_unlabeled.c +++ b/net/netlabel/netlabel_unlabeled.c @@ -116,8 +116,7 @@ struct netlbl_unlhsh_walk_arg { * hash table should be okay */ static DEFINE_SPINLOCK(netlbl_unlhsh_lock); #define netlbl_unlhsh_rcu_deref(p) \ - rcu_dereference_check(p, rcu_read_lock_held() || \ -lockdep_is_held(netlbl_unlhsh_lock)) + rcu_dereference_check(p, lockdep_is_held(netlbl_unlhsh_lock)) static struct
[PATCH 1/3] cgroup - removing superfluous rcu_read_lock_held check
hi, the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. wbr, jirka Signed-off-by: Jiri Olsa jo...@redhat.com --- include/linux/cgroup.h |1 - kernel/cgroup.c|6 ++ 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index ed4ba11..caed568 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -536,7 +536,6 @@ static inline struct cgroup_subsys_state *cgroup_subsys_state( */ #define task_subsys_state_check(task, subsys_id, __c) \ rcu_dereference_check(task-cgroups-subsys[subsys_id], \ - rcu_read_lock_held() || \ lockdep_is_held(task-alloc_lock) || \ cgroup_lock_is_held() || (__c)) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 66a416b..1f329a2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1687,7 +1687,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) { char *start; struct dentry *dentry = rcu_dereference_check(cgrp-dentry, - rcu_read_lock_held() || cgroup_lock_is_held()); if (!dentry || cgrp == dummytop) { @@ -1713,7 +1712,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) break; dentry = rcu_dereference_check(cgrp-dentry, - rcu_read_lock_held() || cgroup_lock_is_held()); if (!cgrp-parent) continue; @@ -4544,7 +4542,7 @@ unsigned short css_id(struct cgroup_subsys_state *css) * it's unchanged until freed. */ cssid = rcu_dereference_check(css-id, - rcu_read_lock_held() || atomic_read(css-refcnt)); + atomic_read(css-refcnt)); if (cssid) return cssid-id; @@ -4557,7 +4555,7 @@ unsigned short css_depth(struct cgroup_subsys_state *css) struct css_id *cssid; cssid = rcu_dereference_check(css-id, - rcu_read_lock_held() || atomic_read(css-refcnt)); + atomic_read(css-refcnt)); if (cssid) return cssid-depth; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] rcu - removing superfluous rcu_read_lock_held check
hi, the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. Several places in kernel are spcifying rcu_read_lock_held as rcu_dereference_check condition parameter. I separated them into 3 patches: 1/3 - cgroup - removing superfluous rcu_read_lock_held check 2/3 - kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check 3/3 - net - removing superfluous rcu_read_lock_held check sry if I might missed or added somebody wrongly to recipients wbr, jirka Signed-off-by: Jiri Olsa jo...@redhat.com --- include/linux/cgroup.h |1 - include/linux/cred.h |1 - include/linux/fdtable.h|1 - include/linux/kvm_host.h |1 - include/linux/rtnetlink.h |3 +-- include/net/sock.h |3 +-- kernel/cgroup.c|6 ++ kernel/exit.c |1 - kernel/pid.c |1 - kernel/rcutorture.c|2 -- net/mac80211/sta_info.c|4 net/netlabel/netlabel_domainhash.c |3 +-- net/netlabel/netlabel_unlabeled.c |3 +-- security/keys/keyring.c|1 - 14 files changed, 6 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On Mon, 2010-11-01 at 11:49 -0700, Anthony Liguori wrote: On 11/01/2010 11:09 AM, Marcelo Tosatti wrote: On Tue, Oct 26, 2010 at 10:39:48AM +0800, Huang Ying wrote: Author: Max Asbockmasb...@linux.vnet.ibm.com Add command pfa2hva to translate guest physical address to host virtual address. The pfa2hva command provides one step in a chain of translations from guest virtual to guest physical to host virtual to host physical. Host physical is then used to inject a machine check error. As a consequence the HWPOISON code on the host and the MCE injection code in qemu-kvm are exercised. v2: - Add QMP support Looks good to me. Anthony? The translation is not stable so this is really a bad interface. It definitely shouldn't be a QMP command and I don't think it's generally useful enough that it should be a monitor command. It's impossible to use correctly as a general interface. I gave this feedback when it was initially submitted. Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address
On 11/01/2010 02:20 PM, Huang Ying wrote: Yes. As general interface, it may not work so well, but as test interface, it works quite well and useful. Do we have any mechanism to add a test only interface? I'd like to see what Luiz/Markus think but definitely only a human monitor interface and probably prefix the command with a 'x-' prefix to indicate that it's not a supported interface. The documentation should be very clear about the limitations of the interface and the intended use-case. Regards, Anthony Liguori Best Regards, Huang Ying -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] trace exit to userspace event
On 11/01/2010 12:06 PM, Marcelo Tosatti wrote: On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote: Add tracepoint for userspace exit. Signed-off-by: Gleb Natapovg...@redhat.com --- ChangeLog: v1-v2 log error case too. Applied, thanks. errno sign is still flipped, no? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
unhandled wrmsr
I built from qemu-kvm-0.13.0.tar.gz on a Debian system with kernel linux-image-2.6.32-5-amd642.6.32-26 (but otherwise basically the stable/lenny version) and now see Oct 26 16:57:38 markov kernel: [ 5757.672426] kvm: 23063: cpu0 unhandled wrmsr: 0x198 data 0 Oct 26 16:57:38 markov kernel: [ 5757.672454] kvm: 23063: cpu1 unhandled wrmsr: 0x198 data 0 Oct 26 16:57:38 markov kernel: [ 5757.672476] kvm: 23063: cpu2 unhandled wrmsr: 0x198 data 0 Oct 26 16:57:38 markov kernel: [ 5757.672497] kvm: 23063: cpu3 unhandled wrmsr: 0x198 data 0 on startup. I have 4 CPUs with 8 cores: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz stepping 06. Note this means the kernel side is from the Debian package, and so might be older. google shows this message has popped up several times in the past and been fixed several times. Most of the messages indicate it is harmless, but https://bugs.launchpad.net/ubuntu/+source/linux/+bug/325851 on 2009-07-01 says it's not benign. There also seems to be a current bug open about it, http://sourceforge.net/tracker/index.php?func=detailaid=3056363group_id=180599atid=893831, which is also marked as fixed. Can anyone tell me anything more about these messages? Do they indicate a problem I need to fix? If so, any advice on how to fix it? Thanks. Ross P.S. Please cc me on reply. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] trace exit to userspace event
On Mon, Nov 01, 2010 at 03:33:20PM -0400, Avi Kivity wrote: On 11/01/2010 12:06 PM, Marcelo Tosatti wrote: On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote: Add tracepoint for userspace exit. Signed-off-by: Gleb Natapovg...@redhat.com --- ChangeLog: v1-v2 log error case too. Applied, thanks. errno sign is still flipped, no? No, kvm_arch_vcpu_ioctl_run return -errno on error. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
On Sat, 2010-10-30 at 22:06 +0200, Michael S. Tsirkin wrote: On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote: On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote: Hmm. I don't yet understand. We are still doing copies into the per-vq buffer, and the data copied is really small. Is it about cache line bounces? Could you try figuring it out? per-vq buffer is much less expensive than 3 put_copy() call. I will collect the profiling data to show that. What about __put_user? Maybe the access checks are the ones that add the cost here? I attach patches to strip access checks: they are not needed as we do them on setup time already, anyway. Can you try them out and see if performance is improved for you please? On top of this, we will need to add some scheme to accumulate signals, but that is a separate issue. Yes, moving from put_user/get_user to __put_user/__get_user does improve the performance by removing the checking. My concern here is whether checking only in set up would be sufficient for security? Would be there is a case guest could corrupt the ring later? If not, that's OK. 2. How about flushing out queued stuff before we exit the handle_tx loop? That would address most of the spec issue. The performance is almost as same as the previous patch. I will resubmit the modified one, adding vhost_add_used_and_signal_n after handle_tx loop for processing pending queue. This patch was a part of modified macvtap zero copy which I haven't submitted yet. I found this helped vhost TX in general. This pending queue will be used by DMA done later, so I put it in vq instead of a local variable in handle_tx. Thanks Shirley BTW why do we need another array? Isn't heads field exactly what we need here? head field is only for up to 32, the more used buffers add and signal accumulated the better performance is from test results. I think we should separate the used update and signalling. Interrupts are expensive so I can believe accumulating even up to 100 of them helps. But used head copies are already prety cheap. If we cut the overhead by x32, that should make them almost free? I can separate the used update and signaling to see the best performance. That's was one of the reason I didn't use heads. The other reason was I used these buffer for pending dma done in mavctap zero copy patch. It could be up to vq-num in worse case. We can always increase that, not an issue. Good, I will change heads up to vq-num and use it. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] net - removing superfluous rcu_read_lock_held check
From: Jiri Olsa jo...@redhat.com Date: Mon, 1 Nov 2010 20:15:09 +0100 the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. Signed-off-by: Jiri Olsa jo...@redhat.com Acked-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 0/2] Test nmi-after-sti
Add a unit test for NMI immediately after an IF-enabling STI. Avi Kivity (2): apic: use boot idt instead of a locally allocated idt apic: test nmi-after-sti x86/apic.c | 83 +--- 1 files changed, 68 insertions(+), 15 deletions(-) -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 2/2] apic: test nmi-after-sti
While not required by the spec, some guests (Linux) rely on nmi being blocked by an IF-enabling sti. Add a unit test for this condition. Signed-off-by: Avi Kivity a...@redhat.com --- x86/apic.c | 67 1 files changed, 67 insertions(+), 0 deletions(-) diff --git a/x86/apic.c b/x86/apic.c index 165f820..2207040 100644 --- a/x86/apic.c +++ b/x86/apic.c @@ -1,6 +1,7 @@ #include libcflat.h #include apic.h #include vm.h +#include smp.h typedef struct { unsigned short offset0; @@ -274,9 +275,74 @@ static void test_ioapic_simultaneous(void) g_66 g_78 g_66_after_78 g_66_rip == g_78_rip); } +volatile int nmi_counter_private, nmi_counter, nmi_hlt_counter, sti_loop_active; + +void sti_nop(char *p) +{ +asm volatile ( + .globl post_sti \n\t + sti \n + /* + * vmx won't exit on external interrupt if blocked-by-sti, + * so give it a reason to exit by accessing an unmapped page. + */ + post_sti: testb $0, %0 \n\t + nop \n\t + cli + : : m(*p) + ); +nmi_counter = nmi_counter_private; +} + +static void sti_loop(void *ignore) +{ +unsigned k = 0; + +while (sti_loop_active) { + sti_nop((char *)(ulong)((k++ * 4096) % (128 * 1024 * 1024))); +} +} + +static void nmi_handler(isr_regs_t *regs) +{ +extern void post_sti(void); +++nmi_counter_private; +nmi_hlt_counter += regs-rip == (ulong)post_sti; +} + +static void update_cr3(void *cr3) +{ +write_cr3((ulong)cr3); +} + +static void test_sti_nmi(void) +{ +unsigned old_counter; + +if (cpu_count() 2) { + return; +} + +set_idt_entry(2, nmi_handler); +on_cpu(1, update_cr3, (void *)read_cr3()); + +sti_loop_active = 1; +on_cpu_async(1, sti_loop, 0); +while (nmi_counter 3) { + old_counter = nmi_counter; + apic_icr_write(APIC_DEST_PHYSICAL | APIC_DM_NMI | APIC_INT_ASSERT, 1); + while (nmi_counter == old_counter) { + ; + } +} +sti_loop_active = 0; +report(nmi-after-sti, nmi_hlt_counter == 0); +} + int main() { setup_vm(); +smp_init(); test_lapic_existence(); @@ -288,6 +354,7 @@ int main() test_ioapic_intr(); test_ioapic_simultaneous(); +test_sti_nmi(); printf(\nsummary: %d tests, %d failures\n, g_tests, g_fail); -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 1/2] apic: use boot idt instead of a locally allocated idt
This allows the smp support, which uses the boot idt, to work. Signed-off-by: Avi Kivity a...@redhat.com --- x86/apic.c | 16 +--- 1 files changed, 1 insertions(+), 15 deletions(-) diff --git a/x86/apic.c b/x86/apic.c index 48fa0f7..165f820 100644 --- a/x86/apic.c +++ b/x86/apic.c @@ -89,7 +89,7 @@ asm ( #endif ); -static idt_entry_t idt[256]; +static idt_entry_t *idt = 0; static int g_fail; static int g_tests; @@ -127,19 +127,6 @@ void test_enable_x2apic(void) } } -static void init_idt(void) -{ -struct { -u16 limit; -ulong idt; -} __attribute__((packed)) idt_ptr = { -sizeof(idt_entry_t) * 256 - 1, -(ulong)idt, -}; - -asm volatile(lidt %0 : : m(idt_ptr)); -} - static void set_idt_entry(unsigned vec, void (*func)(isr_regs_t *regs)) { u8 *thunk = vmalloc(50); @@ -296,7 +283,6 @@ int main() mask_pic_interrupts(); enable_apic(); test_enable_x2apic(); -init_idt(); test_self_ipi(); -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] trace exit to userspace event
On 11/01/2010 03:43 PM, Marcelo Tosatti wrote: On Mon, Nov 01, 2010 at 03:33:20PM -0400, Avi Kivity wrote: On 11/01/2010 12:06 PM, Marcelo Tosatti wrote: On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote: Add tracepoint for userspace exit. Signed-off-by: Gleb Natapovg...@redhat.com --- ChangeLog: v1-v2 log error case too. Applied, thanks. errno sign is still flipped, no? No, kvm_arch_vcpu_ioctl_run return -errno on error. It's true that this is how the kernel ABI is defined, but for a user errnos are positive. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: VMX: Disallow NMI while blocked by STI
While not mandated by the spec, Linux relies on NMI being blocked by an IF-enabling STI. VMX also refuses to enter a guest in this state, at least on some implementations. Disallow NMI while blocked by STI by checking for the condition, and requesting an interrupt window exit if it occurs. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/vmx.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 993e332..09321d0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2788,6 +2788,10 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu) return; } + if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) GUEST_INTR_STATE_STI) { + enable_irq_window(vcpu); + return; + } cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_NMI_PENDING; vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); @@ -2850,7 +2854,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) return 0; return !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) - (GUEST_INTR_STATE_MOV_SS | GUEST_INTR_STATE_NMI)); + (GUEST_INTR_STATE_MOV_SS | GUEST_INTR_STATE_STI + | GUEST_INTR_STATE_NMI)); } static bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu) -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Fixes, non-PCI-2.3 support, EOI enhancements
I've applied all your patches. Thanks! On Saturday, October 30, 2010 09:58:55 am Alex Williamson wrote: Hi Tom, I've updated some patches I've been working on to v5 and wanted to see what you think. I also found a couple minor bugs, fixed in this series. The main idea is that since the VFIO interface defines that the INTx interrupt is disabled when it fires, we should provide an interface to re-enable it. We currently have to do a read-modify-write to PCI config space, requring two ioctls. I introduce an ioctl to do this for us. An important aspect of this is that now we can support non-PCI-2.3 devices since re-enabling the interrupt is abstracted (with the same caveat as current KVM device assignment, that they must get an exclusive interrupt). The real trick though is that we can then also create an irqfd-like mechanism to trigger re-enabling interrupts from and eventfd. This allows qemu to do all the setup for connecting interrupts from VFIO into KVM and EOIs from KVM to VFIO, then lets userspace get completely out of the way. Hope you like it. Thanks, Alex --- Alex Williamson (5): vfio: Add a new ioctl to support EOI via eventfd vfio: Add support for non-PCI 2.3 compliant devices vfio: Add ioctl to re-enable interrupts vfio: Fix requested regions vfio: Fix the ROM mask drivers/vfio/vfio_intrs.c | 254 +--- drivers/vfio/vfio_main.c | 37 +- drivers/vfio/vfio_pci_config.c |2 drivers/vfio/vfio_rdwr.c |4 - include/linux/vfio.h | 14 ++ 5 files changed, 279 insertions(+), 32 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)
Hi, Are there any lingering issues or concerns with the latest rbd patch, or other roadblocks that would prevent this from being merged? Thanks- sage On Fri, 15 Oct 2010, Christian Brunner wrote: Hi, once again, Yehuda committed fixes for all the suggestions made on the list (and more). Here is the next update for the ceph/rbd block driver. Please let us know if there are any pending issues. For those who didn't follow the previous postings: This is an block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace (Yehuda also wrote a driver for the linux kernel, that can be used to access rbd volumes as a block device). Kind Regards, Christian Signed-off-by: Christian Brunner c...@muc.de Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- Makefile.objs |1 + block/rbd.c | 1059 + block/rbd_types.h | 71 configure | 31 ++ 4 files changed, 1162 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h diff --git a/Makefile.objs b/Makefile.objs index 6ee077c..56a13c1 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -19,6 +19,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o +block-nested-$(CONFIG_RBD) += rbd.o block-obj-y += $(addprefix block/, $(block-nested-y)) diff --git a/block/rbd.c b/block/rbd.c new file mode 100644 index 000..fbfb93e --- /dev/null +++ b/block/rbd.c @@ -0,0 +1,1059 @@ +/* + * QEMU Block driver for RADOS (Ceph) + * + * Copyright (C) 2010 Christian Brunner c...@muc.de + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include qemu-error.h + +#include rbd_types.h +#include block_int.h + +#include rados/librados.h + + + +/* + * When specifying the image filename use: + * + * rbd:poolname/devicename + * + * poolname must be the name of an existing rados pool + * + * devicename is the basename for all objects used to + * emulate the raw device. + * + * Metadata information (image size, ...) is stored in an + * object with the name devicename.rbd. + * + * The raw device is split into 4MB sized objects by default. + * The sequencenumber is encoded in a 12 byte long hex-string, + * and is attached to the devicename, separated by a dot. + * e.g. devicename.1234567890ab + * + */ + +#define OBJ_MAX_SIZE (1UL OBJ_DEFAULT_OBJ_ORDER) + +typedef struct RBDAIOCB { +BlockDriverAIOCB common; +QEMUBH *bh; +int ret; +QEMUIOVector *qiov; +char *bounce; +int write; +int64_t sector_num; +int aiocnt; +int error; +struct BDRVRBDState *s; +int cancelled; +} RBDAIOCB; + +typedef struct RADOSCB { +int rcbid; +RBDAIOCB *acb; +struct BDRVRBDState *s; +int done; +int64_t segsize; +char *buf; +int ret; +} RADOSCB; + +#define RBD_FD_READ 0 +#define RBD_FD_WRITE 1 + +typedef struct BDRVRBDState { +int fds[2]; +rados_pool_t pool; +rados_pool_t header_pool; +char name[RBD_MAX_OBJ_NAME_SIZE]; +char block_name[RBD_MAX_BLOCK_NAME_SIZE]; +uint64_t size; +uint64_t objsize; +int qemu_aio_count; +int read_only; +int event_reader_pos; +RADOSCB *event_rcb; +} BDRVRBDState; + +typedef struct rbd_obj_header_ondisk RbdHeader1; + +static int rbd_next_tok(char *dst, int dst_len, +char *src, char delim, +const char *name, +char **p) +{ +int l; +char *end; + +*p = NULL; + +if (delim != '\0') { +end = strchr(src, delim); +if (end) { +*p = end + 1; +*end = '\0'; +} +} +l = strlen(src); +if (l = dst_len) { +error_report(%s too long, name); +return -EINVAL; +} else if (l == 0) { +error_report(%s too short, name); +return -EINVAL; +} + +pstrcpy(dst, dst_len, src); + +return 0; +} + +static int rbd_parsename(const char *filename, + char *pool, int pool_len, + char *snap, int snap_len, + char *name, int name_len) +{ +const char *start; +char *p, *buf; +int ret; + +if (!strstart(filename, rbd:, start)) { +return -EINVAL; +} + +buf = qemu_strdup(start); +p = buf; + +ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p); +if (ret 0 || !p) { +ret = -EINVAL; +goto done; +} +
Re: [PATCH 3/3] net - removing superfluous rcu_read_lock_held check
On Mon, 2010-11-01 at 20:15 +0100, Jiri Olsa wrote: hi, the rcu_dereference_check is defined as #define rcu_dereference_check(p, c) \ __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) so the caller does not need to specify rcu_read_lock_held() condition. wbr, jirka Signed-off-by: Jiri Olsa jo...@redhat.com The NetLabel changes look fine to me; thanks for the cleanup. --- include/linux/rtnetlink.h |3 +-- include/net/sock.h |3 +-- net/mac80211/sta_info.c|4 net/netlabel/netlabel_domainhash.c |3 +-- net/netlabel/netlabel_unlabeled.c |3 +-- 5 files changed, 4 insertions(+), 12 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index d42f274..dfe5ba1 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -758,8 +758,7 @@ extern int lockdep_rtnl_is_held(void); * or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference() */ #define rcu_dereference_rtnl(p) \ - rcu_dereference_check(p, rcu_read_lock_held() ||\ - lockdep_rtnl_is_held()) + rcu_dereference_check(p, lockdep_rtnl_is_held()) /** * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL diff --git a/include/net/sock.h b/include/net/sock.h index c7a7362..bee3e9c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1274,8 +1274,7 @@ extern unsigned long sock_i_ino(struct sock *sk); static inline struct dst_entry * __sk_dst_get(struct sock *sk) { - return rcu_dereference_check(sk-sk_dst_cache, rcu_read_lock_held() || -sock_owned_by_user(sk) || + return rcu_dereference_check(sk-sk_dst_cache, sock_owned_by_user(sk) || lockdep_is_held(sk-sk_lock.slock)); } diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index 6d8f897..c879217 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -94,7 +94,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata, struct sta_info *sta; sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)], - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); while (sta) { @@ -102,7 +101,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data *sdata, memcmp(sta-sta.addr, addr, ETH_ALEN) == 0) break; sta = rcu_dereference_check(sta-hnext, - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); } @@ -120,7 +118,6 @@ struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata, struct sta_info *sta; sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)], - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); while (sta) { @@ -129,7 +126,6 @@ struct sta_info *sta_info_get_bss(struct ieee80211_sub_if_data *sdata, memcmp(sta-sta.addr, addr, ETH_ALEN) == 0) break; sta = rcu_dereference_check(sta-hnext, - rcu_read_lock_held() || lockdep_is_held(local-sta_lock) || lockdep_is_held(local-sta_mtx)); } diff --git a/net/netlabel/netlabel_domainhash.c b/net/netlabel/netlabel_domainhash.c index d37b7f8..82795a4 100644 --- a/net/netlabel/netlabel_domainhash.c +++ b/net/netlabel/netlabel_domainhash.c @@ -55,8 +55,7 @@ struct netlbl_domhsh_tbl { * should be okay */ static DEFINE_SPINLOCK(netlbl_domhsh_lock); #define netlbl_domhsh_rcu_deref(p) \ - rcu_dereference_check(p, rcu_read_lock_held() || \ - lockdep_is_held(netlbl_domhsh_lock)) + rcu_dereference_check(p, lockdep_is_held(netlbl_domhsh_lock)) static struct netlbl_domhsh_tbl *netlbl_domhsh = NULL; static struct netlbl_dom_map *netlbl_domhsh_def = NULL; diff --git a/net/netlabel/netlabel_unlabeled.c b/net/netlabel/netlabel_unlabeled.c index e2b0a68..d2f982f 100644 --- a/net/netlabel/netlabel_unlabeled.c +++ b/net/netlabel/netlabel_unlabeled.c @@ -116,8 +116,7 @@ struct netlbl_unlhsh_walk_arg { * hash table should be okay */ static DEFINE_SPINLOCK(netlbl_unlhsh_lock); #define netlbl_unlhsh_rcu_deref(p) \ - rcu_dereference_check(p, rcu_read_lock_held() || \ -
Re: [PATCH 2/3] kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check
On 11/01/2010 08:15 PM, Jiri Olsa wrote: diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a055742..a90a7e3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -256,7 +256,6 @@ void kvm_put_kvm(struct kvm *kvm); static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) { return rcu_dereference_check(kvm-memslots, - srcu_read_lock_held(kvm-srcu) || lockdep_is_held(kvm-slots_lock)); } This is an srcu_read_lock_held, which you don't touch here: diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 9d8e8fb..0956a73 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -807,7 +807,6 @@ static void rcu_torture_timer(unsigned long unused) idx = cur_ops-readlock(); completed = cur_ops-completed(); p = rcu_dereference_check(rcu_torture_current, - rcu_read_lock_held() || rcu_read_lock_bh_held() || rcu_read_lock_sched_held() || srcu_read_lock_held(srcu_ctl)); I guess the kvm hunk is the incorrect one? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice
Hi Jesper, (dropped some addresses from Cc) Jesper Juhl wrote: There's definately a positive size impact for the generated object code and we save having to do the call to memset() and the cost of a vzalloc() call looks more or less the same as a call to vmalloc() to me. This patch is not based on kvm.git, I guess. Nope. It was generated against mainline git as of a couple of days ago. I can generate a version against kvm.git if you prefer. Sorry, I should have said more clearly. kvm_vm_ioctl_get_dirty_log() has been changed since latest mainline kernel was released. Furthermore, vmalloc() in it is to be removed soon, currently in kvm's next. I have checked all vmalloc() + memset() in kvm, and there are three remaining. See the patch below. - I have tested on x86. - I can test ppc later if needed, but this is so trivial that Alex will see no problem about this, probably. So please write your Signed-off-by and ask Avi and Marcelo to apply. Thanks, Takuya === Subject: [PATCH] KVM: replace vmalloc and memset with vzalloc Let's use newly introduced vzalloc(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Cc: Jesper Juhl j...@chaosbits.net --- arch/powerpc/kvm/book3s.c |4 +--- virt/kvm/kvm_main.c |9 ++--- 2 files changed, 3 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index e316847..badc983 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -1307,12 +1307,10 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) int err = -ENOMEM; unsigned long p; - vcpu_book3s = vmalloc(sizeof(struct kvmppc_vcpu_book3s)); + vcpu_book3s = vzalloc(sizeof(struct kvmppc_vcpu_book3s)); if (!vcpu_book3s) goto out; - memset(vcpu_book3s, 0, sizeof(struct kvmppc_vcpu_book3s)); - vcpu_book3s-shadow_vcpu = (struct kvmppc_book3s_shadow_vcpu *) kzalloc(sizeof(*vcpu_book3s-shadow_vcpu), GFP_KERNEL); if (!vcpu_book3s-shadow_vcpu) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dbe1d6a..bd44a81 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -606,13 +606,11 @@ int __kvm_set_memory_region(struct kvm *kvm, /* Allocate if a slot is being created */ #ifndef CONFIG_S390 if (npages !new.rmap) { - new.rmap = vmalloc(npages * sizeof(*new.rmap)); + new.rmap = vzalloc(npages * sizeof(*new.rmap)); if (!new.rmap) goto out_free; - memset(new.rmap, 0, npages * sizeof(*new.rmap)); - new.user_alloc = user_alloc; new.userspace_addr = mem-userspace_addr; } @@ -635,14 +633,11 @@ int __kvm_set_memory_region(struct kvm *kvm, KVM_HPAGE_GFN_SHIFT(level)); lpages -= base_gfn KVM_HPAGE_GFN_SHIFT(level); - new.lpage_info[i] = vmalloc(lpages * sizeof(*new.lpage_info[i])); + new.lpage_info[i] = vzalloc(lpages * sizeof(*new.lpage_info[i])); if (!new.lpage_info[i]) goto out_free; - memset(new.lpage_info[i], 0, - lpages * sizeof(*new.lpage_info[i])); - if (base_gfn (KVM_PAGES_PER_HPAGE(level) - 1)) new.lpage_info[i][0].write_count = 1; if ((base_gfn+npages) (KVM_PAGES_PER_HPAGE(level) - 1)) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_i386_debian_5_0
The Buildbot has detected a new failure of default_i386_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_debian_5_0/builds/628 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: The Nightly scheduler named 'nightly_default' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_i386_out_of_tree
The Buildbot has detected a new failure of default_i386_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_out_of_tree/builds/565 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: The Nightly scheduler named 'nightly_default' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs
On 11/01/2010 08:55 PM, Gleb Natapov wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cfdf2d..f7aed95 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5295,8 +5295,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) { switch(vcpu-arch.mp_state) { case KVM_MP_STATE_HALTED: - vcpu-arch.mp_state = - KVM_MP_STATE_RUNNABLE; + if (list_empty_careful(vcpu-async_pf.done)) + vcpu-arch.mp_state = + KVM_MP_STATE_RUNNABLE; if nmi/interrupt and apfs completed event occur at the same time, we will miss to exit halt sate. Maybe we can check the pending event here, see below patch please. case KVM_MP_STATE_RUNNABLE: vcpu-arch.apf.halted = false; break; @@ -6279,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, vcpu-arch.fault.error_code = 0; vcpu-arch.fault.address = work-arch.token; kvm_inject_page_fault(vcpu); + vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; } } Have a stupid question, why we make the vcpu runnable here? Sorry i don't know kvm pv guest to much. :-( diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2044302..27a712b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5267,6 +5267,12 @@ out: return r; } +static bool kvm_vcpu_leave_halt_state(struct kvm_vcpu *vcpu) +{ + return vcpu-arch.nmi_pending || + (kvm_arch_interrupt_allowed(vcpu) +kvm_cpu_has_interrupt(vcpu)); +} static int __vcpu_run(struct kvm_vcpu *vcpu) { @@ -5299,8 +5305,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu) { switch(vcpu-arch.mp_state) { case KVM_MP_STATE_HALTED: - vcpu-arch.mp_state = - KVM_MP_STATE_RUNNABLE; + if (kvm_vcpu_leave_halt_state(vcpu)) + vcpu-arch.mp_state = + KVM_MP_STATE_RUNNABLE; case KVM_MP_STATE_RUNNABLE: vcpu-arch.apf.halted = false; break; @@ -6113,9 +6120,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) !vcpu-arch.apf.halted) || !list_empty_careful(vcpu-async_pf.done) || vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED - || vcpu-arch.nmi_pending || - (kvm_arch_interrupt_allowed(vcpu) -kvm_cpu_has_interrupt(vcpu)); + || kvm_vcpu_leave_halt_state(vcpu); } void kvm_vcpu_kick(struct kvm_vcpu *vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice
On Tue, 2 Nov 2010, Takuya Yoshikawa wrote: Hi Jesper, (dropped some addresses from Cc) Jesper Juhl wrote: There's definately a positive size impact for the generated object code and we save having to do the call to memset() and the cost of a vzalloc() call looks more or less the same as a call to vmalloc() to me. This patch is not based on kvm.git, I guess. Nope. It was generated against mainline git as of a couple of days ago. I can generate a version against kvm.git if you prefer. Sorry, I should have said more clearly. kvm_vm_ioctl_get_dirty_log() has been changed since latest mainline kernel was released. Furthermore, vmalloc() in it is to be removed soon, currently in kvm's next. I have checked all vmalloc() + memset() in kvm, and there are three remaining. See the patch below. - I have tested on x86. - I can test ppc later if needed, but this is so trivial that Alex will see no problem about this, probably. So please write your Signed-off-by and ask Avi and Marcelo to apply. Thanks, Takuya === Subject: [PATCH] KVM: replace vmalloc and memset with vzalloc Let's use newly introduced vzalloc(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Signed-off-by: Jesper-juhl j...@chaosbits.net -- Jesper Juhl j...@chaosbits.net http://www.chaosbits.net/ Plain text mails only, please http://www.expita.com/nomime.html Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio: Extended capability fixes
- Virtual channel position gets truncated as a u8 - Print the ecap that's unknown, not the last cap we saw - Print actual config offset, which provides enough info to make some sense of the error. Signed-off-by: Alex Williamson alex.william...@redhat.com --- drivers/vfio/vfio_pci_config.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c index 8af995d..8304316 100644 --- a/drivers/vfio/vfio_pci_config.c +++ b/drivers/vfio/vfio_pci_config.c @@ -410,7 +410,7 @@ static int vfio_msi_cap_len(struct vfio_dev *vdev, u8 pos) * Determine extended capability length for VC (2 9) and * MFVC capabilities */ -static int vfio_vc_cap_len(struct vfio_dev *vdev, u8 pos) +static int vfio_vc_cap_len(struct vfio_dev *vdev, u16 pos) { struct pci_dev *pdev = vdev-pdev; u32 dw; @@ -580,7 +580,7 @@ int vfio_build_config_map(struct vfio_dev *vdev) printk(KERN_WARNING %s: pci config conflict at %x, caps %x %x\n, - __func__, i, map[pos+i], cap); + __func__, pos+i, map[pos+i], cap); map[pos+i] = cap; } ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, pos); @@ -683,7 +683,7 @@ int vfio_build_config_map(struct vfio_dev *vdev) if (len == 0 || len == 0xFF) { printk(KERN_WARNING %s: unknown length for pci ext cap %x\n, - __func__, cap); + __func__, ecap); len = PCI_CAP_SIZEOF; } for (i = 0; i len; i++) { @@ -691,7 +691,7 @@ int vfio_build_config_map(struct vfio_dev *vdev) printk(KERN_WARNING %s: pci config conflict at %x, caps %x %x\n, - __func__, i, map[epos+i], ecap); + __func__, epos+i, map[epos+i], ecap); map[epos+i] = ecap; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM test: Test the file transfer during migartion
Michael Goldish writes: On 09/25/2010 11:36 AM, Jason Wang wrote: This test just do the file transfer from host to guest during migartion in order to check whether the nic/block state could be saved and loaded correctly. Signed-off-by: Jason Wang jasow...@redhat.com --- .../kvm/tests/migration_with_file_transfer.py | 59 client/tests/kvm/tests_base.cfg.sample |7 ++ 2 files changed, 66 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py diff --git a/client/tests/kvm/tests/migration_with_file_transfer.py b/client/tests/kvm/tests/migration_with_file_transfer.py new file mode 100644 index 000..8a316bf --- /dev/null +++ b/client/tests/kvm/tests/migration_with_file_transfer.py @@ -0,0 +1,59 @@ +import logging, time +from autotest_lib.client.common_lib import utils, error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_migration_with_file_transfer(test, params, env): + +KVM migration test: +1) Get a live VM and clone it. +2) Verify that the source VM supports migration. If it does, proceed with +the test. +3) Reboot the VM +4) Send a migration command to the source VM and wait until it's finished. +5) Kill off the source VM. +6) Log into the destination VM after the migration is finished. + +@param test: kvm test object. +@param params: Dictionary with test parameters. +@param env: Dictionary with the test environment. + + +def transfer_test(vm, host_path, guest_path, timeout=120): + +vm.copy_files_to does not raise exception, so we need a wrapper +in order to make it to be used by BackgroundTest. + +if not vm.copy_files_to(host_path, guest_path, timeout=timeout): +raise error.TestError(Fail to do the file transfer!) + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout = int(params.get(login_timeout, 360)) +session = kvm_test_utils.wait_for_login(vm, timeout=timeout) + +mig_timeout = float(params.get(mig_timeout, 3600)) +mig_protocol = params.get(migration_protocol, tcp) + +guest_path = params.get(guest_path, /tmp) +file_size = params.get(file_size, 1000) +transfer_timeout = int(params.get(transfer_timeout, 240)) +bg = None + +try: +utils.run(dd if=/dev/zero of=/tmp/file bs=1M count=%s % file_size) Wouldn't it be useful to read from /dev/urandom instead of /dev/zero, and compare the md5 hash of the file generated on the host with the md5 hash of the file that arrives in the guest? Yes, as the file transfer test have been applied, I would rebase this case to use it. + +# Transfer file from host to guest +bg = kvm_test_utils.BackgroundTest(transfer_test, + (vm, /tmp/file, guest_path, +transfer_timeout)) +bg.start() + +while bg.is_alive(): +logging.info(File transfer is not ended, start a round of migration ...) +# Migrate the VM +dest_vm = kvm_test_utils.migrate(vm, env, mig_timeout, mig_protocol, False) +vm = dest_vm +finally: +if bg: bg.join() +session.close() +utils.run(rm -rf /tmp/zero) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Launch other test during migration
Michael Goldish writes: On 09/25/2010 11:36 AM, Jason Wang wrote: We could give a further test of migration by launch test during migartion. So the following series implements: - A simple class to run a specified test in the background which could be used to launch other test during migartion. Its design is rather simple and its usage is a little tricky, but it work well. - Two sample tests which take advantages of the background class: Test reboot during guest migration and test file_transfer during guest migration. In the future, we could even lauch autotest client test during guest migation. --- Jason Wang (3): KVM Test: Introduce a helper class to run a test in the background KVM test: Test reboot during migration KVM test: Test the file transfer during migartion client/tests/kvm/kvm_test_utils.py | 44 +++ .../kvm/tests/migration_with_file_transfer.py | 59 client/tests/kvm/tests/migration_with_reboot.py| 45 +++ client/tests/kvm/tests_base.cfg.sample | 12 4 files changed, 159 insertions(+), 1 deletions(-) create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py create mode 100644 client/tests/kvm/tests/migration_with_reboot.py It seems to me that this method is only applicable to tests/functions that don't require a VM object (i.e. that require only a shell session object). kvm_test_utils.reboot() operates on a VM object, and the same VM is destroyed by migrate() which runs in the background, so eventually reboot() tries logging into a destroyed VM, which fails because vm.get_address() fails. Any monitor operation will also fail. If the autotest wrapper requires a VM object (currently it does) then it can't be used either. You are right and that's an issue when running test in parallel with migration, but reboot through shell should work. The aim of this kind of test is just to add some stress ( such as run_autotest) during migartion, so most (probably all) of the tests only needs a session. So I think it's not a very big issue in this situation. An alternative (somewhat ugly) way to migrate in the background is to pass a boolean 'migrate' flag to various functions/tests, such as reboot() and run_autotest(). If 'migrate' is True, these functions will do something like vm = kvm_test_utils.migrate(vm, ...) in their waiting loops, where wait_for() is normally used. This guarantees that 'vm' is always a valid VM object. For example: # Log in after reboot while time.time() end_time: if migrate_in_bg: vm = kvm_test_utils.migrate(vm, ...) session = vm.remote_login() if session: break time.sleep(2) This is much longer than the usual wait_for(...) but it does the job. What do you think? This makes sense but it would let testcases need to care about the migration and it's also hard to put all related codes into a wrapper which would complicate the codes. Despite the issue of vm object, all tests that only depends on the shell session should works well with my method and it's more easy to be extended. Maybe we could just warn the user of its usage and adapt my method? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc: kvm: fix information leak to userland
On Sat, Oct 30, 2010 at 01:04:24PM +0400, Vasiliy Kulikov wrote: Structure kvm_ppc_pvinfo is copied to userland with flags and pad fields unitialized. It leads to leaking of contents of kernel stack memory. Signed-off-by: Vasiliy Kulikov sego...@gmail.com --- I cannot compile this driver, so it is not tested at all. arch/powerpc/kvm/powerpc.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html