Re: [PATCH 1/1] virtio_console: perf-test fix [FIX] read-out all data after perf-test [FIX] code clean-up
On (Thu) Sep 23 2010 [14:11:52], Lukas Doktor wrote: @@ -829,6 +832,11 @@ def run_virtio_console(test, params, env): exit_event.set() thread.join() +# Let the guest read-out all the remaining data +while not _on_guest(virt.poll('%s', %s) +% (port[1], select.POLLIN), vm, 2)[0]: +time.sleep(1) This is just polling the guest, not reading out any data? (BTW POLLIN will be set if host is disconnected and there's no data to read, so ensure you don't enter an infinite loop here.) Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pci passthrough with KVM
We are using pci passthrough with an SCSI Adapter card. The system is: - O.S: Ubuntu 10.04.1 LTS - KVM Packages: kvm1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9.2 kvm-pxe 5.4.4-1ubuntu1 qemu-kvm 0.12.3+noroms-0ubuntu9.2 libvirt-bin 0.7.5-5ubuntu27.2 python-libvirt0.7.5-5ubuntu27.2 libvirt00.7.5-5ubuntu27.2 - Kernel 2.6.32.15+drm33.5.iommurecompiled with following options : CONFIG_DMAR=y CONFIG_INTR_REMAP=y - Apparmor is stopped When we started the virtual machine we obtain the following error: char device redirected to /dev/pts/3 device: 04:04.0: driver=pci-assign host=04:04.0 Failed to assign irq for 04:04.0: Operation not permitted Perhaps you are assigning a device that shares an IRQ with another device? Thanks in advance, Inigo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
Eliminate: arch/x86/kvm/emulate.c:801: warning: ‘sv’ may be used uninitialized in this function on gcc 4.1.2 Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/emulate.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index aead72e..d0df25d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -798,7 +798,7 @@ done: static void fetch_bit_operand(struct decode_cache *c) { - long sv, mask; + long sv = 0, mask; if (c-dst.type == OP_MEM c-src.type == OP_REG) { mask = ~(c-dst.bytes * 8 - 1); -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qemu-kvm: device assignment: Reflect assigned device's MSI data/address modification to KVM
Device driver can modify MSI data/address when MSI is enabled. This patch reflect the moditication to KVM. Signed-off-by: Sheng Yang sh...@linux.intel.com --- hw/device-assignment.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 87f7418..1e83495 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1283,7 +1283,8 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad #ifdef KVM_CAP_DEVICE_MSI if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSI) { ctrl_pos = pos + PCI_MSI_FLAGS; -if (address = ctrl_pos address + len ctrl_pos) +/* Cover control, MSI data/addr fields */ +if (pos = address address + len pos + PCI_MSI_DATA_32 + 2) assigned_dev_update_msi(pci_dev, ctrl_pos); pos += PCI_CAPABILITY_CONFIG_MSI_LENGTH; } -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, nmi: workaround sti; hlt race vs nmi; intr
On 09/27/2010 11:55 PM, H. Peter Anvin wrote: On 09/27/2010 02:27 AM, Avi Kivity wrote: I don't have write permissions for the spec. If you can verify that all existing and future hardware will mask NMI after STI and convince the spec owners to update the specifications, I'm all for it; it's certainly a cleaner solution. I'm trying to see if this is doable. As you can well imagine, though, it takes time. Joerg, can you look at this from the AMD side? First, do AMD processors block NMI after an IF-enabling STI? The documentation explicitly states that they don't: Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby allowing external interrupts received on the INTR input. Interrupts received on the non-maskable interrupt (NMI) input are not affected by this instruction. though that may refer to the general effect of EFLAGS.IF, not STI. Second, if NMIs are blocked, is it possible to retro-document this? Personally I think the safer route is to take the patch. There are other processors besides Intel and AMD and we can't test all of them, not to mention various emulators and virtual machine monitors out there. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] KVM: MMU: cleanup for error mask set while walk guest page table
On 09/27/2010 10:30 PM, Avi Kivity wrote: r = FNAME(walk_addr)(walker, vcpu, vaddr, - !!(access PFERR_WRITE_MASK), - !!(access PFERR_USER_MASK), - !!(access PFERR_FETCH_MASK)); + access PFERR_WRITE_MASK, + access PFERR_USER_MASK, + access PFERR_FETCH_MASK); if (r) { gpa = gfn_to_gpa(walker.gfn); Interesting. Maybe a next step is to pass the page-fault error code instead of the various bits? Yeah, it's a good idea, i'll post a patch to do it. Not sure how that interacts with nested ept (which has a different permission model). Umm, we just move the error code parsing from the caller site to FNAME(walk_addr) function, i think it not make trouble for implement nested ept. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix kvmclock bug
Am 27.09.2010 21:00, Zachary Amsden wrote: On 09/25/2010 11:54 PM, Jan Kiszka wrote: That only leaves us with the likely wrong unstable declaration of the TSC after resume. And that raises the question for me if KVM is actually that much smarter than the Linux kernel in detecting TSC jumps. If something is missing, can't we improve the kernel's detection mechanism which already has suspend/resume support? Linux must make the the conservative choice about TSC being declared unstable; if it is possible that it has become unstable, it is unstable. Unfortunately, this bodes not well for us, as most of the finer points of accuracy depend on having a stable TSC. There's a bunch of places that declare TSC unstable, and where in the suspend / resume cycle that happens would depend on your actual hardware. It's absolutely clear where this happens: kvm_arch_vcpu_load. And it seems to happen as the TSC is reset due to suspend-to-RAM. Again: Linux recovers from this and continues to use the TSC. KVM is more picky, so my question is if this is really required. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: move access code parsing to FNAME(walk_addr) function
Move access code parsing from caller site to FNAME(walk_addr) function Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/paging_tmpl.h | 40 1 files changed, 16 insertions(+), 24 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index a83ff37..9a5f7bb 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -116,16 +116,18 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte) */ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gva_t addr, int write_fault, - int user_fault, int fetch_fault) + gva_t addr, u32 access) { pt_element_t pte; gfn_t table_gfn; unsigned index, pt_access, uninitialized_var(pte_access); gpa_t pte_gpa; bool eperm, present, rsvd_fault; - int offset; - u32 access = 0; + int offset, write_fault, user_fault, fetch_fault; + + write_fault = access PFERR_WRITE_MASK; + user_fault = access PFERR_USER_MASK; + fetch_fault = access PFERR_FETCH_MASK; trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault, fetch_fault); @@ -215,6 +217,7 @@ walk: int lvl = walker-level; gpa_t real_gpa; gfn_t gfn; + u32 ac; gfn = gpte_to_gfn_lvl(pte, lvl); gfn += (addr PT_LVL_OFFSET_MASK(lvl)) PAGE_SHIFT; @@ -224,10 +227,10 @@ walk: is_cpuid_PSE36()) gfn += pse36_gfn_delta(pte); - access |= write_fault | fetch_fault | user_fault; + ac = write_fault | fetch_fault | user_fault; real_gpa = mmu-translate_gpa(vcpu, gfn_to_gpa(gfn), - access); + ac); if (real_gpa == UNMAPPED_GVA) return 0; @@ -282,21 +285,18 @@ error: } static int FNAME(walk_addr)(struct guest_walker *walker, - struct kvm_vcpu *vcpu, gva_t addr, - int write_fault, int user_fault, int fetch_fault) + struct kvm_vcpu *vcpu, gva_t addr, u32 access) { return FNAME(walk_addr_generic)(walker, vcpu, vcpu-arch.mmu, addr, - write_fault, user_fault, fetch_fault); + access); } static int FNAME(walk_addr_nested)(struct guest_walker *walker, struct kvm_vcpu *vcpu, gva_t addr, - int write_fault, int user_fault, - int fetch_fault) + u32 access) { return FNAME(walk_addr_generic)(walker, vcpu, vcpu-arch.nested_mmu, - addr, write_fault, user_fault, - fetch_fault); + addr, access); } static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, @@ -532,7 +532,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, { int write_fault = error_code PFERR_WRITE_MASK; int user_fault = error_code PFERR_USER_MASK; - int fetch_fault = error_code PFERR_FETCH_MASK; struct guest_walker walker; u64 *sptep; int write_pt = 0; @@ -550,8 +549,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, /* * Look up the guest pte for the faulting address. */ - r = FNAME(walk_addr)(walker, vcpu, addr, write_fault, user_fault, -fetch_fault); + r = FNAME(walk_addr)(walker, vcpu, addr, error_code); /* * The page is not mapped by the guest. Let the guest handle it. @@ -669,10 +667,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access, gpa_t gpa = UNMAPPED_GVA; int r; - r = FNAME(walk_addr)(walker, vcpu, vaddr, -access PFERR_WRITE_MASK, -access PFERR_USER_MASK, -access PFERR_FETCH_MASK); + r = FNAME(walk_addr)(walker, vcpu, vaddr, access); if (r) { gpa = gfn_to_gpa(walker.gfn); @@ -690,10 +685,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr, gpa_t gpa = UNMAPPED_GVA; int r; - r = FNAME(walk_addr_nested)(walker, vcpu, vaddr, - access
Re: [PATCH 2/7] KVM: MMU: cleanup for error mask set while walk guest page table
On 09/28/2010 10:58 AM, Xiao Guangrong wrote: Yeah, it's a good idea, i'll post a patch to do it. Not sure how that interacts with nested ept (which has a different permission model). Umm, we just move the error code parsing from the caller site to FNAME(walk_addr) function, i think it not make trouble for implement nested ept. It means the nept code will have to take the ept error code and convert it to the standard error code. Not too difficult. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
On 09/28/2010 04:43 AM, Xiao Guangrong wrote: On 09/27/2010 10:39 PM, Avi Kivity wrote: On 09/27/2010 12:14 PM, Xiao Guangrong wrote: On 09/27/2010 06:02 PM, Xiao Guangrong wrote: The value of 'vcpu-arch.mmu.pae_root' is not modified, so we can update 'root_hpa' out of the loop Xiao Guangrongxiaoguangr...@cn.fujitsu.com Hi Avi, Marcelo, I'm so sorry that missed Signed-off-by: string in all patches, i'll repost this patchset after your review. :-( I fixed up the signoffs and applied. Thanks Avi. I noticed you applied KVM: MMU: rename 'sp-root_count' to 'sp-active_count' (commit 55d80c448e7e3417) instead of this patch :-) Thanks for noticing! Fixed. No idea how it happened. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, nmi: workaround sti; hlt race vs nmi; intr
On Tue, Sep 28, 2010 at 04:50:24AM -0400, Avi Kivity wrote: On 09/27/2010 11:55 PM, H. Peter Anvin wrote: On 09/27/2010 02:27 AM, Avi Kivity wrote: I don't have write permissions for the spec. If you can verify that all existing and future hardware will mask NMI after STI and convince the spec owners to update the specifications, I'm all for it; it's certainly a cleaner solution. I'm trying to see if this is doable. As you can well imagine, though, it takes time. Joerg, can you look at this from the AMD side? Yes, I will discuss this with the hardware peoples. Will take some time here too. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: ia64: define kvm_lapic_enabled() to fix a compile error
On 09/27/2010 12:01 PM, Zhang, Xiantao wrote: Maybe we should make ia64 kvm depend on CONFIG_BROKEN. It has been experimental for quite a while. I don't think the kvm/ia64 is broken in the upstream Linux, and it should work according to our last try. When was that? The big issue is about userspace support. Latest qemu maybe not work with latest kernel, but if you choose an old qemu, it should work! There's always the chance that changes in the host kernel or architecture independent kvm code will cause breakage. We need regular use or testing. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Add svm cpuid features
On Mon, Sep 27, 2010 at 12:22:13PM -0400, Avi Kivity wrote: On 09/27/2010 05:40 PM, Roedel, Joerg wrote: On Mon, Sep 27, 2010 at 11:02:23AM -0400, Avi Kivity wrote: On 09/27/2010 04:58 PM, Avi Kivity wrote: On 09/27/2010 03:16 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. This should really be all svm features since we'll later mask them with kvm capabilities. This way, if we later add more capabilities, we automatically gain support in userspace. Yes, that is what -cpu host does with these patches applied. The svm_features variable is set to -1 in this case and masked out later with the KVM capabilities. Well, running current uq/master I get: has_svm: can't execute cpuid 0x800a kvm: no hardware support ? Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? What was your command line? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Add svm cpuid features
On 09/28/2010 11:28 AM, Roedel, Joerg wrote: On Mon, Sep 27, 2010 at 12:22:13PM -0400, Avi Kivity wrote: On 09/27/2010 05:40 PM, Roedel, Joerg wrote: On Mon, Sep 27, 2010 at 11:02:23AM -0400, Avi Kivity wrote: On 09/27/2010 04:58 PM, Avi Kivity wrote: On 09/27/2010 03:16 PM, Joerg Roedel wrote: This patch adds the svm cpuid feature flags to the qemu intialization path. It also adds the svm features available on phenom to its cpu-definition and extends the host cpu type to support all svm features KVM can provide. This should really be all svm features since we'll later mask them with kvm capabilities. This way, if we later add more capabilities, we automatically gain support in userspace. Yes, that is what -cpu host does with these patches applied. The svm_features variable is set to -1 in this case and masked out later with the KVM capabilities. Well, running current uq/master I get: has_svm: can't execute cpuid 0x800a kvm: no hardware support ? Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? Yes, 8fe6a21c76. What was your command line? qemu-system-x86_64 -m 2G -cpu kvm64,+svm,+npt -enable-kvm ... Note this is qemu.git, so -enable-kvm is needed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Emulate MSI-X mask bits for assigned devices
Hi Avi Marcelo This patchset would add emulation of MSI-X mask bit for assigned devices in QEmu. BTW: We are also purposed an acceleration of MSI-X mask bit for KVM - to get it done in kernel. That's because sometime MSI-X mask bit was accessed very frequently by guest and emulation in QEmu cost a lot(tens to hundreds percent CPU utilization sometime), e.g. guest using Linux 2.6.18 or under some heavy workload. The method has been proved efficient in Xen, but it would require VMM to handle MSI-X table. Is that OK with you? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices
This patch enable per-vector mask for assigned devices using MSI-X. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/x86.c |1 + include/linux/kvm.h |9 - include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 39 +++ 4 files changed, 49 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8412c91..e6933e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: + case KVM_CAP_DEVICE_MSIX_MASK: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 919ae53..f2b7cdc 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo { #endif #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 +#ifdef __KVM_HAVE_MSIX +#define KVM_CAP_DEVICE_MSIX_MASK 59 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr { }; #define KVM_MAX_MSIX_PER_DEV 256 + +#define KVM_MSIX_FLAG_MASK 1 + struct kvm_assigned_msix_entry { __u32 assigned_dev_id; __u32 gsi; __u16 entry; /* The index of entry in the MSI-X table */ - __u16 padding[3]; + __u16 flags; + __u16 padding[2]; }; #endif /* __LINUX_KVM_H */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0b89d00..a43405c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier { }; #define KVM_ASSIGNED_MSIX_PENDING 0x1 +#define KVM_ASSIGNED_MSIX_MASK 0x2 struct kvm_guest_msix_entry { u32 vector; u16 entry; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 7c98928..15b8c32 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -17,6 +17,8 @@ #include linux/pci.h #include linux/interrupt.h #include linux/slab.h +#include linux/irqnr.h + #include irq.h static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head, @@ -666,6 +668,30 @@ msix_nr_out: return r; } +static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev, +int index) +{ + int irq; + + if (!assigned_dev-dev-msix_enabled || + !(assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX)) + return; + + irq = assigned_dev-host_msix_entries[index].vector; + + ASSERT(irq != 0); + + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_MASK) + disable_irq(irq); + else { + enable_irq(irq); + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_PENDING) + schedule_work(assigned_dev-interrupt_work); + } +} + static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, struct kvm_assigned_msix_entry *entry) { @@ -688,6 +714,19 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, adev-guest_msix_entries[i].entry = entry-entry; adev-guest_msix_entries[i].vector = entry-gsi; adev-host_msix_entries[i].entry = entry-entry; + if ((entry-flags KVM_MSIX_FLAG_MASK) + !(adev-guest_msix_entries[i].flags + KVM_ASSIGNED_MSIX_MASK)) { + adev-guest_msix_entries[i].flags |= + KVM_ASSIGNED_MSIX_MASK; + update_msix_mask(adev, i); + } else if (!(entry-flags KVM_MSIX_FLAG_MASK) + (adev-guest_msix_entries[i].flags + KVM_ASSIGNED_MSIX_MASK)) { + adev-guest_msix_entries[i].flags = + ~KVM_ASSIGNED_MSIX_MASK; + update_msix_mask(adev, i); + } break; } if (i == adev-entries_nr) { -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] qemu-kvm: device assignment: Some clean up for MSI-X code
Signed-off-by: Sheng Yang sh...@linux.intel.com --- hw/device-assignment.c | 75 --- 1 files changed, 51 insertions(+), 24 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 87f7418..4edae52 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1129,15 +1129,10 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #ifdef KVM_CAP_DEVICE_MSIX -static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) +static int get_msix_entries_max_nr(AssignedDevice *adev) { -AssignedDevice *adev = container_of(pci_dev, AssignedDevice, dev); -uint16_t entries_nr = 0, entries_max_nr; -int pos = 0, i, r = 0; -uint32_t msg_addr, msg_upper_addr, msg_data, msg_ctrl; -struct kvm_assigned_msix_nr msix_nr; -struct kvm_assigned_msix_entry msix_entry; -void *va = adev-msix_table_page; +int pos, entries_max_nr; +PCIDevice *pci_dev = adev-dev; if (adev-cap.available ASSIGNED_DEVICE_CAP_MSI) pos = pci_dev-cap.start + PCI_CAPABILITY_CONFIG_MSI_LENGTH; @@ -1148,6 +1143,17 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) entries_max_nr = PCI_MSIX_TABSIZE; entries_max_nr += 1; +return entries_max_nr; +} + +static int get_msix_valid_entries_nr(AssignedDevice *adev, +uint16_t entries_max_nr) +{ +void *va = adev-msix_table_page; +uint32_t msg_data, msg_ctrl; +uint16_t entries_nr = 0; +int i; + /* Get the usable entry number for allocating */ for (i = 0; i entries_max_nr; i++) { memcpy(msg_ctrl, va + i * 16 + 12, 4); @@ -1157,11 +1163,20 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) continue; entries_nr ++; } +return entries_nr; +} + +static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev, + uint16_t entries_nr, + uint16_t entries_max_nr) +{ +AssignedDevice *adev = container_of(pci_dev, AssignedDevice, dev); +int i, r = 0; +uint32_t msg_addr, msg_upper_addr, msg_data, msg_ctrl; +struct kvm_assigned_msix_nr msix_nr; +struct kvm_assigned_msix_entry msix_entry; +void *va = adev-msix_table_page; -if (entries_nr == 0) { -fprintf(stderr, MSI-X entry number is zero!\n); -return -EINVAL; -} msix_nr.assigned_dev_id = calc_assigned_dev_id(adev-h_segnr, adev-h_busnr, (uint8_t)adev-h_devfn); msix_nr.entry_nr = entries_nr; @@ -1203,8 +1218,8 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) adev-entry[entries_nr].u.msi.address_lo = msg_addr; adev-entry[entries_nr].u.msi.address_hi = msg_upper_addr; adev-entry[entries_nr].u.msi.data = msg_data; -DEBUG(MSI-X data 0x%x, MSI-X addr_lo 0x%x\n!, msg_data, msg_addr); - kvm_add_routing_entry(kvm_context, adev-entry[entries_nr]); +DEBUG(MSI-X data 0x%x, MSI-X addr_lo 0x%x\n, msg_data, msg_addr); +kvm_add_routing_entry(kvm_context, adev-entry[entries_nr]); msix_entry.gsi = adev-entry[entries_nr].gsi; msix_entry.entry = i; @@ -1226,12 +1241,12 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) return r; } -static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) +static void assigned_dev_update_msix(PCIDevice *pci_dev, int enable_msix) { struct kvm_assigned_irq assigned_irq_data; AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); -uint16_t *ctrl_word = (uint16_t *)(pci_dev-config + ctrl_pos); int r; +uint16_t entries_nr, entries_max_nr; memset(assigned_irq_data, 0, sizeof assigned_irq_data); assigned_irq_data.assigned_dev_id = @@ -1242,8 +1257,7 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) * try to catch this by only deassigning irqs if the guest is using * MSIX or intends to start. */ if ((assigned_dev-irq_requested_type KVM_DEV_IRQ_GUEST_MSIX) || -(*ctrl_word PCI_MSIX_ENABLE)) { - +enable_msix) { assigned_irq_data.flags = assigned_dev-irq_requested_type; free_dev_irq_entries(assigned_dev); r = kvm_deassign_irq(kvm_context, assigned_irq_data); @@ -1254,14 +1268,25 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) assigned_dev-irq_requested_type = 0; } -if (*ctrl_word PCI_MSIX_ENABLE) { -assigned_irq_data.flags = KVM_DEV_IRQ_HOST_MSIX | - KVM_DEV_IRQ_GUEST_MSIX; - -if (assigned_dev_update_msix_mmio(pci_dev) 0) { +entries_max_nr = get_msix_entries_max_nr(assigned_dev); +if (entries_max_nr == 0) { +fprintf(stderr, assigned_dev_update_msix: MSI-X entries_max_nr == 0); +
[PATCH 3/3] qemu-kvm: device assignment: emulate MSI-X mask bits
This patch emulated MSI-X per vector mask bit on assigned device. Signed-off-by: Sheng Yang sh...@linux.intel.com --- hw/device-assignment.c | 127 +++- 1 files changed, 126 insertions(+), 1 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 4edae52..564c5ab 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1146,6 +1146,7 @@ static int get_msix_entries_max_nr(AssignedDevice *adev) return entries_max_nr; } +#define MSIX_VECTOR_MASK 0x1 static int get_msix_valid_entries_nr(AssignedDevice *adev, uint16_t entries_max_nr) { @@ -1159,7 +1160,11 @@ static int get_msix_valid_entries_nr(AssignedDevice *adev, memcpy(msg_ctrl, va + i * 16 + 12, 4); memcpy(msg_data, va + i * 16 + 8, 4); /* Ignore unused entry even it's unmasked */ +#ifdef KVM_CAP_DEVICE_MSIX_MASK +if (msg_data == 0 || (msg_ctrl MSIX_VECTOR_MASK)) +#else if (msg_data == 0) +#endif continue; entries_nr ++; } @@ -1188,6 +1193,8 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev, } free_dev_irq_entries(adev); +memset(pci_dev-msix_entry_used, 0, KVM_MAX_MSIX_PER_DEV * +sizeof(*pci_dev-msix_entry_used)); adev-irq_entries_nr = entries_nr; adev-entry = calloc(entries_nr, sizeof(struct kvm_irq_routing_entry)); if (!adev-entry) { @@ -1223,6 +1230,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev, msix_entry.gsi = adev-entry[entries_nr].gsi; msix_entry.entry = i; +pci_dev-msix_entry_used[i] = 1; r = kvm_assign_set_msix_entry(kvm_context, msix_entry); if (r) { fprintf(stderr, fail to set MSI-X entry! %s\n, strerror(-r)); @@ -1266,6 +1274,8 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, int enable_msix) perror(assigned_dev_update_msix: deassign irq); assigned_dev-irq_requested_type = 0; +memset(pci_dev-msix_entry_used, 0, KVM_MAX_MSIX_PER_DEV * +sizeof(*pci_dev-msix_entry_used)); } entries_max_nr = get_msix_entries_max_nr(assigned_dev); @@ -1273,10 +1283,16 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, int enable_msix) fprintf(stderr, assigned_dev_update_msix: MSI-X entries_max_nr == 0); return; } +/* + * Guest may try to enable MSI-X before setting MSI-X entry done, so + * let's wait until guest unmask the entries. + */ entries_nr = get_msix_valid_entries_nr(assigned_dev, entries_max_nr); if (entries_nr == 0) { +#ifndef KVM_CAP_DEVICE_MSIX_MASK if (enable_msix) fprintf(stderr, MSI-X entry number is zero!\n); +#endif return; } if (enable_msix) { @@ -1320,7 +1336,8 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad if (address = ctrl_pos address + len ctrl_pos) { ctrl_pos--; /* control is word long */ ctrl_word = (uint16_t *)(pci_dev-config + ctrl_pos); -assigned_dev_update_msix(pci_dev, (*ctrl_word PCI_MSIX_ENABLE)); +assigned_dev_update_msix(pci_dev, +(*ctrl_word PCI_MSIX_ENABLE) !(*ctrl_word PCI_MSIX_MASK)); } pos += PCI_CAPABILITY_CONFIG_MSIX_LENGTH; } @@ -1412,6 +1429,104 @@ static uint32_t msix_mmio_readw(void *opaque, target_phys_addr_t addr) (8 * (addr 3))) 0x; } +#ifdef KVM_CAP_DEVICE_MSIX_MASK +static void msix_mmio_access_mask_bit(AssignedDevice *adev, int index) +{ +void *page = adev-msix_table_page; +uint32_t msg_ctrl, msg_data, msg_upper_addr, msg_addr; +struct kvm_assigned_msix_entry msix_entry; +int r = 0, pos, ctrl_word, entry_idx, entries_max_nr; +struct kvm_irq_routing_entry old_entry = {}; + +memcpy(msg_addr, (char *)page + index * 16, 4); +memcpy(msg_upper_addr, (char *)page + index * 16 + 4, 4); +memcpy(msg_data, (char *)page + index * 16 + 8, 4); +memcpy(msg_ctrl, (char *)page + index * 16 + 12, 4); +DEBUG(Access mask bit: MSI-X entries index %d: +msg_addr 0x%x, msg_upper_addr 0x%x, msg_data 0x%x, vec_ctl 0x%x\n, +index, msg_addr, msg_upper_addr, msg_data, msg_ctrl); + +if (adev-cap.available ASSIGNED_DEVICE_CAP_MSI) +pos = adev-dev.cap.start + PCI_CAPABILITY_CONFIG_MSI_LENGTH; +else +pos = adev-dev.cap.start; + +ctrl_word = *(uint16_t *)(adev-dev.config + pos + 2); + +if (!((ctrl_word PCI_MSIX_ENABLE) !(ctrl_word PCI_MSIX_MASK))) +return; + +if (!adev-dev.msix_entry_used[index] (msg_ctrl MSIX_VECTOR_MASK) == 0) { +DEBUG(Try to modify unenabled MSI-X entry %d's mask. +Reenable MSI-X.\n, +index); +assigned_dev_update_msix(adev-dev, 1); +
Re: [PATCH v11 17/17]add two new ioctls for mp device.
On Sat, Sep 25, 2010 at 12:27:35PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The patch add two ioctls for mp device. One is for userspace to query how much memory locked to make mp device run smoothly. Another one is for userspace to set how much meory locked it really wants. --- drivers/vhost/mpassthru.c | 103 +++-- include/linux/mpassthru.h |2 + 2 files changed, 54 insertions(+), 51 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index d86d94c..e3a0199 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -67,6 +67,8 @@ static int debug; #define COPY_THRESHOLD (L1_CACHE_BYTES * 4) #define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) +#define DEFAULT_NEED ((8192*2*2)*4096) + struct frag { u16 offset; u16 size; @@ -110,7 +112,8 @@ struct page_ctor { int rq_len; spinlock_t read_lock; try documenting what fields are protected by which lock, btw. /* record the locked pages */ - int lock_pages; + int locked_pages; + int cur_pages; struct rlimit o_rlim; unused now? struct net_device *dev; struct mpassthru_port port; This structure name should start with mp_ to avoid namespace pollution. Also ctor implies a contructor function: see pgtable_page_ctor - it is not a very good name for a structure. @@ -122,6 +125,7 @@ struct mp_struct { struct net_device *dev; struct page_ctor*ctor; struct socket socket; + struct task_struct *user; #ifdef MPASSTHRU_DEBUG int debug; @@ -231,7 +235,8 @@ static int page_ctor_attach(struct mp_struct *mp) ctor-port.ctor = page_ctor; ctor-port.sock = mp-socket; ctor-port.hash = mp_lookup; - ctor-lock_pages = 0; + ctor-locked_pages = 0; + ctor-cur_pages = 0; /* locked by mp_mutex */ dev-mp_port = ctor-port; @@ -264,37 +269,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor) return info; } -static int set_memlock_rlimit(struct page_ctor *ctor, int resource, - unsigned long cur, unsigned long max) -{ - struct rlimit new_rlim, *old_rlim; - int retval; - - if (resource != RLIMIT_MEMLOCK) - return -EINVAL; - new_rlim.rlim_cur = cur; - new_rlim.rlim_max = max; - - old_rlim = current-signal-rlim + resource; - - /* remember the old rlimit value when backend enabled */ - ctor-o_rlim.rlim_cur = old_rlim-rlim_cur; - ctor-o_rlim.rlim_max = old_rlim-rlim_max; - - if ((new_rlim.rlim_max old_rlim-rlim_max) - !capable(CAP_SYS_RESOURCE)) - return -EPERM; - - retval = security_task_setrlimit(resource, new_rlim); - if (retval) - return retval; - - task_lock(current-group_leader); - *old_rlim = new_rlim; - task_unlock(current-group_leader); - return 0; -} - static void relinquish_resource(struct page_ctor *ctor) { if (!(ctor-dev-flags IFF_UP) @@ -323,7 +297,7 @@ static void mp_ki_dtor(struct kiocb *iocb) } else info-ctor-wq_len--; /* Decrement the number of locked pages */ - info-ctor-lock_pages -= info-pnum; + info-ctor-cur_pages -= info-pnum; kmem_cache_free(ext_page_info_cache, info); relinquish_resource(info-ctor); @@ -357,6 +331,7 @@ static int page_ctor_detach(struct mp_struct *mp) { struct page_ctor *ctor; struct page_info *info; + struct task_struct *tsk = mp-user; int i; /* locked by mp_mutex */ @@ -375,9 +350,9 @@ static int page_ctor_detach(struct mp_struct *mp) relinquish_resource(ctor); - set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, -ctor-o_rlim.rlim_cur, -ctor-o_rlim.rlim_max); + down_write(tsk-mm-mmap_sem); + tsk-mm-locked_vm -= ctor-locked_pages; + up_write(tsk-mm-mmap_sem); /* locked by mp_mutex */ ctor-dev-mp_port = NULL; @@ -514,7 +489,6 @@ static struct page_info *mp_hash_delete(struct page_ctor *ctor, { key_mp_t key = mp_hash(info-pages[0], HASH_BUCKETS); struct page_info *tmp = NULL; - int i; tmp = ctor-hash_table[key]; while (tmp) { @@ -565,14 +539,11 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor, int rc; int i, j, n = 0; int len; - unsigned long base, lock_limit; + unsigned long base; struct page_info *info = NULL; - lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; - lock_limit = PAGE_SHIFT; - - if (ctor-lock_pages + count lock_limit npages) { -
Re: [PATCH 3/3] Add svm cpuid features
On Tue, Sep 28, 2010 at 05:37:58AM -0400, Avi Kivity wrote: On 09/28/2010 11:28 AM, Roedel, Joerg wrote: Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? Yes, 8fe6a21c76. What was your command line? qemu-system-x86_64 -m 2G -cpu kvm64,+svm,+npt -enable-kvm ... Note this is qemu.git, so -enable-kvm is needed. Ok, I apparently forgot to force the CPUID xlevel to be 0x800A when SVM is enabled, probably because I only tested CPUID models where xlevel already defaults to 0x800A. Attached is a fix, thanks for catching this. Joerg From d18d973d037f1af1869e114aea1f076ea3a1ee7a Mon Sep 17 00:00:00 2001 From: Joerg Roedel joerg.roe...@amd.com Date: Tue, 28 Sep 2010 11:58:49 +0200 Subject: [PATCH 1/1] qemu: Force xlevel to at least 0x800A if SVM is enabled This patch forces the extended CPUID level to be at least 0x800A if SVM is enabled in the CPU model. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- target-i386/cpuid.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index 0e0bf60..0630fe1 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -714,6 +714,11 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) x86_cpu_def-ext3_features = ~minus_ext3_features; x86_cpu_def-kvm_features = ~minus_kvm_features; x86_cpu_def-svm_features = ~minus_svm_features; +if ((x86_cpu_def-ext3_features CPUID_EXT3_SVM) +(x86_cpu_def-xlevel 0x800A)) { +/* Force xlevel to at least 0x800A if SVM enabled */ +x86_cpu_def-xlevel = 0x800A; +} if (check_cpuid) { if (check_features_against_host(x86_cpu_def) enforce_cpuid) goto error; -- 1.7.0.4 -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Mon, Sep 27, 2010 at 02:46:32PM -0600, Alex Williamson wrote: On Wed, 2010-09-22 at 14:18 -0700, Tom Lyon wrote: +ssize_t vfio_mem_readwrite( + int write, + struct vfio_dev *vdev, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct pci_dev *pdev = vdev-pdev; + resource_size_t end; + void __iomem *io; + loff_t pos; + int pci_space; + + pci_space = vfio_offset_to_pci_space(*ppos); + pos = vfio_offset_to_pci_offset(*ppos); + + if (!pci_resource_start(pdev, pci_space)) + return -EINVAL; + end = pci_resource_len(pdev, pci_space); + if (vdev-barmap[pci_space] == NULL) + vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + io = vdev-barmap[pci_space]; + So we do a pci_iomap, but never do corresponding pci_iounmap. This also only works for the first 6 BARs since the ROM BAR needs pci_map_rom. An issue with ROM is that I think it can not be enabled together with BARs. This is why pci_read_rom/pci_write_rom do what they do. I wonder if we should be doing all the BAR mapping at open and unmap at close so that we can fail if the device can't get basic resources. I belive we should do this on ioctl so that e.g. hotunplug can reset the device clean it up. Unused device should also not consume resources. I believe we should also be calling pci_request_regions in here somewhere. Perhaps something like this: diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index a18e39a..d3886d9 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -85,6 +85,53 @@ static inline int overlap(int a1, int b1, int a2, int b2) return !(b2 = a1 || b1 = a2); } +static int vfio_setup_pci(struct vfio_dev *vdev) +{ + int ret, bar; + + ret = pci_enable_device(vdev-pdev); + if (ret) + return ret; + + ret = pci_request_regions(vdev-pdev, VFIO); + if (ret) { + pci_disable_device(vdev-pdev); + return ret; + } + + for (bar = PCI_STD_RESOURCES; bar = PCI_ROM_RESOURCE; bar++) { + if (!pci_resource_len(vdev-pdev, bar)) + continue; + if (bar != PCI_ROM_RESOURCE) { + if (!pci_resource_start(vdev-pdev, bar)) + continue; + vdev-barmap[bar] = pci_iomap(vdev-pdev, bar, 0); + } else { + size_t size; + vdev-barmap[bar] = pci_map_rom(vdev-pdev, size); + } + } + return ret; +} + +static void vfio_disable_pci(struct vfio_dev *vdev) +{ + int bar; + + for (bar = PCI_STD_RESOURCES; bar = PCI_ROM_RESOURCE; bar++) { + if (!vdev-barmap[bar]) + continue; + if (bar != PCI_ROM_RESOURCE) + pci_iounmap(vdev-pdev, vdev-barmap[bar]); + else + pci_unmap_rom(vdev-pdev, vdev-barmap[bar]); + vdev-barmap[bar] = NULL; + } + + pci_release_regions(vdev-pdev); + pci_disable_device(vdev-pdev); +} + static int vfio_open(struct inode *inode, struct file *filep) { struct vfio_dev *vdev; @@ -110,7 +157,7 @@ static int vfio_open(struct inode *inode, struct file *filep) INIT_LIST_HEAD(listener-dm_list); filep-private_data = listener; if (vdev-listeners == 0) - ret = pci_enable_device(vdev-pdev); + ret = vfio_setup_pci(vdev); if (ret == 0) vdev-listeners++; mutex_unlock(vdev-lgate); @@ -151,7 +198,7 @@ static int vfio_release(struct inode *inode, struct file *filep) vdev-vconfig = NULL; kfree(vdev-pci_config_map); vdev-pci_config_map = NULL; - pci_disable_device(vdev-pdev); + vfio_disable_pci(vdev); vfio_domain_unset(vdev); wake_up(vdev-dev_idle_q); } diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c index 1fd50a6..7705b45 100644 --- a/drivers/vfio/vfio_rdwr.c +++ b/drivers/vfio/vfio_rdwr.c @@ -64,7 +64,7 @@ ssize_t vfio_io_readwrite( if (pos + count end) return -EINVAL; if (vdev-barmap[pci_space] == NULL) - vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + return -EINVAL; io = vdev-barmap[pci_space]; while (count 0) { @@ -137,7 +137,12 @@ ssize_t vfio_mem_readwrite( return -EINVAL; end = pci_resource_len(pdev, pci_space); if (vdev-barmap[pci_space] == NULL) - vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + return -EINVAL; + if (pci_space == PCI_ROM_RESOURCE) { + u32 rom = *(u32 *)(vdev-vconfig + PCI_ROM_ADDRESS); + if (!(rom PCI_ROM_ADDRESS_ENABLE))
Re: [PATCH 3/3] Add svm cpuid features
On 09/28/2010 12:05 PM, Roedel, Joerg wrote: On Tue, Sep 28, 2010 at 05:37:58AM -0400, Avi Kivity wrote: On 09/28/2010 11:28 AM, Roedel, Joerg wrote: Weird, it worked here as I tested it. I had it on qemu/master and with all three patches. But patch 1 should not make the difference. I take a look, have you pushed the failing uq/master? Yes, 8fe6a21c76. What was your command line? qemu-system-x86_64 -m 2G -cpu kvm64,+svm,+npt -enable-kvm ... Note this is qemu.git, so -enable-kvm is needed. Ok, I apparently forgot to force the CPUID xlevel to be 0x800A when SVM is enabled, probably because I only tested CPUID models where xlevel already defaults to 0x800A. Attached is a fix, thanks for catching this. Joerg From d18d973d037f1af1869e114aea1f076ea3a1ee7a Mon Sep 17 00:00:00 2001 From: Joerg Roedeljoerg.roe...@amd.com Date: Tue, 28 Sep 2010 11:58:49 +0200 Subject: [PATCH 1/1] qemu: Force xlevel to at least 0x800A if SVM is enabled This patch forces the extended CPUID level to be at least 0x800A if SVM is enabled in the CPU model. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- target-i386/cpuid.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index 0e0bf60..0630fe1 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -714,6 +714,11 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model) x86_cpu_def-ext3_features= ~minus_ext3_features; x86_cpu_def-kvm_features= ~minus_kvm_features; x86_cpu_def-svm_features= ~minus_svm_features; +if ((x86_cpu_def-ext3_features CPUID_EXT3_SVM) +(x86_cpu_def-xlevel 0x800A)) { +/* Force xlevel to at least 0x800A if SVM enabled */ +x86_cpu_def-xlevel = 0x800A; +} if (check_cpuid) { if (check_features_against_host(x86_cpu_def) enforce_cpuid) goto error; I can't say I like the interdependency, but saying something like -cpu kvm64,+svm,xlevel=0x800a is much worse. If no one has a better idea, I'll apply the patch. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Mon, Sep 27, 2010 at 02:57:49PM -0600, Alex Williamson wrote: On Mon, 2010-09-27 at 13:54 +0200, Michael S. Tsirkin wrote: On Wed, Sep 22, 2010 at 02:18:24PM -0700, Tom Lyon wrote: Signed-off-by: Tom Lyon p...@cisco.com Some comments on the pci bits: After going over them for the Nth time - something needs to be done with the rvirt/write tables. I doubt anyone besides me and you has gone over them: /me bites tongue... Ouch I figured if anyone looked at them he'd have complained :) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.35-rc1 regression with pvclock and smp guests
Arjan Koers 0h61vkll2ly8 at xutrox.com writes: [] I've attached the printk patches for 2.6.34.1 and 2.6.35, in case anyone needs them... Move a printk that's using the clock before it's ready Fix a hang during SMP kernel boot on KVM that showed up after commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 (2.6.35) and 59aab522154a2f17b25335b63c1cf68a51fb6ae0 (2.6.34.1). The problem only occurs when CONFIG_PRINTK_TIME is set. Signed-off-by: Arjan Koers 0h61vkll2ly8 at xutrox.com diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index feaeb0d..71bf2df 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -125,12 +125,15 @@ static struct clocksource kvm_clock = { static int kvm_register_clock(char *txt) { int cpu = smp_processor_id(); - int low, high; + int low, high, ret; + low = (int)__pa(per_cpu(hv_clock, cpu)) | 1; high = ((u64)__pa(per_cpu(hv_clock, cpu)) 32); + ret = native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high); printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n, cpu, high, low, txt); - return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high); + + return ret; } #ifdef CONFIG_X86_LOCAL_APIC Folks, should this be sent to -stable kernel? It is not in any upstream kernel as far as I can see (not in linus tree too), but this is quite an issue and is hitting people The discussion were stalled quite a while ago too -- this email has Date: Mon, 02 Aug 2010 23:35:28 +0200. Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Mon, Sep 27, 2010 at 10:23:33PM +0100, Ben Hutchings wrote: +/* The main function to transform the guest user space address + * to host kernel address via get_user_pages(). Thus the hardware + * can do DMA directly to the external buffer address. + */ +static struct page_info *alloc_page_info(struct page_ctor *ctor, + struct kiocb *iocb, struct iovec *iov, + int count, struct frag *frags, + int npages, int total) +{ + int rc; + int i, j, n = 0; + int len; + unsigned long base, lock_limit; + struct page_info *info = NULL; + + lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; + lock_limit = PAGE_SHIFT; + + if (ctor-lock_pages + count lock_limit npages) { + printk(KERN_INFO exceed the locked memory rlimit.); + return NULL; + } What if the process is locking pages with mlock() as well? Doesn't this allow it to lock twice as many pages as it should be able to? No, since locked_vm is incremented by both mp and mlock. Or at least, that's the idea :) In any case, twice as many would not be a big deal: admin can control which processes can do this by controlling access to the device. + info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL); + + if (!info) + return NULL; + info-skb = NULL; + info-next = info-prev = NULL; + + for (i = j = 0; i count; i++) { + base = (unsigned long)iov[i].iov_base; + len = iov[i].iov_len; + + if (!len) + continue; + n = ((base ~PAGE_MASK) + len + ~PAGE_MASK) PAGE_SHIFT; + + rc = get_user_pages_fast(base, n, npages ? 1 : 0, + info-pages[j]); + if (rc != n) + goto failed; + + while (n--) { + frags[j].offset = base ~PAGE_MASK; + frags[j].size = min_t(int, len, + PAGE_SIZE - frags[j].offset); + len -= frags[j].size; + base += frags[j].size; + j++; + } + } + +#ifdef CONFIG_HIGHMEM + if (npages !(dev-features NETIF_F_HIGHDMA)) { + for (i = 0; i j; i++) { + if (PageHighMem(info-pages[i])) + goto failed; + } + } +#endif Shouldn't you try to allocate lowmem pages explicitly, rather than failing at this point? We don't allocate pages, we lock given pages. Once this is integrated in macvtap presumably we'll fall back on data copy for such devices. ... + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EAGAIN; + } + + skb-protocol = eth_type_trans(skb, mp-dev); Why are you calling eth_type_trans() on transmit? So that GSO can work. BTW macvtap does: skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb-protocol = eth_hdr(skb)-h_proto; and I think this is broken for vlans. Arnd? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 17/17]add two new ioctls for mp device.
On Mon, Sep 27, 2010 at 10:36:15PM +0100, Ben Hutchings wrote: On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The patch add two ioctls for mp device. One is for userspace to query how much memory locked to make mp device run smoothly. Another one is for userspace to set how much meory locked it really wants. [...] diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h index ba8f320..083e9f7 100644 --- a/include/linux/mpassthru.h +++ b/include/linux/mpassthru.h @@ -7,6 +7,8 @@ /* ioctl defines */ #define MPASSTHRU_BINDDEV _IOW('M', 213, int) #define MPASSTHRU_UNBINDDEV_IO('M', 214) +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long) +#define MPASSTHRU_GET_MEM_LOCKED_NEED _IOR('M', 216, unsigned long) [...] These will need compat handling. You can avoid that by defining them to use a parameter type of u32 or u64. Ben. No need to introduce compat mess unless we have to. I guess int is sufficient: locking = 2G just for guest networking is insane. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] virtio_balloon: disable oom killer when fill balloon
Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #include linux/freezer.h #include linux/delay.h #include linux/slab.h +#include linux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (t times) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO, vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru, vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ + __le32 num_pages = cpu_to_le32(vb-num_pages); + vb-vdev-config-set(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + num_pages, sizeof(num_pages)); +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -267,9 +293,14 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); - if (diff 0) - fill_balloon(vb, diff); - else if (diff 0) + if (diff 0) { + int oom; + oom_killer_disable(); + oom = fill_balloon(vb, diff); + oom_killer_enable(); + if (oom) + update_balloon_target(vb); + } else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/18] KVM test: Add basic file transfer test
On Mon, Sep 27, 2010 at 06:43:54PM -0400, Lucas Meneghel Rodrigues wrote: From: Amos Kong ak...@redhat.com This test is the basic test of transfering file between host and guest. Try to transfer a large file from host to guest, and transfer it back to host, then compare the files by calculate their md5 hash. The default file size is 4000M, scp timeout is 1000s. It means if the average speed is less than 4M/s, this test will be fail. We can extend this test by using another disk later, then we can transfer larger files without the limit of first disk size. Changes from v1: - Use md5 to verify the integrity of files - Try to use autotest API, such as, utils.system() Signed-off-by: Amos Kong ak...@redhat.com Why scp_timeout? Not transfer_timeout? Is this really only scp file transfer to/from linux guest? Need to either name it so or generalize. Other things that need testing are NFS for linux guest, scp from windows, samba for linux and windows guests. --- client/tests/kvm/tests/file_transfer.py | 58 +++ client/tests/kvm/tests_base.cfg.sample |7 +++- 2 files changed, 64 insertions(+), 1 deletions(-) create mode 100644 client/tests/kvm/tests/file_transfer.py diff --git a/client/tests/kvm/tests/file_transfer.py b/client/tests/kvm/tests/file_transfer.py new file mode 100644 index 000..a0c6cff --- /dev/null +++ b/client/tests/kvm/tests/file_transfer.py @@ -0,0 +1,58 @@ +import logging, commands, re +from autotest_lib.client.common_lib import error +from autotest_lib.client.bin import utils +import kvm_utils, kvm_test_utils + +def run_file_transfer(test, params, env): + +Test ethrnet device function by ethtool + +1) Boot up a VM. +2) Create a large file by dd on host. +3) Copy this file from host to guest. +4) Copy this file from guest to host. +5) Check if file transfers ended good. + +@param test: KVM test object. +@param params: Dictionary with the test parameters. +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +timeout=int(params.get(login_timeout, 360)) + +logging.info(Trying to log into guest '%s' by serial, vm.name) +session = kvm_utils.wait_for(lambda: vm.serial_login(), + timeout, 0, step=2) +if not session: +raise error.TestFail(Could not log into guest '%s' % vm.name) + +dir = test.tmpdir +scp_timeout = int(params.get(scp_timeout)) +cmd = dd if=/dev/urandom of=%s/a.out bs=1M count=%d % (dir, int( + params.get(filesize, 4000))) + +try: +logging.info(Create file by dd command on host, cmd: %s % cmd) +utils.run(cmd) + +logging.info(Transfer file from host to guest) +if not vm.copy_files_to(%s/a.out % dir, /tmp/b.out, +timeout=scp_timeout): /tmp/b.out won't work on windows guest, will it? maybe let guest select dest dir? +raise error.TestFail(Fail to transfer file from host to guest) + +logging.info(Transfer file from guest to host) +if not vm.copy_files_from(/tmp/b.out, %s/c.out % dir, + timeout=scp_timeout): +raise error.TestFail(Fail to transfer file from guest to host) + +logging.debug(commands.getoutput(ls -l %s/[ac].out % dir)) +md5_orig = utils.hash_file(%s/a.out % dir, method=md5) +md5_new = utils.hash_file(%s/c.out % dir, method=md5) + +if md5_orig != md5_new: +raise error.TestFail(File changed after transfer) + +finally: +session.get_command_status(rm -f /tmp/b.out) +utils.run(rm -f %s/[ac].out % dir) +session.close() diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index cf1deaf..bc014c8 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -476,6 +476,11 @@ variants: - jumbo: install setup unattended_install.cdrom type = jumbo +- file_transfer: install setup unattended_install.cdrom +type = file_transfer +filesize = 4000 +scp_timeout = 1000 + - physical_resources_check: install setup unattended_install.cdrom type = physical_resources_check catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}' @@ -1250,7 +1255,7 @@ variants: # Windows section - @Windows: -no autotest linux_s3 vlan_tag ioquit unattended_install.(url|nfs|remote_ks) jumbo +no autotest linux_s3 vlan_tag ioquit unattended_install.(url|nfs|remote_ks) jumbo file_transfer shutdown_command = shutdown /s /f /t 0 reboot_command = shutdown /r /f /t 0 status_test_command = echo %errorlevel% -- 1.7.1 -- To unsubscribe from this
Re: [RFC] virtio_balloon: disable oom killer when fill balloon
On 09/28/2010 08:19 AM, Dave Young wrote: Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? The guest is not supposed to set the target field in it's config. This is a host read/write, guest read-only field. The problem with your approach generally speaking is that it's unclear whether this is the right policy. For instance, what if another application held a very large allocation which caused the fill request to stop but then shortly afterwards, the aforementioned application released that allocation. If instead of just stopping, we paused and tried again later, we could potentially succeed. I think a better approach might be a graceful back off. For instance, when you hit this condition, deflate the balloon by 10% based on the assumption that you probably already gone too far. Before you attempt to allocate to the target again, set a timeout that increases in duration exponentially until you reach some maximum (say 10s). This isn't the only approach, but hopefully it conveys the idea of gracefully backing off without really giving up. Regards, Anthony Liguori Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #includelinux/freezer.h #includelinux/delay.h #includelinux/slab.h +#includelinux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (t times) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO,vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru,vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ + __le32 num_pages = cpu_to_le32(vb-num_pages); + vb-vdev-config-set(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + num_pages, sizeof(num_pages)); +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -267,9 +293,14 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); - if (diff 0) - fill_balloon(vb, diff); - else if (diff 0) + if (diff 0) { + int oom; + oom_killer_disable(); + oom = fill_balloon(vb, diff); + oom_killer_enable(); + if (oom) + update_balloon_target(vb); + } else if (diff 0) leak_balloon(vb, -diff); update_balloon_size(vb); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To
Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices
On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote: This patch enable per-vector mask for assigned devices using MSI-X. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/x86.c |1 + include/linux/kvm.h |9 - include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 39 +++ 4 files changed, 49 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8412c91..e6933e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: + case KVM_CAP_DEVICE_MSIX_MASK: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 919ae53..f2b7cdc 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo { #endif #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 +#ifdef __KVM_HAVE_MSIX +#define KVM_CAP_DEVICE_MSIX_MASK 59 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr { }; #define KVM_MAX_MSIX_PER_DEV 256 + +#define KVM_MSIX_FLAG_MASK 1 + struct kvm_assigned_msix_entry { __u32 assigned_dev_id; __u32 gsi; __u16 entry; /* The index of entry in the MSI-X table */ - __u16 padding[3]; + __u16 flags; + __u16 padding[2]; }; #endif /* __LINUX_KVM_H */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0b89d00..a43405c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier { }; #define KVM_ASSIGNED_MSIX_PENDING0x1 +#define KVM_ASSIGNED_MSIX_MASK 0x2 struct kvm_guest_msix_entry { u32 vector; u16 entry; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 7c98928..15b8c32 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -17,6 +17,8 @@ #include linux/pci.h #include linux/interrupt.h #include linux/slab.h +#include linux/irqnr.h + #include irq.h static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head, @@ -666,6 +668,30 @@ msix_nr_out: return r; } +static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev, + int index) +{ + int irq; + + if (!assigned_dev-dev-msix_enabled || + !(assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX)) + return; + + irq = assigned_dev-host_msix_entries[index].vector; + + ASSERT(irq != 0); + + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_MASK) + disable_irq(irq); + else { + enable_irq(irq); + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_PENDING) + schedule_work(assigned_dev-interrupt_work); + } Hmm, won't this lose interrupts which were sent while bit was pending? It is also pretty heavy if as you say guests touch the mask a lot. I think we must keep the interrupt disabled, just set a bit and delay interrupt injection until vector is unmasked or deleted. The interface to do this will need more thought: e.g. how can userspace clear this bit then? +} + static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, struct kvm_assigned_msix_entry *entry) { @@ -688,6 +714,19 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, adev-guest_msix_entries[i].entry = entry-entry; adev-guest_msix_entries[i].vector = entry-gsi; adev-host_msix_entries[i].entry = entry-entry; + if ((entry-flags KVM_MSIX_FLAG_MASK) + !(adev-guest_msix_entries[i].flags + KVM_ASSIGNED_MSIX_MASK)) { + adev-guest_msix_entries[i].flags |= + KVM_ASSIGNED_MSIX_MASK; + update_msix_mask(adev, i); + } else if (!(entry-flags KVM_MSIX_FLAG_MASK) + (adev-guest_msix_entries[i].flags + KVM_ASSIGNED_MSIX_MASK)) { + adev-guest_msix_entries[i].flags = + ~KVM_ASSIGNED_MSIX_MASK; + update_msix_mask(adev, i); + } break; } if (i == adev-entries_nr) { -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in
Re: [RFC] virtio_balloon: disable oom killer when fill balloon
On Tue, Sep 28, 2010 at 9:34 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 09/28/2010 08:19 AM, Dave Young wrote: Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? Hi, Thanks your comments. The guest is not supposed to set the target field in it's config. This is a host read/write, guest read-only field. Could you tell where to set it? If so, IMHO set config api should fail, isn't it? The problem with your approach generally speaking is that it's unclear whether this is the right policy. For instance, what if another application held a very large allocation which caused the fill request to stop but then shortly afterwards, the aforementioned application released that allocation. If instead of just stopping, we paused and tried again later, we could potentially succeed. Yes, it is possible. But maybe better to do balloon from qemu monitor later? I think a better approach might be a graceful back off. For instance, when you hit this condition, deflate the balloon by 10% based on the assumption that you probably already gone too far. Before you attempt to allocate to the target again, set a timeout that increases in duration exponentially until you reach some maximum (say 10s). I'm afraid most times it will keep doing inflate/deflate circularly. This isn't the only approach, but hopefully it conveys the idea of gracefully backing off without really giving up. Regards, Anthony Liguori Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #includelinux/freezer.h #includelinux/delay.h #includelinux/slab.h +#includelinux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (t times) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO,vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru,vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ + __le32 num_pages = cpu_to_le32(vb-num_pages); + vb-vdev-config-set(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + num_pages, sizeof(num_pages)); +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -267,9 +293,14 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); - if (diff 0) - fill_balloon(vb, diff); - else if (diff 0) + if (diff 0) { + int oom; + oom_killer_disable(); + oom = fill_balloon(vb, diff); + oom_killer_enable(); + if
Re: [RFC] virtio_balloon: disable oom killer when fill balloon
On Tue, Sep 28, 2010 at 9:49 PM, Dave Young hidave.darks...@gmail.com wrote: On Tue, Sep 28, 2010 at 9:34 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 09/28/2010 08:19 AM, Dave Young wrote: Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? Hi, Thanks your comments. The guest is not supposed to set the target field in it's config. This is a host read/write, guest read-only field. Could you tell where to set it? If so, IMHO set config api should fail, isn't it? Get it, in hw/virtio-balloon.c function virtio_balloon_set_config The problem with your approach generally speaking is that it's unclear whether this is the right policy. For instance, what if another application held a very large allocation which caused the fill request to stop but then shortly afterwards, the aforementioned application released that allocation. If instead of just stopping, we paused and tried again later, we could potentially succeed. Yes, it is possible. But maybe better to do balloon from qemu monitor later? I think a better approach might be a graceful back off. For instance, when you hit this condition, deflate the balloon by 10% based on the assumption that you probably already gone too far. Before you attempt to allocate to the target again, set a timeout that increases in duration exponentially until you reach some maximum (say 10s). I'm afraid most times it will keep doing inflate/deflate circularly. This isn't the only approach, but hopefully it conveys the idea of gracefully backing off without really giving up. Regards, Anthony Liguori Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #includelinux/freezer.h #includelinux/delay.h #includelinux/slab.h +#includelinux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (t times) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO,vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru,vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ + __le32 num_pages = cpu_to_le32(vb-num_pages); + vb-vdev-config-set(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + num_pages, sizeof(num_pages)); +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -267,9 +293,14 @@ static int balloon(void *_vballoon) || freezing(current)); if (vb-need_stats_update) stats_handle_request(vb); - if (diff 0) - fill_balloon(vb, diff); - else if (diff 0) + if (diff 0) { + int oom; +
Re: KVM call agenda for Sept 28
On 09/28/2010 12:14 AM, Chris Wright wrote: Please send in any agenda items you are interested in covering. no agenda - no call. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio_balloon: disable oom killer when fill balloon
On 09/28/2010 08:49 AM, Dave Young wrote: On Tue, Sep 28, 2010 at 9:34 PM, Anthony Liguorianth...@codemonkey.ws wrote: On 09/28/2010 08:19 AM, Dave Young wrote: Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? Hi, Thanks your comments. The guest is not supposed to set the target field in it's config. This is a host read/write, guest read-only field. Could you tell where to set it? If so, IMHO set config api should fail, isn't it? The problem with your approach generally speaking is that it's unclear whether this is the right policy. For instance, what if another application held a very large allocation which caused the fill request to stop but then shortly afterwards, the aforementioned application released that allocation. If instead of just stopping, we paused and tried again later, we could potentially succeed. Yes, it is possible. But maybe better to do balloon from qemu monitor later? It's part of the specification, not something that's enforced or even visible within the APIs. I think a better approach might be a graceful back off. For instance, when you hit this condition, deflate the balloon by 10% based on the assumption that you probably already gone too far. Before you attempt to allocate to the target again, set a timeout that increases in duration exponentially until you reach some maximum (say 10s). I'm afraid most times it will keep doing inflate/deflate circularly. With sufficiently large timeouts, does it matter? The other side of the argument is that the host should be more careful about doing balloon requests to the guest. Alternatively, you can argue that the guest should be able to balloon itself and that's where the logic should be. But I think split policy across the guest and host would prove to be prohibitively complex to deal with. Regards, Anthony Liguori This isn't the only approach, but hopefully it conveys the idea of gracefully backing off without really giving up. Regards, Anthony Liguori Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #includelinux/freezer.h #includelinux/delay.h #includelinux/slab.h +#includelinux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (ttimes) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO,vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru,vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ + __le32 num_pages = cpu_to_le32(vb-num_pages); + vb-vdev-config-set(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), +num_pages, sizeof(num_pages)); +} + static int balloon(void *_vballoon) { struct virtio_balloon *vb = _vballoon; @@ -267,9 +293,14 @@ static int
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Tue, 2010-09-28 at 12:09 +0200, Michael S. Tsirkin wrote: On Mon, Sep 27, 2010 at 02:46:32PM -0600, Alex Williamson wrote: On Wed, 2010-09-22 at 14:18 -0700, Tom Lyon wrote: +ssize_t vfio_mem_readwrite( + int write, + struct vfio_dev *vdev, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct pci_dev *pdev = vdev-pdev; + resource_size_t end; + void __iomem *io; + loff_t pos; + int pci_space; + + pci_space = vfio_offset_to_pci_space(*ppos); + pos = vfio_offset_to_pci_offset(*ppos); + + if (!pci_resource_start(pdev, pci_space)) + return -EINVAL; + end = pci_resource_len(pdev, pci_space); + if (vdev-barmap[pci_space] == NULL) + vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + io = vdev-barmap[pci_space]; + So we do a pci_iomap, but never do corresponding pci_iounmap. This also only works for the first 6 BARs since the ROM BAR needs pci_map_rom. An issue with ROM is that I think it can not be enabled together with BARs. This is why pci_read_rom/pci_write_rom do what they do. I don't think that's an issue. pci_read/write_rom doesn't do anything about disabling BARs when the ROM is read. The reason that it maps and unmaps around the read is to be resource friendly. For vfio, since we've assigned a device to vfio and it's open, I think it's valid to assign all the resources and sit on them. Otherwise I don't think we can guarantee that a read that worked last time will still get resources and work this time. I wonder if we should be doing all the BAR mapping at open and unmap at close so that we can fail if the device can't get basic resources. I belive we should do this on ioctl so that e.g. hotunplug can reset the device clean it up. Unused device should also not consume resources. If it's assigned to vfio and opened, then it's not unused. A hotunplug would involve a close of the vfio device, which could then reset the device. I think we probably are still missing a device reset, but I'm not sure if it needs an ioctl or if a reset on open would be sufficient. Alex I believe we should also be calling pci_request_regions in here somewhere. Perhaps something like this: diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index a18e39a..d3886d9 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -85,6 +85,53 @@ static inline int overlap(int a1, int b1, int a2, int b2) return !(b2 = a1 || b1 = a2); } +static int vfio_setup_pci(struct vfio_dev *vdev) +{ + int ret, bar; + + ret = pci_enable_device(vdev-pdev); + if (ret) + return ret; + + ret = pci_request_regions(vdev-pdev, VFIO); + if (ret) { + pci_disable_device(vdev-pdev); + return ret; + } + + for (bar = PCI_STD_RESOURCES; bar = PCI_ROM_RESOURCE; bar++) { + if (!pci_resource_len(vdev-pdev, bar)) + continue; + if (bar != PCI_ROM_RESOURCE) { + if (!pci_resource_start(vdev-pdev, bar)) + continue; + vdev-barmap[bar] = pci_iomap(vdev-pdev, bar, 0); + } else { + size_t size; + vdev-barmap[bar] = pci_map_rom(vdev-pdev, size); + } + } + return ret; +} + +static void vfio_disable_pci(struct vfio_dev *vdev) +{ + int bar; + + for (bar = PCI_STD_RESOURCES; bar = PCI_ROM_RESOURCE; bar++) { + if (!vdev-barmap[bar]) + continue; + if (bar != PCI_ROM_RESOURCE) + pci_iounmap(vdev-pdev, vdev-barmap[bar]); + else + pci_unmap_rom(vdev-pdev, vdev-barmap[bar]); + vdev-barmap[bar] = NULL; + } + + pci_release_regions(vdev-pdev); + pci_disable_device(vdev-pdev); +} + static int vfio_open(struct inode *inode, struct file *filep) { struct vfio_dev *vdev; @@ -110,7 +157,7 @@ static int vfio_open(struct inode *inode, struct file *filep) INIT_LIST_HEAD(listener-dm_list); filep-private_data = listener; if (vdev-listeners == 0) - ret = pci_enable_device(vdev-pdev); + ret = vfio_setup_pci(vdev); if (ret == 0) vdev-listeners++; mutex_unlock(vdev-lgate); @@ -151,7 +198,7 @@ static int vfio_release(struct inode *inode, struct file *filep) vdev-vconfig = NULL; kfree(vdev-pci_config_map); vdev-pci_config_map = NULL; - pci_disable_device(vdev-pdev); + vfio_disable_pci(vdev); vfio_domain_unset(vdev); wake_up(vdev-dev_idle_q); } diff --git a/drivers/vfio/vfio_rdwr.c b/drivers/vfio/vfio_rdwr.c index 1fd50a6..7705b45 100644 --- a/drivers/vfio/vfio_rdwr.c +++
KVM: VMX: Add AX to list of registers clobbered by guest switch
By chance this caused no harm so far. We overwrite AX during switch to/from guest context, so we must declare this. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1a5ecfd..e315930 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4013,7 +4013,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) #endif [cr2]i(offsetof(struct vcpu_vmx, vcpu.arch.cr2)) : cc, memory - , Rbx, Rdi, Rsi + , Rax, Rbx, Rdi, Rsi #ifdef CONFIG_X86_64 , r8, r9, r10, r11, r12, r13, r14, r15 #endif -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Tuesday 28 September 2010, Michael S. Tsirkin wrote: + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EAGAIN; + } + + skb-protocol = eth_type_trans(skb, mp-dev); Why are you calling eth_type_trans() on transmit? So that GSO can work. BTW macvtap does: skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb-protocol = eth_hdr(skb)-h_proto; and I think this is broken for vlans. Arnd? Hmm, that code (besides set_network_header) was added by Sridhar for GSO support. I believe I originally did eth_type_trans but had to change it before that time because it broke something. Unfortunately, my memory on that is not very good any more. Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Tue, Sep 28, 2010 at 08:26:21AM -0600, Alex Williamson wrote: On Tue, 2010-09-28 at 12:09 +0200, Michael S. Tsirkin wrote: On Mon, Sep 27, 2010 at 02:46:32PM -0600, Alex Williamson wrote: On Wed, 2010-09-22 at 14:18 -0700, Tom Lyon wrote: +ssize_t vfio_mem_readwrite( + int write, + struct vfio_dev *vdev, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct pci_dev *pdev = vdev-pdev; + resource_size_t end; + void __iomem *io; + loff_t pos; + int pci_space; + + pci_space = vfio_offset_to_pci_space(*ppos); + pos = vfio_offset_to_pci_offset(*ppos); + + if (!pci_resource_start(pdev, pci_space)) + return -EINVAL; + end = pci_resource_len(pdev, pci_space); + if (vdev-barmap[pci_space] == NULL) + vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + io = vdev-barmap[pci_space]; + So we do a pci_iomap, but never do corresponding pci_iounmap. This also only works for the first 6 BARs since the ROM BAR needs pci_map_rom. An issue with ROM is that I think it can not be enabled together with BARs. This is why pci_read_rom/pci_write_rom do what they do. I don't think that's an issue. pci_read/write_rom doesn't do anything about disabling BARs when the ROM is read. Oh yes it does: it calls pci_map rom which calls pci_enable_rom. And this enables ROM space decoder, which, PCI spec says, can disable BAR decoder in device. See it now? The reason that it maps and unmaps around the read is to be resource friendly. Not only that. you can not use both ROM and BARs at the same time: if you enable ROM you must not access BARs. For vfio, since we've assigned a device to vfio and it's open, I think it's valid to assign all the resources and sit on them. Otherwise I don't think we can guarantee that a read that worked last time will still get resources and work this time. The issue is that BARs are not accessible while ROM is enabled. It might be a security hole to try to do that from an app, and I do know there is hardware where it will not work, and PCI spec says as much: 6.2.5.2: In order to minimize the number of address decoders needed, a device may share a decoder between the Expansion ROM Base Address register and other Base Address registers.47 When expansion ROM decode is enabled, the decoder is used for accesses to the expansion ROM and device independent software must not access the device through any other Base Address registers. Thus I think applications must shadow the ROM and kernel must prevent ROM access after mapping BARs. We could shadow in kernel as well but this would be expensive in memory as ROMs are sometimes quite large. I wonder if we should be doing all the BAR mapping at open and unmap at close so that we can fail if the device can't get basic resources. I belive we should do this on ioctl so that e.g. hotunplug can reset the device clean it up. Unused device should also not consume resources. If it's assigned to vfio and opened, then it's not unused. A hotunplug would involve a close of the vfio device, which could then reset the device. I think we probably are still missing a device reset, but I'm not sure if it needs an ioctl or if a reset on open would be sufficient. Alex You are only looking at qemu, but look at the whole stack: First, I think we want reset on close, so the device is in sane state when it's returned to the OS. Second, we typically will want libvirt to open the device and pass it to qemu, qemu might start using it much later, and it might keep it around, unused, for as long as it wants. So we do want a way to reset without close. Finally, we will want a way for libvirt to verify that qemu has closed the device, so it's safe to pass it to another guest. I believe we should also be calling pci_request_regions in here somewhere. Perhaps something like this: diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index a18e39a..d3886d9 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -85,6 +85,53 @@ static inline int overlap(int a1, int b1, int a2, int b2) return !(b2 = a1 || b1 = a2); } +static int vfio_setup_pci(struct vfio_dev *vdev) +{ + int ret, bar; + + ret = pci_enable_device(vdev-pdev); + if (ret) + return ret; + + ret = pci_request_regions(vdev-pdev, VFIO); + if (ret) { + pci_disable_device(vdev-pdev); + return ret; + } + + for (bar = PCI_STD_RESOURCES; bar = PCI_ROM_RESOURCE; bar++) { + if (!pci_resource_len(vdev-pdev, bar)) + continue; + if (bar != PCI_ROM_RESOURCE) { +
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote: On Tuesday 28 September 2010, Michael S. Tsirkin wrote: + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EAGAIN; + } + + skb-protocol = eth_type_trans(skb, mp-dev); Why are you calling eth_type_trans() on transmit? So that GSO can work. BTW macvtap does: skb_set_network_header(skb, ETH_HLEN); skb_reset_mac_header(skb); skb-protocol = eth_hdr(skb)-h_proto; and I think this is broken for vlans. Arnd? Hmm, that code (besides set_network_header) was added by Sridhar for GSO support. I believe I originally did eth_type_trans but had to change it before that time because it broke something. Unfortunately, my memory on that is not very good any more. Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? Arnd I expect the protocol value to be wrong when guest sends vlan tagged frames as 802.1q frames have a different format. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On Tuesday 28 September 2010, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote: Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? I expect the protocol value to be wrong when guest sends vlan tagged frames as 802.1q frames have a different format. Ok, I see. Would that be fixed by using eth_type_trans()? I don't see any code in there that tries to deal with the VLAN tag, so do we have the same problem in the tun/tap driver? Also, I wonder how we handle the case where both the guest and the host do VLAN tagging. Does the host transparently override the guest tag, or does it add a nested tag? More importantly, what should it do? Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, nmi: workaround sti; hlt race vs nmi; intr
On 09/28/2010 01:50 AM, Avi Kivity wrote: Personally I think the safer route is to take the patch. There are other processors besides Intel and AMD and we can't test all of them, not to mention various emulators and virtual machine monitors out there. Speaking for the smoltering crater that used to be *Transmeta*, I'm (from memory) quite certain they blocked NMI and that this was intentional behavior. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86, nmi: workaround sti; hlt race vs nmi; intr
On 09/28/2010 05:34 PM, H. Peter Anvin wrote: On 09/28/2010 01:50 AM, Avi Kivity wrote: Personally I think the safer route is to take the patch. There are other processors besides Intel and AMD and we can't test all of them, not to mention various emulators and virtual machine monitors out there. Speaking for the smoltering crater that used to be *Transmeta*, I'm (from memory) quite certain they blocked NMI and that this was intentional behavior. We'll need: - Intel - AMD - Geode - Via - kvm (currently no, but plan to) - qemu - vmware - others? It should be relatively simple to write a small test case to test this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Tue, 2010-09-28 at 16:40 +0200, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 08:26:21AM -0600, Alex Williamson wrote: On Tue, 2010-09-28 at 12:09 +0200, Michael S. Tsirkin wrote: On Mon, Sep 27, 2010 at 02:46:32PM -0600, Alex Williamson wrote: On Wed, 2010-09-22 at 14:18 -0700, Tom Lyon wrote: +ssize_t vfio_mem_readwrite( + int write, + struct vfio_dev *vdev, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct pci_dev *pdev = vdev-pdev; + resource_size_t end; + void __iomem *io; + loff_t pos; + int pci_space; + + pci_space = vfio_offset_to_pci_space(*ppos); + pos = vfio_offset_to_pci_offset(*ppos); + + if (!pci_resource_start(pdev, pci_space)) + return -EINVAL; + end = pci_resource_len(pdev, pci_space); + if (vdev-barmap[pci_space] == NULL) + vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + io = vdev-barmap[pci_space]; + So we do a pci_iomap, but never do corresponding pci_iounmap. This also only works for the first 6 BARs since the ROM BAR needs pci_map_rom. An issue with ROM is that I think it can not be enabled together with BARs. This is why pci_read_rom/pci_write_rom do what they do. I don't think that's an issue. pci_read/write_rom doesn't do anything about disabling BARs when the ROM is read. Oh yes it does: it calls pci_map rom which calls pci_enable_rom. And this enables ROM space decoder, which, PCI spec says, can disable BAR decoder in device. See it now? Ok, I see the comment there now, thanks. The reason that it maps and unmaps around the read is to be resource friendly. Not only that. you can not use both ROM and BARs at the same time: if you enable ROM you must not access BARs. So we either need to shadow the ROM if we want to support an mmap to it or map/unmap around reads. For vfio, since we've assigned a device to vfio and it's open, I think it's valid to assign all the resources and sit on them. Otherwise I don't think we can guarantee that a read that worked last time will still get resources and work this time. The issue is that BARs are not accessible while ROM is enabled. It might be a security hole to try to do that from an app, and I do know there is hardware where it will not work, and PCI spec says as much: 6.2.5.2: In order to minimize the number of address decoders needed, a device may share a decoder between the Expansion ROM Base Address register and other Base Address registers.47 When expansion ROM decode is enabled, the decoder is used for accesses to the expansion ROM and device independent software must not access the device through any other Base Address registers. Thus I think applications must shadow the ROM and kernel must prevent ROM access after mapping BARs. We could shadow in kernel as well but this would be expensive in memory as ROMs are sometimes quite large. Shadowing in the VFIO kernel driver seems too resource intensive, probably best to disable mmap for the ROM and only allow read write, letting the app shadow if it wants. Preventing BAR access is troublesome since we support mmaps. I don't like the idea of overly restrictive ordering, but should we shutdown ROM access after the domain is set(?) I wonder if we should be doing all the BAR mapping at open and unmap at close so that we can fail if the device can't get basic resources. I belive we should do this on ioctl so that e.g. hotunplug can reset the device clean it up. Unused device should also not consume resources. If it's assigned to vfio and opened, then it's not unused. A hotunplug would involve a close of the vfio device, which could then reset the device. I think we probably are still missing a device reset, but I'm not sure if it needs an ioctl or if a reset on open would be sufficient. Alex You are only looking at qemu, but look at the whole stack: First, I think we want reset on close, so the device is in sane state when it's returned to the OS. Yep, I'd agree with that. Second, we typically will want libvirt to open the device and pass it to qemu, qemu might start using it much later, and it might keep it around, unused, for as long as it wants. So we do want a way to reset without close. I'm fine with that, but we also need a reset on open to prevent leaking host state to guest/userspace. Finally, we will want a way for libvirt to verify that qemu has closed the device, so it's safe to pass it to another guest. This gets tricky since we'd really like a revoke to insure there isn't a mmap still lingering. What do you propose? I believe we should also be calling pci_request_regions in here somewhere. Perhaps
Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Tue, Sep 28, 2010 at 11:10:42AM -0600, Alex Williamson wrote: On Tue, 2010-09-28 at 16:40 +0200, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 08:26:21AM -0600, Alex Williamson wrote: On Tue, 2010-09-28 at 12:09 +0200, Michael S. Tsirkin wrote: On Mon, Sep 27, 2010 at 02:46:32PM -0600, Alex Williamson wrote: On Wed, 2010-09-22 at 14:18 -0700, Tom Lyon wrote: +ssize_t vfio_mem_readwrite( + int write, + struct vfio_dev *vdev, + char __user *buf, + size_t count, + loff_t *ppos) +{ + struct pci_dev *pdev = vdev-pdev; + resource_size_t end; + void __iomem *io; + loff_t pos; + int pci_space; + + pci_space = vfio_offset_to_pci_space(*ppos); + pos = vfio_offset_to_pci_offset(*ppos); + + if (!pci_resource_start(pdev, pci_space)) + return -EINVAL; + end = pci_resource_len(pdev, pci_space); + if (vdev-barmap[pci_space] == NULL) + vdev-barmap[pci_space] = pci_iomap(pdev, pci_space, 0); + io = vdev-barmap[pci_space]; + So we do a pci_iomap, but never do corresponding pci_iounmap. This also only works for the first 6 BARs since the ROM BAR needs pci_map_rom. An issue with ROM is that I think it can not be enabled together with BARs. This is why pci_read_rom/pci_write_rom do what they do. I don't think that's an issue. pci_read/write_rom doesn't do anything about disabling BARs when the ROM is read. Oh yes it does: it calls pci_map rom which calls pci_enable_rom. And this enables ROM space decoder, which, PCI spec says, can disable BAR decoder in device. See it now? Ok, I see the comment there now, thanks. The reason that it maps and unmaps around the read is to be resource friendly. Not only that. you can not use both ROM and BARs at the same time: if you enable ROM you must not access BARs. So we either need to shadow the ROM if we want to support an mmap to it or map/unmap around reads. For vfio, since we've assigned a device to vfio and it's open, I think it's valid to assign all the resources and sit on them. Otherwise I don't think we can guarantee that a read that worked last time will still get resources and work this time. The issue is that BARs are not accessible while ROM is enabled. It might be a security hole to try to do that from an app, and I do know there is hardware where it will not work, and PCI spec says as much: 6.2.5.2: In order to minimize the number of address decoders needed, a device may share a decoder between the Expansion ROM Base Address register and other Base Address registers.47 When expansion ROM decode is enabled, the decoder is used for accesses to the expansion ROM and device independent software must not access the device through any other Base Address registers. Thus I think applications must shadow the ROM and kernel must prevent ROM access after mapping BARs. We could shadow in kernel as well but this would be expensive in memory as ROMs are sometimes quite large. Shadowing in the VFIO kernel driver seems too resource intensive, probably best to disable mmap for the ROM and only allow read write, letting the app shadow if it wants. Preventing BAR access is troublesome since we support mmaps. I don't like the idea of overly restrictive ordering, but should we shutdown ROM access after the domain is set(?) We probably must disable it on BAR mmap. Further existing sysfs lacks locking so I think it's racy wrt multiple readers, etc. I wonder if we should be doing all the BAR mapping at open and unmap at close so that we can fail if the device can't get basic resources. I belive we should do this on ioctl so that e.g. hotunplug can reset the device clean it up. Unused device should also not consume resources. If it's assigned to vfio and opened, then it's not unused. A hotunplug would involve a close of the vfio device, which could then reset the device. I think we probably are still missing a device reset, but I'm not sure if it needs an ioctl or if a reset on open would be sufficient. Alex You are only looking at qemu, but look at the whole stack: First, I think we want reset on close, so the device is in sane state when it's returned to the OS. Yep, I'd agree with that. Second, we typically will want libvirt to open the device and pass it to qemu, qemu might start using it much later, and it might keep it around, unused, for as long as it wants. So we do want a way to reset without close. I'm fine with that, but we also need a reset on open to prevent leaking host state to guest/userspace. Can't hurt ... Finally, we will want a way for libvirt to verify that
Support for nested vmx
Hi, I work at an university computer lab and would like to create linux vitual machines using kvm on a big linux intel-vt server we have. Each linux vitual machine would be assigned to a group of students and they will use those vms to study and create another linux vitual machines using kvm inside of it. Searching the list archive, I saw nested kvm vm is possible on AMD processor but not yet on Intel ones. Is there some place where I could get patches/instructions to test nested kvm vm on Intel-vt processors? Thanks. __ Vandeir Eduardo Laboratório de Computação e Informática (LCI) - Campus III Fundacao Universidade Regional de Blumenau (FURB) Blumenau, SC, Brasil.
Re: pci passthrough with KVM
On Tue, Sep 28, 2010 at 2:27 AM, Inigo Losada ilos...@ibex.es wrote: We are using pci passthrough with an SCSI Adapter card. The system is: - O.S: Ubuntu 10.04.1 LTS - KVM Packages: kvm 1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9.2 kvm-pxe 5.4.4-1ubuntu1 qemu-kvm 0.12.3+noroms-0ubuntu9.2 libvirt-bin 0.7.5-5ubuntu27.2 python-libvirt 0.7.5-5ubuntu27.2 libvirt0 0.7.5-5ubuntu27.2 - Kernel 2.6.32.15+drm33.5.iommu recompiled with following options : CONFIG_DMAR=y CONFIG_INTR_REMAP=y - Apparmor is stopped When we started the virtual machine we obtain the following error: char device redirected to /dev/pts/3 device: 04:04.0: driver=pci-assign host=04:04.0 Failed to assign irq for 04:04.0: Operation not permitted Perhaps you are assigning a device that shares an IRQ with another device? Does the device share an IRQ with another device? If not, does it work if you launch qemu manually with sudo? It might be that the kernel and tool set aren't new enough to fully support running a de-privileged guest with device assignment. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug unhandled vm exit: 0x11?
On Tue, Jul 27, 2010 at 3:04 AM, Avi Kivity a...@redhat.com wrote: On 07/26/2010 08:58 PM, ewheeler wrote: O n 07/26/2010 07:01 PM, Neo Jia wrote: hi, I am seeing an unhandled vm exit: 0x11 on Win7 with KVM-88 release and wondering if I am still able to dump the code from guest OS when this happens. But it looks that all instructions are 0s after adding one more print code after dumping the guest registers. And it is very likely that this problem is fixed in the latest qemu code base but I still would like to know how to debug and investigate this kind of problem. BTW, I am using 32-bit qemu + 64-bit KVM kernel module. unhandled vm exit: 0x11 Avi, I found the instruction that caused this problem: emulation failed (failure) rip 71f14651 66 0f 7f 07 And according to Intel, this is a MOVDQA. So, do we already have this instruction emulated as I am using a pretty old version of KVM (release 88)? If yes, could you point me to the file I need to look at for that specific patch? Currently, I am trying to use coalesced_mmio as you suggested in another thread: http://www.mail-archive.com/kvm@vger.kernel.org/msg25695.html Thanks, Neo I happened to be in the intel docs today: 0x11 (17) RSM. Guest software attempted to execute RSM in SMM. As it happens, the 0x11 should be looked up as a KVM_EXIT_REASON (kvm.h), not in the manuals as I said. It's a kvm internal error. What is RSM and SMM? These are documented in the manuals. -- error compiling committee.c: too many arguments to function -- I would remember that if researchers were not ambitious probably today we haven't the technology we are using! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug unhandled vm exit: 0x11?
On Tue, Sep 28, 2010 at 11:40 AM, Neo Jia neo...@gmail.com wrote: On Tue, Jul 27, 2010 at 3:04 AM, Avi Kivity a...@redhat.com wrote: On 07/26/2010 08:58 PM, ewheeler wrote: O n 07/26/2010 07:01 PM, Neo Jia wrote: hi, I am seeing an unhandled vm exit: 0x11 on Win7 with KVM-88 release and wondering if I am still able to dump the code from guest OS when this happens. But it looks that all instructions are 0s after adding one more print code after dumping the guest registers. And it is very likely that this problem is fixed in the latest qemu code base but I still would like to know how to debug and investigate this kind of problem. BTW, I am using 32-bit qemu + 64-bit KVM kernel module. unhandled vm exit: 0x11 Avi, I found the instruction that caused this problem: emulation failed (failure) rip 71f14651 66 0f 7f 07 And according to Intel, this is a MOVDQA. So, do we already have this instruction emulated as I am using a pretty old version of KVM (release 88)? If yes, could you point me to the file I need to look at for that specific patch? Currently, I am trying to use coalesced_mmio as you suggested in another thread: http://www.mail-archive.com/kvm@vger.kernel.org/msg25695.html Just found out that coalesced_mmio doesn't help in my case. Thanks, Neo Thanks, Neo I happened to be in the intel docs today: 0x11 (17) RSM. Guest software attempted to execute RSM in SMM. As it happens, the 0x11 should be looked up as a KVM_EXIT_REASON (kvm.h), not in the manuals as I said. It's a kvm internal error. What is RSM and SMM? These are documented in the manuals. -- error compiling committee.c: too many arguments to function -- I would remember that if researchers were not ambitious probably today we haven't the technology we are using! -- I would remember that if researchers were not ambitious probably today we haven't the technology we are using! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 13/17] Add mp(mediate passthru) device.
On 9/28/2010 8:18 AM, Arnd Bergmann wrote: On Tuesday 28 September 2010, Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote: Can you be more specific what the problem is? Do you think it breaks when a guest sends VLAN tagged frames or when macvtap is connected to a VLAN interface that adds another tag (or only the combination)? I expect the protocol value to be wrong when guest sends vlan tagged frames as 802.1q frames have a different format. Ok, I see. Would that be fixed by using eth_type_trans()? I don't see any code in there that tries to deal with the VLAN tag, so do we have the same problem in the tun/tap driver? tun_get_user() does call eth_type_trans(). Not sure why i didn't use it in macvtap code. Need to test it with guest VLAN tagging to make sure it works. Also, I wonder how we handle the case where both the guest and the host do VLAN tagging. Does the host transparently override the guest tag, or does it add a nested tag? More importantly, what should it do? I would think If both guest and host do VLAN tagging, the tags will be nested. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
disk image snapshot functionality
Hi, I'm using Centos5.5 on the host, and the KVM that's available in the repos. I'm using linux VMs too. My disk images are qcow2 files. Here's what I want: To be able to, on the host, create a snapshot of the guest's disk image, without shutting down the guest, so that I can then restore back to a point in time for the guest. I thought I could do this with the qcow2 images. I've used: qemu-img snapshot -c snapname disk_image.qcow2 to create the snapshot. It doesn't work. The snapshots claim to be created, but if I shut down the guest, apply the snapshot ( qemu-img snapshot -a snapname disk_image.qcow2 ) the guest either: a.) no longer boots (No bootable disk found) b.) boots, but is just how it was when I shut it down (it hasn't reverted back to what it was like when the snapshot was made) It makes no sense. I can sometimes apply the first snapshot, and it has worked...but subsequent snapshots are a no go. One thing that is suspicious is that the VM SIZE and CLOCK are zero: # qemu-img snapshot -l test1.qcow2 Snapshot list: IDTAG VM SIZEDATE VM CLOCK 1 with100mb 0 2010-09-28 11:48:23 00:00:00.000 2 with200mb 0 2010-09-28 11:50:53 00:00:00.000 3 with300mb 0 2010-09-28 11:52:49 00:00:00.000 4 whenoff 0 2010-09-28 11:56:41 00:00:00.000 # /usr/libexec/qemu-kvm --help QEMU PC emulator version 0.9.1 (kvm-83-maint-snapshot-20090205), Copyright (c) 2003-2008 Fabrice Bellard I can't find much info about using qcow2 images when I search. Any help would be appreciated. Thanks. Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Support for nested vmx
On Tue, Sep 28, 2010, Vandeir Eduardo wrote about Support for nested vmx: Is there some place where I could get patches/instructions to test nested kvm vm on Intel-vt processors? Hi, The latest patches that I submitted on this mailing list around two months ago apply to the then-current trunk of KVM, but do not apply cleanly to today's trunk. So I started today the effort to rebase them to the current trunk. Unfortunately (but not too unexpectedly) the rebase did not go smoothly, and the resulting code doesn't work (L2 hangs). But I'll debug it, and expect that I'll be able to release a new version of the patches within a few days. Please note that while nested VMX should work with these patches, it still has a few limitations. Currently it was only tested with KVM as a nested (L1) hypervisor - we got VMWare Server working, but the necessary patches are not yet included in this patch set. Also, only Linux is sure to work as L2 nested guest - Windows *did* work at some point, but it didn't seem to work in this patch set last time I tried - and some debugging work is still needed to figure out why. Nadav. -- Nadav Har'El| Tuesday, Sep 28 2010, 21 Tishri 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |You do not need a parachute to skydive. http://nadav.harel.org.il |You only need one to skydive twice. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices
On Tuesday 28 September 2010 21:36:00 Michael S. Tsirkin wrote: On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote: This patch enable per-vector mask for assigned devices using MSI-X. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/kvm/x86.c |1 + include/linux/kvm.h |9 - include/linux/kvm_host.h |1 + virt/kvm/assigned-dev.c | 39 +++ 4 files changed, 49 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8412c91..e6933e6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: + case KVM_CAP_DEVICE_MSIX_MASK: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 919ae53..f2b7cdc 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo { #endif #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 +#ifdef __KVM_HAVE_MSIX +#define KVM_CAP_DEVICE_MSIX_MASK 59 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr { }; #define KVM_MAX_MSIX_PER_DEV 256 + +#define KVM_MSIX_FLAG_MASK 1 + struct kvm_assigned_msix_entry { __u32 assigned_dev_id; __u32 gsi; __u16 entry; /* The index of entry in the MSI-X table */ - __u16 padding[3]; + __u16 flags; + __u16 padding[2]; }; #endif /* __LINUX_KVM_H */ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0b89d00..a43405c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier { }; #define KVM_ASSIGNED_MSIX_PENDING 0x1 +#define KVM_ASSIGNED_MSIX_MASK 0x2 struct kvm_guest_msix_entry { u32 vector; u16 entry; diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 7c98928..15b8c32 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -17,6 +17,8 @@ #include linux/pci.h #include linux/interrupt.h #include linux/slab.h +#include linux/irqnr.h + #include irq.h static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head, @@ -666,6 +668,30 @@ msix_nr_out: return r; } +static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev, + int index) +{ + int irq; + + if (!assigned_dev-dev-msix_enabled || + !(assigned_dev-irq_requested_type KVM_DEV_IRQ_HOST_MSIX)) + return; + + irq = assigned_dev-host_msix_entries[index].vector; + + ASSERT(irq != 0); + + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_MASK) + disable_irq(irq); + else { + enable_irq(irq); + if (assigned_dev-guest_msix_entries[index].flags + KVM_ASSIGNED_MSIX_PENDING) + schedule_work(assigned_dev-interrupt_work); + } Hmm, won't this lose interrupts which were sent while bit was pending? It is also pretty heavy if as you say guests touch the mask a lot. I think we must keep the interrupt disabled, just set a bit and delay interrupt injection until vector is unmasked or deleted. The interface to do this will need more thought: e.g. how can userspace clear this bit then? I think it's fine. Because we didn't modify pending bit here, and the interrupt handler would schedule an work to check it regardless of if the IRQ is disable. By this meaning, no interrupt would be lose. But the checking for pending bit on unmask action seems unnecessary? Because if the bit is set, then definitely one work has been scheduled to deal with it. -- regards Yang, Sheng +} + static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, struct kvm_assigned_msix_entry *entry) { @@ -688,6 +714,19 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm, adev-guest_msix_entries[i].entry = entry-entry; adev-guest_msix_entries[i].vector = entry-gsi; adev-host_msix_entries[i].entry = entry-entry; + if ((entry-flags KVM_MSIX_FLAG_MASK) + !(adev-guest_msix_entries[i].flags + KVM_ASSIGNED_MSIX_MASK)) { + adev-guest_msix_entries[i].flags |= + KVM_ASSIGNED_MSIX_MASK; +
Re: [RFC] virtio_balloon: disable oom killer when fill balloon
On Tue, Sep 28, 2010 at 10:03 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 09/28/2010 08:49 AM, Dave Young wrote: On Tue, Sep 28, 2010 at 9:34 PM, Anthony Liguorianth...@codemonkey.ws wrote: On 09/28/2010 08:19 AM, Dave Young wrote: Balloon could cause guest memory oom killing and panic. If we disable the oom killer it will be better at least avoid guest panic. If alloc failed we can just adjust the balloon target to be equal to current number by call vdev-config-set But during test I found the config-set num_pages does not change the config actually, Should I do hacks in userspace as well? If so where should I start to hack? Hi, Thanks your comments. The guest is not supposed to set the target field in it's config. This is a host read/write, guest read-only field. Could you tell where to set it? If so, IMHO set config api should fail, isn't it? The problem with your approach generally speaking is that it's unclear whether this is the right policy. For instance, what if another application held a very large allocation which caused the fill request to stop but then shortly afterwards, the aforementioned application released that allocation. If instead of just stopping, we paused and tried again later, we could potentially succeed. Yes, it is possible. But maybe better to do balloon from qemu monitor later? It's part of the specification, not something that's enforced or even visible within the APIs. I think a better approach might be a graceful back off. For instance, when you hit this condition, deflate the balloon by 10% based on the assumption that you probably already gone too far. Before you attempt to allocate to the target again, set a timeout that increases in duration exponentially until you reach some maximum (say 10s). I'm afraid most times it will keep doing inflate/deflate circularly. With sufficiently large timeouts, does it matter? The other side of the argument is that the host should be more careful about doing balloon requests to the guest. Alternatively, you can argue that the guest should be able to balloon itself and that's where the logic should be. But I think split policy across the guest and host would prove to be prohibitively complex to deal with. Thanks. What do you think about add an option like: -balloon virtio,nofail With nofail option we stop balloon if oom The default behavior without nofail will be as your said, ie. retry every 5 minutes Regards, Anthony Liguori This isn't the only approach, but hopefully it conveys the idea of gracefully backing off without really giving up. Regards, Anthony Liguori Thanks --- linux-2.6.orig/drivers/virtio/virtio_balloon.c 2010-09-25 20:58:14.19001 +0800 +++ linux-2.6/drivers/virtio/virtio_balloon.c 2010-09-28 21:05:42.20675 +0800 @@ -25,6 +25,7 @@ #includelinux/freezer.h #includelinux/delay.h #includelinux/slab.h +#includelinux/oom.h struct virtio_balloon { @@ -97,8 +98,22 @@ static void tell_host(struct virtio_ball wait_for_completion(vb-acked); } -static void fill_balloon(struct virtio_balloon *vb, size_t num) +static int cblimit(int times) { + static int t; + + if (t times) + t++; + else + t = 0; + + return !t; +} + +static int fill_balloon(struct virtio_balloon *vb, size_t num) +{ + int ret = 0; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); @@ -106,10 +121,13 @@ static void fill_balloon(struct virtio_b struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { - if (printk_ratelimit()) + if (cblimit(5)) { dev_printk(KERN_INFO,vb-vdev-dev, Out of puff! Can't get %zu pages\n, num); + ret = -ENOMEM; + goto out; + } /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; @@ -120,11 +138,11 @@ static void fill_balloon(struct virtio_b list_add(page-lru,vb-pages); } - /* Didn't get any? Oh well. */ - if (vb-num_pfns == 0) - return; +out: + if (vb-num_pfns) + tell_host(vb, vb-inflate_vq); - tell_host(vb, vb-inflate_vq); + return ret; } static void release_pages_by_pfn(const u32 pfns[], unsigned int num) @@ -251,6 +269,14 @@ static void update_balloon_size(struct v actual, sizeof(actual)); } +static void update_balloon_target(struct virtio_balloon *vb) +{ +
IOMMU in guest OS
Hi, Can someone tell my if my understanding is correct or not. Currently, KVM only use IOMMU to allow PCI/PCI Express device to be assigned as a PCI device inside guest OS. So right now guest OS does not see IOMMU as a device, but this may change when IOMMU emulation (I saw some patches on the list) is merged into development branch. -- Thawan Kooburat Graduate Student Department of Computer Science UW-Madison -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
Hello Michael, On Wed, 2010-09-15 at 07:52 -0700, Shirley Ma wrote: Don't you think once I address vhost_add_used_and_signal update issue, it is a simple and complete patch for macvtap TX zero copy? Thanks Shirley I like the fact that the patch is simple. Unfortunately I suspect it'll stop being simple by the time it's complete :) I can make a try. :) I compared several approaches for addressing the issue being raised here on how/when to update vhost_add_used_and_signal. The simple approach I have found is: 1. Adding completion field in struct virtqueue; 2. when it is a zero copy packet, put vhost thread wait for completion to update vhost_add_used_and_signal; 3. passing vq from vhost to macvtap as skb destruct_arg; 4. when skb is freed for the last reference, signal vq completion The test results show same performance as the original patch. How do you think? If it sounds good to you. I will resubmit this reversion patch. The patch still keeps as simple as it was before. :) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IOMMU in guest OS
On Tue, Sep 28, 2010 at 08:56:57PM -0500, Thawan Kooburat wrote: Can someone tell my if my understanding is correct or not. Currently, KVM only use IOMMU to allow PCI/PCI Express device to be assigned as a PCI device inside guest OS. So right now guest OS does not see IOMMU as a device, but this may change when IOMMU emulation (I saw some patches on the list) is merged into development branch. Correct. Cheers, Muli -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html