Virtualization DevRoom at FOSDEM 2013
Following on the heels of a successful KVM Forum and oVirt Workshop, FOSDEM will be hosting a Virtualization DevRoom in February. If you've been to FOSDEM before, you know this is about developers and code, not products. Presentation proposals are due by December 16th 2012. The full details are here: http://osvc.v2.cs.unibo.it/index.php/Main_Page With the relevant topics being: Topics covered will include, but not limited to: - machine virtualization (e.g. KVM, Xen, VirtualBox,...) - network virtualization (e.g. openvstack, vale, vde, Open vSwitch,...) - process level virtualization, flexible kernels (e.g. rump anykernel, view-os, ...) - virt management (e.g. ganeti, libvirt, ovirt, XCP, ...) thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] QEMU was not selected for Google Summer of Code this year
* Natalia Portillo (clau...@claunia.com) wrote: QEMU hosted on Haiku would be interesting. The fun of Haiku especially when it is hosting QEMU -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
* Anthony Liguori (anth...@codemonkey.ws) wrote: On 02/07/2012 07:18 AM, Avi Kivity wrote: On 02/07/2012 02:51 PM, Anthony Liguori wrote: On 02/07/2012 06:40 AM, Avi Kivity wrote: On 02/07/2012 02:28 PM, Anthony Liguori wrote: It's a potential source of exploits (from bugs in KVM or in hardware). I can see people wanting to be selective with access because of that. As is true of the rest of the kernel. If you want finer grain access control, that's exactly why we have things like LSM and SELinux. You can add the appropriate LSM hooks into the KVM infrastructure and setup default SELinux policies appropriately. LSMs protect objects, not syscalls. There isn't an object to protect here (except the fake /dev/kvm object). A VM can be an object. Not really, it's not accessible in a namespace. How would you label it? A VM, vcpu, etc are all objects. The labelling can be implicit based on the security context of the process creating the object. You could create simplistic rules such as a process may have the ability KVM__VM_CREATE (this is roughly analogous to the PROC__EXECMEM policy control that allows some processes to create executable writable memory mappings, or SHM__CREATE for a process that can create a shared memory segment). Adding some label mgmt to the object (add -security and some callbacks to do -alloc/init/free), and then checks on the object itself would allow for finer grained protection. If there was any VM lookup (although the original example explicitly ties a process to a vm and a thread to a vcpu) the finer grained check would certainly be useful to verify that the process can access the VM. Labels can originate from userspace, IIUC, so I think it's possible for QEMU (or whatever the userspace is) to set the label for the VM while it's creating it. I think this is how most of the labeling for X and things of that nature works. For X, the policy enforcement is done in the X server. There is assistance from the kernel for doing policy server queries (can foo do bar?), but it's up to the X server to actually care enough to ask and then fail a request that doesn't comply. I'm not sure that's the model here. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] intel-iommu: Add device info into list before doing context mapping
* Hao, Xudong (xudong@intel.com) wrote: Yes, Chris, thanks your comments. How about this one? Yes, it gets the locking right. Also makes host device and guest assigned device go through the same order: alloc_devinfo and init lock; place info on lists; unlock domain_context_mapping() The patch itself is whitespace damaged and does not apply. Please fix and feel free to add my: Acked-by: Chris Wright chr...@sous-sol.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] intel-iommu: Add device info into list before doing context mapping
* Chris Wright (chr...@sous-sol.org) wrote: * Hao, Xudong (xudong@intel.com) wrote: Yes, Chris, thanks your comments. How about this one? Yes, it gets the locking right. Sorry, I missed one other problem on the error path. You need to also update pdev-dev.archdata.iommu to NULL (otherwise it is left pointing to freed memory). Also makes host device and guest assigned device go through the same order: alloc_devinfo and init lock; place info on lists; unlock domain_context_mapping() The patch itself is whitespace damaged and does not apply. Please fix and feel free to add my: Acked-by: Chris Wright chr...@sous-sol.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] intel-iommu: Add device info into list before doing context mapping
* Hao, Xudong (xudong@intel.com) wrote: @@ -2282,6 +2276,14 @@ static int domain_add_dev_info(struct dmar_domain *domain, pdev-dev.archdata.iommu = info; spin_unlock_irqrestore(device_domain_lock, flags); + ret = domain_context_mapping(domain, pdev, translation); + if (ret) { + list_del(info-link); + list_del(info-global); At the very least, this is not correct locking. + free_devinfo_mem(info); + return ret; + } + -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
* Peter Zijlstra (a.p.zijls...@chello.nl) wrote: On Wed, 2011-11-30 at 21:52 +0530, Dipankar Sarma wrote: Also, if at all topology changes due to migration or host kernel decisions, we can make use of something like VPHN (virtual processor home node) capability on Power systems to have guest kernel update its topology knowledge. You can refer to that in arch/powerpc/mm/numa.c. I think that fail^Wfeature of PPC is terminally broken. You simply cannot change the topology after the fact. Agreed, there's too many things that consult topology once and never look back. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
* Ben Hutchings (bhutchi...@solarflare.com) wrote: On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote: On 11/29/2011 9:19 AM, Ben Hutchings wrote: On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote: Maybe I missed something! [...] If not, please explain what the new model *is*. The new model is to incorporate a VEB into the NIC. The current model doesn't address any of the requirements of a VEB in the NIC and this proposed set of patches allow us to set MAC filters for the *ports* on the internal NIC VEB. Consider the PF and each of the VFs as just a port on the VEB. We need the ability to set L2 filters (MAC, MC and VLAN) for each of the ports on that VEB. There is no currently supported method for doing this. So yes, this is a new model although it's a fairly simple one. Explain precisely how the VEB changes the existing model. Explain how the existing MAC filter and VF filter APIs interact with port filters on the VEB. Refer to any relevant standards. I agree that it's confusing. Couldn't you simplify your ascii art (hopefully removing hw assumptions about receive processing, and completely ignoring vlans for the moment) to something like: |RX v ++-+ | +--++| | | RX MAC filter || | |and port select|| | +---+| |/|\ | | / | \ match 2| | / v \ | | /match \| |/ 1 |\ | | / | \ | |match / | \ | | 0 / | \| |v|v | |||| | ++++---+ ||| PF VF 1 VF 2 And there's an unclear number of ways to update RX MAC filter and port select table. 1) PF ndo_set_mac_addr I expect that to be implicit to match 0. 2) PF ndo_set_rx_mode Less clear, but I'd still expect these to implicitly match 0 3) PF ndo_set_vf_mac I expect these to be an explicit match to VF N (given the interface specifices which VF's MAC is being programmed). 4) VF ndo_set_mac_addr This one may or may not be allowed (setting MAC+port if the VF is owned by a guest is likely not allowed), but would expect an implicit VF N. 5) VF ndo_set_rx_mode Same as 4) above. 6) PF or VF? ndo_set_rx_filter_addr The new proposal, which has an explicit VF, although when it's VF_SELF I'm not clear if this is just the same as 5) above? Have I missed anything? thanks, chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
* Ben Hutchings (bhutchi...@solarflare.com) wrote: On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote: I agree that it's confusing. Couldn't you simplify your ascii art (hopefully removing hw assumptions about receive processing, and completely ignoring vlans for the moment) to something like: |RX v ++-+ | +--++| | | RX MAC filter || | |and port select|| | +---+| |/|\ | | / | \ match 2| | / v \ | | /match \| |/ 1 |\ | | / | \ | |match / | \ | | 0 / | \| |v|v | |||| | ++++---+ ||| PF VF 1 VF 2 And there's an unclear number of ways to update RX MAC filter and port select table. 1) PF ndo_set_mac_addr I expect that to be implicit to match 0. 2) PF ndo_set_rx_mode Less clear, but I'd still expect these to implicitly match 0 3) PF ndo_set_vf_mac I expect these to be an explicit match to VF N (given the interface specifices which VF's MAC is being programmed). I'm not sure whether this is supposed to implicitly add to the MAC filter or whether that has to be changed too. That's the main difference between my models (a) and (b). I see now. I wasn't entirely clear on the difference before. It's also going to be hw specific. I think (Intel folks can verify) that the Intel SR-IOV devices have a single global unicast exact match table, for example. There's also PF ndo_set_vf_vlan. Right, although I had mentioned I was trying to limit just to MAC filtering to simplify. 4) VF ndo_set_mac_addr This one may or may not be allowed (setting MAC+port if the VF is owned by a guest is likely not allowed), but would expect an implicit VF N. 5) VF ndo_set_rx_mode Same as 4) above. So this is where we are today. Cool, good that we agree there. 6) PF or VF? ndo_set_rx_filter_addr The new proposal, which has an explicit VF, although when it's VF_SELF I'm not clear if this is just the same as 5) above? Have I missed anything? Any physical port can be bridged to a mixture of guests with and without their own VFs. Packets sent from a guest with a VF to the address of a guest without a VF need to be forwarded to the PF rather than the physical port, but none of the drivers currently get to know about those addresses. To clarify, do you mean something like this? physical port | +++ | +-+ | | | VEB | | | +-+ | |/ | \| | /|\ | | / | \ | +-+--+--+-+ | | | PFVF 1VF 2 / | | +---+---+ VM4 +---+---+ | sw | |macvtap| | switch| +---+---+ +-+-+-+-+ | / | \VM5 / | \ VM1 VM2 VM3 This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv switching), VM4 directly owning VF1 (pci device assignement), and VM5 indirectly owning VF2 (macvtap passthrough, that started this whole thing). So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1 goes in to VEB, out PF, and into linux bridging code, rigth? At which point the PF is in promiscuous mode (btw, same does not work if bridge is attached to VF, at least for some VFs, due to lack of promiscuous mode). Packets sent from a guest with a VF to the address of another guest with a VF need to be forwarded similarly, but the driver should be able to infer that from (3). Right, and that works currently for the case where both guests are like VM4, they directly own the VF via PCI device assignement. But for VM4 to talk to VM5, VF3 is not in promiscuous mode and has a different MAC address than VM5's vNIC. If the embedded bridge does not learn, and nobody programmed it to fwd frames for VM5 via VF3... I believe this is what Roopa's patch will allow. The question now is whether there's a better way to handle this? In my mind, we'd model the NIC's embedded bridge as, well, a bridge. And set anti-spoofing, port mirroring, port mac/vlan filtering, etc via that bridge. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
* Sridhar Samudrala (s...@us.ibm.com) wrote: On 11/30/2011 3:00 PM, Chris Wright wrote: physical port | +++ | +-+ | | | VEB | | | +-+ | |/ | \| | /|\ | | / | \ | +-+--+--+-+ | | | PFVF 1VF 2 / | | +---+---+ VM4 +---+---+ | sw | |macvtap| | switch| +---+---+ +-+-+-+-+ | / | \VM5 / | \ VM1 VM2 VM3 This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv switching), VM4 directly owning VF1 (pci device assignement), and VM5 indirectly owning VF2 (macvtap passthrough, that started this whole thing). So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1 goes in to VEB, out PF, and into linux bridging code, rigth? At which point the PF is in promiscuous mode (btw, same does not work if bridge is attached to VF, at least for some VFs, due to lack of promiscuous mode). Packets sent from a guest with a VF to the address of another guest with a VF need to be forwarded similarly, but the driver should be able to infer that from (3). Right, and that works currently for the case where both guests are like VM4, they directly own the VF via PCI device assignement. But for VM4 to talk to VM5, VF3 is not in promiscuous mode and has a different MAC address than VM5's vNIC. If the embedded bridge does not learn, and nobody programmed it to fwd frames for VM5 via VF3... I think you are referring to VF2. There is no VF3 in your picture. *sigh* (also meant 'VM4 or VM5' up above, not 'VM4 or VM4')... In macvtap passthru mode, VF2 will be set to the same mac address as VM5's MAC. So VM4 should be be able to talk to VM5. yes (i think macvtap in bridging or vepa mode w/ single VM has that issue, not passthru) I believe this is what Roopa's patch will allow. The question now is whether there's a better way to handle this? My understanding is that Roopa's patch will allow setting additional mac addresses to VM5 without the need to put VF5 in promiscous mode. Thanks for your corrections Sridar. cheers, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
* Peter Zijlstra (a.p.zijls...@chello.nl) wrote: On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. That's just retarded, and no you won't get such extentions. Poking at another process's virtual address space is just daft. Esp. if there's no actual reason for it. Need to separate the binding vs the policy mgmt. The policy mgmt could still be done outside, whereas the binding could still be done from w/in QEMU. A simple monitor interface to rebalance vcpu memory allcoations to different nodes could very well schedule vcpu thread work in QEMU. So, I agree, even if there is some external policy mgmt, it could still easily work w/ QEMU to use Peter's proposed interface. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM device assignment and user privileges
* Avi Kivity (a...@redhat.com) wrote: On 11/20/2011 04:58 PM, Sasha Levin wrote: Hi all, I've been working on adding device assignment to KVM tools, and started with the basics of just getting a device assigned using the KVM_ASSIGN_PCI_DEVICE ioctl. What I've figured is that unprivileged users can request any PCI device to be assigned to him, including devices which he shouldn't be touching. In my case, it happened with the VGA card, where an unprivileged user simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA card and caused the display on the host to go apeshit. Was it supposed to work this way? No, of course not. Indeed. A device is typically owned by a host OS driver which precludes device assignment from working. If it's not, the unprivilged guest will not have access to the device's config space or resource bars as they are only rw for a privileged user. And similarly, /dev/kvm was typically left as 0644. As you can see, it's fragile. I couldn't find any security checks in the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can invoke it with any parameters he'd want - enabling him to kill the host. Alex, Chris? The security checks were removed some time back. The expectation was that there was nothing an unprivleged user could usefully do w/ the assign device ioctl, and the assign irq ioctl only works after assign device. It's built on an overly fragile set of assumptions, however. Avi, the simplest short term thing to do now might be simply revert: 48bb09e KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq While it's a regression for existing unprivileged users it's better than a hole. And in the meantime, we can come up w/ something better to replace with. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
* Alexander Graf (ag...@suse.de) wrote: On 29.10.2011, at 20:45, Bharata B Rao wrote: As guests become NUMA aware, it becomes important for the guests to have correct NUMA policies when they run on NUMA aware hosts. Currently limited support for NUMA binding is available via libvirt where it is possible to apply a NUMA policy to the guest as a whole. However multinode guests would benefit if guest memory belonging to different guest nodes are mapped appropriately to different host NUMA nodes. To achieve this we would need QEMU to expose information about guest RAM ranges (Guest Physical Address - GPA) and their host virtual address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external tool like libvirt would be able to divide the guest RAM as per the guest NUMA node geometry and bind guest memory nodes to corresponding host memory nodes using HVA. This needs both QEMU (and libvirt) changes as well as changes in the kernel. Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. I think that both Peter and Andrea are looking at this. Before we commit an API to QEMU that has a different semantic than a possible new kernel interface (that perhaps QEMU could use directly to inform kernel of the binding/relationship between vcpu thread and it's memory at VM startuup) it would be useful to see what these guys are working on... thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] KVM: remove host and guest pv mmu support
This feature hasn't been in use for some years now. The host side bits are deprecated for almost a year. The guest side would only get used on old hosts, and it's slower than shadow or hw assisted paging. Time to remove it. Chris Wright (2): KVM Guest: remove KVM guest pv mmu support KVM: remove KVM host pv mmu support Documentation/feature-removal-schedule.txt |9 -- arch/x86/include/asm/kvm_host.h| 13 -- arch/x86/kernel/kvm.c | 181 arch/x86/kvm/mmu.c | 135 - arch/x86/kvm/x86.c | 12 -- 5 files changed, 0 insertions(+), 350 deletions(-) Changes since RFC: - v2 rebase to b796a09c thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] KVM Guest: remove KVM guest pv mmu support
This has not been used for some years now. It's time to remove it. Signed-off-by: Chris Wright chr...@redhat.com --- - v2 rebase to b796a09c arch/x86/kernel/kvm.c | 181 - 1 files changed, 0 insertions(+), 181 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a9c2116..f0c6fd6 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -39,8 +39,6 @@ #include asm/desc.h #include asm/tlbflush.h -#define MMU_QUEUE_SIZE 1024 - static int kvmapf = 1; static int parse_no_kvmapf(char *arg) @@ -60,21 +58,10 @@ static int parse_no_stealacc(char *arg) early_param(no-steal-acc, parse_no_stealacc); -struct kvm_para_state { - u8 mmu_queue[MMU_QUEUE_SIZE]; - int mmu_queue_len; -}; - -static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static int has_steal_clock = 0; -static struct kvm_para_state *kvm_para_state(void) -{ - return per_cpu(para_state, raw_smp_processor_id()); -} - /* * No need for any IO delay on KVM */ @@ -271,151 +258,6 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) } } -static void kvm_mmu_op(void *buffer, unsigned len) -{ - int r; - unsigned long a1, a2; - - do { - a1 = __pa(buffer); - a2 = 0; /* on i386 __pa() always returns 4G */ - r = kvm_hypercall3(KVM_HC_MMU_OP, len, a1, a2); - buffer += r; - len -= r; - } while (len); -} - -static void mmu_queue_flush(struct kvm_para_state *state) -{ - if (state-mmu_queue_len) { - kvm_mmu_op(state-mmu_queue, state-mmu_queue_len); - state-mmu_queue_len = 0; - } -} - -static void kvm_deferred_mmu_op(void *buffer, int len) -{ - struct kvm_para_state *state = kvm_para_state(); - - if (paravirt_get_lazy_mode() != PARAVIRT_LAZY_MMU) { - kvm_mmu_op(buffer, len); - return; - } - if (state-mmu_queue_len + len sizeof state-mmu_queue) - mmu_queue_flush(state); - memcpy(state-mmu_queue + state-mmu_queue_len, buffer, len); - state-mmu_queue_len += len; -} - -static void kvm_mmu_write(void *dest, u64 val) -{ - __u64 pte_phys; - struct kvm_mmu_op_write_pte wpte; - -#ifdef CONFIG_HIGHPTE - struct page *page; - unsigned long dst = (unsigned long) dest; - - page = kmap_atomic_to_page(dest); - pte_phys = page_to_pfn(page); - pte_phys = PAGE_SHIFT; - pte_phys += (dst ~(PAGE_MASK)); -#else - pte_phys = (unsigned long)__pa(dest); -#endif - wpte.header.op = KVM_MMU_OP_WRITE_PTE; - wpte.pte_val = val; - wpte.pte_phys = pte_phys; - - kvm_deferred_mmu_op(wpte, sizeof wpte); -} - -/* - * We only need to hook operations that are MMU writes. We hook these so that - * we can use lazy MMU mode to batch these operations. We could probably - * improve the performance of the host code if we used some of the information - * here to simplify processing of batched writes. - */ -static void kvm_set_pte(pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_set_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_set_pmd(pmd_t *pmdp, pmd_t pmd) -{ - kvm_mmu_write(pmdp, pmd_val(pmd)); -} - -#if PAGETABLE_LEVELS = 3 -#ifdef CONFIG_X86_PAE -static void kvm_set_pte_atomic(pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_pte_clear(struct mm_struct *mm, - unsigned long addr, pte_t *ptep) -{ - kvm_mmu_write(ptep, 0); -} - -static void kvm_pmd_clear(pmd_t *pmdp) -{ - kvm_mmu_write(pmdp, 0); -} -#endif - -static void kvm_set_pud(pud_t *pudp, pud_t pud) -{ - kvm_mmu_write(pudp, pud_val(pud)); -} - -#if PAGETABLE_LEVELS == 4 -static void kvm_set_pgd(pgd_t *pgdp, pgd_t pgd) -{ - kvm_mmu_write(pgdp, pgd_val(pgd)); -} -#endif -#endif /* PAGETABLE_LEVELS = 3 */ - -static void kvm_flush_tlb(void) -{ - struct kvm_mmu_op_flush_tlb ftlb = { - .header.op = KVM_MMU_OP_FLUSH_TLB, - }; - - kvm_deferred_mmu_op(ftlb, sizeof ftlb); -} - -static void kvm_release_pt(unsigned long pfn) -{ - struct kvm_mmu_op_release_pt rpt = { - .header.op = KVM_MMU_OP_RELEASE_PT, - .pt_phys = (u64)pfn PAGE_SHIFT, - }; - - kvm_mmu_op(rpt, sizeof rpt); -} - -static void kvm_enter_lazy_mmu(void) -{ - paravirt_enter_lazy_mmu(); -} - -static void kvm_leave_lazy_mmu(void) -{ - struct kvm_para_state *state = kvm_para_state(); - - mmu_queue_flush(state); - paravirt_leave_lazy_mmu(); -} - static void
[PATCH v2 2/2] KVM: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in January 2011. It's not in use, is slower than shadow or hardware assisted paging, and a maintenance burden. It's November 2011, time to remove it. Signed-off-by: Chris Wright chr...@redhat.com --- - v2 rebase to b796a09c Documentation/feature-removal-schedule.txt |9 -- arch/x86/include/asm/kvm_host.h| 13 --- arch/x86/kvm/mmu.c | 135 arch/x86/kvm/x86.c | 12 --- 4 files changed, 0 insertions(+), 169 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index d5ac362..877f897 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -397,15 +397,6 @@ Who: anybody or Florian Mickler flor...@mickler.org -What: KVM paravirt mmu host support -When: January 2011 -Why: The paravirt mmu host support is slower than non-paravirt mmu, both - on newer and older hardware. It is already not exposed to the guest, - and kept only for live migration purposes. -Who: Avi Kivity a...@redhat.com - - - What: iwlwifi 50XX module parameters When: 3.0 Why: The ..50 modules parameters were used to configure 5000 series and diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c1f19de..6d83264 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -244,13 +244,6 @@ struct kvm_mmu_page { struct rcu_head rcu; }; -struct kvm_pv_mmu_op_buffer { - void *ptr; - unsigned len; - unsigned processed; - char buf[512] __aligned(sizeof(long)); -}; - struct kvm_pio_request { unsigned long count; int in; @@ -347,10 +340,6 @@ struct kvm_vcpu_arch { */ struct kvm_mmu *walk_mmu; - /* only needed in kvm_pv_mmu_op() path, but it's hot so -* put it here to avoid allocation */ - struct kvm_pv_mmu_op_buffer mmu_op_buffer; - struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; @@ -667,8 +656,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3); int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, const void *val, int bytes); -int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes, - gpa_t addr, unsigned long *ret); u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn); extern bool tdp_enabled; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e9534ce..a9b3a32 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2028,20 +2028,6 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page); -static void mmu_unshadow(struct kvm *kvm, gfn_t gfn) -{ - struct kvm_mmu_page *sp; - struct hlist_node *node; - LIST_HEAD(invalid_list); - - for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) { - pgprintk(%s: zap %llx %x\n, -__func__, gfn, sp-role.word); - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - } - kvm_mmu_commit_zap_page(kvm, invalid_list); -} - static void page_header_update_slot(struct kvm *kvm, void *pte, gfn_t gfn) { int slot = memslot_id(kvm, gfn); @@ -4004,127 +3990,6 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm) return nr_mmu_pages; } -static void *pv_mmu_peek_buffer(struct kvm_pv_mmu_op_buffer *buffer, - unsigned len) -{ - if (len buffer-len) - return NULL; - return buffer-ptr; -} - -static void *pv_mmu_read_buffer(struct kvm_pv_mmu_op_buffer *buffer, - unsigned len) -{ - void *ret; - - ret = pv_mmu_peek_buffer(buffer, len); - if (!ret) - return ret; - buffer-ptr += len; - buffer-len -= len; - buffer-processed += len; - return ret; -} - -static int kvm_pv_mmu_write(struct kvm_vcpu *vcpu, -gpa_t addr, gpa_t value) -{ - int bytes = 8; - int r; - - if (!is_long_mode(vcpu) !is_pae(vcpu)) - bytes = 4; - - r = mmu_topup_memory_caches(vcpu); - if (r) - return r; - - if (!emulator_write_phys(vcpu, addr, value, bytes)) - return -EFAULT; - - return 1; -} - -static int kvm_pv_mmu_flush_tlb(struct kvm_vcpu *vcpu) -{ - (void)kvm_set_cr3(vcpu, kvm_read_cr3(vcpu)); - return 1; -} - -static int kvm_pv_mmu_release_pt(struct kvm_vcpu *vcpu, gpa_t addr) -{ - spin_lock(vcpu-kvm-mmu_lock); - mmu_unshadow(vcpu-kvm, addr PAGE_SHIFT); - spin_unlock(vcpu-kvm-mmu_lock
[PATCH RFC 0/2] KVM: remove host and guest pv mmu support
This feature hasn't been in use for some years now. The host side bits are deprecated for almost a year. The guest side would only get used on old hosts, and it's slower than shadow or hw assisted paging. Time to remove it. Documentation/feature-removal-schedule.txt |9 -- arch/x86/include/asm/kvm_host.h| 13 -- arch/x86/kernel/kvm.c | 181 arch/x86/kvm/mmu.c | 135 - arch/x86/kvm/x86.c | 12 -- 5 files changed, 0 insertions(+), 350 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 1/2] KVM Guest: remove KVM guest pv mmu support
This has not been used for some years now. It's time to remove it. Will also make some pv patching improvements easier. Signed-off-by: Chris Wright chr...@redhat.com --- arch/x86/kernel/kvm.c | 181 - 1 files changed, 0 insertions(+), 181 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a9c2116..f0c6fd6 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -39,8 +39,6 @@ #include asm/desc.h #include asm/tlbflush.h -#define MMU_QUEUE_SIZE 1024 - static int kvmapf = 1; static int parse_no_kvmapf(char *arg) @@ -60,21 +58,10 @@ static int parse_no_stealacc(char *arg) early_param(no-steal-acc, parse_no_stealacc); -struct kvm_para_state { - u8 mmu_queue[MMU_QUEUE_SIZE]; - int mmu_queue_len; -}; - -static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static int has_steal_clock = 0; -static struct kvm_para_state *kvm_para_state(void) -{ - return per_cpu(para_state, raw_smp_processor_id()); -} - /* * No need for any IO delay on KVM */ @@ -271,151 +258,6 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) } } -static void kvm_mmu_op(void *buffer, unsigned len) -{ - int r; - unsigned long a1, a2; - - do { - a1 = __pa(buffer); - a2 = 0; /* on i386 __pa() always returns 4G */ - r = kvm_hypercall3(KVM_HC_MMU_OP, len, a1, a2); - buffer += r; - len -= r; - } while (len); -} - -static void mmu_queue_flush(struct kvm_para_state *state) -{ - if (state-mmu_queue_len) { - kvm_mmu_op(state-mmu_queue, state-mmu_queue_len); - state-mmu_queue_len = 0; - } -} - -static void kvm_deferred_mmu_op(void *buffer, int len) -{ - struct kvm_para_state *state = kvm_para_state(); - - if (paravirt_get_lazy_mode() != PARAVIRT_LAZY_MMU) { - kvm_mmu_op(buffer, len); - return; - } - if (state-mmu_queue_len + len sizeof state-mmu_queue) - mmu_queue_flush(state); - memcpy(state-mmu_queue + state-mmu_queue_len, buffer, len); - state-mmu_queue_len += len; -} - -static void kvm_mmu_write(void *dest, u64 val) -{ - __u64 pte_phys; - struct kvm_mmu_op_write_pte wpte; - -#ifdef CONFIG_HIGHPTE - struct page *page; - unsigned long dst = (unsigned long) dest; - - page = kmap_atomic_to_page(dest); - pte_phys = page_to_pfn(page); - pte_phys = PAGE_SHIFT; - pte_phys += (dst ~(PAGE_MASK)); -#else - pte_phys = (unsigned long)__pa(dest); -#endif - wpte.header.op = KVM_MMU_OP_WRITE_PTE; - wpte.pte_val = val; - wpte.pte_phys = pte_phys; - - kvm_deferred_mmu_op(wpte, sizeof wpte); -} - -/* - * We only need to hook operations that are MMU writes. We hook these so that - * we can use lazy MMU mode to batch these operations. We could probably - * improve the performance of the host code if we used some of the information - * here to simplify processing of batched writes. - */ -static void kvm_set_pte(pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_set_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_set_pmd(pmd_t *pmdp, pmd_t pmd) -{ - kvm_mmu_write(pmdp, pmd_val(pmd)); -} - -#if PAGETABLE_LEVELS = 3 -#ifdef CONFIG_X86_PAE -static void kvm_set_pte_atomic(pte_t *ptep, pte_t pte) -{ - kvm_mmu_write(ptep, pte_val(pte)); -} - -static void kvm_pte_clear(struct mm_struct *mm, - unsigned long addr, pte_t *ptep) -{ - kvm_mmu_write(ptep, 0); -} - -static void kvm_pmd_clear(pmd_t *pmdp) -{ - kvm_mmu_write(pmdp, 0); -} -#endif - -static void kvm_set_pud(pud_t *pudp, pud_t pud) -{ - kvm_mmu_write(pudp, pud_val(pud)); -} - -#if PAGETABLE_LEVELS == 4 -static void kvm_set_pgd(pgd_t *pgdp, pgd_t pgd) -{ - kvm_mmu_write(pgdp, pgd_val(pgd)); -} -#endif -#endif /* PAGETABLE_LEVELS = 3 */ - -static void kvm_flush_tlb(void) -{ - struct kvm_mmu_op_flush_tlb ftlb = { - .header.op = KVM_MMU_OP_FLUSH_TLB, - }; - - kvm_deferred_mmu_op(ftlb, sizeof ftlb); -} - -static void kvm_release_pt(unsigned long pfn) -{ - struct kvm_mmu_op_release_pt rpt = { - .header.op = KVM_MMU_OP_RELEASE_PT, - .pt_phys = (u64)pfn PAGE_SHIFT, - }; - - kvm_mmu_op(rpt, sizeof rpt); -} - -static void kvm_enter_lazy_mmu(void) -{ - paravirt_enter_lazy_mmu(); -} - -static void kvm_leave_lazy_mmu(void) -{ - struct kvm_para_state *state = kvm_para_state(); - - mmu_queue_flush(state
[PATCH RFC 2/2] KVM: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in January 2011. It's not in use, is slower than shadow or hardware assisted paging, and a maintenance burden. It's October 2011, time to remove it. Signed-off-by: Chris Wright chr...@redhat.com --- Documentation/feature-removal-schedule.txt |9 -- arch/x86/include/asm/kvm_host.h| 13 --- arch/x86/kvm/mmu.c | 135 arch/x86/kvm/x86.c | 12 --- 4 files changed, 0 insertions(+), 169 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 4dc4654..75f88a5 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -397,15 +397,6 @@ Who: anybody or Florian Mickler flor...@mickler.org -What: KVM paravirt mmu host support -When: January 2011 -Why: The paravirt mmu host support is slower than non-paravirt mmu, both - on newer and older hardware. It is already not exposed to the guest, - and kept only for live migration purposes. -Who: Avi Kivity a...@redhat.com - - - What: iwlwifi 50XX module parameters When: 3.0 Why: The ..50 modules parameters were used to configure 5000 series and diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dd51c83..8c9ce69 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -241,13 +241,6 @@ struct kvm_mmu_page { struct rcu_head rcu; }; -struct kvm_pv_mmu_op_buffer { - void *ptr; - unsigned len; - unsigned processed; - char buf[512] __aligned(sizeof(long)); -}; - struct kvm_pio_request { unsigned long count; int in; @@ -343,10 +336,6 @@ struct kvm_vcpu_arch { */ struct kvm_mmu *walk_mmu; - /* only needed in kvm_pv_mmu_op() path, but it's hot so -* put it here to avoid allocation */ - struct kvm_pv_mmu_op_buffer mmu_op_buffer; - struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; @@ -666,8 +655,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3); int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, const void *val, int bytes); -int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes, - gpa_t addr, unsigned long *ret); u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn); extern bool tdp_enabled; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8e8da79..0a45bc1 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2005,20 +2005,6 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) return r; } -static void mmu_unshadow(struct kvm *kvm, gfn_t gfn) -{ - struct kvm_mmu_page *sp; - struct hlist_node *node; - LIST_HEAD(invalid_list); - - for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) { - pgprintk(%s: zap %llx %x\n, -__func__, gfn, sp-role.word); - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); - } - kvm_mmu_commit_zap_page(kvm, invalid_list); -} - static void page_header_update_slot(struct kvm *kvm, void *pte, gfn_t gfn) { int slot = memslot_id(kvm, gfn); @@ -3958,127 +3944,6 @@ unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm) return nr_mmu_pages; } -static void *pv_mmu_peek_buffer(struct kvm_pv_mmu_op_buffer *buffer, - unsigned len) -{ - if (len buffer-len) - return NULL; - return buffer-ptr; -} - -static void *pv_mmu_read_buffer(struct kvm_pv_mmu_op_buffer *buffer, - unsigned len) -{ - void *ret; - - ret = pv_mmu_peek_buffer(buffer, len); - if (!ret) - return ret; - buffer-ptr += len; - buffer-len -= len; - buffer-processed += len; - return ret; -} - -static int kvm_pv_mmu_write(struct kvm_vcpu *vcpu, -gpa_t addr, gpa_t value) -{ - int bytes = 8; - int r; - - if (!is_long_mode(vcpu) !is_pae(vcpu)) - bytes = 4; - - r = mmu_topup_memory_caches(vcpu); - if (r) - return r; - - if (!emulator_write_phys(vcpu, addr, value, bytes)) - return -EFAULT; - - return 1; -} - -static int kvm_pv_mmu_flush_tlb(struct kvm_vcpu *vcpu) -{ - (void)kvm_set_cr3(vcpu, kvm_read_cr3(vcpu)); - return 1; -} - -static int kvm_pv_mmu_release_pt(struct kvm_vcpu *vcpu, gpa_t addr) -{ - spin_lock(vcpu-kvm-mmu_lock); - mmu_unshadow(vcpu-kvm, addr PAGE_SHIFT); - spin_unlock(vcpu-kvm-mmu_lock); - return 1; -} - -static int
Re: [libvirt] Qemu/KVM is 3x slower under libvirt
* Reeted (ree...@shiftmail.org) wrote: On 09/29/11 02:39, Chris Wright wrote: Can you help narrow down what is happening during the additional 12 seconds in the guest? For example, does a quick simple boot to single user mode happen at the same boot speed w/ and w/out vhost_net? Not tried (would probably be too short to measure effectively) but I'd guess it would be the same as for multiuser, see also the FC6 sub-thread I'm guessing (hoping) that it's the network bring-up that is slow. Are you using dhcp to get an IP address? Does static IP have the same slow down? It's all static IP. And please see my previous post, 1 hour before yours, regarding Fedora Core 6: the bring-up of eth0 in Fedora Core 6 is not particularly faster or slower than the rest. This is an overall system slowdown (I'd say either CPU or disk I/O) not related to the network (apart from being triggered by vhost_net). OK, I re-read it (pretty sure FC6 had the old dhclient, which is why I wondered). That is odd. No ideas are springing to mind. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk
* Donald Dutile (ddut...@redhat.com) wrote: commit f9c29774d2174df6ffc20becec20928948198914 changed the PCIe Capability structure version check from if 2 fail, to if ==1, size=x, if ==2, size=y, else fail. Turns out the 82599's VF has an errata where it's PCIe Cap struct version is 0, which now fails device assignment due to the else fallout, where before, it would blissfully work. Add a quirk if version=0, intel-82599, set size to version 2 struct. Signed-off-by: Donald_Dutile ddut...@redhat.com (not pretty, but neither is the hw errata...) Acked-by: Chris Wright chr...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] Qemu/KVM is 3x slower under libvirt
* Reeted (ree...@shiftmail.org) wrote: On 09/28/11 11:28, Daniel P. Berrange wrote: On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote: On 09/28/11 09:51, Daniel P. Berrange wrote: You could have equivalently used -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3 It's this! It's this!! (thanks for the line) It raises boot time by 10-13 seconds Ok, that is truely bizarre and I don't really have any explanation for why that is. I guess you could try 'vhost=off' too and see if that makes the difference. YES! It's the vhost. With vhost=on it takes about 12 seconds more time to boot. Can you help narrow down what is happening during the additional 12 seconds in the guest? For example, does a quick simple boot to single user mode happen at the same boot speed w/ and w/out vhost_net? I'm guessing (hoping) that it's the network bring-up that is slow. Are you using dhcp to get an IP address? Does static IP have the same slow down? If it's just dhcp, can you recompile qemu with this patch and see if it causes the same slowdown you saw w/ vhost? diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 0b03b57..0c864f7 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -496,7 +496,7 @@ static int receive_header(VirtIONet *n, struct iovec *iov, int iovcnt, if (n-has_vnet_hdr) { memcpy(hdr, buf, sizeof(*hdr)); offset = sizeof(*hdr); -work_around_broken_dhclient(hdr, buf + offset, size - offset); +//work_around_broken_dhclient(hdr, buf + offset, size - offset); } /* We only ever receive a struct virtio_net_hdr from the tapfd, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to assign a pci device to guest [with qemu.git upstream]?
* Ren, Yongjie (yongjie@intel.com) wrote: I'm using kvm and qemu upstream on https://github.com/avikivity The following command line was right for me about three weeks ago, but now I meet some error. # qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=0e:00.0 -hda /root/rhel6u1.img output error is like following. qemu-system-x86_64: -device pci-assign,host=0d:00.0: Parameter 'driver' expects a driver name Try with argument '?' for a list. Looks like you don't have device assignment support compiled in. Start with the basics (assuming tree has hw/device-assignment.c): did your ./configure output show: KVM device assig. yes and does your binary agree? qemu-system-x86_64 -device ? 21 | grep pci-assign thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to assign a pci device to guest [with qemu.git upstream]?
* Ren, Yongjie (yongjie@intel.com) wrote: Chris, Thanks very much for you kind help. I can't find hw/device-assignment.c in the qemu.git tree. Avi, I clone qemu from git://github.com/avikivity/qemu.git So device assignment is not available. But qemu-kvm.git has device-assignment code before kernel.org is down. Any update for this issue? Are you using the master branch? I noticed the github web defaults to the memory/queue branch. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to assign a pci device to guest [with qemu.git upstream]?
* Chris Wright (chr...@sous-sol.org) wrote: * Ren, Yongjie (yongjie@intel.com) wrote: Chris, Thanks very much for you kind help. I can't find hw/device-assignment.c in the qemu.git tree. Avi, I clone qemu from git://github.com/avikivity/qemu.git So device assignment is not available. But qemu-kvm.git has device-assignment code before kernel.org is down. Any update for this issue? Are you using the master branch? I noticed the github web defaults to the memory/queue branch. BTW, if you hadn't used branches much before, something like this will get you what you want: $ git checkout -b master origin/master Now you'll be on the master branch (and it should track upstream master properly). thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: inter VM / PF-VF communication
* Sagar Borikar (sagar.bori...@gmail.com) wrote: Sorry if I am not keeping up on the subject but wanted to know whether there is any effort going on for inter VM communication / PF-VF communication (in case of SR-IOV) I see that most of SR-IOV capable NIC supports mailboxes for that purpose to avoid the security hole. Xen has virtual device implementation for the same. Should I presume that such kind of effort is not on the radar and HW needs to own the responsibility of filling the loop holes in security threats imposed by VF? We do not support this, and had no plans to. Most cards have managed to do this in hw. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI-passthrough - issues/questions/ideas
* Patrick Ringl (patri...@freenet.de) wrote: Hi, I just wanted to introduce a problem I currently face including some questions regarding my temporary fix. Anyway, I have a PCI-device that I want to passthrough to a hvm guest. Now there are several problems that add up: a) the PCI-device is bound to a PCI-to-PCI bridge (which in turn is directly attached to the rootbus) (mainboard has a AMD 970/SB950 chipset). [since pciIsParent shows that the secondary bus equals the device's bus]: bridge: lspci -s00:14.4 -vvv | grep Bus: Bus: primary=00, secondary=07, subordinate=07, sec-latency=64 PCI-device 07:06.0 Multimedia controller: Philips Semiconductors SAA7146 (rev 01) b) neither the bridge nor the PCI device itself have the currently implemented reset functionality that you trigger in pciResetDevice c) the PCI-device is mapped through the PCI bridge (IOMMU-wise): ACPI IOMMU dump: [1.121239] AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 [1.121274] AMD-Vi: DEV_ALIAS_RANGE devid: 07:00.0 flags: 00 devid_to: 00:14.4 [1.121311] AMD-Vi: DEV_RANGE_END devid: 07:1f.7 What I did to get (temporarily and in a rather hackish (maybe even wrong) manner) rid of the problem, is to ignore the error thrown in pciResetDevice when no reset had been possible at all. if (ret 0) { /* -- I know what you did last summer! virErrorPtr err = virGetLastError(); pciReportError(VIR_ERR_INTERNAL_ERROR, _(Unable to reset PCI device %s: %s), dev-name, err ? err-message : _(no FLR, PM reset or bus reset available)); */ ret = 0; } Concludingly I'd ask the following questions: a) Why is a secondary_bus_reset a bad idea if the device in question's _primary_ bus is the root bus? That's not the issue the code is guarding against. It is guarding against issuing a secondary bus reset on the root bus. Your dev-bus should be 7, not 0. However, the part that should be failing is pciTrySecondaryBusReset(). And this will fail if there are other devices on that bus (07) that are not assigned to your guest, because a secondary bus resest will reset _all_ devices on the secondary bus. b) Why would it be a bad idea adding some sort of 'override attribute' to the guest's config, so libvirt may intentionally skip the reset? What are the possible consequences if no reset takes place at all? The problem with skipping the reset is primarily a security concern. Device state will leak between users of the device. An override is possible, you can discuss that with libvirt developers to see if they'd support an insecure flag like that. c) What options do I have besides implementing c) or just going the dirty way and ignore the case when no reset is possible (like I described above)? You can try assiging all devices on bus 7 to the guest. This should allow a sbus reset to be issued. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory API code review
* Avi Kivity (a...@redhat.com) wrote: I would like to carry out an online code review of the memory API so that more people are familiar with the internals, and perhaps even to catch some bugs or deficiency. I'd like to use the next kvm conference call slot for this (Tuesday 1400 UTC) since many people already have it reserved in the schedule. It would be great if people from the wider qemu community be present, rather than the usual x86 is everything crowd (+Jan) that usually participates in the kvm weekly call. Juan, Chris, can we dedicate next week's call to this? Yup, sounds like a good idea. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment VFIO ramblings
* Aaron Fabbri (aafab...@cisco.com) wrote: On 8/26/11 7:07 AM, Alexander Graf ag...@suse.de wrote: Forget the KVM case for a moment and think of a user space device driver. I as a user am not root. But I as a user when having access to /dev/vfioX want to be able to access the device and manage it - and only it. The admin of that box needs to set it up properly for me to be able to access it. So having two steps is really the correct way to go: * create VFIO group * use VFIO group because the two are done by completely different users. This is not the case for my userspace drivers using VFIO today. Each process will open vfio devices on the fly, and they need to be able to share IOMMU resources. How do you share IOMMU resources w/ multiple processes, are the processes sharing memory? So I need the ability to dynamically bring up devices and assign them to a group. The number of actual devices and how they map to iommu domains is not known ahead of time. We have a single piece of silicon that can expose hundreds of pci devices. This does not seem fundamentally different from the KVM use case. We have 2 kinds of groupings. 1) low-level system or topoolgy grouping Some may have multiple devices in a single group * the PCIe-PCI bridge example * the POWER partitionable endpoint Many will not * singleton group, e.g. typical x86 PCIe function (majority of assigned devices) Not sure it makes sense to have these administratively defined as opposed to system defined. 2) logical grouping * multiple low-level groups (singleton or otherwise) attached to same process, allowing things like single set of io page tables where applicable. These are nominally adminstratively defined. In the KVM case, there is likely a privileged task (i.e. libvirtd) involved w/ making the device available to the guest and can do things like group merging. In your userspace case, perhaps it should be directly exposed. In my case, the only administrative task would be to give my processes/users access to the vfio groups (which are initially singletons), and the application actually opens them and needs the ability to merge groups together to conserve IOMMU resources (assuming we're not going to expose uiommu). I agree, we definitely need to expose _some_ way to do this. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment VFIO ramblings
* Aaron Fabbri (aafab...@cisco.com) wrote: On 8/26/11 12:35 PM, Chris Wright chr...@sous-sol.org wrote: * Aaron Fabbri (aafab...@cisco.com) wrote: Each process will open vfio devices on the fly, and they need to be able to share IOMMU resources. How do you share IOMMU resources w/ multiple processes, are the processes sharing memory? Sorry, bad wording. I share IOMMU domains *within* each process. Ah, got it. Thanks. E.g. If one process has 3 devices and another has 10, I can get by with two iommu domains (and can share buffers among devices within each process). If I ever need to share devices across processes, the shared memory case might be interesting. So I need the ability to dynamically bring up devices and assign them to a group. The number of actual devices and how they map to iommu domains is not known ahead of time. We have a single piece of silicon that can expose hundreds of pci devices. This does not seem fundamentally different from the KVM use case. We have 2 kinds of groupings. 1) low-level system or topoolgy grouping Some may have multiple devices in a single group * the PCIe-PCI bridge example * the POWER partitionable endpoint Many will not * singleton group, e.g. typical x86 PCIe function (majority of assigned devices) Not sure it makes sense to have these administratively defined as opposed to system defined. 2) logical grouping * multiple low-level groups (singleton or otherwise) attached to same process, allowing things like single set of io page tables where applicable. These are nominally adminstratively defined. In the KVM case, there is likely a privileged task (i.e. libvirtd) involved w/ making the device available to the guest and can do things like group merging. In your userspace case, perhaps it should be directly exposed. Yes. In essence, I'd rather not have to run any other admin processes. Doing things programmatically, on the fly, from each process, is the cleanest model right now. I don't see an issue w/ this. As long it can not add devices to the system defined groups, it's not a privileged operation. So we still need the iommu domain concept exposed in some form to logically put groups into a single iommu domain (if desired). In fact, I believe Alex covered this in his most recent recap: ...The group fd will provide interfaces for enumerating the devices in the group, returning a file descriptor for each device in the group (the device fd), binding groups together, and returning a file descriptor for iommu operations (the iommu fd). thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gfx card passthrough broken with latest head
* André Weidemann (andre.weidem...@web.de) wrote: snip git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git snip ./configure --audio-drv-list=alsa --target-list=x86_64-softmmu --enable-kvm-device-assignment ERROR: unknown option --enable-kvm-device-assignment snip How come so many revision do not support device assignment? Is there a trick to enable it? Bisection qemu-kvm userspace is tricky. The upstream qemu repo (git://git.qemu.org/qemu.git) does not have PCI device assignment support. The qemu-kvm repo does regular merges w/ the upstream qemu repo. As you bisect through the qemu-kvm repo history, you are likely to land on a commit that is from upstream (meaning a tree w/out downstream qemu-kvm additions, like device assignment). Depending on where you suspect the issue is coming from, you can be careful to bisect only through the qemu-kvm tree (by skipping back to a merge point), or you can remerge the qemu-kvm tree to the qemu tree when you bisect into the qemu tree. Note...gfx assignment has many issues associated w/ it and often does not work. You can check out Allen Kay's presentation at the recent KVM Forum for some examples: http://goo.gl/Hyk13 thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints
* Don Dutile (ddut...@redhat.com) wrote: On 07/24/2011 06:58 AM, Michael S. Tsirkin wrote: On Sun, Jul 24, 2011 at 11:41:10AM +0300, Michael S. Tsirkin wrote: On Sun, Jul 24, 2011 at 11:12:44AM +0300, Michael S. Tsirkin wrote: On Fri, Jul 22, 2011 at 02:35:47PM -0700, Chris Wright wrote: * Alex Williamson (alex.william...@redhat.com) wrote: On Fri, 2011-07-22 at 14:24 -0700, Chris Wright wrote: * Donald Dutile (ddut...@redhat.com) wrote: +} else if (version == 2) { +/* don't include slot cap/stat/ctrl 2 regs; only support endpoints */ +size = 0x34; That doesn't look correct to me. The size is fixed, just that some registers are Reserved Zero when they do not apply (e.g. endpoint only). Apparently it can be interpreted differently. In this case, we've seen a tg3 device expose a v2 PCI express capability at offset 0xcc. Using 0x3c bytes, we extend 8 bytes past the legacy config space area :( Wow, that device sounds broken to me. The spec is pretty clear. Yes, I agree it's broken. Looks like something that happens when a device is designed in parallel with the spec. What bothers me is this patch seems to make devices that do behave correctly out of spec (registers will be writeable by default) - correct? How about we check for overflow and only do the hacks if it happens? Also, the code to initialize slot and root control registers is still there: it would seem that running it will corrupt memmory beyond the config array? I take this last bit back: registers we touch are at offset 0x34. Sorry about the noise. But the question about read-only registers still stands. Also, where does the magic 0x34 come from? I'm guessing this is simply what's left till the end of the config space. So let's be conservative specific as possible with this hack: I believe the spec leaves room for interpretation, and thus the resulting 'broken' device. As I read the spec, the size of the struct can be: Yeah, I can see how it might be misinterpreted, however, it's made really clear in the config space test spec. This strucuture is meant to be full size. Perhaps something like Michael suggested (and if really paranoid + pci vendor/device id to quirk it). I haven't come across many devices have this wrong. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints
* Donald Dutile (ddut...@redhat.com) wrote: diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 36ad6b0..34db52e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1419,16 +1419,18 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) } if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) { -uint8_t version; +uint8_t version, size; uint16_t type, devctl, lnkcap, lnksta; uint32_t devcap; -int size = 0x3c; /* version 2 size */ version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS); version = PCI_EXP_FLAGS_VERS; if (version == 1) { size = 0x14; -} else if (version 2) { +} else if (version == 2) { +/* don't include slot cap/stat/ctrl 2 regs; only support endpoints */ +size = 0x34; That doesn't look correct to me. The size is fixed, just that some registers are Reserved Zero when they do not apply (e.g. endpoint only). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints
* Alex Williamson (alex.william...@redhat.com) wrote: On Fri, 2011-07-22 at 14:24 -0700, Chris Wright wrote: * Donald Dutile (ddut...@redhat.com) wrote: diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 36ad6b0..34db52e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1419,16 +1419,18 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) } if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) { -uint8_t version; +uint8_t version, size; uint16_t type, devctl, lnkcap, lnksta; uint32_t devcap; -int size = 0x3c; /* version 2 size */ version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS); version = PCI_EXP_FLAGS_VERS; if (version == 1) { size = 0x14; -} else if (version 2) { +} else if (version == 2) { +/* don't include slot cap/stat/ctrl 2 regs; only support endpoints */ +size = 0x34; That doesn't look correct to me. The size is fixed, just that some registers are Reserved Zero when they do not apply (e.g. endpoint only). Apparently it can be interpreted differently. In this case, we've seen a tg3 device expose a v2 PCI express capability at offset 0xcc. Using 0x3c bytes, we extend 8 bytes past the legacy config space area :( Wow, that device sounds broken to me. The spec is pretty clear. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Introduce iommu_commit() function
* David Woodhouse (dw...@infradead.org) wrote: I'd much rather KVM just gave us a list of the pages to map, in a single call. This makes most sense to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
* Izik Eidus (izik.ei...@ravellosystems.com) wrote: On 6/22/2011 3:21 AM, Chris Wright wrote: * Nai Xia (nai@gmail.com) wrote: + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } This should never fire with the dirty_update() notifier test, right? And that means that this whole optimization is for the shadow mmu case, arguably the legacy case. Hi Chris, AMD npt does track the dirty bit in the nested page tables, so the shadow_dirty_mask should not be 0 in that case... Yeah, momentary lapse... ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for June 21
concerns about backwards compat - https://bugzilla.redhat.com/show_bug.cgi?id=689672 - f13 host can no longer run f14 guest after qemu update - this particular bug is older f13 which includes patched qemu... - could be useful to fingerprint the guest (lspci, etc) - sounds simple enough, need someone who's inclined to do it state of image streaming/block copy - live block copy and image streaming overlap - attempting to unify - some confusion over next steps - need to clarify differing requirements (shared storage vs. generic storage) - stefan to summarize solution proposal on list/wiki guest agent api current verbs and future roadmap? - pretty happy w/ current verbs, future intention to keep it simple, high-level - should be working on windows guests -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mmu_notifier, kvm: Introduce dirty bit tracking in spte and mmu notifier to help KSM dirty bit tracking
* Nai Xia (nai@gmail.com) wrote: Introduced kvm_mmu_notifier_test_and_clear_dirty(), kvm_mmu_notifier_dirty_update() and their mmu_notifier interfaces to support KSM dirty bit tracking, which brings significant performance gain in volatile pages scanning in KSM. Currently, kvm_mmu_notifier_dirty_update() returns 0 if and only if intel EPT is enabled to indicate that the dirty bits of underlying sptes are not updated by hardware. Did you test with each of EPT, NPT and shadow? Signed-off-by: Nai Xia nai@gmail.com Acked-by: Izik Eidus izik.ei...@ravellosystems.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 36 + arch/x86/kvm/mmu.h |3 +- arch/x86/kvm/vmx.c |1 + include/linux/kvm_host.h|2 +- include/linux/mmu_notifier.h| 48 +++ mm/mmu_notifier.c | 33 ++ virt/kvm/kvm_main.c | 27 ++ 8 files changed, 149 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d2ac8e2..f0d7aa0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -848,6 +848,7 @@ extern bool kvm_rebooting; int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva); void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int cpuid_maxphyaddr(struct kvm_vcpu *vcpu); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index aee3862..a5a0c51 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -979,6 +979,37 @@ out: return young; } +/* + * Caller is supposed to SetPageDirty(), it's not done inside this. + */ +static +int kvm_test_and_clear_dirty_rmapp(struct kvm *kvm, unsigned long *rmapp, +unsigned long data) +{ + u64 *spte; + int dirty = 0; + + if (!shadow_dirty_mask) { + WARN(1, KVM: do NOT try to test dirty bit in EPT\n); + goto out; + } This should never fire with the dirty_update() notifier test, right? And that means that this whole optimization is for the shadow mmu case, arguably the legacy case. + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _dirty; + u64 _spte = *spte; + BUG_ON(!(_spte PT_PRESENT_MASK)); + _dirty = _spte PT_DIRTY_MASK; + if (_dirty) { + dirty = 1; + clear_bit(PT_DIRTY_SHIFT, (unsigned long *)spte); Is this sufficient (not losing dirty state ever)? + } + spte = rmap_next(kvm, rmapp, spte); + } +out: + return dirty; +} + #define RMAP_RECYCLE_THRESHOLD 1000 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) @@ -1004,6 +1035,11 @@ int kvm_test_age_hva(struct kvm *kvm, unsigned long hva) return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp); +int kvm_test_and_clear_dirty_hva(struct kvm *kvm, unsigned long hva) +{ + return kvm_handle_hva(kvm, hva, 0, kvm_test_and_clear_dirty_rmapp); +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 7086ca8..b8d01c3 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -18,7 +18,8 @@ #define PT_PCD_MASK (1ULL 4) #define PT_ACCESSED_SHIFT 5 #define PT_ACCESSED_MASK (1ULL PT_ACCESSED_SHIFT) -#define PT_DIRTY_MASK (1ULL 6) +#define PT_DIRTY_SHIFT 6 +#define PT_DIRTY_MASK (1ULL PT_DIRTY_SHIFT) #define PT_PAGE_SIZE_MASK (1ULL 7) #define PT_PAT_MASK (1ULL 7) #define PT_GLOBAL_MASK (1ULL 8) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d48ec60..b407a69 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4674,6 +4674,7 @@ static int __init vmx_init(void) kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); + kvm_dirty_update = 0; Doesn't the above shadow_dirty_mask==0ull tell us this same info? } else kvm_disable_tdp(); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..2036bae 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -53,7 +53,7 @@ struct kvm; struct kvm_vcpu; extern struct kmem_cache *kvm_vcpu_cache; - +extern int kvm_dirty_update; /* * It would be nice to use something smarter than a linear search, TBD... * Thankfully we dont expect many devices to register (famous last words :), diff --git
Re: Seeing DMAR errors after multiple load/unload with SR-IOV
* padmanabh ratnakar (pratnaka...@gmail.com) wrote: On Tue, Jun 7, 2011 at 4:04 AM, Chris Wright chr...@sous-sol.org wrote: * Alex Williamson (alex.william...@redhat.com) wrote: On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: Hi,     I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. I have used following boot options - intel_iommu=on iommu=pt I was loading/unloading my NIC driver(be2net) with num_vfs=7. After some iterations I get following DMAR errors - Jun  4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0. Jun  4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? Jun  4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue Jun  4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 Jun  4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] fault addr 78077000 Jun  4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in context entry is clear I was trying to debug this. I dont understand iommu code much. The physical address belongs the printed PCI function and there should not have been an error. I am unable to see pci_dev(pdev) of VFs getting removed from si_domain-devices list(intel-iommu.c) when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. Looks like issue happens when when freed pdev is allocated again and as it is already in list, required initializations dont happen. I dont know if my understanding is correct. Can anyone point me to what the issue may be? Yes, that's correct.  The (now replaced) check identity_mapping() will succeed when the pci_dev is recycled (it's freed, but never removed from the list, this is an issue with passtrhough mode and device creation/desctruction).  This false match happens w/ a brand new pci_dev which still has default 32bit DMA mask, so it is removed from pt domain. During removal domain_remove_one_dev_info() test that matches only on bus/devfn (now also segment) will match despite the fact that the info-pdev != pdev-dev.archdata.iommu.  Then...Oops Typically devices are removed from the domain via drivers/pci/intel-iommu.c:device_notifier(), which is called as the device is unbound from the driver.  However, this seems to get skipped when running in passthrough mode, so I'm not sure where that's supposed to occur.  Does it happen w/o passthrough? I had tried without passthrough on RHEL 6.1 GA kernel. Was seeing hangs and panics. Will check if non passthrough mode works on latest kernel. If you blacklist the driver then a create/delete may do similar (haven't tested that idea). Also note that some intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update and see if anything is better there.  Thanks, The change in identity_mapping() means we won't demote to 32-bit DMA (drop out of pt domain), so I don't think we'll see the same issue. For testing I had made a hack in 2.6.39 kernel which will prevent demoting to 32bit DMA mask and thereby prevent calling of domain_remove_one_dev_info() for the specific VF device I was using and it had worked. So as you said I may not hit the issue in latest kernel. Will try that. I think we still leak the list entry though. Bottom line is that we need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seeing DMAR errors after multiple load/unload with SR-IOV
* David Woodhouse (dw...@infradead.org) wrote: On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: I think we still leak the list entry though. Bottom line is that we need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We should figure out the matching DMAR unit directly from the ACPI table at ADD_DEVICE time, and store it in pdev-archdata.iommu. I saw patches which were going in that direction... Cool, where are they? I'm working on something similar, and missed them. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seeing DMAR errors after multiple load/unload with SR-IOV
* David Woodhouse (dw...@infradead.org) wrote: On Tue, 2011-06-07 at 08:10 -0700, Chris Wright wrote: * David Woodhouse (dw...@infradead.org) wrote: On Tue, 2011-06-07 at 06:38 -0700, Chris Wright wrote: I think we still leak the list entry though. Bottom line is that we need to handle hotplug ADD_DEVICE and DEL_DEVICE notifications. We happen to pick up ADD_DEVICE by accident, but it's all pretty sloppy. Yeah, keeping a list of possible stale 'pci_dev' pointers is stupid. We should figure out the matching DMAR unit directly from the ACPI table at ADD_DEVICE time, and store it in pdev-archdata.iommu. I saw patches which were going in that direction... Cool, where are they? I'm working on something similar, and missed them. [PATCH] pci, dmar: Update dmar units devices list during hotplug Oh yeah, thanks for the reminder. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for June 7
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Seeing DMAR errors after multiple load/unload with SR-IOV
* Alex Williamson (alex.william...@redhat.com) wrote: On Mon, 2011-06-06 at 14:39 +0530, padmanabh ratnakar wrote: Hi, I am using linux kernel 2.6.39. I have a IBM x3650 M3 system. I have used following boot options - intel_iommu=on iommu=pt I was loading/unloading my NIC driver(be2net) with num_vfs=7. After some iterations I get following DMAR errors - Jun 4 03:50:20 rhel6 kernel: Uhhuh. NMI received for unknown reason 2d on CPU 0. Jun 4 03:50:20 rhel6 kernel: Do you have a strange power saving mode enabled? Jun 4 03:50:20 rhel6 kernel: Dazed and confused, but trying to continue Jun 4 03:50:20 rhel6 kernel: DRHD: handling fault status reg 2 Jun 4 03:50:20 rhel6 kernel: DMAR:[DMA Read] Request device [1a:00.2] fault addr 78077000 Jun 4 03:50:20 rhel6 kernel: DMAR:[fault reason 02] Present bit in context entry is clear I was trying to debug this. I dont understand iommu code much. The physical address belongs the printed PCI function and there should not have been an error. I am unable to see pci_dev(pdev) of VFs getting removed from si_domain-devices list(intel-iommu.c) when driver gets unloaded calling pci_disable_sriov() freeing VF pdevs. Looks like issue happens when when freed pdev is allocated again and as it is already in list, required initializations dont happen. I dont know if my understanding is correct. Can anyone point me to what the issue may be? Yes, that's correct. The (now replaced) check identity_mapping() will succeed when the pci_dev is recycled (it's freed, but never removed from the list, this is an issue with passtrhough mode and device creation/desctruction). This false match happens w/ a brand new pci_dev which still has default 32bit DMA mask, so it is removed from pt domain. During removal domain_remove_one_dev_info() test that matches only on bus/devfn (now also segment) will match despite the fact that the info-pdev != pdev-dev.archdata.iommu. Then...Oops Typically devices are removed from the domain via drivers/pci/intel-iommu.c:device_notifier(), which is called as the device is unbound from the driver. However, this seems to get skipped when running in passthrough mode, so I'm not sure where that's supposed to occur. Does it happen w/o passthrough? If you blacklist the driver then a create/delete may do similar (haven't tested that idea). Also note that some intel-iommu fixes have rolled into 3.0.0-rc2, you might want to update and see if anything is better there. Thanks, The change in identity_mapping() means we won't demote to 32-bit DMA (drop out of pt domain), so I don't think we'll see the same issue. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Apr 26
Tools for resource accounting the virtual machines. - Luis Castro was not on the call Status of glib tree - next steps? - full conversion done in tree - still targeting 0.15 status of QCFG - code generator rewritten to be more generic and useful - merge core infrastructure first - to not block other work waiting on full conversion - still need to complete full conversion qemu-kvm merge - status - review and merge/feedback pending from Avi on current outstanding patches - still have some 60 patches - break them into a few smaller series - next steps, specifically: - upstreaming in-kernel irqchip support - MSI/MSI-X (cleanup and make mergable) - this is a decent amount of work, Jan is solo...anyone want to help? - need to be careful of regressions - add tests to avi's autotest run (e.g., cpu hotplug) - cpu hotplug test initiated from host side - online needs some cooperation in linux - still unclear on what's supported, windows apparently only supports online autotest - had autotest test day, feedback coming on list - some issues with getting set up - having basic common config could be useful KVM Forum reminder - send in your proposals -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] intel-iommu: Fix use after release during device attach
* Jan Kiszka (jan.kis...@siemens.com) wrote: On 2011-01-04 11:42, Jan Kiszka wrote: Am 10.12.2010 19:44, Chris Wright wrote: * Jan Kiszka (jan.kis...@siemens.com) wrote: --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -3627,9 +3627,9 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, pte = dmar_domain-pgd; if (dma_pte_present(pte)) { - free_pgtable_page(dmar_domain-pgd); dmar_domain-pgd = (struct dma_pte *) phys_to_virt(dma_pte_addr(pte)); While here, might as well remove the unnecessary cast. + free_pgtable_page(pte); } dmar_domain-agaw--; } Reviewed-by: Sheng Yang sh...@linux.intel.com Acked-by: Chris Wright chr...@sous-sol.org CC iommu mailing list and David. Ping... I think this fix also qualifies for stable (.35 and .36). Still not merged? David, do you plan to pick this one up? thanks, -chris Hmm, still no reaction. Trying David's Intel address now... Jan Walking through my old queues, I came across this one again. Given the still lacking reaction from the official maintainer, I'm a bit confused about the state of intel-iommu. Is it unmaintained? Should this bug fix better be routed through the KVM tree as its only in-tree user? Please enlighten me. Note that the patch became stable material for 35..38 in the meantime, and it should go into 39 before release as well. Thanks, Jan ---8 Obtain the new pgd pointer before releasing the page containing this value. Remove unneeded cast at this chance as well. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Chris Wright chr...@sous-sol.org --- drivers/pci/intel-iommu.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) v1-v2: Clean up cast as suggested by Chris. diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 505c1c7..b3e5c43 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -3607,9 +3607,8 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, pte = dmar_domain-pgd; if (dma_pte_present(pte)) { - free_pgtable_page(dmar_domain-pgd); - dmar_domain-pgd = (struct dma_pte *) - phys_to_virt(dma_pte_addr(pte)); + dmar_domain-pgd = phys_to_virt(dma_pte_addr(pte)); + free_pgtable_page(pte); } dmar_domain-agaw--; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Apr 5
KVM Forum - save the date is out, cfp will follow later this week - abstracts due in 6wks, 2wk review period, notifications by end of May Improving process to scale project - Trivial patch bot - Sub-maintainership Trivial patch monkeys^Wteam - small/simple patches posted can fall through the cracks (esp. for areas that aren't well maintained) - patches should be simple, easy to review ( - aiming to gather a team, so that the position can rotate - patch submitter can rest assured - Stefan and possibly Mike Roth are volunteering to get this started - Cc: qemu-triv...@nongnu.org to send patches to the Trivial patch monkey - details here: http://wiki.qemu.org/Contribute/TrivialPatches Sub-maintainership - have MAINTAINERS file - need to add git tree URLs - needs another pass to make sure there are no missing subsystems - make it clearer how maintained the subsystems are - adding a wiki page to show how to become a subsystem maintainer - one valuable step...write testing around the subsystem - means you've had to learn the subsystem (builds expertise) - allows for regression testing the subsystem (esp. validating new patches) - sub-maintainers sometimes disappear - can add another maintainer - actively poke the maintainer when patches are languishing - if you're going to be away, be sure to let list or backup know - systematic patch tracking would help, patchwork doesn't quite cut it - who receives pull request - list + blue swirl/aurelien for tcg, anthony picking up plenty of other bits - infrastructure subsystems (qdev, migration, etc..) - big invasive changes done externally, effective flag day for full merge - subsystem localized change (e.g. vmstate fix for a specific device) maintainers can work it out, be sure to have both - facilitating patch review and hopefully improving subsystem over time kvm-autotest - roadmap...refactor to centralize testing (handle the xen-autotest split off) - internally at RH, lmr and cleber maintain autotest server to test branches (testing qemu.git daily) - have good automation for installs and testing - seems more QA focused than developers - plenty of benefit for developers, so lack of developer use partly cultural/visibility... - kvm-autotest team always looking for feedback to improve for developer use case - kvm-autotest day to have folks use it, write test, give feedback? - startup cost is/was steep, the day might be too much handholding - install-fest? (to get it installed and up and running) - buildbot or autotest for testing patches to verify building and working - one goal is to reduce mailing list load (patch resubmission because they haven't handled basic cases that buildbot or autotest would have caught) - fedora-virt test day coming up on April 14th. lucas will be on hand and we can piggy back on that to include kvm-autotest install and virt testing - kvm autotest run before qemu pull request and post merge to track regressions, more frequent testing helps developers see breakage quickly - qemu.git daily testing already, only the sanity test subset - run more comprehensive stable set of tests on weekends - one issue is the large number of known failures, need to make these easier to identify (and fix the failures one way or another) - create database and verify (regressions) against that - red/yellow/green (yellow shows area was already broken) - autotest can be run against server, not just on laptop - how to do remote client display testing (e.g. spice client) - dogtail and LDTP - graphics could be tested w/ screenshot compares - WHQL testing automated as well -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] device-assignment: Reset device on system reset
* Alex Williamson (alex.william...@redhat.com) wrote: static void reset_assigned_device(DeviceState *dev) { -PCIDevice *d = DO_UPCAST(PCIDevice, qdev, dev); +PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, dev); +AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev); +char reset_file[64]; +const char reset[] = 1; +int fd, ret; + +snprintf(reset_file, sizeof(reset_file), + /sys/bus/pci/devices/:%02x:%02x.%01x/reset, + adev-host.bus, adev-host.dev, adev-host.func); need to consider segment: %04x:..., adev-host.seg, ... +/* + * Issue a device reset via pci-sysfs. Note that we use write(2) here + * and ignore the return value because some kernels have a bug that + * returns 0 rather than bytes written on success, sending us into an + * infinite retry loop using other write mechanisms. + */ +fd = open(reset_file, O_WRONLY); +if (fd != -1) { +ret = write(fd, reset, strlen(reset)); +close(fd); +} This will probably fail when it's managed by libvirt. I expect it will need some file ownership and security label mgmt added to device assignement path I expect. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] device-assignment: Reset device on system reset
* Alex Williamson (alex.william...@redhat.com) wrote: On system reset, we currently try to quiesce DMA by clearing the command register. This assumes that nothing re-enables bus master support without first de-programming the device. Use a bigger hammer to help the guest not shoot itself by issuing a function reset via sysfs on each system reset. Signed-off-by: Alex Williamson alex.william...@redhat.com Looks good. Acked-by: Chris Wright chr...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] device-assignment: Reset device on system reset
* Alex Williamson (alex.william...@redhat.com) wrote: On Thu, 2011-03-17 at 14:12 -0700, Chris Wright wrote: * Alex Williamson (alex.william...@redhat.com) wrote: +fd = open(reset_file, O_WRONLY); +if (fd != -1) { +ret = write(fd, reset, strlen(reset)); +close(fd); +} This will probably fail when it's managed by libvirt. I expect it will need some file ownership and security label mgmt added to device assignement path I expect. Already posted a patch for adding file rights, seems to be sufficient: https://www.redhat.com/archives/libvir-list/2011-March/msg00823.html Awesome, I missed that path, thanks Alex! thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Mar 15
QAPI -- http://wiki.qemu.org/Features/QAPI - please review! - Anthony would like to see feedback and plans to commit in a week (assuming agreement and no major issues in review) - some concern about the maintainability of code generation - but still nothing concrete on the list, need to review and discuss on the list - some concern that implementation details may change the wire protocol - introduces a new mechanism for new signals (mask by default and enabled explicitly) - disagreement over when/how to introduce new extensions - libvirt feedback? - no protocol level changes - old and new versions are testable with test suite and proves this - c library implementation is critical to have unit tests and test driven development - thread safe? - no shared state, no statics. - threading model requires lock for the qmp session - licensiing? - LGPL - forwards/backwards compat? - designed with that in mind see wiki: http://wiki.qemu.org/Features/QAPI QCFG -- http://wiki.qemu.org/Features/QCFG - command line args translation to objects is complex and buggy - schema + code generator to formalize this - formally describe each command line option and generate code to build and validate objects - provides systematic way to document command line options - automatically - device_add does multiple conversions to go from qmp to qemuopts to objects - move to basic c structures, and autogenerated marshalling code - no plan to do this work soon, late in 0.15 cycle - same as qapi, fork a tree, do mass conversion and merge for 0.16 cycle - qmp server mode to take all configuation commands before actually starting the guest - can provide a config file - qdev... - could just bridge to setting and getting qdev properties - OR get to point where device objects go directly to qdev device init - why not move command line to qmp instead of new schema? - single schema - considerations for -M (didn't capture all of these) - for all the details: http://wiki.qemu.org/Features/QCFG Merging big changes - in the past, evolving in tree has not worked well, leaving partial conversions - QAPI/QCFG method of doing changes in external tree hopes to set new precedent - preserve patch/review on list - do full conversion - provide strong testing to show it works Kemari merge plans - just needs some ACKs - Juan, Anthony, anybody else who is familiar with migration to review? switch from gpxe to ipxe - possible 0.15 release w/ ipxe (Alex looking into it) - Michael Brown been helpful in fixing bugs, so compat - Alex will send out mail soon on the details - ipxe releases? not yet, there are plans for it, should be coming RSN - Stefan volunteers to help test -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Mar 15
* Anthony Liguori (anth...@codemonkey.ws) wrote: On 03/15/2011 09:53 AM, Chris Wright wrote: QAPI snip - c library implementation is critical to have unit tests and test driven development - thread safe? - no shared state, no statics. - threading model requires lock for the qmp session - licensiing? - LGPL - forwards/backwards compat? - designed with that in mind see wiki: http://wiki.qemu.org/Features/QAPI One neat feature of libqmp is that once libvirt has a better QMP passthrough interface, we can create a QmpSession that uses libvirt. It would look something like: QmpSession *libqmp_session_new_libvirt(virDomainPtr dom); Looks like you mean this? - request QmpSession - client libvirt - return QmpSession - client - QmpSession - QMP - QEMU So bypassing libvirt completely to actually use the session? Currently, it's more like: client - QemuMonitorCommand - libvirt - QMP - QEMU The QmpSession returned by this call can then be used with all of the libqmp interfaces. This means we can still exercise our test suite with a guest launched through libvirt. It also should make the libvirt pass through interface a bit easier to consume by third parties. This sounds like it's something libvirt folks should be involved with. At the very least, this mode is there now and considered basically unstable/experimental/developer use: Qemu monitor command '%s' executed; libvirt results may be unpredictable! So likely some concern about making it easier to use, esp. assuming that third parties above are mgmt apps, not just developers. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Mar 8
QAPI merge plans - should be 100% back compat - qmp moved over - hmp moved over - 1st pass, core infrastructure (includes test framework) - 2nd pass, command conversion - 3rd pass, more controversial bits - adds dependencies: glib and python - some testing based on kvm-unit-test micro-os instance (e.g. added a balloon and run commands against it to test) - add more functionality here? (kvm autotest is slow, above is quick) - will hit some point where full functionality is needed - have a mini linux to do this (lags where driver updates are part of test) - generated code can obfuscate the debugging process - code generator has some ugly corners (python writing C...) - but generated code should be debuggable, readable, etc. - some grumbling regarding glib dependency - reducing NIH and relying on external functionality is solid way to grow qemu as a project Read wiki here and review closely: http://wiki.qemu.org/Features/QAPI virt-agent - json string converted to command (and vice versa) - add to qmp schema - allows generated marshalling code to sanity check in/out - problem with qmp not being bi-directional (rpc - in, events - out) - posted events allow migration to save and send unposted events - any issues with guest agent interface extensibility - will add command to return schema - can add (optional) parameters to commands - make libqmp a shared object for 0.16 (too much going on for 0.15) - can terminate in qemu (e.g. vnc server internally qmp client to interact with guest cut 'n paste) or externally proxying to/from endpoint - possibly revisit dynamic schema in future glib, main loop, events - (context was setfd changes from amit) - iothread work is more critical to do first and get merged - glib work starting just in qapi iothread merge? - progressing slowly, marcelo working on it - have found regressions (signal handling code) (ifdef'd away for now) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: when use sriov, guest os could not access the vf device assigned
* lidong chen (chen.lidong.ker...@gmail.com) wrote: guest os could not access the vf assigned ,and print this error message . PCI: device :00:06.0 has unknown header type 7f, ignoring. PCI: device :00:07.0 has unknown header type 7f, ignoring. PCI: device :00:08.0 has unknown header type 7f, ignoring. the reason is the config file /sys/bus/pci/devices/xx/config of pci device could not access correctly after guestos start, the content qemu-kvm read from /sys/bus/pci/devices/xx/config is all FF. This is mostly likely a combination of two bugs, both have since been fixed (starting in v0.8.3). What version of libvirt are you using? One is the 82599 VF has an erratum that it does not show that it supports Function Level Reset (FLR -- SR-IOV VFs are required to support FLR). The second is libvirt had buggy handling of device reset for devices that don't support FLR. IIRC, what you are seeing is the result of a secondary bus reset resetting all devices on that bus (including the PF). Try upgrading libvirt. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: HOLY CRAP IT WORKS 8@ Hey, great! ;) ...almost... OK, clear_emulator_capabilities=0 solved the IRQ problem (which was, as it turns out, the rawio problem) My VM came up, both the tuners were there and after the firmware install I was able to tune and watch the slowest TV in the world over VNC. Thank god for that, i was really starting to believe that slashing out a lot of cash on my 890FX board and the fancy DDR3 ram it needed was a collosal waste of money. Sigh of relief Well, thank you all so much for helping me to get to this point! And yes, I did say almost works Looks like I've run straight into Chris' ref counting problem when shutting the guest down. Some sort of critical error barf was on the servers' screen when I shut down the guest, appeared to be very similar to Chris' example, in amd_iommu.c I'd post it but the server locks up after it's been shown and needed resetting. No idea how I would post that bit of dmesg as it gets reset after each boot. Is there a solution for this at the moment or will I have to wait for it to be patched? No solution at the moment. Will keep you posted. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: On Fri, Feb 25, 2011 at 12:06 AM, Chris Wright chr...@sous-sol.org wrote: * James Neave (robo...@gmail.com) wrote: OK, here's my latest dmesg with amd_iommu_dump and debug with no quiet http://pastebin.com/JxEwvqRA Yeah, that's what I expected: [ Â Â 0.724403] AMD-Vi: Â DEV_ALIAS_RANGE Â Â Â Â Â Â Â Â devid: 08:00.0 flags: 00 devid_to: 00:14.4 [ Â Â 0.724439] AMD-Vi: Â DEV_RANGE_END Â Â Â Â Â devid: 08:1f.7 That basically says 08:00.0 - 08:1f.7 will show up as 00:14.4 (and should all go into same iommu domain). I've just figured out a sequence of echo DEV PATH commands to call for 14.4 gets me past the claimed by pci-stub error and gets me to the failed to assign IRQ error. I'm going to narrow down the required sequence and then post it. Kind of afraid to ask, but does it include: (assuming 1002 4384 is the pci to pci bridge) echo 1002 4384 /sys/bus/pci/drivers/pci-stub/new_id echo :00:14.4 /sys/bus/pci/drivers/pci-stub/unbind (this has the side effect of detaching the bridge from its domain) Exact sequence is: echo 1002 4384 /sys/bus/pci/drivers/pci-stub/new_id echo :00:14.4 /sys/bus/pci/devices/:00:14.4/driver/unbind OK, same, since driver is a symlink to pci-stub. I take it this is a bad thing then? It just means the amd iommu driver might be susceptible to a refcounting issue. Indeed, here's what I do that assigning a device below the PCI-PCI bridge, then shutdown the guest: [ 406.535873] [ cut here ] [ 406.536864] kernel BUG at arch/x86/kernel/amd_iommu.c:2460! [ 406.536864] invalid opcode: [#1] SMP [ 406.536864] last sysfs file: /sys/devices/pci:00/:00:14.4/:03:06.0/device [ 406.536864] CPU 0 [ 406.536864] Modules linked in: kvm_amd kvm e1000e bnx2 [ 406.536864] [ 406.536864] Pid: 4265, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ #61 Toonie/Toonie [ 406.536864] RIP: 0010:[81025e53] [81025e53] amd_iommu_domain_destroy+0x75/0x9d [ 406.536864] RSP: 0018:88013507fb78 EFLAGS: 00010202 [ 406.536864] RAX: 8801346ebeb8 RBX: 8801346ebeb8 RCX: 00014f67 [ 406.536864] RDX: 0202 RSI: 0202 RDI: 81a118a0 [ 406.536864] RBP: 88013507fba8 R08: R09: 88007900f8e8 [ 406.536864] R10: 88013507f8d8 R11: 0006 R12: 8801346ebea8 [ 406.536864] R13: 8800783b73a8 R14: 0202 R15: 880135089570 [ 406.536864] FS: 7fe794db76e0() GS:88007fc0() knlGS: [ 406.536864] CS: 0010 DS: ES: CR0: 8005003b [ 406.536864] CR2: CR3: 7c6fb000 CR4: 06f0 [ 406.536864] DR0: DR1: DR2: [ 406.536864] DR3: DR6: 0ff0 DR7: 0400 [ 406.536864] Process qemu-system-x86 (pid: 4265, threadinfo 88013507e000, task 88013496b090) [ 406.536864] Stack: [ 406.536864] 0009 880135089570 88007c734ca0 0001 [ 406.536864] 88007c74e3c8 0002 88013507fbc8 813013b7 [ 406.536864] 0001 880135089570 88013507fbe8 a003f d81 [ 406.536864] Call Trace: [ 406.536864] [813013b7] iommu_domain_free+0x16/0x22 [ 406.536864] [a003fd81] kvm_iommu_unmap_guest+0x22/0x28 [kvm] [ 406.536864] [a00440fd] kvm_arch_destroy_vm+0x15/0x119 [kvm] [ 406.536864] [a003af59] kvm_put_kvm+0xde/0x103 [kvm] [ 406.536864] [a003b64e] kvm_vcpu_release+0x13/0x17 [kvm] [ 406.536864] [810e893a] fput+0x11b/0x1bc [ 406.536864] [810e5db9] filp_close+0x67/0x72 [ 406.536864] [81040505] put_files_struct+0x70/0xc3 [ 406.536864] [8104058c] exit_files+0x34/0x39 [ 406.536864] [810418ec] do_exit+0x267/0x72e [ 406.536864] [8104994c] ? lock_timer_base+0x26/0x4a [ 406.536864] [8104be04] ? freezing+0xe/0x10 [ 406.536864] [81041e47] sys_exit_group+0x0/0x16 [ 406.536864] [8104dfce] get_signal_to_deliver+0x31c/0x33b [ 406.536864] [81001fc6] do_notify_resume+0x8b/0x6c3 [ 406.536864] [8104be41] ? set_tsk_thread_flag+0xd/0xf [ 406.536864] [8104e6b5] ? sys_rt_sigtimedwait+0x18e/0x208 [ 406.536864] [810efe00] ? path_put+0x1d/0x22 [ 406.536864] [81002c58] int_signal+0x12/0x17 [ 406.536864] Code: 00 00 00 4c 89 eb 4d 8b 6d 00 49 8d 44 24 10 48 39 c3 75 df 4c 89 f6 48 c7 c7 a0 18 a1 81 e8 fa b5 56 00 41 83 7c 24 64 00 74 04 0f 0b eb fe 4c 89 e7 e8 2c f5 ff ff 4c 89 e7 e8 9e e2 ff ff 49 [ 406.536864] RIP [81025e53] amd_iommu_domain_destroy+0x75/0x9d [ 406.536864] RSP 88013507fb78 [ 406.854138] ---[ end trace 13c9f9241c8b376b ]--- [ 406.859182] Fixing recursive fault but reboot is needed! I assume this means that 00:14.4 is still left
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: On Fri, Feb 25, 2011 at 11:02 PM, James Neave robo...@gmail.com wrote: On Fri, Feb 25, 2011 at 10:47 PM, James Neave robo...@gmail.com wrote: On Fri, Feb 25, 2011 at 12:06 AM, Chris Wright chr...@sous-sol.org wrote: * James Neave (robo...@gmail.com) wrote: OK, here's my latest dmesg with amd_iommu_dump and debug with no quiet http://pastebin.com/JxEwvqRA Yeah, that's what I expected: [ Â Â 0.724403] AMD-Vi: Â DEV_ALIAS_RANGE Â Â Â Â Â Â Â Â devid: 08:00.0 flags: 00 devid_to: 00:14.4 [ Â Â 0.724439] AMD-Vi: Â DEV_RANGE_END Â Â Â Â Â devid: 08:1f.7 That basically says 08:00.0 - 08:1f.7 will show up as 00:14.4 (and should all go into same iommu domain). I've just figured out a sequence of echo DEV PATH commands to call for 14.4 gets me past the claimed by pci-stub error and gets me to the failed to assign IRQ error. I'm going to narrow down the required sequence and then post it. Kind of afraid to ask, but does it include: (assuming 1002 4384 is the pci to pci bridge) echo 1002 4384 /sys/bus/pci/drivers/pci-stub/new_id echo :00:14.4 /sys/bus/pci/drivers/pci-stub/unbind (this has the side effect of detaching the bridge from its domain) thanks, -chris Exact sequence is: echo 1002 4384 /sys/bus/pci/drivers/pci-stub/new_id echo :00:14.4 /sys/bus/pci/devices/:00:14.4/driver/unbind I take it this is a bad thing then? I assume this means that 00:14.4 is still left claimed by pci-stub? Yes How are you determining this? Â The lspci paste above has pci-stub for all of them. Â The easiest thing might be to start with manually disabling host driver and reassigning pci-stub to: 00:14.4, 08: 06.2,3 and 0e.0 Then giving the guest only 08:06.1. I determined it by being half asleep and not reading it properly... . You're right, all 5 devices were using pci-stub libvirtError: this function is not supported by the connection driver: Unable to reset PCI device :00:14.4: no FLR, PM reset or bus reset available Right, libvirt is more restrictive than qemu-kvm (forgot you were using libvirt here). What does that libvirt error mean? I can't find a definition. Am I limiting myself by using libvirt? Would not using it help and how would I go about not using it? Trouble now is that with shared IRQ we don't have a good way to handle that right now. Game over then? I've tried assigning the USB devices before, I couldn't do it because qemu doesn't support USB2 devices. I don't really understand where this IRQ conflict is, the firewire and the USB2 device share IRQ22 but I'm assigning them both to the VM? Is that still a problem? I don't suppose there's any way to change which IRQ they use in the BIOS or with a command is there? I don't know if it means anything but this page: http://linuxtv.org/wiki/index.php/Hauppauge_WinTV-HVR-2200 Has the lspci output for the HVR-2200 which mentions MSI and IRQ255. My knowledge it very limited on this subject so I don't know if that's meaningless looking at the output from another person's lspci. Anything left to try? Regardless, many thanks for your help, James. On the off chance I tried disabling the firewire in the BIOS, which leaves only my tuner card using IRQ 20, 21 and 22. No difference, still complains about IRQs: Using raw in/out ioport access (sysfs - Input/output error) Failed to assign irq for hostdev0: Operation not permitted Perhaps you are assigning a device that shares an IRQ with another device? It does say Operation not permitted and that only perhaps I am assigning a device that shares an IRQ. Perhaps IRQ conflict it not the problem? They really are sitting on their own. Another permissions problem perhaps? Regards, James. I'm reading something about this error message being related to libvirt and CAP_SYS_RAWIO? Depending on how new your libvirt is, you can force it to stop dropping capabilities. Look for the config item clear_emulator_capabilities in /etc/libvirt/qemu.conf. Setting this to 0 would verify that's the problem (and not a real shared irq...i thought i saw sharing on /proc/interrupts though). http://www.mail-archive.com/kvm@vger.kernel.org/msg34338.html http://www.google.co.uk/#hl=enxhr=tq=libvirt+CAP_SYS_RAWIOcp=21pf=psclient=psyaq=faqi=aql=oq=libvirt+CAP_SYS_RAWIOpbx=1fp=2d8e3f69fec095f4 When I patch libvirt to not drop the capabilities, everything works as expected. Well, that's a good point. We fixed that a while ago, but I'm not sure your kernel has that fix. 2.6.35.10-dmar (btw, random nitpick, dmar == intel dma remapping engine, aka vt-d not amd iommu ;) This was fixed in 2.6.36, commit: 48bb09e KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq The last 2.6.35 stable release is 2.6.35.9 and does not have that fix. So unless your .10-dmar has it, you could
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: libvirtError: this function is not supported by the connection driver: Unable to reset PCI device :00:14.4: no FLR, PM reset or bus reset available Right, libvirt is more restrictive than qemu-kvm (forgot you were using libvirt here). There is nothing written to test.log when you try to start the VM with 00:14.4 attached. At this point libvirt goes screwy and I have to restart it before I can remove 00:14.4 from the VM. I assume this means that 00:14.4 is still left claimed by pci-stub? Failed to assign irq for hostdev0: Operation not permitted Perhaps you are assigning a device that shares an IRQ with another device? kvm: -device pci-assign,host=08:06.0,id=hostdev0,configfd=58,bus=pci.0,addr=0x6: Believe it or not this is progress ;) You have passed the point that it was failing before (the iommu domain issue). Trouble now is that with shared IRQ we don't have a good way to handle that right now. Device 'pci-assign' could not be initialized 2011-02-23 19:21:13.958: shutting down dmesg: http://pastebin.com/70D26xp4 This bit is different: [ 201.625221] uhci_hcd :08:06.0: remove, state 4 [ 201.625237] usb usb4: USB disconnect, address 1 [ 201.625514] uhci_hcd :08:06.0: USB bus 4 deregistered [ 201.625595] uhci_hcd :08:06.0: PCI INT A disabled [ 201.626028] pci-stub :08:06.0: claimed by stub [ 201.631922] uhci_hcd :08:06.1: remove, state 4 [ 201.631937] usb usb9: USB disconnect, address 1 [ 201.632195] uhci_hcd :08:06.1: USB bus 9 deregistered [ 201.632274] uhci_hcd :08:06.1: PCI INT B disabled [ 201.632419] pci-stub :08:06.1: claimed by stub [ 201.638160] ehci_hcd :08:06.2: remove, state 1 [ 201.638172] usb usb10: USB disconnect, address 1 [ 201.638178] usb 10-1: USB disconnect, address 2 [ 201.721626] dvb-usb: Hauppauge Nova-T 500 Dual DVB-T successfully deinitialized and disconnected. [ 201.721990] ehci_hcd :08:06.2: USB bus 10 deregistered [ 201.722126] ehci_hcd :08:06.2: PCI INT C disabled [ 201.725042] pci-stub :08:06.2: claimed by stub [ 201.731830] firewire_ohci :08:0e.0: PCI INT A disabled [ 201.731838] firewire_ohci: Removed fw-ohci device. [ 201.732536] pci-stub :08:0e.0: claimed by stub [ 202.303880] device vnet0 entered promiscuous mode [ 202.305184] virbr0: topology change detected, propagating [ 202.305193] virbr0: port 1(vnet0) entering forwarding state [ 202.305199] virbr0: port 1(vnet0) entering forwarding state [ 202.433007] pci-stub :08:06.0: PCI INT A - GSI 20 (level, low) - IRQ 20 [ 202.470076] pci-stub :08:06.0: restoring config space at offset 0x1 (was 0x210, writing 0x211) [ 202.697270] assign device 0:8:6.0 [ 202.697325] deassign device 0:8:6.0 [ 202.730080] pci-stub :08:06.0: restoring config space at offset 0x1 (was 0x210, writing 0x211) [ 202.730107] pci-stub :08:06.0: PCI INT A disabled This time the pci-stub claimed lines are not all bunched up and there is only one per device, rather than three per device. Also for the first time it says assign device 0:8:6.0 rather than assign device 0:8:6.0 failed It them immediately deassigns the device and stops. test.log shows: Failed to assign irq for hostdev0: Operation not permitted Perhaps you are assigning a device that shares an IRQ with another device? lspsci -vv for the relevant devices shows: http://pastebin.com/EUtUMj8x 00:14.4 now appears to be using pci-stub as it's driver, as well as 08:06.1, 2, 3 but not 0e.0 How are you determining this? The lspci paste above has pci-stub for all of them. The easiest thing might be to start with manually disabling host driver and reassigning pci-stub to: 00:14.4, 08: 06.2,3 and 0e.0 Then giving the guest only 08:06.1. Anyway, that's all for now. Thanks for testing. I think I'll try 'amd_iommu_dump' next, does it write to dmesg? Yes it does. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: OK, here's my latest dmesg with amd_iommu_dump and debug with no quiet http://pastebin.com/JxEwvqRA Yeah, that's what I expected: [0.724403] AMD-Vi: DEV_ALIAS_RANGE devid: 08:00.0 flags: 00 devid_to: 00:14.4 [0.724439] AMD-Vi: DEV_RANGE_END devid: 08:1f.7 That basically says 08:00.0 - 08:1f.7 will show up as 00:14.4 (and should all go into same iommu domain). I've just figured out a sequence of echo DEV PATH commands to call for 14.4 gets me past the claimed by pci-stub error and gets me to the failed to assign IRQ error. I'm going to narrow down the required sequence and then post it. Kind of afraid to ask, but does it include: (assuming 1002 4384 is the pci to pci bridge) echo 1002 4384 /sys/bus/pci/drivers/pci-stub/new_id echo :00:14.4 /sys/bus/pci/drivers/pci-stub/unbind (this has the side effect of detaching the bridge from its domain) thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: Just out of interest, what kind of mileage would I expect out of buying a shiny new PCIe tuner? Hard to say. One advantage would be if it's using MSI or MSI-X interrupts. Can I pass through PCIe? Often, yes (still some caveats w.r.t. extended config space I believe). Would it work better because it wouldn't be behind a bridge? WOULD it not be behind a bridge? You should have a PCIe slot that does not sit behind a PCI-PCI bridge. As much as I'd hate to solve a problem with the application of money... :( If you just want _one_ tuner to go to the guest, you should be able to do that by unbinding the other devices and giving the guest just the one usb controller (assuming just assigning the usb device itself is hitting usb/qemu stack limitations). The trick is to be sure to unbind any host devices that are sharing interrupts with the one device you want the guest to have. With USB controllers you just have to be sure you know which ports they map to so you don't kill a keyboard, mouse, external disk, etc... (OT question, on mailing lists should I use Reply All or just reply and change the To address to kvm.vger.kernel.org?) Reply all is best. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is PCI pass-through possible with host+kvm on latest linux but guest on an older linux?
* Chigurupati, Chaks (ch...@wichorus.com) wrote: If my hardware is VT-d capable and the host is latest linux+kvm with all the needed VT-d support but the guest is an older linux (say 2.6.27), will I be able to use PCI pass-through to hot-plug a PCI device from one guest to another guest? Any comments/thoughts are appreciated. The basic requirement in the guest to do what you describe is that it has hotplug capability and has the driver for the device you want to assign to it. All the rest of the requirements are on the host sw and hw (linux+kvm capable of device assignment, which latest should be, and hw VT-d or AMD IOMMU). thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Feb 22
0.14 recap - keeping schedule on wiki was helpful - changelog was helpful - testing (could even more emphasis could be improved) - -rc cycles - -rc2 and final release just hours 0.15 - tentative date July 1st - qapi - qed features - virtagent? - depends on whether to terminate in qemu vs external - terminating w/in qemu is close to feature complete - using QMP (kinda, QObject - JSON marshalling, still use HTTP) - QMP is not bi-directional XMLRPC, one way with event posting - XMLRPC + server logic add to the basic QEMU side attack surface - splitting out to external process - state associated with guest in external process complicates live migration - e.g. handling in-process command in server - guest client reconnects during migration - can virtagent features be stateless - Avi's favorite Lua based extension language coming RSN ;) - let's use copy and paste as a concrete example - usecase to help define the requirements and expose architectural - Jes will do this, make concrete counter proposal to hosting virtagent server in qemu - splitting QEMU into more modular components is a large architectural step, but better step Block format acceptance - qcow3 wiki starting GSoC projects - only 3 so far, mentoring organization applications Feb 28th - can update app - please add your thoughts here so that we can have a successful - Luiz will send out a note as more explicit reminder gpxe vs ipxe - gpxe still stagnate - ipxe accepting patches (e.g. igbvf) - perhaps switch in 0.15 (Alex take a look) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: On Tue, Feb 22, 2011 at 1:51 AM, Chris Wright chr...@sous-sol.org wrote: * James Neave (robo...@gmail.com) wrote: Does anybody know the debug kernel switches for iommu? Two helpful kernel commandline options are: amd_iommu_dump debug (and drop quiet) The problem is when you attach the device (function) you're getting stuck up in conflicts with the existing domain for that function. My guess is that all the functions are behind a PCI to PCI bridge, so the alias lookup is finding a conflict. Yes, it's behind a PCI-PCI bridge I think, here's the blurb from an earlier email: Sorry, I missed that in your original mail, thanks for reposting. cat /proc/interruts http://pastebin.com/LQdB3hms lspci -vvv http://pastebin.com/GJDkC8B4 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40) lspci -t -v http://pastebin.com/Ftx8Hfjt Yup, that's what I expected: +-14.4-[08]--+-06.0 VIA Technologies, Inc. VT82x UHCI USB 1.1 Controller |+-06.1 VIA Technologies, Inc. VT82x UHCI USB 1.1 Controller |+-06.2 VIA Technologies, Inc. USB 2.0 |\-0e.0 Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller I'd now expect to see (if you boot with amd_iommu_dump) some IVRS details showing an alias range entry basically showing 08:* pointing back to 00:14.4. This means that from the point of view of the IOMMU the devices 08:06.0, 08:06.1, 08:06.2, 08:0e.0 will all show up as if they are 00:14.4. When you assign a device to a guest, the guest VM gets an IOMMU domain (a context to manage IOMMU page table mappings) and the device is put into that guest's IOMMU domain. However, if the device is behind a PCI-PCI bridge it will appear as an alias for the bridge itself. The bridge is a PCI device with an IOMMU domain. When trying to assign a device to a guest there's some sanity checking to verify that the device (or its alias) aren't already under some IOMMU domain other than the guest VM's IOMMU domain. I suspect this is what you are hitting. You could test this theory by adding 2 more devices to your guest -- the firewire device (08:0e.0) and the PCI-PCI bridge itself (00:14.4). thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* Alex Williamson (alex.william...@redhat.com) wrote: I don't know why you're getting -EBUSY for this device, but maybe we can start from a clean slate and see if it helps. Here's what I would suggest: I bet this is an AMD IOMMU box. Can we get full dmesg? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI Passthrough, error: The driver 'pci-stub' is occupying your device 0000:08:06.2
* James Neave (robo...@gmail.com) wrote: Finally, here is the very latest dmesg: http://pastebin.com/9HE61K62 OK, this is an AMD IOMMU box. [0.00] ACPI: IVRS cfcf9830 000E0 (v01 AMD RD890S 00202031 AMD ) It's discovered and enalbed properly: [0.698992] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40 [0.710287] AMD-Vi: Lazy IO/TLB flushing enabled Does anybody know the debug kernel switches for iommu? Two helpful kernel commandline options are: amd_iommu_dump debug (and drop quiet) The problem is when you attach the device (function) you're getting stuck up in conflicts with the existing domain for that function. My guess is that all the functions are behind a PCI to PCI bridge, so the alias lookup is finding a conflict. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Test report, kernel a685b38... qemu 671d89d...
* Alex Williamson (alex.william...@redhat.com) wrote: On Wed, 2011-02-16 at 11:10 +0200, Avi Kivity wrote: On 02/16/2011 11:05 AM, Hao, Xudong wrote: Hi, all, This is KVM test result against kvm.git a685b38e272587e644fedd37269ddb82df21c052, and qemu-kvm.git 671d89d6411655bb4f8058ce6eb86bb0bb8ec978. Currently qemu-kvm can build successfully on RHEL5, and Qcow image create failure issue also got fixed, our nightly testing resumed. One VT-d device assignment issue opened on latest KVM. New issue: 1. [VT-d] VT-d device passthrough fail to guest https://bugzilla.kernel.org/show_bug.cgi?id=29232 Extremely reproducible. Looks like it's a result of this kernel change: commit 47970b1b2aa64464bc0a9543e86361a622ae7c03 Author: Chris Wright chr...@sous-sol.org Date: Thu Feb 10 15:58:56 2011 -0800 pci: use security_capable() when checking capablities during config space re Eric Paris noted that commit de139a3 (pci: check caps from sysfs file open to read device dependent config space) caused the capability check to bypass security modules and potentially auditing. Rectify this by calling security_capable() when checking the open file's capabilities for config space reads. Reported-by: Eric Paris epa...@redhat.com Signed-off-by: Chris Wright chr...@sous-sol.org Signed-off-by: James Morris jmor...@namei.org Chris, why isn't this working for us? Thanks, It's a broken patch, the fix is floating about. Linus reverted it and I supplied this patch after the revert: From 683034fca7b8c322f87b8b4f664f1ae0b5fc Mon Sep 17 00:00:00 2001 From: Chris Wright chr...@sous-sol.org Date: Mon, 14 Feb 2011 19:12:00 -0500 Subject: [PATCH] pci: use security_capable() when checking capablities during config space read This reintroduces commit 47970b1b which was subsequently reverted as f00eaeea. The original change was broken and caused X startup failures and generally made privileged processes incapable of reading device dependent config space. The normal capable() interface returns true on success, but the LSM interface returns 0 on success. This thinko is now fixed in this patch, and has been confirmed to work properly. So, once again...Eric Paris noted that commit de139a3 (pci: check caps from sysfs file open to read device dependent config space) caused the capability check to bypass security modules and potentially auditing. Rectify this by calling security_capable() when checking the open file's capabilities for config space reads. Reported-by: Eric Paris epa...@redhat.com Tested-by: Dave Young hidave.darks...@gmail.com Acked-by: James Morris jmor...@namei.org Cc: Dave Airlie airl...@gmail.com Cc: Alex Riesen raa.l...@gmail.com Cc: Sedat Dilek sedat.di...@googlemail.com Cc: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Chris Wright chr...@sous-sol.org --- drivers/pci/pci-sysfs.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 8ecaac9..ea25e5b 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -23,6 +23,7 @@ #include linux/mm.h #include linux/fs.h #include linux/capability.h +#include linux/security.h #include linux/pci-aspm.h #include linux/slab.h #include pci.h @@ -368,7 +369,7 @@ pci_read_config(struct file *filp, struct kobject *kobj, u8 *data = (u8*) buf; /* Several chips lock up trying to read undefined config space */ - if (cap_raised(filp-f_cred-cap_effective, CAP_SYS_ADMIN)) { + if (security_capable(filp-f_cred, CAP_SYS_ADMIN) == 0) { size = dev-cfg_size; } else if (dev-hdr_type == PCI_HEADER_TYPE_CARDBUS) { size = 128; -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Feb 15
QAPI and QMP - Anthony adding a new wiki page to describe all of this - specified in formal schema using JSON - includes documenation in javadoc-like syntax - can generate api (possibly protocol) docs - documenting each command and expected errors - creates marshalling functions and C interfaces - can generate C library - facilitates unit tests/regression tests - new and old code both exist in Anthony's tree - allows unit tests to run on both to verify - will remove old and force a flag day on merging in for 0.15 - still need to convert human monitor commands - goal to convert all of human monitor to QMP - events? - still not consumable from internal use - model signals and slots - similar to notifier lists, but can pass arbitrary data - client connects to signal via QMP - how to extend? - optional parameters (ABI bump) - no way to know if client is aware of and consuming the optional parameters - add new events - client required to register for new events when the know about them, server can generate different logic based on clients capability - first release may not include shared library (lack of libconf/autotool) - could - QMP session in default well-known location - allows iteration of all running QMP sessions - per-user directory to handle user-level isolation qdev future - have an object model, but can't do polymorphism (i.e. bus level) - could use more oop style, use GObject, use C++...no great ideas - no major qdev plans for 0.15 - would be useful to have the ability to do device level unit testing - cleaner device model, better encapsulation - this is both the device side interfaces, but also interfaces back to qemu - ability to do something like a virtual PCI bus to be a test harness to interact with a device - back to the GObject, oop, C++ questions? - IDL based code generation to generate VMState in effort to make migration more verifiable - VMState - need to focus on serialized guest visible state - start with all state and remove obviously internal only state - start with only guest visible state (structure separation) - verfiable - need a qdev tree maintainer? - some disagreement on exactly how much - qdev autodoc patches? (posted and ack'd multiple times) bad patches committed that are not on list - please inform of specifics incidents, this should not be happening SeaBIOS update? - w/out we will have features that can't be used - need a release.. - 0.15 will need good planning and dates and communication with Kevin 0.14-rc2 tagged please review for any missing patches, 0.14.0 likely tagged late today revisit new - old migration - Amit offers virtio-serial patches and some legwork - tabled discussion to list, possibly next week's call -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Feb 15
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Feb 8
Automated builds and testing - found broken 32-bit - luiz suggested running against maintainer trees - daniel gollub offered to take on maintenance - integration with kvm-autotest? - lucas, daniel, stefan... - testing each git commit is probably overkill and too expensive - current autotest run (each 48-hours to batch it up) - stefan currently running once a day, autotest run is 3 hours, so daily should work - need an integration tree to run build test on? - probably still too early QEMU testing - kvm unit tests - small standalone kernel that exercises paths that have shown bugs http://git.kernel.org/?p=virt/kvm/kvm-unit-tests.git;a=summary - Michael Roth recent sent RFC for qtest (http://www.mail-archive.com/qemu-devel@nongnu.org/msg54191.html) - test module (-init(), -run()) which runs in place of vcpu threads to set up a test framework to do targetted testing, for example, of devices - normal C code, access to qemu internal functions - not just functional device testing, but can also to fuzz testing - looking feedback/users/test developers/etc - PPC (just kernel + initrd to boot, and verify boots are identical) - full install in many cases is too long, and can trigger other issues (alex had examples of emulation being slow enough that login screen times out) - tcg basic testing to verify qemu-kvm patch isn't breaking tcg Cross version migration (new-old version migration thread) - downstreams want this, support this upstream? - versions vs. subsections (subsections should allow this to work) - (as usual) more vmstate conversion needed - qdev/vmstate both examples of partially completed work that need more attention -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Feb 8
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Feb 1
KVM upstream merge: status, plans, coordination - Jan has a git tree, consolidating - qemu-kvm io threading is still an issue - Anthony wants to just merge - concerns with non-x86 arch and merge - concerns with big-bang patch merge and following stability - post 0.14 conversion to glib mainloop, non-upstreamed qemu-kvm will be a problem if it's not there by then - testing and nuances are still an issue (e.g. stefan berger's mmio read issue) - qemu-kvm still evolving, needs to get sync'd or it will keep diverging - 2 implementations of main init, cpu init, Jan has merged them into one - qemu-kvm-x86.c file that's only a few hundred lines - review as one patch to see the fundamental difference QMP support status for 0.14 - declare QMP fully supported - caveats: specific errors aren't guaranteed yet (primarily documentation) - human monitor passthrough command is best effort - device tree structure is not reliable, use name not path - will send out patch to update qmp-commands.hx to document this (and Cc libvirt) - schema file (json subset which is python) and code generator to generate code with C structures, also generates client library for test cases (can test against new and old qmp server to verify hasn't changed) - HMP implemented in terms of QMP only - at the end should have a test framework to test all commands - glib/gtest framework 0.14 stable fork today already posted 0.14 patches? - will pick up all those patches before forking, fork at the end of the day - will grab latest SeaBIOS and vgabios SeaBIOS update for 0.14 (AHCI boot capable version) - need to check if (and why) AHCI is disabled by default - assuming no fundamental issues, could be enabled and become an experimental new 0.14 feature Summer of code 2011 - http://wiki.qemu.org/Google_Summer_of_Code_2011 - update wiki page with project ideas (let Anthony or Luiz know if you want to be a mentor) - application is due at end of the month - mentors...be prepared that projects may take longer than just the summer of code to complete - join #qemu-gsoc on OFTC for gsoc discussions Going to FOSDEM? agraf will be there... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Jan 25
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Jan 18
* Chris Wright (chr...@redhat.com) wrote: Please send in any agenda items you are interested in covering. No agenda, this week's call is cancelled. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Jan 18
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Jan 11
KVM Forum 2011 - expand the scope? yes, continue up the stack - how long? 2 days (maybe 2 1/2 - 3 space permitting) - where? Vancouver with LinuxCon Spice guest agent: - virt agent, matahari, spice agent...what is in spice agent? - spice char device - mouse, copy 'n paste, screen resolution change - could be generic (at least input and copy/paste) - send protocol details of what is being sent - need to look at how difficult it is to split it out from spice (how to split out in qemu vs. libspice) - goal to converge on common framework - more discussion on char device vs. protocol - eg. mouse_set breaks if mouse channel is part pv and part spice specific - Alon will send link to protocol and try to propose new interfaces migration and block devices: - need to invalidate data after first read on target, because it can be stale - close + reopen is what was done for NFS - iscsi: can issue ioctl(BLKFLSBUF) to flush, but it's CAP_SYS_ADMIN only - O_DIRECT to avoid cache (concerns that it's not guaranteed) - agree change the default (cache=none for qemu patch queue is long: - slow to return from break - patience and more patch review will help make sure things are applied and don't fall through cracks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] device-assignment: chmod the rom file before opening read/write
* Alex Williamson (alex.william...@redhat.com) wrote: The PCI sysfs rom file is exposed read-only by default, but we need to write to it to enable and disable the ROM around the read. When running as root, the code works fine as is, but when running de-privileged via libvirt, the fopen(r+) will fail if the file doesn't have owner write permissions. libvirt already gives us ownership of the file, so we can toggle this around the short usage window ourselves. Signed-off-by: Alex Williamson alex.william...@redhat.com Acked-by: Chris Wright chr...@redhat.com --- hw/device-assignment.c | 17 +++-- 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 8446cd4..da0a4d7 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1866,16 +1866,18 @@ static void assigned_dev_load_option_rom(AssignedDevice *dev) return; } -if (access(rom_file, F_OK)) { -fprintf(stderr, pci-assign: Insufficient privileges for %s\n, -rom_file); +/* The ROM file is typically mode 0400, ensure that it's at least 0600 + * for the following fopen to succeed when qemu is de-privileged. */ +if (chmod(rom_file, (st.st_mode ALLPERMS) | S_IRUSR | S_IWUSR)) { +fprintf(stderr, pci-assign: Insufficient privileges for %s (%s)\n, +rom_file, strerror(errno)); return; } /* Write 1 to the ROM file to enable it */ fp = fopen(rom_file, r+); if (fp == NULL) { -return; +goto restore_rom; } val = 1; if (fwrite(val, 1, 1, fp) != 1) { @@ -1895,17 +1897,20 @@ static void assigned_dev_load_option_rom(AssignedDevice *dev) or load from file with romfile=\n, rom_file); qemu_ram_free(dev-dev.rom_offset); dev-dev.rom_offset = 0; -goto close_rom; +goto disable_rom; } pci_register_bar(dev-dev, PCI_ROM_SLOT, st.st_size, 0, pci_map_option_rom); -close_rom: +disable_rom: /* Write 0 to disable ROM */ fseek(fp, 0, SEEK_SET); val = 0; if (!fwrite(val, 1, 1, fp)) { Nitpick...could you unify this? (!= 1, like the enabling write check) DEBUG(%s\n, Failed to disable pci-sysfs rom file); } +close_rom: fclose(fp); +restore_rom: +chmod(rom_file, st.st_mode ALLPERMS); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: I have few (may be stupid) questions on this From: Chris Wright [chr...@sous-sol.org] That's the issue. The IOMMU has a set of page tables for each DeviceID. For most devices, the DeviceID is the same as the Bus:Dev.Func (the PCI address) of the device. But this does not always work. One example is when a device is behind a PCI-to-PCI Bridge. In that case, the device memory read/write requests (attempts to DMA) will appear as if they came from the bridge. Oh I see, I can understand this part. 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 That's the bridge that sits between your e100 and the IOMMU. Can you please explain how did you make out the device 01:05:0 is behind the bridge? 01:05.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 0c) A PCI bridge has config space that states what busses are behind it. The bridge at 00:14.4 is a bridge between bus 0 and bus 1, you can tell from this line: Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 There are no other devices behind that bridge (so theoretically you could safely use it for device assignment). If you can explain this, I will try to find if the other network card also sits behind the bridge or not. The other network interface card you have (03:00.0) is a PCIe device, it's upstream is the PCIe port. It should not have the aliasing issue, and should work. 00:06.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express gpp port F) Bus: primary=00, secondary=03, subordinate=03, sec-latency=0 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller I would like to know the same thing for the PCIe GPU card connected to my machine. If GPU card is also sitting behind the bridge then the hardware may be useless for the project. :( The GPU is also in a PCIe port, here: 00:02.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express gpp port B) Bus: primary=00, secondary=06, subordinate=06, sec-latency=0 06:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] Please explain how to find out this information. Using lspci -t you can see the topology pretty easily. Otherwise you can sift through lspci output to find the topology. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: Is the answer All PCI buses located behind a PCI-PCI bridge must reside between the seondary bus number and the subordinate bus number (inclusive). 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 So all the PCI devices between secondary (01) and subordinate (01) (in this case same) are behind the PCI Bridge. Correct me if I am wrong. That's correct. You'll find secondary subordinate when there's another bridge downstream. 01:05.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 0c) As Bus ID is 01 this ethernet controller is behind the PCI Bridge Yup. 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) As Bus: 03, I can assume this is not behind the PCI Bridge But if subordinate would have been, say 03 or 04, then even this ethernet card (03:00:0) would be behind the PCI Bridge. Am I correct? That's right. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: From: Chris Wright [chr...@sous-sol.org] I would like to know the same thing for the PCIe GPU card connected to my machine. If GPU card is also sitting behind the bridge then the hardware may be useless for the project. :( The GPU is also in a PCIe port, here: 00:02.0 PCI bridge: ATI Technologies Inc RD890 PCI to PCI bridge (PCI express gpp port B) Bus: primary=00, secondary=06, subordinate=06, sec-latency=0 06:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] As the secondary and subordinate are 06, it means GPU pass through won't work. No, it just means there are nor more bridges behind 00:02.0. The GPU is a PCIe device in a PCIe port (which happens to look a lot like a bridge). So, while GPU assignment has some tricky issues, I don't think you'll be stopped by the IOMMU. Please explain how to find out this information. Using lspci -t you can see the topology pretty easily. Otherwise you can sift through lspci output to find the topology. Thanks a lot Chris for explaining everything. You're welcome. Good luck with your project. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Dec 21
* Chris Wright (chr...@redhat.com) wrote: Please send in any agenda items you are interested in covering. No agenda, today's call is cancelled. Also, given people's holiday and vacation schedules, next week's call is cancelled. Talk again after the New Year. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: I am facing a problem with enabling the IOMMU. Dec 21 15:50:57 prasad-kvm kernel: [0.00] Aperture pointing to e820 RAM. Ignoring. Dec 21 15:50:57 prasad-kvm kernel: [0.00] Your BIOS doesn't leave a aperture memory hole Dec 21 15:50:57 prasad-kvm kernel: [0.00] Please enable the IOMMU option in the BIOS setup Dec 21 15:50:57 prasad-kvm kernel: [2.790913] pci :01:05.0: Firmware left e100 interrupts enabled; disabling Dec 21 15:50:57 prasad-kvm kernel: [2.791941] pci :00:00.2: PCI INT A - GSI 55 (level, low) - IRQ 55 Dec 21 15:50:57 prasad-kvm kernel: [2.792775] AMD-Vi: Enabling IOMMU at :00:00.2 cap 0x40 Dec 21 15:50:57 prasad-kvm kernel: [2.800989] AMD-Vi: Lazy IO/TLB flushing enabled I have enabled IOMMU in the BIOS, but I am not sure why it is still asking to enabled IOMMU in BIOS. Do I need to worry about this? It's unfortunate wording. It's telling you that the GART is missing, which is fine because you have an IOMMU. Besides I don't see the DMAR message similar to the one mentioned on the link http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM That wiki page is specific to Intel VT-d. You have an AMD box with IOMMU, so all looks fine. Are you interested in using the IOMMU to do direct PCI device assignment to a guest? thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: From: Chris Wright [chr...@sous-sol.org] I have enabled IOMMU in the BIOS, but I am not sure why it is still asking to enabled IOMMU in BIOS. Do I need to worry about this? It's unfortunate wording. It's telling you that the GART is missing, which is fine because you have an IOMMU. Besides I don't see the DMAR message similar to the one mentioned on the link http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM That wiki page is specific to Intel VT-d. You have an AMD box with IOMMU, so all looks fine. Yes I am using AMD processor and ASUS motherboard. Both of them have the IOMMU support, atleast it is mentioned on the Xen VT-d Looks like we need some additional info in the wiki. Care to create an account and add the info? Are you interested in using the IOMMU to do direct PCI device assignment to a guest? Thanks a lot for your reply. Yes I am interested in working on GPU pass-through to Virtual Machine. But for now I am trying to pass-through a network card to VM. r...@prasad-kvm:~/VMDisks# qemu-system-x86_64 -hda Ubuntu-10.10-amd64.img -m 1024M -device pci-assign,host=01:05.0 Failed to assign device (null) : Device or resource busy *** The driver 'pci-stub' is occupying your device :01:05.0. *** *** You can try the following commands to free it: *** *** $ echo 8086 1229 /sys/bus/pci/drivers/pci-stub/new_id *** $ echo :01:05.0 /sys/bus/pci/drivers/pci-stub/unbind *** $ echo :01:05.0 /sys/bus/pci/drivers/pci-stub/bind *** $ echo 8086 1229 /sys/bus/pci/drivers/pci-stub/remove_id *** Heh, this error is a little odd. It's telling you the pci-stub driver already has this device. Then it's telling you to unbind it from pci-stub, and bind it to pci-stub. That error message is meant to tell you that the real host driver (in your case e100) has the device, unbind from it, and bind to pci-stub. qemu-system-x86_64: -device pci-assign,host=01:05.0: Device 'pci-assign' could not be initialized r...@prasad-kvm:~/VMDisks# echo 8086 1229 /sys/bus/pci/drivers/pci-stub/new_id r...@prasad-kvm:~/VMDisks# echo :01:05.0 /sys/bus/pci/drivers/pci-stub/unbind r...@prasad-kvm:~/VMDisks# echo :01:05.0 /sys/bus/pci/drivers/pci-stub/bind r...@prasad-kvm:~/VMDisks# echo 8086 1229 /sys/bus/pci/drivers/pci-stub/remove_id r...@prasad-kvm:~/VMDisks# qemu-system-x86_64 -hda Ubuntu-10.10-amd64.img -m 1024M -device pci-assign,host=01:05.0 Failed to assign device (null) : Device or resource busy *** The driver 'pci-stub' is occupying your device :01:05.0. [ 605.015852] e100 :01:05.0: BAR 0: can't reserve [mem 0xf9cff000-0xf9cf] [ 605.015855] kvm_vm_ioctl_assign_device: Could not get access to device regions This is what is returning -EBUSY and triggering the error message. [ 667.410228] e100 :01:05.0: PCI INT A disabled [ 700.500278] pci-stub: invalid id string [ 707.730636] pci-stub :01:05.0: claimed by stub [ 734.755491] pci-stub :01:05.0: PCI INT A - GSI 20 (level, low) - IRQ 20 [ 734.790077] pci-stub :01:05.0: restoring config space at offset 0xf (was 0x38080100, writing 0x3808010b) [ 734.790095] pci-stub :01:05.0: restoring config space at offset 0xc (was 0x0, writing 0xf9ce) [ 734.790113] pci-stub :01:05.0: restoring config space at offset 0x6 (was 0x0, writing 0xf9cc) [ 734.790123] pci-stub :01:05.0: restoring config space at offset 0x5 (was 0x1, writing 0xac01) [ 734.790132] pci-stub :01:05.0: restoring config space at offset 0x4 (was 0x0, writing 0xf9cff000) [ 734.790142] pci-stub :01:05.0: restoring config space at offset 0x3 (was 0x0, writing 0x4010) [ 734.790153] pci-stub :01:05.0: restoring config space at offset 0x1 (was 0x290, writing 0x2900113) [ 735.173647] assign device 0:1:5.0 failed [ 735.173688] pci-stub :01:05.0: PCI INT A disabled [ 768.850519] pci-stub :01:05.0: claimed by stub [ 775.855376] pci-stub :01:05.0: PCI INT A - GSI 20 (level, low) - IRQ 20 [ 775.890080] pci-stub :01:05.0: restoring config space at offset 0xf (was 0x38080100, writing 0x3808010b) [ 775.890097] pci-stub :01:05.0: restoring config space at offset 0xc (was 0x0, writing 0xf9ce) [ 775.890115] pci-stub :01:05.0: restoring config space at offset 0x6 (was 0x0, writing 0xf9cc) [ 775.890126] pci-stub :01:05.0: restoring config space at offset 0x5 (was 0x1, writing 0xac01) [ 775.890135] pci-stub :01:05.0: restoring config space at offset 0x4 (was 0x0, writing 0xf9cff000) [ 775.890144] pci-stub :01:05.0: restoring config space at offset 0x3 (was 0x0, writing 0x4010) [ 775.890155] pci-stub :01:05.0: restoring config space at offset 0x1 (was 0x290, writing 0x2900113) [ 776.275188] assign device 0:1:5.0 failed [ 776.275230] pci-stub :01:05.0: PCI INT
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: From: kvm-ow...@vger.kernel.org [kvm-ow...@vger.kernel.org] on behalf of Chris Wright [chr...@sous-sol.org] Yes I am using AMD processor and ASUS motherboard. Both of them have the IOMMU support, atleast it is mentioned on the Xen VT-d Looks like we need some additional info in the wiki. Care to create an account and add the info? Sure I would love to. Thanks, you can use the VT-d portion as an example. The useful dmesg info will be AMD-Vi: messages, the important line is this one: AMD-Vi: Enabling IOMMU at ... (and if you boot with amd_iommu_dump you'll get extra debugging info) Thanks a lot for your reply. Yes I am interested in working on GPU pass-through to Virtual Machine. But for now I am trying to pass-through a network card to VM. Great, GPU assignment has plenty of issues ;) snip It still fails with the same error, here is the screen shot. r...@prasad-kvm:/sys# uptime 17:29:11 up 2 min, 3 users, load average: 0.93, 0.52, 0.20 r...@prasad-kvm:/sys# ls -l /sys/bus/pci/devices/:01:05.0/driver lrwxrwxrwx 1 root root 0 2010-12-21 17:26 /sys/bus/pci/devices/:01:05.0/driver - ../../../../bus/pci/drivers/e100 r...@prasad-kvm:/sys# lsmod | grep pci_stub r...@prasad-kvm:/sys# modprobe pci_stub r...@prasad-kvm:/sys# lsmod | grep pci_stub pci_stub1590 0 r...@prasad-kvm:/sys# echo 8086 1229 /sys/bus/pci/drivers/pci-stub/new_id r...@prasad-kvm:/sys# echo :01:05.0 /sys/bus/pci/drivers/e100/unbind r...@prasad-kvm:/sys# echo :01:05.0 /sys/bus/pci/drivers/pci-stub/bind r...@prasad-kvm:/sys# echo 8086 1229 /sys/bus/pci/drivers/pci-stub/remove_id r...@prasad-kvm:/sys# ls -l /sys/bus/pci/devices/:01:05.0/driver lrwxrwxrwx 1 root root 0 2010-12-21 17:31 /sys/bus/pci/devices/:01:05.0/driver - ../../../../bus/pci/drivers/pci-stub r...@prasad-kvm:~/VMDisks# modprobe kvm_amd r...@prasad-kvm:~/VMDisks# lsmod | grep -i kvm kvm_amd56416 0 kvm 348987 1 kvm_amd r...@prasad-kvm:~/VMDisks# qemu-system-x86_64 -hda Ubuntu-10.10-amd64.img -m 1024M -device pci-assign,host=01:05.0 Failed to assign device (null) : Device or resource busy *** The driver 'pci-stub' is occupying your device :01:05.0. *** *** You can try the following commands to free it: *** *** $ echo 8086 1229 /sys/bus/pci/drivers/pci-stub/new_id *** $ echo :01:05.0 /sys/bus/pci/drivers/pci-stub/unbind *** $ echo :01:05.0 /sys/bus/pci/drivers/pci-stub/bind *** $ echo 8086 1229 /sys/bus/pci/drivers/pci-stub/remove_id *** qemu-system-x86_64: -device pci-assign,host=01:05.0: Device 'pci-assign' could not be initialized r...@prasad-kvm:~/VMDisks# echo $? 1 r...@prasad-kvm:~/VMDisks# The VM does not boot. Are you still seeing the same errors in dmesg? Your first dmesg showed that the e100 driver couldn't allocate BAR0: e100 :01:05.0: BAR 0: can't reserve [mem 0xf9cff000-0xf9cf] If the host driver can't, then kvm_vm_ioctl_assign_device() will fail as well. Seems as if there's a resource conflict on your machine. Can you include a full dmesg, /proc/iomem, and lspci -vvv -? thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: Besides when I insert the pci_stub module, it emits a messages [ 49.197112] pci-stub: invalid id string I don't know why? It's just broken error message. The commit b439b1d (PCI: pci-stub: add pci_stub.ids parameter) created that. I looked at it very briefly a few weeks ago and didn't see the issue. It's cosmetic, and not related to the failure you are seeing. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: From: Chris Wright [chr...@sous-sol.org] Sent: 21 December 2010 19:29 To: Prasad Joshi Cc: Chris Wright; kvm@vger.kernel.org; Tejun Heo Subject: Re: Query on IOMMU * Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: Besides when I insert the pci_stub module, it emits a messages [ 49.197112] pci-stub: invalid id string I don't know why? It's just broken error message. The commit b439b1d (PCI: pci-stub: add pci_stub.ids parameter) created that. I looked at it very briefly a few weeks ago and didn't see the issue. It's cosmetic, and not related to the failure you are seeing. Is it okay to add a following line in section 4. unbind device from host kernel driver (example PCI device 01:00.0) * If the PCI Stub Driver is compiled as module, then load the module using modprobe pci_stub. When I compiled the kernel I selected it as a kernel module. As the driver was not loaded, I could not see the entries in /sys file system. I could figure that out after reading few things. It will good to add a note to mention this fact. Let me know if I should add it or not. Yes, that sounds fine. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query on IOMMU
* Prasad Joshi (p.g.jo...@student.reading.ac.uk) wrote: The following condition from __attach_device() returns the error. static int __attach_device(struct device *dev, struct protection_domain *domain) { ... if (alias_data-domain != NULL alias_data-domain != domain) goto out_unlock; ... } That's the issue. The IOMMU has a set of page tables for each DeviceID. For most devices, the DeviceID is the same as the Bus:Dev.Func (the PCI address) of the device. But this does not always work. One example is when a device is behind a PCI-to-PCI Bridge. In that case, the device memory read/write requests (attempts to DMA) will appear as if they came from the bridge. 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 That's the bridge that sits between your e100 and the IOMMU. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Dec 21
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Dec 14
* Chris Wright (chr...@redhat.com) wrote: Please send in any agenda items you are interested in covering. No agenda, today's call is cancelled. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Dec 14
* Jes Sorensen (jes.soren...@redhat.com) wrote: Any chance you could fix your cronjob to send out the CFA a day earlier? 15 hrs before is a bit short notice. Sure. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Dec 14
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] intel-iommu: Fix use after release during device attach
* Jan Kiszka (jan.kis...@siemens.com) wrote: --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -3627,9 +3627,9 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, pte = dmar_domain-pgd; if (dma_pte_present(pte)) { - free_pgtable_page(dmar_domain-pgd); dmar_domain-pgd = (struct dma_pte *) phys_to_virt(dma_pte_addr(pte)); While here, might as well remove the unnecessary cast. + free_pgtable_page(pte); } dmar_domain-agaw--; } Reviewed-by: Sheng Yang sh...@linux.intel.com Acked-by: Chris Wright chr...@sous-sol.org CC iommu mailing list and David. Ping... I think this fix also qualifies for stable (.35 and .36). Still not merged? David, do you plan to pick this one up? thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Dec 7
* Jes Sorensen (jes.soren...@redhat.com) wrote: On 12/07/10 00:51, Chris Wright wrote: Please send in any agenda items you are interested in covering. thanks, -chris No agenda, no replies Call canceled I presume? Indeed, next week, then pick up next year... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Dec 7
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
* Rik van Riel (r...@redhat.com) wrote: On 12/02/2010 08:18 PM, Chris Wright wrote: * Rik van Riel (r...@redhat.com) wrote: Keep track of which task is running a KVM vcpu. This helps us figure out later what task to wake up if we want to boost a vcpu that got preempted. Unfortunately there are no guarantees that the same task always keeps the same vcpu, so we can only track the task across a single run of the vcpu. So shouldn't it confine to KVM_RUN? The other vcpu_load callers aren't always a vcpu in a useful runnable state. Yeah, probably. If you want I can move the setting of vcpu-task to kvm_vcpu_ioctl. Or maybe setting in sched_out and unsetting in sched_in. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote: Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can consume the extra work that CFS wants to give away. That's not sufficient. Lets we have 3 guests A, B, C that need to be rate limited to 25% on a single cpu system. We create this idle guest D that is 100% cpu hog as per above definition. Now when one of the guest is idle, what ensures that the idle cycles of A is given only to D and not partly to B/C? Yeah, I pictured priorties handling this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Fri, Dec 03, 2010 at 05:27:52PM +0530, Srivatsa Vaddagiri wrote: On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote: Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can consume the extra work that CFS wants to give away. That's not sufficient. Lets we have 3 guests A, B, C that need to be rate limited to 25% on a single cpu system. We create this idle guest D that is 100% cpu hog as per above definition. Now when one of the guest is idle, what ensures that the idle cycles of A is given only to D and not partly to B/C? To tackle this problem, I was thinking of having a fill-thread associated with each vcpu (i.e both belong to same cgroup). Fill-thread consumes idle cycles left by vcpu, but otherwise doesn't compete with it for cycles. That's what Marcelo's suggestion does w/out a fill thread. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Fri, Dec 03, 2010 at 09:28:25AM -0800, Chris Wright wrote: * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote: Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can consume the extra work that CFS wants to give away. That's not sufficient. Lets we have 3 guests A, B, C that need to be rate limited to 25% on a single cpu system. We create this idle guest D that is 100% cpu hog as per above definition. Now when one of the guest is idle, what ensures that the idle cycles of A is given only to D and not partly to B/C? Yeah, I pictured priorties handling this. All guest are of equal priorty in this case (that's how we are able to divide time into 25% chunks), so unless we dynamically boost D's priority based on how idle other VMs are, its not going to be easy! Right, I think there has to be an external mgmt entity. Because num vcpus is not static. So priorities have to be rebalanaced at vcpu create/destroy time. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote: That's what Marcelo's suggestion does w/out a fill thread. There's one complication though even with that. How do we compute the real utilization of VM (given that it will appear to be burning 100% cycles)? We need to have scheduler discount the cycles burnt post halt-exit, so more stuff is needed than those simple 3-4 lines! Heh, was just about to say the same thing ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Anthony Liguori (anth...@codemonkey.ws) wrote: On 12/03/2010 11:58 AM, Chris Wright wrote: * Srivatsa Vaddagiri (va...@linux.vnet.ibm.com) wrote: On Fri, Dec 03, 2010 at 09:29:06AM -0800, Chris Wright wrote: That's what Marcelo's suggestion does w/out a fill thread. There's one complication though even with that. How do we compute the real utilization of VM (given that it will appear to be burning 100% cycles)? We need to have scheduler discount the cycles burnt post halt-exit, so more stuff is needed than those simple 3-4 lines! Heh, was just about to say the same thing ;) My first reaction is that it's not terribly important to account the non-idle time in the guest because of the use-case for this model. Depends on the chargeback model. This would put guest vcpu runtime vs host running guest vcpu time really out of skew. ('course w/out steal and that time it's already out of skew). But I think most models are more uptime based rather then actual runtime now. Eventually, it might be nice to have idle time accounting but I don't see it as a critical feature here. Non-idle time simply isn't as meaningful here as it normally would be. If you have 10 VMs in a normal environment and saw that you had only 50% CPU utilization, you might be inclined to add more VMs. Who is you? cloud user, or cloud service provider's scheduler? On the user side, 50% cpu utilization wouldn't trigger me to add new VMs. On the host side, 50% cpu utilization would have to be measure solely in terms of guest vcpu count. But if you're offering deterministic execution, it doesn't matter if you only have 50% utilization. If you add another VM, the guests will get exactly the same impact as if they were using 100% utilization. Sorry, didn't follow here? thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] sched: add yield_to function
* Rik van Riel (r...@redhat.com) wrote: On 12/02/2010 07:50 PM, Chris Wright wrote: +/* + * Yield the CPU, giving the remainder of our time slice to task p. + * Typically used to hand CPU time to another thread inside the same + * process, eg. when p holds a resource other threads are waiting for. + * Giving priority to p may help get that resource released sooner. + */ +void yield_to(struct task_struct *p) +{ + unsigned long flags; + struct sched_entity *se =p-se; + struct rq *rq; + struct cfs_rq *cfs_rq; + u64 remain = slice_remain(current); + + rq = task_rq_lock(p,flags); + if (task_running(rq, p) || task_has_rt_policy(p)) + goto out; + cfs_rq = cfs_rq_of(se); + se-vruntime -= remain; + if (se-vruntime cfs_rq-min_vruntime) + se-vruntime = cfs_rq-min_vruntime; Should these details all be in sched_fair? Seems like the wrong layer here. And would that condition go the other way? If new vruntime is smaller than min, then it becomes new cfs_rq-min_vruntime? That would be nice. Unfortunately, EXPORT_SYMBOL() does not seem to work right from sched_fair.c, which is included from sched.c instead of being built from the makefile! add a -yield_to() to properly isolate (only relevant then in sched_fair)? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] pci: MSI-X capability is 12 bytes, not 16, MSI is 10 bytes
* Alex Williamson (alex.william...@redhat.com) wrote: Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pci.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/pci.h b/hw/pci.h index 34955d8..7c52637 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -124,8 +124,8 @@ enum { #define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60 #define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40 -#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10 -#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10 +#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0xa This is variable length. +#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x0c typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector, int masked); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html