Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 02:16:41PM +0300, Avi Kivity wrote: Ugh, there's still mark_page_accessed() and SetPageDirty(). btw, like PG_dirty is only set if the spte is writeable, mark_page_accessed should only run if the accessed bit is set in the spte. It doesn't matter now as nobody could possibly clear it, No one will clear it now, but it can start out cleared. This is done on speculative mmu_set_spte(): when the guest writes into its page tables, we update the spte speculatively, but the guest may not actually access that location (for example, due to a page fault clustering). So the change makes sense even now. It still skips an atomic op. Your plan still sounds just fine despite the above, infact it sounds too perfect: the awk hack to re-add the refcounting when building the external module if CONFIG_MMU_NOTIFIER isn't defined is going to be messy, a plain CONFIG_MMU_NOTIFIER in kvm.git would be simpler and more robust IMHO even if less perfect :). Worst case, we stick a get_user_pages() inside the memslot setup function. That makes things not swappable for pre-mmu notifiers, but completely safe. I'd rather avoid special casing the core code, whenever possible. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote: It ought to work. gfn_to_hfn() (old gfn_to_page) will still need to take a refcount if possible. This reminds me, that mmu notifiers we could implement gfn_to_hfn only with follow_page and skip the refcounting on the struct page. I'm not suggesting that though, the refcount makes the code more robust IMHO, and notably it allows to run on kernels without mmu notifiers. So the trick is to do something like if (pfn_valid(pfn)) put_page(pfn_to_page(pfn)); That's the trick I was suggesting to take care of mmio space. If the reserved ram is used instead of VT-d to allow DMA, that will have change to: if (pfn_valid(pfn)) page = pfn_to_page(pfn); if (page_count(page)) put_page(page); that will take care of both the reserved ram and the pfn. In general I made it for the reserved-ram with only the page_count != 0 check, because 'struct page' existed. I'm unsure if it's good to add struct pages for non-ram, I find it slightly confusing and not the right thing, it takes memory for stuff that can't happen (refcounting only makes sense if the page finally goes in the freelist when count reaches 0, and PG_dirty/referenced bits and the like don't make sense either for non-ram or reserved-ram). So I'm not sure the vmemmap trick is the best. This will increase the potential for type errors, so maybe we need to make gfn_t and hfn_t distinct structures, and use accessors to get the actual values. That's also a possibility. If we go for this then hfn_t is probably a better name as it's less likely to collide with core kernel VM code. Otherwise perhaps pfn can be used instead of hfn, it's up to you ;). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote: It ought to work. gfn_to_hfn() (old gfn_to_page) will still need to take a refcount if possible. This reminds me, that mmu notifiers we could implement gfn_to_hfn only with follow_page and skip the refcounting on the struct page. I'm not suggesting that though, the refcount makes the code more robust IMHO, and notably it allows to run on kernels without mmu notifiers. Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. We could hack something to make pre mmu notifier kernels work. I'm unsure if it's good to add struct pages for non-ram, I find it slightly confusing and not the right thing, it takes memory for stuff that can't happen (refcounting only makes sense if the page finally goes in the freelist when count reaches 0, and PG_dirty/referenced bits and the like don't make sense either for non-ram or reserved-ram). So I'm not sure the vmemmap trick is the best. I thought we'd meet with resistance to that idea :) though I'd like to point out that struct pages to exist for mmio on some machines (those with = 4GB). This will increase the potential for type errors, so maybe we need to make gfn_t and hfn_t distinct structures, and use accessors to get the actual values. That's also a possibility. If we go for this then hfn_t is probably a better name as it's less likely to collide with core kernel VM code. Otherwise perhaps pfn can be used instead of hfn, it's up to you ;). I guess we can move to pfn, they're unambiguous enough. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. I expect for us that optimization will be mostly lost in the noise but it is likely measurable in heavy VM workloads and in general it sure worth it in the long run (if nothing else as a microoptimization for whatever spte fault benchmark). We could hack something to make pre mmu notifier kernels work. Actually I already removed the refcounting for the reserved ram, so it's going to be very easy to do with a few CONFIG_MMU_NOTIFIER. But because it's a low prio I would defer it for later, first patch can do the refcounting unconditionally to better test it. (I'm assuming we're going to deal with kernels without mmu notifiers [or without EMM ;) ] for a long while) I thought we'd meet with resistance to that idea :) though I'd like to point out that struct pages to exist for mmio on some machines (those with = 4GB). Sure, those should have page_count 0 like the reserved ram. I'm uncertain about the PageReserved semantics, I thought Nick nuked it long ago but perhaps it was something else. Now copy_page_range bails out on PFNMAP vmas, in the old days it would threat PG_reserved pages mapped with remap_page_range like a VM_SHARED by not wrprotecting and not refcounting even if this was a private mapping. In general using page_t for anything but ram is something that should be avoided and that also makes PG_reserved semantics mostly irrelevant. The users of remap_pfn_range (like /dev/mem) should never use pages regardless if it's ram or non-ram or if page_t exists or not. The fact page_t sometime exists for non-ram is strictly a performance optimization to make the pfn_to_page resolution quicker at runtime (speedup is significant and memory loss is negligeable). I guess we can move to pfn, they're unambiguous enough. Ok! ;) - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. How about this plan? 0. Merge mmu notifiers 1. gfn_to_page() - gfn_to_pfn() Still keeping the refcount. Change bad_page to kvm_bad_hfn. 2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*() Still using get_user_pages() (and dropping the refcount immediately) Simultaneously, change hack_module.awk to add the refcount back. 3. Export follow_page() or something based on fast_gup(), and use it btw, if we change the method we use to read the Linux pte, I'd like to get the writable bit out of it. This was, when we create an spte for a gpte that is writable and dirty, we can set the spte writable iff the Linux pte is writable. This avoids breaking COW unnecessarily. -- error compiling committee.c: too many arguments to function - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 02:16:41PM +0300, Avi Kivity wrote: Ugh, there's still mark_page_accessed() and SetPageDirty(). btw, like PG_dirty is only set if the spte is writeable, mark_page_accessed should only run if the accessed bit is set in the spte. It doesn't matter now as nobody could possibly clear it, but with mmu notifiers the -clear_young will clear that accessed bit for the first time, and if the guest didn't use the spte in the meantime, we don't need to mark the page accessed and we can let the VM swapout the page at the next round of the lru. But that's ok, we'll simply run mark_page_accessed and SetPageDirty if pfn_valid is true. We're under mmu_lock so the mmu notifiers will prevent the page to go away from under us. The detail I'm uncertain is why my reserved-ram patch had both pfn_valid = 1 and page_count =0 and PG_reserved =0 for all reserved ram. Perhaps I triggered a bug in some memmap initialization as the common code should keep all pages that aren't supposed to go in the freelist when freed as PG_reserved and page_count 1. Anyway in practice it worked fine and there's zero risk I'm not using reserved-ram with the page_count = 0 check ;). The rmap_remove should look like this (this won't work with the reserved ram but that needs fixing somehow as reserved-ram must be like mmio region but with a pfn_valid and in turn with a page_t, and then the reserved-ram patch will be able to cope with the reserved ram not having a allocated memmap by luck of how page_alloc.c works). if (pfn_valid(pfn)) { page = pfn_to_page(pfn); if (!PageReserved(page)) { BUG_ON(page_count(page) != 1); if (is_writeable_pte(*spte)) SetPageDirty(page); if (*spte PT_ACCESSED_MASK) mark_page_accessed(page); } } It still skips an atomic op. Your plan still sounds just fine despite the above, infact it sounds too perfect: the awk hack to re-add the refcounting when building the external module if CONFIG_MMU_NOTIFIER isn't defined is going to be messy, a plain CONFIG_MMU_NOTIFIER in kvm.git would be simpler and more robust IMHO even if less perfect :). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Wed, Apr 02, 2008 at 01:50:19PM +0200, Andrea Arcangeli wrote: if (pfn_valid(pfn)) { page = pfn_to_page(pfn); if (!PageReserved(page)) { BUG_ON(page_count(page) != 1); if (is_writeable_pte(*spte)) SetPageDirty(page); if (*spte PT_ACCESSED_MASK) mark_page_accessed(page); } } warning the indenting of the above in text-mode was wrong... read it like a c compiler would do sorry ;) - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Avi Kivity wrote: Andrea Arcangeli wrote: On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote: Isn't it faster though? We don't need to pull in the cacheline containing the struct page anymore. Exactly, not only that, get_user_pages is likely a bit slower that we need for just kvm pte lookup. GRU uses follow_page directly because it runs in the tlb miss handler, for us instead the tlb miss handler doesn't invoke a page fault unless the spte is non present. How about this plan? 0. Merge mmu notifiers Are mmu-notifiers likely to get pushed when the 2.6.26 window opens up? 1. gfn_to_page() - gfn_to_pfn() Still keeping the refcount. Change bad_page to kvm_bad_hfn. kvm_bad_pfn. 2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*() Still using get_user_pages() (and dropping the refcount immediately) Simultaneously, change hack_module.awk to add the refcount back. This has the dependency on mmu notifiers so step 1 can conceivably be merged in the absence of mmu notifiers. Regards, Anthony Liguori 3. Export follow_page() or something based on fast_gup(), and use it btw, if we change the method we use to read the Linux pte, I'd like to get the writable bit out of it. This was, when we create an spte for a gpte that is writable and dirty, we can set the spte writable iff the Linux pte is writable. This avoids breaking COW unnecessarily. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
[EMAIL PROTECTED] wrote: From: Ben-Ami Yassour [EMAIL PROTECTED] Enable a guest to access a device's memory mapped I/O regions directly. Userspace sends the mmio regions that the guest can access. On the first page fault for an access to an mmio address the host translates the gva to hpa, and updates the sptes. Can you explain why you're not using the regular memory slot mechanism? i.e. have userspace mmap(/dev/mem) and create a memslot containing that at the appropriate guest physical address? There are some issues with refcounting, but Andrea has some tricks to deal with that. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Avi Kivity wrote: [EMAIL PROTECTED] wrote: From: Ben-Ami Yassour [EMAIL PROTECTED] Enable a guest to access a device's memory mapped I/O regions directly. Userspace sends the mmio regions that the guest can access. On the first page fault for an access to an mmio address the host translates the gva to hpa, and updates the sptes. Can you explain why you're not using the regular memory slot mechanism? i.e. have userspace mmap(/dev/mem) and create a memslot containing that at the appropriate guest physical address? /dev/mem is often restricted in what memory can be mapped. However, we can't add something like this to KVM that allows arbitrary HPA's to be mapped into a guest from userspace. This is just as bad as /dev/mem and is going to upset a lot of people. Regardless of whether we can use /dev/mem, I think we should introduce a new char device anyway. We only need to mmap() MMIO regions which are mapped by the PCI bus, presumably, the kernel should know about these mappings. The driver should only allow mappings that are valid for a particular PCI device such that it cannot be abused to map arbitrary regions of memory into a guest. Bonus points if it can validate that there isn't a valid Linux driver loaded for the given PCI device. Regards, Anthony Liguori There are some issues with refcounting, but Andrea has some tricks to deal with that. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Anthony Liguori wrote: Regardless of whether we can use /dev/mem, I think we should introduce a new char device anyway. We only need to mmap() MMIO regions which are mapped by the PCI bus, presumably, the kernel should know about these mappings. The driver should only allow mappings that are valid for a particular PCI device such that it cannot be abused to map arbitrary regions of memory into a guest. Bonus points if it can validate that there isn't a valid Linux driver loaded for the given PCI device. Which is apparently entirely unnecessary as we already have /sys/bus/pci/.../region. It's just a matter of checking if a vma is VM_IO and then dealing with the subsequent reference counting issues as Avi points out. From idea to working implementation in 38 minutes. Congrats. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Tue, Apr 01, 2008 at 08:03:14PM +0300, Avi Kivity wrote: Anthony Liguori wrote: Avi Kivity wrote: [EMAIL PROTECTED] wrote: From: Ben-Ami Yassour [EMAIL PROTECTED] Enable a guest to access a device's memory mapped I/O regions directly. Userspace sends the mmio regions that the guest can access. On the first page fault for an access to an mmio address the host translates the gva to hpa, and updates the sptes. Can you explain why you're not using the regular memory slot mechanism? i.e. have userspace mmap(/dev/mem) and create a memslot containing that at the appropriate guest physical address? /dev/mem is often restricted in what memory can be mapped. Please elaborate. The /dev/mem, /dev/kmem devices have a special SELinux context memory_device_t and very few application domains are allowed to access them. THe KVM/QEMU policy will not allow this for example. Basically on the X server, HAL and dmidecode have access in current policy. It would be undesirable to have to all KVM guests full access to /dev/mem, so a more fine grained access method would have benefits here. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 10:20:49AM -0500, Anthony Liguori wrote: Which is apparently entirely unnecessary as we already have /sys/bus/pci/.../region. It's just a matter of checking if a vma is VM_IO and then dealing with the subsequent reference counting issues as Avi points out. Do you need to map it in userland too, isn't it enough to map it in the sptes? For the ram I had to map it in userland too with /dev/mem, and then I used the pte_pfn to fill the spte, so the emulated qemu drivers can input/output. But for the mmio space I doubt the userland side is needed. If you add a direct memslot (new bitflag type) I will use it too instead of catching get_user_pages failures and walking ptes on the RAM pieces overwritten by /dev/mem. There's a certain amount of elegance in mapping to userspace and not introducing a direct memslot. It helps simplify the security model since an application isn't able to do anything more than it could without KVM. The difficulty with a direct memslot is how you introduce policies to limit what guests can access what memslots directly. You would have to teach KVM to interact with the PCI subsystem to determine what memory was within a particular PCI IO region. Not impossible, just ugly. Regards, Anthony Liguori - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Tue, Apr 01, 2008 at 08:10:31PM +0200, Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 06:18:07PM +0100, Daniel P. Berrange wrote: and very few application domains are allowed to access them. THe KVM/QEMU policy will not allow this for example. Basically on the X server, HAL and dmidecode have access in current policy. It would be undesirable to have to all KVM guests full access to /dev/mem, so a more fine grained access method would have benefits here. But pci-passthrough can give a root on the host even to the ring0 guest, just like /dev/mem without VT-d, so there's no muchx difference with using /dev/mem as far as security is concerned. Only on the CPUs including VT-d it's possible to retain a mostly equivalent security level despite pci-passthrough. Clearly it is a loosing battle without VT-d. That doesn't mean we should design it to loose in general. So we should design to that when we do have VT-D it will have the maximum security possible. VT-d will only become more widespread over time. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 06:18:07PM +0100, Daniel P. Berrange wrote: and very few application domains are allowed to access them. THe KVM/QEMU policy will not allow this for example. Basically on the X server, HAL and dmidecode have access in current policy. It would be undesirable to have to all KVM guests full access to /dev/mem, so a more fine grained access method would have benefits here. But pci-passthrough can give a root on the host even to the ring0 guest, just like /dev/mem without VT-d, so there's no muchx difference with using /dev/mem as far as security is concerned. Only on the CPUs including VT-d it's possible to retain a mostly equivalent security level despite pci-passthrough. Eventually, most machines with have an IOMMU so that's the assumption to design to. It is true that PCI pass-through w/o VT-d is always going to be equivalent to /dev/mem access but it's really a special case. Unless you have an IOMMU, you would not do PCI pass-through if you cared at all about security/reliability. Regards, Anthony Liguori - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Tue, Apr 01, 2008 at 10:20:49AM -0500, Anthony Liguori wrote: Which is apparently entirely unnecessary as we already have /sys/bus/pci/.../region. It's just a matter of checking if a vma is VM_IO and then dealing with the subsequent reference counting issues as Avi points out. Do you need to map it in userland too, isn't it enough to map it in the sptes? For the ram I had to map it in userland too with /dev/mem, and then I used the pte_pfn to fill the spte, so the emulated qemu drivers can input/output. But for the mmio space I doubt the userland side is needed. If you add a direct memslot (new bitflag type) I will use it too instead of catching get_user_pages failures and walking ptes on the RAM pieces overwritten by /dev/mem. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Anthony Liguori wrote: I looked at Andrea's patches and I didn't see any special handling for non-RAM pages. Something Muli mentioned that kept them from doing /sys/devices/pci/.../region to begin with was the fact that IO pages do not have a struct page backing them so get_user_pages() will fail in gfn_to_page(). It's just something we discussed, not code. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Avi Kivity [EMAIL PROTECTED] wrote on 01/04/2008 16:30:00: [EMAIL PROTECTED] wrote: From: Ben-Ami Yassour [EMAIL PROTECTED] Enable a guest to access a device's memory mapped I/O regions directly. Userspace sends the mmio regions that the guest can access. On the first page fault for an access to an mmio address the host translates the gva to hpa, and updates the sptes. Can you explain why you're not using the regular memory slot mechanism? i.e. have userspace mmap(/dev/mem) and create a memslot containing that at the appropriate guest physical address? There are some issues with refcounting, but Andrea has some tricks to deal with that. -- Any sufficiently difficult bug is indistinguishable from a feature. Our initial approach was to mmap /sys/bus/pci/devices/.../resource# and create a memory slot for it. However eventually we realized that for mmio we don't need hva mapped to the mmio region, we can map the gpa directly to hpa. As far as I understand, the memory slots mechanism is used to map gpa to hva. Then gfn_to_page uses get_user_pages to map hva to hpa. However, get_user_pages does not work for mmio, and in addition we know the hpa to begin with, so there is no real need to map an hva for the mmio region. In addition there is an assumption in the code that there is a page struct for the frame which is not the case for mmio. So it was easier to simply add a list of mmio gpa-hpa mapping. I guess we can use the memory slots array to holds the gpa to hpa mapping but it is not necessarily natural, and would probably require more hooks in the code to handle an mmio memory slot. BTW, note that for a guest that has multiple passthrough devices each with a set of mmio regions, it might become a long list, so there might be value to keep it separate from that respect as well. With regards to the potential security issue Anthony pointed out (ioctls taking hpa's) we can change the interface so that they will take (device, BAR) instead and the kernel will translate to hpa What do you think? Given that the shadow page table code has to be modified anyway (due to the struct page issue), is it worthwhile to experiment with mmap(...region) or is the current approach sufficient? Thanks, Ben - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Ben-Ami Yassour1 wrote: Can you explain why you're not using the regular memory slot mechanism? i.e. have userspace mmap(/dev/mem) and create a memslot containing that at the appropriate guest physical address? Our initial approach was to mmap /sys/bus/pci/devices/.../resource# and create a memory slot for it. However eventually we realized that for mmio we don't need hva mapped to the mmio region, we can map the gpa directly to hpa. As far as I understand, the memory slots mechanism is used to map gpa to hva. Then gfn_to_page uses get_user_pages to map hva to hpa. However, get_user_pages does not work for mmio, and in addition we know the hpa to begin with, so there is no real need to map an hva for the mmio region. In addition there is an assumption in the code that there is a page struct for the frame which is not the case for mmio. So it was easier to simply add a list of mmio gpa-hpa mapping. I guess we can use the memory slots array to holds the gpa to hpa mapping but it is not necessarily natural, and would probably require more hooks in the code to handle an mmio memory slot. BTW, note that for a guest that has multiple passthrough devices each with a set of mmio regions, it might become a long list, so there might be value to keep it separate from that respect as well. With regards to the potential security issue Anthony pointed out (ioctls taking hpa's) we can change the interface so that they will take (device, BAR) instead and the kernel will translate to hpa Not enough. How do you know if this calling process has permissions to access that pci device (I retract my previous pci passthrough is as rootish as you get remark). What do you think? Given that the shadow page table code has to be modified anyway (due to the struct page issue), is it worthwhile to experiment with mmap(...region) or is the current approach sufficient? As Anthony points out, the advantage to mmap() is that whatever security is needed can be applied at that level. Passing host physical addresses from userspace requires a whole new security model. The issue with gfn_to_page() is real. We can either try to work around it somehow, or we can try to back mmio regions with struct pages. Now that it looks like mem_map is becoming virtually mapped, the cost is minimal, but I expect this approach will meet some resistance. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Avi Kivity wrote: Ben-Ami Yassour1 wrote: Not enough. How do you know if this calling process has permissions to access that pci device (I retract my previous pci passthrough is as rootish as you get remark). What do you think? Given that the shadow page table code has to be modified anyway (due to the struct page issue), is it worthwhile to experiment with mmap(...region) or is the current approach sufficient? As Anthony points out, the advantage to mmap() is that whatever security is needed can be applied at that level. Passing host physical addresses from userspace requires a whole new security model. The issue with gfn_to_page() is real. We can either try to work around it somehow, or we can try to back mmio regions with struct pages. Now that it looks like mem_map is becoming virtually mapped, the cost is minimal, but I expect this approach will meet some resistance. What about switching the KVM MMU code to use hfn_t instead of struct page? The initial conversion is pretty straight forward as the places where you actually need a struct page you can always get it from pfn_to_page() (like in kvm_release_page_dirty). We can then teach the MMU to deal with hfn_t's that don't have a corresponding page. IIUC, the implementation of something like kvm_release_page_dirty is a nop for pfn's that don't have a corresponding page. We just have to be able to detect a pfn_to_page() failure and then assume we're dealing with IO memory. Regards, Anthony Liguori - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Andrea Arcangeli wrote: On Tue, Apr 01, 2008 at 01:21:37PM -0500, Anthony Liguori wrote: return a page, not a HPA. I haven't looked too deeply yet, but my suspicion is that to properly support mapping in VM_IO pages will require some general refactoring since we always assume that a struct page exists for any HPA. Yes, that was potentially problem for reserved _ram_ pages too, as it isn't guaranteed that memmap_t (old days nomenclature) will exist for physical addresses not defined as ram in the e820 map (to make it work without VT-d I have to reserve the ram in the host at the e820 map parsing time). If the memmap will not exist for the reserved ram physical range, the pfn_valid() will fail at runtime in kvm and the bad_page will generate a graceful emulation failure, so it's very safe. But once we handle direct memslots for mmio regions, the reserved ram will better stop depending on the memmap too. Direct memslots is quite a sledgehammer though to deal with IO pages. You could get away with supporting reserved RAM another way though. If we refactored the MMU to use hfn_t instead of struct page, you would then need a mechanism to mmap() reserved ram into userspace (similar to ioremap I guess). In fact, you may be able to just lie and claim reserved ram is IO ram. Then we could treat the reserved ram in the same way as IO ram and not have to introduce a new interface in KVM for mapping arbitrary regions of memory. I would be interested in this sort of mechanism not for reserving low RAM, but for reserving high RAM. Basically, I'd like to boot with mem= something and then still be able to map the higher memory in QEMU userspace and then create KVM guests with it. Regards, Anthony Liguori - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
On Tue, Apr 01, 2008 at 10:22:51PM +0300, Avi Kivity wrote: It's just something we discussed, not code. Yes, the pfn_valid check should skip all refcounting for mmio regions without a struct page. But gfn_to_page can't work without a struct page, so some change will be needed there. With the reserved ram patch I didn't need to touch the gfn_to_page interface because the memmap happens to exist despite the ram is reserved in the e820 map, so I used !page_count() instead of the previously discussed !pfn_valid to skip all refcounting (reserved ram must skip the refcounting too because those ram pages can't be freed as they're not-ram as far as linux is concerned). - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Anthony Liguori wrote: You could get away with supporting reserved RAM another way though. If we refactored the MMU to use hfn_t instead of struct page, you would then need a mechanism to mmap() reserved ram into userspace (similar to ioremap I guess). In fact, you may be able to just lie and claim reserved ram is IO ram. We'd need it to return a (hfn_t, page_type_t) pair so the caller would know whether to use page refcounting or not. The caller might be able to ask Linux whether the page has a struct page or not, but it gets hairy. That's why I'm interested in the possibility of adding struct pages at runtime. vmemmap is needed for many other reasons (NUMA, memory hotplug, sparse SGI memory) so it's reasonable that it will be enabled on most distros. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part
Anthony Liguori wrote: What about switching the KVM MMU code to use hfn_t instead of struct page? The initial conversion is pretty straight forward as the places where you actually need a struct page you can always get it from pfn_to_page() (like in kvm_release_page_dirty). We can then teach the MMU to deal with hfn_t's that don't have a corresponding page. IIUC, the implementation of something like kvm_release_page_dirty is a nop for pfn's that don't have a corresponding page. We just have to be able to detect a pfn_to_page() failure and then assume we're dealing with IO memory. It ought to work. gfn_to_hfn() (old gfn_to_page) will still need to take a refcount if possible. This will increase the potential for type errors, so maybe we need to make gfn_t and hfn_t distinct structures, and use accessors to get the actual values. -- Any sufficiently difficult bug is indistinguishable from a feature. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel