Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Mon, Jun 07, 2010 at 12:01:04PM -0700, Tom Lyon wrote: On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote: On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote: OK, in the interest of making progress, I am about to embark on the following: 1. Create a user-iommu-domain driver - opening it will give a new empty domain. Ultimately this can also populate sysfs with the state of its world, which would also be a good addition to the base iommu stuff. If someone closes the fd while in use, the domain stays valid anyway until users drop off. 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that a domain be set before using the VFIO_DMA_MAP_IOVA ioctl Require domain to be set before you allow any access to the device: mmap, write, read. IMO this is the only safe way to make sure userspace does not corrupt memory, and this removes the need to special-case MSI memory, play with bus master enable and hope it can be cleared without reset, etc. Michael - the light bulb finally lit for me and I now understand what you've been saying the past few weeks. Of course you're right - we need iommu set before any register access. I had thought that was done by default but now I see that the dma_map_sg routine only attaches to the iommu on demand. So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl with the vfio fd so that the user can still do everything with one fd. I'm thinking the fd opens and iommu bindings could be done in a program before spinning out the program with the user driver. This would kind of break Avi's idea that mappings are programmed at the domain and shared by multiple devices, won't it? (this is the one that KVM wants). Not sure I understand. I think that MAP should be done on the domain, not the device, this handles pinning pages correctly and this way you don't need any special checks. However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one which uses the dma_sg interface which has no expicit control of domains. I intend to keep it the way it is, but expect only non-hypervisor programs would want to use it. If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't non-hypervisors just pick an address? 3. Clean up the docs and other nits that folks have found. Comments? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote: On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote: OK, in the interest of making progress, I am about to embark on the following: 1. Create a user-iommu-domain driver - opening it will give a new empty domain. Ultimately this can also populate sysfs with the state of its world, which would also be a good addition to the base iommu stuff. If someone closes the fd while in use, the domain stays valid anyway until users drop off. 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that a domain be set before using the VFIO_DMA_MAP_IOVA ioctl Require domain to be set before you allow any access to the device: mmap, write, read. IMO this is the only safe way to make sure userspace does not corrupt memory, and this removes the need to special-case MSI memory, play with bus master enable and hope it can be cleared without reset, etc. Michael - the light bulb finally lit for me and I now understand what you've been saying the past few weeks. Of course you're right - we need iommu set before any register access. I had thought that was done by default but now I see that the dma_map_sg routine only attaches to the iommu on demand. So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl with the vfio fd so that the user can still do everything with one fd. I'm thinking the fd opens and iommu bindings could be done in a program before spinning out the program with the user driver. (this is the one that KVM wants). Not sure I understand. I think that MAP should be done on the domain, not the device, this handles pinning pages correctly and this way you don't need any special checks. However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one which uses the dma_sg interface which has no expicit control of domains. I intend to keep it the way it is, but expect only non-hypervisor programs would want to use it. If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't non-hypervisors just pick an address? 3. Clean up the docs and other nits that folks have found. Comments? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote: OK, in the interest of making progress, I am about to embark on the following: 1. Create a user-iommu-domain driver - opening it will give a new empty domain. Ultimately this can also populate sysfs with the state of its world, which would also be a good addition to the base iommu stuff. If someone closes the fd while in use, the domain stays valid anyway until users drop off. 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that a domain be set before using the VFIO_DMA_MAP_IOVA ioctl Require domain to be set before you allow any access to the device: mmap, write, read. IMO this is the only safe way to make sure userspace does not corrupt memory, and this removes the need to special-case MSI memory, play with bus master enable and hope it can be cleared without reset, etc. (this is the one that KVM wants). Not sure I understand. I think that MAP should be done on the domain, not the device, this handles pinning pages correctly and this way you don't need any special checks. However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one which uses the dma_sg interface which has no expicit control of domains. I intend to keep it the way it is, but expect only non-hypervisor programs would want to use it. If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't non-hypervisors just pick an address? 3. Clean up the docs and other nits that folks have found. Comments? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 07:53 PM, Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: The interface would only work for clients which support it: kvm, vhost, and iommu/devices with restartable dma. BTW, there is no such thing as restartable dma. There is a provision in new specs (read: no real hardware) that allows a device to request pages before using them. So it's akin to demand paging, but the demand is an explicit request rather than a page fault. Thanks for the correction. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
OK, in the interest of making progress, I am about to embark on the following: 1. Create a user-iommu-domain driver - opening it will give a new empty domain. Ultimately this can also populate sysfs with the state of its world, which would also be a good addition to the base iommu stuff. If someone closes the fd while in use, the domain stays valid anyway until users drop off. 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver. Require that a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one that KVM wants). However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one which uses the dma_sg interface which has no expicit control of domains. I intend to keep it the way it is, but expect only non-hypervisor programs would want to use it. 3. Clean up the docs and other nits that folks have found. Comments? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote: There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. The natural way to do this is to have an iommu driver for programming iommu. IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. We can solve the problem of multiple devices-per-domain with an ioctl which allows binding one uio-device to the address-space on another. Thats much simpler. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote: On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote: Main difference is that vhost works fine with unlocked memory, paging it in on demand. iommu needs to unmap memory when it is swapped out or relocated. So you'd just take the memory map and not pin anything. This way you can reuse the memory map. But no, it doesn't handle the dirty bitmap, so no go. IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 12:45 PM, Joerg Roedel wrote: On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote: On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote: Main difference is that vhost works fine with unlocked memory, paging it in on demand. iommu needs to unmap memory when it is swapped out or relocated. So you'd just take the memory map and not pin anything. This way you can reuse the memory map. But no, it doesn't handle the dirty bitmap, so no go. IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. vhost doesn't pin memory. What I proposed is to describe the memory map using an object (fd), and pass it around to clients that use it: kvm, vhost, vfio. That way you maintain the memory map in a central location and broadcast changes to clients. Only a vfio client would result in memory being pinned. It can still work, but the interface needs to be extended to include dirty bitmap logging. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 12:42 PM, Joerg Roedel wrote: On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote: There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. The natural way to do this is to have an iommu driver for programming iommu. IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. We can solve the problem of multiple devices-per-domain with an ioctl which allows binding one uio-device to the address-space on another. Thats much simpler. This is non trivial with hotplug. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, Jun 01, 2010 at 09:59:40PM -0700, Tom Lyon wrote: This is just what I was thinking. But rather than a get/set, just use two fds. ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2); This may fail if there are really 2 different IOMMUs, so user code must be prepared for failure, In addition, this is strictlyupwards compatible with what is there now, so maybe we can add it later. How can this fail with multiple IOMMUs? This should be handled transparently by the IOMMU driver. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote: On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote: There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. The natural way to do this is to have an iommu driver for programming iommu. IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. One advantage would be that we can reuse the uio framework for the devices themselves. So an existing app can just program an iommu for DMA and keep using uio for interrupts and access. We can solve the problem of multiple devices-per-domain with an ioctl which allows binding one uio-device to the address-space on another. This would imply switching an iommu domain for a device while it could potentially be doing DMA. No idea whether this can be done in a safe manner. Forcing iommu assignment to be done as a first step seems much saner. Thats much simpler. Joerg So instead of dev = open(); ioctl(dev, ASSIGN, iommu) mmap and if we for ioctl mmap will fail we have dev = open(); if (ndevices 0) ioctl(devices[0], ASSIGN, dev) mmap And if we forget ioctl we get errors from device. Seems more complicated to me. There will also always exist the confusion: address space for which device are we modifying? With a separate driver for iommu, we can safely check that binding is done correctly. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote: On 06/02/2010 12:45 PM, Joerg Roedel wrote: IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. vhost doesn't pin memory. What I proposed is to describe the memory map using an object (fd), and pass it around to clients that use it: kvm, vhost, vfio. That way you maintain the memory map in a central location and broadcast changes to clients. Only a vfio client would result in memory being pinned. Ah ok, so its only about the database which keeps the mapping information. It can still work, but the interface needs to be extended to include dirty bitmap logging. Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no dirty-bits in the page-table. And without demand-paging we can't really tell what pages a device has written to. The only choice is to mark all IOMMU-mapped pages dirty as long as they are mapped. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:04:04PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote: On 06/02/2010 12:45 PM, Joerg Roedel wrote: IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. vhost doesn't pin memory. What I proposed is to describe the memory map using an object (fd), and pass it around to clients that use it: kvm, vhost, vfio. That way you maintain the memory map in a central location and broadcast changes to clients. Only a vfio client would result in memory being pinned. Ah ok, so its only about the database which keeps the mapping information. It can still work, but the interface needs to be extended to include dirty bitmap logging. Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no dirty-bits in the page-table. And without demand-paging we can't really tell what pages a device has written to. The only choice is to mark all IOMMU-mapped pages dirty as long as they are mapped. Joerg Or mark them dirty when they are unmapped. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote: IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. One advantage would be that we can reuse the uio framework for the devices themselves. So an existing app can just program an iommu for DMA and keep using uio for interrupts and access. The driver is called UIO and not U-INTR-MMIO ;-) So I think handling IOMMU mappings belongs there. We can solve the problem of multiple devices-per-domain with an ioctl which allows binding one uio-device to the address-space on another. This would imply switching an iommu domain for a device while it could potentially be doing DMA. No idea whether this can be done in a safe manner. It can. The worst thing that can happen is an io-page-fault. Forcing iommu assignment to be done as a first step seems much saner. If we force it, there is no reason why not doing it implicitly. We can do something like this then: dev1 = open(); ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to it*/ dev2 = open(); ioctl(dev2, IOMMU_MAP, ...); /* Now dev1 and dev2 are in seperate domains */ ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and assigns it to the same domain as dev1. Domain has a refcount of two now */ close(dev1); /* domain refcount goes down to one */ close(dev2); /* domain refcount is zero and domain gets destroyed */ Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 11:45:27AM +0200, Joerg Roedel wrote: On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote: On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote: Main difference is that vhost works fine with unlocked memory, paging it in on demand. iommu needs to unmap memory when it is swapped out or relocated. So you'd just take the memory map and not pin anything. This way you can reuse the memory map. But no, it doesn't handle the dirty bitmap, so no go. IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. Joerg One of the issues I see with the current patch is that it uses the mlock rlimit to do this pinning. So this wastes the rlimit for an app that did mlockall already, and also consumes this resource transparently, so an app might call mlock on a small buffer and be surprised that it fails. Using mmu notifiers might help? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote: IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. One advantage would be that we can reuse the uio framework for the devices themselves. So an existing app can just program an iommu for DMA and keep using uio for interrupts and access. The driver is called UIO and not U-INTR-MMIO ;-) So I think handling IOMMU mappings belongs there. We can solve the problem of multiple devices-per-domain with an ioctl which allows binding one uio-device to the address-space on another. This would imply switching an iommu domain for a device while it could potentially be doing DMA. No idea whether this can be done in a safe manner. It can. The worst thing that can happen is an io-page-fault. devices might not be able to recover from this. Forcing iommu assignment to be done as a first step seems much saner. If we force it, there is no reason why not doing it implicitly. What you describe below does 3 ioctls for what can be done with 1. We can do something like this then: dev1 = open(); ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to it*/ dev2 = open(); ioctl(dev2, IOMMU_MAP, ...); /* Now dev1 and dev2 are in seperate domains */ ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and assigns it to the same domain as dev1. Domain has a refcount of two now */ Or maybe it destroys mapping for dev1? How do you remember? close(dev1); /* domain refcount goes down to one */ close(dev2); /* domain refcount is zero and domain gets destroyed */ Joerg Also, no way to unshare? That seems limiting. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 01:15:34PM +0300, Michael S. Tsirkin wrote: One of the issues I see with the current patch is that it uses the mlock rlimit to do this pinning. So this wastes the rlimit for an app that did mlockall already, and also consumes this resource transparently, so an app might call mlock on a small buffer and be surprised that it fails. Using mmu notifiers might help? MMU notifiers are problematic because they are designed for situations where we can do demand paging. The invalidate_range_start and invalidate_range_end functions are not only called on munmap, they also run when mprotect is called (in which case we don't want to tear down iommu mappings). So what may happen with mmu notifiers is that we accidentially tear down iommu mappings. With demand-paging this is no problem because the io-ptes could be re-faulted. But that does not work here. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote: It can. The worst thing that can happen is an io-page-fault. devices might not be able to recover from this. With the userspace interface a process can create io-page-faults anyway if it wants. We can't protect us from this. And the process is also responsible to not tear down iommu-mappings that are currently in use. What you describe below does 3 ioctls for what can be done with 1. The second IOMMU_MAP ioctl is just to show that existing mappings would be destroyed if the device is assigned to another address space. Not strictly necessary. So we have two ioctls but save one call to create the iommu-domain. ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and assigns it to the same domain as dev1. Domain has a refcount of two now */ Or maybe it destroys mapping for dev1? How do you remember? Because we express here that dev2 shares the iommu mappings of dev1. Thats easy to remember. Also, no way to unshare? That seems limiting. Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE because that would require a second parameter) ioctl is certainly also required. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote: It can. The worst thing that can happen is an io-page-fault. devices might not be able to recover from this. With the userspace interface a process can create io-page-faults anyway if it wants. We can't protect us from this. We could fail all operations until an iommu is bound. This will help catch bugs with access before setup. We can not do this if a domain is bound by default. And the process is also responsible to not tear down iommu-mappings that are currently in use. What you describe below does 3 ioctls for what can be done with 1. The second IOMMU_MAP ioctl is just to show that existing mappings would be destroyed if the device is assigned to another address space. Not strictly necessary. So we have two ioctls but save one call to create the iommu-domain. With 10 devices you have 10 extra ioctls. ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and assigns it to the same domain as dev1. Domain has a refcount of two now */ Or maybe it destroys mapping for dev1? How do you remember? Because we express here that dev2 shares the iommu mappings of dev1. Thats easy to remember. they both share the mappings. which one gets the iommu destroyed (breaking the device if it is now doing DMA)? Also, no way to unshare? That seems limiting. Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE because that would require a second parameter) ioctl is certainly also required. Joerg -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote: IMO a seperate iommu-userspace driver is a nightmare for a userspace interface. It is just too complicated to use. One advantage would be that we can reuse the uio framework for the devices themselves. So an existing app can just program an iommu for DMA and keep using uio for interrupts and access. The driver is called UIO and not U-INTR-MMIO ;-) So I think handling IOMMU mappings belongs there. Maybe it could be put there but the patch posted did not use uio. And one of the reasons is that uio framework provides for device access and interrupts but not for programming memory mappings. Solutions (besides giving up on uio completely) could include extending the framework in some way (which was tried, but the result was not pretty) or adding a separate driver for iommu and binding to that. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote: With the userspace interface a process can create io-page-faults anyway if it wants. We can't protect us from this. We could fail all operations until an iommu is bound. This will help catch bugs with access before setup. We can not do this if a domain is bound by default. Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. The second IOMMU_MAP ioctl is just to show that existing mappings would be destroyed if the device is assigned to another address space. Not strictly necessary. So we have two ioctls but save one call to create the iommu-domain. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Remember that we still need to be able to provide seperate mappings for each device to support IOMMU emulation for the guest. I think my proposal does not have any extra costs. Because we express here that dev2 shares the iommu mappings of dev1. Thats easy to remember. they both share the mappings. which one gets the iommu destroyed (breaking the device if it is now doing DMA)? As I wrote the domain has a reference count and is destroyed only when it goes down to zero. This does not happen as long as a device is bound to it. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 01:04 PM, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote: On 06/02/2010 12:45 PM, Joerg Roedel wrote: IOMMU mapped memory can not be swapped out because we can't do demand paging on io-page-faults with current devices. We have to pin _all_ userspace memory that is mapped into an IOMMU domain. vhost doesn't pin memory. What I proposed is to describe the memory map using an object (fd), and pass it around to clients that use it: kvm, vhost, vfio. That way you maintain the memory map in a central location and broadcast changes to clients. Only a vfio client would result in memory being pinned. Ah ok, so its only about the database which keeps the mapping information. Yes. It can still work, but the interface needs to be extended to include dirty bitmap logging. Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no dirty-bits in the page-table. And without demand-paging we can't really tell what pages a device has written to. The only choice is to mark all IOMMU-mapped pages dirty as long as they are mapped. The interface would only work for clients which support it: kvm, vhost, and iommu/devices with restartable dma. Note dirty logging is not very interesting for vfio anyway, since you can't live migrate with assigned devices. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote: With the userspace interface a process can create io-page-faults anyway if it wants. We can't protect us from this. We could fail all operations until an iommu is bound. This will help catch bugs with access before setup. We can not do this if a domain is bound by default. Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. It can always corrupt its own memory directly as well :) But that is not a reason not to detect errors if we can, and not to make APIs hard to misuse. The second IOMMU_MAP ioctl is just to show that existing mappings would be destroyed if the device is assigned to another address space. Not strictly necessary. So we have two ioctls but save one call to create the iommu-domain. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. Remember that we still need to be able to provide seperate mappings for each device to support IOMMU emulation for the guest. Generally not true. E.g. guest can enable iommu passthrough or have domain per a group of devices. I think my proposal does not have any extra costs. with my proposal we have 1 ioctl per device + 1 per domain. with yours we have 2 ioctls per device is iommu is shared and 1 if it is not shared. as current apps share iommu it seems to make sense to optimize for that. Because we express here that dev2 shares the iommu mappings of dev1. Thats easy to remember. they both share the mappings. which one gets the iommu destroyed (breaking the device if it is now doing DMA)? As I wrote the domain has a reference count and is destroyed only when it goes down to zero. This does not happen as long as a device is bound to it. Joerg We were talking about UNSHARE ioctl: ioctl(dev1, UNSHARE, dev2) Does it change the domain for dev1 or dev2? If you make a mistake you get a hard to debug bug. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote: Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. It can always corrupt its own memory directly as well :) But that is not a reason not to detect errors if we can, and not to make APIs hard to misuse. Changing the domain of a device while dma can happen is the same type of bug as unmapping potential dma target addresses. We can't catch this kind of misuse. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. Remember that we still need to be able to provide seperate mappings for each device to support IOMMU emulation for the guest. Generally not true. E.g. guest can enable iommu passthrough or have domain per a group of devices. What I meant was that there may me multiple io-addresses spaces necessary for one process. I didn't want to say that every device _needs_ to have its own address space. As I wrote the domain has a reference count and is destroyed only when it goes down to zero. This does not happen as long as a device is bound to it. Joerg We were talking about UNSHARE ioctl: ioctl(dev1, UNSHARE, dev2) Does it change the domain for dev1 or dev2? If you make a mistake you get a hard to debug bug. As I already wrote we would have an UNBIND ioctl which just removes a device from its current domain. UNBIND is better than UNSHARE for exactly the reason you pointed out above. I thought I stated that already. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 03:19 PM, Joerg Roedel wrote: Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. The problem with this is that it is assymetric, dev1 is treated differently from dev[234]. It's an unintuitive API. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote: Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. It can always corrupt its own memory directly as well :) But that is not a reason not to detect errors if we can, and not to make APIs hard to misuse. Changing the domain of a device while dma can happen is the same type of bug as unmapping potential dma target addresses. We can't catch this kind of misuse. you normally need device mapped to start DMA. SHARE makes this bug more likely as you allow switching domains: mmap could be done before switching. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. I thought we had a BIND ioctl? Remember that we still need to be able to provide seperate mappings for each device to support IOMMU emulation for the guest. Generally not true. E.g. guest can enable iommu passthrough or have domain per a group of devices. What I meant was that there may me multiple io-addresses spaces necessary for one process. I didn't want to say that every device _needs_ to have its own address space. As I wrote the domain has a reference count and is destroyed only when it goes down to zero. This does not happen as long as a device is bound to it. Joerg We were talking about UNSHARE ioctl: ioctl(dev1, UNSHARE, dev2) Does it change the domain for dev1 or dev2? If you make a mistake you get a hard to debug bug. As I already wrote we would have an UNBIND ioctl which just removes a device from its current domain. UNBIND is better than UNSHARE for exactly the reason you pointed out above. I thought I stated that already. Joerg You undo SHARE with UNBIND? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote: On 06/02/2010 03:19 PM, Joerg Roedel wrote: Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. The problem with this is that it is assymetric, dev1 is treated differently from dev[234]. It's an unintuitive API. Its by far more unintuitive that a process needs to explicitly bind a device to an iommu domain before it can do anything with it. If its required anyway the binding can happen implicitly. We could allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry. Note that this way of handling userspace iommu mappings is also a lot simpler for most use-cases outside of KVM. If a developer wants to write a userspace driver all it needs to do is: dev = open(); ioctl(dev, MAP, ...); /* use device with mappings */ close(dev); Which is much easier than the need to create a domain explicitly. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 03:34:17PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote: you normally need device mapped to start DMA. SHARE makes this bug more likely as you allow switching domains: mmap could be done before switching. We need to support domain switching anyway for iommu emulation in a guest. So if you consider this to be a problem (I don't) it will not go away with your proposal. dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. I thought we had a BIND ioctl? I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but thats bad naming as you pointed out below. See my statement on this below too. You undo SHARE with UNBIND? Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one parameter to do this which removes any ambiguity: ioctl(dev1, UNSHARE); Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 03:50 PM, Joerg Roedel wrote: The problem with this is that it is assymetric, dev1 is treated differently from dev[234]. It's an unintuitive API. Its by far more unintuitive that a process needs to explicitly bind a device to an iommu domain before it can do anything with it. I don't really care about the iommu domain. It's a side effect. The kernel takes care of it. I'm only worried about the API. We have a memory map that is (often) the same for a set of devices. If you were coding a non-kernel interface, how would you code it? struct memory_map; void memory_map_init(struct memory_map *mm, ...); struct device; void device_set_memory_map(struct device *device, struct memory_map *mm); or struct device; void device_init_memory_map(struct device *dev, ...); void device_clone_memory_map(struct device *dev, struct device *other); I wouldn't even think of the second one personally. If its required anyway the binding can happen implicitly. We could allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry. It's still special. You define the memory map only for the first device. You have to make sure dev1 doesn't go away while sharing it. Note that this way of handling userspace iommu mappings is also a lot simpler for most use-cases outside of KVM. If a developer wants to write a userspace driver all it needs to do is: dev = open(); ioctl(dev, MAP, ...); /* use device with mappings */ close(dev); Which is much easier than the need to create a domain explicitly. mm = open() ioctl(mm, MAP, ...) dev = open(); ioctl(dev, BIND, mm); ... close(mm); close(dev); so yes, more work, but once you have multiple devices which come and go dynamically things become simpler. The map object has global lifetime (you can even construct it if you don't assign any devices), the devices attach to it, memory hotplug updates the memory map but doesn't touch devices. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote: On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote: On 06/02/2010 03:19 PM, Joerg Roedel wrote: Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. The problem with this is that it is assymetric, dev1 is treated differently from dev[234]. It's an unintuitive API. Its by far more unintuitive that a process needs to explicitly bind a device to an iommu domain before it can do anything with it. The reason it is more intuitive is because it is harder to get it wrong. If you swap iommu and device in the call, you get BADF so you know you made a mistake. We can even make it work both ways if we wanted to. With ioctl(dev1, BIND, dev2) it breaks silently. If its required anyway the binding can happen implicitly. We could allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry. And then when we assign meaning to it we find that half the apps are broken because they did not call this ioctl. Note that this way of handling userspace iommu mappings is also a lot simpler for most use-cases outside of KVM. If a developer wants to write a userspace driver all it needs to do is: dev = open(); ioctl(dev, MAP, ...); /* use device with mappings */ close(dev); Which is much easier than the need to create a domain explicitly. Joerg This simple scenario ignores all the real-life corner cases. For example, with an explicit iommu open and bind application can naturally detect that: - we have run out of iommu domains - iommu is unsupported - iommu is in use by another, incompatible device - device is in bad state because each is a separate operation, so it is easy to produce meaningful errors. Another interesting thing that a separate iommu device supports is when application A controls the iommu and application B controls the device. This might be good to e.g. improve security (B is run by root, A is unpriveledged and passes commands to/from B over a pipe). This is not possible when same fd is used for iommu and device. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 04:06:21PM +0300, Avi Kivity wrote: On 06/02/2010 03:50 PM, Joerg Roedel wrote: Its by far more unintuitive that a process needs to explicitly bind a device to an iommu domain before it can do anything with it. I don't really care about the iommu domain. It's a side effect. The kernel takes care of it. I'm only worried about the API. The proposed memory-map object is nothing else than a userspace abstraction of an iommu-domain. We have a memory map that is (often) the same for a set of devices. If you were coding a non-kernel interface, how would you code it? struct memory_map; void memory_map_init(struct memory_map *mm, ...); struct device; void device_set_memory_map(struct device *device, struct memory_map *mm); or struct device; void device_init_memory_map(struct device *dev, ...); void device_clone_memory_map(struct device *dev, struct device *other); I wouldn't even think of the second one personally. Right, a kernel-interface would be designed the first way. The IOMMU-API is actually designed in this manner. But I still think we should keep it simpler for userspace. If its required anyway the binding can happen implicitly. We could allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry. It's still special. You define the memory map only for the first device. You have to make sure dev1 doesn't go away while sharing it. Must be a misunderstanding. In my proposal the domain is not owned by one device. It is owned by all devices that share it and will only vanish if all devices that use it are unbound (which happens when the file descriptor is closed, for example). so yes, more work, but once you have multiple devices which come and go dynamically things become simpler. The map object has global lifetime (you can even construct it if you don't assign any devices), the devices attach to it, memory hotplug updates the memory map but doesn't touch devices. I still think a userspace interface should be as simple as possible. But since both ways will work I am not really opposed to Michael's proposal. I just think its overkill for the common non-kvm usecase (a userspace device driver). Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
* Avi Kivity (a...@redhat.com) wrote: The interface would only work for clients which support it: kvm, vhost, and iommu/devices with restartable dma. BTW, there is no such thing as restartable dma. There is a provision in new specs (read: no real hardware) that allows a device to request pages before using them. So it's akin to demand paging, but the demand is an explicit request rather than a page fault. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
* Joerg Roedel (j...@8bytes.org) wrote: On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote: Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. It can always corrupt its own memory directly as well :) But that is not a reason not to detect errors if we can, and not to make APIs hard to misuse. Changing the domain of a device while dma can happen is the same type of bug as unmapping potential dma target addresses. We can't catch this kind of misuse. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. This is not any hot path, so saving an ioctl shouldn't be a consideration. Only important consideration is a good API. I may have lost context here, but the SHARE API is limited to the vfio fd. The BIND API expects a new iommu object. Are there other uses for this object? Tom's current vfio driver exposes a dma mapping interface, would the iommu object expose one as well? Current interface is device specific DMA interface for host device drivers typically mapping in-flight dma buffers, and IOMMU specific interface for assigned devices typically mapping entire virtual address space. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote: * Joerg Roedel (j...@8bytes.org) wrote: On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote: On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote: Even if it is bound to a domain the userspace driver could program the device to do dma to unmapped regions causing io-page-faults. The kernel can't do anything about it. It can always corrupt its own memory directly as well :) But that is not a reason not to detect errors if we can, and not to make APIs hard to misuse. Changing the domain of a device while dma can happen is the same type of bug as unmapping potential dma target addresses. We can't catch this kind of misuse. With 10 devices you have 10 extra ioctls. And this works implicitly with your proposal? Yes. so you do: iommu = open ioctl(dev1, BIND, iommu) ioctl(dev2, BIND, iommu) ioctl(dev3, BIND, iommu) ioctl(dev4, BIND, iommu) No need to add a SHARE ioctl. In my proposal this looks like: dev1 = open(); ioctl(dev2, SHARE, dev1); ioctl(dev3, SHARE, dev1); ioctl(dev4, SHARE, dev1); So we actually save an ioctl. This is not any hot path, so saving an ioctl shouldn't be a consideration. Only important consideration is a good API. I may have lost context here, but the SHARE API is limited to the vfio fd. The BIND API expects a new iommu object. Are there other uses for this object? Tom's current vfio driver exposes a dma mapping interface, would the iommu object expose one as well? Current interface is device specific DMA interface for host device drivers typically mapping in-flight dma buffers, and IOMMU specific interface for assigned devices typically mapping entire virtual address space. Actually, it a domain object - which may be usable among iommus (Joerg?). However, you can't really do the dma mapping with just the domain because every device supports a different size address space as a master, i.e., the dma_mask. And I don't know how kvm would deal with devices with varying dma mask support, or why they'd be in the same domain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Wed, Jun 02, 2010 at 11:09:17AM -0700, Tom Lyon wrote: On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote: This is not any hot path, so saving an ioctl shouldn't be a consideration. Only important consideration is a good API. I may have lost context here, but the SHARE API is limited to the vfio fd. The BIND API expects a new iommu object. Are there other uses for this object? Tom's current vfio driver exposes a dma mapping interface, would the iommu object expose one as well? Current interface is device specific DMA interface for host device drivers typically mapping in-flight dma buffers, and IOMMU specific interface for assigned devices typically mapping entire virtual address space. Actually, it a domain object - which may be usable among iommus (Joerg?). Yes, this 'iommu' thing is would be a domain object. But sharing among iommus is not necessary because the fact that there are multiple iommus in the system is hidden by the iommu drivers. This fact is not even exposed by the iommu-api. This makes protection domains system global. However, you can't really do the dma mapping with just the domain because every device supports a different size address space as a master, i.e., the dma_mask. The dma_mask has to be handled by the device driver. With the iommu-mapping interface the driver can specify the target io-address and has to consider the dma_mask for that too. And I don't know how kvm would deal with devices with varying dma mask support, or why they'd be in the same domain. KVM does not care about these masks. This is the business of the guest device drivers. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote: On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote: On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote: So what I suggested is failing any kind of access until iommu is assigned. So, the kernel driver must be aware of the iommu. In which case it may as well program it. It's a kernel driver anyway. Point is that the *device* driver is better off not programming iommu, this way we do not need to reprogram it for each device. The device driver is in userspace. It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, Jun 01, 2010 at 11:10:45AM +0300, Avi Kivity wrote: On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote: On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote: On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote: So what I suggested is failing any kind of access until iommu is assigned. So, the kernel driver must be aware of the iommu. In which case it may as well program it. It's a kernel driver anyway. Point is that the *device* driver is better off not programming iommu, this way we do not need to reprogram it for each device. The device driver is in userspace. I mean the kernel driver that grants userspace the access. It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. The natural way to do this is to have an iommu driver for programming iommu. This likely means we will have to pass the domain to 'vfio' or uio or whatever the driver that gives userspace the access to device is called, but this is only for security, there's no need to support programming iommu there. And using this design means the uio framework changes required would be minor, so we won't have to duplicate code. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote: It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. iommufd = open(/dev/iommu); ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...) ioctl(vfiofd, VFIO_SET_IOMMU, iommufd) ? If so, I agree. The natural way to do this is to have an iommu driver for programming iommu. This likely means we will have to pass the domain to 'vfio' or uio or whatever the driver that gives userspace the access to device is called, but this is only for security, there's no need to support programming iommu there. And using this design means the uio framework changes required would be minor, so we won't have to duplicate code. Since vfio would be the only driver, there would be no duplication. But a separate object for the iommu mapping is a good thing. Perhaps we can even share it with vhost (without actually using the mmu, since vhost is software only). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote: On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote: It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. iommufd = open(/dev/iommu); ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...) ioctl(vfiofd, VFIO_SET_IOMMU, iommufd) ? Yes. If so, I agree. Good. The natural way to do this is to have an iommu driver for programming iommu. This likely means we will have to pass the domain to 'vfio' or uio or whatever the driver that gives userspace the access to device is called, but this is only for security, there's no need to support programming iommu there. And using this design means the uio framework changes required would be minor, so we won't have to duplicate code. Since vfio would be the only driver, there would be no duplication. But a separate object for the iommu mapping is a good thing. Perhaps we can even share it with vhost (without actually using the mmu, since vhost is software only). Main difference is that vhost works fine with unlocked memory, paging it in on demand. iommu needs to unmap memory when it is swapped out or relocated. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote: Since vfio would be the only driver, there would be no duplication. But a separate object for the iommu mapping is a good thing. Perhaps we can even share it with vhost (without actually using the mmu, since vhost is software only). Main difference is that vhost works fine with unlocked memory, paging it in on demand. iommu needs to unmap memory when it is swapped out or relocated. So you'd just take the memory map and not pin anything. This way you can reuse the memory map. But no, it doesn't handle the dirty bitmap, so no go. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote: On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote: On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote: It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. iommufd = open(/dev/iommu); ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...) ioctl(vfiofd, VFIO_SET_IOMMU, iommufd) ? Yes. If so, I agree. Good. I'm not really opposed to multiple devices per domain, but let me point out how I ended up here. First, the driver has two ways of mapping pages, one based on the iommu api and one based on the dma_map_sg api. With the latter, the system already allocates a domain per device and there's no way to control it. This was presumably done to help isolation between drivers. If there are multiple drivers in the user level, do we not want the same isoation to apply to them? Also, domains are not a very scarce resource - my little core i5 has 256, and the intel architecture goes to 64K. And then there's the fact that it is possible to have multiple disjoint iommus on a system, so it may not even be possible to bring 2 devices under one domain. Given all that, I am inclined to leave it alone until someone has a real problem. Note that not sharing iommu domains doesn't mean you can't share device memory, just that you have to do multiple mappings -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Monday 31 May 2010 10:17:35 am Alan Cox wrote: Does look like it needs a locking audit, some memory and error checks reviewing and some further review of the ioctl security and overflows/trusted values. Yes. Thanks for the detailed look. Rather a nice way of attacking the user space PCI problem. And thanks for that! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 12:26 AM, Tom Lyon wrote: I'm not really opposed to multiple devices per domain, but let me point out how I ended up here. First, the driver has two ways of mapping pages, one based on the iommu api and one based on the dma_map_sg api. With the latter, the system already allocates a domain per device and there's no way to control it. This was presumably done to help isolation between drivers. If there are multiple drivers in the user level, do we not want the same isoation to apply to them? In the case of kvm, we don't want isolation between devices, because that doesn't happen on real hardware. So if the guest programs devices to dma to each other, we want that to succeed. Also, domains are not a very scarce resource - my little core i5 has 256, and the intel architecture goes to 64K. But there is a 0.2% of mapped memory per domain cost for the page tables. For the kvm use case, that could be significant since a guest may have large amounts of memory and large numbers of assigned devices. And then there's the fact that it is possible to have multiple disjoint iommus on a system, so it may not even be possible to bring 2 devices under one domain. That's indeed a deficiency. Given all that, I am inclined to leave it alone until someone has a real problem. Note that not sharing iommu domains doesn't mean you can't share device memory, just that you have to do multiple mappings I think we do have a real problem (though a mild one). The only issue I see with deferring the solution is that the API becomes gnarly; both the kernel and userspace will have to support both APIs forever. Perhaps we can implement the new API but defer the actual sharing until later, don't know how much work this saves. Or Alex/Chris can pitch in and help. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote: On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote: It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. iommufd = open(/dev/iommu); ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...) ioctl(vfiofd, VFIO_SET_IOMMU, iommufd) It seems part of the annoyance of the current KVM device assignment is that we have multiple files open, we mmap here, read there, write over there, maybe, if it's not emulated. I quite like Tom's approach that we have one stop shopping with /dev/vfion, including config space emulation so each driver doesn't have to try to write their own. So continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU ioctl to vfio so that after we setup one device we can bind the next to the same domain? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote: On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote: On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote: It can't program the iommu. What the patch proposes is that userspace tells vfio about the needed mappings, and vfio programs the iommu. There seems to be some misunderstanding. The userspace interface proposed forces a separate domain per device and forces userspace to repeat iommu programming for each device. We are better off sharing a domain between devices and programming the iommu once. iommufd = open(/dev/iommu); ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...) ioctl(vfiofd, VFIO_SET_IOMMU, iommufd) It seems part of the annoyance of the current KVM device assignment is that we have multiple files open, we mmap here, read there, write over there, maybe, if it's not emulated. I quite like Tom's approach that we have one stop shopping with /dev/vfion, including config space emulation so each driver doesn't have to try to write their own. So continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU ioctl to vfio so that after we setup one device we can bind the next to the same domain? This is just what I was thinking. But rather than a get/set, just use two fds. ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2); This may fail if there are really 2 different IOMMUs, so user code must be prepared for failure, In addition, this is strictlyupwards compatible with what is there now, so maybe we can add it later. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 07:59 AM, Tom Lyon wrote: This is just what I was thinking. But rather than a get/set, just use two fds. ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2); This may fail if there are really 2 different IOMMUs, so user code must be prepared for failure, In addition, this is strictlyupwards compatible with what is there now, so maybe we can add it later. What happens if one of the fds is later closed? I don't like this conceptually. There is a 1:n relationship between the memory map and the devices. Ignoring it will cause the API to have warts. It's more straightforward to have an object to represent the memory mapping (and talk to the iommus), and have devices bind to this object. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
* Avi Kivity (a...@redhat.com) wrote: On 06/02/2010 12:26 AM, Tom Lyon wrote: I'm not really opposed to multiple devices per domain, but let me point out how I ended up here. First, the driver has two ways of mapping pages, one based on the iommu api and one based on the dma_map_sg api. With the latter, the system already allocates a domain per device and there's no way to control it. This was presumably done to help isolation between drivers. If there are multiple drivers in the user level, do we not want the same isoation to apply to them? In the case of kvm, we don't want isolation between devices, because that doesn't happen on real hardware. Sure it does. That's exactly what happens when there's an iommu involved with bare metal. So if the guest programs devices to dma to each other, we want that to succeed. And it will as long as ATS is enabled (this is a basic requirement for PCIe peer-to-peer traffic to succeed with an iommu involved on bare metal). That's how things currently are, i.e. we put all devices belonging to a single guest in the same domain. However, it can be useful to put each device belonging to a guest in a unique domain. Especially as qemu grows support for iommu emulation, and guest OSes begin to understand how to use a hw iommu. Also, domains are not a very scarce resource - my little core i5 has 256, and the intel architecture goes to 64K. But there is a 0.2% of mapped memory per domain cost for the page tables. For the kvm use case, that could be significant since a guest may have large amounts of memory and large numbers of assigned devices. And then there's the fact that it is possible to have multiple disjoint iommus on a system, so it may not even be possible to bring 2 devices under one domain. That's indeed a deficiency. Not sure it's a deficiency. Typically to share page table mappings across multiple iommu's you just have to do update/invalidate to each hw iommu that is sharing the mapping. Alternatively, you can use more memory and build/maintain identical mappings (as Tom alludes to below). Given all that, I am inclined to leave it alone until someone has a real problem. Note that not sharing iommu domains doesn't mean you can't share device memory, just that you have to do multiple mappings I think we do have a real problem (though a mild one). The only issue I see with deferring the solution is that the API becomes gnarly; both the kernel and userspace will have to support both APIs forever. Perhaps we can implement the new API but defer the actual sharing until later, don't know how much work this saves. Or Alex/Chris can pitch in and help. It really shouldn't be that complicated to create the API to allow for flexible device - domain mappings, so I agree, makes sense to do it right up front. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 06/02/2010 08:29 AM, Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: On 06/02/2010 12:26 AM, Tom Lyon wrote: I'm not really opposed to multiple devices per domain, but let me point out how I ended up here. First, the driver has two ways of mapping pages, one based on the iommu api and one based on the dma_map_sg api. With the latter, the system already allocates a domain per device and there's no way to control it. This was presumably done to help isolation between drivers. If there are multiple drivers in the user level, do we not want the same isoation to apply to them? In the case of kvm, we don't want isolation between devices, because that doesn't happen on real hardware. Sure it does. That's exactly what happens when there's an iommu involved with bare metal. But we are emulating a machine without an iommu. When we emulate a machine with an iommu, then yes, we'll want to use as many domains as the guest does. So if the guest programs devices to dma to each other, we want that to succeed. And it will as long as ATS is enabled (this is a basic requirement for PCIe peer-to-peer traffic to succeed with an iommu involved on bare metal). That's how things currently are, i.e. we put all devices belonging to a single guest in the same domain. However, it can be useful to put each device belonging to a guest in a unique domain. Especially as qemu grows support for iommu emulation, and guest OSes begin to understand how to use a hw iommu. Right, we need to keep flexibility. And then there's the fact that it is possible to have multiple disjoint iommus on a system, so it may not even be possible to bring 2 devices under one domain. That's indeed a deficiency. Not sure it's a deficiency. Typically to share page table mappings across multiple iommu's you just have to do update/invalidate to each hw iommu that is sharing the mapping. Alternatively, you can use more memory and build/maintain identical mappings (as Tom alludes to below). Sharing the page tables is just an optimization, I was worried about devices in separate domains not talking to each other. if ATS fixes that, great. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote: So what I suggested is failing any kind of access until iommu is assigned. So, the kernel driver must be aware of the iommu. In which case it may as well program it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
+/* + * Map usr buffer at specific IO virtual address + */ +static int vfio_dma_map_iova( + mlp = kzalloc(sizeof *mlp, GFP_KERNEL); Not good at that point. I think you need to allocate it first, error if it can't be allocated and then do the work and free it on error ? + mlp = kzalloc(sizeof *mlp, GFP_KERNEL); + mlp-pages = pages; Ditto +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg) +{ + struct pci_dev *pdev = vdev-pdev; + struct eventfd_ctx *ctx; + int ret = 0; + int i; + int fd; + + vdev-msix = kzalloc(nvec * sizeof(struct msix_entry), + GFP_KERNEL); + vdev-ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *), + GFP_KERNEL); These don't seem to get freed on the error path - or indeed protected against being allocated twice (eg two parallel ioctls ?) + case VFIO_DMA_MAP_ANYWHERE: + case VFIO_DMA_MAP_IOVA: + if (copy_from_user(dm, uarg, sizeof dm)) + return -EFAULT; + ret = vfio_dma_map_common(listener, cmd, dm); + if (!ret copy_to_user(uarg, dm, sizeof dm)) So the vfio_dma_map is untrusted. That seems to be checked ok later but the dma_map_common code then plays in current-mm- without apparently holding any locks to stop the values getting corrupted by a parallel mlock ? Actually no I take that back dmp-size is 64bit So npage can end up with values like 0x and cause 32bit boxes to go kerblam + + case VFIO_EVENTFD_IRQ: + if (copy_from_user(fd, uarg, sizeof fd)) + return -EFAULT; + if (vdev-ev_irq) + eventfd_ctx_put(vdev-ev_irq); These paths need locking - suppose two EVENTFD irq ioctls occur at once (in general these paths seem not to be covered) + case VFIO_BAR_LEN: + if (copy_from_user(bar, uarg, sizeof bar)) + return -EFAULT; + if (bar 0 || bar PCI_ROM_RESOURCE) + return -EINVAL; + bar = pci_resource_len(pdev, bar); + if (copy_to_user(uarg, bar, sizeof bar)) + return -EFAULT; How does this all work out if the device is a bridge ? + pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, line); + if (line == 0) + goto out; That may produce some interestingly wrong answers. Firstly the platform has interrupt abstraction so dev-irq may not match PCI_INTERRUPT_LINE, secondly you have devices that report their IRQ via other paths as per spec (notably IDE class devices in non-native mode) So that would also want extra checks. + pci_read_config_word(pdev, PCI_COMMAND, orig); + ret = orig PCI_COMMAND_MASTER; + if (!ret) { + new = orig | PCI_COMMAND_MASTER; + pci_write_config_word(pdev, PCI_COMMAND, new); + pci_read_config_word(pdev, PCI_COMMAND, new); + ret = new PCI_COMMAND_MASTER; + pci_write_config_word(pdev, PCI_COMMAND, orig); The master bit on some devices can be turned on but not off. Not sure it matters here. + vdev-pdev = pdev; Probably best to take/drop a reference. Not needed if you can prove your last use is before the end of the remove path though. Does look like it needs a locking audit, some memory and error checks reviewing and some further review of the ioctl security and overflows/trusted values. Rather a nice way of attacking the user space PCI problem. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote: On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote: So what I suggested is failing any kind of access until iommu is assigned. So, the kernel driver must be aware of the iommu. In which case it may as well program it. It's a kernel driver anyway. Point is that the *device* driver is better off not programming iommu, this way we do not need to reprogram it for each device. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote: On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote: The VFIO driver is used to allow privileged AND non-privileged processes to implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe devices. Signed-off-by: Tom Lyonp...@cisco.com --- This patch is the evolution of code which was first proposed as a patch to uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely out of the uio framework, and things seem much cleaner. Of course, there is a lot of functional overlap with uio, but the previous version just seemed like a giant mode switch in the uio code that did not lead to clarity for either the new or old code. IMO this was because this driver does two things: programming iommu and handling interrupts. uio does interrupt handling. We could have moved iommu / DMA programming to a separate driver, and have uio work with it. This would solve limitation of the current driver that is needs an iommu domain per device. How do we enforce security then? We need to ensure that unprivileged users can only use the device with an iommu. [a pony for avi...] The major new functionality in this version is the ability to deal with PCI config space accesses (through read write calls) - but includes table driven code to determine whats safe to write and what is not. I don't really see why this is helpful: a driver written corrrectly will not access these addresses, and we need an iommu anyway to protect us against a drivers. Haven't reviewed the code (yet) but things like the BARs, MSI, and interrupt disable need to be protected from the guest regardless of the iommu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote: On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote: On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote: The VFIO driver is used to allow privileged AND non-privileged processes to implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe devices. Signed-off-by: Tom Lyonp...@cisco.com --- This patch is the evolution of code which was first proposed as a patch to uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely out of the uio framework, and things seem much cleaner. Of course, there is a lot of functional overlap with uio, but the previous version just seemed like a giant mode switch in the uio code that did not lead to clarity for either the new or old code. IMO this was because this driver does two things: programming iommu and handling interrupts. uio does interrupt handling. We could have moved iommu / DMA programming to a separate driver, and have uio work with it. This would solve limitation of the current driver that is needs an iommu domain per device. How do we enforce security then? We need to ensure that unprivileged users can only use the device with an iommu. Force assigning to iommu before we allow any other operation? [a pony for avi...] The major new functionality in this version is the ability to deal with PCI config space accesses (through read write calls) - but includes table driven code to determine whats safe to write and what is not. I don't really see why this is helpful: a driver written corrrectly will not access these addresses, and we need an iommu anyway to protect us against a drivers. Haven't reviewed the code (yet) but things like the BARs, MSI, and interrupt disable need to be protected from the guest regardless of the iommu. Yes but userspace can do this. As long as userspace can not crash the kernel, no reason to put this policy into kernel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 05/29/2010 02:07 AM, Tom Lyon wrote: The VFIO driver is used to allow privileged AND non-privileged processes to implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe devices. + +Why is this interesting? Some applications, especially in the high performance +computing field, need access to hardware functions with as little overhead as +possible. Examples are in network adapters (typically non tcp/ip based) and +in compute accelerators - i.e., array processors, FPGA processors, etc. +Previous to the VFIO drivers these apps would need either a kernel-level +driver (with corrsponding overheads), or else root permissions to directly +access the hardware. The VFIO driver allows generic access to the hardware +from non-privileged apps IF the hardware is well-behaved enough for this +to be safe. + +Any SR-IOV virtual function meets the VFIO definition of well-behaved, but +there are many other non-IOV PCI devices which also meet the defintion. +Elements of this definition are: +- The size of any memory BARs to be mmap'ed into the user process space must be + a multiple of the system page size. You can relax this. - smaller than page size can be mapped if the rest of the page unused and if the platform tolerates writes to unused areas - if the rest of the page is used, we can relocate the BAR - otherwise, we can prevent mmap() but still allow mediated access via a syscall +- If MSI-X interrupts are used, the device driver must not attempt to mmap or + write the MSI-X vector area. We can allow mediated access (that's what qemu-kvm does). I guess the ioctls for setting up msi interrupts are equivalent to this mediated access. (later I see you do provide mediated access via pwrite - please confirm) +- The device must not use the PCI configuration space in any non-standard way, + i.e., the user level driver will be permitted only to read and write standard + fields of the PCI config space, and only if those fields cannot cause harm to + the system. In addition, some fields are virtualized, so that the user + driver can read/write them like a kernel driver, but they do not affect the + real device. What's wrong with nonstandard fields? + +Even with these restrictions, there are bound to be devices which are unsafe +for user level use - it is still up to the system admin to decide whether to +grant access to the device. When the vfio module is loaded, it will have +access to no devices until the desired PCI devices are bound to the driver. +First, make sure the devices are not bound to another kernel driver. You can +unload that driver if you wish to unbind all its devices, or else enter the +driver's sysfs directory, and unbind a specific device: + cd /sys/bus/pci/drivers/drivername + echo :06:02.00 unbind +(The :06:02.00 is a fully qualified PCI device name - different for each +device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and +write the PCI device type of the target device to the new_id file: + echo 8086 10ca new_id +(8086 10ca are the vendor and device type for the Intel 82576 virtual function +devices). A /dev/vfioN entry will be created for each device bound. The final +step is to grant users permission by changing the mode and/or owner of the /dev +entry - chmod 666 /dev/vfio0. What if I have several such devices? Isn't it better to bind by topoloy (device address)? + +Reads Writes: + +The user driver will typically use mmap to access the memory BAR(s) of a +device; the I/O BARs and the PCI config space may be accessed through normal +read and write system calls. Only 1 file descriptor is needed for all driver +functions -- the desired BAR for I/O, memory, or config space is indicated via +high-order bits of the file offset. My preference would be one fd per BAR, but that's a matter of personal taste. For instance, the following implements a +write to the PCI config space: + + #includelinux/vfio.h + void pci_write_config_word(int pci_fd, u16 off, u16 wd) + { + off_t cfg_off = VFIO_PCI_CONFIG_OFF + off; + + if (pwrite(pci_fd,wd, 2, cfg_off) != 2) + perror(pwrite config_dword); + } + Nice, has the benefit of avoiding endianness issues in the interface. +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided +in vfio.h to convert bar numbers to file offsets and vice-versa. + +Interrupts: + +Device interrupts are translated by the vfio driver into input events on event +notification file descriptors created by the eventfd system call. The user +program must one or more event descriptors and pass them to the vfio driver +via ioctls to arrange for the interrupt mapping: +1. + efd = eventfd(0, 0); + ioctl(vfio_fd, VFIO_EVENTFD_IRQ,efd); + This provides an eventfd for traditional IRQ interrupts. + IRQs will be
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Sun, May 30, 2010 at 04:01:53PM +0300, Avi Kivity wrote: On 05/30/2010 03:49 PM, Michael S. Tsirkin wrote: On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote: On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote: On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote: The VFIO driver is used to allow privileged AND non-privileged processes to implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe devices. Signed-off-by: Tom Lyonp...@cisco.com --- This patch is the evolution of code which was first proposed as a patch to uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely out of the uio framework, and things seem much cleaner. Of course, there is a lot of functional overlap with uio, but the previous version just seemed like a giant mode switch in the uio code that did not lead to clarity for either the new or old code. IMO this was because this driver does two things: programming iommu and handling interrupts. uio does interrupt handling. We could have moved iommu / DMA programming to a separate driver, and have uio work with it. This would solve limitation of the current driver that is needs an iommu domain per device. How do we enforce security then? We need to ensure that unprivileged users can only use the device with an iommu. Force assigning to iommu before we allow any other operation? That means the driver must be aware of the iommu. The userspace driver? Yes. And It is a good thing to be explicit there anyway, since this lets userspace map a non-contigious virtual address list into a contiguous bus address range. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Saturday 29 May 2010, Tom Lyon wrote: +/* + * Structure for DMA mapping of user buffers + * vaddr, dmaaddr, and size must all be page aligned + * buffer may only be larger than 1 page if (a) there is + * an iommu in the system, or (b) buffer is part of a huge page + */ +struct vfio_dma_map { + __u64 vaddr; /* process virtual addr */ + __u64 dmaaddr;/* desired and/or returned dma address */ + __u64 size; /* size in bytes */ + int rdwr; /* bool: 0 for r/o; 1 for r/w */ +}; Please add a 32 bit padding word at the end of this, otherwise the size of the data structure is incompatible between 32 x86 applications and 64 bit kernels. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On 05/29/2010 02:55 PM, Arnd Bergmann wrote: On Saturday 29 May 2010, Tom Lyon wrote: +/* + * Structure for DMA mapping of user buffers + * vaddr, dmaaddr, and size must all be page aligned + * buffer may only be larger than 1 page if (a) there is + * an iommu in the system, or (b) buffer is part of a huge page + */ +struct vfio_dma_map { + __u64 vaddr; /* process virtual addr */ + __u64 dmaaddr;/* desired and/or returned dma address */ + __u64 size; /* size in bytes */ + int rdwr; /* bool: 0 for r/o; 1 for r/w */ +}; Please add a 32 bit padding word at the end of this, otherwise the size of the data structure is incompatible between 32 x86 applications and 64 bit kernels. Might as well call it 'flags' and reserve a bit more space (keeping 64-bit aligned size) for future expansion. rdwr can be folded into it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
Hi, On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote: Missing diffstat -p1 -w 70: Documentation/vfio.txt | 176 MAINTAINERS|7 drivers/Kconfig|2 drivers/Makefile |1 drivers/vfio/Kconfig |9 drivers/vfio/Makefile |5 drivers/vfio/vfio_dma.c| 372 ++ drivers/vfio/vfio_intrs.c | 189 + drivers/vfio/vfio_main.c | 627 +++ drivers/vfio/vfio_pci_config.c | 554 +++ drivers/vfio/vfio_rdwr.c | 147 +++ drivers/vfio/vfio_sysfs.c | 153 +++ include/linux/vfio.h | 193 + 13 files changed, 2435 insertions(+) which shows that the patch is missing an update to Documentation/ioctl/ioctl-number.txt for ioctl code ';'. Please add that. diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig --- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.0 -0800 +++ vfio-linux-2.6.34/drivers/vfio/Kconfig2010-05-27 17:07:25.0 -0700 @@ -0,0 +1,9 @@ +menuconfig VFIO + tristate Non-Priv User Space PCI drivers Non-privileged + depends on PCI + help + Driver to allow advanced user space drivers for PCI, PCI-X, + and PCIe devices. Requires IOMMU to allow non-privilged non-privileged + processes to directly control the PCI devices. + + If you don't know what to do here, say N. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote: diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt --- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.0 -0800 +++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-05-28 14:03:05.0 -0700 @@ -0,0 +1,176 @@ +--- +The VFIO driver is used to allow privileged AND non-privileged processes to +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe +devices. + +Why is this interesting? Some applications, especially in the high performance +computing field, need access to hardware functions with as little overhead as +possible. Examples are in network adapters (typically non tcp/ip based) and non-TCP/IP-based) +in compute accelerators - i.e., array processors, FPGA processors, etc. +Previous to the VFIO drivers these apps would need either a kernel-level +driver (with corrsponding overheads), or else root permissions to directly corresponding +access the hardware. The VFIO driver allows generic access to the hardware +from non-privileged apps IF the hardware is well-behaved enough for this +to be safe. + +While there have long been ways to implement user-level drivers using specific +corresponding drivers in the kernel, it was not until the introduction of the +UIO driver framework, and the uio_pci_generic driver that one could have a +generic kernel component supporting many types of user level drivers. However, +even with the uio_pci_generic driver, processes implementing the user level +drivers had to be trusted - they could do dangerous manipulation of DMA +addreses and were required to be root to write PCI configuration space +registers. + +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide +new hardware capabilities which the VFIO solution exploits to allow non-root +user level drivers. The main role of the IOMMU is to ensure that DMA accesses +from devices go only to the appropriate memory locations, this allows VFIO to locations; +ensure that user level drivers do not corrupt inappropriate memory. PCI I/O +virtualization (SR-IOV) was defined to allow pass-through of virtual devices +to guest virtual machines. VFIO in essence implements pass-through of devices +to user processes, not virtual machines. SR-IOV devices implement a +traditional PCI device (the physical function) and a dynamic number of special +PCI devices (virtual functions) whose feature set is somewhat restricted - in +order to allow the operating system or virtual machine monitor to ensure the +safe operation of the system. + +Any SR-IOV virtual function meets the VFIO definition of well-behaved, but +there are many other non-IOV PCI devices which also meet the defintion. +Elements of this definition are: +- The size of any memory BARs to be mmap'ed into the user process space must be + a multiple of the system page size. +- If MSI-X interrupts are used, the device driver must not attempt to mmap or + write the MSI-X vector area. +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI + revision 2.3 to allow its interrupts to be masked in a generic way. +- The device must not use the PCI configuration space in any non-standard way, + i.e., the user level driver will be permitted only to read and write standard + fields of the PCI config space, and only if those fields cannot cause harm to + the system. In addition, some fields are virtualized, so that the user + driver can read/write them like a kernel driver, but they do not affect the + real device. +- For now, there is no support for user access to the PCIe and PCI-X extended + capabilities configuration space. + +Even with these restrictions, there are bound to be devices which are unsafe +for user level use - it is still up to the system admin to decide whether to +grant access to the device. When the vfio module is loaded, it will have +access to no devices until the desired PCI devices are bound to the driver. +First, make sure the devices are not bound to another kernel driver. You can +unload that driver if you wish to unbind all its devices, or else enter the +driver's sysfs directory, and unbind a specific device: + cd /sys/bus/pci/drivers/drivername + echo :06:02.00 unbind +(The :06:02.00 is a fully qualified PCI device name - different for each +device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and +write the PCI device type of the target device to the new_id file: + echo 8086 10ca new_id +(8086 10ca are the vendor and device type for the Intel 82576 virtual function +devices). A /dev/vfioN entry will be created for each device bound.