Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse wrote: > > On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote: > > We looked into Intel IOMMU performance a while ago and learned a few > > useful things. We generally did a parallel 200 thread TCP_RR test, > > as this churns through mappings quickly and uses all available cores. > > > > First, the main bottleneck was software performance[1]. > > For the Intel IOMMU, *all* we need to do is put a PTE in place. For > real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't > need to do an IOTLB flush. It's a single 64-bit write of the PTE. > > All else is software overhead. > Agreed! > > (I'm deliberately ignoring the stupid chipsets where DMA page tables > aren't cache coherent and we need a clflush too. They make me too sad.) How much does Linux need to care about such chipsets? Can we argue that they get very sad performance and so be it? > > > > > This study preceded the recent patch to break the locks into pools > > ("Break up monolithic iommu table/lock into finer graularity pools > > and lock"). There were several points of lock contention: > > - the RB tree ... > > - the RB tree ... > > - the RB tree ... > > > > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at > > c15-paper-peleg.pdf) has some promising approaches. The magazine > > avoids the RB tree issue. > > I'm thinking of ditching the RB tree altogether and switching to the > allocator in lib/iommu-common.c (and thus getting the benefit of the > finer granularity pools). Sounds promising! Is 4 parallel arenas enough? We'll try to play with that here. I think lazy_flush leaves dangling references. > > > > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free > > page table cleanup algorithm could do well. > > When you say 'dynamic 1:1 mapping', is that the same thing that's been > suggested elsewhere — avoiding the IOVA allocator completely by using a > virtual address which *matches* the physical address, if that virtual > address is available? Yes. We munge in 2 upper address bits in the IOVA to encode read and write permissions as well. > > Simply cmpxchg on the PTE itself, and if it was > already set *then* we fall back to the allocator, obviously configured > to allocate from a range *higher* than the available physical memory. > > Jörg has been looking at this too, and was even trying to find space in > the PTE for a use count so a given page could be in more than one > mapping before we call back to the IOVA allocator. Aren't bits 63:52 available at all levels? > > > > > There are correctness fixes and optimizations left in the > > invalidation path: I want strict-ish semantics (a page doesn't go > > back into the freelist until the last IOTLB/IOMMU TLB entry is > > invalidated) with good performance, and that seems to imply that an > > additional page reference should be gotten at dma_map time and put > > back at the completion of the IOMMU flush routine. (This is worthy > > of much discussion.) > > We already do something like this for page table pages which are freed > by an unmap, FWIW. As I understood the code when I last looked, this was true only if a single unmap operation covered an entire table's worth (2MByte, or 1GByte, etc) of mappings. The caffeine hasn't hit yet, though, so I can't even begin to dive into the new calls into mm.c. > > > > Additionally, we can find ways to optimize the flush routine by > > realizing that if we have frequent maps and unmaps, it may be because > > the device creates and destroys buffers a lot; these kind of > > workloads use an IOVA for one event and then never come back. Maybe > > TLBs don't do much good and we could just flush the entire IOMMU TLB > > [and ATS cache] for that BDF. > > That would be a very interesting thing to look at. Although it would be > nice if we had a better way to measure the performance impact of IOTLB > misses — currently we don't have a lot of visibility at all. All benchmarks are lies. But we intend to run internal workloads as well as well-agreed loads and see how things go. > > > > 1: We verified that the IOMMU costs are almost entirely software > > overheads by forcing software 1:1 mode, where we create page tables > > for all physical addresses. We tested using leaf nodes of size 4KB, > > of 2MB, and of 1GB. In call cases, there is zero runtime maintenance > > of the page tables, and no IOMMU invalidations. We did piles of DMA > > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM > > addresses. At 4KB page size, we could see some bandwidth slowdown, > > but at 2MB and 1GB, there was < 1% performance loss as compared with > > IOMMU off. > > Was this with ATS on or off? With ATS, the cost of the page walk can be > amortised in some cases — you can look up the physical address *before* > you are ready to actually start the DMA to it, and don't take that > latency at the time you're actually moving the data.
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote: > We looked into Intel IOMMU performance a while ago and learned a few > useful things. We generally did a parallel 200 thread TCP_RR test, > as this churns through mappings quickly and uses all available cores. > > First, the main bottleneck was software performance[1]. For the Intel IOMMU, *all* we need to do is put a PTE in place. For real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't need to do an IOTLB flush. It's a single 64-bit write of the PTE. All else is software overhead. (I'm deliberately ignoring the stupid chipsets where DMA page tables aren't cache coherent and we need a clflush too. They make me too sad.) > This study preceded the recent patch to break the locks into pools > ("Break up monolithic iommu table/lock into finer graularity pools > and lock"). There were several points of lock contention: > - the RB tree ... > - the RB tree ... > - the RB tree ... > > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at > c15-paper-peleg.pdf) has some promising approaches. The magazine > avoids the RB tree issue. I'm thinking of ditching the RB tree altogether and switching to the allocator in lib/iommu-common.c (and thus getting the benefit of the finer granularity pools). > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free > page table cleanup algorithm could do well. When you say 'dynamic 1:1 mapping', is that the same thing that's been suggested elsewhere — avoiding the IOVA allocator completely by using a virtual address which *matches* the physical address, if that virtual address is available? Simply cmpxchg on the PTE itself, and if it was already set *then* we fall back to the allocator, obviously configured to allocate from a range *higher* than the available physical memory. Jörg has been looking at this too, and was even trying to find space in the PTE for a use count so a given page could be in more than one mapping before we call back to the IOVA allocator. > There are correctness fixes and optimizations left in the > invalidation path: I want strict-ish semantics (a page doesn't go > back into the freelist until the last IOTLB/IOMMU TLB entry is > invalidated) with good performance, and that seems to imply that an > additional page reference should be gotten at dma_map time and put > back at the completion of the IOMMU flush routine. (This is worthy > of much discussion.) We already do something like this for page table pages which are freed by an unmap, FWIW. > Additionally, we can find ways to optimize the flush routine by > realizing that if we have frequent maps and unmaps, it may be because > the device creates and destroys buffers a lot; these kind of > workloads use an IOVA for one event and then never come back. Maybe > TLBs don't do much good and we could just flush the entire IOMMU TLB > [and ATS cache] for that BDF. That would be a very interesting thing to look at. Although it would be nice if we had a better way to measure the performance impact of IOTLB misses — currently we don't have a lot of visibility at all. > 1: We verified that the IOMMU costs are almost entirely software > overheads by forcing software 1:1 mode, where we create page tables > for all physical addresses. We tested using leaf nodes of size 4KB, > of 2MB, and of 1GB. In call cases, there is zero runtime maintenance > of the page tables, and no IOMMU invalidations. We did piles of DMA > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM > addresses. At 4KB page size, we could see some bandwidth slowdown, > but at 2MB and 1GB, there was < 1% performance loss as compared with > IOMMU off. Was this with ATS on or off? With ATS, the cost of the page walk can be amortised in some cases — you can look up the physical address *before* you are ready to actually start the DMA to it, and don't take that latency at the time you're actually moving the data. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
We looked into Intel IOMMU performance a while ago and learned a few useful things. We generally did a parallel 200 thread TCP_RR test, as this churns through mappings quickly and uses all available cores. First, the main bottleneck was software performance[1]. This study preceded the recent patch to break the locks into pools ("Break up monolithic iommu table/lock into finer graularity pools and lock"). There were several points of lock contention: - the RB tree is per device (and in the network test, there's one device). Every dma_map and unmap holds the lock. - the RB tree lock is held during invalidations as well. There's a 250-entry queue for invalidations that doesn't do any batching intelligence (for example, promote to larger-range invalidations, flush entire device, etc). RB tree locks may be held while waiting for invalidation drains. Invalidations have even worse behavior with ATS enabled for a given device. - the RB tree has one entry per dma_map call (that entry is deleted by the corresponding dma_unmap). If we had merged all adjacent entries when we could, we would have not lost any information that's actually used by the code today. (There could be a check that a dma_unmap actually covers the entire region that was mapped, but there isn't.) At boot (without network traffic), two common NICs' drivers show tens of thousands of static dma_maps that never go away; this means the RB tree is ~14-16 levels deep. A rbtree walk (holding that big lock) has a 14-16 level pointer chase through mostly cache-cold entries. I wrote a modification to the RB tree handling that merges nodes that represent abutting IOVA ranges (and unmerges them on dma_unmap), and the same drivers created around 7 unique entries. Steady state grew to a few hundreds and maybe a thousand, but the fragmentation didn't get worse than that. This optimization got about a third of the performance back. Omer's paper (https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf) has some promising approaches. The magazine avoids the RB tree issue. I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free page table cleanup algorithm could do well. There are correctness fixes and optimizations left in the invalidation path: I want strict-ish semantics (a page doesn't go back into the freelist until the last IOTLB/IOMMU TLB entry is invalidated) with good performance, and that seems to imply that an additional page reference should be gotten at dma_map time and put back at the completion of the IOMMU flush routine. (This is worthy of much discussion.) Additionally, we can find ways to optimize the flush routine by realizing that if we have frequent maps and unmaps, it may be because the device creates and destroys buffers a lot; these kind of workloads use an IOVA for one event and then never come back. Maybe TLBs don't do much good and we could just flush the entire IOMMU TLB [and ATS cache] for that BDF. We'll try to get free time to do some of these things soon. Ben 1: We verified that the IOMMU costs are almost entirely software overheads by forcing software 1:1 mode, where we create page tables for all physical addresses. We tested using leaf nodes of size 4KB, of 2MB, and of 1GB. In call cases, there is zero runtime maintenance of the page tables, and no IOMMU invalidations. We did piles of DMA maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM addresses. At 4KB page size, we could see some bandwidth slowdown, but at 2MB and 1GB, there was < 1% performance loss as compared with IOMMU off. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, Nov 05, 2015 at 04:11:21PM -0500, David Miller wrote: > > And for the record Sowmini fixed a lot of the lock contention: > > commit ff7d37a502022149655c18035b99a53391be0383 > Author: Sowmini Varadhan > Date: Thu Apr 9 15:33:30 2015 -0400 > > Break up monolithic iommu table/lock into finer graularity pools and lock > The poor rds-stress results w/o IOMMU bypass I sent in early post were taken from kernel that has the above patch and that has all the needed changes in arch/sparc to use this new feature. It seems that it worked well for 10G ETH IOMMU lock contention but it still not solving the rds-stress issue. The difference can be from: 1. Lock contention still left with this enhancement <-- zero in bypass 2. Overhead to setup the IOMMU mapping <-- almost zero in bypass (require 1 HV call) 3. Overhead to use the IOMMU mapping <-- not sure how to measure this 4. Overhead to tear the IOMMU mapping <-- zero in bypass -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
From: Joerg Roedel Date: Thu, 5 Nov 2015 14:42:06 +0100 > Contended IOMMU locks are not only a problem on SPARC, but on x86 and > various other IOMMU drivers too. But I have some ideas on how to improve > the situation there. And for the record Sowmini fixed a lot of the lock contention: commit ff7d37a502022149655c18035b99a53391be0383 Author: Sowmini Varadhan Date: Thu Apr 9 15:33:30 2015 -0400 Break up monolithic iommu table/lock into finer graularity pools and lock Investigation of multithreaded iperf experiments on an ethernet interface show the iommu->lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan Acked-by: Benjamin Herrenschmidt Signed-off-by: David S. Miller -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
From: David Woodhouse Date: Thu, 29 Oct 2015 22:35:25 + > For the receive side, it shouldn't be beyond the wit of man to > introduce an API which allocates *and* DMA-maps a skb. Pass it to > netif_rx() still mapped, with a destructor that just shoves it back in > a pool for re-use. For forwarding, the SKB is going to another device to be DMA'd, perhaps via another IOMMU. For local connections, it's going to sit for an unpredictable and unbounded amount of time in the socket queue. We've been through this thought process before, believe me :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 02, 2015 at 07:32:19PM +0200, Shamir Rabinovitch wrote: > Correct. This issue is one of the concerns here in the previous replies. > I will take different approach which will not require the IOMMU bypass > per mapping. Will try to shift to the x86 'iommu=pt' approach. Yeah, it doesn't really make sense to have an extra remappable area when the device can access all physical memory anyway. > We had a bunch of issues around SPARC IOMMU. Not all of them relate to > performance. The first issue was that on SPARC, currently, we only have > limited address space to IOMMU so we had issue to do large DMA mappings > for Infiniband. Second issue was that we identified high contention on > the IOMMU locks even in ETH driver. Contended IOMMU locks are not only a problem on SPARC, but on x86 and various other IOMMU drivers too. But I have some ideas on how to improve the situation there. > I do not want to put too much information here but you can see some results: > > rds-stress test from sparc t5-2 -> x86: > > with iommu bypass: > - > sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX > tsks tx/s rx/s tx+rx K/smbi K/smbo K/s tx us/c rtt us cpu % >3 141278 0 1165565.81 0.00 0.008.93 376.60 -1.00 > (average) > > without iommu bypass: > - > sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX > tsks tx/s rx/s tx+rx K/smbi K/smbo K/s tx us/c rtt us cpu % >3 78558 0 648101.41 0.00 0.00 15.05 876.72 -1.00 > (average) > > + RDMA tests are totally not working (might be due to failure to DMA map all > the memory). > > So IOMMU bypass give ~80% performance boost. Interesting. Have you looked more closely on what causes the performance degradation? Is it the lock contention or something else? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Tue, 2015-11-03 at 14:11 +0100, Christoph Hellwig wrote: > > xHCI for example, vs. something like 10G ethernet... but yes I agree it > > sucks. I don't like that sort of policy anywhere in drivers. On the > > other hand the platform doesn't have much information to make that sort > > of decision either. > > Mabye because it should simply use what's optimal? E.g. passthrough > whenever possible, where arguments against possible are: dma_mask, vfio > requirements, kernel command line option. Right this is what I do today on powerpc with the exception of the command line option. > This is what a lot of > architectures already do, I remember the SGI Origin / Altix code has the > same behavior as well. Those IOMMUs already had the 64 bit passthrough > and 32-bit sliding window in addition to the real IOMMU 10 years ago. > -- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Tue, Nov 03, 2015 at 10:08:13AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote: > > > Then I would argue for naming this differently. Make it an optional > > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is > > > achieved via using a bypass or other means in the backend not the > > > business of the driver. > > > > > > > With a name like that, who wouldn't pass that flag? ;-) > > xHCI for example, vs. something like 10G ethernet... but yes I agree it > sucks. I don't like that sort of policy anywhere in drivers. On the > other hand the platform doesn't have much information to make that sort > of decision either. Mabye because it should simply use what's optimal? E.g. passthrough whenever possible, where arguments against possible are: dma_mask, vfio requirements, kernel command line option. This is what a lot of architectures already do, I remember the SGI Origin / Altix code has the same behavior as well. Those IOMMUs already had the 64 bit passthrough and 32-bit sliding window in addition to the real IOMMU 10 years ago. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote: > > Then I would argue for naming this differently. Make it an optional > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is > > achieved via using a bypass or other means in the backend not the > > business of the driver. > > > > With a name like that, who wouldn't pass that flag? ;-) xHCI for example, vs. something like 10G ethernet... but yes I agree it sucks. I don't like that sort of policy anywhere in drivers. On the other hand the platform doesn't have much information to make that sort of decision either. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 22:48 +, David Woodhouse wrote: > > In the Intel case, the mapping setup is entirely per-device (except for > crap devices and devices behind a PCIe-PCI bridge, etc.). > > So you can happily have a passthrough mapping for *one* device, without > making that same mapping available to another device. You can make that > trade-off of speed vs. protection for each device in the system. Same for me. I should have written "in the context of a device ...", the problem reamains that it doesn't buy you much to do it *per -mapping* which is what this API seems to be about. > Currently we have the 'iommu=pt' option that makes us use passthrough > for *all* devices (except those whose dma_mask can't reach all of > physical memory). But I could live with something like Shamir is > proposing, which lets us do the bypass only for performance-sensitive > devices which actually *ask* for it (and of course we'd have a system- > wide mode which declines that request and does normal mappings anyway). > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 08:10 +1100, Benjamin Herrenschmidt wrote: > On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote: > > Not sure this use case is possible for Infiniband where application hold > > the data buffers and there is no way to force application to re use the > > buffer as suggested. > > > > This is why I think there will be no easy way to bypass the DMA mapping cost > > for all use cases unless we allow applications to request bypass/pass > > through > > DMA mapping (implicitly or explicitly). > > But but but ... > > What I don't understand is how that brings you any safety. > > Basically, either your bridge has a bypass window, or it doesn't. (Or > it has one and it's enabled or not, same thing). > > If you request the bypass on a per-mapping basis, you basically have to > keep the window always enabled, unless you do some nasty refcounting of > how many people have a bypass mapping in flight, but that doesn't seem > useful. > > Thus you have already lost all protection from the device, since your > entire memory is accessible via the bypass mapping. > > Which means what is the point of then having non-bypass mappings for > other things ? Just to make things slower ? > > I really don't see what this whole "bypass on a per-mapping basis" buys > you. In the Intel case, the mapping setup is entirely per-device (except for crap devices and devices behind a PCIe-PCI bridge, etc.). So you can happily have a passthrough mapping for *one* device, without making that same mapping available to another device. You can make that trade-off of speed vs. protection for each device in the system. Currently we have the 'iommu=pt' option that makes us use passthrough for *all* devices (except those whose dma_mask can't reach all of physical memory). But I could live with something like Shamir is proposing, which lets us do the bypass only for performance-sensitive devices which actually *ask* for it (and of course we'd have a system- wide mode which declines that request and does normal mappings anyway). -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Tue, Nov 03, 2015 at 07:13:28AM +1100, Benjamin Herrenschmidt wrote: > > Then I would argue for naming this differently. Make it an optional > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is > achieved via using a bypass or other means in the backend not the > business of the driver. > Correct. This comment was also from internal review :-) Although currently there is strong opposition to such new attribute. > > It will partially only but it's just an example of another way the > bakcend could provide some improved performances without a bypass. Just curious.. In the Infiniband case where user application request the driver to DMA map some random pages they allocated - will your suggestion still function? I mean can you use this limited 1:1 mapping window to map any page the user application choose? How? At the bypass we just use the physical address of the page (almost as is). We use the fact that bypass address space cover the whole memory and so we have mapping for any address. In IOMMU case we use the IOMMU translation tables and so can map 64 bit address to some 32 bit address (or less). In your case it is not clear to me what can we do with such limited 1:1 mapping. > > Cheers, > Ben. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Tuesday 03 November 2015 07:13:28 Benjamin Herrenschmidt wrote: > On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote: > > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt > > wrote: > > > > > > Chosing on a per-mapping basis *in the back end* might still make > > > some > > > > In my case, choosing mapping based on the hardware that will use this > > mappings makes more sense. Most hardware are not that performance > > sensitive as the Infiniband hardware. > > ... > > > The driver know for what hardware it is mapping the memory so it know > > if the memory will be used by performance sensitive hardware or not. > > Then I would argue for naming this differently. Make it an optional > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is > achieved via using a bypass or other means in the backend not the > business of the driver. > With a name like that, who wouldn't pass that flag? ;-) Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote: > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt > wrote: > > > > Chosing on a per-mapping basis *in the back end* might still make > > some > > In my case, choosing mapping based on the hardware that will use this > mappings makes more sense. Most hardware are not that performance > sensitive as the Infiniband hardware. ... > The driver know for what hardware it is mapping the memory so it know > if the memory will be used by performance sensitive hardware or not. Then I would argue for naming this differently. Make it an optional hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is achieved via using a bypass or other means in the backend not the business of the driver. > In your case, what will give the better performance - 1:1 mapping or > IOMMU > mapping? When you say 'relaxing the protection' you refer to 1:1 > mapping? > Also, how this 1:1 window address the security concerns that other > raised > by other here? It will partially only but it's just an example of another way the bakcend could provide some improved performances without a bypass. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 02, 2015 at 03:44:27PM +0100, Joerg Roedel wrote: > > How do you envision this per-mapping by-pass to work? For the DMA-API > mappings you have only one device address space. For by-pass to work, > you need to map all physical memory of the machine into this space, > which leaves the question where you want to put the window for > remapping. > > You surely have to put it after the physical mappings, but any > protection will be gone, as the device can access all physical memory. Correct. This issue is one of the concerns here in the previous replies. I will take different approach which will not require the IOMMU bypass per mapping. Will try to shift to the x86 'iommu=pt' approach. > > So instead of working around the shortcomings of DMA-API > implementations, can you present us some numbers and analysis of how bad > the performance impact with an IOMMU is and what causes it? We had a bunch of issues around SPARC IOMMU. Not all of them relate to performance. The first issue was that on SPARC, currently, we only have limited address space to IOMMU so we had issue to do large DMA mappings for Infiniband. Second issue was that we identified high contention on the IOMMU locks even in ETH driver. > > I know that we have lock-contention issues in our IOMMU drivers, which > can be fixed. Maybe the performance problems you are seeing can be fixed > too, when you give us more details about them. > I do not want to put too much information here but you can see some results: rds-stress test from sparc t5-2 -> x86: with iommu bypass: - sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX tsks tx/s rx/s tx+rx K/smbi K/smbo K/s tx us/c rtt us cpu % 3 141278 0 1165565.81 0.00 0.008.93 376.60 -1.00 (average) without iommu bypass: - sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX tsks tx/s rx/s tx+rx K/smbi K/smbo K/s tx us/c rtt us cpu % 3 78558 0 648101.41 0.00 0.00 15.05 876.72 -1.00 (average) + RDMA tests are totally not working (might be due to failure to DMA map all the memory). So IOMMU bypass give ~80% performance boost. > > Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Fri, Oct 30, 2015 at 11:32:06AM +0100, Arnd Bergmann wrote: > I wonder if the 'iommu=force' attribute is too coarse-grained though, > and if we should perhaps allow a per-device setting on architectures > that allow this. Yeah, definitly. Currently we only have iommu=pt to enable pass-through mode for _all_ devices. I think it makes sense to introduce a per-device opt-in for pass-through, but have it configured by the user and not by the device driver. If the user enables the IOMMU in his system, he expects to be secure against DMA attacks. If drivers could opt-out, every protection would be voided. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, Oct 29, 2015 at 09:32:32AM +0200, Shamir Rabinovitch wrote: > For the IB case, setting and tearing DMA mappings for the drivers data buffers > is expensive. But we could for example consider to map all the HW control > objects > that validate the HW access to the drivers data buffers as IOMMU protected > and so > have IOMMU protection on those critical objects while having fast > set-up/tear-down > of driver data buffers. The HW control objects have stable mapping for the > lifetime > of the system. So the case of using both types of DMA mappings is still valid. How do you envision this per-mapping by-pass to work? For the DMA-API mappings you have only one device address space. For by-pass to work, you need to map all physical memory of the machine into this space, which leaves the question where you want to put the window for remapping. You surely have to put it after the physical mappings, but any protection will be gone, as the device can access all physical memory. So instead of working around the shortcomings of DMA-API implementations, can you present us some numbers and analysis of how bad the performance impact with an IOMMU is and what causes it? I know that we have lock-contention issues in our IOMMU drivers, which can be fixed. Maybe the performance problems you are seeing can be fixed too, when you give us more details about them. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt wrote: > > Chosing on a per-mapping basis *in the back end* might still make some In my case, choosing mapping based on the hardware that will use this mappings makes more sense. Most hardware are not that performance sensitive as the Infiniband hardware. > amount of sense. What I don't completely grasp is what does it give > you to expose that choice to the *driver* all the way up the chain. Why > does the driver knows better whether something should use the bypass or > not ? The driver know for what hardware it is mapping the memory so it know if the memory will be used by performance sensitive hardware or not. > > I can imagine some in-between setups, for example, on POWER (and > probably x86), I could setup a window that is TCE-mapped (TCEs are our > iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always > map to the same physical page. I could then use map/unmap to adjust the > protection, the idea being that only "relaxing" the protection requires > flushing the IO TLB, ie, we could delay most flushes. In your case, what will give the better performance - 1:1 mapping or IOMMU mapping? When you say 'relaxing the protection' you refer to 1:1 mapping? Also, how this 1:1 window address the security concerns that other raised by other here? > > But that sort of optimization only makes sense in the back-end. > > So what was your original idea where you thought the driver was the > right one to decide whether to use the bypass or not for a given map > operation ? That's what I don't grasp... you might have a valid case > that I just fail to see. Please see above. > > Cheers, > Ben. > I think that given that IOMMU bypass on per allocation basis raise some concerns, the only path for SPARC is this: 1. Support 'iommu=pt' as x86 for total IOMMU as intermediate step. Systems that use Infiniband will be set to pass through. 2. Add support in DVMA which allow less contention on the IOMMU resources while doing the map/unmap, bigger address range and full protection. Still this is not clear what will be the performance cost of using DVMA. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote: > To summary - > > 1. The whole point of the IOMMU pass through was to get bigger address space > and faster map/unmap operations for performance critical hardware > 2. SPARC IOMMU in particular has the ability to DVMA which adress all the > protection concerns raised above. Not sure what will be the > performance > impact though. This still need a lot of work before we could test > this. > 3. On x86 we use IOMMU in pass through mode so all the above concerns are > valid > > The question are - > > 1. Does partial use of IOMMU while the pass through window is enabled add some > protection? > 2. Do we rather the x86 way of doing this which is enable / disable IOMMU > translations at kernel level? > > I think that I can live with option (2) till I have DVMA if there is strong > disagree on the need for per allocation IOMMU bypass. Chosing on a per-mapping basis *in the back end* might still make some amount of sense. What I don't completely grasp is what does it give you to expose that choice to the *driver* all the way up the chain. Why does the driver knows better whether something should use the bypass or not ? I can imagine some in-between setups, for example, on POWER (and probably x86), I could setup a window that is TCE-mapped (TCEs are our iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always map to the same physical page. I could then use map/unmap to adjust the protection, the idea being that only "relaxing" the protection requires flushing the IO TLB, ie, we could delay most flushes. But that sort of optimization only makes sense in the back-end. So what was your original idea where you thought the driver was the right one to decide whether to use the bypass or not for a given map operation ? That's what I don't grasp... you might have a valid case that I just fail to see. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Mon, Nov 02, 2015 at 08:10:49AM +1100, Benjamin Herrenschmidt wrote: > But but but ... > > What I don't understand is how that brings you any safety. Limited safety maybe? If some device DMA mappings are via IOMMU and this fall to some address range that is far from the bypass / pass through range then small drifts in address might be still figured if we do not bypass / pass through the IOMMU - right? Device can sure use the bypass address and just reach the memory w/o IOMMU protection. See some comments about that below. > > Basically, either your bridge has a bypass window, or it doesn't. (Or > it has one and it's enabled or not, same thing). Agree. > > If you request the bypass on a per-mapping basis, you basically have to > keep the window always enabled, unless you do some nasty refcounting of > how many people have a bypass mapping in flight, but that doesn't seem > useful. > > Thus you have already lost all protection from the device, since your > entire memory is accessible via the bypass mapping. Correct, see my above comment. > > Which means what is the point of then having non-bypass mappings for > other things ? Just to make things slower ? > > I really don't see what this whole "bypass on a per-mapping basis" buys > you. > > Note that we implicitly already do that on powerpc, but not for those > reasons, we do it based on the DMA mask, so that if your coherent mask > is 32-bit but your dma mask is 64-bit (which is not an uncommon > combination), we service the "coherent" requests (basically the long > lifed dma allocs) from the remapped region and the "stream" requests > (ie, map_page/map_sg) from the bypass. To summary - 1. The whole point of the IOMMU pass through was to get bigger address space and faster map/unmap operations for performance critical hardware 2. SPARC IOMMU in particular has the ability to DVMA which adress all the protection concerns raised above. Not sure what will be the performance impact though. This still need a lot of work before we could test this. 3. On x86 we use IOMMU in pass through mode so all the above concerns are valid The question are - 1. Does partial use of IOMMU while the pass through window is enabled add some protection? 2. Do we rather the x86 way of doing this which is enable / disable IOMMU translations at kernel level? I think that I can live with option (2) till I have DVMA if there is strong disagree on the need for per allocation IOMMU bypass. > > Cheers, > Ben. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote: > Not sure this use case is possible for Infiniband where application hold > the data buffers and there is no way to force application to re use the > buffer as suggested. > > This is why I think there will be no easy way to bypass the DMA mapping cost > for all use cases unless we allow applications to request bypass/pass through > DMA mapping (implicitly or explicitly). But but but ... What I don't understand is how that brings you any safety. Basically, either your bridge has a bypass window, or it doesn't. (Or it has one and it's enabled or not, same thing). If you request the bypass on a per-mapping basis, you basically have to keep the window always enabled, unless you do some nasty refcounting of how many people have a bypass mapping in flight, but that doesn't seem useful. Thus you have already lost all protection from the device, since your entire memory is accessible via the bypass mapping. Which means what is the point of then having non-bypass mappings for other things ? Just to make things slower ? I really don't see what this whole "bypass on a per-mapping basis" buys you. Note that we implicitly already do that on powerpc, but not for those reasons, we do it based on the DMA mask, so that if your coherent mask is 32-bit but your dma mask is 64-bit (which is not an uncommon combination), we service the "coherent" requests (basically the long lifed dma allocs) from the remapped region and the "stream" requests (ie, map_page/map_sg) from the bypass. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, Oct 29, 2015 at 10:35:25PM +, David Woodhouse wrote: > > > > Could this be mitigated using pools? I don't know if the net code > > would play along easily. > > For the receive side, it shouldn't be beyond the wit of man to > introduce an API which allocates *and* DMA-maps a skb. Pass it to > netif_rx() still mapped, with a destructor that just shoves it back in > a pool for re-use. > > Doing it for transmit might be a little more complex, but perhaps still > possible. Not sure this use case is possible for Infiniband where application hold the data buffers and there is no way to force application to re use the buffer as suggested. This is why I think there will be no easy way to bypass the DMA mapping cost for all use cases unless we allow applications to request bypass/pass through DMA mapping (implicitly or explicitly). > > -- > dwmw2 > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Saturday 31 October 2015 10:17:22 Benjamin Herrenschmidt wrote: > On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote: > > On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote: > > > > > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes > > > > across > > > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have > > > > a > > > > sane meaning in the paranoid mode (and perhaps we'd want an ultra > > > > -paranoid mode where it's not honoured). > > > > > > Possibly, though ideally that would be a user policy but of course > > > by > > > the time you get to userspace it's generally too late. > > > > IIRC, we have an 'iommu=force' command line switch for this, to > > ensure > > that no device can use a linear mapping and everything goes th ough > > the page tables. This is often useful for both debugging and as a > > security measure when dealing with unpriviledged DMA access (virtual > > machines, vfio, ...). > > That was used to force-enable the iommu on platforms like G5s where we > would otherwise only do so if the memory was larger than 32-bit but we > never implemented using it to prevent the bypass region. Ah, I see. Thanks for the clarification. > > If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly > > document > > which osed to force-enable the iommu on HGthe two we expect to take > > priority in cases where we have a > > choice. > > > > I wonder if the 'iommu=force' attribute is too coarse-grained though, > > and if we should perhaps allow a per-device setting on architectures > > that allow this. > > The interesting thing, if we can make it work, is the bypass attribute > being per mapping... I would say we want both: for the device driver it can make sense to choose per mapping what it can do, but for the iommu driver, it can also make sense to ensure we never provide a linear mapping, because otherwise the additional security aspect is moot. In particular for the unprivileged VM guest or vfio access, the code that gives access to the device to something else should have a way to tell the IOMMU that the linear mapping can no longer be used. Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote: > On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote: > > > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes > > > across > > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have > > > a > > > sane meaning in the paranoid mode (and perhaps we'd want an ultra > > > -paranoid mode where it's not honoured). > > > > Possibly, though ideally that would be a user policy but of course > > by > > the time you get to userspace it's generally too late. > > IIRC, we have an 'iommu=force' command line switch for this, to > ensure > that no device can use a linear mapping and everything goes th ough > the page tables. This is often useful for both debugging and as a > security measure when dealing with unpriviledged DMA access (virtual > machines, vfio, ...). That was used to force-enable the iommu on platforms like G5s where we would otherwise only do so if the memory was larger than 32-bit but we never implemented using it to prevent the bypass region. > If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly > document > which osed to force-enable the iommu on HGthe two we expect to take > priority in cases where we have a > choice. > > I wonder if the 'iommu=force' attribute is too coarse-grained though, > and if we should perhaps allow a per-device setting on architectures > that allow this. The interesting thing, if we can make it work, is the bypass attribute being per mapping... Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote: > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a > > sane meaning in the paranoid mode (and perhaps we'd want an ultra > > -paranoid mode where it's not honoured). > > Possibly, though ideally that would be a user policy but of course by > the time you get to userspace it's generally too late. IIRC, we have an 'iommu=force' command line switch for this, to ensure that no device can use a linear mapping and everything goes through the page tables. This is often useful for both debugging and as a security measure when dealing with unpriviledged DMA access (virtual machines, vfio, ...). If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly document which of the two we expect to take priority in cases where we have a choice. I wonder if the 'iommu=force' attribute is too coarse-grained though, and if we should perhaps allow a per-device setting on architectures that allow this. Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote: > On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt" > wrote: > > > > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > > > > > On Power, I generally have 2 IOMMU windows for a device, one at > > > > the > > > > bottom is remapped, and is generally used for 32-bit devices > > > > and the > > > > one at the top us setup as a bypass > > > > > > So in the normal case of decent 64-bit devices (and not in a VM), > > > they'll *already* be using the bypass region and have full access > > > to > > > all of memory, all of the time? And you have no protection > > > against > > > driver and firmware bugs causing stray DMA? > > > > Correct, we chose to do that for performance reasons. > > Could this be mitigated using pools? I don't know if the net code > would play along easily. Possibly, the pools we have already limit the lock contention but we still have the map/unmap overhead which under a hypervisor can be quite high. I'm not necessarily against changing the way we do things but it would have to be backed up with numbers. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote: > On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt" > wrote: > > > > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > > > > > On Power, I generally have 2 IOMMU windows for a device, one at > > > > the > > > > bottom is remapped, and is generally used for 32-bit devices > > > > and the > > > > one at the top us setup as a bypass > > > > > > So in the normal case of decent 64-bit devices (and not in a VM), > > > they'll *already* be using the bypass region and have full access > > > to > > > all of memory, all of the time? And you have no protection > > > against > > > driver and firmware bugs causing stray DMA? > > > > Correct, we chose to do that for performance reasons. > > Could this be mitigated using pools? I don't know if the net code > would play along easily. For the receive side, it shouldn't be beyond the wit of man to introduce an API which allocates *and* DMA-maps a skb. Pass it to netif_rx() still mapped, with a destructor that just shoves it back in a pool for re-use. Doing it for transmit might be a little more complex, but perhaps still possible. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt" wrote: > > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > > > On Power, I generally have 2 IOMMU windows for a device, one at the > > > bottom is remapped, and is generally used for 32-bit devices and the > > > one at the top us setup as a bypass > > > > So in the normal case of decent 64-bit devices (and not in a VM), > > they'll *already* be using the bypass region and have full access to > > all of memory, all of the time? And you have no protection against > > driver and firmware bugs causing stray DMA? > > Correct, we chose to do that for performance reasons. Could this be mitigated using pools? I don't know if the net code would play along easily. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, Oct 29, 2015 at 09:42:12AM +0900, David Woodhouse wrote: > Aside from the lack of security, the other disadvantage of that is that > you have to pin *all* pages of a guest in case DMA happens; you don't > get to pin *only* those pages which are referenced by that guest's > IOMMU page tables... We do bypass the IOMMU but not the DMA API and given that before we call the DMA API we pin the page then we do not need to pin all the pages. Just the ones we use for the DMA. For me this flag looks orthogonal to the page pinning issue you brought. It is just a hint to the DMA API that we want to use simple & fast mapping while knowing we loose IOMMU protection for this memory. For the IB case, setting and tearing DMA mappings for the drivers data buffers is expensive. But we could for example consider to map all the HW control objects that validate the HW access to the drivers data buffers as IOMMU protected and so have IOMMU protection on those critical objects while having fast set-up/tear-down of driver data buffers. The HW control objects have stable mapping for the lifetime of the system. So the case of using both types of DMA mappings is still valid. > > -- > dwmw2 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > On Power, I generally have 2 IOMMU windows for a device, one at the > > bottom is remapped, and is generally used for 32-bit devices and the > > one at the top us setup as a bypass > > So in the normal case of decent 64-bit devices (and not in a VM), > they'll *already* be using the bypass region and have full access to > all of memory, all of the time? And you have no protection against > driver and firmware bugs causing stray DMA? Correct, we chose to do that for performance reasons. > > I don't see how thata ttribute would work for us. > > Because you're already doing it anyway without being asked :) > > If SPARC and POWER are both doing that, perhaps we should change the > default for Intel too? > > Aside from the lack of security, the other disadvantage of that is that > you have to pin *all* pages of a guest in case DMA happens; you don't > get to pin *only* those pages which are referenced by that guest's > IOMMU page tables... Correct, the problem is that the cost of doing map/unmap from a guest is really a huge hit on things like network devices. Another problem is that the failure mode isn't great if you don't pin. IE. You have to pin pages as they get mapped into the iommu by the guest, but you don't know in advance how much and you may hit the process ulimit on pinned pages half way through. We tried to address that in various ways but it always ended up horrid. > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a > sane meaning in the paranoid mode (and perhaps we'd want an ultra > -paranoid mode where it's not honoured). Possibly, though ideally that would be a user policy but of course by the time you get to userspace it's generally too late. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > On Power, I generally have 2 IOMMU windows for a device, one at the > bottom is remapped, and is generally used for 32-bit devices and the > one at the top us setup as a bypass So in the normal case of decent 64-bit devices (and not in a VM), they'll *already* be using the bypass region and have full access to all of memory, all of the time? And you have no protection against driver and firmware bugs causing stray DMA? > I don't see how that attribute would work for us. Because you're already doing it anyway without being asked :) If SPARC and POWER are both doing that, perhaps we should change the default for Intel too? Aside from the lack of security, the other disadvantage of that is that you have to pin *all* pages of a guest in case DMA happens; you don't get to pin *only* those pages which are referenced by that guest's IOMMU page tables... Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a sane meaning in the paranoid mode (and perhaps we'd want an ultra -paranoid mode where it's not honoured). -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Wed, 2015-10-28 at 22:31 +0900, David Woodhouse wrote: > We have an option in the Intel IOMMU for pass-through mode too, which > basically *is* a total bypass. In practice, what's the difference > between that and a "simple translation that does not require any > [translation]"? We set up a full 1:1 mapping of all memory, and then > the map/unmap methods become no-ops. > > Currently we have no way to request that mode on a per-device basis; > we > only have 'iommu=pt' on the command line to set it for *all* devices. > But performance-sensitive devices might want it, while we keep doing > proper translation for others. On Power, I generally have 2 IOMMU windows for a device, one at the bottom is remapped, and is generally used for 32-bit devices and the one at the top us setup as a bypass (or in the case of KVM, as a linear mapping of guest memory which looks the same as a bypass to the guest). The DMA ops will automatically hit the appropriate one based on the DMA mask. I don't see how that attribute would work for us. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
From: David Woodhouse Date: Wed, 28 Oct 2015 22:57:12 +0900 > On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote: >> In the sparc64 case, the 64-bit DMA address space is divided into >> IOMMU translated and non-IOMMU translated. >> >> You just set the high bits differently depending upon what you want. > > Wait, does that mean a (rogue) device could *always* get full access to > physical memory just by setting the high bits appropriately? That > mapping is *always* available? The IOMMU supports several modes, one of which disallows passthrough and this is what you would use in a virtual guest. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote: > In the sparc64 case, the 64-bit DMA address space is divided into > IOMMU translated and non-IOMMU translated. > > You just set the high bits differently depending upon what you want. Wait, does that mean a (rogue) device could *always* get full access to physical memory just by setting the high bits appropriately? That mapping is *always* available? > So a device could use both IOMMU translated and bypass accesses at > the same time. While seemingly interesting, I do not recommend we > provide this kind of flexibility in our DMA interfaces. Now I could understand this if the answer to my question above was 'no'. We absolutely want the *security* all the time, and we don't want the device to be able to do stupid stuff. But if the answer was 'yes' then we take the map/unmap performance hit for... *what* benefit? On Intel we have the passthrough as an *option* and I have the same initial reaction — "Hell no, we want the security". But I concede the performance motivation for it, and I'm not *dead* set against permitting it. If I tolerate a per-device request for passthrough mode, that might prevent people from disabling the IOMMU or putting it into passthrough mode *entirely*. So actually, I'm *improving* security... I think it makes sense to allow performance sensitive device drivers to *request* a passthrough mode. The platform can reserve the right to refuse, if either the IOMMU hardware doesn't support that, or we're in a paranoid mode (with iommu=always or something on the command line). -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
From: David Woodhouse Date: Wed, 28 Oct 2015 22:31:50 +0900 > On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote: >> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote: >> > > > +For systems with IOMMU it is assumed all DMA translations use the >> > > > IOMMU. >> > >> > Not entirely true. We have per-device dma_ops on a most architectures >> > already, and we were just talking about the need to add them to >> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to >> > map virtio devices too. >> >> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops). >> >> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH). >> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range. >> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple >> translation that does not require any complex translations tables. > > We have an option in the Intel IOMMU for pass-through mode too, which > basically *is* a total bypass. In practice, what's the difference > between that and a "simple translation that does not require any > [translation]"? We set up a full 1:1 mapping of all memory, and then > the map/unmap methods become no-ops. > > Currently we have no way to request that mode on a per-device basis; we > only have 'iommu=pt' on the command line to set it for *all* devices. > But performance-sensitive devices might want it, while we keep doing > proper translation for others. In the sparc64 case, the 64-bit DMA address space is divided into IOMMU translated and non-IOMMU translated. You just set the high bits differently depending upon what you want. So a device could use both IOMMU translated and bypass accesses at the same time. While seemingly interesting, I do not recommend we provide this kind of flexibility in our DMA interfaces. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote: > On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote: > > > > +For systems with IOMMU it is assumed all DMA translations use the > > > > IOMMU. > > > > Not entirely true. We have per-device dma_ops on a most architectures > > already, and we were just talking about the need to add them to > > POWER/SPARC too, because we need to avoid trying to use the IOMMU to > > map virtio devices too. > > SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops). > > Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH). > Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range. > On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple > translation that does not require any complex translations tables. We have an option in the Intel IOMMU for pass-through mode too, which basically *is* a total bypass. In practice, what's the difference between that and a "simple translation that does not require any [translation]"? We set up a full 1:1 mapping of all memory, and then the map/unmap methods become no-ops. Currently we have no way to request that mode on a per-device basis; we only have 'iommu=pt' on the command line to set it for *all* devices. But performance-sensitive devices might want it, while we keep doing proper translation for others. > > > > As we look at that (and make the per-device dma_ops a generic thing > > rather than per-arch), we should probably look at folding your case in > > too. > > Whether to use IOMMU or not for DMA is up to the driver. The above example > show real situation where one driver can use IOMMU and the other can't. It > seems that the device cannot know when to use IOMMU bypass and when not. > Even in the driver we can decide to DMA map some buffers using IOMMU > translation and some as IOMMU bypass. In practice there are multiple possibilities — there are cases where you *must* use an IOMMU and do full translation, and there is no option of a bypass. There are cases where there just isn't an IOMMU (and sometimes that's a per-device fact, like with virtio). And there are cases where you *can* use the IOMMU, but if you ask nicely you can get away without it. My point in linking up these two threads is that we should contemplate all of those use cases and come up with something that addresses it all. > Do you agree that we need this attribute in the generic DMA API? Yeah, I think it can be useful. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote: > > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU. > > Not entirely true. We have per-device dma_ops on a most architectures > already, and we were just talking about the need to add them to > POWER/SPARC too, because we need to avoid trying to use the IOMMU to > map virtio devices too. SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops). Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH). Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range. On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple translation that does not require any complex translations tables. > > As we look at that (and make the per-device dma_ops a generic thing > rather than per-arch), we should probably look at folding your case in > too. Whether to use IOMMU or not for DMA is up to the driver. The above example show real situation where one driver can use IOMMU and the other can't. It seems that the device cannot know when to use IOMMU bypass and when not. Even in the driver we can decide to DMA map some buffers using IOMMU translation and some as IOMMU bypass. Do you agree that we need this attribute in the generic DMA API? > > -- > dwmw2 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Sun, 2015-10-25 at 09:07 -0700, Shamir Rabinovitch wrote: > > + > +DMA_ATTR_IOMMU_BYPASS > +- > + > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU. Not entirely true. We have per-device dma_ops on a most architectures already, and we were just talking about the need to add them to POWER/SPARC too, because we need to avoid trying to use the IOMMU to map virtio devices too. As we look at that (and make the per-device dma_ops a generic thing rather than per-arch), we should probably look at folding your case in too. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature