Re: DMA mappings and crossing boundaries
On Wed, 2018-07-11 at 14:51 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote: > > > > > Could it ? I wouldn't think dma_map_page is allows to cross page > > > boundaries ... what about single() ? The main worry is people using > > > these things on kmalloc'ed memory > > > > Oh, absolutely - the underlying operation is just "prepare for DMA > > to/from this physically-contiguous region"; the only real difference > > between map_page and map_single is for the sake of the usual "might be > > highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits > > on the size and offset parameters (in fact, if anyone asks I would say > > the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few > > months back is valid for dma_map_page too, however silly it may seem). > > I think this is a very bad idea though. We should thrive to avoid > coalescing before the iommu layer and we should avoid creating sglists > that cross natural alignment boundaries. Ping ? Jens, Christoph ? I really really don't like how sg_alloc_table_from_pages() coalesces the sglist before it gets mapped. This will potentially create huge constraints and fragmentation in the IOMMU allocator by passing large chunks to it. > I had a look at sg_alloc_table_from_pages() and it basically freaks me > out. > > Back in the old days, we used to have the block layer aggressively > coalesce requests iirc, but that was changed to let the iommu layer do > it. > > If you pass to dma_map_sg() something already heavily coalesced, such > as what sg_alloc_table_from_pages() can do since it seems to have > absolutely no limits there, you will create a significant fragmentation > problem in the iommu allocator. > > Instead, we should probably avoid such coalescing at that level and > instead let the iommu opportunistically coalesce at the point of > mapping which it does just fine. > > We've been racking our brains here and we can't find a way to implement > something we want that is both very performance efficient (no global > locks etc...) and works with boundary crossing mappings. > > We could always fallback to classic small page mappings but that is > quite suboptimal. > > I also notice that sg_alloc_table_from_pages() doesn't try to prevent > crossing the 4G boundary. I remember pretty clearly that it was > explicitely forbidden to create such a crossing. > > > Of course, given that the allocators tend to give out size/order-aligned > > chunks, I think you'd have to be pretty tricksy to get two allocations > > to line up either side of a large power-of-two boundary *and* go out of > > your way to then make a single request spanning both, but it's certainly > > not illegal. Realistically, the kind of "scrape together a large buffer > > from smaller pieces" code which is liable to hit a boundary-crossing > > case by sheer chance is almost certainly going to be taking the > > sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather > > than implementing its own merging and piecemeal mapping. > > Yes and I think what sg_alloc_table_from_pages() does is quite wrong. > > > > > Conceptually it looks pretty easy to extend the allocation constraints > > > > to cope with that - even the pathological worst case would have an > > > > absolute upper bound of 3 IOMMU entries for any one physical region - > > > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit > > > > DMA addresses having only 4 1GB slots to play with, I can't really see a > > > > way to make that practical :( > > > > > > No we are talking about 40-ish-bits of address space, so there's a bit > > > of leeway. Of course no scheme will work if the user app tries to map > > > more than the GPU can possibly access. > > > > > > But with newer AMD adding a few more bits and nVidia being at 47-bits, > > > I think we have some margin, it's just that they can't reach our > > > discontiguous memory with a normal 'bypass' mapping and I'd rather not > > > teach Linux about every single way our HW can scatter memory accross > > > nodes, so an "on demand" mechanism is by far the most flexible way to > > > deal with all configurations. > > > > > > > Maybe the best compromise would be some sort of hybrid scheme which > > > > makes sure that one of the IOMMU entries always covers the SWIOTLB > > > > buffer, and invokes software bouncing for the awkward cases. > > > > > > Hrm... not too sure about that. I'm happy to limit that scheme to well > > > known GPU vendor/device IDs, and SW bouncing is pointless in these > > > cases. It would be nice if we could have some kind of guarantee that a > > > single mapping or sglist entry never crossed a specific boundary > > > though... We more/less have that for 4G already (well, we are supposed > > > to at least). Who are the main potential problematic subsystems here ? > > > I'm thinking network skb allocation pools ... and page cache if it > >
Re: DMA mappings and crossing boundaries
On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote: > > > Could it ? I wouldn't think dma_map_page is allows to cross page > > boundaries ... what about single() ? The main worry is people using > > these things on kmalloc'ed memory > > Oh, absolutely - the underlying operation is just "prepare for DMA > to/from this physically-contiguous region"; the only real difference > between map_page and map_single is for the sake of the usual "might be > highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits > on the size and offset parameters (in fact, if anyone asks I would say > the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few > months back is valid for dma_map_page too, however silly it may seem). I think this is a very bad idea though. We should thrive to avoid coalescing before the iommu layer and we should avoid creating sglists that cross natural alignment boundaries. I had a look at sg_alloc_table_from_pages() and it basically freaks me out. Back in the old days, we used to have the block layer aggressively coalesce requests iirc, but that was changed to let the iommu layer do it. If you pass to dma_map_sg() something already heavily coalesced, such as what sg_alloc_table_from_pages() can do since it seems to have absolutely no limits there, you will create a significant fragmentation problem in the iommu allocator. Instead, we should probably avoid such coalescing at that level and instead let the iommu opportunistically coalesce at the point of mapping which it does just fine. We've been racking our brains here and we can't find a way to implement something we want that is both very performance efficient (no global locks etc...) and works with boundary crossing mappings. We could always fallback to classic small page mappings but that is quite suboptimal. I also notice that sg_alloc_table_from_pages() doesn't try to prevent crossing the 4G boundary. I remember pretty clearly that it was explicitely forbidden to create such a crossing. > Of course, given that the allocators tend to give out size/order-aligned > chunks, I think you'd have to be pretty tricksy to get two allocations > to line up either side of a large power-of-two boundary *and* go out of > your way to then make a single request spanning both, but it's certainly > not illegal. Realistically, the kind of "scrape together a large buffer > from smaller pieces" code which is liable to hit a boundary-crossing > case by sheer chance is almost certainly going to be taking the > sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather > than implementing its own merging and piecemeal mapping. Yes and I think what sg_alloc_table_from_pages() does is quite wrong. > > > Conceptually it looks pretty easy to extend the allocation constraints > > > to cope with that - even the pathological worst case would have an > > > absolute upper bound of 3 IOMMU entries for any one physical region - > > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit > > > DMA addresses having only 4 1GB slots to play with, I can't really see a > > > way to make that practical :( > > > > No we are talking about 40-ish-bits of address space, so there's a bit > > of leeway. Of course no scheme will work if the user app tries to map > > more than the GPU can possibly access. > > > > But with newer AMD adding a few more bits and nVidia being at 47-bits, > > I think we have some margin, it's just that they can't reach our > > discontiguous memory with a normal 'bypass' mapping and I'd rather not > > teach Linux about every single way our HW can scatter memory accross > > nodes, so an "on demand" mechanism is by far the most flexible way to > > deal with all configurations. > > > > > Maybe the best compromise would be some sort of hybrid scheme which > > > makes sure that one of the IOMMU entries always covers the SWIOTLB > > > buffer, and invokes software bouncing for the awkward cases. > > > > Hrm... not too sure about that. I'm happy to limit that scheme to well > > known GPU vendor/device IDs, and SW bouncing is pointless in these > > cases. It would be nice if we could have some kind of guarantee that a > > single mapping or sglist entry never crossed a specific boundary > > though... We more/less have that for 4G already (well, we are supposed > > to at least). Who are the main potential problematic subsystems here ? > > I'm thinking network skb allocation pools ... and page cache if it > > tries to coalesce entries before issuing the map request, does it ? > > I don't know of anything definite off-hand, but my hunch is to be most > wary of anything wanting to do zero-copy access to large buffers in > userspace pages. In particular, sg_alloc_table_from_pages() lacks any > kind of boundary enforcement (and most all users don't even use the > segment-length-limiting variant either), so I'd say any caller of that > currently has a very small, but nonzero,
Re: DMA mappings and crossing boundaries
On 02/07/18 14:37, Benjamin Herrenschmidt wrote: On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote: .../... Thanks Robin, I was starting to depair anybody would reply ;-) AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API- HOWTO.txt) as always allocating to the next power-of-2 order, so we should never have the problem unless we allocate a single chunk larger than the IOMMU page size. (and even then it's not *that* much of a problem, since it comes down to just finding n > 1 consecutive unused IOMMU entries for exclusive use by that new chunk) Yes, this case is not my biggest worry. For dma_map_sg() however, if a request that has a single "entry" spawning such a boundary, we need to ensure that the result mapping is 2 contiguous "large" iommu pages as well. However, that doesn't fit well with us re-using existing mappings since they may already exist and either not be contiguous, or partially exist with no free hole around them. Now, we *could* possibly construe a way to solve this by detecting this case and just allocating another "pair" (or set if we cross even more pages) of IOMMU pages elsewhere, thus partially breaking our re-use scheme. But while doable, this introduce some serious complexity in the implementation, which I would very much like to avoid. So I was wondering if you guys thought that was ever likely to happen ? Do you see reasonable cases where dma_map_sg() would be called with a list in which a single entry crosses a 256M or 1G boundary ? For streaming mappings of buffers cobbled together out of any old CPU pages (e.g. user memory), you may well happen to get two physically-adjacent pages falling either side of an IOMMU boundary, which comprise all or part of a single request - note that whilst it's probably less likely than the scatterlist case, this could technically happen for dma_map_{page, single}() calls too. Could it ? I wouldn't think dma_map_page is allows to cross page boundaries ... what about single() ? The main worry is people using these things on kmalloc'ed memory Oh, absolutely - the underlying operation is just "prepare for DMA to/from this physically-contiguous region"; the only real difference between map_page and map_single is for the sake of the usual "might be highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits on the size and offset parameters (in fact, if anyone asks I would say the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few months back is valid for dma_map_page too, however silly it may seem). Of course, given that the allocators tend to give out size/order-aligned chunks, I think you'd have to be pretty tricksy to get two allocations to line up either side of a large power-of-two boundary *and* go out of your way to then make a single request spanning both, but it's certainly not illegal. Realistically, the kind of "scrape together a large buffer from smaller pieces" code which is liable to hit a boundary-crossing case by sheer chance is almost certainly going to be taking the sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather than implementing its own merging and piecemeal mapping. Conceptually it looks pretty easy to extend the allocation constraints to cope with that - even the pathological worst case would have an absolute upper bound of 3 IOMMU entries for any one physical region - but if in practice it's a case of mapping arbitrary CPU pages to 32-bit DMA addresses having only 4 1GB slots to play with, I can't really see a way to make that practical :( No we are talking about 40-ish-bits of address space, so there's a bit of leeway. Of course no scheme will work if the user app tries to map more than the GPU can possibly access. But with newer AMD adding a few more bits and nVidia being at 47-bits, I think we have some margin, it's just that they can't reach our discontiguous memory with a normal 'bypass' mapping and I'd rather not teach Linux about every single way our HW can scatter memory accross nodes, so an "on demand" mechanism is by far the most flexible way to deal with all configurations. Maybe the best compromise would be some sort of hybrid scheme which makes sure that one of the IOMMU entries always covers the SWIOTLB buffer, and invokes software bouncing for the awkward cases. Hrm... not too sure about that. I'm happy to limit that scheme to well known GPU vendor/device IDs, and SW bouncing is pointless in these cases. It would be nice if we could have some kind of guarantee that a single mapping or sglist entry never crossed a specific boundary though... We more/less have that for 4G already (well, we are supposed to at least). Who are the main potential problematic subsystems here ? I'm thinking network skb allocation pools ... and page cache if it tries to coalesce entries before issuing the map request, does it ? I don't know of anything definite off-hand, but my hunch is to be most wary of anything
Re: DMA mappings and crossing boundaries
On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote: .../... Thanks Robin, I was starting to depair anybody would reply ;-) > > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API- > > HOWTO.txt) as always allocating to the next power-of-2 order, so we > > should never have the problem unless we allocate a single chunk larger > > than the IOMMU page size. > > (and even then it's not *that* much of a problem, since it comes down to > just finding n > 1 consecutive unused IOMMU entries for exclusive use by > that new chunk) Yes, this case is not my biggest worry. > > For dma_map_sg() however, if a request that has a single "entry" > > spawning such a boundary, we need to ensure that the result mapping is > > 2 contiguous "large" iommu pages as well. > > > > However, that doesn't fit well with us re-using existing mappings since > > they may already exist and either not be contiguous, or partially exist > > with no free hole around them. > > > > Now, we *could* possibly construe a way to solve this by detecting this > > case and just allocating another "pair" (or set if we cross even more > > pages) of IOMMU pages elsewhere, thus partially breaking our re-use > > scheme. > > > > But while doable, this introduce some serious complexity in the > > implementation, which I would very much like to avoid. > > > > So I was wondering if you guys thought that was ever likely to happen ? > > Do you see reasonable cases where dma_map_sg() would be called with a > > list in which a single entry crosses a 256M or 1G boundary ? > > For streaming mappings of buffers cobbled together out of any old CPU > pages (e.g. user memory), you may well happen to get two > physically-adjacent pages falling either side of an IOMMU boundary, > which comprise all or part of a single request - note that whilst it's > probably less likely than the scatterlist case, this could technically > happen for dma_map_{page, single}() calls too. Could it ? I wouldn't think dma_map_page is allows to cross page boundaries ... what about single() ? The main worry is people using these things on kmalloc'ed memory > Conceptually it looks pretty easy to extend the allocation constraints > to cope with that - even the pathological worst case would have an > absolute upper bound of 3 IOMMU entries for any one physical region - > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit > DMA addresses having only 4 1GB slots to play with, I can't really see a > way to make that practical :( No we are talking about 40-ish-bits of address space, so there's a bit of leeway. Of course no scheme will work if the user app tries to map more than the GPU can possibly access. But with newer AMD adding a few more bits and nVidia being at 47-bits, I think we have some margin, it's just that they can't reach our discontiguous memory with a normal 'bypass' mapping and I'd rather not teach Linux about every single way our HW can scatter memory accross nodes, so an "on demand" mechanism is by far the most flexible way to deal with all configurations. > Maybe the best compromise would be some sort of hybrid scheme which > makes sure that one of the IOMMU entries always covers the SWIOTLB > buffer, and invokes software bouncing for the awkward cases. Hrm... not too sure about that. I'm happy to limit that scheme to well known GPU vendor/device IDs, and SW bouncing is pointless in these cases. It would be nice if we could have some kind of guarantee that a single mapping or sglist entry never crossed a specific boundary though... We more/less have that for 4G already (well, we are supposed to at least). Who are the main potential problematic subsystems here ? I'm thinking network skb allocation pools ... and page cache if it tries to coalesce entries before issuing the map request, does it ? Ben. > Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DMA mappings and crossing boundaries
Hi Ben, On 24/06/18 08:32, Benjamin Herrenschmidt wrote: Hi Folks ! So due work around issues with devices having to strict limitations in DMA address bits (GPUs ugh) on POWER, we've been playing with a mechanism that does dynamic mapping in the IOMMU but using a very large IOMMU page size (256M on POWER8 and 1G on POWER9) for performances. Now, with such page size, we can't just pop out new entries for every DMA map, we need to try to re-use entries for mappings in the same "area". We've prototypes something using refcounts on the entires. It does imply some locking which is potentially problematic, and we'll be looking at options there long run, but it works... so far. My worry is that it will fail if we ever get a mapping request (or coherent allocation request) that spawns one of those giant pages boundaries. At least our current implementation. AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API- HOWTO.txt) as always allocating to the next power-of-2 order, so we should never have the problem unless we allocate a single chunk larger than the IOMMU page size. (and even then it's not *that* much of a problem, since it comes down to just finding n > 1 consecutive unused IOMMU entries for exclusive use by that new chunk) For dma_map_sg() however, if a request that has a single "entry" spawning such a boundary, we need to ensure that the result mapping is 2 contiguous "large" iommu pages as well. However, that doesn't fit well with us re-using existing mappings since they may already exist and either not be contiguous, or partially exist with no free hole around them. Now, we *could* possibly construe a way to solve this by detecting this case and just allocating another "pair" (or set if we cross even more pages) of IOMMU pages elsewhere, thus partially breaking our re-use scheme. But while doable, this introduce some serious complexity in the implementation, which I would very much like to avoid. So I was wondering if you guys thought that was ever likely to happen ? Do you see reasonable cases where dma_map_sg() would be called with a list in which a single entry crosses a 256M or 1G boundary ? For streaming mappings of buffers cobbled together out of any old CPU pages (e.g. user memory), you may well happen to get two physically-adjacent pages falling either side of an IOMMU boundary, which comprise all or part of a single request - note that whilst it's probably less likely than the scatterlist case, this could technically happen for dma_map_{page, single}() calls too. Conceptually it looks pretty easy to extend the allocation constraints to cope with that - even the pathological worst case would have an absolute upper bound of 3 IOMMU entries for any one physical region - but if in practice it's a case of mapping arbitrary CPU pages to 32-bit DMA addresses having only 4 1GB slots to play with, I can't really see a way to make that practical :( Maybe the best compromise would be some sort of hybrid scheme which makes sure that one of the IOMMU entries always covers the SWIOTLB buffer, and invokes software bouncing for the awkward cases. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
DMA mappings and crossing boundaries
Hi Folks ! So due work around issues with devices having to strict limitations in DMA address bits (GPUs ugh) on POWER, we've been playing with a mechanism that does dynamic mapping in the IOMMU but using a very large IOMMU page size (256M on POWER8 and 1G on POWER9) for performances. Now, with such page size, we can't just pop out new entries for every DMA map, we need to try to re-use entries for mappings in the same "area". We've prototypes something using refcounts on the entires. It does imply some locking which is potentially problematic, and we'll be looking at options there long run, but it works... so far. My worry is that it will fail if we ever get a mapping request (or coherent allocation request) that spawns one of those giant pages boundaries. At least our current implementation. AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API- HOWTO.txt) as always allocating to the next power-of-2 order, so we should never have the problem unless we allocate a single chunk larger than the IOMMU page size. For dma_map_sg() however, if a request that has a single "entry" spawning such a boundary, we need to ensure that the result mapping is 2 contiguous "large" iommu pages as well. However, that doesn't fit well with us re-using existing mappings since they may already exist and either not be contiguous, or partially exist with no free hole around them. Now, we *could* possibly construe a way to solve this by detecting this case and just allocating another "pair" (or set if we cross even more pages) of IOMMU pages elsewhere, thus partially breaking our re-use scheme. But while doable, this introduce some serious complexity in the implementation, which I would very much like to avoid. So I was wondering if you guys thought that was ever likely to happen ? Do you see reasonable cases where dma_map_sg() would be called with a list in which a single entry crosses a 256M or 1G boundary ? Cheers, Ben. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu