Re: DMA mappings and crossing boundaries

2018-07-19 Thread Benjamin Herrenschmidt
On Wed, 2018-07-11 at 14:51 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote:
> > 
> > > Could it ? I wouldn't think dma_map_page is allows to cross page
> > > boundaries ... what about single() ? The main worry is people using
> > > these things on kmalloc'ed memory
> > 
> > Oh, absolutely - the underlying operation is just "prepare for DMA 
> > to/from this physically-contiguous region"; the only real difference 
> > between map_page and map_single is for the sake of the usual "might be 
> > highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
> > on the size and offset parameters (in fact, if anyone asks I would say 
> > the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
> > months back is valid for dma_map_page too, however silly it may seem).
> 
> I think this is a very bad idea though. We should thrive to avoid
> coalescing before the iommu layer and we should avoid creating sglists
> that cross natural alignment boundaries.

Ping ? Jens, Christoph ?

I really really don't like how sg_alloc_table_from_pages() coalesces
the sglist before it gets mapped. This will potentially create huge
constraints and fragmentation in the IOMMU allocator by passing large
chunks to it.

> I had a look at sg_alloc_table_from_pages() and it basically freaks me
> out.
> 
> Back in the old days, we used to have the block layer aggressively
> coalesce requests iirc, but that was changed to let the iommu layer do
> it.
> 
> If you pass to dma_map_sg() something already heavily coalesced, such
> as what sg_alloc_table_from_pages() can do since it seems to have
> absolutely no limits there, you will create a significant fragmentation
> problem in the iommu allocator.
> 
> Instead, we should probably avoid such coalescing at that level and
> instead let the iommu opportunistically coalesce at the point of
> mapping which it does just fine.
> 
> We've been racking our brains here and we can't find a way to implement
> something we want that is both very performance efficient (no global
> locks etc...) and works with boundary crossing mappings.
> 
> We could always fallback to classic small page mappings but that is
> quite suboptimal.
> 
> I also notice that sg_alloc_table_from_pages() doesn't try to prevent
> crossing the 4G boundary. I remember pretty clearly that it was
> explicitely forbidden to create such a crossing.
> 
> > Of course, given that the allocators tend to give out size/order-aligned 
> > chunks, I think you'd have to be pretty tricksy to get two allocations 
> > to line up either side of a large power-of-two boundary *and* go out of 
> > your way to then make a single request spanning both, but it's certainly 
> > not illegal. Realistically, the kind of "scrape together a large buffer 
> > from smaller pieces" code which is liable to hit a boundary-crossing 
> > case by sheer chance is almost certainly going to be taking the 
> > sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
> > than implementing its own merging and piecemeal mapping.
> 
> Yes and I  think what sg_alloc_table_from_pages() does is quite wrong.
> 
> > > > Conceptually it looks pretty easy to extend the allocation constraints
> > > > to cope with that - even the pathological worst case would have an
> > > > absolute upper bound of 3 IOMMU entries for any one physical region -
> > > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
> > > > DMA addresses having only 4 1GB slots to play with, I can't really see a
> > > > way to make that practical :(
> > > 
> > > No we are talking about 40-ish-bits of address space, so there's a bit
> > > of leeway. Of course no scheme will work if the user app tries to map
> > > more than the GPU can possibly access.
> > > 
> > > But with newer AMD adding a few more bits and nVidia being at 47-bits,
> > > I think we have some margin, it's just that they can't reach our
> > > discontiguous memory with a normal 'bypass' mapping and I'd rather not
> > > teach Linux about every single way our HW can scatter memory accross
> > > nodes, so an "on demand" mechanism is by far the most flexible way to
> > > deal with all configurations.
> > > 
> > > > Maybe the best compromise would be some sort of hybrid scheme which
> > > > makes sure that one of the IOMMU entries always covers the SWIOTLB
> > > > buffer, and invokes software bouncing for the awkward cases.
> > > 
> > > Hrm... not too sure about that. I'm happy to limit that scheme to well
> > > known GPU vendor/device IDs, and SW bouncing is pointless in these
> > > cases. It would be nice if we could have some kind of guarantee that a
> > > single mapping or sglist entry never crossed a specific boundary
> > > though... We more/less have that for 4G already (well, we are supposed
> > > to at least). Who are the main potential problematic subsystems here ?
> > > I'm thinking network skb allocation pools ... and page cache if it
> > 

Re: DMA mappings and crossing boundaries

2018-07-10 Thread Benjamin Herrenschmidt
On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote:
> 
> > Could it ? I wouldn't think dma_map_page is allows to cross page
> > boundaries ... what about single() ? The main worry is people using
> > these things on kmalloc'ed memory
> 
> Oh, absolutely - the underlying operation is just "prepare for DMA 
> to/from this physically-contiguous region"; the only real difference 
> between map_page and map_single is for the sake of the usual "might be 
> highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
> on the size and offset parameters (in fact, if anyone asks I would say 
> the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
> months back is valid for dma_map_page too, however silly it may seem).

I think this is a very bad idea though. We should thrive to avoid
coalescing before the iommu layer and we should avoid creating sglists
that cross natural alignment boundaries.

I had a look at sg_alloc_table_from_pages() and it basically freaks me
out.

Back in the old days, we used to have the block layer aggressively
coalesce requests iirc, but that was changed to let the iommu layer do
it.

If you pass to dma_map_sg() something already heavily coalesced, such
as what sg_alloc_table_from_pages() can do since it seems to have
absolutely no limits there, you will create a significant fragmentation
problem in the iommu allocator.

Instead, we should probably avoid such coalescing at that level and
instead let the iommu opportunistically coalesce at the point of
mapping which it does just fine.

We've been racking our brains here and we can't find a way to implement
something we want that is both very performance efficient (no global
locks etc...) and works with boundary crossing mappings.

We could always fallback to classic small page mappings but that is
quite suboptimal.

I also notice that sg_alloc_table_from_pages() doesn't try to prevent
crossing the 4G boundary. I remember pretty clearly that it was
explicitely forbidden to create such a crossing.

> Of course, given that the allocators tend to give out size/order-aligned 
> chunks, I think you'd have to be pretty tricksy to get two allocations 
> to line up either side of a large power-of-two boundary *and* go out of 
> your way to then make a single request spanning both, but it's certainly 
> not illegal. Realistically, the kind of "scrape together a large buffer 
> from smaller pieces" code which is liable to hit a boundary-crossing 
> case by sheer chance is almost certainly going to be taking the 
> sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
> than implementing its own merging and piecemeal mapping.

Yes and I  think what sg_alloc_table_from_pages() does is quite wrong.

> > > Conceptually it looks pretty easy to extend the allocation constraints
> > > to cope with that - even the pathological worst case would have an
> > > absolute upper bound of 3 IOMMU entries for any one physical region -
> > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
> > > DMA addresses having only 4 1GB slots to play with, I can't really see a
> > > way to make that practical :(
> > 
> > No we are talking about 40-ish-bits of address space, so there's a bit
> > of leeway. Of course no scheme will work if the user app tries to map
> > more than the GPU can possibly access.
> > 
> > But with newer AMD adding a few more bits and nVidia being at 47-bits,
> > I think we have some margin, it's just that they can't reach our
> > discontiguous memory with a normal 'bypass' mapping and I'd rather not
> > teach Linux about every single way our HW can scatter memory accross
> > nodes, so an "on demand" mechanism is by far the most flexible way to
> > deal with all configurations.
> > 
> > > Maybe the best compromise would be some sort of hybrid scheme which
> > > makes sure that one of the IOMMU entries always covers the SWIOTLB
> > > buffer, and invokes software bouncing for the awkward cases.
> > 
> > Hrm... not too sure about that. I'm happy to limit that scheme to well
> > known GPU vendor/device IDs, and SW bouncing is pointless in these
> > cases. It would be nice if we could have some kind of guarantee that a
> > single mapping or sglist entry never crossed a specific boundary
> > though... We more/less have that for 4G already (well, we are supposed
> > to at least). Who are the main potential problematic subsystems here ?
> > I'm thinking network skb allocation pools ... and page cache if it
> > tries to coalesce entries before issuing the map request, does it ?
> 
> I don't know of anything definite off-hand, but my hunch is to be most 
> wary of anything wanting to do zero-copy access to large buffers in 
> userspace pages. In particular, sg_alloc_table_from_pages() lacks any 
> kind of boundary enforcement (and most all users don't even use the 
> segment-length-limiting variant either), so I'd say any caller of that 
> currently has a very small, but nonzero, 

Re: DMA mappings and crossing boundaries

2018-07-04 Thread Robin Murphy

On 02/07/18 14:37, Benjamin Herrenschmidt wrote:

On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

  .../...

Thanks Robin, I was starting to depair anybody would reply ;-)


AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
HOWTO.txt) as always allocating to the next power-of-2 order, so we
should never have the problem unless we allocate a single chunk larger
than the IOMMU page size.


(and even then it's not *that* much of a problem, since it comes down to
just finding n > 1 consecutive unused IOMMU entries for exclusive use by
that new chunk)


Yes, this case is not my biggest worry.


For dma_map_sg() however, if a request that has a single "entry"
spawning such a boundary, we need to ensure that the result mapping is
2 contiguous "large" iommu pages as well.

However, that doesn't fit well with us re-using existing mappings since
they may already exist and either not be contiguous, or partially exist
with no free hole around them.

Now, we *could* possibly construe a way to solve this by detecting this
case and just allocating another "pair" (or set if we cross even more
pages) of IOMMU pages elsewhere, thus partially breaking our re-use
scheme.

But while doable, this introduce some serious complexity in the
implementation, which I would very much like to avoid.

So I was wondering if you guys thought that was ever likely to happen ?
Do you see reasonable cases where dma_map_sg() would be called with a
list in which a single entry crosses a 256M or 1G boundary ?


For streaming mappings of buffers cobbled together out of any old CPU
pages (e.g. user memory), you may well happen to get two
physically-adjacent pages falling either side of an IOMMU boundary,
which comprise all or part of a single request - note that whilst it's
probably less likely than the scatterlist case, this could technically
happen for dma_map_{page, single}() calls too.


Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory


Oh, absolutely - the underlying operation is just "prepare for DMA 
to/from this physically-contiguous region"; the only real difference 
between map_page and map_single is for the sake of the usual "might be 
highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
on the size and offset parameters (in fact, if anyone asks I would say 
the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
months back is valid for dma_map_page too, however silly it may seem).


Of course, given that the allocators tend to give out size/order-aligned 
chunks, I think you'd have to be pretty tricksy to get two allocations 
to line up either side of a large power-of-two boundary *and* go out of 
your way to then make a single request spanning both, but it's certainly 
not illegal. Realistically, the kind of "scrape together a large buffer 
from smaller pieces" code which is liable to hit a boundary-crossing 
case by sheer chance is almost certainly going to be taking the 
sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
than implementing its own merging and piecemeal mapping.



Conceptually it looks pretty easy to extend the allocation constraints
to cope with that - even the pathological worst case would have an
absolute upper bound of 3 IOMMU entries for any one physical region -
but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
DMA addresses having only 4 1GB slots to play with, I can't really see a
way to make that practical :(


No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.


Maybe the best compromise would be some sort of hybrid scheme which
makes sure that one of the IOMMU entries always covers the SWIOTLB
buffer, and invokes software bouncing for the awkward cases.


Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?


I don't know of anything definite off-hand, but my hunch is to be most 
wary of anything 

Re: DMA mappings and crossing boundaries

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

 .../...

Thanks Robin, I was starting to depair anybody would reply ;-)

> > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
> > HOWTO.txt) as always allocating to the next power-of-2 order, so we
> > should never have the problem unless we allocate a single chunk larger
> > than the IOMMU page size.
> 
> (and even then it's not *that* much of a problem, since it comes down to 
> just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
> that new chunk)

Yes, this case is not my biggest worry.

> > For dma_map_sg() however, if a request that has a single "entry"
> > spawning such a boundary, we need to ensure that the result mapping is
> > 2 contiguous "large" iommu pages as well.
> > 
> > However, that doesn't fit well with us re-using existing mappings since
> > they may already exist and either not be contiguous, or partially exist
> > with no free hole around them.
> > 
> > Now, we *could* possibly construe a way to solve this by detecting this
> > case and just allocating another "pair" (or set if we cross even more
> > pages) of IOMMU pages elsewhere, thus partially breaking our re-use
> > scheme.
> > 
> > But while doable, this introduce some serious complexity in the
> > implementation, which I would very much like to avoid.
> > 
> > So I was wondering if you guys thought that was ever likely to happen ?
> > Do you see reasonable cases where dma_map_sg() would be called with a
> > list in which a single entry crosses a 256M or 1G boundary ?
> 
> For streaming mappings of buffers cobbled together out of any old CPU 
> pages (e.g. user memory), you may well happen to get two 
> physically-adjacent pages falling either side of an IOMMU boundary, 
> which comprise all or part of a single request - note that whilst it's 
> probably less likely than the scatterlist case, this could technically 
> happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

> Conceptually it looks pretty easy to extend the allocation constraints 
> to cope with that - even the pathological worst case would have an 
> absolute upper bound of 3 IOMMU entries for any one physical region - 
> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
> DMA addresses having only 4 1GB slots to play with, I can't really see a 
> way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

> Maybe the best compromise would be some sort of hybrid scheme which 
> makes sure that one of the IOMMU entries always covers the SWIOTLB 
> buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

Ben.

> Robin.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA mappings and crossing boundaries

2018-07-02 Thread Robin Murphy

Hi Ben,

On 24/06/18 08:32, Benjamin Herrenschmidt wrote:

Hi Folks !

So due work around issues with devices having to strict limitations in
DMA address bits (GPUs ugh) on POWER, we've been playing with a
mechanism that does dynamic mapping in the IOMMU but using a very large
IOMMU page size (256M on POWER8 and 1G on POWER9) for performances.

Now, with such page size, we can't just pop out new entries for every
DMA map, we need to try to re-use entries for mappings in the same
"area".

We've prototypes something using refcounts on the entires. It does
imply some locking which is potentially problematic, and we'll be
looking at options there long run, but it works... so far.

My worry is that it will fail if we ever get a mapping request (or
coherent allocation request) that spawns one of those giant pages
boundaries. At least our current implementation.

AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
HOWTO.txt) as always allocating to the next power-of-2 order, so we
should never have the problem unless we allocate a single chunk larger
than the IOMMU page size.


(and even then it's not *that* much of a problem, since it comes down to 
just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
that new chunk)



For dma_map_sg() however, if a request that has a single "entry"
spawning such a boundary, we need to ensure that the result mapping is
2 contiguous "large" iommu pages as well.

However, that doesn't fit well with us re-using existing mappings since
they may already exist and either not be contiguous, or partially exist
with no free hole around them.

Now, we *could* possibly construe a way to solve this by detecting this
case and just allocating another "pair" (or set if we cross even more
pages) of IOMMU pages elsewhere, thus partially breaking our re-use
scheme.

But while doable, this introduce some serious complexity in the
implementation, which I would very much like to avoid.

So I was wondering if you guys thought that was ever likely to happen ?
Do you see reasonable cases where dma_map_sg() would be called with a
list in which a single entry crosses a 256M or 1G boundary ?


For streaming mappings of buffers cobbled together out of any old CPU 
pages (e.g. user memory), you may well happen to get two 
physically-adjacent pages falling either side of an IOMMU boundary, 
which comprise all or part of a single request - note that whilst it's 
probably less likely than the scatterlist case, this could technically 
happen for dma_map_{page, single}() calls too.


Conceptually it looks pretty easy to extend the allocation constraints 
to cope with that - even the pathological worst case would have an 
absolute upper bound of 3 IOMMU entries for any one physical region - 
but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
DMA addresses having only 4 1GB slots to play with, I can't really see a 
way to make that practical :(


Maybe the best compromise would be some sort of hybrid scheme which 
makes sure that one of the IOMMU entries always covers the SWIOTLB 
buffer, and invokes software bouncing for the awkward cases.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


DMA mappings and crossing boundaries

2018-06-24 Thread Benjamin Herrenschmidt
Hi Folks !

So due work around issues with devices having to strict limitations in
DMA address bits (GPUs ugh) on POWER, we've been playing with a
mechanism that does dynamic mapping in the IOMMU but using a very large
IOMMU page size (256M on POWER8 and 1G on POWER9) for performances.

Now, with such page size, we can't just pop out new entries for every
DMA map, we need to try to re-use entries for mappings in the same
"area". 

We've prototypes something using refcounts on the entires. It does
imply some locking which is potentially problematic, and we'll be
looking at options there long run, but it works... so far.

My worry is that it will fail if we ever get a mapping request (or
coherent allocation request) that spawns one of those giant pages
boundaries. At least our current implementation.

AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
HOWTO.txt) as always allocating to the next power-of-2 order, so we
should never have the problem unless we allocate a single chunk larger
than the IOMMU page size.

For dma_map_sg() however, if a request that has a single "entry"
spawning such a boundary, we need to ensure that the result mapping is
2 contiguous "large" iommu pages as well.

However, that doesn't fit well with us re-using existing mappings since
they may already exist and either not be contiguous, or partially exist
with no free hole around them.

Now, we *could* possibly construe a way to solve this by detecting this
case and just allocating another "pair" (or set if we cross even more
pages) of IOMMU pages elsewhere, thus partially breaking our re-use
scheme.

But while doable, this introduce some serious complexity in the
implementation, which I would very much like to avoid.

So I was wondering if you guys thought that was ever likely to happen ?
Do you see reasonable cases where dma_map_sg() would be called with a
list in which a single entry crosses a 256M or 1G boundary ?

Cheers,
Ben.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu