Re: Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Jason Gunthorpe
On Wed, Nov 04, 2020 at 03:09:04PM +, Bernard Metzler wrote:

> lkey of zero to pass a physical buffer, only allowed for
> kernel applications? Very nice idea I think.

It already exists, it is called the local_dma_lkey, just set
IB_DEVICE_LOCAL_DMA_LKEY and provide the value you want to use
in device->local_dma_lkey

> btw.
> It would even get the vain blessing of the old IETF RDMA
> verbs draft ;)
> 
> https://tools.ietf.org/html/draft-hilland-rddp-verbs-00#page-90

IBTA standadized this, they just didn't require HW to use 0 as the
lkey.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Jason Gunthorpe
On Wed, Nov 04, 2020 at 05:31:35PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 04, 2020 at 11:52:55AM -0400, Jason Gunthorpe wrote:
> > It could work, I think a resonable ULP API would be to have some
> > 
> >  rdma_fill_ib_sge_from_sgl()
> >  rdma_map_sge_single()
> >  etc etc
> > 
> > ie instead of wrappering the DMA API as-is we have a new API that
> > directly builds the ib_sge. It always fills the local_dma_lkey from
> > the pd, so it knows it is doing DMA from local kernel memory.
> 
> Yeah.
> 
> > Logically SW devices then have a local_dma_lkey MR that has an IOVA of
> > the CPU physical address space, not the DMA address space as HW
> > devices have. The ib_sge builders can know this detail and fill in
> > addr from either a cpu phyical or a dma map.
> 
> I don't think the builders are the right place to do it - it really
> should to be in the low-level drivers for a bunch of reasons:

At this point we have little choice, the ULP is responsible for
map/unmap because the ULP owns the CQ and (batch) unmap is triggered
by some CQE.

Reworking all drivers to somehow keep track of unmaps a CQEs triggers
feels so hard at this point as to be impossible. It is why the
rdma_rw_ctx basically exists.

So we have to keep the current arrangment, when the ib_sge is built
the dma map must be conditionally done.

>  1) this avoids doing the dma_map when no DMA is performed, e.g. for
> mlx5 when send data is in the extended WQE

At least in the kernel, the ULP has to be involved today in
IB_SEND_INLINE. Userspace does an auto-copy, but that is because it
doesn't have dma map/unmap. 

Without unmap tracking as above the caller must supply a specially
formed ib_sge when using IB_SEND_INLINE that specifies the CPU vaddr
so memcpy works.

But this all looks like dead code, no ULP sets IB_SEND_INLINE ??

>  2) to deal with the fact that dma mapping reduces the number of SGEs.
> When the system uses a modern IOMMU we'll always end up with a
> single IOVA range no matter how many pages were mapped originally.
> This means any MR process can actually be consolidated to use
> a single SGE with the local lkey.

Any place like rdma_rw_ctx_init() that decides dynamically between SGE
and MR becomes a mess. It would be manageable if rdma_rw_ctx_init()
was the only place doing this..

I haven't looked lately, but I wonder if it is feasible that all the
MR users would use this API?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Christoph Hellwig
On Wed, Nov 04, 2020 at 11:52:55AM -0400, Jason Gunthorpe wrote:
> It could work, I think a resonable ULP API would be to have some
> 
>  rdma_fill_ib_sge_from_sgl()
>  rdma_map_sge_single()
>  etc etc
> 
> ie instead of wrappering the DMA API as-is we have a new API that
> directly builds the ib_sge. It always fills the local_dma_lkey from
> the pd, so it knows it is doing DMA from local kernel memory.

Yeah.

> Logically SW devices then have a local_dma_lkey MR that has an IOVA of
> the CPU physical address space, not the DMA address space as HW
> devices have. The ib_sge builders can know this detail and fill in
> addr from either a cpu phyical or a dma map.

I don't think the builders are the right place to do it - it really
should to be in the low-level drivers for a bunch of reasons:

 1) this avoids doing the dma_map when no DMA is performed, e.g. for
mlx5 when send data is in the extended WQE
 2) to deal with the fact that dma mapping reduces the number of SGEs.
When the system uses a modern IOMMU we'll always end up with a
single IOVA range no matter how many pages were mapped originally.
This means any MR process can actually be consolidated to use
a single SGE with the local lkey.

Note that 2 implies a somewhat more complicated API, where the ULP
attempts to create a MR, but the core/driver will tell it that it didn't
need a MR at all.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Jason Gunthorpe
On Wed, Nov 04, 2020 at 03:01:08PM +0100, Christoph Hellwig wrote:

> > Sigh. I think the proper fix is to replace addr/length with a
> > scatterlist pointer in the struct ib_sge, then have SW drivers
> > directly use the page pointer properly.
> 
> The proper fix is to move the DMA mapping into the RDMA core, yes.
> And as you said it will be hard.  But I don't think scatterlists
> are the right interface.  IMHO we can keep re-use the existing
> struct ib_sge:
> 
> struct ib_ge {
>   u64 addr;
>   u32 length;
>   u32 lkey;
> };

Gah, right, this is all about local_dma_lkey..
 
> with the difference that if lkey is not a MR, addr is the physical
> address of the memory, not a dma_addr_t or virtual address.

It could work, I think a resonable ULP API would be to have some

 rdma_fill_ib_sge_from_sgl()
 rdma_map_sge_single()
 etc etc

ie instead of wrappering the DMA API as-is we have a new API that
directly builds the ib_sge. It always fills the local_dma_lkey from
the pd, so it knows it is doing DMA from local kernel memory.

Logically SW devices then have a local_dma_lkey MR that has an IOVA of
the CPU physical address space, not the DMA address space as HW
devices have. The ib_sge builders can know this detail and fill in
addr from either a cpu phyical or a dma map.

The SW device has to translate the addr/length in CPU space to
something else. It actually makes reasonable sense architecturally.

This is actually much less horrible than I thought..

Convert all ULPs to one of these new APIs, searching for
local_dma_lkey will find all places. This will replace a whole lot of
calls to ib DMA API wrapper functions. Searching for local_dma_lkey
will find all users. Drivers already work with sge.addr == CPU
address, so no driver change

Then to kill the dma_ops wrappers the remaining users should all be
connected to map_mr_sg. In this case we want a no-op dma map and fix
the three map_mr_sg's to use the page side of the sgl, not the DMA
side

Not as horrible as I imagined at first, actually..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Bernard Metzler
-"Christoph Hellwig"  wrote: -

>To: "Jason Gunthorpe" 
>From: "Christoph Hellwig" 
>Date: 11/04/2020 03:02PM
>Cc: "Christoph Hellwig" , "Bjorn Helgaas"
>, "Logan Gunthorpe" ,
>linux-r...@vger.kernel.org, linux-...@vger.kernel.org,
>iommu@lists.linux-foundation.org
>Subject: [EXTERNAL] Re: [PATCH 2/5] RDMA/core: remove use of
>dma_virt_ops
>
>On Wed, Nov 04, 2020 at 09:42:41AM -0400, Jason Gunthorpe wrote:
>> On Wed, Nov 04, 2020 at 10:50:49AM +0100, Christoph Hellwig wrote:
>> 
>> > +int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist
>*sg, int nents)
>> > +{
>> > +  struct scatterlist *s;
>> > +  int i;
>> > +
>> > +  for_each_sg(sg, s, nents, i) {
>> > +  sg_dma_address(s) = (uintptr_t)sg_virt(s);
>> > +  sg_dma_len(s) = s->length;
>> 
>> Hmm.. There is nothing ensuring the page is mapped here for this
>> sg_virt(). Before maybe some of the kconfig stuff prevented highmem
>> systems indirectly, I wonder if we should add something more direct
>to
>> exclude highmem for these drivers?
>
>I had actually noticed this earlier as well and then completely
>forgot
>about it..
>
>rdmavt depends on X86_64, so it can't be used with highmem, but for
>rxe and siw there weren't any such dependencies so I think we were
>just
>lucky.  Let me send a fix to add explicit depencies and then respin
>this
>series on top of that..
>
>> Sigh. I think the proper fix is to replace addr/length with a
>> scatterlist pointer in the struct ib_sge, then have SW drivers
>> directly use the page pointer properly.
>
>The proper fix is to move the DMA mapping into the RDMA core, yes.
>And as you said it will be hard.  But I don't think scatterlists
>are the right interface.  IMHO we can keep re-use the existing
>struct ib_sge:
>
>struct ib_ge {
>   u64 addr;
>   u32 length;
>   u32 lkey;
>};
>
>with the difference that if lkey is not a MR, addr is the physical
>address of the memory, not a dma_addr_t or virtual address.
>

lkey of zero to pass a physical buffer, only allowed for
kernel applications? Very nice idea I think.

btw.
It would even get the vain blessing of the old IETF RDMA
verbs draft ;)

https://tools.ietf.org/html/draft-hilland-rddp-verbs-00#page-90


(section '7.2.1 STag of zero' - read lkey for STag)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Christoph Hellwig
On Wed, Nov 04, 2020 at 09:42:41AM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 04, 2020 at 10:50:49AM +0100, Christoph Hellwig wrote:
> 
> > +int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist *sg, int 
> > nents)
> > +{
> > +   struct scatterlist *s;
> > +   int i;
> > +
> > +   for_each_sg(sg, s, nents, i) {
> > +   sg_dma_address(s) = (uintptr_t)sg_virt(s);
> > +   sg_dma_len(s) = s->length;
> 
> Hmm.. There is nothing ensuring the page is mapped here for this
> sg_virt(). Before maybe some of the kconfig stuff prevented highmem
> systems indirectly, I wonder if we should add something more direct to
> exclude highmem for these drivers?

I had actually noticed this earlier as well and then completely forgot
about it..

rdmavt depends on X86_64, so it can't be used with highmem, but for
rxe and siw there weren't any such dependencies so I think we were just
lucky.  Let me send a fix to add explicit depencies and then respin this
series on top of that..

> Sigh. I think the proper fix is to replace addr/length with a
> scatterlist pointer in the struct ib_sge, then have SW drivers
> directly use the page pointer properly.

The proper fix is to move the DMA mapping into the RDMA core, yes.
And as you said it will be hard.  But I don't think scatterlists
are the right interface.  IMHO we can keep re-use the existing
struct ib_sge:

struct ib_ge {
u64 addr;
u32 length;
u32 lkey;
};

with the difference that if lkey is not a MR, addr is the physical
address of the memory, not a dma_addr_t or virtual address.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] RDMA/core: remove use of dma_virt_ops

2020-11-04 Thread Jason Gunthorpe
On Wed, Nov 04, 2020 at 10:50:49AM +0100, Christoph Hellwig wrote:

> +int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist *sg, int 
> nents)
> +{
> + struct scatterlist *s;
> + int i;
> +
> + for_each_sg(sg, s, nents, i) {
> + sg_dma_address(s) = (uintptr_t)sg_virt(s);
> + sg_dma_len(s) = s->length;

Hmm.. There is nothing ensuring the page is mapped here for this
sg_virt(). Before maybe some of the kconfig stuff prevented highmem
systems indirectly, I wonder if we should add something more direct to
exclude highmem for these drivers?

Sigh. I think the proper fix is to replace addr/length with a
scatterlist pointer in the struct ib_sge, then have SW drivers
directly use the page pointer properly.

Then just delete this stuff, all drivers need is a noop dmaops

Looks a lot hard though, so we should probably go ahead with this.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu