Re: [PATCH v2 00/26] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-05-22 Thread Jason Gunthorpe
On Fri, May 18, 2018 at 03:03:47PM +0200, Roman Pen wrote:
> Hi all,
> 
> This is v2 of series, which introduces IBNBD/IBTRS modules.
> 
> This cover letter is split on three parts:
> 
> 1. Introduction, which almost repeats everything from previous cover
>letters.
> 2. Changelog.
> 3. Performance measurements on linux-4.17.0-rc2 and on two different
>Mellanox cards: ConnectX-2 and ConnectX-3 and CPUs: Intel and AMD.
> 
> 
>  Introduction
> 
> IBTRS (InfiniBand Transport) is a reliable high speed transport library
> which allows for establishing connection between client and server
> machines via RDMA. It is optimized to transfer (read/write) IO blocks
> in the sense that it follows the BIO semantics of providing the
> possibility to either write data from a scatter-gather list to the
> remote side or to request ("read") data transfer from the remote side
> into a given set of buffers.
> 
> IBTRS is multipath capalbdke and provides I/O fail-over and load-balancing
> functionality, i.e. in IBTRS terminology, an IBTRS path is a set of RDMA
> CMs and particular path is selected according to the load-balancing policy.
> 
> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
> (client and server) that allow for remote access of a block device on
> the server over IBTRS protocol. After being mapped, the remote block
> devices can be accessed on the client side as local block devices.
> Internally IBNBD uses IBTRS as an RDMA transport library.
> 
> Why?
> 
>- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>  thus internal protocol is simple.
>- IBTRS was developed as an independent RDMA transport library, which
>  supports fail-over and load-balancing policies using multipath, thus
>  it can be used for any other IO needs rather than only for block
>  device.
>- IBNBD/IBTRS is faster than NVME over RDMA.
>  Old comparison results:
>  https://www.spinics.net/lists/linux-rdma/msg48799.html
>  New comparison results: see performance measurements section below.
> 
> Key features of IBTRS transport library and IBNBD block device:
> 
> o High throughput and low latency due to:
>- Only two RDMA messages per IO.
>- IMM InfiniBand messages on responses to reduce round trip latency.
>- Simplified memory management: memory allocation happens once on
>  server side when IBTRS session is established.
> 
> o IO fail-over and load-balancing by using multipath.  According to
>   our test loads additional path brings ~20% of bandwidth.  
> 
> o Simple configuration of IBNBD:
>- Server side is completely passive: volumes do not need to be
>  explicitly exported.
>- Only IB port GID and device path needed on client side to map
>  a block device.
>- A device is remapped automatically i.e. after storage reboot.
> 
> Commits for kernel can be found here:
>https://github.com/profitbricks/ibnbd/commits/linux-4.17-rc2
> 
> The out-of-tree modules are here:
>https://github.com/profitbricks/ibnbd/
> 
> Vault 2017 presentation:
>
> http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

I think from the RDMA side, before we accept something like this, I'd
like to hear from Christoph, Chuck or Sagi that the dataplane
implementation of this is correct, eg it uses the MRs properly and
invalidates at the right time, sequences with dma_ops as required,
etc.

They all have done this work on their ULPs and it was tricky, I don't
want to see another ULP implement this wrong..

I'm skeptical here already due to the performance numbers - they are
not really what I'd expects and we may find that invalidate changes
will bring the performance down further.

Jason


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe <log...@deltatee.com> wrote:
> >>
> >>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> >>>> Regarding the switch business, It is amazing how much trouble you went 
> >>>> into
> >>>> limit this functionality into very specific hardware.
> >>>>
> >>>> I thought that we reached to an agreement that code would not impose
> >>>> any limits on what user wants.
> >>>>
> >>>> What happened to all the emails we exchanged?  
> >>>
> >>> It turns out that root ports that support P2P are far less common than 
> >>> anyone thought. So it will likely have to be a white list.
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> > 
> > I don't think it is a hardware problem.
> 
> The latest and greatest Power9 CPUs still explicitly do not support
> this.

I think this is another case of the HW can do it but the SW support is
missing. IOMMU configuration and maybe firmware too, for instance.

If I recall I saw a presentation that Coral was expected to use P2P
between the network and GPU.

> And, if I recall correctly, the ARM64 device we played with did
> not either -- but I suspect that will differ depending on vendor.

Wouldn't surprise me at all to see broken implementations in
ARM64.. But even there it needs IOMMU enablement to work at all if I
recall.

Bascially, this is probably not a HW problem that needs a HW bit, but
a OS/firmware problem to do all the enablement..

Jason


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe  wrote:
> 
> > On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > > Regarding the switch business, It is amazing how much trouble you went 
> > > into
> > > limit this functionality into very specific hardware.
> > > 
> > > I thought that we reached to an agreement that code would not impose
> > > any limits on what user wants.
> > > 
> > > What happened to all the emails we exchanged?  
> > 
> > It turns out that root ports that support P2P are far less common than 
> > anyone thought. So it will likely have to be a white list.
> 
> This came as a bit of a surprise to our PCIe architect.

I don't think it is a hardware problem.

I know Mellanox and Nvidia have been doing p2p on Intel root complexes
for something like 5-6 years now.. I don't have the details, but it
does seem to work.

I have heard some chips give poor performance..

Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche
feature in Intel/AMD land.

I think the main issue here is that there is some BIOS involvement to
set things up properly. Eg in GPU land motherboards certify for
'crossfire' support.

> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered.  This might
> long term avoid the need to maintain the white list for new devices.

If it is primarily a BIOS issue then it should be an ACPI thing, right?

Jason


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Jason Gunthorpe
On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
> 
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.

You still need it.

DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.

eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.

Jason


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Jason Gunthorpe
On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
> qp->sq.head += nreq;
> 
> /*
>  * Make sure that descriptors are written before
>  * doorbell record.
>  */
> wmb();
> 
> writel(qp->doorbell_qpn,
>to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
> /*
>  * Make sure doorbells don't leak out of SQ spinlock
>  * and reach the HCA out of order.
>  */
> mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-02 Thread Jason Gunthorpe
On Thu, Mar 01, 2018 at 11:57:43PM +, Stephen  Bates wrote:
> > We don't want to lump these all together without knowing which region 
> > you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long
> term. We would consider adding property flags for the memory as it
> is added to the p2p core and then the allocator could evolve to
> intelligently dish it out. Attributes like endurance, latency and
> special write commit requirements could all become attributes in
> time. Perhaps one more reason for a central entity for p2p memory
> allocation so this code does not end up having to go into many
> different drivers?

I fear we will find that every kind of P2P memory has special
allocator requirements and dumping it all into one giant pool is not
going to be helpful.

This allocator is already seems not useful for the P2P target memory
on a Mellanox NIC due to the way it has a special allocation flow
(windowing) and special usage requirements..

Nor can it be usefull for the doorbell memory in the NIC.

Both of these are existing use cases for P2P with out of tree patches..

The allocator seems to only be able to solve the CMB problem, and I
think due to that it is better to put this allocator in the NVMe
driver and not the core code.. At least until we find a 2nd user that
needs the same allocation scheme...

Jason


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Jason Gunthorpe
On Thu, Mar 01, 2018 at 11:00:51PM +, Stephen  Bates wrote:

> > Seems like a very subtle and hard to debug performance trap to leave
> > for the users, and pretty much the only reason to use P2P is
> > performance... So why have such a dangerous interface?
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

No, locality matters. If you have a bunch of NICs and bunch of drives
and the allocator chooses to put all P2P memory on a single drive your
performance will suck horribly even if all the traffic is offloaded.

Performance will suck if you have speed differences between the PCI-E
devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
will not reach full performance unless the p2p buffers are properly
balanced between all cards.

This is why I think putting things in one big pool and pretending like
any P2P mem is interchangable with any other makes zero sense to me,
it is not how the system actually works. Proper locality matters a
lot.

Jason


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Jason Gunthorpe
On Thu, Mar 01, 2018 at 12:27:03PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> >On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> >This is also why I don't entirely understand why this series has a
> >generic allocator for p2p mem, it makes little sense to me.
> 
> >Why wouldn't the nmve driver just claim the entire CMB of its local
> >device for its own use? Why involve p2p core code in this?
> 
> We'd prefer to have a generic way to get p2pmem instead of restricting
> ourselves to only using CMBs. We did work in the past where the P2P memory
> was part of an IB adapter and not the NVMe card. So this won't work if it's
> an NVMe only interface.

It just seems like it it making it too complicated.

In 99% of cases locality is important and you don't want to accidently
use a buffer in a 3rd device that has no relation to the two doing the
transfer just because it happened to be returned by the generic
allocator.

I can appreciate you might have some special use case for that, but it
absolutely should require special configuration and not just magically
happen.

You bring up IB adaptor memory - if we put that into the allocator
then what is to stop the NMVe driver just using it instead of the CMB
buffer? That would be totally wrong in almost all cases.

Seems like a very subtle and hard to debug performance trap to leave
for the users, and pretty much the only reason to use P2P is
performance... So why have such a dangerous interface?

Jason


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jason Gunthorpe
On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).

I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.

I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.

BAR memory should be mapped into the CPU as WC at best on all arches..

Jason


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Jason Gunthorpe
On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> 
> >On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> >>Can you describe what would be the plan to have it when these devices
> >>do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> >>and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> >>ns can prefer to use its own cmb instead of locating a p2p_dev device?
> >
> >The patchset already supports CMB drives. That's essentially what patch 7
> >is for. We change the nvme-pci driver to use the p2pmem code to register
> >and manage the CMB memory. After that, it is simply available to the nvmet
> >code. We have already been using this with a couple prototype CMB drives.
> 
> The comment was to your statement:
> "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available."
> 
> Maybe its a left-over which confused me...
> 
> Anyways, my question still holds. If I rack several of these
> nvme drives, ideally we would use _their_ cmbs for I/O that is
> directed to these namespaces. This is why I was suggesting that
> p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
> p2p_dev used by all namespaces.

I agree, I don't think this series should target anything other than
using p2p memory located in one of the devices expected to participate
in the p2p trasnaction for a first pass..

locality is super important for p2p, so I don't think things should
start out in a way that makes specifying the desired locality hard.

This is also why I don't entirely understand why this series has a
generic allocator for p2p mem, it makes little sense to me.

Why wouldn't the nmve driver just claim the entire CMB of its local
device for its own use? Why involve p2p core code in this?

Jason


Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-06 Thread Jason Gunthorpe
On Tue, Feb 06, 2018 at 01:01:23PM +0100, Roman Penyaev wrote:

> >> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
> >> *dev)
> >> +{
> >> +   int err;
> >> +
> >> +   d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
> >> +   if (IS_ERR(d->pd))
> >> +   return PTR_ERR(d->pd);
> >> +   d->dev = dev;
> >> +   d->lkey = d->pd->local_dma_lkey;
> >> +   d->rkey = d->pd->unsafe_global_rkey;
> >> +
> >> +   err = ibtrs_query_device(d);
> >> +   if (unlikely(err))
> >> +   ib_dealloc_pd(d->pd);
> >> +
> >> +   return err;
> >> +}
> >
> >
> > I must say that this makes me frustrated.. We stopped doing these
> > sort of things long time ago. No way we can even consider accepting
> > the unsafe use of the global rkey exposing the entire memory space for
> > remote access permissions.
> >
> > Sorry for being blunt, but this protocol design which makes a concious
> > decision to expose unconditionally is broken by definition.
> 
> I suppose we can also afford the same trick which nvme does: provide
> register_always module argument, can we?  That can be also interesting
> to measure the performance difference.

I can be firmer than Sagi - new code that has IB_PD_UNSAFE_GLOBAL_RKEY
at can not be accepted upstream.

Once you get that fixed, you should go and read my past comments on
how to properly order dma mapping and completions and fix that
too. Then redo your performance..

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-09 Thread Jason Gunthorpe
On Tue, Jan 09, 2018 at 05:46:40PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> > Pretty sure P2P capable IOMMU hardware exists.
> > 
> > With SOC's we also have the scenario that an DMA originated from an
> > on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> > that definitely exists today, and people care about it :)
> 
> Then people will have to help and contribute support for it.

Sure. But my point was all this will have to migrate under the dma_ops
for that to work, so why the resistance to using dma_ops right away?

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-08 Thread Jason Gunthorpe
On Mon, Jan 8, 2018 at 11:57 AM, Christoph Hellwig  wrote:

>> (c) setup/manage any security permissions on mappings
>> Which P2P may at some point be concerned with.
>
> Maybe once root complexes with iommus actually support P2P.  But until
> then we have a lot more more important problems to solve.

Pretty sure P2P capable IOMMU hardware exists.

With SOC's we also have the scenario that an DMA originated from an
on-die device wishes to target an off-die PCI BAR (through the IOMMU),
that definitely exists today, and people care about it :)

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-08 Thread Jason Gunthorpe
On Mon, Jan 08, 2018 at 07:34:34PM +0100, Christoph Hellwig wrote:
> > > > And on that topic, does this scheme work with HFI?
> > > 
> > > No, and I guess we need an opt-out.  HFI generally seems to be
> > > extremely weird.
> > 
> > This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> > broken, and it shouldn't be 'fixed' at the RDMA level.
> 
> I don't think rxe is a problem as it won't show up a pci device.

Right today's restrictions save us..

> HFI and QIB do show as PCI devices, and could be used for P2P transfers
> from the PCI point of view.  It's just that they have a layer of
> software indirection between their hardware and what is exposed at
> the RDMA layer.
> 
> So I very much disagree about where to place that workaround - the
> RDMA code is exactly the right place.

But why? RDMA is using core code to do this. It uses dma_ops in struct
device and it uses normal dma_map SG. How is it RDMA's problem that
some PCI drivers provide strange DMA ops?

Admittedly they are RDMA drivers, but it is a core mechanism they
(ab)use these days..

> > It could, if we had a DMA op for p2p then the drivers that provide
> > their own ops can implement it appropriately or not at all.
> > 
> > Eg the correct implementation for rxe to support p2p memory is
> > probably somewhat straightfoward.
> 
> But P2P is _not_ a factor of the dma_ops implementation at all,
> it is something that happens behind the dma_map implementation.

Only as long as the !ACS and switch limitations are present.

Those limitations are fine to get things started, but there is going
to a be a push improve the system to remove them.

> > Very long term the IOMMUs under the ops will need to care about this,
> > so the wrapper is not an optimal place to put it - but I wouldn't
> > object if it gets it out of RDMA :)
> 
> Unless you have an IOMMU on your PCIe switch and not before/inside
> the root complex that is not correct.

I understand the proposed patches restrict things to require a switch
and not transit the IOMMU.

But *very long term* P2P will need to work with paths that transit the
system IOMMU and root complex.

This already exists as out-of-tree funtionality that has been deployed
in production for years and years that does P2P through the root
complex with the IOMMU turned off.

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-08 Thread Jason Gunthorpe
On Mon, Jan 08, 2018 at 11:17:38AM -0700, Logan Gunthorpe wrote:
 
> >>If at all it should be in the dma_map* wrappers, but for that we'd need
> >>a good identifier.  And it still would not solve the whole fake dma
> >>ops issue.
> >
> >Very long term the IOMMUs under the ops will need to care about this,
> >so the wrapper is not an optimal place to put it - but I wouldn't
> >object if it gets it out of RDMA :)
> 
> Well, creating the extra op doesn't really change anything to the RDMA patch
> in this series.

Not fundamentally, but it lets us solve the bugs the patch introduces
with hfi/etc

> The point is, for the time being, we need to track whether we are
> doing a P2P or normal mapping using a side channel as we don't have
> a good way of tracking it in the SGL at this time.

Well, that is disappointing for now, but I'm OK with the flag in the
RW interface - as long as we all sort of agree it is not desirable and
the SG should self-identify in an ideal someday future world..

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-08 Thread Jason Gunthorpe
On Mon, Jan 08, 2018 at 03:59:01PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote:
> > Well that argument applies equally to the RDMA RW API wrappers around
> > the DMA API. I think it is fine if sgl are defined to only have P2P or
> > not, and that debugging support seemed reasonable to me..
> > 
> > > It's also very difficult to add similar functionality to dma_map_page 
> > > seeing
> > > dma_unmap_page won't have any way to know what it's dealing with. It just
> > > seems confusing to support P2P in the SG version and not the page version.
> > 
> > Well, this proposal is to support P2P in only some RDMA APIs and not
> > others, so it seems about as confusing to me..
> 
> As usual we implement what actually has a consumer.  On top of that the
> R/W API is the only core RDMA API that actually does DMA mapping for the
> ULP at the moment.

Well again the same can be said for dma_map_page vs dma_map_sg...

> For SENDs and everything else dma maps are done by the ULP (I'd like
> to eventually change that, though - e.g. sends through that are
> inline to the workqueue don't need a dma map to start with).


> That's because the initial design was to let the ULPs do the DMA
> mappings, which fundamentally is wrong.  I've fixed it for the R/W
> API when adding it, but no one has started work on SENDs and atomics.

Well, you know why it is like this, and it is very complicated to
unwind - the HW driver does not have enough information during CQ
processing to properly do any unmaps, let alone serious error tear
down unmaps, so we'd need a bunch of new APIs developed first, like RW
did. :\

> > And on that topic, does this scheme work with HFI?
> 
> No, and I guess we need an opt-out.  HFI generally seems to be
> extremely weird.

This series needs some kind of fix so HFI, QIB, rxe, etc don't get
broken, and it shouldn't be 'fixed' at the RDMA level.

> > This is why P2P must fit in to the common DMA framework somehow, we
> > rely on these abstractions to work properly and fully in RDMA.
> 
> Moving P2P up to common RDMA code isn't going to fix this.  For that
> we need to stop preting that something that isn't DMA can abuse the
> dma mapping framework, and until then opt them out of behavior that
> assumes actual DMA like P2P.

It could, if we had a DMA op for p2p then the drivers that provide
their own ops can implement it appropriately or not at all.

Eg the correct implementation for rxe to support p2p memory is
probably somewhat straightfoward.

> > I think you should consider pushing this directly into the dma_ops
> > implementations. Add a p2p_supported flag to struct dma_map_ops, and
> > only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
> > Only map_sg would be supported for P2P. Upgraded implementations can
> > call the helper function.
> 
> If at all it should be in the dma_map* wrappers, but for that we'd need
> a good identifier.  And it still would not solve the whole fake dma
> ops issue.

Very long term the IOMMUs under the ops will need to care about this,
so the wrapper is not an optimal place to put it - but I wouldn't
object if it gets it out of RDMA :)

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote:
> On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >>We tried things like this in an earlier iteration[1] which assumed the SG
> >>was homogenous (all P2P or all regular memory). This required serious
> >>ugliness to try and ensure SGs were in fact homogenous[2].
> >
> >I'm confused, these patches already assume the sg is homogenous,
> >right? Sure looks that way. So [2] is just debugging??
> 
> Yes, but it's a bit different to expect that someone calling
> pci_p2pmem_map_sg() will know what they're doing and provide a homogenous
> SG. It is relatively clear by convention that the entire SG must be
> homogenous given they're calling a pci_p2pmem function. Where as, allowing
> P2P SGs into the core DMA code means all we can do is hope that future
> developers don't screw it up and allow P2P pages to mix in with regular
> pages.

Well that argument applies equally to the RDMA RW API wrappers around
the DMA API. I think it is fine if sgl are defined to only have P2P or
not, and that debugging support seemed reasonable to me..

> It's also very difficult to add similar functionality to dma_map_page seeing
> dma_unmap_page won't have any way to know what it's dealing with. It just
> seems confusing to support P2P in the SG version and not the page version.

Well, this proposal is to support P2P in only some RDMA APIs and not
others, so it seems about as confusing to me..

> >Then we don't need to patch RDMA because RDMA is not special when it
> >comes to P2P. P2P should work with everything.
> 
> Yes, I agree this would be very nice.

Well, it is more than very nice. We have to keep RDMA working after
all, and if you make it even more special things become harder for us.

It is already the case that DMA in RDMA is very strange. We have
drivers that provide their own DMA ops, for instance.

And on that topic, does this scheme work with HFI?

On first glance, it looks like no. The PCI device the HFI device is
attached to may be able to do P2P, so it should be able to trigger the
support.

However, substituting the p2p_dma_map for the real device op dma_map
will cause a kernel crash when working with HFI. HFI uses a custom DMA
ops that returns CPU addreses in the dma_addr_t which the driver
handles in various special ways. One cannot just replace them with PCI
bus addresses.

So, this kinda looks to me like it causes bad breakage for some RDMA
drivers??

This is why P2P must fit in to the common DMA framework somehow, we
rely on these abstractions to work properly and fully in RDMA.

I think you should consider pushing this directly into the dma_ops
implementations. Add a p2p_supported flag to struct dma_map_ops, and
only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
Only map_sg would be supported for P2P. Upgraded implementations can
call the helper function.

Jason


Re: [PATCH 02/12] pci-p2p: Add sysfs group to display p2pmem stats

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 03:50:40PM -0600, Bjorn Helgaas wrote:
> > This is similar to /sys/bus/pci/drivers_autoprobe, but
> > affects only the VFs associated with a specific PF.
> > +
> > +What:  /sys/bus/pci/devices/.../p2pmem/available
> 
> I wonder if "p2pdma" would be a more suggestive term?  It's not really
> the *memory* that is peer-to-peer; the peer-to-peer part is referring
> to *access* to the memory.

There have been out of tree patches using P2P DMA for a long time, and
some of the use cases have nothing to do with 'memory' - eg DMA to
'registers'

I notice that this series particularly focus on treating the target
BAR as 'memory' - ie it puts genalloc on top of the BAR, and I guess
treat all addresses as equal and interchangable.

If this series gets accepted I would expect proposals to extend this
infrastructure to allow for P2P for registers in some way as well.

So I think the 'p2pmem' name is a good choice only when it is in
places that talk about the genalloc part of this design.

We should reserve p2pdma to talk about the generic infrastructure
unrelated to the genalloc pool.

Since these sysfs's seem to report the genalloc pool status, p2pmem
seems like a good choice to me.

> > @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
> > if (error)
> > goto out_pool_destroy;
> >  
> > +   if (sysfs_create_group(>dev.kobj, _group))
> > +   dev_warn(>dev, "failed to create p2p sysfs group\n");
> 
> Not sure the warning (by itself) is worthwhile.  If we were going to
> disable the feature if sysfs_create_group() failed, that's one thing,
> but we aren't doing anything except generating a warning, which the
> user can't really do anything with.  If the user is looking for the
> sysfs file, its absence will be obvious even without the message.

Don't most of the failure paths inside sysfs_create_group cause prints
anyhow?

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >I'd be much happier if dma_map_sg can tell the memory is P2P or not
> >from the scatterlist or dir arguments and not require the callers to
> >have this.
> 
> We tried things like this in an earlier iteration[1] which assumed the SG
> was homogenous (all P2P or all regular memory). This required serious
> ugliness to try and ensure SGs were in fact homogenous[2].

I'm confused, these patches already assume the sg is homogenous,
right? Sure looks that way. So [2] is just debugging??

What I meant is to rely on the fact it is already homogenous and
create a function like this:

static inline bool sg_is_table_p2p(struct scatterlist *sg)
{
return is_pci_p2p_page(sg_page(sg));
}

And then insert

  if (sg_is_table_p2p(sg))
   return pci_p2pmem_map_sg(...);

At the top of dma_map_sg_attrs() (and similar for unmap)

Then we don't need to patch RDMA because RDMA is not special when it
comes to P2P. P2P should work with everything.

This has been my objection to every P2P iteration so far, for
years. RDMA is not special, P2P should not be patching RDMA's DMA
path. This an issue between the DMA and PCI subsystems, not for RDMA.

> >The signature of pci_p2pmem_map_sg also doesn't look very good, it
> >should mirror the usual dma_map_sg, otherwise it needs more revision
> >to extend this to do P2P through a root complex.
> 
> Generally, the feedback that I get is to not make changes that try to
> anticipate the future. It would be easy enough to add those arguments when
> the code for going through the RC is made (which I expect will be very
> involved anyway).

Yes, but DMA mapping fundamentally requires the arguments to
dma_map_sg, so it is hard to accept an API called dma_map without
them.. Maybe just a naming issue.

[2] 
https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 12:01:31PM -0700, Logan Gunthorpe wrote:
>  
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx 
> *ctx, struct ib_qp *qp,
>   * @remote_addr:remote address to read/write (relative to @rkey)
>   * @rkey:remote key to operate on
>   * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:  any of the RDMA_RW_CTX_FLAG_* flags
>   *
>   * Returns the number of WQEs that will be needed on the workqueue if
>   * successful, or a negative error code.
>   */
>  int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>   struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> - u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> + u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> + unsigned int flags)
>  {
>   struct ib_device *dev = qp->pd->device;
>   int ret;
>  
> - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> + if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
> + ret = pci_p2pmem_map_sg(sg, sg_cnt);
> + else
> + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);

This seems really clunky since we are going to want to do this same
logic all over the place.

I'd be much happier if dma_map_sg can tell the memory is P2P or not
from the scatterlist or dir arguments and not require the callers to
have this.

For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or
adding another bit to the page flags in scatterlist.

The signature of pci_p2pmem_map_sg also doesn't look very good, it
should mirror the usual dma_map_sg, otherwise it needs more revision
to extend this to do P2P through a root complex.

Jason