Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 17:25:24 -0400
Don Dutile  wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas  wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>PCI/P2PDMA: Support peer-to-peer memory
> >>>PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>docs-rst: Add a new directory for PCI documentation
> >>>PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>block: Introduce PCI P2P flags for request and request queue
> >>>IB/core: Ensure we map P2P memory correctly in
> >>>  rdma_rw_ctx_[init|destroy]()
> >>>nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>nvme-pci: Add support for P2P memory in requests
> >>>nvme-pci: Add a quirk for a pseudo CMB
> >>>nvmet: Introduce helper functions to allocate and free request SGLs
> >>>nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci|  25 +
> >>>   Documentation/PCI/index.rst|  14 +
> >>>   Documentation/driver-api/index.rst |   2 +-
> >>>   Documentation/driver-api/pci/index.rst |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst| 166 ++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst|   3 +-
> >>>   block/blk-core.c   |   3 +
> >>>   drivers/infiniband/core/rw.c   |  13 +-
> >>>   drivers/nvme/host/core.c   |   4 +
> >>>   drivers/nvme/host/nvme.h   |   8 +
> >>>   drivers/nvme/host/pci.c| 118 +++--
> >>>   drivers/nvme/target/configfs.c |  67 +++
> >>>   drivers/nvme/target/core.c | 143 -
> >>>   drivers/nvme/target/io-cmd.c   |   3 +
> >>>   drivers/nvme/target/nvmet.h|  15 +
> >>>   drivers/nvme/target/rdma.c |  22 +-
> >>>   drivers/pci/Kconfig|  26 +
> >>>   drivers/pci/Makefile   |   1 +
> >>>   drivers/pci/p2pdma.c   | 814 
> >>> +
> >>>   drivers/pci/pci.c  |   6 +
> >>>   include/linux/blk_types.h  |  18 +-
> >>>   include/linux/blkdev.h |   3 +
> >>>   include/linux/memremap.h   |  19 +
> >>>   include/linux/pci-p2pdma.h | 118 +
> >>>   include/linux/pci.h|   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a 
> tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that 
> manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs 
> (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done 

Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Don Dutile

On 05/08/2018 12:57 PM, Alex Williamson wrote:

On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas  wrote:


On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
...



Logan Gunthorpe (14):
   PCI/P2PDMA: Support peer-to-peer memory
   PCI/P2PDMA: Add sysfs group to display p2pmem stats
   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
   docs-rst: Add a new directory for PCI documentation
   PCI/P2PDMA: Add P2P DMA driver writer's documentation
   block: Introduce PCI P2P flags for request and request queue
   IB/core: Ensure we map P2P memory correctly in
 rdma_rw_ctx_[init|destroy]()
   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
   nvme-pci: Add support for P2P memory in requests
   nvme-pci: Add a quirk for a pseudo CMB
   nvmet: Introduce helper functions to allocate and free request SGLs
   nvmet-rdma: Use new SGL alloc/free helper for requests
   nvmet: Optionally use PCI P2P memory

  Documentation/ABI/testing/sysfs-bus-pci|  25 +
  Documentation/PCI/index.rst|  14 +
  Documentation/driver-api/index.rst |   2 +-
  Documentation/driver-api/pci/index.rst |  20 +
  Documentation/driver-api/pci/p2pdma.rst| 166 ++
  Documentation/driver-api/{ => pci}/pci.rst |   0
  Documentation/index.rst|   3 +-
  block/blk-core.c   |   3 +
  drivers/infiniband/core/rw.c   |  13 +-
  drivers/nvme/host/core.c   |   4 +
  drivers/nvme/host/nvme.h   |   8 +
  drivers/nvme/host/pci.c| 118 +++--
  drivers/nvme/target/configfs.c |  67 +++
  drivers/nvme/target/core.c | 143 -
  drivers/nvme/target/io-cmd.c   |   3 +
  drivers/nvme/target/nvmet.h|  15 +
  drivers/nvme/target/rdma.c |  22 +-
  drivers/pci/Kconfig|  26 +
  drivers/pci/Makefile   |   1 +
  drivers/pci/p2pdma.c   | 814 +
  drivers/pci/pci.c  |   6 +
  include/linux/blk_types.h  |  18 +-
  include/linux/blkdev.h |   3 +
  include/linux/memremap.h   |  19 +
  include/linux/pci-p2pdma.h | 118 +
  include/linux/pci.h|   4 +
  26 files changed, 1579 insertions(+), 56 deletions(-)
  create mode 100644 Documentation/PCI/index.rst
  create mode 100644 Documentation/driver-api/pci/index.rst
  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
  create mode 100644 drivers/pci/p2pdma.c
  create mode 100644 include/linux/pci-p2pdma.h


How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?


AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex

At least 1/2 the cases presented to me by existing customers want it in a 
tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that 
manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs 
(configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this 
static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with 
p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all 
over the VM, but no other VM and not the host/HV.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Alex Williamson
On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas  wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> > rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci|  25 +
> >  Documentation/PCI/index.rst|  14 +
> >  Documentation/driver-api/index.rst |   2 +-
> >  Documentation/driver-api/pci/index.rst |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst| 166 ++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst|   3 +-
> >  block/blk-core.c   |   3 +
> >  drivers/infiniband/core/rw.c   |  13 +-
> >  drivers/nvme/host/core.c   |   4 +
> >  drivers/nvme/host/nvme.h   |   8 +
> >  drivers/nvme/host/pci.c| 118 +++--
> >  drivers/nvme/target/configfs.c |  67 +++
> >  drivers/nvme/target/core.c | 143 -
> >  drivers/nvme/target/io-cmd.c   |   3 +
> >  drivers/nvme/target/nvmet.h|  15 +
> >  drivers/nvme/target/rdma.c |  22 +-
> >  drivers/pci/Kconfig|  26 +
> >  drivers/pci/Makefile   |   1 +
> >  drivers/pci/p2pdma.c   | 814 
> > +
> >  drivers/pci/pci.c  |   6 +
> >  include/linux/blk_types.h  |  18 +-
> >  include/linux/blkdev.h |   3 +
> >  include/linux/memremap.h   |  19 +
> >  include/linux/pci-p2pdma.h | 118 +
> >  include/linux/pci.h|   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Logan Gunthorpe

> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci|  25 +
>  Documentation/PCI/index.rst|  14 +
>  Documentation/driver-api/index.rst |   2 +-
>  Documentation/driver-api/pci/index.rst |  20 +
>  Documentation/driver-api/pci/p2pdma.rst| 166 ++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst|   3 +-
>  block/blk-core.c   |   3 +
>  drivers/infiniband/core/rw.c   |  13 +-
>  drivers/nvme/host/core.c   |   4 +
>  drivers/nvme/host/nvme.h   |   8 +
>  drivers/nvme/host/pci.c| 118 +++--
>  drivers/nvme/target/configfs.c |  67 +++
>  drivers/nvme/target/core.c | 143 -
>  drivers/nvme/target/io-cmd.c   |   3 +
>  drivers/nvme/target/nvmet.h|  15 +
>  drivers/nvme/target/rdma.c |  22 +-
>  drivers/pci/Kconfig|  26 +
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 814 
> +
>  drivers/pci/pci.c  |   6 +
>  include/linux/blk_types.h  |  18 +-
>  include/linux/blkdev.h |   3 +
>  include/linux/memremap.h   |  19 +
>  include/linux/pci-p2pdma.h | 118 +
>  include/linux/pci.h|   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-04 Thread Logan Gunthorpe


On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>      sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>      sum = pci_p2pdma_distance(target, A);
>      sum += pci_p2pdma_distance(target, B);
>      sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-04 Thread Christian König

Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:


On 03/05/18 11:29 AM, Christian König wrote:

Ok, that is the point where I'm stuck. Why do we need that in one
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that
information here, cause the PCI subsystem might not be aware of all the
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the
PCI subsystem that looks like separate devices, but in reality all GPUs
are interconnected and can access each others memory directly without
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather
the driver. That's why it feels like a mistake to me to push all that
into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])


Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth, target])" ?



Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).


Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.


I mean what's the difference between the two approaches?

    sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

    sum = pci_p2pdma_distance(target, A);
    sum += pci_p2pdma_distance(target, B);
    sum += pci_p2pdma_distance(target, C);


Yeah, same for me. If Bjorn is ok with that specialized NVM functions
that I'm fine with that as well.

I think it would just be more convenient when we can come up with
functions which can handle all use cases, cause there still seems to be
a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.


Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)


But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.


Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.


I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.


Thanks,
Christian.



Logan




Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Logan Gunthorpe


On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Christian König

Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:

On 03/05/18 03:05 AM, Christian König wrote:

Second question is how to you want to handle things when device are not
behind the same root port (which is perfectly possible in the cases I
deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.


Sounds good.


Third question why multiple clients? That feels a bit like you are
pushing something special to your use case into the common PCI
subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.


Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?


The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.


For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.


I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.



In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.


Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.


That would also match how I get this information from userspace.


As far as I can see we need a function which return the distance between
a initiator and target device. This function then returns -1 if the
transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.


Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.


I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.





We also need to give the direction of the transaction and have a
whitelist root complex PCI-IDs which can handle P2P transactions from
different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.


Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.


Thanks for the explanation,
Christian.



Logan




Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Logan Gunthorpe


On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Christian König

Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
it would be rather nice to have if you could separate out the 
functions to detect if peer2peer is possible between two devices.


This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).


Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?


Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?


Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.




As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.


We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.



Christian.



Logan




Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-02 Thread Logan Gunthorpe

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.


This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).


Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-02 Thread Christian König

Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.


That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.


Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
   upstream_bridge_distance() which calculates the distance between two
   devices as long as they are behind the same root port. This should
   address Bjorn's concerns that the code was to focused on
   being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
   of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
   API to be more like sgl_alloc() in that the alloc function returns
   the allocated scatterlist and nents is not required bythe free
   function.

* Moved the new documentation into the driver-api tree as requested
   by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
   individual drivers can share the code that allocates P2P memory.
   As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
   thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
   and Kconfig doc. Also, the code that disables the ACS bits is surrounded
   explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
   is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
   are closest to (or the same as) the clients using them. In cases
   of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
   may be explicitly specified, bypassing the logic in pci_p2pmem_find().
   (Note: it's still enforced that the provider must be behind the
same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be