Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci|  25 +
>  Documentation/PCI/index.rst|  14 +
>  Documentation/driver-api/index.rst |   2 +-
>  Documentation/driver-api/pci/index.rst |  20 +
>  Documentation/driver-api/pci/p2pdma.rst| 166 ++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst|   3 +-
>  block/blk-core.c   |   3 +
>  drivers/infiniband/core/rw.c   |  13 +-
>  drivers/nvme/host/core.c   |   4 +
>  drivers/nvme/host/nvme.h   |   8 +
>  drivers/nvme/host/pci.c| 118 +++--
>  drivers/nvme/target/configfs.c |  67 +++
>  drivers/nvme/target/core.c | 143 -
>  drivers/nvme/target/io-cmd.c   |   3 +
>  drivers/nvme/target/nvmet.h|  15 +
>  drivers/nvme/target/rdma.c |  22 +-
>  drivers/pci/Kconfig|  26 +
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 814 
> +
>  drivers/pci/pci.c  |   6 +
>  include/linux/blk_types.h  |  18 +-
>  include/linux/blkdev.h |   3 +
>  include/linux/memremap.h   |  19 +
>  include/linux/pci-p2pdma.h | 118 +
>  include/linux/pci.h|   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn


Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jonathan Corbet 
> ---
>  Documentation/PCI/index.rst |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 
> 
>  Documentation/index.rst |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index ..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==
> +Linux PCI Driver Developer's Guide
> +==
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   ===
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst 
> b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
> :maxdepth: 2
>  
> pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst 
> b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index ..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +
> +PCI Peer-to-Peer DMA Support
> +
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-07 Thread Bjorn Helgaas
[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  9 +
>  drivers/pci/p2pdma.c   | 45 ++---
>  drivers/pci/pci.c  |  6 ++
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>  
> +   Enabling this option will also disable ACS on all ports behind
> +   any PCIe switch. This effectively puts all devices behind any
> +   switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +   individual devices behind any switch will not be able to be
> +   assigned to separate VMs because there is no isolation between
> +   them. Additionally, any malicious PCIe devices will be able to
> +   DMA to memory exposed by other EPs in the same domain as TLPs
> +   will not be checked by the IOMMU.
> +
> If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct 
> device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> - struct pci_dev *up1, *up2;
> + int pos;
> + u16 ctrl;
>  
> - if (!pdev)
> - return NULL;
> + if (!pci_is_bridge(pdev))
> + return 0;
>  
> - up1 = pci_dev_get(pci_upstream_bridge(pdev));
> - if (!up1)
> - return NULL;
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, );
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> - 

Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

2018-05-07 Thread Bjorn Helgaas
s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe 


Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  17 ++
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 694 
> +
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++
>  include/linux/pci.h|   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
> If unsure, say N.
>  
> +config PCI_P2PDMA
> + bool "PCI peer-to-peer transfer support"
> + depends on PCI && ZONE_DEVICE && EXPERT
> + select GENERIC_ALLOCATOR
> + help
> +   Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +   BARs that are exposed in other devices that are the part of
> +   the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +   specification to work (ie. anything below a single PCI bridge).
> +
> +   Many PCIe root complexes do not support P2P transactions and
> +   it's hard to tell which support it at all, so at this time, DMA
> +   transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +   (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + 

Re: [PATCH 09/12] PCI: remove CONFIG_PCI_BUS_ADDR_T_64BIT

2018-04-20 Thread Bjorn Helgaas
On Sun, Apr 15, 2018 at 04:59:44PM +0200, Christoph Hellwig wrote:
> This symbol is now always identical to CONFIG_ARCH_DMA_ADDR_T_64BIT, so
> remove it.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

Please merge this along with the rest of the series; let me know if you
need anything more from me.

> ---
>  drivers/pci/Kconfig | 4 
>  drivers/pci/bus.c   | 4 ++--
>  include/linux/pci.h | 2 +-
>  3 files changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..29a487f31dae 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -5,10 +5,6 @@
>  
>  source "drivers/pci/pcie/Kconfig"
>  
> -config PCI_BUS_ADDR_T_64BIT
> - def_bool y if (ARCH_DMA_ADDR_T_64BIT || 64BIT)
> - depends on PCI
> -
>  config PCI_MSI
>   bool "Message Signaled Interrupts (MSI and MSI-X)"
>   depends on PCI
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index bc2ded4c451f..35b7fc87eac5 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -120,7 +120,7 @@ int devm_request_pci_bus_resources(struct device *dev,
>  EXPORT_SYMBOL_GPL(devm_request_pci_bus_resources);
>  
>  static struct pci_bus_region pci_32_bit = {0, 0xULL};
> -#ifdef CONFIG_PCI_BUS_ADDR_T_64BIT
> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>  static struct pci_bus_region pci_64_bit = {0,
>   (pci_bus_addr_t) 0xULL};
>  static struct pci_bus_region pci_high = {(pci_bus_addr_t) 0x1ULL,
> @@ -230,7 +230,7 @@ int pci_bus_alloc_resource(struct pci_bus *bus, struct 
> resource *res,
> resource_size_t),
>   void *alignf_data)
>  {
> -#ifdef CONFIG_PCI_BUS_ADDR_T_64BIT
> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>   int rc;
>  
>   if (res->flags & IORESOURCE_MEM_64) {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 73178a2fcee0..55371cb827ad 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -670,7 +670,7 @@ int raw_pci_read(unsigned int domain, unsigned int bus, 
> unsigned int devfn,
>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 val);
>  
> -#ifdef CONFIG_PCI_BUS_ADDR_T_64BIT
> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>  typedef u64 pci_bus_addr_t;
>  #else
>  typedef u32 pci_bus_addr_t;
> -- 
> 2.17.0
> 


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-26 Thread Bjorn Helgaas
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe  wrote:
> > It turns out that root ports that support P2P are far less common than 
> > anyone thought. So it will likely have to be a white list.
> 
> This came as a bit of a surprise to our PCIe architect.
> 
> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered.  This might
> long term avoid the need to maintain the white list for new devices.
> 
> So is it worth having a long term solution for making this discoverable?

It was surprising to me that there's no architected way to discover
this.  It seems like such an obvious thing that I guess I assumed the
omission was intentional, i.e., maybe there's something that makes it
impractical, but it would be worth at least asking somebody in the
SIG.  It seems like for root ports in the same root complex, at least,
there could be a bit somewhere in the root port or the RCRB (which
Linux doesn't support yet).


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-23 Thread Bjorn Helgaas
On Fri, Mar 23, 2018 at 03:59:14PM -0600, Logan Gunthorpe wrote:
> On 23/03/18 03:50 PM, Bjorn Helgaas wrote:
> > Popping way up the stack, my original point was that I'm trying to
> > remove restrictions on what devices can participate in
> > peer-to-peer DMA.  I think it's fairly clear that in conventional
> > PCI, any devices in the same PCI hierarchy, i.e., below the same
> > host-to-PCI bridge, should be able to DMA to each other.
> 
> Yup, we are working on this.
> 
> > The routing behavior of PCIe is supposed to be compatible with
> > conventional PCI, and I would argue that this effectively requires
> > multi-function PCIe devices to have the internal routing required
> > to avoid the route-to-self issue.
> 
> That would be very nice but many devices do not support the internal
> route. We've had to work around this in the past and as I mentioned
> earlier that NVMe devices have a flag indicating support. However,
> if a device wants to be involved in P2P it must support it and we
> can exclude devices that don't support it by simply not enabling
> their drivers.

Do you think these devices that don't support internal DMA between
functions are within spec, or should we handle them as exceptions,
e.g., via quirks?

If NVMe defines a flag indicating peer-to-peer support, that would
suggest to me that these devices are within spec.

I looked up the CMBSZ register you mentioned (NVMe 1.3a, sec 3.1.12).
You must be referring to the WDS, RDS, LISTS, CQS, and SQS bits.  If
WDS is set, the controller supports having Write-related data and
metadata in the Controller Memory Buffer.  That would mean the driver
could put certain queues in controller memory instead of in host
memory.  The controller could then read the queue from its own
internal memory rather than issuing a PCIe transaction to read it from
host memory.

That makes sense to me, but I don't see the connection to
peer-to-peer.  There's no multi-function device in this picture, so
it's not about internal DMA between functions.

WDS, etc., tell us about capabilities of the controller.  If WDS is
set, the CPU (or a peer PCIe device) can write things to controller
memory.  If it is clear, neither the CPU nor a peer device can put
things there.  So it doesn't seem to tell us anything about
peer-to-peer specifically.  It looks like information needed by the
NVMe driver, but not by the PCI core.

Bjorn


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-23 Thread Bjorn Helgaas
On Thu, Mar 22, 2018 at 10:57:32PM +, Stephen  Bates wrote:
> >  I've seen the response that peers directly below a Root Port could not
> > DMA to each other through the Root Port because of the "route to self"
> > issue, and I'm not disputing that.  
> 
> Bjorn 
> 
> You asked me for a reference to RTS in the PCIe specification. As
> luck would have it I ended up in an Irish bar with Peter Onufryk
> this week at OCP Summit. We discussed the topic. It is not
> explicitly referred to as "Route to Self" and it's certainly not
> explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification
> discusses error conditions for virtual PCI bridges. One of these
> conditions (given in the very first bullet in that section) applies
> to a request that is destined for the same port it came in on. When
> this occurs the request must be terminated as a UR.

Thanks for that reference!

I suspect figure 10-3 in sec 10.1.1 might also be relevant, although
it's buried in the ATS section.  It shows internal routing between
functions of a multifunction device.  That suggests that the functions
should be able to DMA to each other without those transactions ever
appearing on the link.

Popping way up the stack, my original point was that I'm trying to
remove restrictions on what devices can participate in peer-to-peer
DMA.  I think it's fairly clear that in conventional PCI, any devices
in the same PCI hierarchy, i.e., below the same host-to-PCI bridge,
should be able to DMA to each other.

The routing behavior of PCIe is supposed to be compatible with
conventional PCI, and I would argue that this effectively requires
multi-function PCIe devices to have the internal routing required to
avoid the route-to-self issue.

Bjorn


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Bjorn Helgaas
On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote:
> On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> > I agree that peers need to have a common upstream bridge.  I think
> > you're saying peers need to have *two* common upstream bridges.  If I
> > understand correctly, requiring two common bridges is a way to ensure
> > that peers directly below Root Ports don't try to DMA to each other.
> 
> No, I don't get where you think we need to have two common upstream
> bridges. I'm not sure when such a case would ever happen. But you seem
> to understand based on what you wrote below.

Sorry, I phrased that wrong.  You don't require two common upstream
bridges; you require two upstream bridges, with the upper one being
common, i.e.,

  static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
  {
struct pci_dev *up1, *up2;

up1 = pci_dev_get(pci_upstream_bridge(pdev));
up2 = pci_dev_get(pci_upstream_bridge(up1));
return up2;
  }

So if you're starting with pdev, up1 is the immediately upstream
bridge and up2 is the second upstream bridge.  If this is PCIe, up1
may be a Root Port and there is no up2, or up1 and up2 are in a
switch.

This is more restrictive than the spec requires.  As long as there is
a single common upstream bridge, peer-to-peer DMA should work.  In
fact, in conventional PCI, I think the upstream bridge could even be
the host bridge (not a PCI-to-PCI bridge).

You are focused on PCIe systems, and in those systems, most topologies
do have an upstream switch, which means two upstream bridges.  I'm
trying to remove that assumption because I don't think there's a
requirement for it in the spec.  Enforcing this assumption complicates
the code and makes it harder to understand because the reader says
"huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
so why do we need these two bridges?"

[*] For conventional PCI, this means anything below the same host
bridge.  Two devices on a conventional PCI root bus should be able to
DMA to each other, even though there's no PCI-to-PCI bridge above
them.  For PCIe, it means a "hierarchy domain" as used in PCIe r4.0,
sec 1.3.1, i.e., anything below the same Root Port.

> > So I guess the first order of business is to nail down whether peers
> > below a Root Port are prohibited from DMAing to each other.  My
> > assumption, based on 6.12.1.2 and the fact that I haven't yet found
> > a prohibition, is that they can.
> 
> If you have a multifunction device designed to DMA to itself below a
> root port, it can. But determining this is on a device by device basis,
> just as determining whether a root complex can do peer to peer is on a
> per device basis. So I'd say we don't want to allow it by default and
> let someone who has such a device figure out what's necessary if and
> when one comes along.

It's not the job of this infrastructure to answer the device-dependent
question of whether DMA initiators or targets support peer-to-peer
DMA.

All we want to do here is figure out whether the PCI topology supports
it, using the mechanisms guaranteed by the spec.  We can derive that
from the basic rules about how PCI bridges work, i.e., from the
PCI-to-PCI Bridge spec r1.2, sec 4.3:

  A bridge forwards PCI memory transactions from its primary interface
  to its secondary interface (downstream) if a memory address is in
  the range defined by the Memory Base and Memory Limit registers
  (when the base is less than or equal to the limit) as illustrated in
  Figure 4-3. Conversely, a memory transaction on the secondary
  interface that is within this address range will not be forwarded
  upstream to the primary interface. Any memory transactions on the
  secondary interface that are outside this address range will be
  forwarded upstream to the primary interface (provided they are not
  in the address range defined by the prefetchable memory address
  range registers).

This works for either PCI or PCIe.  The only wrinkle PCIe adds is that
the very top of the hierarchy is a Root Port, and we can't rely on it
to route traffic to other Root Ports.  I also doubt Root Complex
Integrated Endpoints can participate in peer-to-peer DMA.

Thanks for your patience in working through all this.  I know it
sometimes feels like being bounced around in all directions.  It's
just a normal consequence of trying to add complex functionality to an
already complex system, with interest and expertise spread unevenly
across a crowd of people.

Bjorn


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Bjorn Helgaas
On Tue, Mar 13, 2018 at 05:21:20PM -0600, Logan Gunthorpe wrote:
> On 13/03/18 05:08 PM, Bjorn Helgaas wrote:
> > On Tue, Mar 13, 2018 at 10:31:55PM +, Stephen  Bates wrote:
> > If it *is* necessary because Root Ports and devices below them behave
> > differently than in conventional PCI, I think you should include a
> > reference to the relevant section of the spec and check directly for a
> > Root Port.  I would prefer that over trying to exclude Root Ports by
> > looking for two upstream bridges.
> 
> Well we've established that we don't want to allow root ports.

I assume you want to exclude Root Ports because of multi-function
devices and the "route to self" error.  I was hoping for a reference
to that so I could learn more about it.

While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
Functions in SR-IOV Capable and Multi-Function Devices", which seems
relevant.  It talks about "peer-to-peer Requests (between Functions of
the device)".  Thay says to me that multi-function devices can DMA
between themselves.

> But we need to, at a minimum, do two pci_upstream_bridge() calls...
> 
> Remember, we need to check that a number of devices are behind the same
> switch. So we need to find a _common_ upstream port for several devices.

I agree that peers need to have a common upstream bridge.  I think
you're saying peers need to have *two* common upstream bridges.  If I
understand correctly, requiring two common bridges is a way to ensure
that peers directly below Root Ports don't try to DMA to each other.

So I guess the first order of business is to nail down whether peers
below a Root Port are prohibited from DMAing to each other.  My
assumption, based on 6.12.1.2 and the fact that I haven't yet found
a prohibition, is that they can.

> Excluding the multifunction device case (which I don't think is
> applicable for reasons we've discussed before), this will *never* be the
> first upstream port for a given device.

If you're looking for a common upstream bridge, you don't need to make
assumptions about whether the hierarchy is conventional PCI or PCIe or
how many levels are in the hierarchy.

You already have upstream_bridges_match(), which takes two pci_devs.
I think it should walk up the PCI hierarchy from the first device,
checking whether the bridge at each level is also a parent of the
second device.

Bjorn


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Bjorn Helgaas
On Tue, Mar 13, 2018 at 10:31:55PM +, Stephen  Bates wrote:
> >> It sounds like you have very tight hardware expectations for this to work
> >> at this moment. You also don't want to generalize this code for others and
> >> address the shortcomings.
> >  No, that's the way the community has pushed this work
> 
> Hi Sinan
> 
> Thanks for all the input. As Logan has pointed out the switch
> requirement is something that has evolved over time based on input
> from the community. You are more than welcome to have an opinion on
> this (and you have made that opinion clear ;-)). Over time the
> patchset may evolve from its current requirements but right now we
> are aligned with the feedback from the community.

This part of the community hasn't been convinced of the need to have
two bridges, e.g., both an Upstream Port and a Downstream Port, or two
conventional PCI bridges, above the peers.

Every PCI-to-PCI bridge is required to support routing transactions
between devices on its secondary side.  Therefore, I think it is
sufficient to verify that the potential peers share a single common
upstream bridge.  This could be a conventional PCI bridge, a Switch
Downstream Port, or a Root Port.

I've seen the response that peers directly below a Root Port could not
DMA to each other through the Root Port because of the "route to self"
issue, and I'm not disputing that.  But enforcing a requirement for
two upstream bridges introduces a weird restriction on conventional
PCI topologies, makes the code hard to read, and I don't think it's
necessary.

If it *is* necessary because Root Ports and devices below them behave
differently than in conventional PCI, I think you should include a
reference to the relevant section of the spec and check directly for a
Root Port.  I would prefer that over trying to exclude Root Ports by
looking for two upstream bridges.

Bjorn


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-05 Thread Bjorn Helgaas
On Thu, Mar 01, 2018 at 12:13:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > >   void pci_enable_acs(struct pci_dev *dev)
> > >   {
> > > + if (pci_p2pdma_disable_acs(dev))
> > > + return;
> > 
> > This doesn't read naturally to me.  I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> > 
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
> 
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.

I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".

Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().

If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".

Bjorn


Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Bjorn Helgaas
On Thu, Mar 01, 2018 at 11:14:46PM +, Stephen  Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> > transaction in that sense isn't going to work unless the device 
> > specifically supports it and intercepts the traffic before it gets to 
> > the port.
> 
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).

I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).

More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA?  In conventional PCI, single-function devices on
the same bus can DMA to each other.  The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window.  As far as I know, the same
should happen on PCIe.

I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe.  I don't remember a restriction on
whether they can DMA to each other, but maybe there is.

Bjorn


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Bjorn Helgaas
On Thu, Mar 01, 2018 at 06:54:01PM +, Stephen  Bates wrote:
> Thanks for the detailed review Bjorn!
> 
> >> +Enabling this option will also disable ACS on all ports behind
> >> +any PCIe switch. This effictively puts all devices behind any
> >> +switch into the same IOMMU group.
> 
> >  Does this really mean "all devices behind the same Root Port"?
> 
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.

The question is what the relevant switch is.  We call pci_enable_acs()
on every PCI device, including Root Ports.  It looks like this relies
on get_upstream_bridge_port() to filter out some things.  I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.

> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
> 
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users.  "The same IOMMU group" is a pretty
> > abstract idea.
> 
> Alex gave a good overview of the implications in [1].
> 
> [1] https://marc.info/?l=linux-pci=151512320031739=2

This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.

Bjorn


Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Bjorn Helgaas
On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Thanks for the review. I'll correct all the nits for the next version.
> 
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
> 
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions?  I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
> 
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
> 
> 
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
> 
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
> 
> Fair enough.
> 
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
> 
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.

I don't think this is correct.  A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
peer-to-peer transactions *between* hierarchy domains is optional.  I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.

This doesn't say anything about routing between two different devices
below a Root Port.  Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules.  Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.

> > This whole thing is confusing to me.  Why do you want to reject peers
> > directly connected to the same root port?  Why do you require the same
> > Switch Upstream Port?  You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges?  I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
> 
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.

This sounds like the same issue as above, so we just need to resolve
that.

> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here.  I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
> 
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.

OK.  I accept that it 

Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Bjorn Helgaas
On Wed, Feb 28, 2018 at 04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.

s/the the/the/

> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  4 
>  drivers/pci/p2pdma.c   | 44 
>  drivers/pci/pci.c  |  4 
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
> it's hard to tell which support it with good performance, so
> at this time you will need a PCIe switch.
>  
> +   Enabling this option will also disable ACS on all ports behind
> +   any PCIe switch. This effictively puts all devices behind any
> +   switch into the same IOMMU group.

s/effictively/effectively/

Does this really mean "all devices behind the same Root Port"?

What does this mean in terms of device security?  I assume it means,
at least, that individual devices can't be assigned to separate VMs.

I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users.  "The same IOMMU group" is a pretty
abstract idea.

> If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct 
> pci_dev *pdev)
>   return up2;
>  }
>  
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + *   bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> + struct pci_dev *up;
> + int pos;
> + u16 ctrl;
> +
> + up = get_upstream_bridge_port(pdev);
> + if (!up)
> + return 0;
> + pci_dev_put(up);
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + dev_info(>dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, );
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> + pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> + return 1;
> +}
> +
>  static bool __upstream_bridges_match(struct pci_dev *upstream,
>struct pci_dev *client)
>  {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> + if (pci_p2pdma_disable_acs(dev))
> + return;

This doesn't read naturally to me.  I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.

But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.

>   if (!pci_acs_enable)
>   return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>   u64 offset);
>  int 

Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

2018-03-01 Thread Bjorn Helgaas
On Wed, Feb 28, 2018 at 04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/p2pdma.c   | 51 
> ++
>  include/linux/memremap.h   |  1 +
>  include/linux/pci-p2pdma.h | 13 
>  3 files changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int 
> bar, size_t size,
>   pgmap->res.flags = pci_resource_flags(pdev, bar);
>   pgmap->ref = >p2pdma->devmap_ref;
>   pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> + pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> + pci_resource_start(pdev, bar);
>  
>   addr = devm_memremap_pages(>dev, pgmap);
>   if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool 
> publish)
>   pdev->p2pdma->published = publish;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev:device doing the DMA request
> + * @sg:  scatter list to map
> + * @nents:   elements in the scatterlist
> + * @dir:DMA direction

Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.

> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +   enum dma_data_direction dir)

Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.

> +{
> + struct dev_pagemap *pgmap;
> + struct scatterlist *s;
> + phys_addr_t paddr;
> + int i;
> +
> + /*
> +  * p2pdma mappings are not compatible with devices that use
> +  * dma_virt_ops.
> +  */
> + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == _virt_ops)
> + return 0;
> +
> + for_each_sg(sg, s, nents, i) {
> + pgmap = sg_page(s)->pgmap;
> + paddr = sg_phys(s);
> +
> + s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> + sg_dma_len(s) = s->length;
> + }
> +
> + return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev:device doing the DMA request
> + * @sg:  scatter list to map
> + * @nents:   elements in the scatterlist
> + * @dir:DMA direction

Same whitespace comment as above.

> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int 
> nents,
> +  enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
>   struct device *dev;
>   void *data;
>   enum memory_type type;
> + u64 pci_p2pdma_bus_offset;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct 
> scatterlist **sgl,
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
>   unsigned int nents);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +   enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int 
> nents,
> +  enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>   size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev 
> *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> + struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> + return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> + struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
>  #endif /* 

Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats

2018-03-01 Thread Bjorn Helgaas
On Wed, Feb 28, 2018 at 04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.

Can you add enough text here to make the body of the changelog
complete in itself?  That might mean just repeating the subject, which
is fine.

> Signed-off-by: Logan Gunthorpe 
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 +
>  drivers/pci/p2pdma.c| 50 
> +
>  2 files changed, 75 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
> b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>   This is similar to /sys/bus/pci/drivers_autoprobe, but
>   affects only the VFs associated with a specific PF.
> +
> +What:/sys/bus/pci/devices/.../p2pmem/available
> +Date:November 2017
> +Contact: Logan Gunthorpe 
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains the amount of memory that has not been
> + allocated (in decimal).
> +
> +What:/sys/bus/pci/devices/.../p2pmem/size
> +Date:November 2017
> +Contact: Logan Gunthorpe 
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains the total ammount of memory that the device

s/ammount/amount/

> + provides (in decimal).
> +
> +What:/sys/bus/pci/devices/.../p2pmem/published
> +Date:November 2017
> +Contact: Logan Gunthorpe 
> +Description:
> + If the device has any Peer-to-Peer memory registered, this
> + file contains a '1' if the memory has been published for
> + use inside the kernel or a '0' if it is only intended
> + for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
>   bool published;
>  };
>  
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +  char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + size_t size = 0;
> +
> + if (pdev->p2pdma->pool)
> + size = gen_pool_size(pdev->p2pdma->pool);
> +
> + return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute 
> *attr,
> +   char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + size_t avail = 0;
> +
> + if (pdev->p2pdma->pool)
> + avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> + return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute 
> *attr,
> +   char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> + _attr_size.attr,
> + _attr_available.attr,
> + _attr_published.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> + .attrs = p2pmem_attrs,
> + .name = "p2pmem",
> +};
> +
>  static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
>  {
>   struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
>   percpu_ref_exit(>p2pdma->devmap_ref);
>  
>   gen_pool_destroy(pdev->p2pdma->pool);
> + sysfs_remove_group(>dev.kobj, _group);
>   pdev->p2pdma = NULL;
>  }
>  
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>   if (error)
>   goto out_pool_destroy;
>  
> + error = sysfs_create_group(>dev.kobj, _group);
> +
>   pdev->p2pdma = p2p;

I think these two statements are out of order, since the attributes
dereference pdev->p2pdma.  And it looks like you set "error"
unnecessarily, since you return immediately looking at it.

>   return 0;
> -- 
> 2.11.0
> 


Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Bjorn Helgaas
s/peer to peer/peer-to-peer/ to match text below and in spec.

On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.

s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.

Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions?  I assume the device can't
really tell whether a transaction is from a CPU or a peer.

> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.

"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.

Maybe something like:

  Add interfaces for other subsystems to find and allocate ...:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

  This may reduce bandwidth of the transfer but significantly reduce
  ...

BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?

> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.

I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).

The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?

The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.

> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>  
> If unsure, say N.
>  
> +config PCI_P2PDMA
> + bool "PCI Peer to Peer transfer support"
> + depends on ZONE_DEVICE
> + select GENERIC_ALLOCATOR
> + help
> +   Enableѕ drivers to do PCI peer to peer transactions to and from

s/peer to peer/peer-to-peer/ (in bool and help text)

> +   BARs that are exposed in other devices that are the part of
> +   the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +   specification to work (ie. anything below a single PCI bridge).
> +
> +   Many PCIe root complexes do not support P2P transactions and
> +   it's hard to tell which support it with good performance, so
> +   at this time you will need a PCIe switch.

Until we have a way to figure out which of them support P2P,
performance is a don't-care.

> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index ..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.

s/Peer 2 Peer/peer-to-peer/

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.


Re: [PATCH 01/12] pci-p2p: Support peer to peer memory

2018-01-04 Thread Bjorn Helgaas
On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> ...

> + * pci_p2pmem_alloc_sgl - allocate p2p memory in an sgl
> + * @pdev:the device to allocate memory from
> + * @sgl: the allocated sgl
> + * @nents:  the number of sgs in the list
> + * @length: number of bytes to allocate

Your later patches use "SGL" in English text.  If that's the conventional
capitalization, please use it here, too.


Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices

2018-01-04 Thread Bjorn Helgaas
[+cc Alex]

On Thu, Jan 04, 2018 at 12:01:29PM -0700, Logan Gunthorpe wrote:
> When the ACS P2P flags are set in the downstream port of the switch,
> any P2P TLPs will be sent back to the root complex. The whole point of
> the P2P work is to have TLPs avoid the RC seeing it may not support
> P2P transactions or, at best, it will perform poorly. So we clear these

"It will perform poorly" seems like an empirical statement about the
hardware you've seen, not something that's driven by the spec.  So I'm
not sure it adds information that will be useful long-term.

> flags for all devices involved in transactions with the p2pmem.

I'm not sure how this coordinates with other uses of ACS, e.g., VFIO.
I think that should be addressed here somehow.

> A count of the number of requests to disable the flags is maintained.
> When the count transitions from 1 to 0, the old flags are restored.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/p2p.c   | 144 
> ++--
>  include/linux/pci.h |   2 +
>  2 files changed, 143 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/p2p.c b/drivers/pci/p2p.c
> index 87cec87b02e3..617adaa905d2 100644
> --- a/drivers/pci/p2p.c
> +++ b/drivers/pci/p2p.c
> @@ -227,6 +227,89 @@ static struct pci_dev *find_parent_pci_dev(struct device 
> *dev)
>  }
>  
>  /*
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled in the downstream port of each device in order for
> + * the TLPs to not be forwarded up to the RC.
> + */
> +#define PCI_P2P_ACS_FLAGS (PCI_ACS_RR | PCI_ACS_CR)
> +
> +static int pci_p2pmem_disable_acs(struct pci_dev *pdev)
> +{
> + int pos;
> + u16 ctrl;
> + struct pci_dev *downstream;
> +
> + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!downstream) {
> + dev_err(>dev, "could not find downstream port\n");
> + return -ENODEV;
> + }
> +
> + device_lock(>dev);
> + if (downstream->p2p_acs_requests++)
> + goto unlock_and_return;
> +
> + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + goto unlock_and_return;
> +
> + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, );
> +
> + downstream->p2p_old_acs_flags = ctrl & PCI_P2P_ACS_FLAGS;
> +
> + if (downstream->p2p_old_acs_flags)
> + dev_info(>dev, "disabling p2p acs flags: %x\n", ctrl);
> +
> + ctrl &= ~PCI_P2P_ACS_FLAGS;
> +
> + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> +
> +unlock_and_return:
> + device_unlock(>dev);
> + pci_dev_put(downstream);
> + return 0;
> +}
> +
> +static int pci_p2pmem_reset_acs(struct pci_dev *pdev)
> +{
> + int pos;
> + u16 ctrl;
> + struct pci_dev *downstream;
> +
> + downstream = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!downstream)
> + return -ENODEV;
> +
> + device_lock(>dev);
> +
> + /* Only actually reset the flags on a 1->0 transition */
> + if (!downstream->p2p_acs_requests)
> + goto unlock_and_return;
> +
> + if (--downstream->p2p_acs_requests)
> + goto unlock_and_return;
> +
> + pos = pci_find_ext_capability(downstream, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + goto unlock_and_return;
> +
> + pci_read_config_word(downstream, pos + PCI_ACS_CTRL, );
> +
> + ctrl &= ~PCI_P2P_ACS_FLAGS;
> + ctrl |= downstream->p2p_old_acs_flags;
> +
> + if (downstream->p2p_old_acs_flags)
> + dev_info(>dev, "resetting p2p acs flags: %x\n", ctrl);
> +
> + pci_write_config_word(downstream, pos + PCI_ACS_CTRL, ctrl);
> +
> +unlock_and_return:
> + device_unlock(>dev);
> + pci_dev_put(downstream);
> + return 0;
> +}
> +
> +/*
>   * If a device is behind a switch, we try to find the upstream bridge
>   * port of the switch. This requires two calls to pci_upstream_bridge:
>   * one for the upstream port on the switch, one on the upstream port
> @@ -340,6 +423,10 @@ int pci_p2pmem_add_client(struct list_head *head, struct 
> device *dev)
>   ret = -EXDEV;
>   goto put_client;
>   }
> +
> + ret = pci_p2pmem_disable_acs(client);
> + if (!ret)
> + goto put_client;
>   }
>  
>   new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> @@ -363,6 +450,9 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_add_client);
>  
>  static void pci_p2pmem_client_free(struct pci_p2pmem_client *item)
>  {
> + if (item->p2pmem)
> + pci_p2pmem_reset_acs(item->client);
> +
>   list_del(>list);
>   pci_dev_put(item->client);
>   pci_dev_put(item->p2pmem);
> @@ -383,6 +473,7 @@ void pci_p2pmem_remove_client(struct list_head *head, 
> struct device *dev)
>  {
>   struct pci_p2pmem_client *pos, *tmp;
>   struct pci_dev *pdev;
> + struct pci_dev *p2pmem = NULL;
>  
> 

Re: [PATCH 02/12] pci-p2p: Add sysfs group to display p2pmem stats

2018-01-04 Thread Bjorn Helgaas
On Thu, Jan 04, 2018 at 12:01:27PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the ammount available
> and whether it is published or not.

s/ammount/amount/ (also below)

> Signed-off-by: Logan Gunthorpe 
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 
>  drivers/pci/p2p.c   | 51 
> +
>  2 files changed, 76 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
> b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..7b80ea77faca 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>   This is similar to /sys/bus/pci/drivers_autoprobe, but
>   affects only the VFs associated with a specific PF.
> +
> +What:/sys/bus/pci/devices/.../p2pmem/available

I wonder if "p2pdma" would be a more suggestive term?  It's not really
the *memory* that is peer-to-peer; the peer-to-peer part is referring
to *access* to the memory.

> @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
>   if (error)
>   goto out_pool_destroy;
>  
> + if (sysfs_create_group(>dev.kobj, _group))
> + dev_warn(>dev, "failed to create p2p sysfs group\n");

Not sure the warning (by itself) is worthwhile.  If we were going to
disable the feature if sysfs_create_group() failed, that's one thing,
but we aren't doing anything except generating a warning, which the
user can't really do anything with.  If the user is looking for the
sysfs file, its absence will be obvious even without the message.

>   pdev->p2p = p2p;
>  
>   return 0;
> -- 
> 2.11.0
> 


Re: [PATCH 01/12] pci-p2p: Support peer to peer memory

2018-01-04 Thread Bjorn Helgaas
Run "git log --oneline drivers/pci" and follow the convention.  I
think it would make sense to add a new tag like "PCI/P2P", although
"P2P" has historically also been used in the "PCI-to-PCI bridge"
context, so maybe there's something less ambiguous.  "P2PDMA"?

When you add new files, I guess we're looking for the new SPDX
copyright stuff?

On Thu, Jan 04, 2018 at 12:01:26PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.
> 
> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. 

It's more than "non-trivial" or "with good performance" (from Kconfig
help), isn't it?  AFAIK, there's no standard way at all to discover
whether P2P DMA is supported between root ports or RCs.

> +config PCI_P2P
> + bool "PCI Peer to Peer transfer support"
> + depends on ZONE_DEVICE
> + select GENERIC_ALLOCATOR
> + help
> +   Enableѕ drivers to do PCI peer to peer transactions to and from
> +   bars that are exposed to other devices in the same domain.

s/bars/BARs/ (and similarly below, except in C code)

Similarly, s/dma/DMA/ and s/pci/PCI/ below.

And probably also s/p2p/peer-to-peer DMA/ in messages.

Maybe clarify this domain bit.  Using "domain" suggests the common PCI
segment/domain usage, but I think you really mean something like the
part of the hierarchy where peer-to-peer DMA is guaranteed by the PCI
spec to work, i.e., anything below a single PCI bridge.

> +
> +   Many PCIe root complexes do not support P2P transactions and
> +   it's hard to tell which support it with good performance, so
> +   at this time you will need a PCIe switch.
> +
> +   If unsure, say N.

> + * pci_p2pmem_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to
> + * @bar: PCI bar to add
> + * @size: size of the memory to add, may be zero to use the whole bar
> + * @offset: offset into the PCI bar
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.
> + */
> +int pci_p2pmem_add_resource(struct pci_dev *pdev, int bar, size_t size,
> + u64 offset)
> +{
> + struct dev_pagemap *pgmap;
> + void *addr;
> + int error;

Seems like there should be

  if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return -EINVAL;

or similar here?

> + if (WARN_ON(offset >= pci_resource_len(pdev, bar)))
> + return -EINVAL;

Are these WARN_ONs for debugging purposes, or do you think we need
them in production?  Granted, hitting it would probably be a kernel
driver bug, but still, not sure if the PCI core needs to coddle the
driver author that much.

> + if (!size)
> + size = pci_resource_len(pdev, bar) - offset;
> +
> + if (WARN_ON(size + offset > pci_resource_len(pdev, bar)))
> + return -EINVAL;
> +
> + if (!pdev->p2p) {
> + error = pci_p2pmem_setup(pdev);
> + if (error)
> + return error;
> + }
> +
> + pgmap = devm_kzalloc(>dev, sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> + pgmap->res.end = pgmap->res.start + size - 1;

I'm guessing Christoph's dev_pagemap revamp repo must change
pgmap->res from a pointer to a structure, but I don't see the actual
link in your cover letter.

I think you should set pgmap->res.flags here, too.

> + pgmap->ref = >p2p->devmap_ref;
> + pgmap->type = 

Re: [PATCH 9/9] kernel-api.rst: fix a series of errors when parsing C files

2017-03-30 Thread Bjorn Helgaas
On Thu, Mar 30, 2017 at 05:11:36PM -0300, Mauro Carvalho Chehab wrote:
> ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
> ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference 
> start-string without end-string.
> ./mm/filemap.c:1283: ERROR: Unexpected indentation.
> ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference 
> start-string without end-string.
> ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
> ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
> ./ipc/util.c:676: ERROR: Unexpected indentation.
> ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; 
> unexpected unindent.
> ./security/security.c:109: ERROR: Unexpected indentation.
> ./security/security.c:110: WARNING: Definition list ends without a blank 
> line; unexpected unindent.
> ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
> ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
> ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without 
> end-string.
> ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without 
> end-string.
> ./ipc/util.c:477: ERROR: Unknown target name: "s".
> 
> Signed-off-by: Mauro Carvalho Chehab <mche...@s-opensource.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>   # for drivers/pci/irq.c

> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
> index 6684f153ab57..f9f2a0324ecc 100644
> --- a/drivers/pci/irq.c
> +++ b/drivers/pci/irq.c
> @@ -31,7 +31,7 @@ static void pci_note_irq_problem(struct pci_dev *pdev, 
> const char *reason)
>   * driver).
>   *
>   * Returns:
> - *  a suggestion for fixing it (although the driver is not required to
> + * a suggestion for fixing it (although the driver is not required to
>   * act on this).
>   */
>  enum pci_lost_interrupt_reason pci_lost_interrupt(struct pci_dev *pdev)


Re: [PATCH V2 2/3] PCI: add an API to get node from vector

2017-02-24 Thread Bjorn Helgaas
On Wed, Feb 01, 2017 at 09:53:15AM -0800, Shaohua Li wrote:
> Next patch will use the API to get the node from vector for nvme device
> 
> Signed-off-by: Shaohua Li <s...@fb.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

Sorry I missed this; I normally work from the linux-pci patchwork, and
this didn't show up there because it wasn't cc'd to linux-pci.  But I
should have noticed anyway.

> ---
>  drivers/pci/msi.c   | 16 
>  include/linux/pci.h |  6 ++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 50c5003..ab7aee7 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1313,6 +1313,22 @@ const struct cpumask *pci_irq_get_affinity(struct 
> pci_dev *dev, int nr)
>  }
>  EXPORT_SYMBOL(pci_irq_get_affinity);
>  
> +/**
> + * pci_irq_get_node - return the numa node of a particular msi vector
> + * @pdev:PCI device to operate on
> + * @vec: device-relative interrupt vector index (0-based).
> + */
> +int pci_irq_get_node(struct pci_dev *pdev, int vec)
> +{
> + const struct cpumask *mask;
> +
> + mask = pci_irq_get_affinity(pdev, vec);
> + if (mask)
> + return local_memory_node(cpu_to_node(cpumask_first(mask)));
> + return dev_to_node(>dev);
> +}
> +EXPORT_SYMBOL(pci_irq_get_node);
> +
>  struct pci_dev *msi_desc_to_pci_dev(struct msi_desc *desc)
>  {
>   return to_pci_dev(desc->dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e2d1a12..df2c649 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1334,6 +1334,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, 
> unsigned int min_vecs,
>  void pci_free_irq_vectors(struct pci_dev *dev);
>  int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
>  const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);
> +int pci_irq_get_node(struct pci_dev *pdev, int vec);
>  
>  #else
>  static inline int pci_msi_vec_count(struct pci_dev *dev) { return -ENOSYS; }
> @@ -1384,6 +1385,11 @@ static inline const struct cpumask 
> *pci_irq_get_affinity(struct pci_dev *pdev,
>  {
>   return cpu_possible_mask;
>  }
> +
> +static inline int pci_irq_get_node(struct pci_dev *pdev, int vec)
> +{
> + return first_online_node;
> +}
>  #endif
>  
>  static inline int
> -- 
> 2.9.3
> 


Re: [PATCH 1/7] genirq/affinity: Introduce struct irq_affinity

2016-11-08 Thread Bjorn Helgaas
On Mon, Nov 07, 2016 at 10:47:36AM -0800, Christoph Hellwig wrote:
> From: Christogh Hellwig 
> 
> Some drivers (various network and RDMA adapter for example) have a MSI-X
> vector layout where most of the vectors are used for I/O queues and should
> have CPU affinity assigned to them, but some (usually 1 but sometimes more)
> at the beginning or end are used for low-performance admin or configuration
> work and should not have any explicit affinity assigned to them.
> 
> This adds a new irq_affinity structure, which will be passed through a
> variant of pci_irq_alloc_vectors that allows to specify these
> requirements (and is extensible to any future quirks in that area) so that
> the core IRQ affinity algorithm can take this quirks into account.
> 
> Signed-off-by: Christogh Hellwig 

s/Christogh/Christoph/ (also above)

What tree would you prefer?  I vote for the IRQ tree since that seems
to be where the interesting parts are, and I think I acked all the PCI
bits.

> + * struct irq_affinity - Description for auto irq affinity assignements
> + * @pre_vectors: Reserved vectors at the beginning of the MSIX
> + *   vector space
> + * @post_vectors:Reserved vectors at the end of the MSIX
> + *   vector space

Maybe include something more informative than just "reserved", e.g.,
"Don't apply affinity to @pre_vectors at beginning of MSI-X vector
space" or "Vectors at beginning of MSI-X vector space that are exempt
from affinity"?

> + */
> +struct irq_affinity {
> + int pre_vectors;
> + int post_vectors;
> +};
> +
>  #if defined(CONFIG_SMP)
>  
>  extern cpumask_var_t irq_default_affinity;
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] genirq/affinity: Handle pre/post vectors in irq_create_affinity_masks()

2016-11-08 Thread Bjorn Helgaas
On Mon, Nov 07, 2016 at 10:47:38AM -0800, Christoph Hellwig wrote:
> From: Christogh Hellwig <h...@lst.de>
> 
> Only calculate the affinity for the main I/O vectors, and skip the
> pre or post vectors specified by struct irq_affinity.
> 
> Also remove the irq_affinity cpumask argument that has never been used.
> If we ever need it in the future we can pass it through struct
> irq_affinity.
> 
> Signed-off-by: Christogh Hellwig <h...@lst.de>

s/Christogh/Christoph/ (also above, and maybe other patches too?)

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/pci/msi.c |  4 ++--
>  include/linux/interrupt.h |  4 ++--
>  kernel/irq/affinity.c | 46 +-
>  3 files changed, 29 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index c58d3c2..1761b8a 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -558,7 +558,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, bool 
> affinity)
>   u16 control;
>  
>   if (affinity) {
> - masks = irq_create_affinity_masks(dev->irq_affinity, nvec);
> + masks = irq_create_affinity_masks(nvec, NULL);
>   if (!masks)
>   pr_err("Unable to allocate affinity masks, ignoring\n");
>   }
> @@ -697,7 +697,7 @@ static int msix_setup_entries(struct pci_dev *dev, void 
> __iomem *base,
>   int ret, i;
>  
>   if (affinity) {
> - masks = irq_create_affinity_masks(dev->irq_affinity, nvec);
> + masks = irq_create_affinity_masks(nvec, NULL);
>   if (!masks)
>   pr_err("Unable to allocate affinity masks, ignoring\n");

Not caused by this patch, but can we use dev_err() here and above?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] pci/msi: Propagate irq affinity description through the MSI code

2016-11-08 Thread Bjorn Helgaas
s|pci/msi|PCI/MSI| (subject)
s/irq/IRQ/ (subject)

On Mon, Nov 07, 2016 at 10:47:39AM -0800, Christoph Hellwig wrote:
> From: Christogh Hellwig <h...@lst.de>
> 
> No API change yet, just pass it down all the way from
> pci_alloc_irq_vectors to the core MSI code.

pci_alloc_irq_vectors()

> Signed-off-by: Christogh Hellwig <h...@lst.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  drivers/pci/msi.c | 62 
> +--
>  1 file changed, 33 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 1761b8a..512f388 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -551,14 +551,14 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
>  }
>  
>  static struct msi_desc *
> -msi_setup_entry(struct pci_dev *dev, int nvec, bool affinity)
> +msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity 
> *affd)
>  {
>   struct cpumask *masks = NULL;
>   struct msi_desc *entry;
>   u16 control;
>  
> - if (affinity) {
> - masks = irq_create_affinity_masks(nvec, NULL);
> + if (affd) {
> + masks = irq_create_affinity_masks(nvec, affd);
>   if (!masks)
>   pr_err("Unable to allocate affinity masks, ignoring\n");
>   }
> @@ -618,7 +618,8 @@ static int msi_verify_entries(struct pci_dev *dev)
>   * an error, and a positive return value indicates the number of interrupts
>   * which could have been allocated.
>   */
> -static int msi_capability_init(struct pci_dev *dev, int nvec, bool affinity)
> +static int msi_capability_init(struct pci_dev *dev, int nvec,
> +const struct irq_affinity *affd)
>  {
>   struct msi_desc *entry;
>   int ret;
> @@ -626,7 +627,7 @@ static int msi_capability_init(struct pci_dev *dev, int 
> nvec, bool affinity)
>  
>   pci_msi_set_enable(dev, 0); /* Disable MSI during set up */
>  
> - entry = msi_setup_entry(dev, nvec, affinity);
> + entry = msi_setup_entry(dev, nvec, affd);
>   if (!entry)
>   return -ENOMEM;
>  
> @@ -690,14 +691,14 @@ static void __iomem *msix_map_region(struct pci_dev 
> *dev, unsigned nr_entries)
>  
>  static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
> struct msix_entry *entries, int nvec,
> -   bool affinity)
> +   const struct irq_affinity *affd)
>  {
>   struct cpumask *curmsk, *masks = NULL;
>   struct msi_desc *entry;
>   int ret, i;
>  
> - if (affinity) {
> - masks = irq_create_affinity_masks(nvec, NULL);
> + if (affd) {
> + masks = irq_create_affinity_masks(nvec, affd);
>   if (!masks)
>   pr_err("Unable to allocate affinity masks, ignoring\n");
>   }
> @@ -753,14 +754,14 @@ static void msix_program_entries(struct pci_dev *dev,
>   * @dev: pointer to the pci_dev data structure of MSI-X device function
>   * @entries: pointer to an array of struct msix_entry entries
>   * @nvec: number of @entries
> - * @affinity: flag to indicate cpu irq affinity mask should be set
> + * @affd: Optional pointer to enable automatic affinity assignement
>   *
>   * Setup the MSI-X capability structure of device function with a
>   * single MSI-X irq. A return of zero indicates the successful setup of
>   * requested MSI-X entries with allocated irqs or non-zero for otherwise.
>   **/
>  static int msix_capability_init(struct pci_dev *dev, struct msix_entry 
> *entries,
> - int nvec, bool affinity)
> + int nvec, const struct irq_affinity *affd)
>  {
>   int ret;
>   u16 control;
> @@ -775,7 +776,7 @@ static int msix_capability_init(struct pci_dev *dev, 
> struct msix_entry *entries,
>   if (!base)
>   return -ENOMEM;
>  
> - ret = msix_setup_entries(dev, base, entries, nvec, affinity);
> + ret = msix_setup_entries(dev, base, entries, nvec, affd);
>   if (ret)
>   return ret;
>  
> @@ -956,7 +957,7 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  EXPORT_SYMBOL(pci_msix_vec_count);
>  
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> -  int nvec, bool affinity)
> +  int nvec, const struct irq_affinity *affd)
>  {
>   int nr_entries;
>   int i, j;
> @@ -988,7 +989,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct 
> msix_entry *entries,
>   dev_info(>dev, "ca

Re: [PATCH 5/7] pci/msi: Provide pci_alloc_irq_vectors_affinity()

2016-11-08 Thread Bjorn Helgaas
On Mon, Nov 07, 2016 at 10:47:40AM -0800, Christoph Hellwig wrote:
> From: Christogh Hellwig <h...@lst.de>

s/Christogh/Christoph/ (also below)

> This is a variant of pci_alloc_irq_vectors() that allows passing a
> struct irq_affinity to provide fine-grainded IRQ affinity control.

s/grainded/grained/

> For now this means being able to exclude vectors at the beginning or
> end of the MSI vector space, but it could also be used for any other
> quirks needed in the future (e.g. more vectors than CPUs, or exluding

s/exluding/excluding/

> CPUs from the spreading).
> 
> Signed-off-by: Christogh Hellwig <h...@lst.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> +int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int 
> min_vecs,
> +unsigned int max_vecs, unsigned int flags,
> +const struct irq_affinity *affd)

> +int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int 
> min_vecs,
> +unsigned int max_vecs, unsigned int flags,
> +const struct irq_affinity *affd);

> +static inline int
> +pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> +unsigned int max_vecs, unsigned int flags,
> +const struct irq_affinity *aff_desc)

Maybe use the same formal parameter name as in the definition and
declaration above?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] pci: Remove the irq_affinity mask from struct pci_dev

2016-11-08 Thread Bjorn Helgaas
s/pci/PCI/ (in subject)

On Mon, Nov 07, 2016 at 10:47:41AM -0800, Christoph Hellwig wrote:
> This has never been used, and now is totally unreferenced.  Nuke it.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

> ---
>  include/linux/pci.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 7090f5f..f2ba6ac 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -333,7 +333,6 @@ struct pci_dev {
>* directly, use the values stored here. They might be different!
>*/
>   unsigned intirq;
> - struct cpumask  *irq_affinity;
>   struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and memory 
> regions + expansion ROMs */
>  
>   bool match_driver;  /* Skip attaching driver */
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html