Re: [PATCH v5 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-08-31 Thread Christian König

Am 31.08.2018 um 17:51 schrieb Logan Gunthorpe:


On 31/08/18 02:08 AM, Christian König wrote:

+One of the biggest issues is that PCI doesn't require forwarding
+transactions between hierarchy domains, and in PCIe, each Root Port
+defines a separate hierarchy domain. To make things worse, there is no
+simple way to determine if a given Root Complex supports this or not.
+(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
+only supports doing P2P when the endpoints involved are all behind the
+same PCI bridge, as such devices are all in the same PCI hierarchy
+domain, and the spec guarantees that all transacations within the
+hierarchy will be routable, but it does not require routing
+between hierarchies.

Can we add a kernel command line switch and a whitelist to enable P2P
between separate hierarchies?

In future work, yes. But not for this patchset. This is definitely the
way I see things going, but we've chosen to start with what we've presented.


Sounds like a plan to me.

If you can separate out adding the detection I can take a look adding 
this with my DMA-buf P2P efforts.


Christian.



Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-08-31 Thread Christian König

Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe:

[SNIP]
+
+PCI Peer-to-Peer DMA Support
+
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two devices on the bus. This type of transaction is henceforth
+called Peer-to-Peer (or P2P). However, there are a number of issues that
+make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI doesn't require forwarding
+transactions between hierarchy domains, and in PCIe, each Root Port
+defines a separate hierarchy domain. To make things worse, there is no
+simple way to determine if a given Root Complex supports this or not.
+(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
+only supports doing P2P when the endpoints involved are all behind the
+same PCI bridge, as such devices are all in the same PCI hierarchy
+domain, and the spec guarantees that all transacations within the
+hierarchy will be routable, but it does not require routing
+between hierarchies.


Can we add a kernel command line switch and a whitelist to enable P2P 
between separate hierarchies?


At least all newer AMD chipsets supports this and I'm pretty sure that 
Intel has a list with PCI-IDs of the root hubs for this as well.


Regards,
Christian.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-31 Thread Christian König

Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe:

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.


We want to use that feature without ZONE_DEVICE pages for DMA-buf as well.

How hard would it be to separate enabling P2P detection (e.g. distance 
between two devices) from this?


Regards,
Christian.



Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/Kconfig|  17 +
  drivers/pci/Makefile   |   1 +
  drivers/pci/p2pdma.c   | 761 +
  include/linux/memremap.h   |   5 +
  include/linux/mm.h |  18 ++
  include/linux/pci-p2pdma.h | 102 ++
  include/linux/pci.h|   4 +
  7 files changed, 908 insertions(+)
  create mode 100644 drivers/pci/p2pdma.c
  create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 56ff8f6d31fc..deb68be4fdac 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -132,6 +132,23 @@ config PCI_PASID
  
  	  If unsure, say N.
  
+config PCI_P2PDMA

+   bool "PCI peer-to-peer transfer support"
+   depends on PCI && ZONE_DEVICE
+   select GENERIC_ALLOCATOR
+   help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time,
+ P2P DMA transations must be between devices behind the same root
+ port.
+
+ If unsure, say N.
+
  config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 1b2cfe51e8d7..85f4a703b2be 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
  obj-$(CONFIG_PCI_STUB)+= pci-stub.o
  obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
  obj-$(CONFIG_PCI_ECAM)+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)   += p2pdma.o
  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
  
  # Endpoint library must be initialized before its users

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index ..88aaec5351cd
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ */
+
+#define pr_fmt(fmt) "pci-p2pdma: " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct pci_p2pdma {
+   struct percpu_ref devmap_ref;
+   struct completion devmap_ref_done;
+   struct gen_pool *pool;
+   bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_r

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Christian König

Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:


On 10/05/18 11:11 AM, Stephen  Bates wrote:

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address.

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...


By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).


This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.





So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
  Redir bits are on they will be forced back to the RC by the switch and
  the transaction will fail. If we clear the ACS bits, the TLPs will go
  where we want and everything will work (but we lose the isolation of ACS).

Agreed.


If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:


a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.


b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.


c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.



I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.



BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.


Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).


Regards,
Christian.


For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.
 
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Christian König

Am 10.05.2018 um 16:20 schrieb Stephen Bates:

Hi Jerome


As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
  user is the AMD GPU driver see:
 
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.


Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.



One thing I could not grok from the code how the GPU driver indicates which DMA 
events require ATS translations and which do not. I am assuming the driver 
implements someway of indicating that and its not just a global ON or OFF for 
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what 
would need to be added in the NVMe spec above and beyond what we have in PCI 
ATS to support efficient use of ATS (for example would we need a flag in the 
submission queue entries to indicate a particular IO's SGL/PRP should undergo 
ATS).


Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.


As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.


But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.


Christian.



Cheers

Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Christian König

Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:


On 09/05/18 07:40 AM, Christian König wrote:

The key takeaway is that when any device has ATS enabled you can't
disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
  The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.


Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.


If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.


Christian.



Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Christian König

Am 09.05.2018 um 15:12 schrieb Stephen Bates:

Jerome and Christian
 

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a 
ATS translated TLP is still impacted by ACS though it has a separate control 
knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if 
your device supports ATS a P2P DMA will still be routed to the associated RP of 
the domain and down again unless we disable ACS DT P2P on all bridges between 
the two devices involved in the P2P DMA.

So we still don't get fine grained control with ATS and I guess we still have 
security issues because a rogue or malfunctioning EP could just as easily issue 
TLPs with TA set vs not set.


Still need to double check the specification (had a busy morning today), 
but that sounds about correct.


The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).



Also ATS is meaningless without something like PASID as far as i know.
 
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.


Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can 
implement P2P between root ports should advertise "ACS Direct Translated P2P 
(T)" capability. This ties into the discussion around P2P between route ports we had 
a few weeks ago...


Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)


Christian.



Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:


On 08/05/18 01:17 AM, Christian König wrote:

AMD APUs mandatory need the ACS flag set for the GPU integrated in the
CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.


Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.





And what exactly is the problem here? I'm currently testing P2P with
GPUs in different IOMMU domains and at least with AMD IOMMUs that works
perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.


Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.


E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.


So further transactions for the same address range then go directly to 
the destination.


What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.


It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.


Regards,
Christian.



If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
 
Hi Christian



AMD APUs mandatory need the ACS flag set for the GPU integrated in the
CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?


Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.


At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.


It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.


Regards,
Christian.




Similar problems arise when you do this for dedicated GPU, but we
haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit.
 

So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?
 

And what exactly is the problem here?
  
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).



I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change.

Cheers

Stephen
 



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
I don't know if there are other places we would care?


thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.


Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.


So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.


Regards,
Christian.



On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/Kconfig|  9 +
  drivers/pci/p2pdma.c   | 45 ++---
  drivers/pci/pci.c  |  6 ++
  include/linux/pci-p2pdma.h |  5 +
  4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
  transations must be between devices behind the same root port.
  (Typically behind a network of PCIe switches).
  
+	  Enabling this option will also disable ACS on all ports behind

+ any PCIe switch. This effectively puts all devices behind any
+ switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)


+ individual devices behind any switch will not be able to be
+ assigned to separate VMs because there is no isolation between
+ them. Additionally, any malicious PCIe devices will be able to
+ DMA to memory exposed by other EPs in the same domain as TLPs
+ will not be checked by the IOMMU.
+
  If unsure, say N.
  
  config PCI_LABEL

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device 
*dev)
  }
  
  /*

- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)


+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
   */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
  {
-   struct pci_dev *up1, *up2;
+   int pos;
+   u16 ctrl;
  
-	if (!pdev)

-   return NULL;
+   if (!pci_is_brid

Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-04 Thread Christian König

Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:


On 03/05/18 11:29 AM, Christian König wrote:

Ok, that is the point where I'm stuck. Why do we need that in one
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that
information here, cause the PCI subsystem might not be aware of all the
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the
PCI subsystem that looks like separate devices, but in reality all GPUs
are interconnected and can access each others memory directly without
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather
the driver. That's why it feels like a mistake to me to push all that
into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])


Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth, target])" ?



Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).


Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.


I mean what's the difference between the two approaches?

    sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

    sum = pci_p2pdma_distance(target, A);
    sum += pci_p2pdma_distance(target, B);
    sum += pci_p2pdma_distance(target, C);


Yeah, same for me. If Bjorn is ok with that specialized NVM functions
that I'm fine with that as well.

I think it would just be more convenient when we can come up with
functions which can handle all use cases, cause there still seems to be
a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.


Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)


But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.


Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.


I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.


Thanks,
Christian.



Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Christian König

Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:

On 03/05/18 03:05 AM, Christian König wrote:

Second question is how to you want to handle things when device are not
behind the same root port (which is perfectly possible in the cases I
deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.


Sounds good.


Third question why multiple clients? That feels a bit like you are
pushing something special to your use case into the common PCI
subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.


Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?


The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.


For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.


I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.



In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.


Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.


That would also match how I get this information from userspace.


As far as I can see we need a function which return the distance between
a initiator and target device. This function then returns -1 if the
transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.


Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.


I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.





We also need to give the direction of the transaction and have a
whitelist root complex PCI-IDs which can handle P2P transactions from
different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.


Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.


Thanks for the explanation,
Christian.



Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-03 Thread Christian König

Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
it would be rather nice to have if you could separate out the 
functions to detect if peer2peer is possible between two devices.


This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).


Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?


Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?


Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.




As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.


We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.



Christian.



Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-02 Thread Christian König

Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.


That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.


Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
   upstream_bridge_distance() which calculates the distance between two
   devices as long as they are behind the same root port. This should
   address Bjorn's concerns that the code was to focused on
   being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
   of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
   API to be more like sgl_alloc() in that the alloc function returns
   the allocated scatterlist and nents is not required bythe free
   function.

* Moved the new documentation into the driver-api tree as requested
   by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
   individual drivers can share the code that allocates P2P memory.
   As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
   thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
   and Kconfig doc. Also, the code that disables the ACS bits is surrounded
   explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
   is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
   are closest to (or the same as) the clients using them. In cases
   of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
   may be explicitly specified, bypassing the logic in pci_p2pmem_find().
   (Note: it's still enforced that the provider must be behind the
same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addresse

Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-26 Thread Christian König

Am 25.04.2017 um 20:20 schrieb Logan Gunthorpe:

This patch introduces functions which kmap the pages inside an sgl.
These functions replace a common pattern of kmap(sg_page(sg)) that is
used in more than 50 places within the kernel.

The motivation for this work is to eventually safely support sgls that
contain io memory. In order for that to work, any access to the contents
of an iomem SGL will need to be done with iomemcpy or hit some warning.
(The exact details of how this will work have yet to be worked out.)
Having all the kmaps in one place is just a first step in that
direction. Additionally, seeing this helps cut down the users of sg_page,
it should make any effort to go to struct-page-less DMAs a little
easier (should that idea ever swing back into favour again).

A flags option is added to select between a regular or atomic mapping so
these functions can replace kmap(sg_page or kmap_atomic(sg_page.
Future work may expand this to have flags for using page_address or
vmap. We include a flag to require the function not to fail to
support legacy code that has no easy error path. Much further in the
future, there may be a flag to allocate memory and copy the data
from/to iomem.

We also add the semantic that sg_map can fail to create a mapping,
despite the fact that the current code this is replacing is assumed to
never fail and the current version of these functions cannot fail. This
is to support iomem which may either have to fail to create the mapping or
allocate memory as a bounce buffer which itself can fail.

Also, in terms of cleanup, a few of the existing kmap(sg_page) users
play things a bit loose in terms of whether they apply sg->offset
so using these helper functions should help avoid such issues.

Signed-off-by: Logan Gunthorpe 
---


Good to know that somebody is working on this. Those problems troubled 
us as well.


Patch is Acked-by: Christian König .

Regards,
Christian.


  include/linux/scatterlist.h | 85 +
  1 file changed, 85 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..fad170b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  struct scatterlist {

@@ -126,6 +127,90 @@ static inline struct page *sg_page(struct scatterlist *sg)
return (struct page *)((sg)->page_link & ~0x3);
  }
  
+#define SG_KMAP		 (1 << 0)	/* create a mapping with kmap */

+#define SG_KMAP_ATOMIC  (1 << 1) /* create a mapping with kmap_atomic 
*/
+#define SG_MAP_MUST_NOT_FAIL (1 << 2)/* indicate sg_map should not fail */
+
+/**
+ * sg_map - kmap a page inside an sgl
+ * @sg:SG entry
+ * @offset:Offset into entry
+ * @flags: Flags for creating the mapping
+ *
+ * Description:
+ *   Use this function to map a page in the scatterlist at the specified
+ *   offset. sg->offset is already added for you. Note: the semantics of
+ *   this function are that it may fail. Thus, its output should be checked
+ *   with IS_ERR and PTR_ERR. Otherwise, a pointer to the specified offset
+ *   in the mapped page is returned.
+ *
+ *   Flags can be any of:
+ * * SG_KMAP   - Use kmap to create the mapping
+ * * SG_KMAP_ATOMIC- Use kmap_atomic to map the page atommically.
+ *   Thus, the rules of that function apply: the
+ *   cpu may not sleep until it is unmaped.
+ * * SG_MAP_MUST_NOT_FAIL  - Indicate that sg_map must not fail.
+ *   If it does, it will issue a BUG_ON instead.
+ *   This is intended for legacy code only, it
+ *   is not to be used in new code.
+ *
+ *   Also, consider carefully whether this function is appropriate. It is
+ *   largely not recommended for new code and if the sgl came from another
+ *   subsystem and you don't know what kind of memory might be in the list
+ *   then you definitely should not call it. Non-mappable memory may be in
+ *   the sgl and thus this function may fail unexpectedly. Consider using
+ *   sg_copy_to_buffer instead.
+ **/
+static inline void *sg_map(struct scatterlist *sg, size_t offset, int flags)
+{
+   struct page *pg;
+   unsigned int pg_off;
+   void *ret;
+
+   offset += sg->offset;
+   pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT);
+   pg_off = offset_in_page(offset);
+
+   if (flags & SG_KMAP_ATOMIC)
+   ret = kmap_atomic(pg) + pg_off;
+   else if (flags & SG_KMAP)
+   ret = kmap(pg) + pg_off;
+   else
+   ret = ERR_PTR(-EINVAL);
+
+   /*
+* In theory, this can't happen yet. Once we start adding
+* unmapable memory, it also shouldn't happen unless developers
+* start putting unmappa

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-13 Thread Christian König

Am 12.01.2017 um 16:11 schrieb Jerome Glisse:

On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:

On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:


On 06/01/17 11:26 AM, Jason Gunthorpe wrote:



Make a generic API for all of this and you'd have my vote..


IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware can
clearly do that if it can support ODP.

I agree completely.


What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.


Yeah, exactly what I said from the very beginning as well. Just hacking 
together quick solutions doesn't really solve the problem in the long term.


What we need is proper adjusting of the DMA API towards handling of P2P 
and then build solutions for the different use cases on top of that.


We should also avoid falling into the trap of trying to just handle the 
existing get_user_pages and co interfaces so that the existing code 
doesn't need to change. P2P needs to be validated for each use case 
individually and not implemented in workarounds with fingers crossed and 
hoped for the best.


Regards,
Christian.



Cheers,
Jérôme



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-27 Thread Christian König

Am 27.11.2016 um 15:02 schrieb Haggai Eran:

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:

On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:


Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have a
page table to mirror the CPU page table, they usually can't recover from
page faults.
So what we do is making sure that all memory accessed by the GPU Jobs stays
in place while those jobs run (pretty much the same pinning you do for the
DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.

I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

Really good point.

I think this means the bare minimum if we don't have recoverable page 
faults is to have preemption support like Felix described in his answer 
as well.


Going to keep that in mind,
Christian.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:

On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:


Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have a
page table to mirror the CPU page table, they usually can't recover from
page faults.
So what we do is making sure that all memory accessed by the GPU Jobs stays
in place while those jobs run (pretty much the same pinning you do for the
DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.


Yeah, we have a function to "import" anonymous pages from a CPU pointer 
which works exactly that way as well.


We call this "userptr" and it's just a combination of get_user_pages() 
on command submission and making sure the returned list of pages stays 
valid using a MMU notifier.


The "big" problem with this approach is that it is horrible slow. I mean 
seriously horrible slow so that we actually can't use it for some of the 
purposes we wanted to use it.



The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.


And here comes the problem. CPU do this on a page by page basis, so they 
fault only what needed and everything else gets filled in on demand. 
This results that faulting a page is relatively light weight operation.


But for GPU command submission we don't know which pages might be 
accessed beforehand, so what we do is walking all possible pages and 
make sure all of them are present.


Now as far as I understand it the I/O subsystem for example assumes that 
it can easily change the CPU page tables without much overhead. So for 
example when a page can't modified it is temporary marked as readonly 
AFAIK (you are probably way deeper into this than me, so please confirm).


That absolutely kills any performance for GPU command submissions. We 
have use cases where we practically ended up playing ping/pong between 
the GPU driver trying to grab the page with get_user_pages() and sombody 
else in the kernel marking it readonly.



This might be much more efficient since it optimizes for the common
case of unchanging translation tables.


Yeah, completely agree. It works perfectly fine as long as you don't 
have two drivers trying to mess with the same page.



This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt


Correct, this is another problem. GFX command submissions usually don't 
take longer than a few milliseconds, but compute command submission can 
easily take multiple hours.


I can easily imagine what would happen when kswapd is blocked by a GPU 
command submission for an hour or so while the system is under memory 
pressure :)


I'm thinking on this problem for about a year now and going in circles 
for quite a while. So if you have ideas on this even if they sound 
totally crazy, feel free to come up.


Cheers,
Christian.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe:

On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:


On 23/11/16 02:55 PM, Jason Gunthorpe wrote:

Only ODP hardware allows changing the DMA address on the fly, and it
works at the page table level. We do not need special handling for
RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware  that doesn't do
similar things (like an NVMe drive).

There are three cases to worry about:
  - Coherent long lived page table mirroring (RDMA ODP MR)
  - Non-coherent long lived page table mirroring (RDMA MR)
  - Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.


Well a problem which wasn't mentioned so far is that while GPUs do have 
a page table to mirror the CPU page table, they usually can't recover 
from page faults.


So what we do is making sure that all memory accessed by the GPU Jobs 
stays in place while those jobs run (pretty much the same pinning you do 
for the DMA).


But since this can lock down huge amounts of memory the whole command 
submission to GPUs is bound to the memory management. So when to much 
memory would get blocked by the GPU we block further command submissions 
until the situation resolves.



any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.


Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.



One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??


The whole purpose of this effort is that we can do I/O on VRAM directly 
without migrating everything back to system memory.


Allowing this, but then doing the migration by the first touch of the 
CPU is clearly not a good idea.


Regards,
Christian.



Jason



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe:

Hey,

On 24/11/16 02:45 AM, Christian König wrote:

E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
Not PCI device B (a SATA device) can directly read/write to it because
it is on the same bus segment, but PCI device C (a network card for
example) can't because it is on a different bus segment and the bridge
can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)


Well Serguei send me a couple of documents about QPI when we started to 
discuss this internally as well and that's exactly one of the cases I 
had in mind when writing this.


If I understood it correctly for such systems P2P is technical possible, 
but not necessary a good idea. Usually it is faster to just use a 
bouncing buffer when the peers are a bit "father" apart.


That this problem is solved on newer hardware is good, but doesn't helps 
us at all if we at want to support at least systems from the last five 
years or so.



It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


I don't want to predetermine whether a device can see another devices 
memory at get_user_pages() time.


My thinking was more going into the direction of a whitelist to figure 
out during dma_map_single()/dma_map_sg() time if we should use a 
bouncing buffer or not.


Christian.




Logan



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Christian König

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.
Well that is clearly not so simple. When your ZONE_DEVICE pages describe 
a PCI BAR and another PCI device initiates a DMA to this address the DMA 
subsystem must be able to check if the interconnection really works.


E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. 
Not PCI device B (a SATA device) can directly read/write to it because 
it is on the same bus segment, but PCI device C (a network card for 
example) can't because it is on a different bus segment and the bridge 
can't handle P2P transactions.


We need to be able to handle such cases and fall back to bouncing 
buffers, but I don't see that in the DMA subsystem right now.


Regards,
Christian.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Christian König

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter  wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
 wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should point to the
new
physical location/address
  (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't work.
So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think there 
is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?


Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram nodes
out of any default allocation numasets, full isolation should be possible.
-Daniel



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm