Re: [PATCH v5 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation
Am 31.08.2018 um 17:51 schrieb Logan Gunthorpe: On 31/08/18 02:08 AM, Christian König wrote: +One of the biggest issues is that PCI doesn't require forwarding +transactions between hierarchy domains, and in PCIe, each Root Port +defines a separate hierarchy domain. To make things worse, there is no +simple way to determine if a given Root Complex supports this or not. +(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel +only supports doing P2P when the endpoints involved are all behind the +same PCI bridge, as such devices are all in the same PCI hierarchy +domain, and the spec guarantees that all transacations within the +hierarchy will be routable, but it does not require routing +between hierarchies. Can we add a kernel command line switch and a whitelist to enable P2P between separate hierarchies? In future work, yes. But not for this patchset. This is definitely the way I see things going, but we've chosen to start with what we've presented. Sounds like a plan to me. If you can separate out adding the detection I can take a look adding this with my DMA-buf P2P efforts. Christian. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation
Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe: [SNIP] + +PCI Peer-to-Peer DMA Support + + +The PCI bus has pretty decent support for performing DMA transfers +between two devices on the bus. This type of transaction is henceforth +called Peer-to-Peer (or P2P). However, there are a number of issues that +make P2P transactions tricky to do in a perfectly safe way. + +One of the biggest issues is that PCI doesn't require forwarding +transactions between hierarchy domains, and in PCIe, each Root Port +defines a separate hierarchy domain. To make things worse, there is no +simple way to determine if a given Root Complex supports this or not. +(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel +only supports doing P2P when the endpoints involved are all behind the +same PCI bridge, as such devices are all in the same PCI hierarchy +domain, and the spec guarantees that all transacations within the +hierarchy will be routable, but it does not require routing +between hierarchies. Can we add a kernel command line switch and a whitelist to enable P2P between separate hierarchies? At least all newer AMD chipsets supports this and I'm pretty sure that Intel has a list with PCI-IDs of the root hubs for this as well. Regards, Christian. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory
Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe: Some PCI devices may have memory mapped in a BAR space that's intended for use in peer-to-peer transactions. In order to enable such transactions the memory must be registered with ZONE_DEVICE pages so it can be used by DMA interfaces in existing drivers. We want to use that feature without ZONE_DEVICE pages for DMA-buf as well. How hard would it be to separate enabling P2P detection (e.g. distance between two devices) from this? Regards, Christian. Add an interface for other subsystems to find and allocate chunks of P2P memory as necessary to facilitate transfers between two PCI peers: int pci_p2pdma_add_client(); struct pci_dev *pci_p2pmem_find(); void *pci_alloc_p2pmem(); The new interface requires a driver to collect a list of client devices involved in the transaction with the pci_p2pmem_add_client*() functions then call pci_p2pmem_find() to obtain any suitable P2P memory. Once this is done the list is bound to the memory and the calling driver is free to add and remove clients as necessary (adding incompatible clients will fail). With a suitable p2pmem device, memory can then be allocated with pci_alloc_p2pmem() for use in DMA transactions. Depending on hardware, using peer-to-peer memory may reduce the bandwidth of the transfer but can significantly reduce pressure on system memory. This may be desirable in many cases: for example a system could be designed with a small CPU connected to a PCIe switch by a small number of lanes which would maximize the number of lanes available to connect to NVMe devices. The code is designed to only utilize the p2pmem device if all the devices involved in a transfer are behind the same PCI bridge. This is because we have no way of knowing whether peer-to-peer routing between PCIe Root Ports is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that go through the RC is limited to only reducing DRAM usage and, in some cases, coding convenience. The PCI-SIG may be exploring adding a new capability bit to advertise whether this is possible for future hardware. This commit includes significant rework and feedback from Christoph Hellwig. Signed-off-by: Christoph Hellwig Signed-off-by: Logan Gunthorpe --- drivers/pci/Kconfig| 17 + drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 761 + include/linux/memremap.h | 5 + include/linux/mm.h | 18 ++ include/linux/pci-p2pdma.h | 102 ++ include/linux/pci.h| 4 + 7 files changed, 908 insertions(+) create mode 100644 drivers/pci/p2pdma.c create mode 100644 include/linux/pci-p2pdma.h diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 56ff8f6d31fc..deb68be4fdac 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -132,6 +132,23 @@ config PCI_PASID If unsure, say N. +config PCI_P2PDMA + bool "PCI peer-to-peer transfer support" + depends on PCI && ZONE_DEVICE + select GENERIC_ALLOCATOR + help + Enableѕ drivers to do PCI peer-to-peer transactions to and from + BARs that are exposed in other devices that are the part of + the hierarchy where peer-to-peer DMA is guaranteed by the PCI + specification to work (ie. anything below a single PCI bridge). + + Many PCIe root complexes do not support P2P transactions and + it's hard to tell which support it at all, so at this time, + P2P DMA transations must be between devices behind the same root + port. + + If unsure, say N. + config PCI_LABEL def_bool y if (DMI || ACPI) depends on PCI diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 1b2cfe51e8d7..85f4a703b2be 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o obj-$(CONFIG_PCI_STUB)+= pci-stub.o obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o obj-$(CONFIG_PCI_ECAM)+= ecam.o +obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o # Endpoint library must be initialized before its users diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c new file mode 100644 index ..88aaec5351cd --- /dev/null +++ b/drivers/pci/p2pdma.c @@ -0,0 +1,761 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI Peer 2 Peer DMA support. + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + */ + +#define pr_fmt(fmt) "pci-p2pdma: " fmt +#include +#include +#include +#include +#include +#include +#include +#include + +struct pci_p2pdma { + struct percpu_ref devmap_ref; + struct completion devmap_ref_done; + struct gen_pool *pool; + bool p2pmem_published; +}; + +static void pci_p2pdma_percpu_r
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe: On 10/05/18 11:11 AM, Stephen Bates wrote: Not to me. In the p2pdma code we specifically program DMA engines with the PCI bus address. Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address... By disabling the ACS bits on the intermediate bridges you turn their address routing from IOVA addresses (which are to be resolved by the root complex) back to PCI bus addresses (which are resolved locally in the bridge). This only works when the IOVA and the PCI bus addresses never overlap. I'm not sure how the IOVA allocation works but I don't think we guarantee that on Linux. So regardless of whether we are using the IOMMU or not, the packets will be forwarded directly to the peer. If the ACS Redir bits are on they will be forced back to the RC by the switch and the transaction will fail. If we clear the ACS bits, the TLPs will go where we want and everything will work (but we lose the isolation of ACS). Agreed. If we really want to enable P2P without ATS and IOMMU enabled I think we should probably approach it like this: a) Make double sure that IOVA in an IOMMU group never overlap with PCI BARs in that group. b) Add configuration options to put a whole PCI branch of devices (e.g. a bridge) into a single IOMMU group. c) Add a configuration option to disable the ACS bit on bridges in the same IOMMU group. I agree that we have a rather special case here, but I still find that approach rather brave and would vote for disabling P2P without ATS when IOMMU is enabled. BTW: I can't say anything about other implementations, but at least for the AMD-IOMMU the transaction won't fail when it is send to the root complex. Instead the root complex would send it to the correct device. I already tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs (but could be that this only works because of ATS). Regards, Christian. For EPs that support ATS, we should (but don't necessarily have to) program them with the IOVA address so they can go through the translation process which will allow P2P without disabling the ACS Redir bits -- provided the ACS direct translation bit is set. (And btw, if it is, then we lose the benefit of ACS protecting against malicious EPs). But, per above, the ATS transaction should involve only the IOVA address so the ACS bits not being set should not break ATS. Well we would still have to clear some ACS bits but now we can clear only for translated addresses. We don't have to clear the ACS Redir bits as we did in the first case. We just have to make sure the ACS Direct Translated bit is set. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 10.05.2018 um 16:20 schrieb Stephen Bates: Hi Jerome As it is tie to PASID this is done using IOMMU so looks for caller of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing user is the AMD GPU driver see: Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver. Just FYI: There is also another effort ongoing to give both the AMD, Intel as well as ARM IOMMUs a common interface so that drivers can use whatever the platform offers fro SVM support. One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS). Oh, well that is complicated at best. On very old hardware it wasn't a window, but instead you had to use special commands in your shader which indicated that you want to use an ATS transaction instead of a normal PCIe transaction for your read/write/atomic. As Jerome explained on most hardware we have a window inside the internal GPU address space which when accessed issues a ATS transaction with a configurable PASID. But on very newer hardware that window became a bit in the GPUVM page tables, so in theory we now can control it on a 4K granularity basis for the internal 48bit GPU address space. Christian. Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe: On 09/05/18 07:40 AM, Christian König wrote: The key takeaway is that when any device has ATS enabled you can't disable ACS without breaking it (even if you unplug and replug it). I don't follow how you came to this conclusion... The ACS bits we'd be turning off are the ones that force TLPs addressed at a peer to go to the RC. However, ATS translation packets will be addressed to an untranslated address which a switch will not identify as a peer address so it should send upstream regardless the state of the ACS Req/Comp redirect bits. Why would a switch not identify that as a peer address? We use the PASID together with ATS to identify the address space which a transaction should use. If I'm not completely mistaken when you disable ACS it is perfectly possible that a bridge identifies a transaction as belonging to a peer address, which isn't what we want here. Christian. Once the translation comes back, the ATS endpoint should send the TLP to the peer address with the AT packet type and it will be directed to the peer provided the Direct Translated bit is set (or the redirect bits are unset). I can't see how turning off the Req/Comp redirect bits could break anything except for the isolation they provide. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 09.05.2018 um 15:12 schrieb Stephen Bates: Jerome and Christian I think there is confusion here, Alex properly explained the scheme PCIE-device do a ATS request to the IOMMU which returns a valid translation for a virtual address. Device can then use that address directly without going through IOMMU for translation. So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set. Still need to double check the specification (had a busy morning today), but that sounds about correct. The key takeaway is that when any device has ATS enabled you can't disable ACS without breaking it (even if you unplug and replug it). Also ATS is meaningless without something like PASID as far as i know. ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads. Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago... Interesting point, give me a moment to check that. That finally makes all the hardware I have standing around here valuable :) Christian. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe: On 08/05/18 01:17 AM, Christian König wrote: AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. Well, given that the current set only disables ACS bits on bridges (previous versions were only on switches) this shouldn't be an issue for integrated devices. We do not disable ACS flags globally. Ok, that is at least a step in the right direction. But I think we seriously need to test that for side effects. And what exactly is the problem here? I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. In addition to Stephen's comments, seeing we've established a general need to avoid the root complex (until we have a whitelist at least) we must have ACS disabled along the path between the devices. Otherwise, all TLPs will go through the root complex and if there is no support it will fail. Well I'm not an expert on this, but if I'm not completely mistaken that is not correct. E.g. transactions are initially send to the root complex for translation, that's for sure. But at least for AMD GPUs the root complex answers with the translated address which is then cached in the device. So further transactions for the same address range then go directly to the destination. What you don't want is device isolation, cause in this case the root complex handles the transaction themselves. IIRC there where also something like "force_isolation" and "nobypass" parameters for the IOMMU to control that behavior. It's already late here, but going to dig up the documentation for that tomorrow and/or contact a hardware engineer involved in the ACS spec. Regards, Christian. If the consensus is we want a command line option, then so be it. But we'll have to deny pretty much all P2P transactions unless the user correctly disables ACS along the path using the command line option and this is really annoying for users of this functionality to understand how to do that correctly. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 08.05.2018 um 16:25 schrieb Stephen Bates: Hi Christian AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP? Well I'm not an expert on this, but I think that is an incorrect assumption you guys use here. At least in the default configuration even with IOMMU enabled P2P transactions does NOT necessary travel up to the root complex for translation. It's already late here, but if nobody beats me I'm going to dig up the necessary documentation tomorrow. Regards, Christian. Similar problems arise when you do this for dedicated GPU, but we haven't upstreamed the support for this yet. Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. So that is a clear NAK from my side for the approach. Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach? And what exactly is the problem here? We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe). I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Bjorn, Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas: [+to Alex] Alex, Are you happy with this strategy of turning off ACS based on CONFIG_PCI_P2PDMA? We only check this at enumeration-time and I don't know if there are other places we would care? thanks for pointing this out, I totally missed this hack. AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. Similar problems arise when you do this for dedicated GPU, but we haven't upstreamed the support for this yet. So that is a clear NAK from my side for the approach. And what exactly is the problem here? I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. Regards, Christian. On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote: For peer-to-peer transactions to work the downstream ports in each switch must not have the ACS flags set. At this time there is no way to dynamically change the flags and update the corresponding IOMMU groups so this is done at enumeration time before the groups are assigned. This effectively means that if CONFIG_PCI_P2PDMA is selected then all devices behind any PCIe switch heirarchy will be in the same IOMMU group. Which implies that individual devices behind any switch heirarchy will not be able to be assigned to separate VMs because there is no isolation between them. Additionally, any malicious PCIe devices will be able to DMA to memory exposed by other EPs in the same domain as TLPs will not be checked by the IOMMU. Given that the intended use case of P2P Memory is for users with custom hardware designed for purpose, we do not expect distributors to ever need to enable this option. Users that want to use P2P must have compiled a custom kernel with this configuration option and understand the implications regarding ACS. They will either not require ACS or will have design the system in such a way that devices that require isolation will be separate from those using P2P transactions. Signed-off-by: Logan Gunthorpe --- drivers/pci/Kconfig| 9 + drivers/pci/p2pdma.c | 45 ++--- drivers/pci/pci.c | 6 ++ include/linux/pci-p2pdma.h | 5 + 4 files changed, 50 insertions(+), 15 deletions(-) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index b2396c22b53e..b6db41d4b708 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -139,6 +139,15 @@ config PCI_P2PDMA transations must be between devices behind the same root port. (Typically behind a network of PCIe switches). + Enabling this option will also disable ACS on all ports behind + any PCIe switch. This effectively puts all devices behind any + switch heirarchy into the same IOMMU group. Which implies that s/heirarchy/hierarchy/ (also above in changelog) + individual devices behind any switch will not be able to be + assigned to separate VMs because there is no isolation between + them. Additionally, any malicious PCIe devices will be able to + DMA to memory exposed by other EPs in the same domain as TLPs + will not be checked by the IOMMU. + If unsure, say N. config PCI_LABEL diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index ed9dce8552a2..e9f43b43acac 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev) } /* - * If a device is behind a switch, we try to find the upstream bridge - * port of the switch. This requires two calls to pci_upstream_bridge(): - * one for the upstream port on the switch, one on the upstream port - * for the next level in the hierarchy. Because of this, devices connected - * to the root port will be rejected. + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges + * @pdev: device to disable ACS flags for + * + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded + * up to the RC which is not what we want for P2P. s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI) + * + * This function is called when the devices are first enumerated and + * will result in all devices behind any bridge to be in the same IOMMU + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely + * on this largish hammer. If you need the devices to be in separate groups + * don't enable CONFIG_PCI_P2PDMA. + * + * Returns 1 if the ACS bits for this device was cleared, otherwise 0. */ -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) +int pci_p2pdma_disable_acs(struct pci_dev *pdev) { - struct pci_dev *up1, *up2; + int pos; + u16 ctrl; - if (!pdev) - return NULL; + if (!pci_is_brid
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe: On 03/05/18 11:29 AM, Christian König wrote: Ok, that is the point where I'm stuck. Why do we need that in one function call in the PCIe subsystem? The problem at least with GPUs is that we seriously don't have that information here, cause the PCI subsystem might not be aware of all the interconnections. For example it isn't uncommon to put multiple GPUs on one board. To the PCI subsystem that looks like separate devices, but in reality all GPUs are interconnected and can access each others memory directly without going over the PCIe bus. I seriously don't want to model that in the PCI subsystem, but rather the driver. That's why it feels like a mistake to me to push all that into the PCI function. Huh? I'm lost. If you have a bunch of PCI devices you can send them as a list to this API, if you want. If the driver is _sure_ they are all the same, you only have to send one. In your terminology, you'd just have to call the interface with: pci_p2pdma_distance(target, [initiator, target]) Ok, I expected that something like that would do it. So just to confirm: When I have a bunch of GPUs which could be the initiator I only need to do "pci_p2pdma_distance(target, [first GPU, target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, third GPU, forth, target])" ? Why can't we model that as two separate transactions? You could, but this is more convenient for users of the API that need to deal with multiple devices (and manage devices that may be added or removed at any time). Are you sure that this is more convenient? At least on first glance it feels overly complicated. I mean what's the difference between the two approaches? sum = pci_p2pdma_distance(target, [A, B, C, target]); and sum = pci_p2pdma_distance(target, A); sum += pci_p2pdma_distance(target, B); sum += pci_p2pdma_distance(target, C); Yeah, same for me. If Bjorn is ok with that specialized NVM functions that I'm fine with that as well. I think it would just be more convenient when we can come up with functions which can handle all use cases, cause there still seems to be a lot of similarities. The way it's implemented is more general and can handle all use cases. You are arguing for a function that can handle your case (albeit with a bit more fuss) but can't handle mine and is therefore less general. Calling my interface specialized is wrong. Well at the end of the day you only need to convince Bjorn of the interface, so I'm perfectly fine with it as long as it serves my use case as well :) But I still would like to understand your intention, cause that really helps not to accidentally break something in the long term. Now when I take a look at the pure PCI hardware level, what I have is a transaction between an initiator and a target, and not multiple devices in one operation. I mean you must have a very good reason that you now want to deal with multiple devices in the software layer, but neither from the code nor from your explanation that reason becomes obvious to me. Thanks, Christian. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe: On 03/05/18 03:05 AM, Christian König wrote: Second question is how to you want to handle things when device are not behind the same root port (which is perfectly possible in the cases I deal with)? I think we need to implement a whitelist. If both root ports are in the white list and are on the same bus then we return a larger distance instead of -1. Sounds good. Third question why multiple clients? That feels a bit like you are pushing something special to your use case into the common PCI subsystem. Something which usually isn't a good idea. No, I think this will be pretty standard. In the simple general case you are going to have one provider and at least two clients (one which writes the memory and one which reads it). However, one client is likely, but not necessarily, the same as the provider. Ok, that is the point where I'm stuck. Why do we need that in one function call in the PCIe subsystem? The problem at least with GPUs is that we seriously don't have that information here, cause the PCI subsystem might not be aware of all the interconnections. For example it isn't uncommon to put multiple GPUs on one board. To the PCI subsystem that looks like separate devices, but in reality all GPUs are interconnected and can access each others memory directly without going over the PCIe bus. I seriously don't want to model that in the PCI subsystem, but rather the driver. That's why it feels like a mistake to me to push all that into the PCI function. In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block devices. The code doesn't care which device provides the memory as it could be the RDMA device or one/all of the block devices (or, in theory, a completely separate device with P2P-able memory). However, it does require that all devices involved are accessible per pci_p2pdma_distance() or it won't use P2P transactions. I could also imagine other use cases: ie. an RDMA NIC sends data to a GPU for processing and then sends the data to an NVMe device for storage (or vice-versa). In this case we have 3 clients and one provider. Why can't we model that as two separate transactions? E.g. one from the RDMA NIC to the GPU memory. And another one from the GPU memory to the NVMe device. That would also match how I get this information from userspace. As far as I can see we need a function which return the distance between a initiator and target device. This function then returns -1 if the transaction can't be made and a positive value otherwise. If you need to make a simpler convenience function for your use case I'm not against it. Yeah, same for me. If Bjorn is ok with that specialized NVM functions that I'm fine with that as well. I think it would just be more convenient when we can come up with functions which can handle all use cases, cause there still seems to be a lot of similarities. We also need to give the direction of the transaction and have a whitelist root complex PCI-IDs which can handle P2P transactions from different ports for a certain DMA direction. Yes. In the NVMeof case we need all devices to be able to DMA in both directions so we did not need the DMA direction. But I can see this being useful once we add the whitelist. Ok, I agree that can be added later on. For simplicity let's assume for now we always to bidirectional transfers. Thanks for the explanation, Christian. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe: Hi Christian, On 5/2/2018 5:51 AM, Christian König wrote: it would be rather nice to have if you could separate out the functions to detect if peer2peer is possible between two devices. This would essentially be pci_p2pdma_distance() in the existing patchset. It returns the sum of the distance between a list of clients and a P2PDMA provider. It returns -1 if peer2peer is not possible between the devices (presently this means they are not behind the same root port). Ok, I'm still missing the big picture here. First question is what is the P2PDMA provider? Second question is how to you want to handle things when device are not behind the same root port (which is perfectly possible in the cases I deal with)? Third question why multiple clients? That feels a bit like you are pushing something special to your use case into the common PCI subsystem. Something which usually isn't a good idea. As far as I can see we need a function which return the distance between a initiator and target device. This function then returns -1 if the transaction can't be made and a positive value otherwise. We also need to give the direction of the transaction and have a whitelist root complex PCI-IDs which can handle P2P transactions from different ports for a certain DMA direction. Christian. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Hi Logan, it would be rather nice to have if you could separate out the functions to detect if peer2peer is possible between two devices. That would allow me to reuse the same logic for GPU peer2peer where I don't really have ZONE_DEVICE. Regards, Christian. Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe: Hi Everyone, Here's v4 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.17-rc2. A git repo is here: https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 Thanks, Logan Changes in v4: * Change the original upstream_bridges_match() function to upstream_bridge_distance() which calculates the distance between two devices as long as they are behind the same root port. This should address Bjorn's concerns that the code was to focused on being behind a single switch. * The disable ACS function now disables ACS for all bridge ports instead of switch ports (ie. those that had two upstream_bridge ports). * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl() API to be more like sgl_alloc() in that the alloc function returns the allocated scatterlist and nents is not required bythe free function. * Moved the new documentation into the driver-api tree as requested by Jonathan * Add SGL alloc and free helpers in the nvmet code so that the individual drivers can share the code that allocates P2P memory. As requested by Christoph. * Cleanup the nvmet_p2pmem_store() function as Christoph thought my first attempt was ugly. * Numerous commit message and comment fix-ups Changes in v3: * Many more fixes and minor cleanups that were spotted by Bjorn * Additional explanation of the ACS change in both the commit message and Kconfig doc. Also, the code that disables the ACS bits is surrounded explicitly by an #ifdef * Removed the flag we added to rdma_rw_ctx() in favour of using is_pci_p2pdma_page(), as suggested by Sagi. * Adjust pci_p2pmem_find() so that it prefers P2P providers that are closest to (or the same as) the clients using them. In cases of ties, the provider is randomly chosen. * Modify the NVMe Target code so that the PCI device name of the provider may be explicitly specified, bypassing the logic in pci_p2pmem_find(). (Note: it's still enforced that the provider must be behind the same switch as the clients). * As requested by Bjorn, added documentation for driver writers. Changes in v2: * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well as a bunch of cleanup and spelling fixes he pointed out in the last series. * To address Alex's ACS concerns, we change to a simpler method of just disabling ACS behind switches for any kernel that has CONFIG_PCI_P2PDMA. * We also reject using devices that employ 'dma_virt_ops' which should fairly simply handle Jason's concerns that this work might break with the HFI, QIB and rxe drivers that use the virtual ops to implement their own special DMA operations. -- This is a continuation of our work to enable using Peer-to-Peer PCI memory in the kernel with initial support for the NVMe fabrics target subsystem. Many thanks go to Christoph Hellwig who provided valuable feedback to get these patches to where they are today. The concept here is to use memory that's exposed on a PCI BAR as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely. The upside of this is better QoS for applications running on the CPU utilizing memory and lower PCI bandwidth required to the CPU (such that systems could be designed with fewer lanes connected to the CPU). Due to these trade-offs we've designed the system to only enable using the PCI memory in cases where the NIC, NVMe devices and memory are all behind the same PCI switch hierarchy. This will mean many setups that could likely work well will not be supported so that we can be more confident it will work and not place any responsibility on the user to understand their topology. (We chose to go this route based on feedback we received at the last LSF). Future work may enable these transfers using a white list of known good root complexes. However, at this time, there is no reliable way to ensure that Peer-to-Peer transactions are permitted between PCI Root Ports. In order to enable this functionality, we introduce a few new PCI functions such that a driver can register P2P memory with the system. Struct pages are created for this memory using devm_memremap_pages() and the PCI bus offset is stored in the corresponding pagemap structure. When the PCI P2PDMA config option is selected the ACS bits in every bridge port in the system are turned off to allow traffic to pass freely behind the root port. At this time, the bit must be disabled at boot so the IOMMU subsystem can correctly create the groups, though this could be addresse
Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions
Am 25.04.2017 um 20:20 schrieb Logan Gunthorpe: This patch introduces functions which kmap the pages inside an sgl. These functions replace a common pattern of kmap(sg_page(sg)) that is used in more than 50 places within the kernel. The motivation for this work is to eventually safely support sgls that contain io memory. In order for that to work, any access to the contents of an iomem SGL will need to be done with iomemcpy or hit some warning. (The exact details of how this will work have yet to be worked out.) Having all the kmaps in one place is just a first step in that direction. Additionally, seeing this helps cut down the users of sg_page, it should make any effort to go to struct-page-less DMAs a little easier (should that idea ever swing back into favour again). A flags option is added to select between a regular or atomic mapping so these functions can replace kmap(sg_page or kmap_atomic(sg_page. Future work may expand this to have flags for using page_address or vmap. We include a flag to require the function not to fail to support legacy code that has no easy error path. Much further in the future, there may be a flag to allocate memory and copy the data from/to iomem. We also add the semantic that sg_map can fail to create a mapping, despite the fact that the current code this is replacing is assumed to never fail and the current version of these functions cannot fail. This is to support iomem which may either have to fail to create the mapping or allocate memory as a bounce buffer which itself can fail. Also, in terms of cleanup, a few of the existing kmap(sg_page) users play things a bit loose in terms of whether they apply sg->offset so using these helper functions should help avoid such issues. Signed-off-by: Logan Gunthorpe --- Good to know that somebody is working on this. Those problems troubled us as well. Patch is Acked-by: Christian König . Regards, Christian. include/linux/scatterlist.h | 85 + 1 file changed, 85 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index cb3c8fe..fad170b 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -5,6 +5,7 @@ #include #include #include +#include #include struct scatterlist { @@ -126,6 +127,90 @@ static inline struct page *sg_page(struct scatterlist *sg) return (struct page *)((sg)->page_link & ~0x3); } +#define SG_KMAP (1 << 0) /* create a mapping with kmap */ +#define SG_KMAP_ATOMIC (1 << 1) /* create a mapping with kmap_atomic */ +#define SG_MAP_MUST_NOT_FAIL (1 << 2)/* indicate sg_map should not fail */ + +/** + * sg_map - kmap a page inside an sgl + * @sg:SG entry + * @offset:Offset into entry + * @flags: Flags for creating the mapping + * + * Description: + * Use this function to map a page in the scatterlist at the specified + * offset. sg->offset is already added for you. Note: the semantics of + * this function are that it may fail. Thus, its output should be checked + * with IS_ERR and PTR_ERR. Otherwise, a pointer to the specified offset + * in the mapped page is returned. + * + * Flags can be any of: + * * SG_KMAP - Use kmap to create the mapping + * * SG_KMAP_ATOMIC- Use kmap_atomic to map the page atommically. + * Thus, the rules of that function apply: the + * cpu may not sleep until it is unmaped. + * * SG_MAP_MUST_NOT_FAIL - Indicate that sg_map must not fail. + * If it does, it will issue a BUG_ON instead. + * This is intended for legacy code only, it + * is not to be used in new code. + * + * Also, consider carefully whether this function is appropriate. It is + * largely not recommended for new code and if the sgl came from another + * subsystem and you don't know what kind of memory might be in the list + * then you definitely should not call it. Non-mappable memory may be in + * the sgl and thus this function may fail unexpectedly. Consider using + * sg_copy_to_buffer instead. + **/ +static inline void *sg_map(struct scatterlist *sg, size_t offset, int flags) +{ + struct page *pg; + unsigned int pg_off; + void *ret; + + offset += sg->offset; + pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT); + pg_off = offset_in_page(offset); + + if (flags & SG_KMAP_ATOMIC) + ret = kmap_atomic(pg) + pg_off; + else if (flags & SG_KMAP) + ret = kmap(pg) + pg_off; + else + ret = ERR_PTR(-EINVAL); + + /* +* In theory, this can't happen yet. Once we start adding +* unmapable memory, it also shouldn't happen unless developers +* start putting unmappa
Re: Enabling peer to peer device transactions for PCIe devices
Am 12.01.2017 um 16:11 schrieb Jerome Glisse: On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote: On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote: On 06/01/17 11:26 AM, Jason Gunthorpe wrote: Make a generic API for all of this and you'd have my vote.. IMHO, you must support basic pinning semantics - that is necessary to support generic short lived DMA (eg filesystem, etc). That hardware can clearly do that if it can support ODP. I agree completely. What we want is for RDMA, O_DIRECT, etc to just work with special VMAs (ie. at least those backed with ZONE_DEVICE memory). Then GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace (using whatever interface is most appropriate) and userspace can do what it pleases with them. This makes _so_ much sense and actually largely already works today (as demonstrated by iopmem). +1 for iopmem ;-) I feel like we are going around and around on this topic. I would like to see something that is upstream that enables P2P even if it is only the minimum viable useful functionality to begin. I think aiming for the moon (which is what HMM and things like it are) are simply going to take more time if they ever get there. There is a use case for in-kernel P2P PCIe transfers between two NVMe devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or BARs on the NIC). I am even seeing users who now want to move data P2P between FPGAs and NVMe SSDs and the upstream kernel should be able to support these users or they will look elsewhere. The iopmem patchset addressed all the use cases above and while it is not an in kernel API it could have been modified to be one reasonably easily. As Logan states the driver can then choose to pass the VMAs to user-space in a manner that makes sense. Earlier in the thread someone mentioned LSF/MM. There is already a proposal to discuss this topic so if you are interested please respond to the email letting the committee know this topic is of interest to you [1]. Also earlier in the thread someone discussed the issues around the IOMMU. Given the known issues around P2P transfers in certain CPU root complexes [2] it might just be a case of only allowing P2P when a PCIe switch connects the two EPs. Another option is just to use CONFIG_EXPERT and make sure people are aware of the pitfalls if they invoke the P2P option. iopmem is not applicable to GPU what i propose is to split the issue in 2 so that everyone can reuse the part that needs to be common namely the DMA API part where you have to create IOMMU mapping for one device to point to the other device memory. We can have a DMA API that is agnostic to how the device memory is manage (so does not matter if device memory have struct page or not). This what i have been arguing in this thread. To make progress on this issue we need to stop conflicting different use case. So i say let solve the IOMMU issue first and let everyone use it in their own way with their device. I do not think we can share much more than that. Yeah, exactly what I said from the very beginning as well. Just hacking together quick solutions doesn't really solve the problem in the long term. What we need is proper adjusting of the DMA API towards handling of P2P and then build solutions for the different use cases on top of that. We should also avoid falling into the trap of trying to just handle the existing get_user_pages and co interfaces so that the existing code doesn't need to change. P2P needs to be validated for each use case individually and not implemented in workarounds with fingers crossed and hoped for the best. Regards, Christian. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 27.11.2016 um 15:02 schrieb Haggai Eran: On 11/25/2016 9:32 PM, Jason Gunthorpe wrote: On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote: Like you say below we have to handle short lived in the usual way, and that covers basically every device except IB MRs, including the command queue on a NVMe drive. Well a problem which wasn't mentioned so far is that while GPUs do have a page table to mirror the CPU page table, they usually can't recover from page faults. So what we do is making sure that all memory accessed by the GPU Jobs stays in place while those jobs run (pretty much the same pinning you do for the DMA). Yes, it is DMA, so this is a valid approach. But, you don't need page faults from the GPU to do proper coherent page table mirroring. Basically when the driver submits the work to the GPU it 'faults' the pages into the CPU and mirror translation table (instead of pinning). Like in ODP, MMU notifiers/HMM are used to monitor for translation changes. If a change comes in the GPU driver checks if an executing command is touching those pages and blocks the MMU notifier until the command flushes, then unfaults the page (blocking future commands) and unblocks the mmu notifier. I think blocking mmu notifiers against something that is basically controlled by user-space can be problematic. This can block things like memory reclaim. If you have user-space access to the device's queues, user-space can block the mmu notifier forever. Really good point. I think this means the bare minimum if we don't have recoverable page faults is to have preemption support like Felix described in his answer as well. Going to keep that in mind, Christian. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe: On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote: Like you say below we have to handle short lived in the usual way, and that covers basically every device except IB MRs, including the command queue on a NVMe drive. Well a problem which wasn't mentioned so far is that while GPUs do have a page table to mirror the CPU page table, they usually can't recover from page faults. So what we do is making sure that all memory accessed by the GPU Jobs stays in place while those jobs run (pretty much the same pinning you do for the DMA). Yes, it is DMA, so this is a valid approach. But, you don't need page faults from the GPU to do proper coherent page table mirroring. Basically when the driver submits the work to the GPU it 'faults' the pages into the CPU and mirror translation table (instead of pinning). Like in ODP, MMU notifiers/HMM are used to monitor for translation changes. If a change comes in the GPU driver checks if an executing command is touching those pages and blocks the MMU notifier until the command flushes, then unfaults the page (blocking future commands) and unblocks the mmu notifier. Yeah, we have a function to "import" anonymous pages from a CPU pointer which works exactly that way as well. We call this "userptr" and it's just a combination of get_user_pages() on command submission and making sure the returned list of pages stays valid using a MMU notifier. The "big" problem with this approach is that it is horrible slow. I mean seriously horrible slow so that we actually can't use it for some of the purposes we wanted to use it. The code moving the page will move it and the next GPU command that needs it will refault it in the usual way, just like the CPU would. And here comes the problem. CPU do this on a page by page basis, so they fault only what needed and everything else gets filled in on demand. This results that faulting a page is relatively light weight operation. But for GPU command submission we don't know which pages might be accessed beforehand, so what we do is walking all possible pages and make sure all of them are present. Now as far as I understand it the I/O subsystem for example assumes that it can easily change the CPU page tables without much overhead. So for example when a page can't modified it is temporary marked as readonly AFAIK (you are probably way deeper into this than me, so please confirm). That absolutely kills any performance for GPU command submissions. We have use cases where we practically ended up playing ping/pong between the GPU driver trying to grab the page with get_user_pages() and sombody else in the kernel marking it readonly. This might be much more efficient since it optimizes for the common case of unchanging translation tables. Yeah, completely agree. It works perfectly fine as long as you don't have two drivers trying to mess with the same page. This assumes the commands are fairly short lived of course, the expectation of the mmu notifiers is that a flush is reasonably prompt Correct, this is another problem. GFX command submissions usually don't take longer than a few milliseconds, but compute command submission can easily take multiple hours. I can easily imagine what would happen when kswapd is blocked by a GPU command submission for an hour or so while the system is under memory pressure :) I'm thinking on this problem for about a year now and going in circles for quite a while. So if you have ideas on this even if they sound totally crazy, feel free to come up. Cheers, Christian. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe: On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote: On 23/11/16 02:55 PM, Jason Gunthorpe wrote: Only ODP hardware allows changing the DMA address on the fly, and it works at the page table level. We do not need special handling for RDMA. I am aware of ODP but, noted by others, it doesn't provide a general solution to the points above. How do you mean? I was only saying it wasn't general in that it wouldn't work for IB hardware that doesn't support ODP or other hardware that doesn't do similar things (like an NVMe drive). There are three cases to worry about: - Coherent long lived page table mirroring (RDMA ODP MR) - Non-coherent long lived page table mirroring (RDMA MR) - Short lived DMA mapping (everything else) Like you say below we have to handle short lived in the usual way, and that covers basically every device except IB MRs, including the command queue on a NVMe drive. Well a problem which wasn't mentioned so far is that while GPUs do have a page table to mirror the CPU page table, they usually can't recover from page faults. So what we do is making sure that all memory accessed by the GPU Jobs stays in place while those jobs run (pretty much the same pinning you do for the DMA). But since this can lock down huge amounts of memory the whole command submission to GPUs is bound to the memory management. So when to much memory would get blocked by the GPU we block further command submissions until the situation resolves. any complex allocators (GPU or otherwise) should respect that. And that seems like it should be the default way most of this works -- and I think it wouldn't actually take too much effort to make it all work now as is. (Our iopmem work is actually quite small and simple.) Yes, absolutely, some kind of page pinning like locking is a hard requirement. Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE memory working for some time. I'd say it's a good fit. The main question we've had is how to expose PCIe bars to userspace to be used as MRs and such. Is there any progress on that? I still don't quite get what iopmem was about.. I thought the objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX over iopmem and still ending up with uncacheable mmaps still seems like a non-starter to me... Serguei, what is your plan in GPU land for migration? Ie if I have a CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable - do you still allow the CPU to access it? Or do you swap it back to cachable memory if the CPU touches it? Depends on the policy in command, but currently it's the other way around most of the time. E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading because that is slow, the GPU in turn can access it with full speed. When we run out of VRAM we move those allocations to system memory and update both the CPU as well as the GPU page tables. So that move is transparent for both userspace as well as shaders running on the GPU. One approach might be to mmap the uncachable ZONE_DEVICE memory and mark it inaccessible to the CPU - DMA could still translate. If the CPU needs it then the kernel migrates it to system memory so it becomes cachable. ?? The whole purpose of this effort is that we can do I/O on VRAM directly without migrating everything back to system memory. Allowing this, but then doing the migration by the first touch of the CPU is clearly not a good idea. Regards, Christian. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe: Hey, On 24/11/16 02:45 AM, Christian König wrote: E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. Not PCI device B (a SATA device) can directly read/write to it because it is on the same bus segment, but PCI device C (a network card for example) can't because it is on a different bus segment and the bridge can't handle P2P transactions. Yeah, that could be an issue but in our experience we have yet to see it. We've tested with two separate PCI buses on different CPUs connected through QPI links and it works fine. (It is rather slow but I understand Intel has improved the bottleneck in newer CPUs than the ones we tested.) Well Serguei send me a couple of documents about QPI when we started to discuss this internally as well and that's exactly one of the cases I had in mind when writing this. If I understood it correctly for such systems P2P is technical possible, but not necessary a good idea. Usually it is faster to just use a bouncing buffer when the peers are a bit "father" apart. That this problem is solved on newer hardware is good, but doesn't helps us at all if we at want to support at least systems from the last five years or so. It may just be older hardware that has this issue. I expect that as long as a failed transfer can be handled gracefully by the initiator I don't see a need to predetermine whether a device can see another devices memory. I don't want to predetermine whether a device can see another devices memory at get_user_pages() time. My thinking was more going into the direction of a whitelist to figure out during dma_map_single()/dma_map_sg() time if we should use a bouncing buffer or not. Christian. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe: There is certainly nothing about the hardware that cares about ZONE_DEVICE vs System memory. Well that is clearly not so simple. When your ZONE_DEVICE pages describe a PCI BAR and another PCI device initiates a DMA to this address the DMA subsystem must be able to check if the interconnection really works. E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. Not PCI device B (a SATA device) can directly read/write to it because it is on the same bus segment, but PCI device C (a network card for example) can't because it is on a different bus segment and the bridge can't handle P2P transactions. We need to be able to handle such cases and fall back to bouncing buffers, but I don't see that in the DMA subsystem right now. Regards, Christian. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 23.11.2016 um 08:49 schrieb Daniel Vetter: On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch wrote: On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis. Yes, agreed. My idea with exposing vram sections using numa nodes wasn't to reuse all the existing allocation policies directly, those won't work. So at boot-up your default numa policy would exclude any vram nodes. But I think (as an -mm layman) that numa gives us a lot of the tools and policy interface that we need to implement what we want for gpus. Agree completely. From a ten mile high view our GPUs are just command processors with local memory as well . Basically this is also the whole idea of what AMD is pushing with HSA for a while. It's just that a lot of problems start to pop up when you look at all the nasty details. For example only part of the GPU memory is usually accessible by the CPU. So even when numa nodes expose a good foundation for this I think there is still a lot of code to write. BTW: I should probably start to read into the numa code of the kernel. Any good pointers for that? Regards, Christian. Wrt isolation: There's a sliding scale of what different users expect, from full auto everything, including migrating pages around if needed to full isolation all seems to be on the table. As long as we keep vram nodes out of any default allocation numasets, full isolation should be possible. -Daniel ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm