Re: dmaengine support for PMEM
>Here's where I left it last > > https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=pmem_blk_dma Thanks Dave. I'll certainly rebase these on 4.18.x and do some testing! > I do think we need to do some rework with the dmaengine in order to get > better efficiency as well. At some point I would like to see a call in > dmaengine that will take a request (similar to mq) and just operate on > that and submit the descriptors in a single call. I think that can > possibly deprecate all the host of function pointers for dmaengine. I'm > hoping to find some time to take a look at some of this work towards the > end of the year. But I'd be highly interested if you guys have ideas and > thoughts on this topic. And you are welcome to take my patches and run > with it. OK, we were experimenting with a single PMEM driver and making decisions on DMA vs memcpy based on IO size rather than forcing the user to choose which driver to use. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
dmaengine support for PMEM
Hi Dave I hope you are well. Logan and I were looking at adding DMA support to PMEM and then were informed you have proposed some patches to do just that for the ioat DMA engine. The latest version of those I can see were the v7 from August 2017. Is there a more recent version? What happened to that series? https://lists.01.org/pipermail/linux-nvdimm/2017-August/012208.html Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
All > Alex (or anyone else) can you point to where IOVA addresses are generated? A case of RTFM perhaps (though a pointer to the code would still be appreciated). https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt Some exceptions to IOVA --- Interrupt ranges are not address translated, (0xfee0 - 0xfeef). The same is true for peer to peer transactions. Hence we reserve the address from PCI MMIO ranges so they are not allocated for IOVA addresses. Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>I find this hard to believe. There's always the possibility that some >part of the system doesn't support ACS so if the PCI bus addresses and >IOVA overlap there's a good chance that P2P and ATS won't work at all on >some hardware. I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses. > I prefer >the option to disable the ACS bit on boot and let the existing code put >the devices into their own IOMMU group (as it should already do to >support hardware that doesn't have ACS support). +1 Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome >Hopes this helps understanding the big picture. I over simplify thing and >devils is in the details. This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)?? Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome >Note on GPU we do would not rely on ATS for peer to peer. Some part >of the GPU (DMA engines) do not necessarily support ATS. Yet those >are the part likely to be use in peer to peer. OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma. >We (ake GPU people aka the good guys ;)) do no want to do peer to peer >for performance reasons ie we do not care having our transaction going >to the root complex and back down the destination. At least in use case >i am working on this is fine. If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know. >Reasons is that GPU are giving up on PCIe (see all specialize link like >NVlink that are popping up in GPU space). So for fast GPU inter-connect >we have this new links. I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes. >Also the IOMMU isolation do matter a lot to us. Think someone using this >peer to peer to gain control of a server in the cloud. I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> Not to me. In the p2pdma code we specifically program DMA engines with > the PCI bus address. Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address... > So regardless of whether we are using the IOMMU or > not, the packets will be forwarded directly to the peer. If the ACS > Redir bits are on they will be forced back to the RC by the switch and > the transaction will fail. If we clear the ACS bits, the TLPs will go > where we want and everything will work (but we lose the isolation of ACS). Agreed. >For EPs that support ATS, we should (but don't necessarily have to) >program them with the IOVA address so they can go through the >translation process which will allow P2P without disabling the ACS Redir >bits -- provided the ACS direct translation bit is set. (And btw, if it >is, then we lose the benefit of ACS protecting against malicious EPs). >But, per above, the ATS transaction should involve only the IOVA address >so the ACS bits not being set should not break ATS. Well we would still have to clear some ACS bits but now we can clear only for translated addresses. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome > As it is tie to PASID this is done using IOMMU so looks for caller > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing > user is the AMD GPU driver see: Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver. One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS). Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Christian > Why would a switch not identify that as a peer address? We use the PASID >together with ATS to identify the address space which a transaction >should use. I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec). > If I'm not completely mistaken when you disable ACS it is perfectly > possible that a bridge identifies a transaction as belonging to a peer > address, which isn't what we want here. You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain: If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues. So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators). Make sense? Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome > Now inside that page table you can point GPU virtual address > to use GPU memory or use system memory. Those system memory entry can > also be mark as ATS against a given PASID. Thanks. This all makes sense. But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now... Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Christian >Interesting point, give me a moment to check that. That finally makes >all the hardware I have standing around here valuable :) Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable. BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR? Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS... Thanks Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Jerome and Christian > I think there is confusion here, Alex properly explained the scheme > PCIE-device do a ATS request to the IOMMU which returns a valid > translation for a virtual address. Device can then use that address > directly without going through IOMMU for translation. So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set. > Also ATS is meaningless without something like PASID as far as i know. ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads. Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago... Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Don >RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to >put NVME 'resources' into an assignable/manageable object for > 'IOMMU-grouping', >which is really a 'DMA security domain' and less an 'IOMMU grouping > domain'. Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time. NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature... Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Logan >Yeah, I'm having a hard time coming up with an easy enough solution for >the user. I agree with Dan though, the bus renumbering risk would be >fairly low in the custom hardware seeing the switches are likely going >to be directly soldered to the same board with the CPU. I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Alex and Don >Correct, the VM has no concept of the host's IOMMU groups, only the > hypervisor knows about the groups, But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>Yeah, so based on the discussion I'm leaning toward just having a >command line option that takes a list of BDFs and disables ACS for them. >(Essentially as Dan has suggested.) This avoids the shotgun. I concur that this seems to be where the conversation is taking us. @Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue Thanks Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome >I think there is confusion here, Alex properly explained the scheme > PCIE-device do a ATS request to the IOMMU which returns a valid >translation for a virtual address. Device can then use that address >directly without going through IOMMU for translation. This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-). >ATS is implemented by the IOMMU not by the device (well device implement >the client side of it). Also ATS is meaningless without something like >PASID as far as i know. I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future. So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Don >Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two >devices. >That agent should 'request' to the kernel that ACS be removed/circumvented > (p2p enabled) btwn two endpoints. >I recommend doing so via a sysfs method. Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. >So I don't understand the comments why VMs should need to know. As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices. > Is there a thread I need to read up to explain /clear-up the thoughts above? If you search for p2pdma you should find the previous discussions. Thanks for the input! Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Dan >It seems unwieldy that this is a compile time option and not a runtime >option. Can't we have a kernel command line option to opt-in to this >behavior rather than require a wholly separate kernel image? I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series. > Why is this text added in a follow on patch and not the patch that > introduced the config option? Because the ACS section was added later in the series and this information is associated with that additional functionality. > I'm also wondering if that command line option can take a 'bus device > function' address of a switch to limit the scope of where ACS is > disabled. By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system. Stephen [1] https://marc.info/?l=linux-doc=150907188310838=2 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Christian > AMD APUs mandatory need the ACS flag set for the GPU integrated in the > CPU when IOMMU is enabled or otherwise you will break SVM. OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP? > Similar problems arise when you do this for dedicated GPU, but we > haven't upstreamed the support for this yet. Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. > So that is a clear NAK from my side for the approach. Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach? > And what exactly is the problem here? We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe). > I'm currently testing P2P with GPUs in different IOMMU domains and at least > with AMD IOMMUs that works perfectly fine. Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
> I'll see if I can get our PCI SIG people to follow this through Hi Jonathan Can you let me know if this moves forward within PCI-SIG? I would like to track it. I can see this being doable between Root Ports that reside in the same Root Complex but might become more challenging to standardize for RPs that reside in different RCs in the same (potentially multi-socket) system. I know in the past we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that is not something that would work in all systems and must fall outside the remit of PCI-SIG ;-). I agree such a capability bit would be very useful but it's going to be quite some time before we can rely on hardware being available that supports it. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
> That would be very nice but many devices do not support the internal > route. But Logan in the NVMe case we are discussing movement within a single function (i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is discussing movement between two functions (PFs or VFs) in the same PCIe EP. In the case of multi-function endpoints I think the standard requires those devices to support internal DMAs for transfers between those functions (but does not require it within a function). So I think the summary is: 1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time. 2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP. 3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device). Based on my review of the specification I concur with Bjorn that p2pdma between functions in a MF end-point should be assured to be supported via the standard. However if the p2pdma involves only a single function in a MF device then we can only support NVMe CMBs for now. Let's review and see what the options are for supporting this in the next respin. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
> I've seen the response that peers directly below a Root Port could not > DMA to each other through the Root Port because of the "route to self" > issue, and I'm not disputing that. Bjorn You asked me for a reference to RTS in the PCIe specification. As luck would have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. We discussed the topic. It is not explicitly referred to as "Route to Self" and it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification discusses error conditions for virtual PCI bridges. One of these conditions (given in the very first bullet in that section) applies to a request that is destined for the same port it came in on. When this occurs the request must be terminated as a UR. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
>I assume you want to exclude Root Ports because of multi-function > devices and the "route to self" error. I was hoping for a reference > to that so I could learn more about it. Apologies Bjorn. This slipped through my net. I will try and get you a reference for RTS in the next couple of days. > While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS > Functions in SR-IOV Capable and Multi-Function Devices", which seems > relevant. It talks about "peer-to-peer Requests (between Functions of > the device)". Thay says to me that multi-function devices can DMA > between themselves. I will go take a look. Appreciate the link. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
>Yes i need to document that some more in hmm.txt... Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out? > I am about to send RFC for nouveau, i am still working out some bugs. Great. I will keep an eye out for it. An example user of hmm will be very helpful. > i will fix the MAINTAINERS as part of those. Awesome, thanks. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
> It seems people miss-understand HMM :( Hi Jerome Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things. While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-). A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver? The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
>http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip @Keith - my apologies. @Christoph - thanks for the link So my understanding of when the technical content surrounding new NVMe Technical Proposals (TPs) was wrong. I though the TP content could only be discussed once disclosed in the public standard. I have now learnt that once the TPs are ratified they are publically available! However, as Logan pointed out, PMRs are not relevant to this series so let's defer discussion on how to support them to a later date! Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
> We don't want to lump these all together without knowing which region you're > allocating from, right? In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers? Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
> There's a meaningful difference between writing to an NVMe CMB vs PMR When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
> No, locality matters. If you have a bunch of NICs and bunch of drives > and the allocator chooses to put all P2P memory on a single drive your > performance will suck horribly even if all the traffic is offloaded. Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
> I'm pretty sure the spec disallows routing-to-self so doing a P2P > transaction in that sense isn't going to work unless the device > specifically supports it and intercepts the traffic before it gets to > the port. This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
>> We'd prefer to have a generic way to get p2pmem instead of restricting >> ourselves to only using CMBs. We did work in the past where the P2P memory >> was part of an IB adapter and not the NVMe card. So this won't work if it's >> an NVMe only interface. > It just seems like it it making it too complicated. I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler. > Seems like a very subtle and hard to debug performance trap to leave > for the users, and pretty much the only reason to use P2P is > performance... So why have such a dangerous interface? P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
> The intention of HMM is to be useful for all device memory that wish > to have struct page for various reasons. Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> your kernel provider needs to decide whether they favor device assignment or > p2p Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
> I agree, I don't think this series should target anything other than > using p2p memory located in one of the devices expected to participate > in the p2p trasnaction for a first pass.. I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable. > locality is super important for p2p, so I don't think things should > start out in a way that makes specifying the desired locality hard. Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Thanks for the detailed review Bjorn! >> >> + Enabling this option will also disable ACS on all ports behind >> + any PCIe switch. This effictively puts all devices behind any >> + switch into the same IOMMU group. > > Does this really mean "all devices behind the same Root Port"? Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch. > What does this mean in terms of device security? I assume it means, > at least, that individual devices can't be assigned to separate VMs. This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly. > I don't mind admitting that this patch makes me pretty nervous, and I > don't have a clear idea of what the implications of this are, or how > to communicate those to end users. "The same IOMMU group" is a pretty > abstract idea. Alex gave a good overview of the implications in [1]. Stephen [1] https://marc.info/?l=linux-pci=151512320031739=2 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
>> So Oliver (CC) was having issues getting any of that to work for us. >> >> The problem is that acccording to him (I didn't double check the latest >> patches) you effectively hotplug the PCIe memory into the system when >> creating struct pages. >> >> This cannot possibly work for us. First we cannot map PCIe memory as >> cachable. (Note that doing so is a bad idea if you are behind a PLX >> switch anyway since you'd ahve to manage cache coherency in SW). > > Note: I think the above means it won't work behind a switch on x86 > either, will it ? Ben We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches. I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would > > save an extra PCI transfer as the NVME card could just take the data > > out of it's own memory. However, at this time, cards with CMB buffers > > don't seem to be available. > Can you describe what would be the plan to have it when these devices > do come along? I'd say that p2p_dev needs to become a nvmet_ns reference > and not from nvmet_ctrl. Then, when cmb capable devices come along, the > ns can prefer to use its own cmb instead of locating a p2p_dev device? Hi Sagi Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course. [1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf >> +ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients); > This is the first p2p_dev found right? What happens if I have more than > a single p2p device? In theory I'd have more p2p memory I can use. Have > you considered making pci_p2pmem_find return the least used suitable > device? Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
> Any plans adding the capability to nvme-rdma? Should be > straight-forward... In theory, the use-case would be rdma backend > fabric behind. Shouldn't be hard to test either... Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems. Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin? Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
> My first reflex when reading this thread was to think that this whole domain > lends it self excellently to testing via Qemu. Could it be that doing this in > the opposite direction might be a safer approach in the long run even though > (significant) more work up-front? While the idea of QEMU for this work is attractive it will be a long time before QEMU is in a position to support this development. Another approach is to propose a common development platform for p2pmem work using a platform we know is going to work. This an extreme version of the whitelisting approach that was discussed on this thread. We can list a very specific set of hardware (motherboard, PCIe end-points and (possibly) PCIe switch enclosure) that has been shown to work that others can copy for their development purposes. p2pmem.io perhaps ;-)? Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
>> Yes, that's why I used 'significant'. One good thing is that given resources >> it can easily be done in parallel with other development, and will give >> additional >> insight of some form. > >Yup, well if someone wants to start working on an emulated RDMA device >that actually simulates proper DMA transfers that would be great! Give that each RDMA vendor’s devices expose a different MMIO I don’t expect this to happen anytime soon. > Yes, the nvme device in qemu has a CMB buffer which is a good choice to > test with but we don't have code to use it for p2p transfers in the >kernel so it is a bit awkward. Note the CMB code is not in upstream QEMU, it’s in Keith’s fork [1]. I will see if I can push this upstream. Stephen [1] git://git.infradead.org/users/kbusch/qemu-nvme.git ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
> Yes, this makes sense I think we really just want to distinguish host > memory or not in terms of the dev_pagemap type. I would like to see mutually exclusive flags for host memory (or not) and persistence (or not). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
On 2017-04-06, 6:33 AM, "Sagi Grimberg"wrote: > Say it's connected via 2 legs, the bar is accessed from leg A and the > data from the disk comes via leg B. In this case, the data is heading > towards the p2p device via leg B (might be congested), the completion > goes directly to the RC, and then the host issues a read from the > bar via leg A. I don't understand what can guarantee ordering here. > Stephen told me that this still guarantees ordering, but I honestly > can't understand how, perhaps someone can explain to me in a simple > way that I can understand. Sagi As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive). 1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs. 2. Disk device writes to the completion queue (in system memory) via a MemWr TLP. 3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port. 4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem. 5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has. I hope this helps! Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote: > > > On 06/01/17 11:26 AM, Jason Gunthorpe wrote: > > >> Make a generic API for all of this and you'd have my vote.. >> >> >> IMHO, you must support basic pinning semantics - that is necessary to >> support generic short lived DMA (eg filesystem, etc). That hardware can >> clearly do that if it can support ODP. > > I agree completely. > > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs > (ie. at least those backed with ZONE_DEVICE memory). Then > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace > (using whatever interface is most appropriate) and userspace can do what > it pleases with them. This makes _so_ much sense and actually largely > already works today (as demonstrated by iopmem). +1 for iopmem ;-) I feel like we are going around and around on this topic. I would like to see something that is upstream that enables P2P even if it is only the minimum viable useful functionality to begin. I think aiming for the moon (which is what HMM and things like it are) are simply going to take more time if they ever get there. There is a use case for in-kernel P2P PCIe transfers between two NVMe devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or BARs on the NIC). I am even seeing users who now want to move data P2P between FPGAs and NVMe SSDs and the upstream kernel should be able to support these users or they will look elsewhere. The iopmem patchset addressed all the use cases above and while it is not an in kernel API it could have been modified to be one reasonably easily. As Logan states the driver can then choose to pass the VMAs to user-space in a manner that makes sense. Earlier in the thread someone mentioned LSF/MM. There is already a proposal to discuss this topic so if you are interested please respond to the email letting the committee know this topic is of interest to you [1]. Also earlier in the thread someone discussed the issues around the IOMMU. Given the known issues around P2P transfers in certain CPU root complexes [2] it might just be a case of only allowing P2P when a PCIe switch connects the two EPs. Another option is just to use CONFIG_EXPERT and make sure people are aware of the pitfalls if they invoke the P2P option. Finally, as Jason noted, we could all just wait until CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces are the remit of the CPU vendors I think it behooves us to solve this problem before then. Also some of the above mentioned protocols are not even switchable and may not be amenable to a P2P topology... Stephen [1] http://marc.info/?l=linux-mm=148156541804940=2 [2] https://community.mellanox.com/docs/DOC-1119 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
>>> I've already recommended that iopmem not be a block device and >>> instead be a device-dax instance. I also don't think it should claim >>> the PCI ID, rather the driver that wants to map one of its bars this >>> way can register the memory region with the device-dax core. >>> >>> I'm not sure there are enough device drivers that want to do this to >>> have it be a generic /sys/.../resource_dmableX capability. It still >>> seems to be an exotic one-off type of configuration. >> >> >> Yes, this is essentially my thinking. Except I think the userspace >> interface should really depend on the device itself. Device dax is a >> good choice for many and I agree the block device approach wouldn't be >> ideal. I tend to agree here. The block device interface has seen quite a bit of resistance and /dev/dax looks like a better approach for most. We can look at doing it that way in v2. >> >> Specifically for NVME CMB: I think it would make a lot of sense to just >> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB >> buffers would be volatile and thus you wouldn't need to keep track of >> where in the BAR the region came from. Thus, the mmap call would just be >> an allocator from BAR memory. If device-dax were used, userspace would >> need to lookup which device-dax instance corresponds to which nvme >> drive. >> > > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the > device-dax instance under the nvme device, or if you already have the nvme > sysfs path the dax instance(s) will appear under the "dax" sub-directory. > Personally I think mapping the dax resource in the sysfs tree is a nice way to do this and a bit more intuitive than mapping a /dev/nvmeX. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Hi All This has been a great thread (thanks to Alex for kicking it off) and I wanted to jump in and maybe try and put some summary around the discussion. I also wanted to propose we include this as a topic for LFS/MM because I think we need more discussion on the best way to add this functionality to the kernel. As far as I can tell the people looking for P2P support in the kernel fall into two main camps: 1. Those who simply want to expose static BARs on PCIe devices that can be used as the source/destination for DMAs from another PCIe device. This group has no need for memory invalidation and are happy to use physical/bus addresses and not virtual addresses. 2. Those who want to support devices that suffer from occasional memory pressure and need to invalidate memory regions from time to time. This camp also would like to use virtual addresses rather than physical ones to allow for things like migration. I am wondering if people agree with this assessment? I think something like the iopmem patches Logan and I submitted recently come close to addressing use case 1. There are some issues around routability but based on feedback to date that does not seem to be a show-stopper for an initial inclusion. For use-case 2 it looks like there are several options and some of them (like HMM) have been around for quite some time without gaining acceptance. I think there needs to be more discussion on this usecase and it could be some time before we get something upstreamable. I for one, would really like to see use case 1 get addressed soon because we have consumers for it coming soon in the form of CMBs for NVMe devices. Long term I think Jason summed it up really well. CPU vendors will put high-speed, open, switchable, coherent buses on their processors and all these problems will vanish. But I ain't holding my breathe for that to happen ;-). Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote: >> >> Dave are you saying that even for local mappings of files on a DAX >> capable system it is possible for the mappings to move on you unless the >> FS supports locking? >> > > Yes. > > >> Does that not mean DAX on such FS is >> inherently broken? > > No. DAX is accessed through a virtual mapping layer that abstracts > the physical location from userspace applications. > > Example: think copy-on-write overwrites. It occurs atomically from > the perspective of userspace and starts by invalidating any current > mappings userspace has of that physical location. The location is changes, > the data copied in, and then when the locks are released userspace can > fault in a new page table mapping on the next access Dave Thanks for the good input and for correcting some of my DAX misconceptions! We will certainly be taking this into account as we consider v1. > >>>> And at least for XFS we have such a mechanism :) E.g. I have a >>>> prototype of a pNFS layout that uses XFS+DAX to allow clients to do >>>> RDMA directly to XFS files, with the same locking mechanism we use >>>> for the current block and scsi layout in xfs_pnfs.c. >> >> Thanks for fixing this issue on XFS Christoph! I assume this problem >> continues to exist on the other DAX capable FS? > > Yes, but it they implement the exportfs API that supplies this > capability, they'll be able to use pNFS, too. > >> One more reason to consider a move to /dev/dax I guess ;-)... >> > > That doesn't get rid of the need for sane access control arbitration > across all machines that are directly accessing the storage. That's the > problem pNFS solves, regardless of whether your direct access target is a > filesystem, a block device or object storage... Fair point. I am still hoping for a bit more discussion on the best choice of user-space interface for this work. If/When that happens we will take it into account when we look at spinning the patchset. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
Hi Dave and Christoph On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > > You do realise that local filesystems can silently change the > > > location of file data at any point in time, so there is no such > > > thing as a "stable mapping" of file data to block device addresses > > > in userspace? > > > > > > If you want remote access to the blocks owned and controlled by a > > > filesystem, then you need to use a filesystem with a remote locking > > > mechanism to allow co-ordinated, coherent access to the data in > > > those blocks. Anything else is just asking for ongoing, unfixable > > > filesystem corruption or data leakage problems (i.e. security > > > issues). > > Dave are you saying that even for local mappings of files on a DAX capable system it is possible for the mappings to move on you unless the FS supports locking? Does that not mean DAX on such FS is inherently broken? > > And at least for XFS we have such a mechanism :) E.g. I have a > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > > RDMA directly to XFS files, with the same locking mechanism we use > > for the current block and scsi layout in xfs_pnfs.c. > Thanks for fixing this issue on XFS Christoph! I assume this problem continues to exist on the other DAX capable FS? One more reason to consider a move to /dev/dax I guess ;-)... Stephen > Oh, that's good to know - pNFS over XFS was exactly what I was > thinking of when I wrote my earlier reply. A few months ago someone > else was trying to use file mappings in userspace for direct remote > client access on fabric connected devices. I told them "pNFS on XFS > and write an efficient transport for you hardware" > > Now that I know we've got RDMA support for pNFS on XFS in the > pipeline, I can just tell them "just write an rdma driver for your > hardware" instead. :P > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
> >> > >> If you're only using the block-device as a entry-point to create > >> dax-mappings then a device-dax (drivers/dax/) character-device might > >> be a better fit. > >> > > > > We chose a block device because we felt it was intuitive for users to > > carve up a memory region but putting a DAX filesystem on it and creating > > files on that DAX aware FS. It seemed like a convenient way to > > partition up the region and to be easily able to get the DMA address > > for the memory backing the device. > > > > That said I would be very keen to get other peoples thoughts on how > > they would like to see this done. And I know some people have had some > > reservations about using DAX mounted FS to do this in the past. > > I guess it depends on the expected size of these devices BARs, but I > get the sense they may be smaller / more precious such that you > wouldn't want to spend capacity on filesystem metadata? For the target > use case is it assumed that these device BARs are always backed by > non-volatile memory? Otherwise this is a mkfs each boot for a > volatile device. Dan Fair point and this is a concern I share. We are not assuming that all iopmem devices are backed by non-volatile memory so the mkfs recreation comment is valid. All in all I think you are persuading us to take a look at /dev/dax ;-). I will see if anyone else chips in with their thoughts on this. > > >> > >> > 2. Memory Segment Spacing. This patch has the same limitations that > >> > ZONE_DEVICE does in that memory regions must be spaces at least > >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > >> > be usable on neighboring BARs. For our purposes, this is not an issue as > >> > we'd only be looking at enabling a single BAR in a given PCIe device. > >> > More exotic use cases may have problems with this. > >> > >> I'm working on patches for 4.10 to allow mixing multiple > >> devm_memremap_pages() allocations within the same physical section. > >> Hopefully this won't be a problem going forward. > >> > > > > Thanks Dan. Your patches will help address the problem of how to > > partition a /dev/dax device but they don't help the case then BARs > > themselves are small, closely spaced and non-segment aligned. However > > I think most people using iopmem will want to use reasonbly large > > BARs so I am not sure item 2 is that big of an issue. > > I think you might have misunderstood what I'm proposing. The patches > I'm working on are separate from a facility to carve up a /dev/dax > device. The effort is to allow devm_memremap_pages() to maintain > several allocations within the same 128MB section. I need this for > persistent memory to handle platforms that mix pmem and system-ram in > the same section. I want to be able to map ZONE_DEVICE pages for a > portion of a section and be able to remove portions of section that > may collide with allocations of a different lifetime. Oh I did misunderstand. This is very cool and would be useful to us. One more reason to consider moving to /dev/dax in the next spin of this patchset ;-). Thanks Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: > [ adding Ashok and David for potential iommu comments ] > Hi Dan Thanks for adding Ashok and David! > > I agree with the motivation and the need for a solution, but I have > some questions about this implementation. > > > > > Consumers > > - > > > > We provide a PCIe device driver in an accompanying patch that can be > > used to map any PCIe BAR into a DAX capable block device. For > > non-persistent BARs this simply serves as an alternative to using > > system memory bounce buffers. For persistent BARs this can serve as an > > additional storage device in the system. > > Why block devices? I wonder if iopmem was initially designed back > when we were considering enabling DAX for raw block devices. However, > that support has since been ripped out / abandoned. You currently > need a filesystem on top of a block-device to get DAX operation. > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward > if all you want is a way to map the bar for another PCI-E device in > the topology. > > If you're only using the block-device as a entry-point to create > dax-mappings then a device-dax (drivers/dax/) character-device might > be a better fit. > We chose a block device because we felt it was intuitive for users to carve up a memory region but putting a DAX filesystem on it and creating files on that DAX aware FS. It seemed like a convenient way to partition up the region and to be easily able to get the DMA address for the memory backing the device. That said I would be very keen to get other peoples thoughts on how they would like to see this done. And I know some people have had some reservations about using DAX mounted FS to do this in the past. > > > 2. Memory Segment Spacing. This patch has the same limitations that > > ZONE_DEVICE does in that memory regions must be spaces at least > > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > > be usable on neighboring BARs. For our purposes, this is not an issue as > > we'd only be looking at enabling a single BAR in a given PCIe device. > > More exotic use cases may have problems with this. > > I'm working on patches for 4.10 to allow mixing multiple > devm_memremap_pages() allocations within the same physical section. > Hopefully this won't be a problem going forward. > Thanks Dan. Your patches will help address the problem of how to partition a /dev/dax device but they don't help the case then BARs themselves are small, closely spaced and non-segment aligned. However I think most people using iopmem will want to use reasonbly large BARs so I am not sure item 2 is that big of an issue. > I haven't yet grokked the motivation for this, but I'll go comment on > that separately. Thanks Dan! ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.
On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote: > On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <sba...@raithlin.com> wrote: > > From: Logan Gunthorpe <log...@deltatee.com> > > > > We build on recent work that adds memory regions owned by a device > > driver (ZONE_DEVICE) [1] and to add struct page support for these new > > regions of memory [2]. > > > > 1. Add an extra flags argument into dev_memremap_pages to take in a > > MEMREMAP_XX argument. We update the existing calls to this function to > > reflect the change. > > > > 2. For completeness, we add MEMREMAP_WT support to the memremap; > > however we have no actual need for this functionality. > > > > 3. We add the static functions, add_zone_device_pages and > > remove_zone_device pages. These are similar to arch_add_memory except > > they don't create the memory mapping. We don't believe these need to be > > made arch specific, but are open to other opinions. > > > > 4. dev_memremap_pages and devm_memremap_pages_release are updated to > > treat IO memory slightly differently. For IO memory we use a combination > > of the appropriate io_remap function and the zone_device pages functions > > created above. A flags variable and kaddr pointer are added to struct > > page_mem to facilitate this for the release function. We also set up > > the page attribute tables for the mapped region correctly based on the > > desired mapping. > > > > This description says "what" is being done, but not "why". Hi Dan We discuss the motivation in the cover letter. > > In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe > memory", it mentions that the lack of I/O coherency is a known issue > and users of this functionality need to be cognizant of the pitfalls. > If that is the case why do we need support for different cpu mapping > types than the default write-back cache setting? It's up to the > application to handle cache cpu flushing similar to what we require of > device-dax users in the persistent memory case. Some of the iopmem hardware we have tested has certain alignment restrictions on BAR accesses. At the very least we require write combine mappings for these. We then felt it appropriate to add the other mappings for the sake of completeness. Cheers Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.
From: Logan Gunthorpe <log...@deltatee.com> We build on recent work that adds memory regions owned by a device driver (ZONE_DEVICE) [1] and to add struct page support for these new regions of memory [2]. 1. Add an extra flags argument into dev_memremap_pages to take in a MEMREMAP_XX argument. We update the existing calls to this function to reflect the change. 2. For completeness, we add MEMREMAP_WT support to the memremap; however we have no actual need for this functionality. 3. We add the static functions, add_zone_device_pages and remove_zone_device pages. These are similar to arch_add_memory except they don't create the memory mapping. We don't believe these need to be made arch specific, but are open to other opinions. 4. dev_memremap_pages and devm_memremap_pages_release are updated to treat IO memory slightly differently. For IO memory we use a combination of the appropriate io_remap function and the zone_device pages functions created above. A flags variable and kaddr pointer are added to struct page_mem to facilitate this for the release function. We also set up the page attribute tables for the mapped region correctly based on the desired mapping. [1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html [2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html Signed-off-by: Stephen Bates <sba...@raithlin.com> Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- drivers/dax/pmem.c| 4 +- drivers/nvdimm/pmem.c | 4 +- include/linux/memremap.h | 5 ++- kernel/memremap.c | 80 +-- tools/testing/nvdimm/test/iomap.c | 3 +- 5 files changed, 86 insertions(+), 10 deletions(-) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 9630d88..58ac456 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "../nvdimm/pfn.h" #include "../nvdimm/nd.h" #include "dax.h" @@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev) if (rc) return rc; - addr = devm_memremap_pages(dev, , _pmem->ref, altmap); + addr = devm_memremap_pages(dev, , _pmem->ref, altmap, + ARCH_MEMREMAP_PMEM); if (IS_ERR(addr)) return PTR_ERR(addr); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 42b3a82..97032a1 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { addr = devm_memremap_pages(dev, _res, >q_usage_counter, - altmap); + altmap, ARCH_MEMREMAP_PMEM); pfn_sb = nd_pfn->pfn_sb; pmem->data_offset = le64_to_cpu(pfn_sb->dataoff); pmem->pfn_pad = resource_size(res) - resource_size(_res); @@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev, res->start += pmem->data_offset; } else if (pmem_should_map_pages(dev)) { addr = devm_memremap_pages(dev, >res, - >q_usage_counter, NULL); + >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM); pmem->pfn_flags |= PFN_MAP; } else addr = devm_memremap(dev, pmem->phys_addr, diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 9341619..fc99283 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -51,12 +51,13 @@ struct dev_pagemap { #ifdef CONFIG_ZONE_DEVICE void *devm_memremap_pages(struct device *dev, struct resource *res, - struct percpu_ref *ref, struct vmem_altmap *altmap); + struct percpu_ref *ref, struct vmem_altmap *altmap, + unsigned long flags); struct dev_pagemap *find_dev_pagemap(resource_size_t phys); #else static inline void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, unsigned long flags) { /* * Fail attempts to call devm_memremap_pages() without diff --git a/kernel/memremap.c b/kernel/memremap.c index b501e39..d5f462c 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL); #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) +enum { + PAGEMAP_IO_MEM = 1 << 0, +}; + struct page_map { struct resource res; struct percpu_ref *ref; struct dev_pagemap pgmap; struct vmem_altmap altmap; + void *kaddr; + int flags; }; +stat
[PATCH 3/3] iopmem : Add documentation for iopmem driver
Add documentation for the iopmem PCIe device driver. Signed-off-by: Stephen Bates <sba...@raithlin.com> Signed-off-by: Logan Gunthorpe <log...@deltatee.com> --- Documentation/blockdev/00-INDEX | 2 ++ Documentation/blockdev/iopmem.txt | 62 +++ 2 files changed, 64 insertions(+) create mode 100644 Documentation/blockdev/iopmem.txt diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX index c08df56..913e500 100644 --- a/Documentation/blockdev/00-INDEX +++ b/Documentation/blockdev/00-INDEX @@ -8,6 +8,8 @@ cpqarray.txt - info on using Compaq's SMART2 Intelligent Disk Array Controllers. floppy.txt - notes and driver options for the floppy disk driver. +iopmem.txt + - info on the iopmem block driver. mflash.txt - info on mGine m(g)flash driver for linux. nbd.txt diff --git a/Documentation/blockdev/iopmem.txt b/Documentation/blockdev/iopmem.txt new file mode 100644 index 000..ba805b8 --- /dev/null +++ b/Documentation/blockdev/iopmem.txt @@ -0,0 +1,62 @@ +IOPMEM Block Driver +=== + +Logan Gunthorpe and Stephen Bates - October 2016 + +Introduction + + +The iopmem module creates a DAX capable block device from a BAR on a PCIe +device. iopmem leverages heavily from the pmem driver although it utilizes IO +memory rather than system memory as its backing store. + +Usage +- + +To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM +to either y or m. A block device will be created for each PCIe attached device +that matches the vendor and device ID as specified in the module. Currently an +unallocated PMC PCIe ID is used as the default. Alternatively this driver can +be bound to any aribtary PCIe function using the sysfs bind entry. + +The main purpose for an iopmem block device is expected to be for peer-2-peer +PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local +CPU unless you are doing one of the three following things: + +1. Creating a DAX capable filesystem on the iopmem device. +2. Creating some files on the DAX capable filesystem. +3. Interogating the files on said filesystem to obtain pointers that can be + passed to other PCIe devices for p2p DMA operations. + +Issues +-- + +1. Address Translation. Suggestions have been made that in certain +architectures and topologies the dma_addr_t passed to the DMA master +in a peer-2-peer transfer will not correctly route to the IO memory +intended. However in our testing to date we have not seen this to be +an issue, even in systems with IOMMUs and PCIe switches. It is our +understanding that an IOMMU only maps system memory and would not +interfere with device memory regions. (It certainly has no opportunity +to do so if the transfer gets routed through a switch). + +2. Memory Segment Spacing. This patch has the same limitations that +ZONE_DEVICE does in that memory regions must be spaces at least +SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where +BARs can be placed closer together than this. Thus ZONE_DEVICE would not +be usable on neighboring BARs. For our purposes, this is not an issue as +we'd only be looking at enabling a single BAR in a given PCIe device. +More exotic use cases may have problems with this. + +3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe +peer there is potential for coherency issues and for writes to occur out +of order. This is something that users of this feature need to be +cognizant of and may necessitate the use of CONFIG_EXPERT. Though really, +this isn't much different than the existing situation with RDMA: if +userspace sets up an MR for remote use, they need to be careful about +using that memory region themselves. + +4. Architecture. Currently this patch is applicable only to x86 +architectures. The same is true for much of the code pertaining to +PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other +ARCH over time. -- 2.1.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm