Re: dmaengine support for PMEM

2018-08-21 Thread Stephen Bates
>Here's where I left it last
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djiang/linux.git/log/?h=pmem_blk_dma

Thanks Dave. I'll certainly rebase these on 4.18.x and do some testing! 

> I do think we need to do some rework with the dmaengine in order to get
>  better efficiency as well. At some point I would like to see a call in
 > dmaengine that will take a request (similar to mq) and just operate on
 > that and submit the descriptors in a single call. I think that can
 > possibly deprecate all the host of function pointers for dmaengine. I'm
 > hoping to find some time to take a look at some of this work towards the
 > end of the year. But I'd be highly interested if you guys have ideas and
 > thoughts on this topic. And you are welcome to take my patches and run
 > with it.

OK, we were experimenting with a single PMEM driver and making decisions on DMA 
vs memcpy based on IO size rather than forcing the user to choose which driver 
to use. 

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


dmaengine support for PMEM

2018-08-21 Thread Stephen Bates
Hi Dave

I hope you are well. Logan and I were looking at adding DMA support to PMEM and 
then were informed you have proposed some patches to do just that for the ioat 
DMA engine. The latest version of those I can see were the v7 from August 2017. 
Is there a more recent version? What happened to that series?

https://lists.01.org/pipermail/linux-nvdimm/2017-August/012208.html

Cheers
 
Stephen
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be 
appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
---
Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
>I find this hard to believe. There's always the possibility that some 
>part of the system doesn't support ACS so if the PCI bus addresses and 
>IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the 
kernel. Alex (or anyone else) can you point to where IOVA addresses are 
generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI 
bus address directly but other IO going to the same PCI EP may flow through the 
IOMMU and be programmed with IOVA rather than PCI bus addresses.

> I prefer 
>the option to disable the ACS bit on boot and let the existing code put 
>the devices into their own IOMMU group (as it should already do to 
>support hardware that doesn't have ACS support).

+1

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Hopes this helps understanding the big picture. I over simplify thing and
>devils is in the details.

This was a great primer thanks for putting it together. An LWN.net article 
perhaps ;-)??

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Note on GPU we do would not rely on ATS for peer to peer. Some part
>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU 
components most applicable to p2pdma.

>We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>for performance reasons ie we do not care having our transaction going
>to the root complex and back down the destination. At least in use case
>i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys 
;-). If so, what are the RDMA people??? Again good to know.

>Reasons is that GPU are giving up on PCIe (see all specialize link like
>NVlink that are popping up in GPU space). So for fast GPU inter-connect
>we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it 
;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.

>Also the IOMMU isolation do matter a lot to us. Think someone using this
>peer to peer to gain control of a server in the cloud.

I agree that IOMMU isolation is very desirable. Hence the desire to ensure we 
can keep the IOMMU on while doing p2pdma if at all possible whilst still 
delivering the desired performance to the user.

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.

>For EPs that support ATS, we should (but don't necessarily have to)
>program them with the IOVA address so they can go through the
>translation process which will allow P2P without disabling the ACS Redir
>bits -- provided the ACS direct translation bit is set. (And btw, if it
>is, then we lose the benefit of ACS protecting against malicious EPs).
>But, per above, the ATS transaction should involve only the IOVA address
>so the ACS bits not being set should not break ATS.

Well we would still have to clear some ACS bits but now we can clear only for 
translated addresses.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:

Ah thanks. This cleared things up for me. A quick search shows there are still 
no users of intel_svm_bind_mm() but I see the AMD version used in that GPU 
driver.

One thing I could not grok from the code how the GPU driver indicates which DMA 
events require ATS translations and which do not. I am assuming the driver 
implements someway of indicating that and its not just a global ON or OFF for 
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what 
would need to be added in the NVMe spec above and beyond what we have in PCI 
ATS to support efficient use of ATS (for example would we need a flag in the 
submission queue entries to indicate a particular IO's SGL/PRP should undergo 
ATS).

Cheers

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>together with ATS to identify the address space which a transaction 
>should use.

I think you are conflating two types of TLPs here. If the device supports ATS 
then it will issue a TR TLP to obtain a translated address from the IOMMU. This 
TR TLP will be addressed to the RP and so regardless of ACS it is going up to 
the Root Port. When it gets the response it gets the physical address and can 
use that with the TA bit set for the p2pdma. In the case of ATS support we also 
have more control over ACS as we can disable it just for TA addresses (as per 
7.7.7.7.2 of the spec).

 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU 
at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think 
we have to disable the IOMMU (something Mike suggested earlier). The reason is 
that since ATS is not an option the EP must initiate the DMA using the 
addresses passed down to it. If the IOMMU is on then this is an IOVA that could 
(with some non-zero probability) point to an IO Memory address in the same PCI 
domain. So if we disable ACS we are in trouble as we might MemWr to the wrong 
place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the 
IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping 
issues.

So I think if we want to support performant P2PDMA for devices that don't have 
ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I 
know this is problematic for AMDs use case so perhaps we also need to consider 
a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU 
(but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.

Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too 
it) or is this all done via user-space? Based on my grepping of the kernel code 
I see zero EP drivers using in-kernel ATS functionality right now...

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Christian

>Interesting point, give me a moment to check that. That finally makes 
>all the hardware I have standing around here valuable :)

Yes. At the very least it provides an initial standards based path for P2P DMAs 
across RPs which is something we have discussed on this list in the past as 
being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to 
perform an ATS Translation Request (ATS TR). Is there an upstream example of 
the driver for your APU that uses ATS? If so, can you provide a pointer to it. 
Do you provide some type of entry in the submission queues for commands going 
to the APU to indicate if the address associated with a specific command should 
be translated using ATS or not? Or do you simply enable ATS and then all 
addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe 
community on where we might go with ATS...

Thanks

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Jerome and Christian

> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a 
ATS translated TLP is still impacted by ACS though it has a separate control 
knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if 
your device supports ATS a P2P DMA will still be routed to the associated RP of 
the domain and down again unless we disable ACS DT P2P on all bridges between 
the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have 
security issues because a rogue or malfunctioning EP could just as easily issue 
TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.

ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU 
address translations at the EP. This saves hammering on the IOMMU as much in 
certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS 
AND can implement P2P between root ports should advertise "ACS Direct 
Translated P2P (T)" capability. This ties into the discussion around P2P 
between route ports we had a few weeks ago...

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Don

>RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to
>put NVME 'resources' into an assignable/manageable object for 
> 'IOMMU-grouping',
>which is really a 'DMA security domain' and less an 'IOMMU grouping 
> domain'.

Ha, I like your term "DMA Security Domain" which sounds about right for what we 
are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, 
in some ways, too big of hammer for what we want here in the sense that it is 
either on or off for the bridge or MF EP we enable/disable it for. ACS can't 
filter the TLPs by address or ID though PCI-SIG are having some discussions on 
extending ACS. That's a long term solution and won't be applicable to us for 
some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe 
SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Logan

>Yeah, I'm having a hard time coming up with an easy enough solution for
>the user. I agree with Dan though, the bus renumbering risk would be
>fairly low in the custom hardware seeing the switches are likely going
>to be directly soldered to the same board with the CPU.

I am afraid that soldered down assumption may not be valid. More and more PCIe 
cards with PCIe switches on them are becoming available and people are using 
these to connect servers to arrays of NVMe SSDs which may make the topology 
more dynamic.

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Alex and Don

>Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a 
pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and 
IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings 
but it is impacted by them in the sense that if the groupings change the PCI 
topology presented to the VM needs to change too.

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
>Yeah, so based on the discussion I'm leaning toward just having a
>command line option that takes a list of BDFs and disables ACS for them.
>(Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't 
want to re-spin only to find we are still not converging on the ACS issue

Thanks

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Jerome

>I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>translation for a virtual address. Device can then use that address
>directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with 
ACS a lot better than I did 24 hours ago ;-).

>ATS is implemented by the IOMMU not by the device (well device implement
>the client side of it). Also ATS is meaningless without something like
>PASID as far as i know.

I think it's the client side that is what is important to us. Not many EPs 
support ATS today and it's not clear if many will in the future.  So assuming 
we want to do p2pdma between devices (some of) which do NOT support ATS how 
best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me 
given this impacts all the PCI domains in the system and not just the domain we 
wish to do P2P on.

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
>devices.
>That agent should 'request' to the kernel that ACS be removed/circumvented 
> (p2p enabled) btwn two endpoints.
>I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU 
grouping issue I discussed earlier today which is not acceptable right now. In 
the long term, once we get IOMMU grouping callbacks to VMs we can look at 
extending p2pdma in this way. But I don't think this is viable for the initial 
series. 


>So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into 
the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology 
changes. I think we even have to be cognizant of the fact the OS running on the 
VM may not even support hot-plug of PCI devices.

> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for 
the input!

Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Dan

>It seems unwieldy that this is a compile time option and not a runtime
>option. Can't we have a kernel command line option to opt-in to this
>behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we 
wanted to make it very clear people were choosing one (p2pdma) or the other 
(IOMMU groupings and isolation). However personally I would prefer including 
the option of a run-time kernel parameter too. In fact a few months ago I 
proposed a small patch that did just that [1]. It never really went anywhere 
but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is 
associated with that additional functionality.

> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we 
disable ACS? We could do that but I don't think it avoids the issue of changes 
in IOMMU groupings as devices are added/removed. It simply changes the problem 
from affecting and entire PCI domain to a sub-set of the domain. We can already 
handle this by doing p2pdma on one RP and normal IOMMU isolation on the other 
RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc=150907188310838=2


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates

Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit. 

> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?

> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous 
revisions. The issue is that currently there is no mechanism in the IOMMU code 
to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change 
its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS 
settings could change. Since there is no way to currently handle changing ACS 
settings and hence IOMMU groupings the consensus was to simply disable ACS on 
all ports in a p2pdma domain. This effectively makes all the devices in the 
p2pdma domain part of the same IOMMU grouping. The plan will be to address this 
in time and add a mechanism for IOMMU grouping changes and notification to VMs 
but that's not part of this series. Note you are still allowed to have ACS 
functioning on other PCI domains so if you do not a plurality of IOMMU 
groupings you can still achieve it (but you can't do p2pdma across IOMMU 
groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
> with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change. 

Cheers

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-04-13 Thread Stephen Bates
 
>  I'll see if I can get our PCI SIG people to follow this through 

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track 
it. I can see this being doable between Root Ports that reside in the same Root 
Complex but might become more challenging to standardize for RPs that reside in 
different RCs in the same (potentially multi-socket) system. I know in the past 
we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that 
is not something that would work in all systems and must fall outside the remit 
of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite 
some time before we can rely on hardware being available that supports it.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-24 Thread Stephen Bates
> That would be very nice but many devices do not support the internal
> route. 

But Logan in the NVMe case we are discussing movement within a single function 
(i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is 
discussing movement between two functions (PFs or VFs) in the same PCIe EP. In 
the case of multi-function endpoints I think the standard requires those 
devices to support internal DMAs for transfers between those functions (but 
does not require it within a function).

So I think the summary is:

1. There is no requirement for a single function to support internal DMAs but 
in the case of NVMe we do have a protocol specific way for a NVMe function to 
indicate it supports via the CMB BAR. Other protocols may also have such 
methods but I am not aware of them at this time.

2. For multi-function end-points I think it is a requirement that DMAs 
*between* functions are supported via an internal path but this can be 
over-ridden by ACS when supported in the EP.

3. For multi-function end-points there is no requirement to support internal 
DMA within each individual function (i.e. a la point 1 but extended to each 
function in a MF device). 

Based on my review of the specification I concur with Bjorn that p2pdma between 
functions in a MF end-point should be assured to be supported via the standard. 
However if the p2pdma involves only a single function in a MF device then we 
can only support NVMe CMBs for now. Let's review and see what the options are 
for supporting this in the next respin.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-22 Thread Stephen Bates
>  I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.  

Bjorn 

You asked me for a reference to RTS in the PCIe specification. As luck would 
have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. 
We discussed the topic. It is not explicitly referred to as "Route to Self" and 
it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 
specification discusses error conditions for virtual PCI bridges. One of these 
conditions (given in the very first bullet in that section) applies to a 
request that is destined for the same port it came in on. When this occurs the 
request must be terminated as a UR.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
>I assume you want to exclude Root Ports because of multi-function
>  devices and the "route to self" error.  I was hoping for a reference
>  to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a 
reference for RTS in the next couple of days.

> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant.  It talks about "peer-to-peer Requests (between Functions of
> the device)".  Thay says to me that multi-function devices can DMA
> between themselves.

I will go take a look. Appreciate the link.

Stephen 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-05 Thread Stephen Bates
>Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with 
what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very 
helpful.

> i will fix the MAINTAINERS as part of those.

Awesome, thanks.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Stephen Bates
> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. 
Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and 
DEVICE_PUBLIC. It also gives no indication when one might choose to use one 
over the other. Would it be possible to update hmm.txt to include some 
discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the 
kernel's linear address space for the device memory and DEVICE_PRIVATE does 
not. However, like I said, I am not sure when you would use either one and the 
pros and cons of doing so. I actually ended up finding some useful information 
in memremap.h but I don't think it is fair to expect people to dig *that* deep 
to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is 
this correct? Are there any examples of out of tree drivers that use HMM you 
can point me too? As a driver developer what resources exist to help me write a 
HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You 
might want to fix that when you have a moment.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-02 Thread Stephen Bates
>http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe 
Technical Proposals (TPs) was wrong. I though the TP content could only be 
discussed once disclosed in the public standard. I have now learnt that once 
the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's 
defer discussion on how to support them to a later date!

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> We don't want to lump these all together without knowing which region you're 
> allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We 
would consider adding property flags for the memory as it is added to the p2p 
core and then the allocator could evolve to intelligently dish it out. 
Attributes like endurance, latency and special write commit requirements could 
all become attributes in time. Perhaps one more reason for a central entity for 
p2p memory allocation so this code does not end up having to go into many 
different drivers?

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into 
the P2P framework (if at all) ;-).

Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.

Sagi brought this up earlier in his comments about the _find_ function. We are 
planning to do something about this in the next version. This might be a 
randomization or a "user-pick" and include a rule around using the p2p_dev on 
the EP if that EP is part of the transaction.

Stephen




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Stephen Bates
> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the 
root-port then this would be considered a "route to self" violation and an 
error event would occur. The same holds for the downstream port on a PCI switch 
(unless route-to-self violations are disabled which violates the spec but which 
I have seen done in certain applications).

Stephen




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.

 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per 
driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this 
is achieved no matter which p2p_dev is used.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the 
past and long term I definitely would like to consider how we can add P2P 
functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass 
addressable and non-addressable blocks of data between devices. However that is 
well beyond the intentions of this series ;-).

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
> your kernel provider needs to decide whether they favor device assignment or 
> p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) 
are such that we really only expect CONFIG_P2P_DMA to be enabled in specific 
instances and in those instances the users have made a decision to favor P2P 
over IOMMU isolation. Or they have setup their PCIe topology in a way that 
gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates

> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer 
and in deploying systems where only some of the NVMe SSDs below a switch  have 
a CMB but use P2P to access all of them. Also there are some devices that only 
expose memory and their entire purpose is to act as a p2p device, supporting 
these devices would be valuable.

> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same 
PCIe switch ensures locality and (for the switches we have tested) performance. 
I agree solving the case where the namespace are CMB are on the same PCIe EP is 
valuable but I don't see it as critical to initial acceptance of the series.

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
Thanks for the detailed review Bjorn!

>>  
>> +  Enabling this option will also disable ACS on all ports behind
>> +  any PCIe switch. This effictively puts all devices behind any
>> +  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a 
switch) to achieve a very large fan-out (in an NVMe SSD array for example) and 
we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the 
switch means that all the EPs below it have to part of the same IOMMU grouping. 
However it was also agreed that as long as the ACS disable occurred at boot 
time (which is does in v2) then the virtualization layer will be aware of it 
and will perform the IOMMU group formation correctly.

> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.

Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci=151512320031739=2

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe 
switches from both Broadcom (PLX) and Microsemi. We have also done testing on 
x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The 
series as it currently stands only works on x86_64 but modified (hacky) 
versions have been made to work on ARM64. The x86_64 testing has been done on a 
range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least 
three vendors, NVMe SSDs from at least four vendors and P2P devices from four 
vendors) and PCI switches.

I do find it slightly offensive that you would question the series even 
working. I hope you are not suggesting we would submit this framework multiple 
times without having done testing on it

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe 
controllers with CMBs that support RDS and WDS are now commercially available 
[1]. However we have not yet tried to do any kind of optimization around this 
yet in terms of determining which p2p_dev to use. Your suggest above looks good 
and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf

>> +ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the 
very least we should update this allocate over all the valid p2p_dev. Since the 
load on any given p2p_dev will vary over time I think a random allocation of 
the devices makes sense (at least for now). 

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-01 Thread Stephen Bates
> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we 
would probably want to impose the "attached to the same PCIe switch" rule which 
might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P 
framework to facilitate copies between NVMe namespaces (on both PCIe and fabric 
attached namespaces) without having to expose the CMB up to user space. Wasn't 
something like that done in the SCSI world at some point Martin?

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-25 Thread Stephen Bates

> My first reflex when reading this thread was to think that this whole domain
> lends it self excellently to testing via Qemu. Could it be that doing this in 
> the opposite direction might be a safer approach in the long run even though 
> (significant) more work up-front?

While the idea of QEMU for this work is attractive it will be a long time 
before QEMU is in a position to support this development. 

Another approach is to propose a common development platform for p2pmem work 
using a platform we know is going to work. This an extreme version of the 
whitelisting approach that was discussed on this thread. We can list a very 
specific set of hardware (motherboard, PCIe end-points and (possibly) PCIe 
switch enclosure) that has been shown to work that others can copy for their 
development purposes.

p2pmem.io perhaps ;-)?

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-25 Thread Stephen Bates
>> Yes, that's why I used 'significant'. One good thing is that given resources 
>> it can easily be done in parallel with other development, and will give 
>> additional
>> insight of some form.
>
>Yup, well if someone wants to start working on an emulated RDMA device
>that actually simulates proper DMA transfers that would be great!

Give that each RDMA vendor’s devices expose a different MMIO I don’t expect 
this to happen anytime soon.

> Yes, the nvme device in qemu has a CMB buffer which is a good choice to
> test with but we don't have code to use it for p2p transfers in the
>kernel so it is a bit awkward.

Note the CMB code is not in upstream QEMU, it’s in Keith’s fork [1]. I will see 
if I can push this upstream.

Stephen

[1] git://git.infradead.org/users/kbusch/qemu-nvme.git


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-20 Thread Stephen Bates

> Yes, this makes sense I think we really just want to distinguish host
> memory or not in terms of the dev_pagemap type.

I would like to see mutually exclusive flags for host memory (or not) and 
persistence (or not).

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-07 Thread Stephen Bates
On 2017-04-06, 6:33 AM, "Sagi Grimberg"  wrote:

> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd 
cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd 
can never get to the p2pmem device until after the last DMA MemWr has.

I hope this helps!

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-11 Thread Stephen Bates
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm=148156541804940=2
[2] https://community.mellanox.com/docs/DOC-1119

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Stephen Bates
>>> I've already recommended that iopmem not be a block device and
>>> instead be a device-dax instance. I also don't think it should claim
>>> the PCI ID, rather the driver that wants to map one of its bars this
>>> way can register the memory region with the device-dax core.
>>>
>>> I'm not sure there are enough device drivers that want to do this to
>>> have it be a generic /sys/.../resource_dmableX capability. It still
>>> seems to be an exotic one-off type of configuration.
>>
>>
>> Yes, this is essentially my thinking. Except I think the userspace
>> interface should really depend on the device itself. Device dax is a
>> good  choice for many and I agree the block device approach wouldn't be
>> ideal.

I tend to agree here. The block device interface has seen quite a bit of
resistance and /dev/dax looks like a better approach for most. We can look
at doing it that way in v2.

>>
>> Specifically for NVME CMB: I think it would make a lot of sense to just
>> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
>> buffers would be volatile and thus you wouldn't need to keep track of
>> where in the BAR the region came from. Thus, the mmap call would just be
>> an allocator from BAR memory. If device-dax were used, userspace would
>> need to lookup which device-dax instance corresponds to which nvme
>> drive.
>>
>
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> device-dax instance under the nvme device, or if you already have the nvme
> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>

Personally I think mapping the dax resource in the sysfs tree is a nice
way to do this and a bit more intuitive than mapping a /dev/nvmeX.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Stephen Bates
Hi All

This has been a great thread (thanks to Alex for kicking it off) and I
wanted to jump in and maybe try and put some summary around the
discussion. I also wanted to propose we include this as a topic for LFS/MM
because I think we need more discussion on the best way to add this
functionality to the kernel.

As far as I can tell the people looking for P2P support in the kernel fall
into two main camps:

1. Those who simply want to expose static BARs on PCIe devices that can be
used as the source/destination for DMAs from another PCIe device. This
group has no need for memory invalidation and are happy to use
physical/bus addresses and not virtual addresses.

2. Those who want to support devices that suffer from occasional memory
pressure and need to invalidate memory regions from time to time. This
camp also would like to use virtual addresses rather than physical ones to
allow for things like migration.

I am wondering if people agree with this assessment?

I think something like the iopmem patches Logan and I submitted recently
come close to addressing use case 1. There are some issues around
routability but based on feedback to date that does not seem to be a
show-stopper for an initial inclusion.

For use-case 2 it looks like there are several options and some of them
(like HMM) have been around for quite some time without gaining
acceptance. I think there needs to be more discussion on this usecase and
it could be some time before we get something upstreamable.

I for one, would really like to see use case 1 get addressed soon because
we have consumers for it coming soon in the form of CMBs for NVMe devices.

Long term I think Jason summed it up really well. CPU vendors will put
high-speed, open, switchable, coherent buses on their processors and all
these problems will vanish. But I ain't holding my breathe for that to
happen ;-).

Cheers

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-11-06 Thread Stephen Bates
On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote:
>>
>> Dave are you saying that even for local mappings of files on a DAX
>> capable system it is possible for the mappings to move on you unless the
>> FS supports locking?
>>
>
> Yes.
>
>
>> Does that not mean DAX on such FS is
>> inherently broken?
>
> No. DAX is accessed through a virtual mapping layer that abstracts
> the physical location from userspace applications.
>
> Example: think copy-on-write overwrites. It occurs atomically from
> the perspective of userspace and starts by invalidating any current
> mappings userspace has of that physical location. The location is changes,
> the data copied in, and then when the locks are released userspace can
> fault in a new page table mapping on the next access

Dave

Thanks for the good input and for correcting some of my DAX
misconceptions! We will certainly be taking this into account as we
consider v1.

>
>>>> And at least for XFS we have such a mechanism :)  E.g. I have a
>>>> prototype of a pNFS layout that uses XFS+DAX to allow clients to do
>>>> RDMA directly to XFS files, with the same locking mechanism we use
>>>> for the current block and scsi layout in xfs_pnfs.c.
>>
>> Thanks for fixing this issue on XFS Christoph! I assume this problem
>> continues to exist on the other DAX capable FS?
>
> Yes, but it they implement the exportfs API that supplies this
> capability, they'll be able to use pNFS, too.
>
>> One more reason to consider a move to /dev/dax I guess ;-)...
>>
>
> That doesn't get rid of the need for sane access control arbitration
> across all machines that are directly accessing the storage. That's the
> problem pNFS solves, regardless of whether your direct access target is a
> filesystem, a block device or object storage...

Fair point. I am still hoping for a bit more discussion on the best choice
of user-space interface for this work. If/When that happens we will take
it into account when we look at spinning the patchset.


Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-25 Thread Stephen Bates
Hi Dave and Christoph

On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote:
> > > You do realise that local filesystems can silently change the
> > > location of file data at any point in time, so there is no such
> > > thing as a "stable mapping" of file data to block device addresses
> > > in userspace?
> > >
> > > If you want remote access to the blocks owned and controlled by a
> > > filesystem, then you need to use a filesystem with a remote locking
> > > mechanism to allow co-ordinated, coherent access to the data in
> > > those blocks. Anything else is just asking for ongoing, unfixable
> > > filesystem corruption or data leakage problems (i.e.  security
> > > issues).
> >

Dave are you saying that even for local mappings of files on a DAX
capable system it is possible for the mappings to move on you unless
the FS supports locking? Does that not mean DAX on such FS is
inherently broken?

> > And at least for XFS we have such a mechanism :)  E.g. I have a
> > prototype of a pNFS layout that uses XFS+DAX to allow clients to do
> > RDMA directly to XFS files, with the same locking mechanism we use
> > for the current block and scsi layout in xfs_pnfs.c.
>

Thanks for fixing this issue on XFS Christoph! I assume this problem
continues to exist on the other DAX capable FS?

One more reason to consider a move to /dev/dax I guess ;-)...

Stephen


> Oh, that's good to know - pNFS over XFS was exactly what I was
> thinking of when I wrote my earlier reply. A few months ago someone
> else was trying to use file mappings in userspace for direct remote
> client access on fabric connected devices. I told them "pNFS on XFS
> and write an efficient transport for you hardware"
>
> Now that I know we've got RDMA support for pNFS on XFS in the
> pipeline, I can just tell them "just write an rdma driver for your
> hardware" instead. :P
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-19 Thread Stephen Bates
> >>
> >> If you're only using the block-device as a entry-point to create
> >> dax-mappings then a device-dax (drivers/dax/) character-device might
> >> be a better fit.
> >>
> >
> > We chose a block device because we felt it was intuitive for users to
> > carve up a memory region but putting a DAX filesystem on it and creating
> > files on that DAX aware FS. It seemed like a convenient way to
> > partition up the region and to be easily able to get the DMA address
> > for the memory backing the device.
> >
> > That said I would be very keen to get other peoples thoughts on how
> > they would like to see this done. And I know some people have had some
> > reservations about using DAX mounted FS to do this in the past.
>
> I guess it depends on the expected size of these devices BARs, but I
> get the sense they may be smaller / more precious such that you
> wouldn't want to spend capacity on filesystem metadata? For the target
> use case is it assumed that these device BARs are always backed by
> non-volatile memory?  Otherwise this is a mkfs each boot for a
> volatile device.

Dan

Fair point and this is a concern I share. We are not assuming that all
iopmem devices are backed by non-volatile memory so the mkfs
recreation comment is valid. All in all I think you are persuading us
to take a look at /dev/dax ;-). I will see if anyone else chips in
with their thoughts on this.

>
> >>
> >> > 2. Memory Segment Spacing. This patch has the same limitations that
> >> > ZONE_DEVICE does in that memory regions must be spaces at least
> >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> >> > be usable on neighboring BARs. For our purposes, this is not an issue as
> >> > we'd only be looking at enabling a single BAR in a given PCIe device.
> >> > More exotic use cases may have problems with this.
> >>
> >> I'm working on patches for 4.10 to allow mixing multiple
> >> devm_memremap_pages() allocations within the same physical section.
> >> Hopefully this won't be a problem going forward.
> >>
> >
> > Thanks Dan. Your patches will help address the problem of how to
> > partition a /dev/dax device but they don't help the case then BARs
> > themselves are small, closely spaced and non-segment aligned. However
> > I think most people using iopmem will want to use reasonbly large
> > BARs so I am not sure item 2 is that big of an issue.
>
> I think you might have misunderstood what I'm proposing.  The patches
> I'm working on are separate from a facility to carve up a /dev/dax
> device.  The effort is to allow devm_memremap_pages() to maintain
> several allocations within the same 128MB section.  I need this for
> persistent memory to handle platforms that mix pmem and system-ram in
> the same section.  I want to be able to map ZONE_DEVICE pages for a
> portion of a section and be able to remove portions of section that
> may collide with allocations of a different lifetime.

Oh I did misunderstand. This is very cool and would be useful to us.
One more reason to consider moving to /dev/dax in the next spin of
this patchset ;-).

Thanks

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-19 Thread Stephen Bates
On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
> [ adding Ashok and David for potential iommu comments ]
>

Hi Dan

Thanks for adding Ashok and David!

>
> I agree with the motivation and the need for a solution, but I have
> some questions about this implementation.
>
> >
> > Consumers
> > -
> >
> > We provide a PCIe device driver in an accompanying patch that can be
> > used to map any PCIe BAR into a DAX capable block device. For
> > non-persistent BARs this simply serves as an alternative to using
> > system memory bounce buffers. For persistent BARs this can serve as an
> > additional storage device in the system.
>
> Why block devices?  I wonder if iopmem was initially designed back
> when we were considering enabling DAX for raw block devices.  However,
> that support has since been ripped out / abandoned.  You currently
> need a filesystem on top of a block-device to get DAX operation.
> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
> if all you want is a way to map the bar for another PCI-E device in
> the topology.
>
> If you're only using the block-device as a entry-point to create
> dax-mappings then a device-dax (drivers/dax/) character-device might
> be a better fit.
>

We chose a block device because we felt it was intuitive for users to
carve up a memory region but putting a DAX filesystem on it and creating
files on that DAX aware FS. It seemed like a convenient way to
partition up the region and to be easily able to get the DMA address
for the memory backing the device.

That said I would be very keen to get other peoples thoughts on how
they would like to see this done. And I know some people have had some
reservations about using DAX mounted FS to do this in the past.

>
> > 2. Memory Segment Spacing. This patch has the same limitations that
> > ZONE_DEVICE does in that memory regions must be spaces at least
> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> > be usable on neighboring BARs. For our purposes, this is not an issue as
> > we'd only be looking at enabling a single BAR in a given PCIe device.
> > More exotic use cases may have problems with this.
>
> I'm working on patches for 4.10 to allow mixing multiple
> devm_memremap_pages() allocations within the same physical section.
> Hopefully this won't be a problem going forward.
>

Thanks Dan. Your patches will help address the problem of how to
partition a /dev/dax device but they don't help the case then BARs
themselves are small, closely spaced and non-segment aligned. However
I think most people using iopmem will want to use reasonbly large
BARs so I am not sure item 2 is that big of an issue.

> I haven't yet grokked the motivation for this, but I'll go comment on
> that separately.

Thanks Dan!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-19 Thread Stephen Bates
On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <sba...@raithlin.com> wrote:
> > From: Logan Gunthorpe <log...@deltatee.com>
> >
> > We build on recent work that adds memory regions owned by a device
> > driver (ZONE_DEVICE) [1] and to add struct page support for these new
> > regions of memory [2].
> >
> > 1. Add an extra flags argument into dev_memremap_pages to take in a
> > MEMREMAP_XX argument. We update the existing calls to this function to
> > reflect the change.
> >
> > 2. For completeness, we add MEMREMAP_WT support to the memremap;
> > however we have no actual need for this functionality.
> >
> > 3. We add the static functions, add_zone_device_pages and
> > remove_zone_device pages. These are similar to arch_add_memory except
> > they don't create the memory mapping. We don't believe these need to be
> > made arch specific, but are open to other opinions.
> >
> > 4. dev_memremap_pages and devm_memremap_pages_release are updated to
> > treat IO memory slightly differently. For IO memory we use a combination
> > of the appropriate io_remap function and the zone_device pages functions
> > created above. A flags variable and kaddr pointer are added to struct
> > page_mem to facilitate this for the release function. We also set up
> > the page attribute tables for the mapped region correctly based on the
> > desired mapping.
> >
>
> This description says "what" is being done, but not "why".

Hi Dan

We discuss the motivation in the cover letter.

>
> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
> memory",  it mentions that the lack of I/O coherency is a known issue
> and users of this functionality need to be cognizant of the pitfalls.
> If that is the case why do we need support for different cpu mapping
> types than the default write-back cache setting?  It's up to the
> application to handle cache cpu flushing similar to what we require of
> device-dax users in the persistent memory case.

Some of the iopmem hardware we have tested has certain alignment
restrictions on BAR accesses. At the very least we require write
combine mappings for these. We then felt it appropriate to add the
other mappings for the sake of completeness.

Cheers

Stephen
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-18 Thread Stephen Bates
From: Logan Gunthorpe <log...@deltatee.com>

We build on recent work that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2].

1. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

2. For completeness, we add MEMREMAP_WT support to the memremap;
however we have no actual need for this functionality.

3. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

4. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function. We also set up
the page attribute tables for the mapped region correctly based on the
desired mapping.

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html

Signed-off-by: Stephen Bates <sba...@raithlin.com>
Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
 drivers/dax/pmem.c|  4 +-
 drivers/nvdimm/pmem.c |  4 +-
 include/linux/memremap.h  |  5 ++-
 kernel/memremap.c | 80 +--
 tools/testing/nvdimm/test/iomap.c |  3 +-
 5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d88..58ac456 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "../nvdimm/pfn.h"
 #include "../nvdimm/nd.h"
 #include "dax.h"
@@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

-   addr = devm_memremap_pages(dev, , _pmem->ref, altmap);
+   addr = devm_memremap_pages(dev, , _pmem->ref, altmap,
+   ARCH_MEMREMAP_PMEM);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a82..97032a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
addr = devm_memremap_pages(dev, _res, >q_usage_counter,
-   altmap);
+   altmap, ARCH_MEMREMAP_PMEM);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) - resource_size(_res);
@@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev,
res->start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
addr = devm_memremap_pages(dev, >res,
-   >q_usage_counter, NULL);
+   >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
pmem->pfn_flags |= PFN_MAP;
} else
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fc99283 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-   struct percpu_ref *ref, struct vmem_altmap *altmap);
+   struct percpu_ref *ref, struct vmem_altmap *altmap,
+   unsigned long flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
-   struct vmem_altmap *altmap)
+   struct vmem_altmap *altmap, unsigned long flags)
 {
/*
 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..d5f462c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+   PAGEMAP_IO_MEM = 1 << 0,
+};
+
 struct page_map {
struct resource res;
struct percpu_ref *ref;
struct dev_pagemap pgmap;
struct vmem_altmap altmap;
+   void *kaddr;
+   int flags;
 };

+stat

[PATCH 3/3] iopmem : Add documentation for iopmem driver

2016-10-18 Thread Stephen Bates
Add documentation for the iopmem PCIe device driver.

Signed-off-by: Stephen Bates <sba...@raithlin.com>
Signed-off-by: Logan Gunthorpe <log...@deltatee.com>
---
 Documentation/blockdev/00-INDEX   |  2 ++
 Documentation/blockdev/iopmem.txt | 62 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/blockdev/iopmem.txt

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..913e500 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -8,6 +8,8 @@ cpqarray.txt
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
 floppy.txt
- notes and driver options for the floppy disk driver.
+iopmem.txt
+   - info on the iopmem block driver.
 mflash.txt
- info on mGine m(g)flash driver for linux.
 nbd.txt
diff --git a/Documentation/blockdev/iopmem.txt 
b/Documentation/blockdev/iopmem.txt
new file mode 100644
index 000..ba805b8
--- /dev/null
+++ b/Documentation/blockdev/iopmem.txt
@@ -0,0 +1,62 @@
+IOPMEM Block Driver
+===
+
+Logan Gunthorpe and Stephen Bates - October 2016
+
+Introduction
+
+
+The iopmem module creates a DAX capable block device from a BAR on a PCIe
+device. iopmem leverages heavily from the pmem driver although it utilizes IO
+memory rather than system memory as its backing store.
+
+Usage
+-
+
+To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM
+to either y or m. A block device will be created for each PCIe attached device
+that matches the vendor and device ID as specified in the module. Currently an
+unallocated PMC PCIe ID is used as the default. Alternatively this driver can
+be bound to any aribtary PCIe function using the sysfs bind entry.
+
+The main purpose for an iopmem block device is expected to be for peer-2-peer
+PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local
+CPU unless you are doing one of the three following things:
+
+1. Creating a DAX capable filesystem on the iopmem device.
+2. Creating some files on the DAX capable filesystem.
+3. Interogating the files on said filesystem to obtain pointers that can be
+   passed to other PCIe devices for p2p DMA operations.
+
+Issues
+--
+
+1. Address Translation. Suggestions have been made that in certain
+architectures and topologies the dma_addr_t passed to the DMA master
+in a peer-2-peer transfer will not correctly route to the IO memory
+intended. However in our testing to date we have not seen this to be
+an issue, even in systems with IOMMUs and PCIe switches. It is our
+understanding that an IOMMU only maps system memory and would not
+interfere with device memory regions. (It certainly has no opportunity
+to do so if the transfer gets routed through a switch).
+
+2. Memory Segment Spacing. This patch has the same limitations that
+ZONE_DEVICE does in that memory regions must be spaces at least
+SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
+BARs can be placed closer together than this. Thus ZONE_DEVICE would not
+be usable on neighboring BARs. For our purposes, this is not an issue as
+we'd only be looking at enabling a single BAR in a given PCIe device.
+More exotic use cases may have problems with this.
+
+3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
+peer there is potential for coherency issues and for writes to occur out
+of order. This is something that users of this feature need to be
+cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
+this isn't much different than the existing situation with RDMA: if
+userspace sets up an MR for remote use, they need to be careful about
+using that memory region themselves.
+
+4. Architecture. Currently this patch is applicable only to x86
+architectures. The same is true for much of the code pertaining to
+PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
+ARCH over time.
--
2.1.4
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm