[LSF/MM TOPIC] BPF for Block Devices

2019-02-07 Thread Stephen Bates
Hi All

> A BPF track will join the annual LSF/MM Summit this year! Please read the 
> updated description and CFP information below.

Well if we are adding BPF to LSF/MM I have to submit a request to discuss BPF 
for block devices please!

There has been quite a bit of activity around the concept of Computational 
Storage in the past 12 months. SNIA recently formed a Technical Working Group 
(TWG) and it is expected that this TWG will be making proposals to standards 
like NVM Express to add APIs for computation elements that reside on or near 
block devices.

While some of these Computational Storage accelerators will provide fixed 
functions (e.g. a RAID, encryption or compression), others will be more 
flexible. Some of these flexible accelerators will be capable of running BPF 
code on them (something that certain Linux drivers for SmartNICs support today 
[1]). I would like to discuss what such a framework could look like for the 
storage layer and the file-system layer. I'd like to discuss how devices could 
advertise this capability (a special type of NVMe namespace or SCSI LUN 
perhaps?) and how the BPF engine could be programmed and then used against 
block IO. Ideally I'd like to discuss doing this in a vendor-neutral way and 
develop ideas I can take back to NVMe and the SNIA TWG to help shape how these 
standard evolve.

To provide an example use-case one could consider a BPF capable accelerator 
being used to perform a filtering function and then using p2pdma to scan data 
on a number of adjacent NVMe SSDs, filtering said data and then only providing 
filter-matched LBAs to the host. Many other potential applications apply. 

Also, I am interested in the "The end of the DAX Experiment" topic proposed by 
Dan and the " Zoned Block Devices" from Matias and Damien.

Cheers
 
Stephen

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/netronome/nfp/bpf/offload.c?h=v5.0-rc5
 




Re: How to force RC to forward p2p TLPs

2019-01-01 Thread Stephen Bates


Hi Yu

As I understand it you are asking about the case where the two PCIe End Points 
(EPs) are connected below the same PCIe switch that is connected via it's 
Upstream Port (UPS) to a Root Port (RP) that is part of a Root  Complex (RC). 
Do I understand this correctly?

The ECR Eric mentioned aims to implement a platform neutral method for 
determining if P2P *between* RPs is viable in a system. However I think that is 
a bit different to your case (since your EPs are connected to the same RP via a 
PCIe switch).

What I *think* should be happening in your case when ACS is enabled is the the 
TLPs should be routed to the RP and from there to some translation agent (often 
an IOMMU). If the IOMMU has an entry for the addresses in  the TLPs then it 
should attempt to do a translation and forward on the TLP. If the IOMMU does 
not have a translation (or does not exist) then the behaviour is probably 
ill-defined and very specific to the platform you are working on.

If the IOMMU does exist and does have a translation entry it still may not 
support routing the TLP back to the PCI hierarchy. This again would be platform 
specific.

I guess what I am saying is the same as Logan and Eric. Once TLPs leave the 
PCIe Root Port and enter the system (CPU or SOC) you leave the world of the 
PCI/PCIe specification and the behaviour of those TLPs will be dependant  on 
the specifics of your platform.

Do note that if your platform and your EPs supports ATS then I think the 
behaviour can be different again because then the P2P TLPs should be able to 
use a translated address and you can setup ACS to allow for translated  
addresses. However we have not had access to any ATS enabled systems to work on 
this aspect of p2pdma yet. It's on the roadmap.

Cheers

Stephen




From: Eric Wehage 
Sent: December 31, 2018 11:20 PM
To: Bjorn Helgaas; yu
Cc: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; Logan Gunthorpe; 
Stephen Bates; Jonathan Cameron; Alexander Duyck
Subject: RE: How to force RC to forward p2p TLPs
  

There is no method to force an RC to forward requests p2p. RC's are also not 
required to support p2p and whether they do or not is implementation specific. 
There is also currently no PCIe based standard to determine whether an RC 
supports  p2p or in what capacity. Also, even if p2p is supported, it might not 
be supported in a system that enables Virtualization (the Virtualization would 
have to setup p2p tables to allow it).

There is currently a PCIe ECR that has been submitted, but is undergoing 
review, that defines a PCIe capability for bridge devices (Root Ports and 
Switch Downstream Ports) that would allow them to advertise what level of p2p 
is supported by the device.

Eric Wehage
Huawei Principal Engineer, PCIe

-Original Message-
From: Bjorn Helgaas [mailto:helg...@kernel.org]
Sent: Friday, December 28, 2018 6:29 PM
To: yu 
Cc: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; Logan Gunthorpe 
; Stephen Bates ; Jonathan Cameron 
; Eric Wehage ; Alexander 
Duyck 
Subject: Re: How to force RC to forward p2p TLPs

[+cc Logan, Stephen, Jonathan, Eric, Alex]

On Sat, Dec 22, 2018 at 12:50:19PM +0800, yu wrote:
> Hi all,
> 
> We have a PCIE card which has a PEX8732 switch on-board,  and there 
> are two endpoint SOCs like graphic decoder behind the switch, and by 
> default the ACS is enabled in 8732.
> 
> We use the p2p DMA to transfer data between these two endpoint SOCs, 
> and if the host server is not enable ACS in BIOS, the p2p works well, 
> but when ACS is enabled in BIOS, the p2p is always failed. With the 
> help of a protocol analyzer, we can see that the TLP is redirected to 
> RC, and RC just discard it.
> 
> I tried to find how to make RC forward redirected TLP to its original 
> target, but nothing found, it seems this is highly related to the RC 
> vendors.
> 
> In the PCIE 4.0 spec, the section of the RC behavior of the p2p 
> request redirect  said that ''implementation-specific logic within the 
> RC that determines whether the request is directed towards its 
> original target, or blocked as an ACS Violation error. the algorithms 
> and specific controls for making this determination are not 
> architected by this spec''.
> 
> 
> So is there some spec or document to describe how to set the RC? Any 
> suggestion is appreciated.

Not that I'm aware of, but the folks I cc'd would know a lot more about this 
area.

Bjorn
  

Re: lib/genalloc

2018-11-01 Thread Stephen Bates
>I use gen_pool_first_fit_align() as pool allocation algorithm allocating
>buffers with requested alignment. But if a chunk base address is not
>aligned to the requested alignment(from some reason), the returned
>address is not aligned too.

Alexey

Can you try using gen_pool_first_fit_order_align()? Will that give you the 
alignment you need?

Stephen




Re: lib/genalloc

2018-11-01 Thread Stephen Bates
>I use gen_pool_first_fit_align() as pool allocation algorithm allocating
>buffers with requested alignment. But if a chunk base address is not
>aligned to the requested alignment(from some reason), the returned
>address is not aligned too.

Alexey

Can you try using gen_pool_first_fit_order_align()? Will that give you the 
alignment you need?

Stephen




Re: [PATCH 5/5] RISC-V: Implement sparsemem

2018-10-11 Thread Stephen Bates
Palmer

> I don't really know anything about this, but you're welcome to add a
>
>Reviewed-by: Palmer Dabbelt 

Thanks. I think it would be good to get someone who's familiar with linux/mm to 
take a look.

> if you think it'll help.  I'm assuming you're targeting a different tree for 
> the patch set, in which case it's probably best to keep this together with 
> the 
> rest of it.

No I think this series should be pulled by the RISC-V maintainer. The other 
patches in this series just refactor some code and need to be ACK'ed by their 
ARCH developers but I suspect the series should be pulled into RISC-V. That 
said since it does touch other arch should it be pulled by mm? 

BTW note that RISC-V SPARSEMEM support is pretty useful for all manner of 
things and not just the p2pdma discussed in the cover.

> Thanks for porting your stuff to RISC-V!

You bet ;-)

Stephen





Re: [PATCH 5/5] RISC-V: Implement sparsemem

2018-10-11 Thread Stephen Bates
Palmer

> I don't really know anything about this, but you're welcome to add a
>
>Reviewed-by: Palmer Dabbelt 

Thanks. I think it would be good to get someone who's familiar with linux/mm to 
take a look.

> if you think it'll help.  I'm assuming you're targeting a different tree for 
> the patch set, in which case it's probably best to keep this together with 
> the 
> rest of it.

No I think this series should be pulled by the RISC-V maintainer. The other 
patches in this series just refactor some code and need to be ACK'ed by their 
ARCH developers but I suspect the series should be pulled into RISC-V. That 
said since it does touch other arch should it be pulled by mm? 

BTW note that RISC-V SPARSEMEM support is pretty useful for all manner of 
things and not just the p2pdma discussed in the cover.

> Thanks for porting your stuff to RISC-V!

You bet ;-)

Stephen





Re: Linux RDMA mini-conf at Plumbers 2018

2018-10-01 Thread Stephen Bates
Hi Jason and Leon

> This year we expect to have close to a day set aside for RDMA related
> topics. Including up to half a day for the thorny general kernel issues
> related to get_user_pages(), particularly as exasperated by RDMA.

Looks like a great set of topics.

> RDMA and PCI peer to peer
 
I will happily be a leader/organizer for this section and can provide an
overview of p2pdma to get the ball rolling.

Cheers

Stephen



Re: Linux RDMA mini-conf at Plumbers 2018

2018-10-01 Thread Stephen Bates
Hi Jason and Leon

> This year we expect to have close to a day set aside for RDMA related
> topics. Including up to half a day for the thorny general kernel issues
> related to get_user_pages(), particularly as exasperated by RDMA.

Looks like a great set of topics.

> RDMA and PCI peer to peer
 
I will happily be a leader/organizer for this section and can provide an
overview of p2pdma to get the ball rolling.

Cheers

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be 
appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
---
Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be 
appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
---
Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
>I find this hard to believe. There's always the possibility that some 
>part of the system doesn't support ACS so if the PCI bus addresses and 
>IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the 
kernel. Alex (or anyone else) can you point to where IOVA addresses are 
generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI 
bus address directly but other IO going to the same PCI EP may flow through the 
IOMMU and be programmed with IOVA rather than PCI bus addresses.

> I prefer 
>the option to disable the ACS bit on boot and let the existing code put 
>the devices into their own IOMMU group (as it should already do to 
>support hardware that doesn't have ACS support).

+1

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
>I find this hard to believe. There's always the possibility that some 
>part of the system doesn't support ACS so if the PCI bus addresses and 
>IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the 
kernel. Alex (or anyone else) can you point to where IOVA addresses are 
generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI 
bus address directly but other IO going to the same PCI EP may flow through the 
IOMMU and be programmed with IOVA rather than PCI bus addresses.

> I prefer 
>the option to disable the ACS bit on boot and let the existing code put 
>the devices into their own IOMMU group (as it should already do to 
>support hardware that doesn't have ACS support).

+1

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Hopes this helps understanding the big picture. I over simplify thing and
>devils is in the details.

This was a great primer thanks for putting it together. An LWN.net article 
perhaps ;-)??

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Hopes this helps understanding the big picture. I over simplify thing and
>devils is in the details.

This was a great primer thanks for putting it together. An LWN.net article 
perhaps ;-)??

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Note on GPU we do would not rely on ATS for peer to peer. Some part
>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU 
components most applicable to p2pdma.

>We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>for performance reasons ie we do not care having our transaction going
>to the root complex and back down the destination. At least in use case
>i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys 
;-). If so, what are the RDMA people??? Again good to know.

>Reasons is that GPU are giving up on PCIe (see all specialize link like
>NVlink that are popping up in GPU space). So for fast GPU inter-connect
>we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it 
;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.

>Also the IOMMU isolation do matter a lot to us. Think someone using this
>peer to peer to gain control of a server in the cloud.

I agree that IOMMU isolation is very desirable. Hence the desire to ensure we 
can keep the IOMMU on while doing p2pdma if at all possible whilst still 
delivering the desired performance to the user.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

>Note on GPU we do would not rely on ATS for peer to peer. Some part
>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU 
components most applicable to p2pdma.

>We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>for performance reasons ie we do not care having our transaction going
>to the root complex and back down the destination. At least in use case
>i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys 
;-). If so, what are the RDMA people??? Again good to know.

>Reasons is that GPU are giving up on PCIe (see all specialize link like
>NVlink that are popping up in GPU space). So for fast GPU inter-connect
>we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it 
;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.

>Also the IOMMU isolation do matter a lot to us. Think someone using this
>peer to peer to gain control of a server in the cloud.

I agree that IOMMU isolation is very desirable. Hence the desire to ensure we 
can keep the IOMMU on while doing p2pdma if at all possible whilst still 
delivering the desired performance to the user.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.

>For EPs that support ATS, we should (but don't necessarily have to)
>program them with the IOVA address so they can go through the
>translation process which will allow P2P without disabling the ACS Redir
>bits -- provided the ACS direct translation bit is set. (And btw, if it
>is, then we lose the benefit of ACS protecting against malicious EPs).
>But, per above, the ATS transaction should involve only the IOVA address
>so the ACS bits not being set should not break ATS.

Well we would still have to clear some ACS bits but now we can clear only for 
translated addresses.

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.

>For EPs that support ATS, we should (but don't necessarily have to)
>program them with the IOVA address so they can go through the
>translation process which will allow P2P without disabling the ACS Redir
>bits -- provided the ACS direct translation bit is set. (And btw, if it
>is, then we lose the benefit of ACS protecting against malicious EPs).
>But, per above, the ATS transaction should involve only the IOVA address
>so the ACS bits not being set should not break ATS.

Well we would still have to clear some ACS bits but now we can clear only for 
translated addresses.

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:

Ah thanks. This cleared things up for me. A quick search shows there are still 
no users of intel_svm_bind_mm() but I see the AMD version used in that GPU 
driver.

One thing I could not grok from the code how the GPU driver indicates which DMA 
events require ATS translations and which do not. I am assuming the driver 
implements someway of indicating that and its not just a global ON or OFF for 
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what 
would need to be added in the NVMe spec above and beyond what we have in PCI 
ATS to support efficient use of ATS (for example would we need a flag in the 
submission queue entries to indicate a particular IO's SGL/PRP should undergo 
ATS).

Cheers

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:

Ah thanks. This cleared things up for me. A quick search shows there are still 
no users of intel_svm_bind_mm() but I see the AMD version used in that GPU 
driver.

One thing I could not grok from the code how the GPU driver indicates which DMA 
events require ATS translations and which do not. I am assuming the driver 
implements someway of indicating that and its not just a global ON or OFF for 
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what 
would need to be added in the NVMe spec above and beyond what we have in PCI 
ATS to support efficient use of ATS (for example would we need a flag in the 
submission queue entries to indicate a particular IO's SGL/PRP should undergo 
ATS).

Cheers

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>together with ATS to identify the address space which a transaction 
>should use.

I think you are conflating two types of TLPs here. If the device supports ATS 
then it will issue a TR TLP to obtain a translated address from the IOMMU. This 
TR TLP will be addressed to the RP and so regardless of ACS it is going up to 
the Root Port. When it gets the response it gets the physical address and can 
use that with the TA bit set for the p2pdma. In the case of ATS support we also 
have more control over ACS as we can disable it just for TA addresses (as per 
7.7.7.7.2 of the spec).

 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU 
at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think 
we have to disable the IOMMU (something Mike suggested earlier). The reason is 
that since ATS is not an option the EP must initiate the DMA using the 
addresses passed down to it. If the IOMMU is on then this is an IOVA that could 
(with some non-zero probability) point to an IO Memory address in the same PCI 
domain. So if we disable ACS we are in trouble as we might MemWr to the wrong 
place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the 
IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping 
issues.

So I think if we want to support performant P2PDMA for devices that don't have 
ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I 
know this is problematic for AMDs use case so perhaps we also need to consider 
a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU 
(but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Stephen Bates
Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>together with ATS to identify the address space which a transaction 
>should use.

I think you are conflating two types of TLPs here. If the device supports ATS 
then it will issue a TR TLP to obtain a translated address from the IOMMU. This 
TR TLP will be addressed to the RP and so regardless of ACS it is going up to 
the Root Port. When it gets the response it gets the physical address and can 
use that with the TA bit set for the p2pdma. In the case of ATS support we also 
have more control over ACS as we can disable it just for TA addresses (as per 
7.7.7.7.2 of the spec).

 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU 
at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think 
we have to disable the IOMMU (something Mike suggested earlier). The reason is 
that since ATS is not an option the EP must initiate the DMA using the 
addresses passed down to it. If the IOMMU is on then this is an IOVA that could 
(with some non-zero probability) point to an IO Memory address in the same PCI 
domain. So if we disable ACS we are in trouble as we might MemWr to the wrong 
place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the 
IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping 
issues.

So I think if we want to support performant P2PDMA for devices that don't have 
ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I 
know this is problematic for AMDs use case so perhaps we also need to consider 
a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU 
(but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.

Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too 
it) or is this all done via user-space? Based on my grepping of the kernel code 
I see zero EP drivers using in-kernel ATS functionality right now...

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.

Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too 
it) or is this all done via user-space? Based on my grepping of the kernel code 
I see zero EP drivers using in-kernel ATS functionality right now...

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Christian

>Interesting point, give me a moment to check that. That finally makes 
>all the hardware I have standing around here valuable :)

Yes. At the very least it provides an initial standards based path for P2P DMAs 
across RPs which is something we have discussed on this list in the past as 
being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to 
perform an ATS Translation Request (ATS TR). Is there an upstream example of 
the driver for your APU that uses ATS? If so, can you provide a pointer to it. 
Do you provide some type of entry in the submission queues for commands going 
to the APU to indicate if the address associated with a specific command should 
be translated using ATS or not? Or do you simply enable ATS and then all 
addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe 
community on where we might go with ATS...

Thanks

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Christian

>Interesting point, give me a moment to check that. That finally makes 
>all the hardware I have standing around here valuable :)

Yes. At the very least it provides an initial standards based path for P2P DMAs 
across RPs which is something we have discussed on this list in the past as 
being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to 
perform an ATS Translation Request (ATS TR). Is there an upstream example of 
the driver for your APU that uses ATS? If so, can you provide a pointer to it. 
Do you provide some type of entry in the submission queues for commands going 
to the APU to indicate if the address associated with a specific command should 
be translated using ATS or not? Or do you simply enable ATS and then all 
addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe 
community on where we might go with ATS...

Thanks

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Jerome and Christian

> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a 
ATS translated TLP is still impacted by ACS though it has a separate control 
knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if 
your device supports ATS a P2P DMA will still be routed to the associated RP of 
the domain and down again unless we disable ACS DT P2P on all bridges between 
the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have 
security issues because a rogue or malfunctioning EP could just as easily issue 
TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.

ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU 
address translations at the EP. This saves hammering on the IOMMU as much in 
certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS 
AND can implement P2P between root ports should advertise "ACS Direct 
Translated P2P (T)" capability. This ties into the discussion around P2P 
between route ports we had a few weeks ago...

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Jerome and Christian

> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a 
ATS translated TLP is still impacted by ACS though it has a separate control 
knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if 
your device supports ATS a P2P DMA will still be routed to the associated RP of 
the domain and down again unless we disable ACS DT P2P on all bridges between 
the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have 
security issues because a rogue or malfunctioning EP could just as easily issue 
TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.

ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU 
address translations at the EP. This saves hammering on the IOMMU as much in 
certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS 
AND can implement P2P between root ports should advertise "ACS Direct 
Translated P2P (T)" capability. This ties into the discussion around P2P 
between route ports we had a few weeks ago...

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Don

>RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to
>put NVME 'resources' into an assignable/manageable object for 
> 'IOMMU-grouping',
>which is really a 'DMA security domain' and less an 'IOMMU grouping 
> domain'.

Ha, I like your term "DMA Security Domain" which sounds about right for what we 
are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, 
in some ways, too big of hammer for what we want here in the sense that it is 
either on or off for the bridge or MF EP we enable/disable it for. ACS can't 
filter the TLPs by address or ID though PCI-SIG are having some discussions on 
extending ACS. That's a long term solution and won't be applicable to us for 
some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe 
SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Don

>RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to
>put NVME 'resources' into an assignable/manageable object for 
> 'IOMMU-grouping',
>which is really a 'DMA security domain' and less an 'IOMMU grouping 
> domain'.

Ha, I like your term "DMA Security Domain" which sounds about right for what we 
are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, 
in some ways, too big of hammer for what we want here in the sense that it is 
either on or off for the bridge or MF EP we enable/disable it for. ACS can't 
filter the TLPs by address or ID though PCI-SIG are having some discussions on 
extending ACS. That's a long term solution and won't be applicable to us for 
some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe 
SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Logan

>Yeah, I'm having a hard time coming up with an easy enough solution for
>the user. I agree with Dan though, the bus renumbering risk would be
>fairly low in the custom hardware seeing the switches are likely going
>to be directly soldered to the same board with the CPU.

I am afraid that soldered down assumption may not be valid. More and more PCIe 
cards with PCIe switches on them are becoming available and people are using 
these to connect servers to arrays of NVMe SSDs which may make the topology 
more dynamic.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Logan

>Yeah, I'm having a hard time coming up with an easy enough solution for
>the user. I agree with Dan though, the bus renumbering risk would be
>fairly low in the custom hardware seeing the switches are likely going
>to be directly soldered to the same board with the CPU.

I am afraid that soldered down assumption may not be valid. More and more PCIe 
cards with PCIe switches on them are becoming available and people are using 
these to connect servers to arrays of NVMe SSDs which may make the topology 
more dynamic.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Alex and Don

>Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a 
pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and 
IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings 
but it is impacted by them in the sense that if the groupings change the PCI 
topology presented to the VM needs to change too.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Stephen Bates
Hi Alex and Don

>Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a 
pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and 
IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings 
but it is impacted by them in the sense that if the groupings change the PCI 
topology presented to the VM needs to change too.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
>Yeah, so based on the discussion I'm leaning toward just having a
>command line option that takes a list of BDFs and disables ACS for them.
>(Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't 
want to re-spin only to find we are still not converging on the ACS issue

Thanks

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
>Yeah, so based on the discussion I'm leaning toward just having a
>command line option that takes a list of BDFs and disables ACS for them.
>(Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't 
want to re-spin only to find we are still not converging on the ACS issue

Thanks

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Alex

>But it would be a much easier proposal to disable ACS when the IOMMU is
>not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all 
Root Ports and not just the one(s) we wish to do p2pdma on. 

>The IOMMU and P2P are already not exclusive, we can bounce off the
>IOMMU or make use of ATS as we've previously discussed.  We were
>previously talking about a build time config option that you didn't
>expect distros to use, so I don't think intervention for the user to
>disable the IOMMU if it's enabled by default is a serious concern
>either.

ATS definitely makes things more interesting for the cases where the EPs 
support it. However I don't really have a handle on how common ATS support is 
going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA 
NICs mostly). 

> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both 
p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is 
just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will 
probably not require a bounce buffer and will support 64 bit DMA addressing.

> I'm still not seeing why it's terribly undesirable to require devices to 
> support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on 
one domain and p2pdma on another domain

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Alex

>But it would be a much easier proposal to disable ACS when the IOMMU is
>not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all 
Root Ports and not just the one(s) we wish to do p2pdma on. 

>The IOMMU and P2P are already not exclusive, we can bounce off the
>IOMMU or make use of ATS as we've previously discussed.  We were
>previously talking about a build time config option that you didn't
>expect distros to use, so I don't think intervention for the user to
>disable the IOMMU if it's enabled by default is a serious concern
>either.

ATS definitely makes things more interesting for the cases where the EPs 
support it. However I don't really have a handle on how common ATS support is 
going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA 
NICs mostly). 

> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both 
p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is 
just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will 
probably not require a bounce buffer and will support 64 bit DMA addressing.

> I'm still not seeing why it's terribly undesirable to require devices to 
> support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on 
one domain and p2pdma on another domain

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Jerome

>I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>translation for a virtual address. Device can then use that address
>directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with 
ACS a lot better than I did 24 hours ago ;-).

>ATS is implemented by the IOMMU not by the device (well device implement
>the client side of it). Also ATS is meaningless without something like
>PASID as far as i know.

I think it's the client side that is what is important to us. Not many EPs 
support ATS today and it's not clear if many will in the future.  So assuming 
we want to do p2pdma between devices (some of) which do NOT support ATS how 
best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me 
given this impacts all the PCI domains in the system and not just the domain we 
wish to do P2P on.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Jerome

>I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>translation for a virtual address. Device can then use that address
>directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with 
ACS a lot better than I did 24 hours ago ;-).

>ATS is implemented by the IOMMU not by the device (well device implement
>the client side of it). Also ATS is meaningless without something like
>PASID as far as i know.

I think it's the client side that is what is important to us. Not many EPs 
support ATS today and it's not clear if many will in the future.  So assuming 
we want to do p2pdma between devices (some of) which do NOT support ATS how 
best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me 
given this impacts all the PCI domains in the system and not just the domain we 
wish to do P2P on.

Stephen



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
>devices.
>That agent should 'request' to the kernel that ACS be removed/circumvented 
> (p2p enabled) btwn two endpoints.
>I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU 
grouping issue I discussed earlier today which is not acceptable right now. In 
the long term, once we get IOMMU grouping callbacks to VMs we can look at 
extending p2pdma in this way. But I don't think this is viable for the initial 
series. 


>So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into 
the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology 
changes. I think we even have to be cognizant of the fact the OS running on the 
VM may not even support hot-plug of PCI devices.

> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for 
the input!

Stephen





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
>devices.
>That agent should 'request' to the kernel that ACS be removed/circumvented 
> (p2p enabled) btwn two endpoints.
>I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU 
grouping issue I discussed earlier today which is not acceptable right now. In 
the long term, once we get IOMMU grouping callbacks to VMs we can look at 
extending p2pdma in this way. But I don't think this is viable for the initial 
series. 


>So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into 
the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology 
changes. I think we even have to be cognizant of the fact the OS running on the 
VM may not even support hot-plug of PCI devices.

> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for 
the input!

Stephen





Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Dan

>It seems unwieldy that this is a compile time option and not a runtime
>option. Can't we have a kernel command line option to opt-in to this
>behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we 
wanted to make it very clear people were choosing one (p2pdma) or the other 
(IOMMU groupings and isolation). However personally I would prefer including 
the option of a run-time kernel parameter too. In fact a few months ago I 
proposed a small patch that did just that [1]. It never really went anywhere 
but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is 
associated with that additional functionality.

> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we 
disable ACS? We could do that but I don't think it avoids the issue of changes 
in IOMMU groupings as devices are added/removed. It simply changes the problem 
from affecting and entire PCI domain to a sub-set of the domain. We can already 
handle this by doing p2pdma on one RP and normal IOMMU isolation on the other 
RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc=150907188310838=2




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Dan

>It seems unwieldy that this is a compile time option and not a runtime
>option. Can't we have a kernel command line option to opt-in to this
>behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we 
wanted to make it very clear people were choosing one (p2pdma) or the other 
(IOMMU groupings and isolation). However personally I would prefer including 
the option of a run-time kernel parameter too. In fact a few months ago I 
proposed a small patch that did just that [1]. It never really went anywhere 
but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is 
associated with that additional functionality.

> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we 
disable ACS? We could do that but I don't think it avoids the issue of changes 
in IOMMU groupings as devices are added/removed. It simply changes the problem 
from affecting and entire PCI domain to a sub-set of the domain. We can already 
handle this by doing p2pdma on one RP and normal IOMMU isolation on the other 
RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc=150907188310838=2




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates

Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit. 

> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?

> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous 
revisions. The issue is that currently there is no mechanism in the IOMMU code 
to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change 
its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS 
settings could change. Since there is no way to currently handle changing ACS 
settings and hence IOMMU groupings the consensus was to simply disable ACS on 
all ports in a p2pdma domain. This effectively makes all the devices in the 
p2pdma domain part of the same IOMMU grouping. The plan will be to address this 
in time and add a mechanism for IOMMU grouping changes and notification to VMs 
but that's not part of this series. Note you are still allowed to have ACS 
functioning on other PCI domains so if you do not a plurality of IOMMU 
groupings you can still achieve it (but you can't do p2pdma across IOMMU 
groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
> with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change. 

Cheers

Stephen




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates

Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit. 

> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?

> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous 
revisions. The issue is that currently there is no mechanism in the IOMMU code 
to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change 
its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS 
settings could change. Since there is no way to currently handle changing ACS 
settings and hence IOMMU groupings the consensus was to simply disable ACS on 
all ports in a p2pdma domain. This effectively makes all the devices in the 
p2pdma domain part of the same IOMMU grouping. The plan will be to address this 
in time and add a mechanism for IOMMU grouping changes and notification to VMs 
but that's not part of this series. Note you are still allowed to have ACS 
functioning on other PCI domains so if you do not a plurality of IOMMU 
groupings you can still achieve it (but you can't do p2pdma across IOMMU 
groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
> with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change. 

Cheers

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-04-13 Thread Stephen Bates
 
>  I'll see if I can get our PCI SIG people to follow this through 

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track 
it. I can see this being doable between Root Ports that reside in the same Root 
Complex but might become more challenging to standardize for RPs that reside in 
different RCs in the same (potentially multi-socket) system. I know in the past 
we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that 
is not something that would work in all systems and must fall outside the remit 
of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite 
some time before we can rely on hardware being available that supports it.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-04-13 Thread Stephen Bates
 
>  I'll see if I can get our PCI SIG people to follow this through 

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track 
it. I can see this being doable between Root Ports that reside in the same Root 
Complex but might become more challenging to standardize for RPs that reside in 
different RCs in the same (potentially multi-socket) system. I know in the past 
we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that 
is not something that would work in all systems and must fall outside the remit 
of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite 
some time before we can rely on hardware being available that supports it.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-24 Thread Stephen Bates
> That would be very nice but many devices do not support the internal
> route. 

But Logan in the NVMe case we are discussing movement within a single function 
(i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is 
discussing movement between two functions (PFs or VFs) in the same PCIe EP. In 
the case of multi-function endpoints I think the standard requires those 
devices to support internal DMAs for transfers between those functions (but 
does not require it within a function).

So I think the summary is:

1. There is no requirement for a single function to support internal DMAs but 
in the case of NVMe we do have a protocol specific way for a NVMe function to 
indicate it supports via the CMB BAR. Other protocols may also have such 
methods but I am not aware of them at this time.

2. For multi-function end-points I think it is a requirement that DMAs 
*between* functions are supported via an internal path but this can be 
over-ridden by ACS when supported in the EP.

3. For multi-function end-points there is no requirement to support internal 
DMA within each individual function (i.e. a la point 1 but extended to each 
function in a MF device). 

Based on my review of the specification I concur with Bjorn that p2pdma between 
functions in a MF end-point should be assured to be supported via the standard. 
However if the p2pdma involves only a single function in a MF device then we 
can only support NVMe CMBs for now. Let's review and see what the options are 
for supporting this in the next respin.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-24 Thread Stephen Bates
> That would be very nice but many devices do not support the internal
> route. 

But Logan in the NVMe case we are discussing movement within a single function 
(i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is 
discussing movement between two functions (PFs or VFs) in the same PCIe EP. In 
the case of multi-function endpoints I think the standard requires those 
devices to support internal DMAs for transfers between those functions (but 
does not require it within a function).

So I think the summary is:

1. There is no requirement for a single function to support internal DMAs but 
in the case of NVMe we do have a protocol specific way for a NVMe function to 
indicate it supports via the CMB BAR. Other protocols may also have such 
methods but I am not aware of them at this time.

2. For multi-function end-points I think it is a requirement that DMAs 
*between* functions are supported via an internal path but this can be 
over-ridden by ACS when supported in the EP.

3. For multi-function end-points there is no requirement to support internal 
DMA within each individual function (i.e. a la point 1 but extended to each 
function in a MF device). 

Based on my review of the specification I concur with Bjorn that p2pdma between 
functions in a MF end-point should be assured to be supported via the standard. 
However if the p2pdma involves only a single function in a MF device then we 
can only support NVMe CMBs for now. Let's review and see what the options are 
for supporting this in the next respin.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-22 Thread Stephen Bates
>  I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.  

Bjorn 

You asked me for a reference to RTS in the PCIe specification. As luck would 
have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. 
We discussed the topic. It is not explicitly referred to as "Route to Self" and 
it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 
specification discusses error conditions for virtual PCI bridges. One of these 
conditions (given in the very first bullet in that section) applies to a 
request that is destined for the same port it came in on. When this occurs the 
request must be terminated as a UR.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-22 Thread Stephen Bates
>  I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.  

Bjorn 

You asked me for a reference to RTS in the PCIe specification. As luck would 
have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. 
We discussed the topic. It is not explicitly referred to as "Route to Self" and 
it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 
specification discusses error conditions for virtual PCI bridges. One of these 
conditions (given in the very first bullet in that section) applies to a 
request that is destined for the same port it came in on. When this occurs the 
request must be terminated as a UR.

Stephen




Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
> P2P over PCI/PCI-X is quite common in devices like raid controllers.

Hi Dan 

Do you mean between PCIe devices below the RAID controller? Isn't it pretty 
novel to be able to support PCIe EPs below a RAID controller (as opposed to 
SCSI based devices)?

> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.
   
Agreed. I think this would be great. Kind of like the XCOPY framework that was 
proposed a while back for SCSI devices [1] but updated to also include NVMe 
devices. That is definitely a use case we would like this framework to support.

Stephen
 
[1] https://lwn.net/Articles/592094/



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
> P2P over PCI/PCI-X is quite common in devices like raid controllers.

Hi Dan 

Do you mean between PCIe devices below the RAID controller? Isn't it pretty 
novel to be able to support PCIe EPs below a RAID controller (as opposed to 
SCSI based devices)?

> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.
   
Agreed. I think this would be great. Kind of like the XCOPY framework that was 
proposed a while back for SCSI devices [1] but updated to also include NVMe 
devices. That is definitely a use case we would like this framework to support.

Stephen
 
[1] https://lwn.net/Articles/592094/



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
>I assume you want to exclude Root Ports because of multi-function
>  devices and the "route to self" error.  I was hoping for a reference
>  to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a 
reference for RTS in the next couple of days.

> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant.  It talks about "peer-to-peer Requests (between Functions of
> the device)".  Thay says to me that multi-function devices can DMA
> between themselves.

I will go take a look. Appreciate the link.

Stephen 



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Stephen Bates
>I assume you want to exclude Root Ports because of multi-function
>  devices and the "route to self" error.  I was hoping for a reference
>  to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a 
reference for RTS in the next couple of days.

> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant.  It talks about "peer-to-peer Requests (between Functions of
> the device)".  Thay says to me that multi-function devices can DMA
> between themselves.

I will go take a look. Appreciate the link.

Stephen 



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Stephen Bates
Hi Sinan

>If hardware doesn't support it, blacklisting should have been the right
>path and I still think that you should remove all switch business from the 
> code.
>I did not hear enough justification for having a switch requirement
>for P2P.
  
We disagree. As does the community based on feedback on previous version of 
this patch set.
  
> You are also saying that root ports have issues not because of functionality 
> but
> because of performance. 

I have seen both.  Some systems fail in entirety to route TLPs across route 
ports. Others do route but with poor performance. Having a requirement for a 
switch is therefore a reasonable and sensible minimum criteria for enabling p2p.

> If you know what is bad, create a list and keep adding it. You can't assume
> universally that all root ports are broken/ have bad performance.

I actually wanted to do a blacklist originally but that idea was not accepted 
by the community. 

>What if I come up with a very cheap/crappy switch (something like used in 
> data
>mining)?
> What guarantees that P2P will work with this device? You are making an
> assumption here that all switches have good performance.

A switch must support P2P or it is in violation of the PCIe specification.

>I have been doing my best to provide feedback. It feels like you are 
> throwing
>them over the wall to be honest.

I think one issue Sinan is that a lot of your ideas have been discussed in 
previous versions of the series. So you are asking us to make changes which, in 
some cases, we know are not acceptable to others in the community. The input is 
welcome but it flies in the face of other input we have received. 

However if other developers feel strongly about the blacklist/whitelist and 
switch criteria please speak up! ;-)

Stephen





Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Stephen Bates
Hi Sinan

>If hardware doesn't support it, blacklisting should have been the right
>path and I still think that you should remove all switch business from the 
> code.
>I did not hear enough justification for having a switch requirement
>for P2P.
  
We disagree. As does the community based on feedback on previous version of 
this patch set.
  
> You are also saying that root ports have issues not because of functionality 
> but
> because of performance. 

I have seen both.  Some systems fail in entirety to route TLPs across route 
ports. Others do route but with poor performance. Having a requirement for a 
switch is therefore a reasonable and sensible minimum criteria for enabling p2p.

> If you know what is bad, create a list and keep adding it. You can't assume
> universally that all root ports are broken/ have bad performance.

I actually wanted to do a blacklist originally but that idea was not accepted 
by the community. 

>What if I come up with a very cheap/crappy switch (something like used in 
> data
>mining)?
> What guarantees that P2P will work with this device? You are making an
> assumption here that all switches have good performance.

A switch must support P2P or it is in violation of the PCIe specification.

>I have been doing my best to provide feedback. It feels like you are 
> throwing
>them over the wall to be honest.

I think one issue Sinan is that a lot of your ideas have been discussed in 
previous versions of the series. So you are asking us to make changes which, in 
some cases, we know are not acceptable to others in the community. The input is 
welcome but it flies in the face of other input we have received. 

However if other developers feel strongly about the blacklist/whitelist and 
switch criteria please speak up! ;-)

Stephen





Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Stephen Bates
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
>  No, that's the way the community has pushed this work

Hi Sinan

Thanks for all the input. As Logan has pointed out the switch requirement is 
something that has evolved over time based on input from the community. You are 
more than welcome to have an opinion on this (and you have made that opinion 
clear ;-)). Over time the patchset may evolve from its current requirements but 
right now we are aligned with the feedback from the community.

Cheers

Stephen
 



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-13 Thread Stephen Bates
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
>  No, that's the way the community has pushed this work

Hi Sinan

Thanks for all the input. As Logan has pointed out the switch requirement is 
something that has evolved over time based on input from the community. You are 
more than welcome to have an opinion on this (and you have made that opinion 
clear ;-)). Over time the patchset may evolve from its current requirements but 
right now we are aligned with the feedback from the community.

Cheers

Stephen
 



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-05 Thread Stephen Bates
>Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with 
what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very 
helpful.

> i will fix the MAINTAINERS as part of those.

Awesome, thanks.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-05 Thread Stephen Bates
>Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with 
what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very 
helpful.

> i will fix the MAINTAINERS as part of those.

Awesome, thanks.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Stephen Bates
> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. 
Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and 
DEVICE_PUBLIC. It also gives no indication when one might choose to use one 
over the other. Would it be possible to update hmm.txt to include some 
discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the 
kernel's linear address space for the device memory and DEVICE_PRIVATE does 
not. However, like I said, I am not sure when you would use either one and the 
pros and cons of doing so. I actually ended up finding some useful information 
in memremap.h but I don't think it is fair to expect people to dig *that* deep 
to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is 
this correct? Are there any examples of out of tree drivers that use HMM you 
can point me too? As a driver developer what resources exist to help me write a 
HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You 
might want to fix that when you have a moment.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Stephen Bates
> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. 
Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and 
DEVICE_PUBLIC. It also gives no indication when one might choose to use one 
over the other. Would it be possible to update hmm.txt to include some 
discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the 
kernel's linear address space for the device memory and DEVICE_PRIVATE does 
not. However, like I said, I am not sure when you would use either one and the 
pros and cons of doing so. I actually ended up finding some useful information 
in memremap.h but I don't think it is fair to expect people to dig *that* deep 
to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is 
this correct? Are there any examples of out of tree drivers that use HMM you 
can point me too? As a driver developer what resources exist to help me write a 
HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You 
might want to fix that when you have a moment.

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-02 Thread Stephen Bates
>http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe 
Technical Proposals (TPs) was wrong. I though the TP content could only be 
discussed once disclosed in the public standard. I have now learnt that once 
the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's 
defer discussion on how to support them to a later date!

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-02 Thread Stephen Bates
>http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe 
Technical Proposals (TPs) was wrong. I though the TP content could only be 
discussed once disclosed in the public standard. I have now learnt that once 
the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's 
defer discussion on how to support them to a later date!

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> We don't want to lump these all together without knowing which region you're 
> allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We 
would consider adding property flags for the memory as it is added to the p2p 
core and then the allocator could evolve to intelligently dish it out. 
Attributes like endurance, latency and special write commit requirements could 
all become attributes in time. Perhaps one more reason for a central entity for 
p2p memory allocation so this code does not end up having to go into many 
different drivers?

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> We don't want to lump these all together without knowing which region you're 
> allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We 
would consider adding property flags for the memory as it is added to the p2p 
core and then the allocator could evolve to intelligently dish it out. 
Attributes like endurance, latency and special write commit requirements could 
all become attributes in time. Perhaps one more reason for a central entity for 
p2p memory allocation so this code does not end up having to go into many 
different drivers?

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into 
the P2P framework (if at all) ;-).

Stephen





Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into 
the P2P framework (if at all) ;-).

Stephen





Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.

Sagi brought this up earlier in his comments about the _find_ function. We are 
planning to do something about this in the next version. This might be a 
randomization or a "user-pick" and include a rule around using the p2p_dev on 
the EP if that EP is part of the transaction.

Stephen






Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.

Sagi brought this up earlier in his comments about the _find_ function. We are 
planning to do something about this in the next version. This might be a 
randomization or a "user-pick" and include a rule around using the p2p_dev on 
the EP if that EP is part of the transaction.

Stephen






Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Stephen Bates
> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the 
root-port then this would be considered a "route to self" violation and an 
error event would occur. The same holds for the downstream port on a PCI switch 
(unless route-to-self violations are disabled which violates the spec but which 
I have seen done in certain applications).

Stephen






Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory

2018-03-01 Thread Stephen Bates
> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the 
root-port then this would be considered a "route to self" violation and an 
error event would occur. The same holds for the downstream port on a PCI switch 
(unless route-to-self violations are disabled which violates the spec but which 
I have seen done in certain applications).

Stephen






Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.

 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per 
driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this 
is achieved no matter which p2p_dev is used.

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.

 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per 
driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this 
is achieved no matter which p2p_dev is used.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the 
past and long term I definitely would like to consider how we can add P2P 
functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass 
addressable and non-addressable blocks of data between devices. However that is 
well beyond the intentions of this series ;-).

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the 
past and long term I definitely would like to consider how we can add P2P 
functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass 
addressable and non-addressable blocks of data between devices. However that is 
well beyond the intentions of this series ;-).

Stephen




Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
> your kernel provider needs to decide whether they favor device assignment or 
> p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) 
are such that we really only expect CONFIG_P2P_DMA to be enabled in specific 
instances and in those instances the users have made a decision to favor P2P 
over IOMMU isolation. Or they have setup their PCIe topology in a way that 
gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 



Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
> your kernel provider needs to decide whether they favor device assignment or 
> p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) 
are such that we really only expect CONFIG_P2P_DMA to be enabled in specific 
instances and in those instances the users have made a decision to favor P2P 
over IOMMU isolation. Or they have setup their PCIe topology in a way that 
gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 



Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates

> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer 
and in deploying systems where only some of the NVMe SSDs below a switch  have 
a CMB but use P2P to access all of them. Also there are some devices that only 
expose memory and their entire purpose is to act as a p2p device, supporting 
these devices would be valuable.

> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same 
PCIe switch ensures locality and (for the switches we have tested) performance. 
I agree solving the case where the namespace are CMB are on the same PCIe EP is 
valuable but I don't see it as critical to initial acceptance of the series.

Stephen




Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates

> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer 
and in deploying systems where only some of the NVMe SSDs below a switch  have 
a CMB but use P2P to access all of them. Also there are some devices that only 
expose memory and their entire purpose is to act as a p2p device, supporting 
these devices would be valuable.

> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same 
PCIe switch ensures locality and (for the switches we have tested) performance. 
I agree solving the case where the namespace are CMB are on the same PCIe EP is 
valuable but I don't see it as critical to initial acceptance of the series.

Stephen




Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
Thanks for the detailed review Bjorn!

>>  
>> +  Enabling this option will also disable ACS on all ports behind
>> +  any PCIe switch. This effictively puts all devices behind any
>> +  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a 
switch) to achieve a very large fan-out (in an NVMe SSD array for example) and 
we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the 
switch means that all the EPs below it have to part of the same IOMMU grouping. 
However it was also agreed that as long as the ACS disable occurred at boot 
time (which is does in v2) then the virtualization layer will be aware of it 
and will perform the IOMMU group formation correctly.

> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.

Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci=151512320031739=2



Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Stephen Bates
Thanks for the detailed review Bjorn!

>>  
>> +  Enabling this option will also disable ACS on all ports behind
>> +  any PCIe switch. This effictively puts all devices behind any
>> +  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a 
switch) to achieve a very large fan-out (in an NVMe SSD array for example) and 
we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the 
switch means that all the EPs below it have to part of the same IOMMU grouping. 
However it was also agreed that as long as the ACS disable occurred at boot 
time (which is does in v2) then the virtualization layer will be aware of it 
and will perform the IOMMU group formation correctly.

> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.

Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci=151512320031739=2



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe 
switches from both Broadcom (PLX) and Microsemi. We have also done testing on 
x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The 
series as it currently stands only works on x86_64 but modified (hacky) 
versions have been made to work on ARM64. The x86_64 testing has been done on a 
range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least 
three vendors, NVMe SSDs from at least four vendors and P2P devices from four 
vendors) and PCI switches.

I do find it slightly offensive that you would question the series even 
working. I hope you are not suggesting we would submit this framework multiple 
times without having done testing on it

Stephen



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe 
switches from both Broadcom (PLX) and Microsemi. We have also done testing on 
x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The 
series as it currently stands only works on x86_64 but modified (hacky) 
versions have been made to work on ARM64. The x86_64 testing has been done on a 
range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least 
three vendors, NVMe SSDs from at least four vendors and P2P devices from four 
vendors) and PCI switches.

I do find it slightly offensive that you would question the series even 
working. I hope you are not suggesting we would submit this framework multiple 
times without having done testing on it

Stephen



Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe 
controllers with CMBs that support RDS and WDS are now commercially available 
[1]. However we have not yet tried to do any kind of optimization around this 
yet in terms of determining which p2p_dev to use. Your suggest above looks good 
and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf

>> +ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the 
very least we should update this allocate over all the valid p2p_dev. Since the 
load on any given p2p_dev will vary over time I think a random allocation of 
the devices makes sense (at least for now). 

Stephen



Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Stephen Bates
> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe 
controllers with CMBs that support RDS and WDS are now commercially available 
[1]. However we have not yet tried to do any kind of optimization around this 
yet in terms of determining which p2p_dev to use. Your suggest above looks good 
and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf

>> +ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the 
very least we should update this allocate over all the valid p2p_dev. Since the 
load on any given p2p_dev will vary over time I think a random allocation of 
the devices makes sense (at least for now). 

Stephen



Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-01 Thread Stephen Bates
> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we 
would probably want to impose the "attached to the same PCIe switch" rule which 
might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P 
framework to facilitate copies between NVMe namespaces (on both PCIe and fabric 
attached namespaces) without having to expose the CMB up to user space. Wasn't 
something like that done in the SCSI world at some point Martin?

Stephen




Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-01 Thread Stephen Bates
> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we 
would probably want to impose the "attached to the same PCIe switch" rule which 
might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P 
framework to facilitate copies between NVMe namespaces (on both PCIe and fabric 
attached namespaces) without having to expose the CMB up to user space. Wasn't 
something like that done in the SCSI world at some point Martin?

Stephen




Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth

2018-02-06 Thread Stephen Bates

> On Feb 6, 2018, at 8:02 AM, Keith Busch  wrote:
> 
>> On Mon, Feb 05, 2018 at 03:32:23PM -0700, sba...@raithlin.com wrote:
>> 
>> -if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
>> +if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> 
> Is this a prep patch for something coming later? dev->cmb is already
> NULL if use_cmb_sqes is false.

Thanks Keith. 

Not sure how I missed this. I was working on OOT patches to enable WDS and RDS 
in the CMB. I *thought* I’d confirmed this patch was applicable to upstream. 
Looks like I did not do that correctly. 

Sorry for the churn. We can leave this patch out for now. Thanks for the prompt 
feedback everyone!

Stephen 

Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth

2018-02-06 Thread Stephen Bates

> On Feb 6, 2018, at 8:02 AM, Keith Busch  wrote:
> 
>> On Mon, Feb 05, 2018 at 03:32:23PM -0700, sba...@raithlin.com wrote:
>> 
>> -if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
>> +if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> 
> Is this a prep patch for something coming later? dev->cmb is already
> NULL if use_cmb_sqes is false.

Thanks Keith. 

Not sure how I missed this. I was working on OOT patches to enable WDS and RDS 
in the CMB. I *thought* I’d confirmed this patch was applicable to upstream. 
Looks like I did not do that correctly. 

Sorry for the churn. We can leave this patch out for now. Thanks for the prompt 
feedback everyone!

Stephen 

Re: [PATCH] pci: Add a acs_disable option for pci kernel parameter

2017-10-29 Thread Stephen Bates
>> This patch adds a new boot option to the pci kernel parameter called
>> "acs_disable" that will disable ACS. This is useful for PCI peer to
>> peer communication but can cause problems when IOVA isolation is
>> required and an IOMMU is enabled. Use with care.

> Eww.

Thanks for the feedback Christoph. My sentiments exactly ;-).  

> Can we please add smbios quirks for the systems where you've
> observed this? 

I can look at doing this. The issue with this approach is that it will require 
a kernel patch for each new system that is detected.

> we probably also want to keep the option just in case).

Agreed. At least then operators have a path to ACS disable prior to a quirk 
being accepted.

Stephen



Re: [PATCH] pci: Add a acs_disable option for pci kernel parameter

2017-10-29 Thread Stephen Bates
>> This patch adds a new boot option to the pci kernel parameter called
>> "acs_disable" that will disable ACS. This is useful for PCI peer to
>> peer communication but can cause problems when IOVA isolation is
>> required and an IOMMU is enabled. Use with care.

> Eww.

Thanks for the feedback Christoph. My sentiments exactly ;-).  

> Can we please add smbios quirks for the systems where you've
> observed this? 

I can look at doing this. The issue with this approach is that it will require 
a kernel patch for each new system that is detected.

> we probably also want to keep the option just in case).

Agreed. At least then operators have a path to ACS disable prior to a quirk 
being accepted.

Stephen



Re: [PATCH v2] genalloc: Make the avail variable an atomic_long_t

2017-10-29 Thread Stephen Bates

> Do we still need #include  ? For me, it compiles without it.

Yes we do. Kbuild reported a failure when I tried omitting it 
(arm-multi_v7_defconfig).

> Reviewed-by: Daniel Mentz danielme...@google.com

Thanks for the review

Andrew can you look at picking this up or do you want me to respin with that 
tags from Daniel and Mathieu applied?

Stephen 




Re: [PATCH v2] genalloc: Make the avail variable an atomic_long_t

2017-10-29 Thread Stephen Bates

> Do we still need #include  ? For me, it compiles without it.

Yes we do. Kbuild reported a failure when I tried omitting it 
(arm-multi_v7_defconfig).

> Reviewed-by: Daniel Mentz danielme...@google.com

Thanks for the review

Andrew can you look at picking this up or do you want me to respin with that 
tags from Daniel and Mathieu applied?

Stephen 




Re: [PATCH] genalloc: Make the avail variable an atomic64_t

2017-10-26 Thread Stephen Bates
> We have atomic_long_t for that. Please use it instead. It will be
> 64-bit on 64-bit archs, and 32-bit on 32-bit archs, which seems to
> fit your purpose here.

Thanks you Mathieu! Yes atomic_long_t looks perfect for this and addresses 
Daniel’s concerns for 32 bit systems. I’ll prepare a v2 with that change.

Stephen




Re: [PATCH] genalloc: Make the avail variable an atomic64_t

2017-10-26 Thread Stephen Bates
> We have atomic_long_t for that. Please use it instead. It will be
> 64-bit on 64-bit archs, and 32-bit on 32-bit archs, which seems to
> fit your purpose here.

Thanks you Mathieu! Yes atomic_long_t looks perfect for this and addresses 
Daniel’s concerns for 32 bit systems. I’ll prepare a v2 with that change.

Stephen




Re: [PATCH] genalloc: Make the avail variable an atomic64_t

2017-10-25 Thread Stephen Bates
> I found that genalloc is very slow for large chunk sizes because
> bitmap_find_next_zero_area has to grind through that entire bitmap.
> Hence, I recommend avoiding genalloc for large chunk sizes.

Thanks for the feedback Daniel! We have been doing 16GiB without any noticeable 
issues.

> I'm thinking how this would behave on a 32 bit ARM platform

I don’t think people would be doing such big allocations on 32 bit (ARM 
systems). It would not make sense for them to be doing >4GB anyway.

>> --- a/lib/genalloc.c
>> +++ b/lib/genalloc.c
>> @@ -194,7 +194,7 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned 
>> long virt, phys_addr_t phy
>> chunk->phys_addr = phys;
>> chunk->start_addr = virt;
>> chunk->end_addr = virt + size - 1;
>> -   atomic_set(>avail, size);
>> +   atomic64_set(>avail, size);

> Isn't size defined as a size_t type which is 32 bit wide on ARM? How
> can you ever set chunk->avail to anything larger than 2^32 - 1?

I did consider changing this type but it seems like there would never be a need 
to set this value to more than 4GiB on 32 bit systems.

>> @@ -464,7 +464,7 @@ size_t gen_pool_avail(struct gen_pool *pool)
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(chunk, >chunks, next_chunk)
>> -   avail += atomic_read(>avail);
>> +   avail += atomic64_read(>avail);
>
>avail is defined as size_t (32 bit). Aren't you going to overflow that 
>variable?

Again, I don’t think people on 32 bit systems will be doing >4GB assignments so 
it would not be an issue.

Stephen




Re: [PATCH] genalloc: Make the avail variable an atomic64_t

2017-10-25 Thread Stephen Bates
> I found that genalloc is very slow for large chunk sizes because
> bitmap_find_next_zero_area has to grind through that entire bitmap.
> Hence, I recommend avoiding genalloc for large chunk sizes.

Thanks for the feedback Daniel! We have been doing 16GiB without any noticeable 
issues.

> I'm thinking how this would behave on a 32 bit ARM platform

I don’t think people would be doing such big allocations on 32 bit (ARM 
systems). It would not make sense for them to be doing >4GB anyway.

>> --- a/lib/genalloc.c
>> +++ b/lib/genalloc.c
>> @@ -194,7 +194,7 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned 
>> long virt, phys_addr_t phy
>> chunk->phys_addr = phys;
>> chunk->start_addr = virt;
>> chunk->end_addr = virt + size - 1;
>> -   atomic_set(>avail, size);
>> +   atomic64_set(>avail, size);

> Isn't size defined as a size_t type which is 32 bit wide on ARM? How
> can you ever set chunk->avail to anything larger than 2^32 - 1?

I did consider changing this type but it seems like there would never be a need 
to set this value to more than 4GiB on 32 bit systems.

>> @@ -464,7 +464,7 @@ size_t gen_pool_avail(struct gen_pool *pool)
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(chunk, >chunks, next_chunk)
>> -   avail += atomic_read(>avail);
>> +   avail += atomic64_read(>avail);
>
>avail is defined as size_t (32 bit). Aren't you going to overflow that 
>variable?

Again, I don’t think people on 32 bit systems will be doing >4GB assignments so 
it would not be an issue.

Stephen




  1   2   >