Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote: >> >> Dave are you saying that even for local mappings of files on a DAX >> capable system it is possible for the mappings to move on you unless the >> FS supports locking? >> > > Yes. > > >> Does that not mean DAX on such FS is >> inherently broken? > > No. DAX is accessed through a virtual mapping layer that abstracts > the physical location from userspace applications. > > Example: think copy-on-write overwrites. It occurs atomically from > the perspective of userspace and starts by invalidating any current > mappings userspace has of that physical location. The location is changes, > the data copied in, and then when the locks are released userspace can > fault in a new page table mapping on the next access Dave Thanks for the good input and for correcting some of my DAX misconceptions! We will certainly be taking this into account as we consider v1. > And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c. >> >> Thanks for fixing this issue on XFS Christoph! I assume this problem >> continues to exist on the other DAX capable FS? > > Yes, but it they implement the exportfs API that supplies this > capability, they'll be able to use pNFS, too. > >> One more reason to consider a move to /dev/dax I guess ;-)... >> > > That doesn't get rid of the need for sane access control arbitration > across all machines that are directly accessing the storage. That's the > problem pNFS solves, regardless of whether your direct access target is a > filesystem, a block device or object storage... Fair point. I am still hoping for a bit more discussion on the best choice of user-space interface for this work. If/When that happens we will take it into account when we look at spinning the patchset. Stephen
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote: >> >> Dave are you saying that even for local mappings of files on a DAX >> capable system it is possible for the mappings to move on you unless the >> FS supports locking? >> > > Yes. > > >> Does that not mean DAX on such FS is >> inherently broken? > > No. DAX is accessed through a virtual mapping layer that abstracts > the physical location from userspace applications. > > Example: think copy-on-write overwrites. It occurs atomically from > the perspective of userspace and starts by invalidating any current > mappings userspace has of that physical location. The location is changes, > the data copied in, and then when the locks are released userspace can > fault in a new page table mapping on the next access Dave Thanks for the good input and for correcting some of my DAX misconceptions! We will certainly be taking this into account as we consider v1. > And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c. >> >> Thanks for fixing this issue on XFS Christoph! I assume this problem >> continues to exist on the other DAX capable FS? > > Yes, but it they implement the exportfs API that supplies this > capability, they'll be able to use pNFS, too. > >> One more reason to consider a move to /dev/dax I guess ;-)... >> > > That doesn't get rid of the need for sane access control arbitration > across all machines that are directly accessing the storage. That's the > problem pNFS solves, regardless of whether your direct access target is a > filesystem, a block device or object storage... Fair point. I am still hoping for a bit more discussion on the best choice of user-space interface for this work. If/When that happens we will take it into account when we look at spinning the patchset. Stephen
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
You do realise that local filesystems can silently change the location of file data at any point in time, so there is no such thing as a "stable mapping" of file data to block device addresses in userspace? If you want remote access to the blocks owned and controlled by a filesystem, then you need to use a filesystem with a remote locking mechanism to allow co-ordinated, coherent access to the data in those blocks. Anything else is just asking for ongoing, unfixable filesystem corruption or data leakage problems (i.e. security issues). And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c. Christoph, did you manage to leap to the future and solve the RDMA persistency hole? :) e.g. what happens with O_DSYNC in this model? Or you did a message exchange for commits?
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
You do realise that local filesystems can silently change the location of file data at any point in time, so there is no such thing as a "stable mapping" of file data to block device addresses in userspace? If you want remote access to the blocks owned and controlled by a filesystem, then you need to use a filesystem with a remote locking mechanism to allow co-ordinated, coherent access to the data in those blocks. Anything else is just asking for ongoing, unfixable filesystem corruption or data leakage problems (i.e. security issues). And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c. Christoph, did you manage to leap to the future and solve the RDMA persistency hole? :) e.g. what happens with O_DSYNC in this model? Or you did a message exchange for commits?
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Thu, Oct 27, 2016 at 01:22:49PM +0300, Sagi Grimberg wrote: > Christoph, did you manage to leap to the future and solve the > RDMA persistency hole? :) > > e.g. what happens with O_DSYNC in this model? Or you did > a message exchange for commits? Yes, pNFS calls this the layoutcommit. That being said once we get a RDMA commit or flush operation we could easily make the layoutcommit optional for some operations. There already is a precedence for the in the flexfiles layout specification.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Thu, Oct 27, 2016 at 01:22:49PM +0300, Sagi Grimberg wrote: > Christoph, did you manage to leap to the future and solve the > RDMA persistency hole? :) > > e.g. what happens with O_DSYNC in this model? Or you did > a message exchange for commits? Yes, pNFS calls this the layoutcommit. That being said once we get a RDMA commit or flush operation we could easily make the layoutcommit optional for some operations. There already is a precedence for the in the flexfiles layout specification.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 26, 2016 at 1:24 AM, Haggai Eranwrote: [..] >> I wonder if we could (ab)use a >> software-defined 'pasid' as the requester id for a peer-to-peer >> mapping that needs address translation. > Why would you need that? Isn't it enough to map the peer-to-peer > addresses correctly in the iommu driver? > You're right, we might already have enough... We would just need to audit iommu drivers to undo any assumptions that the page being mapped is always in host memory and apply any bus address translations between source device and target device.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 26, 2016 at 1:24 AM, Haggai Eran wrote: [..] >> I wonder if we could (ab)use a >> software-defined 'pasid' as the requester id for a peer-to-peer >> mapping that needs address translation. > Why would you need that? Isn't it enough to map the peer-to-peer > addresses correctly in the iommu driver? > You're right, we might already have enough... We would just need to audit iommu drivers to undo any assumptions that the page being mapped is always in host memory and apply any bus address translations between source device and target device.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On 10/19/2016 6:51 AM, Dan Williams wrote: > On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bateswrote: >> 1. Address Translation. Suggestions have been made that in certain >> architectures and topologies the dma_addr_t passed to the DMA master >> in a peer-2-peer transfer will not correctly route to the IO memory >> intended. However in our testing to date we have not seen this to be >> an issue, even in systems with IOMMUs and PCIe switches. It is our >> understanding that an IOMMU only maps system memory and would not >> interfere with device memory regions. I'm not sure that's the case. I think it works because with ZONE_DEVICE, the iommu driver will simply treat a dma_map_page call as any other PFN, and create a mapping as it does for any memory page. >> (It certainly has no opportunity >> to do so if the transfer gets routed through a switch). It can still go through the IOMMU if you enable ACS upstream forwarding. > There may still be platforms where peer-to-peer cycles are routed up > through the root bridge and then back down to target device, but we > can address that when / if it happens. I agree. > I wonder if we could (ab)use a > software-defined 'pasid' as the requester id for a peer-to-peer > mapping that needs address translation. Why would you need that? Isn't it enough to map the peer-to-peer addresses correctly in the iommu driver? Haggai
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On 10/19/2016 6:51 AM, Dan Williams wrote: > On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates wrote: >> 1. Address Translation. Suggestions have been made that in certain >> architectures and topologies the dma_addr_t passed to the DMA master >> in a peer-2-peer transfer will not correctly route to the IO memory >> intended. However in our testing to date we have not seen this to be >> an issue, even in systems with IOMMUs and PCIe switches. It is our >> understanding that an IOMMU only maps system memory and would not >> interfere with device memory regions. I'm not sure that's the case. I think it works because with ZONE_DEVICE, the iommu driver will simply treat a dma_map_page call as any other PFN, and create a mapping as it does for any memory page. >> (It certainly has no opportunity >> to do so if the transfer gets routed through a switch). It can still go through the IOMMU if you enable ACS upstream forwarding. > There may still be platforms where peer-to-peer cycles are routed up > through the root bridge and then back down to target device, but we > can address that when / if it happens. I agree. > I wonder if we could (ab)use a > software-defined 'pasid' as the requester id for a peer-to-peer > mapping that needs address translation. Why would you need that? Isn't it enough to map the peer-to-peer addresses correctly in the iommu driver? Haggai
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote: > Hi Dave and Christoph > > On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote: > > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > > > You do realise that local filesystems can silently change the > > > > location of file data at any point in time, so there is no such > > > > thing as a "stable mapping" of file data to block device addresses > > > > in userspace? > > > > > > > > If you want remote access to the blocks owned and controlled by a > > > > filesystem, then you need to use a filesystem with a remote locking > > > > mechanism to allow co-ordinated, coherent access to the data in > > > > those blocks. Anything else is just asking for ongoing, unfixable > > > > filesystem corruption or data leakage problems (i.e. security > > > > issues). > > > > > Dave are you saying that even for local mappings of files on a DAX > capable system it is possible for the mappings to move on you unless > the FS supports locking? Yes. > Does that not mean DAX on such FS is > inherently broken? No. DAX is accessed through a virtual mapping layer that abstracts the physical location from userspace applications. Example: think copy-on-write overwrites. It occurs atomically from the perspective of userspace and starts by invalidating any current mappings userspace has of that physical location. The location is changes, the data copied in, and then when the locks are released userspace can fault in a new page table mapping on the next access > > > And at least for XFS we have such a mechanism :) E.g. I have a > > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > > > RDMA directly to XFS files, with the same locking mechanism we use > > > for the current block and scsi layout in xfs_pnfs.c. > > Thanks for fixing this issue on XFS Christoph! I assume this problem > continues to exist on the other DAX capable FS? Yes, but it they implement the exportfs API that supplies this capability, they'll be able to use pNFS, too. > One more reason to consider a move to /dev/dax I guess ;-)... That doesn't get rid of the need for sane access control arbitration across all machines that are directly accessing the storage. That's the problem pNFS solves, regardless of whether your direct access target is a filesystem, a block device or object storage... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote: > Hi Dave and Christoph > > On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote: > > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > > > You do realise that local filesystems can silently change the > > > > location of file data at any point in time, so there is no such > > > > thing as a "stable mapping" of file data to block device addresses > > > > in userspace? > > > > > > > > If you want remote access to the blocks owned and controlled by a > > > > filesystem, then you need to use a filesystem with a remote locking > > > > mechanism to allow co-ordinated, coherent access to the data in > > > > those blocks. Anything else is just asking for ongoing, unfixable > > > > filesystem corruption or data leakage problems (i.e. security > > > > issues). > > > > > Dave are you saying that even for local mappings of files on a DAX > capable system it is possible for the mappings to move on you unless > the FS supports locking? Yes. > Does that not mean DAX on such FS is > inherently broken? No. DAX is accessed through a virtual mapping layer that abstracts the physical location from userspace applications. Example: think copy-on-write overwrites. It occurs atomically from the perspective of userspace and starts by invalidating any current mappings userspace has of that physical location. The location is changes, the data copied in, and then when the locks are released userspace can fault in a new page table mapping on the next access > > > And at least for XFS we have such a mechanism :) E.g. I have a > > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > > > RDMA directly to XFS files, with the same locking mechanism we use > > > for the current block and scsi layout in xfs_pnfs.c. > > Thanks for fixing this issue on XFS Christoph! I assume this problem > continues to exist on the other DAX capable FS? Yes, but it they implement the exportfs API that supplies this capability, they'll be able to use pNFS, too. > One more reason to consider a move to /dev/dax I guess ;-)... That doesn't get rid of the need for sane access control arbitration across all machines that are directly accessing the storage. That's the problem pNFS solves, regardless of whether your direct access target is a filesystem, a block device or object storage... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
Hi Dave and Christoph On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > > You do realise that local filesystems can silently change the > > > location of file data at any point in time, so there is no such > > > thing as a "stable mapping" of file data to block device addresses > > > in userspace? > > > > > > If you want remote access to the blocks owned and controlled by a > > > filesystem, then you need to use a filesystem with a remote locking > > > mechanism to allow co-ordinated, coherent access to the data in > > > those blocks. Anything else is just asking for ongoing, unfixable > > > filesystem corruption or data leakage problems (i.e. security > > > issues). > > Dave are you saying that even for local mappings of files on a DAX capable system it is possible for the mappings to move on you unless the FS supports locking? Does that not mean DAX on such FS is inherently broken? > > And at least for XFS we have such a mechanism :) E.g. I have a > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > > RDMA directly to XFS files, with the same locking mechanism we use > > for the current block and scsi layout in xfs_pnfs.c. > Thanks for fixing this issue on XFS Christoph! I assume this problem continues to exist on the other DAX capable FS? One more reason to consider a move to /dev/dax I guess ;-)... Stephen > Oh, that's good to know - pNFS over XFS was exactly what I was > thinking of when I wrote my earlier reply. A few months ago someone > else was trying to use file mappings in userspace for direct remote > client access on fabric connected devices. I told them "pNFS on XFS > and write an efficient transport for you hardware" > > Now that I know we've got RDMA support for pNFS on XFS in the > pipeline, I can just tell them "just write an rdma driver for your > hardware" instead. :P > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
Hi Dave and Christoph On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > > You do realise that local filesystems can silently change the > > > location of file data at any point in time, so there is no such > > > thing as a "stable mapping" of file data to block device addresses > > > in userspace? > > > > > > If you want remote access to the blocks owned and controlled by a > > > filesystem, then you need to use a filesystem with a remote locking > > > mechanism to allow co-ordinated, coherent access to the data in > > > those blocks. Anything else is just asking for ongoing, unfixable > > > filesystem corruption or data leakage problems (i.e. security > > > issues). > > Dave are you saying that even for local mappings of files on a DAX capable system it is possible for the mappings to move on you unless the FS supports locking? Does that not mean DAX on such FS is inherently broken? > > And at least for XFS we have such a mechanism :) E.g. I have a > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > > RDMA directly to XFS files, with the same locking mechanism we use > > for the current block and scsi layout in xfs_pnfs.c. > Thanks for fixing this issue on XFS Christoph! I assume this problem continues to exist on the other DAX capable FS? One more reason to consider a move to /dev/dax I guess ;-)... Stephen > Oh, that's good to know - pNFS over XFS was exactly what I was > thinking of when I wrote my earlier reply. A few months ago someone > else was trying to use file mappings in userspace for direct remote > client access on fabric connected devices. I told them "pNFS on XFS > and write an efficient transport for you hardware" > > Now that I know we've got RDMA support for pNFS on XFS in the > pipeline, I can just tell them "just write an rdma driver for your > hardware" instead. :P > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > You do realise that local filesystems can silently change the > > location of file data at any point in time, so there is no such > > thing as a "stable mapping" of file data to block device addresses > > in userspace? > > > > If you want remote access to the blocks owned and controlled by a > > filesystem, then you need to use a filesystem with a remote locking > > mechanism to allow co-ordinated, coherent access to the data in > > those blocks. Anything else is just asking for ongoing, unfixable > > filesystem corruption or data leakage problems (i.e. security > > issues). > > And at least for XFS we have such a mechanism :) E.g. I have a > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > RDMA directly to XFS files, with the same locking mechanism we use > for the current block and scsi layout in xfs_pnfs.c. Oh, that's good to know - pNFS over XFS was exactly what I was thinking of when I wrote my earlier reply. A few months ago someone else was trying to use file mappings in userspace for direct remote client access on fabric connected devices. I told them "pNFS on XFS and write an efficient transport for you hardware" Now that I know we've got RDMA support for pNFS on XFS in the pipeline, I can just tell them "just write an rdma driver for your hardware" instead. :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote: > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > > You do realise that local filesystems can silently change the > > location of file data at any point in time, so there is no such > > thing as a "stable mapping" of file data to block device addresses > > in userspace? > > > > If you want remote access to the blocks owned and controlled by a > > filesystem, then you need to use a filesystem with a remote locking > > mechanism to allow co-ordinated, coherent access to the data in > > those blocks. Anything else is just asking for ongoing, unfixable > > filesystem corruption or data leakage problems (i.e. security > > issues). > > And at least for XFS we have such a mechanism :) E.g. I have a > prototype of a pNFS layout that uses XFS+DAX to allow clients to do > RDMA directly to XFS files, with the same locking mechanism we use > for the current block and scsi layout in xfs_pnfs.c. Oh, that's good to know - pNFS over XFS was exactly what I was thinking of when I wrote my earlier reply. A few months ago someone else was trying to use file mappings in userspace for direct remote client access on fabric connected devices. I told them "pNFS on XFS and write an efficient transport for you hardware" Now that I know we've got RDMA support for pNFS on XFS in the pipeline, I can just tell them "just write an rdma driver for your hardware" instead. :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > You do realise that local filesystems can silently change the > location of file data at any point in time, so there is no such > thing as a "stable mapping" of file data to block device addresses > in userspace? > > If you want remote access to the blocks owned and controlled by a > filesystem, then you need to use a filesystem with a remote locking > mechanism to allow co-ordinated, coherent access to the data in > those blocks. Anything else is just asking for ongoing, unfixable > filesystem corruption or data leakage problems (i.e. security > issues). And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote: > You do realise that local filesystems can silently change the > location of file data at any point in time, so there is no such > thing as a "stable mapping" of file data to block device addresses > in userspace? > > If you want remote access to the blocks owned and controlled by a > filesystem, then you need to use a filesystem with a remote locking > mechanism to allow co-ordinated, coherent access to the data in > those blocks. Anything else is just asking for ongoing, unfixable > filesystem corruption or data leakage problems (i.e. security > issues). And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 19, 2016 at 12:48:14PM -0600, Stephen Bates wrote: > On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: > > [ adding Ashok and David for potential iommu comments ] > > > > Hi Dan > > Thanks for adding Ashok and David! > > > > > I agree with the motivation and the need for a solution, but I have > > some questions about this implementation. > > > > > > > > Consumers > > > - > > > > > > We provide a PCIe device driver in an accompanying patch that can be > > > used to map any PCIe BAR into a DAX capable block device. For > > > non-persistent BARs this simply serves as an alternative to using > > > system memory bounce buffers. For persistent BARs this can serve as an > > > additional storage device in the system. > > > > Why block devices? I wonder if iopmem was initially designed back > > when we were considering enabling DAX for raw block devices. However, > > that support has since been ripped out / abandoned. You currently > > need a filesystem on top of a block-device to get DAX operation. > > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward > > if all you want is a way to map the bar for another PCI-E device in > > the topology. > > > > If you're only using the block-device as a entry-point to create > > dax-mappings then a device-dax (drivers/dax/) character-device might > > be a better fit. > > > > We chose a block device because we felt it was intuitive for users to > carve up a memory region but putting a DAX filesystem on it and creating > files on that DAX aware FS. It seemed like a convenient way to > partition up the region and to be easily able to get the DMA address > for the memory backing the device. You do realise that local filesystems can silently change the location of file data at any point in time, so there is no such thing as a "stable mapping" of file data to block device addresses in userspace? If you want remote access to the blocks owned and controlled by a filesystem, then you need to use a filesystem with a remote locking mechanism to allow co-ordinated, coherent access to the data in those blocks. Anything else is just asking for ongoing, unfixable filesystem corruption or data leakage problems (i.e. security issues). Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 19, 2016 at 12:48:14PM -0600, Stephen Bates wrote: > On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: > > [ adding Ashok and David for potential iommu comments ] > > > > Hi Dan > > Thanks for adding Ashok and David! > > > > > I agree with the motivation and the need for a solution, but I have > > some questions about this implementation. > > > > > > > > Consumers > > > - > > > > > > We provide a PCIe device driver in an accompanying patch that can be > > > used to map any PCIe BAR into a DAX capable block device. For > > > non-persistent BARs this simply serves as an alternative to using > > > system memory bounce buffers. For persistent BARs this can serve as an > > > additional storage device in the system. > > > > Why block devices? I wonder if iopmem was initially designed back > > when we were considering enabling DAX for raw block devices. However, > > that support has since been ripped out / abandoned. You currently > > need a filesystem on top of a block-device to get DAX operation. > > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward > > if all you want is a way to map the bar for another PCI-E device in > > the topology. > > > > If you're only using the block-device as a entry-point to create > > dax-mappings then a device-dax (drivers/dax/) character-device might > > be a better fit. > > > > We chose a block device because we felt it was intuitive for users to > carve up a memory region but putting a DAX filesystem on it and creating > files on that DAX aware FS. It seemed like a convenient way to > partition up the region and to be easily able to get the DMA address > for the memory backing the device. You do realise that local filesystems can silently change the location of file data at any point in time, so there is no such thing as a "stable mapping" of file data to block device addresses in userspace? If you want remote access to the blocks owned and controlled by a filesystem, then you need to use a filesystem with a remote locking mechanism to allow co-ordinated, coherent access to the data in those blocks. Anything else is just asking for ongoing, unfixable filesystem corruption or data leakage problems (i.e. security issues). Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
> >> > >> If you're only using the block-device as a entry-point to create > >> dax-mappings then a device-dax (drivers/dax/) character-device might > >> be a better fit. > >> > > > > We chose a block device because we felt it was intuitive for users to > > carve up a memory region but putting a DAX filesystem on it and creating > > files on that DAX aware FS. It seemed like a convenient way to > > partition up the region and to be easily able to get the DMA address > > for the memory backing the device. > > > > That said I would be very keen to get other peoples thoughts on how > > they would like to see this done. And I know some people have had some > > reservations about using DAX mounted FS to do this in the past. > > I guess it depends on the expected size of these devices BARs, but I > get the sense they may be smaller / more precious such that you > wouldn't want to spend capacity on filesystem metadata? For the target > use case is it assumed that these device BARs are always backed by > non-volatile memory? Otherwise this is a mkfs each boot for a > volatile device. Dan Fair point and this is a concern I share. We are not assuming that all iopmem devices are backed by non-volatile memory so the mkfs recreation comment is valid. All in all I think you are persuading us to take a look at /dev/dax ;-). I will see if anyone else chips in with their thoughts on this. > > >> > >> > 2. Memory Segment Spacing. This patch has the same limitations that > >> > ZONE_DEVICE does in that memory regions must be spaces at least > >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > >> > be usable on neighboring BARs. For our purposes, this is not an issue as > >> > we'd only be looking at enabling a single BAR in a given PCIe device. > >> > More exotic use cases may have problems with this. > >> > >> I'm working on patches for 4.10 to allow mixing multiple > >> devm_memremap_pages() allocations within the same physical section. > >> Hopefully this won't be a problem going forward. > >> > > > > Thanks Dan. Your patches will help address the problem of how to > > partition a /dev/dax device but they don't help the case then BARs > > themselves are small, closely spaced and non-segment aligned. However > > I think most people using iopmem will want to use reasonbly large > > BARs so I am not sure item 2 is that big of an issue. > > I think you might have misunderstood what I'm proposing. The patches > I'm working on are separate from a facility to carve up a /dev/dax > device. The effort is to allow devm_memremap_pages() to maintain > several allocations within the same 128MB section. I need this for > persistent memory to handle platforms that mix pmem and system-ram in > the same section. I want to be able to map ZONE_DEVICE pages for a > portion of a section and be able to remove portions of section that > may collide with allocations of a different lifetime. Oh I did misunderstand. This is very cool and would be useful to us. One more reason to consider moving to /dev/dax in the next spin of this patchset ;-). Thanks Stephen
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
> >> > >> If you're only using the block-device as a entry-point to create > >> dax-mappings then a device-dax (drivers/dax/) character-device might > >> be a better fit. > >> > > > > We chose a block device because we felt it was intuitive for users to > > carve up a memory region but putting a DAX filesystem on it and creating > > files on that DAX aware FS. It seemed like a convenient way to > > partition up the region and to be easily able to get the DMA address > > for the memory backing the device. > > > > That said I would be very keen to get other peoples thoughts on how > > they would like to see this done. And I know some people have had some > > reservations about using DAX mounted FS to do this in the past. > > I guess it depends on the expected size of these devices BARs, but I > get the sense they may be smaller / more precious such that you > wouldn't want to spend capacity on filesystem metadata? For the target > use case is it assumed that these device BARs are always backed by > non-volatile memory? Otherwise this is a mkfs each boot for a > volatile device. Dan Fair point and this is a concern I share. We are not assuming that all iopmem devices are backed by non-volatile memory so the mkfs recreation comment is valid. All in all I think you are persuading us to take a look at /dev/dax ;-). I will see if anyone else chips in with their thoughts on this. > > >> > >> > 2. Memory Segment Spacing. This patch has the same limitations that > >> > ZONE_DEVICE does in that memory regions must be spaces at least > >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > >> > be usable on neighboring BARs. For our purposes, this is not an issue as > >> > we'd only be looking at enabling a single BAR in a given PCIe device. > >> > More exotic use cases may have problems with this. > >> > >> I'm working on patches for 4.10 to allow mixing multiple > >> devm_memremap_pages() allocations within the same physical section. > >> Hopefully this won't be a problem going forward. > >> > > > > Thanks Dan. Your patches will help address the problem of how to > > partition a /dev/dax device but they don't help the case then BARs > > themselves are small, closely spaced and non-segment aligned. However > > I think most people using iopmem will want to use reasonbly large > > BARs so I am not sure item 2 is that big of an issue. > > I think you might have misunderstood what I'm proposing. The patches > I'm working on are separate from a facility to carve up a /dev/dax > device. The effort is to allow devm_memremap_pages() to maintain > several allocations within the same 128MB section. I need this for > persistent memory to handle platforms that mix pmem and system-ram in > the same section. I want to be able to map ZONE_DEVICE pages for a > portion of a section and be able to remove portions of section that > may collide with allocations of a different lifetime. Oh I did misunderstand. This is very cool and would be useful to us. One more reason to consider moving to /dev/dax in the next spin of this patchset ;-). Thanks Stephen
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 19, 2016 at 11:48 AM, Stephen Bateswrote: > On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: >> [ adding Ashok and David for potential iommu comments ] >> > > Hi Dan > > Thanks for adding Ashok and David! > >> >> I agree with the motivation and the need for a solution, but I have >> some questions about this implementation. >> >> > >> > Consumers >> > - >> > >> > We provide a PCIe device driver in an accompanying patch that can be >> > used to map any PCIe BAR into a DAX capable block device. For >> > non-persistent BARs this simply serves as an alternative to using >> > system memory bounce buffers. For persistent BARs this can serve as an >> > additional storage device in the system. >> >> Why block devices? I wonder if iopmem was initially designed back >> when we were considering enabling DAX for raw block devices. However, >> that support has since been ripped out / abandoned. You currently >> need a filesystem on top of a block-device to get DAX operation. >> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward >> if all you want is a way to map the bar for another PCI-E device in >> the topology. >> >> If you're only using the block-device as a entry-point to create >> dax-mappings then a device-dax (drivers/dax/) character-device might >> be a better fit. >> > > We chose a block device because we felt it was intuitive for users to > carve up a memory region but putting a DAX filesystem on it and creating > files on that DAX aware FS. It seemed like a convenient way to > partition up the region and to be easily able to get the DMA address > for the memory backing the device. > > That said I would be very keen to get other peoples thoughts on how > they would like to see this done. And I know some people have had some > reservations about using DAX mounted FS to do this in the past. I guess it depends on the expected size of these devices BARs, but I get the sense they may be smaller / more precious such that you wouldn't want to spend capacity on filesystem metadata? For the target use case is it assumed that these device BARs are always backed by non-volatile memory? Otherwise this is a mkfs each boot for a volatile device. >> >> > 2. Memory Segment Spacing. This patch has the same limitations that >> > ZONE_DEVICE does in that memory regions must be spaces at least >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not >> > be usable on neighboring BARs. For our purposes, this is not an issue as >> > we'd only be looking at enabling a single BAR in a given PCIe device. >> > More exotic use cases may have problems with this. >> >> I'm working on patches for 4.10 to allow mixing multiple >> devm_memremap_pages() allocations within the same physical section. >> Hopefully this won't be a problem going forward. >> > > Thanks Dan. Your patches will help address the problem of how to > partition a /dev/dax device but they don't help the case then BARs > themselves are small, closely spaced and non-segment aligned. However > I think most people using iopmem will want to use reasonbly large > BARs so I am not sure item 2 is that big of an issue. I think you might have misunderstood what I'm proposing. The patches I'm working on are separate from a facility to carve up a /dev/dax device. The effort is to allow devm_memremap_pages() to maintain several allocations within the same 128MB section. I need this for persistent memory to handle platforms that mix pmem and system-ram in the same section. I want to be able to map ZONE_DEVICE pages for a portion of a section and be able to remove portions of section that may collide with allocations of a different lifetime.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Wed, Oct 19, 2016 at 11:48 AM, Stephen Bates wrote: > On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: >> [ adding Ashok and David for potential iommu comments ] >> > > Hi Dan > > Thanks for adding Ashok and David! > >> >> I agree with the motivation and the need for a solution, but I have >> some questions about this implementation. >> >> > >> > Consumers >> > - >> > >> > We provide a PCIe device driver in an accompanying patch that can be >> > used to map any PCIe BAR into a DAX capable block device. For >> > non-persistent BARs this simply serves as an alternative to using >> > system memory bounce buffers. For persistent BARs this can serve as an >> > additional storage device in the system. >> >> Why block devices? I wonder if iopmem was initially designed back >> when we were considering enabling DAX for raw block devices. However, >> that support has since been ripped out / abandoned. You currently >> need a filesystem on top of a block-device to get DAX operation. >> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward >> if all you want is a way to map the bar for another PCI-E device in >> the topology. >> >> If you're only using the block-device as a entry-point to create >> dax-mappings then a device-dax (drivers/dax/) character-device might >> be a better fit. >> > > We chose a block device because we felt it was intuitive for users to > carve up a memory region but putting a DAX filesystem on it and creating > files on that DAX aware FS. It seemed like a convenient way to > partition up the region and to be easily able to get the DMA address > for the memory backing the device. > > That said I would be very keen to get other peoples thoughts on how > they would like to see this done. And I know some people have had some > reservations about using DAX mounted FS to do this in the past. I guess it depends on the expected size of these devices BARs, but I get the sense they may be smaller / more precious such that you wouldn't want to spend capacity on filesystem metadata? For the target use case is it assumed that these device BARs are always backed by non-volatile memory? Otherwise this is a mkfs each boot for a volatile device. >> >> > 2. Memory Segment Spacing. This patch has the same limitations that >> > ZONE_DEVICE does in that memory regions must be spaces at least >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not >> > be usable on neighboring BARs. For our purposes, this is not an issue as >> > we'd only be looking at enabling a single BAR in a given PCIe device. >> > More exotic use cases may have problems with this. >> >> I'm working on patches for 4.10 to allow mixing multiple >> devm_memremap_pages() allocations within the same physical section. >> Hopefully this won't be a problem going forward. >> > > Thanks Dan. Your patches will help address the problem of how to > partition a /dev/dax device but they don't help the case then BARs > themselves are small, closely spaced and non-segment aligned. However > I think most people using iopmem will want to use reasonbly large > BARs so I am not sure item 2 is that big of an issue. I think you might have misunderstood what I'm proposing. The patches I'm working on are separate from a facility to carve up a /dev/dax device. The effort is to allow devm_memremap_pages() to maintain several allocations within the same 128MB section. I need this for persistent memory to handle platforms that mix pmem and system-ram in the same section. I want to be able to map ZONE_DEVICE pages for a portion of a section and be able to remove portions of section that may collide with allocations of a different lifetime.
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: > [ adding Ashok and David for potential iommu comments ] > Hi Dan Thanks for adding Ashok and David! > > I agree with the motivation and the need for a solution, but I have > some questions about this implementation. > > > > > Consumers > > - > > > > We provide a PCIe device driver in an accompanying patch that can be > > used to map any PCIe BAR into a DAX capable block device. For > > non-persistent BARs this simply serves as an alternative to using > > system memory bounce buffers. For persistent BARs this can serve as an > > additional storage device in the system. > > Why block devices? I wonder if iopmem was initially designed back > when we were considering enabling DAX for raw block devices. However, > that support has since been ripped out / abandoned. You currently > need a filesystem on top of a block-device to get DAX operation. > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward > if all you want is a way to map the bar for another PCI-E device in > the topology. > > If you're only using the block-device as a entry-point to create > dax-mappings then a device-dax (drivers/dax/) character-device might > be a better fit. > We chose a block device because we felt it was intuitive for users to carve up a memory region but putting a DAX filesystem on it and creating files on that DAX aware FS. It seemed like a convenient way to partition up the region and to be easily able to get the DMA address for the memory backing the device. That said I would be very keen to get other peoples thoughts on how they would like to see this done. And I know some people have had some reservations about using DAX mounted FS to do this in the past. > > > 2. Memory Segment Spacing. This patch has the same limitations that > > ZONE_DEVICE does in that memory regions must be spaces at least > > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > > be usable on neighboring BARs. For our purposes, this is not an issue as > > we'd only be looking at enabling a single BAR in a given PCIe device. > > More exotic use cases may have problems with this. > > I'm working on patches for 4.10 to allow mixing multiple > devm_memremap_pages() allocations within the same physical section. > Hopefully this won't be a problem going forward. > Thanks Dan. Your patches will help address the problem of how to partition a /dev/dax device but they don't help the case then BARs themselves are small, closely spaced and non-segment aligned. However I think most people using iopmem will want to use reasonbly large BARs so I am not sure item 2 is that big of an issue. > I haven't yet grokked the motivation for this, but I'll go comment on > that separately. Thanks Dan!
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote: > [ adding Ashok and David for potential iommu comments ] > Hi Dan Thanks for adding Ashok and David! > > I agree with the motivation and the need for a solution, but I have > some questions about this implementation. > > > > > Consumers > > - > > > > We provide a PCIe device driver in an accompanying patch that can be > > used to map any PCIe BAR into a DAX capable block device. For > > non-persistent BARs this simply serves as an alternative to using > > system memory bounce buffers. For persistent BARs this can serve as an > > additional storage device in the system. > > Why block devices? I wonder if iopmem was initially designed back > when we were considering enabling DAX for raw block devices. However, > that support has since been ripped out / abandoned. You currently > need a filesystem on top of a block-device to get DAX operation. > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward > if all you want is a way to map the bar for another PCI-E device in > the topology. > > If you're only using the block-device as a entry-point to create > dax-mappings then a device-dax (drivers/dax/) character-device might > be a better fit. > We chose a block device because we felt it was intuitive for users to carve up a memory region but putting a DAX filesystem on it and creating files on that DAX aware FS. It seemed like a convenient way to partition up the region and to be easily able to get the DMA address for the memory backing the device. That said I would be very keen to get other peoples thoughts on how they would like to see this done. And I know some people have had some reservations about using DAX mounted FS to do this in the past. > > > 2. Memory Segment Spacing. This patch has the same limitations that > > ZONE_DEVICE does in that memory regions must be spaces at least > > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > > be usable on neighboring BARs. For our purposes, this is not an issue as > > we'd only be looking at enabling a single BAR in a given PCIe device. > > More exotic use cases may have problems with this. > > I'm working on patches for 4.10 to allow mixing multiple > devm_memremap_pages() allocations within the same physical section. > Hopefully this won't be a problem going forward. > Thanks Dan. Your patches will help address the problem of how to partition a /dev/dax device but they don't help the case then BARs themselves are small, closely spaced and non-segment aligned. However I think most people using iopmem will want to use reasonbly large BARs so I am not sure item 2 is that big of an issue. > I haven't yet grokked the motivation for this, but I'll go comment on > that separately. Thanks Dan!
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
[ adding Ashok and David for potential iommu comments ] On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bateswrote: > This patch follows from an RFC we did earlier this year [1]. This > patchset applies cleanly to v4.9-rc1. > > Updates since RFC > - > Rebased. > Included the iopmem driver in the submission. > > History > --- > > There have been several attempts to upstream patchsets that enable > DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf > style patches [3]. None have been successful to date. Haggai Eran > gives a nice overview of the prior art in this space in his cover > letter [3]. > > Motivation and Use Cases > > > PCIe IO devices are getting faster. It is not uncommon now to find PCIe > network and storage devices that can generate and consume several GB/s. > Almost always these devices have either a high performance DMA engine, a > number of exposed PCIe BARs or both. > > Until this patch, any high-performance transfer of information between > two PICe devices has required the use of a staging buffer in system > memory. With this patch the bandwidth to system memory is not compromised > when high-throughput transfers occurs between PCIe devices. This means > that more system memory bandwidth is available to the CPU cores for data > processing and manipulation. In addition, in systems where the two PCIe > devices reside behind a PCIe switch the datapath avoids the CPU > entirely. I agree with the motivation and the need for a solution, but I have some questions about this implementation. > > Consumers > - > > We provide a PCIe device driver in an accompanying patch that can be > used to map any PCIe BAR into a DAX capable block device. For > non-persistent BARs this simply serves as an alternative to using > system memory bounce buffers. For persistent BARs this can serve as an > additional storage device in the system. Why block devices? I wonder if iopmem was initially designed back when we were considering enabling DAX for raw block devices. However, that support has since been ripped out / abandoned. You currently need a filesystem on top of a block-device to get DAX operation. Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward if all you want is a way to map the bar for another PCI-E device in the topology. If you're only using the block-device as a entry-point to create dax-mappings then a device-dax (drivers/dax/) character-device might be a better fit. > > Testing and Performance > --- > > We have done a moderate about of testing of this patch on a QEMU > environment and on real hardware. On real hardware we have observed > peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In > both cases these numbers are limitations of our consumer hardware. In > addition, we have observed that the CPU DRAM bandwidth is not impacted > when using IOPMEM which is not the case when a traditional path > through system memory is taken. > > For more information on the testing and performance results see the > GitHub site [4]. > > Known Issues > > > 1. Address Translation. Suggestions have been made that in certain > architectures and topologies the dma_addr_t passed to the DMA master > in a peer-2-peer transfer will not correctly route to the IO memory > intended. However in our testing to date we have not seen this to be > an issue, even in systems with IOMMUs and PCIe switches. It is our > understanding that an IOMMU only maps system memory and would not > interfere with device memory regions. (It certainly has no opportunity > to do so if the transfer gets routed through a switch). > There may still be platforms where peer-to-peer cycles are routed up through the root bridge and then back down to target device, but we can address that when / if it happens. I wonder if we could (ab)use a software-defined 'pasid' as the requester id for a peer-to-peer mapping that needs address translation. > 2. Memory Segment Spacing. This patch has the same limitations that > ZONE_DEVICE does in that memory regions must be spaces at least > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > be usable on neighboring BARs. For our purposes, this is not an issue as > we'd only be looking at enabling a single BAR in a given PCIe device. > More exotic use cases may have problems with this. I'm working on patches for 4.10 to allow mixing multiple devm_memremap_pages() allocations within the same physical section. Hopefully this won't be a problem going forward. > 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe > peer there is potential for coherency issues and for writes to occur out > of order. This is something that users of this feature need to be > cognizant of. Though really, this isn't much different than the > existing situation
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
[ adding Ashok and David for potential iommu comments ] On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates wrote: > This patch follows from an RFC we did earlier this year [1]. This > patchset applies cleanly to v4.9-rc1. > > Updates since RFC > - > Rebased. > Included the iopmem driver in the submission. > > History > --- > > There have been several attempts to upstream patchsets that enable > DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf > style patches [3]. None have been successful to date. Haggai Eran > gives a nice overview of the prior art in this space in his cover > letter [3]. > > Motivation and Use Cases > > > PCIe IO devices are getting faster. It is not uncommon now to find PCIe > network and storage devices that can generate and consume several GB/s. > Almost always these devices have either a high performance DMA engine, a > number of exposed PCIe BARs or both. > > Until this patch, any high-performance transfer of information between > two PICe devices has required the use of a staging buffer in system > memory. With this patch the bandwidth to system memory is not compromised > when high-throughput transfers occurs between PCIe devices. This means > that more system memory bandwidth is available to the CPU cores for data > processing and manipulation. In addition, in systems where the two PCIe > devices reside behind a PCIe switch the datapath avoids the CPU > entirely. I agree with the motivation and the need for a solution, but I have some questions about this implementation. > > Consumers > - > > We provide a PCIe device driver in an accompanying patch that can be > used to map any PCIe BAR into a DAX capable block device. For > non-persistent BARs this simply serves as an alternative to using > system memory bounce buffers. For persistent BARs this can serve as an > additional storage device in the system. Why block devices? I wonder if iopmem was initially designed back when we were considering enabling DAX for raw block devices. However, that support has since been ripped out / abandoned. You currently need a filesystem on top of a block-device to get DAX operation. Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward if all you want is a way to map the bar for another PCI-E device in the topology. If you're only using the block-device as a entry-point to create dax-mappings then a device-dax (drivers/dax/) character-device might be a better fit. > > Testing and Performance > --- > > We have done a moderate about of testing of this patch on a QEMU > environment and on real hardware. On real hardware we have observed > peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In > both cases these numbers are limitations of our consumer hardware. In > addition, we have observed that the CPU DRAM bandwidth is not impacted > when using IOPMEM which is not the case when a traditional path > through system memory is taken. > > For more information on the testing and performance results see the > GitHub site [4]. > > Known Issues > > > 1. Address Translation. Suggestions have been made that in certain > architectures and topologies the dma_addr_t passed to the DMA master > in a peer-2-peer transfer will not correctly route to the IO memory > intended. However in our testing to date we have not seen this to be > an issue, even in systems with IOMMUs and PCIe switches. It is our > understanding that an IOMMU only maps system memory and would not > interfere with device memory regions. (It certainly has no opportunity > to do so if the transfer gets routed through a switch). > There may still be platforms where peer-to-peer cycles are routed up through the root bridge and then back down to target device, but we can address that when / if it happens. I wonder if we could (ab)use a software-defined 'pasid' as the requester id for a peer-to-peer mapping that needs address translation. > 2. Memory Segment Spacing. This patch has the same limitations that > ZONE_DEVICE does in that memory regions must be spaces at least > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > be usable on neighboring BARs. For our purposes, this is not an issue as > we'd only be looking at enabling a single BAR in a given PCIe device. > More exotic use cases may have problems with this. I'm working on patches for 4.10 to allow mixing multiple devm_memremap_pages() allocations within the same physical section. Hopefully this won't be a problem going forward. > 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe > peer there is potential for coherency issues and for writes to occur out > of order. This is something that users of this feature need to be > cognizant of. Though really, this isn't much different than the > existing situation with things like RDMA:
[PATCH 0/3] iopmem : A block device for PCIe memory
This patch follows from an RFC we did earlier this year [1]. This patchset applies cleanly to v4.9-rc1. Updates since RFC - Rebased. Included the iopmem driver in the submission. History --- There have been several attempts to upstream patchsets that enable DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf style patches [3]. None have been successful to date. Haggai Eran gives a nice overview of the prior art in this space in his cover letter [3]. Motivation and Use Cases PCIe IO devices are getting faster. It is not uncommon now to find PCIe network and storage devices that can generate and consume several GB/s. Almost always these devices have either a high performance DMA engine, a number of exposed PCIe BARs or both. Until this patch, any high-performance transfer of information between two PICe devices has required the use of a staging buffer in system memory. With this patch the bandwidth to system memory is not compromised when high-throughput transfers occurs between PCIe devices. This means that more system memory bandwidth is available to the CPU cores for data processing and manipulation. In addition, in systems where the two PCIe devices reside behind a PCIe switch the datapath avoids the CPU entirely. Consumers - We provide a PCIe device driver in an accompanying patch that can be used to map any PCIe BAR into a DAX capable block device. For non-persistent BARs this simply serves as an alternative to using system memory bounce buffers. For persistent BARs this can serve as an additional storage device in the system. Testing and Performance --- We have done a moderate about of testing of this patch on a QEMU environment and on real hardware. On real hardware we have observed peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In both cases these numbers are limitations of our consumer hardware. In addition, we have observed that the CPU DRAM bandwidth is not impacted when using IOPMEM which is not the case when a traditional path through system memory is taken. For more information on the testing and performance results see the GitHub site [4]. Known Issues 1. Address Translation. Suggestions have been made that in certain architectures and topologies the dma_addr_t passed to the DMA master in a peer-2-peer transfer will not correctly route to the IO memory intended. However in our testing to date we have not seen this to be an issue, even in systems with IOMMUs and PCIe switches. It is our understanding that an IOMMU only maps system memory and would not interfere with device memory regions. (It certainly has no opportunity to do so if the transfer gets routed through a switch). 2. Memory Segment Spacing. This patch has the same limitations that ZONE_DEVICE does in that memory regions must be spaces at least SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where BARs can be placed closer together than this. Thus ZONE_DEVICE would not be usable on neighboring BARs. For our purposes, this is not an issue as we'd only be looking at enabling a single BAR in a given PCIe device. More exotic use cases may have problems with this. 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe peer there is potential for coherency issues and for writes to occur out of order. This is something that users of this feature need to be cognizant of. Though really, this isn't much different than the existing situation with things like RDMA: if userspace sets up an MR for remote use, they need to be careful about using that memory region themselves. 4. Architecture. Currently this patch is applicable only to x86_64 architectures. The same is true for much of the code pertaining to PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other ARCH over time. References -- [1] https://patchwork.kernel.org/patch/8583221/ [2] http://comments.gmane.org/gmane.linux.drivers.rdma/21849 [3] http://www.spinics.net/lists/linux-rdma/msg38748.html [4] https://github.com/sbates130272/zone-device Logan Gunthorpe (1): memremap.c : Add support for ZONE_DEVICE IO memory with struct pages. Stephen Bates (2): iopmem : Add a block device driver for PCIe attached IO memory. iopmem : Add documentation for iopmem driver Documentation/blockdev/00-INDEX | 2 + Documentation/blockdev/iopmem.txt | 62 +++ MAINTAINERS | 7 + drivers/block/Kconfig | 27 drivers/block/Makefile| 1 + drivers/block/iopmem.c| 333 ++ drivers/dax/pmem.c| 4 +- drivers/nvdimm/pmem.c | 4 +- include/linux/memremap.h | 5 +- kernel/memremap.c | 80 - tools/testing/nvdimm/test/iomap.c | 3 +- 11 files changed, 518 insertions(+), 10 deletions(-) create mode 100644
[PATCH 0/3] iopmem : A block device for PCIe memory
This patch follows from an RFC we did earlier this year [1]. This patchset applies cleanly to v4.9-rc1. Updates since RFC - Rebased. Included the iopmem driver in the submission. History --- There have been several attempts to upstream patchsets that enable DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf style patches [3]. None have been successful to date. Haggai Eran gives a nice overview of the prior art in this space in his cover letter [3]. Motivation and Use Cases PCIe IO devices are getting faster. It is not uncommon now to find PCIe network and storage devices that can generate and consume several GB/s. Almost always these devices have either a high performance DMA engine, a number of exposed PCIe BARs or both. Until this patch, any high-performance transfer of information between two PICe devices has required the use of a staging buffer in system memory. With this patch the bandwidth to system memory is not compromised when high-throughput transfers occurs between PCIe devices. This means that more system memory bandwidth is available to the CPU cores for data processing and manipulation. In addition, in systems where the two PCIe devices reside behind a PCIe switch the datapath avoids the CPU entirely. Consumers - We provide a PCIe device driver in an accompanying patch that can be used to map any PCIe BAR into a DAX capable block device. For non-persistent BARs this simply serves as an alternative to using system memory bounce buffers. For persistent BARs this can serve as an additional storage device in the system. Testing and Performance --- We have done a moderate about of testing of this patch on a QEMU environment and on real hardware. On real hardware we have observed peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In both cases these numbers are limitations of our consumer hardware. In addition, we have observed that the CPU DRAM bandwidth is not impacted when using IOPMEM which is not the case when a traditional path through system memory is taken. For more information on the testing and performance results see the GitHub site [4]. Known Issues 1. Address Translation. Suggestions have been made that in certain architectures and topologies the dma_addr_t passed to the DMA master in a peer-2-peer transfer will not correctly route to the IO memory intended. However in our testing to date we have not seen this to be an issue, even in systems with IOMMUs and PCIe switches. It is our understanding that an IOMMU only maps system memory and would not interfere with device memory regions. (It certainly has no opportunity to do so if the transfer gets routed through a switch). 2. Memory Segment Spacing. This patch has the same limitations that ZONE_DEVICE does in that memory regions must be spaces at least SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where BARs can be placed closer together than this. Thus ZONE_DEVICE would not be usable on neighboring BARs. For our purposes, this is not an issue as we'd only be looking at enabling a single BAR in a given PCIe device. More exotic use cases may have problems with this. 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe peer there is potential for coherency issues and for writes to occur out of order. This is something that users of this feature need to be cognizant of. Though really, this isn't much different than the existing situation with things like RDMA: if userspace sets up an MR for remote use, they need to be careful about using that memory region themselves. 4. Architecture. Currently this patch is applicable only to x86_64 architectures. The same is true for much of the code pertaining to PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other ARCH over time. References -- [1] https://patchwork.kernel.org/patch/8583221/ [2] http://comments.gmane.org/gmane.linux.drivers.rdma/21849 [3] http://www.spinics.net/lists/linux-rdma/msg38748.html [4] https://github.com/sbates130272/zone-device Logan Gunthorpe (1): memremap.c : Add support for ZONE_DEVICE IO memory with struct pages. Stephen Bates (2): iopmem : Add a block device driver for PCIe attached IO memory. iopmem : Add documentation for iopmem driver Documentation/blockdev/00-INDEX | 2 + Documentation/blockdev/iopmem.txt | 62 +++ MAINTAINERS | 7 + drivers/block/Kconfig | 27 drivers/block/Makefile| 1 + drivers/block/iopmem.c| 333 ++ drivers/dax/pmem.c| 4 +- drivers/nvdimm/pmem.c | 4 +- include/linux/memremap.h | 5 +- kernel/memremap.c | 80 - tools/testing/nvdimm/test/iomap.c | 3 +- 11 files changed, 518 insertions(+), 10 deletions(-) create mode 100644