Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jerome Glisse
On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe  wrote:
> >
> >
> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> >> I was thinking only this one would be supported with a core code
> >> helper..
> >
> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> > type flag to the dev_pagemap structure which would be very useful to us.
> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> > Then, potentially, we could add a dma_map callback to the structure
> > (possibly unioned with an hmm field). The dev_ops providers would then
> > just need to do something like this (enclosed in a helper):
> >
> > if (is_zone_device_page(page)) {
> > pgmap = get_dev_pagemap(page_to_pfn(page));
> > if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
> > !pgmap->dma_map)
> > return 0;
> >
> > dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
> > put_dev_pagemap(pgmap);
> > if (!dma_addr)
> > return 0;
> > ...
> > }
> >
> > The pci_enable_p2p_bar function would then just need to call
> > devm_memremap_pages with the dma_map callback set to a function that
> > does the segment check and the offset calculation.
> >
> > Thoughts?
> >
> > @Jerome: my feedback to you would be that your patch assumes all users
> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> > useful if it was generic. My suggestion would be to have the caller
> > allocate the dev_pagemap structure, populate it and pass it into
> > devm_memremap_pages. Given that pretty much everything in that structure
> > are already arguments to that function, I feel like this makes sense.
> > This should also help to unify hmm_devmem_pages_create and
> > devm_memremap_pages which look very similar to each other.
> 
> I like that change. Also the types should describe the memory relative
> to its relationship to struct page, not whether it is persistent or
> not. I would consider volatile and persistent memory that is attached
> to the cpu memory controller and i/o coherent as the same type of
> memory. DMA incoherent ranges like P2P and HMM should get their own
> types.

Dan you asked me to not use devm_memremap_pages() because you didn't
want to have HMM memory in the pgmap_radix, did you change opinion
on that ? :)

Note i won't make any change now on that front but if it make sense
i am happy to do it as separate patchset on top of HMM.

Also i don't want p2pmem to be an exclusive or with HMM, we will want
GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
this too.

I do believe it is easier to special case in ZONE_DEVICE into existing
dma_ops for each architecture. For x86 i think there is only 3 different
set of dma_ops to modify. For other arch i guess there is even less.

But in all the case i think p2pmem should stay out of memory management
business. If some set of device do not have memory management it is
better to propose helper to do that as part of the subsystem to which
those devices belong. Just wanted to reiterate that point.

Cheers,
Jérôme


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jerome Glisse
> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe 
> wrote:
> >
> >
> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> >> Ultimately every dma_ops will need special code to support P2P with
> >> the special hardware that ops is controlling, so it makes some sense
> >> to start by pushing the check down there in the first place. This
> >> advice is partially motivated by how dma_map_sg is just a small
> >> wrapper around the function pointer call...
> >
> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

This would not work well for everyone, for instance on GPU we usualy
have buffer object with a mix of device memory and regular system
memory but call dma sg map once for the list.

Cheers,
Jérôme


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Jerome Glisse
On Mon, Apr 17, 2017 at 10:52:29AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

I disagree here. I would rather see Peer-to-Peer mapping as a form
of helper so that device driver can opt-in for multiple mecanisms
concurrently. Like HMM and p2p.

Also it seems you are having a static vision for p2p. For GPU and
network the use case is you move some buffer into the device memory
and then you create mapping for some network adapter while the buffer
is in device memory. But this is only temporary and buffer might
move to different device memory. So usecase is highly dynamic (well
mapping lifetime is still probably few second/minutes).

I see no reason for exposing sysfs tree to userspace for all this.
This isn't too dynamic, either 2 devices can access each others
memory, either they can't. This can be hidden through the device
kernel API. Again for GPU the idea is that it is always do-able
in the sense that when it is not you fallback to using system
memory.


> >> 2) In order to create the struct pages we use the ZONE_DEVICE
> >> infrastructure which requires a struct device. (See
> >> devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device. Having a specific class for this makes it very
> clear how this memory would be handled. For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

Yes this could conflict and that's why i would rather see this as a set
of helper like HMM is doing. So device driver can opt-in HMM and p2pmem
at the same time.


> >>  This amazingly gets us the get_dev_pagemap
> >> architecture which also uses a struct device. So by using a p2pmem
> >> device we can go from struct page to struct device to p2pmem device
> >> quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?
> 
> >> 3) You wouldn't want to use the pci's struct device because it doesn't
> >> really describe what's going on. For example, there may be multiple
> >> devices on the pci device in question: eg. an NVME card and some p2pmem.
> > 
> > What is "some p2pmem" ?
> >> Or it could be a NIC with some p2pmem.
> > 
> > Again what is "some p2pmem" ?
> 
> Some device local memory intended for use as a DMA target from a
> neighbour device or itself. On a PCI device, this would be a BAR, or a
> portion of a BAR with memory behind it.
> 
> Keep in mind device classes tend to carve out common use cases and don't
> have a one to one mapping with a physical pci card.
> 
> > That a device might have some memory-like buffer space is all well and
> > good but does it need to be specifically distinguished at the device
> > level ? It could be inherent to what the device is... for example again
> > take the GPU example, why would you call the FB memory "p2pmem" ? 
> 
> Well if you are using it for p2p transactions why wouldn't you call it
> p2pmem? There's no technical downside here except some vague argument
> over naming. Once registered as p2pmem, that device will handle all the
> dma map stuff for you and have a central obvious place to put code which
> helps decide whether to use it or not based on topology.
> 
> I can certainly see an issue you'd have with the current RFC in that the
> p2pmem device currently also handles memory allocation which a GPU would
>  want to do itself. There are plenty of solutions to this though: we
> could provide hooks for the parent device to override allocation or
> something like that. However, the use cases I'm concerned with don't do
> their own allocation so that is an important feature for them.

This seems to duplicate things that already exist in each individual
driver. If a device has memory than device driver already have some
form of memory management and most likely expose some API to userspace
to allow pro