Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Serguei Sagalovitch

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:

 but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages.

Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Serguei Sagalovitch

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:

Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

I don't know..
We could use send_sig_info to send signal from  kernel  to user space. 
So theoretically GPU driver

could issue KILL signal to some process.


On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at 
least in the current form):


-  It deals with handles/fd for the whole allocation but client 
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one 
from GPU perspective.
-  It require knowledge to export but because "sharing" is controlled 
from user space it

means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with 
addresses/pointers.


Also current  DMA-API force each time to do all DMA table programming 
unrelated if
location was changed or not. With  vma / mmu  we are  able to install 
notifier to intercept
changes in location and update  translation tables only as needed (we do 
not need to keep

get_user_pages()  lock).
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:

On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:


On 28/11/16 09:57 AM, Jason Gunthorpe wrote:

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping.

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.
About one application restriction: so it is per memory mapping? I assume 
that
it should not be problem for one application to do transfer to the 
several devices

simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p 
transfer from user

space?

What about the following:

a)  Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and 
CPU pointer/access will

be requested.
c)  p2p_mr_register() will be called and CPU pointer (mmap( on DAX 
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics 
driver.

d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan
I agreed that it is better to leave up to user space to check what is 
working
and what is not. I found that write is practically always working but 
read very

often not. Also sometimes system BIOS update could fix the issue.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 08:22 AM, Christian König wrote:



Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full 
speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.

I would like to add more in relation to  CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU  accessible part.  As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution,  knowledge
about size of VRAM memory, etc.  So if CPU access performance
is critical then such memory should be allocated in system memory
as  the first (and may be only) choice.

b) Allocation may not  have CPU address  at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still  be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to  avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Serguei Sagalovitch


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
PCI BAR and another PCI device initiates a DMA to this address the DMA
subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason
I agreed but the problem is that one issue immediately introduce another 
one

to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver

has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch



On 2016-11-23 02:32 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote:

On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

Standard RDMA MRs require pinned pages, the DMA address cannot change
while the MR exists (there is no hardware support for this at all), so
page faulting from any other device is out of the question while they
exist. This is the same requirement as typical simple driver DMA which
requires pages pinned until the simple device completes DMA.

ODP RDMA MRs do not require that, they just page fault like the CPU or
really anything and the kernel has to make sense of concurrant page
faults from multiple sources.

The upshot is that GPU scenarios that rely on highly dynamic
virtual->physical translation cannot sanely be combined with standard
long-life RDMA MRs.
We do not want to have "highly" dynamic translation due to performance 
cost.

We need to support "overcommit" but would like to minimize impact.

To support RDMA MRs for GPU/VRAM/PCIe device memory (which is must)
we need either globally force  pinning for the scope of
"get_user_pages() / "put_pages" or have special handling for RDMA MRs and
similar cases.  Generally it could be difficult to correctly handle "DMA 
in progress"

 due to the  facts that (a) DMA could originate  from numerous PCIe devices
simultaneously including requests to receive network data. (b) in HSA 
case DMA could

 originated from user space without kernel driver knowledge.
So without corresponding h/w support everywhere I do not see how it could
be solved effectively.

Certainly, any solution for GPUs must follow the typical page pinning
semantics, changing the DMA address of a page must be blocked while
any DMA is in progress.

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

Humm, so it can be removed from Alexander's list then :\
HMM is very useful for some type of scenarios as well as it could 
significantly
simplify (for performance) implementations of some features e.g. OpenCL 
2.0 SVM.


As Dan suggested, maybe we need to do both. Some kind of fix for
get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory
API conversion to get_user_dma_sg() for other cases?

Jason


Sincerely yours,
Serguei Sagalovitch

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 03:51 AM, Christian König wrote:

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <dan...@ffwll.ch> wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:
On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
<dan.j.willi...@intel.com>

wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the 
vma is

not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should 
point to the

new
physical location/address
  (and may be not in PCI device memory at all)?
So, device-DAX deliberately avoids support for in-kernel 
migration or
overcommit. Those cases are left to the core mm or drm. The 
device-dax
interface is for cases where all that is needed is a 
direct-mapping to
a statically-allocated physical-address range be it persistent 
memory

or some other special reserved memory range.
For some of the fancy use-cases (e.g. to be comparable to what 
HMM can
pull off) I think we want all the magic in core mm, i.e. 
migration and

overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where 
memory is

allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,
It is possible that there is other way around: memory is requested 
to be

allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" 
such

allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't 
work.

So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think 
there is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of 
graphics driver due to possible different requirements especially from 
graphics: alignment, etc.




Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram 
nodes
out of any default allocation numasets, full isolation should be 
possible.

-Daniel





Sincerely yours,
Serguei Sagalovitch

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:


an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

 From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Serguei Sagalovitch



On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.willi...@intel.com> wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure / implementations
in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to "move"/"evict"
memory/data to the new location so CPU pointer should point to the new
physical location/address
 (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

  but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm