Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-02-19 Thread Ralph Campbell



On 1/16/20 12:21 PM, Jason Gunthorpe wrote:

On Thu, Jan 16, 2020 at 12:16:30PM -0800, Ralph Campbell wrote:

Can you point me to the latest ODP code? Seems like my understanding is
quite off.


https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/infiniband/hw/mlx5/odp.c

Look for the word 'implicit'

mlx5_ib_invalidate_range() releases the interval_notifier when there are
no populated shadow PTEs in its leaf

pagefault_implicit_mr() creates an interval_notifier that covers the
level in the page table that needs population. Notice it just uses an
unlocked xa_load to find the page table level.

The locking is pretty tricky as it relies on RCU, but the fault flow
is fairly lightweight.

Jason


Thanks for the information, Jason.

I'm still interested in finding a way to support range based hints to device 
drivers.
madvise() looks like it only sets a bit in vma->vm_flags or acts on the
advice immediately. mbind() and set_mempolicy() only work with CPUs and memory
with NUMA a node number. What I'm looking for is a way for the device to know
whether to migrate pages to device private memory on a fault, whether to 
duplicate
read-only pages in device private memory, or remote map/access a page instead 
of migrating it.
For example, there is a working draft extension to OpenCL,
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc
that could provide a way to specify this sort of advice.
C++ is also looking at extentions for specifying affinity attributes.
In any case, these are probably a long ways off before being finalized and 
implemented.

I also have some changes to support THP migration to device private memory but 
that
will require updating nouveau to use 2MB TLB mappings.

In the mean time, I can update the HMM self tests to do something like ODP 
without
changing mm/mmu_notifier.c but I don't think I can easily change nouveau to 
that model.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-16 Thread Jason Gunthorpe
On Thu, Jan 16, 2020 at 12:16:30PM -0800, Ralph Campbell wrote:
> Can you point me to the latest ODP code? Seems like my understanding is
> quite off.

https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/infiniband/hw/mlx5/odp.c

Look for the word 'implicit'

mlx5_ib_invalidate_range() releases the interval_notifier when there are
no populated shadow PTEs in its leaf

pagefault_implicit_mr() creates an interval_notifier that covers the
level in the page table that needs population. Notice it just uses an
unlocked xa_load to find the page table level.

The locking is pretty tricky as it relies on RCU, but the fault flow
is fairly lightweight.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-16 Thread Ralph Campbell



On 1/16/20 8:00 AM, Jason Gunthorpe wrote:

On Wed, Jan 15, 2020 at 02:09:47PM -0800, Ralph Campbell wrote:


I don't understand the lifetime/membership issue. The driver is the only thing
that allocates, inserts, or removes struct mmu_interval_notifier and thus
completely controls the lifetime.


If the returned value is on the defered list it could be freed at any
moment. The existing locks do not prevent it.


+   ret = nouveau_svmm_interval_find(svmm, );
+   if (ret) {
+   up_read(>mmap_sem);
+   return ret;
+   }
+   range.notifier_seq = mmu_interval_read_begin(range.notifier);
ret = hmm_range_fault(, 0);
up_read(>mmap_sem);
if (ret <= 0) {


I'm still not sure this is a better approach than what ODP does. It
looks very expensive on the fault path..

Jason



ODP doesn't have this problem because users have to call ib_reg_mr()
before any I/O can happen to the process address space.


ODP supports a single 'full VA' call at process startup, just like
these cases.


That is when mmu_interval_notifier_insert() /
mmu_interval_notifier_remove() can be called and the driver doesn't
have to worry about the interval changing sizes or being removed
while I/O is happening.


No, for the 'ODP full process VA' (aka implicit ODP) mode it
dynamically maintains a list of intervals. ODP chooses the align the
dynamic intervals to it's HW page table levels, and not to SW VMAs.
This is much simpler to manage and faster to fault, at the cost of
capturing more VA for invalidations which have to be probed against
the HW shadow PTEs.


It isn't that expensive, there is an extra driver lock/unlock as
part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC)
for new intervals. Also, the deferred interval updates for munmap().
Compared to the cost of updating PTEs in the device and GPU fault
handling, this is minimal overhead.


Well, compared to ODP which does a single xa lookup with no lock to
find its interval, this looks very expensive and not parallel.

I think if there is merit in having ranges cover the vmas and track
changes then there is probably merit in having the core code provide
much of that logic, not the driver.

But it would be interesting to see some kind of analysis on the two
methods to decide if the complexity is worthwhile.

Jason



Can you point me to the latest ODP code? Seems like my understanding is
quite off.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-16 Thread Jason Gunthorpe
On Wed, Jan 15, 2020 at 02:09:47PM -0800, Ralph Campbell wrote:

> I don't understand the lifetime/membership issue. The driver is the only thing
> that allocates, inserts, or removes struct mmu_interval_notifier and thus
> completely controls the lifetime.

If the returned value is on the defered list it could be freed at any
moment. The existing locks do not prevent it.

> > > + ret = nouveau_svmm_interval_find(svmm, );
> > > + if (ret) {
> > > + up_read(>mmap_sem);
> > > + return ret;
> > > + }
> > > + range.notifier_seq = mmu_interval_read_begin(range.notifier);
> > >   ret = hmm_range_fault(, 0);
> > >   up_read(>mmap_sem);
> > >   if (ret <= 0) {
> > 
> > I'm still not sure this is a better approach than what ODP does. It
> > looks very expensive on the fault path..
> > 
> > Jason
> > 
> 
> ODP doesn't have this problem because users have to call ib_reg_mr()
> before any I/O can happen to the process address space.

ODP supports a single 'full VA' call at process startup, just like
these cases.

> That is when mmu_interval_notifier_insert() /
> mmu_interval_notifier_remove() can be called and the driver doesn't
> have to worry about the interval changing sizes or being removed
> while I/O is happening.  

No, for the 'ODP full process VA' (aka implicit ODP) mode it
dynamically maintains a list of intervals. ODP chooses the align the
dynamic intervals to it's HW page table levels, and not to SW VMAs.
This is much simpler to manage and faster to fault, at the cost of
capturing more VA for invalidations which have to be probed against
the HW shadow PTEs.

> It isn't that expensive, there is an extra driver lock/unlock as
> part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC)
> for new intervals. Also, the deferred interval updates for munmap().
> Compared to the cost of updating PTEs in the device and GPU fault
> handling, this is minimal overhead.

Well, compared to ODP which does a single xa lookup with no lock to
find its interval, this looks very expensive and not parallel.

I think if there is merit in having ranges cover the vmas and track
changes then there is probably merit in having the core code provide
much of that logic, not the driver.

But it would be interesting to see some kind of analysis on the two
methods to decide if the complexity is worthwhile.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-15 Thread Ralph Campbell



On 1/14/20 5:00 AM, Jason Gunthorpe wrote:

On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:

  void
  nouveau_svmm_fini(struct nouveau_svmm **psvmm)
  {
struct nouveau_svmm *svmm = *psvmm;
+   struct mmu_interval_notifier *mni;
+
if (svmm) {
mutex_lock(>mutex);
+   while (true) {
+   mni = mmu_interval_notifier_find(svmm->mm,
+   _svm_mni_ops, 0UL, ~0UL);
+   if (!mni)
+   break;
+   mmu_interval_notifier_put(mni);


Oh, now I really don't like the name 'put'. It looks like mni is
refcounted here, and it isn't. put should be called 'remove_deferred'


OK.


And then you also need a way to barrier this scheme on driver unload.


Good point. I can add something like
void mmu_interval_notifier_synchronize(struct mm_struct *mm)
that waits for deferred operations to complete similar to
mmu_interval_read_begin().


+   }
svmm->vmm = NULL;
mutex_unlock(>mutex);
-   mmu_notifier_put(>notifier);


While here it was actually a refcount.


+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
+const struct mmu_notifier_range *range)
+{
+   struct svmm_interval *smi =
+   container_of(mni, struct svmm_interval, notifier);
+   struct nouveau_svmm *svmm = smi->svmm;
+   unsigned long start = mmu_interval_notifier_start(mni);
+   unsigned long last = mmu_interval_notifier_last(mni);


This whole algorithm only works if it is protected by the read side of
the interval tree lock. Deserves at least a comment if not an
assertion too.


This is called from the invalidate() callback and while holding the
driver page table lock so the struct mmu_interval_notifier and
the interval tree can't change.
I will add comments for v7.


  static int nouveau_range_fault(struct nouveau_svmm *svmm,
   struct nouveau_drm *drm, void *data, u32 size,
-  u64 *pfns, struct svm_notifier *notifier)
+  u64 *pfns, u64 start, u64 end)
  {
unsigned long timeout =
jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
/* Have HMM fault pages within the fault window to the GPU. */
struct hmm_range range = {
-   .notifier = >notifier,
-   .start = notifier->notifier.interval_tree.start,
-   .end = notifier->notifier.interval_tree.last + 1,
+   .start = start,
+   .end = end,
.pfns = pfns,
.flags = nouveau_svm_pfn_flags,
.values = nouveau_svm_pfn_values,
+   .default_flags = 0,
+   .pfn_flags_mask = ~0UL,
.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT,
};
-   struct mm_struct *mm = notifier->notifier.mm;
+   struct mm_struct *mm = svmm->mm;
long ret;
  
  	while (true) {

if (time_after(jiffies, timeout))
return -EBUSY;
  
-		range.notifier_seq = mmu_interval_read_begin(range.notifier);

-   range.default_flags = 0;
-   range.pfn_flags_mask = -1UL;
down_read(>mmap_sem);


mmap sem doesn't have to be held for the interval search, and again we
have lifetime issues with the membership here.


I agree mmap_sem isn't needed for the interval search, it is needed if
the search doesn't find a registered interval and one needs to be created
to cover the underlying VMA. If an arbitrary size interval was created
instead, then mmap_sem wouldn't be needed.
I don't understand the lifetime/membership issue. The driver is the only thing
that allocates, inserts, or removes struct mmu_interval_notifier and thus
completely controls the lifetime.


+   ret = nouveau_svmm_interval_find(svmm, );
+   if (ret) {
+   up_read(>mmap_sem);
+   return ret;
+   }
+   range.notifier_seq = mmu_interval_read_begin(range.notifier);
ret = hmm_range_fault(, 0);
up_read(>mmap_sem);
if (ret <= 0) {


I'm still not sure this is a better approach than what ODP does. It
looks very expensive on the fault path..

Jason



ODP doesn't have this problem because users have to call ib_reg_mr()
before any I/O can happen to the process address space. That is when
mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can
be called and the driver doesn't have to worry about the interval
changing sizes or being removed while I/O is happening.
For GPU like devices, I'm trying to allow hardware access to any user
level address without pre-registering it. That means inserting mmu
interval notifiers for the ranges the GPU page faults on and updating
the intervals as munmap() calls 

Re: [Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-14 Thread Jason Gunthorpe
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:
>  void
>  nouveau_svmm_fini(struct nouveau_svmm **psvmm)
>  {
>   struct nouveau_svmm *svmm = *psvmm;
> + struct mmu_interval_notifier *mni;
> +
>   if (svmm) {
>   mutex_lock(>mutex);
> + while (true) {
> + mni = mmu_interval_notifier_find(svmm->mm,
> + _svm_mni_ops, 0UL, ~0UL);
> + if (!mni)
> + break;
> + mmu_interval_notifier_put(mni);

Oh, now I really don't like the name 'put'. It looks like mni is
refcounted here, and it isn't. put should be called 'remove_deferred'

And then you also need a way to barrier this scheme on driver unload.

> + }
>   svmm->vmm = NULL;
>   mutex_unlock(>mutex);
> - mmu_notifier_put(>notifier);

While here it was actually a refcount.

> +static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
> +  const struct mmu_notifier_range *range)
> +{
> + struct svmm_interval *smi =
> + container_of(mni, struct svmm_interval, notifier);
> + struct nouveau_svmm *svmm = smi->svmm;
> + unsigned long start = mmu_interval_notifier_start(mni);
> + unsigned long last = mmu_interval_notifier_last(mni);

This whole algorithm only works if it is protected by the read side of
the interval tree lock. Deserves at least a comment if not an
assertion too.

>  static int nouveau_range_fault(struct nouveau_svmm *svmm,
>  struct nouveau_drm *drm, void *data, u32 size,
> -u64 *pfns, struct svm_notifier *notifier)
> +u64 *pfns, u64 start, u64 end)
>  {
>   unsigned long timeout =
>   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>   /* Have HMM fault pages within the fault window to the GPU. */
>   struct hmm_range range = {
> - .notifier = >notifier,
> - .start = notifier->notifier.interval_tree.start,
> - .end = notifier->notifier.interval_tree.last + 1,
> + .start = start,
> + .end = end,
>   .pfns = pfns,
>   .flags = nouveau_svm_pfn_flags,
>   .values = nouveau_svm_pfn_values,
> + .default_flags = 0,
> + .pfn_flags_mask = ~0UL,
>   .pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT,
>   };
> - struct mm_struct *mm = notifier->notifier.mm;
> + struct mm_struct *mm = svmm->mm;
>   long ret;
>  
>   while (true) {
>   if (time_after(jiffies, timeout))
>   return -EBUSY;
>  
> - range.notifier_seq = mmu_interval_read_begin(range.notifier);
> - range.default_flags = 0;
> - range.pfn_flags_mask = -1UL;
>   down_read(>mmap_sem);

mmap sem doesn't have to be held for the interval search, and again we
have lifetime issues with the membership here.

> + ret = nouveau_svmm_interval_find(svmm, );
> + if (ret) {
> + up_read(>mmap_sem);
> + return ret;
> + }
> + range.notifier_seq = mmu_interval_read_begin(range.notifier);
>   ret = hmm_range_fault(, 0);
>   up_read(>mmap_sem);
>   if (ret <= 0) {

I'm still not sure this is a better approach than what ODP does. It
looks very expensive on the fault path..

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH v6 5/6] nouveau: use new mmu interval notifiers

2020-01-13 Thread Ralph Campbell
Update nouveau to only use the mmu interval notifiers.

Signed-off-by: Ralph Campbell 
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 313 +-
 1 file changed, 201 insertions(+), 112 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index df9bf1fd1bc0..0343e48d41d7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -88,7 +88,7 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst)
 }
 
 struct nouveau_svmm {
-   struct mmu_notifier notifier;
+   struct mm_struct *mm;
struct nouveau_vmm *vmm;
struct {
unsigned long start;
@@ -98,6 +98,13 @@ struct nouveau_svmm {
struct mutex mutex;
 };
 
+struct svmm_interval {
+   struct mmu_interval_notifier notifier;
+   struct nouveau_svmm *svmm;
+};
+
+static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops;
+
 #define SVMM_DBG(s,f,a...) 
\
NV_DEBUG((s)->vmm->cli->drm, "svm-%p: "f"\n", (s), ##a)
 #define SVMM_ERR(s,f,a...) 
\
@@ -236,6 +243,8 @@ nouveau_svmm_join(struct nouveau_svmm *svmm, u64 inst)
 static void
 nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit)
 {
+   SVMM_DBG(svmm, "invalidate %016llx-%016llx", start, limit);
+
if (limit > start) {
bool super = svmm->vmm->vmm.object.client->super;
svmm->vmm->vmm.object.client->super = true;
@@ -248,58 +257,25 @@ nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 
start, u64 limit)
}
 }
 
-static int
-nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
-   const struct mmu_notifier_range *update)
-{
-   struct nouveau_svmm *svmm =
-   container_of(mn, struct nouveau_svmm, notifier);
-   unsigned long start = update->start;
-   unsigned long limit = update->end;
-
-   if (!mmu_notifier_range_blockable(update))
-   return -EAGAIN;
-
-   SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit);
-
-   mutex_lock(>mutex);
-   if (unlikely(!svmm->vmm))
-   goto out;
-
-   if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
-   if (start < svmm->unmanaged.start) {
-   nouveau_svmm_invalidate(svmm, start,
-   svmm->unmanaged.limit);
-   }
-   start = svmm->unmanaged.limit;
-   }
-
-   nouveau_svmm_invalidate(svmm, start, limit);
-
-out:
-   mutex_unlock(>mutex);
-   return 0;
-}
-
-static void nouveau_svmm_free_notifier(struct mmu_notifier *mn)
-{
-   kfree(container_of(mn, struct nouveau_svmm, notifier));
-}
-
-static const struct mmu_notifier_ops nouveau_mn_ops = {
-   .invalidate_range_start = nouveau_svmm_invalidate_range_start,
-   .free_notifier = nouveau_svmm_free_notifier,
-};
-
 void
 nouveau_svmm_fini(struct nouveau_svmm **psvmm)
 {
struct nouveau_svmm *svmm = *psvmm;
+   struct mmu_interval_notifier *mni;
+
if (svmm) {
mutex_lock(>mutex);
+   while (true) {
+   mni = mmu_interval_notifier_find(svmm->mm,
+   _svm_mni_ops, 0UL, ~0UL);
+   if (!mni)
+   break;
+   mmu_interval_notifier_put(mni);
+   }
svmm->vmm = NULL;
mutex_unlock(>mutex);
-   mmu_notifier_put(>notifier);
+   mmdrop(svmm->mm);
+   kfree(svmm);
*psvmm = NULL;
}
 }
@@ -343,11 +319,12 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
goto out_free;
 
down_write(>mm->mmap_sem);
-   svmm->notifier.ops = _mn_ops;
-   ret = __mmu_notifier_register(>notifier, current->mm);
+   ret = __mmu_notifier_register(NULL, current->mm);
if (ret)
goto out_mm_unlock;
-   /* Note, ownership of svmm transfers to mmu_notifier */
+
+   mmgrab(current->mm);
+   svmm->mm = current->mm;
 
cli->svm.svmm = svmm;
cli->svm.cli = cli;
@@ -482,65 +459,212 @@ nouveau_svm_fault_cache(struct nouveau_svm *svm,
fault->inst, fault->addr, fault->access);
 }
 
-struct svm_notifier {
-   struct mmu_interval_notifier notifier;
-   struct nouveau_svmm *svmm;
-};
+static struct svmm_interval *nouveau_svmm_new_interval(
+   struct nouveau_svmm *svmm,
+   unsigned long start,
+   unsigned long last)
+{
+   struct svmm_interval *smi;
+   int ret;
+
+   smi = kmalloc(sizeof(*smi), GFP_ATOMIC);
+   if (!smi)
+   return NULL;
+
+   smi->svmm = svmm;
+