subject:"\[PATCH\] mm, oom\: distinguish blockable mode for mmu notifiers"

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-09-07 Thread Tetsuo Handa

On 2018/08/27 16:41, Christian König wrote:
> Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:
>> I'm not following. Why don't we need to do like below (given that
>> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
>> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
>> is doing GFP_KERNEL memory allocation with ->lock held for write?
> 
> That's a bug which needs to be fixed separately.
> 
> Allocating memory with GFP_KERNEL while holding a lock which is also taken in 
> the reclaim code path is illegal not matter what you do.
> 
> Patches to fix this are already on the appropriate mailing list and will be 
> pushed upstream today.
> 
> Regards,
> Christian.

Commit 4a2de54dc1d7668f ("drm/amdgpu: fix holding mn_lock while allocating 
memory")
seems to be calling amdgpu_mn_unlock() without amdgpu_mn_lock() when
drm_sched_job_init() failed... 

Michal, you are asking me to fix all bugs (including out of tree code) and 
prevent
future bugs just because you want to avoid using timeout in order to avoid OOM 
lockup
( https://marc.info/?i=55a3fb37-3246-73d7-0f45-5835a3f48...@i-love.sakura.ne.jp 
).
That is a too much request which is impossible for even you. More you count on
the OOM reaper, we exponentially complicates dependency and more likely to 
stumble
over unreviewed/untested code...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Christian König


Am 26.08.2018 um 10:40 schrieb Tetsuo Handa:

On 2018/08/24 22:52, Michal Hocko wrote:

@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);


I'm not following. Why don't we need to do like below (given that
nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
is doing GFP_KERNEL memory allocation with ->lock held for write?


That's a bug which needs to be fixed separately.

Allocating memory with GFP_KERNEL while holding a lock which is also 
taken in the reclaim code path is illegal not matter what you do.


Patches to fix this are already on the appropriate mailing list and will 
be pushed upstream today.


Regards,
Christian.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b..e1cb344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -64,8 +64,6 @@
   * @node: hash table node to find structure by adev and mn
   * @lock: rw semaphore protecting the notifier nodes
   * @objects: interval tree containing amdgpu_mn_nodes
- * @read_lock: mutex for recursive locking of @lock
- * @recursion: depth of recursion
   *
   * Data for each amdgpu device and process address space.
   */
@@ -85,8 +83,6 @@ struct amdgpu_mn {
/* objects protected by lock */
struct rw_semaphore lock;
struct rb_root_cached   objects;
-   struct mutexread_lock;
-   atomic_trecursion;
  };
  
  /**

@@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
+   down_read(>lock);
+   else if (!down_read_trylock(>lock))
return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
-
return 0;
  }
  
@@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)

   */
  static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
  {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
  }
  
  /**

@@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
amn->type = type;
amn->mn.ops = _mn_ops[type];
amn->objects = RB_ROOT_CACHED;
-   mutex_init(>read_lock);
-   atomic_set(>recursion, 0);
  
  	r = __mmu_notifier_register(>mn, mm);

if (r)


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Tetsuo Handa

On 2018/08/24 22:52, Michal Hocko wrote:
> @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   */
>  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> - if (blockable)
> - mutex_lock(>read_lock);
> - else if (!mutex_trylock(>read_lock))
> - return -EAGAIN;
> -
> + /*
> +  * We can take sleepable lock even on !blockable mode because
> +  * read_lock is only ever take from this path and the notifier
> +  * lock never really sleeps. In fact the only reason why the
> +  * later is sleepable is because the notifier itself might sleep
> +  * in amdgpu_mn_invalidate_node but blockable mode is handled
> +  * before calling into that path.
> +  */
> + mutex_lock(>read_lock);
>   if (atomic_inc_return(>recursion) == 1)
>   down_read_non_owner(>lock);
>   mutex_unlock(>read_lock);
> 

I'm not following. Why don't we need to do like below (given that
nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g.
drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit()
is doing GFP_KERNEL memory allocation with ->lock held for write?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b..e1cb344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -64,8 +64,6 @@
  * @node: hash table node to find structure by adev and mn
  * @lock: rw semaphore protecting the notifier nodes
  * @objects: interval tree containing amdgpu_mn_nodes
- * @read_lock: mutex for recursive locking of @lock
- * @recursion: depth of recursion
  *
  * Data for each amdgpu device and process address space.
  */
@@ -85,8 +83,6 @@ struct amdgpu_mn {
/* objects protected by lock */
struct rw_semaphore lock;
struct rb_root_cached   objects;
-   struct mutexread_lock;
-   atomic_trecursion;
 };
 
 /**
@@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
+   down_read(>lock);
+   else if (!down_read_trylock(>lock))
return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
-
return 0;
 }
 
@@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
 }
 
 /**
@@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
amn->type = type;
amn->mn.ops = _mn_ops[type];
amn->objects = RB_ROOT_CACHED;
-   mutex_init(>read_lock);
-   atomic_set(>recursion, 0);
 
r = __mmu_notifier_register(>mn, mm);
if (r)
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Tetsuo Handa

Two more worries for this patch.



> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * @amn: our notifier
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
>  {
> -   mutex_lock(>read_lock);
> +   if (blockable)
> +   mutex_lock(>read_lock);
> +   else if (!mutex_trylock(>read_lock))
> +   return -EAGAIN;
> +
> if (atomic_inc_return(>recursion) == 1)
> down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

> mutex_unlock(>read_lock);
> +
> +   return 0;
>  }
> 
>  /**



> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct 
> mm_struct *mm)
> up_write(>mirrors_sem);
>  }
> 
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>struct mm_struct *mm,
>unsigned long start,
> -  unsigned long end)
> +  unsigned long end,
> +  bool blockable)
>  {
> struct hmm *hmm = mm->hmm;
> 
> VM_BUG_ON(!hmm);
> 
> atomic_inc(>sequence);
> +
> +   return 0;
>  }
> 
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,

This assumes that hmm_invalidate_range_end() does not have memory
allocation dependency. But hmm_invalidate_range() from
hmm_invalidate_range_end() involves

down_read(>mirrors_sem);
list_for_each_entry(mirror, >mirrors, list)
mirror->ops->sync_cpu_device_pagetables(mirror, action,
start, end);
up_read(>mirrors_sem);

sequence. What is surprising is that there is no in-tree user who assigns
sync_cpu_device_pagetables field.

  $ grep -Fr sync_cpu_device_pagetables *
  Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
page tables
  include/linux/hmm.h: * will get callbacks through 
sync_cpu_device_pagetables() operation (see
  include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
tables
  include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct hmm_mirror 
*mirror,
  include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, 
so that CPU page
  mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
action,

That is, this API seems to be currently used by only out-of-tree users. Since
we can't check that nobody has memory allocation dependency, I think that
hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
now.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Tetsuo Handa

On 2018/08/24 22:32, Michal Hocko wrote:
> On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
>> I worry that (currently
>> out-of-tree) users of this API are involving work / recursion.
> 
> I do not give a slightest about out-of-tree modules. They will have to
> accomodate to the new API. I have no problems to extend the
> documentation and be explicit about this expectation.

You don't need to care about out-of-tree modules. But you need to hear from
mm/hmm.c authors/maintainers when making changes for mmu-notifiers.

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.

Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
mmu-notifiers users.

-* If both of these callbacks cannot block, and invalidate_range
-* cannot block, mmu_notifier_ops.flags should have
-* MMU_INVALIDATE_DOES_NOT_BLOCK set.
+* If blockable argument is set to false then the callback 
cannot
+* sleep and has to return with -EAGAIN. 0 should be returned
+* otherwise.

Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.

Now we are in a merge window. And we noticed a possibility that out-of-tree
mmu-notifiers users might have trouble with making changes immediately in order
to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
And you are trying to ignore such possibility by just updating expected behavior
description instead of giving out-of-tree users a grace period to check and 
update
their code.

>> and keeps "all operations protected by hmm->mirrors_sem held for write are
>> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
>> held
>> for read will sleep (and in the worst case involves memory allocation
>> dependency)".
> 
> Yes and so what? The clear expectation is that neither of the range
> notifiers do not sleep in !blocking mode. I really fail to see what you
> are trying to say.

I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-27 Thread Tetsuo Handa

On 2018/08/24 20:36, Michal Hocko wrote:
>> That is, this API seems to be currently used by only out-of-tree users. Since
>> we can't check that nobody has memory allocation dependency, I think that
>> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
>> now.
> 
> The code expects that the invalidate_range_end doesn't block if
> invalidate_range_start hasn't blocked. That is the reason why the end
> callback doesn't have blockable parameter. If this doesn't hold then the
> whole scheme is just fragile because those two calls should pair.
> 
That is

  More worrisome part in that patch is that I don't know whether using
  trylock if blockable == false at entry is really sufficient.

. Since those two calls should pair, I think that we need to determine whether
we need to return -EAGAIN at start call by evaluating both calls.

Like mn_invl_range_start() involves schedule_delayed_work() which could be
blocked on memory allocation under OOM situation, I worry that (currently
out-of-tree) users of this API are involving work / recursion.
And hmm_release() says that

/*
 * Drop mirrors_sem so callback can wait on any pending
 * work that might itself trigger mmu_notifier callback
 * and thus would deadlock with us.
 */

and keeps "all operations protected by hmm->mirrors_sem held for write are
atomic". This suggests that "some operations protected by hmm->mirrors_sem held
for read will sleep (and in the worst case involves memory allocation
dependency)".

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse

On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
> [...]
> > I am fine with Michal patch, i already said so couple month ago first time
> > this discussion did pop up, Michal you can add:
> > 
> > Reviewed-by: Jérôme Glisse 
> 
> So I guess the below is the patch you were talking about?
> 
> From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 24 Aug 2018 15:32:24 +0200
> Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
>  non-blocking mode
> 
> If invalidate_range_start is called for !blocking mode then all
> callbacks have to guarantee they will no block/sleep. The same obviously
> applies to invalidate_range_end because this operation pairs with the
> former and they are called from the same context. Make sure this is
> appropriately documented.

In my branch i already updated HMM to be like other existing user
ie all blocking operation in the start callback. But yes it would
be wise to added such comments.


> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mmu_notifier.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.
>*
>*/
>   int (*invalidate_range_start)(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
[...]
> I am fine with Michal patch, i already said so couple month ago first time
> this discussion did pop up, Michal you can add:
> 
> Reviewed-by: Jérôme Glisse 

So I guess the below is the patch you were talking about?

From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:32:24 +0200
Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
 non-blocking mode

If invalidate_range_start is called for !blocking mode then all
callbacks have to guarantee they will no block/sleep. The same obviously
applies to invalidate_range_end because this operation pairs with the
former and they are called from the same context. Make sure this is
appropriately documented.

Signed-off-by: Michal Hocko 
---
 include/linux/mmu_notifier.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 23:52:25, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

This is just ridiculous. I have no idea what you are trying to achieve
here but please read through Documentation/process/stable-api-nonsense.rst
before you try to make strong statements again.

I have changed an in-kernel interface. I have gone through all users and
fixed them up. It is really appreciated to double check after me and I
am willing to fix up any fallouts. But that is just about it. I do not
get a whit about _any_ out of tree drivers when changing the interface.
I am willing to answer any questions regarding this change so developers
of those drivers know how to do their change properly but doing so is
completely their business.
 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

HMM is a library layer for other driver, until those get merged the same
applies for them as well.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse

On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

Intention is that 99% of HMM users will be upstream as long as they are
not people shouldn't worry. We have been working on nouveau to use it
for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i
hope it will all be there in 4.20/4.21 timeframe.

See my other mail for list of other users.

> 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

I am fine with Michal patch, i already said so couple month ago first time
this discussion did pop up, Michal you can add:

Reviewed-by: Jérôme Glisse 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse

On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 14:18:44, Christian König wrote:
> > Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > > [...]
> > > > > > That won't work like this there might be multiple
> > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > > same time.
> > > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > > rw_semaphore.
> > > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > > trigger another one from the same path?
> > > > No, but what can happen is:
> > > > 
> > > > invalidate_range_start(A,B);
> > > > invalidate_range_start(C,D);
> > > > ...
> > > > invalidate_range_end(C,D);
> > > > invalidate_range_end(A,B);
> > > > 
> > > > Grabbing the read lock twice would be illegal in this case.
> > > I am sorry but I still do not follow. What is the context the two are
> > > called from?
> > 
> > I don't have the slightest idea.
> > 
> > > Can you give me an example. I simply do not see it in the
> > > code, mostly because I am not familiar with it.
> > 
> > I'm neither.
> > 
> > We stumbled over that by pure observation and after discussing the problem
> > with Jerome came up with this solution.
> > 
> > No idea where exactly that case comes from, but I can confirm that it indeed
> > happens.
> 
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Multiple concurrent mmu notifier, for overlapping range or not, is
common (each concurrent threads can trigger some). So you might have
multiple invalidate_range_start() in flight for same mm and thus might
complete in different order (invalidate_range_end()). IIRC this is
what this lock was trying to protect against.

I can't think of a reason for recursive mmu notifier call right now.
I will ponder see if i remember something about it.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse

On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote:
> Two more worries for this patch.

[...]

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.
> 
>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

So you can see update and user of this there:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00

I am still working on Mellanox and AMD GPU patchset.

I will post the HMM changes that adapt to Michal shortly as anyway
thus have been sufficiently tested by now.

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=78785dcb5ba0924c2c5e7be027793f99ebbc39f3
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=4fc25571dc893f2b278e90cda9e71e139e01de70

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 15:44:03, Christian König wrote:
> Am 24.08.2018 um 15:40 schrieb Michal Hocko:
> > On Fri 24-08-18 15:28:33, Christian König wrote:
> > > Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > > > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > > > [...]
> > > > > > > > Thiking about it some more, I can imagine that a notifier 
> > > > > > > > callback which
> > > > > > > > performs an allocation might trigger a memory reclaim and that 
> > > > > > > > in turn
> > > > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > > > shouldn't really allocate memory. They are called from deep MM 
> > > > > > > > code
> > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome 
> > > > > > > > can come
> > > > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > > > simplify
> > > > > > > > the locking here. We have lockdep to catch self deadlocks and 
> > > > > > > > it is
> > > > > > > > always better to handle a specific issue rather than having a 
> > > > > > > > code
> > > > > > > > without a clear indication how it can recurse.
> > > > > > > Well I agree that we should probably fix that, but I have some 
> > > > > > > concerns to
> > > > > > > remove the existing workaround.
> > > > > > > 
> > > > > > > See we added that to get rid of a real problem in a customer 
> > > > > > > environment and
> > > > > > > I don't want to that to show up again.
> > > > > > It would really help to know more about that case and fix it 
> > > > > > properly
> > > > > > rather than workaround it like this. Anyway, let me think how to 
> > > > > > handle
> > > > > > the non-blocking notifier invocation then. I was not able to come up
> > > > > > with anything remotely sane yet.
> > > > > With avoiding allocating memory in the write lock path I don't see an 
> > > > > issue
> > > > > any more with that.
> > > > > 
> > > > > All what the write lock path does now is adding items to a linked 
> > > > > lists,
> > > > > arrays etc
> > > > Can we change it to non-sleepable lock then?
> > > No, the write side doesn't sleep any more, but the read side does.
> > > 
> > > See amdgpu_mn_invalidate_node() and that is where you actually need to
> > > handle the non-blocking flag correctly.
> > Ohh, right you are. We already handle that by bailing out before calling
> > amdgpu_mn_invalidate_node in !blockable mode.
> 
> Yeah, that is sufficient.
> 
> It could be improved because we have something like 90% chance that
> amdgpu_mn_invalidate_node() actually doesn't need to do anything.
> 
> But I can take care of that when the patch set has landed.
> 
> > So does this looks good to you?
> 
> Yeah, that looks perfect to me. Reviewed-by: Christian König
> 

Cool! Thanks for your guidance and patience with me. Here is the full
patch. Feel free to take it and route per your preference.

From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:45:52 +0200
Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable
mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is
incomplete because it might sleep on the notifier lock. This is true
but as it turned out from the discussion with Christian this doesn't
really matter.

The amd notifier lock doesn't block in the exclusive mode. It only ever
sleeps with the read lock inside amdgpu_mn_invalidate_node. That one
is not called in !blockable state so while we might sleep on notifier
read_lock this will only be for a short while. The same applies on the
notifier lock.

Therefore remove blockable handling from amdgpu_mn_read_lock and
document it properly.

Noticed-by: Tetsuo Handa 
Reviewed-by: Christian König 
Signed-off-by: Michal Hocko 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 15:40 schrieb Michal Hocko:

On Fri 24-08-18 15:28:33, Christian König wrote:

Am 24.08.2018 um 15:24 schrieb Michal Hocko:

On Fri 24-08-18 15:10:08, Christian König wrote:

Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.

With avoiding allocating memory in the write lock path I don't see an issue
any more with that.

All what the write lock path does now is adding items to a linked lists,
arrays etc

Can we change it to non-sleepable lock then?

No, the write side doesn't sleep any more, but the read side does.

See amdgpu_mn_invalidate_node() and that is where you actually need to
handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode.


Yeah, that is sufficient.

It could be improved because we have something like 90% chance that 
amdgpu_mn_invalidate_node() actually doesn't need to do anything.


But I can take care of that when the patch set has landed.


So does this looks good to
you?


Yeah, that looks perfect to me. Reviewed-by: Christian König 



Thanks,
Christian.



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 15:28:33, Christian König wrote:
> Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > [...]
> > > > > > Thiking about it some more, I can imagine that a notifier callback 
> > > > > > which
> > > > > > performs an allocation might trigger a memory reclaim and that in 
> > > > > > turn
> > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > shouldn't really allocate memory. They are called from deep MM code
> > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can 
> > > > > > come
> > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > simplify
> > > > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > > > always better to handle a specific issue rather than having a code
> > > > > > without a clear indication how it can recurse.
> > > > > Well I agree that we should probably fix that, but I have some 
> > > > > concerns to
> > > > > remove the existing workaround.
> > > > > 
> > > > > See we added that to get rid of a real problem in a customer 
> > > > > environment and
> > > > > I don't want to that to show up again.
> > > > It would really help to know more about that case and fix it properly
> > > > rather than workaround it like this. Anyway, let me think how to handle
> > > > the non-blocking notifier invocation then. I was not able to come up
> > > > with anything remotely sane yet.
> > > With avoiding allocating memory in the write lock path I don't see an 
> > > issue
> > > any more with that.
> > > 
> > > All what the write lock path does now is adding items to a linked lists,
> > > arrays etc
> > Can we change it to non-sleepable lock then?
> 
> No, the write side doesn't sleep any more, but the read side does.
> 
> See amdgpu_mn_invalidate_node() and that is where you actually need to
> handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to
you?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> On 2018/08/24 20:36, Michal Hocko wrote:
> >> That is, this API seems to be currently used by only out-of-tree users. 
> >> Since
> >> we can't check that nobody has memory allocation dependency, I think that
> >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false 
> >> for now.
> > 
> > The code expects that the invalidate_range_end doesn't block if
> > invalidate_range_start hasn't blocked. That is the reason why the end
> > callback doesn't have blockable parameter. If this doesn't hold then the
> > whole scheme is just fragile because those two calls should pair.
> > 
> That is
> 
>   More worrisome part in that patch is that I don't know whether using
>   trylock if blockable == false at entry is really sufficient.
> 
> . Since those two calls should pair, I think that we need to determine whether
> we need to return -EAGAIN at start call by evaluating both calls.

Yes, and I believe I have done that audit. Module my misunderstanding of
the code.

> Like mn_invl_range_start() involves schedule_delayed_work() which could be
> blocked on memory allocation under OOM situation,

It doesn't because that code path is not invoked for the !blockable
case.

> I worry that (currently
> out-of-tree) users of this API are involving work / recursion.

I do not give a slightest about out-of-tree modules. They will have to
accomodate to the new API. I have no problems to extend the
documentation and be explicit about this expectation.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,


> And hmm_release() says that
> 
>   /*
>* Drop mirrors_sem so callback can wait on any pending
>* work that might itself trigger mmu_notifier callback
>* and thus would deadlock with us.
>*/
> 
> and keeps "all operations protected by hmm->mirrors_sem held for write are
> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> held
> for read will sleep (and in the worst case involves memory allocation
> dependency)".

Yes and so what? The clear expectation is that neither of the range
notifiers do not sleep in !blocking mode. I really fail to see what you
are trying to say.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 15:24 schrieb Michal Hocko:

On Fri 24-08-18 15:10:08, Christian König wrote:

Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.

With avoiding allocating memory in the write lock path I don't see an issue
any more with that.

All what the write lock path does now is adding items to a linked lists,
arrays etc

Can we change it to non-sleepable lock then?


No, the write side doesn't sleep any more, but the read side does.

See amdgpu_mn_invalidate_node() and that is where you actually need to 
handle the non-blocking flag correctly.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 15:10:08, Christian König wrote:
> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > [...]
> > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > shouldn't really allocate memory. They are called from deep MM code
> > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > up some more realistic scenario. If not then I would propose to simplify
> > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > always better to handle a specific issue rather than having a code
> > > > without a clear indication how it can recurse.
> > > Well I agree that we should probably fix that, but I have some concerns to
> > > remove the existing workaround.
> > > 
> > > See we added that to get rid of a real problem in a customer environment 
> > > and
> > > I don't want to that to show up again.
> > It would really help to know more about that case and fix it properly
> > rather than workaround it like this. Anyway, let me think how to handle
> > the non-blocking notifier invocation then. I was not able to come up
> > with anything remotely sane yet.
> 
> With avoiding allocating memory in the write lock path I don't see an issue
> any more with that.
> 
> All what the write lock path does now is adding items to a linked lists,
> arrays etc

Can we change it to non-sleepable lock then?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 15:01 schrieb Michal Hocko:

On Fri 24-08-18 14:52:26, Christian König wrote:

Am 24.08.2018 um 14:33 schrieb Michal Hocko:

[...]

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.

Well I agree that we should probably fix that, but I have some concerns to
remove the existing workaround.

See we added that to get rid of a real problem in a customer environment and
I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.


With avoiding allocating memory in the write lock path I don't see an 
issue any more with that.


All what the write lock path does now is adding items to a linked lists, 
arrays etc


So there is no more blocking involved here and the read lock side should 
be able to grab the lock immediately.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 14:52:26, Christian König wrote:
> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
[...]
> > Thiking about it some more, I can imagine that a notifier callback which
> > performs an allocation might trigger a memory reclaim and that in turn
> > might trigger a notifier to be invoked and recurse. But notifier
> > shouldn't really allocate memory. They are called from deep MM code
> > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > up some more realistic scenario. If not then I would propose to simplify
> > the locking here. We have lockdep to catch self deadlocks and it is
> > always better to handle a specific issue rather than having a code
> > without a clear indication how it can recurse.
> 
> Well I agree that we should probably fix that, but I have some concerns to
> remove the existing workaround.
> 
> See we added that to get rid of a real problem in a customer environment and
> I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 14:33 schrieb Michal Hocko:

On Fri 24-08-18 14:18:44, Christian König wrote:

Am 24.08.2018 um 14:03 schrieb Michal Hocko:

On Fri 24-08-18 13:57:52, Christian König wrote:

Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

[...]

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?

No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from?

I don't have the slightest idea.


Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.

I'm neither.

We stumbled over that by pure observation and after discussing the problem
with Jerome came up with this solution.

No idea where exactly that case comes from, but I can confirm that it indeed
happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.


Well I agree that we should probably fix that, but I have some concerns 
to remove the existing workaround.


See we added that to get rid of a real problem in a customer environment 
and I don't want to that to show up again.


In the meantime I've send out a fix to avoid allocating memory while 
holding the mn_lock.


Thanks for pointing that out,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 14:18:44, Christian König wrote:
> Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > [...]
> > > > > That won't work like this there might be multiple
> > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > same time.
> > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > rw_semaphore.
> > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > trigger another one from the same path?
> > > No, but what can happen is:
> > > 
> > > invalidate_range_start(A,B);
> > > invalidate_range_start(C,D);
> > > ...
> > > invalidate_range_end(C,D);
> > > invalidate_range_end(A,B);
> > > 
> > > Grabbing the read lock twice would be illegal in this case.
> > I am sorry but I still do not follow. What is the context the two are
> > called from?
> 
> I don't have the slightest idea.
> 
> > Can you give me an example. I simply do not see it in the
> > code, mostly because I am not familiar with it.
> 
> I'm neither.
> 
> We stumbled over that by pure observation and after discussing the problem
> with Jerome came up with this solution.
> 
> No idea where exactly that case comes from, but I can confirm that it indeed
> happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 14:03 schrieb Michal Hocko:

On Fri 24-08-18 13:57:52, Christian König wrote:

Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

[...]

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?

No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from?


I don't have the slightest idea.


Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.


I'm neither.

We stumbled over that by pure observation and after discussing the 
problem with Jerome came up with this solution.


No idea where exactly that case comes from, but I can confirm that it 
indeed happens.


Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 13:57:52, Christian König wrote:
> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > On Fri 24-08-18 13:43:16, Christian König wrote:
[...]
> > > That won't work like this there might be multiple
> > > invalidate_range_start()/invalidate_range_end() pairs open at the same 
> > > time.
> > > E.g. the lock might be taken recursively and that is illegal for a
> > > rw_semaphore.
> > I am not sure I follow. Are you saying that one invalidate_range might
> > trigger another one from the same path?
> 
> No, but what can happen is:
> 
> invalidate_range_start(A,B);
> invalidate_range_start(C,D);
> ...
> invalidate_range_end(C,D);
> invalidate_range_end(A,B);
> 
> Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from? Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 13:52 schrieb Michal Hocko:

On Fri 24-08-18 13:43:16, Christian König wrote:

Am 24.08.2018 um 13:32 schrieb Michal Hocko:

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:

Two more worries for this patch.




--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
*
* @amn: our notifier
*/
-static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
   {
-   mutex_lock(>read_lock);
+   if (blockable)
+   mutex_lock(>read_lock);
+   else if (!mutex_trylock(>read_lock))
+   return -EAGAIN;
+
  if (atomic_inc_return(>recursion) == 1)
  down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

The write side of the lock is only taken in the command submission IOCTL.

So you actually don't need to change anything here (even the proposed
changes are overkill) since we can't tear down the struct_mm while an IOCTL
is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.


Oh, yeah good point. Haven't thought about that possibility.




Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

That won't work like this there might be multiple
invalidate_range_start()/invalidate_range_end() pairs open at the same time.
E.g. the lock might be taken recursively and that is illegal for a
rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?


No, but what can happen is:

invalidate_range_start(A,B);
invalidate_range_start(C,D);
...
invalidate_range_end(C,D);
invalidate_range_end(A,B);

Grabbing the read lock twice would be illegal in this case.

Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 13:43:16, Christian König wrote:
> Am 24.08.2018 um 13:32 schrieb Michal Hocko:
> > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> > > Two more worries for this patch.
> > > 
> > > 
> > > 
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> > > >*
> > > >* @amn: our notifier
> > > >*/
> > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> > > >   {
> > > > -   mutex_lock(>read_lock);
> > > > +   if (blockable)
> > > > +   mutex_lock(>read_lock);
> > > > +   else if (!mutex_trylock(>read_lock))
> > > > +   return -EAGAIN;
> > > > +
> > > >  if (atomic_inc_return(>recursion) == 1)
> > > >  down_read_non_owner(>lock);
> > > Why don't we need to use trylock here if blockable == false ?
> > > Want comment why it is safe to use blocking lock here.
> > Hmm, I am pretty sure I have checked the code but it was quite confusing
> > so I might have missed something. Double checking now, it seems that
> > this read_lock is not used anywhere else and it is not _the_ lock we are
> > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
> > it is taken in exclusive mode for expensive operations.
> 
> The write side of the lock is only taken in the command submission IOCTL.
> 
> So you actually don't need to change anything here (even the proposed
> changes are overkill) since we can't tear down the struct_mm while an IOCTL
> is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.

> > Is that correct Christian? If this is correct then we need to update the
> > locking here. I am struggling to grasp the ref counting part. Why cannot
> > all readers simply take the lock rather than rely on somebody else to
> > take it? 1ed3d2567c800 didn't really help me to understand the locking
> > scheme here so any help would be appreciated.
> 
> That won't work like this there might be multiple
> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> E.g. the lock might be taken recursively and that is illegal for a
> rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Christian König


Am 24.08.2018 um 13:32 schrieb Michal Hocko:

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:

Two more worries for this patch.




--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   *
   * @amn: our notifier
   */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   mutex_lock(>read_lock);
+   if (blockable)
+   mutex_lock(>read_lock);
+   else if (!mutex_trylock(>read_lock))
+   return -EAGAIN;
+
 if (atomic_inc_return(>recursion) == 1)
 down_read_non_owner(>lock);

Why don't we need to use trylock here if blockable == false ?
Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.


The write side of the lock is only taken in the command submission IOCTL.

So you actually don't need to change anything here (even the proposed 
changes are overkill) since we can't tear down the struct_mm while an 
IOCTL is still using.



Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.


That won't work like this there might be multiple 
invalidate_range_start()/invalidate_range_end() pairs open at the same 
time. E.g. the lock might be taken recursively and that is illegal for a 
rw_semaphore.


Regards,
Christian.



I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
   */
  static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
  {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
+   if (!down_read_trylock(>lock)) {
+   if (!blockable)
+   return -EAGAIN;
+   down_read(amn->lock);
+   }
  
  	return 0;

  }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
   */
  static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
  {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
  }
  
  /**




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
[...]
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.

Yes HMM doesn't have any in tree user yet.

>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

The code expects that the invalidate_range_end doesn't block if
invalidate_range_start hasn't blocked. That is the reason why the end
callback doesn't have blockable parameter. If this doesn't hold then the
whole scheme is just fragile because those two calls should pair.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko

On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> Two more worries for this patch.
> 
> 
> 
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >   *
> >   * @amn: our notifier
> >   */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> >  {
> > -   mutex_lock(>read_lock);
> > +   if (blockable)
> > +   mutex_lock(>read_lock);
> > +   else if (!mutex_trylock(>read_lock))
> > +   return -EAGAIN;
> > +
> > if (atomic_inc_return(>recursion) == 1)
> > down_read_non_owner(>lock);
> 
> Why don't we need to use trylock here if blockable == false ?
> Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
+   if (!down_read_trylock(>lock)) {
+   if (!blockable)
+   return -EAGAIN;
+   down_read(amn->lock);
+   }
 
return 0;
 }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
 }
 
 /**

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko

On Tue 24-07-18 12:53:07, Andrew Morton wrote:
[...]
> > On top of that the proposed cleanup looks as follows:
> > 
> 
> Looks good to me.  Seems a bit strange that we omit the pr_info()
> output if the mm was partially reaped - people would still want to know
> this?   Not very important though.

I think that having a single output once we are done is better but I do
not have a strong opinion on this.

Btw. here is the changelog for the cleanup.

"
Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably
 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
"

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko

On Tue 24-07-18 14:07:49, David Rientjes wrote:
[...]
> mm/oom_kill.c: clean up oom_reap_task_mm() fix
> 
> indicate reaping has been partially skipped so we can expect future skips 
> or another start before finish.

But we are not skipping. This is essentially the same case as mmap_sem
trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping
to denote partial success?

> Signed-off-by: David Rientjes 
> ---
>  mm/oom_kill.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - /* failed to reap part of the address space. Try again later */
>   ret = __oom_reap_task_mm(mm);
> - if (!ret)
> + if (!ret) {
> + /* Failed to reap part of the address space. Try again later */
> + trace_skip_task_reaping(tsk->pid);
>   goto out_finish;
> + }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread David Rientjes

On Tue, 24 Jul 2018, Michal Hocko wrote:

> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   /* failed to reap part of the address space. Try again later */
>   if (!__oom_reap_task_mm(mm)) {
>   up_read(>mmap_sem);
> - return true;
> + return false;
>   }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
> 
> 
> On top of that the proposed cleanup looks as follows:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 88657e018714..4e185a282b3d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>   return ret;
>  }
>  
> +/*
> + * Reaps the address space of the give task.
> + *
> + * Returns true on success and false if none or part of the address space
> + * has been reclaimed and the caller should retry later.
> + */
>  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  {
> + bool ret = true;
> +
>   if (!down_read_trylock(>mmap_sem)) {
>   trace_skip_task_reaping(tsk->pid);
>   return false;
> @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>* down_write();up_write() cycle in exit_mmap().
>*/
>   if (test_bit(MMF_OOM_SKIP, >flags)) {
> - up_read(>mmap_sem);
>   trace_skip_task_reaping(tsk->pid);
> - return true;
> + goto out_unlock;
>   }
>  
>   trace_start_task_reaping(tsk->pid);
>  
>   /* failed to reap part of the address space. Try again later */
> - if (!__oom_reap_task_mm(mm)) {
> - up_read(>mmap_sem);
> - return false;
> - }
> + ret = __oom_reap_task_mm(mm);
> + if (!ret)
> + goto out_finish;
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,
>   K(get_mm_counter(mm, MM_ANONPAGES)),
>   K(get_mm_counter(mm, MM_FILEPAGES)),
>   K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +out_finish:
> + trace_finish_task_reaping(tsk->pid);
> +out_unlock:
>   up_read(>mmap_sem);
>  
> - trace_finish_task_reaping(tsk->pid);
> - return true;
> + return ret;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10

I think we still want to trace when reaping was skipped to know that the 
oom reaper will retry again later.



mm/oom_kill.c: clean up oom_reap_task_mm() fix

indicate reaping has been partially skipped so we can expect future skips 
or another start before finish.

Signed-off-by: David Rientjes 
---
 mm/oom_kill.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 
trace_start_task_reaping(tsk->pid);
 
-   /* failed to reap part of the address space. Try again later */
ret = __oom_reap_task_mm(mm);
-   if (!ret)
+   if (!ret) {
+   /* Failed to reap part of the address space. Try again later */
+   trace_skip_task_reaping(tsk->pid);
goto out_finish;
+   }
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Andrew Morton

On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko  wrote:

> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > - Undocumented return value.
> > 
> > - comment "failed to reap part..." is misleading - sounds like it's
> >   referring to something which happened in the past, is in fact
> >   referring to something which might happen in the future.
> > 
> > - fails to call trace_finish_task_reaping() in one case
> > 
> > - code duplication.
> > 
> > - Increases mmap_sem hold time a little by moving
> >   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> > 
> > - Sharing the finish: path means that the trace event won't
> >   distinguish between the two sources of finishing.
> > 
> > Please take a look?
> 
> oom_reap_task_mm should return false when __oom_reap_task_mm return
> false. This is what my patch did but it seems this changed by
> http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
> so that one should be fixed.
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 104ef4a01a55..88657e018714 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   /* failed to reap part of the address space. Try again later */
>   if (!__oom_reap_task_mm(mm)) {
>   up_read(>mmap_sem);
> - return true;
> + return false;
>   }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",

OK, thanks, I added that.

> 
> On top of that the proposed cleanup looks as follows:
> 

Looks good to me.  Seems a bit strange that we omit the pr_info()
output if the mm was partially reaped - people would still want to know
this?   Not very important though.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Michal Hocko

On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> - Undocumented return value.
> 
> - comment "failed to reap part..." is misleading - sounds like it's
>   referring to something which happened in the past, is in fact
>   referring to something which might happen in the future.
> 
> - fails to call trace_finish_task_reaping() in one case
> 
> - code duplication.
> 
> - Increases mmap_sem hold time a little by moving
>   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> 
> - Sharing the finish: path means that the trace event won't
>   distinguish between the two sources of finishing.
> 
> Please take a look?

oom_reap_task_mm should return false when __oom_reap_task_mm return
false. This is what my patch did but it seems this changed by
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
so that one should be fixed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 104ef4a01a55..88657e018714 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
/* failed to reap part of the address space. Try again later */
if (!__oom_reap_task_mm(mm)) {
up_read(>mmap_sem);
-   return true;
+   return false;
}
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",


On top of that the proposed cleanup looks as follows:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 88657e018714..4e185a282b3d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
return ret;
 }
 
+/*
+ * Reaps the address space of the give task.
+ *
+ * Returns true on success and false if none or part of the address space
+ * has been reclaimed and the caller should retry later.
+ */
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
+   bool ret = true;
+
if (!down_read_trylock(>mmap_sem)) {
trace_skip_task_reaping(tsk->pid);
return false;
@@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 * down_write();up_write() cycle in exit_mmap().
 */
if (test_bit(MMF_OOM_SKIP, >flags)) {
-   up_read(>mmap_sem);
trace_skip_task_reaping(tsk->pid);
-   return true;
+   goto out_unlock;
}
 
trace_start_task_reaping(tsk->pid);
 
/* failed to reap part of the address space. Try again later */
-   if (!__oom_reap_task_mm(mm)) {
-   up_read(>mmap_sem);
-   return false;
-   }
+   ret = __oom_reap_task_mm(mm);
+   if (!ret)
+   goto out_finish;
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_finish:
+   trace_finish_task_reaping(tsk->pid);
+out_unlock:
up_read(>mmap_sem);
 
-   trace_finish_task_reaping(tsk->pid);
-   return true;
+   return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko

On Fri 20-07-18 16:01:25, Andrew Morton wrote:
> On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:
> 
> > > Any suggestions regarding how the driver developers can test this code
> > > path?  I don't think we presently have a way to fake an oom-killing
> > > event?  Perhaps we should add such a thing, given the problems we're
> > > having with that feature.
> > 
> > The simplest way is to wrap an userspace code which uses these notifiers
> > into a memcg and set the hard limit to hit the oom. This can be done
> > e.g. after the test faults in all the mmu notifier managed memory and
> > set the hard limit to something really small. Then we are looking for a
> > proper process tear down.
> 
> Chances are, some of the intended audience don't know how to do this
> and will either have to hunt down a lot of documentation or will just
> not test it.  But we want them to test it, so a little worked step-by-step
> example would help things along please.

I am willing to give more specific steps. Is anybody interested? From my
experience so far this is not something drivers developers using mmu
notifiers would be unfamiliar with.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko

On Mon 23-07-18 09:11:54, Michal Hocko wrote:
> On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> > On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> > [...]
> > > Please take a look?
> > 
> > Are you OK to have these in a separate patch?
> 
> Btw. I will rebase this patch once oom stuff in linux-next settles. At
> least oom_lock removal from oom_reaper will conflict.

Hmm, I have just checked Andrew's akpm and the patch is already in and
Andrew has resolved the conflict with the oom_lock patch. It just seems
that linux-next (next-20180720) doesn't have the newest mmotm tree.

Anyway, I will go with the incremental cleanup patch per Andrew's
comments as soon as linux-next catches up.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko

On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > Please take a look?
> 
> Are you OK to have these in a separate patch?

Btw. I will rebase this patch once oom stuff in linux-next settles. At
least oom_lock removal from oom_reaper will conflict.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko

On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> Please take a look?

Are you OK to have these in a separate patch?

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton

On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> ...
>
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - __oom_reap_task_mm(mm);
> + /* failed to reap part of the address space. Try again later */
> + if (!__oom_reap_task_mm(mm)) {
> + up_read(>mmap_sem);
> + ret = false;
> + goto unlock_oom;
> + }

This function is starting to look a bit screwy.

: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(>mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, >flags)) {
:   up_read(>mmap_sem);
:   trace_skip_task_reaping(tsk->pid);
:   return true;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   /* failed to reap part of the address space. Try again later */
:   if (!__oom_reap_task_mm(mm)) {
:   up_read(>mmap_sem);
:   return true;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
:   up_read(>mmap_sem);
: 
:   trace_finish_task_reaping(tsk->pid);
:   return true;
: }

- Undocumented return value.

- comment "failed to reap part..." is misleading - sounds like it's
  referring to something which happened in the past, is in fact
  referring to something which might happen in the future.

- fails to call trace_finish_task_reaping() in one case

- code duplication.


I'm thinking it wants to be something like this?

: /*
:  * Return true if we successfully acquired (then released) mmap_sem
:  */
: static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
: {
:   if (!down_read_trylock(>mmap_sem)) {
:   trace_skip_task_reaping(tsk->pid);
:   return false;
:   }
: 
:   /*
:* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
:* work on the mm anymore. The check for MMF_OOM_SKIP must run
:* under mmap_sem for reading because it serializes against the
:* down_write();up_write() cycle in exit_mmap().
:*/
:   if (test_bit(MMF_OOM_SKIP, >flags)) {
:   trace_skip_task_reaping(tsk->pid);
:   goto out;
:   }
: 
:   trace_start_task_reaping(tsk->pid);
: 
:   if (!__oom_reap_task_mm(mm)) {
:   /* Failed to reap part of the address space. Try again later */
:   goto finish;
:   }
: 
:   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
:   task_pid_nr(tsk), tsk->comm,
:   K(get_mm_counter(mm, MM_ANONPAGES)),
:   K(get_mm_counter(mm, MM_FILEPAGES)),
:   K(get_mm_counter(mm, MM_SHMEMPAGES)));
: finish:
:   trace_finish_task_reaping(tsk->pid);
: out:
:   up_read(>mmap_sem);
:   return true;
: }

- Increases mmap_sem hold time a little by moving
  trace_finish_task_reaping() inside the locked region.  So sue me ;)

- Sharing the finish: path means that the trace event won't
  distinguish between the two sources of finishing.

Please take a look?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-20 Thread Andrew Morton

On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:

> > Any suggestions regarding how the driver developers can test this code
> > path?  I don't think we presently have a way to fake an oom-killing
> > event?  Perhaps we should add such a thing, given the problems we're
> > having with that feature.
> 
> The simplest way is to wrap an userspace code which uses these notifiers
> into a memcg and set the hard limit to hit the oom. This can be done
> e.g. after the test faults in all the mmu notifier managed memory and
> set the hard limit to something really small. Then we are looking for a
> proper process tear down.

Chances are, some of the intended audience don't know how to do this
and will either have to hunt down a lot of documentation or will just
not test it.  But we want them to test it, so a little worked step-by-step
example would help things along please.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-19 Thread Michal Hocko

Does anybody see any reasons why this should get into mmotm tree?
I do not want to rush this in but if general feeling is to push it for
the upcoming merge window then I will not object.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-17 Thread Michal Hocko

On Mon 16-07-18 16:12:49, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> > 
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> > 
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> > 
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
> 
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

Yes, this is the case right now. Maybe we will grow other users in
future. Those other potential users is the reason why I used blockable
rather than oom parameter name.

> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.

The simplest way is to wrap an userspace code which uses these notifiers
into a memcg and set the hard limit to hit the oom. This can be done
e.g. after the test faults in all the mmu notifier managed memory and
set the hard limit to something really small. Then we are looking for a
proper process tear down.

> > I think we can improve that even further because there is a common
> > pattern to do a range lookup first and then do something about that.
> > The first part can be done without a sleeping lock in most cases AFAICS.
> > 
> > The oom_reaper end then simply retries if there is at least one notifier
> > which couldn't make any progress in !blockable mode. A retry loop is
> > already implemented to wait for the mmap_sem and this is basically the
> > same thing.
> > 
> > ...
> >
> > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> > mm_struct *mm,
> > + unsigned long start, unsigned long end)
> > +{
> > +   int ret = 0;
> > +   if (mm_has_notifiers(mm))
> > +   ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> > false);
> > +
> > +   return ret;
> >  }
> 
> nit,
> 
> {
>   if (mm_has_notifiers(mm))
>   return __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
>   return 0;
> }
> 
> would suffice.

Sure. Fixed
 
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
> >  * reliably test it.
> >  */
> > mutex_lock(_lock);
> > -   __oom_reap_task_mm(mm);
> > +   (void)__oom_reap_task_mm(mm);
> > mutex_unlock(_lock);
> 
> What does this do?

There is no error to be returned here as the comment above explains
 * Nothing can be holding mm->mmap_sem here and the above call
 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 * __oom_reap_task_mm() will not block.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-17 Thread Leon Romanovsky

On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
>
> > From: Michal Hocko 
> >
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> >
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> >
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> >
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
>
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

I can't say for everyone, but at least for me (mlx5), it is not rare event.
I'm seeing OOM very often while I'm running my tests in low memory VMs.

Thanks

>
> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Andrew Morton

On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.

I assume device driver developers are wondering "what does this mean
for me".  As I understand it, the only time they will see
blockable==false is when their driver is being called in response to an
out-of-memory condition, yes?  So it is a very rare thing.

Any suggestions regarding how the driver developers can test this code
path?  I don't think we presently have a way to fake an oom-killing
event?  Perhaps we should add such a thing, given the problems we're
having with that feature.

> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> ...
>
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> mm_struct *mm,
> +   unsigned long start, unsigned long end)
> +{
> + int ret = 0;
> + if (mm_has_notifiers(mm))
> + ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
> +
> + return ret;
>  }

nit,

{
if (mm_has_notifiers(mm))
return __mmu_notifier_invalidate_range_start(mm, start, end, 
false);
return 0;
}

would suffice.


> 
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>* reliably test it.
>*/
>   mutex_lock(_lock);
> - __oom_reap_task_mm(mm);
> + (void)__oom_reap_task_mm(mm);
>   mutex_unlock(_lock);

What does this do?

>   set_bit(MMF_OOM_SKIP, >flags);
> 
> ...
>

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Michal Hocko

From: Michal Hocko 

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: linux-r...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Acked-by: Christian König  # AMD notifiers
Acked-by: Leon Romanovsky  # mlx and umem_odp
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---

Hi,
there were no major objections when I sent this as an RFC the last time
[1]. I was hoping for more feedback in the drivers land because I am
touching the code I have no way to test. On the other hand the pattern
is quite simple and consistent over all users so there shouldn't be
any large surprises hopefully.

Any further review would be highly appreciate of course. But is this
something to put into the mm tree now?

[1] http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz


 arch/x86/kvm/x86.c  |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
 drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
 drivers/infiniband/core/umem_odp.c  | 33 +++
 drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
 drivers/infiniband/hw/mlx5/odp.c|  2 +-
 drivers/misc/mic/scif/scif_dma.c|  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
 drivers/xen/gntdev.c| 44 -
 include/linux/kvm_host.h|  4 +--
 include/linux/mmu_notifier.h| 35 +++-
 include/linux/oom.h |  2 +-
 include/rdma/ib_umem_odp.h  |  3 +-
 mm/hmm.c|  7 ++--
 mm/mmap.c   |  2 +-
 mm/mmu_notifier.c   | 19 ---
 mm/oom_kill.c   | 29 
 virt/kvm/kvm_main.c | 15 ++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end)
+int

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Leon Romanovsky

On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > >
> > > > Any further feedback is highly appreciated of course.
> > >
> > > Any other feedback before I post this as non-RFC?
> >
> > From mlx5 perspective, who is primary user of umem_odp.c your change looks 
> > ok.
>
> Can I assume your Acked-by?
>
> Thanks for your review!

For mlx and umem_odp pieces,
Acked-by: Leon Romanovsky 

Thanks


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Leon Romanovsky

On Wed, Jul 11, 2018 at 01:13:18PM +0200, Michal Hocko wrote:
> On Wed 11-07-18 13:14:47, Leon Romanovsky wrote:
> > On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> > > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > > > This is the v2 of RFC based on the feedback I've received so 
> > > > > > > > far. The
> > > > > > > > code even compiles as a bonus ;) I haven't runtime tested it 
> > > > > > > > yet, mostly
> > > > > > > > because I have no idea how.
> > > > > > > >
> > > > > > > > Any further feedback is highly appreciated of course.
> > > > > > >
> > > > > > > Any other feedback before I post this as non-RFC?
> > > > > >
> > > > > > From mlx5 perspective, who is primary user of umem_odp.c your 
> > > > > > change looks ok.
> > > > >
> > > > > Can I assume your Acked-by?
> > > >
> > > > I didn't have a chance to test it because it applies on our rdma-next, 
> > > > but
> > > > fails to compile.
> > >
> > > What is the compilation problem? Is it caused by the patch or some other
> > > unrelated changed?
> >
> > Thanks for pushing me to take a look on it.
> > Your patch needs the following hunk to properly compile at least on my 
> > system.
>
> I suspect you were trying the original version. I've posted an updated
> patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz
> and all these issues should be fixed there. Including many other fixes.
>

Ohh, you used --reply-to, IMHO it is best way to make sure that the
patch will be lost :)

> Could you have a look at that one please?

I grabbed it, the results will be overnight only.

Thanks

> --
> Michal Hocko
> SUSE Labs


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Michal Hocko

On Wed 11-07-18 13:14:47, Leon Romanovsky wrote:
> On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > > The
> > > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > > mostly
> > > > > > > because I have no idea how.
> > > > > > >
> > > > > > > Any further feedback is highly appreciated of course.
> > > > > >
> > > > > > Any other feedback before I post this as non-RFC?
> > > > >
> > > > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > > > looks ok.
> > > >
> > > > Can I assume your Acked-by?
> > >
> > > I didn't have a chance to test it because it applies on our rdma-next, but
> > > fails to compile.
> >
> > What is the compilation problem? Is it caused by the patch or some other
> > unrelated changed?
> 
> Thanks for pushing me to take a look on it.
> Your patch needs the following hunk to properly compile at least on my system.

I suspect you were trying the original version. I've posted an updated
patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz
and all these issues should be fixed there. Including many other fixes.

Could you have a look at that one please?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Leon Romanovsky

On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > mostly
> > > > > > because I have no idea how.
> > > > > >
> > > > > > Any further feedback is highly appreciated of course.
> > > > >
> > > > > Any other feedback before I post this as non-RFC?
> > > >
> > > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > > looks ok.
> > >
> > > Can I assume your Acked-by?
> >
> > I didn't have a chance to test it because it applies on our rdma-next, but
> > fails to compile.
>
> What is the compilation problem? Is it caused by the patch or some other
> unrelated changed?

Thanks for pushing me to take a look on it.
Your patch needs the following hunk to properly compile at least on my system.

I'll take it to our regression.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 369867501bed..1f364a157097 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -155,9 +155,9 @@ struct mmu_notifier_ops {
 * cannot block, mmu_notifier_ops.flags should have
 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
 */
-   void (*invalidate_range_start)(struct mmu_notifier *mn,
+   int (*invalidate_range_start)(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long start, unsigned long end);
+  unsigned long start, unsigned long end, 
bool blockable);
void (*invalidate_range_end)(struct mmu_notifier *mn,
 struct mm_struct *mm,
 unsigned long start, unsigned long end);
@@ -229,7 +229,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
  unsigned long start, unsigned long end,
  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct 
*mm)
return 0;
 }

-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);

 extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e0c6e78ae5c..7c7bd6f3298e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,6 +1,6 @@
 /*
  *  linux/mm/oom_kill.c
- *
+ *
  *  Copyright (C)  1998,2000  Rik van Riel
  * Thanks go out to Claus Fischer for some serious inspiration and
  * for goading me into coding this file...
@@ -569,7 +569,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
if (!__oom_reap_task_mm(mm)) {
up_read(>mmap_sem);
ret = false;
-   goto out_unlock;
+   goto unlock_oom;
}

pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",

Thanks

> --
> Michal Hocko
> SUSE Labs


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Michal Hocko

On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > mostly
> > > > > because I have no idea how.
> > > > >
> > > > > Any further feedback is highly appreciated of course.
> > > >
> > > > Any other feedback before I post this as non-RFC?
> > >
> > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > looks ok.
> >
> > Can I assume your Acked-by?
> 
> I didn't have a chance to test it because it applies on our rdma-next, but
> fails to compile.

What is the compilation problem? Is it caused by the patch or some other
unrelated changed?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Leon Romanovsky

On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > >
> > > > Any further feedback is highly appreciated of course.
> > >
> > > Any other feedback before I post this as non-RFC?
> >
> > From mlx5 perspective, who is primary user of umem_odp.c your change looks 
> > ok.
>
> Can I assume your Acked-by?

I didn't have a chance to test it because it applies on our rdma-next, but
fails to compile.

Thanks

>
> Thanks for your review!
> --
> Michal Hocko
> SUSE Labs
>


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Michal Hocko

On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > This is the v2 of RFC based on the feedback I've received so far. The
> > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > because I have no idea how.
> > >
> > > Any further feedback is highly appreciated of course.
> >
> > Any other feedback before I post this as non-RFC?
> 
> From mlx5 perspective, who is primary user of umem_odp.c your change looks ok.

Can I assume your Acked-by?

Thanks for your review!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Leon Romanovsky

On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> >
> > Any further feedback is highly appreciated of course.
>
> Any other feedback before I post this as non-RFC?

From mlx5 perspective, who is primary user of umem_odp.c your change looks ok.

Thanks


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-09 Thread Michal Hocko

On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
> 
> Any further feedback is highly appreciated of course.

Any other feedback before I post this as non-RFC?

> ---
> From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
> 
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>   we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>   callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>   path so we have to back off early on overlapping ranges
> 
> Cc: "David (ChunMing) Zhou" 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: David Airlie 
> Cc: Jani Nikula 
> Cc: Joonas Lahtinen 
> Cc: Rodrigo Vivi 
> Cc: Doug Ledford 
> Cc: Jason Gunthorpe 
> Cc: Mike Marciniszyn 
> Cc: Dennis Dalessandro 
> Cc: Sudeep Dutt 
> Cc: Ashutosh Dixit 
> Cc: Dimitri Sivanich 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: "Jérôme Glisse" 
> Cc: Andrea Arcangeli 
> Cc: Felix Kuehling 
> Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
> 64-BIT))
> Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
> Poulsbo, Moorestow...)
> Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes 
> Signed-off-by: Michal Hocko 
> ---
>  arch/x86/kvm/x86.c  |  7 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
>  drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
>  drivers/infiniband/core/umem_odp.c  | 33 +++
>  drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
>  drivers/infiniband/hw/mlx5/odp.c|  2 +-
>  drivers/misc/mic/scif/scif_dma.c|  7 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
>  drivers/xen/gntdev.c| 44 -
>

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko

On Mon 02-07-18 14:39:50, Christian König wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian König  to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.

> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Christian König


Am 02.07.2018 um 14:35 schrieb Michal Hocko:

On Mon 02-07-18 14:24:29, Christian König wrote:

Am 02.07.2018 um 14:20 schrieb Michal Hocko:

On Mon 02-07-18 14:13:42, Christian König wrote:

Am 02.07.2018 um 13:54 schrieb Michal Hocko:

On Mon 02-07-18 11:14:58, Christian König wrote:

Am 27.06.2018 um 09:44 schrieb Michal Hocko:

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the blockable
flag in one patch and fixing all implementations of the MMU notifier in
follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change in
first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.


Then add all the new code to the implementations and last start to actually
use the new interface.

That is a pattern we use regularly and I think it's good practice to do
this.

But we do rely on the proper blockable handling.

Yeah, but you could add the handling only after you have all the
implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.


Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

It at least makes reviewing changes much easier, cause as driver maintainer
I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver we
can much easier pinpoint the actual change later on when the patch is
smaller.


This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.

Well, I don't insist on this. It's just from my point of view that this
patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git


Not wanting to block something as important as this, so feel free to add 
an Acked-by: Christian König  to the patch.


Let's rather face the next topic: Any idea how to runtime test this?

I mean I can rather easily provide a test which crashes an AMD GPU, 
which in turn then would mean that the MMU notifier would block forever 
without this patch.


But do you know a way to let the OOM killer kill a specific process?

Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko

On Mon 02-07-18 14:24:29, Christian König wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian König wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now 
> > > > > look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the 
> > > > > blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier 
> > > > > in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to 
> > > actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver 
> > > maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver 
> > > we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the 
> > > > > amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Christian König


Am 02.07.2018 um 14:20 schrieb Michal Hocko:

On Mon 02-07-18 14:13:42, Christian König wrote:

Am 02.07.2018 um 13:54 schrieb Michal Hocko:

On Mon 02-07-18 11:14:58, Christian König wrote:

Am 27.06.2018 um 09:44 schrieb Michal Hocko:

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the blockable
flag in one patch and fixing all implementations of the MMU notifier in
follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change in
first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.


Then add all the new code to the implementations and last start to actually
use the new interface.

That is a pattern we use regularly and I think it's good practice to do
this.

But we do rely on the proper blockable handling.


Yeah, but you could add the handling only after you have all the 
implementations in place. Don't you?



Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

It at least makes reviewing changes much easier, cause as driver maintainer
I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver we
can much easier pinpoint the actual change later on when the patch is
smaller.


This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.


Well, I don't insist on this. It's just from my point of view that this 
patch doesn't needs to be one patch, but could be split up.


Could be that I just don't know the code or the consequences of adding 
that well enough to really judge.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko

On Mon 02-07-18 14:13:42, Christian König wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Christian König


Am 02.07.2018 um 13:54 schrieb Michal Hocko:

On Mon 02-07-18 11:14:58, Christian König wrote:

Am 27.06.2018 um 09:44 schrieb Michal Hocko:

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the blockable
flag in one patch and fixing all implementations of the MMU notifier in
follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.


Well to still be bisect-able you only need to get the interface change 
in first with fixing the function signature of the implementations.


Then add all the new code to the implementations and last start to 
actually use the new interface.


That is a pattern we use regularly and I think it's good practice to do 
this.



Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.


It at least makes reviewing changes much easier, cause as driver 
maintainer I can concentrate on the stuff only related to me.


Additional to that when you cause some unrelated side effect in a driver 
we can much easier pinpoint the actual change later on when the patch is 
smaller.





This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA


Yeah, possible alternative but more work for me when I review it :)

Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko

On Mon 02-07-18 11:14:58, Christian König wrote:
> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> > 
> > Any further feedback is highly appreciated of course.
> 
> That sounds like it should work and at least the amdgpu changes now look
> good to me on first glance.
> 
> Can you split that up further in the usual way? E.g. adding the blockable
> flag in one patch and fixing all implementations of the MMU notifier in
> follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.
Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Christian König


Am 27.06.2018 um 09:44 schrieb Michal Hocko:

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.


That sounds like it should work and at least the amdgpu changes now look 
good to me on first glance.


Can you split that up further in the usual way? E.g. adding the 
blockable flag in one patch and fixing all implementations of the MMU 
notifier in follow up patches.


This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd 
changes.


Thanks,
Christian.


---
 From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
   we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
   callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
   path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
64-BIT))
Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
Poulsbo, Moorestow...)
Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---
  arch/x86/kvm/x86.c  |  7 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
  drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
  drivers/infiniband/core/umem_odp.c  | 33 +++
  drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
  drivers/infiniband/hw/mlx5/odp.c|  2 +-
  drivers/misc/mic/scif/scif_dma.c|  7 ++--
  drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
  drivers/xen/gntdev.c| 44 -
  include/linux/kvm_host.h|  4 +--
  include/linux/mmu_notifier.h| 35 +++-
  include/linux/oom.h |  2 +-
  incl

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-27 Thread Michal Hocko

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---
From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
64-BIT))
Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
Poulsbo, Moorestow...)
Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---
 arch/x86/kvm/x86.c  |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
 drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
 drivers/infiniband/core/umem_odp.c  | 33 +++
 drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
 drivers/infiniband/hw/mlx5/odp.c|  2 +-
 drivers/misc/mic/scif/scif_dma.c|  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
 drivers/xen/gntdev.c| 44 -
 include/linux/kvm_host.h|  4 +--
 include/linux/mmu_notifier.h| 35 +++-
 include/linux/oom.h |  2 +-
 include/rdma/ib_umem_odp.h  |  3 +-
 mm/hmm.c|  7 ++--
 mm/mmap.c   |  2 +-
 mm/mmu_notifier.c   | 19 ---
 mm/oom_kill.c   | 29 
 virt/kvm/kvm_main.c | 15 ++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-25 Thread Michal Hocko

On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian König wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep 
> > >> because
> > >> we need to wait for hardware operations to finish and *NOT* because we 
> > >> need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will 
> > >> break
> > >> badly. In our case the approach of waiting for a short time for the 
> > >> process
> > >> to be reaped and then select another victim actually sounds like the 
> > >> right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
> mmu_notifier *mn,
>   struct amdgpu_mn_node *node;
>   struct amdgpu_bo *bo;
>  
> + if (!blockable) {
> + amdgpu_mn_read_unlock();
> + return -EAGAIN;
> + }
> +
>   node = container_of(it, struct amdgpu_mn_node, it);
>   it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko 
Date:   Mon Jun 25 15:24:03 2018 +0200

fold me

- amd gpu notifiers can sleep deeper in the callchain 
(evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct 
mmu_notifier *mn,
while (it) {
struct amdgpu_mn_node *node;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock(rmn);
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
mmu_notifier *mn,
struct amdgpu_mn_node *node;
struct amdgpu_bo *bo;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock(rmn);
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-25 Thread Michal Hocko

On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian König wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep 
> >> because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
mmu_notifier *mn,
struct amdgpu_mn_node *node;
struct amdgpu_bo *bo;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock();
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Felix Kuehling

On 2018-06-22 11:24 AM, Michal Hocko wrote:
> On Fri 22-06-18 17:13:02, Christian König wrote:
>> Hi Michal,
>>
>> [Adding Felix as well]
>>
>> Well first of all you have a misconception why at least the AMD graphics
>> driver need to be able to sleep in an MMU notifier: We need to sleep because
>> we need to wait for hardware operations to finish and *NOT* because we need
>> to wait for locks.
>>
>> I'm not sure if your flag now means that you generally can't sleep in MMU
>> notifiers any more, but if that's the case at least AMD hardware will break
>> badly. In our case the approach of waiting for a short time for the process
>> to be reaped and then select another victim actually sounds like the right
>> thing to do.
> Well, I do not need to make the notifier code non blocking all the time.
> All I need is to ensure that it won't sleep if the flag says so and
> return -EAGAIN instead.
>
> So here is what I do for amdgpu:

In the case of KFD we also need to take the DQM lock:

amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch

So we'd need to pass the blockable parameter all the way through that
call chain.

Regards,
  Felix

>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 83e344fbb50a..d138a526feff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>*
>>>* Take the rmn read side lock.
>>>*/
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>>>   {
>>> -   mutex_lock(>read_lock);
>>> +   if (blockable)
>>> +   mutex_lock(>read_lock);
>>> +   else if (!mutex_trylock(>read_lock))
>>> +   return -EAGAIN;
>>> +
>>> if (atomic_inc_return(>recursion) == 1)
>>> down_read_non_owner(>lock);
>>> mutex_unlock(>read_lock);
>>> +
>>> +   return 0;
>>>   }
>>>   /**
>>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct 
>>> amdgpu_mn_node *node,
>>>* We block for all BOs between start and end to be idle and
>>>* unmap them by move them into system domain again.
>>>*/
>>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>  struct mm_struct *mm,
>>>  unsigned long start,
>>> -unsigned long end)
>>> +unsigned long end,
>>> +bool blockable)
>>>   {
>>> struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>> struct interval_tree_node *it;
>>> @@ -208,7 +215,11 @@ static void 
>>> amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> /* notification is exclusive, but interval is inclusive */
>>> end -= 1;
>>> -   amdgpu_mn_read_lock(rmn);
>>> +   /* TODO we should be able to split locking for interval tree and
>>> +* amdgpu_mn_invalidate_node
>>> +*/
>>> +   if (amdgpu_mn_read_lock(rmn, blockable))
>>> +   return -EAGAIN;
>>> it = interval_tree_iter_first(>objects, start, end);
>>> while (it) {
>>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct 
>>> mmu_notifier *mn,
>>> amdgpu_mn_invalidate_node(node, start, end);
>>> }
>>> +
>>> +   return 0;
>>>   }
>>>   /**
>>> @@ -233,10 +246,11 @@ static void 
>>> amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>* necessitates evicting all user-mode queues of the process. The BOs
>>>* are restorted in amdgpu_mn_invalidate_range_end_hsa.
>>>*/
>>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>  struct mm_struct *mm,
>>>  unsigned long start,
>>> -unsigned long end)
>>> +unsigned long end,
>>> +bool blockable)
>>>   {
>>> struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>> struct interval_tree_node *it;
>>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct 
>>> mmu_notifier *mn,
>>> /* notification is exclusive, but interval is inclusive */
>>> end -= 1;
>>> -   amdgpu_mn_read_lock(rmn);
>>> +   if (amdgpu_mn_read_lock(rmn, blockable))
>>> +   return -EAGAIN;
>>> it = interval_tree_iter_first(>objects, start, end);
>>> while (it) {
>>> @@ -262,6 +277,8 @@ static void

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse

On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with 
> > > > > > the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to 
> > > > > > achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had 
> > > > > > to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object 
> > > > > > *mo)
> > > > > > mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > >struct 
> > > > > > mm_struct *mm,
> > > > > >unsigned 
> > > > > > long start,
> > > > > > -  unsigned 
> > > > > > long end)
> > > > > > +  unsigned 
> > > > > > long end,
> > > > > > +  bool 
> > > > > > blockable)
> > > > > >  {
> > > > > > struct i915_mmu_notifier *mn =
> > > > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void 
> > > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > LIST_HEAD(cancelled);
> > > > > >  
> > > > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > > > -   return;
> > > > > > +   return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > > depends on the range and invalidate internal driver states (like clear
> > > buffer object pages array in case of GPU but not GPU page table). While
> > > the second callback would do the actual wait on the GPU to be done and
> > > update the GPU page table.
> > 
> > What can you do after the first phase? Can I unmap the range?

No you can't do anything but this force synchronization as any other
thread that concurrently trie to do something with those would queue
up. So it would serialize thing. Also main motivation on my side is
multi-GPU, right now multi-GPU are not that common but this is changing
quickly and what we see on high end (4, 8 or 16 GPUs per socket) is
spreading into more configurations. Here in mutli GPU case splitting
in two would avoid having to fully wait on first GPU before trying to
invaidate on second GPU, so on and so forth.

> > 
> > > Now in this scheme in case the task is already in some exit state and
> > > that all CPU threads are frozen/kill then we can probably find a way to
> > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > >

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to 
> > > > > achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to 
> > > > > trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object 
> > > > > *mo)
> > > > > mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > mmu_notifier *_mn,
> > > > >struct 
> > > > > mm_struct *mm,
> > > > >unsigned long 
> > > > > start,
> > > > > -  unsigned long 
> > > > > end)
> > > > > +  unsigned long 
> > > > > end,
> > > > > +  bool blockable)
> > > > >  {
> > > > > struct i915_mmu_notifier *mn =
> > > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void 
> > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > LIST_HEAD(cancelled);
> > > > >  
> > > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > > -   return;
> > > > > +   return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko

[Hmm, the cc list got mangled somehow - you have just made many people
to work for suse ;) and to kvack.org in the preious one - fixed up
hopefully]

On Fri 22-06-18 17:07:21, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:57:16)
> > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > Hi,
> > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > what I need basically. It might be completely wrong but I would like
> > > > to discuss what would be a better way if that is the case.
> > > > 
> > > > get_maintainers gave me quite large list of people to CC so I had to 
> > > > trim
> > > > it down. If you think I have forgot somebody, please let me know
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > index 854bd51b9478..5285df9331fa 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > mo->attached = false;
> > > >  }
> > > >  
> > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > mmu_notifier *_mn,
> > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > mmu_notifier *_mn,
> > > >struct mm_struct 
> > > > *mm,
> > > >unsigned long 
> > > > start,
> > > > -  unsigned long 
> > > > end)
> > > > +  unsigned long 
> > > > end,
> > > > +  bool blockable)
> > > >  {
> > > > struct i915_mmu_notifier *mn =
> > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > @@ -124,7 +125,7 @@ static void 
> > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > LIST_HEAD(cancelled);
> > > >  
> > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > -   return;
> > > > +   return 0;
> > > 
> > > The principle wait here is for the HW (even after fixing all the locks
> > > to be not so coarse, we still have to wait for the HW to finish its
> > > access).
> > 
> > Is this wait bound or it can take basically arbitrary amount of time?
> 
> Arbitrary. It waits for the last operation in the queue that needs that
> set of backing pages, and that queue is unbounded and not even confined
> to the local driver. (Though each operation should be bounded to be
> completed within an interval or be cancelled, that interval is on the
> order of 10s!)

OK, I see. We should rather not wait that long so backoff is just
better. The whole point of the oom_reaper is to tear down and free some
memory. We do not really need to reclaim all of it.

It would be great if we could do something like - kick the tear down of
the device memory but have it done in the background. We wouldn't tear
the vma down in that case but the whole process would start at least.
I am not sure something like that is possible.
 
> > > The first pass would be then to not do anything here if
> > > !blockable.
> > 
> > something like this? (incremental diff)
> 
> Yup.

Cool, I will start with that because even that is an improvement from
the oom_reaper POV.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse

On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > >struct mm_struct 
> > > *mm,
> > >unsigned long 
> > > start,
> > > -  unsigned long end)
> > > +  unsigned long end,
> > > +  bool blockable)
> > >  {
> > > struct i915_mmu_notifier *mn =
> > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void 
> > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > LIST_HEAD(cancelled);
> > >  
> > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > -   return;
> > > +   return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.

Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko

On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:02:42)
> > Hi,
> > this is an RFC and not tested at all. I am not very familiar with the
> > mmu notifiers semantics very much so this is a crude attempt to achieve
> > what I need basically. It might be completely wrong but I would like
> > to discuss what would be a better way if that is the case.
> > 
> > get_maintainers gave me quite large list of people to CC so I had to trim
> > it down. If you think I have forgot somebody, please let me know
> 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 854bd51b9478..5285df9331fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > mo->attached = false;
> >  }
> >  
> > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> > *_mn,
> > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> > *_mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct i915_mmu_notifier *mn =
> > container_of(_mn, struct i915_mmu_notifier, mn);
> > @@ -124,7 +125,7 @@ static void 
> > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > LIST_HEAD(cancelled);
> >  
> > if (RB_EMPTY_ROOT(>objects.rb_root))
> > -   return;
> > +   return 0;
> 
> The principle wait here is for the HW (even after fixing all the locks
> to be not so coarse, we still have to wait for the HW to finish its
> access).

Is this wait bound or it can take basically arbitrary amount of time?

> The first pass would be then to not do anything here if
> !blockable.

something like this? (incremental diff)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 5285df9331fa..e9ed0d2cfabc 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -122,6 +122,7 @@ static int 
i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
container_of(_mn, struct i915_mmu_notifier, mn);
struct i915_mmu_object *mo;
struct interval_tree_node *it;
+   int ret = 0;
LIST_HEAD(cancelled);
 
if (RB_EMPTY_ROOT(>objects.rb_root))
@@ -133,6 +134,10 @@ static int 
i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
spin_lock(>lock);
it = interval_tree_iter_first(>objects, start, end);
while (it) {
+   if (!blockable) {
+   ret = -EAGAIN;
+   goto out_unlock;
+   }
/* The mmu_object is released late when destroying the
 * GEM object so it is entirely possible to gain a
 * reference on an object in the process of being freed
@@ -154,8 +159,10 @@ static int 
i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
spin_unlock(>lock);
 
/* TODO: can we skip waiting here? */
-   if (!list_empty() && blockable)
+   if (!list_empty())
flush_workqueue(mn->wq);
+
+   return ret;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Chris Wilson

Quoting Michal Hocko (2018-06-22 16:02:42)
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
> 
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> *_mn,
>struct mm_struct *mm,
>unsigned long start,
> -  unsigned long end)
> +  unsigned long end,
> +  bool blockable)
>  {
> struct i915_mmu_notifier *mn =
> container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void 
> i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> LIST_HEAD(cancelled);
>  
> if (RB_EMPTY_ROOT(>objects.rb_root))
> -   return;
> +   return 0;

The principle wait here is for the HW (even after fixing all the locks
to be not so coarse, we still have to wait for the HW to finish its
access). The first pass would be then to not do anything here if
!blockable.

Jerome keeps on shaking his head and telling us we're doing it all
wrong, so maybe it'll all fall out of HMM before we have to figure out
how to differentiate between objects that can be invalidated immediately
and those that need to acquire locks and/or wait.
-Chris
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko

On Fri 22-06-18 17:13:02, Christian König wrote:
> Hi Michal,
> 
> [Adding Felix as well]
> 
> Well first of all you have a misconception why at least the AMD graphics
> driver need to be able to sleep in an MMU notifier: We need to sleep because
> we need to wait for hardware operations to finish and *NOT* because we need
> to wait for locks.
> 
> I'm not sure if your flag now means that you generally can't sleep in MMU
> notifiers any more, but if that's the case at least AMD hardware will break
> badly. In our case the approach of waiting for a short time for the process
> to be reaped and then select another victim actually sounds like the right
> thing to do.

Well, I do not need to make the notifier code non blocking all the time.
All I need is to ensure that it won't sleep if the flag says so and
return -EAGAIN instead.

So here is what I do for amdgpu:

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > index 83e344fbb50a..d138a526feff 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >*
> >* Take the rmn read side lock.
> >*/
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
> >   {
> > -   mutex_lock(>read_lock);
> > +   if (blockable)
> > +   mutex_lock(>read_lock);
> > +   else if (!mutex_trylock(>read_lock))
> > +   return -EAGAIN;
> > +
> > if (atomic_inc_return(>recursion) == 1)
> > down_read_non_owner(>lock);
> > mutex_unlock(>read_lock);
> > +
> > +   return 0;
> >   }
> >   /**
> > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct 
> > amdgpu_mn_node *node,
> >* We block for all BOs between start and end to be idle and
> >* unmap them by move them into system domain again.
> >*/
> > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >  struct mm_struct *mm,
> >  unsigned long start,
> > -unsigned long end)
> > +unsigned long end,
> > +bool blockable)
> >   {
> > struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> > struct interval_tree_node *it;
> > @@ -208,7 +215,11 @@ static void 
> > amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > /* notification is exclusive, but interval is inclusive */
> > end -= 1;
> > -   amdgpu_mn_read_lock(rmn);
> > +   /* TODO we should be able to split locking for interval tree and
> > +* amdgpu_mn_invalidate_node
> > +*/
> > +   if (amdgpu_mn_read_lock(rmn, blockable))
> > +   return -EAGAIN;
> > it = interval_tree_iter_first(>objects, start, end);
> > while (it) {
> > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct 
> > mmu_notifier *mn,
> > amdgpu_mn_invalidate_node(node, start, end);
> > }
> > +
> > +   return 0;
> >   }
> >   /**
> > @@ -233,10 +246,11 @@ static void 
> > amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >* necessitates evicting all user-mode queues of the process. The BOs
> >* are restorted in amdgpu_mn_invalidate_range_end_hsa.
> >*/
> > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >  struct mm_struct *mm,
> >  unsigned long start,
> > -unsigned long end)
> > +unsigned long end,
> > +bool blockable)
> >   {
> > struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> > struct interval_tree_node *it;
> > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct 
> > mmu_notifier *mn,
> > /* notification is exclusive, but interval is inclusive */
> > end -= 1;
> > -   amdgpu_mn_read_lock(rmn);
> > +   if (amdgpu_mn_read_lock(rmn, blockable))
> > +   return -EAGAIN;
> > it = interval_tree_iter_first(>objects, start, end);
> > while (it) {
> > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct 
> > mmu_notifier *mn,
> > amdgpu_amdkfd_evict_userptr(mem, mm);
> > }
> > }
> > +
> > +   return 0;
> >   }
> >   /**
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Christian König


Hi Michal,

[Adding Felix as well]

Well first of all you have a misconception why at least the AMD graphics 
driver need to be able to sleep in an MMU notifier: We need to sleep 
because we need to wait for hardware operations to finish and *NOT* 
because we need to wait for locks.


I'm not sure if your flag now means that you generally can't sleep in 
MMU notifiers any more, but if that's the case at least AMD hardware 
will break badly. In our case the approach of waiting for a short time 
for the process to be reaped and then select another victim actually 
sounds like the right thing to do.


What we also already try to do is to abort hardware operations with the 
address space when we detect that the process is dying, but that can 
certainly be improved.


Regards,
Christian.

Am 22.06.2018 um 17:02 schrieb Michal Hocko:

From: Michal Hocko 

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks. Currently we simply back off and mark an
oom victim with blockable mmu notifiers as done after a short sleep.
That can result in selecting a new oom victim prematurely because the
previous one still hasn't torn its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover most notifiers only care about a portion of the address
space. This patch handles the first part of the problem.
__mmu_notifier_invalidate_range_start gets a blockable flag and
callbacks are not allowed to sleep if the flag is set to false. This is
achieved by using trylock instead of the sleepable lock for most
callbacks. I think we can improve that even further because there is
a common pattern to do a range lookup first and then do something about
that. The first part can be done without a sleeping lock I presume.

Anyway, what does the oom_reaper do with all that? We do not have to
fail right away. We simply retry if there is at least one notifier which
couldn't make any progress. A retry loop is already implemented to wait
for the mmap_sem and this is basically the same thing.

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
64-BIT))
Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
Poulsbo, Moorestow...)
Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---

Hi,
this is an RFC and not tested at all. I am not very familiar with the
mmu notifiers semantics very much so this is a crude attempt to achieve
what I need basically. It might be completely wrong but I would like
to discuss what would be a better way if that is the case.

get_maintainers gave me quite large list of people to CC so I had to trim
it down. If you think I have forgot somebody, please let me know

Any feedback is highly appreciated.

  arch/x86/kvm/x86.c  |  7 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++--
  drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +---
  drivers/gpu/drm/radeon/radeon_mn.c  | 15 ---
  drivers/infiniband/core/umem_odp.c  | 15 ---
  drivers/infiniband/hw/hfi1/mmu_rb.c |  7 --
  drivers/misc/mic/scif/scif_dma.c|  7 --
  drivers/misc/sgi-gru/grutlbpurge.c  |  7 --
  drivers/xen/gntdev.c| 14 ---
  include/linux/kvm_host.h|  2 +-
  include/linux/mmu_notifier.h| 15 +--
  mm/hmm.c|  7 --
  mm/mmu_notifier.c   | 15 ---
  mm/oom_kill.c   | 29 +++---
  virt/kvm/kvm_main.c | 12 ++---
  15 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu,

[RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko

From: Michal Hocko 

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks. Currently we simply back off and mark an
oom victim with blockable mmu notifiers as done after a short sleep.
That can result in selecting a new oom victim prematurely because the
previous one still hasn't torn its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover most notifiers only care about a portion of the address
space. This patch handles the first part of the problem.
__mmu_notifier_invalidate_range_start gets a blockable flag and
callbacks are not allowed to sleep if the flag is set to false. This is
achieved by using trylock instead of the sleepable lock for most
callbacks. I think we can improve that even further because there is
a common pattern to do a range lookup first and then do something about
that. The first part can be done without a sleeping lock I presume.

Anyway, what does the oom_reaper do with all that? We do not have to
fail right away. We simply retry if there is at least one notifier which
couldn't make any progress. A retry loop is already implemented to wait
for the mmap_sem and this is basically the same thing.

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
64-BIT))
Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
Poulsbo, Moorestow...)
Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---

Hi,
this is an RFC and not tested at all. I am not very familiar with the
mmu notifiers semantics very much so this is a crude attempt to achieve
what I need basically. It might be completely wrong but I would like
to discuss what would be a better way if that is the case.

get_maintainers gave me quite large list of people to CC so I had to trim
it down. If you think I have forgot somebody, please let me know

Any feedback is highly appreciated.

 arch/x86/kvm/x86.c  |  7 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++--
 drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +---
 drivers/gpu/drm/radeon/radeon_mn.c  | 15 ---
 drivers/infiniband/core/umem_odp.c  | 15 ---
 drivers/infiniband/hw/hfi1/mmu_rb.c |  7 --
 drivers/misc/mic/scif/scif_dma.c|  7 --
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 --
 drivers/xen/gntdev.c| 14 ---
 include/linux/kvm_host.h|  2 +-
 include/linux/mmu_notifier.h| 15 +--
 mm/hmm.c|  7 --
 mm/mmu_notifier.c   | 15 ---
 mm/oom_kill.c   | 29 +++---
 virt/kvm/kvm_main.c | 12 ++---
 15 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+   unsigned long start, unsigned long end,
+   bool blockable)
 {
unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+   return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..d138a526feff 100644
---

75 matches

Mail list logo