Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On 2018/08/27 16:41, Christian König wrote: > Am 26.08.2018 um 10:40 schrieb Tetsuo Handa: >> I'm not following. Why don't we need to do like below (given that >> nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. >> drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() >> is doing GFP_KERNEL memory allocation with ->lock held for write? > > That's a bug which needs to be fixed separately. > > Allocating memory with GFP_KERNEL while holding a lock which is also taken in > the reclaim code path is illegal not matter what you do. > > Patches to fix this are already on the appropriate mailing list and will be > pushed upstream today. > > Regards, > Christian. Commit 4a2de54dc1d7668f ("drm/amdgpu: fix holding mn_lock while allocating memory") seems to be calling amdgpu_mn_unlock() without amdgpu_mn_lock() when drm_sched_job_init() failed... Michal, you are asking me to fix all bugs (including out of tree code) and prevent future bugs just because you want to avoid using timeout in order to avoid OOM lockup ( https://marc.info/?i=55a3fb37-3246-73d7-0f45-5835a3f48...@i-love.sakura.ne.jp ). That is a too much request which is impossible for even you. More you count on the OOM reaper, we exponentially complicates dependency and more likely to stumble over unreviewed/untested code... ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 26.08.2018 um 10:40 schrieb Tetsuo Handa: On 2018/08/24 22:52, Michal Hocko wrote: @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - + /* +* We can take sleepable lock even on !blockable mode because +* read_lock is only ever take from this path and the notifier +* lock never really sleeps. In fact the only reason why the +* later is sleepable is because the notifier itself might sleep +* in amdgpu_mn_invalidate_node but blockable mode is handled +* before calling into that path. +*/ + mutex_lock(>read_lock); if (atomic_inc_return(>recursion) == 1) down_read_non_owner(>lock); mutex_unlock(>read_lock); I'm not following. Why don't we need to do like below (given that nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() is doing GFP_KERNEL memory allocation with ->lock held for write? That's a bug which needs to be fixed separately. Allocating memory with GFP_KERNEL while holding a lock which is also taken in the reclaim code path is illegal not matter what you do. Patches to fix this are already on the appropriate mailing list and will be pushed upstream today. Regards, Christian. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b..e1cb344 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -64,8 +64,6 @@ * @node: hash table node to find structure by adev and mn * @lock: rw semaphore protecting the notifier nodes * @objects: interval tree containing amdgpu_mn_nodes - * @read_lock: mutex for recursive locking of @lock - * @recursion: depth of recursion * * Data for each amdgpu device and process address space. */ @@ -85,8 +83,6 @@ struct amdgpu_mn { /* objects protected by lock */ struct rw_semaphore lock; struct rb_root_cached objects; - struct mutexread_lock; - atomic_trecursion; }; /** @@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) + down_read(>lock); + else if (!down_read_trylock(>lock)) return -EAGAIN; - - if (atomic_inc_return(>recursion) == 1) - down_read_non_owner(>lock); - mutex_unlock(>read_lock); - return 0; } @@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(>recursion) == 0) - up_read_non_owner(>lock); + up_read(>lock); } /** @@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, amn->type = type; amn->mn.ops = _mn_ops[type]; amn->objects = RB_ROOT_CACHED; - mutex_init(>read_lock); - atomic_set(>recursion, 0); r = __mmu_notifier_register(>mn, mm); if (r) ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On 2018/08/24 22:52, Michal Hocko wrote: > @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > */ > static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - if (blockable) > - mutex_lock(>read_lock); > - else if (!mutex_trylock(>read_lock)) > - return -EAGAIN; > - > + /* > + * We can take sleepable lock even on !blockable mode because > + * read_lock is only ever take from this path and the notifier > + * lock never really sleeps. In fact the only reason why the > + * later is sleepable is because the notifier itself might sleep > + * in amdgpu_mn_invalidate_node but blockable mode is handled > + * before calling into that path. > + */ > + mutex_lock(>read_lock); > if (atomic_inc_return(>recursion) == 1) > down_read_non_owner(>lock); > mutex_unlock(>read_lock); > I'm not following. Why don't we need to do like below (given that nobody except amdgpu_mn_read_lock() holds ->read_lock) because e.g. drm_sched_fence_create() from drm_sched_job_init() from amdgpu_cs_submit() is doing GFP_KERNEL memory allocation with ->lock held for write? diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b..e1cb344 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -64,8 +64,6 @@ * @node: hash table node to find structure by adev and mn * @lock: rw semaphore protecting the notifier nodes * @objects: interval tree containing amdgpu_mn_nodes - * @read_lock: mutex for recursive locking of @lock - * @recursion: depth of recursion * * Data for each amdgpu device and process address space. */ @@ -85,8 +83,6 @@ struct amdgpu_mn { /* objects protected by lock */ struct rw_semaphore lock; struct rb_root_cached objects; - struct mutexread_lock; - atomic_trecursion; }; /** @@ -181,14 +177,9 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) + down_read(>lock); + else if (!down_read_trylock(>lock)) return -EAGAIN; - - if (atomic_inc_return(>recursion) == 1) - down_read_non_owner(>lock); - mutex_unlock(>read_lock); - return 0; } @@ -199,8 +190,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(>recursion) == 0) - up_read_non_owner(>lock); + up_read(>lock); } /** @@ -410,8 +400,6 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev, amn->type = type; amn->mn.ops = _mn_ops[type]; amn->objects = RB_ROOT_CACHED; - mutex_init(>read_lock); - atomic_set(>recursion, 0); r = __mmu_notifier_register(>mn, mm); if (r) ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Two more worries for this patch. > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > * > * @amn: our notifier > */ > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > { > - mutex_lock(>read_lock); > + if (blockable) > + mutex_lock(>read_lock); > + else if (!mutex_trylock(>read_lock)) > + return -EAGAIN; > + > if (atomic_inc_return(>recursion) == 1) > down_read_non_owner(>lock); Why don't we need to use trylock here if blockable == false ? Want comment why it is safe to use blocking lock here. > mutex_unlock(>read_lock); > + > + return 0; > } > > /** > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct > mm_struct *mm) > up_write(>mirrors_sem); > } > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, >struct mm_struct *mm, >unsigned long start, > - unsigned long end) > + unsigned long end, > + bool blockable) > { > struct hmm *hmm = mm->hmm; > > VM_BUG_ON(!hmm); > > atomic_inc(>sequence); > + > + return 0; > } > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, This assumes that hmm_invalidate_range_end() does not have memory allocation dependency. But hmm_invalidate_range() from hmm_invalidate_range_end() involves down_read(>mirrors_sem); list_for_each_entry(mirror, >mirrors, list) mirror->ops->sync_cpu_device_pagetables(mirror, action, start, end); up_read(>mirrors_sem); sequence. What is surprising is that there is no in-tree user who assigns sync_cpu_device_pagetables field. $ grep -Fr sync_cpu_device_pagetables * Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize page tables include/linux/hmm.h: * will get callbacks through sync_cpu_device_pagetables() operation (see include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page tables include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror, include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, action, That is, this API seems to be currently used by only out-of-tree users. Since we can't check that nobody has memory allocation dependency, I think that hmm_invalidate_range_start() should return -EAGAIN if blockable == false for now. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On 2018/08/24 22:32, Michal Hocko wrote: > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: >> I worry that (currently >> out-of-tree) users of this API are involving work / recursion. > > I do not give a slightest about out-of-tree modules. They will have to > accomodate to the new API. I have no problems to extend the > documentation and be explicit about this expectation. You don't need to care about out-of-tree modules. But you need to hear from mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 133ba78820ee..698e371aafe3 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { >* >* If blockable argument is set to false then the callback cannot >* sleep and has to return with -EAGAIN. 0 should be returned > - * otherwise. > + * otherwise. Please note that if invalidate_range_start approves > + * a non-blocking behavior then the same applies to > + * invalidate_range_end. Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to mmu-notifiers users. -* If both of these callbacks cannot block, and invalidate_range -* cannot block, mmu_notifier_ops.flags should have -* MMU_INVALIDATE_DOES_NOT_BLOCK set. +* If blockable argument is set to false then the callback cannot +* sleep and has to return with -EAGAIN. 0 should be returned +* otherwise. Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. Now we are in a merge window. And we noticed a possibility that out-of-tree mmu-notifiers users might have trouble with making changes immediately in order to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. And you are trying to ignore such possibility by just updating expected behavior description instead of giving out-of-tree users a grace period to check and update their code. >> and keeps "all operations protected by hmm->mirrors_sem held for write are >> atomic". This suggests that "some operations protected by hmm->mirrors_sem >> held >> for read will sleep (and in the worst case involves memory allocation >> dependency)". > > Yes and so what? The clear expectation is that neither of the range > notifiers do not sleep in !blocking mode. I really fail to see what you > are trying to say. I'm saying "Get ACK from Jérôme about mm/hmm.c changes". ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On 2018/08/24 20:36, Michal Hocko wrote: >> That is, this API seems to be currently used by only out-of-tree users. Since >> we can't check that nobody has memory allocation dependency, I think that >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for >> now. > > The code expects that the invalidate_range_end doesn't block if > invalidate_range_start hasn't blocked. That is the reason why the end > callback doesn't have blockable parameter. If this doesn't hold then the > whole scheme is just fragile because those two calls should pair. > That is More worrisome part in that patch is that I don't know whether using trylock if blockable == false at entry is really sufficient. . Since those two calls should pair, I think that we need to determine whether we need to return -EAGAIN at start call by evaluating both calls. Like mn_invl_range_start() involves schedule_delayed_work() which could be blocked on memory allocation under OOM situation, I worry that (currently out-of-tree) users of this API are involving work / recursion. And hmm_release() says that /* * Drop mirrors_sem so callback can wait on any pending * work that might itself trigger mmu_notifier callback * and thus would deadlock with us. */ and keeps "all operations protected by hmm->mirrors_sem held for write are atomic". This suggests that "some operations protected by hmm->mirrors_sem held for read will sleep (and in the worst case involves memory allocation dependency)". ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote: > On Fri 24-08-18 11:12:40, Jerome Glisse wrote: > [...] > > I am fine with Michal patch, i already said so couple month ago first time > > this discussion did pop up, Michal you can add: > > > > Reviewed-by: Jérôme Glisse > > So I guess the below is the patch you were talking about? > > From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Fri, 24 Aug 2018 15:32:24 +0200 > Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition > non-blocking mode > > If invalidate_range_start is called for !blocking mode then all > callbacks have to guarantee they will no block/sleep. The same obviously > applies to invalidate_range_end because this operation pairs with the > former and they are called from the same context. Make sure this is > appropriately documented. In my branch i already updated HMM to be like other existing user ie all blocking operation in the start callback. But yes it would be wise to added such comments. > > Signed-off-by: Michal Hocko > --- > include/linux/mmu_notifier.h | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 133ba78820ee..698e371aafe3 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { >* >* If blockable argument is set to false then the callback cannot >* sleep and has to return with -EAGAIN. 0 should be returned > - * otherwise. > + * otherwise. Please note that if invalidate_range_start approves > + * a non-blocking behavior then the same applies to > + * invalidate_range_end. >* >*/ > int (*invalidate_range_start)(struct mmu_notifier *mn, > -- > 2.18.0 > > -- > Michal Hocko > SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 11:12:40, Jerome Glisse wrote: [...] > I am fine with Michal patch, i already said so couple month ago first time > this discussion did pop up, Michal you can add: > > Reviewed-by: Jérôme Glisse So I guess the below is the patch you were talking about? From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 24 Aug 2018 15:32:24 +0200 Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition non-blocking mode If invalidate_range_start is called for !blocking mode then all callbacks have to guarantee they will no block/sleep. The same obviously applies to invalidate_range_end because this operation pairs with the former and they are called from the same context. Make sure this is appropriately documented. Signed-off-by: Michal Hocko --- include/linux/mmu_notifier.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 133ba78820ee..698e371aafe3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -153,7 +153,9 @@ struct mmu_notifier_ops { * * If blockable argument is set to false then the callback cannot * sleep and has to return with -EAGAIN. 0 should be returned -* otherwise. +* otherwise. Please note that if invalidate_range_start approves +* a non-blocking behavior then the same applies to +* invalidate_range_end. * */ int (*invalidate_range_start)(struct mmu_notifier *mn, -- 2.18.0 -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 23:52:25, Tetsuo Handa wrote: > On 2018/08/24 22:32, Michal Hocko wrote: > > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > >> I worry that (currently > >> out-of-tree) users of this API are involving work / recursion. > > > > I do not give a slightest about out-of-tree modules. They will have to > > accomodate to the new API. I have no problems to extend the > > documentation and be explicit about this expectation. > > You don't need to care about out-of-tree modules. But you need to hear from > mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 133ba78820ee..698e371aafe3 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > > * > > * If blockable argument is set to false then the callback cannot > > * sleep and has to return with -EAGAIN. 0 should be returned > > -* otherwise. > > +* otherwise. Please note that if invalidate_range_start approves > > +* a non-blocking behavior then the same applies to > > +* invalidate_range_end. > > Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu > notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to > mmu-notifiers users. > > -* If both of these callbacks cannot block, and invalidate_range > -* cannot block, mmu_notifier_ops.flags should have > -* MMU_INVALIDATE_DOES_NOT_BLOCK set. > +* If blockable argument is set to false then the callback > cannot > +* sleep and has to return with -EAGAIN. 0 should be returned > +* otherwise. > > Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. > make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. > > Now we are in a merge window. And we noticed a possibility that out-of-tree > mmu-notifiers users might have trouble with making changes immediately in > order > to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. > And you are trying to ignore such possibility by just updating expected > behavior > description instead of giving out-of-tree users a grace period to check and > update > their code. This is just ridiculous. I have no idea what you are trying to achieve here but please read through Documentation/process/stable-api-nonsense.rst before you try to make strong statements again. I have changed an in-kernel interface. I have gone through all users and fixed them up. It is really appreciated to double check after me and I am willing to fix up any fallouts. But that is just about it. I do not get a whit about _any_ out of tree drivers when changing the interface. I am willing to answer any questions regarding this change so developers of those drivers know how to do their change properly but doing so is completely their business. > >> and keeps "all operations protected by hmm->mirrors_sem held for write are > >> atomic". This suggests that "some operations protected by hmm->mirrors_sem > >> held > >> for read will sleep (and in the worst case involves memory allocation > >> dependency)". > > > > Yes and so what? The clear expectation is that neither of the range > > notifiers do not sleep in !blocking mode. I really fail to see what you > > are trying to say. > > I'm saying "Get ACK from Jérôme about mm/hmm.c changes". HMM is a library layer for other driver, until those get merged the same applies for them as well. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote: > On 2018/08/24 22:32, Michal Hocko wrote: > > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > >> I worry that (currently > >> out-of-tree) users of this API are involving work / recursion. > > > > I do not give a slightest about out-of-tree modules. They will have to > > accomodate to the new API. I have no problems to extend the > > documentation and be explicit about this expectation. > > You don't need to care about out-of-tree modules. But you need to hear from > mm/hmm.c authors/maintainers when making changes for mmu-notifiers. > > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index 133ba78820ee..698e371aafe3 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -153,7 +153,9 @@ struct mmu_notifier_ops { > > * > > * If blockable argument is set to false then the callback cannot > > * sleep and has to return with -EAGAIN. 0 should be returned > > -* otherwise. > > +* otherwise. Please note that if invalidate_range_start approves > > +* a non-blocking behavior then the same applies to > > +* invalidate_range_end. > > Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu > notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to > mmu-notifiers users. > > -* If both of these callbacks cannot block, and invalidate_range > -* cannot block, mmu_notifier_ops.flags should have > -* MMU_INVALIDATE_DOES_NOT_BLOCK set. > +* If blockable argument is set to false then the callback > cannot > +* sleep and has to return with -EAGAIN. 0 should be returned > +* otherwise. > > Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e. > make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK. > > Now we are in a merge window. And we noticed a possibility that out-of-tree > mmu-notifiers users might have trouble with making changes immediately in > order > to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately. > And you are trying to ignore such possibility by just updating expected > behavior > description instead of giving out-of-tree users a grace period to check and > update > their code. Intention is that 99% of HMM users will be upstream as long as they are not people shouldn't worry. We have been working on nouveau to use it for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i hope it will all be there in 4.20/4.21 timeframe. See my other mail for list of other users. > > >> and keeps "all operations protected by hmm->mirrors_sem held for write are > >> atomic". This suggests that "some operations protected by hmm->mirrors_sem > >> held > >> for read will sleep (and in the worst case involves memory allocation > >> dependency)". > > > > Yes and so what? The clear expectation is that neither of the range > > notifiers do not sleep in !blocking mode. I really fail to see what you > > are trying to say. > > I'm saying "Get ACK from Jérôme about mm/hmm.c changes". I am fine with Michal patch, i already said so couple month ago first time this discussion did pop up, Michal you can add: Reviewed-by: Jérôme Glisse ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote: > On Fri 24-08-18 14:18:44, Christian König wrote: > > Am 24.08.2018 um 14:03 schrieb Michal Hocko: > > > On Fri 24-08-18 13:57:52, Christian König wrote: > > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > > > > On Fri 24-08-18 13:43:16, Christian König wrote: > > > [...] > > > > > > That won't work like this there might be multiple > > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the > > > > > > same time. > > > > > > E.g. the lock might be taken recursively and that is illegal for a > > > > > > rw_semaphore. > > > > > I am not sure I follow. Are you saying that one invalidate_range might > > > > > trigger another one from the same path? > > > > No, but what can happen is: > > > > > > > > invalidate_range_start(A,B); > > > > invalidate_range_start(C,D); > > > > ... > > > > invalidate_range_end(C,D); > > > > invalidate_range_end(A,B); > > > > > > > > Grabbing the read lock twice would be illegal in this case. > > > I am sorry but I still do not follow. What is the context the two are > > > called from? > > > > I don't have the slightest idea. > > > > > Can you give me an example. I simply do not see it in the > > > code, mostly because I am not familiar with it. > > > > I'm neither. > > > > We stumbled over that by pure observation and after discussing the problem > > with Jerome came up with this solution. > > > > No idea where exactly that case comes from, but I can confirm that it indeed > > happens. > > Thiking about it some more, I can imagine that a notifier callback which > performs an allocation might trigger a memory reclaim and that in turn > might trigger a notifier to be invoked and recurse. But notifier > shouldn't really allocate memory. They are called from deep MM code > paths and this would be extremely deadlock prone. Maybe Jerome can come > up some more realistic scenario. If not then I would propose to simplify > the locking here. We have lockdep to catch self deadlocks and it is > always better to handle a specific issue rather than having a code > without a clear indication how it can recurse. Multiple concurrent mmu notifier, for overlapping range or not, is common (each concurrent threads can trigger some). So you might have multiple invalidate_range_start() in flight for same mm and thus might complete in different order (invalidate_range_end()). IIRC this is what this lock was trying to protect against. I can't think of a reason for recursive mmu notifier call right now. I will ponder see if i remember something about it. Cheers, Jérôme ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote: > Two more worries for this patch. [...] > > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, > > struct mm_struct *mm) > > up_write(>mirrors_sem); > > } > > > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, > >struct mm_struct *mm, > >unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool blockable) > > { > > struct hmm *hmm = mm->hmm; > > > > VM_BUG_ON(!hmm); > > > > atomic_inc(>sequence); > > + > > + return 0; > > } > > > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, > > This assumes that hmm_invalidate_range_end() does not have memory > allocation dependency. But hmm_invalidate_range() from > hmm_invalidate_range_end() involves > > down_read(>mirrors_sem); > list_for_each_entry(mirror, >mirrors, list) > mirror->ops->sync_cpu_device_pagetables(mirror, action, > start, end); > up_read(>mirrors_sem); > > sequence. What is surprising is that there is no in-tree user who assigns > sync_cpu_device_pagetables field. > > $ grep -Fr sync_cpu_device_pagetables * > Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize > page tables > include/linux/hmm.h: * will get callbacks through > sync_cpu_device_pagetables() operation (see > include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page > tables > include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct > hmm_mirror *mirror, > include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() > callback, so that CPU page > mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, > action, > > That is, this API seems to be currently used by only out-of-tree users. Since > we can't check that nobody has memory allocation dependency, I think that > hmm_invalidate_range_start() should return -EAGAIN if blockable == false for > now. So you can see update and user of this there: https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00 I am still working on Mellanox and AMD GPU patchset. I will post the HMM changes that adapt to Michal shortly as anyway thus have been sufficiently tested by now. https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=78785dcb5ba0924c2c5e7be027793f99ebbc39f3 https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=4fc25571dc893f2b278e90cda9e71e139e01de70 Cheers, Jérôme ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 15:44:03, Christian König wrote: > Am 24.08.2018 um 15:40 schrieb Michal Hocko: > > On Fri 24-08-18 15:28:33, Christian König wrote: > > > Am 24.08.2018 um 15:24 schrieb Michal Hocko: > > > > On Fri 24-08-18 15:10:08, Christian König wrote: > > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > > > > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > > > > > [...] > > > > > > > > Thiking about it some more, I can imagine that a notifier > > > > > > > > callback which > > > > > > > > performs an allocation might trigger a memory reclaim and that > > > > > > > > in turn > > > > > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > > > > > shouldn't really allocate memory. They are called from deep MM > > > > > > > > code > > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome > > > > > > > > can come > > > > > > > > up some more realistic scenario. If not then I would propose to > > > > > > > > simplify > > > > > > > > the locking here. We have lockdep to catch self deadlocks and > > > > > > > > it is > > > > > > > > always better to handle a specific issue rather than having a > > > > > > > > code > > > > > > > > without a clear indication how it can recurse. > > > > > > > Well I agree that we should probably fix that, but I have some > > > > > > > concerns to > > > > > > > remove the existing workaround. > > > > > > > > > > > > > > See we added that to get rid of a real problem in a customer > > > > > > > environment and > > > > > > > I don't want to that to show up again. > > > > > > It would really help to know more about that case and fix it > > > > > > properly > > > > > > rather than workaround it like this. Anyway, let me think how to > > > > > > handle > > > > > > the non-blocking notifier invocation then. I was not able to come up > > > > > > with anything remotely sane yet. > > > > > With avoiding allocating memory in the write lock path I don't see an > > > > > issue > > > > > any more with that. > > > > > > > > > > All what the write lock path does now is adding items to a linked > > > > > lists, > > > > > arrays etc > > > > Can we change it to non-sleepable lock then? > > > No, the write side doesn't sleep any more, but the read side does. > > > > > > See amdgpu_mn_invalidate_node() and that is where you actually need to > > > handle the non-blocking flag correctly. > > Ohh, right you are. We already handle that by bailing out before calling > > amdgpu_mn_invalidate_node in !blockable mode. > > Yeah, that is sufficient. > > It could be improved because we have something like 90% chance that > amdgpu_mn_invalidate_node() actually doesn't need to do anything. > > But I can take care of that when the patch set has landed. > > > So does this looks good to you? > > Yeah, that looks perfect to me. Reviewed-by: Christian König > Cool! Thanks for your guidance and patience with me. Here is the full patch. Feel free to take it and route per your preference. From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 24 Aug 2018 15:45:52 +0200 Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is incomplete because it might sleep on the notifier lock. This is true but as it turned out from the discussion with Christian this doesn't really matter. The amd notifier lock doesn't block in the exclusive mode. It only ever sleeps with the read lock inside amdgpu_mn_invalidate_node. That one is not called in !blockable state so while we might sleep on notifier read_lock this will only be for a short while. The same applies on the notifier lock. Therefore remove blockable handling from amdgpu_mn_read_lock and document it properly. Noticed-by: Tetsuo Handa Reviewed-by: Christian König Signed-off-by: Michal Hocko --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..48fa152231be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - + /* +* We can take sleepable lock even on !blockable mode because +* read_lock is only ever take from this path and the notifier +* lock never really sleeps. In fact the only reason why the +* later is
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 15:40 schrieb Michal Hocko: On Fri 24-08-18 15:28:33, Christian König wrote: Am 24.08.2018 um 15:24 schrieb Michal Hocko: On Fri 24-08-18 15:10:08, Christian König wrote: Am 24.08.2018 um 15:01 schrieb Michal Hocko: On Fri 24-08-18 14:52:26, Christian König wrote: Am 24.08.2018 um 14:33 schrieb Michal Hocko: [...] Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse. Well I agree that we should probably fix that, but I have some concerns to remove the existing workaround. See we added that to get rid of a real problem in a customer environment and I don't want to that to show up again. It would really help to know more about that case and fix it properly rather than workaround it like this. Anyway, let me think how to handle the non-blocking notifier invocation then. I was not able to come up with anything remotely sane yet. With avoiding allocating memory in the write lock path I don't see an issue any more with that. All what the write lock path does now is adding items to a linked lists, arrays etc Can we change it to non-sleepable lock then? No, the write side doesn't sleep any more, but the read side does. See amdgpu_mn_invalidate_node() and that is where you actually need to handle the non-blocking flag correctly. Ohh, right you are. We already handle that by bailing out before calling amdgpu_mn_invalidate_node in !blockable mode. Yeah, that is sufficient. It could be improved because we have something like 90% chance that amdgpu_mn_invalidate_node() actually doesn't need to do anything. But I can take care of that when the patch set has landed. So does this looks good to you? Yeah, that looks perfect to me. Reviewed-by: Christian König Thanks, Christian. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..48fa152231be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - + /* +* We can take sleepable lock even on !blockable mode because +* read_lock is only ever take from this path and the notifier +* lock never really sleeps. In fact the only reason why the +* later is sleepable is because the notifier itself might sleep +* in amdgpu_mn_invalidate_node but blockable mode is handled +* before calling into that path. +*/ + mutex_lock(>read_lock); if (atomic_inc_return(>recursion) == 1) down_read_non_owner(>lock); mutex_unlock(>read_lock); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 15:28:33, Christian König wrote: > Am 24.08.2018 um 15:24 schrieb Michal Hocko: > > On Fri 24-08-18 15:10:08, Christian König wrote: > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > > > [...] > > > > > > Thiking about it some more, I can imagine that a notifier callback > > > > > > which > > > > > > performs an allocation might trigger a memory reclaim and that in > > > > > > turn > > > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > > > shouldn't really allocate memory. They are called from deep MM code > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can > > > > > > come > > > > > > up some more realistic scenario. If not then I would propose to > > > > > > simplify > > > > > > the locking here. We have lockdep to catch self deadlocks and it is > > > > > > always better to handle a specific issue rather than having a code > > > > > > without a clear indication how it can recurse. > > > > > Well I agree that we should probably fix that, but I have some > > > > > concerns to > > > > > remove the existing workaround. > > > > > > > > > > See we added that to get rid of a real problem in a customer > > > > > environment and > > > > > I don't want to that to show up again. > > > > It would really help to know more about that case and fix it properly > > > > rather than workaround it like this. Anyway, let me think how to handle > > > > the non-blocking notifier invocation then. I was not able to come up > > > > with anything remotely sane yet. > > > With avoiding allocating memory in the write lock path I don't see an > > > issue > > > any more with that. > > > > > > All what the write lock path does now is adding items to a linked lists, > > > arrays etc > > Can we change it to non-sleepable lock then? > > No, the write side doesn't sleep any more, but the read side does. > > See amdgpu_mn_invalidate_node() and that is where you actually need to > handle the non-blocking flag correctly. Ohh, right you are. We already handle that by bailing out before calling amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to you? diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..48fa152231be 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - + /* +* We can take sleepable lock even on !blockable mode because +* read_lock is only ever take from this path and the notifier +* lock never really sleeps. In fact the only reason why the +* later is sleepable is because the notifier itself might sleep +* in amdgpu_mn_invalidate_node but blockable mode is handled +* before calling into that path. +*/ + mutex_lock(>read_lock); if (atomic_inc_return(>recursion) == 1) down_read_non_owner(>lock); mutex_unlock(>read_lock); -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 22:02:23, Tetsuo Handa wrote: > On 2018/08/24 20:36, Michal Hocko wrote: > >> That is, this API seems to be currently used by only out-of-tree users. > >> Since > >> we can't check that nobody has memory allocation dependency, I think that > >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false > >> for now. > > > > The code expects that the invalidate_range_end doesn't block if > > invalidate_range_start hasn't blocked. That is the reason why the end > > callback doesn't have blockable parameter. If this doesn't hold then the > > whole scheme is just fragile because those two calls should pair. > > > That is > > More worrisome part in that patch is that I don't know whether using > trylock if blockable == false at entry is really sufficient. > > . Since those two calls should pair, I think that we need to determine whether > we need to return -EAGAIN at start call by evaluating both calls. Yes, and I believe I have done that audit. Module my misunderstanding of the code. > Like mn_invl_range_start() involves schedule_delayed_work() which could be > blocked on memory allocation under OOM situation, It doesn't because that code path is not invoked for the !blockable case. > I worry that (currently > out-of-tree) users of this API are involving work / recursion. I do not give a slightest about out-of-tree modules. They will have to accomodate to the new API. I have no problems to extend the documentation and be explicit about this expectation. diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 133ba78820ee..698e371aafe3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -153,7 +153,9 @@ struct mmu_notifier_ops { * * If blockable argument is set to false then the callback cannot * sleep and has to return with -EAGAIN. 0 should be returned -* otherwise. +* otherwise. Please note that if invalidate_range_start approves +* a non-blocking behavior then the same applies to +* invalidate_range_end. * */ int (*invalidate_range_start)(struct mmu_notifier *mn, > And hmm_release() says that > > /* >* Drop mirrors_sem so callback can wait on any pending >* work that might itself trigger mmu_notifier callback >* and thus would deadlock with us. >*/ > > and keeps "all operations protected by hmm->mirrors_sem held for write are > atomic". This suggests that "some operations protected by hmm->mirrors_sem > held > for read will sleep (and in the worst case involves memory allocation > dependency)". Yes and so what? The clear expectation is that neither of the range notifiers do not sleep in !blocking mode. I really fail to see what you are trying to say. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 15:24 schrieb Michal Hocko: On Fri 24-08-18 15:10:08, Christian König wrote: Am 24.08.2018 um 15:01 schrieb Michal Hocko: On Fri 24-08-18 14:52:26, Christian König wrote: Am 24.08.2018 um 14:33 schrieb Michal Hocko: [...] Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse. Well I agree that we should probably fix that, but I have some concerns to remove the existing workaround. See we added that to get rid of a real problem in a customer environment and I don't want to that to show up again. It would really help to know more about that case and fix it properly rather than workaround it like this. Anyway, let me think how to handle the non-blocking notifier invocation then. I was not able to come up with anything remotely sane yet. With avoiding allocating memory in the write lock path I don't see an issue any more with that. All what the write lock path does now is adding items to a linked lists, arrays etc Can we change it to non-sleepable lock then? No, the write side doesn't sleep any more, but the read side does. See amdgpu_mn_invalidate_node() and that is where you actually need to handle the non-blocking flag correctly. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 15:10:08, Christian König wrote: > Am 24.08.2018 um 15:01 schrieb Michal Hocko: > > On Fri 24-08-18 14:52:26, Christian König wrote: > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko: > > [...] > > > > Thiking about it some more, I can imagine that a notifier callback which > > > > performs an allocation might trigger a memory reclaim and that in turn > > > > might trigger a notifier to be invoked and recurse. But notifier > > > > shouldn't really allocate memory. They are called from deep MM code > > > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > > > up some more realistic scenario. If not then I would propose to simplify > > > > the locking here. We have lockdep to catch self deadlocks and it is > > > > always better to handle a specific issue rather than having a code > > > > without a clear indication how it can recurse. > > > Well I agree that we should probably fix that, but I have some concerns to > > > remove the existing workaround. > > > > > > See we added that to get rid of a real problem in a customer environment > > > and > > > I don't want to that to show up again. > > It would really help to know more about that case and fix it properly > > rather than workaround it like this. Anyway, let me think how to handle > > the non-blocking notifier invocation then. I was not able to come up > > with anything remotely sane yet. > > With avoiding allocating memory in the write lock path I don't see an issue > any more with that. > > All what the write lock path does now is adding items to a linked lists, > arrays etc Can we change it to non-sleepable lock then? -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 15:01 schrieb Michal Hocko: On Fri 24-08-18 14:52:26, Christian König wrote: Am 24.08.2018 um 14:33 schrieb Michal Hocko: [...] Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse. Well I agree that we should probably fix that, but I have some concerns to remove the existing workaround. See we added that to get rid of a real problem in a customer environment and I don't want to that to show up again. It would really help to know more about that case and fix it properly rather than workaround it like this. Anyway, let me think how to handle the non-blocking notifier invocation then. I was not able to come up with anything remotely sane yet. With avoiding allocating memory in the write lock path I don't see an issue any more with that. All what the write lock path does now is adding items to a linked lists, arrays etc So there is no more blocking involved here and the read lock side should be able to grab the lock immediately. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 14:52:26, Christian König wrote: > Am 24.08.2018 um 14:33 schrieb Michal Hocko: [...] > > Thiking about it some more, I can imagine that a notifier callback which > > performs an allocation might trigger a memory reclaim and that in turn > > might trigger a notifier to be invoked and recurse. But notifier > > shouldn't really allocate memory. They are called from deep MM code > > paths and this would be extremely deadlock prone. Maybe Jerome can come > > up some more realistic scenario. If not then I would propose to simplify > > the locking here. We have lockdep to catch self deadlocks and it is > > always better to handle a specific issue rather than having a code > > without a clear indication how it can recurse. > > Well I agree that we should probably fix that, but I have some concerns to > remove the existing workaround. > > See we added that to get rid of a real problem in a customer environment and > I don't want to that to show up again. It would really help to know more about that case and fix it properly rather than workaround it like this. Anyway, let me think how to handle the non-blocking notifier invocation then. I was not able to come up with anything remotely sane yet. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 14:33 schrieb Michal Hocko: On Fri 24-08-18 14:18:44, Christian König wrote: Am 24.08.2018 um 14:03 schrieb Michal Hocko: On Fri 24-08-18 13:57:52, Christian König wrote: Am 24.08.2018 um 13:52 schrieb Michal Hocko: On Fri 24-08-18 13:43:16, Christian König wrote: [...] That won't work like this there might be multiple invalidate_range_start()/invalidate_range_end() pairs open at the same time. E.g. the lock might be taken recursively and that is illegal for a rw_semaphore. I am not sure I follow. Are you saying that one invalidate_range might trigger another one from the same path? No, but what can happen is: invalidate_range_start(A,B); invalidate_range_start(C,D); ... invalidate_range_end(C,D); invalidate_range_end(A,B); Grabbing the read lock twice would be illegal in this case. I am sorry but I still do not follow. What is the context the two are called from? I don't have the slightest idea. Can you give me an example. I simply do not see it in the code, mostly because I am not familiar with it. I'm neither. We stumbled over that by pure observation and after discussing the problem with Jerome came up with this solution. No idea where exactly that case comes from, but I can confirm that it indeed happens. Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse. Well I agree that we should probably fix that, but I have some concerns to remove the existing workaround. See we added that to get rid of a real problem in a customer environment and I don't want to that to show up again. In the meantime I've send out a fix to avoid allocating memory while holding the mn_lock. Thanks for pointing that out, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 14:18:44, Christian König wrote: > Am 24.08.2018 um 14:03 schrieb Michal Hocko: > > On Fri 24-08-18 13:57:52, Christian König wrote: > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > > > On Fri 24-08-18 13:43:16, Christian König wrote: > > [...] > > > > > That won't work like this there might be multiple > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the > > > > > same time. > > > > > E.g. the lock might be taken recursively and that is illegal for a > > > > > rw_semaphore. > > > > I am not sure I follow. Are you saying that one invalidate_range might > > > > trigger another one from the same path? > > > No, but what can happen is: > > > > > > invalidate_range_start(A,B); > > > invalidate_range_start(C,D); > > > ... > > > invalidate_range_end(C,D); > > > invalidate_range_end(A,B); > > > > > > Grabbing the read lock twice would be illegal in this case. > > I am sorry but I still do not follow. What is the context the two are > > called from? > > I don't have the slightest idea. > > > Can you give me an example. I simply do not see it in the > > code, mostly because I am not familiar with it. > > I'm neither. > > We stumbled over that by pure observation and after discussing the problem > with Jerome came up with this solution. > > No idea where exactly that case comes from, but I can confirm that it indeed > happens. Thiking about it some more, I can imagine that a notifier callback which performs an allocation might trigger a memory reclaim and that in turn might trigger a notifier to be invoked and recurse. But notifier shouldn't really allocate memory. They are called from deep MM code paths and this would be extremely deadlock prone. Maybe Jerome can come up some more realistic scenario. If not then I would propose to simplify the locking here. We have lockdep to catch self deadlocks and it is always better to handle a specific issue rather than having a code without a clear indication how it can recurse. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 14:03 schrieb Michal Hocko: On Fri 24-08-18 13:57:52, Christian König wrote: Am 24.08.2018 um 13:52 schrieb Michal Hocko: On Fri 24-08-18 13:43:16, Christian König wrote: [...] That won't work like this there might be multiple invalidate_range_start()/invalidate_range_end() pairs open at the same time. E.g. the lock might be taken recursively and that is illegal for a rw_semaphore. I am not sure I follow. Are you saying that one invalidate_range might trigger another one from the same path? No, but what can happen is: invalidate_range_start(A,B); invalidate_range_start(C,D); ... invalidate_range_end(C,D); invalidate_range_end(A,B); Grabbing the read lock twice would be illegal in this case. I am sorry but I still do not follow. What is the context the two are called from? I don't have the slightest idea. Can you give me an example. I simply do not see it in the code, mostly because I am not familiar with it. I'm neither. We stumbled over that by pure observation and after discussing the problem with Jerome came up with this solution. No idea where exactly that case comes from, but I can confirm that it indeed happens. Regards, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 13:57:52, Christian König wrote: > Am 24.08.2018 um 13:52 schrieb Michal Hocko: > > On Fri 24-08-18 13:43:16, Christian König wrote: [...] > > > That won't work like this there might be multiple > > > invalidate_range_start()/invalidate_range_end() pairs open at the same > > > time. > > > E.g. the lock might be taken recursively and that is illegal for a > > > rw_semaphore. > > I am not sure I follow. Are you saying that one invalidate_range might > > trigger another one from the same path? > > No, but what can happen is: > > invalidate_range_start(A,B); > invalidate_range_start(C,D); > ... > invalidate_range_end(C,D); > invalidate_range_end(A,B); > > Grabbing the read lock twice would be illegal in this case. I am sorry but I still do not follow. What is the context the two are called from? Can you give me an example. I simply do not see it in the code, mostly because I am not familiar with it. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 13:52 schrieb Michal Hocko: On Fri 24-08-18 13:43:16, Christian König wrote: Am 24.08.2018 um 13:32 schrieb Michal Hocko: On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: Two more worries for this patch. --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) * * @amn: our notifier */ -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - mutex_lock(>read_lock); + if (blockable) + mutex_lock(>read_lock); + else if (!mutex_trylock(>read_lock)) + return -EAGAIN; + if (atomic_inc_return(>recursion) == 1) down_read_non_owner(>lock); Why don't we need to use trylock here if blockable == false ? Want comment why it is safe to use blocking lock here. Hmm, I am pretty sure I have checked the code but it was quite confusing so I might have missed something. Double checking now, it seems that this read_lock is not used anywhere else and it is not _the_ lock we are interested about. It is the amn->lock (amdgpu_mn_lock) which matters as it is taken in exclusive mode for expensive operations. The write side of the lock is only taken in the command submission IOCTL. So you actually don't need to change anything here (even the proposed changes are overkill) since we can't tear down the struct_mm while an IOCTL is still using. I am not so sure. We are not in the mm destruction phase yet. This is mostly about the oom context which might fire right during the IOCTL. If any of the path which is holding the write lock blocks for unbound amount of time or even worse allocates a memory then we are screwed. So we need to back of when blockable = false. Oh, yeah good point. Haven't thought about that possibility. Is that correct Christian? If this is correct then we need to update the locking here. I am struggling to grasp the ref counting part. Why cannot all readers simply take the lock rather than rely on somebody else to take it? 1ed3d2567c800 didn't really help me to understand the locking scheme here so any help would be appreciated. That won't work like this there might be multiple invalidate_range_start()/invalidate_range_end() pairs open at the same time. E.g. the lock might be taken recursively and that is illegal for a rw_semaphore. I am not sure I follow. Are you saying that one invalidate_range might trigger another one from the same path? No, but what can happen is: invalidate_range_start(A,B); invalidate_range_start(C,D); ... invalidate_range_end(C,D); invalidate_range_end(A,B); Grabbing the read lock twice would be illegal in this case. Regards, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 13:43:16, Christian König wrote: > Am 24.08.2018 um 13:32 schrieb Michal Hocko: > > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: > > > Two more worries for this patch. > > > > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > > > >* > > > >* @amn: our notifier > > > >*/ > > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > > > > { > > > > - mutex_lock(>read_lock); > > > > + if (blockable) > > > > + mutex_lock(>read_lock); > > > > + else if (!mutex_trylock(>read_lock)) > > > > + return -EAGAIN; > > > > + > > > > if (atomic_inc_return(>recursion) == 1) > > > > down_read_non_owner(>lock); > > > Why don't we need to use trylock here if blockable == false ? > > > Want comment why it is safe to use blocking lock here. > > Hmm, I am pretty sure I have checked the code but it was quite confusing > > so I might have missed something. Double checking now, it seems that > > this read_lock is not used anywhere else and it is not _the_ lock we are > > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as > > it is taken in exclusive mode for expensive operations. > > The write side of the lock is only taken in the command submission IOCTL. > > So you actually don't need to change anything here (even the proposed > changes are overkill) since we can't tear down the struct_mm while an IOCTL > is still using. I am not so sure. We are not in the mm destruction phase yet. This is mostly about the oom context which might fire right during the IOCTL. If any of the path which is holding the write lock blocks for unbound amount of time or even worse allocates a memory then we are screwed. So we need to back of when blockable = false. > > Is that correct Christian? If this is correct then we need to update the > > locking here. I am struggling to grasp the ref counting part. Why cannot > > all readers simply take the lock rather than rely on somebody else to > > take it? 1ed3d2567c800 didn't really help me to understand the locking > > scheme here so any help would be appreciated. > > That won't work like this there might be multiple > invalidate_range_start()/invalidate_range_end() pairs open at the same time. > E.g. the lock might be taken recursively and that is illegal for a > rw_semaphore. I am not sure I follow. Are you saying that one invalidate_range might trigger another one from the same path? -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 24.08.2018 um 13:32 schrieb Michal Hocko: On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: Two more worries for this patch. --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) * * @amn: our notifier */ -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - mutex_lock(>read_lock); + if (blockable) + mutex_lock(>read_lock); + else if (!mutex_trylock(>read_lock)) + return -EAGAIN; + if (atomic_inc_return(>recursion) == 1) down_read_non_owner(>lock); Why don't we need to use trylock here if blockable == false ? Want comment why it is safe to use blocking lock here. Hmm, I am pretty sure I have checked the code but it was quite confusing so I might have missed something. Double checking now, it seems that this read_lock is not used anywhere else and it is not _the_ lock we are interested about. It is the amn->lock (amdgpu_mn_lock) which matters as it is taken in exclusive mode for expensive operations. The write side of the lock is only taken in the command submission IOCTL. So you actually don't need to change anything here (even the proposed changes are overkill) since we can't tear down the struct_mm while an IOCTL is still using. Is that correct Christian? If this is correct then we need to update the locking here. I am struggling to grasp the ref counting part. Why cannot all readers simply take the lock rather than rely on somebody else to take it? 1ed3d2567c800 didn't really help me to understand the locking scheme here so any help would be appreciated. That won't work like this there might be multiple invalidate_range_start()/invalidate_range_end() pairs open at the same time. E.g. the lock might be taken recursively and that is illegal for a rw_semaphore. Regards, Christian. I am wondering why we cannot do diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..93034178673d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - - if (atomic_inc_return(>recursion) == 1) - down_read_non_owner(>lock); - mutex_unlock(>read_lock); + if (!down_read_trylock(>lock)) { + if (!blockable) + return -EAGAIN; + down_read(amn->lock); + } return 0; } @@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(>recursion) == 0) - up_read_non_owner(>lock); + up_read(>lock); } /** ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: [...] > > --- a/mm/hmm.c > > +++ b/mm/hmm.c > > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, > > struct mm_struct *mm) > > up_write(>mirrors_sem); > > } > > > > -static void hmm_invalidate_range_start(struct mmu_notifier *mn, > > +static int hmm_invalidate_range_start(struct mmu_notifier *mn, > >struct mm_struct *mm, > >unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool blockable) > > { > > struct hmm *hmm = mm->hmm; > > > > VM_BUG_ON(!hmm); > > > > atomic_inc(>sequence); > > + > > + return 0; > > } > > > > static void hmm_invalidate_range_end(struct mmu_notifier *mn, > > This assumes that hmm_invalidate_range_end() does not have memory > allocation dependency. But hmm_invalidate_range() from > hmm_invalidate_range_end() involves > > down_read(>mirrors_sem); > list_for_each_entry(mirror, >mirrors, list) > mirror->ops->sync_cpu_device_pagetables(mirror, action, > start, end); > up_read(>mirrors_sem); > > sequence. What is surprising is that there is no in-tree user who assigns > sync_cpu_device_pagetables field. Yes HMM doesn't have any in tree user yet. > $ grep -Fr sync_cpu_device_pagetables * > Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize > page tables > include/linux/hmm.h: * will get callbacks through > sync_cpu_device_pagetables() operation (see > include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page > tables > include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct > hmm_mirror *mirror, > include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() > callback, so that CPU page > mm/hmm.c: mirror->ops->sync_cpu_device_pagetables(mirror, > action, > > That is, this API seems to be currently used by only out-of-tree users. Since > we can't check that nobody has memory allocation dependency, I think that > hmm_invalidate_range_start() should return -EAGAIN if blockable == false for > now. The code expects that the invalidate_range_end doesn't block if invalidate_range_start hasn't blocked. That is the reason why the end callback doesn't have blockable parameter. If this doesn't hold then the whole scheme is just fragile because those two calls should pair. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote: > Two more worries for this patch. > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > > * > > * @amn: our notifier > > */ > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn) > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) > > { > > - mutex_lock(>read_lock); > > + if (blockable) > > + mutex_lock(>read_lock); > > + else if (!mutex_trylock(>read_lock)) > > + return -EAGAIN; > > + > > if (atomic_inc_return(>recursion) == 1) > > down_read_non_owner(>lock); > > Why don't we need to use trylock here if blockable == false ? > Want comment why it is safe to use blocking lock here. Hmm, I am pretty sure I have checked the code but it was quite confusing so I might have missed something. Double checking now, it seems that this read_lock is not used anywhere else and it is not _the_ lock we are interested about. It is the amn->lock (amdgpu_mn_lock) which matters as it is taken in exclusive mode for expensive operations. Is that correct Christian? If this is correct then we need to update the locking here. I am struggling to grasp the ref counting part. Why cannot all readers simply take the lock rather than rely on somebody else to take it? 1ed3d2567c800 didn't really help me to understand the locking scheme here so any help would be appreciated. I am wondering why we cannot do diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index e55508b39496..93034178673d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) */ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) { - if (blockable) - mutex_lock(>read_lock); - else if (!mutex_trylock(>read_lock)) - return -EAGAIN; - - if (atomic_inc_return(>recursion) == 1) - down_read_non_owner(>lock); - mutex_unlock(>read_lock); + if (!down_read_trylock(>lock)) { + if (!blockable) + return -EAGAIN; + down_read(amn->lock); + } return 0; } @@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable) */ static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn) { - if (atomic_dec_return(>recursion) == 0) - up_read_non_owner(>lock); + up_read(>lock); } /** -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue 24-07-18 12:53:07, Andrew Morton wrote: [...] > > On top of that the proposed cleanup looks as follows: > > > > Looks good to me. Seems a bit strange that we omit the pr_info() > output if the mm was partially reaped - people would still want to know > this? Not very important though. I think that having a single output once we are done is better but I do not have a strong opinion on this. Btw. here is the changelog for the cleanup. " Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Suggested-by: Andrew Morton Signed-off-by: Michal Hocko " -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue 24-07-18 14:07:49, David Rientjes wrote: [...] > mm/oom_kill.c: clean up oom_reap_task_mm() fix > > indicate reaping has been partially skipped so we can expect future skips > or another start before finish. But we are not skipping. This is essentially the same case as mmap_sem trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping to denote partial success? > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - /* failed to reap part of the address space. Try again later */ > ret = __oom_reap_task_mm(mm); > - if (!ret) > + if (!ret) { > + /* Failed to reap part of the address space. Try again later */ > + trace_skip_task_reaping(tsk->pid); > goto out_finish; > + } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, > file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue, 24 Jul 2018, Michal Hocko wrote: > oom_reap_task_mm should return false when __oom_reap_task_mm return > false. This is what my patch did but it seems this changed by > http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch > so that one should be fixed. > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 104ef4a01a55..88657e018714 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > /* failed to reap part of the address space. Try again later */ > if (!__oom_reap_task_mm(mm)) { > up_read(>mmap_sem); > - return true; > + return false; > } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, > file-rss:%lukB, shmem-rss:%lukB\n", > > > On top of that the proposed cleanup looks as follows: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 88657e018714..4e185a282b3d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm) > return ret; > } > > +/* > + * Reaps the address space of the give task. > + * > + * Returns true on success and false if none or part of the address space > + * has been reclaimed and the caller should retry later. > + */ > static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > { > + bool ret = true; > + > if (!down_read_trylock(>mmap_sem)) { > trace_skip_task_reaping(tsk->pid); > return false; > @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) >* down_write();up_write() cycle in exit_mmap(). >*/ > if (test_bit(MMF_OOM_SKIP, >flags)) { > - up_read(>mmap_sem); > trace_skip_task_reaping(tsk->pid); > - return true; > + goto out_unlock; > } > > trace_start_task_reaping(tsk->pid); > > /* failed to reap part of the address space. Try again later */ > - if (!__oom_reap_task_mm(mm)) { > - up_read(>mmap_sem); > - return false; > - } > + ret = __oom_reap_task_mm(mm); > + if (!ret) > + goto out_finish; > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, > file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > K(get_mm_counter(mm, MM_FILEPAGES)), > K(get_mm_counter(mm, MM_SHMEMPAGES))); > +out_finish: > + trace_finish_task_reaping(tsk->pid); > +out_unlock: > up_read(>mmap_sem); > > - trace_finish_task_reaping(tsk->pid); > - return true; > + return ret; > } > > #define MAX_OOM_REAP_RETRIES 10 I think we still want to trace when reaping was skipped to know that the oom reaper will retry again later. mm/oom_kill.c: clean up oom_reap_task_mm() fix indicate reaping has been partially skipped so we can expect future skips or another start before finish. Signed-off-by: David Rientjes --- mm/oom_kill.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* failed to reap part of the address space. Try again later */ ret = __oom_reap_task_mm(mm); - if (!ret) + if (!ret) { + /* Failed to reap part of the address space. Try again later */ + trace_skip_task_reaping(tsk->pid); goto out_finish; + } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue, 24 Jul 2018 16:17:47 +0200 Michal Hocko wrote: > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > [...] > > - Undocumented return value. > > > > - comment "failed to reap part..." is misleading - sounds like it's > > referring to something which happened in the past, is in fact > > referring to something which might happen in the future. > > > > - fails to call trace_finish_task_reaping() in one case > > > > - code duplication. > > > > - Increases mmap_sem hold time a little by moving > > trace_finish_task_reaping() inside the locked region. So sue me ;) > > > > - Sharing the finish: path means that the trace event won't > > distinguish between the two sources of finishing. > > > > Please take a look? > > oom_reap_task_mm should return false when __oom_reap_task_mm return > false. This is what my patch did but it seems this changed by > http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch > so that one should be fixed. > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 104ef4a01a55..88657e018714 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > /* failed to reap part of the address space. Try again later */ > if (!__oom_reap_task_mm(mm)) { > up_read(>mmap_sem); > - return true; > + return false; > } > > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, > file-rss:%lukB, shmem-rss:%lukB\n", OK, thanks, I added that. > > On top of that the proposed cleanup looks as follows: > Looks good to me. Seems a bit strange that we omit the pr_info() output if the mm was partially reaped - people would still want to know this? Not very important though. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 20-07-18 17:09:02, Andrew Morton wrote: [...] > - Undocumented return value. > > - comment "failed to reap part..." is misleading - sounds like it's > referring to something which happened in the past, is in fact > referring to something which might happen in the future. > > - fails to call trace_finish_task_reaping() in one case > > - code duplication. > > - Increases mmap_sem hold time a little by moving > trace_finish_task_reaping() inside the locked region. So sue me ;) > > - Sharing the finish: path means that the trace event won't > distinguish between the two sources of finishing. > > Please take a look? oom_reap_task_mm should return false when __oom_reap_task_mm return false. This is what my patch did but it seems this changed by http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch so that one should be fixed. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 104ef4a01a55..88657e018714 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) /* failed to reap part of the address space. Try again later */ if (!__oom_reap_task_mm(mm)) { up_read(>mmap_sem); - return true; + return false; } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", On top of that the proposed cleanup looks as follows: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 88657e018714..4e185a282b3d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm) return ret; } +/* + * Reaps the address space of the give task. + * + * Returns true on success and false if none or part of the address space + * has been reclaimed and the caller should retry later. + */ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { + bool ret = true; + if (!down_read_trylock(>mmap_sem)) { trace_skip_task_reaping(tsk->pid); return false; @@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * down_write();up_write() cycle in exit_mmap(). */ if (test_bit(MMF_OOM_SKIP, >flags)) { - up_read(>mmap_sem); trace_skip_task_reaping(tsk->pid); - return true; + goto out_unlock; } trace_start_task_reaping(tsk->pid); /* failed to reap part of the address space. Try again later */ - if (!__oom_reap_task_mm(mm)) { - up_read(>mmap_sem); - return false; - } + ret = __oom_reap_task_mm(mm); + if (!ret) + goto out_finish; pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); +out_finish: + trace_finish_task_reaping(tsk->pid); +out_unlock: up_read(>mmap_sem); - trace_finish_task_reaping(tsk->pid); - return true; + return ret; } #define MAX_OOM_REAP_RETRIES 10 -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 20-07-18 16:01:25, Andrew Morton wrote: > On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko wrote: > > > > Any suggestions regarding how the driver developers can test this code > > > path? I don't think we presently have a way to fake an oom-killing > > > event? Perhaps we should add such a thing, given the problems we're > > > having with that feature. > > > > The simplest way is to wrap an userspace code which uses these notifiers > > into a memcg and set the hard limit to hit the oom. This can be done > > e.g. after the test faults in all the mmu notifier managed memory and > > set the hard limit to something really small. Then we are looking for a > > proper process tear down. > > Chances are, some of the intended audience don't know how to do this > and will either have to hunt down a lot of documentation or will just > not test it. But we want them to test it, so a little worked step-by-step > example would help things along please. I am willing to give more specific steps. Is anybody interested? From my experience so far this is not something drivers developers using mmu notifiers would be unfamiliar with. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 23-07-18 09:11:54, Michal Hocko wrote: > On Mon 23-07-18 09:03:06, Michal Hocko wrote: > > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > > [...] > > > Please take a look? > > > > Are you OK to have these in a separate patch? > > Btw. I will rebase this patch once oom stuff in linux-next settles. At > least oom_lock removal from oom_reaper will conflict. Hmm, I have just checked Andrew's akpm and the patch is already in and Andrew has resolved the conflict with the oom_lock patch. It just seems that linux-next (next-20180720) doesn't have the newest mmotm tree. Anyway, I will go with the incremental cleanup patch per Andrew's comments as soon as linux-next catches up. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 23-07-18 09:03:06, Michal Hocko wrote: > On Fri 20-07-18 17:09:02, Andrew Morton wrote: > [...] > > Please take a look? > > Are you OK to have these in a separate patch? Btw. I will rebase this patch once oom stuff in linux-next settles. At least oom_lock removal from oom_reaper will conflict. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 20-07-18 17:09:02, Andrew Morton wrote: [...] > Please take a look? Are you OK to have these in a separate patch? -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko wrote: > From: Michal Hocko > > There are several blockable mmu notifiers which might sleep in > mmu_notifier_invalidate_range_start and that is a problem for the > oom_reaper because it needs to guarantee a forward progress so it cannot > depend on any sleepable locks. > > ... > > @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, > struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - __oom_reap_task_mm(mm); > + /* failed to reap part of the address space. Try again later */ > + if (!__oom_reap_task_mm(mm)) { > + up_read(>mmap_sem); > + ret = false; > + goto unlock_oom; > + } This function is starting to look a bit screwy. : static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) : { : if (!down_read_trylock(>mmap_sem)) { : trace_skip_task_reaping(tsk->pid); : return false; : } : : /* :* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't :* work on the mm anymore. The check for MMF_OOM_SKIP must run :* under mmap_sem for reading because it serializes against the :* down_write();up_write() cycle in exit_mmap(). :*/ : if (test_bit(MMF_OOM_SKIP, >flags)) { : up_read(>mmap_sem); : trace_skip_task_reaping(tsk->pid); : return true; : } : : trace_start_task_reaping(tsk->pid); : : /* failed to reap part of the address space. Try again later */ : if (!__oom_reap_task_mm(mm)) { : up_read(>mmap_sem); : return true; : } : : pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", : task_pid_nr(tsk), tsk->comm, : K(get_mm_counter(mm, MM_ANONPAGES)), : K(get_mm_counter(mm, MM_FILEPAGES)), : K(get_mm_counter(mm, MM_SHMEMPAGES))); : up_read(>mmap_sem); : : trace_finish_task_reaping(tsk->pid); : return true; : } - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. I'm thinking it wants to be something like this? : /* : * Return true if we successfully acquired (then released) mmap_sem : */ : static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) : { : if (!down_read_trylock(>mmap_sem)) { : trace_skip_task_reaping(tsk->pid); : return false; : } : : /* :* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't :* work on the mm anymore. The check for MMF_OOM_SKIP must run :* under mmap_sem for reading because it serializes against the :* down_write();up_write() cycle in exit_mmap(). :*/ : if (test_bit(MMF_OOM_SKIP, >flags)) { : trace_skip_task_reaping(tsk->pid); : goto out; : } : : trace_start_task_reaping(tsk->pid); : : if (!__oom_reap_task_mm(mm)) { : /* Failed to reap part of the address space. Try again later */ : goto finish; : } : : pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", : task_pid_nr(tsk), tsk->comm, : K(get_mm_counter(mm, MM_ANONPAGES)), : K(get_mm_counter(mm, MM_FILEPAGES)), : K(get_mm_counter(mm, MM_SHMEMPAGES))); : finish: : trace_finish_task_reaping(tsk->pid); : out: : up_read(>mmap_sem); : return true; : } - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Please take a look? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko wrote: > > Any suggestions regarding how the driver developers can test this code > > path? I don't think we presently have a way to fake an oom-killing > > event? Perhaps we should add such a thing, given the problems we're > > having with that feature. > > The simplest way is to wrap an userspace code which uses these notifiers > into a memcg and set the hard limit to hit the oom. This can be done > e.g. after the test faults in all the mmu notifier managed memory and > set the hard limit to something really small. Then we are looking for a > proper process tear down. Chances are, some of the intended audience don't know how to do this and will either have to hunt down a lot of documentation or will just not test it. But we want them to test it, so a little worked step-by-step example would help things along please. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Does anybody see any reasons why this should get into mmotm tree? I do not want to rush this in but if general feeling is to push it for the upcoming merge window then I will not object. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 16-07-18 16:12:49, Andrew Morton wrote: > On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko wrote: > > > From: Michal Hocko > > > > There are several blockable mmu notifiers which might sleep in > > mmu_notifier_invalidate_range_start and that is a problem for the > > oom_reaper because it needs to guarantee a forward progress so it cannot > > depend on any sleepable locks. > > > > Currently we simply back off and mark an oom victim with blockable mmu > > notifiers as done after a short sleep. That can result in selecting a > > new oom victim prematurely because the previous one still hasn't torn > > its memory down yet. > > > > We can do much better though. Even if mmu notifiers use sleepable locks > > there is no reason to automatically assume those locks are held. > > Moreover majority of notifiers only care about a portion of the address > > space and there is absolutely zero reason to fail when we are unmapping an > > unrelated range. Many notifiers do really block and wait for HW which is > > harder to handle and we have to bail out though. > > > > This patch handles the low hanging fruid. > > __mmu_notifier_invalidate_range_start > > gets a blockable flag and callbacks are not allowed to sleep if the > > flag is set to false. This is achieved by using trylock instead of the > > sleepable lock for most callbacks and continue as long as we do not > > block down the call chain. > > I assume device driver developers are wondering "what does this mean > for me". As I understand it, the only time they will see > blockable==false is when their driver is being called in response to an > out-of-memory condition, yes? So it is a very rare thing. Yes, this is the case right now. Maybe we will grow other users in future. Those other potential users is the reason why I used blockable rather than oom parameter name. > Any suggestions regarding how the driver developers can test this code > path? I don't think we presently have a way to fake an oom-killing > event? Perhaps we should add such a thing, given the problems we're > having with that feature. The simplest way is to wrap an userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. > > I think we can improve that even further because there is a common > > pattern to do a range lookup first and then do something about that. > > The first part can be done without a sleeping lock in most cases AFAICS. > > > > The oom_reaper end then simply retries if there is at least one notifier > > which couldn't make any progress in !blockable mode. A retry loop is > > already implemented to wait for the mmap_sem and this is basically the > > same thing. > > > > ... > > > > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct > > mm_struct *mm, > > + unsigned long start, unsigned long end) > > +{ > > + int ret = 0; > > + if (mm_has_notifiers(mm)) > > + ret = __mmu_notifier_invalidate_range_start(mm, start, end, > > false); > > + > > + return ret; > > } > > nit, > > { > if (mm_has_notifiers(mm)) > return __mmu_notifier_invalidate_range_start(mm, start, end, > false); > return 0; > } > > would suffice. Sure. Fixed > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm) > > * reliably test it. > > */ > > mutex_lock(_lock); > > - __oom_reap_task_mm(mm); > > + (void)__oom_reap_task_mm(mm); > > mutex_unlock(_lock); > > What does this do? There is no error to be returned here as the comment above explains * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote: > On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko wrote: > > > From: Michal Hocko > > > > There are several blockable mmu notifiers which might sleep in > > mmu_notifier_invalidate_range_start and that is a problem for the > > oom_reaper because it needs to guarantee a forward progress so it cannot > > depend on any sleepable locks. > > > > Currently we simply back off and mark an oom victim with blockable mmu > > notifiers as done after a short sleep. That can result in selecting a > > new oom victim prematurely because the previous one still hasn't torn > > its memory down yet. > > > > We can do much better though. Even if mmu notifiers use sleepable locks > > there is no reason to automatically assume those locks are held. > > Moreover majority of notifiers only care about a portion of the address > > space and there is absolutely zero reason to fail when we are unmapping an > > unrelated range. Many notifiers do really block and wait for HW which is > > harder to handle and we have to bail out though. > > > > This patch handles the low hanging fruid. > > __mmu_notifier_invalidate_range_start > > gets a blockable flag and callbacks are not allowed to sleep if the > > flag is set to false. This is achieved by using trylock instead of the > > sleepable lock for most callbacks and continue as long as we do not > > block down the call chain. > > I assume device driver developers are wondering "what does this mean > for me". As I understand it, the only time they will see > blockable==false is when their driver is being called in response to an > out-of-memory condition, yes? So it is a very rare thing. I can't say for everyone, but at least for me (mlx5), it is not rare event. I'm seeing OOM very often while I'm running my tests in low memory VMs. Thanks > > Any suggestions regarding how the driver developers can test this code > path? I don't think we presently have a way to fake an oom-killing > event? Perhaps we should add such a thing, given the problems we're > having with that feature. signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko wrote: > From: Michal Hocko > > There are several blockable mmu notifiers which might sleep in > mmu_notifier_invalidate_range_start and that is a problem for the > oom_reaper because it needs to guarantee a forward progress so it cannot > depend on any sleepable locks. > > Currently we simply back off and mark an oom victim with blockable mmu > notifiers as done after a short sleep. That can result in selecting a > new oom victim prematurely because the previous one still hasn't torn > its memory down yet. > > We can do much better though. Even if mmu notifiers use sleepable locks > there is no reason to automatically assume those locks are held. > Moreover majority of notifiers only care about a portion of the address > space and there is absolutely zero reason to fail when we are unmapping an > unrelated range. Many notifiers do really block and wait for HW which is > harder to handle and we have to bail out though. > > This patch handles the low hanging fruid. > __mmu_notifier_invalidate_range_start > gets a blockable flag and callbacks are not allowed to sleep if the > flag is set to false. This is achieved by using trylock instead of the > sleepable lock for most callbacks and continue as long as we do not > block down the call chain. I assume device driver developers are wondering "what does this mean for me". As I understand it, the only time they will see blockable==false is when their driver is being called in response to an out-of-memory condition, yes? So it is a very rare thing. Any suggestions regarding how the driver developers can test this code path? I don't think we presently have a way to fake an oom-killing event? Perhaps we should add such a thing, given the problems we're having with that feature. > I think we can improve that even further because there is a common > pattern to do a range lookup first and then do something about that. > The first part can be done without a sleeping lock in most cases AFAICS. > > The oom_reaper end then simply retries if there is at least one notifier > which couldn't make any progress in !blockable mode. A retry loop is > already implemented to wait for the mmap_sem and this is basically the > same thing. > > ... > > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct > mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + int ret = 0; > + if (mm_has_notifiers(mm)) > + ret = __mmu_notifier_invalidate_range_start(mm, start, end, > false); > + > + return ret; > } nit, { if (mm_has_notifiers(mm)) return __mmu_notifier_invalidate_range_start(mm, start, end, false); return 0; } would suffice. > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm) >* reliably test it. >*/ > mutex_lock(_lock); > - __oom_reap_task_mm(mm); > + (void)__oom_reap_task_mm(mm); > mutex_unlock(_lock); What does this do? > set_bit(MMF_OOM_SKIP, >flags); > > ... > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] mm, oom: distinguish blockable mode for mmu notifiers
From: Michal Hocko There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. Changes since rfc v1 - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure we bail out when we have an intersecting range for starter - note that a notifier failed to the log for easier debugging - back off early in ib_umem_notifier_invalidate_range_start if the callback is called - mn_invl_range_start waits for completion down the unmap_grant_pages path so we have to back off early on overlapping ranges Cc: "David (ChunMing) Zhou" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Alex Deucher Cc: David Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Doug Ledford Cc: Jason Gunthorpe Cc: Mike Marciniszyn Cc: Dennis Dalessandro Cc: Sudeep Dutt Cc: Ashutosh Dixit Cc: Dimitri Sivanich Cc: Boris Ostrovsky Cc: Juergen Gross Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: Felix Kuehling Cc: k...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: amd-...@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Cc: intel-...@lists.freedesktop.org Cc: linux-r...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Cc: linux...@kvack.org Acked-by: Christian König # AMD notifiers Acked-by: Leon Romanovsky # mlx and umem_odp Reported-by: David Rientjes Signed-off-by: Michal Hocko --- Hi, there were no major objections when I sent this as an RFC the last time [1]. I was hoping for more feedback in the drivers land because I am touching the code I have no way to test. On the other hand the pattern is quite simple and consistent over all users so there shouldn't be any large surprises hopefully. Any further review would be highly appreciate of course. But is this something to put into the mm tree now? [1] http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz arch/x86/kvm/x86.c | 7 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 43 +++- drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++-- drivers/gpu/drm/radeon/radeon_mn.c | 22 +++-- drivers/infiniband/core/umem_odp.c | 33 +++ drivers/infiniband/hw/hfi1/mmu_rb.c | 11 --- drivers/infiniband/hw/mlx5/odp.c| 2 +- drivers/misc/mic/scif/scif_dma.c| 7 ++-- drivers/misc/sgi-gru/grutlbpurge.c | 7 ++-- drivers/xen/gntdev.c| 44 - include/linux/kvm_host.h| 4 +-- include/linux/mmu_notifier.h| 35 +++- include/linux/oom.h | 2 +- include/rdma/ib_umem_odp.h | 3 +- mm/hmm.c| 7 ++-- mm/mmap.c | 2 +- mm/mmu_notifier.c | 19 --- mm/oom_kill.c | 29 virt/kvm/kvm_main.c | 15 ++--- 19 files changed, 225 insertions(+), 80 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bcecc325e7e..ac08f5d711be 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap); } -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end) +int
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > This is the v2 of RFC based on the feedback I've received so far. The > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > > > because I have no idea how. > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > Any other feedback before I post this as non-RFC? > > > > From mlx5 perspective, who is primary user of umem_odp.c your change looks > > ok. > > Can I assume your Acked-by? > > Thanks for your review! For mlx and umem_odp pieces, Acked-by: Leon Romanovsky Thanks signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Wed, Jul 11, 2018 at 01:13:18PM +0200, Michal Hocko wrote: > On Wed 11-07-18 13:14:47, Leon Romanovsky wrote: > > On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote: > > > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote: > > > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > > > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > > > > > This is the v2 of RFC based on the feedback I've received so > > > > > > > > far. The > > > > > > > > code even compiles as a bonus ;) I haven't runtime tested it > > > > > > > > yet, mostly > > > > > > > > because I have no idea how. > > > > > > > > > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > > > > > > > > > Any other feedback before I post this as non-RFC? > > > > > > > > > > > > From mlx5 perspective, who is primary user of umem_odp.c your > > > > > > change looks ok. > > > > > > > > > > Can I assume your Acked-by? > > > > > > > > I didn't have a chance to test it because it applies on our rdma-next, > > > > but > > > > fails to compile. > > > > > > What is the compilation problem? Is it caused by the patch or some other > > > unrelated changed? > > > > Thanks for pushing me to take a look on it. > > Your patch needs the following hunk to properly compile at least on my > > system. > > I suspect you were trying the original version. I've posted an updated > patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz > and all these issues should be fixed there. Including many other fixes. > Ohh, you used --reply-to, IMHO it is best way to make sure that the patch will be lost :) > Could you have a look at that one please? I grabbed it, the results will be overnight only. Thanks > -- > Michal Hocko > SUSE Labs signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Wed 11-07-18 13:14:47, Leon Romanovsky wrote: > On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote: > > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote: > > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > > > > This is the v2 of RFC based on the feedback I've received so far. > > > > > > > The > > > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, > > > > > > > mostly > > > > > > > because I have no idea how. > > > > > > > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > > > > > > > Any other feedback before I post this as non-RFC? > > > > > > > > > > From mlx5 perspective, who is primary user of umem_odp.c your change > > > > > looks ok. > > > > > > > > Can I assume your Acked-by? > > > > > > I didn't have a chance to test it because it applies on our rdma-next, but > > > fails to compile. > > > > What is the compilation problem? Is it caused by the patch or some other > > unrelated changed? > > Thanks for pushing me to take a look on it. > Your patch needs the following hunk to properly compile at least on my system. I suspect you were trying the original version. I've posted an updated patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz and all these issues should be fixed there. Including many other fixes. Could you have a look at that one please? -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote: > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote: > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > > > This is the v2 of RFC based on the feedback I've received so far. > > > > > > The > > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, > > > > > > mostly > > > > > > because I have no idea how. > > > > > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > > > > > Any other feedback before I post this as non-RFC? > > > > > > > > From mlx5 perspective, who is primary user of umem_odp.c your change > > > > looks ok. > > > > > > Can I assume your Acked-by? > > > > I didn't have a chance to test it because it applies on our rdma-next, but > > fails to compile. > > What is the compilation problem? Is it caused by the patch or some other > unrelated changed? Thanks for pushing me to take a look on it. Your patch needs the following hunk to properly compile at least on my system. I'll take it to our regression. diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 369867501bed..1f364a157097 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -155,9 +155,9 @@ struct mmu_notifier_ops { * cannot block, mmu_notifier_ops.flags should have * MMU_INVALIDATE_DOES_NOT_BLOCK set. */ - void (*invalidate_range_start)(struct mmu_notifier *mn, + int (*invalidate_range_start)(struct mmu_notifier *mn, struct mm_struct *mm, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, bool blockable); void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); @@ -229,7 +229,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end, bool blockable); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, diff --git a/include/linux/oom.h b/include/linux/oom.h index 6adac113e96d..92f70e4c6252 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } -void __oom_reap_task_mm(struct mm_struct *mm); +bool __oom_reap_task_mm(struct mm_struct *mm); extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7e0c6e78ae5c..7c7bd6f3298e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1,6 +1,6 @@ /* * linux/mm/oom_kill.c - * + * * Copyright (C) 1998,2000 Rik van Riel * Thanks go out to Claus Fischer for some serious inspiration and * for goading me into coding this file... @@ -569,7 +569,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) if (!__oom_reap_task_mm(mm)) { up_read(>mmap_sem); ret = false; - goto out_unlock; + goto unlock_oom; } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", Thanks > -- > Michal Hocko > SUSE Labs signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue 10-07-18 19:20:20, Leon Romanovsky wrote: > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > > This is the v2 of RFC based on the feedback I've received so far. The > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, > > > > > mostly > > > > > because I have no idea how. > > > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > > > Any other feedback before I post this as non-RFC? > > > > > > From mlx5 perspective, who is primary user of umem_odp.c your change > > > looks ok. > > > > Can I assume your Acked-by? > > I didn't have a chance to test it because it applies on our rdma-next, but > fails to compile. What is the compilation problem? Is it caused by the patch or some other unrelated changed? -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote: > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > > This is the v2 of RFC based on the feedback I've received so far. The > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > > > because I have no idea how. > > > > > > > > Any further feedback is highly appreciated of course. > > > > > > Any other feedback before I post this as non-RFC? > > > > From mlx5 perspective, who is primary user of umem_odp.c your change looks > > ok. > > Can I assume your Acked-by? I didn't have a chance to test it because it applies on our rdma-next, but fails to compile. Thanks > > Thanks for your review! > -- > Michal Hocko > SUSE Labs > signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Tue 10-07-18 16:40:40, Leon Romanovsky wrote: > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > > This is the v2 of RFC based on the feedback I've received so far. The > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > > because I have no idea how. > > > > > > Any further feedback is highly appreciated of course. > > > > Any other feedback before I post this as non-RFC? > > From mlx5 perspective, who is primary user of umem_odp.c your change looks ok. Can I assume your Acked-by? Thanks for your review! -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote: > On Wed 27-06-18 09:44:21, Michal Hocko wrote: > > This is the v2 of RFC based on the feedback I've received so far. The > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > because I have no idea how. > > > > Any further feedback is highly appreciated of course. > > Any other feedback before I post this as non-RFC? From mlx5 perspective, who is primary user of umem_odp.c your change looks ok. Thanks signature.asc Description: PGP signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Wed 27-06-18 09:44:21, Michal Hocko wrote: > This is the v2 of RFC based on the feedback I've received so far. The > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > because I have no idea how. > > Any further feedback is highly appreciated of course. Any other feedback before I post this as non-RFC? > --- > From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Wed, 20 Jun 2018 15:03:20 +0200 > Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > There are several blockable mmu notifiers which might sleep in > mmu_notifier_invalidate_range_start and that is a problem for the > oom_reaper because it needs to guarantee a forward progress so it cannot > depend on any sleepable locks. > > Currently we simply back off and mark an oom victim with blockable mmu > notifiers as done after a short sleep. That can result in selecting a > new oom victim prematurely because the previous one still hasn't torn > its memory down yet. > > We can do much better though. Even if mmu notifiers use sleepable locks > there is no reason to automatically assume those locks are held. > Moreover majority of notifiers only care about a portion of the address > space and there is absolutely zero reason to fail when we are unmapping an > unrelated range. Many notifiers do really block and wait for HW which is > harder to handle and we have to bail out though. > > This patch handles the low hanging fruid. > __mmu_notifier_invalidate_range_start > gets a blockable flag and callbacks are not allowed to sleep if the > flag is set to false. This is achieved by using trylock instead of the > sleepable lock for most callbacks and continue as long as we do not > block down the call chain. > > I think we can improve that even further because there is a common > pattern to do a range lookup first and then do something about that. > The first part can be done without a sleeping lock in most cases AFAICS. > > The oom_reaper end then simply retries if there is at least one notifier > which couldn't make any progress in !blockable mode. A retry loop is > already implemented to wait for the mmap_sem and this is basically the > same thing. > > Changes since rfc v1 > - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch > on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure > we bail out when we have an intersecting range for starter > - note that a notifier failed to the log for easier debugging > - back off early in ib_umem_notifier_invalidate_range_start if the > callback is called > - mn_invl_range_start waits for completion down the unmap_grant_pages > path so we have to back off early on overlapping ranges > > Cc: "David (ChunMing) Zhou" > Cc: Paolo Bonzini > Cc: "Radim Krčmář" > Cc: Alex Deucher > Cc: "Christian König" > Cc: David Airlie > Cc: Jani Nikula > Cc: Joonas Lahtinen > Cc: Rodrigo Vivi > Cc: Doug Ledford > Cc: Jason Gunthorpe > Cc: Mike Marciniszyn > Cc: Dennis Dalessandro > Cc: Sudeep Dutt > Cc: Ashutosh Dixit > Cc: Dimitri Sivanich > Cc: Boris Ostrovsky > Cc: Juergen Gross > Cc: "Jérôme Glisse" > Cc: Andrea Arcangeli > Cc: Felix Kuehling > Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)) > Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND > 64-BIT)) > Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS) > Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS) > Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding > Poulsbo, Moorestow...) > Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM) > Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE) > Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management) > Reported-by: David Rientjes > Signed-off-by: Michal Hocko > --- > arch/x86/kvm/x86.c | 7 ++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 43 +++- > drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++-- > drivers/gpu/drm/radeon/radeon_mn.c | 22 +++-- > drivers/infiniband/core/umem_odp.c | 33 +++ > drivers/infiniband/hw/hfi1/mmu_rb.c | 11 --- > drivers/infiniband/hw/mlx5/odp.c| 2 +- > drivers/misc/mic/scif/scif_dma.c| 7 ++-- > drivers/misc/sgi-gru/grutlbpurge.c | 7 ++-- > drivers/xen/gntdev.c| 44 - >
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 02-07-18 14:39:50, Christian König wrote: [...] > Not wanting to block something as important as this, so feel free to add an > Acked-by: Christian König to the patch. Thanks a lot! > Let's rather face the next topic: Any idea how to runtime test this? This is a good question indeed. One way to do that would be triggering the OOM killer from the context which uses each of these mmu notifiers (one at the time) and see how that works. You would see the note in the log whenever the notifier would block. The primary thing to test is how often the oom reaper really had to back off completely. > I mean I can rather easily provide a test which crashes an AMD GPU, which in > turn then would mean that the MMU notifier would block forever without this > patch. Well, you do not really have to go that far. It should be sufficient to do the above. The current code would simply back of without releasing any memory. The patch should help to reclaim some memory. > But do you know a way to let the OOM killer kill a specific process? Yes, you can set its oom_score_adj to 1000 which means always select that task. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 02.07.2018 um 14:35 schrieb Michal Hocko: On Mon 02-07-18 14:24:29, Christian König wrote: Am 02.07.2018 um 14:20 schrieb Michal Hocko: On Mon 02-07-18 14:13:42, Christian König wrote: Am 02.07.2018 um 13:54 schrieb Michal Hocko: On Mon 02-07-18 11:14:58, Christian König wrote: Am 27.06.2018 um 09:44 schrieb Michal Hocko: This is the v2 of RFC based on the feedback I've received so far. The code even compiles as a bonus ;) I haven't runtime tested it yet, mostly because I have no idea how. Any further feedback is highly appreciated of course. That sounds like it should work and at least the amdgpu changes now look good to me on first glance. Can you split that up further in the usual way? E.g. adding the blockable flag in one patch and fixing all implementations of the MMU notifier in follow up patches. But such a code would be broken, no? Ignoring the blockable state will simply lead to lockups until the fixup parts get applied. Well to still be bisect-able you only need to get the interface change in first with fixing the function signature of the implementations. That would only work if those functions return -AGAIN unconditionally. Otherwise they would pretend to not block while that would be obviously incorrect. This doesn't sound correct to me. Then add all the new code to the implementations and last start to actually use the new interface. That is a pattern we use regularly and I think it's good practice to do this. But we do rely on the proper blockable handling. Yeah, but you could add the handling only after you have all the implementations in place. Don't you? Yeah, but then I would be adding a code with no user. And I really prefer to no do so because then the code is harder to argue about. Is the split up really worth it? I was thinking about that but had hard times to end up with something that would be bisectable. Well, except for returning -EBUSY until all notifiers are implemented. Which I found confusing. It at least makes reviewing changes much easier, cause as driver maintainer I can concentrate on the stuff only related to me. Additional to that when you cause some unrelated side effect in a driver we can much easier pinpoint the actual change later on when the patch is smaller. This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd changes. If you are worried to give r-b only for those then this can be done even for larger patches. Just make your Reviewd-by more specific R-b: name # For BLA BLA Yeah, possible alternative but more work for me when I review it :) I definitely do not want to add more work to reviewers and I completely see how massive "flag days" like these are not popular but I really didn't find a reasonable way around that would be both correct and wouldn't add much more churn on the way. So if you really insist then I would really appreciate a hint on the way to achive the same without any above downsides. Well, I don't insist on this. It's just from my point of view that this patch doesn't needs to be one patch, but could be split up. Well, if there are more people with the same concern I can try to do that. But if your only concern is to focus on your particular part then I guess it would be easier both for you and me to simply apply the patch and use git show $files_for_your_subystem on your end. I have put the patch to attempts/oom-vs-mmu-notifiers branch to my tree at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git Not wanting to block something as important as this, so feel free to add an Acked-by: Christian König to the patch. Let's rather face the next topic: Any idea how to runtime test this? I mean I can rather easily provide a test which crashes an AMD GPU, which in turn then would mean that the MMU notifier would block forever without this patch. But do you know a way to let the OOM killer kill a specific process? Regards, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 02-07-18 14:24:29, Christian König wrote: > Am 02.07.2018 um 14:20 schrieb Michal Hocko: > > On Mon 02-07-18 14:13:42, Christian König wrote: > > > Am 02.07.2018 um 13:54 schrieb Michal Hocko: > > > > On Mon 02-07-18 11:14:58, Christian König wrote: > > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko: > > > > > > This is the v2 of RFC based on the feedback I've received so far. > > > > > > The > > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, > > > > > > mostly > > > > > > because I have no idea how. > > > > > > > > > > > > Any further feedback is highly appreciated of course. > > > > > That sounds like it should work and at least the amdgpu changes now > > > > > look > > > > > good to me on first glance. > > > > > > > > > > Can you split that up further in the usual way? E.g. adding the > > > > > blockable > > > > > flag in one patch and fixing all implementations of the MMU notifier > > > > > in > > > > > follow up patches. > > > > But such a code would be broken, no? Ignoring the blockable state will > > > > simply lead to lockups until the fixup parts get applied. > > > Well to still be bisect-able you only need to get the interface change in > > > first with fixing the function signature of the implementations. > > That would only work if those functions return -AGAIN unconditionally. > > Otherwise they would pretend to not block while that would be obviously > > incorrect. This doesn't sound correct to me. > > > > > Then add all the new code to the implementations and last start to > > > actually > > > use the new interface. > > > > > > That is a pattern we use regularly and I think it's good practice to do > > > this. > > But we do rely on the proper blockable handling. > > Yeah, but you could add the handling only after you have all the > implementations in place. Don't you? Yeah, but then I would be adding a code with no user. And I really prefer to no do so because then the code is harder to argue about. > > > > Is the split up really worth it? I was thinking about that but had hard > > > > times to end up with something that would be bisectable. Well, except > > > > for returning -EBUSY until all notifiers are implemented. Which I found > > > > confusing. > > > It at least makes reviewing changes much easier, cause as driver > > > maintainer > > > I can concentrate on the stuff only related to me. > > > > > > Additional to that when you cause some unrelated side effect in a driver > > > we > > > can much easier pinpoint the actual change later on when the patch is > > > smaller. > > > > > > > > This way I'm pretty sure Felix and I can give an rb on the > > > > > amdgpu/amdkfd > > > > > changes. > > > > If you are worried to give r-b only for those then this can be done even > > > > for larger patches. Just make your Reviewd-by more specific > > > > R-b: name # For BLA BLA > > > Yeah, possible alternative but more work for me when I review it :) > > I definitely do not want to add more work to reviewers and I completely > > see how massive "flag days" like these are not popular but I really > > didn't find a reasonable way around that would be both correct and > > wouldn't add much more churn on the way. So if you really insist then I > > would really appreciate a hint on the way to achive the same without any > > above downsides. > > Well, I don't insist on this. It's just from my point of view that this > patch doesn't needs to be one patch, but could be split up. Well, if there are more people with the same concern I can try to do that. But if your only concern is to focus on your particular part then I guess it would be easier both for you and me to simply apply the patch and use git show $files_for_your_subystem on your end. I have put the patch to attempts/oom-vs-mmu-notifiers branch to my tree at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 02.07.2018 um 14:20 schrieb Michal Hocko: On Mon 02-07-18 14:13:42, Christian König wrote: Am 02.07.2018 um 13:54 schrieb Michal Hocko: On Mon 02-07-18 11:14:58, Christian König wrote: Am 27.06.2018 um 09:44 schrieb Michal Hocko: This is the v2 of RFC based on the feedback I've received so far. The code even compiles as a bonus ;) I haven't runtime tested it yet, mostly because I have no idea how. Any further feedback is highly appreciated of course. That sounds like it should work and at least the amdgpu changes now look good to me on first glance. Can you split that up further in the usual way? E.g. adding the blockable flag in one patch and fixing all implementations of the MMU notifier in follow up patches. But such a code would be broken, no? Ignoring the blockable state will simply lead to lockups until the fixup parts get applied. Well to still be bisect-able you only need to get the interface change in first with fixing the function signature of the implementations. That would only work if those functions return -AGAIN unconditionally. Otherwise they would pretend to not block while that would be obviously incorrect. This doesn't sound correct to me. Then add all the new code to the implementations and last start to actually use the new interface. That is a pattern we use regularly and I think it's good practice to do this. But we do rely on the proper blockable handling. Yeah, but you could add the handling only after you have all the implementations in place. Don't you? Is the split up really worth it? I was thinking about that but had hard times to end up with something that would be bisectable. Well, except for returning -EBUSY until all notifiers are implemented. Which I found confusing. It at least makes reviewing changes much easier, cause as driver maintainer I can concentrate on the stuff only related to me. Additional to that when you cause some unrelated side effect in a driver we can much easier pinpoint the actual change later on when the patch is smaller. This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd changes. If you are worried to give r-b only for those then this can be done even for larger patches. Just make your Reviewd-by more specific R-b: name # For BLA BLA Yeah, possible alternative but more work for me when I review it :) I definitely do not want to add more work to reviewers and I completely see how massive "flag days" like these are not popular but I really didn't find a reasonable way around that would be both correct and wouldn't add much more churn on the way. So if you really insist then I would really appreciate a hint on the way to achive the same without any above downsides. Well, I don't insist on this. It's just from my point of view that this patch doesn't needs to be one patch, but could be split up. Could be that I just don't know the code or the consequences of adding that well enough to really judge. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 02-07-18 14:13:42, Christian König wrote: > Am 02.07.2018 um 13:54 schrieb Michal Hocko: > > On Mon 02-07-18 11:14:58, Christian König wrote: > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko: > > > > This is the v2 of RFC based on the feedback I've received so far. The > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > > > because I have no idea how. > > > > > > > > Any further feedback is highly appreciated of course. > > > That sounds like it should work and at least the amdgpu changes now look > > > good to me on first glance. > > > > > > Can you split that up further in the usual way? E.g. adding the blockable > > > flag in one patch and fixing all implementations of the MMU notifier in > > > follow up patches. > > But such a code would be broken, no? Ignoring the blockable state will > > simply lead to lockups until the fixup parts get applied. > > Well to still be bisect-able you only need to get the interface change in > first with fixing the function signature of the implementations. That would only work if those functions return -AGAIN unconditionally. Otherwise they would pretend to not block while that would be obviously incorrect. This doesn't sound correct to me. > Then add all the new code to the implementations and last start to actually > use the new interface. > > That is a pattern we use regularly and I think it's good practice to do > this. But we do rely on the proper blockable handling. > > Is the split up really worth it? I was thinking about that but had hard > > times to end up with something that would be bisectable. Well, except > > for returning -EBUSY until all notifiers are implemented. Which I found > > confusing. > > It at least makes reviewing changes much easier, cause as driver maintainer > I can concentrate on the stuff only related to me. > > Additional to that when you cause some unrelated side effect in a driver we > can much easier pinpoint the actual change later on when the patch is > smaller. > > > > > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd > > > changes. > > If you are worried to give r-b only for those then this can be done even > > for larger patches. Just make your Reviewd-by more specific > > R-b: name # For BLA BLA > > Yeah, possible alternative but more work for me when I review it :) I definitely do not want to add more work to reviewers and I completely see how massive "flag days" like these are not popular but I really didn't find a reasonable way around that would be both correct and wouldn't add much more churn on the way. So if you really insist then I would really appreciate a hint on the way to achive the same without any above downsides. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 02.07.2018 um 13:54 schrieb Michal Hocko: On Mon 02-07-18 11:14:58, Christian König wrote: Am 27.06.2018 um 09:44 schrieb Michal Hocko: This is the v2 of RFC based on the feedback I've received so far. The code even compiles as a bonus ;) I haven't runtime tested it yet, mostly because I have no idea how. Any further feedback is highly appreciated of course. That sounds like it should work and at least the amdgpu changes now look good to me on first glance. Can you split that up further in the usual way? E.g. adding the blockable flag in one patch and fixing all implementations of the MMU notifier in follow up patches. But such a code would be broken, no? Ignoring the blockable state will simply lead to lockups until the fixup parts get applied. Well to still be bisect-able you only need to get the interface change in first with fixing the function signature of the implementations. Then add all the new code to the implementations and last start to actually use the new interface. That is a pattern we use regularly and I think it's good practice to do this. Is the split up really worth it? I was thinking about that but had hard times to end up with something that would be bisectable. Well, except for returning -EBUSY until all notifiers are implemented. Which I found confusing. It at least makes reviewing changes much easier, cause as driver maintainer I can concentrate on the stuff only related to me. Additional to that when you cause some unrelated side effect in a driver we can much easier pinpoint the actual change later on when the patch is smaller. This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd changes. If you are worried to give r-b only for those then this can be done even for larger patches. Just make your Reviewd-by more specific R-b: name # For BLA BLA Yeah, possible alternative but more work for me when I review it :) Regards, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 02-07-18 11:14:58, Christian König wrote: > Am 27.06.2018 um 09:44 schrieb Michal Hocko: > > This is the v2 of RFC based on the feedback I've received so far. The > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly > > because I have no idea how. > > > > Any further feedback is highly appreciated of course. > > That sounds like it should work and at least the amdgpu changes now look > good to me on first glance. > > Can you split that up further in the usual way? E.g. adding the blockable > flag in one patch and fixing all implementations of the MMU notifier in > follow up patches. But such a code would be broken, no? Ignoring the blockable state will simply lead to lockups until the fixup parts get applied. Is the split up really worth it? I was thinking about that but had hard times to end up with something that would be bisectable. Well, except for returning -EBUSY until all notifiers are implemented. Which I found confusing. > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd > changes. If you are worried to give r-b only for those then this can be done even for larger patches. Just make your Reviewd-by more specific R-b: name # For BLA BLA -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Am 27.06.2018 um 09:44 schrieb Michal Hocko: This is the v2 of RFC based on the feedback I've received so far. The code even compiles as a bonus ;) I haven't runtime tested it yet, mostly because I have no idea how. Any further feedback is highly appreciated of course. That sounds like it should work and at least the amdgpu changes now look good to me on first glance. Can you split that up further in the usual way? E.g. adding the blockable flag in one patch and fixing all implementations of the MMU notifier in follow up patches. This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd changes. Thanks, Christian. --- From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 20 Jun 2018 15:03:20 +0200 Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. Changes since rfc v1 - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure we bail out when we have an intersecting range for starter - note that a notifier failed to the log for easier debugging - back off early in ib_umem_notifier_invalidate_range_start if the callback is called - mn_invl_range_start waits for completion down the unmap_grant_pages path so we have to back off early on overlapping ranges Cc: "David (ChunMing) Zhou" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Alex Deucher Cc: "Christian König" Cc: David Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Doug Ledford Cc: Jason Gunthorpe Cc: Mike Marciniszyn Cc: Dennis Dalessandro Cc: Sudeep Dutt Cc: Ashutosh Dixit Cc: Dimitri Sivanich Cc: Boris Ostrovsky Cc: Juergen Gross Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: Felix Kuehling Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)) Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS) Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS) Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...) Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM) Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE) Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management) Reported-by: David Rientjes Signed-off-by: Michal Hocko --- arch/x86/kvm/x86.c | 7 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 43 +++- drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++-- drivers/gpu/drm/radeon/radeon_mn.c | 22 +++-- drivers/infiniband/core/umem_odp.c | 33 +++ drivers/infiniband/hw/hfi1/mmu_rb.c | 11 --- drivers/infiniband/hw/mlx5/odp.c| 2 +- drivers/misc/mic/scif/scif_dma.c| 7 ++-- drivers/misc/sgi-gru/grutlbpurge.c | 7 ++-- drivers/xen/gntdev.c| 44 - include/linux/kvm_host.h| 4 +-- include/linux/mmu_notifier.h| 35 +++- include/linux/oom.h | 2 +- incl
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
This is the v2 of RFC based on the feedback I've received so far. The code even compiles as a bonus ;) I haven't runtime tested it yet, mostly because I have no idea how. Any further feedback is highly appreciated of course. --- From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 20 Jun 2018 15:03:20 +0200 Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. Changes since rfc v1 - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure we bail out when we have an intersecting range for starter - note that a notifier failed to the log for easier debugging - back off early in ib_umem_notifier_invalidate_range_start if the callback is called - mn_invl_range_start waits for completion down the unmap_grant_pages path so we have to back off early on overlapping ranges Cc: "David (ChunMing) Zhou" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Alex Deucher Cc: "Christian König" Cc: David Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Doug Ledford Cc: Jason Gunthorpe Cc: Mike Marciniszyn Cc: Dennis Dalessandro Cc: Sudeep Dutt Cc: Ashutosh Dixit Cc: Dimitri Sivanich Cc: Boris Ostrovsky Cc: Juergen Gross Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: Felix Kuehling Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)) Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS) Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS) Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...) Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM) Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE) Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management) Reported-by: David Rientjes Signed-off-by: Michal Hocko --- arch/x86/kvm/x86.c | 7 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 43 +++- drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++-- drivers/gpu/drm/radeon/radeon_mn.c | 22 +++-- drivers/infiniband/core/umem_odp.c | 33 +++ drivers/infiniband/hw/hfi1/mmu_rb.c | 11 --- drivers/infiniband/hw/mlx5/odp.c| 2 +- drivers/misc/mic/scif/scif_dma.c| 7 ++-- drivers/misc/sgi-gru/grutlbpurge.c | 7 ++-- drivers/xen/gntdev.c| 44 - include/linux/kvm_host.h| 4 +-- include/linux/mmu_notifier.h| 35 +++- include/linux/oom.h | 2 +- include/rdma/ib_umem_odp.h | 3 +- mm/hmm.c| 7 ++-- mm/mmap.c | 2 +- mm/mmu_notifier.c | 19 --- mm/oom_kill.c | 29 virt/kvm/kvm_main.c | 15 ++--- 19 files changed, 225 insertions(+), 80 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bcecc325e7e
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Mon 25-06-18 10:01:03, Michal Hocko wrote: > On Fri 22-06-18 16:09:06, Felix Kuehling wrote: > > On 2018-06-22 11:24 AM, Michal Hocko wrote: > > > On Fri 22-06-18 17:13:02, Christian König wrote: > > >> Hi Michal, > > >> > > >> [Adding Felix as well] > > >> > > >> Well first of all you have a misconception why at least the AMD graphics > > >> driver need to be able to sleep in an MMU notifier: We need to sleep > > >> because > > >> we need to wait for hardware operations to finish and *NOT* because we > > >> need > > >> to wait for locks. > > >> > > >> I'm not sure if your flag now means that you generally can't sleep in MMU > > >> notifiers any more, but if that's the case at least AMD hardware will > > >> break > > >> badly. In our case the approach of waiting for a short time for the > > >> process > > >> to be reaped and then select another victim actually sounds like the > > >> right > > >> thing to do. > > > Well, I do not need to make the notifier code non blocking all the time. > > > All I need is to ensure that it won't sleep if the flag says so and > > > return -EAGAIN instead. > > > > > > So here is what I do for amdgpu: > > > > In the case of KFD we also need to take the DQM lock: > > > > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr -> > > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch > > > > So we'd need to pass the blockable parameter all the way through that > > call chain. > > Thanks, I have missed that part. So I guess I will start with something > similar to intel-gfx and back off when the current range needs some > treatment. So this on top. Does it look correct? > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > index d138a526feff..e2d422b3eb0b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct > mmu_notifier *mn, > struct amdgpu_mn_node *node; > struct amdgpu_bo *bo; > > + if (!blockable) { > + amdgpu_mn_read_unlock(); > + return -EAGAIN; > + } > + > node = container_of(it, struct amdgpu_mn_node, it); > it = interval_tree_iter_next(it, start, end); Ble, just noticed that half of the change didn't get to git index... This is what I have commit c4701b36ac2802b903db3d05cf77c030fccce3a8 Author: Michal Hocko Date: Mon Jun 25 15:24:03 2018 +0200 fold me - amd gpu notifiers can sleep deeper in the callchain (evict_process_queues_cpsch on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure we bail out when we have an intersecting range for starter diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index d138a526feff..3399a4a927fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, while (it) { struct amdgpu_mn_node *node; + if (!blockable) { + amdgpu_mn_read_unlock(rmn); + return -EAGAIN; + } + node = container_of(it, struct amdgpu_mn_node, it); it = interval_tree_iter_next(it, start, end); @@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, struct amdgpu_mn_node *node; struct amdgpu_bo *bo; + if (!blockable) { + amdgpu_mn_read_unlock(rmn); + return -EAGAIN; + } + node = container_of(it, struct amdgpu_mn_node, it); it = interval_tree_iter_next(it, start, end); -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 22-06-18 16:09:06, Felix Kuehling wrote: > On 2018-06-22 11:24 AM, Michal Hocko wrote: > > On Fri 22-06-18 17:13:02, Christian König wrote: > >> Hi Michal, > >> > >> [Adding Felix as well] > >> > >> Well first of all you have a misconception why at least the AMD graphics > >> driver need to be able to sleep in an MMU notifier: We need to sleep > >> because > >> we need to wait for hardware operations to finish and *NOT* because we need > >> to wait for locks. > >> > >> I'm not sure if your flag now means that you generally can't sleep in MMU > >> notifiers any more, but if that's the case at least AMD hardware will break > >> badly. In our case the approach of waiting for a short time for the process > >> to be reaped and then select another victim actually sounds like the right > >> thing to do. > > Well, I do not need to make the notifier code non blocking all the time. > > All I need is to ensure that it won't sleep if the flag says so and > > return -EAGAIN instead. > > > > So here is what I do for amdgpu: > > In the case of KFD we also need to take the DQM lock: > > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr -> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch > > So we'd need to pass the blockable parameter all the way through that > call chain. Thanks, I have missed that part. So I guess I will start with something similar to intel-gfx and back off when the current range needs some treatment. So this on top. Does it look correct? diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index d138a526feff..e2d422b3eb0b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, struct amdgpu_mn_node *node; struct amdgpu_bo *bo; + if (!blockable) { + amdgpu_mn_read_unlock(); + return -EAGAIN; + } + node = container_of(it, struct amdgpu_mn_node, it); it = interval_tree_iter_next(it, start, end); -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On 2018-06-22 11:24 AM, Michal Hocko wrote: > On Fri 22-06-18 17:13:02, Christian König wrote: >> Hi Michal, >> >> [Adding Felix as well] >> >> Well first of all you have a misconception why at least the AMD graphics >> driver need to be able to sleep in an MMU notifier: We need to sleep because >> we need to wait for hardware operations to finish and *NOT* because we need >> to wait for locks. >> >> I'm not sure if your flag now means that you generally can't sleep in MMU >> notifiers any more, but if that's the case at least AMD hardware will break >> badly. In our case the approach of waiting for a short time for the process >> to be reaped and then select another victim actually sounds like the right >> thing to do. > Well, I do not need to make the notifier code non blocking all the time. > All I need is to ensure that it won't sleep if the flag says so and > return -EAGAIN instead. > > So here is what I do for amdgpu: In the case of KFD we also need to take the DQM lock: amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr -> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch So we'd need to pass the blockable parameter all the way through that call chain. Regards, Felix > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> index 83e344fbb50a..d138a526feff 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c >>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) >>>* >>>* Take the rmn read side lock. >>>*/ >>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn) >>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable) >>> { >>> - mutex_lock(>read_lock); >>> + if (blockable) >>> + mutex_lock(>read_lock); >>> + else if (!mutex_trylock(>read_lock)) >>> + return -EAGAIN; >>> + >>> if (atomic_inc_return(>recursion) == 1) >>> down_read_non_owner(>lock); >>> mutex_unlock(>read_lock); >>> + >>> + return 0; >>> } >>> /** >>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct >>> amdgpu_mn_node *node, >>>* We block for all BOs between start and end to be idle and >>>* unmap them by move them into system domain again. >>>*/ >>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, >>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, >>> struct mm_struct *mm, >>> unsigned long start, >>> -unsigned long end) >>> +unsigned long end, >>> +bool blockable) >>> { >>> struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); >>> struct interval_tree_node *it; >>> @@ -208,7 +215,11 @@ static void >>> amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, >>> /* notification is exclusive, but interval is inclusive */ >>> end -= 1; >>> - amdgpu_mn_read_lock(rmn); >>> + /* TODO we should be able to split locking for interval tree and >>> +* amdgpu_mn_invalidate_node >>> +*/ >>> + if (amdgpu_mn_read_lock(rmn, blockable)) >>> + return -EAGAIN; >>> it = interval_tree_iter_first(>objects, start, end); >>> while (it) { >>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct >>> mmu_notifier *mn, >>> amdgpu_mn_invalidate_node(node, start, end); >>> } >>> + >>> + return 0; >>> } >>> /** >>> @@ -233,10 +246,11 @@ static void >>> amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, >>>* necessitates evicting all user-mode queues of the process. The BOs >>>* are restorted in amdgpu_mn_invalidate_range_end_hsa. >>>*/ >>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, >>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, >>> struct mm_struct *mm, >>> unsigned long start, >>> -unsigned long end) >>> +unsigned long end, >>> +bool blockable) >>> { >>> struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); >>> struct interval_tree_node *it; >>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct >>> mmu_notifier *mn, >>> /* notification is exclusive, but interval is inclusive */ >>> end -= 1; >>> - amdgpu_mn_read_lock(rmn); >>> + if (amdgpu_mn_read_lock(rmn, blockable)) >>> + return -EAGAIN; >>> it = interval_tree_iter_first(>objects, start, end); >>> while (it) { >>> @@ -262,6 +277,8 @@ static void
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote: > [Resnding with the CC list fixed] > > On Fri 22-06-18 18:40:26, Michal Hocko wrote: > > On Fri 22-06-18 12:18:46, Jerome Glisse wrote: > > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > > > Hi, > > > > > > this is an RFC and not tested at all. I am not very familiar with > > > > > > the > > > > > > mmu notifiers semantics very much so this is a crude attempt to > > > > > > achieve > > > > > > what I need basically. It might be completely wrong but I would like > > > > > > to discuss what would be a better way if that is the case. > > > > > > > > > > > > get_maintainers gave me quite large list of people to CC so I had > > > > > > to trim > > > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > index 854bd51b9478..5285df9331fa 100644 > > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object > > > > > > *mo) > > > > > > mo->attached = false; > > > > > > } > > > > > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct > > > > > > mmu_notifier *_mn, > > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct > > > > > > mmu_notifier *_mn, > > > > > >struct > > > > > > mm_struct *mm, > > > > > >unsigned > > > > > > long start, > > > > > > - unsigned > > > > > > long end) > > > > > > + unsigned > > > > > > long end, > > > > > > + bool > > > > > > blockable) > > > > > > { > > > > > > struct i915_mmu_notifier *mn = > > > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > > > @@ -124,7 +125,7 @@ static void > > > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > > LIST_HEAD(cancelled); > > > > > > > > > > > > if (RB_EMPTY_ROOT(>objects.rb_root)) > > > > > > - return; > > > > > > + return 0; > > > > > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > > > to be not so coarse, we still have to wait for the HW to finish its > > > > > access). > > > > > > > > Is this wait bound or it can take basically arbitrary amount of time? > > > > > > Arbitrary amount of time but in desktop use case you can assume that > > > it should never go above 16ms for a 60frame per second rendering of > > > your desktop (in GPU compute case this kind of assumption does not > > > hold). Is the process exit_state already updated by the time this mmu > > > notifier callbacks happen ? > > > > What do you mean? The process is killed (by SIGKILL) at the time but we > > do not know much more than that. The task might be stuck anywhere in the > > kernel before handling that signal. I was wondering if another thread might still be dereferencing any of the structure concurrently with the OOM mmu notifier callback. Saddly yes, it would be simpler if we could make such assumption. > > > > > > > The first pass would be then to not do anything here if > > > > > !blockable. > > > > > > > > something like this? (incremental diff) > > > > > > What i wanted to do with HMM and mmu notifier is split the invalidation > > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that > > > depends on the range and invalidate internal driver states (like clear > > > buffer object pages array in case of GPU but not GPU page table). While > > > the second callback would do the actual wait on the GPU to be done and > > > update the GPU page table. > > > > What can you do after the first phase? Can I unmap the range? No you can't do anything but this force synchronization as any other thread that concurrently trie to do something with those would queue up. So it would serialize thing. Also main motivation on my side is multi-GPU, right now multi-GPU are not that common but this is changing quickly and what we see on high end (4, 8 or 16 GPUs per socket) is spreading into more configurations. Here in mutli GPU case splitting in two would avoid having to fully wait on first GPU before trying to invaidate on second GPU, so on and so forth. > > > > > Now in this scheme in case the task is already in some exit state and > > > that all CPU threads are frozen/kill then we can probably find a way to > > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to > > >
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
[Resnding with the CC list fixed] On Fri 22-06-18 18:40:26, Michal Hocko wrote: > On Fri 22-06-18 12:18:46, Jerome Glisse wrote: > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > > Hi, > > > > > this is an RFC and not tested at all. I am not very familiar with the > > > > > mmu notifiers semantics very much so this is a crude attempt to > > > > > achieve > > > > > what I need basically. It might be completely wrong but I would like > > > > > to discuss what would be a better way if that is the case. > > > > > > > > > > get_maintainers gave me quite large list of people to CC so I had to > > > > > trim > > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > index 854bd51b9478..5285df9331fa 100644 > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object > > > > > *mo) > > > > > mo->attached = false; > > > > > } > > > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct > > > > > mmu_notifier *_mn, > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct > > > > > mmu_notifier *_mn, > > > > >struct > > > > > mm_struct *mm, > > > > >unsigned long > > > > > start, > > > > > - unsigned long > > > > > end) > > > > > + unsigned long > > > > > end, > > > > > + bool blockable) > > > > > { > > > > > struct i915_mmu_notifier *mn = > > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > > @@ -124,7 +125,7 @@ static void > > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > LIST_HEAD(cancelled); > > > > > > > > > > if (RB_EMPTY_ROOT(>objects.rb_root)) > > > > > - return; > > > > > + return 0; > > > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > > to be not so coarse, we still have to wait for the HW to finish its > > > > access). > > > > > > Is this wait bound or it can take basically arbitrary amount of time? > > > > Arbitrary amount of time but in desktop use case you can assume that > > it should never go above 16ms for a 60frame per second rendering of > > your desktop (in GPU compute case this kind of assumption does not > > hold). Is the process exit_state already updated by the time this mmu > > notifier callbacks happen ? > > What do you mean? The process is killed (by SIGKILL) at the time but we > do not know much more than that. The task might be stuck anywhere in the > kernel before handling that signal. > > > > > The first pass would be then to not do anything here if > > > > !blockable. > > > > > > something like this? (incremental diff) > > > > What i wanted to do with HMM and mmu notifier is split the invalidation > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that > > depends on the range and invalidate internal driver states (like clear > > buffer object pages array in case of GPU but not GPU page table). While > > the second callback would do the actual wait on the GPU to be done and > > update the GPU page table. > > What can you do after the first phase? Can I unmap the range? > > > Now in this scheme in case the task is already in some exit state and > > that all CPU threads are frozen/kill then we can probably find a way to > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to > > share userptr bo hence a uptr bo should only ever be access through > > ioctl submited by the process. > > > > The second call can then be delayed and ping from time to time to see > > if GPU jobs are done. > > > > > > Note that what you propose might still be useful as in case there is > > no buffer object for a range then OOM can make progress in freeing a > > range of memory. It is very likely that significant virtual address > > range of a process and backing memory can be reclaim that way. This > > assume OOM reclaim vma by vma or in some form of granularity like > > reclaiming 1GB by 1GB. Or we could also update blocking callback to > > return range that are blocking that way OOM can reclaim around. > > Exactly my point. What we have right now is all or nothing which is > obviously too coarse to be useful. -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
[Hmm, the cc list got mangled somehow - you have just made many people to work for suse ;) and to kvack.org in the preious one - fixed up hopefully] On Fri 22-06-18 17:07:21, Chris Wilson wrote: > Quoting Michal Hocko (2018-06-22 16:57:16) > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > Hi, > > > > this is an RFC and not tested at all. I am not very familiar with the > > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > > what I need basically. It might be completely wrong but I would like > > > > to discuss what would be a better way if that is the case. > > > > > > > > get_maintainers gave me quite large list of people to CC so I had to > > > > trim > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > index 854bd51b9478..5285df9331fa 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > > mo->attached = false; > > > > } > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct > > > > mmu_notifier *_mn, > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct > > > > mmu_notifier *_mn, > > > >struct mm_struct > > > > *mm, > > > >unsigned long > > > > start, > > > > - unsigned long > > > > end) > > > > + unsigned long > > > > end, > > > > + bool blockable) > > > > { > > > > struct i915_mmu_notifier *mn = > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > @@ -124,7 +125,7 @@ static void > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > LIST_HEAD(cancelled); > > > > > > > > if (RB_EMPTY_ROOT(>objects.rb_root)) > > > > - return; > > > > + return 0; > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > to be not so coarse, we still have to wait for the HW to finish its > > > access). > > > > Is this wait bound or it can take basically arbitrary amount of time? > > Arbitrary. It waits for the last operation in the queue that needs that > set of backing pages, and that queue is unbounded and not even confined > to the local driver. (Though each operation should be bounded to be > completed within an interval or be cancelled, that interval is on the > order of 10s!) OK, I see. We should rather not wait that long so backoff is just better. The whole point of the oom_reaper is to tear down and free some memory. We do not really need to reclaim all of it. It would be great if we could do something like - kick the tear down of the device memory but have it done in the background. We wouldn't tear the vma down in that case but the whole process would start at least. I am not sure something like that is possible. > > > The first pass would be then to not do anything here if > > > !blockable. > > > > something like this? (incremental diff) > > Yup. Cool, I will start with that because even that is an improvement from the oom_reaper POV. Thanks! -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > Hi, > > > this is an RFC and not tested at all. I am not very familiar with the > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > what I need basically. It might be completely wrong but I would like > > > to discuss what would be a better way if that is the case. > > > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > > it down. If you think I have forgot somebody, please let me know > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > index 854bd51b9478..5285df9331fa 100644 > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > mo->attached = false; > > > } > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct > > > mmu_notifier *_mn, > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct > > > mmu_notifier *_mn, > > >struct mm_struct > > > *mm, > > >unsigned long > > > start, > > > - unsigned long end) > > > + unsigned long end, > > > + bool blockable) > > > { > > > struct i915_mmu_notifier *mn = > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > @@ -124,7 +125,7 @@ static void > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > LIST_HEAD(cancelled); > > > > > > if (RB_EMPTY_ROOT(>objects.rb_root)) > > > - return; > > > + return 0; > > > > The principle wait here is for the HW (even after fixing all the locks > > to be not so coarse, we still have to wait for the HW to finish its > > access). > > Is this wait bound or it can take basically arbitrary amount of time? Arbitrary amount of time but in desktop use case you can assume that it should never go above 16ms for a 60frame per second rendering of your desktop (in GPU compute case this kind of assumption does not hold). Is the process exit_state already updated by the time this mmu notifier callbacks happen ? > > > The first pass would be then to not do anything here if > > !blockable. > > something like this? (incremental diff) What i wanted to do with HMM and mmu notifier is split the invalidation in 2 pass. First pass tell the drivers to stop/cancel pending jobs that depends on the range and invalidate internal driver states (like clear buffer object pages array in case of GPU but not GPU page table). While the second callback would do the actual wait on the GPU to be done and update the GPU page table. Now in this scheme in case the task is already in some exit state and that all CPU threads are frozen/kill then we can probably find a way to do the first path mostly lock less. AFAICR nor AMD nor Intel allow to share userptr bo hence a uptr bo should only ever be access through ioctl submited by the process. The second call can then be delayed and ping from time to time to see if GPU jobs are done. Note that what you propose might still be useful as in case there is no buffer object for a range then OOM can make progress in freeing a range of memory. It is very likely that significant virtual address range of a process and backing memory can be reclaim that way. This assume OOM reclaim vma by vma or in some form of granularity like reclaiming 1GB by 1GB. Or we could also update blocking callback to return range that are blocking that way OOM can reclaim around. Cheers, Jérôme ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 22-06-18 16:36:49, Chris Wilson wrote: > Quoting Michal Hocko (2018-06-22 16:02:42) > > Hi, > > this is an RFC and not tested at all. I am not very familiar with the > > mmu notifiers semantics very much so this is a crude attempt to achieve > > what I need basically. It might be completely wrong but I would like > > to discuss what would be a better way if that is the case. > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > it down. If you think I have forgot somebody, please let me know > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > > b/drivers/gpu/drm/i915/i915_gem_userptr.c > > index 854bd51b9478..5285df9331fa 100644 > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > mo->attached = false; > > } > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier > > *_mn, > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier > > *_mn, > >struct mm_struct *mm, > >unsigned long start, > > - unsigned long end) > > + unsigned long end, > > + bool blockable) > > { > > struct i915_mmu_notifier *mn = > > container_of(_mn, struct i915_mmu_notifier, mn); > > @@ -124,7 +125,7 @@ static void > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > LIST_HEAD(cancelled); > > > > if (RB_EMPTY_ROOT(>objects.rb_root)) > > - return; > > + return 0; > > The principle wait here is for the HW (even after fixing all the locks > to be not so coarse, we still have to wait for the HW to finish its > access). Is this wait bound or it can take basically arbitrary amount of time? > The first pass would be then to not do anything here if > !blockable. something like this? (incremental diff) diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 5285df9331fa..e9ed0d2cfabc 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -122,6 +122,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, container_of(_mn, struct i915_mmu_notifier, mn); struct i915_mmu_object *mo; struct interval_tree_node *it; + int ret = 0; LIST_HEAD(cancelled); if (RB_EMPTY_ROOT(>objects.rb_root)) @@ -133,6 +134,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, spin_lock(>lock); it = interval_tree_iter_first(>objects, start, end); while (it) { + if (!blockable) { + ret = -EAGAIN; + goto out_unlock; + } /* The mmu_object is released late when destroying the * GEM object so it is entirely possible to gain a * reference on an object in the process of being freed @@ -154,8 +159,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, spin_unlock(>lock); /* TODO: can we skip waiting here? */ - if (!list_empty() && blockable) + if (!list_empty()) flush_workqueue(mn->wq); + + return ret; } static const struct mmu_notifier_ops i915_gem_userptr_notifier = { -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Quoting Michal Hocko (2018-06-22 16:02:42) > Hi, > this is an RFC and not tested at all. I am not very familiar with the > mmu notifiers semantics very much so this is a crude attempt to achieve > what I need basically. It might be completely wrong but I would like > to discuss what would be a better way if that is the case. > > get_maintainers gave me quite large list of people to CC so I had to trim > it down. If you think I have forgot somebody, please let me know > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > b/drivers/gpu/drm/i915/i915_gem_userptr.c > index 854bd51b9478..5285df9331fa 100644 > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > mo->attached = false; > } > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier > *_mn, > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier > *_mn, >struct mm_struct *mm, >unsigned long start, > - unsigned long end) > + unsigned long end, > + bool blockable) > { > struct i915_mmu_notifier *mn = > container_of(_mn, struct i915_mmu_notifier, mn); > @@ -124,7 +125,7 @@ static void > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > LIST_HEAD(cancelled); > > if (RB_EMPTY_ROOT(>objects.rb_root)) > - return; > + return 0; The principle wait here is for the HW (even after fixing all the locks to be not so coarse, we still have to wait for the HW to finish its access). The first pass would be then to not do anything here if !blockable. Jerome keeps on shaking his head and telling us we're doing it all wrong, so maybe it'll all fall out of HMM before we have to figure out how to differentiate between objects that can be invalidated immediately and those that need to acquire locks and/or wait. -Chris ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
On Fri 22-06-18 17:13:02, Christian König wrote: > Hi Michal, > > [Adding Felix as well] > > Well first of all you have a misconception why at least the AMD graphics > driver need to be able to sleep in an MMU notifier: We need to sleep because > we need to wait for hardware operations to finish and *NOT* because we need > to wait for locks. > > I'm not sure if your flag now means that you generally can't sleep in MMU > notifiers any more, but if that's the case at least AMD hardware will break > badly. In our case the approach of waiting for a short time for the process > to be reaped and then select another victim actually sounds like the right > thing to do. Well, I do not need to make the notifier code non blocking all the time. All I need is to ensure that it won't sleep if the flag says so and return -EAGAIN instead. So here is what I do for amdgpu: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > index 83e344fbb50a..d138a526feff 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c > > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn) > >* > >* Take the rmn read side lock. > >*/ > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn) > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable) > > { > > - mutex_lock(>read_lock); > > + if (blockable) > > + mutex_lock(>read_lock); > > + else if (!mutex_trylock(>read_lock)) > > + return -EAGAIN; > > + > > if (atomic_inc_return(>recursion) == 1) > > down_read_non_owner(>lock); > > mutex_unlock(>read_lock); > > + > > + return 0; > > } > > /** > > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct > > amdgpu_mn_node *node, > >* We block for all BOs between start and end to be idle and > >* unmap them by move them into system domain again. > >*/ > > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, > > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, > > struct mm_struct *mm, > > unsigned long start, > > -unsigned long end) > > +unsigned long end, > > +bool blockable) > > { > > struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); > > struct interval_tree_node *it; > > @@ -208,7 +215,11 @@ static void > > amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, > > /* notification is exclusive, but interval is inclusive */ > > end -= 1; > > - amdgpu_mn_read_lock(rmn); > > + /* TODO we should be able to split locking for interval tree and > > +* amdgpu_mn_invalidate_node > > +*/ > > + if (amdgpu_mn_read_lock(rmn, blockable)) > > + return -EAGAIN; > > it = interval_tree_iter_first(>objects, start, end); > > while (it) { > > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct > > mmu_notifier *mn, > > amdgpu_mn_invalidate_node(node, start, end); > > } > > + > > + return 0; > > } > > /** > > @@ -233,10 +246,11 @@ static void > > amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, > >* necessitates evicting all user-mode queues of the process. The BOs > >* are restorted in amdgpu_mn_invalidate_range_end_hsa. > >*/ > > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, > > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, > > struct mm_struct *mm, > > unsigned long start, > > -unsigned long end) > > +unsigned long end, > > +bool blockable) > > { > > struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn); > > struct interval_tree_node *it; > > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct > > mmu_notifier *mn, > > /* notification is exclusive, but interval is inclusive */ > > end -= 1; > > - amdgpu_mn_read_lock(rmn); > > + if (amdgpu_mn_read_lock(rmn, blockable)) > > + return -EAGAIN; > > it = interval_tree_iter_first(>objects, start, end); > > while (it) { > > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct > > mmu_notifier *mn, > > amdgpu_amdkfd_evict_userptr(mem, mm); > > } > > } > > + > > + return 0; > > } > > /** -- Michal Hocko SUSE Labs ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
Hi Michal, [Adding Felix as well] Well first of all you have a misconception why at least the AMD graphics driver need to be able to sleep in an MMU notifier: We need to sleep because we need to wait for hardware operations to finish and *NOT* because we need to wait for locks. I'm not sure if your flag now means that you generally can't sleep in MMU notifiers any more, but if that's the case at least AMD hardware will break badly. In our case the approach of waiting for a short time for the process to be reaped and then select another victim actually sounds like the right thing to do. What we also already try to do is to abort hardware operations with the address space when we detect that the process is dying, but that can certainly be improved. Regards, Christian. Am 22.06.2018 um 17:02 schrieb Michal Hocko: From: Michal Hocko There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover most notifiers only care about a portion of the address space. This patch handles the first part of the problem. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock I presume. Anyway, what does the oom_reaper do with all that? We do not have to fail right away. We simply retry if there is at least one notifier which couldn't make any progress. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. Cc: "David (ChunMing) Zhou" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Alex Deucher Cc: "Christian König" Cc: David Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Doug Ledford Cc: Jason Gunthorpe Cc: Mike Marciniszyn Cc: Dennis Dalessandro Cc: Sudeep Dutt Cc: Ashutosh Dixit Cc: Dimitri Sivanich Cc: Boris Ostrovsky Cc: Juergen Gross Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)) Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS) Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS) Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...) Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM) Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE) Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management) Reported-by: David Rientjes Signed-off-by: Michal Hocko --- Hi, this is an RFC and not tested at all. I am not very familiar with the mmu notifiers semantics very much so this is a crude attempt to achieve what I need basically. It might be completely wrong but I would like to discuss what would be a better way if that is the case. get_maintainers gave me quite large list of people to CC so I had to trim it down. If you think I have forgot somebody, please let me know Any feedback is highly appreciated. arch/x86/kvm/x86.c | 7 -- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 33 +++-- drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +--- drivers/gpu/drm/radeon/radeon_mn.c | 15 --- drivers/infiniband/core/umem_odp.c | 15 --- drivers/infiniband/hw/hfi1/mmu_rb.c | 7 -- drivers/misc/mic/scif/scif_dma.c| 7 -- drivers/misc/sgi-gru/grutlbpurge.c | 7 -- drivers/xen/gntdev.c| 14 --- include/linux/kvm_host.h| 2 +- include/linux/mmu_notifier.h| 15 +-- mm/hmm.c| 7 -- mm/mmu_notifier.c | 15 --- mm/oom_kill.c | 29 +++--- virt/kvm/kvm_main.c | 12 ++--- 15 files changed, 137 insertions(+), 58 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bcecc325e7e..ac08f5d711be 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) kvm_x86_ops->load_eoi_exitmap(vcpu,
[RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
From: Michal Hocko There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover most notifiers only care about a portion of the address space. This patch handles the first part of the problem. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock I presume. Anyway, what does the oom_reaper do with all that? We do not have to fail right away. We simply retry if there is at least one notifier which couldn't make any progress. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. Cc: "David (ChunMing) Zhou" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Alex Deucher Cc: "Christian König" Cc: David Airlie Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Doug Ledford Cc: Jason Gunthorpe Cc: Mike Marciniszyn Cc: Dennis Dalessandro Cc: Sudeep Dutt Cc: Ashutosh Dixit Cc: Dimitri Sivanich Cc: Boris Ostrovsky Cc: Juergen Gross Cc: "Jérôme Glisse" Cc: Andrea Arcangeli Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)) Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS) Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS) Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...) Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM) Cc: xen-de...@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE) Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management) Reported-by: David Rientjes Signed-off-by: Michal Hocko --- Hi, this is an RFC and not tested at all. I am not very familiar with the mmu notifiers semantics very much so this is a crude attempt to achieve what I need basically. It might be completely wrong but I would like to discuss what would be a better way if that is the case. get_maintainers gave me quite large list of people to CC so I had to trim it down. If you think I have forgot somebody, please let me know Any feedback is highly appreciated. arch/x86/kvm/x86.c | 7 -- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 33 +++-- drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +--- drivers/gpu/drm/radeon/radeon_mn.c | 15 --- drivers/infiniband/core/umem_odp.c | 15 --- drivers/infiniband/hw/hfi1/mmu_rb.c | 7 -- drivers/misc/mic/scif/scif_dma.c| 7 -- drivers/misc/sgi-gru/grutlbpurge.c | 7 -- drivers/xen/gntdev.c| 14 --- include/linux/kvm_host.h| 2 +- include/linux/mmu_notifier.h| 15 +-- mm/hmm.c| 7 -- mm/mmu_notifier.c | 15 --- mm/oom_kill.c | 29 +++--- virt/kvm/kvm_main.c | 12 ++--- 15 files changed, 137 insertions(+), 58 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bcecc325e7e..ac08f5d711be 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap); } -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end) +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, + unsigned long start, unsigned long end, + bool blockable) { unsigned long apic_address; @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); if (start <= apic_address && apic_address < end) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); + + return 0; } void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 83e344fbb50a..d138a526feff 100644 ---