Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Tue 03-07-18 00:08:05, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> >> Michal Hocko wrote: >> >> >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> >> Michal Hocko wrote: >> >> > [...] >> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> >> > +return OOM_SUCCESS; >> >> >> > + >> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> >> > memory! " >> >> >> > +"This looks like a misconfiguration or a kernel bug."); >> >> >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> >> alarming in that case. >> >> > >> >> > If the task is reaped then its charges should be released as well and >> >> > that means that we should get below the limit. Sure there is some room >> >> > for races but this should be still unlikely. Maybe I am just >> >> > underestimating though. >> >> > >> >> > What would you suggest instead? >> >> >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. >> > >> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking >> > for all the tasks would be quite expensive and remembering that from the >> > task selection not nice either. Why do you think it would help much? >> >> I assume we could just check current's MMF_OOM_SKIP - no need to check >> all tasks. > > I still do not follow. If you are after a single task memcg then we > should be ok. try_charge has a runaway for oom victims > if (unlikely(tsk_is_oom_victim(current) || >fatal_signal_pending(current) || >current->flags & PF_EXITING)) > goto force; > > regardless of MMF_OOM_SKIP. So if there is a single process in the > memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then > we should bail out there. Or do I miss your intention? For a single task memcg it seems that racing process cgroup migration could trigger the new warning (I have attempted to reproduce this): Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] M1 is oom try_charge(M1) Move A M1=>M2 mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() Another variant might be possible, this time with global oom: Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] try_charge() trigger global oom reaper sets A.MMF_OOM_SKIP mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() These seem unlikely, so I'm fine with taking a wait-and-see approach.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Tue 03-07-18 00:08:05, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> >> Michal Hocko wrote: >> >> >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> >> Michal Hocko wrote: >> >> > [...] >> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> >> > +return OOM_SUCCESS; >> >> >> > + >> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> >> > memory! " >> >> >> > +"This looks like a misconfiguration or a kernel bug."); >> >> >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> >> alarming in that case. >> >> > >> >> > If the task is reaped then its charges should be released as well and >> >> > that means that we should get below the limit. Sure there is some room >> >> > for races but this should be still unlikely. Maybe I am just >> >> > underestimating though. >> >> > >> >> > What would you suggest instead? >> >> >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. >> > >> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking >> > for all the tasks would be quite expensive and remembering that from the >> > task selection not nice either. Why do you think it would help much? >> >> I assume we could just check current's MMF_OOM_SKIP - no need to check >> all tasks. > > I still do not follow. If you are after a single task memcg then we > should be ok. try_charge has a runaway for oom victims > if (unlikely(tsk_is_oom_victim(current) || >fatal_signal_pending(current) || >current->flags & PF_EXITING)) > goto force; > > regardless of MMF_OOM_SKIP. So if there is a single process in the > memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then > we should bail out there. Or do I miss your intention? For a single task memcg it seems that racing process cgroup migration could trigger the new warning (I have attempted to reproduce this): Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] M1 is oom try_charge(M1) Move A M1=>M2 mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() Another variant might be possible, this time with global oom: Processes A,B in memcg M1,M2. M1 is oom. Process A[M1] Process B[M2] try_charge() trigger global oom reaper sets A.MMF_OOM_SKIP mem_cgroup_oom() mem_cgroup_out_of_memory() out_of_memory() select_bad_process() sees nothing in M1 return 0 return 0 WARN() These seem unlikely, so I'm fine with taking a wait-and-see approach.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Tue 03-07-18 00:08:05, Greg Thelen wrote: > Michal Hocko wrote: > > > On Fri 29-06-18 11:59:04, Greg Thelen wrote: > >> Michal Hocko wrote: > >> > >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: > >> >> Michal Hocko wrote: > >> > [...] > >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > >> >> > + return OOM_SUCCESS; > >> >> > + > >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable > >> >> > memory! " > >> >> > + "This looks like a misconfiguration or a kernel bug."); > >> >> > >> >> I'm not sure here if the warning should here or so strongly worded. It > >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and > >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing > >> >> alarming in that case. > >> > > >> > If the task is reaped then its charges should be released as well and > >> > that means that we should get below the limit. Sure there is some room > >> > for races but this should be still unlikely. Maybe I am just > >> > underestimating though. > >> > > >> > What would you suggest instead? > >> > >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > > for all the tasks would be quite expensive and remembering that from the > > task selection not nice either. Why do you think it would help much? > > I assume we could just check current's MMF_OOM_SKIP - no need to check > all tasks. I still do not follow. If you are after a single task memcg then we should be ok. try_charge has a runaway for oom victims if (unlikely(tsk_is_oom_victim(current) || fatal_signal_pending(current) || current->flags & PF_EXITING)) goto force; regardless of MMF_OOM_SKIP. So if there is a single process in the memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then we should bail out there. Or do I miss your intention? > My only (minor) objection is that the warning text suggests > misconfiguration or kernel bug, when there may be neither. > > > I feel strongly that we have to warn when bypassing the charge limit > > during the corner case because it can lead to unexpected behavior and > > users should be aware of this fact. I am open to the wording or some > > optimizations. I would prefer the latter on top with a clear description > > how it helped in a particular case though. I would rather not over > > optimize now without any story to back it. > > I'm fine with the warning. I know enough to look at dmesg logs to take > an educates that the race occurred. We can refine it later if/when the > reports start rolling in. No change needed. OK. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Tue 03-07-18 00:08:05, Greg Thelen wrote: > Michal Hocko wrote: > > > On Fri 29-06-18 11:59:04, Greg Thelen wrote: > >> Michal Hocko wrote: > >> > >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: > >> >> Michal Hocko wrote: > >> > [...] > >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > >> >> > + return OOM_SUCCESS; > >> >> > + > >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable > >> >> > memory! " > >> >> > + "This looks like a misconfiguration or a kernel bug."); > >> >> > >> >> I'm not sure here if the warning should here or so strongly worded. It > >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and > >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing > >> >> alarming in that case. > >> > > >> > If the task is reaped then its charges should be released as well and > >> > that means that we should get below the limit. Sure there is some room > >> > for races but this should be still unlikely. Maybe I am just > >> > underestimating though. > >> > > >> > What would you suggest instead? > >> > >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > > for all the tasks would be quite expensive and remembering that from the > > task selection not nice either. Why do you think it would help much? > > I assume we could just check current's MMF_OOM_SKIP - no need to check > all tasks. I still do not follow. If you are after a single task memcg then we should be ok. try_charge has a runaway for oom victims if (unlikely(tsk_is_oom_victim(current) || fatal_signal_pending(current) || current->flags & PF_EXITING)) goto force; regardless of MMF_OOM_SKIP. So if there is a single process in the memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then we should bail out there. Or do I miss your intention? > My only (minor) objection is that the warning text suggests > misconfiguration or kernel bug, when there may be neither. > > > I feel strongly that we have to warn when bypassing the charge limit > > during the corner case because it can lead to unexpected behavior and > > users should be aware of this fact. I am open to the wording or some > > optimizations. I would prefer the latter on top with a clear description > > how it helped in a particular case though. I would rather not over > > optimize now without any story to back it. > > I'm fine with the warning. I know enough to look at dmesg logs to take > an educates that the race occurred. We can refine it later if/when the > reports start rolling in. No change needed. OK. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> Michal Hocko wrote: >> > [...] >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> > + return OOM_SUCCESS; >> >> > + >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> > memory! " >> >> > + "This looks like a misconfiguration or a kernel bug."); >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> alarming in that case. >> > >> > If the task is reaped then its charges should be released as well and >> > that means that we should get below the limit. Sure there is some room >> > for races but this should be still unlikely. Maybe I am just >> > underestimating though. >> > >> > What would you suggest instead? >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > for all the tasks would be quite expensive and remembering that from the > task selection not nice either. Why do you think it would help much? I assume we could just check current's MMF_OOM_SKIP - no need to check all tasks. My only (minor) objection is that the warning text suggests misconfiguration or kernel bug, when there may be neither. > I feel strongly that we have to warn when bypassing the charge limit > during the corner case because it can lead to unexpected behavior and > users should be aware of this fact. I am open to the wording or some > optimizations. I would prefer the latter on top with a clear description > how it helped in a particular case though. I would rather not over > optimize now without any story to back it. I'm fine with the warning. I know enough to look at dmesg logs to take an educates that the race occurred. We can refine it later if/when the reports start rolling in. No change needed.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Fri 29-06-18 11:59:04, Greg Thelen wrote: >> Michal Hocko wrote: >> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> >> Michal Hocko wrote: >> > [...] >> >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> >> > + return OOM_SUCCESS; >> >> > + >> >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable >> >> > memory! " >> >> > + "This looks like a misconfiguration or a kernel bug."); >> >> >> >> I'm not sure here if the warning should here or so strongly worded. It >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> >> alarming in that case. >> > >> > If the task is reaped then its charges should be released as well and >> > that means that we should get below the limit. Sure there is some room >> > for races but this should be still unlikely. Maybe I am just >> > underestimating though. >> > >> > What would you suggest instead? >> >> I suggest checking MMF_OOM_SKIP or deleting the warning. > > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking > for all the tasks would be quite expensive and remembering that from the > task selection not nice either. Why do you think it would help much? I assume we could just check current's MMF_OOM_SKIP - no need to check all tasks. My only (minor) objection is that the warning text suggests misconfiguration or kernel bug, when there may be neither. > I feel strongly that we have to warn when bypassing the charge limit > during the corner case because it can lead to unexpected behavior and > users should be aware of this fact. I am open to the wording or some > optimizations. I would prefer the latter on top with a clear description > how it helped in a particular case though. I would rather not over > optimize now without any story to back it. I'm fine with the warning. I know enough to look at dmesg logs to take an educates that the race occurred. We can refine it later if/when the reports start rolling in. No change needed.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Fri 29-06-18 11:59:04, Greg Thelen wrote: > Michal Hocko wrote: > > > On Thu 28-06-18 16:19:07, Greg Thelen wrote: > >> Michal Hocko wrote: > > [...] > >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) > >> > +return OOM_SUCCESS; > >> > + > >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable > >> > memory! " > >> > +"This looks like a misconfiguration or a kernel bug."); > >> > >> I'm not sure here if the warning should here or so strongly worded. It > >> seems like the current task could be oom reaped with MMF_OOM_SKIP and > >> thus mem_cgroup_out_of_memory() will return false. So there's nothing > >> alarming in that case. > > > > If the task is reaped then its charges should be released as well and > > that means that we should get below the limit. Sure there is some room > > for races but this should be still unlikely. Maybe I am just > > underestimating though. > > > > What would you suggest instead? > > I suggest checking MMF_OOM_SKIP or deleting the warning. So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking for all the tasks would be quite expensive and remembering that from the task selection not nice either. Why do you think it would help much? I feel strongly that we have to warn when bypassing the charge limit during the corner case because it can lead to unexpected behavior and users should be aware of this fact. I am open to the wording or some optimizations. I would prefer the latter on top with a clear description how it helped in a particular case though. I would rather not over optimize now without any story to back it. -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Fri 29-06-18 11:59:04, Greg Thelen wrote: > Michal Hocko wrote: > > > On Thu 28-06-18 16:19:07, Greg Thelen wrote: > >> Michal Hocko wrote: > > [...] > >> > +if (mem_cgroup_out_of_memory(memcg, mask, order)) > >> > +return OOM_SUCCESS; > >> > + > >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable > >> > memory! " > >> > +"This looks like a misconfiguration or a kernel bug."); > >> > >> I'm not sure here if the warning should here or so strongly worded. It > >> seems like the current task could be oom reaped with MMF_OOM_SKIP and > >> thus mem_cgroup_out_of_memory() will return false. So there's nothing > >> alarming in that case. > > > > If the task is reaped then its charges should be released as well and > > that means that we should get below the limit. Sure there is some room > > for races but this should be still unlikely. Maybe I am just > > underestimating though. > > > > What would you suggest instead? > > I suggest checking MMF_OOM_SKIP or deleting the warning. So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking for all the tasks would be quite expensive and remembering that from the task selection not nice either. Why do you think it would help much? I feel strongly that we have to warn when bypassing the charge limit during the corner case because it can lead to unexpected behavior and users should be aware of this fact. I am open to the wording or some optimizations. I would prefer the latter on top with a clear description how it helped in a particular case though. I would rather not over optimize now without any story to back it. -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> Michal Hocko wrote: > [...] >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> > + return OOM_SUCCESS; >> > + >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " >> > + "This looks like a misconfiguration or a kernel bug."); >> >> I'm not sure here if the warning should here or so strongly worded. It >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> alarming in that case. > > If the task is reaped then its charges should be released as well and > that means that we should get below the limit. Sure there is some room > for races but this should be still unlikely. Maybe I am just > underestimating though. > > What would you suggest instead? I suggest checking MMF_OOM_SKIP or deleting the warning. But I don't feel strongly.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > On Thu 28-06-18 16:19:07, Greg Thelen wrote: >> Michal Hocko wrote: > [...] >> > + if (mem_cgroup_out_of_memory(memcg, mask, order)) >> > + return OOM_SUCCESS; >> > + >> > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " >> > + "This looks like a misconfiguration or a kernel bug."); >> >> I'm not sure here if the warning should here or so strongly worded. It >> seems like the current task could be oom reaped with MMF_OOM_SKIP and >> thus mem_cgroup_out_of_memory() will return false. So there's nothing >> alarming in that case. > > If the task is reaped then its charges should be released as well and > that means that we should get below the limit. Sure there is some room > for races but this should be still unlikely. Maybe I am just > underestimating though. > > What would you suggest instead? I suggest checking MMF_OOM_SKIP or deleting the warning. But I don't feel strongly.
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu 28-06-18 16:19:07, Greg Thelen wrote: > Michal Hocko wrote: [...] > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return OOM_SUCCESS; > > + > > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " > > + "This looks like a misconfiguration or a kernel bug."); > > I'm not sure here if the warning should here or so strongly worded. It > seems like the current task could be oom reaped with MMF_OOM_SKIP and > thus mem_cgroup_out_of_memory() will return false. So there's nothing > alarming in that case. If the task is reaped then its charges should be released as well and that means that we should get below the limit. Sure there is some room for races but this should be still unlikely. Maybe I am just underestimating though. What would you suggest instead? -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu 28-06-18 16:19:07, Greg Thelen wrote: > Michal Hocko wrote: [...] > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return OOM_SUCCESS; > > + > > + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " > > + "This looks like a misconfiguration or a kernel bug."); > > I'm not sure here if the warning should here or so strongly worded. It > seems like the current task could be oom reaped with MMF_OOM_SKIP and > thus mem_cgroup_out_of_memory() will return false. So there's nothing > alarming in that case. If the task is reaped then its charges should be released as well and that means that we should get below the limit. Sure there is some room for races but this should be still unlikely. Maybe I am just underestimating though. What would you suggest instead? -- Michal Hocko SUSE Labs
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. > > Changes since rfc v1 > - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more > clear what is the purpose of the flag now > - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g > s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes > - make oom_kill_disable an exceptional case because it should be rare > and the normal oom handling a core of the function - per Johannes > > Signed-off-by: Michal Hocko Acked-by: Greg Thelen Thanks! One comment below. > --- > > Hi, > I've posted this as an RFC previously [1]. There was no fundamental > disagreement so I've integrated all the suggested changes and tested it. > mmap(MAP_POPULATE) hits the oom killer again rather than silently fails > to populate the mapping on the hard limit excess. On the other hand > g-u-p and other charge path keep the ENOMEM semantic when the memcg oom > killer is disabled. All the forward progress guarantee relies on the oom > reaper. > > Unless there are objections I think this is ready to go to mmotm and > ready for the next merge window > > [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org > include/linux/memcontrol.h | 16 > include/linux/sched.h | 2 +- > mm/memcontrol.c| 75 ++ > mm/memory.c| 4 +- > 4 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c6fb116e925..5a69bb4026f6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup > *memcg); > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 87bf02d93a27..34cc95b751cd 100644 > --- a/include/linux/sched.h > +++
Re: [PATCH] memcg, oom: move out_of_memory back to the charge path
Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. > > Changes since rfc v1 > - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more > clear what is the purpose of the flag now > - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g > s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes > - make oom_kill_disable an exceptional case because it should be rare > and the normal oom handling a core of the function - per Johannes > > Signed-off-by: Michal Hocko Acked-by: Greg Thelen Thanks! One comment below. > --- > > Hi, > I've posted this as an RFC previously [1]. There was no fundamental > disagreement so I've integrated all the suggested changes and tested it. > mmap(MAP_POPULATE) hits the oom killer again rather than silently fails > to populate the mapping on the hard limit excess. On the other hand > g-u-p and other charge path keep the ENOMEM semantic when the memcg oom > killer is disabled. All the forward progress guarantee relies on the oom > reaper. > > Unless there are objections I think this is ready to go to mmotm and > ready for the next merge window > > [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org > include/linux/memcontrol.h | 16 > include/linux/sched.h | 2 +- > mm/memcontrol.c| 75 ++ > mm/memory.c| 4 +- > 4 files changed, 71 insertions(+), 26 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 6c6fb116e925..5a69bb4026f6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup > *memcg); > void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > struct task_struct *p); > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } > > static inline bool task_in_memcg_oom(struct task_struct *p) > @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > -static inline void mem_cgroup_oom_enable(void) > +static inline void mem_cgroup_enter_user_fault(void) > { > } > > -static inline void mem_cgroup_oom_disable(void) > +static inline void mem_cgroup_exit_user_fault(void) > { > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 87bf02d93a27..34cc95b751cd 100644 > --- a/include/linux/sched.h > +++
[PATCH] memcg, oom: move out_of_memory back to the charge path
From: Michal Hocko 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Changes since rfc v1 - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more clear what is the purpose of the flag now - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes - make oom_kill_disable an exceptional case because it should be rare and the normal oom handling a core of the function - per Johannes Signed-off-by: Michal Hocko --- Hi, I've posted this as an RFC previously [1]. There was no fundamental disagreement so I've integrated all the suggested changes and tested it. mmap(MAP_POPULATE) hits the oom killer again rather than silently fails to populate the mapping on the hard limit excess. On the other hand g-u-p and other charge path keep the ENOMEM semantic when the memcg oom killer is disabled. All the forward progress guarantee relies on the oom reaper. Unless there are objections I think this is ready to go to mmotm and ready for the next merge window [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org include/linux/memcontrol.h | 16 include/linux/sched.h | 2 +- mm/memcontrol.c| 75 ++ mm/memory.c| 4 +- 4 files changed, 71 insertions(+), 26 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c6fb116e925..5a69bb4026f6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { - WARN_ON(current->memcg_may_oom); - current->memcg_may_oom = 1; + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { - WARN_ON(!current->memcg_may_oom); - current->memcg_may_oom = 0; + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) { } -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..34cc95b751cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -722,7 +722,7 @@ struct task_struct { unsignedrestore_sigmask:1; #endif #ifdef CONFIG_MEMCG - unsignedmemcg_may_oom:1; + unsignedin_user_fault:1;
[PATCH] memcg, oom: move out_of_memory back to the charge path
From: Michal Hocko 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Changes since rfc v1 - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more clear what is the purpose of the flag now - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes - make oom_kill_disable an exceptional case because it should be rare and the normal oom handling a core of the function - per Johannes Signed-off-by: Michal Hocko --- Hi, I've posted this as an RFC previously [1]. There was no fundamental disagreement so I've integrated all the suggested changes and tested it. mmap(MAP_POPULATE) hits the oom killer again rather than silently fails to populate the mapping on the hard limit excess. On the other hand g-u-p and other charge path keep the ENOMEM semantic when the memcg oom killer is disabled. All the forward progress guarantee relies on the oom reaper. Unless there are objections I think this is ready to go to mmotm and ready for the next merge window [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org include/linux/memcontrol.h | 16 include/linux/sched.h | 2 +- mm/memcontrol.c| 75 ++ mm/memory.c| 4 +- 4 files changed, 71 insertions(+), 26 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c6fb116e925..5a69bb4026f6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { - WARN_ON(current->memcg_may_oom); - current->memcg_may_oom = 1; + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { - WARN_ON(!current->memcg_may_oom); - current->memcg_may_oom = 0; + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) { } -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..34cc95b751cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -722,7 +722,7 @@ struct task_struct { unsignedrestore_sigmask:1; #endif #ifdef CONFIG_MEMCG - unsignedmemcg_may_oom:1; + unsignedin_user_fault:1;
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu 21-06-18 10:37:51, Johannes Weiner wrote: > On Thu, Jun 21, 2018 at 10:09:27AM +0200, Michal Hocko wrote: > > @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup > > *memcg, > > > > static inline void mem_cgroup_oom_enable(void) > > { > > - WARN_ON(current->memcg_may_oom); > > - current->memcg_may_oom = 1; > > + WARN_ON(current->in_user_fault); > > + current->in_user_fault = 1; > > } > > > > static inline void mem_cgroup_oom_disable(void) > > { > > - WARN_ON(!current->memcg_may_oom); > > - current->memcg_may_oom = 0; > > + WARN_ON(!current->in_user_fault); > > + current->in_user_fault = 0; > > } > > Would it make more sense to rename these to > mem_cgroup_enter_user_fault(), mem_cgroup_exit_user_fault()? OK, makes sense. It is less explicit about the oom behavior... > Other than that, this looks great to me. Thanks for the review! I will wait few days for other feedback and retest and repost then. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu 21-06-18 10:37:51, Johannes Weiner wrote: > On Thu, Jun 21, 2018 at 10:09:27AM +0200, Michal Hocko wrote: > > @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup > > *memcg, > > > > static inline void mem_cgroup_oom_enable(void) > > { > > - WARN_ON(current->memcg_may_oom); > > - current->memcg_may_oom = 1; > > + WARN_ON(current->in_user_fault); > > + current->in_user_fault = 1; > > } > > > > static inline void mem_cgroup_oom_disable(void) > > { > > - WARN_ON(!current->memcg_may_oom); > > - current->memcg_may_oom = 0; > > + WARN_ON(!current->in_user_fault); > > + current->in_user_fault = 0; > > } > > Would it make more sense to rename these to > mem_cgroup_enter_user_fault(), mem_cgroup_exit_user_fault()? OK, makes sense. It is less explicit about the oom behavior... > Other than that, this looks great to me. Thanks for the review! I will wait few days for other feedback and retest and repost then. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu, Jun 21, 2018 at 10:09:27AM +0200, Michal Hocko wrote: > @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > > static inline void mem_cgroup_oom_enable(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > static inline void mem_cgroup_oom_disable(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } Would it make more sense to rename these to mem_cgroup_enter_user_fault(), mem_cgroup_exit_user_fault()? Other than that, this looks great to me. Thanks
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Thu, Jun 21, 2018 at 10:09:27AM +0200, Michal Hocko wrote: > @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, > > static inline void mem_cgroup_oom_enable(void) > { > - WARN_ON(current->memcg_may_oom); > - current->memcg_may_oom = 1; > + WARN_ON(current->in_user_fault); > + current->in_user_fault = 1; > } > > static inline void mem_cgroup_oom_disable(void) > { > - WARN_ON(!current->memcg_may_oom); > - current->memcg_may_oom = 0; > + WARN_ON(!current->in_user_fault); > + current->in_user_fault = 0; > } Would it make more sense to rename these to mem_cgroup_enter_user_fault(), mem_cgroup_exit_user_fault()? Other than that, this looks great to me. Thanks
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
This is an updated version with feedback from Johannes integrated. Still not runtime tested but I am posting it to make further review easier. >From ed2796dc3894f93ddf0fc9ec74b83c58abc2b4ff Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 20 Jun 2018 10:25:10 +0200 Subject: [PATCH] memcg, oom: move out_of_memory back to the charge path 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Changes since rfc v1 - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more clear what is the purpose of the flag now - make oom_kill_disable an exceptional case because it should be rare and the normal oom handling a core of the function - per Johannes Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 8 ++-- include/linux/sched.h | 2 +- mm/memcontrol.c| 75 ++ 3 files changed, 65 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c6fb116e925..8753bc313ef6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, static inline void mem_cgroup_oom_enable(void) { - WARN_ON(current->memcg_may_oom); - current->memcg_may_oom = 1; + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; } static inline void mem_cgroup_oom_disable(void) { - WARN_ON(!current->memcg_may_oom); - current->memcg_may_oom = 0; + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..34cc95b751cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -722,7 +722,7 @@ struct task_struct { unsignedrestore_sigmask:1; #endif #ifdef CONFIG_MEMCG - unsignedmemcg_may_oom:1; + unsignedin_user_fault:1; #ifndef CONFIG_SLOB unsignedmemcg_kmem_skip_account:1; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..cff6c75137c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,53 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; + /* * We are in the middle of the charge context here, so we * don't want to
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
This is an updated version with feedback from Johannes integrated. Still not runtime tested but I am posting it to make further review easier. >From ed2796dc3894f93ddf0fc9ec74b83c58abc2b4ff Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 20 Jun 2018 10:25:10 +0200 Subject: [PATCH] memcg, oom: move out_of_memory back to the charge path 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Changes since rfc v1 - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more clear what is the purpose of the flag now - make oom_kill_disable an exceptional case because it should be rare and the normal oom handling a core of the function - per Johannes Signed-off-by: Michal Hocko --- include/linux/memcontrol.h | 8 ++-- include/linux/sched.h | 2 +- mm/memcontrol.c| 75 ++ 3 files changed, 65 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c6fb116e925..8753bc313ef6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, static inline void mem_cgroup_oom_enable(void) { - WARN_ON(current->memcg_may_oom); - current->memcg_may_oom = 1; + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; } static inline void mem_cgroup_oom_disable(void) { - WARN_ON(!current->memcg_may_oom); - current->memcg_may_oom = 0; + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..34cc95b751cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -722,7 +722,7 @@ struct task_struct { unsignedrestore_sigmask:1; #endif #ifdef CONFIG_MEMCG - unsignedmemcg_may_oom:1; + unsignedin_user_fault:1; #ifndef CONFIG_SLOB unsignedmemcg_kmem_skip_account:1; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..cff6c75137c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,53 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; + /* * We are in the middle of the charge context here, so we * don't want to
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 15:38:36, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > > * Please note that mem_cgroup_oom_synchronize might fail to find a > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > * we would fall back to the global oom killer in > > pagefault_out_of_memory > > I can't quite figure out what this paragraph is trying to > say. "oom_synchronize might fail [...] and we have to rely on > oom_synchronize". Hm? heh, vim autocompletion + a stale comment from the previous implementation which ENOMEM on the fail path. I went with * Please note that mem_cgroup_out_of_memory might fail to find a * victim and then we have to bail out from the charge path. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 15:38:36, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > > * Please note that mem_cgroup_oom_synchronize might fail to find a > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > * we would fall back to the global oom killer in > > pagefault_out_of_memory > > I can't quite figure out what this paragraph is trying to > say. "oom_synchronize might fail [...] and we have to rely on > oom_synchronize". Hm? heh, vim autocompletion + a stale comment from the previous implementation which ENOMEM on the fail path. I went with * Please note that mem_cgroup_out_of_memory might fail to find a * victim and then we have to bail out from the charge path. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: >* Please note that mem_cgroup_oom_synchronize might fail to find a >* victim and then we have rely on mem_cgroup_oom_synchronize otherwise >* we would fall back to the global oom killer in > pagefault_out_of_memory I can't quite figure out what this paragraph is trying to say. "oom_synchronize might fail [...] and we have to rely on oom_synchronize". Hm?
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: >* Please note that mem_cgroup_oom_synchronize might fail to find a >* victim and then we have rely on mem_cgroup_oom_synchronize otherwise >* we would fall back to the global oom killer in > pagefault_out_of_memory I can't quite figure out what this paragraph is trying to say. "oom_synchronize might fail [...] and we have to rely on oom_synchronize". Hm?
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > This? > if (order > PAGE_ALLOC_COSTLY_ORDER) > return OOM_SKIPPED; > > /* >* We are in the middle of the charge context here, so we >* don't want to block when potentially sitting on a callstack >* that holds all kinds of filesystem and mm locks. >* >* cgroup1 allows disabling the OOM killer and waiting for outside >* handling until the charge can succeed; remember the context and put >* the task to sleep at the end of the page fault when all locks are >* released. >* >* On the other hand, in-kernel OOM killer allows for an async victim >* memory reclaim (oom_reaper) and that means that we are not solely >* relying on the oom victim to make a forward progress and we can >* invoke the oom killer here. >* >* Please note that mem_cgroup_oom_synchronize might fail to find a >* victim and then we have rely on mem_cgroup_oom_synchronize otherwise >* we would fall back to the global oom killer in > pagefault_out_of_memory >*/ > if (memcg->oom_kill_disable) { > if (!current->memcg_may_oom) > return OOM_SKIPPED; > css_get(>css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > > return OOM_ASYNC; > } > > if (mem_cgroup_out_of_memory(memcg, mask, order)) > return OOM_SUCCESS; > > WARN(!current->memcg_may_oom, > "Memory cgroup charge failed because of no reclaimable > memory! " > "This looks like a misconfiguration or a kernel bug."); > return OOM_FAILED; Yep, this looks good IMO.
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > This? > if (order > PAGE_ALLOC_COSTLY_ORDER) > return OOM_SKIPPED; > > /* >* We are in the middle of the charge context here, so we >* don't want to block when potentially sitting on a callstack >* that holds all kinds of filesystem and mm locks. >* >* cgroup1 allows disabling the OOM killer and waiting for outside >* handling until the charge can succeed; remember the context and put >* the task to sleep at the end of the page fault when all locks are >* released. >* >* On the other hand, in-kernel OOM killer allows for an async victim >* memory reclaim (oom_reaper) and that means that we are not solely >* relying on the oom victim to make a forward progress and we can >* invoke the oom killer here. >* >* Please note that mem_cgroup_oom_synchronize might fail to find a >* victim and then we have rely on mem_cgroup_oom_synchronize otherwise >* we would fall back to the global oom killer in > pagefault_out_of_memory >*/ > if (memcg->oom_kill_disable) { > if (!current->memcg_may_oom) > return OOM_SKIPPED; > css_get(>css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > > return OOM_ASYNC; > } > > if (mem_cgroup_out_of_memory(memcg, mask, order)) > return OOM_SUCCESS; > > WARN(!current->memcg_may_oom, > "Memory cgroup charge failed because of no reclaimable > memory! " > "This looks like a misconfiguration or a kernel bug."); > return OOM_FAILED; Yep, this looks good IMO.
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 17:31:48, Michal Hocko wrote: > On Wed 20-06-18 11:18:12, Johannes Weiner wrote: [...] > > 1) Why warn for kernel allocations, but not userspace ones? This > > should have a comment at least. > > I am not sure I understand. We do warn for all allocations types of > mem_cgroup_out_of_memory fails as long as we are not in a legacy - > oom_disabled case. OK, I can see it now. It wasn't in the quoted context and I just forgot that WARN(!current->memcg_may_oom, ...). Well, I do not remember why I've made it conditional and you are right it doesn't make any sense. Probably a different code flow back then. Updated to warn regardless of memcg_may_oom. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 17:31:48, Michal Hocko wrote: > On Wed 20-06-18 11:18:12, Johannes Weiner wrote: [...] > > 1) Why warn for kernel allocations, but not userspace ones? This > > should have a comment at least. > > I am not sure I understand. We do warn for all allocations types of > mem_cgroup_out_of_memory fails as long as we are not in a legacy - > oom_disabled case. OK, I can see it now. It wasn't in the quoted context and I just forgot that WARN(!current->memcg_may_oom, ...). Well, I do not remember why I've made it conditional and you are right it doesn't make any sense. Probably a different code flow back then. Updated to warn regardless of memcg_may_oom. -- Michal Hocko SUSE Labs
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 11:18:12, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: [...] > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > +enum oom_status { > > + OOM_SUCCESS, > > + OOM_FAILED, > > + OOM_ASYNC, > > + OOM_SKIPPED > > Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) sure, I will go with later. > We're not distinguishing ASYNC and SKIPPED anywhere below, but I > cannot think of a good name to communicate them both without this > function making assumptions about the charge function's behavior. > So it's a bit weird, but probably the best way to go. Yeah, that was what I was fighting with. My original proposal which simply ENOMEM in the failure case was a simple bool but once we have those different sates and failure behavior I think it is better to comunicate that and let the caller do whatever it finds reasonable. > > > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t > > mask, int order) > > { > > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > > - return; > > + if (order > PAGE_ALLOC_COSTLY_ORDER) > > + return OOM_SKIPPED; > > /* > > * We are in the middle of the charge context here, so we > > * don't want to block when potentially sitting on a callstack > > * that holds all kinds of filesystem and mm locks. > > * > > -* Also, the caller may handle a failed allocation gracefully > > -* (like optional page cache readahead) and so an OOM killer > > -* invocation might not even be necessary. > > +* cgroup1 allows disabling the OOM killer and waiting for outside > > +* handling until the charge can succeed; remember the context and put > > +* the task to sleep at the end of the page fault when all locks are > > +* released. > > +* > > +* On the other hand, in-kernel OOM killer allows for an async victim > > +* memory reclaim (oom_reaper) and that means that we are not solely > > +* relying on the oom victim to make a forward progress and we can > > +* invoke the oom killer here. > > * > > -* That's why we don't do anything here except remember the > > -* OOM context and then deal with it at the end of the page > > -* fault when the stack is unwound, the locks are released, > > -* and when we know whether the fault was overall successful. > > +* Please note that mem_cgroup_oom_synchronize might fail to find a > > +* victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > +* we would fall back to the global oom killer in > > pagefault_out_of_memory > > */ > > + if (!memcg->oom_kill_disable) { > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return OOM_SUCCESS; > > + > > + WARN(!current->memcg_may_oom, > > + "Memory cgroup charge failed because of no > > reclaimable memory! " > > + "This looks like a misconfiguration or a kernel > > bug."); > > + return OOM_FAILED; > > + } > > + > > + if (!current->memcg_may_oom) > > + return OOM_SKIPPED; > > memcg_may_oom was introduced to distinguish between userspace faults > that can OOM and contexts that return -ENOMEM. Now we're using it > slightly differently and it's confusing. > > 1) Why warn for kernel allocations, but not userspace ones? This > should have a comment at least. I am not sure I understand. We do warn for all allocations types of mem_cgroup_out_of_memory fails as long as we are not in a legacy - oom_disabled case. > 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill > in either case, but only set up the mem_cgroup_oom_synchronize() for > userspace faults. So the code makes sense, but a better name would be > in order -- current->in_user_fault? in_user_fault is definitely better than memcg_may_oom. > > css_get(>css); > > current->memcg_in_oom = memcg; > > current->memcg_oom_gfp_mask = mask; > > current->memcg_oom_order = order; > > + > > + return OOM_ASYNC; > > In terms of code flow, it would be much clearer to handle the > memcg->oom_kill_disable case first, as a special case with early > return, and make the OOM invocation the main code of this function, > given its name. This? if (order > PAGE_ALLOC_COSTLY_ORDER) return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * * cgroup1 allows disabling the OOM killer and waiting for outside * handling until the charge can succeed; remember the context and put * the task to sleep at the end of the page fault when all locks are * released. * * On the other hand, in-kernel OOM killer allows for
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed 20-06-18 11:18:12, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: [...] > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > +enum oom_status { > > + OOM_SUCCESS, > > + OOM_FAILED, > > + OOM_ASYNC, > > + OOM_SKIPPED > > Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) sure, I will go with later. > We're not distinguishing ASYNC and SKIPPED anywhere below, but I > cannot think of a good name to communicate them both without this > function making assumptions about the charge function's behavior. > So it's a bit weird, but probably the best way to go. Yeah, that was what I was fighting with. My original proposal which simply ENOMEM in the failure case was a simple bool but once we have those different sates and failure behavior I think it is better to comunicate that and let the caller do whatever it finds reasonable. > > > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t > > mask, int order) > > { > > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > > - return; > > + if (order > PAGE_ALLOC_COSTLY_ORDER) > > + return OOM_SKIPPED; > > /* > > * We are in the middle of the charge context here, so we > > * don't want to block when potentially sitting on a callstack > > * that holds all kinds of filesystem and mm locks. > > * > > -* Also, the caller may handle a failed allocation gracefully > > -* (like optional page cache readahead) and so an OOM killer > > -* invocation might not even be necessary. > > +* cgroup1 allows disabling the OOM killer and waiting for outside > > +* handling until the charge can succeed; remember the context and put > > +* the task to sleep at the end of the page fault when all locks are > > +* released. > > +* > > +* On the other hand, in-kernel OOM killer allows for an async victim > > +* memory reclaim (oom_reaper) and that means that we are not solely > > +* relying on the oom victim to make a forward progress and we can > > +* invoke the oom killer here. > > * > > -* That's why we don't do anything here except remember the > > -* OOM context and then deal with it at the end of the page > > -* fault when the stack is unwound, the locks are released, > > -* and when we know whether the fault was overall successful. > > +* Please note that mem_cgroup_oom_synchronize might fail to find a > > +* victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > +* we would fall back to the global oom killer in > > pagefault_out_of_memory > > */ > > + if (!memcg->oom_kill_disable) { > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return OOM_SUCCESS; > > + > > + WARN(!current->memcg_may_oom, > > + "Memory cgroup charge failed because of no > > reclaimable memory! " > > + "This looks like a misconfiguration or a kernel > > bug."); > > + return OOM_FAILED; > > + } > > + > > + if (!current->memcg_may_oom) > > + return OOM_SKIPPED; > > memcg_may_oom was introduced to distinguish between userspace faults > that can OOM and contexts that return -ENOMEM. Now we're using it > slightly differently and it's confusing. > > 1) Why warn for kernel allocations, but not userspace ones? This > should have a comment at least. I am not sure I understand. We do warn for all allocations types of mem_cgroup_out_of_memory fails as long as we are not in a legacy - oom_disabled case. > 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill > in either case, but only set up the mem_cgroup_oom_synchronize() for > userspace faults. So the code makes sense, but a better name would be > in order -- current->in_user_fault? in_user_fault is definitely better than memcg_may_oom. > > css_get(>css); > > current->memcg_in_oom = memcg; > > current->memcg_oom_gfp_mask = mask; > > current->memcg_oom_order = order; > > + > > + return OOM_ASYNC; > > In terms of code flow, it would be much clearer to handle the > memcg->oom_kill_disable case first, as a special case with early > return, and make the OOM invocation the main code of this function, > given its name. This? if (order > PAGE_ALLOC_COSTLY_ORDER) return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * * cgroup1 allows disabling the OOM killer and waiting for outside * handling until the charge can succeed; remember the context and put * the task to sleep at the end of the page fault when all locks are * released. * * On the other hand, in-kernel OOM killer allows for
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. This is more straight-forward than I thought it would be. I have no objections to this going forward, just a couple of minor notes. > @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup > *memcg) > __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); > } > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > +enum oom_status { > + OOM_SUCCESS, > + OOM_FAILED, > + OOM_ASYNC, > + OOM_SKIPPED Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) We're not distinguishing ASYNC and SKIPPED anywhere below, but I cannot think of a good name to communicate them both without this function making assumptions about the charge function's behavior. So it's a bit weird, but probably the best way to go. > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, > int order) > { > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > - return; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return OOM_SKIPPED; > /* >* We are in the middle of the charge context here, so we >* don't want to block when potentially sitting on a callstack >* that holds all kinds of filesystem and mm locks. >* > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup1 allows disabling the OOM killer and waiting for outside > + * handling until the charge can succeed; remember the context and put > + * the task to sleep at the end of the page fault when all locks are > + * released. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress and we can > + * invoke the oom killer here. >* > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in > pagefault_out_of_memory >*/ > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return OOM_SUCCESS; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no > reclaimable memory! " >
Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: > From: Michal Hocko > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. This is more straight-forward than I thought it would be. I have no objections to this going forward, just a couple of minor notes. > @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup > *memcg) > __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); > } > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > +enum oom_status { > + OOM_SUCCESS, > + OOM_FAILED, > + OOM_ASYNC, > + OOM_SKIPPED Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) We're not distinguishing ASYNC and SKIPPED anywhere below, but I cannot think of a good name to communicate them both without this function making assumptions about the charge function's behavior. So it's a bit weird, but probably the best way to go. > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, > int order) > { > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > - return; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return OOM_SKIPPED; > /* >* We are in the middle of the charge context here, so we >* don't want to block when potentially sitting on a callstack >* that holds all kinds of filesystem and mm locks. >* > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup1 allows disabling the OOM killer and waiting for outside > + * handling until the charge can succeed; remember the context and put > + * the task to sleep at the end of the page fault when all locks are > + * released. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress and we can > + * invoke the oom killer here. >* > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in > pagefault_out_of_memory >*/ > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return OOM_SUCCESS; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no > reclaimable memory! " >
[RFC PATCH] memcg, oom: move out_of_memory back to the charge path
From: Michal Hocko 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Signed-off-by: Michal Hocko --- Hi, we have discussed this at LSFMM this year and my recollection is that we have agreed that we should do this. So I am sending the patch as an RFC. Please note I have only forward ported the patch without any testing yet. I would like to see a general agreement before I spend more time on this. Thoughts? Objections? mm/memcontrol.c | 67 + 1 file changed, 56 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..7fe3ce1fd625 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * -* Also, the caller may handle a failed allocation gracefully -* (like optional page cache readahead) and so an OOM killer -* invocation might not even be necessary. +* cgroup1 allows disabling the OOM killer and waiting for outside +* handling until the charge can succeed; remember the context and put +* the task to sleep at the end of the page fault when all locks are +* released. +* +* On the other hand, in-kernel OOM killer allows for an async victim +* memory reclaim (oom_reaper) and that means that we are not solely +* relying on the oom victim to make a forward progress and we can +* invoke the oom killer here. * -* That's why we don't do anything here except remember the -* OOM context and then deal with it at the end of the page -* fault when the stack is unwound, the locks are released, -* and when we know whether the fault was overall successful. +* Please note that mem_cgroup_oom_synchronize might fail to find a +* victim and then we have rely on mem_cgroup_oom_synchronize otherwise +* we would fall back to the global oom killer in pagefault_out_of_memory */ + if (!memcg->oom_kill_disable) { + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return OOM_SUCCESS; + + WARN(!current->memcg_may_oom, + "Memory cgroup charge failed because of no
[RFC PATCH] memcg, oom: move out_of_memory back to the charge path
From: Michal Hocko 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Signed-off-by: Michal Hocko --- Hi, we have discussed this at LSFMM this year and my recollection is that we have agreed that we should do this. So I am sending the patch as an RFC. Please note I have only forward ported the patch without any testing yet. I would like to see a general agreement before I spend more time on this. Thoughts? Objections? mm/memcontrol.c | 67 + 1 file changed, 56 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..7fe3ce1fd625 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * -* Also, the caller may handle a failed allocation gracefully -* (like optional page cache readahead) and so an OOM killer -* invocation might not even be necessary. +* cgroup1 allows disabling the OOM killer and waiting for outside +* handling until the charge can succeed; remember the context and put +* the task to sleep at the end of the page fault when all locks are +* released. +* +* On the other hand, in-kernel OOM killer allows for an async victim +* memory reclaim (oom_reaper) and that means that we are not solely +* relying on the oom victim to make a forward progress and we can +* invoke the oom killer here. * -* That's why we don't do anything here except remember the -* OOM context and then deal with it at the end of the page -* fault when the stack is unwound, the locks are released, -* and when we know whether the fault was overall successful. +* Please note that mem_cgroup_oom_synchronize might fail to find a +* victim and then we have rely on mem_cgroup_oom_synchronize otherwise +* we would fall back to the global oom killer in pagefault_out_of_memory */ + if (!memcg->oom_kill_disable) { + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return OOM_SUCCESS; + + WARN(!current->memcg_may_oom, + "Memory cgroup charge failed because of no