Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Wed 13-06-18 22:20:49, Tetsuo Handa wrote: > On 2018/06/05 17:57, Michal Hocko wrote: > >> For this reason, we see testing harnesses often oom killed immediately > >> after running a unittest that stresses reclaim or compaction by inducing a > >> system-wide oom condition. The harness spawns the unittest which spawns > >> an antagonist memory hog that is intended to be oom killed. When memory > >> is mlocked or there are a large number of threads faulting memory for the > >> antagonist, the unittest and the harness itself get oom killed because the > >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. > >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is > >> doing mmap() so the oom reaper can't grab the sem quickly enough. > > > > How come the writer doesn't back off. mmap paths should be taking an > > exclusive mmap sem in killable sleep so it should back off. Or is the > > holder of the lock deep inside mmap path doing something else and not > > backing out with the exclusive lock held? > > > > Here is an example where the writer doesn't back off. > > > http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org > > down_write_killable(>mmap_sem) is nothing but increasing the possibility > of > successfully back off. There is no guarantee that the owner of that exclusive > mmap sem will not be blocked by other unkillable waits. but we are talking about mmap() path here. Sure there are other paths which might need a back off while the lock is held and that should be addressed if possible but this is not really related to what David wrote above and I tried to understand. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Wed 13-06-18 22:20:49, Tetsuo Handa wrote: > On 2018/06/05 17:57, Michal Hocko wrote: > >> For this reason, we see testing harnesses often oom killed immediately > >> after running a unittest that stresses reclaim or compaction by inducing a > >> system-wide oom condition. The harness spawns the unittest which spawns > >> an antagonist memory hog that is intended to be oom killed. When memory > >> is mlocked or there are a large number of threads faulting memory for the > >> antagonist, the unittest and the harness itself get oom killed because the > >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. > >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is > >> doing mmap() so the oom reaper can't grab the sem quickly enough. > > > > How come the writer doesn't back off. mmap paths should be taking an > > exclusive mmap sem in killable sleep so it should back off. Or is the > > holder of the lock deep inside mmap path doing something else and not > > backing out with the exclusive lock held? > > > > Here is an example where the writer doesn't back off. > > > http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org > > down_write_killable(>mmap_sem) is nothing but increasing the possibility > of > successfully back off. There is no guarantee that the owner of that exclusive > mmap sem will not be blocked by other unkillable waits. but we are talking about mmap() path here. Sure there are other paths which might need a back off while the lock is held and that should be addressed if possible but this is not really related to what David wrote above and I tried to understand. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On 2018/06/05 17:57, Michal Hocko wrote: >> For this reason, we see testing harnesses often oom killed immediately >> after running a unittest that stresses reclaim or compaction by inducing a >> system-wide oom condition. The harness spawns the unittest which spawns >> an antagonist memory hog that is intended to be oom killed. When memory >> is mlocked or there are a large number of threads faulting memory for the >> antagonist, the unittest and the harness itself get oom killed because the >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is >> doing mmap() so the oom reaper can't grab the sem quickly enough. > > How come the writer doesn't back off. mmap paths should be taking an > exclusive mmap sem in killable sleep so it should back off. Or is the > holder of the lock deep inside mmap path doing something else and not > backing out with the exclusive lock held? > Here is an example where the writer doesn't back off. http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org down_write_killable(>mmap_sem) is nothing but increasing the possibility of successfully back off. There is no guarantee that the owner of that exclusive mmap sem will not be blocked by other unkillable waits.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On 2018/06/05 17:57, Michal Hocko wrote: >> For this reason, we see testing harnesses often oom killed immediately >> after running a unittest that stresses reclaim or compaction by inducing a >> system-wide oom condition. The harness spawns the unittest which spawns >> an antagonist memory hog that is intended to be oom killed. When memory >> is mlocked or there are a large number of threads faulting memory for the >> antagonist, the unittest and the harness itself get oom killed because the >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is >> doing mmap() so the oom reaper can't grab the sem quickly enough. > > How come the writer doesn't back off. mmap paths should be taking an > exclusive mmap sem in killable sleep so it should back off. Or is the > holder of the lock deep inside mmap path doing something else and not > backing out with the exclusive lock held? > Here is an example where the writer doesn't back off. http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org down_write_killable(>mmap_sem) is nothing but increasing the possibility of successfully back off. There is no guarantee that the owner of that exclusive mmap sem will not be blocked by other unkillable waits.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Mon 04-06-18 21:25:39, David Rientjes wrote: > On Fri, 1 Jun 2018, Michal Hocko wrote: > > > > We've discussed the mm > > > having a single blockable mmu notifier. Regardless of how we arrive at > > > the point where the oom reaper can't free memory, which could be any of > > > those three cases, if (1) the original victim is sufficiently large that > > > follow-up oom kills would become unnecessary and (2) other threads > > > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > > > > > We have examples of cases where oom reaping was successful, but the rss > > > numbers in the kernel log are very similar to when it was oom killed and > > > the process is known not to mlock, the reason is because the oom reaper > > > could free very little memory due to blockable mmu notifiers. > > > > Please be more specific. Which notifiers these were. Blockable notifiers > > are a PITA and we should be addressing them. That requiers identifying > > them first. > > > > The most common offender seems to be ib_umem_notifier, but I have also > heard of possible occurrences for mv_invl_range_start() for xen, but that > would need more investigation. The rather new invalidate_range callback > for hmm mirroring could also be problematic. Any mmu_notifier without > MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded. Yes, this is unfortunate and it was meant as a stop gap quick fix with a long term vision to be fixed properly. I am pretty sure that we can do much better here. Teach mmu_notifier_invalidate_range_start to get a non-block flag and back out on ranges that would block. I am pretty sure that notifiers can be targeted a lot and so we can still process some vmas at least. > For this reason, we see testing harnesses often oom killed immediately > after running a unittest that stresses reclaim or compaction by inducing a > system-wide oom condition. The harness spawns the unittest which spawns > an antagonist memory hog that is intended to be oom killed. When memory > is mlocked or there are a large number of threads faulting memory for the > antagonist, the unittest and the harness itself get oom killed because the > oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. > The memory hog has mm->mmap_sem readers queued ahead of a writer that is > doing mmap() so the oom reaper can't grab the sem quickly enough. How come the writer doesn't back off. mmap paths should be taking an exclusive mmap sem in killable sleep so it should back off. Or is the holder of the lock deep inside mmap path doing something else and not backing out with the exclusive lock held? [...] > > As I've already said. I will nack any timeout based solution until we > > address all particular problems and still see more to come. Here we have > > a clear goal. Address mlocked pages and identify mmu notifier offenders. > > I cannot fix all mmu notifiers to not block, I can't fix the configuration > to allow direct compaction for thp allocations and a large number of > concurrent faulters, and I cannot fix userspace mlocking a lot of memory. > It's worthwhile to work in that direction, but it will never be 100% > possible to avoid. We must have a solution that prevents innocent > processes from consistently being oom killed completely unnecessarily. None of the above has been attempted and shown not worth doing. The oom even should be a rare thing to happen so I absolutely do not see any reason to rush any misdesigned fix to be done right now. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Mon 04-06-18 21:25:39, David Rientjes wrote: > On Fri, 1 Jun 2018, Michal Hocko wrote: > > > > We've discussed the mm > > > having a single blockable mmu notifier. Regardless of how we arrive at > > > the point where the oom reaper can't free memory, which could be any of > > > those three cases, if (1) the original victim is sufficiently large that > > > follow-up oom kills would become unnecessary and (2) other threads > > > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > > > > > We have examples of cases where oom reaping was successful, but the rss > > > numbers in the kernel log are very similar to when it was oom killed and > > > the process is known not to mlock, the reason is because the oom reaper > > > could free very little memory due to blockable mmu notifiers. > > > > Please be more specific. Which notifiers these were. Blockable notifiers > > are a PITA and we should be addressing them. That requiers identifying > > them first. > > > > The most common offender seems to be ib_umem_notifier, but I have also > heard of possible occurrences for mv_invl_range_start() for xen, but that > would need more investigation. The rather new invalidate_range callback > for hmm mirroring could also be problematic. Any mmu_notifier without > MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded. Yes, this is unfortunate and it was meant as a stop gap quick fix with a long term vision to be fixed properly. I am pretty sure that we can do much better here. Teach mmu_notifier_invalidate_range_start to get a non-block flag and back out on ranges that would block. I am pretty sure that notifiers can be targeted a lot and so we can still process some vmas at least. > For this reason, we see testing harnesses often oom killed immediately > after running a unittest that stresses reclaim or compaction by inducing a > system-wide oom condition. The harness spawns the unittest which spawns > an antagonist memory hog that is intended to be oom killed. When memory > is mlocked or there are a large number of threads faulting memory for the > antagonist, the unittest and the harness itself get oom killed because the > oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. > The memory hog has mm->mmap_sem readers queued ahead of a writer that is > doing mmap() so the oom reaper can't grab the sem quickly enough. How come the writer doesn't back off. mmap paths should be taking an exclusive mmap sem in killable sleep so it should back off. Or is the holder of the lock deep inside mmap path doing something else and not backing out with the exclusive lock held? [...] > > As I've already said. I will nack any timeout based solution until we > > address all particular problems and still see more to come. Here we have > > a clear goal. Address mlocked pages and identify mmu notifier offenders. > > I cannot fix all mmu notifiers to not block, I can't fix the configuration > to allow direct compaction for thp allocations and a large number of > concurrent faulters, and I cannot fix userspace mlocking a lot of memory. > It's worthwhile to work in that direction, but it will never be 100% > possible to avoid. We must have a solution that prevents innocent > processes from consistently being oom killed completely unnecessarily. None of the above has been attempted and shown not worth doing. The oom even should be a rare thing to happen so I absolutely do not see any reason to rush any misdesigned fix to be done right now. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 1 Jun 2018, Michal Hocko wrote: > > We've discussed the mm > > having a single blockable mmu notifier. Regardless of how we arrive at > > the point where the oom reaper can't free memory, which could be any of > > those three cases, if (1) the original victim is sufficiently large that > > follow-up oom kills would become unnecessary and (2) other threads > > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > > > We have examples of cases where oom reaping was successful, but the rss > > numbers in the kernel log are very similar to when it was oom killed and > > the process is known not to mlock, the reason is because the oom reaper > > could free very little memory due to blockable mmu notifiers. > > Please be more specific. Which notifiers these were. Blockable notifiers > are a PITA and we should be addressing them. That requiers identifying > them first. > The most common offender seems to be ib_umem_notifier, but I have also heard of possible occurrences for mv_invl_range_start() for xen, but that would need more investigation. The rather new invalidate_range callback for hmm mirroring could also be problematic. Any mmu_notifier without MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded. For this reason, we see testing harnesses often oom killed immediately after running a unittest that stresses reclaim or compaction by inducing a system-wide oom condition. The harness spawns the unittest which spawns an antagonist memory hog that is intended to be oom killed. When memory is mlocked or there are a large number of threads faulting memory for the antagonist, the unittest and the harness itself get oom killed because the oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. The memory hog has mm->mmap_sem readers queued ahead of a writer that is doing mmap() so the oom reaper can't grab the sem quickly enough. I agree that blockable mmu notifiers are a pain, but until such time as all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can free all mlocked memory, and the oom reaper waits long enough to grab mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that won't oom kill everything running on the system. I have doubts we'll ever reach a point where the oom reaper can do the equivalent of exit_mmap(), but it's possible to help solve the immediate issue of all oom kills killing many innocent processes while working in a direction to make oom reaping more successful at freeing memory. > > The current implementation is a timeout based solution for mmap_sem, it > > just has the oom reaper spinning trying to grab the sem and eventually > > gives up. This patch allows it to currently work on other mm's and > > detects the timeout in a different way, with jiffies instead of an > > iterator. > > And I argue that anything timeout based is just broken by design. Trying > n times will at least give you a consistent behavior. It's not consistent, we see wildly inconsistent results especially on power because it depends on the number of queued readers of mm->mmap_sem ahead of a writer until such time that a thread doing mmap() can grab it, drop it, and allow the oom reaper to grab it for read. It's so inconsistent that we can see the oom reaper successfully grab the sem for an oom killed memory hog with 128 faulting threads, and see it fail with 4 faulting threads. > Retrying on mmap > sem makes sense because the lock might be taken for a short time. It isn't a function of how long mmap_sem is taken for write, it's a function of how many readers are ahead of the queued writer. We don't run with thp defrag set to "always" under standard configurations, but users of MADV_HUGEPAGE or configs where defrag is set to "always" can consistently cause any number of additional processes to be oom killed unnecessarily because the readers are performing compaction and the writer is queued behind it. > > I'd love a solution where we can reliably detect an oom livelock and oom > > kill additional processes but only after the original victim has had a > > chance to do exit_mmap() without a timeout, but I don't see one being > > offered. Given Tetsuo has seen issues with this in the past and suggested > > a similar proposal means we are not the only ones feeling pain from this. > > Tetsuo is doing an artificial stress test which doesn't resemble any > reasonable workload. Tetsuo's test cases caught the CVE on powerpc which could trivially panic the system if configured to panic on any oops and required a security fix because it made it easy for any user doing a large mlock. His test case here is trivial to reproduce on powerpc and causes several additional processes to be oom killed. It's not artificial, I see many test harnesses killed *nightly* because a memory hog is faulting with a large number of threads and two or three other threads are
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 1 Jun 2018, Michal Hocko wrote: > > We've discussed the mm > > having a single blockable mmu notifier. Regardless of how we arrive at > > the point where the oom reaper can't free memory, which could be any of > > those three cases, if (1) the original victim is sufficiently large that > > follow-up oom kills would become unnecessary and (2) other threads > > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > > > We have examples of cases where oom reaping was successful, but the rss > > numbers in the kernel log are very similar to when it was oom killed and > > the process is known not to mlock, the reason is because the oom reaper > > could free very little memory due to blockable mmu notifiers. > > Please be more specific. Which notifiers these were. Blockable notifiers > are a PITA and we should be addressing them. That requiers identifying > them first. > The most common offender seems to be ib_umem_notifier, but I have also heard of possible occurrences for mv_invl_range_start() for xen, but that would need more investigation. The rather new invalidate_range callback for hmm mirroring could also be problematic. Any mmu_notifier without MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded. For this reason, we see testing harnesses often oom killed immediately after running a unittest that stresses reclaim or compaction by inducing a system-wide oom condition. The harness spawns the unittest which spawns an antagonist memory hog that is intended to be oom killed. When memory is mlocked or there are a large number of threads faulting memory for the antagonist, the unittest and the harness itself get oom killed because the oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. The memory hog has mm->mmap_sem readers queued ahead of a writer that is doing mmap() so the oom reaper can't grab the sem quickly enough. I agree that blockable mmu notifiers are a pain, but until such time as all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can free all mlocked memory, and the oom reaper waits long enough to grab mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that won't oom kill everything running on the system. I have doubts we'll ever reach a point where the oom reaper can do the equivalent of exit_mmap(), but it's possible to help solve the immediate issue of all oom kills killing many innocent processes while working in a direction to make oom reaping more successful at freeing memory. > > The current implementation is a timeout based solution for mmap_sem, it > > just has the oom reaper spinning trying to grab the sem and eventually > > gives up. This patch allows it to currently work on other mm's and > > detects the timeout in a different way, with jiffies instead of an > > iterator. > > And I argue that anything timeout based is just broken by design. Trying > n times will at least give you a consistent behavior. It's not consistent, we see wildly inconsistent results especially on power because it depends on the number of queued readers of mm->mmap_sem ahead of a writer until such time that a thread doing mmap() can grab it, drop it, and allow the oom reaper to grab it for read. It's so inconsistent that we can see the oom reaper successfully grab the sem for an oom killed memory hog with 128 faulting threads, and see it fail with 4 faulting threads. > Retrying on mmap > sem makes sense because the lock might be taken for a short time. It isn't a function of how long mmap_sem is taken for write, it's a function of how many readers are ahead of the queued writer. We don't run with thp defrag set to "always" under standard configurations, but users of MADV_HUGEPAGE or configs where defrag is set to "always" can consistently cause any number of additional processes to be oom killed unnecessarily because the readers are performing compaction and the writer is queued behind it. > > I'd love a solution where we can reliably detect an oom livelock and oom > > kill additional processes but only after the original victim has had a > > chance to do exit_mmap() without a timeout, but I don't see one being > > offered. Given Tetsuo has seen issues with this in the past and suggested > > a similar proposal means we are not the only ones feeling pain from this. > > Tetsuo is doing an artificial stress test which doesn't resemble any > reasonable workload. Tetsuo's test cases caught the CVE on powerpc which could trivially panic the system if configured to panic on any oops and required a security fix because it made it easy for any user doing a large mlock. His test case here is trivial to reproduce on powerpc and causes several additional processes to be oom killed. It's not artificial, I see many test harnesses killed *nightly* because a memory hog is faulting with a large number of threads and two or three other threads are
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu 31-05-18 14:16:34, David Rientjes wrote: > On Thu, 31 May 2018, Michal Hocko wrote: > > > > It's not a random timeout, it's sufficiently long such that we don't oom > > > kill several processes needlessly in the very rare case where oom > > > livelock > > > would actually prevent the original victim from exiting. The oom reaper > > > processing an mm, finding everything to be mlocked, and immediately > > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > > > large memory hogging process that mlocks all of its memory; we > > > consistently see spurious and unnecessary oom kills simply because the > > > oom > > > reaper has set MMF_OOM_SKIP very early. > > > > It takes quite some additional steps for admin to allow a large amount > > of mlocked memory and such an application should be really careful to > > not consume too much memory. So how come this is something you see that > > consistently? Is this some sort of bug or an unfortunate workload side > > effect? I am asking this because I really want to see how relevant this > > really is. > > > > The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after > the victim has been chosen for oom kill and we get follow-up oom kills, > not that the process is able to mlock a large amount of memory. Mlock > here is only being discussed as a single example. Tetsuo has brought up > the example of all shared file-backed memory. How is such a case even possible? File backed memory is reclaimable and as such should be gone by the time we hit the OOM killer. If that is not the case then I fail how wait slightly longer helps anything. > We've discussed the mm > having a single blockable mmu notifier. Regardless of how we arrive at > the point where the oom reaper can't free memory, which could be any of > those three cases, if (1) the original victim is sufficiently large that > follow-up oom kills would become unnecessary and (2) other threads > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > We have examples of cases where oom reaping was successful, but the rss > numbers in the kernel log are very similar to when it was oom killed and > the process is known not to mlock, the reason is because the oom reaper > could free very little memory due to blockable mmu notifiers. Please be more specific. Which notifiers these were. Blockable notifiers are a PITA and we should be addressing them. That requiers identifying them first. > > But the waiting periods just turn out to be a really poor design. There > > will be no good timeout to fit for everybody. We can do better and as > > long as this is the case the timeout based solution should be really > > rejected. It is a shortcut that doesn't really solve the underlying > > problem. > > > > The current implementation is a timeout based solution for mmap_sem, it > just has the oom reaper spinning trying to grab the sem and eventually > gives up. This patch allows it to currently work on other mm's and > detects the timeout in a different way, with jiffies instead of an > iterator. And I argue that anything timeout based is just broken by design. Trying n times will at least give you a consistent behavior. Retrying on mmap sem makes sense because the lock might be taken for a short time. Retrying on a memory oom reaper doesn't reclaim is just pointless waiting for somebody else doing the work. See the difference? > I'd love a solution where we can reliably detect an oom livelock and oom > kill additional processes but only after the original victim has had a > chance to do exit_mmap() without a timeout, but I don't see one being > offered. Given Tetsuo has seen issues with this in the past and suggested > a similar proposal means we are not the only ones feeling pain from this. Tetsuo is doing an artificial stress test which doesn't resemble any reasonable workload. This is good to catch different corner cases but nothing even close to base any design on. I will definitely nack any attempt to add a timeout based solution based on such a non-realistic tests. If we have realistic workloads then try to address them and resort to any timeout or other hacks as the last option. > > > I'm open to hearing any other suggestions that you have other than > > > waiting > > > some time period before MMF_OOM_SKIP gets set to solve this problem. > > > > I've already offered one. Make mlocked pages reapable. > > Making mlocked pages reapable would only solve the most trivial reproducer > of this. Unless the oom reaper can guarantee that it will never block and > can free all memory that exit_mmap() can free, we need to ensure that a > victim has a chance to reach the exit path on its own before killing every > other process on the system. > > I'll fix the issue I identified with doing list_add_tail() rather than > list_add(), fix up the commit message per Tetsuo to identify the other > possible ways this
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu 31-05-18 14:16:34, David Rientjes wrote: > On Thu, 31 May 2018, Michal Hocko wrote: > > > > It's not a random timeout, it's sufficiently long such that we don't oom > > > kill several processes needlessly in the very rare case where oom > > > livelock > > > would actually prevent the original victim from exiting. The oom reaper > > > processing an mm, finding everything to be mlocked, and immediately > > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > > > large memory hogging process that mlocks all of its memory; we > > > consistently see spurious and unnecessary oom kills simply because the > > > oom > > > reaper has set MMF_OOM_SKIP very early. > > > > It takes quite some additional steps for admin to allow a large amount > > of mlocked memory and such an application should be really careful to > > not consume too much memory. So how come this is something you see that > > consistently? Is this some sort of bug or an unfortunate workload side > > effect? I am asking this because I really want to see how relevant this > > really is. > > > > The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after > the victim has been chosen for oom kill and we get follow-up oom kills, > not that the process is able to mlock a large amount of memory. Mlock > here is only being discussed as a single example. Tetsuo has brought up > the example of all shared file-backed memory. How is such a case even possible? File backed memory is reclaimable and as such should be gone by the time we hit the OOM killer. If that is not the case then I fail how wait slightly longer helps anything. > We've discussed the mm > having a single blockable mmu notifier. Regardless of how we arrive at > the point where the oom reaper can't free memory, which could be any of > those three cases, if (1) the original victim is sufficiently large that > follow-up oom kills would become unnecessary and (2) other threads > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > We have examples of cases where oom reaping was successful, but the rss > numbers in the kernel log are very similar to when it was oom killed and > the process is known not to mlock, the reason is because the oom reaper > could free very little memory due to blockable mmu notifiers. Please be more specific. Which notifiers these were. Blockable notifiers are a PITA and we should be addressing them. That requiers identifying them first. > > But the waiting periods just turn out to be a really poor design. There > > will be no good timeout to fit for everybody. We can do better and as > > long as this is the case the timeout based solution should be really > > rejected. It is a shortcut that doesn't really solve the underlying > > problem. > > > > The current implementation is a timeout based solution for mmap_sem, it > just has the oom reaper spinning trying to grab the sem and eventually > gives up. This patch allows it to currently work on other mm's and > detects the timeout in a different way, with jiffies instead of an > iterator. And I argue that anything timeout based is just broken by design. Trying n times will at least give you a consistent behavior. Retrying on mmap sem makes sense because the lock might be taken for a short time. Retrying on a memory oom reaper doesn't reclaim is just pointless waiting for somebody else doing the work. See the difference? > I'd love a solution where we can reliably detect an oom livelock and oom > kill additional processes but only after the original victim has had a > chance to do exit_mmap() without a timeout, but I don't see one being > offered. Given Tetsuo has seen issues with this in the past and suggested > a similar proposal means we are not the only ones feeling pain from this. Tetsuo is doing an artificial stress test which doesn't resemble any reasonable workload. This is good to catch different corner cases but nothing even close to base any design on. I will definitely nack any attempt to add a timeout based solution based on such a non-realistic tests. If we have realistic workloads then try to address them and resort to any timeout or other hacks as the last option. > > > I'm open to hearing any other suggestions that you have other than > > > waiting > > > some time period before MMF_OOM_SKIP gets set to solve this problem. > > > > I've already offered one. Make mlocked pages reapable. > > Making mlocked pages reapable would only solve the most trivial reproducer > of this. Unless the oom reaper can guarantee that it will never block and > can free all memory that exit_mmap() can free, we need to ensure that a > victim has a chance to reach the exit path on its own before killing every > other process on the system. > > I'll fix the issue I identified with doing list_add_tail() rather than > list_add(), fix up the commit message per Tetsuo to identify the other > possible ways this
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu, 31 May 2018, Michal Hocko wrote: > > It's not a random timeout, it's sufficiently long such that we don't oom > > kill several processes needlessly in the very rare case where oom livelock > > would actually prevent the original victim from exiting. The oom reaper > > processing an mm, finding everything to be mlocked, and immediately > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > > large memory hogging process that mlocks all of its memory; we > > consistently see spurious and unnecessary oom kills simply because the oom > > reaper has set MMF_OOM_SKIP very early. > > It takes quite some additional steps for admin to allow a large amount > of mlocked memory and such an application should be really careful to > not consume too much memory. So how come this is something you see that > consistently? Is this some sort of bug or an unfortunate workload side > effect? I am asking this because I really want to see how relevant this > really is. > The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after the victim has been chosen for oom kill and we get follow-up oom kills, not that the process is able to mlock a large amount of memory. Mlock here is only being discussed as a single example. Tetsuo has brought up the example of all shared file-backed memory. We've discussed the mm having a single blockable mmu notifier. Regardless of how we arrive at the point where the oom reaper can't free memory, which could be any of those three cases, if (1) the original victim is sufficiently large that follow-up oom kills would become unnecessary and (2) other threads allocate/charge before the oom victim reaches exit_mmap(), this occurs. We have examples of cases where oom reaping was successful, but the rss numbers in the kernel log are very similar to when it was oom killed and the process is known not to mlock, the reason is because the oom reaper could free very little memory due to blockable mmu notifiers. > But the waiting periods just turn out to be a really poor design. There > will be no good timeout to fit for everybody. We can do better and as > long as this is the case the timeout based solution should be really > rejected. It is a shortcut that doesn't really solve the underlying > problem. > The current implementation is a timeout based solution for mmap_sem, it just has the oom reaper spinning trying to grab the sem and eventually gives up. This patch allows it to currently work on other mm's and detects the timeout in a different way, with jiffies instead of an iterator. I'd love a solution where we can reliably detect an oom livelock and oom kill additional processes but only after the original victim has had a chance to do exit_mmap() without a timeout, but I don't see one being offered. Given Tetsuo has seen issues with this in the past and suggested a similar proposal means we are not the only ones feeling pain from this. > > I'm open to hearing any other suggestions that you have other than waiting > > some time period before MMF_OOM_SKIP gets set to solve this problem. > > I've already offered one. Make mlocked pages reapable. Making mlocked pages reapable would only solve the most trivial reproducer of this. Unless the oom reaper can guarantee that it will never block and can free all memory that exit_mmap() can free, we need to ensure that a victim has a chance to reach the exit path on its own before killing every other process on the system. I'll fix the issue I identified with doing list_add_tail() rather than list_add(), fix up the commit message per Tetsuo to identify the other possible ways this can occur other than mlock, remove the rfc tag, and repost.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu, 31 May 2018, Michal Hocko wrote: > > It's not a random timeout, it's sufficiently long such that we don't oom > > kill several processes needlessly in the very rare case where oom livelock > > would actually prevent the original victim from exiting. The oom reaper > > processing an mm, finding everything to be mlocked, and immediately > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > > large memory hogging process that mlocks all of its memory; we > > consistently see spurious and unnecessary oom kills simply because the oom > > reaper has set MMF_OOM_SKIP very early. > > It takes quite some additional steps for admin to allow a large amount > of mlocked memory and such an application should be really careful to > not consume too much memory. So how come this is something you see that > consistently? Is this some sort of bug or an unfortunate workload side > effect? I am asking this because I really want to see how relevant this > really is. > The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after the victim has been chosen for oom kill and we get follow-up oom kills, not that the process is able to mlock a large amount of memory. Mlock here is only being discussed as a single example. Tetsuo has brought up the example of all shared file-backed memory. We've discussed the mm having a single blockable mmu notifier. Regardless of how we arrive at the point where the oom reaper can't free memory, which could be any of those three cases, if (1) the original victim is sufficiently large that follow-up oom kills would become unnecessary and (2) other threads allocate/charge before the oom victim reaches exit_mmap(), this occurs. We have examples of cases where oom reaping was successful, but the rss numbers in the kernel log are very similar to when it was oom killed and the process is known not to mlock, the reason is because the oom reaper could free very little memory due to blockable mmu notifiers. > But the waiting periods just turn out to be a really poor design. There > will be no good timeout to fit for everybody. We can do better and as > long as this is the case the timeout based solution should be really > rejected. It is a shortcut that doesn't really solve the underlying > problem. > The current implementation is a timeout based solution for mmap_sem, it just has the oom reaper spinning trying to grab the sem and eventually gives up. This patch allows it to currently work on other mm's and detects the timeout in a different way, with jiffies instead of an iterator. I'd love a solution where we can reliably detect an oom livelock and oom kill additional processes but only after the original victim has had a chance to do exit_mmap() without a timeout, but I don't see one being offered. Given Tetsuo has seen issues with this in the past and suggested a similar proposal means we are not the only ones feeling pain from this. > > I'm open to hearing any other suggestions that you have other than waiting > > some time period before MMF_OOM_SKIP gets set to solve this problem. > > I've already offered one. Make mlocked pages reapable. Making mlocked pages reapable would only solve the most trivial reproducer of this. Unless the oom reaper can guarantee that it will never block and can free all memory that exit_mmap() can free, we need to ensure that a victim has a chance to reach the exit path on its own before killing every other process on the system. I'll fix the issue I identified with doing list_add_tail() rather than list_add(), fix up the commit message per Tetsuo to identify the other possible ways this can occur other than mlock, remove the rfc tag, and repost.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Wed 30-05-18 14:06:51, David Rientjes wrote: > On Mon, 28 May 2018, Michal Hocko wrote: > > > > That's not sufficient since the oom reaper is also not able to oom reap > > > if > > > the mm has blockable mmu notifiers or all memory is shared filebacked > > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > > > oom killed. > > > > Could you be more specific with a real world example where that is the > > case? I mean the full address space of non-reclaimable file backed > > memory where waiting some more would help? Blockable mmu notifiers are > > a PITA for sure. I wish we could have a better way to deal with them. > > Maybe we can tell them we are in the non-blockable context and have them > > release as much as possible. Still something that a random timeout > > wouldn't help I am afraid. > > > > It's not a random timeout, it's sufficiently long such that we don't oom > kill several processes needlessly in the very rare case where oom livelock > would actually prevent the original victim from exiting. The oom reaper > processing an mm, finding everything to be mlocked, and immediately > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > large memory hogging process that mlocks all of its memory; we > consistently see spurious and unnecessary oom kills simply because the oom > reaper has set MMF_OOM_SKIP very early. It takes quite some additional steps for admin to allow a large amount of mlocked memory and such an application should be really careful to not consume too much memory. So how come this is something you see that consistently? Is this some sort of bug or an unfortunate workload side effect? I am asking this because I really want to see how relevant this really is. > This patch introduces a "give up" period such that the oom reaper is still > allowed to do its good work but only gives up in the hope the victim can > make forward progress at some substantial period of time in the future. I > would understand the objection if oom livelock where the victim cannot > make forward progress were commonplace, but in the interest of not killing > several processes needlessly every time a large mlocked process is > targeted, I think it compels a waiting period. But the waiting periods just turn out to be a really poor design. There will be no good timeout to fit for everybody. We can do better and as long as this is the case the timeout based solution should be really rejected. It is a shortcut that doesn't really solve the underlying problem. > > Trying to reap a different oom victim when the current one is not making > > progress during the lock contention is certainly something that make > > sense. It has been proposed in the past and we just gave it up because > > it was more complex. Do you have any specific example when this would > > help to justify the additional complexity? > > > > I'm not sure how you're defining complexity, the patch adds ~30 lines of > code and prevents processes from needlessly being oom killed when oom > reaping is largely unsuccessful and before the victim finishes > free_pgtables() and then also allows the oom reaper to operate on multiple > mm's instead of processing one at a time. Obviously if there is a delay > before MMF_OOM_SKIP is set it requires that the oom reaper be able to > process other mm's, otherwise we stall needlessly for 10s. Operating on > multiple mm's in a linked list while waiting for victims to exit during a > timeout period is thus very much needed, it wouldn't make sense without > it. It needs to keep track of the current retry state of the reaped victim and that is an additional complexity, isn't it? And I am asking how often do we have to handle that. Please note that the primary objective here is to unclutter a locked up situation. The oom reaper doesn't block the victim to go away on its own while we keep retrying. So a slow progress on the reaper side is not an issue IMIHO. > > > But also note that even if oom reaping is possible, in the presence of an > > > antagonist that continues to allocate memory, that it is possible to oom > > > kill additional victims unnecessarily if we aren't able to complete > > > free_pgtables() in exit_mmap() of the original victim. > > > > If there is unbound source of allocations then we are screwed no matter > > what. We just hope that the allocator will get noticed by the oom killer > > and it will be stopped. > > > > It's not unbounded, it's just an allocator that acts as an antagonist. At > the risk of being overly verbose, for system or memcg oom conditions: a > large mlocked process is oom killed, other processes continue to > allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP > without being able to free any memory, and the other important processes > are needlessly oom killed before the original victim can reach > exit_mmap(). This happens a _lot_. > > I'm open to hearing any
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Wed 30-05-18 14:06:51, David Rientjes wrote: > On Mon, 28 May 2018, Michal Hocko wrote: > > > > That's not sufficient since the oom reaper is also not able to oom reap > > > if > > > the mm has blockable mmu notifiers or all memory is shared filebacked > > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > > > oom killed. > > > > Could you be more specific with a real world example where that is the > > case? I mean the full address space of non-reclaimable file backed > > memory where waiting some more would help? Blockable mmu notifiers are > > a PITA for sure. I wish we could have a better way to deal with them. > > Maybe we can tell them we are in the non-blockable context and have them > > release as much as possible. Still something that a random timeout > > wouldn't help I am afraid. > > > > It's not a random timeout, it's sufficiently long such that we don't oom > kill several processes needlessly in the very rare case where oom livelock > would actually prevent the original victim from exiting. The oom reaper > processing an mm, finding everything to be mlocked, and immediately > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > large memory hogging process that mlocks all of its memory; we > consistently see spurious and unnecessary oom kills simply because the oom > reaper has set MMF_OOM_SKIP very early. It takes quite some additional steps for admin to allow a large amount of mlocked memory and such an application should be really careful to not consume too much memory. So how come this is something you see that consistently? Is this some sort of bug or an unfortunate workload side effect? I am asking this because I really want to see how relevant this really is. > This patch introduces a "give up" period such that the oom reaper is still > allowed to do its good work but only gives up in the hope the victim can > make forward progress at some substantial period of time in the future. I > would understand the objection if oom livelock where the victim cannot > make forward progress were commonplace, but in the interest of not killing > several processes needlessly every time a large mlocked process is > targeted, I think it compels a waiting period. But the waiting periods just turn out to be a really poor design. There will be no good timeout to fit for everybody. We can do better and as long as this is the case the timeout based solution should be really rejected. It is a shortcut that doesn't really solve the underlying problem. > > Trying to reap a different oom victim when the current one is not making > > progress during the lock contention is certainly something that make > > sense. It has been proposed in the past and we just gave it up because > > it was more complex. Do you have any specific example when this would > > help to justify the additional complexity? > > > > I'm not sure how you're defining complexity, the patch adds ~30 lines of > code and prevents processes from needlessly being oom killed when oom > reaping is largely unsuccessful and before the victim finishes > free_pgtables() and then also allows the oom reaper to operate on multiple > mm's instead of processing one at a time. Obviously if there is a delay > before MMF_OOM_SKIP is set it requires that the oom reaper be able to > process other mm's, otherwise we stall needlessly for 10s. Operating on > multiple mm's in a linked list while waiting for victims to exit during a > timeout period is thus very much needed, it wouldn't make sense without > it. It needs to keep track of the current retry state of the reaped victim and that is an additional complexity, isn't it? And I am asking how often do we have to handle that. Please note that the primary objective here is to unclutter a locked up situation. The oom reaper doesn't block the victim to go away on its own while we keep retrying. So a slow progress on the reaper side is not an issue IMIHO. > > > But also note that even if oom reaping is possible, in the presence of an > > > antagonist that continues to allocate memory, that it is possible to oom > > > kill additional victims unnecessarily if we aren't able to complete > > > free_pgtables() in exit_mmap() of the original victim. > > > > If there is unbound source of allocations then we are screwed no matter > > what. We just hope that the allocator will get noticed by the oom killer > > and it will be stopped. > > > > It's not unbounded, it's just an allocator that acts as an antagonist. At > the risk of being overly verbose, for system or memcg oom conditions: a > large mlocked process is oom killed, other processes continue to > allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP > without being able to free any memory, and the other important processes > are needlessly oom killed before the original victim can reach > exit_mmap(). This happens a _lot_. > > I'm open to hearing any
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Mon, 28 May 2018, Michal Hocko wrote: > > That's not sufficient since the oom reaper is also not able to oom reap if > > the mm has blockable mmu notifiers or all memory is shared filebacked > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > > oom killed. > > Could you be more specific with a real world example where that is the > case? I mean the full address space of non-reclaimable file backed > memory where waiting some more would help? Blockable mmu notifiers are > a PITA for sure. I wish we could have a better way to deal with them. > Maybe we can tell them we are in the non-blockable context and have them > release as much as possible. Still something that a random timeout > wouldn't help I am afraid. > It's not a random timeout, it's sufficiently long such that we don't oom kill several processes needlessly in the very rare case where oom livelock would actually prevent the original victim from exiting. The oom reaper processing an mm, finding everything to be mlocked, and immediately MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a large memory hogging process that mlocks all of its memory; we consistently see spurious and unnecessary oom kills simply because the oom reaper has set MMF_OOM_SKIP very early. This patch introduces a "give up" period such that the oom reaper is still allowed to do its good work but only gives up in the hope the victim can make forward progress at some substantial period of time in the future. I would understand the objection if oom livelock where the victim cannot make forward progress were commonplace, but in the interest of not killing several processes needlessly every time a large mlocked process is targeted, I think it compels a waiting period. > Trying to reap a different oom victim when the current one is not making > progress during the lock contention is certainly something that make > sense. It has been proposed in the past and we just gave it up because > it was more complex. Do you have any specific example when this would > help to justify the additional complexity? > I'm not sure how you're defining complexity, the patch adds ~30 lines of code and prevents processes from needlessly being oom killed when oom reaping is largely unsuccessful and before the victim finishes free_pgtables() and then also allows the oom reaper to operate on multiple mm's instead of processing one at a time. Obviously if there is a delay before MMF_OOM_SKIP is set it requires that the oom reaper be able to process other mm's, otherwise we stall needlessly for 10s. Operating on multiple mm's in a linked list while waiting for victims to exit during a timeout period is thus very much needed, it wouldn't make sense without it. > > But also note that even if oom reaping is possible, in the presence of an > > antagonist that continues to allocate memory, that it is possible to oom > > kill additional victims unnecessarily if we aren't able to complete > > free_pgtables() in exit_mmap() of the original victim. > > If there is unbound source of allocations then we are screwed no matter > what. We just hope that the allocator will get noticed by the oom killer > and it will be stopped. > It's not unbounded, it's just an allocator that acts as an antagonist. At the risk of being overly verbose, for system or memcg oom conditions: a large mlocked process is oom killed, other processes continue to allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP without being able to free any memory, and the other important processes are needlessly oom killed before the original victim can reach exit_mmap(). This happens a _lot_. I'm open to hearing any other suggestions that you have other than waiting some time period before MMF_OOM_SKIP gets set to solve this problem.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Mon, 28 May 2018, Michal Hocko wrote: > > That's not sufficient since the oom reaper is also not able to oom reap if > > the mm has blockable mmu notifiers or all memory is shared filebacked > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > > oom killed. > > Could you be more specific with a real world example where that is the > case? I mean the full address space of non-reclaimable file backed > memory where waiting some more would help? Blockable mmu notifiers are > a PITA for sure. I wish we could have a better way to deal with them. > Maybe we can tell them we are in the non-blockable context and have them > release as much as possible. Still something that a random timeout > wouldn't help I am afraid. > It's not a random timeout, it's sufficiently long such that we don't oom kill several processes needlessly in the very rare case where oom livelock would actually prevent the original victim from exiting. The oom reaper processing an mm, finding everything to be mlocked, and immediately MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a large memory hogging process that mlocks all of its memory; we consistently see spurious and unnecessary oom kills simply because the oom reaper has set MMF_OOM_SKIP very early. This patch introduces a "give up" period such that the oom reaper is still allowed to do its good work but only gives up in the hope the victim can make forward progress at some substantial period of time in the future. I would understand the objection if oom livelock where the victim cannot make forward progress were commonplace, but in the interest of not killing several processes needlessly every time a large mlocked process is targeted, I think it compels a waiting period. > Trying to reap a different oom victim when the current one is not making > progress during the lock contention is certainly something that make > sense. It has been proposed in the past and we just gave it up because > it was more complex. Do you have any specific example when this would > help to justify the additional complexity? > I'm not sure how you're defining complexity, the patch adds ~30 lines of code and prevents processes from needlessly being oom killed when oom reaping is largely unsuccessful and before the victim finishes free_pgtables() and then also allows the oom reaper to operate on multiple mm's instead of processing one at a time. Obviously if there is a delay before MMF_OOM_SKIP is set it requires that the oom reaper be able to process other mm's, otherwise we stall needlessly for 10s. Operating on multiple mm's in a linked list while waiting for victims to exit during a timeout period is thus very much needed, it wouldn't make sense without it. > > But also note that even if oom reaping is possible, in the presence of an > > antagonist that continues to allocate memory, that it is possible to oom > > kill additional victims unnecessarily if we aren't able to complete > > free_pgtables() in exit_mmap() of the original victim. > > If there is unbound source of allocations then we are screwed no matter > what. We just hope that the allocator will get noticed by the oom killer > and it will be stopped. > It's not unbounded, it's just an allocator that acts as an antagonist. At the risk of being overly verbose, for system or memcg oom conditions: a large mlocked process is oom killed, other processes continue to allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP without being able to free any memory, and the other important processes are needlessly oom killed before the original victim can reach exit_mmap(). This happens a _lot_. I'm open to hearing any other suggestions that you have other than waiting some time period before MMF_OOM_SKIP gets set to solve this problem.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri 25-05-18 12:36:08, David Rientjes wrote: > On Fri, 25 May 2018, Michal Hocko wrote: > > > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > > it cannot reap an mm. This can happen for a variety of reasons, > > > including: > > > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > >to stall indefinitely, > > > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > > so is unlikely to free any amount of memory: > > > > > > - when the mm's memory is fully mlocked. > > > > > > When all memory is mlocked, the oom reaper will not be able to free any > > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > > chosen unnecessarily. This is trivial to reproduce if all eligible > > > processes on the system have mlocked their memory: the oom killer calls > > > panic() even though forward progress can be made. > > > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > > unmapping memory and additional processes can be chosen unnecessarily > > > because the oom killer is racing with exit_mmap(). > > > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > > a true oom livelock in progress, it never gets set and no additional > > > killing is possible. > > > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > > at 10s. It requires that the oom reaper's list becomes a properly linked > > > list so that other mm's may be reaped while waiting for an mm's timeout to > > > expire. > > > > No timeouts please! The proper way to handle this problem is to simply > > teach the oom reaper to handle mlocked areas. > > That's not sufficient since the oom reaper is also not able to oom reap if > the mm has blockable mmu notifiers or all memory is shared filebacked > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > oom killed. Could you be more specific with a real world example where that is the case? I mean the full address space of non-reclaimable file backed memory where waiting some more would help? Blockable mmu notifiers are a PITA for sure. I wish we could have a better way to deal with them. Maybe we can tell them we are in the non-blockable context and have them release as much as possible. Still something that a random timeout wouldn't help I am afraid. > The current implementation that relies on MAX_OOM_REAP_RETRIES is acting > as a timeout already for mm->mmap_sem, but it's doing so without > attempting to oom reap other victims that may actually allow it to grab > mm->mmap_sem if the allocator is waiting on a lock. Trying to reap a different oom victim when the current one is not making progress during the lock contention is certainly something that make sense. It has been proposed in the past and we just gave it up because it was more complex. Do you have any specific example when this would help to justify the additional complexity? > The solution, as proposed, is to allow the oom reaper to iterate over all > victims and try to free memory rather than working on each victim one by > one and giving up. > > But also note that even if oom reaping is possible, in the presence of an > antagonist that continues to allocate memory, that it is possible to oom > kill additional victims unnecessarily if we aren't able to complete > free_pgtables() in exit_mmap() of the original victim. If there is unbound source of allocations then we are screwed no matter what. We just hope that the allocator will get noticed by the oom killer and it will be stopped. That being said. I do not object for justified improvements in the oom reaping. But I absolutely detest some random timeouts and will nack implementations based on them until it is absolutely clear there is no other way around. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri 25-05-18 12:36:08, David Rientjes wrote: > On Fri, 25 May 2018, Michal Hocko wrote: > > > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > > it cannot reap an mm. This can happen for a variety of reasons, > > > including: > > > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > >to stall indefinitely, > > > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > > so is unlikely to free any amount of memory: > > > > > > - when the mm's memory is fully mlocked. > > > > > > When all memory is mlocked, the oom reaper will not be able to free any > > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > > chosen unnecessarily. This is trivial to reproduce if all eligible > > > processes on the system have mlocked their memory: the oom killer calls > > > panic() even though forward progress can be made. > > > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > > unmapping memory and additional processes can be chosen unnecessarily > > > because the oom killer is racing with exit_mmap(). > > > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > > a true oom livelock in progress, it never gets set and no additional > > > killing is possible. > > > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > > at 10s. It requires that the oom reaper's list becomes a properly linked > > > list so that other mm's may be reaped while waiting for an mm's timeout to > > > expire. > > > > No timeouts please! The proper way to handle this problem is to simply > > teach the oom reaper to handle mlocked areas. > > That's not sufficient since the oom reaper is also not able to oom reap if > the mm has blockable mmu notifiers or all memory is shared filebacked > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > oom killed. Could you be more specific with a real world example where that is the case? I mean the full address space of non-reclaimable file backed memory where waiting some more would help? Blockable mmu notifiers are a PITA for sure. I wish we could have a better way to deal with them. Maybe we can tell them we are in the non-blockable context and have them release as much as possible. Still something that a random timeout wouldn't help I am afraid. > The current implementation that relies on MAX_OOM_REAP_RETRIES is acting > as a timeout already for mm->mmap_sem, but it's doing so without > attempting to oom reap other victims that may actually allow it to grab > mm->mmap_sem if the allocator is waiting on a lock. Trying to reap a different oom victim when the current one is not making progress during the lock contention is certainly something that make sense. It has been proposed in the past and we just gave it up because it was more complex. Do you have any specific example when this would help to justify the additional complexity? > The solution, as proposed, is to allow the oom reaper to iterate over all > victims and try to free memory rather than working on each victim one by > one and giving up. > > But also note that even if oom reaping is possible, in the presence of an > antagonist that continues to allocate memory, that it is possible to oom > kill additional victims unnecessarily if we aren't able to complete > free_pgtables() in exit_mmap() of the original victim. If there is unbound source of allocations then we are screwed no matter what. We just hope that the allocator will get noticed by the oom killer and it will be stopped. That being said. I do not object for justified improvements in the oom reaping. But I absolutely detest some random timeouts and will nack implementations based on them until it is absolutely clear there is no other way around. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 25 May 2018, Tetsuo Handa wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > >to stall indefinitely, > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > >- when the mm's memory is fully mlocked (needs privilege) or > fully shared (does not need privilege) > Good point, that is another way that unnecessary oom killing can occur because the oom reaper sets MMF_OOM_SKIP far too early. I can make the change to the commit message. Also, I noticed in my patch that oom_reap_task() should be doing list_add_tail() rather than list_add() to enqueue the mm for reaping again. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > I already proposed more simpler one at > https://patchwork.kernel.org/patch/9877991/ . > It's a similar idea, and I'm glad that we agree that some kind of per-mm delay is required to avoid this problem. I think yours is simpler, but consider the other two changes in my patch: - in the normal exit path, absent any timeout for the mm, we only set MMF_OOM_SKIP after free_pgtables() when it is known we will not free any additional memory, which can also cause unnecessary oom killing because the oom killer races with free_pgtables(), and - the oom reaper now operates over all concurrent victims instead of repeatedly trying to take mm->mmap_sem of the first victim, sleeping many times, retrying, giving up, and moving on the next victim. Allowing the oom reaper to iterate through all victims can allow memory freeing such that an allocator may be able to drop mm->mmap_sem. In fact, with my patch, I don't know of any condition where we kill additional processes unnecessarily *unless* the victim cannot be oom reaped or complete memory freeing in the exit path within 10 seconds. Given how rare oom livelock appears in practice, I think the 10 seconds is justified because right now it is _trivial_ to oom kill many victims completely unnecessarily.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 25 May 2018, Tetsuo Handa wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > >to stall indefinitely, > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > >- when the mm's memory is fully mlocked (needs privilege) or > fully shared (does not need privilege) > Good point, that is another way that unnecessary oom killing can occur because the oom reaper sets MMF_OOM_SKIP far too early. I can make the change to the commit message. Also, I noticed in my patch that oom_reap_task() should be doing list_add_tail() rather than list_add() to enqueue the mm for reaping again. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > I already proposed more simpler one at > https://patchwork.kernel.org/patch/9877991/ . > It's a similar idea, and I'm glad that we agree that some kind of per-mm delay is required to avoid this problem. I think yours is simpler, but consider the other two changes in my patch: - in the normal exit path, absent any timeout for the mm, we only set MMF_OOM_SKIP after free_pgtables() when it is known we will not free any additional memory, which can also cause unnecessary oom killing because the oom killer races with free_pgtables(), and - the oom reaper now operates over all concurrent victims instead of repeatedly trying to take mm->mmap_sem of the first victim, sleeping many times, retrying, giving up, and moving on the next victim. Allowing the oom reaper to iterate through all victims can allow memory freeing such that an allocator may be able to drop mm->mmap_sem. In fact, with my patch, I don't know of any condition where we kill additional processes unnecessarily *unless* the victim cannot be oom reaped or complete memory freeing in the exit path within 10 seconds. Given how rare oom livelock appears in practice, I think the 10 seconds is justified because right now it is _trivial_ to oom kill many victims completely unnecessarily.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 25 May 2018, Michal Hocko wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > >to stall indefinitely, > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > > > > When all memory is mlocked, the oom reaper will not be able to free any > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > chosen unnecessarily. This is trivial to reproduce if all eligible > > processes on the system have mlocked their memory: the oom killer calls > > panic() even though forward progress can be made. > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > No timeouts please! The proper way to handle this problem is to simply > teach the oom reaper to handle mlocked areas. That's not sufficient since the oom reaper is also not able to oom reap if the mm has blockable mmu notifiers or all memory is shared filebacked memory, so it immediately sets MMF_OOM_SKIP and additional processes are oom killed. The current implementation that relies on MAX_OOM_REAP_RETRIES is acting as a timeout already for mm->mmap_sem, but it's doing so without attempting to oom reap other victims that may actually allow it to grab mm->mmap_sem if the allocator is waiting on a lock. The solution, as proposed, is to allow the oom reaper to iterate over all victims and try to free memory rather than working on each victim one by one and giving up. But also note that even if oom reaping is possible, in the presence of an antagonist that continues to allocate memory, that it is possible to oom kill additional victims unnecessarily if we aren't able to complete free_pgtables() in exit_mmap() of the original victim. So this patch is solving all three issues: allowing a process to *fully* exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the oom reaper to act on parallel victims that may allow a victim to be reaped, and preventing additional processes from being killed unnecessarily when oom reaping isn't able to free memory (mlock, blockable mmu invalidates, all VM_SHARED file backed, small rss, etc). The vast majority of the time, oom reaping can occur with this change or the process can reach exit_mmap() itself; oom livelock appears to be very rare with this patch even for mem cgroup constrained oom kills and very tight limitation and thus it makes sense to wait for a prolonged period of time before killing additional processes unnecessarily.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Fri, 25 May 2018, Michal Hocko wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > >to stall indefinitely, > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > > > > When all memory is mlocked, the oom reaper will not be able to free any > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > chosen unnecessarily. This is trivial to reproduce if all eligible > > processes on the system have mlocked their memory: the oom killer calls > > panic() even though forward progress can be made. > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > No timeouts please! The proper way to handle this problem is to simply > teach the oom reaper to handle mlocked areas. That's not sufficient since the oom reaper is also not able to oom reap if the mm has blockable mmu notifiers or all memory is shared filebacked memory, so it immediately sets MMF_OOM_SKIP and additional processes are oom killed. The current implementation that relies on MAX_OOM_REAP_RETRIES is acting as a timeout already for mm->mmap_sem, but it's doing so without attempting to oom reap other victims that may actually allow it to grab mm->mmap_sem if the allocator is waiting on a lock. The solution, as proposed, is to allow the oom reaper to iterate over all victims and try to free memory rather than working on each victim one by one and giving up. But also note that even if oom reaping is possible, in the presence of an antagonist that continues to allocate memory, that it is possible to oom kill additional victims unnecessarily if we aren't able to complete free_pgtables() in exit_mmap() of the original victim. So this patch is solving all three issues: allowing a process to *fully* exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the oom reaper to act on parallel victims that may allow a victim to be reaped, and preventing additional processes from being killed unnecessarily when oom reaping isn't able to free memory (mlock, blockable mmu invalidates, all VM_SHARED file backed, small rss, etc). The vast majority of the time, oom reaping can occur with this change or the process can reach exit_mmap() itself; oom livelock appears to be very rare with this patch even for mem cgroup constrained oom kills and very tight limitation and thus it makes sense to wait for a prolonged period of time before killing additional processes unnecessarily.
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu 24-05-18 14:22:53, David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper >to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. No timeouts please! The proper way to handle this problem is to simply teach the oom reaper to handle mlocked areas. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
On Thu 24-05-18 14:22:53, David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper >to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. No timeouts please! The proper way to handle this problem is to simply teach the oom reaper to handle mlocked areas. -- Michal Hocko SUSE Labs
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper >to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. - when the mm's memory is fully mlocked (needs privilege) or fully shared (does not need privilege) > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. s/mlocked/mlocked or shared/g > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. I already proposed more simpler one at https://patchwork.kernel.org/patch/9877991/ . > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. But before thinking about your proposal, please think about how to guarantee that the OOM reaper and the exit path can run discussed at http://lkml.kernel.org/r/201805122318.hjg81246.mfvflfjooqt...@i-love.sakura.ne.jp .
Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper >to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. - when the mm's memory is fully mlocked (needs privilege) or fully shared (does not need privilege) > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. s/mlocked/mlocked or shared/g > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. I already proposed more simpler one at https://patchwork.kernel.org/patch/9877991/ . > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. But before thinking about your proposal, please think about how to guarantee that the OOM reaper and the exit path can run discussed at http://lkml.kernel.org/r/201805122318.hjg81246.mfvflfjooqt...@i-love.sakura.ne.jp .
[rfc patch] mm, oom: fix unnecessary killing of additional processes
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if it cannot reap an mm. This can happen for a variety of reasons, including: - the inability to grab mm->mmap_sem in a sufficient amount of time, - when the mm has blockable mmu notifiers that could cause the oom reaper to stall indefinitely, but we can also add a third when the oom reaper can "reap" an mm but doing so is unlikely to free any amount of memory: - when the mm's memory is fully mlocked. When all memory is mlocked, the oom reaper will not be able to free any substantial amount of memory. It sets MMF_OOM_SKIP before the victim can unmap and free its memory in exit_mmap() and subsequent oom victims are chosen unnecessarily. This is trivial to reproduce if all eligible processes on the system have mlocked their memory: the oom killer calls panic() even though forward progress can be made. This is the same issue where the exit path sets MMF_OOM_SKIP before unmapping memory and additional processes can be chosen unnecessarily because the oom killer is racing with exit_mmap(). We can't simply defer setting MMF_OOM_SKIP, however, because if there is a true oom livelock in progress, it never gets set and no additional killing is possible. To fix this, this patch introduces a per-mm reaping timeout, initially set at 10s. It requires that the oom reaper's list becomes a properly linked list so that other mm's may be reaped while waiting for an mm's timeout to expire. The exit path will now set MMF_OOM_SKIP only after all memory has been freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to determine when it can race with the oom reaper. The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has lapsed because it can no longer guarantee forward progress. The reaping timeout is intentionally set for a substantial amount of time since oom livelock is a very rare occurrence and it's better to optimize for preventing additional (unnecessary) oom killing than a scenario that is much more unlikely. Signed-off-by: David Rientjes--- include/linux/mm_types.h | 4 ++ include/linux/sched.h| 2 +- mm/mmap.c| 12 +++--- mm/oom_kill.c| 85 ++-- 4 files changed, 66 insertions(+), 37 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -462,6 +462,10 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_MMU + /* When to give up on oom reaping this mm */ + unsigned long reap_timeout; +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1151,7 +1151,7 @@ struct task_struct { #endif int pagefault_disabled; #ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; + struct list_headoom_reap_list; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct*stack_vm_area; diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) if (unlikely(mm_is_oom_victim(mm))) { /* * Manually reap the mm to free as much memory as possible. -* Then, as the oom reaper does, set MMF_OOM_SKIP to disregard -* this mm from further consideration. Taking mm->mmap_sem for -* write after setting MMF_OOM_SKIP will guarantee that the oom -* reaper will not run on this mm again after mmap_sem is -* dropped. +* Then, set MMF_UNSTABLE to avoid racing with the oom reaper. +* Taking mm->mmap_sem for write after setting MMF_UNSTABLE will +* guarantee that the oom reaper will not run on this mm again +* after mmap_sem is dropped. * * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in @@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm) __oom_reap_task_mm(mm); mutex_unlock(_lock); - set_bit(MMF_OOM_SKIP, >flags); + set_bit(MMF_UNSTABLE, >flags); down_write(>mmap_sem); up_write(>mmap_sem); } @@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(, vma, 0, -1); free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(, 0, -1); + set_bit(MMF_OOM_SKIP, >flags); /* * Walk the list again, actually closing and
[rfc patch] mm, oom: fix unnecessary killing of additional processes
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if it cannot reap an mm. This can happen for a variety of reasons, including: - the inability to grab mm->mmap_sem in a sufficient amount of time, - when the mm has blockable mmu notifiers that could cause the oom reaper to stall indefinitely, but we can also add a third when the oom reaper can "reap" an mm but doing so is unlikely to free any amount of memory: - when the mm's memory is fully mlocked. When all memory is mlocked, the oom reaper will not be able to free any substantial amount of memory. It sets MMF_OOM_SKIP before the victim can unmap and free its memory in exit_mmap() and subsequent oom victims are chosen unnecessarily. This is trivial to reproduce if all eligible processes on the system have mlocked their memory: the oom killer calls panic() even though forward progress can be made. This is the same issue where the exit path sets MMF_OOM_SKIP before unmapping memory and additional processes can be chosen unnecessarily because the oom killer is racing with exit_mmap(). We can't simply defer setting MMF_OOM_SKIP, however, because if there is a true oom livelock in progress, it never gets set and no additional killing is possible. To fix this, this patch introduces a per-mm reaping timeout, initially set at 10s. It requires that the oom reaper's list becomes a properly linked list so that other mm's may be reaped while waiting for an mm's timeout to expire. The exit path will now set MMF_OOM_SKIP only after all memory has been freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to determine when it can race with the oom reaper. The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has lapsed because it can no longer guarantee forward progress. The reaping timeout is intentionally set for a substantial amount of time since oom livelock is a very rare occurrence and it's better to optimize for preventing additional (unnecessary) oom killing than a scenario that is much more unlikely. Signed-off-by: David Rientjes --- include/linux/mm_types.h | 4 ++ include/linux/sched.h| 2 +- mm/mmap.c| 12 +++--- mm/oom_kill.c| 85 ++-- 4 files changed, 66 insertions(+), 37 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -462,6 +462,10 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_MMU + /* When to give up on oom reaping this mm */ + unsigned long reap_timeout; +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1151,7 +1151,7 @@ struct task_struct { #endif int pagefault_disabled; #ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; + struct list_headoom_reap_list; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct*stack_vm_area; diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) if (unlikely(mm_is_oom_victim(mm))) { /* * Manually reap the mm to free as much memory as possible. -* Then, as the oom reaper does, set MMF_OOM_SKIP to disregard -* this mm from further consideration. Taking mm->mmap_sem for -* write after setting MMF_OOM_SKIP will guarantee that the oom -* reaper will not run on this mm again after mmap_sem is -* dropped. +* Then, set MMF_UNSTABLE to avoid racing with the oom reaper. +* Taking mm->mmap_sem for write after setting MMF_UNSTABLE will +* guarantee that the oom reaper will not run on this mm again +* after mmap_sem is dropped. * * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in @@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm) __oom_reap_task_mm(mm); mutex_unlock(_lock); - set_bit(MMF_OOM_SKIP, >flags); + set_bit(MMF_UNSTABLE, >flags); down_write(>mmap_sem); up_write(>mmap_sem); } @@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(, vma, 0, -1); free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(, 0, -1); + set_bit(MMF_OOM_SKIP, >flags); /* * Walk the list again, actually closing and freeing it, diff