Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-13 Thread Michal Hocko
On Wed 13-06-18 22:20:49, Tetsuo Handa wrote:
> On 2018/06/05 17:57, Michal Hocko wrote:
> >> For this reason, we see testing harnesses often oom killed immediately 
> >> after running a unittest that stresses reclaim or compaction by inducing a 
> >> system-wide oom condition.  The harness spawns the unittest which spawns 
> >> an antagonist memory hog that is intended to be oom killed.  When memory 
> >> is mlocked or there are a large number of threads faulting memory for the 
> >> antagonist, the unittest and the harness itself get oom killed because the 
> >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
> >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
> >> doing mmap() so the oom reaper can't grab the sem quickly enough.
> > 
> > How come the writer doesn't back off. mmap paths should be taking an
> > exclusive mmap sem in killable sleep so it should back off. Or is the
> > holder of the lock deep inside mmap path doing something else and not
> > backing out with the exclusive lock held?
> > 
>  
> Here is an example where the writer doesn't back off.
> 
>   
> http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org
> 
> down_write_killable(>mmap_sem) is nothing but increasing the possibility 
> of
> successfully back off. There is no guarantee that the owner of that exclusive
> mmap sem will not be blocked by other unkillable waits.

but we are talking about mmap() path here. Sure there are other paths
which might need a back off while the lock is held and that should be
addressed if possible but this is not really related to what David wrote
above and I tried to understand.

-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-13 Thread Michal Hocko
On Wed 13-06-18 22:20:49, Tetsuo Handa wrote:
> On 2018/06/05 17:57, Michal Hocko wrote:
> >> For this reason, we see testing harnesses often oom killed immediately 
> >> after running a unittest that stresses reclaim or compaction by inducing a 
> >> system-wide oom condition.  The harness spawns the unittest which spawns 
> >> an antagonist memory hog that is intended to be oom killed.  When memory 
> >> is mlocked or there are a large number of threads faulting memory for the 
> >> antagonist, the unittest and the harness itself get oom killed because the 
> >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
> >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
> >> doing mmap() so the oom reaper can't grab the sem quickly enough.
> > 
> > How come the writer doesn't back off. mmap paths should be taking an
> > exclusive mmap sem in killable sleep so it should back off. Or is the
> > holder of the lock deep inside mmap path doing something else and not
> > backing out with the exclusive lock held?
> > 
>  
> Here is an example where the writer doesn't back off.
> 
>   
> http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org
> 
> down_write_killable(>mmap_sem) is nothing but increasing the possibility 
> of
> successfully back off. There is no guarantee that the owner of that exclusive
> mmap sem will not be blocked by other unkillable waits.

but we are talking about mmap() path here. Sure there are other paths
which might need a back off while the lock is held and that should be
addressed if possible but this is not really related to what David wrote
above and I tried to understand.

-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-13 Thread Tetsuo Handa
On 2018/06/05 17:57, Michal Hocko wrote:
>> For this reason, we see testing harnesses often oom killed immediately 
>> after running a unittest that stresses reclaim or compaction by inducing a 
>> system-wide oom condition.  The harness spawns the unittest which spawns 
>> an antagonist memory hog that is intended to be oom killed.  When memory 
>> is mlocked or there are a large number of threads faulting memory for the 
>> antagonist, the unittest and the harness itself get oom killed because the 
>> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
>> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
>> doing mmap() so the oom reaper can't grab the sem quickly enough.
> 
> How come the writer doesn't back off. mmap paths should be taking an
> exclusive mmap sem in killable sleep so it should back off. Or is the
> holder of the lock deep inside mmap path doing something else and not
> backing out with the exclusive lock held?
> 
 
Here is an example where the writer doesn't back off.

  
http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org

down_write_killable(>mmap_sem) is nothing but increasing the possibility of
successfully back off. There is no guarantee that the owner of that exclusive
mmap sem will not be blocked by other unkillable waits.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-13 Thread Tetsuo Handa
On 2018/06/05 17:57, Michal Hocko wrote:
>> For this reason, we see testing harnesses often oom killed immediately 
>> after running a unittest that stresses reclaim or compaction by inducing a 
>> system-wide oom condition.  The harness spawns the unittest which spawns 
>> an antagonist memory hog that is intended to be oom killed.  When memory 
>> is mlocked or there are a large number of threads faulting memory for the 
>> antagonist, the unittest and the harness itself get oom killed because the 
>> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
>> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
>> doing mmap() so the oom reaper can't grab the sem quickly enough.
> 
> How come the writer doesn't back off. mmap paths should be taking an
> exclusive mmap sem in killable sleep so it should back off. Or is the
> holder of the lock deep inside mmap path doing something else and not
> backing out with the exclusive lock held?
> 
 
Here is an example where the writer doesn't back off.

  
http://lkml.kernel.org/r/20180607150546.1c7db21f70221008e14b8...@linux-foundation.org

down_write_killable(>mmap_sem) is nothing but increasing the possibility of
successfully back off. There is no guarantee that the owner of that exclusive
mmap sem will not be blocked by other unkillable waits.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-05 Thread Michal Hocko
On Mon 04-06-18 21:25:39, David Rientjes wrote:
> On Fri, 1 Jun 2018, Michal Hocko wrote:
> 
> > > We've discussed the mm 
> > > having a single blockable mmu notifier.  Regardless of how we arrive at 
> > > the point where the oom reaper can't free memory, which could be any of 
> > > those three cases, if (1) the original victim is sufficiently large that 
> > > follow-up oom kills would become unnecessary and (2) other threads 
> > > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> > > 
> > > We have examples of cases where oom reaping was successful, but the rss 
> > > numbers in the kernel log are very similar to when it was oom killed and 
> > > the process is known not to mlock, the reason is because the oom reaper 
> > > could free very little memory due to blockable mmu notifiers.
> > 
> > Please be more specific. Which notifiers these were. Blockable notifiers
> > are a PITA and we should be addressing them. That requiers identifying
> > them first.
> > 
> 
> The most common offender seems to be ib_umem_notifier, but I have also 
> heard of possible occurrences for mv_invl_range_start() for xen, but that 
> would need more investigation.  The rather new invalidate_range callback 
> for hmm mirroring could also be problematic.  Any mmu_notifier without 
> MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.  

Yes, this is unfortunate and it was meant as a stop gap quick fix with a
long term vision to be fixed properly. I am pretty sure that we can do
much better here. Teach mmu_notifier_invalidate_range_start to get a
non-block flag and back out on ranges that would block. I am pretty sure
that notifiers can be targeted a lot and so we can still process some
vmas at least.

> For this reason, we see testing harnesses often oom killed immediately 
> after running a unittest that stresses reclaim or compaction by inducing a 
> system-wide oom condition.  The harness spawns the unittest which spawns 
> an antagonist memory hog that is intended to be oom killed.  When memory 
> is mlocked or there are a large number of threads faulting memory for the 
> antagonist, the unittest and the harness itself get oom killed because the 
> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
> doing mmap() so the oom reaper can't grab the sem quickly enough.

How come the writer doesn't back off. mmap paths should be taking an
exclusive mmap sem in killable sleep so it should back off. Or is the
holder of the lock deep inside mmap path doing something else and not
backing out with the exclusive lock held?
 
[...]

> > As I've already said. I will nack any timeout based solution until we
> > address all particular problems and still see more to come. Here we have
> > a clear goal. Address mlocked pages and identify mmu notifier offenders.
> 
> I cannot fix all mmu notifiers to not block, I can't fix the configuration 
> to allow direct compaction for thp allocations and a large number of 
> concurrent faulters, and I cannot fix userspace mlocking a lot of memory.  
> It's worthwhile to work in that direction, but it will never be 100% 
> possible to avoid.  We must have a solution that prevents innocent 
> processes from consistently being oom killed completely unnecessarily.

None of the above has been attempted and shown not worth doing. The oom
even should be a rare thing to happen so I absolutely do not see any
reason to rush any misdesigned fix to be done right now.

-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-05 Thread Michal Hocko
On Mon 04-06-18 21:25:39, David Rientjes wrote:
> On Fri, 1 Jun 2018, Michal Hocko wrote:
> 
> > > We've discussed the mm 
> > > having a single blockable mmu notifier.  Regardless of how we arrive at 
> > > the point where the oom reaper can't free memory, which could be any of 
> > > those three cases, if (1) the original victim is sufficiently large that 
> > > follow-up oom kills would become unnecessary and (2) other threads 
> > > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> > > 
> > > We have examples of cases where oom reaping was successful, but the rss 
> > > numbers in the kernel log are very similar to when it was oom killed and 
> > > the process is known not to mlock, the reason is because the oom reaper 
> > > could free very little memory due to blockable mmu notifiers.
> > 
> > Please be more specific. Which notifiers these were. Blockable notifiers
> > are a PITA and we should be addressing them. That requiers identifying
> > them first.
> > 
> 
> The most common offender seems to be ib_umem_notifier, but I have also 
> heard of possible occurrences for mv_invl_range_start() for xen, but that 
> would need more investigation.  The rather new invalidate_range callback 
> for hmm mirroring could also be problematic.  Any mmu_notifier without 
> MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.  

Yes, this is unfortunate and it was meant as a stop gap quick fix with a
long term vision to be fixed properly. I am pretty sure that we can do
much better here. Teach mmu_notifier_invalidate_range_start to get a
non-block flag and back out on ranges that would block. I am pretty sure
that notifiers can be targeted a lot and so we can still process some
vmas at least.

> For this reason, we see testing harnesses often oom killed immediately 
> after running a unittest that stresses reclaim or compaction by inducing a 
> system-wide oom condition.  The harness spawns the unittest which spawns 
> an antagonist memory hog that is intended to be oom killed.  When memory 
> is mlocked or there are a large number of threads faulting memory for the 
> antagonist, the unittest and the harness itself get oom killed because the 
> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
> The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
> doing mmap() so the oom reaper can't grab the sem quickly enough.

How come the writer doesn't back off. mmap paths should be taking an
exclusive mmap sem in killable sleep so it should back off. Or is the
holder of the lock deep inside mmap path doing something else and not
backing out with the exclusive lock held?
 
[...]

> > As I've already said. I will nack any timeout based solution until we
> > address all particular problems and still see more to come. Here we have
> > a clear goal. Address mlocked pages and identify mmu notifier offenders.
> 
> I cannot fix all mmu notifiers to not block, I can't fix the configuration 
> to allow direct compaction for thp allocations and a large number of 
> concurrent faulters, and I cannot fix userspace mlocking a lot of memory.  
> It's worthwhile to work in that direction, but it will never be 100% 
> possible to avoid.  We must have a solution that prevents innocent 
> processes from consistently being oom killed completely unnecessarily.

None of the above has been attempted and shown not worth doing. The oom
even should be a rare thing to happen so I absolutely do not see any
reason to rush any misdesigned fix to be done right now.

-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-04 Thread David Rientjes
On Fri, 1 Jun 2018, Michal Hocko wrote:

> > We've discussed the mm 
> > having a single blockable mmu notifier.  Regardless of how we arrive at 
> > the point where the oom reaper can't free memory, which could be any of 
> > those three cases, if (1) the original victim is sufficiently large that 
> > follow-up oom kills would become unnecessary and (2) other threads 
> > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> > 
> > We have examples of cases where oom reaping was successful, but the rss 
> > numbers in the kernel log are very similar to when it was oom killed and 
> > the process is known not to mlock, the reason is because the oom reaper 
> > could free very little memory due to blockable mmu notifiers.
> 
> Please be more specific. Which notifiers these were. Blockable notifiers
> are a PITA and we should be addressing them. That requiers identifying
> them first.
> 

The most common offender seems to be ib_umem_notifier, but I have also 
heard of possible occurrences for mv_invl_range_start() for xen, but that 
would need more investigation.  The rather new invalidate_range callback 
for hmm mirroring could also be problematic.  Any mmu_notifier without 
MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.  
For this reason, we see testing harnesses often oom killed immediately 
after running a unittest that stresses reclaim or compaction by inducing a 
system-wide oom condition.  The harness spawns the unittest which spawns 
an antagonist memory hog that is intended to be oom killed.  When memory 
is mlocked or there are a large number of threads faulting memory for the 
antagonist, the unittest and the harness itself get oom killed because the 
oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
doing mmap() so the oom reaper can't grab the sem quickly enough.

I agree that blockable mmu notifiers are a pain, but until such time as 
all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can 
free all mlocked memory, and the oom reaper waits long enough to grab 
mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that 
won't oom kill everything running on the system.  I have doubts we'll ever 
reach a point where the oom reaper can do the equivalent of exit_mmap(), 
but it's possible to help solve the immediate issue of all oom kills 
killing many innocent processes while working in a direction to make oom 
reaping more successful at freeing memory.

> > The current implementation is a timeout based solution for mmap_sem, it 
> > just has the oom reaper spinning trying to grab the sem and eventually 
> > gives up.  This patch allows it to currently work on other mm's and 
> > detects the timeout in a different way, with jiffies instead of an 
> > iterator.
> 
> And I argue that anything timeout based is just broken by design. Trying
> n times will at least give you a consistent behavior.

It's not consistent, we see wildly inconsistent results especially on 
power because it depends on the number of queued readers of mm->mmap_sem 
ahead of a writer until such time that a thread doing mmap() can grab it, 
drop it, and allow the oom reaper to grab it for read.  It's so 
inconsistent that we can see the oom reaper successfully grab the sem for 
an oom killed memory hog with 128 faulting threads, and see it fail with 4 
faulting threads.

> Retrying on mmap
> sem makes sense because the lock might be taken for a short time.

It isn't a function of how long mmap_sem is taken for write, it's a 
function of how many readers are ahead of the queued writer.  We don't run 
with thp defrag set to "always" under standard configurations, but users 
of MADV_HUGEPAGE or configs where defrag is set to "always" can 
consistently cause any number of additional processes to be oom killed 
unnecessarily because the readers are performing compaction and the writer 
is queued behind it.

> > I'd love a solution where we can reliably detect an oom livelock and oom 
> > kill additional processes but only after the original victim has had a 
> > chance to do exit_mmap() without a timeout, but I don't see one being 
> > offered.  Given Tetsuo has seen issues with this in the past and suggested 
> > a similar proposal means we are not the only ones feeling pain from this.
> 
> Tetsuo is doing an artificial stress test which doesn't resemble any
> reasonable workload.

Tetsuo's test cases caught the CVE on powerpc which could trivially 
panic the system if configured to panic on any oops and required a 
security fix because it made it easy for any user doing a large mlock.  
His test case here is trivial to reproduce on powerpc and causes several 
additional processes to be oom killed.  It's not artificial, I see many 
test harnesses killed *nightly* because a memory hog is faulting with a 
large number of threads and two or three other threads are 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-04 Thread David Rientjes
On Fri, 1 Jun 2018, Michal Hocko wrote:

> > We've discussed the mm 
> > having a single blockable mmu notifier.  Regardless of how we arrive at 
> > the point where the oom reaper can't free memory, which could be any of 
> > those three cases, if (1) the original victim is sufficiently large that 
> > follow-up oom kills would become unnecessary and (2) other threads 
> > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> > 
> > We have examples of cases where oom reaping was successful, but the rss 
> > numbers in the kernel log are very similar to when it was oom killed and 
> > the process is known not to mlock, the reason is because the oom reaper 
> > could free very little memory due to blockable mmu notifiers.
> 
> Please be more specific. Which notifiers these were. Blockable notifiers
> are a PITA and we should be addressing them. That requiers identifying
> them first.
> 

The most common offender seems to be ib_umem_notifier, but I have also 
heard of possible occurrences for mv_invl_range_start() for xen, but that 
would need more investigation.  The rather new invalidate_range callback 
for hmm mirroring could also be problematic.  Any mmu_notifier without 
MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.  
For this reason, we see testing harnesses often oom killed immediately 
after running a unittest that stresses reclaim or compaction by inducing a 
system-wide oom condition.  The harness spawns the unittest which spawns 
an antagonist memory hog that is intended to be oom killed.  When memory 
is mlocked or there are a large number of threads faulting memory for the 
antagonist, the unittest and the harness itself get oom killed because the 
oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.  
The memory hog has mm->mmap_sem readers queued ahead of a writer that is 
doing mmap() so the oom reaper can't grab the sem quickly enough.

I agree that blockable mmu notifiers are a pain, but until such time as 
all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can 
free all mlocked memory, and the oom reaper waits long enough to grab 
mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that 
won't oom kill everything running on the system.  I have doubts we'll ever 
reach a point where the oom reaper can do the equivalent of exit_mmap(), 
but it's possible to help solve the immediate issue of all oom kills 
killing many innocent processes while working in a direction to make oom 
reaping more successful at freeing memory.

> > The current implementation is a timeout based solution for mmap_sem, it 
> > just has the oom reaper spinning trying to grab the sem and eventually 
> > gives up.  This patch allows it to currently work on other mm's and 
> > detects the timeout in a different way, with jiffies instead of an 
> > iterator.
> 
> And I argue that anything timeout based is just broken by design. Trying
> n times will at least give you a consistent behavior.

It's not consistent, we see wildly inconsistent results especially on 
power because it depends on the number of queued readers of mm->mmap_sem 
ahead of a writer until such time that a thread doing mmap() can grab it, 
drop it, and allow the oom reaper to grab it for read.  It's so 
inconsistent that we can see the oom reaper successfully grab the sem for 
an oom killed memory hog with 128 faulting threads, and see it fail with 4 
faulting threads.

> Retrying on mmap
> sem makes sense because the lock might be taken for a short time.

It isn't a function of how long mmap_sem is taken for write, it's a 
function of how many readers are ahead of the queued writer.  We don't run 
with thp defrag set to "always" under standard configurations, but users 
of MADV_HUGEPAGE or configs where defrag is set to "always" can 
consistently cause any number of additional processes to be oom killed 
unnecessarily because the readers are performing compaction and the writer 
is queued behind it.

> > I'd love a solution where we can reliably detect an oom livelock and oom 
> > kill additional processes but only after the original victim has had a 
> > chance to do exit_mmap() without a timeout, but I don't see one being 
> > offered.  Given Tetsuo has seen issues with this in the past and suggested 
> > a similar proposal means we are not the only ones feeling pain from this.
> 
> Tetsuo is doing an artificial stress test which doesn't resemble any
> reasonable workload.

Tetsuo's test cases caught the CVE on powerpc which could trivially 
panic the system if configured to panic on any oops and required a 
security fix because it made it easy for any user doing a large mlock.  
His test case here is trivial to reproduce on powerpc and causes several 
additional processes to be oom killed.  It's not artificial, I see many 
test harnesses killed *nightly* because a memory hog is faulting with a 
large number of threads and two or three other threads are 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-01 Thread Michal Hocko
On Thu 31-05-18 14:16:34, David Rientjes wrote:
> On Thu, 31 May 2018, Michal Hocko wrote:
> 
> > > It's not a random timeout, it's sufficiently long such that we don't oom 
> > > kill several processes needlessly in the very rare case where oom 
> > > livelock 
> > > would actually prevent the original victim from exiting.  The oom reaper 
> > > processing an mm, finding everything to be mlocked, and immediately 
> > > MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> > > large memory hogging process that mlocks all of its memory; we 
> > > consistently see spurious and unnecessary oom kills simply because the 
> > > oom 
> > > reaper has set MMF_OOM_SKIP very early.
> > 
> > It takes quite some additional steps for admin to allow a large amount
> > of mlocked memory and such an application should be really careful to
> > not consume too much memory. So how come this is something you see that
> > consistently? Is this some sort of bug or an unfortunate workload side
> > effect? I am asking this because I really want to see how relevant this
> > really is.
> > 
> 
> The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after 
> the victim has been chosen for oom kill and we get follow-up oom kills, 
> not that the process is able to mlock a large amount of memory.  Mlock 
> here is only being discussed as a single example.  Tetsuo has brought up 
> the example of all shared file-backed memory.

How is such a case even possible? File backed memory is reclaimable and
as such should be gone by the time we hit the OOM killer. If that is not
the case then I fail how wait slightly longer helps anything.

> We've discussed the mm 
> having a single blockable mmu notifier.  Regardless of how we arrive at 
> the point where the oom reaper can't free memory, which could be any of 
> those three cases, if (1) the original victim is sufficiently large that 
> follow-up oom kills would become unnecessary and (2) other threads 
> allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> 
> We have examples of cases where oom reaping was successful, but the rss 
> numbers in the kernel log are very similar to when it was oom killed and 
> the process is known not to mlock, the reason is because the oom reaper 
> could free very little memory due to blockable mmu notifiers.

Please be more specific. Which notifiers these were. Blockable notifiers
are a PITA and we should be addressing them. That requiers identifying
them first.

> > But the waiting periods just turn out to be a really poor design. There
> > will be no good timeout to fit for everybody. We can do better and as
> > long as this is the case the timeout based solution should be really
> > rejected. It is a shortcut that doesn't really solve the underlying
> > problem.
> > 
> 
> The current implementation is a timeout based solution for mmap_sem, it 
> just has the oom reaper spinning trying to grab the sem and eventually 
> gives up.  This patch allows it to currently work on other mm's and 
> detects the timeout in a different way, with jiffies instead of an 
> iterator.

And I argue that anything timeout based is just broken by design. Trying
n times will at least give you a consistent behavior. Retrying on mmap
sem makes sense because the lock might be taken for a short time.
Retrying on a memory oom reaper doesn't reclaim is just pointless
waiting for somebody else doing the work. See the difference?

> I'd love a solution where we can reliably detect an oom livelock and oom 
> kill additional processes but only after the original victim has had a 
> chance to do exit_mmap() without a timeout, but I don't see one being 
> offered.  Given Tetsuo has seen issues with this in the past and suggested 
> a similar proposal means we are not the only ones feeling pain from this.

Tetsuo is doing an artificial stress test which doesn't resemble any
reasonable workload. This is good to catch different corner cases but
nothing even close to base any design on. I will definitely nack any
attempt to add a timeout based solution based on such a non-realistic
tests. If we have realistic workloads then try to address them and
resort to any timeout or other hacks as the last option.
 
> > > I'm open to hearing any other suggestions that you have other than 
> > > waiting 
> > > some time period before MMF_OOM_SKIP gets set to solve this problem.
> > 
> > I've already offered one. Make mlocked pages reapable.
> 
> Making mlocked pages reapable would only solve the most trivial reproducer 
> of this.  Unless the oom reaper can guarantee that it will never block and 
> can free all memory that exit_mmap() can free, we need to ensure that a 
> victim has a chance to reach the exit path on its own before killing every 
> other process on the system.
> 
> I'll fix the issue I identified with doing list_add_tail() rather than 
> list_add(), fix up the commit message per Tetsuo to identify the other 
> possible ways this 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-06-01 Thread Michal Hocko
On Thu 31-05-18 14:16:34, David Rientjes wrote:
> On Thu, 31 May 2018, Michal Hocko wrote:
> 
> > > It's not a random timeout, it's sufficiently long such that we don't oom 
> > > kill several processes needlessly in the very rare case where oom 
> > > livelock 
> > > would actually prevent the original victim from exiting.  The oom reaper 
> > > processing an mm, finding everything to be mlocked, and immediately 
> > > MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> > > large memory hogging process that mlocks all of its memory; we 
> > > consistently see spurious and unnecessary oom kills simply because the 
> > > oom 
> > > reaper has set MMF_OOM_SKIP very early.
> > 
> > It takes quite some additional steps for admin to allow a large amount
> > of mlocked memory and such an application should be really careful to
> > not consume too much memory. So how come this is something you see that
> > consistently? Is this some sort of bug or an unfortunate workload side
> > effect? I am asking this because I really want to see how relevant this
> > really is.
> > 
> 
> The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after 
> the victim has been chosen for oom kill and we get follow-up oom kills, 
> not that the process is able to mlock a large amount of memory.  Mlock 
> here is only being discussed as a single example.  Tetsuo has brought up 
> the example of all shared file-backed memory.

How is such a case even possible? File backed memory is reclaimable and
as such should be gone by the time we hit the OOM killer. If that is not
the case then I fail how wait slightly longer helps anything.

> We've discussed the mm 
> having a single blockable mmu notifier.  Regardless of how we arrive at 
> the point where the oom reaper can't free memory, which could be any of 
> those three cases, if (1) the original victim is sufficiently large that 
> follow-up oom kills would become unnecessary and (2) other threads 
> allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> 
> We have examples of cases where oom reaping was successful, but the rss 
> numbers in the kernel log are very similar to when it was oom killed and 
> the process is known not to mlock, the reason is because the oom reaper 
> could free very little memory due to blockable mmu notifiers.

Please be more specific. Which notifiers these were. Blockable notifiers
are a PITA and we should be addressing them. That requiers identifying
them first.

> > But the waiting periods just turn out to be a really poor design. There
> > will be no good timeout to fit for everybody. We can do better and as
> > long as this is the case the timeout based solution should be really
> > rejected. It is a shortcut that doesn't really solve the underlying
> > problem.
> > 
> 
> The current implementation is a timeout based solution for mmap_sem, it 
> just has the oom reaper spinning trying to grab the sem and eventually 
> gives up.  This patch allows it to currently work on other mm's and 
> detects the timeout in a different way, with jiffies instead of an 
> iterator.

And I argue that anything timeout based is just broken by design. Trying
n times will at least give you a consistent behavior. Retrying on mmap
sem makes sense because the lock might be taken for a short time.
Retrying on a memory oom reaper doesn't reclaim is just pointless
waiting for somebody else doing the work. See the difference?

> I'd love a solution where we can reliably detect an oom livelock and oom 
> kill additional processes but only after the original victim has had a 
> chance to do exit_mmap() without a timeout, but I don't see one being 
> offered.  Given Tetsuo has seen issues with this in the past and suggested 
> a similar proposal means we are not the only ones feeling pain from this.

Tetsuo is doing an artificial stress test which doesn't resemble any
reasonable workload. This is good to catch different corner cases but
nothing even close to base any design on. I will definitely nack any
attempt to add a timeout based solution based on such a non-realistic
tests. If we have realistic workloads then try to address them and
resort to any timeout or other hacks as the last option.
 
> > > I'm open to hearing any other suggestions that you have other than 
> > > waiting 
> > > some time period before MMF_OOM_SKIP gets set to solve this problem.
> > 
> > I've already offered one. Make mlocked pages reapable.
> 
> Making mlocked pages reapable would only solve the most trivial reproducer 
> of this.  Unless the oom reaper can guarantee that it will never block and 
> can free all memory that exit_mmap() can free, we need to ensure that a 
> victim has a chance to reach the exit path on its own before killing every 
> other process on the system.
> 
> I'll fix the issue I identified with doing list_add_tail() rather than 
> list_add(), fix up the commit message per Tetsuo to identify the other 
> possible ways this 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-31 Thread David Rientjes
On Thu, 31 May 2018, Michal Hocko wrote:

> > It's not a random timeout, it's sufficiently long such that we don't oom 
> > kill several processes needlessly in the very rare case where oom livelock 
> > would actually prevent the original victim from exiting.  The oom reaper 
> > processing an mm, finding everything to be mlocked, and immediately 
> > MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> > large memory hogging process that mlocks all of its memory; we 
> > consistently see spurious and unnecessary oom kills simply because the oom 
> > reaper has set MMF_OOM_SKIP very early.
> 
> It takes quite some additional steps for admin to allow a large amount
> of mlocked memory and such an application should be really careful to
> not consume too much memory. So how come this is something you see that
> consistently? Is this some sort of bug or an unfortunate workload side
> effect? I am asking this because I really want to see how relevant this
> really is.
> 

The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after 
the victim has been chosen for oom kill and we get follow-up oom kills, 
not that the process is able to mlock a large amount of memory.  Mlock 
here is only being discussed as a single example.  Tetsuo has brought up 
the example of all shared file-backed memory.  We've discussed the mm 
having a single blockable mmu notifier.  Regardless of how we arrive at 
the point where the oom reaper can't free memory, which could be any of 
those three cases, if (1) the original victim is sufficiently large that 
follow-up oom kills would become unnecessary and (2) other threads 
allocate/charge before the oom victim reaches exit_mmap(), this occurs.

We have examples of cases where oom reaping was successful, but the rss 
numbers in the kernel log are very similar to when it was oom killed and 
the process is known not to mlock, the reason is because the oom reaper 
could free very little memory due to blockable mmu notifiers.

> But the waiting periods just turn out to be a really poor design. There
> will be no good timeout to fit for everybody. We can do better and as
> long as this is the case the timeout based solution should be really
> rejected. It is a shortcut that doesn't really solve the underlying
> problem.
> 

The current implementation is a timeout based solution for mmap_sem, it 
just has the oom reaper spinning trying to grab the sem and eventually 
gives up.  This patch allows it to currently work on other mm's and 
detects the timeout in a different way, with jiffies instead of an 
iterator.

I'd love a solution where we can reliably detect an oom livelock and oom 
kill additional processes but only after the original victim has had a 
chance to do exit_mmap() without a timeout, but I don't see one being 
offered.  Given Tetsuo has seen issues with this in the past and suggested 
a similar proposal means we are not the only ones feeling pain from this.

> > I'm open to hearing any other suggestions that you have other than waiting 
> > some time period before MMF_OOM_SKIP gets set to solve this problem.
> 
> I've already offered one. Make mlocked pages reapable.

Making mlocked pages reapable would only solve the most trivial reproducer 
of this.  Unless the oom reaper can guarantee that it will never block and 
can free all memory that exit_mmap() can free, we need to ensure that a 
victim has a chance to reach the exit path on its own before killing every 
other process on the system.

I'll fix the issue I identified with doing list_add_tail() rather than 
list_add(), fix up the commit message per Tetsuo to identify the other 
possible ways this can occur other than mlock, remove the rfc tag, and 
repost.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-31 Thread David Rientjes
On Thu, 31 May 2018, Michal Hocko wrote:

> > It's not a random timeout, it's sufficiently long such that we don't oom 
> > kill several processes needlessly in the very rare case where oom livelock 
> > would actually prevent the original victim from exiting.  The oom reaper 
> > processing an mm, finding everything to be mlocked, and immediately 
> > MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> > large memory hogging process that mlocks all of its memory; we 
> > consistently see spurious and unnecessary oom kills simply because the oom 
> > reaper has set MMF_OOM_SKIP very early.
> 
> It takes quite some additional steps for admin to allow a large amount
> of mlocked memory and such an application should be really careful to
> not consume too much memory. So how come this is something you see that
> consistently? Is this some sort of bug or an unfortunate workload side
> effect? I am asking this because I really want to see how relevant this
> really is.
> 

The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after 
the victim has been chosen for oom kill and we get follow-up oom kills, 
not that the process is able to mlock a large amount of memory.  Mlock 
here is only being discussed as a single example.  Tetsuo has brought up 
the example of all shared file-backed memory.  We've discussed the mm 
having a single blockable mmu notifier.  Regardless of how we arrive at 
the point where the oom reaper can't free memory, which could be any of 
those three cases, if (1) the original victim is sufficiently large that 
follow-up oom kills would become unnecessary and (2) other threads 
allocate/charge before the oom victim reaches exit_mmap(), this occurs.

We have examples of cases where oom reaping was successful, but the rss 
numbers in the kernel log are very similar to when it was oom killed and 
the process is known not to mlock, the reason is because the oom reaper 
could free very little memory due to blockable mmu notifiers.

> But the waiting periods just turn out to be a really poor design. There
> will be no good timeout to fit for everybody. We can do better and as
> long as this is the case the timeout based solution should be really
> rejected. It is a shortcut that doesn't really solve the underlying
> problem.
> 

The current implementation is a timeout based solution for mmap_sem, it 
just has the oom reaper spinning trying to grab the sem and eventually 
gives up.  This patch allows it to currently work on other mm's and 
detects the timeout in a different way, with jiffies instead of an 
iterator.

I'd love a solution where we can reliably detect an oom livelock and oom 
kill additional processes but only after the original victim has had a 
chance to do exit_mmap() without a timeout, but I don't see one being 
offered.  Given Tetsuo has seen issues with this in the past and suggested 
a similar proposal means we are not the only ones feeling pain from this.

> > I'm open to hearing any other suggestions that you have other than waiting 
> > some time period before MMF_OOM_SKIP gets set to solve this problem.
> 
> I've already offered one. Make mlocked pages reapable.

Making mlocked pages reapable would only solve the most trivial reproducer 
of this.  Unless the oom reaper can guarantee that it will never block and 
can free all memory that exit_mmap() can free, we need to ensure that a 
victim has a chance to reach the exit path on its own before killing every 
other process on the system.

I'll fix the issue I identified with doing list_add_tail() rather than 
list_add(), fix up the commit message per Tetsuo to identify the other 
possible ways this can occur other than mlock, remove the rfc tag, and 
repost.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-31 Thread Michal Hocko
On Wed 30-05-18 14:06:51, David Rientjes wrote:
> On Mon, 28 May 2018, Michal Hocko wrote:
> 
> > > That's not sufficient since the oom reaper is also not able to oom reap 
> > > if 
> > > the mm has blockable mmu notifiers or all memory is shared filebacked 
> > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> > > oom killed.
> > 
> > Could you be more specific with a real world example where that is the
> > case? I mean the full address space of non-reclaimable file backed
> > memory where waiting some more would help? Blockable mmu notifiers are
> > a PITA for sure. I wish we could have a better way to deal with them.
> > Maybe we can tell them we are in the non-blockable context and have them
> > release as much as possible. Still something that a random timeout
> > wouldn't help I am afraid.
> > 
> 
> It's not a random timeout, it's sufficiently long such that we don't oom 
> kill several processes needlessly in the very rare case where oom livelock 
> would actually prevent the original victim from exiting.  The oom reaper 
> processing an mm, finding everything to be mlocked, and immediately 
> MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> large memory hogging process that mlocks all of its memory; we 
> consistently see spurious and unnecessary oom kills simply because the oom 
> reaper has set MMF_OOM_SKIP very early.

It takes quite some additional steps for admin to allow a large amount
of mlocked memory and such an application should be really careful to
not consume too much memory. So how come this is something you see that
consistently? Is this some sort of bug or an unfortunate workload side
effect? I am asking this because I really want to see how relevant this
really is.

> This patch introduces a "give up" period such that the oom reaper is still 
> allowed to do its good work but only gives up in the hope the victim can 
> make forward progress at some substantial period of time in the future.  I 
> would understand the objection if oom livelock where the victim cannot 
> make forward progress were commonplace, but in the interest of not killing 
> several processes needlessly every time a large mlocked process is 
> targeted, I think it compels a waiting period.

But the waiting periods just turn out to be a really poor design. There
will be no good timeout to fit for everybody. We can do better and as
long as this is the case the timeout based solution should be really
rejected. It is a shortcut that doesn't really solve the underlying
problem.

> > Trying to reap a different oom victim when the current one is not making
> > progress during the lock contention is certainly something that make
> > sense. It has been proposed in the past and we just gave it up because
> > it was more complex. Do you have any specific example when this would
> > help to justify the additional complexity?
> > 
> 
> I'm not sure how you're defining complexity, the patch adds ~30 lines of 
> code and prevents processes from needlessly being oom killed when oom 
> reaping is largely unsuccessful and before the victim finishes 
> free_pgtables() and then also allows the oom reaper to operate on multiple 
> mm's instead of processing one at a time.  Obviously if there is a delay 
> before MMF_OOM_SKIP is set it requires that the oom reaper be able to 
> process other mm's, otherwise we stall needlessly for 10s.  Operating on 
> multiple mm's in a linked list while waiting for victims to exit during a 
> timeout period is thus very much needed, it wouldn't make sense without 
> it.

It needs to keep track of the current retry state of the reaped victim
and that is an additional complexity, isn't it? And I am asking how
often do we have to handle that. Please note that the primary objective
here is to unclutter a locked up situation. The oom reaper doesn't block
the victim to go away on its own while we keep retrying. So a slow
progress on the reaper side is not an issue IMIHO.

> > > But also note that even if oom reaping is possible, in the presence of an 
> > > antagonist that continues to allocate memory, that it is possible to oom 
> > > kill additional victims unnecessarily if we aren't able to complete 
> > > free_pgtables() in exit_mmap() of the original victim.
> > 
> > If there is unbound source of allocations then we are screwed no matter
> > what. We just hope that the allocator will get noticed by the oom killer
> > and it will be stopped.
> > 
> 
> It's not unbounded, it's just an allocator that acts as an antagonist.  At 
> the risk of being overly verbose, for system or memcg oom conditions: a 
> large mlocked process is oom killed, other processes continue to 
> allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP 
> without being able to free any memory, and the other important processes 
> are needlessly oom killed before the original victim can reach 
> exit_mmap().  This happens a _lot_.
>
> I'm open to hearing any 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-31 Thread Michal Hocko
On Wed 30-05-18 14:06:51, David Rientjes wrote:
> On Mon, 28 May 2018, Michal Hocko wrote:
> 
> > > That's not sufficient since the oom reaper is also not able to oom reap 
> > > if 
> > > the mm has blockable mmu notifiers or all memory is shared filebacked 
> > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> > > oom killed.
> > 
> > Could you be more specific with a real world example where that is the
> > case? I mean the full address space of non-reclaimable file backed
> > memory where waiting some more would help? Blockable mmu notifiers are
> > a PITA for sure. I wish we could have a better way to deal with them.
> > Maybe we can tell them we are in the non-blockable context and have them
> > release as much as possible. Still something that a random timeout
> > wouldn't help I am afraid.
> > 
> 
> It's not a random timeout, it's sufficiently long such that we don't oom 
> kill several processes needlessly in the very rare case where oom livelock 
> would actually prevent the original victim from exiting.  The oom reaper 
> processing an mm, finding everything to be mlocked, and immediately 
> MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
> large memory hogging process that mlocks all of its memory; we 
> consistently see spurious and unnecessary oom kills simply because the oom 
> reaper has set MMF_OOM_SKIP very early.

It takes quite some additional steps for admin to allow a large amount
of mlocked memory and such an application should be really careful to
not consume too much memory. So how come this is something you see that
consistently? Is this some sort of bug or an unfortunate workload side
effect? I am asking this because I really want to see how relevant this
really is.

> This patch introduces a "give up" period such that the oom reaper is still 
> allowed to do its good work but only gives up in the hope the victim can 
> make forward progress at some substantial period of time in the future.  I 
> would understand the objection if oom livelock where the victim cannot 
> make forward progress were commonplace, but in the interest of not killing 
> several processes needlessly every time a large mlocked process is 
> targeted, I think it compels a waiting period.

But the waiting periods just turn out to be a really poor design. There
will be no good timeout to fit for everybody. We can do better and as
long as this is the case the timeout based solution should be really
rejected. It is a shortcut that doesn't really solve the underlying
problem.

> > Trying to reap a different oom victim when the current one is not making
> > progress during the lock contention is certainly something that make
> > sense. It has been proposed in the past and we just gave it up because
> > it was more complex. Do you have any specific example when this would
> > help to justify the additional complexity?
> > 
> 
> I'm not sure how you're defining complexity, the patch adds ~30 lines of 
> code and prevents processes from needlessly being oom killed when oom 
> reaping is largely unsuccessful and before the victim finishes 
> free_pgtables() and then also allows the oom reaper to operate on multiple 
> mm's instead of processing one at a time.  Obviously if there is a delay 
> before MMF_OOM_SKIP is set it requires that the oom reaper be able to 
> process other mm's, otherwise we stall needlessly for 10s.  Operating on 
> multiple mm's in a linked list while waiting for victims to exit during a 
> timeout period is thus very much needed, it wouldn't make sense without 
> it.

It needs to keep track of the current retry state of the reaped victim
and that is an additional complexity, isn't it? And I am asking how
often do we have to handle that. Please note that the primary objective
here is to unclutter a locked up situation. The oom reaper doesn't block
the victim to go away on its own while we keep retrying. So a slow
progress on the reaper side is not an issue IMIHO.

> > > But also note that even if oom reaping is possible, in the presence of an 
> > > antagonist that continues to allocate memory, that it is possible to oom 
> > > kill additional victims unnecessarily if we aren't able to complete 
> > > free_pgtables() in exit_mmap() of the original victim.
> > 
> > If there is unbound source of allocations then we are screwed no matter
> > what. We just hope that the allocator will get noticed by the oom killer
> > and it will be stopped.
> > 
> 
> It's not unbounded, it's just an allocator that acts as an antagonist.  At 
> the risk of being overly verbose, for system or memcg oom conditions: a 
> large mlocked process is oom killed, other processes continue to 
> allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP 
> without being able to free any memory, and the other important processes 
> are needlessly oom killed before the original victim can reach 
> exit_mmap().  This happens a _lot_.
>
> I'm open to hearing any 

Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-30 Thread David Rientjes
On Mon, 28 May 2018, Michal Hocko wrote:

> > That's not sufficient since the oom reaper is also not able to oom reap if 
> > the mm has blockable mmu notifiers or all memory is shared filebacked 
> > memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> > oom killed.
> 
> Could you be more specific with a real world example where that is the
> case? I mean the full address space of non-reclaimable file backed
> memory where waiting some more would help? Blockable mmu notifiers are
> a PITA for sure. I wish we could have a better way to deal with them.
> Maybe we can tell them we are in the non-blockable context and have them
> release as much as possible. Still something that a random timeout
> wouldn't help I am afraid.
> 

It's not a random timeout, it's sufficiently long such that we don't oom 
kill several processes needlessly in the very rare case where oom livelock 
would actually prevent the original victim from exiting.  The oom reaper 
processing an mm, finding everything to be mlocked, and immediately 
MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
large memory hogging process that mlocks all of its memory; we 
consistently see spurious and unnecessary oom kills simply because the oom 
reaper has set MMF_OOM_SKIP very early.

This patch introduces a "give up" period such that the oom reaper is still 
allowed to do its good work but only gives up in the hope the victim can 
make forward progress at some substantial period of time in the future.  I 
would understand the objection if oom livelock where the victim cannot 
make forward progress were commonplace, but in the interest of not killing 
several processes needlessly every time a large mlocked process is 
targeted, I think it compels a waiting period.

> Trying to reap a different oom victim when the current one is not making
> progress during the lock contention is certainly something that make
> sense. It has been proposed in the past and we just gave it up because
> it was more complex. Do you have any specific example when this would
> help to justify the additional complexity?
> 

I'm not sure how you're defining complexity, the patch adds ~30 lines of 
code and prevents processes from needlessly being oom killed when oom 
reaping is largely unsuccessful and before the victim finishes 
free_pgtables() and then also allows the oom reaper to operate on multiple 
mm's instead of processing one at a time.  Obviously if there is a delay 
before MMF_OOM_SKIP is set it requires that the oom reaper be able to 
process other mm's, otherwise we stall needlessly for 10s.  Operating on 
multiple mm's in a linked list while waiting for victims to exit during a 
timeout period is thus very much needed, it wouldn't make sense without 
it.

> > But also note that even if oom reaping is possible, in the presence of an 
> > antagonist that continues to allocate memory, that it is possible to oom 
> > kill additional victims unnecessarily if we aren't able to complete 
> > free_pgtables() in exit_mmap() of the original victim.
> 
> If there is unbound source of allocations then we are screwed no matter
> what. We just hope that the allocator will get noticed by the oom killer
> and it will be stopped.
> 

It's not unbounded, it's just an allocator that acts as an antagonist.  At 
the risk of being overly verbose, for system or memcg oom conditions: a 
large mlocked process is oom killed, other processes continue to 
allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP 
without being able to free any memory, and the other important processes 
are needlessly oom killed before the original victim can reach 
exit_mmap().  This happens a _lot_.

I'm open to hearing any other suggestions that you have other than waiting 
some time period before MMF_OOM_SKIP gets set to solve this problem.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-30 Thread David Rientjes
On Mon, 28 May 2018, Michal Hocko wrote:

> > That's not sufficient since the oom reaper is also not able to oom reap if 
> > the mm has blockable mmu notifiers or all memory is shared filebacked 
> > memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> > oom killed.
> 
> Could you be more specific with a real world example where that is the
> case? I mean the full address space of non-reclaimable file backed
> memory where waiting some more would help? Blockable mmu notifiers are
> a PITA for sure. I wish we could have a better way to deal with them.
> Maybe we can tell them we are in the non-blockable context and have them
> release as much as possible. Still something that a random timeout
> wouldn't help I am afraid.
> 

It's not a random timeout, it's sufficiently long such that we don't oom 
kill several processes needlessly in the very rare case where oom livelock 
would actually prevent the original victim from exiting.  The oom reaper 
processing an mm, finding everything to be mlocked, and immediately 
MMF_OOM_SKIP is inappropriate.  This is rather trivial to reproduce for a 
large memory hogging process that mlocks all of its memory; we 
consistently see spurious and unnecessary oom kills simply because the oom 
reaper has set MMF_OOM_SKIP very early.

This patch introduces a "give up" period such that the oom reaper is still 
allowed to do its good work but only gives up in the hope the victim can 
make forward progress at some substantial period of time in the future.  I 
would understand the objection if oom livelock where the victim cannot 
make forward progress were commonplace, but in the interest of not killing 
several processes needlessly every time a large mlocked process is 
targeted, I think it compels a waiting period.

> Trying to reap a different oom victim when the current one is not making
> progress during the lock contention is certainly something that make
> sense. It has been proposed in the past and we just gave it up because
> it was more complex. Do you have any specific example when this would
> help to justify the additional complexity?
> 

I'm not sure how you're defining complexity, the patch adds ~30 lines of 
code and prevents processes from needlessly being oom killed when oom 
reaping is largely unsuccessful and before the victim finishes 
free_pgtables() and then also allows the oom reaper to operate on multiple 
mm's instead of processing one at a time.  Obviously if there is a delay 
before MMF_OOM_SKIP is set it requires that the oom reaper be able to 
process other mm's, otherwise we stall needlessly for 10s.  Operating on 
multiple mm's in a linked list while waiting for victims to exit during a 
timeout period is thus very much needed, it wouldn't make sense without 
it.

> > But also note that even if oom reaping is possible, in the presence of an 
> > antagonist that continues to allocate memory, that it is possible to oom 
> > kill additional victims unnecessarily if we aren't able to complete 
> > free_pgtables() in exit_mmap() of the original victim.
> 
> If there is unbound source of allocations then we are screwed no matter
> what. We just hope that the allocator will get noticed by the oom killer
> and it will be stopped.
> 

It's not unbounded, it's just an allocator that acts as an antagonist.  At 
the risk of being overly verbose, for system or memcg oom conditions: a 
large mlocked process is oom killed, other processes continue to 
allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP 
without being able to free any memory, and the other important processes 
are needlessly oom killed before the original victim can reach 
exit_mmap().  This happens a _lot_.

I'm open to hearing any other suggestions that you have other than waiting 
some time period before MMF_OOM_SKIP gets set to solve this problem.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 12:36:08, David Rientjes wrote:
> On Fri, 25 May 2018, Michal Hocko wrote:
> 
> > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > > it cannot reap an mm.  This can happen for a variety of reasons,
> > > including:
> > > 
> > >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > > 
> > >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> > >to stall indefinitely,
> > > 
> > > but we can also add a third when the oom reaper can "reap" an mm but doing
> > > so is unlikely to free any amount of memory:
> > > 
> > >  - when the mm's memory is fully mlocked.
> > > 
> > > When all memory is mlocked, the oom reaper will not be able to free any
> > > substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> > > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > > chosen unnecessarily.  This is trivial to reproduce if all eligible
> > > processes on the system have mlocked their memory: the oom killer calls
> > > panic() even though forward progress can be made.
> > > 
> > > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > > unmapping memory and additional processes can be chosen unnecessarily
> > > because the oom killer is racing with exit_mmap().
> > > 
> > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > > a true oom livelock in progress, it never gets set and no additional
> > > killing is possible.
> > > 
> > > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > > list so that other mm's may be reaped while waiting for an mm's timeout to
> > > expire.
> > 
> > No timeouts please! The proper way to handle this problem is to simply
> > teach the oom reaper to handle mlocked areas.
> 
> That's not sufficient since the oom reaper is also not able to oom reap if 
> the mm has blockable mmu notifiers or all memory is shared filebacked 
> memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> oom killed.

Could you be more specific with a real world example where that is the
case? I mean the full address space of non-reclaimable file backed
memory where waiting some more would help? Blockable mmu notifiers are
a PITA for sure. I wish we could have a better way to deal with them.
Maybe we can tell them we are in the non-blockable context and have them
release as much as possible. Still something that a random timeout
wouldn't help I am afraid.

> The current implementation that relies on MAX_OOM_REAP_RETRIES is acting 
> as a timeout already for mm->mmap_sem, but it's doing so without 
> attempting to oom reap other victims that may actually allow it to grab 
> mm->mmap_sem if the allocator is waiting on a lock.

Trying to reap a different oom victim when the current one is not making
progress during the lock contention is certainly something that make
sense. It has been proposed in the past and we just gave it up because
it was more complex. Do you have any specific example when this would
help to justify the additional complexity?

> The solution, as proposed, is to allow the oom reaper to iterate over all 
> victims and try to free memory rather than working on each victim one by 
> one and giving up.
> 
> But also note that even if oom reaping is possible, in the presence of an 
> antagonist that continues to allocate memory, that it is possible to oom 
> kill additional victims unnecessarily if we aren't able to complete 
> free_pgtables() in exit_mmap() of the original victim.

If there is unbound source of allocations then we are screwed no matter
what. We just hope that the allocator will get noticed by the oom killer
and it will be stopped.

That being said. I do not object for justified improvements in the oom
reaping. But I absolutely detest some random timeouts and will nack
implementations based on them until it is absolutely clear there is no
other way around.
-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-28 Thread Michal Hocko
On Fri 25-05-18 12:36:08, David Rientjes wrote:
> On Fri, 25 May 2018, Michal Hocko wrote:
> 
> > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > > it cannot reap an mm.  This can happen for a variety of reasons,
> > > including:
> > > 
> > >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > > 
> > >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> > >to stall indefinitely,
> > > 
> > > but we can also add a third when the oom reaper can "reap" an mm but doing
> > > so is unlikely to free any amount of memory:
> > > 
> > >  - when the mm's memory is fully mlocked.
> > > 
> > > When all memory is mlocked, the oom reaper will not be able to free any
> > > substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> > > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > > chosen unnecessarily.  This is trivial to reproduce if all eligible
> > > processes on the system have mlocked their memory: the oom killer calls
> > > panic() even though forward progress can be made.
> > > 
> > > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > > unmapping memory and additional processes can be chosen unnecessarily
> > > because the oom killer is racing with exit_mmap().
> > > 
> > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > > a true oom livelock in progress, it never gets set and no additional
> > > killing is possible.
> > > 
> > > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > > list so that other mm's may be reaped while waiting for an mm's timeout to
> > > expire.
> > 
> > No timeouts please! The proper way to handle this problem is to simply
> > teach the oom reaper to handle mlocked areas.
> 
> That's not sufficient since the oom reaper is also not able to oom reap if 
> the mm has blockable mmu notifiers or all memory is shared filebacked 
> memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
> oom killed.

Could you be more specific with a real world example where that is the
case? I mean the full address space of non-reclaimable file backed
memory where waiting some more would help? Blockable mmu notifiers are
a PITA for sure. I wish we could have a better way to deal with them.
Maybe we can tell them we are in the non-blockable context and have them
release as much as possible. Still something that a random timeout
wouldn't help I am afraid.

> The current implementation that relies on MAX_OOM_REAP_RETRIES is acting 
> as a timeout already for mm->mmap_sem, but it's doing so without 
> attempting to oom reap other victims that may actually allow it to grab 
> mm->mmap_sem if the allocator is waiting on a lock.

Trying to reap a different oom victim when the current one is not making
progress during the lock contention is certainly something that make
sense. It has been proposed in the past and we just gave it up because
it was more complex. Do you have any specific example when this would
help to justify the additional complexity?

> The solution, as proposed, is to allow the oom reaper to iterate over all 
> victims and try to free memory rather than working on each victim one by 
> one and giving up.
> 
> But also note that even if oom reaping is possible, in the presence of an 
> antagonist that continues to allocate memory, that it is possible to oom 
> kill additional victims unnecessarily if we aren't able to complete 
> free_pgtables() in exit_mmap() of the original victim.

If there is unbound source of allocations then we are screwed no matter
what. We just hope that the allocator will get noticed by the oom killer
and it will be stopped.

That being said. I do not object for justified improvements in the oom
reaping. But I absolutely detest some random timeouts and will nack
implementations based on them until it is absolutely clear there is no
other way around.
-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread David Rientjes
On Fri, 25 May 2018, Tetsuo Handa wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >to stall indefinitely,
> > 
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is fully mlocked.
> 
>- when the mm's memory is fully mlocked (needs privilege) or
>  fully shared (does not need privilege)
> 

Good point, that is another way that unnecessary oom killing can occur 
because the oom reaper sets MMF_OOM_SKIP far too early.  I can make the 
change to the commit message.

Also, I noticed in my patch that oom_reap_task() should be doing 
list_add_tail() rather than list_add() to enqueue the mm for reaping 
again.

> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> > 
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> > 
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> 
> I already proposed more simpler one at 
> https://patchwork.kernel.org/patch/9877991/ .
> 

It's a similar idea, and I'm glad that we agree that some kind of per-mm 
delay is required to avoid this problem.  I think yours is simpler, but 
consider the other two changes in my patch:

 - in the normal exit path, absent any timeout for the mm, we only set
   MMF_OOM_SKIP after free_pgtables() when it is known we will not free
   any additional memory, which can also cause unnecessary oom killing
   because the oom killer races with free_pgtables(), and

 - the oom reaper now operates over all concurrent victims instead of
   repeatedly trying to take mm->mmap_sem of the first victim, sleeping
   many times, retrying, giving up, and moving on the next victim.
   Allowing the oom reaper to iterate through all victims can allow
   memory freeing such that an allocator may be able to drop mm->mmap_sem.

In fact, with my patch, I don't know of any condition where we kill 
additional processes unnecessarily *unless* the victim cannot be oom 
reaped or complete memory freeing in the exit path within 10 seconds.  
Given how rare oom livelock appears in practice, I think the 10 seconds is 
justified because right now it is _trivial_ to oom kill many victims 
completely unnecessarily.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread David Rientjes
On Fri, 25 May 2018, Tetsuo Handa wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >to stall indefinitely,
> > 
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is fully mlocked.
> 
>- when the mm's memory is fully mlocked (needs privilege) or
>  fully shared (does not need privilege)
> 

Good point, that is another way that unnecessary oom killing can occur 
because the oom reaper sets MMF_OOM_SKIP far too early.  I can make the 
change to the commit message.

Also, I noticed in my patch that oom_reap_task() should be doing 
list_add_tail() rather than list_add() to enqueue the mm for reaping 
again.

> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> > 
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> > 
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> 
> I already proposed more simpler one at 
> https://patchwork.kernel.org/patch/9877991/ .
> 

It's a similar idea, and I'm glad that we agree that some kind of per-mm 
delay is required to avoid this problem.  I think yours is simpler, but 
consider the other two changes in my patch:

 - in the normal exit path, absent any timeout for the mm, we only set
   MMF_OOM_SKIP after free_pgtables() when it is known we will not free
   any additional memory, which can also cause unnecessary oom killing
   because the oom killer races with free_pgtables(), and

 - the oom reaper now operates over all concurrent victims instead of
   repeatedly trying to take mm->mmap_sem of the first victim, sleeping
   many times, retrying, giving up, and moving on the next victim.
   Allowing the oom reaper to iterate through all victims can allow
   memory freeing such that an allocator may be able to drop mm->mmap_sem.

In fact, with my patch, I don't know of any condition where we kill 
additional processes unnecessarily *unless* the victim cannot be oom 
reaped or complete memory freeing in the exit path within 10 seconds.  
Given how rare oom livelock appears in practice, I think the 10 seconds is 
justified because right now it is _trivial_ to oom kill many victims 
completely unnecessarily.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread David Rientjes
On Fri, 25 May 2018, Michal Hocko wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >to stall indefinitely,
> > 
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is fully mlocked.
> > 
> > When all memory is mlocked, the oom reaper will not be able to free any
> > substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > chosen unnecessarily.  This is trivial to reproduce if all eligible
> > processes on the system have mlocked their memory: the oom killer calls
> > panic() even though forward progress can be made.
> > 
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> > 
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> > 
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> 
> No timeouts please! The proper way to handle this problem is to simply
> teach the oom reaper to handle mlocked areas.

That's not sufficient since the oom reaper is also not able to oom reap if 
the mm has blockable mmu notifiers or all memory is shared filebacked 
memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
oom killed.

The current implementation that relies on MAX_OOM_REAP_RETRIES is acting 
as a timeout already for mm->mmap_sem, but it's doing so without 
attempting to oom reap other victims that may actually allow it to grab 
mm->mmap_sem if the allocator is waiting on a lock.

The solution, as proposed, is to allow the oom reaper to iterate over all 
victims and try to free memory rather than working on each victim one by 
one and giving up.

But also note that even if oom reaping is possible, in the presence of an 
antagonist that continues to allocate memory, that it is possible to oom 
kill additional victims unnecessarily if we aren't able to complete 
free_pgtables() in exit_mmap() of the original victim.

So this patch is solving all three issues: allowing a process to *fully* 
exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the 
oom reaper to act on parallel victims that may allow a victim to be 
reaped, and preventing additional processes from being killed 
unnecessarily when oom reaping isn't able to free memory (mlock, blockable 
mmu invalidates, all VM_SHARED file backed, small rss, etc).

The vast majority of the time, oom reaping can occur with this change or 
the process can reach exit_mmap() itself; oom livelock appears to be very 
rare with this patch even for mem cgroup constrained oom kills and very 
tight limitation and thus it makes sense to wait for a prolonged period of 
time before killing additional processes unnecessarily.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread David Rientjes
On Fri, 25 May 2018, Michal Hocko wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >to stall indefinitely,
> > 
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is fully mlocked.
> > 
> > When all memory is mlocked, the oom reaper will not be able to free any
> > substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > chosen unnecessarily.  This is trivial to reproduce if all eligible
> > processes on the system have mlocked their memory: the oom killer calls
> > panic() even though forward progress can be made.
> > 
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> > 
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> > 
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> 
> No timeouts please! The proper way to handle this problem is to simply
> teach the oom reaper to handle mlocked areas.

That's not sufficient since the oom reaper is also not able to oom reap if 
the mm has blockable mmu notifiers or all memory is shared filebacked 
memory, so it immediately sets MMF_OOM_SKIP and additional processes are 
oom killed.

The current implementation that relies on MAX_OOM_REAP_RETRIES is acting 
as a timeout already for mm->mmap_sem, but it's doing so without 
attempting to oom reap other victims that may actually allow it to grab 
mm->mmap_sem if the allocator is waiting on a lock.

The solution, as proposed, is to allow the oom reaper to iterate over all 
victims and try to free memory rather than working on each victim one by 
one and giving up.

But also note that even if oom reaping is possible, in the presence of an 
antagonist that continues to allocate memory, that it is possible to oom 
kill additional victims unnecessarily if we aren't able to complete 
free_pgtables() in exit_mmap() of the original victim.

So this patch is solving all three issues: allowing a process to *fully* 
exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the 
oom reaper to act on parallel victims that may allow a victim to be 
reaped, and preventing additional processes from being killed 
unnecessarily when oom reaping isn't able to free memory (mlock, blockable 
mmu invalidates, all VM_SHARED file backed, small rss, etc).

The vast majority of the time, oom reaping can occur with this change or 
the process can reach exit_mmap() itself; oom livelock appears to be very 
rare with this patch even for mem cgroup constrained oom kills and very 
tight limitation and thus it makes sense to wait for a prolonged period of 
time before killing additional processes unnecessarily.


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread Michal Hocko
On Thu 24-05-18 14:22:53, David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.
> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.

No timeouts please! The proper way to handle this problem is to simply
teach the oom reaper to handle mlocked areas.
-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-25 Thread Michal Hocko
On Thu 24-05-18 14:22:53, David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.
> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.

No timeouts please! The proper way to handle this problem is to simply
teach the oom reaper to handle mlocked areas.
-- 
Michal Hocko
SUSE Labs


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-24 Thread Tetsuo Handa
David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.

   - when the mm's memory is fully mlocked (needs privilege) or
 fully shared (does not need privilege)

> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.

s/mlocked/mlocked or shared/g

> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.

I already proposed more simpler one at 
https://patchwork.kernel.org/patch/9877991/ .

> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
> 
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.

But before thinking about your proposal, please think about how to guarantee
that the OOM reaper and the exit path can run discussed at
http://lkml.kernel.org/r/201805122318.hjg81246.mfvflfjooqt...@i-love.sakura.ne.jp
 .


Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-24 Thread Tetsuo Handa
David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.

   - when the mm's memory is fully mlocked (needs privilege) or
 fully shared (does not need privilege)

> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.

s/mlocked/mlocked or shared/g

> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.

I already proposed more simpler one at 
https://patchwork.kernel.org/patch/9877991/ .

> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
> 
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.

But before thinking about your proposal, please think about how to guarantee
that the OOM reaper and the exit path can run discussed at
http://lkml.kernel.org/r/201805122318.hjg81246.mfvflfjooqt...@i-love.sakura.ne.jp
 .


[rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-24 Thread David Rientjes
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is fully mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap().

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping timeout, initially set
at 10s.  It requires that the oom reaper's list becomes a properly linked
list so that other mm's may be reaped while waiting for an mm's timeout to
expire.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.

The reaping timeout is intentionally set for a substantial amount of time
since oom livelock is a very rare occurrence and it's better to optimize
for preventing additional (unnecessary) oom killing than a scenario that
is much more unlikely.

Signed-off-by: David Rientjes 
---
 include/linux/mm_types.h |  4 ++
 include/linux/sched.h|  2 +-
 mm/mmap.c| 12 +++---
 mm/oom_kill.c| 85 ++--
 4 files changed, 66 insertions(+), 37 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -462,6 +462,10 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+   /* When to give up on oom reaping this mm */
+   unsigned long reap_timeout;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1151,7 +1151,7 @@ struct task_struct {
 #endif
int pagefault_disabled;
 #ifdef CONFIG_MMU
-   struct task_struct  *oom_reaper_list;
+   struct list_headoom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
struct vm_struct*stack_vm_area;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
 * Manually reap the mm to free as much memory as possible.
-* Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-* this mm from further consideration.  Taking mm->mmap_sem for
-* write after setting MMF_OOM_SKIP will guarantee that the oom
-* reaper will not run on this mm again after mmap_sem is
-* dropped.
+* Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
+* Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+* guarantee that the oom reaper will not run on this mm again
+* after mmap_sem is dropped.
 *
 * Nothing can be holding mm->mmap_sem here and the above call
 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
@@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm)
__oom_reap_task_mm(mm);
mutex_unlock(_lock);
 
-   set_bit(MMF_OOM_SKIP, >flags);
+   set_bit(MMF_UNSTABLE, >flags);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
@@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(, vma, 0, -1);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
 
/*
 * Walk the list again, actually closing and 

[rfc patch] mm, oom: fix unnecessary killing of additional processes

2018-05-24 Thread David Rientjes
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is fully mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap().

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping timeout, initially set
at 10s.  It requires that the oom reaper's list becomes a properly linked
list so that other mm's may be reaped while waiting for an mm's timeout to
expire.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.

The reaping timeout is intentionally set for a substantial amount of time
since oom livelock is a very rare occurrence and it's better to optimize
for preventing additional (unnecessary) oom killing than a scenario that
is much more unlikely.

Signed-off-by: David Rientjes 
---
 include/linux/mm_types.h |  4 ++
 include/linux/sched.h|  2 +-
 mm/mmap.c| 12 +++---
 mm/oom_kill.c| 85 ++--
 4 files changed, 66 insertions(+), 37 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -462,6 +462,10 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+   /* When to give up on oom reaping this mm */
+   unsigned long reap_timeout;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1151,7 +1151,7 @@ struct task_struct {
 #endif
int pagefault_disabled;
 #ifdef CONFIG_MMU
-   struct task_struct  *oom_reaper_list;
+   struct list_headoom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
struct vm_struct*stack_vm_area;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
 * Manually reap the mm to free as much memory as possible.
-* Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-* this mm from further consideration.  Taking mm->mmap_sem for
-* write after setting MMF_OOM_SKIP will guarantee that the oom
-* reaper will not run on this mm again after mmap_sem is
-* dropped.
+* Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
+* Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+* guarantee that the oom reaper will not run on this mm again
+* after mmap_sem is dropped.
 *
 * Nothing can be holding mm->mmap_sem here and the above call
 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
@@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm)
__oom_reap_task_mm(mm);
mutex_unlock(_lock);
 
-   set_bit(MMF_OOM_SKIP, >flags);
+   set_bit(MMF_UNSTABLE, >flags);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
@@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(, vma, 0, -1);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
 
/*
 * Walk the list again, actually closing and freeing it,
diff