subject:"4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free"

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Marc MERLIN

On Tue, May 02, 2017 at 09:44:33AM +0200, Michal Hocko wrote:
> On Mon 01-05-17 21:12:35, Marc MERLIN wrote:
> > Howdy,
> > 
> > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> > really 
> > crash but it goes into an infinite loop with
> > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> > stuck for 33s!
> > More logs: https://pastebin.com/YqE4riw0
> 
> I am seeing a lot of traces where tasks is waiting for an IO. I do not
> see any OOM report there. Why do you believe this is an OOM killer
> issue?

Good question. This is a followup of the problem I had in 4.8.8 until I
got a patch to fix the issue. Then, it used to OOM and later, to pile up
I/O tasks like this.
Now it doesn't OOM anymore, but tasks still pile up.
I temporarily fixed the issue by doing this:
gargamel:~# echo 0 > /proc/sys/vm/dirty_ratio
gargamel:~# echo 0 > /proc/sys/vm/dirty_background_ratio

of course my performance is abysmal now, but I can at least run btrfs
scrub without piling up enough IO to deadlock the system.

On Tue, May 02, 2017 at 07:44:47PM +0900, Tetsuo Handa wrote:
> > Any idea what I should do next?
> 
> Maybe you can try collecting list of all in-flight allocations with backtraces
> using kmallocwd patches at
> http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> and 
> http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp
> which also tracks mempool allocations.
> (Well, the
> 
> - cond_resched();
> + //cond_resched();
> 
> change in the latter patch would not be preferable.)

Thanks. I can give that a shot as soon as my current scrub is done, it
may take another 12 to 24H at this rate.
In the meantimne, as explained above, not allowing any dirty VM has
worked around the problem (Linus pointed out to me in the original
thread that on a lightly loaded 24GB system, even 1 or 2% could still be
a lot of memory for requests to pile up in and cause issues in
degenerative cases like mine).
Now I'm still curious what changed betweeen 4.8.8 + custom patches and 4.11 to 
cause
this.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Marc MERLIN

On Tue, May 02, 2017 at 09:44:33AM +0200, Michal Hocko wrote:
> On Mon 01-05-17 21:12:35, Marc MERLIN wrote:
> > Howdy,
> > 
> > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> > really 
> > crash but it goes into an infinite loop with
> > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> > stuck for 33s!
> > More logs: https://pastebin.com/YqE4riw0
> 
> I am seeing a lot of traces where tasks is waiting for an IO. I do not
> see any OOM report there. Why do you believe this is an OOM killer
> issue?

Good question. This is a followup of the problem I had in 4.8.8 until I
got a patch to fix the issue. Then, it used to OOM and later, to pile up
I/O tasks like this.
Now it doesn't OOM anymore, but tasks still pile up.
I temporarily fixed the issue by doing this:
gargamel:~# echo 0 > /proc/sys/vm/dirty_ratio
gargamel:~# echo 0 > /proc/sys/vm/dirty_background_ratio

of course my performance is abysmal now, but I can at least run btrfs
scrub without piling up enough IO to deadlock the system.

On Tue, May 02, 2017 at 07:44:47PM +0900, Tetsuo Handa wrote:
> > Any idea what I should do next?
> 
> Maybe you can try collecting list of all in-flight allocations with backtraces
> using kmallocwd patches at
> http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
> and 
> http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp
> which also tracks mempool allocations.
> (Well, the
> 
> - cond_resched();
> + //cond_resched();
> 
> change in the latter patch would not be preferable.)

Thanks. I can give that a shot as soon as my current scrub is done, it
may take another 12 to 24H at this rate.
In the meantimne, as explained above, not allowing any dirty VM has
worked around the problem (Linus pointed out to me in the original
thread that on a lightly loaded 24GB system, even 1 or 2% could still be
a lot of memory for requests to pile up in and cause issues in
degenerative cases like mine).
Now I'm still curious what changed betweeen 4.8.8 + custom patches and 4.11 to 
cause
this.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Tetsuo Handa

On 2017/05/02 13:12, Marc MERLIN wrote:
> Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> really 
> crash but it goes into an infinite loop with
> [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> stuck for 33s!

Wow, two of workqueues are reaching max active.

[34777.202267] workqueue btrfs-endio-write: flags=0xe
[34777.218313]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8
[34777.236548] in-flight: 15168:btrfs_endio_write_helper, 
13855:btrfs_endio_write_helper, 3360:btrfs_endio_write_helper, 
14241:btrfs_endio_write_helper, 27092:btrfs_endio_write_helper, 
15194:btrfs_endio_write_helper, 15169:btrfs_endio_write_helper, 
27093:btrfs_endio_write_helper
[34777.316225] delayed: btrfs_endio_write_helper, btrfs_endio_write_helper, 
btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, 
btrfs_endio_write_helper

[34777.450684] workqueue bcache: flags=0x8
[34779.956462]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=256/256
[34779.978283] in-flight: 15320:cached_dev_read_done [bcache], 
23385:cached_dev_read_done [bcache], 23371:cached_dev_read_done [bcache], 
15321:cached_dev_read_done [bcache], 15395:cached_dev_read_done [bcache], 
11101:cached_dev_read_done [bcache], 15300:cached_dev_read_done [bcache], 
23349:cached_dev_read_done [bcache], 23425:cached_dev_read_done [bcache], 
23399:cached_dev_read_done [bcache], 15293:cached_dev_read_done [bcache], 
20529:cached_dev_read_done [bcache], 15402:cached_dev_read_done [bcache], 
23422:cached_dev_read_done [bcache], 23417:cached_dev_read_done [bcache], 
23409:cached_dev_read_done [bcache], 20539:cached_dev_read_done [bcache], 
23431:cached_dev_read_done [bcache], 20544:cached_dev_read_done [bcache], 
15355:cached_dev_read_done [bcache], 11085:cached_dev_read_done [bcache], 
6511:cached_dev_read_done [bcache]   

Googling with btrfs_endio_write_helper shows a stuck report with 4.8-rc5, but
seems no response ( https://www.spinics.net/lists/linux-btrfs/msg58633.html ).

> Any idea what I should do next?

Maybe you can try collecting list of all in-flight allocations with backtraces
using kmallocwd patches at
http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
and 
http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp
which also tracks mempool allocations.
(Well, the

-   cond_resched();
+   //cond_resched();

change in the latter patch would not be preferable.)

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Tetsuo Handa

On 2017/05/02 13:12, Marc MERLIN wrote:
> Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> really 
> crash but it goes into an infinite loop with
> [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> stuck for 33s!

Wow, two of workqueues are reaching max active.

[34777.202267] workqueue btrfs-endio-write: flags=0xe
[34777.218313]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8
[34777.236548] in-flight: 15168:btrfs_endio_write_helper, 
13855:btrfs_endio_write_helper, 3360:btrfs_endio_write_helper, 
14241:btrfs_endio_write_helper, 27092:btrfs_endio_write_helper, 
15194:btrfs_endio_write_helper, 15169:btrfs_endio_write_helper, 
27093:btrfs_endio_write_helper
[34777.316225] delayed: btrfs_endio_write_helper, btrfs_endio_write_helper, 
btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, 
btrfs_endio_write_helper

[34777.450684] workqueue bcache: flags=0x8
[34779.956462]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=256/256
[34779.978283] in-flight: 15320:cached_dev_read_done [bcache], 
23385:cached_dev_read_done [bcache], 23371:cached_dev_read_done [bcache], 
15321:cached_dev_read_done [bcache], 15395:cached_dev_read_done [bcache], 
11101:cached_dev_read_done [bcache], 15300:cached_dev_read_done [bcache], 
23349:cached_dev_read_done [bcache], 23425:cached_dev_read_done [bcache], 
23399:cached_dev_read_done [bcache], 15293:cached_dev_read_done [bcache], 
20529:cached_dev_read_done [bcache], 15402:cached_dev_read_done [bcache], 
23422:cached_dev_read_done [bcache], 23417:cached_dev_read_done [bcache], 
23409:cached_dev_read_done [bcache], 20539:cached_dev_read_done [bcache], 
23431:cached_dev_read_done [bcache], 20544:cached_dev_read_done [bcache], 
15355:cached_dev_read_done [bcache], 11085:cached_dev_read_done [bcache], 
6511:cached_dev_read_done [bcache]   

Googling with btrfs_endio_write_helper shows a stuck report with 4.8-rc5, but
seems no response ( https://www.spinics.net/lists/linux-btrfs/msg58633.html ).

> Any idea what I should do next?

Maybe you can try collecting list of all in-flight allocations with backtraces
using kmallocwd patches at
http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
and 
http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp
which also tracks mempool allocations.
(Well, the

-   cond_resched();
+   //cond_resched();

change in the latter patch would not be preferable.)

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Michal Hocko

On Mon 01-05-17 21:12:35, Marc MERLIN wrote:
> Howdy,
> 
> Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> really 
> crash but it goes into an infinite loop with
> [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> stuck for 33s!
> More logs: https://pastebin.com/YqE4riw0

I am seeing a lot of traces where tasks is waiting for an IO. I do not
see any OOM report there. Why do you believe this is an OOM killer
issue?
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-02 Thread Michal Hocko

On Mon 01-05-17 21:12:35, Marc MERLIN wrote:
> Howdy,
> 
> Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
> really 
> crash but it goes into an infinite loop with
> [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
> stuck for 33s!
> More logs: https://pastebin.com/YqE4riw0

I am seeing a lot of traces where tasks is waiting for an IO. I do not
see any OOM report there. Why do you believe this is an OOM killer
issue?
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-01 Thread Marc MERLIN

Howdy,

Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
really 
crash but it goes into an infinite loop with
[34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
stuck for 33s!
More logs: https://pastebin.com/YqE4riw0

(I upgraded from 4.8 with custom patches you gave me, and went to 4.11.0

gargamel:~# cat /proc/sys/vm/dirty_ratio
2
gargamel:~# cat /proc/sys/vm/dirty_background_ratio
1
gargamel:~# free
 total   used   free sharedbuffers cached
Mem:  24392600   163626608029940  0   8884   13739000
-/+ buffers/cache:2614776   21777824
Swap: 15616764  0   15616764

And yet, I was doing a btrfs check repair on a busy filesystem, within 40mn or 
so,
it triggered the workqueue lockup.

gargamel:~# grep CONFIG_COMPACTION 
/boot/config-4.11.0-amd64-preempt-sysrq-20170406 
CONFIG_COMPACTION=y

kernel config file: https://pastebin.com/7Tajse6L

To be fair, I didn't try to run btrfs check on 4.8 and now I'm busy
trying to recover a filesystem that apparently got corrupted by a bad
SAS driver in 4.8 which caused a lot of I/O errors and corruption.
This is just to say that btrfs on top of dmcrypt on top of bcache may
have been enough layers to hang on btrfs check on 4.8 too, but I can't
really go back to check right now due to the driver corruption issues.

Any idea what I should do next?

Thanks,
Marc

On Tue, Nov 29, 2016 at 03:01:35PM -0800, Marc MERLIN wrote:
> On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote:
> > Thanks for the reply and suggestions.
> > 
> > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > > > Now, to be fair, this is not a new problem, it's just varying degrees of
> > > > bad and usually only happens when I do a lot of I/O with btrfs.
> > > 
> > > One situation where I've seen something like this happen is
> > > 
> > >  (a) lots and lots of dirty data queued up
> > >  (b) horribly slow storage
> > 
> > In my case, it is a 5x 4TB HDD with 
> > software raid 5 < bcache < dmcrypt < btrfs
> > bcache is currently half disabled (as in I removed the actual cache) or
> > too many bcache requests pile up, and the kernel dies when too many
> > workqueues have piled up.
> > I'm just kind of worried that since I'm going through 4 subsystems
> > before my data can hit disk, that's a lot of memory allocations and
> > places where data can accumulate and cause bottlenecks if the next
> > subsystem isn't as fast.
> > 
> > But this shouldn't be "horribly slow", should it? (it does copy a few
> > terabytes per day, not fast, but not horrible, about 30MB/s or so)
> > 
> > > Sadly, our defaults for "how much dirty data do we allow" are somewhat
> > > buggered. The global defaults are in "percent of memory", and are
> > > generally _much_ too high for big-memory machines:
> > > 
> > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> > > 20
> > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> > > 10
> > 
> > I can confirm I have the same.
> > 
> > > says that it only starts really throttling writes when you hit 20% of
> > > all memory used. You don't say how much memory you have in that
> > > machine, but if it's the same one you talked about earlier, it was
> > > 24GB. So you can have 4GB of dirty data waiting to be flushed out.
> > 
> > Correct, 24GB and 4GB.
> > 
> > > And we *try* to do this per-device backing-dev congestion thing to
> > > make things work better, but it generally seems to not work very well.
> > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> > > does really well, and we want to open up, and then it shuts down).
> > > 
> > > One thing you can try is to just make the global limits much lower. As in
> > > 
> > >echo 2 > /proc/sys/vm/dirty_ratio
> > >echo 1 > /proc/sys/vm/dirty_background_ratio
> > 
> > I will give that a shot, thank you.
> 
> And, after 5H of copying, not a single hang, or USB disconnect, or anything.
> Obviously this seems to point to other problems in the code, and I have no
> idea which layer is a culprit here, but reducing the buffers absolutely
> helped a lot.

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2017-05-01 Thread Marc MERLIN

Howdy,

Well, sadly, the problem is more or less back is 4.11.0. The system doesn't 
really 
crash but it goes into an infinite loop with
[34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 
stuck for 33s!
More logs: https://pastebin.com/YqE4riw0

(I upgraded from 4.8 with custom patches you gave me, and went to 4.11.0

gargamel:~# cat /proc/sys/vm/dirty_ratio
2
gargamel:~# cat /proc/sys/vm/dirty_background_ratio
1
gargamel:~# free
 total   used   free sharedbuffers cached
Mem:  24392600   163626608029940  0   8884   13739000
-/+ buffers/cache:2614776   21777824
Swap: 15616764  0   15616764

And yet, I was doing a btrfs check repair on a busy filesystem, within 40mn or 
so,
it triggered the workqueue lockup.

gargamel:~# grep CONFIG_COMPACTION 
/boot/config-4.11.0-amd64-preempt-sysrq-20170406 
CONFIG_COMPACTION=y

kernel config file: https://pastebin.com/7Tajse6L

To be fair, I didn't try to run btrfs check on 4.8 and now I'm busy
trying to recover a filesystem that apparently got corrupted by a bad
SAS driver in 4.8 which caused a lot of I/O errors and corruption.
This is just to say that btrfs on top of dmcrypt on top of bcache may
have been enough layers to hang on btrfs check on 4.8 too, but I can't
really go back to check right now due to the driver corruption issues.

Any idea what I should do next?

Thanks,
Marc

On Tue, Nov 29, 2016 at 03:01:35PM -0800, Marc MERLIN wrote:
> On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote:
> > Thanks for the reply and suggestions.
> > 
> > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > > > Now, to be fair, this is not a new problem, it's just varying degrees of
> > > > bad and usually only happens when I do a lot of I/O with btrfs.
> > > 
> > > One situation where I've seen something like this happen is
> > > 
> > >  (a) lots and lots of dirty data queued up
> > >  (b) horribly slow storage
> > 
> > In my case, it is a 5x 4TB HDD with 
> > software raid 5 < bcache < dmcrypt < btrfs
> > bcache is currently half disabled (as in I removed the actual cache) or
> > too many bcache requests pile up, and the kernel dies when too many
> > workqueues have piled up.
> > I'm just kind of worried that since I'm going through 4 subsystems
> > before my data can hit disk, that's a lot of memory allocations and
> > places where data can accumulate and cause bottlenecks if the next
> > subsystem isn't as fast.
> > 
> > But this shouldn't be "horribly slow", should it? (it does copy a few
> > terabytes per day, not fast, but not horrible, about 30MB/s or so)
> > 
> > > Sadly, our defaults for "how much dirty data do we allow" are somewhat
> > > buggered. The global defaults are in "percent of memory", and are
> > > generally _much_ too high for big-memory machines:
> > > 
> > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> > > 20
> > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> > > 10
> > 
> > I can confirm I have the same.
> > 
> > > says that it only starts really throttling writes when you hit 20% of
> > > all memory used. You don't say how much memory you have in that
> > > machine, but if it's the same one you talked about earlier, it was
> > > 24GB. So you can have 4GB of dirty data waiting to be flushed out.
> > 
> > Correct, 24GB and 4GB.
> > 
> > > And we *try* to do this per-device backing-dev congestion thing to
> > > make things work better, but it generally seems to not work very well.
> > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> > > does really well, and we want to open up, and then it shuts down).
> > > 
> > > One thing you can try is to just make the global limits much lower. As in
> > > 
> > >echo 2 > /proc/sys/vm/dirty_ratio
> > >echo 1 > /proc/sys/vm/dirty_background_ratio
> > 
> > I will give that a shot, thank you.
> 
> And, after 5H of copying, not a single hang, or USB disconnect, or anything.
> Obviously this seems to point to other problems in the code, and I have no
> idea which layer is a culprit here, but reducing the buffers absolutely
> helped a lot.

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Jens Axboe

On 12/01/2016 11:37 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe  wrote:
>>
>> It's two different kinds of throttling. The vm absolutely should
>> throttle at dirty time, to avoid having insane amounts of memory dirty.
>> On the block layer side, throttling is about avoid the device queues
>> being too long. It's very similar to the buffer bloating on the
>> networking side. The block layer throttling is not a fix for the vm
>> allowing too much memory to be dirty and causing issues, it's about
>> keeping the device response latencies in check.
> 
> Sure. But if we really do just end up blocking in the block layer (in
> situations where we didn't used to), that may be a bad thing. It might
> be better to feed that information back to the VM instead,
> particularly for writes, where the VM layer already tries to ratelimit
> the writes.

It's not a new blocking point, it's the same blocking point that we
always end up in, if we run out of requests. The problem with bcache and
other stacked drivers is that they don't have a request pool, so they
never really need to block there.

> And frankly, it's almost purely writes that matter. There  just aren't
> a lot of ways to get that many parallel reads in real life.

Exactly, it's almost exclusively a buffered write problem, as I wrote in
the initial reply. Most other things tend to throttle nicely on their
own.

> I haven't looked at your patches, so maybe you already do this.

It's currently not fed back, but that would be pretty trivial to do. The
mechanism we have for that (queue congestion) is a bit of a mess,
though, so it would need to be revamped a bit.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Jens Axboe

On 12/01/2016 11:37 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe  wrote:
>>
>> It's two different kinds of throttling. The vm absolutely should
>> throttle at dirty time, to avoid having insane amounts of memory dirty.
>> On the block layer side, throttling is about avoid the device queues
>> being too long. It's very similar to the buffer bloating on the
>> networking side. The block layer throttling is not a fix for the vm
>> allowing too much memory to be dirty and causing issues, it's about
>> keeping the device response latencies in check.
> 
> Sure. But if we really do just end up blocking in the block layer (in
> situations where we didn't used to), that may be a bad thing. It might
> be better to feed that information back to the VM instead,
> particularly for writes, where the VM layer already tries to ratelimit
> the writes.

It's not a new blocking point, it's the same blocking point that we
always end up in, if we run out of requests. The problem with bcache and
other stacked drivers is that they don't have a request pool, so they
never really need to block there.

> And frankly, it's almost purely writes that matter. There  just aren't
> a lot of ways to get that many parallel reads in real life.

Exactly, it's almost exclusively a buffered write problem, as I wrote in
the initial reply. Most other things tend to throttle nicely on their
own.

> I haven't looked at your patches, so maybe you already do this.

It's currently not fed back, but that would be pretty trivial to do. The
mechanism we have for that (queue congestion) is a bit of a mess,
though, so it would need to be revamped a bit.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Linus Torvalds

On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe  wrote:
>
> It's two different kinds of throttling. The vm absolutely should
> throttle at dirty time, to avoid having insane amounts of memory dirty.
> On the block layer side, throttling is about avoid the device queues
> being too long. It's very similar to the buffer bloating on the
> networking side. The block layer throttling is not a fix for the vm
> allowing too much memory to be dirty and causing issues, it's about
> keeping the device response latencies in check.

Sure. But if we really do just end up blocking in the block layer (in
situations where we didn't used to), that may be a bad thing. It might
be better to feed that information back to the VM instead,
particularly for writes, where the VM layer already tries to ratelimit
the writes.

And frankly, it's almost purely writes that matter. There  just aren't
a lot of ways to get that many parallel reads in real life.

I haven't looked at your patches, so maybe you already do this.

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Linus Torvalds

On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe  wrote:
>
> It's two different kinds of throttling. The vm absolutely should
> throttle at dirty time, to avoid having insane amounts of memory dirty.
> On the block layer side, throttling is about avoid the device queues
> being too long. It's very similar to the buffer bloating on the
> networking side. The block layer throttling is not a fix for the vm
> allowing too much memory to be dirty and causing issues, it's about
> keeping the device response latencies in check.

Sure. But if we really do just end up blocking in the block layer (in
situations where we didn't used to), that may be a bad thing. It might
be better to feed that information back to the VM instead,
particularly for writes, where the VM layer already tries to ratelimit
the writes.

And frankly, it's almost purely writes that matter. There  just aren't
a lot of ways to get that many parallel reads in real life.

I haven't looked at your patches, so maybe you already do this.

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Jens Axboe

On 12/01/2016 11:16 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
>  wrote:
>>
>> That said, I'm not sure how I feel about Jens's exact approach... it seems 
>> to me
>> that this can really just live within the writeback code, I don't know why it
>> should involve the block layer at all. plus, if I understand correctly his 
>> code
>> has the effect of blocking in generic_make_request() to throttle, which means
>> due to the way the writeback code is structured we'll be blocking with page
>> locks held.
> 
> Yeah, I do *not* believe that throttling at the block layer is at all
> the right thing to do.
> 
> I do think that the block layer needs to throttle, but it needs to be
> seen as a "last resort" kind of thing, where the block layer just
> needs to limit how much it will have oending. But it should be seen as
> a failure mode, not as a write balancing issue.
> 
> Because the real throttling absolutely needs to happen when things are
> marked dirty, because no block layer throttling will ever fix the
> situation where you just have too much memory dirtied that you cannot
> free because it will take a minute to write out.
> 
> So throttling at a VM level is sane. Throttling at a block layer level is not.

It's two different kinds of throttling. The vm absolutely should
throttle at dirty time, to avoid having insane amounts of memory dirty.
On the block layer side, throttling is about avoid the device queues
being too long. It's very similar to the buffer bloating on the
networking side. The block layer throttling is not a fix for the vm
allowing too much memory to be dirty and causing issues, it's about
keeping the device response latencies in check.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Jens Axboe

On 12/01/2016 11:16 AM, Linus Torvalds wrote:
> On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
>  wrote:
>>
>> That said, I'm not sure how I feel about Jens's exact approach... it seems 
>> to me
>> that this can really just live within the writeback code, I don't know why it
>> should involve the block layer at all. plus, if I understand correctly his 
>> code
>> has the effect of blocking in generic_make_request() to throttle, which means
>> due to the way the writeback code is structured we'll be blocking with page
>> locks held.
> 
> Yeah, I do *not* believe that throttling at the block layer is at all
> the right thing to do.
> 
> I do think that the block layer needs to throttle, but it needs to be
> seen as a "last resort" kind of thing, where the block layer just
> needs to limit how much it will have oending. But it should be seen as
> a failure mode, not as a write balancing issue.
> 
> Because the real throttling absolutely needs to happen when things are
> marked dirty, because no block layer throttling will ever fix the
> situation where you just have too much memory dirtied that you cannot
> free because it will take a minute to write out.
> 
> So throttling at a VM level is sane. Throttling at a block layer level is not.

It's two different kinds of throttling. The vm absolutely should
throttle at dirty time, to avoid having insane amounts of memory dirty.
On the block layer side, throttling is about avoid the device queues
being too long. It's very similar to the buffer bloating on the
networking side. The block layer throttling is not a fix for the vm
allowing too much memory to be dirty and causing issues, it's about
keeping the device response latencies in check.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Linus Torvalds

On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
 wrote:
>
> That said, I'm not sure how I feel about Jens's exact approach... it seems to 
> me
> that this can really just live within the writeback code, I don't know why it
> should involve the block layer at all. plus, if I understand correctly his 
> code
> has the effect of blocking in generic_make_request() to throttle, which means
> due to the way the writeback code is structured we'll be blocking with page
> locks held.

Yeah, I do *not* believe that throttling at the block layer is at all
the right thing to do.

I do think that the block layer needs to throttle, but it needs to be
seen as a "last resort" kind of thing, where the block layer just
needs to limit how much it will have oending. But it should be seen as
a failure mode, not as a write balancing issue.

Because the real throttling absolutely needs to happen when things are
marked dirty, because no block layer throttling will ever fix the
situation where you just have too much memory dirtied that you cannot
free because it will take a minute to write out.

So throttling at a VM level is sane. Throttling at a block layer level is not.

Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Linus Torvalds

On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet
 wrote:
>
> That said, I'm not sure how I feel about Jens's exact approach... it seems to 
> me
> that this can really just live within the writeback code, I don't know why it
> should involve the block layer at all. plus, if I understand correctly his 
> code
> has the effect of blocking in generic_make_request() to throttle, which means
> due to the way the writeback code is structured we'll be blocking with page
> locks held.

Yeah, I do *not* believe that throttling at the block layer is at all
the right thing to do.

I do think that the block layer needs to throttle, but it needs to be
seen as a "last resort" kind of thing, where the block layer just
needs to limit how much it will have oending. But it should be seen as
a failure mode, not as a write balancing issue.

Because the real throttling absolutely needs to happen when things are
marked dirty, because no block layer throttling will ever fix the
situation where you just have too much memory dirtied that you cannot
free because it will take a minute to write out.

So throttling at a VM level is sane. Throttling at a block layer level is not.

Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Kent Overstreet

On Wed, Nov 30, 2016 at 03:30:11PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> > Tejun/Kent - any way to just limit the workqueue depth for bcache?
> > Because that really isn't helping, and things *will* time out and
> > cause those problems when you have hundreds of IO's queued on a disk
> > that likely as a write iops around ~100..
> 
> Yeah, easily.  I'm assuming it's gonna be the bcache_wq allocated in
> from bcache_init().  It's currently using 0 as @max_active and it can
> set to be any arbitrary number.  It'd be a very crude way to control
> what looks like a buffer bloat with IOs tho.  We can make it a bit
> more granular by splitting workqueues per bcache instance / purpose
> but for the long term the right solution seems to be hooking into
> writeback throttling mechanism that block layer just grew recently.

Agreed that the writeback code is the right place to do it. Within bcache we
can't really do anything smarter than just throw a hard limit on the number of
outstanding IOs and enforce it by blocking in generic_make_request(), and the
bcache code is the wrong place to do that - we don't know what the limit should
be there, and all the IOs look the same at that point so you'd probably still
end up with writeback starving everything else.

I could futz with the workqueue stuff, but that'd likely as not break some other
workload - I've spent enough time as it is fighting with workqueue concurrency
stuff in the past. My preference would be to just try and get Jens's stuff in.

That said, I'm not sure how I feel about Jens's exact approach... it seems to me
that this can really just live within the writeback code, I don't know why it
should involve the block layer at all. plus, if I understand correctly his code
has the effect of blocking in generic_make_request() to throttle, which means
due to the way the writeback code is structured we'll be blocking with page
locks held. I did my own thing in bcachefs, same idea but throttling in
writepages... it's dumb and simple but it's worked exceedingly well, as far as
actual usability and responsiveness:

https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev=acf766b2dd33b076fdce66c86363a3e26a9b70cf#n1002

that said - any kind of throttling for writeback will be a million times better
than the current situation...

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-12-01 Thread Kent Overstreet

On Wed, Nov 30, 2016 at 03:30:11PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> > Tejun/Kent - any way to just limit the workqueue depth for bcache?
> > Because that really isn't helping, and things *will* time out and
> > cause those problems when you have hundreds of IO's queued on a disk
> > that likely as a write iops around ~100..
> 
> Yeah, easily.  I'm assuming it's gonna be the bcache_wq allocated in
> from bcache_init().  It's currently using 0 as @max_active and it can
> set to be any arbitrary number.  It'd be a very crude way to control
> what looks like a buffer bloat with IOs tho.  We can make it a bit
> more granular by splitting workqueues per bcache instance / purpose
> but for the long term the right solution seems to be hooking into
> writeback throttling mechanism that block layer just grew recently.

Agreed that the writeback code is the right place to do it. Within bcache we
can't really do anything smarter than just throw a hard limit on the number of
outstanding IOs and enforce it by blocking in generic_make_request(), and the
bcache code is the wrong place to do that - we don't know what the limit should
be there, and all the IOs look the same at that point so you'd probably still
end up with writeback starving everything else.

I could futz with the workqueue stuff, but that'd likely as not break some other
workload - I've spent enough time as it is fighting with workqueue concurrency
stuff in the past. My preference would be to just try and get Jens's stuff in.

That said, I'm not sure how I feel about Jens's exact approach... it seems to me
that this can really just live within the writeback code, I don't know why it
should involve the block layer at all. plus, if I understand correctly his code
has the effect of blocking in generic_make_request() to throttle, which means
due to the way the writeback code is structured we'll be blocking with page
locks held. I did my own thing in bcachefs, same idea but throttling in
writepages... it's dumb and simple but it's worked exceedingly well, as far as
actual usability and responsiveness:

https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev=acf766b2dd33b076fdce66c86363a3e26a9b70cf#n1002

that said - any kind of throttling for writeback will be a million times better
than the current situation...

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Tejun Heo

Hello,

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Tejun/Kent - any way to just limit the workqueue depth for bcache?
> Because that really isn't helping, and things *will* time out and
> cause those problems when you have hundreds of IO's queued on a disk
> that likely as a write iops around ~100..

Yeah, easily.  I'm assuming it's gonna be the bcache_wq allocated in
from bcache_init().  It's currently using 0 as @max_active and it can
set to be any arbitrary number.  It'd be a very crude way to control
what looks like a buffer bloat with IOs tho.  We can make it a bit
more granular by splitting workqueues per bcache instance / purpose
but for the long term the right solution seems to be hooking into
writeback throttling mechanism that block layer just grew recently.

Thanks.

-- 
tejun

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Tejun Heo

Hello,

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Tejun/Kent - any way to just limit the workqueue depth for bcache?
> Because that really isn't helping, and things *will* time out and
> cause those problems when you have hundreds of IO's queued on a disk
> that likely as a write iops around ~100..

Yeah, easily.  I'm assuming it's gonna be the bcache_wq allocated in
from bcache_init().  It's currently using 0 as @max_active and it can
set to be any arbitrary number.  It'd be a very crude way to control
what looks like a buffer bloat with IOs tho.  We can make it a bit
more granular by splitting workqueues per bcache instance / purpose
but for the long term the right solution seems to be hooking into
writeback throttling mechanism that block layer just grew recently.

Thanks.

-- 
tejun

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Jens Axboe

On 11/30/2016 11:14 AM, Linus Torvalds wrote:
> On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN  wrote:
>>
>> I gave it a thought again, I think it is exactly the nasty situation you
>> described.
>> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
>> bcache can't handle IO as quickly and has to hang until the SSD has been
>> flushed to spinning rust drives.
>> This actually is exactly the same as filling up the cache on a USB key
>> and now you're waiting for slow writes to flash, is it not?
> 
> It does sound like you might hit exactly the same kind of situation, yes.
> 
> And the fact that you have dmcrypt running too just makes things pile
> up more. All those IO's end up slowed down by the scheduling too.
> 
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

Honestly, the easiest would be to wire it up to the blk-wbt stuff that
is queued up for 4.10, which attempts to limit the queue depths to
something reasonable instead of letting them run amok. This is largely
(exclusively, almost) a problem with buffered writeback.

On devices utilizing the stacked interface, they never get any depth
throttling. Obviously it's worse if each IO ends up queueing work, but
it's a big problem even if they do not.

> I think it's the traditional "throughput is much easier to measure and
> improve" situation, where making queues big help some throughput
> situation, but ends up causing chaos when things go south.

Yes, and the longer queues never buy you anything, but they end up
causing tons of problems at the other end of the spectrum.

Still makes sense to limit dirty memory for highmem, though.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Jens Axboe

On 11/30/2016 11:14 AM, Linus Torvalds wrote:
> On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN  wrote:
>>
>> I gave it a thought again, I think it is exactly the nasty situation you
>> described.
>> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
>> bcache can't handle IO as quickly and has to hang until the SSD has been
>> flushed to spinning rust drives.
>> This actually is exactly the same as filling up the cache on a USB key
>> and now you're waiting for slow writes to flash, is it not?
> 
> It does sound like you might hit exactly the same kind of situation, yes.
> 
> And the fact that you have dmcrypt running too just makes things pile
> up more. All those IO's end up slowed down by the scheduling too.
> 
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

Honestly, the easiest would be to wire it up to the blk-wbt stuff that
is queued up for 4.10, which attempts to limit the queue depths to
something reasonable instead of letting them run amok. This is largely
(exclusively, almost) a problem with buffered writeback.

On devices utilizing the stacked interface, they never get any depth
throttling. Obviously it's worse if each IO ends up queueing work, but
it's a big problem even if they do not.

> I think it's the traditional "throughput is much easier to measure and
> improve" situation, where making queues big help some throughput
> situation, but ends up causing chaos when things go south.

Yes, and the longer queues never buy you anything, but they end up
causing tons of problems at the other end of the spectrum.

Still makes sense to limit dirty memory for highmem, though.

-- 
Jens Axboe

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Marc MERLIN

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

I'll take your word for it, all I got in the end was
Kernel panic - not syncing: Hard LOCKUP
and the system stone dead when I woke up hours later.

> And I think your NMI watchdog then turns the "system is no longer
> responsive" into an actual kernel panic.

Ah, I see.

Thanks for the reply, and sorry for bringing in that separate thread
from the btrfs mailing list, which effectively was a suggestion similar
to what you're saying here too.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Marc MERLIN

On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote:
> Anyway, none of this seems new per se. I'm adding Kent and Jens to the
> cc (Tejun already was), in the hope that maybe they have some idea how
> to control the nasty worst-case behavior wrt workqueue lockup (it's
> not really a "lockup", it looks like it's just hundreds of workqueues
> all waiting for IO to complete and much too deep IO queues).

I'll take your word for it, all I got in the end was
Kernel panic - not syncing: Hard LOCKUP
and the system stone dead when I woke up hours later.

> And I think your NMI watchdog then turns the "system is no longer
> responsive" into an actual kernel panic.

Ah, I see.

Thanks for the reply, and sorry for bringing in that separate thread
from the btrfs mailing list, which effectively was a suggestion similar
to what you're saying here too.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Linus Torvalds

On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN  wrote:
>
> I gave it a thought again, I think it is exactly the nasty situation you
> described.
> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
> bcache can't handle IO as quickly and has to hang until the SSD has been
> flushed to spinning rust drives.
> This actually is exactly the same as filling up the cache on a USB key
> and now you're waiting for slow writes to flash, is it not?

It does sound like you might hit exactly the same kind of situation, yes.

And the fact that you have dmcrypt running too just makes things pile
up more. All those IO's end up slowed down by the scheduling too.

Anyway, none of this seems new per se. I'm adding Kent and Jens to the
cc (Tejun already was), in the hope that maybe they have some idea how
to control the nasty worst-case behavior wrt workqueue lockup (it's
not really a "lockup", it looks like it's just hundreds of workqueues
all waiting for IO to complete and much too deep IO queues).

I think it's the traditional "throughput is much easier to measure and
improve" situation, where making queues big help some throughput
situation, but ends up causing chaos when things go south.

And I think your NMI watchdog then turns the "system is no longer
responsive" into an actual kernel panic.

> With your dirty ratio workaround, I was able to re-enable bcache and
> have it not fall over, but only barely. I recorded over a hundred
> workqueues in flight during the copy at some point (just not enough
> to actually kill the kernel this time).
>
> I've started a bcache followp on this here:
> http://marc.info/?l=linux-bcache=148052441423532=2
> http://marc.info/?l=linux-bcache=148052620524162=2
>
> A full traceback showing the pilup of requests is here:
> http://marc.info/?l=linux-bcache=147949497808483=2
>
> and there:
> http://pastebin.com/rJ5RKUVm
> (2 different ones but mostly the same result)

Tejun/Kent - any way to just limit the workqueue depth for bcache?
Because that really isn't helping, and things *will* time out and
cause those problems when you have hundreds of IO's queued on a disk
that likely as a write iops around ~100..

And I really wonder if we should do the "big hammer" approach to the
dirty limits on non-HIGHMEM machines too (approximate the
"vm_highmem_is_dirtyable" by just limiting global_dirtyable_memory()
to 1 GB).

That would make the default dirty limits be 100/200MB (for soft/hard
throttling), which really is much more reasonable than gigabytes and
gigabytes of dirty data.

Of course, no way do we do that during rc7..

Linus
 mm/page-writeback.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 439cc63ad903..26ecbdecb815 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -352,6 +352,10 @@ static unsigned long highmem_dirtyable_memory(unsigned 
long total)
 #endif
 }

+/* Limit dirtyable memory to 1GB */
+#define PAGES_IN_GB(x) ((x) << (30 - PAGE_SHIFT))
+#define MAX_DIRTYABLE_LOWMEM_PAGES PAGES_IN_GB(1)
+
 /**
  * global_dirtyable_memory - number of globally dirtyable pages
  *
@@ -373,8 +377,11 @@ static unsigned long global_dirtyable_memory(void)
x += global_node_page_state(NR_INACTIVE_FILE);
x += global_node_page_state(NR_ACTIVE_FILE);

-   if (!vm_highmem_is_dirtyable)
+   if (!vm_highmem_is_dirtyable) {
x -= highmem_dirtyable_memory(x);
+   if (x > MAX_DIRTYABLE_LOWMEM_PAGES)
+   x = MAX_DIRTYABLE_LOWMEM_PAGES;
+   }

return x + 1;   /* Ensure that we never return 0 */
 }

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Linus Torvalds

On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN  wrote:
>
> I gave it a thought again, I think it is exactly the nasty situation you
> described.
> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
> bcache can't handle IO as quickly and has to hang until the SSD has been
> flushed to spinning rust drives.
> This actually is exactly the same as filling up the cache on a USB key
> and now you're waiting for slow writes to flash, is it not?

It does sound like you might hit exactly the same kind of situation, yes.

And the fact that you have dmcrypt running too just makes things pile
up more. All those IO's end up slowed down by the scheduling too.

Anyway, none of this seems new per se. I'm adding Kent and Jens to the
cc (Tejun already was), in the hope that maybe they have some idea how
to control the nasty worst-case behavior wrt workqueue lockup (it's
not really a "lockup", it looks like it's just hundreds of workqueues
all waiting for IO to complete and much too deep IO queues).

I think it's the traditional "throughput is much easier to measure and
improve" situation, where making queues big help some throughput
situation, but ends up causing chaos when things go south.

And I think your NMI watchdog then turns the "system is no longer
responsive" into an actual kernel panic.

> With your dirty ratio workaround, I was able to re-enable bcache and
> have it not fall over, but only barely. I recorded over a hundred
> workqueues in flight during the copy at some point (just not enough
> to actually kill the kernel this time).
>
> I've started a bcache followp on this here:
> http://marc.info/?l=linux-bcache=148052441423532=2
> http://marc.info/?l=linux-bcache=148052620524162=2
>
> A full traceback showing the pilup of requests is here:
> http://marc.info/?l=linux-bcache=147949497808483=2
>
> and there:
> http://pastebin.com/rJ5RKUVm
> (2 different ones but mostly the same result)

Tejun/Kent - any way to just limit the workqueue depth for bcache?
Because that really isn't helping, and things *will* time out and
cause those problems when you have hundreds of IO's queued on a disk
that likely as a write iops around ~100..

And I really wonder if we should do the "big hammer" approach to the
dirty limits on non-HIGHMEM machines too (approximate the
"vm_highmem_is_dirtyable" by just limiting global_dirtyable_memory()
to 1 GB).

That would make the default dirty limits be 100/200MB (for soft/hard
throttling), which really is much more reasonable than gigabytes and
gigabytes of dirty data.

Of course, no way do we do that during rc7..

Linus
 mm/page-writeback.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 439cc63ad903..26ecbdecb815 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -352,6 +352,10 @@ static unsigned long highmem_dirtyable_memory(unsigned 
long total)
 #endif
 }

+/* Limit dirtyable memory to 1GB */
+#define PAGES_IN_GB(x) ((x) << (30 - PAGE_SHIFT))
+#define MAX_DIRTYABLE_LOWMEM_PAGES PAGES_IN_GB(1)
+
 /**
  * global_dirtyable_memory - number of globally dirtyable pages
  *
@@ -373,8 +377,11 @@ static unsigned long global_dirtyable_memory(void)
x += global_node_page_state(NR_INACTIVE_FILE);
x += global_node_page_state(NR_ACTIVE_FILE);

-   if (!vm_highmem_is_dirtyable)
+   if (!vm_highmem_is_dirtyable) {
x -= highmem_dirtyable_memory(x);
+   if (x > MAX_DIRTYABLE_LOWMEM_PAGES)
+   x = MAX_DIRTYABLE_LOWMEM_PAGES;
+   }

return x + 1;   /* Ensure that we never return 0 */
 }

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 10:01:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN  wrote:
> >
> > In my case, it is a 5x 4TB HDD with
> > software raid 5 < bcache < dmcrypt < btrfs
> 
> It doesn't sound like the nasty situations I have seen (particularly
> with large USB flash storage - often high momentary speed for
> benchmarks, but slows down to a crawl after you've written a bit to
> it, and doesn't have the smart garbage collection that modern "real"
> SSDs have).

I gave it a thought again, I think it is exactly the nasty situation you
described.
bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
bcache can't handle IO as quickly and has to hang until the SSD has been
flushed to spinning rust drives.
This actually is exactly the same as filling up the cache on a USB key
and now you're waiting for slow writes to flash, is it not?

With your dirty ratio workaround, I was able to re-enable bcache and
have it not fall over, but only barely. I recorded over a hundred
workqueues in flight during the copy at some point (just not enough
to actually kill the kernel this time).

I've started a bcache followp on this here:
http://marc.info/?l=linux-bcache=148052441423532=2
http://marc.info/?l=linux-bcache=148052620524162=2

This message shows the huge pileup of workqueeues in bcache
just before the kernel dies with
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
task: 9ee0c2fa4180 task.stack: 9ee0c2fa8000
RIP: 0010:[]  [] 
cpuidle_enter_state+0x119/0x171
RSP: :9ee0c2fabea0  EFLAGS: 0246
RAX: 9ee0de3d90c0 RBX: 0004 RCX: 001f
RDX:  RSI: 0007 RDI: 
RBP: 9ee0c2fabed0 R08: 0f92 R09: 0f42
R10: 9ee0c2fabe50 R11: 071c71c71c71c71c R12: e047bfdcb200
R13: 0af626899577 R14: 0004 R15: 0af6264cc557
FS:  () GS:9ee0de3c() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0898b000 CR3: 00045cc06000 CR4: 001406e0
Stack:
 0f40 e047bfdcb200 bbccc060 9ee0c2fac000
 9ee0c2fa8000 9ee0c2fac000 9ee0c2fabee0 bb57a1ac
 9ee0c2fabf30 bb09238d 9ee0c2fa8000 00070004
Call Trace:
 [] cpuidle_enter+0x17/0x19
 [] cpu_startup_entry+0x210/0x28b
 [] start_secondary+0x13e/0x140
Code: 00 00 00 48 c7 c7 cd ae b2 bb c6 05 4b 8e 7a 00 01 e8 17 6c ae ff fa 66 
0f 1f 44 00 00 31 ff e8 75 60 b4
44 00 00 <4c> 89 e8 b9 e8 03 00 00 4c 29 f8 48 99 48 f7 f9 ba ff ff ff 7f
Kernel panic - not syncing: Hard LOCKUP

A full traceback showing the pilup of requests is here:
http://marc.info/?l=linux-bcache=147949497808483=2

and there:
http://pastebin.com/rJ5RKUVm
(2 different ones but mostly the same result)

We can probably follow up on the bcache thread I Cc'ed you on since I'm
not sure if the fault here lies with bcache or the VM subsystem anymore.

Thanks.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 10:01:10AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN  wrote:
> >
> > In my case, it is a 5x 4TB HDD with
> > software raid 5 < bcache < dmcrypt < btrfs
> 
> It doesn't sound like the nasty situations I have seen (particularly
> with large USB flash storage - often high momentary speed for
> benchmarks, but slows down to a crawl after you've written a bit to
> it, and doesn't have the smart garbage collection that modern "real"
> SSDs have).

I gave it a thought again, I think it is exactly the nasty situation you
described.
bcache takes I/O quickly while sending to SSD cache. SSD fills up, now
bcache can't handle IO as quickly and has to hang until the SSD has been
flushed to spinning rust drives.
This actually is exactly the same as filling up the cache on a USB key
and now you're waiting for slow writes to flash, is it not?

With your dirty ratio workaround, I was able to re-enable bcache and
have it not fall over, but only barely. I recorded over a hundred
workqueues in flight during the copy at some point (just not enough
to actually kill the kernel this time).

I've started a bcache followp on this here:
http://marc.info/?l=linux-bcache=148052441423532=2
http://marc.info/?l=linux-bcache=148052620524162=2

This message shows the huge pileup of workqueeues in bcache
just before the kernel dies with
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
task: 9ee0c2fa4180 task.stack: 9ee0c2fa8000
RIP: 0010:[]  [] 
cpuidle_enter_state+0x119/0x171
RSP: :9ee0c2fabea0  EFLAGS: 0246
RAX: 9ee0de3d90c0 RBX: 0004 RCX: 001f
RDX:  RSI: 0007 RDI: 
RBP: 9ee0c2fabed0 R08: 0f92 R09: 0f42
R10: 9ee0c2fabe50 R11: 071c71c71c71c71c R12: e047bfdcb200
R13: 0af626899577 R14: 0004 R15: 0af6264cc557
FS:  () GS:9ee0de3c() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0898b000 CR3: 00045cc06000 CR4: 001406e0
Stack:
 0f40 e047bfdcb200 bbccc060 9ee0c2fac000
 9ee0c2fa8000 9ee0c2fac000 9ee0c2fabee0 bb57a1ac
 9ee0c2fabf30 bb09238d 9ee0c2fa8000 00070004
Call Trace:
 [] cpuidle_enter+0x17/0x19
 [] cpu_startup_entry+0x210/0x28b
 [] start_secondary+0x13e/0x140
Code: 00 00 00 48 c7 c7 cd ae b2 bb c6 05 4b 8e 7a 00 01 e8 17 6c ae ff fa 66 
0f 1f 44 00 00 31 ff e8 75 60 b4
44 00 00 <4c> 89 e8 b9 e8 03 00 00 4c 29 f8 48 99 48 f7 f9 ba ff ff ff 7f
Kernel panic - not syncing: Hard LOCKUP

A full traceback showing the pilup of requests is here:
http://marc.info/?l=linux-bcache=147949497808483=2

and there:
http://pastebin.com/rJ5RKUVm
(2 different ones but mostly the same result)

We can probably follow up on the bcache thread I Cc'ed you on since I'm
not sure if the fault here lies with bcache or the VM subsystem anymore.

Thanks.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Tetsuo Handa

On 2016/11/30 8:01, Marc MERLIN wrote:
> And, after 5H of copying, not a single hang, or USB disconnect, or anything.
> Obviously this seems to point to other problems in the code, and I have no
> idea which layer is a culprit here, but reducing the buffers absolutely
> helped a lot.

Maybe you can try commit 63f53dea0c9866e9 ("mm: warn about allocations which 
stall for too long")
or 
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
for finding the culprit.

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-30 Thread Tetsuo Handa

On 2016/11/30 8:01, Marc MERLIN wrote:
> And, after 5H of copying, not a single hang, or USB disconnect, or anything.
> Obviously this seems to point to other problems in the code, and I have no
> idea which layer is a culprit here, but reducing the buffers absolutely
> helped a lot.

Maybe you can try commit 63f53dea0c9866e9 ("mm: warn about allocations which 
stall for too long")
or 
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
for finding the culprit.

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote:
> Thanks for the reply and suggestions.
> 
> On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > > Now, to be fair, this is not a new problem, it's just varying degrees of
> > > bad and usually only happens when I do a lot of I/O with btrfs.
> > 
> > One situation where I've seen something like this happen is
> > 
> >  (a) lots and lots of dirty data queued up
> >  (b) horribly slow storage
> 
> In my case, it is a 5x 4TB HDD with 
> software raid 5 < bcache < dmcrypt < btrfs
> bcache is currently half disabled (as in I removed the actual cache) or
> too many bcache requests pile up, and the kernel dies when too many
> workqueues have piled up.
> I'm just kind of worried that since I'm going through 4 subsystems
> before my data can hit disk, that's a lot of memory allocations and
> places where data can accumulate and cause bottlenecks if the next
> subsystem isn't as fast.
> 
> But this shouldn't be "horribly slow", should it? (it does copy a few
> terabytes per day, not fast, but not horrible, about 30MB/s or so)
> 
> > Sadly, our defaults for "how much dirty data do we allow" are somewhat
> > buggered. The global defaults are in "percent of memory", and are
> > generally _much_ too high for big-memory machines:
> > 
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> > 20
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> > 10
> 
> I can confirm I have the same.
> 
> > says that it only starts really throttling writes when you hit 20% of
> > all memory used. You don't say how much memory you have in that
> > machine, but if it's the same one you talked about earlier, it was
> > 24GB. So you can have 4GB of dirty data waiting to be flushed out.
> 
> Correct, 24GB and 4GB.
> 
> > And we *try* to do this per-device backing-dev congestion thing to
> > make things work better, but it generally seems to not work very well.
> > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> > does really well, and we want to open up, and then it shuts down).
> > 
> > One thing you can try is to just make the global limits much lower. As in
> > 
> >echo 2 > /proc/sys/vm/dirty_ratio
> >echo 1 > /proc/sys/vm/dirty_background_ratio
> 
> I will give that a shot, thank you.

And, after 5H of copying, not a single hang, or USB disconnect, or anything.
Obviously this seems to point to other problems in the code, and I have no
idea which layer is a culprit here, but reducing the buffers absolutely
helped a lot.

Thanks much,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote:
> Thanks for the reply and suggestions.
> 
> On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > > Now, to be fair, this is not a new problem, it's just varying degrees of
> > > bad and usually only happens when I do a lot of I/O with btrfs.
> > 
> > One situation where I've seen something like this happen is
> > 
> >  (a) lots and lots of dirty data queued up
> >  (b) horribly slow storage
> 
> In my case, it is a 5x 4TB HDD with 
> software raid 5 < bcache < dmcrypt < btrfs
> bcache is currently half disabled (as in I removed the actual cache) or
> too many bcache requests pile up, and the kernel dies when too many
> workqueues have piled up.
> I'm just kind of worried that since I'm going through 4 subsystems
> before my data can hit disk, that's a lot of memory allocations and
> places where data can accumulate and cause bottlenecks if the next
> subsystem isn't as fast.
> 
> But this shouldn't be "horribly slow", should it? (it does copy a few
> terabytes per day, not fast, but not horrible, about 30MB/s or so)
> 
> > Sadly, our defaults for "how much dirty data do we allow" are somewhat
> > buggered. The global defaults are in "percent of memory", and are
> > generally _much_ too high for big-memory machines:
> > 
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> > 20
> > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> > 10
> 
> I can confirm I have the same.
> 
> > says that it only starts really throttling writes when you hit 20% of
> > all memory used. You don't say how much memory you have in that
> > machine, but if it's the same one you talked about earlier, it was
> > 24GB. So you can have 4GB of dirty data waiting to be flushed out.
> 
> Correct, 24GB and 4GB.
> 
> > And we *try* to do this per-device backing-dev congestion thing to
> > make things work better, but it generally seems to not work very well.
> > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> > does really well, and we want to open up, and then it shuts down).
> > 
> > One thing you can try is to just make the global limits much lower. As in
> > 
> >echo 2 > /proc/sys/vm/dirty_ratio
> >echo 1 > /proc/sys/vm/dirty_background_ratio
> 
> I will give that a shot, thank you.

And, after 5H of copying, not a single hang, or USB disconnect, or anything.
Obviously this seems to point to other problems in the code, and I have no
idea which layer is a culprit here, but reducing the buffers absolutely
helped a lot.

Thanks much,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Linus Torvalds

On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN  wrote:
>
> In my case, it is a 5x 4TB HDD with
> software raid 5 < bcache < dmcrypt < btrfs

It doesn't sound like the nasty situations I have seen (particularly
with large USB flash storage - often high momentary speed for
benchmarks, but slows down to a crawl after you've written a bit to
it, and doesn't have the smart garbage collection that modern "real"
SSDs have).

But while it doesn't sound like that nasty case, RAID5 will certainly
not help your write speed, and with spinning rust that potentially up
to 4GB (in fact, almost 5GB) of dirty pending data is going to take a
long time to write out if it's not all nice and contiguous (which it
won't be).

And btrfs might be weak on that case - I remember complaining about
fsync stuttering all IO a few years ago, exactly because it would
force-flush everything else too (ie you were doing non-synchronous
writes in one session, and then the browser did a "fsync" on the small
writes it did to the mysql database, and suddenly the browser paused
for ten seconds or more, because the fsync wasn't just waiting for the
small database update, but for _everythinig_ to be written back).

Your backtrace isn't for fsync, but it looks superficially similar:
"wait for write data to flush".

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Linus Torvalds

On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN  wrote:
>
> In my case, it is a 5x 4TB HDD with
> software raid 5 < bcache < dmcrypt < btrfs

It doesn't sound like the nasty situations I have seen (particularly
with large USB flash storage - often high momentary speed for
benchmarks, but slows down to a crawl after you've written a bit to
it, and doesn't have the smart garbage collection that modern "real"
SSDs have).

But while it doesn't sound like that nasty case, RAID5 will certainly
not help your write speed, and with spinning rust that potentially up
to 4GB (in fact, almost 5GB) of dirty pending data is going to take a
long time to write out if it's not all nice and contiguous (which it
won't be).

And btrfs might be weak on that case - I remember complaining about
fsync stuttering all IO a few years ago, exactly because it would
force-flush everything else too (ie you were doing non-synchronous
writes in one session, and then the browser did a "fsync" on the small
writes it did to the mysql database, and suddenly the browser paused
for ten seconds or more, because the fsync wasn't just waiting for the
small database update, but for _everythinig_ to be written back).

Your backtrace isn't for fsync, but it looks superficially similar:
"wait for write data to flush".

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

Thanks for the reply and suggestions.

On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > Now, to be fair, this is not a new problem, it's just varying degrees of
> > bad and usually only happens when I do a lot of I/O with btrfs.
> 
> One situation where I've seen something like this happen is
> 
>  (a) lots and lots of dirty data queued up
>  (b) horribly slow storage

In my case, it is a 5x 4TB HDD with 
software raid 5 < bcache < dmcrypt < btrfs
bcache is currently half disabled (as in I removed the actual cache) or
too many bcache requests pile up, and the kernel dies when too many
workqueues have piled up.
I'm just kind of worried that since I'm going through 4 subsystems
before my data can hit disk, that's a lot of memory allocations and
places where data can accumulate and cause bottlenecks if the next
subsystem isn't as fast.

But this shouldn't be "horribly slow", should it? (it does copy a few
terabytes per day, not fast, but not horrible, about 30MB/s or so)

> Sadly, our defaults for "how much dirty data do we allow" are somewhat
> buggered. The global defaults are in "percent of memory", and are
> generally _much_ too high for big-memory machines:
> 
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> 20
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> 10

I can confirm I have the same.

> says that it only starts really throttling writes when you hit 20% of
> all memory used. You don't say how much memory you have in that
> machine, but if it's the same one you talked about earlier, it was
> 24GB. So you can have 4GB of dirty data waiting to be flushed out.

Correct, 24GB and 4GB.

> And we *try* to do this per-device backing-dev congestion thing to
> make things work better, but it generally seems to not work very well.
> Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> does really well, and we want to open up, and then it shuts down).
> 
> One thing you can try is to just make the global limits much lower. As in
> 
>echo 2 > /proc/sys/vm/dirty_ratio
>echo 1 > /proc/sys/vm/dirty_background_ratio

I will give that a shot, thank you.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

Thanks for the reply and suggestions.

On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> > Now, to be fair, this is not a new problem, it's just varying degrees of
> > bad and usually only happens when I do a lot of I/O with btrfs.
> 
> One situation where I've seen something like this happen is
> 
>  (a) lots and lots of dirty data queued up
>  (b) horribly slow storage

In my case, it is a 5x 4TB HDD with 
software raid 5 < bcache < dmcrypt < btrfs
bcache is currently half disabled (as in I removed the actual cache) or
too many bcache requests pile up, and the kernel dies when too many
workqueues have piled up.
I'm just kind of worried that since I'm going through 4 subsystems
before my data can hit disk, that's a lot of memory allocations and
places where data can accumulate and cause bottlenecks if the next
subsystem isn't as fast.

But this shouldn't be "horribly slow", should it? (it does copy a few
terabytes per day, not fast, but not horrible, about 30MB/s or so)

> Sadly, our defaults for "how much dirty data do we allow" are somewhat
> buggered. The global defaults are in "percent of memory", and are
> generally _much_ too high for big-memory machines:
> 
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
> 20
> [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
> 10

I can confirm I have the same.

> says that it only starts really throttling writes when you hit 20% of
> all memory used. You don't say how much memory you have in that
> machine, but if it's the same one you talked about earlier, it was
> 24GB. So you can have 4GB of dirty data waiting to be flushed out.

Correct, 24GB and 4GB.

> And we *try* to do this per-device backing-dev congestion thing to
> make things work better, but it generally seems to not work very well.
> Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
> does really well, and we want to open up, and then it shuts down).
> 
> One thing you can try is to just make the global limits much lower. As in
> 
>echo 2 > /proc/sys/vm/dirty_ratio
>echo 1 > /proc/sys/vm/dirty_background_ratio

I will give that a shot, thank you.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Linus Torvalds

On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> Now, to be fair, this is not a new problem, it's just varying degrees of
> bad and usually only happens when I do a lot of I/O with btrfs.

One situation where I've seen something like this happen is

 (a) lots and lots of dirty data queued up
 (b) horribly slow storage
 (c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days.

The usual case for (c) is "fsync" or similar - often on a totally
unrelated file - which then ends up waiting for everything else to
flush too. Looks like btrfs_start_ordered_extent() does something kind
of like that, where it waits for data to be flushed.

The usual *fix* for this is to just not get into situation (a).

Sadly, our defaults for "how much dirty data do we allow" are somewhat
buggered. The global defaults are in "percent of memory", and are
generally _much_ too high for big-memory machines:

[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
20
[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
10

says that it only starts really throttling writes when you hit 20% of
all memory used. You don't say how much memory you have in that
machine, but if it's the same one you talked about earlier, it was
24GB. So you can have 4GB of dirty data waiting to be flushed out.

And we *try* to do this per-device backing-dev congestion thing to
make things work better, but it generally seems to not work very well.
Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
does really well, and we want to open up, and then it shuts down).

One thing you can try is to just make the global limits much lower. As in

   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio

(if you want to go lower than 1%, you'll have to use the
"dirty_*ratio_bytes" byte limits instead of percentage limits).

Obviously you'll need to be root for this, and equally obviously it's
really a failure of the kernel. I'd *love* to get something like this
right automatically, but sadly it depends so much on memory size,
load, disk subsystem, etc etc that I despair at it.

On x86-32 we "fixed" this long ago by just saying "high memory is not
dirtyable", so you were always limited to a maximum of 10/20% of 1GB,
rather than the full memory range. It worked better, but it's a sad
kind of fix.

(See commit dc6e29da9162: "Fix balance_dirty_page() calculations with
CONFIG_HIGHMEM")

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Linus Torvalds

On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN  wrote:
> Now, to be fair, this is not a new problem, it's just varying degrees of
> bad and usually only happens when I do a lot of I/O with btrfs.

One situation where I've seen something like this happen is

 (a) lots and lots of dirty data queued up
 (b) horribly slow storage
 (c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days.

The usual case for (c) is "fsync" or similar - often on a totally
unrelated file - which then ends up waiting for everything else to
flush too. Looks like btrfs_start_ordered_extent() does something kind
of like that, where it waits for data to be flushed.

The usual *fix* for this is to just not get into situation (a).

Sadly, our defaults for "how much dirty data do we allow" are somewhat
buggered. The global defaults are in "percent of memory", and are
generally _much_ too high for big-memory machines:

[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
20
[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
10

says that it only starts really throttling writes when you hit 20% of
all memory used. You don't say how much memory you have in that
machine, but if it's the same one you talked about earlier, it was
24GB. So you can have 4GB of dirty data waiting to be flushed out.

And we *try* to do this per-device backing-dev congestion thing to
make things work better, but it generally seems to not work very well.
Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
does really well, and we want to open up, and then it shuts down).

One thing you can try is to just make the global limits much lower. As in

   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio

(if you want to go lower than 1%, you'll have to use the
"dirty_*ratio_bytes" byte limits instead of percentage limits).

Obviously you'll need to be root for this, and equally obviously it's
really a failure of the kernel. I'd *love* to get something like this
right automatically, but sadly it depends so much on memory size,
load, disk subsystem, etc etc that I despair at it.

On x86-32 we "fixed" this long ago by just saying "high memory is not
dirtyable", so you were always limited to a maximum of 10/20% of 1GB,
rather than the full memory range. It worked better, but it's a sad
kind of fix.

(See commit dc6e29da9162: "Fix balance_dirty_page() calculations with
CONFIG_HIGHMEM")

 Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Greg Kroah-Hartman

On Tue, Nov 29, 2016 at 05:25:15PM +0100, Michal Hocko wrote:
> On Tue 22-11-16 17:38:01, Greg KH wrote:
> > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > >  4.9rc5 however seems to be doing better, and is still running after 
> > >  18
> > >  hours. However, I got a few page allocation failures as per below, 
> > >  but the
> > >  system seems to recover.
> > >  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
> > >  days) 
> > >  or is that good enough, and i should go back to 4.8.8 with that 
> > >  patch applied?
> > >  https://marc.info/?l=linux-mm=147423605024993
> > > >>>
> > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > > >>> 4.8 with that patch, yeah.
> > > >>
> > > >> So the good news is that it's been running for almost 5H and so far so 
> > > >> good.
> > > > 
> > > > And the better news is that the copy is still going strong, 4.4TB and
> > > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > > concerned.
> > > > 
> > > > So thanks for that, looks good to me to merge.
> > > 
> > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > > already EOL AFAICS).
> > > 
> > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> > >   - alternatively a simpler (againm 4.8-only) patch that just outright
> > > prevents OOM for 0 < order < costly, as Michal already suggested.
> > > - backport 10+ compaction patches to 4.8 stable
> > > - something else?
> > 
> > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> > released?  :)
> 
> OK, so can we push this through to 4.8 before EOL and make sure there
> won't be any additional pre-mature high order OOM reports? The patch
> should be simple enough and safe for the stable tree. There is no
> upstream commit because 4.9 is fixed in a different way which would be
> way too intrusive for the stable backport.

Now queued up, thanks!

greg k-h

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Greg Kroah-Hartman

On Tue, Nov 29, 2016 at 05:25:15PM +0100, Michal Hocko wrote:
> On Tue 22-11-16 17:38:01, Greg KH wrote:
> > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > >  4.9rc5 however seems to be doing better, and is still running after 
> > >  18
> > >  hours. However, I got a few page allocation failures as per below, 
> > >  but the
> > >  system seems to recover.
> > >  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
> > >  days) 
> > >  or is that good enough, and i should go back to 4.8.8 with that 
> > >  patch applied?
> > >  https://marc.info/?l=linux-mm=147423605024993
> > > >>>
> > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > > >>> 4.8 with that patch, yeah.
> > > >>
> > > >> So the good news is that it's been running for almost 5H and so far so 
> > > >> good.
> > > > 
> > > > And the better news is that the copy is still going strong, 4.4TB and
> > > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > > concerned.
> > > > 
> > > > So thanks for that, looks good to me to merge.
> > > 
> > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > > already EOL AFAICS).
> > > 
> > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> > >   - alternatively a simpler (againm 4.8-only) patch that just outright
> > > prevents OOM for 0 < order < costly, as Michal already suggested.
> > > - backport 10+ compaction patches to 4.8 stable
> > > - something else?
> > 
> > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> > released?  :)
> 
> OK, so can we push this through to 4.8 before EOL and make sure there
> won't be any additional pre-mature high order OOM reports? The patch
> should be simple enough and safe for the stable tree. There is no
> upstream commit because 4.9 is fixed in a different way which would be
> way too intrusive for the stable backport.

Now queued up, thanks!

greg k-h

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 05:07:51PM +0100, Michal Hocko wrote:
> On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > > Marc, could you try this patch please? I think it should be pretty clear
> > > it should help you but running it through your use case would be more
> > > than welcome before I ask Greg to take this to the 4.8 stable tree.
> > 
> > I ran it overnight and copied 1.4TB with it before it failed because
> > there wasn't enough disk space on the other side, so I think it fixes
> > the problem too.
> 
> Can I add your Tested-by?

Done.

Now, probably unrelated, but hard to be sure, doing those big copies
causes massive hangs on my system. I hit a few of the 120s hangs,
but more generally lots of things hang, including shells, my DNS server,
monitoring reading from USB and timing out, and so forth.
Examples below. 
I have a hard time telling what is the fault, but is there a chance it
might be memory allocation pressure?
I already have a preempt kernel, so I can't make it more preempt than
that.
Now, to be fair, this is not a new problem, it's just varying degrees of
bad and usually only happens when I do a lot of I/O with btrfs.
That said, btrfs may very well just be suffering from memory allocation
issues and hanging as a result, with everything else on my system also
hanging for similar reasons until the memory pressure goes away with the
copy or scrub are finished.

What do you think?

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[28035.025429] btrfs   D 91154d33fc70 0  5618   5372 0x0080
[28035.047717]  91154d33fc70 00200246 911842f880c0 
9115a4cf01c0
[28035.071020]  91154d33fc58 91154d34 91165493bca0 
9115623773f0
[28035.094252]  1000 0001 91154d33fc88 
b86cf1a6
[28035.117538] Call Trace:
[28035.125791]  [] schedule+0x8b/0xa3
[28035.141550]  [] btrfs_start_ordered_extent+0xce/0x122
[28035.162457]  [] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891]  [] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723]  [] btrfs_truncate+0x40/0x24b
[28035.219269]  [] btrfs_setattr+0x1da/0x2d7
[28035.237032]  [] notify_change+0x252/0x39c
[28035.254566]  [] do_truncate+0x81/0xb4
[28035.271057]  [] vfs_truncate+0xd9/0xf9
[28035.287782]  [] do_sys_truncate+0x63/0xa7

I get other hangs like:

[10338.968912] perf: interrupt took too long (3927 > 3917), lowering 
kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb 
stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 
rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Tue, Nov 29, 2016 at 05:07:51PM +0100, Michal Hocko wrote:
> On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > > Marc, could you try this patch please? I think it should be pretty clear
> > > it should help you but running it through your use case would be more
> > > than welcome before I ask Greg to take this to the 4.8 stable tree.
> > 
> > I ran it overnight and copied 1.4TB with it before it failed because
> > there wasn't enough disk space on the other side, so I think it fixes
> > the problem too.
> 
> Can I add your Tested-by?

Done.

Now, probably unrelated, but hard to be sure, doing those big copies
causes massive hangs on my system. I hit a few of the 120s hangs,
but more generally lots of things hang, including shells, my DNS server,
monitoring reading from USB and timing out, and so forth.
Examples below. 
I have a hard time telling what is the fault, but is there a chance it
might be memory allocation pressure?
I already have a preempt kernel, so I can't make it more preempt than
that.
Now, to be fair, this is not a new problem, it's just varying degrees of
bad and usually only happens when I do a lot of I/O with btrfs.
That said, btrfs may very well just be suffering from memory allocation
issues and hanging as a result, with everything else on my system also
hanging for similar reasons until the memory pressure goes away with the
copy or scrub are finished.

What do you think?

[28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds.
[28034.975471]   Tainted: G U  
4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[28035.025429] btrfs   D 91154d33fc70 0  5618   5372 0x0080
[28035.047717]  91154d33fc70 00200246 911842f880c0 
9115a4cf01c0
[28035.071020]  91154d33fc58 91154d34 91165493bca0 
9115623773f0
[28035.094252]  1000 0001 91154d33fc88 
b86cf1a6
[28035.117538] Call Trace:
[28035.125791]  [] schedule+0x8b/0xa3
[28035.141550]  [] btrfs_start_ordered_extent+0xce/0x122
[28035.162457]  [] ? wake_up_atomic_t+0x2c/0x2c
[28035.180891]  [] btrfs_wait_ordered_range+0xa9/0x10d
[28035.201723]  [] btrfs_truncate+0x40/0x24b
[28035.219269]  [] btrfs_setattr+0x1da/0x2d7
[28035.237032]  [] notify_change+0x252/0x39c
[28035.254566]  [] do_truncate+0x81/0xb4
[28035.271057]  [] vfs_truncate+0xd9/0xf9
[28035.287782]  [] do_sys_truncate+0x63/0xa7

I get other hangs like:

[10338.968912] perf: interrupt took too long (3927 > 3917), lowering 
kernel.perf_event_max_sample_rate to 50750

[12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb 
stopped: -32

[17761.122238] usb 4-1.4: USB disconnect, device number 39
[17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 
rq 6 len 1024 ret -108
[17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd

[24130.574425] hpet1: lost 2306 rtc interrupts
[24156.034950] hpet1: lost 1628 rtc interrupts
[24173.314738] hpet1: lost 1104 rtc interrupts
[24180.129950] hpet1: lost 436 rtc interrupts
[24257.557955] hpet1: lost 4954 rtc interrupts
[24267.522656] hpet1: lost 637 rtc interrupts

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Michal Hocko

On Tue 22-11-16 17:38:01, Greg KH wrote:
> On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> >  4.9rc5 however seems to be doing better, and is still running after 18
> >  hours. However, I got a few page allocation failures as per below, but 
> >  the
> >  system seems to recover.
> >  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
> >  days) 
> >  or is that good enough, and i should go back to 4.8.8 with that patch 
> >  applied?
> >  https://marc.info/?l=linux-mm=147423605024993
> > >>>
> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > >>> 4.8 with that patch, yeah.
> > >>
> > >> So the good news is that it's been running for almost 5H and so far so 
> > >> good.
> > > 
> > > And the better news is that the copy is still going strong, 4.4TB and
> > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > concerned.
> > > 
> > > So thanks for that, looks good to me to merge.
> > 
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> > 
> > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> >   - alternatively a simpler (againm 4.8-only) patch that just outright
> > prevents OOM for 0 < order < costly, as Michal already suggested.
> > - backport 10+ compaction patches to 4.8 stable
> > - something else?
> 
> Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> released?  :)

OK, so can we push this through to 4.8 before EOL and make sure there
won't be any additional pre-mature high order OOM reports? The patch
should be simple enough and safe for the stable tree. There is no
upstream commit because 4.9 is fixed in a different way which would be
way too intrusive for the stable backport.
--- 
>From 02306e8d593fa8a48d620e0c9d63a934ca8366d8 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 23 Nov 2016 07:26:30 +0100
Subject: [PATCH] mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Tested-by: Marc MERLIN 
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;
 
+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Michal Hocko

On Tue 22-11-16 17:38:01, Greg KH wrote:
> On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> > On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> >  4.9rc5 however seems to be doing better, and is still running after 18
> >  hours. However, I got a few page allocation failures as per below, but 
> >  the
> >  system seems to recover.
> >  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
> >  days) 
> >  or is that good enough, and i should go back to 4.8.8 with that patch 
> >  applied?
> >  https://marc.info/?l=linux-mm=147423605024993
> > >>>
> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > >>> 4.8 with that patch, yeah.
> > >>
> > >> So the good news is that it's been running for almost 5H and so far so 
> > >> good.
> > > 
> > > And the better news is that the copy is still going strong, 4.4TB and
> > > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > > concerned.
> > > 
> > > So thanks for that, looks good to me to merge.
> > 
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> > 
> > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
> >   - alternatively a simpler (againm 4.8-only) patch that just outright
> > prevents OOM for 0 < order < costly, as Michal already suggested.
> > - backport 10+ compaction patches to 4.8 stable
> > - something else?
> 
> Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
> released?  :)

OK, so can we push this through to 4.8 before EOL and make sure there
won't be any additional pre-mature high order OOM reports? The patch
should be simple enough and safe for the stable tree. There is no
upstream commit because 4.9 is fixed in a different way which would be
way too intrusive for the stable backport.
--- 
>From 02306e8d593fa8a48d620e0c9d63a934ca8366d8 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 23 Nov 2016 07:26:30 +0100
Subject: [PATCH] mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Tested-by: Marc MERLIN 
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;
 
+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.
-- 
2.10.2

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.
> 
> Thanks!
> 
> On Wed 23-11-16 07:34:10, Michal Hocko wrote:
> [...]
> > commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> > Author: Michal Hocko 
> > Date:   Wed Nov 23 07:26:30 2016 +0100
> > 
> > mm, oom: stop pre-mature high-order OOM killer invocations
> > 
> > 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> > killer invocation for high order requests. It seemed to work for most
> > users just fine but it is far from bullet proof and obviously not
> > sufficient for Marc who has reported pre-mature OOM killer invocations
> > with 4.8 based kernels. 4.9 will all the compaction improvements seems
> > to be behaving much better but that would be too intrusive to backport
> > to 4.8 stable kernels. Instead this patch simply never declares OOM for
> > !costly high order requests. We rely on order-0 requests to do that in
> > case we are really out of memory. Order-0 requests are much more common
> > and so a risk of a livelock without any way forward is highly unlikely.
> > 
> > Reported-by: Marc MERLIN 
> > Signed-off-by: Michal Hocko 

Tested-by: Marc MERLIN 

Marc

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a2214c64ed3c..7401e996009a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> > unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> >  
> > +#ifdef CONFIG_COMPACTION
> > +   /*
> > +* This is a gross workaround to compensate a lack of reliable 
> > compaction
> > +* operation. We cannot simply go OOM with the current state of the 
> > compaction
> > +* code because this can lead to pre mature OOM declaration.
> > +*/
> > +   if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +   return true;
> > +#endif
> > +
> > /*
> >  * There are setups with compaction disabled which would prefer to loop
> >  * inside the allocator rather than hit the oom killer prematurely.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.
> 
> Thanks!
> 
> On Wed 23-11-16 07:34:10, Michal Hocko wrote:
> [...]
> > commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> > Author: Michal Hocko 
> > Date:   Wed Nov 23 07:26:30 2016 +0100
> > 
> > mm, oom: stop pre-mature high-order OOM killer invocations
> > 
> > 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> > killer invocation for high order requests. It seemed to work for most
> > users just fine but it is far from bullet proof and obviously not
> > sufficient for Marc who has reported pre-mature OOM killer invocations
> > with 4.8 based kernels. 4.9 will all the compaction improvements seems
> > to be behaving much better but that would be too intrusive to backport
> > to 4.8 stable kernels. Instead this patch simply never declares OOM for
> > !costly high order requests. We rely on order-0 requests to do that in
> > case we are really out of memory. Order-0 requests are much more common
> > and so a risk of a livelock without any way forward is highly unlikely.
> > 
> > Reported-by: Marc MERLIN 
> > Signed-off-by: Michal Hocko 

Tested-by: Marc MERLIN 

Marc

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a2214c64ed3c..7401e996009a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> > unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> >  
> > +#ifdef CONFIG_COMPACTION
> > +   /*
> > +* This is a gross workaround to compensate a lack of reliable 
> > compaction
> > +* operation. We cannot simply go OOM with the current state of the 
> > compaction
> > +* code because this can lead to pre mature OOM declaration.
> > +*/
> > +   if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +   return true;
> > +#endif
> > +
> > /*
> >  * There are setups with compaction disabled which would prefer to loop
> >  * inside the allocator rather than hit the oom killer prematurely.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Michal Hocko

On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > Marc, could you try this patch please? I think it should be pretty clear
> > it should help you but running it through your use case would be more
> > than welcome before I ask Greg to take this to the 4.8 stable tree.
> 
> I ran it overnight and copied 1.4TB with it before it failed because
> there wasn't enough disk space on the other side, so I think it fixes
> the problem too.

Can I add your Tested-by?

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Michal Hocko

On Tue 29-11-16 07:55:37, Marc MERLIN wrote:
> On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> > Marc, could you try this patch please? I think it should be pretty clear
> > it should help you but running it through your use case would be more
> > than welcome before I ask Greg to take this to the 4.8 stable tree.
> 
> I ran it overnight and copied 1.4TB with it before it failed because
> there wasn't enough disk space on the other side, so I think it fixes
> the problem too.

Can I add your Tested-by?

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.

I ran it overnight and copied 1.4TB with it before it failed because
there wasn't enough disk space on the other side, so I think it fixes
the problem too.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-29 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.

I ran it overnight and copied 1.4TB with it before it failed because
there wasn't enough disk space on the other side, so I think it fixes
the problem too.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-28 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.
 
This will take a little while, the whole copy took 5 days to finish and I'm a
bit hesitant about blowing it away and starting over :)
Let me see if I can come up with maybe another disk array for another test.

For now, as a reminder, I'm running that attached patch, and it works fine
I'll report back as soon as I can.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..9b3b3a79c58a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
ac->nodemask) {
unsigned long available;
unsigned long reclaimable;
+   int check_order = order;
+   unsigned long watermark = min_wmark_pages(zone);
 
available = reclaimable = zone_reclaimable_pages(zone);
available -= DIV_ROUND_UP(no_progress_loops * available,
  MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
+   if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) {
+   check_order = 0;
+   watermark += 1UL << order;
+   }
+
/*
 * Would the allocation succeed if we reclaimed the whole
 * available?
 */
-   if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+   if (__zone_watermark_ok(zone, check_order, watermark,
ac_classzone_idx(ac), alloc_flags, available)) {
/*
 * If we didn't make any progress and have a lot of

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-28 Thread Marc MERLIN

On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote:
> Marc, could you try this patch please? I think it should be pretty clear
> it should help you but running it through your use case would be more
> than welcome before I ask Greg to take this to the 4.8 stable tree.
 
This will take a little while, the whole copy took 5 days to finish and I'm a
bit hesitant about blowing it away and starting over :)
Let me see if I can come up with maybe another disk array for another test.

For now, as a reminder, I'm running that attached patch, and it works fine
I'll report back as soon as I can.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..9b3b3a79c58a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
ac->nodemask) {
unsigned long available;
unsigned long reclaimable;
+   int check_order = order;
+   unsigned long watermark = min_wmark_pages(zone);
 
available = reclaimable = zone_reclaimable_pages(zone);
available -= DIV_ROUND_UP(no_progress_loops * available,
  MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
+   if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) {
+   check_order = 0;
+   watermark += 1UL << order;
+   }
+
/*
 * Would the allocation succeed if we reclaimed the whole
 * available?
 */
-   if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+   if (__zone_watermark_ok(zone, check_order, watermark,
ac_classzone_idx(ac), alloc_flags, available)) {
/*
 * If we didn't make any progress and have a lot of

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-28 Thread Vlastimil Babka


On 11/22/2016 10:46 PM, Simon Kirby wrote:

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:


On 11/22/2016 05:06 PM, Marc MERLIN wrote:

On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:

On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:

4.9rc5 however seems to be doing better, and is still running after 18
hours. However, I got a few page allocation failures as per below, but the
system seems to recover.
Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
or is that good enough, and i should go back to 4.8.8 with that patch applied?
https://marc.info/?l=linux-mm=147423605024993


Hi, I think it's enough for 4.9 for now and I would appreciate trying
4.8 with that patch, yeah.


So the good news is that it's been running for almost 5H and so far so good.


And the better news is that the copy is still going strong, 4.4TB and
going. So 4.8.8 is fixed with that one single patch as far as I'm
concerned.

So thanks for that, looks good to me to merge.


Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
  - alternatively a simpler (againm 4.8-only) patch that just outright
prevents OOM for 0 < order < costly, as Michal already suggested.
- backport 10+ compaction patches to 4.8 stable
- something else?

Michal? Linus?

[1] https://marc.info/?l=linux-mm=147423605024993


Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.


Thanks for the testing. Could you now try Michal's stable candidate [1] 
from this thread please?


[1] http://marc.info/?l=linux-mm=147988285831283=2


I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-28 Thread Vlastimil Babka


On 11/22/2016 10:46 PM, Simon Kirby wrote:

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:


On 11/22/2016 05:06 PM, Marc MERLIN wrote:

On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:

On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:

4.9rc5 however seems to be doing better, and is still running after 18
hours. However, I got a few page allocation failures as per below, but the
system seems to recover.
Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days)
or is that good enough, and i should go back to 4.8.8 with that patch applied?
https://marc.info/?l=linux-mm=147423605024993


Hi, I think it's enough for 4.9 for now and I would appreciate trying
4.8 with that patch, yeah.


So the good news is that it's been running for almost 5H and so far so good.


And the better news is that the copy is still going strong, 4.4TB and
going. So 4.8.8 is fixed with that one single patch as far as I'm
concerned.

So thanks for that, looks good to me to merge.


Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
  - alternatively a simpler (againm 4.8-only) patch that just outright
prevents OOM for 0 < order < costly, as Michal already suggested.
- backport 10+ compaction patches to 4.8 stable
- something else?

Michal? Linus?

[1] https://marc.info/?l=linux-mm=147423605024993


Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.


Thanks for the testing. Could you now try Michal's stable candidate [1] 
from this thread please?


[1] http://marc.info/?l=linux-mm=147988285831283=2


I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-27 Thread Michal Hocko

Marc, could you try this patch please? I think it should be pretty clear
it should help you but running it through your use case would be more
than welcome before I ask Greg to take this to the 4.8 stable tree.

Thanks!

On Wed 23-11-16 07:34:10, Michal Hocko wrote:
[...]
> commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> Author: Michal Hocko 
> Date:   Wed Nov 23 07:26:30 2016 +0100
> 
> mm, oom: stop pre-mature high-order OOM killer invocations
> 
> 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> killer invocation for high order requests. It seemed to work for most
> users just fine but it is far from bullet proof and obviously not
> sufficient for Marc who has reported pre-mature OOM killer invocations
> with 4.8 based kernels. 4.9 will all the compaction improvements seems
> to be behaving much better but that would be too intrusive to backport
> to 4.8 stable kernels. Instead this patch simply never declares OOM for
> !costly high order requests. We rely on order-0 requests to do that in
> case we are really out of memory. Order-0 requests are much more common
> and so a risk of a livelock without any way forward is highly unlikely.
> 
> Reported-by: Marc MERLIN 
> Signed-off-by: Michal Hocko 
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a2214c64ed3c..7401e996009a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> unsigned int order, int alloc_fla
>   if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
>   return false;
>  
> +#ifdef CONFIG_COMPACTION
> + /*
> +  * This is a gross workaround to compensate a lack of reliable 
> compaction
> +  * operation. We cannot simply go OOM with the current state of the 
> compaction
> +  * code because this can lead to pre mature OOM declaration.
> +  */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return true;
> +#endif
> +
>   /*
>* There are setups with compaction disabled which would prefer to loop
>* inside the allocator rather than hit the oom killer prematurely.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-27 Thread Michal Hocko

Marc, could you try this patch please? I think it should be pretty clear
it should help you but running it through your use case would be more
than welcome before I ask Greg to take this to the 4.8 stable tree.

Thanks!

On Wed 23-11-16 07:34:10, Michal Hocko wrote:
[...]
> commit b2ccdcb731b666aa28f86483656c39c5e53828c7
> Author: Michal Hocko 
> Date:   Wed Nov 23 07:26:30 2016 +0100
> 
> mm, oom: stop pre-mature high-order OOM killer invocations
> 
> 31e49bfda184 ("mm, oom: protect !costly allocations some more for
> !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
> killer invocation for high order requests. It seemed to work for most
> users just fine but it is far from bullet proof and obviously not
> sufficient for Marc who has reported pre-mature OOM killer invocations
> with 4.8 based kernels. 4.9 will all the compaction improvements seems
> to be behaving much better but that would be too intrusive to backport
> to 4.8 stable kernels. Instead this patch simply never declares OOM for
> !costly high order requests. We rely on order-0 requests to do that in
> case we are really out of memory. Order-0 requests are much more common
> and so a risk of a livelock without any way forward is highly unlikely.
> 
> Reported-by: Marc MERLIN 
> Signed-off-by: Michal Hocko 
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a2214c64ed3c..7401e996009a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> unsigned int order, int alloc_fla
>   if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
>   return false;
>  
> +#ifdef CONFIG_COMPACTION
> + /*
> +  * This is a gross workaround to compensate a lack of reliable 
> compaction
> +  * operation. We cannot simply go OOM with the current state of the 
> compaction
> +  * code because this can lead to pre mature OOM declaration.
> +  */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return true;
> +#endif
> +
>   /*
>* There are setups with compaction disabled which would prefer to loop
>* inside the allocator rather than hit the oom killer prematurely.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-23 Thread Vlastimil Babka


On 11/23/2016 07:34 AM, Michal Hocko wrote:

On Tue 22-11-16 11:38:47, Linus Torvalds wrote:

On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:


Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable.


I think that's the right thing to do. It's pretty small, and the
argument that it changes the oom logic too much is pretty bogus, I
think. The oom logic in 4.8 is simply broken. Let's get it fixed.
Changing it is the point.


The point I've tried to make is that it is not should_reclaim_retry
which is broken. It's an overly optimistic reliance on the compaction
to do it's work which led to all those issues. My previous fix
31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") tried to cope with that by checking the order-0
watermark which has proven to help most users. Now it didn't cover
everybody obviously. Rather than fiddling with fine tuning of these
heuristics I think it would be safer to simply admit that high order
OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
killer for those requests at all. The risk of such a change is not big
because there usually are order-0 requests happening all the time so if
we are really OOM we would trigger the OOM eventually.

So I am proposing this for 4.8 stable tree instead
---
commit b2ccdcb731b666aa28f86483656c39c5e53828c7
Author: Michal Hocko 
Date:   Wed Nov 23 07:26:30 2016 +0100

mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Signed-off-by: Michal Hocko 


This should effectively restore the 4.6 logic, so I'm fine with it for 
stable, if it passes testing.



diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-23 Thread Vlastimil Babka


On 11/23/2016 07:34 AM, Michal Hocko wrote:

On Tue 22-11-16 11:38:47, Linus Torvalds wrote:

On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:


Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable.


I think that's the right thing to do. It's pretty small, and the
argument that it changes the oom logic too much is pretty bogus, I
think. The oom logic in 4.8 is simply broken. Let's get it fixed.
Changing it is the point.


The point I've tried to make is that it is not should_reclaim_retry
which is broken. It's an overly optimistic reliance on the compaction
to do it's work which led to all those issues. My previous fix
31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") tried to cope with that by checking the order-0
watermark which has proven to help most users. Now it didn't cover
everybody obviously. Rather than fiddling with fine tuning of these
heuristics I think it would be safer to simply admit that high order
OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
killer for those requests at all. The risk of such a change is not big
because there usually are order-0 requests happening all the time so if
we are really OOM we would trigger the OOM eventually.

So I am proposing this for 4.8 stable tree instead
---
commit b2ccdcb731b666aa28f86483656c39c5e53828c7
Author: Michal Hocko 
Date:   Wed Nov 23 07:26:30 2016 +0100

mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Signed-off-by: Michal Hocko 


This should effectively restore the 4.6 logic, so I'm fine with it for 
stable, if it passes testing.



diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Wed 23-11-16 14:53:12, Hillf Danton wrote:
> On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> > unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> > 
> > +#ifdef CONFIG_COMPACTION
> > +   /*
> > +* This is a gross workaround to compensate a lack of reliable 
> > compaction
> > +* operation. We cannot simply go OOM with the current state of the 
> > compaction
> > +* code because this can lead to pre mature OOM declaration.
> > +*/
> > +   if (order <= PAGE_ALLOC_COSTLY_ORDER)
> 
> No need to check order once more.

yes simple return true would be sufficient but I wanted the code to be
more obvious.

> Plus can we retry without CONFIG_COMPACTION enabled?

Yes checking the order-0 watermark was the original implementation of
the high order retry without compaction enabled. I do not rememeber any
reports for that so I didn't want to touch that path.
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Wed 23-11-16 14:53:12, Hillf Danton wrote:
> On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> > unsigned int order, int alloc_fla
> > if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
> > return false;
> > 
> > +#ifdef CONFIG_COMPACTION
> > +   /*
> > +* This is a gross workaround to compensate a lack of reliable 
> > compaction
> > +* operation. We cannot simply go OOM with the current state of the 
> > compaction
> > +* code because this can lead to pre mature OOM declaration.
> > +*/
> > +   if (order <= PAGE_ALLOC_COSTLY_ORDER)
> 
> No need to check order once more.

yes simple return true would be sufficient but I wanted the code to be
more obvious.

> Plus can we retry without CONFIG_COMPACTION enabled?

Yes checking the order-0 watermark was the original implementation of
the high order retry without compaction enabled. I do not rememeber any
reports for that so I didn't want to touch that path.
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Hillf Danton

On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> unsigned int order, int alloc_fla
>   if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
>   return false;
> 
> +#ifdef CONFIG_COMPACTION
> + /*
> +  * This is a gross workaround to compensate a lack of reliable 
> compaction
> +  * operation. We cannot simply go OOM with the current state of the 
> compaction
> +  * code because this can lead to pre mature OOM declaration.
> +  */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)

No need to check order once more.
Plus can we retry without CONFIG_COMPACTION enabled?

> + return true;
> +#endif
> +
>   /*
>* There are setups with compaction disabled which would prefer to loop
>* inside the allocator rather than hit the oom killer prematurely.
> --
> Michal Hocko
> SUSE Labs
>

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Hillf Danton

On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote:
> @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, 
> unsigned int order, int alloc_fla
>   if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
>   return false;
> 
> +#ifdef CONFIG_COMPACTION
> + /*
> +  * This is a gross workaround to compensate a lack of reliable 
> compaction
> +  * operation. We cannot simply go OOM with the current state of the 
> compaction
> +  * code because this can lead to pre mature OOM declaration.
> +  */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)

No need to check order once more.
Plus can we retry without CONFIG_COMPACTION enabled?

> + return true;
> +#endif
> +
>   /*
>* There are setups with compaction disabled which would prefer to loop
>* inside the allocator rather than hit the oom killer prematurely.
> --
> Michal Hocko
> SUSE Labs
>

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Tue 22-11-16 11:38:47, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:
> >
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> >
> > - send the patch [1] as 4.8-only stable.
> 
> I think that's the right thing to do. It's pretty small, and the
> argument that it changes the oom logic too much is pretty bogus, I
> think. The oom logic in 4.8 is simply broken. Let's get it fixed.
> Changing it is the point.

The point I've tried to make is that it is not should_reclaim_retry
which is broken. It's an overly optimistic reliance on the compaction
to do it's work which led to all those issues. My previous fix
31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") tried to cope with that by checking the order-0
watermark which has proven to help most users. Now it didn't cover
everybody obviously. Rather than fiddling with fine tuning of these
heuristics I think it would be safer to simply admit that high order
OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
killer for those requests at all. The risk of such a change is not big
because there usually are order-0 requests happening all the time so if
we are really OOM we would trigger the OOM eventually.

So I am proposing this for 4.8 stable tree instead
---
commit b2ccdcb731b666aa28f86483656c39c5e53828c7
Author: Michal Hocko 
Date:   Wed Nov 23 07:26:30 2016 +0100

mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Signed-off-by: Michal Hocko 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Tue 22-11-16 11:38:47, Linus Torvalds wrote:
> On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:
> >
> > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> > already EOL AFAICS).
> >
> > - send the patch [1] as 4.8-only stable.
> 
> I think that's the right thing to do. It's pretty small, and the
> argument that it changes the oom logic too much is pretty bogus, I
> think. The oom logic in 4.8 is simply broken. Let's get it fixed.
> Changing it is the point.

The point I've tried to make is that it is not should_reclaim_retry
which is broken. It's an overly optimistic reliance on the compaction
to do it's work which led to all those issues. My previous fix
31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") tried to cope with that by checking the order-0
watermark which has proven to help most users. Now it didn't cover
everybody obviously. Rather than fiddling with fine tuning of these
heuristics I think it would be safer to simply admit that high order
OOM detection doesn't work in 4.8 kernel and so do not declare the OOM
killer for those requests at all. The risk of such a change is not big
because there usually are order-0 requests happening all the time so if
we are really OOM we would trigger the OOM eventually.

So I am proposing this for 4.8 stable tree instead
---
commit b2ccdcb731b666aa28f86483656c39c5e53828c7
Author: Michal Hocko 
Date:   Wed Nov 23 07:26:30 2016 +0100

mm, oom: stop pre-mature high-order OOM killer invocations

31e49bfda184 ("mm, oom: protect !costly allocations some more for
!CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
killer invocation for high order requests. It seemed to work for most
users just fine but it is far from bullet proof and obviously not
sufficient for Marc who has reported pre-mature OOM killer invocations
with 4.8 based kernels. 4.9 will all the compaction improvements seems
to be behaving much better but that would be too intrusive to backport
to 4.8 stable kernels. Instead this patch simply never declares OOM for
!costly high order requests. We rely on order-0 requests to do that in
case we are really out of memory. Order-0 requests are much more common
and so a risk of a livelock without any way forward is highly unlikely.

Reported-by: Marc MERLIN 
Signed-off-by: Michal Hocko 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..7401e996009a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned 
int order, int alloc_fla
if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
return false;

+#ifdef CONFIG_COMPACTION
+   /*
+* This is a gross workaround to compensate a lack of reliable 
compaction
+* operation. We cannot simply go OOM with the current state of the 
compaction
+* code because this can lead to pre mature OOM declaration.
+*/
+   if (order <= PAGE_ALLOC_COSTLY_ORDER)
+   return true;
+#endif
+
/*
 * There are setups with compaction disabled which would prefer to loop
 * inside the allocator rather than hit the oom killer prematurely.
-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Simon Kirby

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:

> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
> 
> Michal? Linus?
> 
> [1] https://marc.info/?l=linux-mm=147423605024993

Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.

I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Simon Kirby

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:

> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
> 
> Michal? Linus?
> 
> [1] https://marc.info/?l=linux-mm=147423605024993

Sorry for my molasses rate of feedback. I found a workaround, setting
vm/watermark_scale_factor to 500, and threw that in sysctl. This was on
the MythTV box that OOMs everything after about a day on 4.8 otherwise.

I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but
just realized I forgot to remove the watermark_scale_factor workaround.
I've restored that now, so I'll see if it becomes unhappy by tomorrow.

I also threw up a few other things you had asked for (vmstat, zoneinfo
before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/
(that was before booting into a rebuild with [1] applied)

Simon-

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Linus Torvalds

On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable.

I think that's the right thing to do. It's pretty small, and the
argument that it changes the oom logic too much is pretty bogus, I
think. The oom logic in 4.8 is simply broken. Let's get it fixed.
Changing it is the point.

   Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Linus Torvalds

On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka  wrote:
>
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
>
> - send the patch [1] as 4.8-only stable.

I think that's the right thing to do. It's pretty small, and the
argument that it changes the oom logic too much is pretty bogus, I
think. The oom logic in 4.8 is simply broken. Let's get it fixed.
Changing it is the point.

   Linus

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Marc MERLIN

On Tue, Nov 22, 2016 at 05:25:44PM +0100, Michal Hocko wrote:
> currently AFAIR. I hate that Marc is not falling into that category but
> is it really problem for you to run with 4.9? If we have more users

Don't do anything just on my account. I had a problem, it's been fixed
in 2 different ways: 4.8+patch, or 4.9rc5

For me this was a 100% regression from 4.6, there was just no way I
could copy my data at all with 4.8, it not only failed, but killed all
the services on my machine until it randomly killed the shell that was
doing the copy.
Personally, I'll stick with 4.8 + this patch, and switch to 4.9 when
it's out (I'm a bit wary of RC kernels on a production server,
especially when I'm in the middle of trying to get my only good backup
to work again)

But at the same time, what I'm doing is probably not common (btrfs on
top of dmcrypt, on top of bcache, on top of swraid5, for both source and
destination), so I can't comment on whether the fix I just put on my 4.8
kernel does not cause other regressions or problems for other people.

Either way, I'm personally ok again now, so I thank you all for your
help, and will leave the hard decisions to you :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Marc MERLIN

On Tue, Nov 22, 2016 at 05:25:44PM +0100, Michal Hocko wrote:
> currently AFAIR. I hate that Marc is not falling into that category but
> is it really problem for you to run with 4.9? If we have more users

Don't do anything just on my account. I had a problem, it's been fixed
in 2 different ways: 4.8+patch, or 4.9rc5

For me this was a 100% regression from 4.6, there was just no way I
could copy my data at all with 4.8, it not only failed, but killed all
the services on my machine until it randomly killed the shell that was
doing the copy.
Personally, I'll stick with 4.8 + this patch, and switch to 4.9 when
it's out (I'm a bit wary of RC kernels on a production server,
especially when I'm in the middle of trying to get my only good backup
to work again)

But at the same time, what I'm doing is probably not common (btrfs on
top of dmcrypt, on top of bcache, on top of swraid5, for both source and
destination), so I can't comment on whether the fix I just put on my 4.8
kernel does not cause other regressions or problems for other people.

Either way, I'm personally ok again now, so I thank you all for your
help, and will leave the hard decisions to you :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Greg Kroah-Hartman

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?

Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
released?  :)

thanks,

greg k-h

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Greg Kroah-Hartman

On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?

Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is
released?  :)

thanks,

greg k-h

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Tue 22-11-16 17:14:02, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
> 
> Michal? Linus?

Dunno. To be honest I do not like [1] because it seriously tweaks the
retry logic. 10+ compaction patches to 4.8 seems too much for a stable
tree and quite risky as well. Considering that 4.9 works just much
better, is there any strong reason to do 4.8 specific fix at all? Most
users reporting OOM regressions seemed to be ok with what 4.8 does
currently AFAIR. I hate that Marc is not falling into that category but
is it really problem for you to run with 4.9? If we have more users
seeing this regression then I would rather go with a simpler 4.8-only
"never trigger OOM for order > 0 && order < costly because that would at
least have deterministic behavior.

> 
> [1] https://marc.info/?l=linux-mm=147423605024993

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Michal Hocko

On Tue 22-11-16 17:14:02, Vlastimil Babka wrote:
> On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
>  4.9rc5 however seems to be doing better, and is still running after 18
>  hours. However, I got a few page allocation failures as per below, but 
>  the
>  system seems to recover.
>  Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 
>  days) 
>  or is that good enough, and i should go back to 4.8.8 with that patch 
>  applied?
>  https://marc.info/?l=linux-mm=147423605024993
> >>>
> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> >>> 4.8 with that patch, yeah.
> >>
> >> So the good news is that it's been running for almost 5H and so far so 
> >> good.
> > 
> > And the better news is that the copy is still going strong, 4.4TB and
> > going. So 4.8.8 is fixed with that one single patch as far as I'm
> > concerned.
> > 
> > So thanks for that, looks good to me to merge.
> 
> Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
> already EOL AFAICS).
> 
> - send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
>   - alternatively a simpler (againm 4.8-only) patch that just outright
> prevents OOM for 0 < order < costly, as Michal already suggested.
> - backport 10+ compaction patches to 4.8 stable
> - something else?
> 
> Michal? Linus?

Dunno. To be honest I do not like [1] because it seriously tweaks the
retry logic. 10+ compaction patches to 4.8 seems too much for a stable
tree and quite risky as well. Considering that 4.9 works just much
better, is there any strong reason to do 4.8 specific fix at all? Most
users reporting OOM regressions seemed to be ok with what 4.8 does
currently AFAIR. I hate that Marc is not falling into that category but
is it really problem for you to run with 4.9? If we have more users
seeing this regression then I would rather go with a simpler 4.8-only
"never trigger OOM for order > 0 && order < costly because that would at
least have deterministic behavior.

> 
> [1] https://marc.info/?l=linux-mm=147423605024993

-- 
Michal Hocko
SUSE Labs

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Vlastimil Babka

On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
>> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
 4.9rc5 however seems to be doing better, and is still running after 18
 hours. However, I got a few page allocation failures as per below, but the
 system seems to recover.
 Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
 or is that good enough, and i should go back to 4.8.8 with that patch 
 applied?
 https://marc.info/?l=linux-mm=147423605024993
>>>
>>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
>>> 4.8 with that patch, yeah.
>>
>> So the good news is that it's been running for almost 5H and so far so good.
> 
> And the better news is that the copy is still going strong, 4.4TB and
> going. So 4.8.8 is fixed with that one single patch as far as I'm
> concerned.
> 
> So thanks for that, looks good to me to merge.

Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
  - alternatively a simpler (againm 4.8-only) patch that just outright
prevents OOM for 0 < order < costly, as Michal already suggested.
- backport 10+ compaction patches to 4.8 stable
- something else?

Michal? Linus?

[1] https://marc.info/?l=linux-mm=147423605024993

> Marc
>

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Vlastimil Babka

On 11/22/2016 05:06 PM, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
>> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
 4.9rc5 however seems to be doing better, and is still running after 18
 hours. However, I got a few page allocation failures as per below, but the
 system seems to recover.
 Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
 or is that good enough, and i should go back to 4.8.8 with that patch 
 applied?
 https://marc.info/?l=linux-mm=147423605024993
>>>
>>> Hi, I think it's enough for 4.9 for now and I would appreciate trying
>>> 4.8 with that patch, yeah.
>>
>> So the good news is that it's been running for almost 5H and so far so good.
> 
> And the better news is that the copy is still going strong, 4.4TB and
> going. So 4.8.8 is fixed with that one single patch as far as I'm
> concerned.
> 
> So thanks for that, looks good to me to merge.

Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is
already EOL AFAICS).

- send the patch [1] as 4.8-only stable. Greg won't like that, I expect.
  - alternatively a simpler (againm 4.8-only) patch that just outright
prevents OOM for 0 < order < costly, as Michal already suggested.
- backport 10+ compaction patches to 4.8 stable
- something else?

Michal? Linus?

[1] https://marc.info/?l=linux-mm=147423605024993

> Marc
>

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Marc MERLIN

On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > > 4.9rc5 however seems to be doing better, and is still running after 18
> > > hours. However, I got a few page allocation failures as per below, but the
> > > system seems to recover.
> > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> > > or is that good enough, and i should go back to 4.8.8 with that patch 
> > > applied?
> > > https://marc.info/?l=linux-mm=147423605024993
> > 
> > Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > 4.8 with that patch, yeah.
> 
> So the good news is that it's been running for almost 5H and so far so good.

And the better news is that the copy is still going strong, 4.4TB and
going. So 4.8.8 is fixed with that one single patch as far as I'm
concerned.

So thanks for that, looks good to me to merge.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-22 Thread Marc MERLIN

On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote:
> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > > 4.9rc5 however seems to be doing better, and is still running after 18
> > > hours. However, I got a few page allocation failures as per below, but the
> > > system seems to recover.
> > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> > > or is that good enough, and i should go back to 4.8.8 with that patch 
> > > applied?
> > > https://marc.info/?l=linux-mm=147423605024993
> > 
> > Hi, I think it's enough for 4.9 for now and I would appreciate trying
> > 4.8 with that patch, yeah.
> 
> So the good news is that it's been running for almost 5H and so far so good.

And the better news is that the copy is still going strong, 4.4TB and
going. So 4.8.8 is fixed with that one single patch as far as I'm
concerned.

So thanks for that, looks good to me to merge.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-21 Thread Marc MERLIN

On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > 4.9rc5 however seems to be doing better, and is still running after 18
> > hours. However, I got a few page allocation failures as per below, but the
> > system seems to recover.
> > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> > or is that good enough, and i should go back to 4.8.8 with that patch 
> > applied?
> > https://marc.info/?l=linux-mm=147423605024993
> 
> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> 4.8 with that patch, yeah.

So the good news is that it's been running for almost 5H and so far so good.

> The failures below are in a GFP_NOWAIT context, which cannot do any
> reclaim so it's not affected by OOM rewrite. If it's a regression, it
> has to be caused by something else. But it seems the code in
> cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
> reserves, and has a fallback scenario for allocation failure, in which
> case I would argue that it should add __GFP_NOWARN, as these warnings
> can't help anyone. CCing Tejun as author of commit d4aad7ff0.

No, that's not a regression, I get those on occasion. The good news is that 
they're not
fatal. Just got another one with 4.8.8.
No idea if they're actual errors I should worry about, or just warnings that 
spam
the console a bit, but things retry, recover and succeed, so I can ignore them.

Another one from 4.8.8 below. I'll report back tomorrow to see if this has run 
for a day
and if so, I'll call your patch a fix for my problem (but at this point, it's 
already
looking very good).

Thanks, Marc

cron: page allocation failure: order:0, 
mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
CPU: 4 PID: 9748 Comm: cron Tainted: G U  
4.8.8-amd64-volpreempt-sysrq-20161108vb2 #9
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
  a1e37429f6d0 9a36a0bb 
  a1e37429f768 9a1359d4 022040009f5e8d00
 0012   9a140770
Call Trace:
 [] dump_stack+0x61/0x7d
 [] warn_alloc_failed+0x11c/0x132
 [] ? wakeup_kswapd+0x8e/0x153
 [] __alloc_pages_nodemask+0x87b/0xb02
 [] ? __alloc_pages_nodemask+0x87b/0xb02
 [] cache_grow_begin+0xb2/0x30b
 [] fallback_alloc+0x137/0x19f
 [] cache_alloc_node+0xd3/0xde
 [] kmem_cache_alloc_node+0x8e/0x163
 [] cfq_get_queue+0x162/0x29d
 [] ? kmem_cache_alloc+0xd7/0x14b
 [] ? slab_post_alloc_hook+0x5b/0x66
 [] cfq_set_request+0x141/0x2be
 [] ? timekeeping_get_ns+0x1e/0x32
 [] ? ktime_get+0x41/0x52
 [] ? ktime_get_ns+0x9/0xb
 [] ? cfq_init_icq+0x12/0x19
 [] elv_set_request+0x1f/0x24
 [] get_request+0x324/0x5aa
 [] ? wake_up_atomic_t+0x2c/0x2c
 [] blk_queue_bio+0x19f/0x28c
 [] generic_make_request+0xbd/0x160
 [] submit_bio+0x100/0x11d
 [] ? map_swap_page+0x12/0x14
 [] ? get_swap_bio+0x57/0x6c
 [] swap_readpage+0x110/0x118
 [] read_swap_cache_async+0x26/0x2d
 [] swapin_readahead+0x11a/0x16a
 [] do_swap_page+0x9c/0x431
 [] ? do_swap_page+0x9c/0x431
 [] handle_mm_fault+0xa4d/0xb3d
 [] ? vfs_getattr_nosec+0x26/0x37
 [] __do_page_fault+0x267/0x43d
 [] do_page_fault+0x25/0x27
 [] page_fault+0x28/0x30
Mem-Info:
active_anon:532194 inactive_anon:133376 isolated_anon:0
 active_file:4118244 inactive_file:382010 isolated_file:0
 unevictable:1687 dirty:3502 writeback:386111 unstable:0
 slab_reclaimable:41767 slab_unreclaimable:106595
 mapped:512496 shmem:582026 pagetables:5352 bounce:0
 free:92092 free_pcp:176 free_cma:2072
Node 0 active_anon:2128776kB inactive_anon:533504kB active_file:16472976kB 
inactive_file:1528040kB unevictable:6748kB isolated(anon):0kB 
isolated(file):0kB mapped:2049984kB dirty:14008kB writeback:154kB shmem:0kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2328104kB writeback_tmp:0kB 
unstable:0kB pages_scanned:1 all_unreclaimable? no
Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB 
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:15976kB managed:15892kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3200 23767 23767 23767
Node 0 DMA32 free:117580kB min:35424kB low:44280kB high:53136kB 
active_anon:3980kB inactive_anon:400kB active_file:2632672kB 
inactive_file:286956kB unevictable:0kB writepending:288296kB present:3362068kB 
managed:3296500kB mlocked:0kB slab_reclaimable:41632kB 
slab_unreclaimable:19512kB kernel_stack:880kB pagetables:676kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 20567 20567 20567
Node 0 Normal free:234904kB min:226544kB low:283180kB high:339816kB 
active_anon:2124796kB inactive_anon:533104kB active_file:13840304kB 
inactive_file:1241268kB unevictable:6748kB writepending:1270156kB 
present:21485568kB managed:21080636kB mlocked:6748kB slab_reclaimable:125436kB

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-21 Thread Marc MERLIN

On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote:
> > 4.9rc5 however seems to be doing better, and is still running after 18
> > hours. However, I got a few page allocation failures as per below, but the
> > system seems to recover.
> > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> > or is that good enough, and i should go back to 4.8.8 with that patch 
> > applied?
> > https://marc.info/?l=linux-mm=147423605024993
> 
> Hi, I think it's enough for 4.9 for now and I would appreciate trying
> 4.8 with that patch, yeah.

So the good news is that it's been running for almost 5H and so far so good.

> The failures below are in a GFP_NOWAIT context, which cannot do any
> reclaim so it's not affected by OOM rewrite. If it's a regression, it
> has to be caused by something else. But it seems the code in
> cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
> reserves, and has a fallback scenario for allocation failure, in which
> case I would argue that it should add __GFP_NOWARN, as these warnings
> can't help anyone. CCing Tejun as author of commit d4aad7ff0.

No, that's not a regression, I get those on occasion. The good news is that 
they're not
fatal. Just got another one with 4.8.8.
No idea if they're actual errors I should worry about, or just warnings that 
spam
the console a bit, but things retry, recover and succeed, so I can ignore them.

Another one from 4.8.8 below. I'll report back tomorrow to see if this has run 
for a day
and if so, I'll call your patch a fix for my problem (but at this point, it's 
already
looking very good).

Thanks, Marc

cron: page allocation failure: order:0, 
mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
CPU: 4 PID: 9748 Comm: cron Tainted: G U  
4.8.8-amd64-volpreempt-sysrq-20161108vb2 #9
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
04/27/2013
  a1e37429f6d0 9a36a0bb 
  a1e37429f768 9a1359d4 022040009f5e8d00
 0012   9a140770
Call Trace:
 [] dump_stack+0x61/0x7d
 [] warn_alloc_failed+0x11c/0x132
 [] ? wakeup_kswapd+0x8e/0x153
 [] __alloc_pages_nodemask+0x87b/0xb02
 [] ? __alloc_pages_nodemask+0x87b/0xb02
 [] cache_grow_begin+0xb2/0x30b
 [] fallback_alloc+0x137/0x19f
 [] cache_alloc_node+0xd3/0xde
 [] kmem_cache_alloc_node+0x8e/0x163
 [] cfq_get_queue+0x162/0x29d
 [] ? kmem_cache_alloc+0xd7/0x14b
 [] ? slab_post_alloc_hook+0x5b/0x66
 [] cfq_set_request+0x141/0x2be
 [] ? timekeeping_get_ns+0x1e/0x32
 [] ? ktime_get+0x41/0x52
 [] ? ktime_get_ns+0x9/0xb
 [] ? cfq_init_icq+0x12/0x19
 [] elv_set_request+0x1f/0x24
 [] get_request+0x324/0x5aa
 [] ? wake_up_atomic_t+0x2c/0x2c
 [] blk_queue_bio+0x19f/0x28c
 [] generic_make_request+0xbd/0x160
 [] submit_bio+0x100/0x11d
 [] ? map_swap_page+0x12/0x14
 [] ? get_swap_bio+0x57/0x6c
 [] swap_readpage+0x110/0x118
 [] read_swap_cache_async+0x26/0x2d
 [] swapin_readahead+0x11a/0x16a
 [] do_swap_page+0x9c/0x431
 [] ? do_swap_page+0x9c/0x431
 [] handle_mm_fault+0xa4d/0xb3d
 [] ? vfs_getattr_nosec+0x26/0x37
 [] __do_page_fault+0x267/0x43d
 [] do_page_fault+0x25/0x27
 [] page_fault+0x28/0x30
Mem-Info:
active_anon:532194 inactive_anon:133376 isolated_anon:0
 active_file:4118244 inactive_file:382010 isolated_file:0
 unevictable:1687 dirty:3502 writeback:386111 unstable:0
 slab_reclaimable:41767 slab_unreclaimable:106595
 mapped:512496 shmem:582026 pagetables:5352 bounce:0
 free:92092 free_pcp:176 free_cma:2072
Node 0 active_anon:2128776kB inactive_anon:533504kB active_file:16472976kB 
inactive_file:1528040kB unevictable:6748kB isolated(anon):0kB 
isolated(file):0kB mapped:2049984kB dirty:14008kB writeback:154kB shmem:0kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2328104kB writeback_tmp:0kB 
unstable:0kB pages_scanned:1 all_unreclaimable? no
Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB 
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:15976kB managed:15892kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3200 23767 23767 23767
Node 0 DMA32 free:117580kB min:35424kB low:44280kB high:53136kB 
active_anon:3980kB inactive_anon:400kB active_file:2632672kB 
inactive_file:286956kB unevictable:0kB writepending:288296kB present:3362068kB 
managed:3296500kB mlocked:0kB slab_reclaimable:41632kB 
slab_unreclaimable:19512kB kernel_stack:880kB pagetables:676kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 20567 20567 20567
Node 0 Normal free:234904kB min:226544kB low:283180kB high:339816kB 
active_anon:2124796kB inactive_anon:533104kB active_file:13840304kB 
inactive_file:1241268kB unevictable:6748kB writepending:1270156kB 
present:21485568kB managed:21080636kB mlocked:6748kB slab_reclaimable:125436kB

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-21 Thread Vlastimil Babka

On 11/21/2016 04:43 PM, Marc MERLIN wrote:
> Howdy,
> 
> As a followup to https://plus.google.com/u/0/+MarcMERLIN/posts/A3FrLVo3kc6
> 
> http://pastebin.com/yJybSHNq and http://pastebin.com/B6xEH4Dw
> show a system with plenty of RAM (24GB) falling over and killing inoccent
> user space apps, a few hours after I start a 9TB copy between 2 raid5 arrays 
> both hosting bcache, dmcrypt and btrfs (yes, that's 3 layers under btrfs)
> 
> This kind of stuff worked until 4.6 if I'm not mistaken and started failing
> with 4.8 (I didn't try 4.7)
> 
> I tried applying
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9f7e3387939b036faacf4e7f32de7bb92a6635d6
> to 4.8.8 and it didn't help
> http://pastebin.com/2LUicF3k
> 
> 4.9rc5 however seems to be doing better, and is still running after 18
> hours. However, I got a few page allocation failures as per below, but the
> system seems to recover.
> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> https://marc.info/?l=linux-mm=147423605024993

Hi, I think it's enough for 4.9 for now and I would appreciate trying
4.8 with that patch, yeah.

The failures below are in a GFP_NOWAIT context, which cannot do any
reclaim so it's not affected by OOM rewrite. If it's a regression, it
has to be caused by something else. But it seems the code in
cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
reserves, and has a fallback scenario for allocation failure, in which
case I would argue that it should add __GFP_NOWARN, as these warnings
can't help anyone. CCing Tejun as author of commit d4aad7ff0.

> 
> Thanks,
> Marc
> 
> 
> bash: page allocation failure: order:0, 
> mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
> CPU: 4 PID: 16706 Comm: bash Not tainted 
> 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
> 04/27/2013
>  9812088ff680 9a36f697  9aababe8
>  9812088ff710 9a13ae2b 02204012 9aababe8  
>  9812088ff6a8 0010 9812088ff720 9812088ff6c0
> Call Trace:
>  [] dump_stack+0x61/0x7d
>  [] warn_alloc+0x107/0x11b
>  [] __alloc_pages_slowpath+0x727/0x8f2 
>  [] ? get_page_from_freelist+0x62e/0x66f
>  [] __alloc_pages_nodemask+0x15c/0x220
>  [] cache_grow_begin+0xb2/0x308
>  [] fallback_alloc+0x137/0x19f
>  [] cache_alloc_node+0xd3/0xde
>  [] kmem_cache_alloc_node+0x8e/0x163
>  [] cfq_get_queue+0x162/0x29d
>  [] ? kmem_cache_alloc+0xd7/0x14b
>  [] ? mempool_alloc_slab+0x15/0x17
>  [] ? mempool_alloc+0x69/0x132
>  [] cfq_set_request+0x141/0x2be
>  [] ? timekeeping_get_ns+0x1e/0x32
>  [] ? ktime_get+0x41/0x52
>  [] ? ktime_get_ns+0x9/0xb
>  [] ? cfq_init_icq+0x12/0x19
>  [] elv_set_request+0x1f/0x24
>  [] get_request+0x324/0x5aa  
>  [] ? wake_up_atomic_t+0x2c/0x2c
>  [] blk_queue_bio+0x19f/0x28c  
>  [] generic_make_request+0xbd/0x160
>  [] submit_bio+0x100/0x11d
>  [] ? map_swap_page+0x12/0x14  
>  [] ? get_swap_bio+0x57/0x6c
>  [] swap_readpage+0x106/0x10e
>  [] read_swap_cache_async+0x26/0x2d  
>  [] swapin_readahead+0x11a/0x16a  
>  [] do_swap_page+0x9c/0x42e
>  [] ? do_swap_page+0x9c/0x42e
>  [] handle_mm_fault+0xa51/0xb71
>  [] ? _raw_spin_lock_irq+0x1c/0x1e
>  [] __do_page_fault+0x29e/0x425
>  [] do_page_fault+0x25/0x27
>  [] page_fault+0x28/0x30
> Mem-Info:
> active_anon:563129 inactive_anon:140630 isolated_anon:0
>  active_file:4036325 inactive_file:448954 isolated_file:288
>  unevictable:1760 dirty:9197 writeback:446395 unstable:0
>  slab_reclaimable:47810 slab_unreclaimable:120834
>  mapped:534180 shmem:627708 pagetables:5647 bounce:0
>  free:90108 free_pcp:218 free_cma:78
> Node 0 active_anon:2252516kB inactive_anon:562520kB active_file:16145300kB 
> inactive_file:1795816kB unevictable:7040kB isolated(anon):0kB 
> isolated(file):1152kB mapped:2136720kB dirty:367
> 1785580kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510832kB 
> writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no
> Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB 
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
> writepending:0kB present:15976kB managed:15892kB mlocked:0kB 
> slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB 
> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3199 23767 23767 23767
> Node 0 DMA32 free:117656kB min:35424kB low:44280kB high:53136kB 
> active_anon:38004kB inactive_anon:13540kB active_file:2221420kB 
> inactive_file:307236kB unevictable:0kB writepending:311780kB 
> present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47992kB 
> slab_unreclaimable:25360kB kernel_stack:512kB pagetables:796kB bounce:0kB 
> free_pcp:96kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 20567 20567 20567
> Node 0 Normal free:226892kB

Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

2016-11-21 Thread Vlastimil Babka

On 11/21/2016 04:43 PM, Marc MERLIN wrote:
> Howdy,
> 
> As a followup to https://plus.google.com/u/0/+MarcMERLIN/posts/A3FrLVo3kc6
> 
> http://pastebin.com/yJybSHNq and http://pastebin.com/B6xEH4Dw
> show a system with plenty of RAM (24GB) falling over and killing inoccent
> user space apps, a few hours after I start a 9TB copy between 2 raid5 arrays 
> both hosting bcache, dmcrypt and btrfs (yes, that's 3 layers under btrfs)
> 
> This kind of stuff worked until 4.6 if I'm not mistaken and started failing
> with 4.8 (I didn't try 4.7)
> 
> I tried applying
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9f7e3387939b036faacf4e7f32de7bb92a6635d6
> to 4.8.8 and it didn't help
> http://pastebin.com/2LUicF3k
> 
> 4.9rc5 however seems to be doing better, and is still running after 18
> hours. However, I got a few page allocation failures as per below, but the
> system seems to recover.
> Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) 
> or is that good enough, and i should go back to 4.8.8 with that patch applied?
> https://marc.info/?l=linux-mm=147423605024993

Hi, I think it's enough for 4.9 for now and I would appreciate trying
4.8 with that patch, yeah.

The failures below are in a GFP_NOWAIT context, which cannot do any
reclaim so it's not affected by OOM rewrite. If it's a regression, it
has to be caused by something else. But it seems the code in
cfq_get_queue() intentionally doesn't want to reclaim or use any atomic
reserves, and has a fallback scenario for allocation failure, in which
case I would argue that it should add __GFP_NOWARN, as these warnings
can't help anyone. CCing Tejun as author of commit d4aad7ff0.

> 
> Thanks,
> Marc
> 
> 
> bash: page allocation failure: order:0, 
> mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
> CPU: 4 PID: 16706 Comm: bash Not tainted 
> 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 
> 04/27/2013
>  9812088ff680 9a36f697  9aababe8
>  9812088ff710 9a13ae2b 02204012 9aababe8  
>  9812088ff6a8 0010 9812088ff720 9812088ff6c0
> Call Trace:
>  [] dump_stack+0x61/0x7d
>  [] warn_alloc+0x107/0x11b
>  [] __alloc_pages_slowpath+0x727/0x8f2 
>  [] ? get_page_from_freelist+0x62e/0x66f
>  [] __alloc_pages_nodemask+0x15c/0x220
>  [] cache_grow_begin+0xb2/0x308
>  [] fallback_alloc+0x137/0x19f
>  [] cache_alloc_node+0xd3/0xde
>  [] kmem_cache_alloc_node+0x8e/0x163
>  [] cfq_get_queue+0x162/0x29d
>  [] ? kmem_cache_alloc+0xd7/0x14b
>  [] ? mempool_alloc_slab+0x15/0x17
>  [] ? mempool_alloc+0x69/0x132
>  [] cfq_set_request+0x141/0x2be
>  [] ? timekeeping_get_ns+0x1e/0x32
>  [] ? ktime_get+0x41/0x52
>  [] ? ktime_get_ns+0x9/0xb
>  [] ? cfq_init_icq+0x12/0x19
>  [] elv_set_request+0x1f/0x24
>  [] get_request+0x324/0x5aa  
>  [] ? wake_up_atomic_t+0x2c/0x2c
>  [] blk_queue_bio+0x19f/0x28c  
>  [] generic_make_request+0xbd/0x160
>  [] submit_bio+0x100/0x11d
>  [] ? map_swap_page+0x12/0x14  
>  [] ? get_swap_bio+0x57/0x6c
>  [] swap_readpage+0x106/0x10e
>  [] read_swap_cache_async+0x26/0x2d  
>  [] swapin_readahead+0x11a/0x16a  
>  [] do_swap_page+0x9c/0x42e
>  [] ? do_swap_page+0x9c/0x42e
>  [] handle_mm_fault+0xa51/0xb71
>  [] ? _raw_spin_lock_irq+0x1c/0x1e
>  [] __do_page_fault+0x29e/0x425
>  [] do_page_fault+0x25/0x27
>  [] page_fault+0x28/0x30
> Mem-Info:
> active_anon:563129 inactive_anon:140630 isolated_anon:0
>  active_file:4036325 inactive_file:448954 isolated_file:288
>  unevictable:1760 dirty:9197 writeback:446395 unstable:0
>  slab_reclaimable:47810 slab_unreclaimable:120834
>  mapped:534180 shmem:627708 pagetables:5647 bounce:0
>  free:90108 free_pcp:218 free_cma:78
> Node 0 active_anon:2252516kB inactive_anon:562520kB active_file:16145300kB 
> inactive_file:1795816kB unevictable:7040kB isolated(anon):0kB 
> isolated(file):1152kB mapped:2136720kB dirty:367
> 1785580kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510832kB 
> writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no
> Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB 
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
> writepending:0kB present:15976kB managed:15892kB mlocked:0kB 
> slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB 
> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 3199 23767 23767 23767
> Node 0 DMA32 free:117656kB min:35424kB low:44280kB high:53136kB 
> active_anon:38004kB inactive_anon:13540kB active_file:2221420kB 
> inactive_file:307236kB unevictable:0kB writepending:311780kB 
> present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47992kB 
> slab_unreclaimable:25360kB kernel_stack:512kB pagetables:796kB bounce:0kB 
> free_pcp:96kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 20567 20567 20567
> Node 0 Normal free:226892kB

82 matches

Mail list logo