Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, May 02, 2017 at 09:44:33AM +0200, Michal Hocko wrote: > On Mon 01-05-17 21:12:35, Marc MERLIN wrote: > > Howdy, > > > > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > > really > > crash but it goes into an infinite loop with > > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > > stuck for 33s! > > More logs: https://pastebin.com/YqE4riw0 > > I am seeing a lot of traces where tasks is waiting for an IO. I do not > see any OOM report there. Why do you believe this is an OOM killer > issue? Good question. This is a followup of the problem I had in 4.8.8 until I got a patch to fix the issue. Then, it used to OOM and later, to pile up I/O tasks like this. Now it doesn't OOM anymore, but tasks still pile up. I temporarily fixed the issue by doing this: gargamel:~# echo 0 > /proc/sys/vm/dirty_ratio gargamel:~# echo 0 > /proc/sys/vm/dirty_background_ratio of course my performance is abysmal now, but I can at least run btrfs scrub without piling up enough IO to deadlock the system. On Tue, May 02, 2017 at 07:44:47PM +0900, Tetsuo Handa wrote: > > Any idea what I should do next? > > Maybe you can try collecting list of all in-flight allocations with backtraces > using kmallocwd patches at > http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp > and > http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp > which also tracks mempool allocations. > (Well, the > > - cond_resched(); > + //cond_resched(); > > change in the latter patch would not be preferable.) Thanks. I can give that a shot as soon as my current scrub is done, it may take another 12 to 24H at this rate. In the meantimne, as explained above, not allowing any dirty VM has worked around the problem (Linus pointed out to me in the original thread that on a lightly loaded 24GB system, even 1 or 2% could still be a lot of memory for requests to pile up in and cause issues in degenerative cases like mine). Now I'm still curious what changed betweeen 4.8.8 + custom patches and 4.11 to cause this. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, May 02, 2017 at 09:44:33AM +0200, Michal Hocko wrote: > On Mon 01-05-17 21:12:35, Marc MERLIN wrote: > > Howdy, > > > > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > > really > > crash but it goes into an infinite loop with > > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > > stuck for 33s! > > More logs: https://pastebin.com/YqE4riw0 > > I am seeing a lot of traces where tasks is waiting for an IO. I do not > see any OOM report there. Why do you believe this is an OOM killer > issue? Good question. This is a followup of the problem I had in 4.8.8 until I got a patch to fix the issue. Then, it used to OOM and later, to pile up I/O tasks like this. Now it doesn't OOM anymore, but tasks still pile up. I temporarily fixed the issue by doing this: gargamel:~# echo 0 > /proc/sys/vm/dirty_ratio gargamel:~# echo 0 > /proc/sys/vm/dirty_background_ratio of course my performance is abysmal now, but I can at least run btrfs scrub without piling up enough IO to deadlock the system. On Tue, May 02, 2017 at 07:44:47PM +0900, Tetsuo Handa wrote: > > Any idea what I should do next? > > Maybe you can try collecting list of all in-flight allocations with backtraces > using kmallocwd patches at > http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp > and > http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp > which also tracks mempool allocations. > (Well, the > > - cond_resched(); > + //cond_resched(); > > change in the latter patch would not be preferable.) Thanks. I can give that a shot as soon as my current scrub is done, it may take another 12 to 24H at this rate. In the meantimne, as explained above, not allowing any dirty VM has worked around the problem (Linus pointed out to me in the original thread that on a lightly loaded 24GB system, even 1 or 2% could still be a lot of memory for requests to pile up in and cause issues in degenerative cases like mine). Now I'm still curious what changed betweeen 4.8.8 + custom patches and 4.11 to cause this. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 2017/05/02 13:12, Marc MERLIN wrote: > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > really > crash but it goes into an infinite loop with > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > stuck for 33s! Wow, two of workqueues are reaching max active. [34777.202267] workqueue btrfs-endio-write: flags=0xe [34777.218313] pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8 [34777.236548] in-flight: 15168:btrfs_endio_write_helper, 13855:btrfs_endio_write_helper, 3360:btrfs_endio_write_helper, 14241:btrfs_endio_write_helper, 27092:btrfs_endio_write_helper, 15194:btrfs_endio_write_helper, 15169:btrfs_endio_write_helper, 27093:btrfs_endio_write_helper [34777.316225] delayed: btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper [34777.450684] workqueue bcache: flags=0x8 [34779.956462] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=256/256 [34779.978283] in-flight: 15320:cached_dev_read_done [bcache], 23385:cached_dev_read_done [bcache], 23371:cached_dev_read_done [bcache], 15321:cached_dev_read_done [bcache], 15395:cached_dev_read_done [bcache], 11101:cached_dev_read_done [bcache], 15300:cached_dev_read_done [bcache], 23349:cached_dev_read_done [bcache], 23425:cached_dev_read_done [bcache], 23399:cached_dev_read_done [bcache], 15293:cached_dev_read_done [bcache], 20529:cached_dev_read_done [bcache], 15402:cached_dev_read_done [bcache], 23422:cached_dev_read_done [bcache], 23417:cached_dev_read_done [bcache], 23409:cached_dev_read_done [bcache], 20539:cached_dev_read_done [bcache], 23431:cached_dev_read_done [bcache], 20544:cached_dev_read_done [bcache], 15355:cached_dev_read_done [bcache], 11085:cached_dev_read_done [bcache], 6511:cached_dev_read_done [bcache] Googling with btrfs_endio_write_helper shows a stuck report with 4.8-rc5, but seems no response ( https://www.spinics.net/lists/linux-btrfs/msg58633.html ). > Any idea what I should do next? Maybe you can try collecting list of all in-flight allocations with backtraces using kmallocwd patches at http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp and http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp which also tracks mempool allocations. (Well, the - cond_resched(); + //cond_resched(); change in the latter patch would not be preferable.)
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 2017/05/02 13:12, Marc MERLIN wrote: > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > really > crash but it goes into an infinite loop with > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > stuck for 33s! Wow, two of workqueues are reaching max active. [34777.202267] workqueue btrfs-endio-write: flags=0xe [34777.218313] pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8 [34777.236548] in-flight: 15168:btrfs_endio_write_helper, 13855:btrfs_endio_write_helper, 3360:btrfs_endio_write_helper, 14241:btrfs_endio_write_helper, 27092:btrfs_endio_write_helper, 15194:btrfs_endio_write_helper, 15169:btrfs_endio_write_helper, 27093:btrfs_endio_write_helper [34777.316225] delayed: btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper, btrfs_endio_write_helper [34777.450684] workqueue bcache: flags=0x8 [34779.956462] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=256/256 [34779.978283] in-flight: 15320:cached_dev_read_done [bcache], 23385:cached_dev_read_done [bcache], 23371:cached_dev_read_done [bcache], 15321:cached_dev_read_done [bcache], 15395:cached_dev_read_done [bcache], 11101:cached_dev_read_done [bcache], 15300:cached_dev_read_done [bcache], 23349:cached_dev_read_done [bcache], 23425:cached_dev_read_done [bcache], 23399:cached_dev_read_done [bcache], 15293:cached_dev_read_done [bcache], 20529:cached_dev_read_done [bcache], 15402:cached_dev_read_done [bcache], 23422:cached_dev_read_done [bcache], 23417:cached_dev_read_done [bcache], 23409:cached_dev_read_done [bcache], 20539:cached_dev_read_done [bcache], 23431:cached_dev_read_done [bcache], 20544:cached_dev_read_done [bcache], 15355:cached_dev_read_done [bcache], 11085:cached_dev_read_done [bcache], 6511:cached_dev_read_done [bcache] Googling with btrfs_endio_write_helper shows a stuck report with 4.8-rc5, but seems no response ( https://www.spinics.net/lists/linux-btrfs/msg58633.html ). > Any idea what I should do next? Maybe you can try collecting list of all in-flight allocations with backtraces using kmallocwd patches at http://lkml.kernel.org/r/1489578541-81526-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp and http://lkml.kernel.org/r/201704272019.jeh26057.shfotmljoov...@i-love.sakura.ne.jp which also tracks mempool allocations. (Well, the - cond_resched(); + //cond_resched(); change in the latter patch would not be preferable.)
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon 01-05-17 21:12:35, Marc MERLIN wrote: > Howdy, > > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > really > crash but it goes into an infinite loop with > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > stuck for 33s! > More logs: https://pastebin.com/YqE4riw0 I am seeing a lot of traces where tasks is waiting for an IO. I do not see any OOM report there. Why do you believe this is an OOM killer issue? -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon 01-05-17 21:12:35, Marc MERLIN wrote: > Howdy, > > Well, sadly, the problem is more or less back is 4.11.0. The system doesn't > really > crash but it goes into an infinite loop with > [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 > stuck for 33s! > More logs: https://pastebin.com/YqE4riw0 I am seeing a lot of traces where tasks is waiting for an IO. I do not see any OOM report there. Why do you believe this is an OOM killer issue? -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Howdy, Well, sadly, the problem is more or less back is 4.11.0. The system doesn't really crash but it goes into an infinite loop with [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 33s! More logs: https://pastebin.com/YqE4riw0 (I upgraded from 4.8 with custom patches you gave me, and went to 4.11.0 gargamel:~# cat /proc/sys/vm/dirty_ratio 2 gargamel:~# cat /proc/sys/vm/dirty_background_ratio 1 gargamel:~# free total used free sharedbuffers cached Mem: 24392600 163626608029940 0 8884 13739000 -/+ buffers/cache:2614776 21777824 Swap: 15616764 0 15616764 And yet, I was doing a btrfs check repair on a busy filesystem, within 40mn or so, it triggered the workqueue lockup. gargamel:~# grep CONFIG_COMPACTION /boot/config-4.11.0-amd64-preempt-sysrq-20170406 CONFIG_COMPACTION=y kernel config file: https://pastebin.com/7Tajse6L To be fair, I didn't try to run btrfs check on 4.8 and now I'm busy trying to recover a filesystem that apparently got corrupted by a bad SAS driver in 4.8 which caused a lot of I/O errors and corruption. This is just to say that btrfs on top of dmcrypt on top of bcache may have been enough layers to hang on btrfs check on 4.8 too, but I can't really go back to check right now due to the driver corruption issues. Any idea what I should do next? Thanks, Marc On Tue, Nov 29, 2016 at 03:01:35PM -0800, Marc MERLIN wrote: > On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote: > > Thanks for the reply and suggestions. > > > > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLINwrote: > > > > Now, to be fair, this is not a new problem, it's just varying degrees of > > > > bad and usually only happens when I do a lot of I/O with btrfs. > > > > > > One situation where I've seen something like this happen is > > > > > > (a) lots and lots of dirty data queued up > > > (b) horribly slow storage > > > > In my case, it is a 5x 4TB HDD with > > software raid 5 < bcache < dmcrypt < btrfs > > bcache is currently half disabled (as in I removed the actual cache) or > > too many bcache requests pile up, and the kernel dies when too many > > workqueues have piled up. > > I'm just kind of worried that since I'm going through 4 subsystems > > before my data can hit disk, that's a lot of memory allocations and > > places where data can accumulate and cause bottlenecks if the next > > subsystem isn't as fast. > > > > But this shouldn't be "horribly slow", should it? (it does copy a few > > terabytes per day, not fast, but not horrible, about 30MB/s or so) > > > > > Sadly, our defaults for "how much dirty data do we allow" are somewhat > > > buggered. The global defaults are in "percent of memory", and are > > > generally _much_ too high for big-memory machines: > > > > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > > > 20 > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > > > 10 > > > > I can confirm I have the same. > > > > > says that it only starts really throttling writes when you hit 20% of > > > all memory used. You don't say how much memory you have in that > > > machine, but if it's the same one you talked about earlier, it was > > > 24GB. So you can have 4GB of dirty data waiting to be flushed out. > > > > Correct, 24GB and 4GB. > > > > > And we *try* to do this per-device backing-dev congestion thing to > > > make things work better, but it generally seems to not work very well. > > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > > > does really well, and we want to open up, and then it shuts down). > > > > > > One thing you can try is to just make the global limits much lower. As in > > > > > >echo 2 > /proc/sys/vm/dirty_ratio > > >echo 1 > /proc/sys/vm/dirty_background_ratio > > > > I will give that a shot, thank you. > > And, after 5H of copying, not a single hang, or USB disconnect, or anything. > Obviously this seems to point to other problems in the code, and I have no > idea which layer is a culprit here, but reducing the buffers absolutely > helped a lot. -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Howdy, Well, sadly, the problem is more or less back is 4.11.0. The system doesn't really crash but it goes into an infinite loop with [34776.826800] BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 33s! More logs: https://pastebin.com/YqE4riw0 (I upgraded from 4.8 with custom patches you gave me, and went to 4.11.0 gargamel:~# cat /proc/sys/vm/dirty_ratio 2 gargamel:~# cat /proc/sys/vm/dirty_background_ratio 1 gargamel:~# free total used free sharedbuffers cached Mem: 24392600 163626608029940 0 8884 13739000 -/+ buffers/cache:2614776 21777824 Swap: 15616764 0 15616764 And yet, I was doing a btrfs check repair on a busy filesystem, within 40mn or so, it triggered the workqueue lockup. gargamel:~# grep CONFIG_COMPACTION /boot/config-4.11.0-amd64-preempt-sysrq-20170406 CONFIG_COMPACTION=y kernel config file: https://pastebin.com/7Tajse6L To be fair, I didn't try to run btrfs check on 4.8 and now I'm busy trying to recover a filesystem that apparently got corrupted by a bad SAS driver in 4.8 which caused a lot of I/O errors and corruption. This is just to say that btrfs on top of dmcrypt on top of bcache may have been enough layers to hang on btrfs check on 4.8 too, but I can't really go back to check right now due to the driver corruption issues. Any idea what I should do next? Thanks, Marc On Tue, Nov 29, 2016 at 03:01:35PM -0800, Marc MERLIN wrote: > On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote: > > Thanks for the reply and suggestions. > > > > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN wrote: > > > > Now, to be fair, this is not a new problem, it's just varying degrees of > > > > bad and usually only happens when I do a lot of I/O with btrfs. > > > > > > One situation where I've seen something like this happen is > > > > > > (a) lots and lots of dirty data queued up > > > (b) horribly slow storage > > > > In my case, it is a 5x 4TB HDD with > > software raid 5 < bcache < dmcrypt < btrfs > > bcache is currently half disabled (as in I removed the actual cache) or > > too many bcache requests pile up, and the kernel dies when too many > > workqueues have piled up. > > I'm just kind of worried that since I'm going through 4 subsystems > > before my data can hit disk, that's a lot of memory allocations and > > places where data can accumulate and cause bottlenecks if the next > > subsystem isn't as fast. > > > > But this shouldn't be "horribly slow", should it? (it does copy a few > > terabytes per day, not fast, but not horrible, about 30MB/s or so) > > > > > Sadly, our defaults for "how much dirty data do we allow" are somewhat > > > buggered. The global defaults are in "percent of memory", and are > > > generally _much_ too high for big-memory machines: > > > > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > > > 20 > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > > > 10 > > > > I can confirm I have the same. > > > > > says that it only starts really throttling writes when you hit 20% of > > > all memory used. You don't say how much memory you have in that > > > machine, but if it's the same one you talked about earlier, it was > > > 24GB. So you can have 4GB of dirty data waiting to be flushed out. > > > > Correct, 24GB and 4GB. > > > > > And we *try* to do this per-device backing-dev congestion thing to > > > make things work better, but it generally seems to not work very well. > > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > > > does really well, and we want to open up, and then it shuts down). > > > > > > One thing you can try is to just make the global limits much lower. As in > > > > > >echo 2 > /proc/sys/vm/dirty_ratio > > >echo 1 > /proc/sys/vm/dirty_background_ratio > > > > I will give that a shot, thank you. > > And, after 5H of copying, not a single hang, or USB disconnect, or anything. > Obviously this seems to point to other problems in the code, and I have no > idea which layer is a culprit here, but reducing the buffers absolutely > helped a lot. -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 12/01/2016 11:37 AM, Linus Torvalds wrote: > On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboewrote: >> >> It's two different kinds of throttling. The vm absolutely should >> throttle at dirty time, to avoid having insane amounts of memory dirty. >> On the block layer side, throttling is about avoid the device queues >> being too long. It's very similar to the buffer bloating on the >> networking side. The block layer throttling is not a fix for the vm >> allowing too much memory to be dirty and causing issues, it's about >> keeping the device response latencies in check. > > Sure. But if we really do just end up blocking in the block layer (in > situations where we didn't used to), that may be a bad thing. It might > be better to feed that information back to the VM instead, > particularly for writes, where the VM layer already tries to ratelimit > the writes. It's not a new blocking point, it's the same blocking point that we always end up in, if we run out of requests. The problem with bcache and other stacked drivers is that they don't have a request pool, so they never really need to block there. > And frankly, it's almost purely writes that matter. There just aren't > a lot of ways to get that many parallel reads in real life. Exactly, it's almost exclusively a buffered write problem, as I wrote in the initial reply. Most other things tend to throttle nicely on their own. > I haven't looked at your patches, so maybe you already do this. It's currently not fed back, but that would be pretty trivial to do. The mechanism we have for that (queue congestion) is a bit of a mess, though, so it would need to be revamped a bit. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 12/01/2016 11:37 AM, Linus Torvalds wrote: > On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe wrote: >> >> It's two different kinds of throttling. The vm absolutely should >> throttle at dirty time, to avoid having insane amounts of memory dirty. >> On the block layer side, throttling is about avoid the device queues >> being too long. It's very similar to the buffer bloating on the >> networking side. The block layer throttling is not a fix for the vm >> allowing too much memory to be dirty and causing issues, it's about >> keeping the device response latencies in check. > > Sure. But if we really do just end up blocking in the block layer (in > situations where we didn't used to), that may be a bad thing. It might > be better to feed that information back to the VM instead, > particularly for writes, where the VM layer already tries to ratelimit > the writes. It's not a new blocking point, it's the same blocking point that we always end up in, if we run out of requests. The problem with bcache and other stacked drivers is that they don't have a request pool, so they never really need to block there. > And frankly, it's almost purely writes that matter. There just aren't > a lot of ways to get that many parallel reads in real life. Exactly, it's almost exclusively a buffered write problem, as I wrote in the initial reply. Most other things tend to throttle nicely on their own. > I haven't looked at your patches, so maybe you already do this. It's currently not fed back, but that would be pretty trivial to do. The mechanism we have for that (queue congestion) is a bit of a mess, though, so it would need to be revamped a bit. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboewrote: > > It's two different kinds of throttling. The vm absolutely should > throttle at dirty time, to avoid having insane amounts of memory dirty. > On the block layer side, throttling is about avoid the device queues > being too long. It's very similar to the buffer bloating on the > networking side. The block layer throttling is not a fix for the vm > allowing too much memory to be dirty and causing issues, it's about > keeping the device response latencies in check. Sure. But if we really do just end up blocking in the block layer (in situations where we didn't used to), that may be a bad thing. It might be better to feed that information back to the VM instead, particularly for writes, where the VM layer already tries to ratelimit the writes. And frankly, it's almost purely writes that matter. There just aren't a lot of ways to get that many parallel reads in real life. I haven't looked at your patches, so maybe you already do this. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Thu, Dec 1, 2016 at 10:30 AM, Jens Axboe wrote: > > It's two different kinds of throttling. The vm absolutely should > throttle at dirty time, to avoid having insane amounts of memory dirty. > On the block layer side, throttling is about avoid the device queues > being too long. It's very similar to the buffer bloating on the > networking side. The block layer throttling is not a fix for the vm > allowing too much memory to be dirty and causing issues, it's about > keeping the device response latencies in check. Sure. But if we really do just end up blocking in the block layer (in situations where we didn't used to), that may be a bad thing. It might be better to feed that information back to the VM instead, particularly for writes, where the VM layer already tries to ratelimit the writes. And frankly, it's almost purely writes that matter. There just aren't a lot of ways to get that many parallel reads in real life. I haven't looked at your patches, so maybe you already do this. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 12/01/2016 11:16 AM, Linus Torvalds wrote: > On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet >wrote: >> >> That said, I'm not sure how I feel about Jens's exact approach... it seems >> to me >> that this can really just live within the writeback code, I don't know why it >> should involve the block layer at all. plus, if I understand correctly his >> code >> has the effect of blocking in generic_make_request() to throttle, which means >> due to the way the writeback code is structured we'll be blocking with page >> locks held. > > Yeah, I do *not* believe that throttling at the block layer is at all > the right thing to do. > > I do think that the block layer needs to throttle, but it needs to be > seen as a "last resort" kind of thing, where the block layer just > needs to limit how much it will have oending. But it should be seen as > a failure mode, not as a write balancing issue. > > Because the real throttling absolutely needs to happen when things are > marked dirty, because no block layer throttling will ever fix the > situation where you just have too much memory dirtied that you cannot > free because it will take a minute to write out. > > So throttling at a VM level is sane. Throttling at a block layer level is not. It's two different kinds of throttling. The vm absolutely should throttle at dirty time, to avoid having insane amounts of memory dirty. On the block layer side, throttling is about avoid the device queues being too long. It's very similar to the buffer bloating on the networking side. The block layer throttling is not a fix for the vm allowing too much memory to be dirty and causing issues, it's about keeping the device response latencies in check. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 12/01/2016 11:16 AM, Linus Torvalds wrote: > On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet > wrote: >> >> That said, I'm not sure how I feel about Jens's exact approach... it seems >> to me >> that this can really just live within the writeback code, I don't know why it >> should involve the block layer at all. plus, if I understand correctly his >> code >> has the effect of blocking in generic_make_request() to throttle, which means >> due to the way the writeback code is structured we'll be blocking with page >> locks held. > > Yeah, I do *not* believe that throttling at the block layer is at all > the right thing to do. > > I do think that the block layer needs to throttle, but it needs to be > seen as a "last resort" kind of thing, where the block layer just > needs to limit how much it will have oending. But it should be seen as > a failure mode, not as a write balancing issue. > > Because the real throttling absolutely needs to happen when things are > marked dirty, because no block layer throttling will ever fix the > situation where you just have too much memory dirtied that you cannot > free because it will take a minute to write out. > > So throttling at a VM level is sane. Throttling at a block layer level is not. It's two different kinds of throttling. The vm absolutely should throttle at dirty time, to avoid having insane amounts of memory dirty. On the block layer side, throttling is about avoid the device queues being too long. It's very similar to the buffer bloating on the networking side. The block layer throttling is not a fix for the vm allowing too much memory to be dirty and causing issues, it's about keeping the device response latencies in check. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreetwrote: > > That said, I'm not sure how I feel about Jens's exact approach... it seems to > me > that this can really just live within the writeback code, I don't know why it > should involve the block layer at all. plus, if I understand correctly his > code > has the effect of blocking in generic_make_request() to throttle, which means > due to the way the writeback code is structured we'll be blocking with page > locks held. Yeah, I do *not* believe that throttling at the block layer is at all the right thing to do. I do think that the block layer needs to throttle, but it needs to be seen as a "last resort" kind of thing, where the block layer just needs to limit how much it will have oending. But it should be seen as a failure mode, not as a write balancing issue. Because the real throttling absolutely needs to happen when things are marked dirty, because no block layer throttling will ever fix the situation where you just have too much memory dirtied that you cannot free because it will take a minute to write out. So throttling at a VM level is sane. Throttling at a block layer level is not. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Thu, Dec 1, 2016 at 5:50 AM, Kent Overstreet wrote: > > That said, I'm not sure how I feel about Jens's exact approach... it seems to > me > that this can really just live within the writeback code, I don't know why it > should involve the block layer at all. plus, if I understand correctly his > code > has the effect of blocking in generic_make_request() to throttle, which means > due to the way the writeback code is structured we'll be blocking with page > locks held. Yeah, I do *not* believe that throttling at the block layer is at all the right thing to do. I do think that the block layer needs to throttle, but it needs to be seen as a "last resort" kind of thing, where the block layer just needs to limit how much it will have oending. But it should be seen as a failure mode, not as a write balancing issue. Because the real throttling absolutely needs to happen when things are marked dirty, because no block layer throttling will ever fix the situation where you just have too much memory dirtied that you cannot free because it will take a minute to write out. So throttling at a VM level is sane. Throttling at a block layer level is not. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 03:30:11PM -0500, Tejun Heo wrote: > Hello, > > On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > > Tejun/Kent - any way to just limit the workqueue depth for bcache? > > Because that really isn't helping, and things *will* time out and > > cause those problems when you have hundreds of IO's queued on a disk > > that likely as a write iops around ~100.. > > Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in > from bcache_init(). It's currently using 0 as @max_active and it can > set to be any arbitrary number. It'd be a very crude way to control > what looks like a buffer bloat with IOs tho. We can make it a bit > more granular by splitting workqueues per bcache instance / purpose > but for the long term the right solution seems to be hooking into > writeback throttling mechanism that block layer just grew recently. Agreed that the writeback code is the right place to do it. Within bcache we can't really do anything smarter than just throw a hard limit on the number of outstanding IOs and enforce it by blocking in generic_make_request(), and the bcache code is the wrong place to do that - we don't know what the limit should be there, and all the IOs look the same at that point so you'd probably still end up with writeback starving everything else. I could futz with the workqueue stuff, but that'd likely as not break some other workload - I've spent enough time as it is fighting with workqueue concurrency stuff in the past. My preference would be to just try and get Jens's stuff in. That said, I'm not sure how I feel about Jens's exact approach... it seems to me that this can really just live within the writeback code, I don't know why it should involve the block layer at all. plus, if I understand correctly his code has the effect of blocking in generic_make_request() to throttle, which means due to the way the writeback code is structured we'll be blocking with page locks held. I did my own thing in bcachefs, same idea but throttling in writepages... it's dumb and simple but it's worked exceedingly well, as far as actual usability and responsiveness: https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev=acf766b2dd33b076fdce66c86363a3e26a9b70cf#n1002 that said - any kind of throttling for writeback will be a million times better than the current situation...
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 03:30:11PM -0500, Tejun Heo wrote: > Hello, > > On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > > Tejun/Kent - any way to just limit the workqueue depth for bcache? > > Because that really isn't helping, and things *will* time out and > > cause those problems when you have hundreds of IO's queued on a disk > > that likely as a write iops around ~100.. > > Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in > from bcache_init(). It's currently using 0 as @max_active and it can > set to be any arbitrary number. It'd be a very crude way to control > what looks like a buffer bloat with IOs tho. We can make it a bit > more granular by splitting workqueues per bcache instance / purpose > but for the long term the right solution seems to be hooking into > writeback throttling mechanism that block layer just grew recently. Agreed that the writeback code is the right place to do it. Within bcache we can't really do anything smarter than just throw a hard limit on the number of outstanding IOs and enforce it by blocking in generic_make_request(), and the bcache code is the wrong place to do that - we don't know what the limit should be there, and all the IOs look the same at that point so you'd probably still end up with writeback starving everything else. I could futz with the workqueue stuff, but that'd likely as not break some other workload - I've spent enough time as it is fighting with workqueue concurrency stuff in the past. My preference would be to just try and get Jens's stuff in. That said, I'm not sure how I feel about Jens's exact approach... it seems to me that this can really just live within the writeback code, I don't know why it should involve the block layer at all. plus, if I understand correctly his code has the effect of blocking in generic_make_request() to throttle, which means due to the way the writeback code is structured we'll be blocking with page locks held. I did my own thing in bcachefs, same idea but throttling in writepages... it's dumb and simple but it's worked exceedingly well, as far as actual usability and responsiveness: https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev=acf766b2dd33b076fdce66c86363a3e26a9b70cf#n1002 that said - any kind of throttling for writeback will be a million times better than the current situation...
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Hello, On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > Tejun/Kent - any way to just limit the workqueue depth for bcache? > Because that really isn't helping, and things *will* time out and > cause those problems when you have hundreds of IO's queued on a disk > that likely as a write iops around ~100.. Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in from bcache_init(). It's currently using 0 as @max_active and it can set to be any arbitrary number. It'd be a very crude way to control what looks like a buffer bloat with IOs tho. We can make it a bit more granular by splitting workqueues per bcache instance / purpose but for the long term the right solution seems to be hooking into writeback throttling mechanism that block layer just grew recently. Thanks. -- tejun
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Hello, On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > Tejun/Kent - any way to just limit the workqueue depth for bcache? > Because that really isn't helping, and things *will* time out and > cause those problems when you have hundreds of IO's queued on a disk > that likely as a write iops around ~100.. Yeah, easily. I'm assuming it's gonna be the bcache_wq allocated in from bcache_init(). It's currently using 0 as @max_active and it can set to be any arbitrary number. It'd be a very crude way to control what looks like a buffer bloat with IOs tho. We can make it a bit more granular by splitting workqueues per bcache instance / purpose but for the long term the right solution seems to be hooking into writeback throttling mechanism that block layer just grew recently. Thanks. -- tejun
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/30/2016 11:14 AM, Linus Torvalds wrote: > On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLINwrote: >> >> I gave it a thought again, I think it is exactly the nasty situation you >> described. >> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now >> bcache can't handle IO as quickly and has to hang until the SSD has been >> flushed to spinning rust drives. >> This actually is exactly the same as filling up the cache on a USB key >> and now you're waiting for slow writes to flash, is it not? > > It does sound like you might hit exactly the same kind of situation, yes. > > And the fact that you have dmcrypt running too just makes things pile > up more. All those IO's end up slowed down by the scheduling too. > > Anyway, none of this seems new per se. I'm adding Kent and Jens to the > cc (Tejun already was), in the hope that maybe they have some idea how > to control the nasty worst-case behavior wrt workqueue lockup (it's > not really a "lockup", it looks like it's just hundreds of workqueues > all waiting for IO to complete and much too deep IO queues). Honestly, the easiest would be to wire it up to the blk-wbt stuff that is queued up for 4.10, which attempts to limit the queue depths to something reasonable instead of letting them run amok. This is largely (exclusively, almost) a problem with buffered writeback. On devices utilizing the stacked interface, they never get any depth throttling. Obviously it's worse if each IO ends up queueing work, but it's a big problem even if they do not. > I think it's the traditional "throughput is much easier to measure and > improve" situation, where making queues big help some throughput > situation, but ends up causing chaos when things go south. Yes, and the longer queues never buy you anything, but they end up causing tons of problems at the other end of the spectrum. Still makes sense to limit dirty memory for highmem, though. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/30/2016 11:14 AM, Linus Torvalds wrote: > On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN wrote: >> >> I gave it a thought again, I think it is exactly the nasty situation you >> described. >> bcache takes I/O quickly while sending to SSD cache. SSD fills up, now >> bcache can't handle IO as quickly and has to hang until the SSD has been >> flushed to spinning rust drives. >> This actually is exactly the same as filling up the cache on a USB key >> and now you're waiting for slow writes to flash, is it not? > > It does sound like you might hit exactly the same kind of situation, yes. > > And the fact that you have dmcrypt running too just makes things pile > up more. All those IO's end up slowed down by the scheduling too. > > Anyway, none of this seems new per se. I'm adding Kent and Jens to the > cc (Tejun already was), in the hope that maybe they have some idea how > to control the nasty worst-case behavior wrt workqueue lockup (it's > not really a "lockup", it looks like it's just hundreds of workqueues > all waiting for IO to complete and much too deep IO queues). Honestly, the easiest would be to wire it up to the blk-wbt stuff that is queued up for 4.10, which attempts to limit the queue depths to something reasonable instead of letting them run amok. This is largely (exclusively, almost) a problem with buffered writeback. On devices utilizing the stacked interface, they never get any depth throttling. Obviously it's worse if each IO ends up queueing work, but it's a big problem even if they do not. > I think it's the traditional "throughput is much easier to measure and > improve" situation, where making queues big help some throughput > situation, but ends up causing chaos when things go south. Yes, and the longer queues never buy you anything, but they end up causing tons of problems at the other end of the spectrum. Still makes sense to limit dirty memory for highmem, though. -- Jens Axboe
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > Anyway, none of this seems new per se. I'm adding Kent and Jens to the > cc (Tejun already was), in the hope that maybe they have some idea how > to control the nasty worst-case behavior wrt workqueue lockup (it's > not really a "lockup", it looks like it's just hundreds of workqueues > all waiting for IO to complete and much too deep IO queues). I'll take your word for it, all I got in the end was Kernel panic - not syncing: Hard LOCKUP and the system stone dead when I woke up hours later. > And I think your NMI watchdog then turns the "system is no longer > responsive" into an actual kernel panic. Ah, I see. Thanks for the reply, and sorry for bringing in that separate thread from the btrfs mailing list, which effectively was a suggestion similar to what you're saying here too. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 10:14:50AM -0800, Linus Torvalds wrote: > Anyway, none of this seems new per se. I'm adding Kent and Jens to the > cc (Tejun already was), in the hope that maybe they have some idea how > to control the nasty worst-case behavior wrt workqueue lockup (it's > not really a "lockup", it looks like it's just hundreds of workqueues > all waiting for IO to complete and much too deep IO queues). I'll take your word for it, all I got in the end was Kernel panic - not syncing: Hard LOCKUP and the system stone dead when I woke up hours later. > And I think your NMI watchdog then turns the "system is no longer > responsive" into an actual kernel panic. Ah, I see. Thanks for the reply, and sorry for bringing in that separate thread from the btrfs mailing list, which effectively was a suggestion similar to what you're saying here too. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLINwrote: > > I gave it a thought again, I think it is exactly the nasty situation you > described. > bcache takes I/O quickly while sending to SSD cache. SSD fills up, now > bcache can't handle IO as quickly and has to hang until the SSD has been > flushed to spinning rust drives. > This actually is exactly the same as filling up the cache on a USB key > and now you're waiting for slow writes to flash, is it not? It does sound like you might hit exactly the same kind of situation, yes. And the fact that you have dmcrypt running too just makes things pile up more. All those IO's end up slowed down by the scheduling too. Anyway, none of this seems new per se. I'm adding Kent and Jens to the cc (Tejun already was), in the hope that maybe they have some idea how to control the nasty worst-case behavior wrt workqueue lockup (it's not really a "lockup", it looks like it's just hundreds of workqueues all waiting for IO to complete and much too deep IO queues). I think it's the traditional "throughput is much easier to measure and improve" situation, where making queues big help some throughput situation, but ends up causing chaos when things go south. And I think your NMI watchdog then turns the "system is no longer responsive" into an actual kernel panic. > With your dirty ratio workaround, I was able to re-enable bcache and > have it not fall over, but only barely. I recorded over a hundred > workqueues in flight during the copy at some point (just not enough > to actually kill the kernel this time). > > I've started a bcache followp on this here: > http://marc.info/?l=linux-bcache=148052441423532=2 > http://marc.info/?l=linux-bcache=148052620524162=2 > > A full traceback showing the pilup of requests is here: > http://marc.info/?l=linux-bcache=147949497808483=2 > > and there: > http://pastebin.com/rJ5RKUVm > (2 different ones but mostly the same result) Tejun/Kent - any way to just limit the workqueue depth for bcache? Because that really isn't helping, and things *will* time out and cause those problems when you have hundreds of IO's queued on a disk that likely as a write iops around ~100.. And I really wonder if we should do the "big hammer" approach to the dirty limits on non-HIGHMEM machines too (approximate the "vm_highmem_is_dirtyable" by just limiting global_dirtyable_memory() to 1 GB). That would make the default dirty limits be 100/200MB (for soft/hard throttling), which really is much more reasonable than gigabytes and gigabytes of dirty data. Of course, no way do we do that during rc7.. Linus mm/page-writeback.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 439cc63ad903..26ecbdecb815 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -352,6 +352,10 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) #endif } +/* Limit dirtyable memory to 1GB */ +#define PAGES_IN_GB(x) ((x) << (30 - PAGE_SHIFT)) +#define MAX_DIRTYABLE_LOWMEM_PAGES PAGES_IN_GB(1) + /** * global_dirtyable_memory - number of globally dirtyable pages * @@ -373,8 +377,11 @@ static unsigned long global_dirtyable_memory(void) x += global_node_page_state(NR_INACTIVE_FILE); x += global_node_page_state(NR_ACTIVE_FILE); - if (!vm_highmem_is_dirtyable) + if (!vm_highmem_is_dirtyable) { x -= highmem_dirtyable_memory(x); + if (x > MAX_DIRTYABLE_LOWMEM_PAGES) + x = MAX_DIRTYABLE_LOWMEM_PAGES; + } return x + 1; /* Ensure that we never return 0 */ }
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed, Nov 30, 2016 at 9:47 AM, Marc MERLIN wrote: > > I gave it a thought again, I think it is exactly the nasty situation you > described. > bcache takes I/O quickly while sending to SSD cache. SSD fills up, now > bcache can't handle IO as quickly and has to hang until the SSD has been > flushed to spinning rust drives. > This actually is exactly the same as filling up the cache on a USB key > and now you're waiting for slow writes to flash, is it not? It does sound like you might hit exactly the same kind of situation, yes. And the fact that you have dmcrypt running too just makes things pile up more. All those IO's end up slowed down by the scheduling too. Anyway, none of this seems new per se. I'm adding Kent and Jens to the cc (Tejun already was), in the hope that maybe they have some idea how to control the nasty worst-case behavior wrt workqueue lockup (it's not really a "lockup", it looks like it's just hundreds of workqueues all waiting for IO to complete and much too deep IO queues). I think it's the traditional "throughput is much easier to measure and improve" situation, where making queues big help some throughput situation, but ends up causing chaos when things go south. And I think your NMI watchdog then turns the "system is no longer responsive" into an actual kernel panic. > With your dirty ratio workaround, I was able to re-enable bcache and > have it not fall over, but only barely. I recorded over a hundred > workqueues in flight during the copy at some point (just not enough > to actually kill the kernel this time). > > I've started a bcache followp on this here: > http://marc.info/?l=linux-bcache=148052441423532=2 > http://marc.info/?l=linux-bcache=148052620524162=2 > > A full traceback showing the pilup of requests is here: > http://marc.info/?l=linux-bcache=147949497808483=2 > > and there: > http://pastebin.com/rJ5RKUVm > (2 different ones but mostly the same result) Tejun/Kent - any way to just limit the workqueue depth for bcache? Because that really isn't helping, and things *will* time out and cause those problems when you have hundreds of IO's queued on a disk that likely as a write iops around ~100.. And I really wonder if we should do the "big hammer" approach to the dirty limits on non-HIGHMEM machines too (approximate the "vm_highmem_is_dirtyable" by just limiting global_dirtyable_memory() to 1 GB). That would make the default dirty limits be 100/200MB (for soft/hard throttling), which really is much more reasonable than gigabytes and gigabytes of dirty data. Of course, no way do we do that during rc7.. Linus mm/page-writeback.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 439cc63ad903..26ecbdecb815 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -352,6 +352,10 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) #endif } +/* Limit dirtyable memory to 1GB */ +#define PAGES_IN_GB(x) ((x) << (30 - PAGE_SHIFT)) +#define MAX_DIRTYABLE_LOWMEM_PAGES PAGES_IN_GB(1) + /** * global_dirtyable_memory - number of globally dirtyable pages * @@ -373,8 +377,11 @@ static unsigned long global_dirtyable_memory(void) x += global_node_page_state(NR_INACTIVE_FILE); x += global_node_page_state(NR_ACTIVE_FILE); - if (!vm_highmem_is_dirtyable) + if (!vm_highmem_is_dirtyable) { x -= highmem_dirtyable_memory(x); + if (x > MAX_DIRTYABLE_LOWMEM_PAGES) + x = MAX_DIRTYABLE_LOWMEM_PAGES; + } return x + 1; /* Ensure that we never return 0 */ }
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 10:01:10AM -0800, Linus Torvalds wrote: > On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLINwrote: > > > > In my case, it is a 5x 4TB HDD with > > software raid 5 < bcache < dmcrypt < btrfs > > It doesn't sound like the nasty situations I have seen (particularly > with large USB flash storage - often high momentary speed for > benchmarks, but slows down to a crawl after you've written a bit to > it, and doesn't have the smart garbage collection that modern "real" > SSDs have). I gave it a thought again, I think it is exactly the nasty situation you described. bcache takes I/O quickly while sending to SSD cache. SSD fills up, now bcache can't handle IO as quickly and has to hang until the SSD has been flushed to spinning rust drives. This actually is exactly the same as filling up the cache on a USB key and now you're waiting for slow writes to flash, is it not? With your dirty ratio workaround, I was able to re-enable bcache and have it not fall over, but only barely. I recorded over a hundred workqueues in flight during the copy at some point (just not enough to actually kill the kernel this time). I've started a bcache followp on this here: http://marc.info/?l=linux-bcache=148052441423532=2 http://marc.info/?l=linux-bcache=148052620524162=2 This message shows the huge pileup of workqueeues in bcache just before the kernel dies with Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 task: 9ee0c2fa4180 task.stack: 9ee0c2fa8000 RIP: 0010:[] [] cpuidle_enter_state+0x119/0x171 RSP: :9ee0c2fabea0 EFLAGS: 0246 RAX: 9ee0de3d90c0 RBX: 0004 RCX: 001f RDX: RSI: 0007 RDI: RBP: 9ee0c2fabed0 R08: 0f92 R09: 0f42 R10: 9ee0c2fabe50 R11: 071c71c71c71c71c R12: e047bfdcb200 R13: 0af626899577 R14: 0004 R15: 0af6264cc557 FS: () GS:9ee0de3c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0898b000 CR3: 00045cc06000 CR4: 001406e0 Stack: 0f40 e047bfdcb200 bbccc060 9ee0c2fac000 9ee0c2fa8000 9ee0c2fac000 9ee0c2fabee0 bb57a1ac 9ee0c2fabf30 bb09238d 9ee0c2fa8000 00070004 Call Trace: [] cpuidle_enter+0x17/0x19 [] cpu_startup_entry+0x210/0x28b [] start_secondary+0x13e/0x140 Code: 00 00 00 48 c7 c7 cd ae b2 bb c6 05 4b 8e 7a 00 01 e8 17 6c ae ff fa 66 0f 1f 44 00 00 31 ff e8 75 60 b4 44 00 00 <4c> 89 e8 b9 e8 03 00 00 4c 29 f8 48 99 48 f7 f9 ba ff ff ff 7f Kernel panic - not syncing: Hard LOCKUP A full traceback showing the pilup of requests is here: http://marc.info/?l=linux-bcache=147949497808483=2 and there: http://pastebin.com/rJ5RKUVm (2 different ones but mostly the same result) We can probably follow up on the bcache thread I Cc'ed you on since I'm not sure if the fault here lies with bcache or the VM subsystem anymore. Thanks. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 10:01:10AM -0800, Linus Torvalds wrote: > On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN wrote: > > > > In my case, it is a 5x 4TB HDD with > > software raid 5 < bcache < dmcrypt < btrfs > > It doesn't sound like the nasty situations I have seen (particularly > with large USB flash storage - often high momentary speed for > benchmarks, but slows down to a crawl after you've written a bit to > it, and doesn't have the smart garbage collection that modern "real" > SSDs have). I gave it a thought again, I think it is exactly the nasty situation you described. bcache takes I/O quickly while sending to SSD cache. SSD fills up, now bcache can't handle IO as quickly and has to hang until the SSD has been flushed to spinning rust drives. This actually is exactly the same as filling up the cache on a USB key and now you're waiting for slow writes to flash, is it not? With your dirty ratio workaround, I was able to re-enable bcache and have it not fall over, but only barely. I recorded over a hundred workqueues in flight during the copy at some point (just not enough to actually kill the kernel this time). I've started a bcache followp on this here: http://marc.info/?l=linux-bcache=148052441423532=2 http://marc.info/?l=linux-bcache=148052620524162=2 This message shows the huge pileup of workqueeues in bcache just before the kernel dies with Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 task: 9ee0c2fa4180 task.stack: 9ee0c2fa8000 RIP: 0010:[] [] cpuidle_enter_state+0x119/0x171 RSP: :9ee0c2fabea0 EFLAGS: 0246 RAX: 9ee0de3d90c0 RBX: 0004 RCX: 001f RDX: RSI: 0007 RDI: RBP: 9ee0c2fabed0 R08: 0f92 R09: 0f42 R10: 9ee0c2fabe50 R11: 071c71c71c71c71c R12: e047bfdcb200 R13: 0af626899577 R14: 0004 R15: 0af6264cc557 FS: () GS:9ee0de3c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0898b000 CR3: 00045cc06000 CR4: 001406e0 Stack: 0f40 e047bfdcb200 bbccc060 9ee0c2fac000 9ee0c2fa8000 9ee0c2fac000 9ee0c2fabee0 bb57a1ac 9ee0c2fabf30 bb09238d 9ee0c2fa8000 00070004 Call Trace: [] cpuidle_enter+0x17/0x19 [] cpu_startup_entry+0x210/0x28b [] start_secondary+0x13e/0x140 Code: 00 00 00 48 c7 c7 cd ae b2 bb c6 05 4b 8e 7a 00 01 e8 17 6c ae ff fa 66 0f 1f 44 00 00 31 ff e8 75 60 b4 44 00 00 <4c> 89 e8 b9 e8 03 00 00 4c 29 f8 48 99 48 f7 f9 ba ff ff ff 7f Kernel panic - not syncing: Hard LOCKUP A full traceback showing the pilup of requests is here: http://marc.info/?l=linux-bcache=147949497808483=2 and there: http://pastebin.com/rJ5RKUVm (2 different ones but mostly the same result) We can probably follow up on the bcache thread I Cc'ed you on since I'm not sure if the fault here lies with bcache or the VM subsystem anymore. Thanks. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 2016/11/30 8:01, Marc MERLIN wrote: > And, after 5H of copying, not a single hang, or USB disconnect, or anything. > Obviously this seems to point to other problems in the code, and I have no > idea which layer is a culprit here, but reducing the buffers absolutely > helped a lot. Maybe you can try commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too long") or http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp for finding the culprit.
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 2016/11/30 8:01, Marc MERLIN wrote: > And, after 5H of copying, not a single hang, or USB disconnect, or anything. > Obviously this seems to point to other problems in the code, and I have no > idea which layer is a culprit here, but reducing the buffers absolutely > helped a lot. Maybe you can try commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too long") or http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp for finding the culprit.
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote: > Thanks for the reply and suggestions. > > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLINwrote: > > > Now, to be fair, this is not a new problem, it's just varying degrees of > > > bad and usually only happens when I do a lot of I/O with btrfs. > > > > One situation where I've seen something like this happen is > > > > (a) lots and lots of dirty data queued up > > (b) horribly slow storage > > In my case, it is a 5x 4TB HDD with > software raid 5 < bcache < dmcrypt < btrfs > bcache is currently half disabled (as in I removed the actual cache) or > too many bcache requests pile up, and the kernel dies when too many > workqueues have piled up. > I'm just kind of worried that since I'm going through 4 subsystems > before my data can hit disk, that's a lot of memory allocations and > places where data can accumulate and cause bottlenecks if the next > subsystem isn't as fast. > > But this shouldn't be "horribly slow", should it? (it does copy a few > terabytes per day, not fast, but not horrible, about 30MB/s or so) > > > Sadly, our defaults for "how much dirty data do we allow" are somewhat > > buggered. The global defaults are in "percent of memory", and are > > generally _much_ too high for big-memory machines: > > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > > 20 > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > > 10 > > I can confirm I have the same. > > > says that it only starts really throttling writes when you hit 20% of > > all memory used. You don't say how much memory you have in that > > machine, but if it's the same one you talked about earlier, it was > > 24GB. So you can have 4GB of dirty data waiting to be flushed out. > > Correct, 24GB and 4GB. > > > And we *try* to do this per-device backing-dev congestion thing to > > make things work better, but it generally seems to not work very well. > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > > does really well, and we want to open up, and then it shuts down). > > > > One thing you can try is to just make the global limits much lower. As in > > > >echo 2 > /proc/sys/vm/dirty_ratio > >echo 1 > /proc/sys/vm/dirty_background_ratio > > I will give that a shot, thank you. And, after 5H of copying, not a single hang, or USB disconnect, or anything. Obviously this seems to point to other problems in the code, and I have no idea which layer is a culprit here, but reducing the buffers absolutely helped a lot. Thanks much, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 09:40:19AM -0800, Marc MERLIN wrote: > Thanks for the reply and suggestions. > > On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN wrote: > > > Now, to be fair, this is not a new problem, it's just varying degrees of > > > bad and usually only happens when I do a lot of I/O with btrfs. > > > > One situation where I've seen something like this happen is > > > > (a) lots and lots of dirty data queued up > > (b) horribly slow storage > > In my case, it is a 5x 4TB HDD with > software raid 5 < bcache < dmcrypt < btrfs > bcache is currently half disabled (as in I removed the actual cache) or > too many bcache requests pile up, and the kernel dies when too many > workqueues have piled up. > I'm just kind of worried that since I'm going through 4 subsystems > before my data can hit disk, that's a lot of memory allocations and > places where data can accumulate and cause bottlenecks if the next > subsystem isn't as fast. > > But this shouldn't be "horribly slow", should it? (it does copy a few > terabytes per day, not fast, but not horrible, about 30MB/s or so) > > > Sadly, our defaults for "how much dirty data do we allow" are somewhat > > buggered. The global defaults are in "percent of memory", and are > > generally _much_ too high for big-memory machines: > > > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > > 20 > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > > 10 > > I can confirm I have the same. > > > says that it only starts really throttling writes when you hit 20% of > > all memory used. You don't say how much memory you have in that > > machine, but if it's the same one you talked about earlier, it was > > 24GB. So you can have 4GB of dirty data waiting to be flushed out. > > Correct, 24GB and 4GB. > > > And we *try* to do this per-device backing-dev congestion thing to > > make things work better, but it generally seems to not work very well. > > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > > does really well, and we want to open up, and then it shuts down). > > > > One thing you can try is to just make the global limits much lower. As in > > > >echo 2 > /proc/sys/vm/dirty_ratio > >echo 1 > /proc/sys/vm/dirty_background_ratio > > I will give that a shot, thank you. And, after 5H of copying, not a single hang, or USB disconnect, or anything. Obviously this seems to point to other problems in the code, and I have no idea which layer is a culprit here, but reducing the buffers absolutely helped a lot. Thanks much, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLINwrote: > > In my case, it is a 5x 4TB HDD with > software raid 5 < bcache < dmcrypt < btrfs It doesn't sound like the nasty situations I have seen (particularly with large USB flash storage - often high momentary speed for benchmarks, but slows down to a crawl after you've written a bit to it, and doesn't have the smart garbage collection that modern "real" SSDs have). But while it doesn't sound like that nasty case, RAID5 will certainly not help your write speed, and with spinning rust that potentially up to 4GB (in fact, almost 5GB) of dirty pending data is going to take a long time to write out if it's not all nice and contiguous (which it won't be). And btrfs might be weak on that case - I remember complaining about fsync stuttering all IO a few years ago, exactly because it would force-flush everything else too (ie you were doing non-synchronous writes in one session, and then the browser did a "fsync" on the small writes it did to the mysql database, and suddenly the browser paused for ten seconds or more, because the fsync wasn't just waiting for the small database update, but for _everythinig_ to be written back). Your backtrace isn't for fsync, but it looks superficially similar: "wait for write data to flush". Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 9:40 AM, Marc MERLIN wrote: > > In my case, it is a 5x 4TB HDD with > software raid 5 < bcache < dmcrypt < btrfs It doesn't sound like the nasty situations I have seen (particularly with large USB flash storage - often high momentary speed for benchmarks, but slows down to a crawl after you've written a bit to it, and doesn't have the smart garbage collection that modern "real" SSDs have). But while it doesn't sound like that nasty case, RAID5 will certainly not help your write speed, and with spinning rust that potentially up to 4GB (in fact, almost 5GB) of dirty pending data is going to take a long time to write out if it's not all nice and contiguous (which it won't be). And btrfs might be weak on that case - I remember complaining about fsync stuttering all IO a few years ago, exactly because it would force-flush everything else too (ie you were doing non-synchronous writes in one session, and then the browser did a "fsync" on the small writes it did to the mysql database, and suddenly the browser paused for ten seconds or more, because the fsync wasn't just waiting for the small database update, but for _everythinig_ to be written back). Your backtrace isn't for fsync, but it looks superficially similar: "wait for write data to flush". Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Thanks for the reply and suggestions. On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLINwrote: > > Now, to be fair, this is not a new problem, it's just varying degrees of > > bad and usually only happens when I do a lot of I/O with btrfs. > > One situation where I've seen something like this happen is > > (a) lots and lots of dirty data queued up > (b) horribly slow storage In my case, it is a 5x 4TB HDD with software raid 5 < bcache < dmcrypt < btrfs bcache is currently half disabled (as in I removed the actual cache) or too many bcache requests pile up, and the kernel dies when too many workqueues have piled up. I'm just kind of worried that since I'm going through 4 subsystems before my data can hit disk, that's a lot of memory allocations and places where data can accumulate and cause bottlenecks if the next subsystem isn't as fast. But this shouldn't be "horribly slow", should it? (it does copy a few terabytes per day, not fast, but not horrible, about 30MB/s or so) > Sadly, our defaults for "how much dirty data do we allow" are somewhat > buggered. The global defaults are in "percent of memory", and are > generally _much_ too high for big-memory machines: > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > 20 > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > 10 I can confirm I have the same. > says that it only starts really throttling writes when you hit 20% of > all memory used. You don't say how much memory you have in that > machine, but if it's the same one you talked about earlier, it was > 24GB. So you can have 4GB of dirty data waiting to be flushed out. Correct, 24GB and 4GB. > And we *try* to do this per-device backing-dev congestion thing to > make things work better, but it generally seems to not work very well. > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > does really well, and we want to open up, and then it shuts down). > > One thing you can try is to just make the global limits much lower. As in > >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio I will give that a shot, thank you. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Thanks for the reply and suggestions. On Tue, Nov 29, 2016 at 09:07:03AM -0800, Linus Torvalds wrote: > On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN wrote: > > Now, to be fair, this is not a new problem, it's just varying degrees of > > bad and usually only happens when I do a lot of I/O with btrfs. > > One situation where I've seen something like this happen is > > (a) lots and lots of dirty data queued up > (b) horribly slow storage In my case, it is a 5x 4TB HDD with software raid 5 < bcache < dmcrypt < btrfs bcache is currently half disabled (as in I removed the actual cache) or too many bcache requests pile up, and the kernel dies when too many workqueues have piled up. I'm just kind of worried that since I'm going through 4 subsystems before my data can hit disk, that's a lot of memory allocations and places where data can accumulate and cause bottlenecks if the next subsystem isn't as fast. But this shouldn't be "horribly slow", should it? (it does copy a few terabytes per day, not fast, but not horrible, about 30MB/s or so) > Sadly, our defaults for "how much dirty data do we allow" are somewhat > buggered. The global defaults are in "percent of memory", and are > generally _much_ too high for big-memory machines: > > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio > 20 > [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio > 10 I can confirm I have the same. > says that it only starts really throttling writes when you hit 20% of > all memory used. You don't say how much memory you have in that > machine, but if it's the same one you talked about earlier, it was > 24GB. So you can have 4GB of dirty data waiting to be flushed out. Correct, 24GB and 4GB. > And we *try* to do this per-device backing-dev congestion thing to > make things work better, but it generally seems to not work very well. > Possibly because of inconsistent write speeds (ie _sometimes_ the SSD > does really well, and we want to open up, and then it shuts down). > > One thing you can try is to just make the global limits much lower. As in > >echo 2 > /proc/sys/vm/dirty_ratio >echo 1 > /proc/sys/vm/dirty_background_ratio I will give that a shot, thank you. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLINwrote: > Now, to be fair, this is not a new problem, it's just varying degrees of > bad and usually only happens when I do a lot of I/O with btrfs. One situation where I've seen something like this happen is (a) lots and lots of dirty data queued up (b) horribly slow storage (c) filesystem that ends up serializing on writeback under certain circumstances The usual case for (b) in the modern world is big SSD's that have bad worst-case behavior (ie they may do gbps speeds when doing well, and then they come to a screeching halt when their buffers fill up and they have to do rewrites, and their gbps throughput drops to mbps or lower). Generally you only find that kind of really nasty SSD in the USB stick world these days. The usual case for (c) is "fsync" or similar - often on a totally unrelated file - which then ends up waiting for everything else to flush too. Looks like btrfs_start_ordered_extent() does something kind of like that, where it waits for data to be flushed. The usual *fix* for this is to just not get into situation (a). Sadly, our defaults for "how much dirty data do we allow" are somewhat buggered. The global defaults are in "percent of memory", and are generally _much_ too high for big-memory machines: [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio 20 [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio 10 says that it only starts really throttling writes when you hit 20% of all memory used. You don't say how much memory you have in that machine, but if it's the same one you talked about earlier, it was 24GB. So you can have 4GB of dirty data waiting to be flushed out. And we *try* to do this per-device backing-dev congestion thing to make things work better, but it generally seems to not work very well. Possibly because of inconsistent write speeds (ie _sometimes_ the SSD does really well, and we want to open up, and then it shuts down). One thing you can try is to just make the global limits much lower. As in echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (if you want to go lower than 1%, you'll have to use the "dirty_*ratio_bytes" byte limits instead of percentage limits). Obviously you'll need to be root for this, and equally obviously it's really a failure of the kernel. I'd *love* to get something like this right automatically, but sadly it depends so much on memory size, load, disk subsystem, etc etc that I despair at it. On x86-32 we "fixed" this long ago by just saying "high memory is not dirtyable", so you were always limited to a maximum of 10/20% of 1GB, rather than the full memory range. It worked better, but it's a sad kind of fix. (See commit dc6e29da9162: "Fix balance_dirty_page() calculations with CONFIG_HIGHMEM") Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN wrote: > Now, to be fair, this is not a new problem, it's just varying degrees of > bad and usually only happens when I do a lot of I/O with btrfs. One situation where I've seen something like this happen is (a) lots and lots of dirty data queued up (b) horribly slow storage (c) filesystem that ends up serializing on writeback under certain circumstances The usual case for (b) in the modern world is big SSD's that have bad worst-case behavior (ie they may do gbps speeds when doing well, and then they come to a screeching halt when their buffers fill up and they have to do rewrites, and their gbps throughput drops to mbps or lower). Generally you only find that kind of really nasty SSD in the USB stick world these days. The usual case for (c) is "fsync" or similar - often on a totally unrelated file - which then ends up waiting for everything else to flush too. Looks like btrfs_start_ordered_extent() does something kind of like that, where it waits for data to be flushed. The usual *fix* for this is to just not get into situation (a). Sadly, our defaults for "how much dirty data do we allow" are somewhat buggered. The global defaults are in "percent of memory", and are generally _much_ too high for big-memory machines: [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio 20 [torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio 10 says that it only starts really throttling writes when you hit 20% of all memory used. You don't say how much memory you have in that machine, but if it's the same one you talked about earlier, it was 24GB. So you can have 4GB of dirty data waiting to be flushed out. And we *try* to do this per-device backing-dev congestion thing to make things work better, but it generally seems to not work very well. Possibly because of inconsistent write speeds (ie _sometimes_ the SSD does really well, and we want to open up, and then it shuts down). One thing you can try is to just make the global limits much lower. As in echo 2 > /proc/sys/vm/dirty_ratio echo 1 > /proc/sys/vm/dirty_background_ratio (if you want to go lower than 1%, you'll have to use the "dirty_*ratio_bytes" byte limits instead of percentage limits). Obviously you'll need to be root for this, and equally obviously it's really a failure of the kernel. I'd *love* to get something like this right automatically, but sadly it depends so much on memory size, load, disk subsystem, etc etc that I despair at it. On x86-32 we "fixed" this long ago by just saying "high memory is not dirtyable", so you were always limited to a maximum of 10/20% of 1GB, rather than the full memory range. It worked better, but it's a sad kind of fix. (See commit dc6e29da9162: "Fix balance_dirty_page() calculations with CONFIG_HIGHMEM") Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 05:25:15PM +0100, Michal Hocko wrote: > On Tue 22-11-16 17:38:01, Greg KH wrote: > > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > > > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > > 4.9rc5 however seems to be doing better, and is still running after > > > 18 > > > hours. However, I got a few page allocation failures as per below, > > > but the > > > system seems to recover. > > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > > > days) > > > or is that good enough, and i should go back to 4.8.8 with that > > > patch applied? > > > https://marc.info/?l=linux-mm=147423605024993 > > > >>> > > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > > > >>> 4.8 with that patch, yeah. > > > >> > > > >> So the good news is that it's been running for almost 5H and so far so > > > >> good. > > > > > > > > And the better news is that the copy is still going strong, 4.4TB and > > > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > > > concerned. > > > > > > > > So thanks for that, looks good to me to merge. > > > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > > already EOL AFAICS). > > > > > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > > > - alternatively a simpler (againm 4.8-only) patch that just outright > > > prevents OOM for 0 < order < costly, as Michal already suggested. > > > - backport 10+ compaction patches to 4.8 stable > > > - something else? > > > > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is > > released? :) > > OK, so can we push this through to 4.8 before EOL and make sure there > won't be any additional pre-mature high order OOM reports? The patch > should be simple enough and safe for the stable tree. There is no > upstream commit because 4.9 is fixed in a different way which would be > way too intrusive for the stable backport. Now queued up, thanks! greg k-h
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 05:25:15PM +0100, Michal Hocko wrote: > On Tue 22-11-16 17:38:01, Greg KH wrote: > > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > > > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > > 4.9rc5 however seems to be doing better, and is still running after > > > 18 > > > hours. However, I got a few page allocation failures as per below, > > > but the > > > system seems to recover. > > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > > > days) > > > or is that good enough, and i should go back to 4.8.8 with that > > > patch applied? > > > https://marc.info/?l=linux-mm=147423605024993 > > > >>> > > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > > > >>> 4.8 with that patch, yeah. > > > >> > > > >> So the good news is that it's been running for almost 5H and so far so > > > >> good. > > > > > > > > And the better news is that the copy is still going strong, 4.4TB and > > > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > > > concerned. > > > > > > > > So thanks for that, looks good to me to merge. > > > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > > already EOL AFAICS). > > > > > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > > > - alternatively a simpler (againm 4.8-only) patch that just outright > > > prevents OOM for 0 < order < costly, as Michal already suggested. > > > - backport 10+ compaction patches to 4.8 stable > > > - something else? > > > > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is > > released? :) > > OK, so can we push this through to 4.8 before EOL and make sure there > won't be any additional pre-mature high order OOM reports? The patch > should be simple enough and safe for the stable tree. There is no > upstream commit because 4.9 is fixed in a different way which would be > way too intrusive for the stable backport. Now queued up, thanks! greg k-h
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 05:07:51PM +0100, Michal Hocko wrote: > On Tue 29-11-16 07:55:37, Marc MERLIN wrote: > > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > > > Marc, could you try this patch please? I think it should be pretty clear > > > it should help you but running it through your use case would be more > > > than welcome before I ask Greg to take this to the 4.8 stable tree. > > > > I ran it overnight and copied 1.4TB with it before it failed because > > there wasn't enough disk space on the other side, so I think it fixes > > the problem too. > > Can I add your Tested-by? Done. Now, probably unrelated, but hard to be sure, doing those big copies causes massive hangs on my system. I hit a few of the 120s hangs, but more generally lots of things hang, including shells, my DNS server, monitoring reading from USB and timing out, and so forth. Examples below. I have a hard time telling what is the fault, but is there a chance it might be memory allocation pressure? I already have a preempt kernel, so I can't make it more preempt than that. Now, to be fair, this is not a new problem, it's just varying degrees of bad and usually only happens when I do a lot of I/O with btrfs. That said, btrfs may very well just be suffering from memory allocation issues and hanging as a result, with everything else on my system also hanging for similar reasons until the memory pressure goes away with the copy or scrub are finished. What do you think? [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 I get other hangs like: [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 29, 2016 at 05:07:51PM +0100, Michal Hocko wrote: > On Tue 29-11-16 07:55:37, Marc MERLIN wrote: > > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > > > Marc, could you try this patch please? I think it should be pretty clear > > > it should help you but running it through your use case would be more > > > than welcome before I ask Greg to take this to the 4.8 stable tree. > > > > I ran it overnight and copied 1.4TB with it before it failed because > > there wasn't enough disk space on the other side, so I think it fixes > > the problem too. > > Can I add your Tested-by? Done. Now, probably unrelated, but hard to be sure, doing those big copies causes massive hangs on my system. I hit a few of the 120s hangs, but more generally lots of things hang, including shells, my DNS server, monitoring reading from USB and timing out, and so forth. Examples below. I have a hard time telling what is the fault, but is there a chance it might be memory allocation pressure? I already have a preempt kernel, so I can't make it more preempt than that. Now, to be fair, this is not a new problem, it's just varying degrees of bad and usually only happens when I do a lot of I/O with btrfs. That said, btrfs may very well just be suffering from memory allocation issues and hanging as a result, with everything else on my system also hanging for similar reasons until the memory pressure goes away with the copy or scrub are finished. What do you think? [28034.954435] INFO: task btrfs:5618 blocked for more than 120 seconds. [28034.975471] Tainted: G U 4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12 [28035.000964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [28035.025429] btrfs D 91154d33fc70 0 5618 5372 0x0080 [28035.047717] 91154d33fc70 00200246 911842f880c0 9115a4cf01c0 [28035.071020] 91154d33fc58 91154d34 91165493bca0 9115623773f0 [28035.094252] 1000 0001 91154d33fc88 b86cf1a6 [28035.117538] Call Trace: [28035.125791] [] schedule+0x8b/0xa3 [28035.141550] [] btrfs_start_ordered_extent+0xce/0x122 [28035.162457] [] ? wake_up_atomic_t+0x2c/0x2c [28035.180891] [] btrfs_wait_ordered_range+0xa9/0x10d [28035.201723] [] btrfs_truncate+0x40/0x24b [28035.219269] [] btrfs_setattr+0x1da/0x2d7 [28035.237032] [] notify_change+0x252/0x39c [28035.254566] [] do_truncate+0x81/0xb4 [28035.271057] [] vfs_truncate+0xd9/0xf9 [28035.287782] [] do_sys_truncate+0x63/0xa7 I get other hangs like: [10338.968912] perf: interrupt took too long (3927 > 3917), lowering kernel.perf_event_max_sample_rate to 50750 [12971.047705] ftdi_sio ttyUSB15: usb_serial_generic_read_bulk_callback - urb stopped: -32 [17761.122238] usb 4-1.4: USB disconnect, device number 39 [17761.141063] usb 4-1.4: usbfs: USBDEVFS_CONTROL failed cmd hub-ctrl rqt 160 rq 6 len 1024 ret -108 [17761.263252] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd [17761.938575] usb 4-1.4: new SuperSpeed USB device number 40 using xhci_hcd [24130.574425] hpet1: lost 2306 rtc interrupts [24156.034950] hpet1: lost 1628 rtc interrupts [24173.314738] hpet1: lost 1104 rtc interrupts [24180.129950] hpet1: lost 436 rtc interrupts [24257.557955] hpet1: lost 4954 rtc interrupts [24267.522656] hpet1: lost 637 rtc interrupts Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 17:38:01, Greg KH wrote: > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > 4.9rc5 however seems to be doing better, and is still running after 18 > > hours. However, I got a few page allocation failures as per below, but > > the > > system seems to recover. > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > > days) > > or is that good enough, and i should go back to 4.8.8 with that patch > > applied? > > https://marc.info/?l=linux-mm=147423605024993 > > >>> > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > > >>> 4.8 with that patch, yeah. > > >> > > >> So the good news is that it's been running for almost 5H and so far so > > >> good. > > > > > > And the better news is that the copy is still going strong, 4.4TB and > > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > > concerned. > > > > > > So thanks for that, looks good to me to merge. > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > already EOL AFAICS). > > > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > > - alternatively a simpler (againm 4.8-only) patch that just outright > > prevents OOM for 0 < order < costly, as Michal already suggested. > > - backport 10+ compaction patches to 4.8 stable > > - something else? > > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is > released? :) OK, so can we push this through to 4.8 before EOL and make sure there won't be any additional pre-mature high order OOM reports? The patch should be simple enough and safe for the stable tree. There is no upstream commit because 4.9 is fixed in a different way which would be way too intrusive for the stable backport. --- >From 02306e8d593fa8a48d620e0c9d63a934ca8366d8 Mon Sep 17 00:00:00 2001 From: Michal HockoDate: Wed, 23 Nov 2016 07:26:30 +0100 Subject: [PATCH] mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Tested-by: Marc MERLIN Signed-off-by: Michal Hocko --- mm/page_alloc.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely. -- 2.10.2 -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 17:38:01, Greg KH wrote: > On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > 4.9rc5 however seems to be doing better, and is still running after 18 > > hours. However, I got a few page allocation failures as per below, but > > the > > system seems to recover. > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > > days) > > or is that good enough, and i should go back to 4.8.8 with that patch > > applied? > > https://marc.info/?l=linux-mm=147423605024993 > > >>> > > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > > >>> 4.8 with that patch, yeah. > > >> > > >> So the good news is that it's been running for almost 5H and so far so > > >> good. > > > > > > And the better news is that the copy is still going strong, 4.4TB and > > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > > concerned. > > > > > > So thanks for that, looks good to me to merge. > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > already EOL AFAICS). > > > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > > - alternatively a simpler (againm 4.8-only) patch that just outright > > prevents OOM for 0 < order < costly, as Michal already suggested. > > - backport 10+ compaction patches to 4.8 stable > > - something else? > > Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is > released? :) OK, so can we push this through to 4.8 before EOL and make sure there won't be any additional pre-mature high order OOM reports? The patch should be simple enough and safe for the stable tree. There is no upstream commit because 4.9 is fixed in a different way which would be way too intrusive for the stable backport. --- >From 02306e8d593fa8a48d620e0c9d63a934ca8366d8 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 23 Nov 2016 07:26:30 +0100 Subject: [PATCH] mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Tested-by: Marc MERLIN Signed-off-by: Michal Hocko --- mm/page_alloc.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely. -- 2.10.2 -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. > > Thanks! > > On Wed 23-11-16 07:34:10, Michal Hocko wrote: > [...] > > commit b2ccdcb731b666aa28f86483656c39c5e53828c7 > > Author: Michal Hocko> > Date: Wed Nov 23 07:26:30 2016 +0100 > > > > mm, oom: stop pre-mature high-order OOM killer invocations > > > > 31e49bfda184 ("mm, oom: protect !costly allocations some more for > > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM > > killer invocation for high order requests. It seemed to work for most > > users just fine but it is far from bullet proof and obviously not > > sufficient for Marc who has reported pre-mature OOM killer invocations > > with 4.8 based kernels. 4.9 will all the compaction improvements seems > > to be behaving much better but that would be too intrusive to backport > > to 4.8 stable kernels. Instead this patch simply never declares OOM for > > !costly high order requests. We rely on order-0 requests to do that in > > case we are really out of memory. Order-0 requests are much more common > > and so a risk of a livelock without any way forward is highly unlikely. > > > > Reported-by: Marc MERLIN > > Signed-off-by: Michal Hocko Tested-by: Marc MERLIN Marc > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index a2214c64ed3c..7401e996009a 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > > unsigned int order, int alloc_fla > > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > > return false; > > > > +#ifdef CONFIG_COMPACTION > > + /* > > +* This is a gross workaround to compensate a lack of reliable > > compaction > > +* operation. We cannot simply go OOM with the current state of the > > compaction > > +* code because this can lead to pre mature OOM declaration. > > +*/ > > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > > + return true; > > +#endif > > + > > /* > > * There are setups with compaction disabled which would prefer to loop > > * inside the allocator rather than hit the oom killer prematurely. > > -- > > Michal Hocko > > SUSE Labs > > -- > Michal Hocko > SUSE Labs > -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. > > Thanks! > > On Wed 23-11-16 07:34:10, Michal Hocko wrote: > [...] > > commit b2ccdcb731b666aa28f86483656c39c5e53828c7 > > Author: Michal Hocko > > Date: Wed Nov 23 07:26:30 2016 +0100 > > > > mm, oom: stop pre-mature high-order OOM killer invocations > > > > 31e49bfda184 ("mm, oom: protect !costly allocations some more for > > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM > > killer invocation for high order requests. It seemed to work for most > > users just fine but it is far from bullet proof and obviously not > > sufficient for Marc who has reported pre-mature OOM killer invocations > > with 4.8 based kernels. 4.9 will all the compaction improvements seems > > to be behaving much better but that would be too intrusive to backport > > to 4.8 stable kernels. Instead this patch simply never declares OOM for > > !costly high order requests. We rely on order-0 requests to do that in > > case we are really out of memory. Order-0 requests are much more common > > and so a risk of a livelock without any way forward is highly unlikely. > > > > Reported-by: Marc MERLIN > > Signed-off-by: Michal Hocko Tested-by: Marc MERLIN Marc > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index a2214c64ed3c..7401e996009a 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > > unsigned int order, int alloc_fla > > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > > return false; > > > > +#ifdef CONFIG_COMPACTION > > + /* > > +* This is a gross workaround to compensate a lack of reliable > > compaction > > +* operation. We cannot simply go OOM with the current state of the > > compaction > > +* code because this can lead to pre mature OOM declaration. > > +*/ > > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > > + return true; > > +#endif > > + > > /* > > * There are setups with compaction disabled which would prefer to loop > > * inside the allocator rather than hit the oom killer prematurely. > > -- > > Michal Hocko > > SUSE Labs > > -- > Michal Hocko > SUSE Labs > -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 29-11-16 07:55:37, Marc MERLIN wrote: > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > > Marc, could you try this patch please? I think it should be pretty clear > > it should help you but running it through your use case would be more > > than welcome before I ask Greg to take this to the 4.8 stable tree. > > I ran it overnight and copied 1.4TB with it before it failed because > there wasn't enough disk space on the other side, so I think it fixes > the problem too. Can I add your Tested-by? -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 29-11-16 07:55:37, Marc MERLIN wrote: > On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > > Marc, could you try this patch please? I think it should be pretty clear > > it should help you but running it through your use case would be more > > than welcome before I ask Greg to take this to the 4.8 stable tree. > > I ran it overnight and copied 1.4TB with it before it failed because > there wasn't enough disk space on the other side, so I think it fixes > the problem too. Can I add your Tested-by? -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. I ran it overnight and copied 1.4TB with it before it failed because there wasn't enough disk space on the other side, so I think it fixes the problem too. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. I ran it overnight and copied 1.4TB with it before it failed because there wasn't enough disk space on the other side, so I think it fixes the problem too. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. This will take a little while, the whole copy took 5 days to finish and I'm a bit hesitant about blowing it away and starting over :) Let me see if I can come up with maybe another disk array for another test. For now, as a reminder, I'm running that attached patch, and it works fine I'll report back as soon as I can. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..9b3b3a79c58a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + int check_order = order; + unsigned long watermark = min_wmark_pages(zone); available = reclaimable = zone_reclaimable_pages(zone); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) { + check_order = 0; + watermark += 1UL << order; + } + /* * Would the allocation succeed if we reclaimed the whole * available? */ - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + if (__zone_watermark_ok(zone, check_order, watermark, ac_classzone_idx(ac), alloc_flags, available)) { /* * If we didn't make any progress and have a lot of
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 28, 2016 at 08:23:15AM +0100, Michal Hocko wrote: > Marc, could you try this patch please? I think it should be pretty clear > it should help you but running it through your use case would be more > than welcome before I ask Greg to take this to the 4.8 stable tree. This will take a little while, the whole copy took 5 days to finish and I'm a bit hesitant about blowing it away and starting over :) Let me see if I can come up with maybe another disk array for another test. For now, as a reminder, I'm running that attached patch, and it works fine I'll report back as soon as I can. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..9b3b3a79c58a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + int check_order = order; + unsigned long watermark = min_wmark_pages(zone); available = reclaimable = zone_reclaimable_pages(zone); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) { + check_order = 0; + watermark += 1UL << order; + } + /* * Would the allocation succeed if we reclaimed the whole * available? */ - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), + if (__zone_watermark_ok(zone, check_order, watermark, ac_classzone_idx(ac), alloc_flags, available)) { /* * If we didn't make any progress and have a lot of
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/22/2016 10:46 PM, Simon Kirby wrote: On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: On 11/22/2016 05:06 PM, Marc MERLIN wrote: On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: 4.9rc5 however seems to be doing better, and is still running after 18 hours. However, I got a few page allocation failures as per below, but the system seems to recover. Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) or is that good enough, and i should go back to 4.8.8 with that patch applied? https://marc.info/?l=linux-mm=147423605024993 Hi, I think it's enough for 4.9 for now and I would appreciate trying 4.8 with that patch, yeah. So the good news is that it's been running for almost 5H and so far so good. And the better news is that the copy is still going strong, 4.4TB and going. So 4.8.8 is fixed with that one single patch as far as I'm concerned. So thanks for that, looks good to me to merge. Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. - alternatively a simpler (againm 4.8-only) patch that just outright prevents OOM for 0 < order < costly, as Michal already suggested. - backport 10+ compaction patches to 4.8 stable - something else? Michal? Linus? [1] https://marc.info/?l=linux-mm=147423605024993 Sorry for my molasses rate of feedback. I found a workaround, setting vm/watermark_scale_factor to 500, and threw that in sysctl. This was on the MythTV box that OOMs everything after about a day on 4.8 otherwise. I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but just realized I forgot to remove the watermark_scale_factor workaround. I've restored that now, so I'll see if it becomes unhappy by tomorrow. Thanks for the testing. Could you now try Michal's stable candidate [1] from this thread please? [1] http://marc.info/?l=linux-mm=147988285831283=2 I also threw up a few other things you had asked for (vmstat, zoneinfo before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/ (that was before booting into a rebuild with [1] applied) Simon-
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/22/2016 10:46 PM, Simon Kirby wrote: On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: On 11/22/2016 05:06 PM, Marc MERLIN wrote: On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: 4.9rc5 however seems to be doing better, and is still running after 18 hours. However, I got a few page allocation failures as per below, but the system seems to recover. Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) or is that good enough, and i should go back to 4.8.8 with that patch applied? https://marc.info/?l=linux-mm=147423605024993 Hi, I think it's enough for 4.9 for now and I would appreciate trying 4.8 with that patch, yeah. So the good news is that it's been running for almost 5H and so far so good. And the better news is that the copy is still going strong, 4.4TB and going. So 4.8.8 is fixed with that one single patch as far as I'm concerned. So thanks for that, looks good to me to merge. Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. - alternatively a simpler (againm 4.8-only) patch that just outright prevents OOM for 0 < order < costly, as Michal already suggested. - backport 10+ compaction patches to 4.8 stable - something else? Michal? Linus? [1] https://marc.info/?l=linux-mm=147423605024993 Sorry for my molasses rate of feedback. I found a workaround, setting vm/watermark_scale_factor to 500, and threw that in sysctl. This was on the MythTV box that OOMs everything after about a day on 4.8 otherwise. I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but just realized I forgot to remove the watermark_scale_factor workaround. I've restored that now, so I'll see if it becomes unhappy by tomorrow. Thanks for the testing. Could you now try Michal's stable candidate [1] from this thread please? [1] http://marc.info/?l=linux-mm=147988285831283=2 I also threw up a few other things you had asked for (vmstat, zoneinfo before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/ (that was before booting into a rebuild with [1] applied) Simon-
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Marc, could you try this patch please? I think it should be pretty clear it should help you but running it through your use case would be more than welcome before I ask Greg to take this to the 4.8 stable tree. Thanks! On Wed 23-11-16 07:34:10, Michal Hocko wrote: [...] > commit b2ccdcb731b666aa28f86483656c39c5e53828c7 > Author: Michal Hocko> Date: Wed Nov 23 07:26:30 2016 +0100 > > mm, oom: stop pre-mature high-order OOM killer invocations > > 31e49bfda184 ("mm, oom: protect !costly allocations some more for > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM > killer invocation for high order requests. It seemed to work for most > users just fine but it is far from bullet proof and obviously not > sufficient for Marc who has reported pre-mature OOM killer invocations > with 4.8 based kernels. 4.9 will all the compaction improvements seems > to be behaving much better but that would be too intrusive to backport > to 4.8 stable kernels. Instead this patch simply never declares OOM for > !costly high order requests. We rely on order-0 requests to do that in > case we are really out of memory. Order-0 requests are much more common > and so a risk of a livelock without any way forward is highly unlikely. > > Reported-by: Marc MERLIN > Signed-off-by: Michal Hocko > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a2214c64ed3c..7401e996009a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > unsigned int order, int alloc_fla > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > return false; > > +#ifdef CONFIG_COMPACTION > + /* > + * This is a gross workaround to compensate a lack of reliable > compaction > + * operation. We cannot simply go OOM with the current state of the > compaction > + * code because this can lead to pre mature OOM declaration. > + */ > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > + return true; > +#endif > + > /* >* There are setups with compaction disabled which would prefer to loop >* inside the allocator rather than hit the oom killer prematurely. > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
Marc, could you try this patch please? I think it should be pretty clear it should help you but running it through your use case would be more than welcome before I ask Greg to take this to the 4.8 stable tree. Thanks! On Wed 23-11-16 07:34:10, Michal Hocko wrote: [...] > commit b2ccdcb731b666aa28f86483656c39c5e53828c7 > Author: Michal Hocko > Date: Wed Nov 23 07:26:30 2016 +0100 > > mm, oom: stop pre-mature high-order OOM killer invocations > > 31e49bfda184 ("mm, oom: protect !costly allocations some more for > !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM > killer invocation for high order requests. It seemed to work for most > users just fine but it is far from bullet proof and obviously not > sufficient for Marc who has reported pre-mature OOM killer invocations > with 4.8 based kernels. 4.9 will all the compaction improvements seems > to be behaving much better but that would be too intrusive to backport > to 4.8 stable kernels. Instead this patch simply never declares OOM for > !costly high order requests. We rely on order-0 requests to do that in > case we are really out of memory. Order-0 requests are much more common > and so a risk of a livelock without any way forward is highly unlikely. > > Reported-by: Marc MERLIN > Signed-off-by: Michal Hocko > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a2214c64ed3c..7401e996009a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > unsigned int order, int alloc_fla > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > return false; > > +#ifdef CONFIG_COMPACTION > + /* > + * This is a gross workaround to compensate a lack of reliable > compaction > + * operation. We cannot simply go OOM with the current state of the > compaction > + * code because this can lead to pre mature OOM declaration. > + */ > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > + return true; > +#endif > + > /* >* There are setups with compaction disabled which would prefer to loop >* inside the allocator rather than hit the oom killer prematurely. > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/23/2016 07:34 AM, Michal Hocko wrote: On Tue 22-11-16 11:38:47, Linus Torvalds wrote: On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babkawrote: Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. I think that's the right thing to do. It's pretty small, and the argument that it changes the oom logic too much is pretty bogus, I think. The oom logic in 4.8 is simply broken. Let's get it fixed. Changing it is the point. The point I've tried to make is that it is not should_reclaim_retry which is broken. It's an overly optimistic reliance on the compaction to do it's work which led to all those issues. My previous fix 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") tried to cope with that by checking the order-0 watermark which has proven to help most users. Now it didn't cover everybody obviously. Rather than fiddling with fine tuning of these heuristics I think it would be safer to simply admit that high order OOM detection doesn't work in 4.8 kernel and so do not declare the OOM killer for those requests at all. The risk of such a change is not big because there usually are order-0 requests happening all the time so if we are really OOM we would trigger the OOM eventually. So I am proposing this for 4.8 stable tree instead --- commit b2ccdcb731b666aa28f86483656c39c5e53828c7 Author: Michal Hocko Date: Wed Nov 23 07:26:30 2016 +0100 mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Signed-off-by: Michal Hocko This should effectively restore the 4.6 logic, so I'm fine with it for stable, if it passes testing. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely.
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/23/2016 07:34 AM, Michal Hocko wrote: On Tue 22-11-16 11:38:47, Linus Torvalds wrote: On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka wrote: Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. I think that's the right thing to do. It's pretty small, and the argument that it changes the oom logic too much is pretty bogus, I think. The oom logic in 4.8 is simply broken. Let's get it fixed. Changing it is the point. The point I've tried to make is that it is not should_reclaim_retry which is broken. It's an overly optimistic reliance on the compaction to do it's work which led to all those issues. My previous fix 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") tried to cope with that by checking the order-0 watermark which has proven to help most users. Now it didn't cover everybody obviously. Rather than fiddling with fine tuning of these heuristics I think it would be safer to simply admit that high order OOM detection doesn't work in 4.8 kernel and so do not declare the OOM killer for those requests at all. The risk of such a change is not big because there usually are order-0 requests happening all the time so if we are really OOM we would trigger the OOM eventually. So I am proposing this for 4.8 stable tree instead --- commit b2ccdcb731b666aa28f86483656c39c5e53828c7 Author: Michal Hocko Date: Wed Nov 23 07:26:30 2016 +0100 mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Signed-off-by: Michal Hocko This should effectively restore the 4.6 logic, so I'm fine with it for stable, if it passes testing. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely.
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed 23-11-16 14:53:12, Hillf Danton wrote: > On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote: > > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > > unsigned int order, int alloc_fla > > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > > return false; > > > > +#ifdef CONFIG_COMPACTION > > + /* > > +* This is a gross workaround to compensate a lack of reliable > > compaction > > +* operation. We cannot simply go OOM with the current state of the > > compaction > > +* code because this can lead to pre mature OOM declaration. > > +*/ > > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > > No need to check order once more. yes simple return true would be sufficient but I wanted the code to be more obvious. > Plus can we retry without CONFIG_COMPACTION enabled? Yes checking the order-0 watermark was the original implementation of the high order retry without compaction enabled. I do not rememeber any reports for that so I didn't want to touch that path. -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wed 23-11-16 14:53:12, Hillf Danton wrote: > On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote: > > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > > unsigned int order, int alloc_fla > > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > > return false; > > > > +#ifdef CONFIG_COMPACTION > > + /* > > +* This is a gross workaround to compensate a lack of reliable > > compaction > > +* operation. We cannot simply go OOM with the current state of the > > compaction > > +* code because this can lead to pre mature OOM declaration. > > +*/ > > + if (order <= PAGE_ALLOC_COSTLY_ORDER) > > No need to check order once more. yes simple return true would be sufficient but I wanted the code to be more obvious. > Plus can we retry without CONFIG_COMPACTION enabled? Yes checking the order-0 watermark was the original implementation of the high order retry without compaction enabled. I do not rememeber any reports for that so I didn't want to touch that path. -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote: > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > unsigned int order, int alloc_fla > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > return false; > > +#ifdef CONFIG_COMPACTION > + /* > + * This is a gross workaround to compensate a lack of reliable > compaction > + * operation. We cannot simply go OOM with the current state of the > compaction > + * code because this can lead to pre mature OOM declaration. > + */ > + if (order <= PAGE_ALLOC_COSTLY_ORDER) No need to check order once more. Plus can we retry without CONFIG_COMPACTION enabled? > + return true; > +#endif > + > /* >* There are setups with compaction disabled which would prefer to loop >* inside the allocator rather than hit the oom killer prematurely. > -- > Michal Hocko > SUSE Labs >
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Wednesday, November 23, 2016 2:34 PM Michal Hocko wrote: > @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, > unsigned int order, int alloc_fla > if (!order || order > PAGE_ALLOC_COSTLY_ORDER) > return false; > > +#ifdef CONFIG_COMPACTION > + /* > + * This is a gross workaround to compensate a lack of reliable > compaction > + * operation. We cannot simply go OOM with the current state of the > compaction > + * code because this can lead to pre mature OOM declaration. > + */ > + if (order <= PAGE_ALLOC_COSTLY_ORDER) No need to check order once more. Plus can we retry without CONFIG_COMPACTION enabled? > + return true; > +#endif > + > /* >* There are setups with compaction disabled which would prefer to loop >* inside the allocator rather than hit the oom killer prematurely. > -- > Michal Hocko > SUSE Labs >
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 11:38:47, Linus Torvalds wrote: > On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babkawrote: > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > already EOL AFAICS). > > > > - send the patch [1] as 4.8-only stable. > > I think that's the right thing to do. It's pretty small, and the > argument that it changes the oom logic too much is pretty bogus, I > think. The oom logic in 4.8 is simply broken. Let's get it fixed. > Changing it is the point. The point I've tried to make is that it is not should_reclaim_retry which is broken. It's an overly optimistic reliance on the compaction to do it's work which led to all those issues. My previous fix 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") tried to cope with that by checking the order-0 watermark which has proven to help most users. Now it didn't cover everybody obviously. Rather than fiddling with fine tuning of these heuristics I think it would be safer to simply admit that high order OOM detection doesn't work in 4.8 kernel and so do not declare the OOM killer for those requests at all. The risk of such a change is not big because there usually are order-0 requests happening all the time so if we are really OOM we would trigger the OOM eventually. So I am proposing this for 4.8 stable tree instead --- commit b2ccdcb731b666aa28f86483656c39c5e53828c7 Author: Michal Hocko Date: Wed Nov 23 07:26:30 2016 +0100 mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Signed-off-by: Michal Hocko diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely. -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 11:38:47, Linus Torvalds wrote: > On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka wrote: > > > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > > already EOL AFAICS). > > > > - send the patch [1] as 4.8-only stable. > > I think that's the right thing to do. It's pretty small, and the > argument that it changes the oom logic too much is pretty bogus, I > think. The oom logic in 4.8 is simply broken. Let's get it fixed. > Changing it is the point. The point I've tried to make is that it is not should_reclaim_retry which is broken. It's an overly optimistic reliance on the compaction to do it's work which led to all those issues. My previous fix 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") tried to cope with that by checking the order-0 watermark which has proven to help most users. Now it didn't cover everybody obviously. Rather than fiddling with fine tuning of these heuristics I think it would be safer to simply admit that high order OOM detection doesn't work in 4.8 kernel and so do not declare the OOM killer for those requests at all. The risk of such a change is not big because there usually are order-0 requests happening all the time so if we are really OOM we would trigger the OOM eventually. So I am proposing this for 4.8 stable tree instead --- commit b2ccdcb731b666aa28f86483656c39c5e53828c7 Author: Michal Hocko Date: Wed Nov 23 07:26:30 2016 +0100 mm, oom: stop pre-mature high-order OOM killer invocations 31e49bfda184 ("mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM killer invocation for high order requests. It seemed to work for most users just fine but it is far from bullet proof and obviously not sufficient for Marc who has reported pre-mature OOM killer invocations with 4.8 based kernels. 4.9 will all the compaction improvements seems to be behaving much better but that would be too intrusive to backport to 4.8 stable kernels. Instead this patch simply never declares OOM for !costly high order requests. We rely on order-0 requests to do that in case we are really out of memory. Order-0 requests are much more common and so a risk of a livelock without any way forward is highly unlikely. Reported-by: Marc MERLIN Signed-off-by: Michal Hocko diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a2214c64ed3c..7401e996009a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3161,6 +3161,16 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla if (!order || order > PAGE_ALLOC_COSTLY_ORDER) return false; +#ifdef CONFIG_COMPACTION + /* +* This is a gross workaround to compensate a lack of reliable compaction +* operation. We cannot simply go OOM with the current state of the compaction +* code because this can lead to pre mature OOM declaration. +*/ + if (order <= PAGE_ALLOC_COSTLY_ORDER) + return true; +#endif + /* * There are setups with compaction disabled which would prefer to loop * inside the allocator rather than hit the oom killer prematurely. -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? > > Michal? Linus? > > [1] https://marc.info/?l=linux-mm=147423605024993 Sorry for my molasses rate of feedback. I found a workaround, setting vm/watermark_scale_factor to 500, and threw that in sysctl. This was on the MythTV box that OOMs everything after about a day on 4.8 otherwise. I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but just realized I forgot to remove the watermark_scale_factor workaround. I've restored that now, so I'll see if it becomes unhappy by tomorrow. I also threw up a few other things you had asked for (vmstat, zoneinfo before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/ (that was before booting into a rebuild with [1] applied) Simon-
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? > > Michal? Linus? > > [1] https://marc.info/?l=linux-mm=147423605024993 Sorry for my molasses rate of feedback. I found a workaround, setting vm/watermark_scale_factor to 500, and threw that in sysctl. This was on the MythTV box that OOMs everything after about a day on 4.8 otherwise. I've been running [1] for 9 days on it (4.8.4 + [1]) without issue, but just realized I forgot to remove the watermark_scale_factor workaround. I've restored that now, so I'll see if it becomes unhappy by tomorrow. I also threw up a few other things you had asked for (vmstat, zoneinfo before and after the first OOM on 4.8.4): http://0x.ca/sim/ref/4.8.4/ (that was before booting into a rebuild with [1] applied) Simon-
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babkawrote: > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. I think that's the right thing to do. It's pretty small, and the argument that it changes the oom logic too much is pretty bogus, I think. The oom logic in 4.8 is simply broken. Let's get it fixed. Changing it is the point. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 8:14 AM, Vlastimil Babka wrote: > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. I think that's the right thing to do. It's pretty small, and the argument that it changes the oom logic too much is pretty bogus, I think. The oom logic in 4.8 is simply broken. Let's get it fixed. Changing it is the point. Linus
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:25:44PM +0100, Michal Hocko wrote: > currently AFAIR. I hate that Marc is not falling into that category but > is it really problem for you to run with 4.9? If we have more users Don't do anything just on my account. I had a problem, it's been fixed in 2 different ways: 4.8+patch, or 4.9rc5 For me this was a 100% regression from 4.6, there was just no way I could copy my data at all with 4.8, it not only failed, but killed all the services on my machine until it randomly killed the shell that was doing the copy. Personally, I'll stick with 4.8 + this patch, and switch to 4.9 when it's out (I'm a bit wary of RC kernels on a production server, especially when I'm in the middle of trying to get my only good backup to work again) But at the same time, what I'm doing is probably not common (btrfs on top of dmcrypt, on top of bcache, on top of swraid5, for both source and destination), so I can't comment on whether the fix I just put on my 4.8 kernel does not cause other regressions or problems for other people. Either way, I'm personally ok again now, so I thank you all for your help, and will leave the hard decisions to you :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:25:44PM +0100, Michal Hocko wrote: > currently AFAIR. I hate that Marc is not falling into that category but > is it really problem for you to run with 4.9? If we have more users Don't do anything just on my account. I had a problem, it's been fixed in 2 different ways: 4.8+patch, or 4.9rc5 For me this was a 100% regression from 4.6, there was just no way I could copy my data at all with 4.8, it not only failed, but killed all the services on my machine until it randomly killed the shell that was doing the copy. Personally, I'll stick with 4.8 + this patch, and switch to 4.9 when it's out (I'm a bit wary of RC kernels on a production server, especially when I'm in the middle of trying to get my only good backup to work again) But at the same time, what I'm doing is probably not common (btrfs on top of dmcrypt, on top of bcache, on top of swraid5, for both source and destination), so I can't comment on whether the fix I just put on my 4.8 kernel does not cause other regressions or problems for other people. Either way, I'm personally ok again now, so I thank you all for your help, and will leave the hard decisions to you :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is released? :) thanks, greg k-h
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue, Nov 22, 2016 at 05:14:02PM +0100, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? Just wait for 4.8-stable to go end-of-life in a few weeks after 4.9 is released? :) thanks, greg k-h
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 17:14:02, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? > > Michal? Linus? Dunno. To be honest I do not like [1] because it seriously tweaks the retry logic. 10+ compaction patches to 4.8 seems too much for a stable tree and quite risky as well. Considering that 4.9 works just much better, is there any strong reason to do 4.8 specific fix at all? Most users reporting OOM regressions seemed to be ok with what 4.8 does currently AFAIR. I hate that Marc is not falling into that category but is it really problem for you to run with 4.9? If we have more users seeing this regression then I would rather go with a simpler 4.8-only "never trigger OOM for order > 0 && order < costly because that would at least have deterministic behavior. > > [1] https://marc.info/?l=linux-mm=147423605024993 -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Tue 22-11-16 17:14:02, Vlastimil Babka wrote: > On 11/22/2016 05:06 PM, Marc MERLIN wrote: > > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but > the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 > days) > or is that good enough, and i should go back to 4.8.8 with that patch > applied? > https://marc.info/?l=linux-mm=147423605024993 > >>> > >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying > >>> 4.8 with that patch, yeah. > >> > >> So the good news is that it's been running for almost 5H and so far so > >> good. > > > > And the better news is that the copy is still going strong, 4.4TB and > > going. So 4.8.8 is fixed with that one single patch as far as I'm > > concerned. > > > > So thanks for that, looks good to me to merge. > > Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is > already EOL AFAICS). > > - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. > - alternatively a simpler (againm 4.8-only) patch that just outright > prevents OOM for 0 < order < costly, as Michal already suggested. > - backport 10+ compaction patches to 4.8 stable > - something else? > > Michal? Linus? Dunno. To be honest I do not like [1] because it seriously tweaks the retry logic. 10+ compaction patches to 4.8 seems too much for a stable tree and quite risky as well. Considering that 4.9 works just much better, is there any strong reason to do 4.8 specific fix at all? Most users reporting OOM regressions seemed to be ok with what 4.8 does currently AFAIR. I hate that Marc is not falling into that category but is it really problem for you to run with 4.9? If we have more users seeing this regression then I would rather go with a simpler 4.8-only "never trigger OOM for order > 0 && order < costly because that would at least have deterministic behavior. > > [1] https://marc.info/?l=linux-mm=147423605024993 -- Michal Hocko SUSE Labs
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/22/2016 05:06 PM, Marc MERLIN wrote: > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: 4.9rc5 however seems to be doing better, and is still running after 18 hours. However, I got a few page allocation failures as per below, but the system seems to recover. Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) or is that good enough, and i should go back to 4.8.8 with that patch applied? https://marc.info/?l=linux-mm=147423605024993 >>> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying >>> 4.8 with that patch, yeah. >> >> So the good news is that it's been running for almost 5H and so far so good. > > And the better news is that the copy is still going strong, 4.4TB and > going. So 4.8.8 is fixed with that one single patch as far as I'm > concerned. > > So thanks for that, looks good to me to merge. Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. - alternatively a simpler (againm 4.8-only) patch that just outright prevents OOM for 0 < order < costly, as Michal already suggested. - backport 10+ compaction patches to 4.8 stable - something else? Michal? Linus? [1] https://marc.info/?l=linux-mm=147423605024993 > Marc >
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/22/2016 05:06 PM, Marc MERLIN wrote: > On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: >> On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: 4.9rc5 however seems to be doing better, and is still running after 18 hours. However, I got a few page allocation failures as per below, but the system seems to recover. Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) or is that good enough, and i should go back to 4.8.8 with that patch applied? https://marc.info/?l=linux-mm=147423605024993 >>> >>> Hi, I think it's enough for 4.9 for now and I would appreciate trying >>> 4.8 with that patch, yeah. >> >> So the good news is that it's been running for almost 5H and so far so good. > > And the better news is that the copy is still going strong, 4.4TB and > going. So 4.8.8 is fixed with that one single patch as far as I'm > concerned. > > So thanks for that, looks good to me to merge. Thanks a lot for the testing. So what do we do now about 4.8? (4.7 is already EOL AFAICS). - send the patch [1] as 4.8-only stable. Greg won't like that, I expect. - alternatively a simpler (againm 4.8-only) patch that just outright prevents OOM for 0 < order < costly, as Michal already suggested. - backport 10+ compaction patches to 4.8 stable - something else? Michal? Linus? [1] https://marc.info/?l=linux-mm=147423605024993 > Marc >
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > > 4.9rc5 however seems to be doing better, and is still running after 18 > > > hours. However, I got a few page allocation failures as per below, but the > > > system seems to recover. > > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > > > or is that good enough, and i should go back to 4.8.8 with that patch > > > applied? > > > https://marc.info/?l=linux-mm=147423605024993 > > > > Hi, I think it's enough for 4.9 for now and I would appreciate trying > > 4.8 with that patch, yeah. > > So the good news is that it's been running for almost 5H and so far so good. And the better news is that the copy is still going strong, 4.4TB and going. So 4.8.8 is fixed with that one single patch as far as I'm concerned. So thanks for that, looks good to me to merge. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 21, 2016 at 01:56:39PM -0800, Marc MERLIN wrote: > On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > > 4.9rc5 however seems to be doing better, and is still running after 18 > > > hours. However, I got a few page allocation failures as per below, but the > > > system seems to recover. > > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > > > or is that good enough, and i should go back to 4.8.8 with that patch > > > applied? > > > https://marc.info/?l=linux-mm=147423605024993 > > > > Hi, I think it's enough for 4.9 for now and I would appreciate trying > > 4.8 with that patch, yeah. > > So the good news is that it's been running for almost 5H and so far so good. And the better news is that the copy is still going strong, 4.4TB and going. So 4.8.8 is fixed with that one single patch as far as I'm concerned. So thanks for that, looks good to me to merge. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > 4.9rc5 however seems to be doing better, and is still running after 18 > > hours. However, I got a few page allocation failures as per below, but the > > system seems to recover. > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > > or is that good enough, and i should go back to 4.8.8 with that patch > > applied? > > https://marc.info/?l=linux-mm=147423605024993 > > Hi, I think it's enough for 4.9 for now and I would appreciate trying > 4.8 with that patch, yeah. So the good news is that it's been running for almost 5H and so far so good. > The failures below are in a GFP_NOWAIT context, which cannot do any > reclaim so it's not affected by OOM rewrite. If it's a regression, it > has to be caused by something else. But it seems the code in > cfq_get_queue() intentionally doesn't want to reclaim or use any atomic > reserves, and has a fallback scenario for allocation failure, in which > case I would argue that it should add __GFP_NOWARN, as these warnings > can't help anyone. CCing Tejun as author of commit d4aad7ff0. No, that's not a regression, I get those on occasion. The good news is that they're not fatal. Just got another one with 4.8.8. No idea if they're actual errors I should worry about, or just warnings that spam the console a bit, but things retry, recover and succeed, so I can ignore them. Another one from 4.8.8 below. I'll report back tomorrow to see if this has run for a day and if so, I'll call your patch a fix for my problem (but at this point, it's already looking very good). Thanks, Marc cron: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK) CPU: 4 PID: 9748 Comm: cron Tainted: G U 4.8.8-amd64-volpreempt-sysrq-20161108vb2 #9 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 a1e37429f6d0 9a36a0bb a1e37429f768 9a1359d4 022040009f5e8d00 0012 9a140770 Call Trace: [] dump_stack+0x61/0x7d [] warn_alloc_failed+0x11c/0x132 [] ? wakeup_kswapd+0x8e/0x153 [] __alloc_pages_nodemask+0x87b/0xb02 [] ? __alloc_pages_nodemask+0x87b/0xb02 [] cache_grow_begin+0xb2/0x30b [] fallback_alloc+0x137/0x19f [] cache_alloc_node+0xd3/0xde [] kmem_cache_alloc_node+0x8e/0x163 [] cfq_get_queue+0x162/0x29d [] ? kmem_cache_alloc+0xd7/0x14b [] ? slab_post_alloc_hook+0x5b/0x66 [] cfq_set_request+0x141/0x2be [] ? timekeeping_get_ns+0x1e/0x32 [] ? ktime_get+0x41/0x52 [] ? ktime_get_ns+0x9/0xb [] ? cfq_init_icq+0x12/0x19 [] elv_set_request+0x1f/0x24 [] get_request+0x324/0x5aa [] ? wake_up_atomic_t+0x2c/0x2c [] blk_queue_bio+0x19f/0x28c [] generic_make_request+0xbd/0x160 [] submit_bio+0x100/0x11d [] ? map_swap_page+0x12/0x14 [] ? get_swap_bio+0x57/0x6c [] swap_readpage+0x110/0x118 [] read_swap_cache_async+0x26/0x2d [] swapin_readahead+0x11a/0x16a [] do_swap_page+0x9c/0x431 [] ? do_swap_page+0x9c/0x431 [] handle_mm_fault+0xa4d/0xb3d [] ? vfs_getattr_nosec+0x26/0x37 [] __do_page_fault+0x267/0x43d [] do_page_fault+0x25/0x27 [] page_fault+0x28/0x30 Mem-Info: active_anon:532194 inactive_anon:133376 isolated_anon:0 active_file:4118244 inactive_file:382010 isolated_file:0 unevictable:1687 dirty:3502 writeback:386111 unstable:0 slab_reclaimable:41767 slab_unreclaimable:106595 mapped:512496 shmem:582026 pagetables:5352 bounce:0 free:92092 free_pcp:176 free_cma:2072 Node 0 active_anon:2128776kB inactive_anon:533504kB active_file:16472976kB inactive_file:1528040kB unevictable:6748kB isolated(anon):0kB isolated(file):0kB mapped:2049984kB dirty:14008kB writeback:154kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2328104kB writeback_tmp:0kB unstable:0kB pages_scanned:1 all_unreclaimable? no Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 3200 23767 23767 23767 Node 0 DMA32 free:117580kB min:35424kB low:44280kB high:53136kB active_anon:3980kB inactive_anon:400kB active_file:2632672kB inactive_file:286956kB unevictable:0kB writepending:288296kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:41632kB slab_unreclaimable:19512kB kernel_stack:880kB pagetables:676kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 20567 20567 20567 Node 0 Normal free:234904kB min:226544kB low:283180kB high:339816kB active_anon:2124796kB inactive_anon:533104kB active_file:13840304kB inactive_file:1241268kB unevictable:6748kB writepending:1270156kB present:21485568kB managed:21080636kB mlocked:6748kB slab_reclaimable:125436kB
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On Mon, Nov 21, 2016 at 10:50:20PM +0100, Vlastimil Babka wrote: > > 4.9rc5 however seems to be doing better, and is still running after 18 > > hours. However, I got a few page allocation failures as per below, but the > > system seems to recover. > > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > > or is that good enough, and i should go back to 4.8.8 with that patch > > applied? > > https://marc.info/?l=linux-mm=147423605024993 > > Hi, I think it's enough for 4.9 for now and I would appreciate trying > 4.8 with that patch, yeah. So the good news is that it's been running for almost 5H and so far so good. > The failures below are in a GFP_NOWAIT context, which cannot do any > reclaim so it's not affected by OOM rewrite. If it's a regression, it > has to be caused by something else. But it seems the code in > cfq_get_queue() intentionally doesn't want to reclaim or use any atomic > reserves, and has a fallback scenario for allocation failure, in which > case I would argue that it should add __GFP_NOWARN, as these warnings > can't help anyone. CCing Tejun as author of commit d4aad7ff0. No, that's not a regression, I get those on occasion. The good news is that they're not fatal. Just got another one with 4.8.8. No idea if they're actual errors I should worry about, or just warnings that spam the console a bit, but things retry, recover and succeed, so I can ignore them. Another one from 4.8.8 below. I'll report back tomorrow to see if this has run for a day and if so, I'll call your patch a fix for my problem (but at this point, it's already looking very good). Thanks, Marc cron: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK) CPU: 4 PID: 9748 Comm: cron Tainted: G U 4.8.8-amd64-volpreempt-sysrq-20161108vb2 #9 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 a1e37429f6d0 9a36a0bb a1e37429f768 9a1359d4 022040009f5e8d00 0012 9a140770 Call Trace: [] dump_stack+0x61/0x7d [] warn_alloc_failed+0x11c/0x132 [] ? wakeup_kswapd+0x8e/0x153 [] __alloc_pages_nodemask+0x87b/0xb02 [] ? __alloc_pages_nodemask+0x87b/0xb02 [] cache_grow_begin+0xb2/0x30b [] fallback_alloc+0x137/0x19f [] cache_alloc_node+0xd3/0xde [] kmem_cache_alloc_node+0x8e/0x163 [] cfq_get_queue+0x162/0x29d [] ? kmem_cache_alloc+0xd7/0x14b [] ? slab_post_alloc_hook+0x5b/0x66 [] cfq_set_request+0x141/0x2be [] ? timekeeping_get_ns+0x1e/0x32 [] ? ktime_get+0x41/0x52 [] ? ktime_get_ns+0x9/0xb [] ? cfq_init_icq+0x12/0x19 [] elv_set_request+0x1f/0x24 [] get_request+0x324/0x5aa [] ? wake_up_atomic_t+0x2c/0x2c [] blk_queue_bio+0x19f/0x28c [] generic_make_request+0xbd/0x160 [] submit_bio+0x100/0x11d [] ? map_swap_page+0x12/0x14 [] ? get_swap_bio+0x57/0x6c [] swap_readpage+0x110/0x118 [] read_swap_cache_async+0x26/0x2d [] swapin_readahead+0x11a/0x16a [] do_swap_page+0x9c/0x431 [] ? do_swap_page+0x9c/0x431 [] handle_mm_fault+0xa4d/0xb3d [] ? vfs_getattr_nosec+0x26/0x37 [] __do_page_fault+0x267/0x43d [] do_page_fault+0x25/0x27 [] page_fault+0x28/0x30 Mem-Info: active_anon:532194 inactive_anon:133376 isolated_anon:0 active_file:4118244 inactive_file:382010 isolated_file:0 unevictable:1687 dirty:3502 writeback:386111 unstable:0 slab_reclaimable:41767 slab_unreclaimable:106595 mapped:512496 shmem:582026 pagetables:5352 bounce:0 free:92092 free_pcp:176 free_cma:2072 Node 0 active_anon:2128776kB inactive_anon:533504kB active_file:16472976kB inactive_file:1528040kB unevictable:6748kB isolated(anon):0kB isolated(file):0kB mapped:2049984kB dirty:14008kB writeback:154kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2328104kB writeback_tmp:0kB unstable:0kB pages_scanned:1 all_unreclaimable? no Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 3200 23767 23767 23767 Node 0 DMA32 free:117580kB min:35424kB low:44280kB high:53136kB active_anon:3980kB inactive_anon:400kB active_file:2632672kB inactive_file:286956kB unevictable:0kB writepending:288296kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:41632kB slab_unreclaimable:19512kB kernel_stack:880kB pagetables:676kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 20567 20567 20567 Node 0 Normal free:234904kB min:226544kB low:283180kB high:339816kB active_anon:2124796kB inactive_anon:533104kB active_file:13840304kB inactive_file:1241268kB unevictable:6748kB writepending:1270156kB present:21485568kB managed:21080636kB mlocked:6748kB slab_reclaimable:125436kB
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/21/2016 04:43 PM, Marc MERLIN wrote: > Howdy, > > As a followup to https://plus.google.com/u/0/+MarcMERLIN/posts/A3FrLVo3kc6 > > http://pastebin.com/yJybSHNq and http://pastebin.com/B6xEH4Dw > show a system with plenty of RAM (24GB) falling over and killing inoccent > user space apps, a few hours after I start a 9TB copy between 2 raid5 arrays > both hosting bcache, dmcrypt and btrfs (yes, that's 3 layers under btrfs) > > This kind of stuff worked until 4.6 if I'm not mistaken and started failing > with 4.8 (I didn't try 4.7) > > I tried applying > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9f7e3387939b036faacf4e7f32de7bb92a6635d6 > to 4.8.8 and it didn't help > http://pastebin.com/2LUicF3k > > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > or is that good enough, and i should go back to 4.8.8 with that patch applied? > https://marc.info/?l=linux-mm=147423605024993 Hi, I think it's enough for 4.9 for now and I would appreciate trying 4.8 with that patch, yeah. The failures below are in a GFP_NOWAIT context, which cannot do any reclaim so it's not affected by OOM rewrite. If it's a regression, it has to be caused by something else. But it seems the code in cfq_get_queue() intentionally doesn't want to reclaim or use any atomic reserves, and has a fallback scenario for allocation failure, in which case I would argue that it should add __GFP_NOWARN, as these warnings can't help anyone. CCing Tejun as author of commit d4aad7ff0. > > Thanks, > Marc > > > bash: page allocation failure: order:0, > mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK) > CPU: 4 PID: 16706 Comm: bash Not tainted > 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1 > Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 > 04/27/2013 > 9812088ff680 9a36f697 9aababe8 > 9812088ff710 9a13ae2b 02204012 9aababe8 > 9812088ff6a8 0010 9812088ff720 9812088ff6c0 > Call Trace: > [] dump_stack+0x61/0x7d > [] warn_alloc+0x107/0x11b > [] __alloc_pages_slowpath+0x727/0x8f2 > [] ? get_page_from_freelist+0x62e/0x66f > [] __alloc_pages_nodemask+0x15c/0x220 > [] cache_grow_begin+0xb2/0x308 > [] fallback_alloc+0x137/0x19f > [] cache_alloc_node+0xd3/0xde > [] kmem_cache_alloc_node+0x8e/0x163 > [] cfq_get_queue+0x162/0x29d > [] ? kmem_cache_alloc+0xd7/0x14b > [] ? mempool_alloc_slab+0x15/0x17 > [] ? mempool_alloc+0x69/0x132 > [] cfq_set_request+0x141/0x2be > [] ? timekeeping_get_ns+0x1e/0x32 > [] ? ktime_get+0x41/0x52 > [] ? ktime_get_ns+0x9/0xb > [] ? cfq_init_icq+0x12/0x19 > [] elv_set_request+0x1f/0x24 > [] get_request+0x324/0x5aa > [] ? wake_up_atomic_t+0x2c/0x2c > [] blk_queue_bio+0x19f/0x28c > [] generic_make_request+0xbd/0x160 > [] submit_bio+0x100/0x11d > [] ? map_swap_page+0x12/0x14 > [] ? get_swap_bio+0x57/0x6c > [] swap_readpage+0x106/0x10e > [] read_swap_cache_async+0x26/0x2d > [] swapin_readahead+0x11a/0x16a > [] do_swap_page+0x9c/0x42e > [] ? do_swap_page+0x9c/0x42e > [] handle_mm_fault+0xa51/0xb71 > [] ? _raw_spin_lock_irq+0x1c/0x1e > [] __do_page_fault+0x29e/0x425 > [] do_page_fault+0x25/0x27 > [] page_fault+0x28/0x30 > Mem-Info: > active_anon:563129 inactive_anon:140630 isolated_anon:0 > active_file:4036325 inactive_file:448954 isolated_file:288 > unevictable:1760 dirty:9197 writeback:446395 unstable:0 > slab_reclaimable:47810 slab_unreclaimable:120834 > mapped:534180 shmem:627708 pagetables:5647 bounce:0 > free:90108 free_pcp:218 free_cma:78 > Node 0 active_anon:2252516kB inactive_anon:562520kB active_file:16145300kB > inactive_file:1795816kB unevictable:7040kB isolated(anon):0kB > isolated(file):1152kB mapped:2136720kB dirty:367 > 1785580kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510832kB > writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no > Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB > writepending:0kB present:15976kB managed:15892kB mlocked:0kB > slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB > bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 3199 23767 23767 23767 > Node 0 DMA32 free:117656kB min:35424kB low:44280kB high:53136kB > active_anon:38004kB inactive_anon:13540kB active_file:2221420kB > inactive_file:307236kB unevictable:0kB writepending:311780kB > present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47992kB > slab_unreclaimable:25360kB kernel_stack:512kB pagetables:796kB bounce:0kB > free_pcp:96kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 0 20567 20567 20567 > Node 0 Normal free:226892kB
Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free
On 11/21/2016 04:43 PM, Marc MERLIN wrote: > Howdy, > > As a followup to https://plus.google.com/u/0/+MarcMERLIN/posts/A3FrLVo3kc6 > > http://pastebin.com/yJybSHNq and http://pastebin.com/B6xEH4Dw > show a system with plenty of RAM (24GB) falling over and killing inoccent > user space apps, a few hours after I start a 9TB copy between 2 raid5 arrays > both hosting bcache, dmcrypt and btrfs (yes, that's 3 layers under btrfs) > > This kind of stuff worked until 4.6 if I'm not mistaken and started failing > with 4.8 (I didn't try 4.7) > > I tried applying > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9f7e3387939b036faacf4e7f32de7bb92a6635d6 > to 4.8.8 and it didn't help > http://pastebin.com/2LUicF3k > > 4.9rc5 however seems to be doing better, and is still running after 18 > hours. However, I got a few page allocation failures as per below, but the > system seems to recover. > Vlastimil, do you want me to continue the copy on 4.9 (may take 3-5 days) > or is that good enough, and i should go back to 4.8.8 with that patch applied? > https://marc.info/?l=linux-mm=147423605024993 Hi, I think it's enough for 4.9 for now and I would appreciate trying 4.8 with that patch, yeah. The failures below are in a GFP_NOWAIT context, which cannot do any reclaim so it's not affected by OOM rewrite. If it's a regression, it has to be caused by something else. But it seems the code in cfq_get_queue() intentionally doesn't want to reclaim or use any atomic reserves, and has a fallback scenario for allocation failure, in which case I would argue that it should add __GFP_NOWARN, as these warnings can't help anyone. CCing Tejun as author of commit d4aad7ff0. > > Thanks, > Marc > > > bash: page allocation failure: order:0, > mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK) > CPU: 4 PID: 16706 Comm: bash Not tainted > 4.9.0-rc5-amd64-volpreempt-sysrq-20161108 #1 > Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 > 04/27/2013 > 9812088ff680 9a36f697 9aababe8 > 9812088ff710 9a13ae2b 02204012 9aababe8 > 9812088ff6a8 0010 9812088ff720 9812088ff6c0 > Call Trace: > [] dump_stack+0x61/0x7d > [] warn_alloc+0x107/0x11b > [] __alloc_pages_slowpath+0x727/0x8f2 > [] ? get_page_from_freelist+0x62e/0x66f > [] __alloc_pages_nodemask+0x15c/0x220 > [] cache_grow_begin+0xb2/0x308 > [] fallback_alloc+0x137/0x19f > [] cache_alloc_node+0xd3/0xde > [] kmem_cache_alloc_node+0x8e/0x163 > [] cfq_get_queue+0x162/0x29d > [] ? kmem_cache_alloc+0xd7/0x14b > [] ? mempool_alloc_slab+0x15/0x17 > [] ? mempool_alloc+0x69/0x132 > [] cfq_set_request+0x141/0x2be > [] ? timekeeping_get_ns+0x1e/0x32 > [] ? ktime_get+0x41/0x52 > [] ? ktime_get_ns+0x9/0xb > [] ? cfq_init_icq+0x12/0x19 > [] elv_set_request+0x1f/0x24 > [] get_request+0x324/0x5aa > [] ? wake_up_atomic_t+0x2c/0x2c > [] blk_queue_bio+0x19f/0x28c > [] generic_make_request+0xbd/0x160 > [] submit_bio+0x100/0x11d > [] ? map_swap_page+0x12/0x14 > [] ? get_swap_bio+0x57/0x6c > [] swap_readpage+0x106/0x10e > [] read_swap_cache_async+0x26/0x2d > [] swapin_readahead+0x11a/0x16a > [] do_swap_page+0x9c/0x42e > [] ? do_swap_page+0x9c/0x42e > [] handle_mm_fault+0xa51/0xb71 > [] ? _raw_spin_lock_irq+0x1c/0x1e > [] __do_page_fault+0x29e/0x425 > [] do_page_fault+0x25/0x27 > [] page_fault+0x28/0x30 > Mem-Info: > active_anon:563129 inactive_anon:140630 isolated_anon:0 > active_file:4036325 inactive_file:448954 isolated_file:288 > unevictable:1760 dirty:9197 writeback:446395 unstable:0 > slab_reclaimable:47810 slab_unreclaimable:120834 > mapped:534180 shmem:627708 pagetables:5647 bounce:0 > free:90108 free_pcp:218 free_cma:78 > Node 0 active_anon:2252516kB inactive_anon:562520kB active_file:16145300kB > inactive_file:1795816kB unevictable:7040kB isolated(anon):0kB > isolated(file):1152kB mapped:2136720kB dirty:367 > 1785580kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2510832kB > writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no > Node 0 DMA free:15884kB min:168kB low:208kB high:248kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB > writepending:0kB present:15976kB managed:15892kB mlocked:0kB > slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB > bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 3199 23767 23767 23767 > Node 0 DMA32 free:117656kB min:35424kB low:44280kB high:53136kB > active_anon:38004kB inactive_anon:13540kB active_file:2221420kB > inactive_file:307236kB unevictable:0kB writepending:311780kB > present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:47992kB > slab_unreclaimable:25360kB kernel_stack:512kB pagetables:796kB bounce:0kB > free_pcp:96kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 0 20567 20567 20567 > Node 0 Normal free:226892kB