Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On 03/14/2018 02:08 AM, Michal Hocko wrote: On Mon 05-03-18 13:04:24, Laura Abbott wrote: On 02/26/2018 06:28 AM, Michal Hocko wrote: On Fri 23-02-18 11:51:41, Laura Abbott wrote: Hi, The Fedora arm-32 build VMs have a somewhat long standing problem of hanging when running mkfs.ext4 with a bunch of processes stuck in D state. This has been seen as far back as 4.13 but is still present on 4.14: [...] This looks like everything is blocked on the writeback completing but the writeback has been throttled. According to the infra team, this problem is _not_ seen without LPAE (i.e. only 4G of RAM). I did see https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to quite match since this seems to be completely stuck. Any suggestions to narrow the problem down? How much dirtyable memory does the system have? We do allow only lowmem to be dirtyable by default on 32b highmem systems. Maybe you have the lowmem mostly consumed by the kernel memory. Have you tried to enable highmem_is_dirtyable? Setting highmem_is_dirtyable did fix the problem. The infrastructure people seemed satisfied enough with this (and are happy to have the machines back). I'll see if they are willing to run a few more tests to get some more state information. Please be aware that highmem_is_dirtyable is not for free. There are some code paths which can only allocate from lowmem (e.g. block device AFAIR) and those could fill up the whole lowmem without any throttling. Good to note. This particular setup is one basically everyone dislikes so I think this is only encouragement to move to something else. Thanks, Laura
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On Tue 06-03-18 20:28:59, Tetsuo Handa wrote: > Laura Abbott wrote: > > On 02/26/2018 06:28 AM, Michal Hocko wrote: > > > On Fri 23-02-18 11:51:41, Laura Abbott wrote: > > >> Hi, > > >> > > >> The Fedora arm-32 build VMs have a somewhat long standing problem > > >> of hanging when running mkfs.ext4 with a bunch of processes stuck > > >> in D state. This has been seen as far back as 4.13 but is still > > >> present on 4.14: > > >> > > > [...] > > >> This looks like everything is blocked on the writeback completing but > > >> the writeback has been throttled. According to the infra team, this > > >> problem > > >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > > >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > > >> quite match since this seems to be completely stuck. Any suggestions to > > >> narrow the problem down? > > > > > > How much dirtyable memory does the system have? We do allow only lowmem > > > to be dirtyable by default on 32b highmem systems. Maybe you have the > > > lowmem mostly consumed by the kernel memory. Have you tried to enable > > > highmem_is_dirtyable? > > > > > > > Setting highmem_is_dirtyable did fix the problem. The infrastructure > > people seemed satisfied enough with this (and are happy to have the > > machines back). > > That's good. > > > I'll see if they are willing to run a few more tests > > to get some more state information. > > Well, I'm far from understanding what is happening in your case, but I'm > interested in other threads which were trying to allocate memory. Therefore, > I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described > at http://akari.osdn.jp/capturing-kernel-messages.html ). > > Code which assumes that kswapd can make progress can get stuck when kswapd > is blocked somewhere. And wbt_wait() seems to change behavior based on > current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot > make progress, I worry that it leads to hangups like your case. Tetsuo, could you stop this finally, pretty please? This is a well known limitation of 32b architectures with more than 4G. The lowmem can only handle 896MB of memory and that can be filled up with other kernel allocations. Stalled writeback is _usually_ a result of only little dirtyable memory which is left in the lowmem. We cannot simply allow highmem to be dirtyable by default due to reasons explained in other email. I can imagine that it is hard for you to grasp that not everything is "silent hang during OOM" but there are other things going on in the VM. -- Michal Hocko SUSE Labs
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On Mon 05-03-18 13:04:24, Laura Abbott wrote: > On 02/26/2018 06:28 AM, Michal Hocko wrote: > > On Fri 23-02-18 11:51:41, Laura Abbott wrote: > > > Hi, > > > > > > The Fedora arm-32 build VMs have a somewhat long standing problem > > > of hanging when running mkfs.ext4 with a bunch of processes stuck > > > in D state. This has been seen as far back as 4.13 but is still > > > present on 4.14: > > > > > [...] > > > This looks like everything is blocked on the writeback completing but > > > the writeback has been throttled. According to the infra team, this > > > problem > > > is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > > > https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > > > quite match since this seems to be completely stuck. Any suggestions to > > > narrow the problem down? > > > > How much dirtyable memory does the system have? We do allow only lowmem > > to be dirtyable by default on 32b highmem systems. Maybe you have the > > lowmem mostly consumed by the kernel memory. Have you tried to enable > > highmem_is_dirtyable? > > > > Setting highmem_is_dirtyable did fix the problem. The infrastructure > people seemed satisfied enough with this (and are happy to have the > machines back). I'll see if they are willing to run a few more tests > to get some more state information. Please be aware that highmem_is_dirtyable is not for free. There are some code paths which can only allocate from lowmem (e.g. block device AFAIR) and those could fill up the whole lowmem without any throttling. -- Michal Hocko SUSE Labs
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
Laura Abbott wrote: > On 02/26/2018 06:28 AM, Michal Hocko wrote: > > On Fri 23-02-18 11:51:41, Laura Abbott wrote: > >> Hi, > >> > >> The Fedora arm-32 build VMs have a somewhat long standing problem > >> of hanging when running mkfs.ext4 with a bunch of processes stuck > >> in D state. This has been seen as far back as 4.13 but is still > >> present on 4.14: > >> > > [...] > >> This looks like everything is blocked on the writeback completing but > >> the writeback has been throttled. According to the infra team, this problem > >> is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > >> https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > >> quite match since this seems to be completely stuck. Any suggestions to > >> narrow the problem down? > > > > How much dirtyable memory does the system have? We do allow only lowmem > > to be dirtyable by default on 32b highmem systems. Maybe you have the > > lowmem mostly consumed by the kernel memory. Have you tried to enable > > highmem_is_dirtyable? > > > > Setting highmem_is_dirtyable did fix the problem. The infrastructure > people seemed satisfied enough with this (and are happy to have the > machines back). That's good. > I'll see if they are willing to run a few more tests > to get some more state information. Well, I'm far from understanding what is happening in your case, but I'm interested in other threads which were trying to allocate memory. Therefore, I appreciate if they can take SysRq-m + SysRq-t than SysRq-w (as described at http://akari.osdn.jp/capturing-kernel-messages.html ). Code which assumes that kswapd can make progress can get stuck when kswapd is blocked somewhere. And wbt_wait() seems to change behavior based on current_is_kswapd(). If everyone is waiting for kswapd but kswapd cannot make progress, I worry that it leads to hangups like your case. Below is a totally different case which I got today, but an example of whether SysRq-m + SysRq-t can give us some clues. Running below program on CPU 0 (using "taskset -c 0") on 4.16-rc4 against XFS can trigger OOM lockups (hangup without being able to invoke the OOM killer). -- #include #include #include #include #include #include #include int main(int argc, char *argv[]) { static char buffer[4096] = { }; char *buf = NULL; unsigned long size; unsigned long i; for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd; snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); memset(buffer, 0, sizeof(buffer)); sleep(1); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; return 0; } -- MM people love to ignore such kind of problem with "It is a DoS attack", but only one CPU out of 8 CPUs is occupied by this program, which means that other threads (including kernel threads doing memory reclaim activities) are free to use idle CPUs 1-7 as they need. Also, while CPU 0 was really busy processing hundreds of threads doing direct reclaim, idle CPUs 1-7 should be able to invoke the OOM killer shortly because there should be already little to reclaim. Also, writepending: did not decrease (and no disk I/O was observed) during the OOM lockup. Thus, I don't know whether this is just an overloaded. [ 660.035957] Node 0 Normal free:17056kB min:17320kB low:21648kB high:25976kB active_anon:570132kB inactive_anon:13452kB active_file:15136kB inactive_file:13296kB unevictable:0kB writepending:42320kB present:1048576kB managed:951188kB mlocked:0kB kernel_stack:22448kB pagetables:37304kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 709.498421] Node 0 Normal free:16920kB min:17320kB low:21648kB high:25976kB active_anon:570132kB inactive_anon:13452kB active_file:19180kB inactive_file:17640kB unevictable:0kB writepending:42740kB present:1048576kB managed:951188kB mlocked:0kB kernel_stack:22400kB pagetables:37304kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB [ 751.290146] Node 0 Normal free:16920kB min:17320kB low:21648kB high:25976kB active_anon:570132kB inactive_anon:13452kB active_file:14556kB inactive_file:14452kB unevictable:0kB writepending:42740kB present:1048576kB managed:951188kB mlocked:0kB kernel_stack:22400kB pagetables:37304kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On 02/26/2018 06:28 AM, Michal Hocko wrote: On Fri 23-02-18 11:51:41, Laura Abbott wrote: Hi, The Fedora arm-32 build VMs have a somewhat long standing problem of hanging when running mkfs.ext4 with a bunch of processes stuck in D state. This has been seen as far back as 4.13 but is still present on 4.14: [...] This looks like everything is blocked on the writeback completing but the writeback has been throttled. According to the infra team, this problem is _not_ seen without LPAE (i.e. only 4G of RAM). I did see https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to quite match since this seems to be completely stuck. Any suggestions to narrow the problem down? How much dirtyable memory does the system have? We do allow only lowmem to be dirtyable by default on 32b highmem systems. Maybe you have the lowmem mostly consumed by the kernel memory. Have you tried to enable highmem_is_dirtyable? Setting highmem_is_dirtyable did fix the problem. The infrastructure people seemed satisfied enough with this (and are happy to have the machines back). I'll see if they are willing to run a few more tests to get some more state information. Thanks, Laura
Re: Hangs in balance_dirty_pages with arm-32 LPAE + highmem
On Fri 23-02-18 11:51:41, Laura Abbott wrote: > Hi, > > The Fedora arm-32 build VMs have a somewhat long standing problem > of hanging when running mkfs.ext4 with a bunch of processes stuck > in D state. This has been seen as far back as 4.13 but is still > present on 4.14: > [...] > This looks like everything is blocked on the writeback completing but > the writeback has been throttled. According to the infra team, this problem > is _not_ seen without LPAE (i.e. only 4G of RAM). I did see > https://patchwork.kernel.org/patch/10201593/ but that doesn't seem to > quite match since this seems to be completely stuck. Any suggestions to > narrow the problem down? How much dirtyable memory does the system have? We do allow only lowmem to be dirtyable by default on 32b highmem systems. Maybe you have the lowmem mostly consumed by the kernel memory. Have you tried to enable highmem_is_dirtyable? -- Michal Hocko SUSE Labs
Hangs in balance_dirty_pages with arm-32 LPAE + highmem
Hi, The Fedora arm-32 build VMs have a somewhat long standing problem of hanging when running mkfs.ext4 with a bunch of processes stuck in D state. This has been seen as far back as 4.13 but is still present on 4.14: sysrq: SysRq : Show Blocked State [255/1885] taskPC stack pid father auditd D0 377 1 0x0020 [] (__schedule) from [] (schedule+0x98/0xbc) [] (schedule) from [] (schedule_timeout+0x328/0x3ac) [] (schedule_timeout) from [] (io_schedule_timeout+0x24/0x38) [] (io_schedule_timeout) from [] (balance_dirty_pages.constprop.6+0xac8 /0xc5c) [] (balance_dirty_pages.constprop.6) from [] (balance_dirty_pages_ratel imited+0x2b8/0x43c) [] (balance_dirty_pages_ratelimited) from [] (generic_perform_write+0x1 74/0x1a4) [] (generic_perform_write) from [] (__generic_file_write_iter+0x16c/0x1 98) [] (__generic_file_write_iter) from [] (ext4_file_write_iter+0x314/0x41 4) [] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) [] (__vfs_write) from [] (vfs_write+0xc0/0x194) [] (vfs_write) from [] (SyS_write+0x44/0x7c) [] (SyS_write) from [] (__sys_trace_return+0x0/0x10) rs:main Q:Reg D0 441 1 0x [] (__schedule) from [] (schedule+0x98/0xbc) [] (schedule) from [] (schedule_timeout+0x328/0x3ac) [] (schedule_timeout) from [] (io_schedule_timeout+0x24/0x38) [] (io_schedule_timeout) from [] (balance_dirty_pages.constprop.6+0xac8 /0xc5c) [] (balance_dirty_pages.constprop.6) from [] (balance_dirty_pages_ratel imited+0x2b8/0x43c) [] (balance_dirty_pages_ratelimited) from [] (generic_perform_write+0x1 74/0x1a4) [] (generic_perform_write) from [] (__generic_file_write_iter+0x16c/0x1 98) [] (__generic_file_write_iter) from [] (ext4_file_write_iter+0x314/0x41 4) [] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) [] (__vfs_write) from [] (vfs_write+0xc0/0x194) [] (vfs_write) from [] (SyS_write+0x44/0x7c) [] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c) ntpdD0 1453 1 0x0001 [] (__schedule) from [] (schedule+0x98/0xbc) [] (schedule) from [] (schedule_timeout+0x328/0x3ac) [] (schedule_timeout) from [] (io_schedule_timeout+0x24/0x38) [] (io_schedule_timeout) from [] (balance_dirty_pages.constprop.6+0xac8 /0xc5c) [] (balance_dirty_pages.constprop.6) from [] (balance_dirty_pages_ratel imited+0x2b8/0x43c) [] (balance_dirty_pages_ratelimited) from [] (generic_perform_write+0x1 74/0x1a4) [] (generic_perform_write) from [] (__generic_file_write_iter+0x16c/0x1 98) [] (__generic_file_write_iter) from [] (ext4_file_write_iter+0x314/0x41 4) [] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) [] (__vfs_write) from [] (vfs_write+0xc0/0x194) [] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) [203/1885] [] (__vfs_write) from [] (vfs_write+0xc0/0x194) [] (vfs_write) from [] (SyS_write+0x44/0x7c) [] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c) kojid D0 4616 1 0x [] (__schedule) from [] (schedule+0x98/0xbc) [] (schedule) from [] (schedule_timeout+0x328/0x3ac) [] (schedule_timeout) from [] (io_schedule_timeout+0x24/0x38) [] (io_schedule_timeout) from [] (balance_dirty_pages.constprop.6+0xac8 /0xc5c) [] (balance_dirty_pages.constprop.6) from [] (balance_dirty_pages_ratel imited+0x2b8/0x43c) [] (balance_dirty_pages_ratelimited) from [] (generic_perform_write+0x1 74/0x1a4) [] (generic_perform_write) from [] (__generic_file_write_iter+0x16c/0x1 98) [] (__generic_file_write_iter) from [] (ext4_file_write_iter+0x314/0x41 4) [] (ext4_file_write_iter) from [] (__vfs_write+0x100/0x128) [] (__vfs_write) from [] (vfs_write+0xc0/0x194) [] (vfs_write) from [] (SyS_write+0x44/0x7c) [] (SyS_write) from [] (ret_fast_syscall+0x0/0x4c) kworker/u8:0D0 28525 2 0x Workqueue: writeback wb_workfn (flush-7:0) [] (__schedule) from [] (schedule+0x98/0xbc) [] (schedule) from [] (io_schedule+0x1c/0x2c) [] (io_schedule) from [] (wbt_wait+0x21c/0x300) [] (wbt_wait) from [] (blk_mq_make_request+0xac/0x560) [] (blk_mq_make_request) from [] (generic_make_request+0xd0/0x214) [] (generic_make_request) from [] (submit_bio+0x114/0x16c) [] (submit_bio) from [] (submit_bh_wbc+0x190/0x1a0) [] (submit_bh_wbc) from [] (__block_write_full_page+0x2e8/0x43c) [] (__block_write_full_page) from [] (block_write_full_page+0x80/0xec) [] (block_write_full_page) from [] (__writepage+0x1c/0x4c) [] (__writepage) from [] (write_cache_pages+0x350/0x3f0) [] (write_cache_pages) from [] (generic_writepages+0x44/0x60) [] (generic_writepages) from [] (do_writepages+0x3c/0x74) [] (do_writepages) from [] (__writeback_single_inode+0xb4/0x404) [] (__writeback_single_inode) from [] (writeback_sb_inodes+0x258/0x438) [] (writeback_sb_inodes) from [] (__writeback_inodes_wb+0x6c/0xa8) [] (__writeback_inodes_wb) from [] (wb_writeback+0x1c4/0x30c) [] (wb_writeback) from [] (wb_workfn+0x130/0x450) [] (wb_workfn) from [] (process_one_work+0x254/0x42c) [] (process_one_wor