Re: kernel hangs during balance
I finally got a full stack trace via sysrq. A lot of stacks seem to end in page fault -- I wonder if its because my page file is a loopback on btrfs. https://pastebin.com/GyWAu1EP $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.14.8-1.el7.elrepo.x86_64 root=UUID=35f0ce3f-0902-47a3-8ad8-86179d1f3e3a ro rootflags=subvol=root00 vconsole.keymap=us crashkernel=128M vconsole.font=latarcyrheb-sun16 consoleblank=600 scsi_mod.use_blk_mq=n LANG=en_US.UTF-8 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
Holger Hoffstätte posted on Wed, 20 Dec 2017 20:58:14 +0100 as excerpted: > On 12/20/17 20:02, Chris Murphy wrote: >> I don't know if it's the sending MUA or the list server, but the line >> wrapping makes this much harder to follow. I suggest putting it in a >> text file and attaching the text file. It's definitely not on the >> receiving side, I see it here also: >> https://www.spinics.net/lists/linux-btrfs/msg72872.html > > You can see enough to suggest that blk-mq is hanging, which is > "unsurprising" (being kind here) with such an old kernel. Rich, build > your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with > scsi_mod.use_blk_mq=n as kernel prarameter. Well, the kernel is 4.4 "el repo". 4.4 /is/ an LTS (and elrepo is AFAIK a red hat backports repo, not sure how official, but useful for people on red hat), but it's now the second-back LTS, with 4.9 and the new 4.14 being newer LTS series. The thing is that this is the btrfs list, and we're development and thus rather forward focused here. As such, we normally want at /least/ the second newest (first back) lts series kernel for best chance at reasonable support. While I understand people who want to stick with LTS being reluctant to go with the /newest/ LTS before even a single current release has passed, making that LTS still a bit new as such things go, certainly the one-back LTS, 4.9, should be reasonable. So yes, tho 4.4 is at least an LTS, for purposes of btrfs and this list, it really is rather old now, and an upgrade, presumably to 4.9 in keeping with the LTS theme, would be recommended. If the issue can be confirmed on the current and LTS 4.14, so much the better, but certainly, 4.9 is a recommended upgrade, and a bug there would still be in the concern for a fix range, while 4.4... really is just out of the focus range for this list, tho various longer focus distros will of course still provide support for it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
I switched to the LT kernel because of this issue. I was running mainline and thought that LT would get me stability. I can switch back to LT while we RCA. At the risk of changing two things, I could add that (scsi_mod.use_blk_mq=n) to my boot and also switch back to ML. I do notice that disk IO is high at the time of the hang -- I've assumed it is btrfs balance, but if I read the logs right, it may have finished before the hang. Adding more logging. I've also put the kernel logs into a pastebin: https://pastebin.com/80qRXA9c ...wish I could get the watchdog to force a dump/trace. Will look at that, too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
On 12/20/17 20:02, Chris Murphy wrote: > I don't know if it's the sending MUA or the list server, but the line > wrapping makes this much harder to follow. I suggest putting it in a > text file and attaching the text file. It's definitely not on the > receiving side, I see it here also: > https://www.spinics.net/lists/linux-btrfs/msg72872.html You can see enough to suggest that blk-mq is hanging, which is "unsurprising" (being kind here) with such an old kernel. Rich, build your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with scsi_mod.use_blk_mq=n as kernel prarameter. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
I don't know if it's the sending MUA or the list server, but the line wrapping makes this much harder to follow. I suggest putting it in a text file and attaching the text file. It's definitely not on the receiving side, I see it here also: https://www.spinics.net/lists/linux-btrfs/msg72872.html Also anytime you get a hang, it's best to issue sysrq+w which dumps extra information into kernel messages that the developers sometimes find useful. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
Ok, caught the hung tasks last night. I don't *think* this is related, because I pretty sure this isn't happening on the same filesystem, but I do have a loopback swap on one btrfs drive. The hang might have occurred after the btrfs balance was finished which is confusing. I'm adding timestamps to my next btrfs job. Dec 20 01:14:50 tendo [64680.942928] INFO: task systemd:1 blocked for more than 120 seconds. Dec 20 01:14:50 tendo [64680.945844] Not tainted 4.4.106-1.el7.elrepo.x86_64 #1 Dec 20 01:14:50 tendo [64680.948900] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 20 01:14:50 tendo [64680.951987] 88040d6730d8 Dec 20 01:14:50 tendo 81a81500 Dec 20 01:14:50 tendo 88040d5f8000 Dec 20 01:14:50 tendo 88040d674000 Dec 20 01:14:50 tendo Dec 20 01:14:50 tendo [64680.955107] Dec 20 01:14:50 tendo 7fff Dec 20 01:14:50 tendo 8800bd19a648 Dec 20 01:14:50 tendo e8c03980 Dec 20 01:14:50 tendo Dec 20 01:14:50 tendo [64680.957871] 88040d6730f0 Dec 20 01:14:50 tendo 81700435 Dec 20 01:14:50 tendo 88041fa16cc0 Dec 20 01:14:50 tendo 88040d6731a0 Dec 20 01:14:50 tendo Dec 20 01:14:50 tendo [64680.960618] Call Trace: Dec 20 01:14:50 tendo [64680.963342] [] schedule+0x35/0x80 Dec 20 01:14:50 tendo [64680.966052] [] schedule_timeout+0x237/0x2d0 Dec 20 01:14:50 tendo [64680.968490] [] io_schedule_timeout+0xa6/0x110 Dec 20 01:14:50 tendo [64680.970900] [] bt_get+0x14c/0x1d0 Dec 20 01:14:50 tendo [64680.973211] [] ? prepare_to_wait_event+0xf0/0xf0 Dec 20 01:14:50 tendo [64680.975383] [] blk_mq_get_tag+0xb0/0xe0 Dec 20 01:14:50 tendo [64680.977540] [] __blk_mq_alloc_request+0x1b/0x1f0 Dec 20 01:14:50 tendo [64680.979708] [] blk_mq_map_request+0x19e/0x210 Dec 20 01:14:50 tendo [64680.981871] [] blk_sq_make_request+0xab/0x3d0 Dec 20 01:14:50 tendo [64680.984004] [] ? generic_make_request_checks+0x29f/0x4f0 Dec 20 01:14:50 tendo [64680.986122] [] ? mempool_alloc+0x6e/0x170 Dec 20 01:14:50 tendo [64680.988207] [] generic_make_request+0x113/0x2c0 Dec 20 01:14:50 tendo [64680.990137] [] submit_bio+0x77/0x150 Dec 20 01:14:50 tendo [64680.992041] [] ? __test_set_page_writeback+0x149/0x1c0 Dec 20 01:14:50 tendo [64680.993949] [] __swap_writepage+0x20d/0x250 Dec 20 01:14:50 tendo [64680.995853] [] ? __frontswap_store+0x8c/0x120 Dec 20 01:14:50 tendo [64680.997720] [] swap_writepage+0x39/0x70 Dec 20 01:14:50 tendo [64680.999595] [] pageout.isra.43+0x170/0x290 Dec 20 01:14:50 tendo [64681.001291] [] shrink_page_list+0x353/0x780 Dec 20 01:14:50 tendo [64681.002975] [] shrink_inactive_list+0x217/0x530 Dec 20 01:14:50 tendo [64681.004652] [] shrink_lruvec+0x588/0x740 Dec 20 01:14:50 tendo [64681.006311] [] shrink_zone+0xeb/0x2e0 Dec 20 01:14:50 tendo [64681.007967] [] do_try_to_free_pages+0x164/0x430 Dec 20 01:14:50 tendo [64681.009617] [] ? throttle_direct_reclaim+0x14e/0x240 Dec 20 01:14:50 tendo [64681.011141] [] try_to_free_pages+0xd5/0x190 Dec 20 01:14:51 tendo [64681.012659] [] __alloc_pages_slowpath.constprop.89+0x3a0/0x6e8 Dec 20 01:14:51 tendo [64681.014191] [] ? sched_clock_cpu+0x72/0xa0 Dec 20 01:14:51 tendo [64681.015701] [] __alloc_pages_nodemask+0x253/0x260 Dec 20 01:14:51 tendo [64681.017215] [] ? __kmalloc_node_track_caller+0x249/0x2b0 Dec 20 01:14:51 tendo [64681.018392] [] alloc_pages_vma+0xa5/0x220 Dec 20 01:14:51 tendo [64681.019296] [] handle_pte_fault+0xd21/0x1490 Dec 20 01:14:51 tendo [64681.020195] [] ? legitimize_path.isra.23+0x2e/0x60 Dec 20 01:14:51 tendo [64681.021100] [] ? put_dec+0x1a/0x80 Dec 20 01:14:51 tendo [64681.021992] [] ? number.isra.13+0x2dd/0x310 Dec 20 01:14:51 tendo [64681.022883] [] handle_mm_fault+0x250/0x540 Dec 20 01:14:51 tendo [64681.023757] [] __do_page_fault+0x188/0x3f0 Dec 20 01:14:51 tendo [64681.024626] [] do_page_fault+0x30/0x80 Dec 20 01:14:51 tendo [64681.025486] [] ? kernfs_path_locked+0x37/0x80 Dec 20 01:14:51 tendo [64681.026350] [] page_fault+0x28/0x30 Dec 20 01:14:51 tendo [64681.027201] [] ? copy_user_enhanced_fast_string+0xe/0x20 Dec 20 01:14:51 tendo [64681.028034] [] ? seq_read+0x295/0x390 Dec 20 01:14:51 tendo [64681.028832] [] __vfs_read+0x37/0x100 Dec 20 01:14:51 tendo [64681.029618] [] ? security_file_permission+0xa3/0xc0 Dec 20 01:14:51 tendo [64681.030342] [] ? rw_verify_area+0x52/0xd0 Dec 20 01:14:51 tendo [64681.031055] [] vfs_read+0x7f/0x130 Dec 20 01:14:51 tendo [64681.031761] [] SyS_read+0x55/0xc0 Dec 20 01:14:51 tendo [64681.032441] [] entry_SYSCALL_64_fastpath+0x12/0x71 Dec 20 01:14:51 tendo [64681.033119] INFO: task khugepaged:68 blocked for more than 120 seconds. Dec 20 01:14:51 tendo [64681.033815] Not tainted 4.4.106-1.el7.elrepo.x86_64 #1 Dec 20 01:14:51 tendo [64681.034518] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 20 01:14:51 tendo [64681.035238] 88040cf4f318 Dec 20 01:14:51 tendo 88040d6c2c80 Dec 20 01:14:51 tendo 88040cf5
Re: kernel hangs during balance
On Tue, Dec 19, 2017 at 9:14 AM, Hans van Kranenburgwrote: > Just wild first guess... are you also using btrfs send/receive > functionality where the system having problems is the sending part? No. >>> Every night I'm getting a kernel hang, sometimes caught by my >>> watchdog, sometimes not. Last night's hang was on the balance of DATA >>> on / at 70. >>> >>> I'm not sure how to further trace this down to help you -- the console >>> by the time I notice just has lots of messages on it without the >>> initial ones. > > Capturing more logs is definitely the first thing to do. > > Look if the output of `dmesg` still shows the btrfs errors. Otherwise, But I can't ... I have to hit the reset button (or the watchdog does) before I can get into the system again. > if something is spamming there, turn that off, or if you don't have the > errors in a log file because the log files are on the same btrfs, then > you have to find out another way to capture them. E.g. make the kernel > buffer for messages bigger, use netconsole or just pragmatic things like > ssh from another server and `dmesg -w` and store it on the other machine. Ah... yes, a permanent connection from another might work. I also have a pfsense box -- I think it can take a syslogd daemon. Maybe I can netconsole to it? Will investigate. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
On 12/19/2017 06:08 PM, Rich Rauenzahn wrote: > What's also confusing is I just ran a manual balance on the fs using > defaults (which are aggressive) and it completed with no problems. > It smells more like a race condition than a particular corruption. Just wild first guess... are you also using btrfs send/receive functionality where the system having problems is the sending part? > On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahnwrote: >> I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight. >> >> Every night I'm getting a kernel hang, sometimes caught by my >> watchdog, sometimes not. Last night's hang was on the balance of DATA >> on / at 70. >> >> I'm not sure how to further trace this down to help you -- the console >> by the time I notice just has lots of messages on it without the >> initial ones. Capturing more logs is definitely the first thing to do. Look if the output of `dmesg` still shows the btrfs errors. Otherwise, if something is spamming there, turn that off, or if you don't have the errors in a log file because the log files are on the same btrfs, then you have to find out another way to capture them. E.g. make the kernel buffer for messages bigger, use netconsole or just pragmatic things like ssh from another server and `dmesg -w` and store it on the other machine. >> The last items in /var/log/message aren't helpful, but >> I'm pretty sure it is the nightly balance. >> >> I've run btrfs check on / with no issues recently. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
What's also confusing is I just ran a manual balance on the fs using defaults (which are aggressive) and it completed with no problems. It smells more like a race condition than a particular corruption. On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahnwrote: > I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight. > > Every night I'm getting a kernel hang, sometimes caught by my > watchdog, sometimes not. Last night's hang was on the balance of DATA > on / at 70. > > I'm not sure how to further trace this down to help you -- the console > by the time I notice just has lots of messages on it without the > initial ones. The last items in /var/log/message aren't helpful, but > I'm pretty sure it is the nightly balance. > > I've run btrfs check on / with no issues recently. > > Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel hangs during balance
I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight. Every night I'm getting a kernel hang, sometimes caught by my watchdog, sometimes not. Last night's hang was on the balance of DATA on / at 70. I'm not sure how to further trace this down to help you -- the console by the time I notice just has lots of messages on it without the initial ones. The last items in /var/log/message aren't helpful, but I'm pretty sure it is the nightly balance. I've run btrfs check on / with no issues recently. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html