Re: kernel hangs during balance

2017-12-23 Thread Rich Rauenzahn
I finally got a full stack trace via sysrq.   A lot of stacks seem to
end in page fault -- I wonder if its because my page file is a
loopback on btrfs.

https://pastebin.com/GyWAu1EP

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.14.8-1.el7.elrepo.x86_64
root=UUID=35f0ce3f-0902-47a3-8ad8-86179d1f3e3a ro
rootflags=subvol=root00 vconsole.keymap=us crashkernel=128M
vconsole.font=latarcyrheb-sun16 consoleblank=600 scsi_mod.use_blk_mq=n
LANG=en_US.UTF-8
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Duncan
Holger Hoffstätte posted on Wed, 20 Dec 2017 20:58:14 +0100 as excerpted:

> On 12/20/17 20:02, Chris Murphy wrote:
>> I don't know if it's the sending MUA or the list server, but the line
>> wrapping makes this much harder to follow. I suggest putting it in a
>> text file and attaching the text file. It's definitely not on the
>> receiving side, I see it here also:
>> https://www.spinics.net/lists/linux-btrfs/msg72872.html
> 
> You can see enough to suggest that blk-mq is hanging, which is
> "unsurprising" (being kind here) with such an old kernel. Rich, build
> your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with
> scsi_mod.use_blk_mq=n as kernel prarameter.

Well, the kernel is 4.4 "el repo".  4.4 /is/ an LTS (and elrepo is AFAIK 
a red hat backports repo, not sure how official, but useful for people on 
red hat), but it's now the second-back LTS, with 4.9 and the new 4.14 
being newer LTS series.

The thing is that this is the btrfs list, and we're development and thus 
rather forward focused here.  As such, we normally want at /least/ the 
second newest (first back) lts series kernel for best chance at 
reasonable support.  While I understand people who want to stick with LTS 
being reluctant to go with the /newest/ LTS before even a single current 
release has passed, making that LTS still a bit new as such things go, 
certainly the one-back LTS, 4.9, should be reasonable.

So yes, tho 4.4 is at least an LTS, for purposes of btrfs and this list, 
it really is rather old now, and an upgrade, presumably to 4.9 in keeping 
with the LTS theme, would be recommended.  If the issue can be confirmed 
on the current and LTS 4.14, so much the better, but certainly, 4.9 is a 
recommended upgrade, and a bug there would still be in the concern for a 
fix range, while 4.4... really is just out of the focus range for this 
list, tho various longer focus distros will of course still provide 
support for it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Rich Rauenzahn
I switched to the LT kernel because of this issue.  I was running
mainline and thought that LT would get me stability.  I can switch
back to LT while we RCA.

At the risk of changing two things, I could add that
(scsi_mod.use_blk_mq=n) to my boot and also switch back to ML.  I do
notice that disk IO is high at the time of the hang -- I've assumed it
is btrfs balance, but if I read the logs right, it may have finished
before the hang.  Adding more logging.

I've also put the kernel logs into a pastebin:  https://pastebin.com/80qRXA9c

...wish I could get the watchdog to force a dump/trace.  Will look at that, too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Holger Hoffstätte
On 12/20/17 20:02, Chris Murphy wrote:
> I don't know if it's the sending MUA or the list server, but the line
> wrapping makes this much harder to follow. I suggest putting it in a
> text file and attaching the text file. It's definitely not on the
> receiving side, I see it here also:
> https://www.spinics.net/lists/linux-btrfs/msg72872.html

You can see enough to suggest that blk-mq is hanging, which is
"unsurprising" (being kind here) with such an old kernel.
Rich, build your kernel with CONFIG_SCSI_MQ_DEFAULT=n or
boot with scsi_mod.use_blk_mq=n as kernel prarameter.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Chris Murphy
I don't know if it's the sending MUA or the list server, but the line
wrapping makes this much harder to follow. I suggest putting it in a
text file and attaching the text file. It's definitely not on the
receiving side, I see it here also:
https://www.spinics.net/lists/linux-btrfs/msg72872.html

Also anytime you get a hang, it's best to issue sysrq+w which dumps
extra information into kernel messages that the developers sometimes
find useful.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-20 Thread Rich Rauenzahn
Ok, caught the hung tasks last night.  I don't *think* this is
related, because I pretty sure this isn't happening on the same
filesystem, but I do have a loopback swap on one btrfs drive.

The hang might have occurred after the btrfs balance was finished 
which is confusing.  I'm adding timestamps to my next btrfs job.


Dec 20 01:14:50 tendo [64680.942928] INFO: task systemd:1 blocked for
more than 120 seconds.
Dec 20 01:14:50 tendo [64680.945844]   Not tainted
4.4.106-1.el7.elrepo.x86_64 #1
Dec 20 01:14:50 tendo [64680.948900] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 01:14:50 tendo [64680.951987]  88040d6730d8
Dec 20 01:14:50 tendo 81a81500
Dec 20 01:14:50 tendo 88040d5f8000
Dec 20 01:14:50 tendo 88040d674000
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.955107]  
Dec 20 01:14:50 tendo 7fff
Dec 20 01:14:50 tendo 8800bd19a648
Dec 20 01:14:50 tendo e8c03980
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.957871]  88040d6730f0
Dec 20 01:14:50 tendo 81700435
Dec 20 01:14:50 tendo 88041fa16cc0
Dec 20 01:14:50 tendo 88040d6731a0
Dec 20 01:14:50 tendo
Dec 20 01:14:50 tendo [64680.960618] Call Trace:
Dec 20 01:14:50 tendo [64680.963342]  [] schedule+0x35/0x80
Dec 20 01:14:50 tendo [64680.966052]  []
schedule_timeout+0x237/0x2d0
Dec 20 01:14:50 tendo [64680.968490]  []
io_schedule_timeout+0xa6/0x110
Dec 20 01:14:50 tendo [64680.970900]  [] bt_get+0x14c/0x1d0
Dec 20 01:14:50 tendo [64680.973211]  [] ?
prepare_to_wait_event+0xf0/0xf0
Dec 20 01:14:50 tendo [64680.975383]  []
blk_mq_get_tag+0xb0/0xe0
Dec 20 01:14:50 tendo [64680.977540]  []
__blk_mq_alloc_request+0x1b/0x1f0
Dec 20 01:14:50 tendo [64680.979708]  []
blk_mq_map_request+0x19e/0x210
Dec 20 01:14:50 tendo [64680.981871]  []
blk_sq_make_request+0xab/0x3d0
Dec 20 01:14:50 tendo [64680.984004]  [] ?
generic_make_request_checks+0x29f/0x4f0
Dec 20 01:14:50 tendo [64680.986122]  [] ?
mempool_alloc+0x6e/0x170
Dec 20 01:14:50 tendo [64680.988207]  []
generic_make_request+0x113/0x2c0
Dec 20 01:14:50 tendo [64680.990137]  [] submit_bio+0x77/0x150
Dec 20 01:14:50 tendo [64680.992041]  [] ?
__test_set_page_writeback+0x149/0x1c0
Dec 20 01:14:50 tendo [64680.993949]  []
__swap_writepage+0x20d/0x250
Dec 20 01:14:50 tendo [64680.995853]  [] ?
__frontswap_store+0x8c/0x120
Dec 20 01:14:50 tendo [64680.997720]  []
swap_writepage+0x39/0x70
Dec 20 01:14:50 tendo [64680.999595]  []
pageout.isra.43+0x170/0x290
Dec 20 01:14:50 tendo [64681.001291]  []
shrink_page_list+0x353/0x780
Dec 20 01:14:50 tendo [64681.002975]  []
shrink_inactive_list+0x217/0x530
Dec 20 01:14:50 tendo [64681.004652]  []
shrink_lruvec+0x588/0x740
Dec 20 01:14:50 tendo [64681.006311]  []
shrink_zone+0xeb/0x2e0
Dec 20 01:14:50 tendo [64681.007967]  []
do_try_to_free_pages+0x164/0x430
Dec 20 01:14:50 tendo [64681.009617]  [] ?
throttle_direct_reclaim+0x14e/0x240
Dec 20 01:14:50 tendo [64681.011141]  []
try_to_free_pages+0xd5/0x190
Dec 20 01:14:51 tendo [64681.012659]  []
__alloc_pages_slowpath.constprop.89+0x3a0/0x6e8
Dec 20 01:14:51 tendo [64681.014191]  [] ?
sched_clock_cpu+0x72/0xa0
Dec 20 01:14:51 tendo [64681.015701]  []
__alloc_pages_nodemask+0x253/0x260
Dec 20 01:14:51 tendo [64681.017215]  [] ?
__kmalloc_node_track_caller+0x249/0x2b0
Dec 20 01:14:51 tendo [64681.018392]  []
alloc_pages_vma+0xa5/0x220
Dec 20 01:14:51 tendo [64681.019296]  []
handle_pte_fault+0xd21/0x1490
Dec 20 01:14:51 tendo [64681.020195]  [] ?
legitimize_path.isra.23+0x2e/0x60
Dec 20 01:14:51 tendo [64681.021100]  [] ? put_dec+0x1a/0x80
Dec 20 01:14:51 tendo [64681.021992]  [] ?
number.isra.13+0x2dd/0x310
Dec 20 01:14:51 tendo [64681.022883]  []
handle_mm_fault+0x250/0x540
Dec 20 01:14:51 tendo [64681.023757]  []
__do_page_fault+0x188/0x3f0
Dec 20 01:14:51 tendo [64681.024626]  []
do_page_fault+0x30/0x80
Dec 20 01:14:51 tendo [64681.025486]  [] ?
kernfs_path_locked+0x37/0x80
Dec 20 01:14:51 tendo [64681.026350]  [] page_fault+0x28/0x30
Dec 20 01:14:51 tendo [64681.027201]  [] ?
copy_user_enhanced_fast_string+0xe/0x20
Dec 20 01:14:51 tendo [64681.028034]  [] ?
seq_read+0x295/0x390
Dec 20 01:14:51 tendo [64681.028832]  [] __vfs_read+0x37/0x100
Dec 20 01:14:51 tendo [64681.029618]  [] ?
security_file_permission+0xa3/0xc0
Dec 20 01:14:51 tendo [64681.030342]  [] ?
rw_verify_area+0x52/0xd0
Dec 20 01:14:51 tendo [64681.031055]  [] vfs_read+0x7f/0x130
Dec 20 01:14:51 tendo [64681.031761]  [] SyS_read+0x55/0xc0
Dec 20 01:14:51 tendo [64681.032441]  []
entry_SYSCALL_64_fastpath+0x12/0x71
Dec 20 01:14:51 tendo [64681.033119] INFO: task khugepaged:68 blocked
for more than 120 seconds.
Dec 20 01:14:51 tendo [64681.033815]   Not tainted
4.4.106-1.el7.elrepo.x86_64 #1
Dec 20 01:14:51 tendo [64681.034518] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 01:14:51 tendo [64681.035238]  88040cf4f318
Dec 20 01:14:51 tendo 88040d6c2c80
Dec 20 01:14:51 tendo 88040cf5

Re: kernel hangs during balance

2017-12-19 Thread Rich Rauenzahn
On Tue, Dec 19, 2017 at 9:14 AM, Hans van Kranenburg
 wrote:
> Just wild first guess... are you also using btrfs send/receive
> functionality where the system having problems is the sending part?

No.

>>> Every night I'm getting a kernel hang, sometimes caught by my
>>> watchdog, sometimes not.  Last night's hang was on the balance of DATA
>>> on / at 70.
>>>
>>> I'm not sure how to further trace this down to help you -- the console
>>> by the time I notice just has lots of messages on it without the
>>> initial ones.
>
> Capturing more logs is definitely the first thing to do.
>
> Look if the output of `dmesg` still shows the btrfs errors. Otherwise,

But I can't ...  I have to hit the reset button (or the watchdog does)
before I can get into the system again.

> if something is spamming there, turn that off, or if you don't have the
> errors in a log file because the log files are on the same btrfs, then
> you have to find out another way to capture them. E.g. make the kernel
> buffer for messages bigger, use netconsole or just pragmatic things like
> ssh from another server and `dmesg -w` and store it on the other machine.

Ah... yes, a permanent connection from another might work.

I also have a pfsense box -- I think it can take a syslogd daemon.
Maybe I can netconsole to it?

Will investigate.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-19 Thread Hans van Kranenburg
On 12/19/2017 06:08 PM, Rich Rauenzahn wrote:
> What's also confusing is I just ran a manual balance on the fs using
> defaults (which are aggressive) and it completed with no problems.
> It smells more like a race condition than a particular corruption.

Just wild first guess... are you also using btrfs send/receive
functionality where the system having problems is the sending part?

> On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahn  wrote:
>> I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight.
>>
>> Every night I'm getting a kernel hang, sometimes caught by my
>> watchdog, sometimes not.  Last night's hang was on the balance of DATA
>> on / at 70.
>>
>> I'm not sure how to further trace this down to help you -- the console
>> by the time I notice just has lots of messages on it without the
>> initial ones.

Capturing more logs is definitely the first thing to do.

Look if the output of `dmesg` still shows the btrfs errors. Otherwise,
if something is spamming there, turn that off, or if you don't have the
errors in a log file because the log files are on the same btrfs, then
you have to find out another way to capture them. E.g. make the kernel
buffer for messages bigger, use netconsole or just pragmatic things like
ssh from another server and `dmesg -w` and store it on the other machine.

>>   The last items in /var/log/message aren't helpful, but
>> I'm pretty sure it is the nightly balance.
>>
>> I've run btrfs check on / with no issues recently.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel hangs during balance

2017-12-19 Thread Rich Rauenzahn
What's also confusing is I just ran a manual balance on the fs using
defaults (which are aggressive) and it completed with no problems.
It smells more like a race condition than a particular corruption.

On Tue, Dec 19, 2017 at 8:09 AM, Rich Rauenzahn  wrote:
> I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight.
>
> Every night I'm getting a kernel hang, sometimes caught by my
> watchdog, sometimes not.  Last night's hang was on the balance of DATA
> on / at 70.
>
> I'm not sure how to further trace this down to help you -- the console
> by the time I notice just has lots of messages on it without the
> initial ones.   The last items in /var/log/message aren't helpful, but
> I'm pretty sure it is the nightly balance.
>
> I've run btrfs check on / with no issues recently.
>
> Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel hangs during balance

2017-12-19 Thread Rich Rauenzahn
I'm running 4.4.106-1.el7.elrepo.x86_64 and I do a btrfs balance everynight.

Every night I'm getting a kernel hang, sometimes caught by my
watchdog, sometimes not.  Last night's hang was on the balance of DATA
on / at 70.

I'm not sure how to further trace this down to help you -- the console
by the time I notice just has lots of messages on it without the
initial ones.   The last items in /var/log/message aren't helpful, but
I'm pretty sure it is the nightly balance.

I've run btrfs check on / with no issues recently.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html