Re: unable to handle kernel paging request - btrfs

2016-10-10 Thread Rich Freeman
Here is another trace, similar to the original issue, but I have a bit
more detail on this one and it is available as text which if nothing
else is more convenient so I'll go ahead and paste this.  I don't
intend to keep pasting these unless I get something that looks
different.

I only posted the initial BUG.

Oct 10 05:11:15 hab nc[1250]: ip_tables ext4 crc16 mbcache jbd2 radeon
nxt200x cx88_dvb cx88_vp3054_i2c videobuf2_dvb dvb_core tuner_simple
tuner_types tuner cx8800 cx8802 videobuf2_dma_sg videobuf2_memops
videobuf2_v4l2 cx88_alsa cx88xx mousedev fbcon videobuf2_core bitblit
dm_region_hash dm_log dm_mod
Oct 10 05:11:15 hab nc[1250]: [81346.935203] CPU: 3 PID: 29648 Comm:
kworker/u16:3 Not tainted 4.4.24 #1
Oct 10 05:11:15 hab nc[1250]: [81346.935317] Hardware name: Gigabyte
Technology Co., Ltd. GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010
Oct 10 05:11:15 hab nc[1250]: [81346.935544] Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Oct 10 05:11:15 hab nc[1250]: [81346.935657] task: 880415acae00
ti: 88019a584000 task.ti: 88019a584000
Oct 10 05:11:15 hab nc[1250]: [81346.935783] RIP:
0010:[]  [] __memcpy+0x12/0x20
Oct 10 05:11:15 hab nc[1250]: [81346.935930] RSP:
0018:88019a587c68  EFLAGS: 00010246
Oct 10 05:11:15 hab nc[1250]: [81346.936023] RAX: c90002ecfff8
RBX: 1000 RCX: 01ff
Oct 10 05:11:15 hab nc[1250]: [81346.936142] RDX: 
RSI: 88008c950008 RDI: c90002ed
Oct 10 05:11:15 hab nc[1250]: [81346.936262] RBP: 88019a587d30
R08: 41545345 R09: c90002ece000
Oct 10 05:11:15 hab nc[1250]: [81346.936382] R10: e8cc09e0
R11: 1000 R12: 88008c95
Oct 10 05:11:15 hab nc[1250]: [81346.936502] R13: 4154534d
R14:  R15: 8802b25b2798
Oct 10 05:11:15 hab nc[1250]: [81346.936623] FS:
7fe90a15d780() GS:880427cc()
knlGS:
Oct 10 05:11:15 hab nc[1250]: [81346.936756] CS:  0010 DS:  ES:
 CR0: 8005003b
Oct 10 05:11:16 hab nc[1250]: [81346.937182]  8800102c5720
0004 41545345 4154334d
Oct 10 05:11:16 hab nc[1250]: [81346.937347]  c90002ece000
1000 0002 003a
Oct 10 05:11:16 hab nc[1250]: [81346.937515] Call Trace:
Oct 10 05:11:16 hab nc[1250]: [81346.937621]  [] ?
lzo_decompress_biovec+0x1d1/0x2c0 [btrfs]
Oct 10 05:11:16 hab nc[1250]: [81346.944148]  []
end_compressed_bio_read+0x20c/0x2c0 [btrfs]
Oct 10 05:11:16 hab nc[1250]: [81346.950610]  [] ?
resched_curr+0x60/0xc0
Oct 10 05:11:16 hab nc[1250]: [81346.957055]  []
bio_endio+0x3a/0x70
Oct 10 05:11:16 hab nc[1250]: [81346.963516]  []
end_workqueue_fn+0x37/0x40 [btrfs]
Oct 10 05:11:16 hab nc[1250]: [81346.970009]  []
normal_work_helper+0xae/0x2d0 [btrfs]
Oct 10 05:11:16 hab nc[1250]: [81346.976532]  []
btrfs_endio_helper+0xd/0x10 [btrfs]
Oct 10 05:11:16 hab nc[1250]: [81346.983010]  []
process_one_work+0x148/0x400
Oct 10 05:11:16 hab nc[1250]: [81346.989509]  []
worker_thread+0x46/0x430
Oct 10 05:11:16 hab nc[1250]: [81346.996013]  [] ?
rescuer_thread+0x2d0/0x2d0
Oct 10 05:11:16 hab nc[1250]: [81347.034423] Code: ff ff 48 8b 43 60
48 2b 43 50 88 43 4e 5b 5d f3 c3 90 90 90 90 90 90 90 90 00 48 89 f8
48 89 d1 48 c1 e9 03 83 e2 07  48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00
00 48 89 f8 48 89 d1 f3
Oct 10 05:11:16 hab nc[1250]: [81347.041852] RIP  []
__memcpy+0x12/0x20
Oct 10 05:11:16 hab nc[1250]: [81347.048565]  RSP 
Oct 10 05:11:16 hab nc[1250]: [81347.055218] CR2: c90002ed
Oct 10 05:11:16 hab nc[1250]: [81347.104741] ---[ end trace
9a43c0b6d874fe31 ]---
Oct 10 05:11:16 hab nc[1250]: [81347.104752] BUG: unable to handle
kernel paging request at c90002c4a000
Oct 10 05:11:16 hab nc[1250]: [81347.104761] IP: []
__memcpy+0x12/0x20
Oct 10 05:11:16 hab nc[1250]: [81347.104767] PGD 417427067 PUD
417488067 PMD 410881067 PTE 0
Oct 10 05:11:16 hab nc[1250]: [81347.104771] Oops: 0002 [#2] SMP
Oct 10 05:11:16 hab nc[1250]: [81347.104825] Modules linked in:
netconsole configfs tun ipt_MASQUERADE nf_nat_masquerade_ipv4
xt_conntrack veth iptable_mangle iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter
ip_tables ext4 crc16 mbcache jbd2 radeon nxt200x cx88_dvb
cx88_vp3054_i2c videobuf2_dvb dvb_core tuner_simple tuner_types tuner
cx8800 cx8802 videobuf2_dma_sg videobuf2_memops videobuf2_v4l2
cx88_alsa cx88xx mousedev fbcon videobuf2_core bitblit softcursor
tveeprom font tileblit drm_kms_helper kvm_amd rc_core kvm v4l2_common
cfbfillrect syscopyarea videodev cfbimgblt sysfillrect
snd_hda_codec_realtek snd_hda_codec_generic irqbypass i2c_algo_bit
sysimgblt fb_sys_fops snd_hda_intel k10temp cfbcopyarea ttm
snd_hda_codec snd_hwdep i2c_piix4 snd_hda_core drm hid_logitech_hidpp
snd_pcm r8169[81347.104954] CR2: c90002c4a000 CR3:
cb2fb000 CR4: 06e0
Oct 10 05:11:16 hab nc[1250]: [81347.104955] Stack:
Oct 10 05:11:16 hab nc[1250]: [81347.104960]  a02ef741

Re: unable to handle kernel paging request - btrfs

2016-10-08 Thread Rich Freeman
I'm not sure if this is related to the same issue or not, but I just
started getting a new BUG, followed by a panic.  (I'm also enabled
network console capture so that you won't have to squint at photos.)

Original BUG is:


[14740.444257] [ cut here ]
[14740.444293] kernel BUG at /usr/src/linux-stable/fs/btrfs/volumes.c:5509!
[14740.444323] invalid opcode:  [#1] SMP
[14740.444348] Modules linked in: nfsd auth_rpcgss oid_registry lockd
grace sunrpc it87 hwmon_vid netconsole configfs tun ipt_MASQUERADE
nf_nat_masquerade_ipv4 xt_conntrack veth iptable_mangle iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n
 at nf_conntrack iptable_filter ip_tables ext4 crc16 mbcache
jbd2 radeon nxt200x cx88_dvb cx88_vp3054_i2c videobuf2_dvb
dvb_coretuner_simple tuner_types tuner fbcon bitblit softcursor font
tileblit drm_kms_helper kvm_amd kvm cfbfillrect syscopyarea cfbimgblt
sysfillrect sysimgblt mousedev fb_sys_fops cfbcopyarea cx88_alsa ttm
cx8802 drm cx8800 videobuf2_dma_sg videobuf2_memops videobuf2_v4l2
cx88xx snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
videobuf2_core snd_hda_codec tveeprom rc_core irqbypass v4l2_common
videodev k10temp i2c_algo_bit
[14740.444799]  snd_hwdep i2c_piix4 snd_hda_core hid_logitech_hidpp
snd_pcm r8169 8250 snd_timer snd mii 8250_base backlight serial_core
soundcore evdev sch_fq_codel hid_logitech_dj hid_generic usbhid btrfs
firewire_ohci atkbd ata_generic pata_acpi firew
   ire_core crc_itu_t xor zlib_deflate ohci_pci pata_atiixp
raid6_pq ehci_pci ohci_hcd ehci_hcd usbcore usb_common dm_mirror
dm_region_hash dm_log dm_mod
[14740.445028] CPU: 1 PID: 3213 Comm: kworker/u16:2 Not tainted 4.4.24 #1
[14740.445056] Hardware name: Gigabyte Technology Co., Ltd.
GA-880GM-UD2H/GA-880GM-UD2H, BIOS F8 10/11/2010
[14740.445116] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[14740.445143] task: 8803ff527300 ti: 8803e3c8c000 task.ti:
8803e3c8c000
[14740.445173] RIP: 0010:[]  []
__btrfs_map_block+0xdfd/0x1140 [btrfs]
[14740.445226] RSP: 0018:8803e3c8faa0  EFLAGS: 00010282
[14740.445248] RAX: cdf2f040 RBX: 0002 RCX: 0002
[14740.445277] RDX:  RSI: 21b27000 RDI: 8800cab4fb40
[14740.445306] RBP: 8803e3c8fb88 R08: 050743c0 R09: cdf2f040
[14740.445334] R10: 0001 R11: 1e4d R12: cdf2f03f
[14740.445363] R13: 9000 R14: 8803e3c8fbd0 R15: 0001
[14740.445391] FS:  7f9e2befc7c0() GS:880427c4()
knlGS:
[14740.445423] CS:  0010 DS:  ES:  CR0: 8005003b
[14740.445446] CR2: 7fc533bf7000 CR3: 0003e29e4000 CR4: 06e0
[14740.445474] Stack:
[14740.445484]  8803e3c8fab0 81084577 8112acf0
02011200
[14740.445526]  880410cacc60 880410cacc90 1e4e
8803ff527300
[14740.445565]   1e4e 880414e68ee8

[14740.445603] Call Trace:
[14740.445618]  [] ? __enqueue_entity+0x67/0x70
[14740.445644]  [] ? mempool_alloc_slab+0x10/0x20
[14740.445680]  [] btrfs_map_bio+0x71/0x320 [btrfs]
[14740.445707]  [] ? kmem_cache_alloc+0x190/0x1f0
[14740.445742]  [] ? btrfs_bio_wq_end_io+0x2e/0x80 [btrfs]
[14740.445780]  []
btrfs_submit_compressed_read+0x451/0x4a0 [btrfs]
[14740.445821]  [] btrfs_submit_bio_hook+0x1a0/0x1b0 [btrfs]
[14740.445860]  [] ? btrfs_io_bio_alloc+0x10/0x30 [btrfs]
[14740.445900]  [] ? btrfs_create_repair_bio+0xc3/0xe0 [btrfs]
[14740.445940]  [] end_bio_extent_readpage+0x44f/0x510 [btrfs]
[14740.445981]  [] ? btrfs_create_repair_bio+0xe0/0xe0 [btrfs]
[14740.446011]  [] bio_endio+0x3a/0x70
[14740.446042]  [] end_workqueue_fn+0x37/0x40 [btrfs]
[14740.446080]  [] normal_work_helper+0xae/0x2d0 [btrfs]
[14740.446118]  [] btrfs_endio_helper+0xd/0x10 [btrfs]
[14740.446145]  [] process_one_work+0x148/0x400
[14740.446170]  [] worker_thread+0x46/0x430
[14740.446193]  [] ? rescuer_thread+0x2d0/0x2d0
[14740.446217]  [] ? rescuer_thread+0x2d0/0x2d0
[14740.446241]  [] kthread+0xc4/0xe0
[14740.446262]  [] ? kthread_park+0x50/0x50
[14740.446286]  [] ret_from_fork+0x3f/0x70
[14740.446309]  [] ? kthread_park+0x50/0x50
[14740.446332] Code: 60 ff ff ff 48 63 d3 48 2b 4d c0 48 0f af c1 48
39 c2 48 0f 46 c2 48 89 45 90 89 d9 c7 85 70 ff ff ff 00 0
  0 00 00 e9 f9 f3 ff ff <0f> 0b bb f4 ff ff ff e9 c7
fa ff ff be 6a 16 00 00 48 c7 c7 18
[14740.446672] RIP  [] __btrfs_map_block+0xdfd/0x1140 [btrfs]
[14740.446714]  RSP 
[14740.456756] ---[ end trace e349a675c6512569 ]---
[14740.456832] BUG: unable to handle kernel paging request at ffd8
[14740.456869] IP: [] kthread_data+0xb/0x20
[14740.456896] PGD 1a0a067 PUD 1a0c067 PMD 0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-10-07 Thread Rich Freeman
On Fri, Sep 30, 2016 at 8:38 PM, Jeff Mahoney  wrote:
> On 9/30/16 5:07 PM, Rich Freeman wrote:
>> On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoney  wrote:
>>> This looks like a use-after-free on one of the pages used for
>>> compression.  Can you post the output of objdump -Dr
>>> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?
>>>
>>
>> Sure:
>> https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk
>>
>> I was impressed by just how large it was.
>>
>> I take it you're going to try to use the offsets in the oops to figure
>> out where it went wrong?  I really need to get kernel core dumping
>> working on this box...
>
> Yep.  What I think is happening is that we have workspace getting freed
> while it's in use.  The faulting address is in vmalloc space and it's
> also the first argument to memcpy, which makes it the destination.  In
> lzo_decompress_biovec, that means it's the workspace->cbuf.  Beyond that
> I'll have to dig a bit more.
>

I'll confess to not being much of a kernel hacker, but could this
error also be caused by a buffer overrun?  If working_bytes or
in_page_bytes_left are larger than the size of the buffer then the
memcpy would overrun the length of the buffer.  I don't know if that
generates a different error than the one reported.

What guarantee do we have that working_bytes is less than the size of
workspace->cbuf?  I'm just throwing stuff out there because as far as
I can tell the code never frees workspace (I'm guessing kunmap at the
very end might take care of it).

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Jeff Mahoney
On 9/30/16 5:07 PM, Rich Freeman wrote:
> On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoney  wrote:
>> This looks like a use-after-free on one of the pages used for
>> compression.  Can you post the output of objdump -Dr
>> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?
>>
> 
> Sure:
> https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk
> 
> I was impressed by just how large it was.
> 
> I take it you're going to try to use the offsets in the oops to figure
> out where it went wrong?  I really need to get kernel core dumping
> working on this box...

Yep.  What I think is happening is that we have workspace getting freed
while it's in use.  The faulting address is in vmalloc space and it's
also the first argument to memcpy, which makes it the destination.  In
lzo_decompress_biovec, that means it's the workspace->cbuf.  Beyond that
I'll have to dig a bit more.

It's the same fault that your first photo showed as a secondary Oops,
but that's not always the case.


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Rich Freeman
On Fri, Sep 30, 2016 at 4:55 PM, Jeff Mahoney  wrote:
> This looks like a use-after-free on one of the pages used for
> compression.  Can you post the output of objdump -Dr
> /lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?
>

Sure:
https://drive.google.com/open?id=0BwUDImviY_gcR3JfT0Z1cUlRVEk

I was impressed by just how large it was.

I take it you're going to try to use the offsets in the oops to figure
out where it went wrong?  I really need to get kernel core dumping
working on this box...

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Jeff Mahoney
On 9/30/16 2:54 PM, Rich Freeman wrote:
> On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoney  wrote:
>> On 9/22/16 8:18 AM, Rich Freeman wrote:
>>> I have been getting panics consistently after doing a btrfs replace
>>> operation on a raid1 and rebooting.  I linked a photo of the panic; I
>>> haven't been able to get a text capture of it.
>>>
>>> https://ibin.co/2vx0HhDeViu3.jpg
>>>
>>> I'm getting this error on the latest 4.4, 4.1, and even on an old
>>> 3.18.26 kernel I had lying around.
>>>
>>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
>>> returns patch on 4.1 and that did not solve my problem either.
>>>
>>> I'm able to boot into single-user mode and if I don't start any
>>> processes the system seems fairly stable.  I am also able to start a
>>> btrfs balance and run that for several hours without issue.  If I
>>> start launching services the system will tend to panic, though how
>>> many processes I can launch will vary.  I don't think that it is a
>>> particular file being accessed that is triggering the issue since the
>>> point where it fails varies.  I suspect it may be load-related.
>>>
>>> Mounting with compress=no doesn't seem to help either.  Granted, I see
>>> lzo_decompress in the backtrace and that is probably a read operation.
>>>
>>> Any suggestions?  Google hasn't been helpful on this one...
>>
>> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops?
>> The trace in your photo is a secondary Oops (tainted D), which means
>> that something else went wrong before that and now the system is
>> tripping over it.  Secondary Oopses don't really help the debugging
>> process because the system was already in a broken, undefined, state.
>>
> 
> Ok, the system has been up for a week without issue, but just paniced
> and rebooted right towards the end of a balance (it literally had
> about 30 of 2500 chunks left).
> 
> After it came up (and waiting for it to fully mount as there were a
> bunch of free space warnings/etc) I managed to capture an initial oops
> when it happened again:
> 
> https://ibin.co/2wt0n2IaCOA3.jpg
> 
> This is on a system without swap, though my understanding is that the
> paging system is used for other things.

It's not paging in the way Microsoft uses the term.  In this context, it
just means that the kernel tried to resolve a virtual address and
failed.  When the address is < PAGE_SIZE, it prints the NULL pointer
dereference message instead.  It's literally the same code otherwise.
Short version: it's the same thing as a segfault in userspace code.

This looks like a use-after-free on one of the pages used for
compression.  Can you post the output of objdump -Dr
/lib/modules/$(uname -r)/kernel/fs/btrfs/btrfs.ko somewhere?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: unable to handle kernel paging request - btrfs

2016-09-30 Thread Rich Freeman
On Thu, Sep 22, 2016 at 1:41 PM, Jeff Mahoney  wrote:
> On 9/22/16 8:18 AM, Rich Freeman wrote:
>> I have been getting panics consistently after doing a btrfs replace
>> operation on a raid1 and rebooting.  I linked a photo of the panic; I
>> haven't been able to get a text capture of it.
>>
>> https://ibin.co/2vx0HhDeViu3.jpg
>>
>> I'm getting this error on the latest 4.4, 4.1, and even on an old
>> 3.18.26 kernel I had lying around.
>>
>> I tried the remove root_log_ctx from ctx list before btrfs_sync_log
>> returns patch on 4.1 and that did not solve my problem either.
>>
>> I'm able to boot into single-user mode and if I don't start any
>> processes the system seems fairly stable.  I am also able to start a
>> btrfs balance and run that for several hours without issue.  If I
>> start launching services the system will tend to panic, though how
>> many processes I can launch will vary.  I don't think that it is a
>> particular file being accessed that is triggering the issue since the
>> point where it fails varies.  I suspect it may be load-related.
>>
>> Mounting with compress=no doesn't seem to help either.  Granted, I see
>> lzo_decompress in the backtrace and that is probably a read operation.
>>
>> Any suggestions?  Google hasn't been helpful on this one...
>
> Can you boot with panic_on_oops=1, reproduce it, and capture that Oops?
> The trace in your photo is a secondary Oops (tainted D), which means
> that something else went wrong before that and now the system is
> tripping over it.  Secondary Oopses don't really help the debugging
> process because the system was already in a broken, undefined, state.
>

Ok, the system has been up for a week without issue, but just paniced
and rebooted right towards the end of a balance (it literally had
about 30 of 2500 chunks left).

After it came up (and waiting for it to fully mount as there were a
bunch of free space warnings/etc) I managed to capture an initial oops
when it happened again:

https://ibin.co/2wt0n2IaCOA3.jpg

This is on a system without swap, though my understanding is that the
paging system is used for other things.

Note that I've updated my kernel since my last post.  When it paniced
during the balance it was running 4.4.21, and on the oops I actually
captured it was on 4.4.23 (I was actually just waiting for the balance
to finish before rebooting with a new kernel).

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to handle kernel paging request - btrfs

2016-09-25 Thread Duncan
Rich Freeman posted on Sun, 25 Sep 2016 09:55:42 -0400 as excerpted:

> On Fri, Sep 23, 2016 at 12:58 AM, Duncan <1i5t5.dun...@cox.net> wrote:
>>
>> Btrfs raid1 you say, and you have existing compressed files it's trying
>> to read in the backtrace?
>>
>> Sounds like the issues I see sometimes and have posted about where
>> after a crash that resulted in one device of my raid1 pair getting
>> behind the other, the kernel will crash if it sees too many
>> csum-errors, even tho it's /supposed/ to check the other copy and read
>> from it if valid (which it is as a btrfs scrub resolves the issue).
>>
>> When booted to rescue/single-user mode, can you run a scrub?
> 
> After a few reboots trying to capture the initial panic message (even
> when I set panic_on_oops=1 I was getting multiple ones with only the
> tainted one staying on screen), the system managed to stay up.  I
> completed a scrub and it found no errors.

Well, so much for that theory.  If it found and fixed errors you'd likely 
be seeing the same problem I see sometimes, but if it didn't find any to 
fix... unlikely.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html