Re: Kernel crash related to LZO compression

2018-10-26 Thread Qu Wenruo


On 2018/10/26 下午10:57, Dmitry Katsubo wrote:
> On 2018-10-25 20:49, Chris Murphy wrote:
>> I would say the first step no matter what if you're using an older
>> kernel, is to boot a current Fedora or Arch live or install media,
>> mount the Btrfs and try to read the problem files and see if the
>> problem still happens. I can't even being to estimate the tens of
>> thousands of line changes since kernel 4.9.
> 
> Good point Chris. Indeed booting a fresh kernel is never a problem.
> Actually I forgot to mention that I've seen the same problem with
> kernel 4.12.13 (attached).
> 
>> What profile are you using for this Btrfs? Is this a raid56? What do
>> you get for 'btrfs fi us ' ?
> 
> It is RAID1 volume for both metadata and data, but unfortunately I
> haven't recorded the actual output before the failure. The configuration
> was like this:
> 
> # btrfs filesystem show /var/log
> Label: none  uuid: 5b45ac8e-fd8c-4759-854a-94e45069959d
> Total devices 2 FS bytes used 11.13GiB
> devid    3 size 50.00GiB used 14.03GiB path /dev/sda3
> devid    4 size 50.00GiB used 14.03GiB path /dev/sdc1
> 
> On 2018-10-25 20:49, Chris Murphy wrote:
>> It should be safe even with that kernel. I'm not sure this is
>> compression related. There is a corruption bug related to inline
>> extents and corruption that had been fairly elusive but I think it's
>> fixed now. I haven't run into it though.
> 
> On 2018-10-26 02:09, Qu Wenruo wrote:
>>> Are there any updates / fixes done in that area? Is lzo option safe
>>> to use?
>>
>> Yes, we have commits to harden lzo decompress code in v4.18:
>>
>> de885e3ee281a88f52283c7e8994e762e3a5f6bd btrfs: lzo: Harden inline lzo
>> compressed extent decompression
>> 314bfa473b6b6d3efe68011899bd718b349f29d7 btrfs: lzo: Add header length
>> check to avoid potential out-of-bounds acc
>>
>> And for the root cause, it's compressed data without csum, then scrub
>> could make it corrupted.
>>
>> It's also fixed in v4.18:
>>
>> 665d4953cde6d9e75c62a07ec8f4f8fd7d396ade btrfs: scrub: Don't use inode
>> page cache in scrub_handle_errored_block()
>> ac0b4145d662a3b9e34085dea460fb06ede9b69b btrfs: scrub: Don't use inode
>> pages for device replace
> 
> Thanks, Qu, for this information. Actually one time I've seen the binary
> crap (not zeros) in text log files (/var/log/*.log) and I was surprised
> that btrfs returned me data which is corrupted instead of signalling I/O
> error. Could it be because of "compressed data without csum" problem?

Yes, pretty much the case, especially for your RAID1 setup.

The root fix should has been backported to stable kernel after 4.0, but
the lzo decompression harden part isn't sent to stable kernel, so you
may still hit such problem.

Thanks,
Qu

> 
> Thanks!
> 



signature.asc
Description: OpenPGP digital signature


Re: Kernel crash related to LZO compression

2018-10-26 Thread Dmitry Katsubo

On 2018-10-25 20:49, Chris Murphy wrote:

I would say the first step no matter what if you're using an older
kernel, is to boot a current Fedora or Arch live or install media,
mount the Btrfs and try to read the problem files and see if the
problem still happens. I can't even being to estimate the tens of
thousands of line changes since kernel 4.9.


Good point Chris. Indeed booting a fresh kernel is never a problem.
Actually I forgot to mention that I've seen the same problem with
kernel 4.12.13 (attached).


What profile are you using for this Btrfs? Is this a raid56? What do
you get for 'btrfs fi us ' ?


It is RAID1 volume for both metadata and data, but unfortunately I
haven't recorded the actual output before the failure. The configuration
was like this:

# btrfs filesystem show /var/log
Label: none  uuid: 5b45ac8e-fd8c-4759-854a-94e45069959d
Total devices 2 FS bytes used 11.13GiB
devid3 size 50.00GiB used 14.03GiB path /dev/sda3
devid4 size 50.00GiB used 14.03GiB path /dev/sdc1

On 2018-10-25 20:49, Chris Murphy wrote:

It should be safe even with that kernel. I'm not sure this is
compression related. There is a corruption bug related to inline
extents and corruption that had been fairly elusive but I think it's
fixed now. I haven't run into it though.


On 2018-10-26 02:09, Qu Wenruo wrote:
Are there any updates / fixes done in that area? Is lzo option safe to 
use?


Yes, we have commits to harden lzo decompress code in v4.18:

de885e3ee281a88f52283c7e8994e762e3a5f6bd btrfs: lzo: Harden inline lzo
compressed extent decompression
314bfa473b6b6d3efe68011899bd718b349f29d7 btrfs: lzo: Add header length
check to avoid potential out-of-bounds acc

And for the root cause, it's compressed data without csum, then scrub
could make it corrupted.

It's also fixed in v4.18:

665d4953cde6d9e75c62a07ec8f4f8fd7d396ade btrfs: scrub: Don't use inode
page cache in scrub_handle_errored_block()
ac0b4145d662a3b9e34085dea460fb06ede9b69b btrfs: scrub: Don't use inode
pages for device replace


Thanks, Qu, for this information. Actually one time I've seen the binary
crap (not zeros) in text log files (/var/log/*.log) and I was surprised
that btrfs returned me data which is corrupted instead of signalling I/O
error. Could it be because of "compressed data without csum" problem?

Thanks!

--
With best regards,
Dmitry
[Sun Dec  3 19:39:55 2017] BUG: unable to handle kernel paging request at 
f80a3000
[Sun Dec  3 19:39:55 2017] IP: memcpy+0x11/0x20
[Sun Dec  3 19:39:55 2017] *pde = 370bb067 
[Sun Dec  3 19:39:55 2017] *pte =  
[Sun Dec  3 19:39:55 2017] Oops: 0002 [#1] SMP
[Sun Dec  3 19:39:55 2017] Modules linked in: bridge stp llc arc4 iTCO_wdt 
iTCO_vendor_support ppdev ath5k evdev ath mac80211 cfg80211 i915 coretemp 
pcspkr rfkill snd_hda_codec_realtek serio_raw snd_hda_codec_generic video 
snd_hda_intel drm_kms_helper snd_hda_codec lpc_ich drm snd_hda_core snd_hwdep 
i2c_algo_bit snd_pcm_oss snd_mixer_oss fb_sys_fops sg snd_pcm syscopyarea 
snd_timer sysfillrect rng_core snd sysimgblt soundcore parport_pc parport 
shpchp button acpi_cpufreq binfmt_misc w83627hf hwmon_vid ip_tables x_tables 
autofs4 ses enclosure scsi_transport_sas xfs libcrc32c hid_generic usbhid hid 
btrfs crc32c_generic xor raid6_pq uas usb_storage sr_mod cdrom sd_mod 
ata_generic ata_piix i2c_i801 libata scsi_mod firewire_ohci firewire_core 
crc_itu_t ehci_pci e1000e ptp pps_core uhci_hcd ehci_hcd usbcore usb_common
[Sun Dec  3 19:39:55 2017] CPU: 1 PID: 100 Comm: kworker/u4:2 Tainted: G
W   4.12.0-2-686 #1 Debian 4.12.13-1
[Sun Dec  3 19:39:55 2017] Hardware name: AOpen i945GMx-IF/i945GMx-IF, BIOS 
i945GMx-IF R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
[Sun Dec  3 19:39:55 2017] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[Sun Dec  3 19:39:55 2017] task: f7337280 task.stack: f695c000
[Sun Dec  3 19:39:55 2017] EIP: memcpy+0x11/0x20
[Sun Dec  3 19:39:55 2017] EFLAGS: 00010206 CPU: 1
[Sun Dec  3 19:39:55 2017] EAX: f80a2ff8 EBX: 1000 ECX: 03fe EDX: 
ff998000
[Sun Dec  3 19:39:55 2017] ESI: ff998008 EDI: f80a3000 EBP:  ESP: 
f695de88
[Sun Dec  3 19:39:55 2017]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[Sun Dec  3 19:39:55 2017] CR0: 80050033 CR2: f9c00140 CR3: 36bc7000 CR4: 
06d0
[Sun Dec  3 19:39:55 2017] Call Trace:
[Sun Dec  3 19:39:55 2017]  ? lzo_decompress_bio+0x19f/0x2b0 [btrfs]
[Sun Dec  3 19:39:55 2017]  ? end_compressed_bio_read+0x28d/0x360 [btrfs]
[Sun Dec  3 19:39:55 2017]  ? btrfs_scrubparity_helper+0xb6/0x2c0 [btrfs]
[Sun Dec  3 19:39:55 2017]  ? process_one_work+0x135/0x2f0
[Sun Dec  3 19:39:55 2017]  ? worker_thread+0x39/0x3a0
[Sun Dec  3 19:39:55 2017]  ? kthread+0xd7/0x110
[Sun Dec  3 19:39:55 2017]  ? process_one_work+0x2f0/0x2f0
[Sun Dec  3 19:39:55 2017]  ? kthread_create_on_node+0x30/0x30
[Sun Dec  3 19:39:55 2017]  ? ret_from_fork+0x19/0x24
[Sun Dec  3 19:39:55 2017] Code: 43 58 2b 43 50 88 43 4e 5b eb ed 90 90 90 90 
90 90 90 90 90 90 90 90 90 90 90 

Re: Kernel crash related to LZO compression

2018-10-25 Thread Qu Wenruo


On 2018/10/25 下午11:56, Dmitry Katsubo wrote:
> Dear btrfs community,
> 
> My excuses for the dumps for rather old kernel (4.9.25), nevertheless I
> wonder
> about your opinion about the below reported kernel crashes.
> 
> As I could understand the situation (correct me if I am wrong), it happened
> that some data block became corrupted which resulted the following
> kernel trace
> during the boot:
> 
> kernel BUG at /build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318!
> invalid opcode:  [#1] SMP
> Call Trace:
>  [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs]
>  [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? process_one_work+0x380/0x380
>  [] ? kthread_park+0x50/0x50
>  [] ? ret_from_fork+0x1b/0x28
> 
> The problematic file turned out to be the one used by systemd-journald
> /var/log/journal/c496cea41ebc4700a0dfaabf64a21be4/system.journal
> which was trying to read it (or append to it) during the boot and that was
> causing the system crash (see attached bootN_dmesg.txt).
> 
> I've rebooted in safe mode and tried to copy the data from this
> partition to
> another location using btrfs-restore, however kernel was crashing as
> well with
> a bit different symphom (see attached copyN_dmesg.txt):
> 
> Call Trace:
>  [] ? lzo_decompress_biovec+0x1b0/0x2b0 [btrfs]
>  [] ? vmalloc+0x38/0x40
>  [] ? end_compressed_bio_read+0x265/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? ret_from_fork+0x1b/0x28
> 
> Just to keep away from the problem, I've removed this file and also removed
> "compress=lzo" mount option.
> 
> Are there any updates / fixes done in that area? Is lzo option safe to use?

Yes, we have commits to harden lzo decompress code in v4.18:

de885e3ee281a88f52283c7e8994e762e3a5f6bd btrfs: lzo: Harden inline lzo
compressed extent decompression
314bfa473b6b6d3efe68011899bd718b349f29d7 btrfs: lzo: Add header length
check to avoid potential out-of-bounds acc

And for the root cause, it's compressed data without csum, then scrub
could make it corrupted.

It's also fixed in v4.18:

665d4953cde6d9e75c62a07ec8f4f8fd7d396ade btrfs: scrub: Don't use inode
page cache in scrub_handle_errored_block()
ac0b4145d662a3b9e34085dea460fb06ede9b69b btrfs: scrub: Don't use inode
pages for device replace

Thanks,
Qu

> 
> P.S. Perhaps relative issue is in "Warnings" section:
> 
> https://wiki.debian.org/Btrfs#Warnings /
> https://www.spinics.net/lists/linux-btrfs/msg56563.html
> 



signature.asc
Description: OpenPGP digital signature


Re: Kernel crash related to LZO compression

2018-10-25 Thread Chris Murphy
On Thu, Oct 25, 2018 at 9:56 AM, Dmitry Katsubo  wrote:
> Dear btrfs community,
>
> My excuses for the dumps for rather old kernel (4.9.25), nevertheless I
> wonder
> about your opinion about the below reported kernel crashes.
>
> As I could understand the situation (correct me if I am wrong), it happened
> that some data block became corrupted which resulted the following kernel
> trace
> during the boot:
>
> kernel BUG at /build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318!
> invalid opcode:  [#1] SMP
> Call Trace:
>  [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs]
>  [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? process_one_work+0x380/0x380
>  [] ? kthread_park+0x50/0x50
>  [] ? ret_from_fork+0x1b/0x28
>
> The problematic file turned out to be the one used by systemd-journald
> /var/log/journal/c496cea41ebc4700a0dfaabf64a21be4/system.journal
> which was trying to read it (or append to it) during the boot and that was
> causing the system crash (see attached bootN_dmesg.txt).
>
> I've rebooted in safe mode and tried to copy the data from this partition to
> another location using btrfs-restore, however kernel was crashing as well
> with
> a bit different symphom (see attached copyN_dmesg.txt):
>
> Call Trace:
>  [] ? lzo_decompress_biovec+0x1b0/0x2b0 [btrfs]
>  [] ? vmalloc+0x38/0x40
>  [] ? end_compressed_bio_read+0x265/0x2d0 [btrfs]
>  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
>  [] ? process_one_work+0x141/0x380
>  [] ? worker_thread+0x41/0x460
>  [] ? kthread+0xb4/0xd0
>  [] ? ret_from_fork+0x1b/0x28
>
> Just to keep away from the problem, I've removed this file and also removed
> "compress=lzo" mount option.
>
> Are there any updates / fixes done in that area? Is lzo option safe to use?


It should be safe even with that kernel. I'm not sure this is
compression related. There is a corruption bug related to inline
extents and corruption that had been fairly elusive but I think it's
fixed now. I haven't run into it though.

I would say the first step no matter what if you're using an older
kernel, is to boot a current Fedora or Arch live or install media,
mount the Btrfs and try to read the problem files and see if the
problem still happens. I can't even being to estimate the tens of
thousands of line changes since kernel 4.9.

What profile are you using for this Btrfs? Is this a raid56? What do
you get for 'btrfs fi us ' ?



-- 
Chris Murphy


Kernel crash related to LZO compression

2018-10-25 Thread Dmitry Katsubo

Dear btrfs community,

My excuses for the dumps for rather old kernel (4.9.25), nevertheless I 
wonder

about your opinion about the below reported kernel crashes.

As I could understand the situation (correct me if I am wrong), it 
happened
that some data block became corrupted which resulted the following 
kernel trace

during the boot:

kernel BUG at 
/build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318!

invalid opcode:  [#1] SMP
Call Trace:
 [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs]
 [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs]
 [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
 [] ? process_one_work+0x141/0x380
 [] ? worker_thread+0x41/0x460
 [] ? kthread+0xb4/0xd0
 [] ? process_one_work+0x380/0x380
 [] ? kthread_park+0x50/0x50
 [] ? ret_from_fork+0x1b/0x28

The problematic file turned out to be the one used by systemd-journald
/var/log/journal/c496cea41ebc4700a0dfaabf64a21be4/system.journal
which was trying to read it (or append to it) during the boot and that 
was

causing the system crash (see attached bootN_dmesg.txt).

I've rebooted in safe mode and tried to copy the data from this 
partition to
another location using btrfs-restore, however kernel was crashing as 
well with

a bit different symphom (see attached copyN_dmesg.txt):

Call Trace:
 [] ? lzo_decompress_biovec+0x1b0/0x2b0 [btrfs]
 [] ? vmalloc+0x38/0x40
 [] ? end_compressed_bio_read+0x265/0x2d0 [btrfs]
 [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
 [] ? process_one_work+0x141/0x380
 [] ? worker_thread+0x41/0x460
 [] ? kthread+0xb4/0xd0
 [] ? ret_from_fork+0x1b/0x28

Just to keep away from the problem, I've removed this file and also 
removed

"compress=lzo" mount option.

Are there any updates / fixes done in that area? Is lzo option safe to 
use?


P.S. Perhaps relative issue is in "Warnings" section:

https://wiki.debian.org/Btrfs#Warnings / 
https://www.spinics.net/lists/linux-btrfs/msg56563.html


--
With best regards,
Dmitry[   13.100666] BTRFS critical (device sda3): stripe index math went horribly 
wrong, got stripe_index=4294936575, num_stripes=2
[   13.100901] BTRFS critical (device sda3): stripe index math went horribly 
wrong, got stripe_index=4294936575, num_stripes=2
[   13.101096] BTRFS critical (device sda3): stripe index math went horribly 
wrong, got stripe_index=4294936575, num_stripes=2
[   13.101178] [ cut here ]
[   13.101182] kernel BUG at 
/build/linux-fB36Cv/linux-4.9.25/fs/btrfs/extent_io.c:2318!
[   13.101185] invalid opcode:  [#1] SMP
[   13.101257] Modules linked in: binfmt_misc bridge stp llc iTCO_wdt 
iTCO_vendor_support arc4 ppdev coretemp ath5k pcspkr ath sr9700 mac80211 dm9601 
serio_raw usbnet cfg80211 snd_hda_codec_realtek snd_hda_codec_generic mii 
rfkill lpc_ich snd_hda_intel i915 mfd_core snd_hda_codec evdev sg snd_hda_core 
snd_hwdep snd_pcm_oss snd_mixer_oss rng_core snd_pcm snd_timer video snd 
drm_kms_helper soundcore drm parport_pc parport i2c_algo_bit shpchp button 
acpi_cpufreq netconsole configfs w83627hf hwmon_vid ip_tables x_tables autofs4 
xfs libcrc32c btrfs crc32c_generic xor raid6_pq ses enclosure 
scsi_transport_sas uas hid_generic usbhid usb_storage hid sd_mod sr_mod cdrom 
i2c_i801 i2c_smbus firewire_ohci ata_generic firewire_core crc_itu_t ehci_pci 
ata_piix libata uhci_hcd ehci_hcd scsi_mod e1000e ptp pps_core
[   13.101261]  usbcore usb_common
[   13.101267] CPU: 0 PID: 96 Comm: kworker/u4:2 Tainted: GW   
4.9.0-3-686-pae #1 Debian 4.9.25-1
[   13.101269] Hardware name: AOpen i945GMx-IF/i945GMx-IF, BIOS i945GMx-IF 
R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
[   13.101326] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[   13.101328] task: f6d409c0 task.stack: f6d46000
[   13.101332] EIP: 0060:[] EFLAGS: 00010203 CPU: 0
[   13.101373] EIP is at btrfs_check_repairable+0x12c/0x130 [btrfs]
[   13.101375] EAX: 8800 EBX: f292dd80 ECX: 8801 EDX: 0002
[   13.101378] ESI: f69c EDI: f678bc5c EBP: f6d47e50 ESP: f6d47e30
[   13.101381]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   13.101383] CR0: 80050033 CR2: b64c6db0 CR3: 36c115a0 CR4: 06f0
[   13.101386] Stack:
[   13.101395]  1000   f292dd80 d04e93d0  f35885d8 
f35885d8
[   13.101402]  f6d47ed8 f8c63739 0001  f6d47ec4 f8c951eb f3bb4800 
0001
[   13.101412]  0009 f678bc00 f35884b0   0001  

[   13.101413] Call Trace:
[   13.101457]  [] ? end_bio_extent_readpage+0x4e9/0x680 [btrfs]
[   13.101497]  [] ? end_compressed_bio_read+0x3b/0x2d0 [btrfs]
[   13.101538]  [] ? btrfs_scrubparity_helper+0xce/0x2d0 [btrfs]
[   13.101548]  [] ? process_one_work+0x141/0x380
[   13.101553]  [] ? worker_thread+0x41/0x460
[   13.101557]  [] ? kthread+0xb4/0xd0
[   13.101561]  [] ? process_one_work+0x380/0x380
[   13.101566]  [] ? kthread_park+0x50/0x50
[   13.101572]  [] ? ret_from_fork+0x1b/0x28
[   13.104547] Modules linked in: binfmt_misc bridge stp llc iTCO_wdt 
iTCO_vendor_support