Re: remounted ro during operation, unmountable since

2018-04-15 Thread Duncan
Qu Wenruo posted on Sat, 14 Apr 2018 22:41:50 +0800 as excerpted:

>> sectorsize        4096
>> nodesize        4096
> 
> Nodesize is not the default 16K, any reason for this?
> (Maybe performance?)
> 
>>> 3) Extra hardware info about your sda
>>>     Things like SMART and hardware model would also help here.

>> Model Family: Samsung based SSDs Device Model: SAMSUNG SSD 830
>> Series
> 
> At least I haven't hear much problem about Samsung SSD, so I don't think
> it's the hardware to blamce. (Unlike Intel 600P)

830 model is a few years old, IIRC (I have 850s, and I think I saw 860s 
out in something I read probably on this list, but am not sure of it).  I 
suspect the filesystem was created with an old enough btrfs-tools that 
the default nodesize was still 4K, either due to older distro, or simply 
due to using the filesystem that long.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: remounted ro during operation, unmountable since

2018-04-15 Thread Qu Wenruo


On 2018年04月15日 16:39, Timo Nentwig wrote:
> On 04/14/2018 03:45 PM, Timo Nentwig wrote:
>> On 04/14/2018 11:42 AM, Qu Wenruo wrote:
>>> And the work load when the RO happens is also helpful.
>>> (Well, the dmesg of RO happens would be the best though)
>> I had a glance at dmesg but don't remember anything specific (think
>> the usual " [cut here] ---" + dump of registers, but I'm not even
>> sure about that). Sorry. 
> 
> *cough* look, what I just found :)
> 
[snip]
> [Sun Apr  8 09:40:45 2018] CPU: 4 PID: 64370 Comm: kworker/u256:87
> Tainted: G    W  O 4.15.15-1-ARCH #1
> [Sun Apr  8 09:40:45 2018] Hardware name: System manufacturer System
> Product Name/ROG ZENITH EXTREME, BIOS 0902 12/21/2017

ROG, no wonder you're OCing.

> [Sun Apr  8 09:40:45 2018] Workqueue: btrfs-extent-refs
> btrfs_extent_refs_helper [btrfs]
> [Sun Apr  8 09:40:45 2018] RIP: 0010:btrfs_run_delayed_refs+0x167/0x1b0
> [btrfs]
> [Sun Apr  8 09:40:45 2018] RSP: 0018:a085e34d3dd8 EFLAGS: 00010282
> [Sun Apr  8 09:40:45 2018] RAX:  RBX: 96b4f02ab9c8
> RCX: 0001
> [Sun Apr  8 09:40:45 2018] RDX: 8001 RSI: a1e47fcf
> RDI: 
> [Sun Apr  8 09:40:45 2018] RBP: 96b92514 R08: 0001
> R09: 0a02
> [Sun Apr  8 09:40:45 2018] R10: 0005 R11: 
> R12: 96b915a5ec00
> [Sun Apr  8 09:40:45 2018] R13:  R14: 96b4e187c600
> R15: 96b92c411400
> [Sun Apr  8 09:40:45 2018] FS:  ()
> GS:96b92cb0() knlGS:
> [Sun Apr  8 09:40:45 2018] CS:  0010 DS:  ES:  CR0:
> 80050033
> [Sun Apr  8 09:40:45 2018] CR2: 7f2c661a96f0 CR3: 000f48fcc000
> CR4: 003406e0
> [Sun Apr  8 09:40:45 2018] Call Trace:
> [Sun Apr  8 09:40:45 2018]  delayed_ref_async_start+0x8d/0xa0 [btrfs]
> [Sun Apr  8 09:40:45 2018]  normal_work_helper+0x39/0x370 [btrfs]
> [Sun Apr  8 09:40:45 2018]  process_one_work+0x1ce/0x410
> [Sun Apr  8 09:40:45 2018]  worker_thread+0x2b/0x3d0
> [Sun Apr  8 09:40:45 2018]  ? process_one_work+0x410/0x410
> [Sun Apr  8 09:40:45 2018]  kthread+0x113/0x130
> [Sun Apr  8 09:40:45 2018]  ? kthread_create_on_node+0x70/0x70
> [Sun Apr  8 09:40:45 2018]  ret_from_fork+0x22/0x40
> [Sun Apr  8 09:40:45 2018] Code: a0 82 d8 e0 eb 90 48 8b 53 60 f0 0f ba
> aa 50 12 00 00 02 72 1b 83 f8 fb 74 37 89 c6 48 c7 c7 c8 28 53 c0 89 04
> 24 e8 c9 fa be e0 <0f> 0b 8b 04 24 89 c1 ba 04 0c 00 00 48 c7 c6 00 b9
> 52 c0 48 89
> [Sun Apr  8 09:40:45 2018] ---[ end trace 6ad220910a160dd3 ]---
> [Sun Apr  8 09:40:45 2018] BTRFS: error (device sda2) in
> btrfs_run_delayed_refs:3076: errno=-17 Object already exists

According to the code, it's just transaction abort.
I'd say it's just extent tree code hit something unexpected, maybe it's
caused by the offending tree blocks.

Not much useful info compared to debug tree grep.

Thanks,
Qu

> [Sun Apr  8 09:40:45 2018] BTRFS info (device sda2): forced readonly
> [Sun Apr  8 09:40:45 2018] BTRFS error (device sda2): pending csums is
> 331776
> 
> 
> [Tue Apr 10 05:23:22 2018] hid-generic 0003:1E71:170E.0009:
> hiddev0,hidraw0: USB HID v1.10 Device [NZXT.-Inc. NZXT USB Device] on
> usb-:01:00.0-9/input0
> [Tue Apr 10 06:19:31 2018] [ cut here ]
> [Tue Apr 10 06:19:31 2018] BTRFS: Transaction aborted (error -17)
> [Tue Apr 10 06:19:31 2018] WARNING: CPU: 18 PID: 541 at
> fs/btrfs/extent-tree.c:3076 btrfs_run_delayed_refs+0x167/0x1b0 [btrfs]
> [Tue Apr 10 06:19:31 2018] Modules linked in: cmac md4 nls_utf8 cifs ccm
> dns_resolver fscache lz4 lz4_compress zram rfcomm bnep xt_tcpudp
> iptable_filter hwmon_vid msr arc4 ext4 mbcache jbd2 fscrypto btusb
> edac_mce_amd btrtl btbcm btintel ath10k_pci kvm bluetooth ath10k_core
> mousedev ecdh_generic ath irqbypass crc16 input_leds crct10dif_pclmul
> mac80211 snd_hda_codec_realtek crc32_pclmul ghash_clmulni_intel
> snd_hda_codec_generic snd_hda_codec_hdmi pcbc eeepc_wmi wil6210
> aesni_intel igb asus_wmi snd_hda_intel aes_x86_64 sparse_keymap
> led_class crypto_simd snd_hda_codec cfg80211 glue_helper ptp wmi_bmof
> cryptd snd_hda_core pps_core mxm_wmi dca snd_hwdep sp5100_tco ccp rfkill
> snd_pcm pcspkr atlantic i2c_piix4 rng_core k10temp rtc_cmos evdev shpchp
> gpio_amdpt wmi pinctrl_amd mac_hid acpi_cpufreq vboxnetflt(O) vboxnetadp(O)
> [Tue Apr 10 06:19:31 2018]  vboxpci(O) vboxdrv(O) snd_seq_dummy
> snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer snd
> soundcore cuse fuse cpufreq_powersave cpufreq_ondemand crypto_user
> ip_tables x_tables sd_mod hid_apple uas usb_storage hid_saitek
> hid_generic usbhid hid ahci xhci_pci libahci xhci_hcd libata usbcore
> scsi_mod usb_common crc32c_generic crc32c_intel btrfs xor
> zstd_decompress zstd_compress xxhash raid6_pq amdgpu chash i2c_algo_bit
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm
> agpgart
> [Tue Apr 10 06:19:31 2018] CPU: 18 PID: 541 Comm: btrfs-

Re: remounted ro during operation, unmountable since

2018-04-15 Thread Timo Nentwig

On 04/14/2018 03:45 PM, Timo Nentwig wrote:

On 04/14/2018 11:42 AM, Qu Wenruo wrote:

And the work load when the RO happens is also helpful.
(Well, the dmesg of RO happens would be the best though)
I had a glance at dmesg but don't remember anything specific (think 
the usual " [cut here] ---" + dump of registers, but I'm not even 
sure about that). Sorry. 


*cough* look, what I just found :)

[Sun Apr  8 09:40:45 2018] [ cut here ]
[Sun Apr  8 09:40:45 2018] BTRFS: Transaction aborted (error -17)
[Sun Apr  8 09:40:45 2018] WARNING: CPU: 4 PID: 64370 at 
fs/btrfs/extent-tree.c:3076 btrfs_run_delayed_refs+0x167/0x1b0 [btrfs]
[Sun Apr  8 09:40:45 2018] Modules linked in: cmac md4 nls_utf8 cifs ccm 
dns_resolver fscache lz4 lz4_compress zram rfcomm bnep xt_tcpudp 
hwmon_vid msr iptable_filter arc4 ext4 mbcache jbd2 fscrypto btusb 
edac_mce_amd
 btrtl snd_hda_codec_hdmi btbcm btintel kvm bluetooth ath10k_pci 
mousedev irqbypass ath10k_core ecdh_generic crc16 input_leds 
crct10dif_pclmul crc32_pclmul ath ghash_clmulni_intel 
snd_hda_codec_realtek pcbc mac80211 s
nd_hda_codec_generic aesni_intel wil6210 aes_x86_64 crypto_simd 
snd_hda_intel igb glue_helper cryptd snd_hda_codec cfg80211 eeepc_wmi 
ptp snd_hda_core asus_wmi pps_core ccp sparse_keymap sp5100_tco wmi_bmof 
snd_hwdep
dca rng_core pcspkr atlantic rfkill snd_pcm i2c_piix4 k10temp shpchp 
rtc_cmos evdev gpio_amdpt pinctrl_amd mac_hid acpi_cpufreq vboxnetflt(O) 
vboxnetadp(O) vboxpci(O) vboxdrv(O)
[Sun Apr  8 09:40:45 2018]  snd_seq_dummy snd_seq_oss snd_seq_midi_event 
snd_seq snd_seq_device snd_timer snd soundcore cuse fuse 
cpufreq_powersave cpufreq_ondemand crypto_user ip_tables x_tables sd_mod 
hid_apple uas
usb_storage hid_saitek hid_generic usbhid hid ahci xhci_pci libahci 
xhci_hcd libata usbcore scsi_mod usb_common crc32c_generic crc32c_intel 
btrfs xor zstd_decompress zstd_compress xxhash raid6_pq nouveau 
led_class mxm
_wmi wmi i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm drm agpgart
[Sun Apr  8 09:40:45 2018] CPU: 4 PID: 64370 Comm: kworker/u256:87 
Tainted: G    W  O 4.15.15-1-ARCH #1
[Sun Apr  8 09:40:45 2018] Hardware name: System manufacturer System 
Product Name/ROG ZENITH EXTREME, BIOS 0902 12/21/2017
[Sun Apr  8 09:40:45 2018] Workqueue: btrfs-extent-refs 
btrfs_extent_refs_helper [btrfs]
[Sun Apr  8 09:40:45 2018] RIP: 0010:btrfs_run_delayed_refs+0x167/0x1b0 
[btrfs]

[Sun Apr  8 09:40:45 2018] RSP: 0018:a085e34d3dd8 EFLAGS: 00010282
[Sun Apr  8 09:40:45 2018] RAX:  RBX: 96b4f02ab9c8 
RCX: 0001
[Sun Apr  8 09:40:45 2018] RDX: 8001 RSI: a1e47fcf 
RDI: 
[Sun Apr  8 09:40:45 2018] RBP: 96b92514 R08: 0001 
R09: 0a02
[Sun Apr  8 09:40:45 2018] R10: 0005 R11:  
R12: 96b915a5ec00
[Sun Apr  8 09:40:45 2018] R13:  R14: 96b4e187c600 
R15: 96b92c411400
[Sun Apr  8 09:40:45 2018] FS:  () 
GS:96b92cb0() knlGS:

[Sun Apr  8 09:40:45 2018] CS:  0010 DS:  ES:  CR0: 80050033
[Sun Apr  8 09:40:45 2018] CR2: 7f2c661a96f0 CR3: 000f48fcc000 
CR4: 003406e0

[Sun Apr  8 09:40:45 2018] Call Trace:
[Sun Apr  8 09:40:45 2018]  delayed_ref_async_start+0x8d/0xa0 [btrfs]
[Sun Apr  8 09:40:45 2018]  normal_work_helper+0x39/0x370 [btrfs]
[Sun Apr  8 09:40:45 2018]  process_one_work+0x1ce/0x410
[Sun Apr  8 09:40:45 2018]  worker_thread+0x2b/0x3d0
[Sun Apr  8 09:40:45 2018]  ? process_one_work+0x410/0x410
[Sun Apr  8 09:40:45 2018]  kthread+0x113/0x130
[Sun Apr  8 09:40:45 2018]  ? kthread_create_on_node+0x70/0x70
[Sun Apr  8 09:40:45 2018]  ret_from_fork+0x22/0x40
[Sun Apr  8 09:40:45 2018] Code: a0 82 d8 e0 eb 90 48 8b 53 60 f0 0f ba 
aa 50 12 00 00 02 72 1b 83 f8 fb 74 37 89 c6 48 c7 c7 c8 28 53 c0 89 04 
24 e8 c9 fa be e0 <0f> 0b 8b 04 24 89 c1 ba 04 0c 00 00 48 c7 c6 00 b9 
52 c0 48 89

[Sun Apr  8 09:40:45 2018] ---[ end trace 6ad220910a160dd3 ]---
[Sun Apr  8 09:40:45 2018] BTRFS: error (device sda2) in 
btrfs_run_delayed_refs:3076: errno=-17 Object already exists

[Sun Apr  8 09:40:45 2018] BTRFS info (device sda2): forced readonly
[Sun Apr  8 09:40:45 2018] BTRFS error (device sda2): pending csums is 
331776



[Tue Apr 10 05:23:22 2018] hid-generic 0003:1E71:170E.0009: 
hiddev0,hidraw0: USB HID v1.10 Device [NZXT.-Inc. NZXT USB Device] on 
usb-:01:00.0-9/input0

[Tue Apr 10 06:19:31 2018] [ cut here ]
[Tue Apr 10 06:19:31 2018] BTRFS: Transaction aborted (error -17)
[Tue Apr 10 06:19:31 2018] WARNING: CPU: 18 PID: 541 at 
fs/btrfs/extent-tree.c:3076 btrfs_run_delayed_refs+0x167/0x1b0 [btrfs]
[Tue Apr 10 06:19:31 2018] Modules linked in: cmac md4 nls_utf8 cifs ccm 
dns_resolver fscache lz4 lz4_compress zram rfcomm bnep xt_tcpudp 
iptable_filter hwmon_vid msr arc4 ext4 mbcache jbd2 fscrypto btusb 
edac_mce_amd btrtl btbcm bt

Re: remounted ro during operation, unmountable since

2018-04-14 Thread Qu Wenruo


On 2018年04月14日 21:45, Timo Nentwig wrote:
> On 04/14/2018 11:42 AM, Qu Wenruo wrote:
>> And the work load when the RO happens is also helpful.
>> (Well, the dmesg of RO happens would be the best though)
> Surprisingly nothing special AFAIR. It's a private, mostly idle machine.
> Probably "just" browsing with chrome.
> I didn't notice the remount right away as there were no obvious
> failures. And even then I kept it running for a couple more hours/a day
> or so.
> 
> I had a glance at dmesg but don't remember anything specific (think the
> usual " [cut here] ---" + dump of registers, but I'm not even sure
> about that). Sorry.
> 
> Actually the same thing happened just a few days earlier and after a
> reboot (and maybe fsck) it was back up and good. Was optimistic it would
> go the same way this time as well :) In general I had to hard-reset (+
> fsck) a couple of times in recent times.

So, after each hard-reset, fsck is executed and no problem exposed by
btrfs check (before RW mount)?

That's interesting.

> Except for the SSD it's an
> all-new machine that I'm still OC/stress-testing. But not when that
> particular event happened.

A little off-topic, Linux + OC is not that common in my opinion.
Especially when we don't have AMD Ryzen Master to Intel XTU under linux.

>> Despite above salvage method, please also considering provide the
>> following data, as your case is pretty special and may help us to catch
>> a long hidden bug.
> If only I had know I would have saved dmesg! :)
> Sure, I'd be happy to help. If you need any more information just let me
> know.
>> 1) Extent tree dump
>>     Need above 2 patches applied first.
>>
>>     # btrfs inspect dump-tree -t extent /dev/sda2 &> \
>>   /tmp/extent_tree_dump
>>     If above dump is too large, "grep -C20 166030671872" of the output is
>>     also good enough.
> 
> I'll send you a link to the full dump directly.

It's good enough with the grepped result, feel free to delete the full dump.


>     item 16 key (166030671872 EXTENT_ITEM 4096) itemoff 3096 itemsize 51
>         refs 1 gen 1702074 flags TREE_BLOCK
>         tree block key (162793705472 EXTENT_ITEM 4096) level 0
>         tree block backref root 2

So at least btrfs still consider that tree block should belong to extent
tree.

>     item 17 key (166030671872 BLOCK_GROUP_ITEM 1073741824) itemoff 3072
> itemsize 24
>         block group used 96915456 chunk_objectid 256 flags METADATA

Your metadata is SINGLE profile, default for SSD. Nothing special here.

Currently speaking, the problem looks like that tree log tree block get
allocated into extent tree (the only way btrfs allocate tree blocks
without update extent tree).
And when log tree get replayed, your fs is corrupted.

Did you have several hard-reset before the fs mounted RO itself?

>> 2) super block dump
>>     # btrfs inspect dump-super -f /dev/sda2
> superblock: bytenr=65536, device=/dev/sda2
> -
> csum_type        0 (crc32c)
> csum_size        4
> csum            0xef0068ba [match]
> bytenr            65536
> flags            0x1
>             ( WRITTEN )
> magic            _BHRfS_M [match]
> fsid            22e778f7-2499-4379-99d2-cdd399d1cc6e
> label            830
> generation        1706541

The offending tree block has generation 1705980, which is 561
generations ago.

Although it's hard to tell the real world time, at least the problem is
not directly caused by your first automatical RO remount.

The problem should exist for a while.

> root            167104118784
> sys_array_size        97
> chunk_root_generation    1702072
> root_level        1
> chunk_root        186120536064
> chunk_root_level    1
> log_root        180056702976> log_root_transid    0

Not sure if this is common, need to double check later.

> log_root_level        0
> total_bytes        63879249920
> bytes_used        36929691648
> sectorsize        4096
> nodesize        4096

Nodesize is not the default 16K, any reason for this?
(Maybe performance?)

>> 3) Extra hardware info about your sda
>>     Things like SMART and hardware model would also help here.
> smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.15.15-1-ARCH] (local build)
> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family: Samsung based SSDs
> Device Model: SAMSUNG SSD 830 Series

At least I haven't hear much problem about Samsung SSD, so I don't think
it's the hardware to blamce. (Unlike Intel 600P)


>> 4) The mount option of /dev/sda2
> 
> /dev/sda2    /    btrfs compress=zstd,discard,autodefrag,subvol=/  
> 0   0

Discard used to cause some problem, but it should be fixed in recent
release IIRC.

Despite that, discard option is not recommended IIRC, routine fstrim is
preferred instead.

> 
> And if that matters (AFAIK subvolume mount options have no effect anyway):
> 
> /dev/sda2    /var/lib/postgres   btrfs
> compress=zs

Re: remounted ro during operation, unmountable since

2018-04-14 Thread Timo Nentwig

On 04/14/2018 11:42 AM, Qu Wenruo wrote:

And the work load when the RO happens is also helpful.
(Well, the dmesg of RO happens would be the best though)
Surprisingly nothing special AFAIR. It's a private, mostly idle machine. 
Probably "just" browsing with chrome.
I didn't notice the remount right away as there were no obvious 
failures. And even then I kept it running for a couple more hours/a day 
or so.


I had a glance at dmesg but don't remember anything specific (think the 
usual " [cut here] ---" + dump of registers, but I'm not even sure 
about that). Sorry.


Actually the same thing happened just a few days earlier and after a 
reboot (and maybe fsck) it was back up and good. Was optimistic it would 
go the same way this time as well :) In general I had to hard-reset (+ 
fsck) a couple of times in recent times. Except for the SSD it's an 
all-new machine that I'm still OC/stress-testing. But not when that 
particular event happened.

Despite above salvage method, please also considering provide the
following data, as your case is pretty special and may help us to catch
a long hidden bug.

If only I had know I would have saved dmesg! :)
Sure, I'd be happy to help. If you need any more information just let me 
know.

1) Extent tree dump
Need above 2 patches applied first.

# btrfs inspect dump-tree -t extent /dev/sda2 &> \
  /tmp/extent_tree_dump
If above dump is too large, "grep -C20 166030671872" of the output is
also good enough.


I'll send you a link to the full dump directly.

    key (162020040704 EXTENT_ITEM 4096) block 130834812928 (31942093) 
gen 1627950
    key (162032852992 EXTENT_ITEM 16384) block 130834829312 (31942097) 
gen 1627950
    key (162058788864 EXTENT_ITEM 126976) block 166896513024 (40746219) 
gen 1627103
    key (162067652608 EXTENT_ITEM 122880) block 181095837696 (44212851) 
gen 1626650
    key (162074021888 EXTENT_ITEM 126976) block 181095841792 (44212852) 
gen 1626650
    key (162080391168 EXTENT_ITEM 126976) block 181095845888 (44212853) 
gen 1626650
    key (162086768640 EXTENT_ITEM 126976) block 140649435136 (34338241) 
gen 1697624
    key (162541010944 EXTENT_ITEM 1052672) block 141310992384 
(34499754) gen 1626990
    key (162557276160 EXTENT_ITEM 118784) block 130641969152 (31895012) 
gen 1627722
    key (162677518336 EXTENT_ITEM 122880) block 140914860032 (34403042) 
gen 1626930
    key (162678145024 EXTENT_ITEM 122880) block 140343918592 (34263652) 
gen 1626743
    key (162682937344 EXTENT_ITEM 122880) block 181095866368 (44212858) 
gen 1626650
    key (162692177920 EXTENT_ITEM 118784) block 130621235200 (31889950) 
gen 1627194
    key (162698711040 EXTENT_ITEM 28672) block 130621198336 (31889941) 
gen 1627194
    key (162699722752 EXTENT_ITEM 4096) block 180581654528 (44087318) 
gen 1675166
    key (162702168064 EXTENT_ITEM 4096) block 131325329408 (32061848) 
gen 1626697
    key (162702237696 EXTENT_ITEM 8192) block 181116600320 (44217920) 
gen 1626654
    key (162707652608 EXTENT_ITEM 49152) block 181095890944 (44212864) 
gen 1626650
    key (162710052864 EXTENT_ITEM 1859584) block 140858179584 
(34389204) gen 1626926
    key (162735849472 EXTENT_ITEM 32768) block 140858195968 (34389208) 
gen 1626926
    key (162793705472 EXTENT_ITEM 4096) block 166030671872 (40534832) 
gen 1702074
    key (162858274816 EXTENT_ITEM 21327872) block 140301692928 
(34253343) gen 1626728
    key (162881642496 EXTENT_ITEM 73728) block 130903351296 (31958826) 
gen 1626665
    key (162884112384 EXTENT_ITEM 53248) block 130903355392 (31958827) 
gen 1626665
    key (162884964352 EXTENT_ITEM 4096) block 130903359488 (31958828) 
gen 1626665
leaf 181087133696 items 51 free space 17 generation 1626650 owner 
EXTENT_TREE

leaf 181087133696 flags 0x1(WRITTEN) backref revision 1
fs uuid 22e778f7-2499-4379-99d2-cdd399d1cc6e
chunk uuid bee8ad15-e128-45f1-a3d7-e2fda17806ce
    item 0 key (161456054272 EXTENT_ITEM 4096) itemoff 3942 itemsize 53
        refs 1 gen 1626650 flags DATA
        extent data backref root 5 objectid 89243202 offset 0 count 1
    item 1 key (161456058368 EXTENT_ITEM 4096) itemoff 3889 itemsize 53
        refs 1 gen 1626650 flags DATA
        extent data backref root 5 objectid 89243208 offset 0 count 1
    item 2 key (161456062464 EXTENT_ITEM 4096) itemoff 3836 itemsize 53
        refs 1 gen 1626650 flags DATA
        extent data backref root 5 objectid 89243192 offset 0 count 1
    item 3 key (161456066560 EXTENT_ITEM 4096) itemoff 3783 itemsize 53
        refs 1 gen 1626650 flags DATA
        extent data backref root 5 objectid 89243198 offset 0 count 1
--
        refs 1 gen 1626665 flags DATA
        extent data backref root 5 objectid 32187101 offset 655360 count 1
    item 35 key (162793402368 EXTENT_ITEM 20480) itemoff 2087 itemsize 53
        refs 1 gen 1626665 flags DATA
        extent data backref root 5 objectid 32187104 offset 0 count 1
    item 36 key (162793422848 EXTENT_ITEM 110592) itemoff 2034 itemsize 53
        refs 1 gen 162666

Re: remounted ro during operation, unmountable since

2018-04-14 Thread Qu Wenruo


On 2018年04月14日 15:31, Timo Nentwig wrote:
> Hi!
> 
> btrfs remounted itself ro during operation (don't have the dmesg) and
> fails to mount after reboot.

The common mount option please, and if possible, the hardware model of
sda2 please.

And the work load when the RO happens is also helpful.
(Well, the dmesg of RO happens would be the best though)
> 
> Any advice?
> 
> 
> 4.15.15-1-ARCH #1 SMP PREEMPT Sat Mar 31 23:59:25 UTC 2018 x86_64 GNU/Linux
> btrfs-progs v4.16
> Label: '830'  uuid: 22e778f7-2499-4379-99d2-cdd399d1cc6e
>     Total devices 1 FS bytes used 34.39GiB
>     devid    1 size 59.49GiB used 58.98GiB path /dev/sda2
> 
> [  867.041397] BTRFS info (device sda2): disk space caching is enabled
> [  867.185357] BTRFS info (device sda2): bdev /dev/sda2 errs: wr 0, rd
> 158, flush 0, corrupt 0, gen 0
> [  868.423427] BTRFS error (device sda2): parent transid verify failed
> on 166030671872 wanted 1702074 found 1705980

Another metadata corruption.

> [  868.425400] BTRFS error (device sda2): failed to read block groups: -5

Extent tree is corrupted.
If your objective is just to salvage data, at least this means your data
should be mostly safe (if there is no other metadata corruption).

> 
> # sudo btrfs check /dev/sda2
> parent transid verify failed on 166030671872 wanted 1702074 found 1705980
> parent transid verify failed on 166030671872 wanted 1702074 found 1705980
> Ignoring transid failure
> leaf parent key incorrect 166030671872
> ERROR: cannot open file system

I could enhance btrfs check to continue checking, but it may need some
time to get rid of all btrfs_block_group_cache usage in that case.

> # sudo btrfs-debug-tree -b 166030671872 /dev/sda2

Btrfs works because when using -b option, (current) btrfs check will not
try to read block groups any more, so it can continue.

> btrfs-progs v4.16
> parent transid verify failed on 166030671872 wanted 1702074 found 1705980
> parent transid verify failed on 166030671872 wanted 1702074 found 1705980
> Ignoring transid failure
> leaf parent key incorrect 166030671872
> leaf 166030671872 items 60 free space 95 generation 1705980 owner TREE_LOG

This output is pretty interesting, and pretty helpful.

The tree block causing the problem belongs to TREE_LOG, while according
to kernel dmesg, it should belong to extent tree.

So that's showing tree log code is more or less related to this situation.

> leaf 166030671872 flags 0x1(WRITTEN) backref revision 1
> fs uuid 22e778f7-2499-4379-99d2-cdd399d1cc6e
> chunk uuid bee8ad15-e128-45f1-a3d7-e2fda17806ce
>     item 0 key (46475 DIR_ITEM 2335543231) itemoff 3955 itemsize 40
>         location key (1973554 INODE_ITEM 0) type FILE
>         transid 1705438 data_len 0 name_len 10
>         name: 002443.sst
> 
>

If your primary objective is to salvage data, please apply these patches
upon v4.16.
https://patchwork.kernel.org/patch/10334955/
https://patchwork.kernel.org/patch/10334957/

And then try btrfs restore to salvage as much as data as possible.
For the best case (no other tree block corruption), it should salvage
all your data.

Despite above salvage method, please also considering provide the
following data, as your case is pretty special and may help us to catch
a long hidden bug.

1) Extent tree dump
   Need above 2 patches applied first.

   # btrfs inspect dump-tree -t extent /dev/sda2 &> \
 /tmp/extent_tree_dump
   If above dump is too large, "grep -C20 166030671872" of the output is
   also good enough.

   This would help us to inspect the backref of that tree to see if we
   could find anything wrong.

2) super block dump
   # btrfs inspect dump-super -f /dev/sda2

3) Extra hardware info about your sda
   Things like SMART and hardware model would also help here.

4) The mount option of /dev/sda2

Thanks,
Qu

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature