Re: Linux-next regression?
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote: > On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote: >> I think (hope) this is: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Which was just nailed down to a blkmq bug. It triggers when you have >> scsi devices using elevator=none over blkmq. > > Thanks a lot Chris. Really. > Good news: I confirm I recompiled and used blkmq and no-op (at that > time). > Also, the massive write of btrfs defrag can explain the massive > trigger of > the bug, and next corruption. Sorry this happened, but glad you were able to confirm that it explains the trouble you hit. Thanks for the report, I did end up using this as a datapoint to convince myself the bugzilla above wasn't ext4 specific. -chris
Re: Linux-next regression?
On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote: > I think (hope) this is: > > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Which was just nailed down to a blkmq bug. It triggers when you have > scsi devices using elevator=none over blkmq. Thanks a lot Chris. Really. Good news: I confirm I recompiled and used blkmq and no-op (at that time). Also, the massive write of btrfs defrag can explain the massive trigger of the bug, and next corruption. Thanks again, Andrea
Re: Linux-next regression?
On 28 Nov 2018, at 11:05, Andrea Gelmini wrote: > On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote: >> >> But it's less a concerning problem since it doesn't reach latest RC, >> so >> if you could reproduce it stably, I'd recommend to do a bisect. > > No problem to bisect, usually. > But right now it's not possible for me, I explain further. > Anyway, here the rest of the story. > > So, in the end I: > a) booted with 4.20.0-rc4 > b) updated backup > c) did the btrfs check --read-only > d) seven steps, everything is perfect > e) no complains on screen or in logs (never had) > f) so, started to compile linux-next 20181128 (on another partition) > e) without using (reading or writing) on /home, I started > f) btrfs filesystem defrag -v -r -t 128M /home > g) it worked without complain (in screen or logs) > h) then, reboot with kernel tag 20181128 > i) and no way to mount: I think (hope) this is: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Which was just nailed down to a blkmq bug. It triggers when you have scsi devices using elevator=none over blkmq. -chris
Re: Linux-next regression?
On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote: > > But it's less a concerning problem since it doesn't reach latest RC, so > if you could reproduce it stably, I'd recommend to do a bisect. No problem to bisect, usually. But right now it's not possible for me, I explain further. Anyway, here the rest of the story. So, in the end I: a) booted with 4.20.0-rc4 b) updated backup c) did the btrfs check --read-only d) seven steps, everything is perfect e) no complains on screen or in logs (never had) f) so, started to compile linux-next 20181128 (on another partition) e) without using (reading or writing) on /home, I started f) btrfs filesystem defrag -v -r -t 128M /home g) it worked without complain (in screen or logs) h) then, reboot with kernel tag 20181128 i) and no way to mount: -- nov 28 15:44:03 glet kernel: BTRFS: device label home devid 1 transid 37360 /dev/mapper/cry-home nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): use lzo compression, level 0 nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): turning on discard nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): enabling auto defrag nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): disk space caching is enabled nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): has skinny extents nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): bad tree block start, want 2150302023680 have 17816181330383341936 nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): failed to read block groups: -5 nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): open_ctree failed -- l) get back to 4.20.0-rc4 m) mounted, but after a few minutes, I get this: -- nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 2199347265536 has wrong amount of free space nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2199347265536, rebuilding it now nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 2196126040064 has wrong amount of free space nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2196126040064, rebuilding it now nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 218431488 has wrong amount of free space nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 218431488, rebuilding it now nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 2183241138176 has wrong amount of free space nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2183241138176, rebuilding it now nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): block group 2152102625280 has wrong amount of free space nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2152102625280, rebuilding it now nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): block group 2530059747328 has wrong amount of free space nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2530059747328, rebuilding it now nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): block group 2151028883456 has wrong amount of free space nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2151028883456, rebuilding it now nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): block group 2203642232832 has wrong amount of free space nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): failed to load free space cache for block group 2203642232832, rebuilding it now -- n) and then read-only mode: -- [ 1058.996960] BTRFS error (device dm-3): bad tree block start, want 2150382092288 have 159161645701828393 [ 1058.996967] BTRFS: error (device dm-3) in __btrfs_free_extent:6831: errno=-5 IO failure [ 1058.996969] BTRFS info (device dm-3): forced readonly [ 1058.996971] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2978: errno=-5 IO failure [ 1059.002857] BTRFS error (device dm-3): pending csums is 97832960 -- So, ok, for the moment I'm very sorry I can't help you with bisect, because I have to revert to ext4. This is the laptop I use to work with. If I can help you investigating, just tell me. Thanks for your time, Gelma signature.asc Description: PGP signature
Re: Linux-next regression?
On 2018/11/27 下午10:11, Andrea Gelmini wrote: > On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote: >> >> >> On 2018/11/26 下午11:01, Andrea Gelmini wrote: >>> One question: I can completely trust the ok return status of scrub? I >>> know is made for this, but shit happens... >> >> No, scrub only checks csum of data and tree blocks, it doesn't ensure >> the content of tree blocks are OK. > > Hi Qu, > and thanks a lot, really. Your answers are always the best: short, > detailed and very kind. You rock. > > I'm going to send a patch to propose to add your explanation above > on the relative man page, if you agree. > >> For comprehensive check, go "btrfs check --readonly". > > I'll do it. > > At the moment I just compared the file existance between my laptop and > latest backup. Everything is fine. > >> >> However I don't think it's something "btrfs check --readonly" would >> report, but some strange behavior, maybe from LVM or cryptsetup. > > Well, I'm using this setup with ext4 and xfs, on same machine, without > troubles. Then it indeed looks like something goes wrong in linux-next. I would recommend to do a bisect if possible. As you compared all your data with laptop, it ensures your csum/file trees are OK, thus no corruption in that trees. But still something doesn't look right for extent tree only. But it's less a concerning problem since it doesn't reach latest RC, so if you could reproduce it stably, I'd recommend to do a bisect. Thanks, Qu > I've got files checksummed on the backup machine, so I can be sure about > comparing integrity. > > Anyway, thanks a lot again, > Andrea > signature.asc Description: OpenPGP digital signature
Re: Linux-next regression?
On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote: > > > On 2018/11/26 下午11:01, Andrea Gelmini wrote: > > One question: I can completely trust the ok return status of scrub? I > > know is made for this, but shit happens... > > No, scrub only checks csum of data and tree blocks, it doesn't ensure > the content of tree blocks are OK. Hi Qu, and thanks a lot, really. Your answers are always the best: short, detailed and very kind. You rock. I'm going to send a patch to propose to add your explanation above on the relative man page, if you agree. > For comprehensive check, go "btrfs check --readonly". I'll do it. At the moment I just compared the file existance between my laptop and latest backup. Everything is fine. > > However I don't think it's something "btrfs check --readonly" would > report, but some strange behavior, maybe from LVM or cryptsetup. Well, I'm using this setup with ext4 and xfs, on same machine, without troubles. I've got files checksummed on the backup machine, so I can be sure about comparing integrity. Anyway, thanks a lot again, Andrea signature.asc Description: PGP signature
Re: Linux-next regression?
On 2018/11/26 下午11:01, Andrea Gelmini wrote: > Hi everybody, >and thanks a lot for your work. > >I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest > git of btrfs-progs). >Usually I run kernel in development, because I know BTRFS is young and > there are still lots of bugs and corner case to fix. > >Anyway, I just want to submit to you a - maybe - useful info. > >Yesterday I compiled and booted latest linux-next,¹ and I've got this: > > --- > nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel > nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 > /dev/mapper/cry-home > nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, > level 0 > nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is > enabled > nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents > nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, > want 2152002191360 have 8829432654847901262 This means we failed to read one extent tree block and caused the problem. And if you're using default mkfs profile it should try again to use the extra copy, but it doesn't look like to be the case. BTW, does it always happen like this? Or is there any possibility involved? > nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block > groups: -5 > nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed > --- > >Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 > (compiled on this machine), the problem disappears. > >Now, running scrub a few times, and copying data (all files of the logical > volume) to external device, gives no complain Would you please also try "btrfs check --readonly"? > >Here I stop. This is my primary dev laptop, and at the moment I can't > spend time switching/rebooting/testing. I'm comparing the data with last > backup (I rsync each hour), but it takes time (it's more then 3TB). > >So, that was about to let you know. Well, it's Ubuntu 18.10, and between > reboots no dist-upgrade or changes in booting related packages or systemd. > > One question: I can completely trust the ok return status of scrub? I know > is made for this, but shit happens... No, scrub only checks csum of data and tree blocks, it doesn't ensure the content of tree blocks are OK. For comprehensive check, go "btrfs check --readonly". However I don't think it's something "btrfs check --readonly" would report, but some strange behavior, maybe from LVM or cryptsetup. Thanks, Qu > > Kisses, > Gelma > > - > ¹ commit: 8c9733fd9806c71e7f2313a280f98cb3051f93df > "Add linux-next specific files for 20181123" > ² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/ > signature.asc Description: OpenPGP digital signature
Linux-next regression?
Hi everybody, and thanks a lot for your work. I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest git of btrfs-progs). Usually I run kernel in development, because I know BTRFS is young and there are still lots of bugs and corner case to fix. Anyway, I just want to submit to you a - maybe - useful info. Yesterday I compiled and booted latest linux-next,¹ and I've got this: --- nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 /dev/mapper/cry-home nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, level 0 nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is enabled nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, want 2152002191360 have 8829432654847901262 nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block groups: -5 nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed --- Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 (compiled on this machine), the problem disappears. Now, running scrub a few times, and copying data (all files of the logical volume) to external device, gives no complain. Here I stop. This is my primary dev laptop, and at the moment I can't spend time switching/rebooting/testing. I'm comparing the data with last backup (I rsync each hour), but it takes time (it's more then 3TB). So, that was about to let you know. Well, it's Ubuntu 18.10, and between reboots no dist-upgrade or changes in booting related packages or systemd. One question: I can completely trust the ok return status of scrub? I know is made for this, but shit happens... Kisses, Gelma - ¹ commit: 8c9733fd9806c71e7f2313a280f98cb3051f93df "Add linux-next specific files for 20181123" ² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/