Re: Linux-next regression?

2018-12-05 Thread Chris Mason
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote:

> On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote:
>> I think (hope) this is:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Which was just nailed down to a blkmq bug.  It triggers when you have
>> scsi devices using elevator=none over blkmq.
>
> Thanks a lot Chris. Really.
> Good news: I confirm I recompiled and used blkmq and no-op (at that 
> time).
> Also, the massive write of btrfs defrag can explain the massive 
> trigger of
> the bug, and next corruption.

Sorry this happened, but glad you were able to confirm that it explains 
the trouble you hit.  Thanks for the report, I did end up using this as 
a datapoint to convince myself the bugzilla above wasn't ext4 specific.

-chris


Re: Linux-next regression?

2018-12-05 Thread Andrea Gelmini
On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote:
> I think (hope) this is:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Which was just nailed down to a blkmq bug.  It triggers when you have 
> scsi devices using elevator=none over blkmq.

Thanks a lot Chris. Really.
Good news: I confirm I recompiled and used blkmq and no-op (at that time).
Also, the massive write of btrfs defrag can explain the massive trigger of
the bug, and next corruption.

Thanks again,
Andrea


Re: Linux-next regression?

2018-12-04 Thread Chris Mason
On 28 Nov 2018, at 11:05, Andrea Gelmini wrote:

> On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>>
>> But it's less a concerning problem since it doesn't reach latest RC, 
>> so
>> if you could reproduce it stably, I'd recommend to do a bisect.
>
> No problem to bisect, usually.
> But right now it's not possible for me, I explain further.
> Anyway, here the rest of the story.
>
> So, in the end I:
> a) booted with 4.20.0-rc4
> b) updated backup
> c) did the btrfs check --read-only
> d) seven steps, everything is perfect
> e) no complains on screen or in logs (never had)
> f) so, started to compile linux-next 20181128 (on another partition)
> e) without using (reading or writing) on /home, I started
> f) btrfs filesystem defrag -v -r -t 128M /home
> g) it worked without complain (in screen or logs)
> h) then, reboot with kernel tag 20181128
> i) and no way to mount:

I think (hope) this is:

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Which was just nailed down to a blkmq bug.  It triggers when you have 
scsi devices using elevator=none over blkmq.

-chris


Re: Linux-next regression?

2018-11-28 Thread Andrea Gelmini
On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>
> But it's less a concerning problem since it doesn't reach latest RC, so
> if you could reproduce it stably, I'd recommend to do a bisect.

No problem to bisect, usually.
But right now it's not possible for me, I explain further.
Anyway, here the rest of the story.

So, in the end I:
a) booted with 4.20.0-rc4
b) updated backup
c) did the btrfs check --read-only
d) seven steps, everything is perfect
e) no complains on screen or in logs (never had)
f) so, started to compile linux-next 20181128 (on another partition)
e) without using (reading or writing) on /home, I started
f) btrfs filesystem defrag -v -r -t 128M /home
g) it worked without complain (in screen or logs)
h) then, reboot with kernel tag 20181128
i) and no way to mount:

--
nov 28 15:44:03 glet kernel: BTRFS: device label home devid 1 transid 37360 
/dev/mapper/cry-home
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): use lzo compression, 
level 0
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): turning on discard
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): enabling auto defrag
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): disk space caching is 
enabled
nov 28 15:44:04 glet kernel: BTRFS info (device dm-3): has skinny extents
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): bad tree block start, 
want 2150302023680 have 17816181330383341936
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): failed to read block 
groups: -5
nov 28 15:44:04 glet kernel: BTRFS error (device dm-3): open_ctree failed
--

l) get back to 4.20.0-rc4
m) mounted, but after a few minutes, I get this:

--
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 
2199347265536 has wrong amount of free space
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2199347265536, rebuilding it now
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): block group 
2196126040064 has wrong amount of free space
nov 28 15:51:23 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2196126040064, rebuilding it now
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 
218431488 has wrong amount of free space
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 218431488, rebuilding it now
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): block group 
2183241138176 has wrong amount of free space
nov 28 15:52:09 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2183241138176, rebuilding it now
nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): block group 
2152102625280 has wrong amount of free space
nov 28 15:52:53 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2152102625280, rebuilding it now
nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): block group 
2530059747328 has wrong amount of free space
nov 28 15:54:13 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2530059747328, rebuilding it now
nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): block group 
2151028883456 has wrong amount of free space
nov 28 15:55:10 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2151028883456, rebuilding it now
nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): block group 
2203642232832 has wrong amount of free space
nov 28 15:55:48 glet kernel: BTRFS warning (device dm-3): failed to load free 
space cache for block group 2203642232832, rebuilding it now
--

n) and then read-only mode:

--
[ 1058.996960] BTRFS error (device dm-3): bad tree block start, want 
2150382092288 have 159161645701828393
[ 1058.996967] BTRFS: error (device dm-3) in __btrfs_free_extent:6831: errno=-5 
IO failure
[ 1058.996969] BTRFS info (device dm-3): forced readonly
[ 1058.996971] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2978: 
errno=-5 IO failure
[ 1059.002857] BTRFS error (device dm-3): pending csums is 97832960
--

So, ok, for the moment I'm very sorry I can't help you with bisect, because I 
have to
revert to ext4. This is the laptop I use to work with.

If I can help you investigating, just tell me.

Thanks for your time,
Gelma


signature.asc
Description: PGP signature


Re: Linux-next regression?

2018-11-27 Thread Qu Wenruo


On 2018/11/27 下午10:11, Andrea Gelmini wrote:
> On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
>>>   One question: I can completely trust the ok return status of scrub? I 
>>> know is made for this, but shit happens...
>>
>> No, scrub only checks csum of data and tree blocks, it doesn't ensure
>> the content of tree blocks are OK.
> 
> Hi Qu,
>   and thanks a lot, really. Your answers are always the best: short,
>   detailed and very kind. You rock.
> 
>   I'm going to send a patch to propose to add your explanation above
>   on the relative man page, if you agree.
> 
>> For comprehensive check, go "btrfs check --readonly".
> 
>   I'll do it.
> 
>   At the moment I just compared the file existance between my laptop and
>   latest backup. Everything is fine.
> 
>>
>> However I don't think it's something "btrfs check --readonly" would
>> report, but some strange behavior, maybe from LVM or cryptsetup.
> 
>   Well, I'm using this setup with ext4 and xfs, on same machine, without
>   troubles.

Then it indeed looks like something goes wrong in linux-next.

I would recommend to do a bisect if possible.

As you compared all your data with laptop, it ensures your csum/file
trees are OK, thus no corruption in that trees.
But still something doesn't look right for extent tree only.

But it's less a concerning problem since it doesn't reach latest RC, so
if you could reproduce it stably, I'd recommend to do a bisect.

Thanks,
Qu

>   I've got files checksummed on the backup machine, so I can be sure about
>   comparing integrity.
> 
> Anyway, thanks a lot again,
> Andrea
> 



signature.asc
Description: OpenPGP digital signature


Re: Linux-next regression?

2018-11-27 Thread Andrea Gelmini
On Tue, Nov 27, 2018 at 09:13:02AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> >   One question: I can completely trust the ok return status of scrub? I 
> > know is made for this, but shit happens...
> 
> No, scrub only checks csum of data and tree blocks, it doesn't ensure
> the content of tree blocks are OK.

Hi Qu,
  and thanks a lot, really. Your answers are always the best: short,
  detailed and very kind. You rock.

  I'm going to send a patch to propose to add your explanation above
  on the relative man page, if you agree.

> For comprehensive check, go "btrfs check --readonly".

  I'll do it.

  At the moment I just compared the file existance between my laptop and
  latest backup. Everything is fine.

> 
> However I don't think it's something "btrfs check --readonly" would
> report, but some strange behavior, maybe from LVM or cryptsetup.

  Well, I'm using this setup with ext4 and xfs, on same machine, without
  troubles.
  I've got files checksummed on the backup machine, so I can be sure about
  comparing integrity.

Anyway, thanks a lot again,
Andrea


signature.asc
Description: PGP signature


Re: Linux-next regression?

2018-11-26 Thread Qu Wenruo


On 2018/11/26 下午11:01, Andrea Gelmini wrote:
> Hi everybody,
>and thanks a lot for your work.
> 
>I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest 
> git of btrfs-progs).
>Usually I run kernel in development, because I know BTRFS is young and 
> there are still lots of bugs and corner case to fix.
> 
>Anyway, I just want to submit to you a - maybe - useful info.
> 
>Yesterday I compiled and booted latest linux-next,¹ and I've got this:
> 
> ---
> nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel
> nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 
> /dev/mapper/cry-home
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, 
> level 0
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is 
> enabled
> nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, 
> want 2152002191360 have 8829432654847901262

This means we failed to read one extent tree block and caused the problem.

And if you're using default mkfs profile it should try again to use the
extra copy, but it doesn't look like to be the case.

BTW, does it always happen like this? Or is there any possibility involved?

> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block 
> groups: -5
> nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed
> ---
> 
>Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 
> (compiled on this machine), the problem disappears.
> 
>Now, running scrub a few times, and copying data (all files of the logical 
> volume) to external device, gives no complain
Would you please also try "btrfs check --readonly"?

> 
>Here I stop. This is my primary dev laptop, and at the moment I can't 
> spend time switching/rebooting/testing. I'm comparing the data with last 
> backup (I rsync each hour), but it takes time (it's more then 3TB).
> 
>So, that was about to let you know. Well, it's Ubuntu 18.10, and between 
> reboots no dist-upgrade or changes in booting related packages or systemd.
> 
>   One question: I can completely trust the ok return status of scrub? I know 
> is made for this, but shit happens...

No, scrub only checks csum of data and tree blocks, it doesn't ensure
the content of tree blocks are OK.

For comprehensive check, go "btrfs check --readonly".

However I don't think it's something "btrfs check --readonly" would
report, but some strange behavior, maybe from LVM or cryptsetup.

Thanks,
Qu

> 
> Kisses,
> Gelma   
> 
> -
> ¹ commit:  8c9733fd9806c71e7f2313a280f98cb3051f93df
>   "Add linux-next specific files for 20181123"
> ² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/
> 



signature.asc
Description: OpenPGP digital signature


Linux-next regression?

2018-11-26 Thread Andrea Gelmini
Hi everybody,
   and thanks a lot for your work.

   I'm using BTRFS over LVM over cryptsetup, over Samsung SSD 860 EVO (latest 
git of btrfs-progs).
   Usually I run kernel in development, because I know BTRFS is young and there 
are still lots of bugs and corner case to fix.

   Anyway, I just want to submit to you a - maybe - useful info.

   Yesterday I compiled and booted latest linux-next,¹ and I've got this:

---
nov 26 01:18:22 glet kernel: Btrfs loaded, crc32c=crc32c-intel
nov 26 01:18:22 glet kernel: BTRFS: device label home devid 1 transid 32759 
/dev/mapper/cry-home
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): force lzo compression, 
level 0
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): disk space caching is 
enabled
nov 26 01:18:23 glet kernel: BTRFS info (device dm-3): has skinny extents
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): bad tree block start, 
want 2152002191360 have 8829432654847901262
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): failed to read block 
groups: -5
nov 26 01:18:23 glet kernel: BTRFS error (device dm-3): open_ctree failed
---

   Now, rebooting with 4.19.0-041900 (downloaded from here)², or 4.20-rc4 
(compiled on this machine), the problem disappears.

   Now, running scrub a few times, and copying data (all files of the logical 
volume) to external device, gives no complain.

   Here I stop. This is my primary dev laptop, and at the moment I can't spend 
time switching/rebooting/testing. I'm comparing the data with last backup (I 
rsync each hour), but it takes time (it's more then 3TB).

   So, that was about to let you know. Well, it's Ubuntu 18.10, and between 
reboots no dist-upgrade or changes in booting related packages or systemd.

  One question: I can completely trust the ok return status of scrub? I know is 
made for this, but shit happens...

Kisses,
Gelma   

-
¹ commit:  8c9733fd9806c71e7f2313a280f98cb3051f93df
  "Add linux-next specific files for 20181123"
² http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19/