Re: csum failed root -9

2017-06-14 Thread Kai Krakow
Am Wed, 14 Jun 2017 15:39:50 +0200
schrieb Henk Slager :

> On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager 
> wrote:
> > On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow 
> > wrote:  
> >> Am Mon, 12 Jun 2017 11:00:31 +0200
> >> schrieb Henk Slager :
> >>  
>  [...]  
> >>
> >> There's btrfs-progs v4.11 available...  
> >
> > I started:
> > # btrfs check -p --readonly /dev/mapper/smr
> > but it stopped with printing 'Killed' while checking extents. The
> > board has 8G RAM, no swap (yet), so I just started lowmem mode:
> > # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
> >
> > Now after a 1 day 77 lines like this are printed:
> > ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
> > 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
> >
> > It is still running, hopefully it will finish within 2 days. But
> > lateron I can compile/use latest progs from git. Same for kernel,
> > maybe with some tweaks/patches, but I think I will also plug the
> > disk into a faster machine then ( i7-4770 instead of the J1900 ).
> >  
>  [...]  
> >>
> >> What looks strange to me is that the parameters of the error
> >> reports seem to be rotated by one... See below:
> >>  
>  [...]  
> >>
> >> Why does it say "ino 1"? Does it mean devid 1?  
> >
> > On a 3-disk btrfs raid1 fs I see in the journal also "read error
> > corrected: ino 1" lines for all 3 disks. This was with a 4.10.x
> > kernel, ATM I don't know if this is right or wrong.
> >  
>  [...]  
> >>
> >> And why does it say "root -9"? Shouldn't it be "failed -9 root 257
> >> ino 515567616"? In that case the "off" value would be completely
> >> missing...
> >>
> >> Those "rotations" may mess up with where you try to locate the
> >> error on disk...  
> >
> > I hadn't looked at the numbers like that, but as you indicate, I
> > also think that the 1-block csum fail location is bogus because the
> > kernel calculates that based on some random corruption in critical
> > btrfs structures, also looking at the 77 referencer count
> > mismatches. A negative root ID is already a sort of red flag. When
> > I can mount the fs again after the check is finished, I can
> > hopefully use the output of the check to get clearer how big the
> > 'damage' is.  
> 
> The btrfs lowmem mode check ends with:
> 
> ERROR: root 7331 EXTENT_DATA[928390 3506176] shouldn't be hole
> ERROR: errors found in fs roots
> found 6968612982784 bytes used, error(s) found
> total csum bytes: 6786376404
> total tree bytes: 25656016896
> total fs tree bytes: 14857535488
> total extent tree bytes: 3237216256
> btree space waste bytes: 3072362630
> file data blocks allocated: 38874881994752
>  referenced 36477629964288
> 
> In total 2000+ of those "shouldn't be hole" lines.
> 
> A non-lowmem check, now done with kernel 4.11.4 and progs v4.11 and
> 16G swap added ends with 'noerrors found'

Don't trust lowmem mode too much. The developer of lowmem mode may tell
you more about specific edge cases.

> W.r.t. holes, maybe it is woth to mention the super-flags:
> incompat_flags  0x369
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA |
>   NO_HOLES )

I think it's not worth to follow up on this holes topic: I guess it was
a false report of lowmem mode, and it was fixed with 4.11 btrfs progs.

> The fs has received snapshots from source fs that had NO_HOLES enabled
> for some time, but after registed this bug:
> https://bugzilla.kernel.org/show_bug.cgi?id=121321
> I put back that NO_HOLES flag to zero on the source fs. It seems I
> forgot to do that on the 8TB target/backup fs. But I don't know if
> there is a relation between this flag flipping and the btrfs check
> error messages.
> 
> I think I leave it as is for the time being, unless there is some news
> how to fix things with low risk (or maybe via a temp overlay snapshot
> with DM). But the lowmem check took 2 days, that's not really fun.
> The goal for the 8TB fs is to have an up to 7 year snapshot history at
> sometime, now the oldest snapshot is from early 2014, so almost
> halfway :)

Btrfs is still much too unstable to trust 7 years worth of backup to
it. You will probably loose it at some point, especially while many
snapshots are still such a huge performance breaker in btrfs. I suggest
trying out also other alternatives like borg backup for such a project.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use btrfs_space_info_used instead of opencoding it

2017-06-14 Thread Nikolay Borisov


On 15.06.2017 04:17, kbuild test robot wrote:
> Hi Nikolay,
> 
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to btrfs/next kdave/for-next next-20170614]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Nikolay-Borisov/btrfs-Use-btrfs_space_info_used-instead-of-opencoding-it/20170615-084940
> config: x86_64-randconfig-x012-201724 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>fs/btrfs/extent-tree.c: In function 'btrfs_calc_reclaim_metadata_size':
>>> fs/btrfs/extent-tree.c:4918:9: error: implicit declaration of function 
>>> 'btrfs_space_info_used' [-Werror=implicit-function-declaration]
>  used = btrfs_space_info_used(space_info, true);
> ^
>cc1: some warnings being treated as errors

This patch builds cleanly on latest linux master and it's based off that
tree. I don't see why it's being applied to 4.9-rc8.


> 
> vim +/btrfs_space_info_used +4918 fs/btrfs/extent-tree.c
> 
>   4912
>   4913to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, 
> SZ_16M);
>   4914if (can_overcommit(root, space_info, to_reclaim,
>   4915   BTRFS_RESERVE_FLUSH_ALL))
>   4916return 0;
>   4917
>> 4918 used = btrfs_space_info_used(space_info, true);
>   4919
>   4920if (can_overcommit(root, space_info, SZ_1M, 
> BTRFS_RESERVE_FLUSH_ALL))
>   4921expected = 
> div_factor_fine(space_info->total_bytes, 95);
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-14 Thread Eryu Guan
On Wed, Jun 14, 2017 at 07:55:17AM -0400, Jeff Layton wrote:
> On Tue, 2017-06-13 at 16:40 +0800, Eryu Guan wrote:
> > On Mon, Jun 12, 2017 at 08:42:13AM -0400, Jeff Layton wrote:
> > > Make a new btrfs/999 test that works the way Chris Mason suggested:
> > > 
> > > Build a filesystem with 2 devices that stripes the data across
> > > both devices, but mirrors metadata across both. Then, make one
> > > of the devices fail and see how fsync is handled.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  tests/btrfs/999   | 93 
> > > +++
> > 
> > Missing btrfs/999.out file
> > 
> > >  tests/btrfs/group |  1 +
> > >  2 files changed, 94 insertions(+)
> > >  create mode 100755 tests/btrfs/999
> > > 
> > > diff --git a/tests/btrfs/999 b/tests/btrfs/999
> > > new file mode 100755
> > > index ..84031cc0d913
> > > --- /dev/null
> > > +++ b/tests/btrfs/999
> > > @@ -0,0 +1,93 @@
> > > +#! /bin/bash
> > > +# FS QA Test No. 999
> > > +#
> > > +# Open a file several times, write to it, fsync on all fds and make sure 
> > > that
> > > +# they all return 0. Change the device to start throwing errors. Write 
> > > again
> > > +# on all fds and fsync on all fds. Ensure that we get errors on all of 
> > > them.
> > > +# Then fsync on all one last time and verify that all return 0.
> > > +#
> > > +#---
> > > +# Copyright (c) 2017, Jeff Layton 
> > > +#
> > > +# This program is free software; you can redistribute it and/or
> > > +# modify it under the terms of the GNU General Public License as
> > > +# published by the Free Software Foundation.
> > > +#
> > > +# This program is distributed in the hope that it would be useful,
> > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > +# GNU General Public License for more details.
> > > +#
> > > +# You should have received a copy of the GNU General Public License
> > > +# along with this program; if not, write the Free Software Foundation,
> > > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > > +#---
> > > +
> > > +seq=`basename $0`
> > > +seqres=$RESULT_DIR/$seq
> > > +echo "QA output created by $seq"
> > > +
> > > +here=`pwd`
> > > +tmp=/tmp/$$
> > > +status=1# failure is the default!
> > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > +
> > > +_cleanup()
> > > +{
> > > +cd /
> > > +rm -rf $tmp.* $testdir
> > > +_dmerror_cleanup
> > > +}
> > > +
> > > +# get standard environment, filters and checks
> > > +. ./common/rc
> > > +. ./common/filter
> > > +. ./common/dmerror
> > > +
> > > +# real QA test starts here
> > > +_supported_os Linux
> > > +_require_dm_target error
> > > +_require_test_program fsync-err
> > > +_require_test_program dmerror
> > > +
> > > +# bring up dmerror device
> > > +_scratch_unmount
> > > +_dmerror_init
> > > +
> > > +# Replace first device with error-test device
> > > +old_SCRATCH_DEV=$SCRATCH_DEV
> > > +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
> > > "s#$SCRATCH_DEV#$DMERROR_DEV#"`
> > > +SCRATCH_DEV=$DMERROR_DEV
> > > +
> > > +_require_scratch
> > > +_require_scratch_dev_pool
> > 
> > Need "_require_scratch_dev_pool_equal_size" too, since test creates
> > raid1 profile for metadata.

Sorry, it's not needed here. I got confused with device replace
operation, only "replace" needs this require rule. Thanks for
confirming!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use btrfs_space_info_used instead of opencoding it

2017-06-14 Thread kbuild test robot
Hi Nikolay,

[auto build test ERROR on v4.9-rc8]
[cannot apply to btrfs/next kdave/for-next next-20170614]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nikolay-Borisov/btrfs-Use-btrfs_space_info_used-instead-of-opencoding-it/20170615-084940
config: x86_64-randconfig-x012-201724 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/extent-tree.c: In function 'btrfs_calc_reclaim_metadata_size':
>> fs/btrfs/extent-tree.c:4918:9: error: implicit declaration of function 
>> 'btrfs_space_info_used' [-Werror=implicit-function-declaration]
 used = btrfs_space_info_used(space_info, true);
^
   cc1: some warnings being treated as errors

vim +/btrfs_space_info_used +4918 fs/btrfs/extent-tree.c

  4912  
  4913  to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
  4914  if (can_overcommit(root, space_info, to_reclaim,
  4915 BTRFS_RESERVE_FLUSH_ALL))
  4916  return 0;
  4917  
> 4918  used = btrfs_space_info_used(space_info, true);
  4919  
  4920  if (can_overcommit(root, space_info, SZ_1M, 
BTRFS_RESERVE_FLUSH_ALL))
  4921  expected = div_factor_fine(space_info->total_bytes, 95);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH] btrfs: test incremental send after renaming a file and a directory

2017-06-14 Thread fdmanana
From: Filipe Manana 

Test that an incremental send works if we rename some directory inode A
and then rename some file inode B to the name inode A had, for the case
where the directory inode A is an ancestor of inode B in the parent
snapshot.

This issue is fixed by the following patch for the linux kernel:

  "Btrfs: incremental send, fix invalid path for unlink commands"

Signed-off-by: Filipe Manana 
---
 tests/btrfs/145 | 130 
 tests/btrfs/145.out |   6 +++
 tests/btrfs/group   |   1 +
 3 files changed, 137 insertions(+)
 create mode 100755 tests/btrfs/145
 create mode 100644 tests/btrfs/145.out

diff --git a/tests/btrfs/145 b/tests/btrfs/145
new file mode 100755
index ..ef03292e
--- /dev/null
+++ b/tests/btrfs/145
@@ -0,0 +1,130 @@
+#! /bin/bash
+# FS QA Test No. btrfs/145
+#
+# Test that an incremental send works if we rename some directory inode A and
+# then rename some file inode B to the name inode A had, for the case where the
+# directory inode A is an ancestor of inode B in the parent snapshot.
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_fssum
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+mkdir $SCRATCH_MNT/dir1
+mkdir $SCRATCH_MNT/dir1/dir2
+touch $SCRATCH_MNT/dir1/dir2/file1
+touch $SCRATCH_MNT/dir1/dir2/file2
+touch $SCRATCH_MNT/dir1/dir2/file3
+mkdir $SCRATCH_MNT/dir1/dir3
+mv $SCRATCH_MNT/dir1/dir2/file2 $SCRATCH_MNT/dir1/dir3/file22
+mkdir $SCRATCH_MNT/dir1/dir3/dir4
+
+# Filesystem looks like:
+#
+# .  (ino 256)
+# |
+# |--- dir1/ (ino 257)
+#   |--- dir2/   (ino 258)
+#   | |--- file1 (ino 259)
+#   | |--- file3 (ino 261)
+#   |
+#   |--- dir3/   (ino 262)
+# |--- file22(ino 260)
+# |--- dir4/ (ino 263)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/mysnap1 > /dev/null
+
+$BTRFS_UTIL_PROG send -f $send_files_dir/1.snap \
+   $SCRATCH_MNT/mysnap1 2>&1 1>/dev/null | _filter_scratch
+
+mv $SCRATCH_MNT/dir1/dir2/file3 $SCRATCH_MNT/dir1/dir3/dir4/file33
+mv $SCRATCH_MNT/dir1/dir2/file1 $SCRATCH_MNT/dir1/dir3/dir4/file11
+mv $SCRATCH_MNT/dir1/dir3 $SCRATCH_MNT/dir1/file3
+mv $SCRATCH_MNT/dir1/file3/file22 $SCRATCH_MNT/dir1/dir3
+
+# Filesystem now looks like:
+#
+# .  (ino 256)
+# |
+# |--- dir1/ (ino 257)
+#   |--- dir2/   (ino 258)
+#   |--- dir3(ino 260)
+#   |--- file3/  (ino 262)
+# |--- dir4/ (ino 263)
+#   |--- file11  (ino 269)
+#   |--- file33  (ino 261)
+#
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+$SCRATCH_MNT/mysnap2 > /dev/null
+
+$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 -f $send_files_dir/2.snap \
+$SCRATCH_MNT/mysnap2 2>&1 1>/dev/null | _filter_scratch
+
+$FSSUM_PROG -A -f -w $send_files_dir/1.fssum $SCRATCH_MNT/mysnap1
+$FSSUM_PROG -A -f -w $send_files_dir/2.fssum \
+   -x $SCRATCH_MNT/mysnap2/mysnap1 $SCRATCH_MNT/mysnap2
+
+# Now recrea

[PATCH] Btrfs: incremental send, fix invalid path for unlink commands

2017-06-14 Thread fdmanana
From: Filipe Manana 

An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).

Consider the following example scenario where this issue happens.

Parent snapshot:

 .  (ino 256)
 |
 |--- dir1/ (ino 257)
   |--- dir2/   (ino 258)
   | |--- file1 (ino 259)
   | |--- file3 (ino 261)
   |
   |--- dir3/   (ino 262)
 |--- file22(ino 260)
 |--- dir4/ (ino 263)

Send snapshot:

 .  (ino 256)
 |
 |--- dir1/ (ino 257)
   |--- dir2/   (ino 258)
   |--- dir3(ino 260)
   |--- file3/  (ino 262)
 |--- dir4/ (ino 263)
   |--- file11  (ino 269)
   |--- file33  (ino 261)

When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:

 receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
 utimes
 utimes dir1
 utimes dir1/dir2
 link dir1/dir3/dir4/file11 -> dir1/dir2/file1
 unlink dir1/dir2/file1
 utimes dir1/dir2
 truncate dir1/dir3/dir4/file11 size=0
 utimes dir1/dir3/dir4/file11
 rename dir1/dir3 -> o262-7-0
 link dir1/dir3 -> o262-7-0/file22
 unlink dir1/dir3/file22
 ERROR: unlink dir1/dir3/file22 failed. Not a directory

The following steps happen during the computation of the incremental send
stream the lead to this issue:

1) Before we start processing the new and deleted references for inode
   260, we compute the full path of the deleted reference
   ("dir1/dir3/file22") and cache it in the list of deleted references
   for our inode.

2) We then start processing the new references for inode 260, for which
   there is only one new, located at "dir1/dir3". When processing this
   new reference, we check that inode 262, which was not yet processed,
   collides with the new reference and because of that we orphanize
   inode 262 so its new full path becomes "o262-7-0".

3) After the orphanization of inode 262, we create the new reference for
   inode 260 by issuing a link command with a target path of "dir1/dir3"
   and a source path of "o262-7-0/file22".

4) We then start processing the deleted references for inode 260, for
   which there is only one with the base name of "file22", and issue
   an unlink operation containing the target path computed at step 1,
   which is wrong because that path no longer exists and should be
   replaced with "o262-7-0/file22".

So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
---

Depends on a recent patch named:

 "Btrfs: send, fix invalid path after renaming and linking file"

 fs/btrfs/send.c | 65 +++--
 1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 381ec4d58edf..4d2622331b67 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -2778,6 +2778,20 @@ struct recorded_ref {
int name_len;
 };
 
+static void set_ref_path(struct recorded_ref *ref,
+struct fs_path *path)
+{
+   ref->full_path = path;
+   ref->name = (char *)kbasename(ref->full_path->start);
+   ref->name_len = ref->full_path->end - ref->name;
+   ref->dir_path = ref->full_path->start;
+   if (ref->name == ref->full_path->start)
+   ref->dir_path_len = 0;
+   else
+   ref->dir_path_len = ref->full_path->end -
+   ref->full_path->start - 1 - ref->name_len;
+}
+
 /*
  * We need to process new refs before deleted refs, but compare_tree gives us
  * everything mixed. So we first record all refs and later process them.
@@ -2794,17 +2808,7 @@ static int __record_ref(struct list_head *head, u64 dir,
 
ref->dir = dir;
ref->dir_gen = dir_gen;
-   ref->full_path = path;
-
-   ref->name = (char *)kbasename(ref->full_path->start);
-   ref->name_len = ref->full_path->end - ref->name;
-   ref->dir_pat

Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-14 Thread Jeff Layton
On Tue, 2017-06-13 at 23:47 -0700, Christoph Hellwig wrote:
> On Tue, Jun 13, 2017 at 06:24:32AM -0400, Jeff Layton wrote:
> > That's definitely what I want for the endgame here. My plan was to add
> > this flag for now, and then eventually reverse it (or drop it) once all
> > or most filesystems are converted.
> > 
> > We can do it that way from the get-go if you like. It'll mean tossing in
> >  a patch add this flag to all filesystems that have an fsync operation
> > and that use the pagecache, and then gradually remove it from them as we
> > convert them.
> > 
> > Which method do you prefer?
> 
> Please do it from the get-go.  Or in fact figure out if we can get
> away without it entirely.  Moving the error reporting into ->fsync
> should help greatly with that, so what's missing after that?

In this smaller set, it's only really used for DAX. In the larger patch
series I have (which needs updating on top of this), there are other
things that key off of it:

sync_file_range: ->fsync isn't called directly there, and I think we
probably want similar semantics to fsync() for it

JBD2: will try to re-set the error after clearing it with
filemap_fdatawait. That's problematic with the new infrastructure so we
need some way to avoid it.

What I think I'll do for now is add a new FS_DAX_WB_ERRSEQ flag that
will go away by the end of the series. As the need arises for a similar
flag in other areas, I'll add them then.

The overall goal is not to need these flags. It may take a bit of time
to get there though.

Thanks for the review so far!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: Simplify math in should_alloc chunk

2017-06-14 Thread jeffm
From: Nikolay Borisov 

Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to
signify the total amount of bytes in this space info. However, given
Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation
of num_allocated it becomes a lot more clear to just eliminate num_bytes
altogether and add the bytes_readonly to the amount of used space. That
way we don't change the results of the following statements. In the
process also start using btrfs_space_info_used.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d027807..6111f79 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4388,8 +4388,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
-   u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
sinfo->bytes_pinned + sinfo->bytes_may_use;
+   u64 bytes_used = btrfs_space_info_used(sinfo, true);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
@@ -4401,7 +4400,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
 * global_rsv, it doesn't change except when the transaction commits.
 */
if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   num_allocated += calc_global_rsv_need_space(global_rsv);
+   bytes_used += calc_global_rsv_need_space(global_rsv);
 
/*
 * in limited mode, we want to have some free space up to
@@ -4411,11 +4410,11 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
thresh = btrfs_super_total_bytes(fs_info->super_copy);
thresh = max_t(u64, SZ_64M, div_factor_fine(thresh, 1));
 
-   if (num_bytes - num_allocated < thresh)
+   if (sinfo->total_bytes - bytes_used < thresh)
return 1;
}
 
-   if (num_allocated + SZ_2M < div_factor(num_bytes, 8))
+   if (bytes_used + SZ_2M < div_factor(sinfo->total_bytes, 8))
return 0;
return 1;
 }
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-14 Thread jeffm
From: Jeff Mahoney 

In a heavy write scenario, we can end up with a large number of pinned
bytes.  This can translate into (very) premature ENOSPC because pinned
bytes must be accounted for when allowing a reservation but aren't
accounted for when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb0b924..d027807 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
*fs_info,
 {
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
-   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
+   u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
sinfo->bytes_pinned + sinfo->bytes_may_use;
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
-- 
1.8.5.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: test for device flush-able should be after wait code

2017-06-14 Thread Anand Jain



On 06/14/2017 08:26 PM, David Sterba wrote:

On Wed, Jun 14, 2017 at 06:02:38PM +0800, Anand Jain wrote:

write_dev_flush() checks if the device is flush capable, however as
the device mode can change any time, this check should be after the
wait part of the code.


This would apply to code before "btrfs: wait part of the
write_dev_flush() can be separated out", right?


 Yes.


The problematic case is when a flush bio is sent, queue status changes
and waiting is skipped. A bio would leak. Next time the flushing is
enabled, write_dev_flush allocaes a new bio and either it would leak or
waiting will happen as expected.

The bio would leak only if the barriers are switched between write and
wait. Not impossible, but I still think a race hard to win. The
consequences are not absolutely fatal.


 Yeah more of a theoretical problem and hard to reproduce unless there
 is a deliberate attempt.


Also, the cleanups in write_dev_flush fix the bug in another way, so we
don't need this separate patch as a potential stable backport. The patch
mentioned above can be considered a fix but would need some manual
adaptations to apply. Therefore I don't think we need the $subj patch.


 Agreed. Kindly ignore this and the V2.2 of
[PATCH 3/3 v2.2] btrfs: wait part of the write_dev_flush()...
 patch as well.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root -9

2017-06-14 Thread Henk Slager
On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager  wrote:
> On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow  wrote:
>> Am Mon, 12 Jun 2017 11:00:31 +0200
>> schrieb Henk Slager :
>>
>>> Hi all,
>>>
>>> there is 1-block corruption a 8TB filesystem that showed up several
>>> months ago. The fs is almost exclusively a btrfs receive target and
>>> receives monthly sequential snapshots from two hosts but 1 received
>>> uuid. I do not know exactly when the corruption has happened but it
>>> must have been roughly 3 to 6 months ago. with monthly updated
>>> kernel+progs on that host.
>>>
>>> Some more history:
>>> - fs was created in november 2015 on top of luks
>>> - initially bcache between the 2048-sector aligned partition and luks.
>>> Some months ago I removed 'the bcache layer' by making sure that cache
>>> was clean and then zeroing 8K bytes at start of partition in an
>>> isolated situation. Then setting partion offset to 2064 by
>>> delete-recreate in gdisk.
>>> - in december 2016 there were more scrub errors, but related to the
>>> monthly snapshot of december2016. I have removed that snapshot this
>>> year and now only this 1-block csum error is the only issue.
>>> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
>>> includes some SMR related changes in the blocklayer this disk works
>>> fine with btrfs.
>>> - the smartctl values show no error so far but I will run an extended
>>> test this week after another btrfs check which did not show any error
>>> earlier with the csum fail being there
>>> - I have noticed that the board that has the disk attached has been
>>> rebooted due to power-failures many times (unreliable power switch and
>>> power dips from energy company) and the 150W powersupply is broken and
>>> replaced since then. Also due to this, I decided to remove bcache
>>> (which has been in write-through and write-around only).
>>>
>>> Some btrfs inpect-internal exercise shows that the problem is in a
>>> directory in the root that contains most of the data and snapshots.
>>> But an  rsync -c  with an identical other clone snapshot shows no
>>> difference (no writes to an rw snapshot of that clone). So the fs is
>>> still OK as file-level backup, but btrfs replace/balance will fatal
>>> error on just this 1 csum error. It looks like that this is not a
>>> media/disk error but some HW induced error or SW/kernel issue.
>>> Relevant btrfs commands + dmesg info, see below.
>>>
>>> Any comments on how to fix or handle this without incrementally
>>> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
>>> fail)?
>>>
>>>
>>> # uname -r
>>> 4.11.3-1-default
>>> # btrfs --version
>>> btrfs-progs v4.10.2+20170406
>>
>> There's btrfs-progs v4.11 available...
>
> I started:
> # btrfs check -p --readonly /dev/mapper/smr
> but it stopped with printing 'Killed' while checking extents. The
> board has 8G RAM, no swap (yet), so I just started lowmem mode:
> # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
>
> Now after a 1 day 77 lines like this are printed:
> ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
> 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
>
> It is still running, hopefully it will finish within 2 days. But
> lateron I can compile/use latest progs from git. Same for kernel,
> maybe with some tweaks/patches, but I think I will also plug the disk
> into a faster machine then ( i7-4770 instead of the J1900 ).
>
>>> fs profile is dup for system+meta, single for data
>>>
>>> # btrfs scrub start /local/smr
>>
>> What looks strange to me is that the parameters of the error reports
>> seem to be rotated by one... See below:
>>
>>> [27609.626555] BTRFS error (device dm-0): parent transid verify failed
>>> on 6350718500864 wanted 23170 found 23076
>>> [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
>>> [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
>>> [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
>>> [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
>>> [37663.606455] BTRFS error (device dm-0): parent transid verify failed
>>> on 6350453751808 wanted 23170 found 23075
>>> [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
>>> [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
>>> [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 635045376 (dev /dev/mapper/smr sector 11679647024)
>>> [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
>>> off 6350453764096 (dev /dev/mapper/smr sector 11679647032)

Re: [PATCH] btrfs: test for device flush-able should be after wait code

2017-06-14 Thread David Sterba
On Wed, Jun 14, 2017 at 06:02:38PM +0800, Anand Jain wrote:
> write_dev_flush() checks if the device is flush capable, however as
> the device mode can change any time, this check should be after the
> wait part of the code.

This would apply to code before "btrfs: wait part of the
write_dev_flush() can be separated out", right?

The problematic case is when a flush bio is sent, queue status changes
and waiting is skipped. A bio would leak. Next time the flushing is
enabled, write_dev_flush allocaes a new bio and either it would leak or
waiting will happen as expected.

The bio would leak only if the barriers are switched between write and
wait. Not impossible, but I still think a race hard to win. The
consequences are not absolutely fatal.

Also, the cleanups in write_dev_flush fix the bug in another way, so we
don't need this separate patch as a potential stable backport. The patch
mentioned above can be considered a fix but would need some manual
adaptations to apply. Therefore I don't think we need the $subj patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-14 Thread Jeff Layton
On Tue, 2017-06-13 at 16:40 +0800, Eryu Guan wrote:
> On Mon, Jun 12, 2017 at 08:42:13AM -0400, Jeff Layton wrote:
> > Make a new btrfs/999 test that works the way Chris Mason suggested:
> > 
> > Build a filesystem with 2 devices that stripes the data across
> > both devices, but mirrors metadata across both. Then, make one
> > of the devices fail and see how fsync is handled.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  tests/btrfs/999   | 93 
> > +++
> 
> Missing btrfs/999.out file
> 
> >  tests/btrfs/group |  1 +
> >  2 files changed, 94 insertions(+)
> >  create mode 100755 tests/btrfs/999
> > 
> > diff --git a/tests/btrfs/999 b/tests/btrfs/999
> > new file mode 100755
> > index ..84031cc0d913
> > --- /dev/null
> > +++ b/tests/btrfs/999
> > @@ -0,0 +1,93 @@
> > +#! /bin/bash
> > +# FS QA Test No. 999
> > +#
> > +# Open a file several times, write to it, fsync on all fds and make sure 
> > that
> > +# they all return 0. Change the device to start throwing errors. Write 
> > again
> > +# on all fds and fsync on all fds. Ensure that we get errors on all of 
> > them.
> > +# Then fsync on all one last time and verify that all return 0.
> > +#
> > +#---
> > +# Copyright (c) 2017, Jeff Layton 
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +rm -rf $tmp.* $testdir
> > +_dmerror_cleanup
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/dmerror
> > +
> > +# real QA test starts here
> > +_supported_os Linux
> > +_require_dm_target error
> > +_require_test_program fsync-err
> > +_require_test_program dmerror
> > +
> > +# bring up dmerror device
> > +_scratch_unmount
> > +_dmerror_init
> > +
> > +# Replace first device with error-test device
> > +old_SCRATCH_DEV=$SCRATCH_DEV
> > +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
> > "s#$SCRATCH_DEV#$DMERROR_DEV#"`
> > +SCRATCH_DEV=$DMERROR_DEV
> > +
> > +_require_scratch
> > +_require_scratch_dev_pool
> 
> Need "_require_scratch_dev_pool_equal_size" too, since test creates
> raid1 profile for metadata.
> 
> Thanks,
> Eryu
> 

Is this really needed?

I've been running this test on btrfs with devices that are not of equal
size, and it seems to work just fine. The test doesn't write a lot of
data (just a few megs at most), so I don't think we'll run out of space
unless you have some really small devices in there.

> > +
> > +rm -f $seqres.full
> > +
> > +echo "Format and mount"
> > +
> > +_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
> > +_scratch_mount
> > +
> > +# How much do we need to write? We need to hit all of the stripes. btrfs 
> > uses
> > +# a fixed 64k stripesize, so write enough to hit each one
> > +number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
> > +write_kb=$(($number_of_devices * 64))
> > +_require_fs_space $SCRATCH_MNT $write_kb
> > +
> > +testfile=$SCRATCH_MNT/fsync-err-test
> > +
> > +SCRATCH_DEV=$old_SCRATCH_DEV
> > +$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
> > +
> > +# success, all done
> > +_dmerror_load_working_table
> > +
> > +# fs may be corrupt after this -- attempt to repair it
> > +_repair_scratch_fs >> $seqres.full
> > +
> > +# remove dmerror device
> > +_dmerror_cleanup
> > +
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/group b/tests/btrfs/group
> > index 6f19619e877c..8dbdfbfe29fd 100644
> > --- a/tests/btrfs/group
> > +++ b/tests/btrfs/group
> > @@ -145,3 +145,4 @@
> >  141 auto quick
> >  142 auto quick
> >  143 auto quick
> > +999 auto quick
> > -- 
> > 2.13.0
> > 

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3 v2.2] btrfs: wait part of the write_dev_flush() can be separated out

2017-06-14 Thread Anand Jain
Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.

Signed-off-by: Anand Jain 
---
v2.2: Make it conflict free patch on top of
   [PATCH] btrfs: test for device flush-able should be after wait code
  There is no other changes. Sorry for another version. Thxs.
v2.1: Part of commit log got missed out during send. fix it.
v2: . Add the removal of submit_flush_error in check_barrier_error()
from 2/2 before to 1/3 here, as its appropriate to be here.
. Did not split the write and wait part here in this patch.

 fs/btrfs/disk-io.c | 60 --
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 42bb674028f6..9b9375469134 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3488,39 +3488,17 @@ static void btrfs_end_empty_barrier(struct bio *bio)
 }
 
 /*
- * trigger flushes for one the devices.  If you pass wait == 0, the flushes are
- * sent down.  With wait == 1, it waits for the previous flush.
- *
- * any device where the flush fails with eopnotsupp are flagged as not-barrier
- * capable
+ * Submit a flush request to the device if it supports it. Error handling is
+ * done in the waiting counterpart.
  */
-static int write_dev_flush(struct btrfs_device *device, int wait)
+static void write_dev_flush(struct btrfs_device *device)
 {
struct request_queue *devq;
struct bio *bio;
-   int ret = 0;
-
-   if (wait) {
-   bio = device->flush_bio;
-
-   wait_for_completion(&device->flush_wait);
-
-   if (bio->bi_error) {
-   ret = bio->bi_error;
-   btrfs_dev_stat_inc_and_print(device,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
-   }
-
-   /* drop the reference from the wait == 0 run */
-   bio_put(bio);
-   device->flush_bio = NULL;
-
-   return ret;
-   }
 
devq = bdev_get_queue(device->bdev);
if (!test_bit(QUEUE_FLAG_WC, &devq->queue_flags))
-   return 0;
+   return;
 
/*
 * one reference for us, and we leave it for the
@@ -3537,8 +3515,32 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
 
bio_get(bio);
btrfsic_submit_bio(bio);
+}
 
-   return 0;
+/*
+ * If the flush bio has been submitted by write_dev_flush, wait for it.
+ */
+static int wait_dev_flush(struct btrfs_device *device)
+{
+   int ret = 0;
+   struct bio *bio = device->flush_bio;
+
+   if (!bio)
+   return 0;
+
+   wait_for_completion(&device->flush_wait);
+
+   if (bio->bi_error) {
+   ret = bio->bi_error;
+   btrfs_dev_stat_inc_and_print(device,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   }
+
+   /* drop the reference from the wait == 0 run */
+   bio_put(bio);
+   device->flush_bio = NULL;
+
+   return ret;
 }
 
 static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
@@ -3579,7 +3581,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   write_dev_flush(dev, 0);
+   write_dev_flush(dev);
dev->last_flush_error = 0;
}
 
@@ -3594,7 +3596,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 1);
+   ret = wait_dev_flush(dev);
if (ret) {
dev->last_flush_error = ret;
errors_wait++;
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: test for device flush-able should be after wait code

2017-06-14 Thread Anand Jain
write_dev_flush() checks if the device is flush capable, however as
the device mode can change any time, this check should be after the
wait part of the code.

Signed-off-by: Anand Jain 
 Fixes: c2a9c7ab475b btrfs: check if the device is flush capable
---
 fs/btrfs/disk-io.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9f2ffe2c6afb..603a7e32e708 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3496,13 +3496,10 @@ static void btrfs_end_empty_barrier(struct bio *bio)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   struct request_queue *q = bdev_get_queue(device->bdev);
+   struct request_queue *devq;
struct bio *bio;
int ret = 0;
 
-   if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags))
-   return 0;
-
if (wait) {
bio = device->flush_bio;
if (!bio)
@@ -3527,6 +3524,10 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return ret;
}
 
+   devq = bdev_get_queue(device->bdev);
+   if (!test_bit(QUEUE_FLAG_WC, &devq->queue_flags))
+   return 0;
+
/*
 * one reference for us, and we leave it for the
 * caller
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regression in 4.12 caused by a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks")

2017-06-14 Thread Filipe Manana
On Wed, Jun 14, 2017 at 8:24 AM, Nikolay Borisov  wrote:
> Hello Filipe,
>
> It seems the above commit causes  various underflows in
> some btrfs internal counters. This is evident during generic/439 running:

Yes, it's evident - I wrote that test recently to fix a bug that
existed for a long time, which that commit exposes more easily. The
fix is at:

https://patchwork.kernel.org/patch/9760695/

If you read the changelog for the fstests commit, you'll see that it
refers to that patch.

>
> The first one is in BTRFS_I(inode)->csum_bytes
>
> [   16.710508] [ cut here ]
> [   16.710524] WARNING: CPU: 3 PID: 1080 at fs/btrfs/inode.c:9449 
> btrfs_destroy_inode+0x275/0x2c0
> [   16.710525] Modules linked in:
> [   16.710530] CPU: 3 PID: 1080 Comm: umount Not tainted 4.12.0-rc5-nbor #317
> [   16.710531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   16.710533] task: 8801309bd4c0 task.stack: 880132ff4000
> [   16.710535] RIP: 0010:btrfs_destroy_inode+0x275/0x2c0
> [   16.710537] RSP: 0018:880132ff7d08 EFLAGS: 00010286
> [   16.710540] RAX:  RBX: 8801369ee000 RCX: 
> 
> [   16.710541] RDX:  RSI:  RDI: 
> 880131f21aa0
> [   16.710542] RBP: 880132ff7d38 R08:  R09: 
> 
> [   16.710543] R10: 0001 R11: 80b489d8 R12: 
> 880131f21aa0
> [   16.710544] R13: 88012ff6 R14: 8801396c57a8 R15: 
> 8801396c57a8
> [   16.710546] FS:  7f4ca857a840() GS:88013fcc() 
> knlGS:
> [   16.710548] CS:  0010 DS:  ES:  CR0: 80050033
> [   16.710550] CR2: 7f4ca7da0170 CR3: 0001347fe000 CR4: 
> 06a0
> [   16.710553] Call Trace:
> [   16.710562]  destroy_inode+0x3b/0x60
> [   16.710565]  evict+0x135/0x1c0
> [   16.710567]  dispose_list+0x4d/0x80
> [   16.710570]  evict_inodes+0x16d/0x180
> [   16.710588]  generic_shutdown_super+0x3f/0x120
> [   16.710590]  kill_anon_super+0x12/0x20
> [   16.710597]  btrfs_kill_super+0x16/0xa0
> [   16.710599]  deactivate_locked_super+0x4c/0x80
> [   16.710602]  deactivate_super+0x45/0x60
> [   16.710604]  cleanup_mnt+0x3f/0x70
> [   16.710605]  __cleanup_mnt+0x12/0x20
> [   16.710609]  task_work_run+0x7e/0xc0
> [   16.710614]  exit_to_usermode_loop+0x6c/0x93
> [   16.710618]  syscall_return_slowpath+0x9f/0xd0
> [   16.710625]  entry_SYSCALL_64_fastpath+0xc0/0xc2
>
> What's peculiar about this one is that it always underflows by a single page 
> size.
> When the warning triggers csum_bytes is -4096. Perhaps some off-by-one which 
> is then
> multiplied by PAGE_SIZE? I've investigated it a bit further and here are the 
> 2 types of
> stack traces that lead to freeing csum_bytes:
>
> ==
> Freeing 4096 bytes for ino 257
>  0x813d0042 : calc_csum_metadata_size+0x62/0x100 [kernel]
>  0x813d94b1 : btrfs_delalloc_release_metadata+0x1a1/0x210 [kernel]
>  0x813f8d66 : btrfs_clear_bit_hook+0xd6/0x410 [kernel]
>  0x814182ce : clear_state_bit+0x6e/0x180 [kernel] clear_state_bit at 
> fs/btrfs/extent_io.c:549
>  0x81418517 : __clear_extent_bit+0x137/0x370 [kernel] 
> fs/btrfs/extent_io.c:726
>  0x81418c2a : clear_extent_bit+0x2a/0x30 [kernel]
>  0x8140b565 : lock_and_cleanup_extent_if_need+0x1e5/0x370 [kernel] 
> fs/btrfs/file.c:1514
>  0x8140c9b4 : __btrfs_buffered_write+0x204/0x6e0 [kernel]
>  0x814104cc : btrfs_file_write_iter+0x19c/0x500 [kernel]
>  0x811d19c5 : __vfs_write+0xc5/0x140 [kernel]
>  0x811d39bc : vfs_write+0xcc/0x1a0 [kernel]
>  0x811d3e65 : sys_pwrite64+0x85/0xa0 [kernel]
>  0x8178ca85 : entry_SYSCALL_64_fastpath+0x23/0xc2 [kernel]
>  0x0 (inexact)
>
> ==
> Freeing 155648 bytes for ino 257
>  UNDERFLOW !!!
>  0x813d0042 : calc_csum_metadata_size+0x62/0x100 [kernel]
>  0x813d94b1 : btrfs_delalloc_release_metadata+0x1a1/0x210 [kernel]
>  0x8140103d : btrfs_finish_ordered_io+0x16d/0x6a0 [kernel] 
> fs/btrfs/inode.c:3038
>  0x81401585 : finish_ordered_fn+0x15/0x20 [kernel]
>  0x8142d083 : normal_work_helper+0x93/0x650 [kernel]
>  0x8142d832 : btrfs_endio_write_helper+0x12/0x20 [kernel]
>  0x8107259b : process_one_work+0x1cb/0x5f0 [kernel]
>  0x81072a4d : worker_thread+0x4d/0x380 [kernel]
>  0x81079ce4 : kthread+0x114/0x150 [kernel]
>  0x8178cd1a : ret_from_fork+0x2a/0x40 [kernel]
>  0x0 (inexact)
>
> The underflow always happens from finish_ordered_io.
>
> The next one is:
>
> [   55.007633] [ cut here ]
> [   55.007652] WARNING: CPU: 1 PID: 1078 at fs/btrfs/extent-tree.c:5709 
> btrfs_free_block_groups+0x365/0x3f0
> [   55.007654] Modules linked in:
> [   55.007660] CPU: 1 PID: 1078 Comm: umount Tainted: GW   
> 4.12.0-rc5-nbor #318

Re: [PATCH 1/3 v2.1] btrfs: write_dev_flush does not return ENOMEM anymore

2017-06-14 Thread Anand Jain



All 3 reviewed and queued, thanks. 


 Thanks indeed.


I have a prototype for the
preallocated flush bio but was waiting until these cleanups are in
before snding it.


 oh. sorry for the delay if any. -ENOMEM was only purpose I had/have
 in mind for peralloc, but looks like there is another purpose which
 I am not aware of.


I've noticed that threre are two more "if (!device->bdev)" checks under
the device lock in write_all_supers, that might be worth removing.



However, a NULL bdev and device->missing are related and I think there
are some dark corners in dev replace where the invariant can be
temporarily broken.


 Ok. Thanks for the hints. Trying to dig.


Given that we probably don't have great testing
coverage of devices with the flush capabilities,


 Any hints on what type of test coverage ?

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Use btrfs_space_info_used instead of opencoding it

2017-06-14 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/extent-tree.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c0d3980fe3f..a08a743a8e09 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4935,9 +4935,8 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
   BTRFS_RESERVE_FLUSH_ALL))
return 0;
 
-   used = space_info->bytes_used + space_info->bytes_reserved +
-  space_info->bytes_pinned + space_info->bytes_readonly +
-  space_info->bytes_may_use;
+   used = btrfs_space_info_used(space_info, true);
+
if (can_overcommit(root, space_info, SZ_1M, BTRFS_RESERVE_FLUSH_ALL))
expected = div_factor_fine(space_info->total_bytes, 95);
else
@@ -5378,9 +5377,7 @@ static void space_info_add_old_bytes(struct btrfs_fs_info 
*fs_info,
 * overcommit, and if we can't then we just need to free up our space
 * and not satisfy any requests.
 */
-   used = space_info->bytes_used + space_info->bytes_reserved +
-   space_info->bytes_pinned + space_info->bytes_readonly +
-   space_info->bytes_may_use;
+   used = btrfs_space_info_used(space_info, true);
if (used - num_bytes >= space_info->total_bytes)
check_overcommit = true;
 again:
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


regression in 4.12 caused by a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks")

2017-06-14 Thread Nikolay Borisov
Hello Filipe, 

It seems the above commit causes  various underflows in 
some btrfs internal counters. This is evident during generic/439 running: 

The first one is in BTRFS_I(inode)->csum_bytes

[   16.710508] [ cut here ]
[   16.710524] WARNING: CPU: 3 PID: 1080 at fs/btrfs/inode.c:9449 
btrfs_destroy_inode+0x275/0x2c0
[   16.710525] Modules linked in:
[   16.710530] CPU: 3 PID: 1080 Comm: umount Not tainted 4.12.0-rc5-nbor #317
[   16.710531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   16.710533] task: 8801309bd4c0 task.stack: 880132ff4000
[   16.710535] RIP: 0010:btrfs_destroy_inode+0x275/0x2c0
[   16.710537] RSP: 0018:880132ff7d08 EFLAGS: 00010286
[   16.710540] RAX:  RBX: 8801369ee000 RCX: 
[   16.710541] RDX:  RSI:  RDI: 880131f21aa0
[   16.710542] RBP: 880132ff7d38 R08:  R09: 
[   16.710543] R10: 0001 R11: 80b489d8 R12: 880131f21aa0
[   16.710544] R13: 88012ff6 R14: 8801396c57a8 R15: 8801396c57a8
[   16.710546] FS:  7f4ca857a840() GS:88013fcc() 
knlGS:
[   16.710548] CS:  0010 DS:  ES:  CR0: 80050033
[   16.710550] CR2: 7f4ca7da0170 CR3: 0001347fe000 CR4: 06a0
[   16.710553] Call Trace:
[   16.710562]  destroy_inode+0x3b/0x60
[   16.710565]  evict+0x135/0x1c0
[   16.710567]  dispose_list+0x4d/0x80
[   16.710570]  evict_inodes+0x16d/0x180
[   16.710588]  generic_shutdown_super+0x3f/0x120
[   16.710590]  kill_anon_super+0x12/0x20
[   16.710597]  btrfs_kill_super+0x16/0xa0
[   16.710599]  deactivate_locked_super+0x4c/0x80
[   16.710602]  deactivate_super+0x45/0x60
[   16.710604]  cleanup_mnt+0x3f/0x70
[   16.710605]  __cleanup_mnt+0x12/0x20
[   16.710609]  task_work_run+0x7e/0xc0
[   16.710614]  exit_to_usermode_loop+0x6c/0x93
[   16.710618]  syscall_return_slowpath+0x9f/0xd0
[   16.710625]  entry_SYSCALL_64_fastpath+0xc0/0xc2

What's peculiar about this one is that it always underflows by a single page 
size. 
When the warning triggers csum_bytes is -4096. Perhaps some off-by-one which is 
then 
multiplied by PAGE_SIZE? I've investigated it a bit further and here are the 2 
types of
stack traces that lead to freeing csum_bytes: 

==
Freeing 4096 bytes for ino 257
 0x813d0042 : calc_csum_metadata_size+0x62/0x100 [kernel]
 0x813d94b1 : btrfs_delalloc_release_metadata+0x1a1/0x210 [kernel]
 0x813f8d66 : btrfs_clear_bit_hook+0xd6/0x410 [kernel]
 0x814182ce : clear_state_bit+0x6e/0x180 [kernel] clear_state_bit at 
fs/btrfs/extent_io.c:549
 0x81418517 : __clear_extent_bit+0x137/0x370 [kernel] 
fs/btrfs/extent_io.c:726
 0x81418c2a : clear_extent_bit+0x2a/0x30 [kernel]
 0x8140b565 : lock_and_cleanup_extent_if_need+0x1e5/0x370 [kernel] 
fs/btrfs/file.c:1514
 0x8140c9b4 : __btrfs_buffered_write+0x204/0x6e0 [kernel]
 0x814104cc : btrfs_file_write_iter+0x19c/0x500 [kernel]
 0x811d19c5 : __vfs_write+0xc5/0x140 [kernel]
 0x811d39bc : vfs_write+0xcc/0x1a0 [kernel]
 0x811d3e65 : sys_pwrite64+0x85/0xa0 [kernel]
 0x8178ca85 : entry_SYSCALL_64_fastpath+0x23/0xc2 [kernel]
 0x0 (inexact)

==
Freeing 155648 bytes for ino 257
 UNDERFLOW !!!
 0x813d0042 : calc_csum_metadata_size+0x62/0x100 [kernel]
 0x813d94b1 : btrfs_delalloc_release_metadata+0x1a1/0x210 [kernel]
 0x8140103d : btrfs_finish_ordered_io+0x16d/0x6a0 [kernel] 
fs/btrfs/inode.c:3038
 0x81401585 : finish_ordered_fn+0x15/0x20 [kernel]
 0x8142d083 : normal_work_helper+0x93/0x650 [kernel]
 0x8142d832 : btrfs_endio_write_helper+0x12/0x20 [kernel]
 0x8107259b : process_one_work+0x1cb/0x5f0 [kernel]
 0x81072a4d : worker_thread+0x4d/0x380 [kernel]
 0x81079ce4 : kthread+0x114/0x150 [kernel]
 0x8178cd1a : ret_from_fork+0x2a/0x40 [kernel]
 0x0 (inexact)

The underflow always happens from finish_ordered_io. 

The next one is: 

[   55.007633] [ cut here ]
[   55.007652] WARNING: CPU: 1 PID: 1078 at fs/btrfs/extent-tree.c:5709 
btrfs_free_block_groups+0x365/0x3f0
[   55.007654] Modules linked in:
[   55.007660] CPU: 1 PID: 1078 Comm: umount Tainted: GW   
4.12.0-rc5-nbor #318
[   55.007662] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   55.007663] task: 8801329eb880 task.stack: 880130c1
[   55.007666] RIP: 0010:btrfs_free_block_groups+0x365/0x3f0
[   55.007668] RSP: 0018:880130c13d68 EFLAGS: 00010286
[   55.007672] RAX: 0001 RBX:  RCX: 880130c13cbc
[   55.007673] RDX: 0002 RSI: 0002 RDI: 880136bf5000
[   55.007675] RBP: 880130c13d98 R08:  R09: