Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Robert Krig


On 04.04.2017 18:55, Chris Murphy wrote:
> On Tue, Apr 4, 2017 at 10:52 AM, Chris Murphy  wrote:
>
>
>> Mounting -o ro,degraded is probably permitted by the file system, but
>> chunks of the file system and certainly your data, will be missing. So
>> it's just a matter of time before copying data off will fail.
> ** Context here is, more than 1 device missing.
>

Thanks you guys for all your help and input.

I've ordered two new drives to backup all my data. I have a cloud backup
in place, but 13TB takes a while to upload :-)
I think I'm gonna abandon btrfs as the main fs for my home server. I'm
just gonna set up a separate LVM volume for storing snapshots and
backups, since I use btrfs on all my single disk machines.
Thanks again everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4 V3] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-04-04 Thread Anand Jain
The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Signed-off-by: Anand Jain 
---
v2: Address Qu review comments viz..
 Add meaningful names, like cp_list (for checkpoint_list head).
 (And actually it does not need a new struct type just to hold
  the head pointer, list node is already named as device_checkpoint).
 Check return value of add_device_checkpoint()
 Check if the device is already added at add_device_checkpoint()
 Rename fini_devices_checkpoint() to rel_devices_checkpoint()

v3:
   (resent with the correct version of the patch).
   Now the flush error return is saved and checked instead of the
   checkpoint of the dev_stat method earlier.

 fs/btrfs/disk-io.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 420753d37e1a..3c476b118440 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3538,6 +3538,23 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) {
+   if (!dev->bdev || dev->last_flush_error)
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3575,8 +3592,19 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (write_dev_flush(dev, 1))
dropouts++;
}
-   if (dropouts > info->num_tolerated_disk_barrier_failures)
-   return -EIO;
+
+   /*
+* A slight optimization, we check for dropouts here which avoids
+* a dev list loop when disks are healthy.
+*/
+   if (dropouts) {
+   /*
+* As we need holistic view of the failed disks, so
+* error checking is pushed to a separate loop.
+*/
+   return check_barrier_error(info->fs_devices);
+   }
+
return 0;
 }
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: delete unused member nobarriers

2017-04-04 Thread Anand Jain
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now even when write cache is
write through it just return success without actually transpiring such
a request to the block device/lun.

[1]
commit b25de9d6da49b1a8760a89672283128aa8c78345
block: remove BIO_EOPNOTSUPP

And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.

Signed-off-by: Anand Jain 
---
 v2: We need this patch. Update the commit log.

 fs/btrfs/disk-io.c | 3 ---
 fs/btrfs/volumes.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c476b118440..b6d047250ce2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3517,9 +3517,6 @@ static void btrfs_dev_issue_flush(struct work_struct 
*work)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   if (device->nobarriers)
-   return 0;
-
if (wait) {
int ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0df50bc65578..1168b78c5f1d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -123,7 +123,6 @@ struct btrfs_device {
struct list_head resized_list;
 
/* for sending down flush barriers */
-   int nobarriers;
struct completion flush_wait;
struct work_struct flush_work;
int last_flush_error;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: check if the device is flush capable

2017-04-04 Thread Anand Jain
The blkdev_issue_flush() will check if the write cache is enabled
before submitting the flush. This will add a code to fail fast if
its not.

Signed-off-by: Anand Jain 
---
v1:
  This patch will replace
  [PATCH] btrfs: delete unused member nobarriers

v2:
  - This patch will now _not_ replace the below patch and this patch
  should be applied on top of it.
[PATCH] btrfs: delete unused member nobarriers
  That's because, Q wc flag is not static, so we can't save its
  state into nobarries.
  - commit log is updated.

 fs/btrfs/disk-io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b6d047250ce2..bb5816c7f5c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3517,6 +3517,11 @@ static void btrfs_dev_issue_flush(struct work_struct 
*work)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
+   struct request_queue *q = bdev_get_queue(device->bdev);
+
+   if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+   return 0;
+
if (wait) {
int ret;
 
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-04 Thread Qu Wenruo
As long as we don't modify the on-disk data, fiemap result should always
be constant.

Operation like cycle mount and sleep should not affect fiemap result.
While unfortunately, btrfs doesn't follow that behavior.

Btrfs fiemap sometimes return merged result, while after cycle mount, it
returns split result. Furthermore after a snap, btrfs returns merged
result again.

Signed-off-by: Qu Wenruo 
---
v2:
  Add require_fiemap
  Remove IRIX support
  Fix whilespace issues
  Remove _scratch_mkfs_sized, as the needed space is so small that no sane
  tester would put a so small fs as SCRATCH_DEV.
---
 tests/generic/422 | 127 ++
 tests/generic/422.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 130 insertions(+)
 create mode 100755 tests/generic/422
 create mode 100644 tests/generic/422.out

diff --git a/tests/generic/422 b/tests/generic/422
new file mode 100755
index 000..f0a990f
--- /dev/null
+++ b/tests/generic/422
@@ -0,0 +1,127 @@
+#! /bin/bash
+# FS QA Test 422
+#
+# Test if a file system returns constant fiemap result after remount and
+# fiemap.
+# Unfortunately, btrfs doesn't follow this behavior.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_fiemap
+
+block_size=$((64 * 1024))
+block_count=32
+dst=$SCRATCH_MNT/file
+sleeptime=3
+
+# It's almost 100% for btrfs to trigger inconstant fiemap result
+# just in case
+loop_counts=$((2 * $LOAD_FACTOR))
+
+# record fiemap as checkpoint, and output the hash of fiemap result
+# to stdout
+fiemap_checkpoint()
+{
+   local number=$1
+   local message=$2
+
+   echo "=== $message ===" >> $seqres.full
+   $XFS_IO_PROG -c "fiemap -v" $dst > ${tmp}.cp$number
+   cat ${tmp}.cp${number} >> $seqres.full
+
+   md5sum ${tmp}.cp${number} | cut -d ' ' -f 1
+}
+
+do_test()
+{
+   local number=$1
+
+   # Use 16 times of file size to ensure we have enough space
+   _scratch_mkfs > /dev/null 2>&1
+   _scratch_mount
+
+   echo "== Loop $number ==" >> $seqres.full
+   touch $dst
+   # Xfsprogs 4.9.0 still has a bug that xfs_io "open" with O_SYNC command
+   # doesn't work well with "pwrite", although it gets fixed in v4.10.0,
+   # use dd here to avoid it, non-xfs developers will be happy with it.
+   dd if=/dev/zero of=$dst bs=$block_size count=$block_count oflag=dsync \
+   status=none 2>&1
+
+   hash1=$(fiemap_checkpoint 1 "Fiemap just after dsync write")
+
+   # Sleep should not modify fiemap result
+   sleep $sleeptime
+
+   hash2=$(fiemap_checkpoint 2 "Fiemap after dsync write and sleep")
+
+   # cycle mount should not modify fiemap result
+   _scratch_cycle_mount
+
+   hash3=$(fiemap_checkpoint 3 "Fiemap after cycle mount")
+
+   # Sleep should not modify fiemap result
+   sleep $sleeptime
+
+   hash4=$(fiemap_checkpoint 4 "Fiemap after cycle mount and sleep")
+
+   _scratch_unmount
+
+   if [ $hash1 != $hash2 -o $hash2 != $hash3 -o $hash3 != $hash4 ]; then
+   echo "Inconstant fiemap result detected"
+   fi
+   echo >> $seqres.full
+}
+
+for i in $(seq 1 $loop_counts); do
+   do_test $i
+done
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/generic/422.out b/tests/generic/422.out
new file mode 100644
index 000..f70693f
--- /dev/null
+++ b/tests/generic/422.out
@@ -0,0 +1,2 @@
+QA output created by 422
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 3c7c5e4..86d2325 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -424,3 +424,4 @@
 419 auto quick encrypt
 420 auto quick pu

Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-04 Thread Qu Wenruo



At 04/05/2017 10:35 AM, Eryu Guan wrote:

On Mon, Apr 03, 2017 at 03:09:23PM +0800, Qu Wenruo wrote:

As long as we don't modify the on-disk data, fiemap result should always
be constant.

Operation like cycle mount and sleep should not affect fiemap result.
While unfortunately, btrfs doesn't follow that behavior.

Btrfs fiemap sometimes return merged result, while after cycle mount, it
returns split result. Furthermore after a snap, btrfs returns merged
result again.

Signed-off-by: Qu Wenruo 


Test fails with ext3/2 when driving with ext4 driver, fiemap changed
after umount/mount cycle, then changed back to original result after
sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)


It's good to know this test case can already detect hidden bug of other fs.



=== Fiemap after dsync write and sleep ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:12288..12415   128 0x1000
   1: [128..255]:  12160..12287   128 0x1000
   2: [256..511]:  16384..16639   256 0x1000
   3: [512..2047]: 16896..18431  1536 0x1000
   4: [2048..4095]:19456..21503  2048 0x1001
=== Fiemap after cycle mount ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..95]: 12288..1238396 0x1000
   1: [96..127]:   12384..1241532 0x1000

[0-127] was split to two extents

   2: [128..255]:  12160..12287   128 0x1000
   3: [256..511]:  16384..16639   256 0x1000
   4: [512..2047]: 16896..18431  1536 0x1000
   5: [2048..4095]:19456..21503  2048 0x1001
=== Fiemap after cycle mount and sleep ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:12288..12415   128 0x1000
   1: [128..255]:  12160..12287   128 0x1000
   2: [256..511]:  16384..16639   256 0x1000
   3: [512..2047]: 16896..18431  1536 0x1000
   4: [2048..4095]:19456..21503  2048 0x1001


---
 tests/generic/422 | 127 ++
 tests/generic/422.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 130 insertions(+)
 create mode 100755 tests/generic/422
 create mode 100644 tests/generic/422.out

diff --git a/tests/generic/422 b/tests/generic/422
new file mode 100755
index 000..4ca4476
--- /dev/null
+++ b/tests/generic/422
@@ -0,0 +1,127 @@
+#! /bin/bash
+# FS QA Test 422
+#
+# Test if a file system returns constant fiemap result after remount and
+# fiemap.
+# Unfortunately, btrfs doesn't follow this behavior.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os IRIX Linux
+_require_scratch


Need _require_fiemap


+
+block_size=$((64 * 1024))
+block_count=32
+dst=$SCRATCH_MNT/file
+sleeptime=3
+
+# It's almost 100% for btrfs to trigger inconstant fiemap result


Trailing whitespace in above line.


+# just in case
+runtime=$((2 * $LOAD_FACTOR))


I don't think this loop is necessary, as the comments state, it's almost
100% reproducible for btrfs (and for ext2/3 in my testing).


In fact, the uncertainty lies in the time of sleep.

In my test environment, if only sleeps for 1 second, the fiemap result 
won't change for btrfs. (although cycle mount still changes the fiemap 
result)


So it's better to keep the runtime (will be renamed to loop_counts in 
next version), just as stated, in case.





+
+# record fiemap as checkpoint, and output the hash of fiemap result
+# to stdout
+fiemap_checkpoint()
+{
+   local number=$1
+   local message=$2
+
+   echo "=== $message ===" >> $seqres.full
+   $XFS_IO_PROG -c "fiemap -v" $dst > ${tmp}.cp$number
+   cat ${tmp}.cp${number} >> $seqres.full
+
+   md5sum ${tmp}.cp${nu

Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-04 Thread Eryu Guan
On Mon, Apr 03, 2017 at 03:09:23PM +0800, Qu Wenruo wrote:
> As long as we don't modify the on-disk data, fiemap result should always
> be constant.
> 
> Operation like cycle mount and sleep should not affect fiemap result.
> While unfortunately, btrfs doesn't follow that behavior.
> 
> Btrfs fiemap sometimes return merged result, while after cycle mount, it
> returns split result. Furthermore after a snap, btrfs returns merged
> result again.
> 
> Signed-off-by: Qu Wenruo 

Test fails with ext3/2 when driving with ext4 driver, fiemap changed
after umount/mount cycle, then changed back to original result after
sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)

=== Fiemap after dsync write and sleep ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:12288..12415   128 0x1000
   1: [128..255]:  12160..12287   128 0x1000
   2: [256..511]:  16384..16639   256 0x1000
   3: [512..2047]: 16896..18431  1536 0x1000
   4: [2048..4095]:19456..21503  2048 0x1001
=== Fiemap after cycle mount ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..95]: 12288..1238396 0x1000
   1: [96..127]:   12384..1241532 0x1000

[0-127] was split to two extents

   2: [128..255]:  12160..12287   128 0x1000
   3: [256..511]:  16384..16639   256 0x1000
   4: [512..2047]: 16896..18431  1536 0x1000
   5: [2048..4095]:19456..21503  2048 0x1001
=== Fiemap after cycle mount and sleep ===
/vda6/file:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:12288..12415   128 0x1000
   1: [128..255]:  12160..12287   128 0x1000
   2: [256..511]:  16384..16639   256 0x1000
   3: [512..2047]: 16896..18431  1536 0x1000
   4: [2048..4095]:19456..21503  2048 0x1001

> ---
>  tests/generic/422 | 127 
> ++
>  tests/generic/422.out |   2 +
>  tests/generic/group   |   1 +
>  3 files changed, 130 insertions(+)
>  create mode 100755 tests/generic/422
>  create mode 100644 tests/generic/422.out
> 
> diff --git a/tests/generic/422 b/tests/generic/422
> new file mode 100755
> index 000..4ca4476
> --- /dev/null
> +++ b/tests/generic/422
> @@ -0,0 +1,127 @@
> +#! /bin/bash
> +# FS QA Test 422
> +#
> +# Test if a file system returns constant fiemap result after remount and
> +# fiemap.
> +# Unfortunately, btrfs doesn't follow this behavior.
> +#
> +#---
> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os IRIX Linux
> +_require_scratch

Need _require_fiemap

> +
> +block_size=$((64 * 1024))
> +block_count=32
> +dst=$SCRATCH_MNT/file
> +sleeptime=3
> +
> +# It's almost 100% for btrfs to trigger inconstant fiemap result 

Trailing whitespace in above line.

> +# just in case
> +runtime=$((2 * $LOAD_FACTOR))

I don't think this loop is necessary, as the comments state, it's almost
100% reproducible for btrfs (and for ext2/3 in my testing).

> +
> +# record fiemap as checkpoint, and output the hash of fiemap result
> +# to stdout
> +fiemap_checkpoint()
> +{
> + local number=$1
> + local message=$2
> +
> + echo "=== $message ===" >> $seqres.full
> + $XFS_IO_PROG -c "fiemap -v" $dst > ${tmp}.cp$number
> + cat ${tmp}.cp${number} >> $seqres.full
> +
> + md5sum ${tmp}.cp${number} | cut -d ' ' -f 1
> +}
> +
> +do_test()
> +{
> + local number=$1
> +
> + # Use 16 times of file size to ensure we have enough space
> + _scratch_mkfs_sized $((16 * $block_size * $block_count)) \

Use _require_fs_space for required space. Not all filesys

[PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-04 Thread Qu Wenruo
Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents follows.

The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/140 | 120 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 123 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..55f81f2
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,120 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyong max_inline should replace inline
+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+src_subv="$SCRATCH_MNT/src"
+dst_subv="$SCRATCH_MNT/dst"
+loop_counts=$(($LOAD_FACTOR * 4)) #The possibility is over 80%, just in case
+
+do_test()
+{
+   local ret
+   _scratch_mkfs > /dev/null 2>&1
+
+   # specify max_inline and compress explicitly to create
+   # inlined compressed extent
+   _scratch_mount "-o max_inline=$inline_size,compress"
+
+   # Subvolume id=257
+   _run_btrfs_util_prog subvolume create $src_subv
+
+   # Page sized data will be compressed to quite small size,
+   # which is about 50 bytes for 4K, 100 bytes for 64K,
+   # far smaller than $inline_size, causing a inlined compressed
+   # file extent
+   $XFS_IO_PROG -f -c "pwrite 0 $page_size" -c "fsync" $src_subv/file \
+   > /dev/null
+
+   # Subvolume id=258
+   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
+
+   # Append another page to file in snapshot, since its offset is not
+   # 0, it will cause regular extent, and should convert inline extent
+   # to regular too.
+   $XFS_IO_PROG -c "pwrite $page_size $page_size" $dst_subv/file > 
/dev/null
+   _scratch_unmount
+
+   # Fsck won't report inline-then-regular as error, as it's so long
+   # standing that we have no choice but to accept.
+   # So here we use dump-tree to catch inline extent of $dst_subv, and
+   # we need piping stdout, call btrfs-progs directly
+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 258 $SCRATCH_DEV \
+   > $tmp.dump_tree
+   if grep -q inline $tmp.dump_tree; then
+   cat $tmp.dump_tree >> $seqres.full
+   echo "Invalid inline file extent detected"
+   return 1
+   fi
+   return 0
+}
+
+for i in $(seq 1 $loop_counts); do
+   do_test
+   if [ $? -ne 0 ]; then
+   corrupted=$(($corrupted + 1))
+   fi
+done
+
+echo "Corruption possibilit

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown
On Tue, Apr 04 2017, J. Bruce Fields wrote:

> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be 
>> > > > a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> > 
>> > How hard is this for filesystems to support?  Do they need an on-disk
>> > format change to keep track of the crash counter?  Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> > 
>> 
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead.  So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.

This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.

To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields.  Maybe there are enough bits.

If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb.  Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.

If you want to change a file with an even i_version, you subtract
  crash-count*large-number
to the i_version, then set lsb.  This is written to stable storage before
the change.

If a file has not been changed for a while, you can add
  crash-count*large-number
and clear lsb.

The lsb of the i_version would be for internal use only.  It would not
be visible outside the filesystem.

It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-04 Thread Qu Wenruo



At 04/05/2017 09:27 AM, Eryu Guan wrote:

On Mon, Apr 03, 2017 at 11:01:39AM +0800, Qu Wenruo wrote:

Btrfs allows inline file extent if and only if
1) It's at offset 0
2) It's smaller than min(max_inline, page_size)
   Although we don't specify if the size is before compression or after
   compression.
   At least according to current behavior, we are only limiting the size
   after compression.
3) It's the only file extent
   So that if we append existing inline extent, it should be converted
   to regular file extents.

However several users in btrfs mail list have reported invalid inline
file extent, which only meets the first two condition, but with regular
file extents follows.

The bug is here for a long long time, so long that we have modified
kernel and btrfs-progs to accept such case, but it's still not designed
behavior, and must be detected and fixed.

Signed-off-by: Qu Wenruo 


Better to have btrfs developers review this too to see if it's a valid
test. Some comments inline from fstests perspective of view.


---
 tests/btrfs/140 | 124 
 tests/btrfs/140.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 127 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..139df76
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,124 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Check if btrfs will create inline-then-regular file layout.
+#
+# Btrfs only allows inline file extent if file is small enough, and any
+# incoming write enlarges the file beyong max_inline should replace inline
+# extents with regular extents.
+# This is a long standing bug, so fsck won't detect it and kernel will allow
+# normal read on it, but should be avoided.
+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+
+inline_size=2048
+page_size=$(get_page_size)
+src_subv="$SCRATCH_MNT/src"
+dst_subv="$SCRATCH_MNT/dst"
+runtime=$(($LOAD_FACTOR * 8))  #The possibility is over 80%, just in case


This confused me a bit, I thought test would run for $runtime seconds
and wanted to suggest s/LOAD_FACTOR/TIME_FACTOR, but runtime is really a
loop counter. Mind renaming it to something more obvious, like
loop_count?


+corrupted=0
+
+do_test()
+{
+   local ret
+   _scratch_mkfs > /dev/null >&1

  meant 2>&1 ?

+
+   # specify max_inline and compress explicitly to create
+   # inlined compressed extent
+   _scratch_mount "-o max_inline=$inline_size,compress"
+   


Whitespace issue in above line.


+   # Subvolume id=257
+   _run_btrfs_util_prog subvolume create $src_subv
+
+   # Page sized data will be compressed to quite small size,
+   # which is about 50 bytes for 4K, 100 bytes for 64K,
+   # far smaller than $inline_size, causing a inlined compressed
+   # file extent
+   xfs_io -f -c "pwrite 0 $page_size" $src_subv/file > /dev/null
+   sync


$XFS_IO_PROG and use '-c fsync' to sync the file.


What about syncfs? As I don't want log tree of btrfs to get involved in 
this case.





+
+   # Subvolume id=258
+   _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
+   


Whitespace issue.


+   # Append another page to file in snapshot, since its offset is not
+   # 0, it will cause regular extent, and should convert inline extent
+   # to regular too.
+   xfs_io -c "pwrite $page_size $page_size" $dst_subv/file > /dev/null


$XFS_IO_PROG


+   _scratch_unmount
+
+   # Fsck won't report inline-

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown
On Tue, Apr 04 2017, Dave Chinner wrote:

> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
>> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
>> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
>> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > > Because if above is acceptable we could make reported i_version to 
>> > > > > be a sum
>> > > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > > shutdown.
>> > > > > That way after a crash we are guaranteed each inode will report new
>> > > > > i_version (the sum would probably have to look like "superblock crash
>> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> > > > > possible
>> > > > > i_version numbers we gave away but did not write to disk but 
>> > > > > still...).
>> > > > > Thoughts?
>> > > 
>> > > How hard is this for filesystems to support?  Do they need an on-disk
>> > > format change to keep track of the crash counter?
>> > 
>> > Yes. We'll need version counter in the superblock, and we'll need to
>> > know what the increment semantics are. 
>> > 
>> > The big question is how do we know there was a crash? The only thing
>> > a journalling filesystem knows at mount time is whether it is clean
>> > or requires recovery. Filesystems can require recovery for many
>> > reasons that don't involve a crash (e.g. root fs is never unmounted
>> > cleanly, so always requires recovery). Further, some filesystems may
>> > not even know there was a crash at mount time because their
>> > architecture always leaves a consistent filesystem on disk (e.g. COW
>> > filesystems)
>> 
>> What filesystems can or cannot easily do obviously differs. Ext4 has a
>> recovery flag set in superblock on RW mount/remount and cleared on
>> umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro. This uncovered a bug in grub in

Filesystems could use register_reboot_notifier() to get a notification
that even systemd cannot stuff-up.  It could check for dirty data and, if
there is none (which there shouldn't be if a sync happened), it does a
single write to disk to update the superblock (or a single write to each
disk... or something).
md does this, because getting the root device to be marked read-only is
even harder than getting the root filesystem to be remounted read-only.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] fstests: btrfs: Check if btrfs will create inline-then-regular file extents

2017-04-04 Thread Eryu Guan
On Mon, Apr 03, 2017 at 11:01:39AM +0800, Qu Wenruo wrote:
> Btrfs allows inline file extent if and only if
> 1) It's at offset 0
> 2) It's smaller than min(max_inline, page_size)
>Although we don't specify if the size is before compression or after
>compression.
>At least according to current behavior, we are only limiting the size
>after compression.
> 3) It's the only file extent
>So that if we append existing inline extent, it should be converted
>to regular file extents.
> 
> However several users in btrfs mail list have reported invalid inline
> file extent, which only meets the first two condition, but with regular
> file extents follows.
> 
> The bug is here for a long long time, so long that we have modified
> kernel and btrfs-progs to accept such case, but it's still not designed
> behavior, and must be detected and fixed.
> 
> Signed-off-by: Qu Wenruo 

Better to have btrfs developers review this too to see if it's a valid
test. Some comments inline from fstests perspective of view.

> ---
>  tests/btrfs/140 | 124 
> 
>  tests/btrfs/140.out |   2 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 127 insertions(+)
>  create mode 100755 tests/btrfs/140
>  create mode 100644 tests/btrfs/140.out
> 
> diff --git a/tests/btrfs/140 b/tests/btrfs/140
> new file mode 100755
> index 000..139df76
> --- /dev/null
> +++ b/tests/btrfs/140
> @@ -0,0 +1,124 @@
> +#! /bin/bash
> +# FS QA Test 140
> +#
> +# Check if btrfs will create inline-then-regular file layout.
> +#
> +# Btrfs only allows inline file extent if file is small enough, and any
> +# incoming write enlarges the file beyong max_inline should replace inline
> +# extents with regular extents.
> +# This is a long standing bug, so fsck won't detect it and kernel will allow
> +# normal read on it, but should be avoided.
> +#
> +#---
> +# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_btrfs_command inspect-internal dump-tree
> +
> +inline_size=2048
> +page_size=$(get_page_size)
> +src_subv="$SCRATCH_MNT/src"
> +dst_subv="$SCRATCH_MNT/dst"
> +runtime=$(($LOAD_FACTOR * 8))#The possibility is over 80%, just in 
> case

This confused me a bit, I thought test would run for $runtime seconds
and wanted to suggest s/LOAD_FACTOR/TIME_FACTOR, but runtime is really a
loop counter. Mind renaming it to something more obvious, like
loop_count?

> +corrupted=0
> +
> +do_test()
> +{
> + local ret
> + _scratch_mkfs > /dev/null >&1
  meant 2>&1 ?
> +
> + # specify max_inline and compress explicitly to create
> + # inlined compressed extent
> + _scratch_mount "-o max_inline=$inline_size,compress"
> + 

Whitespace issue in above line.

> + # Subvolume id=257
> + _run_btrfs_util_prog subvolume create $src_subv
> +
> + # Page sized data will be compressed to quite small size,
> + # which is about 50 bytes for 4K, 100 bytes for 64K,
> + # far smaller than $inline_size, causing a inlined compressed
> + # file extent
> + xfs_io -f -c "pwrite 0 $page_size" $src_subv/file > /dev/null
> + sync

$XFS_IO_PROG and use '-c fsync' to sync the file.

> +
> + # Subvolume id=258
> + _run_btrfs_util_prog subvolume snapshot $src_subv $dst_subv
> + 

Whitespace issue.

> + # Append another page to file in snapshot, since its offset is not
> + # 0, it will cause regular extent, and should convert inline extent
> + # to regular too.
> + xfs_io -c "pwrite $page_size $page_size" $dst_subv/file > /dev/null

$XFS_IO_P

Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-04 Thread Qu Wenruo



At 04/04/2017 12:31 AM, Darrick J. Wong wrote:

On Mon, Apr 03, 2017 at 03:09:23PM +0800, Qu Wenruo wrote:

As long as we don't modify the on-disk data, fiemap result should always
be constant.

Operation like cycle mount and sleep should not affect fiemap result.
While unfortunately, btrfs doesn't follow that behavior.

Btrfs fiemap sometimes return merged result, while after cycle mount, it
returns split result. Furthermore after a snap, btrfs returns merged
result again.

Signed-off-by: Qu Wenruo 
---
 tests/generic/422 | 127 ++
 tests/generic/422.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 130 insertions(+)
 create mode 100755 tests/generic/422
 create mode 100644 tests/generic/422.out

diff --git a/tests/generic/422 b/tests/generic/422
new file mode 100755
index 000..4ca4476
--- /dev/null
+++ b/tests/generic/422
@@ -0,0 +1,127 @@
+#! /bin/bash
+# FS QA Test 422
+#
+# Test if a file system returns constant fiemap result after remount and
+# fiemap.
+# Unfortunately, btrfs doesn't follow this behavior.


Is this test a goalpost for btrfs bugfixes you're going to post?


Yes.




+#
+#---
+# Copyright (c) 2017 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os IRIX Linux


I don't think we need to support IRIX...


Could we just remove IRIX from the template?

Thanks,
Qu



--D


+_require_scratch
+
+block_size=$((64 * 1024))
+block_count=32
+dst=$SCRATCH_MNT/file
+sleeptime=3
+
+# It's almost 100% for btrfs to trigger inconstant fiemap result
+# just in case
+runtime=$((2 * $LOAD_FACTOR))
+
+# record fiemap as checkpoint, and output the hash of fiemap result
+# to stdout
+fiemap_checkpoint()
+{
+   local number=$1
+   local message=$2
+
+   echo "=== $message ===" >> $seqres.full
+   $XFS_IO_PROG -c "fiemap -v" $dst > ${tmp}.cp$number
+   cat ${tmp}.cp${number} >> $seqres.full
+
+   md5sum ${tmp}.cp${number} | cut -d ' ' -f 1
+}
+
+do_test()
+{
+   local number=$1
+
+   # Use 16 times of file size to ensure we have enough space
+   _scratch_mkfs_sized $((16 * $block_size * $block_count)) \
+   > /dev/null 2>&1
+   _scratch_mount
+
+   echo "== Loop $number ==" >> $seqres.full
+   touch $dst
+   # Xfsprogs 4.9.0 still has a bug that xfs_io "open" with O_SYNC command
+   # doesn't work well with "pwrite", although it gets fixed in v4.10.0,
+   # use dd here to avoid it won't hurt for non-xfs developers
+   dd if=/dev/zero of=$dst bs=$block_size count=$block_count oflag=dsync \
+   status=none 2>&1
+
+   hash1=$(fiemap_checkpoint 1 "Fiemap just after dsync write")
+
+   # Sleep should not modify fiemap result
+   sleep $sleeptime
+
+   hash2=$(fiemap_checkpoint 2 "Fiemap after dsync write and sleep")
+
+   # cycle mount should not modify fiemap result
+   _scratch_cycle_mount
+
+   hash3=$(fiemap_checkpoint 3 "Fiemap after cycle mount")
+
+   # Sleep should not modify fiemap result
+   sleep $sleeptime
+
+   hash4=$(fiemap_checkpoint 4 "Fiemap after cycle mount and sleep")
+
+   _scratch_unmount
+
+   if [ $hash1 != $hash2 -o $hash2 != $hash3 -o $hash3 != $hash4 ]; then
+   echo "Inconstant fiemap result detected"
+   fi
+   echo >> $seqres.full
+}
+
+for i in $(seq 1 $runtime); do
+   do_test $i
+done
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/generic/422.out b/tests/generic/422.out
new file mode 100644
index 000..f70693f
--- /dev/null
+++ b/tests/generic/422.out
@@ -0,0 +1,2 @@
+QA output created by 422
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 3c7c5

Re: [PATCH 6/8] nowait aio: ext4

2017-04-04 Thread Goldwyn Rodrigues


On 04/04/2017 03:41 AM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2017 at 09:58:53AM +0200, Jan Kara wrote:
>> FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
>> Can we call it FS_NOWAIT_IO?
> 
> It's way to generic as it's a feature of the particular file_operations
> instance.  But once we switch to using RWF_* we can just the existing
> per-op feature checks for thos and the per-fs flag should just go away.
> 

I am working on incorporating RWF_* flags. However, I am not sure how
RWF_* flags would get rid of FS_NOWAIT/FS_NOWAIT_IO. Since most of
"blocking" information is with the filesystem, it is a per-filesystem
flag to block out (EOPNOTSUPP) the filesystems which do not support it.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.

A slightly out-of-date i_version is fine, I think.  It's just the
reverse we want to avoid.  E.g., assuming an i_version-supporting
statux, if somebody could called statx then read, and got the new
i_version followed by the old data, that would cause problems.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?  Maybe not, maybe the
> > high bits of the i_version counters are all they need.
> > 
> 
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.

We could consider using the current time instead.  So, put the current
time (or time of last boot, or this inode's ctime, or something) in the
high bits of the change attribute, and keep the low bits as a counter.

Then as long as we trust our clock, we're no longer at risk of reusing
an i_version value.

> > I guess we just want to have some back-of-the-envelope estimates of
> > maximum number of i_version increments possible between crashes and
> > maximum number of crashes possible over lifetime of a filesystem, to
> > decide how to split up the bits.
> > 
> > I wonder if we could get away with using the new crash counter only for
> > *new* values of the i_version?  After a crash, use the on disk i_version
> > as is, and put off using the new crash counter until the next time the
> > file's modified.
> > 
> 
> That sounds difficult to get right. Suppose I have an inode that has not
> been updated in a long time. Someone writes to it and then queries the
> i_version. How do I know whether there were crashes since the last time
> I updated it? Or am I misunderstanding what you're proposing here?

I believe Jan was suggesting that we keep the i_version as-is on disk
but combine with with the crash counter when it's queried.

I was suggesting instead that on write, when we bump the i_version, we
replace the simple increment by something that increments *and* sticks
the current crash counter (or maybe just a time) in the high bits.  And
that combination will be what goes in i_version and is written to disk.

That guarantees that we never reuse an old value when we increment.

It also avoids having to invalidate absolutely every cache, even of
completely static files.

Clients could still see the change attribute go backwards, though, if
they query a file with dirty data and the server crashes before it
writes out the new change attribute.

Also, they could fail to detect data that reverted after boot if they
cache data while the file's dirty on the server side, and the server
crash preserves the i_version update but not the new data.

Presumably both are possible already for ctime (depending on the
filesystem), and I'm not convinced they're a big problem.

In both cases the reading client is caching while somebody's still
writing, and as long as the writer comes back and finishes its job,
readers will thereafter see the right thing.  So I think it's adequate
for close-to-open.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> > What filesystems can or cannot easily do obviously differs. Ext4 has a
> > recovery flag set in superblock on RW mount/remount and cleared on
> > umount/RO remount.
> 
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro.

I'd certainly rather not invalidate caches on *every* boot.

On the other hand, if the only cases involve the root filesystem, then
from the point of view of NFS, we probably don't care much.

> > This flag being set on mount would imply incrementing the crash
> > counter. It should be pretty easy for each filesystem to implement
> > such flag and the counter but I agree it requires an on-disk
> > format change.
> 
> Yup, anything we want that is persistent and consistent across
> filesystems will need on-disk format changes. Hence we need a solid
> specification first, not to mention tests to validate correct
> behaviour across all filesystems in xfstests...

For xfstests we'll need a way to get i_version (patch it into statx, I
guess?).  Ideally we'd have a way to test behavior across crashes too,
any advice there?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: cleanup submit_one_bio

2017-04-04 Thread Liu Bo
On Mon, Apr 03, 2017 at 01:45:47PM -0700, Liu Bo wrote:
> @bio_offset is passed into submit_bio_hook and is used at
> btrfs_wq_submit_bio(), but only dio code makes use of @bio_offset, so
> remove other dead code.
>

Please ignore this one.

Thanks,

-liubo
> Cc: David Sterba 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent_io.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7b5139d..79f38e6a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2746,19 +2746,14 @@ static int __must_check submit_one_bio(struct bio 
> *bio, int mirror_num,
>  unsigned long bio_flags)
>  {
>   int ret = 0;
> - struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
> - struct page *page = bvec->bv_page;
>   struct extent_io_tree *tree = bio->bi_private;
> - u64 start;
> -
> - start = page_offset(page) + bvec->bv_offset;
>  
>   bio->bi_private = NULL;
>   bio_get(bio);
>  
>   if (tree->ops)
>   ret = tree->ops->submit_bio_hook(page->mapping->host, bio,
> -mirror_num, bio_flags, start);
> +mirror_num, bio_flags, 0);
>   else
>   btrfsic_submit_bio(bio);
>  
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-04 Thread Peter Grandi
> [ ... ] I tried to use eSATA and ext4 first, but observed
> silent data corruption and irrecoverable kernel hangs --
> apparently, SATA is not really designed for external use.

SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.

In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.

>> As written that question is meaningless: despite the current
>> mania for "threads"/"threadlets" a filesystem driver is a
>> library, not a set of processes (all those '[btrfs-*]'
>> threadlets are somewhat misguided ways to do background
>> stuff).

> But these threadlets, misguided as the are, do exist, don't
> they?

But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.

> I understand that qgroups is very much work in progress, but
> (correct me if I'm wrong) right now it's the only way to
> estimate real usage of subvolume and its snapshots.

It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.

> For instance, if I have dozen 1TB subvolumes each having ~50
> snapshots and suddenly run out of space on a 24TB volume, how
> do I find the culprit without qgroups?

It is not clear what "culprit" means here. The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.

> [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite
> popular nevertheless.

ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about. I have seen some ASMedia one that also seem OK. For
the disks I use a Seagate and a WDC external box from which I
have removed the original disk, as I have noticed that Seagate
and WDC for obvious reasons tend to test and use the more
reliable chipsets. I have also got an external USB3 dock with a
recent ASMedia chipset that also seems good, but I haven't used
it much.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Chris Murphy
On Tue, Apr 4, 2017 at 10:52 AM, Chris Murphy  wrote:


> Mounting -o ro,degraded is probably permitted by the file system, but
> chunks of the file system and certainly your data, will be missing. So
> it's just a matter of time before copying data off will fail.

** Context here is, more than 1 device missing.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Chris Murphy
On Mon, Apr 3, 2017 at 10:02 PM, Robert Krig
 wrote:
>
>
> On 03.04.2017 16:25, Robert Krig wrote:
>>
>> I'm gonna run a extensive memory check once I get home, since you
>> mentioned corrupt memory might be an issue here.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> I ran a memtest over a couple of hours with no errors. Ram seems to be
> fine so far.

Inconclusive. A memtest can take days to expose a problem, and even
that's not conclusive. The list archive has some examples of where
memory testers gave RAM a pass, but doing things like compiling the
kernel would fail.


>
> I've looked at the link you provided. Frankly it looks very scary. (At
> least to me it does)
> But I've just thought of something else.
>
> My storage array is BTRFS Raid1 with 4x8TB Drives.
> Wouldn't it be possible to simply disconnect two of those drives, mount
> with -o degraded and still have access (even if read-only) to all my data?

man mkfs.btrfs

Btrfs raid1 supports only one device missing, no matter how many drives.

Mounting -o ro,degraded is probably permitted by the file system, but
chunks of the file system and certainly your data, will be missing. So
it's just a matter of time before copying data off will fail.

I suggest trying -o ro with all drives, not a degraded mount, and
copying data off. Any failures should be logged. Metadata errors are
logged without paths, whereas data corruption included path to the
affected file. This is easier than scraping the file system with btrfs
restore.

If you can't mount ro with all drives, or ro,degraded with just one
device missing, you'll need to use btrfs restore which is more
tolerant of missing metadata.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: add test for btrfs send/receive with sparse files

2017-04-04 Thread Filipe Manana
On Tue, Mar 21, 2017 at 6:50 AM, Qu Wenruo  wrote:
> Hi Filipe,
>
> At 02/15/2017 04:35 AM, fdman...@kernel.org wrote:
>>
>> From: Filipe Manana 
>>
>> Test that both a full and incremental btrfs send operation preserves file
>> holes.
>
>
> I found the test case fails with compress=lzo mount option.

I had seen it too but had no time so far to investigate it.

>
> In fact, it turns out to be 2 bugs:
>
> [Inline-then-regular layout]
> And further more, it seems that it can create inline-then-regular file
> extents layout with compress=lzo.

That's not good.

I'll take a look at it.

Thanks.

>
> It does not always create such inline-then-regular, but the possibility
> seems quite high in my test box, near 80%.
>
> Btrfs dump tree for snap2 (258):
> item 9 key (257 INODE_REF 256) itemoff 15736 itemsize 13
> inode ref index 2 namelen 3 name: foo
> item 10 key (257 EXTENT_DATA 0) itemoff 15663 itemsize 73
> generation 7 type 0 (inline)
> inline extent data size 52 ram_bytes 4096 compression 2
> (lzo)
> item 11 key (257 EXTENT_DATA 4096) itemoff 15610 itemsize 53
> generation 8 type 1 (regular)
> extent data disk byte 0 nr 0
> extent data offset 0 nr 1048576 ram 1052672
> extent compression 0 (none)
> item 12 key (257 EXTENT_DATA 1052672) itemoff 15557 itemsize 53
> generation 8 type 1 (regular)
> extent data disk byte 12582912 nr 4096
> extent data offset 0 nr 4096 ram 4096
> extent compression 0 (none)
>
> [Send-stream without hole]
> This bug is 100% reproducible, the send stream of snap2 contains no hole,
> just the full file contents.
>
> Thanks,
> Qu
>
>
>>
>> This used to fail when the filesystem had the NO_HOLES feature enabled,
>> that is, when the test is run with MKFS_OPTIONS="-O no-holes".
>>
>> This is fixed by the following patch for the linux kernel:
>>
>>   "Btrfs: incremental send, fix unnecessary hole writes for sparse files"
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>  tests/btrfs/137 | 141
>> 
>>  tests/btrfs/137.out |  63 +++
>>  tests/btrfs/group   |   1 +
>>  3 files changed, 205 insertions(+)
>>  create mode 100755 tests/btrfs/137
>>  create mode 100644 tests/btrfs/137.out
>>
>> diff --git a/tests/btrfs/137 b/tests/btrfs/137
>> new file mode 100755
>> index 000..3ff2c6b
>> --- /dev/null
>> +++ b/tests/btrfs/137
>> @@ -0,0 +1,141 @@
>> +#! /bin/bash
>> +# FS QA Test No. btrfs/137
>> +#
>> +# Test that both incremental and full send operations preserve file
>> holes.
>> +#
>> +#---
>> +#
>> +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
>> +# Author: Filipe Manana 
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> +#---
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +tmp=/tmp/$$
>> +status=1   # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +   cd /
>> +   rm -fr $send_files_dir
>> +   rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +. ./common/punch
>> +
>> +# real QA test starts here
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_test
>> +_require_scratch
>> +_require_xfs_io_command "fiemap"
>> +
>> +send_files_dir=$TEST_DIR/btrfs-test-$seq
>> +
>> +rm -f $seqres.full
>> +rm -fr $send_files_dir
>> +mkdir $send_files_dir
>> +
>> +_scratch_mkfs >>$seqres.full 2>&1
>> +_scratch_mount
>> +
>> +# Create the first test file.
>> +$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 4K" $SCRATCH_MNT/foo |
>> _filter_xfs_io
>> +
>> +# Create a second test file with a 1Mb hole.
>> +$XFS_IO_PROG -f \
>> + -c "pwrite -S 0xaa 0 4K" \
>> + -c "pwrite -S 0xbb 1028K 4K" \
>> + $SCRATCH_MNT/bar | _filter_xfs_io
>> +
>> +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
>> +   $SCRATCH_MNT/snap1 >/dev/null
>> +
>> +# Now add one new extent to our first test file, increasing its size and
>> leaving
>> +# a 1Mb hole between the first 

[PATCH] btrfs: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support

2017-04-04 Thread Chandan Jay Sharma
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support
for btrfs. Extended file attributes are 32 bit values (FS_XFLAGS_SYNC,
FS_XFLAG_IMMUTABLE, etc) which have one-to-one mapping to the flag values
that can be stored in inode->i_flags (i.e. S_SYNC, S_IMMUTABLE, etc).
The flags can be set/unset to enable/disable file attributes.
These attributes are listed/modified by lsattr/chattr.

Signed-off-by: Chandan Jay Sharma 
---
 fs/btrfs/ioctl.c | 148 +++
 1 file changed, 148 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dabfc7a..5d8486b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -132,6 +132,25 @@ static unsigned int btrfs_flags_to_ioctl(unsigned int 
flags)
return iflags;
 }
 
+/* Transfer ioctl flags to btrfs internal flags */
+static unsigned int btrfs_ioctl_to_flags(unsigned int iflags)
+{
+   unsigned int flags = 0;
+
+   if (iflags & FS_SYNC_FL)
+   flags |= BTRFS_INODE_SYNC;
+   if (iflags & FS_IMMUTABLE_FL)
+   flags |= BTRFS_INODE_IMMUTABLE;
+   if (iflags & FS_APPEND_FL)
+   flags |= BTRFS_INODE_APPEND;
+   if (iflags & FS_NODUMP_FL)
+   flags |= BTRFS_INODE_NODUMP;
+   if (iflags & FS_NOATIME_FL)
+   flags |= BTRFS_INODE_NOATIME;
+
+   return flags;
+}
+
 /*
  * Update inode->i_flags based on the btrfs internal flags.
  */
@@ -157,6 +176,75 @@ void btrfs_update_iflags(struct inode *inode)
 }
 
 /*
+ * Propagate flags from i_flags to BTRFS_I(inode)->flags
+ */
+void btrfs_get_inode_flags(struct btrfs_inode *ip)
+{
+   unsigned int vfs_fl;
+   unsigned long old_fl, new_fl;
+
+   do {
+   vfs_fl = ip->vfs_inode.i_flags;
+   old_fl = ip->flags;
+   new_fl = old_fl & ~(BTRFS_INODE_SYNC|BTRFS_INODE_APPEND|
+   BTRFS_INODE_IMMUTABLE|BTRFS_INODE_NOATIME|
+   BTRFS_INODE_DIRSYNC);
+   if (vfs_fl & S_SYNC)
+   new_fl |= BTRFS_INODE_SYNC;
+   if (vfs_fl & S_APPEND)
+   new_fl |= BTRFS_INODE_APPEND;
+   if (vfs_fl & S_IMMUTABLE)
+   new_fl |= BTRFS_INODE_IMMUTABLE;
+   if (vfs_fl & S_NOATIME)
+   new_fl |= BTRFS_INODE_NOATIME;
+   if (vfs_fl & S_DIRSYNC)
+   new_fl |= BTRFS_INODE_DIRSYNC;
+   } while (cmpxchg(&ip->flags, old_fl, new_fl) != old_fl);
+}
+
+/*
+ * Translate btrfs internal flags BTRFS_I(inode)->flags to xflags.
+ */
+static inline unsigned int btrfs_flags_to_xflags(unsigned int flags)
+{
+   unsigned int xflags = 0;
+
+   if (flags & BTRFS_INODE_SYNC)
+   xflags |= FS_XFLAG_SYNC;
+   if (flags & BTRFS_INODE_IMMUTABLE)
+   xflags |= FS_XFLAG_IMMUTABLE;
+   if (flags & BTRFS_INODE_APPEND)
+   xflags |= FS_XFLAG_APPEND;
+   if (flags & BTRFS_INODE_NODUMP)
+   xflags |= FS_XFLAG_NODUMP;
+   if (flags & BTRFS_INODE_NOATIME)
+   xflags |= FS_XFLAG_NOATIME;
+
+   return xflags;
+}
+
+/*
+ * Transfer xflags flags to ioctl flags.
+ */
+static inline unsigned int btrfs_xflags_to_ioctl(unsigned int xflags)
+{
+   unsigned int flags = 0;
+
+   if (xflags & FS_XFLAG_SYNC)
+   flags |= FS_SYNC_FL;
+   if (xflags & FS_XFLAG_IMMUTABLE)
+   flags |= FS_IMMUTABLE_FL;
+   if (xflags & FS_XFLAG_APPEND)
+   flags |= FS_APPEND_FL;
+   if (xflags & FS_XFLAG_NODUMP)
+   flags |= FS_NODUMP_FL;
+   if (xflags & FS_XFLAG_NOATIME)
+   flags |= FS_NOATIME_FL;
+
+   return flags;
+}
+
+/*
  * Inherit flags from the parent inode.
  *
  * Currently only the compression flags and the cow flags are inherited.
@@ -5504,6 +5592,62 @@ static int btrfs_ioctl_set_features(struct file *file, 
void __user *arg)
return ret;
 }
 
+static int btrfs_ioctl_fsgetxattr(struct file *file, void __user *arg)
+{
+   struct fsxattr fa;
+   struct btrfs_inode *ip = BTRFS_I(file_inode(file));
+
+   memset(&fa, 0, sizeof(struct fsxattr));
+   btrfs_get_inode_flags(ip);
+   fa.fsx_xflags = btrfs_flags_to_xflags(ip->flags);
+
+   if (copy_to_user((struct fsxattr __user *)arg,
+   &fa, sizeof(fa)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg)
+{
+   struct inode *inode = file_inode(file);
+   struct btrfs_inode *ip = BTRFS_I(inode);
+   struct btrfs_root *root = ip->root;
+   struct fsxattr fa;
+   unsigned int flags;
+   int err;
+
+   /* Make sure caller has proper permission */
+   if (!inode_owner_or_capable(inode))
+   return -EPERM;
+
+   if (btrfs_root_readonly(root))
+   return -EROFS;
+
+   memset(&fa, 0, sizeof(st

[PATCH] generic: test for number of bytes used by files after buffered writes

2017-04-04 Thread fdmanana
From: Filipe Manana 

Test that a filesystem's implementation of the stat(2) system call
reports correct values for the number of blocks allocated for a file
when there are delayed allocations.

This test is motivated by a bug in btrfs which is fixed by the following
path for the linux kernel:

 "Btrfs: fix reported number of inode blocks"

Signed-off-by: Filipe Manana 
---
 tests/generic/422 | 126 ++
 tests/generic/422.out |  41 
 tests/generic/group   |   1 +
 3 files changed, 168 insertions(+)
 create mode 100755 tests/generic/422
 create mode 100644 tests/generic/422.out

diff --git a/tests/generic/422 b/tests/generic/422
new file mode 100755
index 000..20cd54a
--- /dev/null
+++ b/tests/generic/422
@@ -0,0 +1,126 @@
+#! /bin/bash
+# FS QA Test No. generic/422
+#
+# Test that a filesystem's implementation of the stat(2) system call reports
+# correct values for the number of blocks allocated for a file when there are
+# delayed allocations.
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_scratch
+_require_xfs_io_command "falloc"
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
+$XFS_IO_PROG -f \
+ -c "pwrite -S 0xaa 0 64K" \
+ -c "truncate 128K" \
+ $SCRATCH_MNT/foo2 | _filter_xfs_io
+$XFS_IO_PROG -f \
+ -c "falloc -k 0 128K" \
+ -c "pwrite -S 0xaa 0 64K" \
+ $SCRATCH_MNT/foo3 | _filter_xfs_io
+touch $SCRATCH_MNT/foo4
+
+# Make sure everything done so far is durably persisted.
+sync
+
+# Now overwrite the extent of the first file.
+$XFS_IO_PROG -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/foo1 | _filter_xfs_io
+
+# Write to a hole of the second file.
+$XFS_IO_PROG -c "pwrite -S 0xff 64K 64K" $SCRATCH_MNT/foo2 | _filter_xfs_io
+# Write again to the same location, just to test that the fs will not account
+# the same write twice.
+$XFS_IO_PROG -c "pwrite -S 0x20 64K 64K" $SCRATCH_MNT/foo2 | _filter_xfs_io
+
+# Write beyond eof of the third file into the pre-allocated extent.
+$XFS_IO_PROG -c "pwrite -S 0xff 64K 64K" $SCRATCH_MNT/foo3 | _filter_xfs_io
+
+# Do a buffered write immediately followed by a direct IO write, without a
+# fsync in between, just to test that page invalidation does not lead to an
+# incorrect number of file blocks reported.
+$XFS_IO_PROG -c "pwrite -S 0xab 0 64K" $SCRATCH_MNT/foo4 | _filter_xfs_io
+$XFS_IO_PROG -d -c "pwrite -S 0xef 0 64K" $SCRATCH_MNT/foo4 | _filter_xfs_io
+
+echo
+echo "Before writeback"
+echo
+
+echo "Space used by file foo1:"
+du -h $SCRATCH_MNT/foo1 | _filter_scratch
+
+echo "Space used by file foo2:"
+du -h $SCRATCH_MNT/foo2 | _filter_scratch
+
+echo "Space used by file foo3:"
+du -h $SCRATCH_MNT/foo3 | _filter_scratch
+
+echo "Space used by file foo4:"
+du -h $SCRATCH_MNT/foo4 | _filter_scratch
+
+sync
+
+# We expect the same file sizes reported by 'du' after writeback finishes.
+echo
+echo "After writeback"
+echo
+
+echo "Space used by file foo1:"
+du -h $SCRATCH_MNT/foo1 | _filter_scratch
+
+echo "Space used by file foo2:"
+du -h $SCRATCH_MNT/foo2 | _filter_scratch
+
+echo "Space used by file foo3:"
+du -h $SCRATCH_MNT/foo3 | _filter_scratch
+
+echo "Space used by file foo4:"
+du -h $SCRATCH_MNT/foo4 | _filter_scratch
+
+status=0
+exit
diff --git a/tests/generic/422.out b/tests/generic/422.out
new file mode 100644
index 000..3696088
--- /dev/null
+++ b/tests/generic/422.out
@@ -0,0 +1,41 @@
+QA output created by 422
+wrote 65536/65536 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/6553

[PATCH] Btrfs: fix reported number of inode blocks

2017-04-04 Thread fdmanana
From: Filipe Manana 

Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0. sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0. sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K  /mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K   /mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0. sec (520.833 MiB/sec and 13. ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0. sec (260.417 MiB/sec and 6.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K  /mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K  /mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana 
---

This applies on top of the following patches:

  btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  btrfs: Handle delalloc error correctly to avoid ordered extent hang
  Btrfs: fix invalid attempt to free reserved space on failure to cow range
  Btrfs: fix incorrect space accounting after failure to insert inline extent

All in my integration-4.12 branch:

https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.12


 fs/btrfs/btrfs_inode.h |  7 
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/file.c| 95 +-
 fs/btrfs/inode.c   | 53 
 4 files changed, 125 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 0c6baab..b8622e4 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -125,6 +125,13 @@ struct btrfs_inode {
u64 delalloc_bytes;
 
/*
+* Total number of bytes pending delalloc that fall within a file
+* range that is either a hole or beyond EOF (and no prealloc extent
+* exists in the range). This is always <= delalloc_bytes.
+*/
+   u64 new_delalloc_bytes;
+
+   /*
 * total number of bytes pending defrag, used by stat to check whether
 * it needs COW.
 */
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 48a30d0..d5ff51b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -21,6 +21,7 @@
 #define EXTENT_NORESERVE   (1U << 15)
 #define EXTENT_QGROUP_RESERVED (1U << 16)
 #define EXTENT_CLEAR_DATA_RESV (1U << 17)
+#define EXTENT_DELALLOC_NEW(1U << 18)
 #define EXTENT_IOBITS  (EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_DO_ACCOUNTING(EXTENT_CLEAR_MET

[PATCH] Btrfs: fix extent map leak during fallocate error path

2017-04-04 Thread fdmanana
From: Filipe Manana 

If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 48dfb8e..56304c4 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2856,8 +2856,10 @@ static long btrfs_fallocate(struct file *file, int mode,
}
ret = btrfs_qgroup_reserve_data(inode, cur_offset,
last_byte - cur_offset);
-   if (ret < 0)
+   if (ret < 0) {
+   free_extent_map(em);
break;
+   }
} else {
/*
 * Do not need to reserve unwritten extent for this
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Austin S. Hemmelgarn

On 2017-04-04 09:29, Brian B wrote:

On 04/04/2017 12:02 AM, Robert Krig wrote:

My storage array is BTRFS Raid1 with 4x8TB Drives.
Wouldn't it be possible to simply disconnect two of those drives, mount
with -o degraded and still have access (even if read-only) to all my data?

Just jumping on this point: my understanding of BTRFS "RAID1" is that
each file (block?) is randomly assigned to two disks of the array (no
matter how many disks are in the array).  So if you remove two disks,
you will probably have files that were "assigned" to both of those
disks, and will be missing.

In short, you can't remove more than one disk of a BTRFS RAID1 and still
have all of your data.

That understanding is correct.  From a functional perspective, BTRFS 
raid1 is currently a RAID10 implementation with striping happening at a 
very large granularity.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Hugo Mills
On Tue, Apr 04, 2017 at 09:29:11AM -0400, Brian B wrote:
> On 04/04/2017 12:02 AM, Robert Krig wrote:
> > My storage array is BTRFS Raid1 with 4x8TB Drives.
> > Wouldn't it be possible to simply disconnect two of those drives, mount
> > with -o degraded and still have access (even if read-only) to all my data?
> Just jumping on this point: my understanding of BTRFS "RAID1" is that
> each file (block?) is randomly assigned to two disks of the array (no

   Arbitrarily assigned, rather than randomly assigned (there is a
deterministic algorithm for it, but it's wise not to rely on the exact
behaviour of that algorithm, because there are a number of factors
that can alter its behaviour).

> matter how many disks are in the array).  So if you remove two disks,
> you will probably have files that were "assigned" to both of those
> disks, and will be missing.
> 
> In short, you can't remove more than one disk of a BTRFS RAID1 and still
> have all of your data.

   Indeed.

   Hugo.

-- 
Hugo Mills | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Need some help: "BTRFS critical (device sda): corrupt leaf, slot offset bad: block"

2017-04-04 Thread Brian B
On 04/04/2017 12:02 AM, Robert Krig wrote:
> My storage array is BTRFS Raid1 with 4x8TB Drives.
> Wouldn't it be possible to simply disconnect two of those drives, mount
> with -o degraded and still have access (even if read-only) to all my data?
Just jumping on this point: my understanding of BTRFS "RAID1" is that
each file (block?) is randomly assigned to two disks of the array (no
matter how many disks are in the array).  So if you remove two disks,
you will probably have files that were "assigned" to both of those
disks, and will be missing.

In short, you can't remove more than one disk of a BTRFS RAID1 and still
have all of your data.



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread Dave Chinner
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > > Because if above is acceptable we could make reported i_version to be 
> > > > > a sum
> > > > > of "superblock crash counter" and "inode i_version". We increment
> > > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > > shutdown.
> > > > > That way after a crash we are guaranteed each inode will report new
> > > > > i_version (the sum would probably have to look like "superblock crash
> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > > i_version numbers we gave away but did not write to disk but 
> > > > > still...).
> > > > > Thoughts?
> > > 
> > > How hard is this for filesystems to support?  Do they need an on-disk
> > > format change to keep track of the crash counter?
> > 
> > Yes. We'll need version counter in the superblock, and we'll need to
> > know what the increment semantics are. 
> > 
> > The big question is how do we know there was a crash? The only thing
> > a journalling filesystem knows at mount time is whether it is clean
> > or requires recovery. Filesystems can require recovery for many
> > reasons that don't involve a crash (e.g. root fs is never unmounted
> > cleanly, so always requires recovery). Further, some filesystems may
> > not even know there was a crash at mount time because their
> > architecture always leaves a consistent filesystem on disk (e.g. COW
> > filesystems)
> 
> What filesystems can or cannot easily do obviously differs. Ext4 has a
> recovery flag set in superblock on RW mount/remount and cleared on
> umount/RO remount.

Even this doesn't help. A recent bug that was reported to the XFS
list - turns out that systemd can't remount-ro the root
filesystem sucessfully on shutdown because there are open write fds
on the root filesystem when it attempts the remount. So it just
reboots without a remount-ro. This uncovered a bug in grub in
that it (still!) thinks sync(1) is sufficient to get all the
metadata that points to a kernel image onto disk in places it can
read. XFS, like ext4, leaves it in the journal and so the system then fails to
boot because systemd didn't remount-ro the root fs and hence the
journal was never flushed before reboot and so grub can't find the
kernel and so everything fails

> This flag being set on mount would imply incrementing the crash
> counter. It should be pretty easy for each filesystem to implement
> such flag and the counter but I agree it requires an on-disk
> format change.

Yup, anything we want that is persistent and consistent across
filesystems will need on-disk format changes. Hence we need a solid
specification first, not to mention tests to validate correct
behaviour across all filesystems in xfstests...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: cleanup submit_one_bio

2017-04-04 Thread kbuild test robot
Hi Liu,

[auto build test ERROR on next-20170330]
[also build test ERROR on v4.11-rc5]
[cannot apply to btrfs/next v4.9-rc8 v4.9-rc7 v4.9-rc6]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-cleanup-submit_one_bio/20170404-194545
config: x86_64-randconfig-x012-201714 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/extent_io.c: In function 'submit_one_bio':
>> fs/btrfs/extent_io.c:2739:36: error: 'page' undeclared (first use in this 
>> function)
  ret = tree->ops->submit_bio_hook(page->mapping->host, bio,
   ^~~~
   fs/btrfs/extent_io.c:2739:36: note: each undeclared identifier is reported 
only once for each function it appears in

vim +/page +2739 fs/btrfs/extent_io.c

70dec807 Chris Mason 2008-01-29  2733   struct extent_io_tree *tree = 
bio->bi_private;
70dec807 Chris Mason 2008-01-29  2734  
902b22f3 David Woodhouse 2008-08-20  2735   bio->bi_private = NULL;
d1310b2e Chris Mason 2008-01-24  2736   bio_get(bio);
d1310b2e Chris Mason 2008-01-24  2737  
20c9801d David Sterba2017-02-17  2738   if (tree->ops)
81a75f67 Mike Christie   2016-06-05 @2739   ret = 
tree->ops->submit_bio_hook(page->mapping->host, bio,
178218fd Liu Bo  2017-04-03  2740   
   mirror_num, bio_flags, 0);
0b86a832 Chris Mason 2008-03-24  2741   else
4e49ea4a Mike Christie   2016-06-05  2742   btrfsic_submit_bio(bio);

:: The code at line 2739 was first introduced by commit
:: 81a75f6781deb7a3b5274b4c683e327e5cb5b883 btrfs: use bio fields for op 
and flags

:: TO: Mike Christie 
:: CC: Jens Axboe 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] btrfs: delete unused member nobarriers

2017-04-04 Thread Anand Jain



On 04/03/2017 08:06 PM, David Sterba wrote:

Please update the changelog to say why it's ok to remove it, eg. the
commit that removed the last user.

commit b25de9d6da49b1a8760a89672283128aa8c78345
Author: Christoph Hellwig 
Date:   Fri Apr 24 21:41:01 2015 +0200

block: remove BIO_EOPNOTSUPP


 Ah. I should have search the commit log. sorry about that.


IMO, there is a bug in generic_make_request_checks() in which
it should rather return EOPNOTSUPP, instead of EIO if QUEUE_FLAG_WC
is not supported.


1853 static noinline_for_stack bool
1854 generic_make_request_checks(struct bio *bio)
1855 {
::

1892 /*
1893  * Filter flush bio's early so that make_request based
1894  * drivers without flush support don't have to worry
1895  * about them.
1896  */
1897 if ((bio->bi_opf & (REQ_PREFLUSH | REQ_FUA)) &&
1898 !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
1899 bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
1900 if (!nr_sectors) {
1901 err = 0;
1902 goto end_io; <- this should goto  not_supported
1903 }
1904 }

::

1946 not_supported:
1947 err = -EOPNOTSUPP;



 Pls ignore this patch.

 I have submitted
[PATCH] btrfs: check if the device is flush capable

 which will remain unaffected by the above bug/not-a-bug
 in the blk core code.

Thanks, Anand



Otherwise the patch is ok.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: check if the device is flush capable

2017-04-04 Thread Anand Jain
blkdev_issue_flush() or the empty buffer with the flag REQ_PREFLUSH
will never return BIO_EOPNOTSUPP as of now, however it should rather
or it may in future. So for now the BTRFS to have least affected by
this change at the blk layer, we can check if the device is flush
capable.

In this process, I rename the unused nobarrier member as flushable,
which the below patch made it redundant.

Per commit b25de9d6da49b1a8760a89672283128aa8c78345
   block: remove BIO_EOPNOTSUPP

Signed-off-by: Anand Jain 
---
v1:
  This patch will replace
  [PATCH] btrfs: delete unused member nobarriers


 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/volumes.c | 6 ++
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3c476b118440..6c15f5685d25 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3517,7 +3517,7 @@ static void btrfs_dev_issue_flush(struct work_struct 
*work)
  */
 static int write_dev_flush(struct btrfs_device *device, int wait)
 {
-   if (device->nobarriers)
+   if (!device->flushable)
return 0;
 
if (wait) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4de5b2d549bd..3c5142d66260 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1010,6 +1010,8 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
device->can_discard = 1;
if (!blk_queue_nonrot(q))
fs_devices->rotating = 1;
+   if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+   device->flushable = true;
 
device->bdev = bdev;
device->in_fs_metadata = 0;
@@ -2377,6 +2379,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
device->can_discard = 1;
+   if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+   device->flushable = true;
device->writeable = 1;
device->generation = trans->transid;
device->io_width = fs_info->sectorsize;
@@ -2580,6 +2584,8 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info 
*fs_info,
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
device->can_discard = 1;
+   if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+   device->flushable = true;
mutex_lock(&fs_info->fs_devices->device_list_mutex);
device->writeable = 1;
device->generation = 0;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0df50bc65578..bbaf5238b6ca 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -123,7 +123,7 @@ struct btrfs_device {
struct list_head resized_list;
 
/* for sending down flush barriers */
-   int nobarriers;
+   bool flushable;
struct completion flush_wait;
struct work_struct flush_work;
int last_flush_error;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: use q which is already obtained from bdev_get_queue

2017-04-04 Thread Anand Jain
We have already assigned q from bdev_get_queue() so use it.
And rearrange the code for better view.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1563ae03079b..4de5b2d549bd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1008,14 +1008,13 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
q = bdev_get_queue(bdev);
if (blk_queue_discard(q))
device->can_discard = 1;
+   if (!blk_queue_nonrot(q))
+   fs_devices->rotating = 1;
 
device->bdev = bdev;
device->in_fs_metadata = 0;
device->mode = flags;
 
-   if (!blk_queue_nonrot(bdev_get_queue(bdev)))
-   fs_devices->rotating = 1;
-
fs_devices->open_devices++;
if (device->writeable &&
device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -2417,7 +2416,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
fs_info->free_chunk_space += device->total_bytes;
spin_unlock(&fs_info->free_chunk_lock);
 
-   if (!blk_queue_nonrot(bdev_get_queue(bdev)))
+   if (!blk_queue_nonrot(q))
fs_info->fs_devices->rotating = 1;
 
tmp = btrfs_super_total_bytes(fs_info->super_copy);
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] nowait aio: ext4

2017-04-04 Thread Christoph Hellwig
On Tue, Apr 04, 2017 at 09:58:53AM +0200, Jan Kara wrote:
> FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
> Can we call it FS_NOWAIT_IO?

It's way to generic as it's a feature of the particular file_operations
instance.  But once we switch to using RWF_* we can just the existing
per-op feature checks for thos and the per-fs flag should just go away.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] nowait aio: ext4

2017-04-04 Thread Jan Kara
On Mon 03-04-17 13:53:05, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Return EAGAIN if any of the following checks fail for direct I/O:
>  + i_rwsem is lockable
>  + Writing beyond end of file (will trigger allocation)
>  + Blocks are not allocated at the write location

Patches seem to be missing your Signed-off-by tag...

> @@ -235,9 +237,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>  
>   iocb->private = &overwrite;
>   /* Check whether we do a DIO overwrite or not */
> - if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
> - ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
> - overwrite = 1;
> + if (o_direct && !unaligned_aio) {
> + struct ext4_map_blocks map;
> + if (ext4_blocks_mapped(inode, iocb->ki_pos,
> +   iov_iter_count(from), &map)) {
> + /* To exclude unwritten extents, we need to check
> +  * m_flags.
> +  */
> + if (ext4_should_dioread_nolock(inode) &&
> + (map.m_flags & EXT4_MAP_MAPPED))
> + overwrite = 1;
> + } else if (iocb->ki_flags & IOCB_NOWAIT) {
> + ret = -EAGAIN;
> + goto out;
> + }
> + }

Actually, overwriting unwritten extents is relatively complex in ext4 as
well. In particular we need to start a transaction and split out the
written part of the extent. So I don't think we can easily support this
without blocking. As a result I'd keep the condition for IOCB_NOWAIT the
same as for overwrite IO.

> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -117,7 +117,7 @@ static struct file_system_type ext2_fs_type = {
>   .name   = "ext2",
>   .mount  = ext4_mount,
>   .kill_sb= kill_block_super,
> - .fs_flags   = FS_REQUIRES_DEV,
> + .fs_flags   = FS_REQUIRES_DEV | FS_NOWAIT,

FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
Can we call it FS_NOWAIT_IO?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html