Re: [PATCH 1/2] btrfs: add small program for clone testing

2014-02-06 Thread David Disseldorp
Hi Dave,

On Thu, 6 Feb 2014 10:09:36 +1100, Dave Chinner wrote:

 On Wed, Feb 05, 2014 at 12:16:48PM +0100, David Disseldorp wrote:
  The cloner program is capable of cloning files using the BTRFS_IOC_CLONE
  and BTRFS_IOC_CLONE_RANGE ioctls.
  
  Signed-off-by: David Disseldorp dd...@suse.de
 
 Hi Dave - long time since I've seen your head pop up around here ;)

Indeed, it's been a while. Thanks for the review :)

 
 A few comments below.
 
  +struct btrfs_ioctl_clone_range_args {
  +   int64_t src_fd;
  +   uint64_t src_offset;
  +   uint64_t src_length;
  +   uint64_t dest_offset;
  +};
  +
  +#define BTRFS_IOCTL_MAGIC 0x94
  +#define BTRFS_IOC_CLONE   _IOW(BTRFS_IOCTL_MAGIC, 9, int)
  +#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
  +  struct btrfs_ioctl_clone_range_args)
 
 Is there some published header file that these belong to? i.e.
 somewhere in the include/linux/uapi/ kernel directory? Normally the
 way to handle this sort of thing is by autoconf - if the header file
 exists, then we include it, otherwise we use the manual definitions.
 This just means that if the public api ever changes, we'll pick it
 up automatically in future...

I'd wanted to avoid the addition of another xfsqa prereq, but I guess
it'll work with the fall-back.
I'll add the autoconf logic to the next round, along with changes
addressing your other remarks.

Cheers, David
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2 ] Btrfs: switch to btrfs_previous_extent_item()

2014-02-06 Thread Wang Shilong
From: Wang Shilong wangsl.f...@cn.fujitsu.com

Since we have introduced btrfs_previous_extent_item() to search previous
extent item, just switch into it.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
Changelog:
v1-v2:
set @found_key properly and fix return value(thanks to Filipe)
---
 fs/btrfs/backref.c | 37 ++---
 1 file changed, 6 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index c39891c..4dedb89 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1325,38 +1325,13 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
u64 logical,
if (ret  0)
return ret;
 
-   while (1) {
-   u32 nritems;
-   if (path-slots[0] == 0) {
-   btrfs_set_path_blocking(path);
-   ret = btrfs_prev_leaf(fs_info-extent_root, path);
-   if (ret != 0) {
-   if (ret  0) {
-   pr_debug(logical %llu is not within 
-any extent\n, logical);
-   ret = -ENOENT;
-   }
-   return ret;
-   }
-   } else {
-   path-slots[0]--;
-   }
-   nritems = btrfs_header_nritems(path-nodes[0]);
-   if (nritems == 0) {
-   pr_debug(logical %llu is not within any extent\n,
-logical);
-   return -ENOENT;
-   }
-   if (path-slots[0] == nritems)
-   path-slots[0]--;
-
-   btrfs_item_key_to_cpu(path-nodes[0], found_key,
- path-slots[0]);
-   if (found_key-type == BTRFS_EXTENT_ITEM_KEY ||
-   found_key-type == BTRFS_METADATA_ITEM_KEY)
-   break;
+   ret = btrfs_previous_extent_item(fs_info-extent_root, path, 0);
+   if (ret) {
+   if (ret  0)
+   ret = -ENOENT;
+   return ret;
}
-
+   btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]);
if (found_key-type == BTRFS_METADATA_ITEM_KEY)
size = fs_info-extent_root-leafsize;
else if (found_key-type == BTRFS_EXTENT_ITEM_KEY)
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2 ] Btrfs: switch to btrfs_previous_extent_item()

2014-02-06 Thread Filipe David Manana
On Thu, Feb 6, 2014 at 12:02 PM, Wang Shilong wangshilong1...@gmail.com wrote:
 From: Wang Shilong wangsl.f...@cn.fujitsu.com

 Since we have introduced btrfs_previous_extent_item() to search previous
 extent item, just switch into it.

 Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com

Reviewed-by: Filipe Manana fdman...@gmail.com

Thanks Shilong.

 ---
 Changelog:
 v1-v2:
 set @found_key properly and fix return value(thanks to Filipe)
 ---
  fs/btrfs/backref.c | 37 ++---
  1 file changed, 6 insertions(+), 31 deletions(-)

 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index c39891c..4dedb89 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -1325,38 +1325,13 @@ int extent_from_logical(struct btrfs_fs_info 
 *fs_info, u64 logical,
 if (ret  0)
 return ret;

 -   while (1) {
 -   u32 nritems;
 -   if (path-slots[0] == 0) {
 -   btrfs_set_path_blocking(path);
 -   ret = btrfs_prev_leaf(fs_info-extent_root, path);
 -   if (ret != 0) {
 -   if (ret  0) {
 -   pr_debug(logical %llu is not within 
 -any extent\n, logical);
 -   ret = -ENOENT;
 -   }
 -   return ret;
 -   }
 -   } else {
 -   path-slots[0]--;
 -   }
 -   nritems = btrfs_header_nritems(path-nodes[0]);
 -   if (nritems == 0) {
 -   pr_debug(logical %llu is not within any extent\n,
 -logical);
 -   return -ENOENT;
 -   }
 -   if (path-slots[0] == nritems)
 -   path-slots[0]--;
 -
 -   btrfs_item_key_to_cpu(path-nodes[0], found_key,
 - path-slots[0]);
 -   if (found_key-type == BTRFS_EXTENT_ITEM_KEY ||
 -   found_key-type == BTRFS_METADATA_ITEM_KEY)
 -   break;
 +   ret = btrfs_previous_extent_item(fs_info-extent_root, path, 0);
 +   if (ret) {
 +   if (ret  0)
 +   ret = -ENOENT;
 +   return ret;
 }
 -
 +   btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]);
 if (found_key-type == BTRFS_METADATA_ITEM_KEY)
 size = fs_info-extent_root-leafsize;
 else if (found_key-type == BTRFS_EXTENT_ITEM_KEY)
 --
 1.8.4

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Roman Mamedov
On Thu, 06 Feb 2014 09:38:15 +0200
Brendan Hide bren...@swiftspirit.co.za wrote:

 This is a known issue: 
 https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F
 Btrfs is still considered experimental 

It's long overdue to start tackling these snags and 'stop hiding behind the
experimental flag' [1], which is also no longer present as of 3.13.

[1] http://www.spinics.net/lists/linux-btrfs/msg30396.html

 this is just one of those caveats we've learned to adjust to.

Sure, but it's hard to argue this particular behavior is clearly broken from
the user perspective, even if it's broken by design, and there are a number
of very smart and future-proof reasons to keep it broken today.

Personally I tired of trying to keep in mind which partitions are btrfs raid1,
and mentally divide the displayed free space by two for those. That's what
computers are for, to keep track of such things.

 The change could work well for now and I'm sure it has been considered. 
 I guess the biggest end-user issue is that you can, at a whim, change 
 the model for new blocks - raid0/5/6,single etc and your value from 5 
 minutes ago is far out from your new value without having written 
 anything or taken up any space. Not a show-stopper problem, really.

Changing the allocation profile for new blocks is a serious change you don't do
accidentally, it's about the same importance level as e.g. resizing the
filesystem. And no one is really surprised when the 'df' result changes after
an FS resize.

 The biggest dev issue is that future features will break this behaviour, 

That could be years away.

 such as the per-subvolume RAID profiles you mentioned. It is difficult 
 to motivate including code (for which there's a known workaround) where 
 we know it will be obsoleted.

There's not a lot of code to include (as my 3-line patch demonstrates), it
could just as easily be removed when it's obsolete. But I did not have any
high hopes of defeating the broken by design philosophy, that's why I didn't
submit it as a real patch for inclusion but rather just as a helpful hint for
people to add to their own kernels if they want this change to happen.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: [PATCH] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Wang Shilong
Hi Dave,

 On Mon, Feb 03, 2014 at 11:22:36PM +0800, Wang Shilong wrote:
 From: Wang Shilong wangsl.f...@cn.fujitsu.com
 
 Btrfs would fail to send if snapshot run concurrently, this test is to make
 sure we have fixed the bug.
 
 Couple of comments below.
 
 +_scratch_mkfs  /dev/null 21
 +_scratch_mount
 +
 +
 +touch $SCRATCH_MNT/foo
 +
 +# get file with fragments by using backwards writes.
 +for i in `seq 10240 -1 1`; do
 +$XFS_IO_PROG -f -d -c pwrite $(($i * 4096)) 4096 \
 +$SCRATCH_MNT/foo  /dev/null | _filter_xfs_io
 
 Indentation.
 
 +done
 +
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
 +$SCRATCH_MNT/snap_1  $seqres.full 21
 +
 +$BTRFS_UTIL_PROG send -f $SCRATCH_MNT/send_file \
 +$SCRATCH_MNT/snap_1  $seqres.full 21 
 +
 +pid=$!
 +
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \
 +$SCRATCH_MNT/snap_2  $seqres.full 21
 +
 +wait $pid || echo Failed to send, see dmesg
 
 This seems kind of racy. It assumes that the send command
 doesn't complete before the wait $pid call is made. If $pid doesn't
 exist at this time because it has completed, wait will return 127
 and the test will fail….

Sorry for delay reply!

Maybe a better idea for this will be:

Opposite to previous way, we do snapshots background while at the
same time we do sending.

And btrfs-progs should output meaningful information on send failure, i will
make a tiny patch to address this issue. but to make this test more friendly, i
think we can still do something like:

btrfs send ..  || echo Failed to send

What do you think of this way?  Feel free to tell me if there are better 
ideas^_^.

Thanks,
Wang
 Also, why would a failure to send result in meaingful information in
 dmesg? Shouldn't the userspace command output information to tell
 you why there was a failure into $seqres.full?
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] __btrfs_drop_extents() BUG_ON reproducer

2014-02-06 Thread David Disseldorp
This patch-set provides a reproducer for hitting the 3.14.0-rc1 BUG_ON()
at:
 692 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
...
 839 /*
 840  *  |  range to drop - |
 841  *  |  extent  |
 842  */
 843 if (start = key.offset  end  extent_end) {
 844 BUG_ON(extent_type == BTRFS_FILE_EXTENT_INLINE);
 845 
 846 memcpy(new_key, key, sizeof(new_key));

The first patch adds a small cloner binary which is used by btrfs/035 to
dispatch BTRFS_IOC_CLONE_RANGE requests.

This workload resembles that of Samba's vfs_btrfs module, when a Windows
client restores a file from a shadow-copy (snapshot) using server-side
copy requests.

Changes since V1:
- Use strtoull instead of atoi
- Print error conditions in cloner
- Check for cloner binary before running test
- Continue test on failure
- Add cloner to .gitignore

Feedback appreciated.

Cheers, David


 .gitignore  |   1 +
 configure.ac|   1 +
 src/Makefile|   2 +-
 src/cloner.c| 192 
+++
 tests/btrfs/035 |  77 
+
 tests/btrfs/035.out |   3 +++
 tests/btrfs/group   |   1 +
 7 files changed, 276 insertions(+), 1 deletion(-)
 create mode 100644 src/cloner.c
 create mode 100755 tests/btrfs/035
 create mode 100644 tests/btrfs/035.out
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] src/cloner: use btrfs/ioctl.h header if present

2014-02-06 Thread David Disseldorp
Check for the btrfsprogs-devel ioctl.h header at configure time. Use it
in src/cloner if present, otherwise fall back to using the copied clone
ioctl definitions.

Signed-off-by: David Disseldorp dd...@suse.de
---
 configure.ac | 1 +
 src/cloner.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/configure.ac b/configure.ac
index bd48fd9..6fba3ad 100644
--- a/configure.ac
+++ b/configure.ac
@@ -30,6 +30,7 @@ AC_HEADER_STDC
 AC_CHECK_HEADERS([ sys/fs/xfs_fsops.h  \
sys/fs/xfs_itable.h \
xfs/platform_defs.h \
+   btrfs/ioctl.h   \
 ])
 
 AC_PACKAGE_NEED_UUIDCOMPARE
diff --git a/src/cloner.c b/src/cloner.c
index dfce837..ccc2354 100644
--- a/src/cloner.c
+++ b/src/cloner.c
@@ -30,6 +30,9 @@
 #include stdio.h
 #include string.h
 #include errno.h
+#ifdef HAVE_BTRFS_IOCTL_H
+#include btrfs/ioctl.h
+#else
 
 struct btrfs_ioctl_clone_range_args {
int64_t src_fd;
@@ -42,6 +45,7 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_IOC_CLONE   _IOW(BTRFS_IOCTL_MAGIC, 9, int)
 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
   struct btrfs_ioctl_clone_range_args)
+#endif
 
 static void
 usage(char *name, const char *msg)
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs: add small program for clone testing

2014-02-06 Thread David Disseldorp
The cloner program is capable of cloning files using the BTRFS_IOC_CLONE
and BTRFS_IOC_CLONE_RANGE ioctls.

Signed-off-by: David Disseldorp dd...@suse.de
---
 .gitignore   |   1 +
 src/Makefile |   2 +-
 src/cloner.c | 188 +++
 3 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 src/cloner.c

diff --git a/.gitignore b/.gitignore
index ee4cb41..b6f2463 100644
--- a/.gitignore
+++ b/.gitignore
@@ -104,6 +104,7 @@
 /src/aio-dio-regress/aio-free-ring-with-bogus-nr-pages
 /src/aio-dio-regress/aio-io-setup-with-nonwritable-context-pointer
 /src/aio-dio-regress/aiodio_sparse2
+/src/cloner
 
 # dmapi/ binaries
 /dmapi/src/common/cmd/read_invis
diff --git a/src/Makefile b/src/Makefile
index 84c8297..6509f2d 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -18,7 +18,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
locktest unwritten_mmap bulkstat_unlink_test t_stripealign \
bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
-   seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec
+   seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner
 
 SUBDIRS =
 
diff --git a/src/cloner.c b/src/cloner.c
new file mode 100644
index 000..dfce837
--- /dev/null
+++ b/src/cloner.c
@@ -0,0 +1,188 @@
+/*
+ *  Tiny program to perform file (range) clones using raw Btrfs ioctls.
+ *  It should only be needed until btrfs-progs has an xfs_io equivalent.
+ *
+ *  Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ */
+
+#include sys/types.h
+#include sys/stat.h
+#include sys/ioctl.h
+#include stdint.h
+#include stdbool.h
+#include fcntl.h
+#include unistd.h
+#include stdlib.h
+#include stdio.h
+#include string.h
+#include errno.h
+
+struct btrfs_ioctl_clone_range_args {
+   int64_t src_fd;
+   uint64_t src_offset;
+   uint64_t src_length;
+   uint64_t dest_offset;
+};
+
+#define BTRFS_IOCTL_MAGIC 0x94
+#define BTRFS_IOC_CLONE   _IOW(BTRFS_IOCTL_MAGIC, 9, int)
+#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
+  struct btrfs_ioctl_clone_range_args)
+
+static void
+usage(char *name, const char *msg)
+{
+   printf(Fatal: %s\n
+  Usage:\n
+  %s [options] src_file dest_file\n
+  \tA full file clone (reflink) is performed by default, 
+  unless any of the following are specified:\n
+  \t-s offset:  source file offset (default = 0)\n
+  \t-d offset:  destination file offset (default = 0)\n
+  \t-l length:  length of clone (default = 0)\n,
+  msg, name);
+   _exit(1);
+}
+
+static int
+clone_file(int src_fd, int dst_fd)
+{
+   int ret = ioctl(dst_fd, BTRFS_IOC_CLONE, src_fd);
+   if (ret != 0)
+   ret = errno;
+   return ret;
+}
+
+static int
+clone_file_range(int src_fd, int dst_fd, uint64_t src_off, uint64_t dst_off,
+uint64_t len)
+{
+   struct btrfs_ioctl_clone_range_args cr_args;
+   int ret;
+
+   memset(cr_args, 0, sizeof(cr_args));
+   cr_args.src_fd = src_fd;
+   cr_args.src_offset = src_off;
+   cr_args.src_length = len;
+   cr_args.dest_offset = dst_off;
+   ret = ioctl(dst_fd, BTRFS_IOC_CLONE_RANGE, cr_args);
+   if (ret != 0)
+   ret = errno;
+   return ret;
+}
+
+int
+main(int argc, char **argv)
+{
+   bool full_file = true;
+   uint64_t src_off = 0;
+   uint64_t dst_off = 0;
+   uint64_t len = 0;
+   char *src_file;
+   int src_fd;
+   char *dst_file;
+   int dst_fd;
+   int ret;
+   int opt;
+
+   while ((opt = getopt(argc, argv, s:d:l:)) != -1) {
+   char *sval_end;
+   switch (opt) {
+   case 's':
+   errno = 0;
+   src_off = strtoull(optarg, sval_end, 10);
+   if ((errno) || (*sval_end != '\0'))
+   usage(argv[0], invalid source offset);
+   full_file = false;
+   break;
+   

[PATCH v2 3/3] btrfs/035: add new clone overwrite regression test

2014-02-06 Thread David Disseldorp
This test uses the newly added cloner binary to dispatch full file and
range specific clone (reflink) requests.

Signed-off-by: David Disseldorp dd...@suse.de
---
 tests/btrfs/035 | 77 +
 tests/btrfs/035.out |  3 +++
 tests/btrfs/group   |  1 +
 3 files changed, 81 insertions(+)
 create mode 100755 tests/btrfs/035
 create mode 100644 tests/btrfs/035.out

diff --git a/tests/btrfs/035 b/tests/btrfs/035
new file mode 100755
index 000..49d9ece
--- /dev/null
+++ b/tests/btrfs/035
@@ -0,0 +1,77 @@
+#!/bin/bash
+# FS QA Test No. btrfs/035
+#
+# Regression test for overwriting clones
+#
+#---
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+rm -f $tmp.*
+}
+
+trap _cleanup ; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs  /dev/null 21
+_scratch_mount
+
+CLONER_PROG=$here/src/cloner
+[ -x $CLONER_PROG ] || _notrun cloner binary not present at $CLONER_PROG
+
+src_str=aa
+
+echo -n $src_str  $SCRATCH_MNT/src || echo failed to create src
+
+$CLONER_PROG $SCRATCH_MNT/src  $SCRATCH_MNT/src.clone1
+
+src_str=bbcc
+
+echo -n $src_str  $SCRATCH_MNT/src || echo failed to create src
+
+$CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2
+
+snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'`
+echo attempting ioctl (src.clone1 src)
+$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
+   $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src
+
+snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'`
+echo attempting ioctl (src.clone2 src)
+$CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
+   $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src
+
+status=0 ; exit
diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out
new file mode 100644
index 000..f86cadf
--- /dev/null
+++ b/tests/btrfs/035.out
@@ -0,0 +1,3 @@
+QA output created by 035
+attempting ioctl (src.clone1 src)
+attempting ioctl (src.clone2 src)
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f9f062f..bee57cb 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -37,3 +37,4 @@
 032 auto quick
 033 auto quick
 034 auto quick
+035 auto quick
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ioctl: add note regarding CLONE_RANGE(len=0) behaviour

2014-02-06 Thread David Disseldorp
A BTRFS_IOC_CLONE_RANGE request with a src_length value of zero has the
effect of cloning all data from src_offset through to end-of-file.

Document this behaviour in the header file for those who (like me)
incorrectly assume that no data is cloned in such a case.

Signed-off-by: David Disseldorp dd...@suse.de
---
 ioctl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ioctl.h b/ioctl.h
index a589cd7..f7d435d 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -506,6 +506,7 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
 
+/* With a @src_length of zero, the range from @src_offset-EOF is cloned! */
 struct btrfs_ioctl_clone_range_args {
__s64 src_fd;
__u64 src_offset, src_length;
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Wang Shilong
From: Wang Shilong wangsl.f...@cn.fujitsu.com

Btrfs would fail to send if snapshot run concurrently, this test is to make
sure we have fixed the bug.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
Changelog v1-v2:
Avoid race codes and a code style update(Thanks to Dave Chinner's 
comments)
---
 tests/btrfs/034 | 83 +
 tests/btrfs/034.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 86 insertions(+)
 create mode 100644 tests/btrfs/034
 create mode 100644 tests/btrfs/034.out

diff --git a/tests/btrfs/034 b/tests/btrfs/034
new file mode 100644
index 000..1fcd792
--- /dev/null
+++ b/tests/btrfs/034
@@ -0,0 +1,83 @@
+#!/bin/bash
+# FS QA Test No. btrfs/034
+#
+# Regression test for running snapshots and send concurrently.
+#
+#---
+# Copyright (c) 2014 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+rm -f $tmp.*
+}
+
+do_snapshots()
+{
+   i=2
+   while [ 1 ]
+   do
+   $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \
+   $SCRATCH_MNT/snap_$i  $seqres.full 21
+   let i=$i+1
+   done
+}
+
+trap _cleanup ; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs  /dev/null 21
+_scratch_mount
+
+
+touch $SCRATCH_MNT/foo
+
+# get file with fragments by using backwards writes.
+for i in `seq 10240 -1 1`; do
+   $XFS_IO_PROG -f -d -c pwrite $(($i * 4096)) 4096 \
+   $SCRATCH_MNT/foo  /dev/null | _filter_xfs_io
+done
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/snap_1  $seqres.full 21
+
+do_snapshots 
+snapshots_pid=$!
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1  /dev/null 21 || _fail btrfs 
send failed
+
+kill -TERM $snapshots_pid 2 /dev/null
+
+echo Silence is golden
+status=0 ; exit
diff --git a/tests/btrfs/034.out b/tests/btrfs/034.out
new file mode 100644
index 000..4c8873c
--- /dev/null
+++ b/tests/btrfs/034.out
@@ -0,0 +1,2 @@
+QA output created by 034
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index b29236c..f9f062f 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -36,3 +36,4 @@
 031 auto quick
 032 auto quick
 033 auto quick
+034 auto quick
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: throttle delayed refs better

2014-02-06 Thread Josef Bacik


On 02/05/2014 05:57 PM, Johannes Hirte wrote:

On Wed, 5 Feb 2014 16:46:57 -0500
Josef Bacik jba...@fb.com wrote:


On 02/05/2014 04:42 PM, Johannes Hirte wrote:

On Wed, 5 Feb 2014 14:36:39 -0500
Josef Bacik jba...@fb.com wrote:


On 02/05/2014 02:30 PM, Johannes Hirte wrote:

On Wed, 5 Feb 2014 14:00:57 -0500
Josef Bacik jba...@fb.com wrote:


On 02/05/2014 12:34 PM, Johannes Hirte wrote:

On Wed, 5 Feb 2014 10:49:15 -0500
Josef Bacik jba...@fb.com wrote:


Ok none of those make sense which makes me think it may be the
ktime bits, instead of un-applying the whole patch could you
just comment out the parts

 ktime_t start = ktime_get();

and

 if (actual_count  0) {
 u64 runtime =
ktime_to_ns(ktime_sub(ktime_get(), start)); u64 avg;

 /*
  * We weigh the current average higher than
our current runtime
  * to avoid large swings in the average.
  */
 spin_lock(delayed_refs-lock);
 avg = fs_info-avg_delayed_ref_runtime * 3
+ runtime; avg = div64_u64(avg, 4);
 fs_info-avg_delayed_ref_runtime = avg;
 spin_unlock(delayed_refs-lock);
 }

in __btrfs_run_delayed_refs and see if that makes the problem
stop? If it does will you try chris's for-linus branch to see
if it still reproduces there?  Maybe some patch changed
ktime_get() in -rc1 that is causing issues and we're just now
exposing it. Thanks,

With the ktime bits disabled, I wasn't able to reproduce the
problem anymore. With Chris' for-linus branch it took longer but
still appeared.


Ok can you send your .config, maybe there's some weird time bug
being exposed.  What kind of CPU do you have?  Thanks,

Josef

It's a Core i5-540M, dualcore + hyperthreading

Ok while I'm doing this can you change
btrfs_should_throttle_delayed_refs to _always_ return 1, still with
all the ktime stuff commented out, and see if that causes the
problem to happen?  Thanks,

Yes it does. Same behavior as without ktime stuff commented out.


Ok perfect, can you send me a btrfs fi df of that volume, and do you
have any snapshots or anything?  Thanks,

btrfs fi df /
Data, single: total=220.01GiB, used=210.85GiB
System, DUP: total=8.00MiB, used=32.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=4.00GiB, used=2.93GiB
Metadata, single: total=8.00MiB, used=0.00

No snapshots but several subvolumes. / itself is a seperate subvolume
and subvol 0 only contains the other subvolumes (5 at moment). qgroups
aren't enabled.

mount options are noatime,inode_cache, if this matters

I've managed to reproduce on one of my test boxes at work, I'll get to 
the bottom of this.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Wang Shilong
From: Wang Shilong wangsl.f...@cn.fujitsu.com

Btrfs would fail to send if snapshot run concurrently, this test is to make
sure we have fixed the bug.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
Changelog
 v1-v2:
Avoid race codes and a code style update(Thanks to Dave Chinner's 
comments)
 v2-v3:
make sure we kill background snapshots on test failure
---
 tests/btrfs/034 | 83 +
 tests/btrfs/034.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 86 insertions(+)
 create mode 100644 tests/btrfs/034
 create mode 100644 tests/btrfs/034.out

diff --git a/tests/btrfs/034 b/tests/btrfs/034
new file mode 100644
index 000..5152969
--- /dev/null
+++ b/tests/btrfs/034
@@ -0,0 +1,83 @@
+#!/bin/bash
+# FS QA Test No. btrfs/034
+#
+# Regression test for running snapshots and send concurrently.
+#
+#---
+# Copyright (c) 2014 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+rm -f $tmp.*
+}
+
+do_snapshots()
+{
+   i=2
+   while [ $i -lt 50 ]
+   do
+   $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \
+   $SCRATCH_MNT/snap_$i  $seqres.full 21
+   let i=$i+1
+   done
+}
+
+trap _cleanup ; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs  /dev/null 21
+_scratch_mount
+
+
+touch $SCRATCH_MNT/foo
+
+# get file with fragments by using backwards writes.
+for i in `seq 10240 -1 1`; do
+   $XFS_IO_PROG -f -d -c pwrite $(($i * 4096)) 4096 \
+   $SCRATCH_MNT/foo  /dev/null | _filter_xfs_io
+done
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/snap_1  $seqres.full 21
+
+do_snapshots 
+snapshots_pid=$!
+
+$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1  /dev/null 21 || echo btrfs send 
failed
+
+kill -TERM $snapshots_pid 2 /dev/null
+
+echo Silence is golden
+status=0 ; exit
diff --git a/tests/btrfs/034.out b/tests/btrfs/034.out
new file mode 100644
index 000..4c8873c
--- /dev/null
+++ b/tests/btrfs/034.out
@@ -0,0 +1,2 @@
+QA output created by 034
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index b29236c..f9f062f 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -36,3 +36,4 @@
 031 auto quick
 032 auto quick
 033 auto quick
+034 auto quick
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Convert BUG() to BUG_ON(1)

2014-02-06 Thread Mitch Harder
Convert the instances of BUG() to BUG_ON(1) to provide information
about the location of the abort.

Signed-off-by: Mitch Harder mitch.har...@sabayonlinux.org
---
 btrfs-debug-tree.c |  4 ++--
 ctree.c| 20 ++--
 ctree.h|  2 +-
 disk-io.c  |  4 ++--
 extent-tree.c  |  6 +++---
 extent_io.c|  2 +-
 file-item.c|  4 ++--
 print-tree.c   |  8 
 volumes.c  |  4 ++--
 9 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index f37de9d..0180265 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -68,10 +68,10 @@ static void print_extents(struct btrfs_root *root, struct 
extent_buffer *eb)
 btrfs_node_ptr_generation(eb, i));
if (btrfs_is_leaf(next) 
btrfs_header_level(eb) != 1)
-   BUG();
+   BUG_ON(1);
if (btrfs_header_level(next) !=
btrfs_header_level(eb) - 1)
-   BUG();
+   BUG_ON(1);
print_extents(root, next);
free_extent_buffer(next);
}
diff --git a/ctree.c b/ctree.c
index 9e5b30f..7aab3b1 100644
--- a/ctree.c
+++ b/ctree.c
@@ -822,7 +822,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
check_block(root, path, level);
if (orig_ptr !=
btrfs_node_blockptr(path-nodes[level], path-slots[level]))
-   BUG();
+   BUG_ON(1);
 enospc:
if (right)
free_extent_buffer(right);
@@ -1425,9 +1425,9 @@ static int insert_ptr(struct btrfs_trans_handle *trans, 
struct btrfs_root
lower = path-nodes[level];
nritems = btrfs_header_nritems(lower);
if (slot  nritems)
-   BUG();
+   BUG_ON(1);
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
-   BUG();
+   BUG_ON(1);
if (slot != nritems) {
memmove_extent_buffer(lower,
  btrfs_node_key_ptr_offset(slot + 1),
@@ -2213,7 +2213,7 @@ split:
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
kfree(buf);
return ret;
@@ -2311,7 +2311,7 @@ int btrfs_truncate_item(struct btrfs_trans_handle *trans,
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
return ret;
 }
@@ -2337,7 +2337,7 @@ int btrfs_extend_item(struct btrfs_trans_handle *trans,
 
if (btrfs_leaf_free_space(root, leaf)  data_size) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
slot = path-slots[0];
old_data = btrfs_item_end_nr(leaf, slot);
@@ -2374,7 +2374,7 @@ int btrfs_extend_item(struct btrfs_trans_handle *trans,
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
return ret;
 }
@@ -2406,7 +2406,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
 
/* create a root if there isn't one */
if (!root-node)
-   BUG();
+   BUG_ON(1);
 
total_size = total_data + nr * sizeof(struct btrfs_item);
ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
@@ -2425,7 +2425,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
btrfs_print_leaf(root, leaf);
printk(not enough freespace need %u have %d\n,
   total_size, btrfs_leaf_free_space(root, leaf));
-   BUG();
+   BUG_ON(1);
}
 
slot = path-slots[0];
@@ -2484,7 +2484,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
 
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
 
 out:
diff --git a/ctree.h b/ctree.h
index a9c67b2..101389b 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1519,7 +1519,7 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
if (type == BTRFS_EXTENT_DATA_REF_KEY)
return sizeof(struct btrfs_extent_data_ref) +
   offsetof(struct btrfs_extent_inline_ref, offset);
-   BUG();
+   BUG_ON(1);
return 0;
 }
 
diff --git a/disk-io.c b/disk-io.c
index e840177..2a6c68f 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -349,10 +349,10 @@ static int write_tree_block(struct btrfs_trans_handle 
*trans,
 struct extent_buffer *eb)
 {
if (check_tree_block(root, eb))
-   BUG();
+   BUG_ON(1);
 
if 

Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS

2014-02-06 Thread David Sterba
On Mon, Jan 27, 2014 at 08:47:09AM +, Hugo Mills wrote:
 On Mon, Jan 27, 2014 at 04:52:50PM +0800, Anand Jain wrote:
  The user land progs needs   a simple way to read
  the raw list of disks and its parameters as
  btrfs kernel understands it. This patch will
  introduce BTRFS_IOC_GET_DEVS which dumps
  every thing under fs_devices.
  
  As of now btrfs-devlist uses this ioctl.
  
  In the long run this ioctl would help to optimize
  some part of btrfs-progs, mainly the current
  btrfs filesystem show
 
Just thinking out loud here, really, but can we export this
 information in /sys instead, rather than adding yet more ioctls?

I tend to agree that this belongs to sysfs, it's more flexible in case
we'll add more per-device stats, ie. one file that holds all the stats
about the device. This is also easier to use from scripts that gather
system information.

With 3.14 the sysfs interface is available, but the devices under
 /sys/fs/btrfs/fs-uuid/devices/...
are symlinks to the sysfs devices, so this btrfs-specific device
information has to be located in a separate directory.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Goffredo Baroncelli
Hi Roman

On 02/06/2014 01:45 PM, Roman Mamedov wrote:
 On Thu, 06 Feb 2014 09:38:15 +0200
[...]
 
 There's not a lot of code to include (as my 3-line patch demonstrates), it
 could just as easily be removed when it's obsolete. But I did not have any
 high hopes of defeating the broken by design philosophy, that's why I didn't
 submit it as a real patch for inclusion but rather just as a helpful hint for
 people to add to their own kernels if they want this change to happen.


I agree with you about the needing of a solution. However your patch to me 
seems even worse than the actual code.

For example you cannot take in account the mix of data/linear and metadata/dup 
(with the pathological case of small files stored in the metadata chunks ), nor 
different profile level like raid5/6 (or the future raidNxM)
And do not forget the compression...

The situation is very complex. I am inclined to use a different approach.

As you know, btrfs allocate space in chunk. Each chunk has an own ration 
between the data occupied on the disk, and the data available to the 
filesystem. For SINGLE the ratio is 1, for DUP/RAID1/RAID10 the ratio is 2, for 
raid 5 the ratio is n/(n-1) (where n is the stripes count), for raid 6 the 
ratio is n/(n-2)

Because a filesystem could have chunks with different ratios, we can compute a 
global ratio as the composition of the each chunk ratio:

for_each_chunk:
all_chunks_size += chunk_size[i]

for_each_chunk:
global_ratio += chunk_ratio[i] * chunk_size[i] / all_chunks_size

If we assume that this ratio is constant during the live of the filesystem, we 
can use it to get an estimation of the space available to the users as:

free_space = (all_disks_size-all_chunks_size)/global_ratio


The code above is a simplification, because we should take in account also the 
space available on each _already_allocated_ chunk.
We could further enhance this estimation, taking in account also the total 
files sizes and their space consumed in the chunks (this could be different due 
to the compression)

Even tough not perfect, it would be a better estimation than the actual one. 


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS

2014-02-06 Thread Goffredo Baroncelli
On 02/06/2014 08:49 PM, David Sterba wrote:
 On Mon, Jan 27, 2014 at 08:47:09AM +, Hugo Mills wrote:
 On Mon, Jan 27, 2014 at 04:52:50PM +0800, Anand Jain wrote:
 The user land progs needs   a simple way to read
 the raw list of disks and its parameters as
 btrfs kernel understands it. This patch will
 introduce BTRFS_IOC_GET_DEVS which dumps
 every thing under fs_devices.

 As of now btrfs-devlist uses this ioctl.

 In the long run this ioctl would help to optimize
 some part of btrfs-progs, mainly the current
 btrfs filesystem show

Just thinking out loud here, really, but can we export this
 information in /sys instead, rather than adding yet more ioctls?
 
 I tend to agree that this belongs to sysfs, it's more flexible in case
 we'll add more per-device stats, ie. one file that holds all the stats
 about the device. This is also easier to use from scripts that gather
 system information.
 
 With 3.14 the sysfs interface is available, but the devices under
  /sys/fs/btrfs/fs-uuid/devices/...
 are symlinks to the sysfs devices, so this btrfs-specific device
 information has to be located in a separate directory.

yes please; it would scale better than the ioctl()s; anyway I suggest 1
file per property, and not ...one file that holds all the stats
 about the device

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Convert BUG() to BUG_ON(1)

2014-02-06 Thread Josef Bacik


On 02/06/2014 01:34 PM, Mitch Harder wrote:

Convert the instances of BUG() to BUG_ON(1) to provide information
about the location of the abort.


If we're going to screw with our BUG()'s in any sort of big way like 
this I'd rather it be to swap us over to ASSERT() in the cases that it 
makes sense and return errors in those cases.



Signed-off-by: Mitch Harder mitch.har...@sabayonlinux.org
---
  btrfs-debug-tree.c |  4 ++--
  ctree.c| 20 ++--
  ctree.h|  2 +-
  disk-io.c  |  4 ++--
  extent-tree.c  |  6 +++---
  extent_io.c|  2 +-
  file-item.c|  4 ++--
  print-tree.c   |  8 
  volumes.c  |  4 ++--
  9 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index f37de9d..0180265 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -68,10 +68,10 @@ static void print_extents(struct btrfs_root *root, struct 
extent_buffer *eb)
 btrfs_node_ptr_generation(eb, i));
if (btrfs_is_leaf(next) 
btrfs_header_level(eb) != 1)
-   BUG();
+   BUG_ON(1);
if (btrfs_header_level(next) !=
btrfs_header_level(eb) - 1)
-   BUG();
+   BUG_ON(1);
print_extents(root, next);
free_extent_buffer(next);
}
diff --git a/ctree.c b/ctree.c
index 9e5b30f..7aab3b1 100644
--- a/ctree.c
+++ b/ctree.c
@@ -822,7 +822,7 @@ static int balance_level(struct btrfs_trans_handle *trans,
check_block(root, path, level);
if (orig_ptr !=
btrfs_node_blockptr(path-nodes[level], path-slots[level]))
-   BUG();
+   BUG_ON(1);
  enospc:
if (right)
free_extent_buffer(right);
@@ -1425,9 +1425,9 @@ static int insert_ptr(struct btrfs_trans_handle *trans, 
struct btrfs_root
lower = path-nodes[level];
nritems = btrfs_header_nritems(lower);
if (slot  nritems)
-   BUG();
+   BUG_ON(1);
if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
-   BUG();
+   BUG_ON(1);


These are perfect cases for ASSERT()


if (slot != nritems) {
memmove_extent_buffer(lower,
  btrfs_node_key_ptr_offset(slot + 1),
@@ -2213,7 +2213,7 @@ split:
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);


For stuff like this we could add a ASSERT_LEAF() or something like that 
to do the btrfs_print_leaf first and then do the normal panic thing.

}
kfree(buf);
return ret;
@@ -2311,7 +2311,7 @@ int btrfs_truncate_item(struct btrfs_trans_handle *trans,
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
return ret;
  }
@@ -2337,7 +2337,7 @@ int btrfs_extend_item(struct btrfs_trans_handle *trans,
  
  	if (btrfs_leaf_free_space(root, leaf)  data_size) {

btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
slot = path-slots[0];
old_data = btrfs_item_end_nr(leaf, slot);
@@ -2374,7 +2374,7 @@ int btrfs_extend_item(struct btrfs_trans_handle *trans,
ret = 0;
if (btrfs_leaf_free_space(root, leaf)  0) {
btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
return ret;
  }
@@ -2406,7 +2406,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
  
  	/* create a root if there isn't one */

if (!root-node)
-   BUG();
+   BUG_ON(1);
  
  	total_size = total_data + nr * sizeof(struct btrfs_item);

ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
@@ -2425,7 +2425,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
btrfs_print_leaf(root, leaf);
printk(not enough freespace need %u have %d\n,
   total_size, btrfs_leaf_free_space(root, leaf));
-   BUG();
+   BUG_ON(1);
}
  
  	slot = path-slots[0];

@@ -2484,7 +2484,7 @@ int btrfs_insert_empty_items(struct btrfs_trans_handle 
*trans,
  
  	if (btrfs_leaf_free_space(root, leaf)  0) {

btrfs_print_leaf(root, leaf);
-   BUG();
+   BUG_ON(1);
}
  
  out:

diff --git a/ctree.h b/ctree.h
index a9c67b2..101389b 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1519,7 +1519,7 @@ static inline u32 btrfs_extent_inline_ref_size(int type)
if (type == BTRFS_EXTENT_DATA_REF_KEY)
return sizeof(struct btrfs_extent_data_ref) +
   offsetof(struct 

Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Josef Bacik


On 02/05/2014 03:15 PM, Roman Mamedov wrote:

Hello,

On a freshly-created RAID1 filesystem of two 1TB disks:

# df -h /mnt/p2/
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2

I cannot write 2TB of user data to that RAID1, so this estimate is clearly
misleading. I got tired of looking at the bogus disk free space on all my
RAID1 btrfs systems, so today I decided to do something about this:

--- fs/btrfs/super.c.orig   2014-02-06 01:28:36.636164982 +0600
+++ fs/btrfs/super.c2014-02-06 01:28:58.304164370 +0600
@@ -1481,6 +1481,11 @@
}
  
  	kfree(devices_info);

+
+   if (type  BTRFS_BLOCK_GROUP_RAID1) {
+   do_div(avail_space, min_stripes);
+   }
+
*free_bytes = avail_space;
return 0;
  }


This needs to be more flexible, and also this causes the problem where 
now you show the actual usable amount of space _but_ you are also 
showing twice the amount of used space.  I'm ok with going in this 
direction, but we need to convert everybody over so it works for raid10 
as well and the used values need to be adjusted.  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: balance delayed inode updates

2014-02-06 Thread Josef Bacik
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files.  So balance delayed inode
updates everytime we create a new inode.  With this patch we no longer use up
all of our ram with delayed inode updates.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 fs/btrfs/extent-tree.c | 1 +
 fs/btrfs/inode.c   | 4 
 2 files changed, 5 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c9ecc9..1d9d9ce 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2660,6 +2660,7 @@ int btrfs_should_throttle_delayed_refs(struct 
btrfs_trans_handle *trans,
atomic_read(trans-transaction-delayed_refs.num_entries);
u64 avg_runtime;
 
+   return 1;
smp_mb();
avg_runtime = fs_info-avg_delayed_ref_runtime;
if (num_entries * avg_runtime = NSEC_PER_SEC)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1af34d0..d44ff9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5795,6 +5795,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry 
*dentry,
}
 out_unlock:
btrfs_end_transaction(trans, root);
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
if (drop_inode) {
inode_dec_link_count(inode);
@@ -5868,6 +5869,7 @@ out_unlock:
inode_dec_link_count(inode);
iput(inode);
}
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
return err;
 }
@@ -5926,6 +5928,7 @@ static int btrfs_link(struct dentry *old_dentry, struct 
inode *dir,
}
 
btrfs_end_transaction(trans, root);
+   btrfs_balance_delayed_items(root);
 fail:
if (drop_inode) {
inode_dec_link_count(inode);
@@ -5992,6 +5995,7 @@ out_fail:
btrfs_end_transaction(trans, root);
if (drop_on_err)
iput(inode);
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
return err;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: throttle delayed refs better

2014-02-06 Thread Josef Bacik
On 02/05/2014 05:57 PM, Johannes Hirte wrote:
 On Wed, 5 Feb 2014 16:46:57 -0500
 Josef Bacik jba...@fb.com wrote:
 

 On 02/05/2014 04:42 PM, Johannes Hirte wrote:
 On Wed, 5 Feb 2014 14:36:39 -0500
 Josef Bacik jba...@fb.com wrote:

 On 02/05/2014 02:30 PM, Johannes Hirte wrote:
 On Wed, 5 Feb 2014 14:00:57 -0500
 Josef Bacik jba...@fb.com wrote:

 On 02/05/2014 12:34 PM, Johannes Hirte wrote:
 On Wed, 5 Feb 2014 10:49:15 -0500
 Josef Bacik jba...@fb.com wrote:

 Ok none of those make sense which makes me think it may be the
 ktime bits, instead of un-applying the whole patch could you
 just comment out the parts

  ktime_t start = ktime_get();

 and

  if (actual_count  0) {
  u64 runtime =
 ktime_to_ns(ktime_sub(ktime_get(), start)); u64 avg;

  /*
   * We weigh the current average higher than
 our current runtime
   * to avoid large swings in the average.
   */
  spin_lock(delayed_refs-lock);
  avg = fs_info-avg_delayed_ref_runtime * 3
 + runtime; avg = div64_u64(avg, 4);
  fs_info-avg_delayed_ref_runtime = avg;
  spin_unlock(delayed_refs-lock);
  }

 in __btrfs_run_delayed_refs and see if that makes the problem
 stop? If it does will you try chris's for-linus branch to see
 if it still reproduces there?  Maybe some patch changed
 ktime_get() in -rc1 that is causing issues and we're just now
 exposing it. Thanks,
 With the ktime bits disabled, I wasn't able to reproduce the
 problem anymore. With Chris' for-linus branch it took longer but
 still appeared.

 Ok can you send your .config, maybe there's some weird time bug
 being exposed.  What kind of CPU do you have?  Thanks,

 Josef
 It's a Core i5-540M, dualcore + hyperthreading
 Ok while I'm doing this can you change
 btrfs_should_throttle_delayed_refs to _always_ return 1, still with
 all the ktime stuff commented out, and see if that causes the
 problem to happen?  Thanks,
 Yes it does. Same behavior as without ktime stuff commented out.

 Ok perfect, can you send me a btrfs fi df of that volume, and do you
 have any snapshots or anything?  Thanks,
 
 btrfs fi df /
 Data, single: total=220.01GiB, used=210.85GiB
 System, DUP: total=8.00MiB, used=32.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=4.00GiB, used=2.93GiB
 Metadata, single: total=8.00MiB, used=0.00
 
 No snapshots but several subvolumes. / itself is a seperate subvolume
 and subvol 0 only contains the other subvolumes (5 at moment). qgroups
 aren't enabled.
 
 mount options are noatime,inode_cache, if this matters
 

Ok so I thought I reproduced the problem but I just reproduced a different
problem.  Please undo any changes you've made and apply this patch and reproduce
and then provide me with any debug output that gets spit out.  I'm sending this
via thunderbird with 6 different extensions to make sure it comes out right so
if it doesn't work let me know and I'll just paste it somewhere.  Thanks,

Josef

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index f3bff89..b025a04 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -204,8 +204,12 @@ find_ref_head(struct rb_root *root, u64 bytenr,
struct rb_node *n;
struct btrfs_delayed_ref_head *entry;
int cmp = 0;
+   unsigned long loops = 0;
 
 again:
+   loops++;
+   if (loops  2)
+   printk(KERN_ERR we have fucked up\n);
n = root-rb_node;
entry = NULL;
while (n) {
@@ -232,6 +236,7 @@ again:
n = rb_next(entry-href_node);
if (!n)
n = rb_first(root);
+   BUG_ON(!n);
entry = rb_entry(n, struct btrfs_delayed_ref_head,
 href_node);
bytenr = entry-node.bytenr;
@@ -410,10 +415,14 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
struct btrfs_delayed_ref_head *head;
u64 start;
bool loop = false;
+   unsigned long loops = 0;
 
delayed_refs = trans-transaction-delayed_refs;
 
 again:
+   loops++;
+   if (loops  5)
+   printk(KERN_ERR houston we have a problem\n);
start = delayed_refs-run_delayed_start;
head = find_ref_head(delayed_refs-href_root, start, NULL, 1);
if (!head  !loop) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c9ecc9..91dacf4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2327,9 +2327,16 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
unsigned long count = 0;
unsigned long actual_count = 0;
int must_insert_reserved = 0;
+   unsigned long loops = 0;
+   unsigned long no_selected_ref = 0;
 
delayed_refs = 

Re: [PATCH] btrfs-progs: Convert BUG() to BUG_ON(1)

2014-02-06 Thread David Sterba
On Thu, Feb 06, 2014 at 12:34:08PM -0600, Mitch Harder wrote:
 Convert the instances of BUG() to BUG_ON(1) to provide information
 about the location of the abort.

kerncompat.h:

#define BUG() abort()

#define BUG_ON(c) assert(!(c))

I'd rather fix the definition to do the same thing, that way no
developer would need to know the difference (that actually exists only
in the userspace tools, in kernel the two produce the same outout).


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Convert BUG() to BUG_ON(1)

2014-02-06 Thread Josef Bacik

On 02/06/2014 01:34 PM, Mitch Harder wrote:

Convert the instances of BUG() to BUG_ON(1) to provide information
about the location of the abort.



This is where I realize this was against btrfs-progs, ignore me.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: balance delayed inode updates

2014-02-06 Thread Josef Bacik



On 02/06/2014 04:07 PM, Josef Bacik wrote:

While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files.  So balance delayed inode
updates everytime we create a new inode.  With this patch we no longer use up
all of our ram with delayed inode updates.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
  fs/btrfs/extent-tree.c | 1 +
  fs/btrfs/inode.c   | 4 
  2 files changed, 5 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c9ecc9..1d9d9ce 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2660,6 +2660,7 @@ int btrfs_should_throttle_delayed_refs(struct 
btrfs_trans_handle *trans,
atomic_read(trans-transaction-delayed_refs.num_entries);
u64 avg_runtime;

+   return 1;
smp_mb();


Do git add for the file I want, do git commit -a anyway.

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: balance delayed inode updates V2

2014-02-06 Thread Josef Bacik
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files.  So balance delayed inode
updates everytime we create a new inode.  With this patch we no longer use up
all of our ram with delayed inode updates.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
V1-V2: this time without the debug part I was using.

 fs/btrfs/inode.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1af34d0..d44ff9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5795,6 +5795,7 @@ static int btrfs_mknod(struct inode *dir, struct dentry 
*dentry,
}
 out_unlock:
btrfs_end_transaction(trans, root);
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
if (drop_inode) {
inode_dec_link_count(inode);
@@ -5868,6 +5869,7 @@ out_unlock:
inode_dec_link_count(inode);
iput(inode);
}
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
return err;
 }
@@ -5926,6 +5928,7 @@ static int btrfs_link(struct dentry *old_dentry, struct 
inode *dir,
}
 
btrfs_end_transaction(trans, root);
+   btrfs_balance_delayed_items(root);
 fail:
if (drop_inode) {
inode_dec_link_count(inode);
@@ -5992,6 +5995,7 @@ out_fail:
btrfs_end_transaction(trans, root);
if (drop_on_err)
iput(inode);
+   btrfs_balance_delayed_items(root);
btrfs_btree_balance_dirty(root);
return err;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: introduce BTRFS_IOC_GET_DEVS

2014-02-06 Thread David Sterba
On Thu, Feb 06, 2014 at 09:09:57PM +0100, Goffredo Baroncelli wrote:
  With 3.14 the sysfs interface is available, but the devices under
   /sys/fs/btrfs/fs-uuid/devices/...
  are symlinks to the sysfs devices, so this btrfs-specific device
  information has to be located in a separate directory.
 
 yes please; it would scale better than the ioctl()s; anyway I suggest 1
 file per property, and not ...one file that holds all the stats
  about the device

Well, we can do both, I don't have a preference and both make sense.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Dave Chinner
On Fri, Feb 07, 2014 at 12:10:08AM +0800, Wang Shilong wrote:
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
 + $SCRATCH_MNT/snap_1  $seqres.full 21
 +
 +do_snapshots 
 +snapshots_pid=$!
 +
 +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1  /dev/null 21 || echo btrfs 
 send failed

Let's stop this anti-pattern before it takes hold.

If there's output from the send command it should be
filtered and captured in the golden image. Hence any deviation
caused by errors is automatically flagged as an error.

That's the whole point of using golden images for capturing errors -
you don't need to capture return values from binaries and it
guarantees that users are informed about failures through error
messages. IOWs:

$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 | _btrfs_send_filter

is what you should be doing here.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs/035: add new clone overwrite regression test

2014-02-06 Thread Dave Chinner
On Thu, Feb 06, 2014 at 02:59:14PM +0100, David Disseldorp wrote:
 This test uses the newly added cloner binary to dispatch full file and
 range specific clone (reflink) requests.


 +
 +echo -n $src_str  $SCRATCH_MNT/src || echo failed to create src

Not exactly what I intended.

If echo fails, it will output some kind of error message, and that
will cause the golden image mismatch.

Otherwise the test looks good.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Convert BUG() to BUG_ON(1)

2014-02-06 Thread Mitch Harder
On Thu, Feb 6, 2014 at 3:22 PM, David Sterba dste...@suse.cz wrote:
 On Thu, Feb 06, 2014 at 12:34:08PM -0600, Mitch Harder wrote:
 Convert the instances of BUG() to BUG_ON(1) to provide information
 about the location of the abort.

 kerncompat.h:

 #define BUG() abort()

 #define BUG_ON(c) assert(!(c))

 I'd rather fix the definition to do the same thing, that way no
 developer would need to know the difference (that actually exists only
 in the userspace tools, in kernel the two produce the same outout).


 david

Thanks for the feedback.

Changing the definition of BUG() in kerncompat.h will be much more concise.

I'll restructure the patch and resubmit it after I test it.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Dave Chinner
On Thu, Feb 06, 2014 at 09:12:51PM +0800, Wang Shilong wrote:
 Hi Dave,
 
  On Mon, Feb 03, 2014 at 11:22:36PM +0800, Wang Shilong wrote:
  From: Wang Shilong wangsl.f...@cn.fujitsu.com
  
  Btrfs would fail to send if snapshot run concurrently, this test is to make
  sure we have fixed the bug.
  
  Couple of comments below.
  
  +_scratch_mkfs  /dev/null 21
  +_scratch_mount
  +
  +
  +touch $SCRATCH_MNT/foo
  +
  +# get file with fragments by using backwards writes.
  +for i in `seq 10240 -1 1`; do
  +  $XFS_IO_PROG -f -d -c pwrite $(($i * 4096)) 4096 \
  +  $SCRATCH_MNT/foo  /dev/null | _filter_xfs_io
  
  Indentation.
  
  +done
  +
  +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
  +  $SCRATCH_MNT/snap_1  $seqres.full 21
  +
  +$BTRFS_UTIL_PROG send -f $SCRATCH_MNT/send_file \
  +  $SCRATCH_MNT/snap_1  $seqres.full 21 
  +
  +pid=$!
  +
  +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT/snap_1 \
  +  $SCRATCH_MNT/snap_2  $seqres.full 21
  +
  +wait $pid || echo Failed to send, see dmesg
  
  This seems kind of racy. It assumes that the send command
  doesn't complete before the wait $pid call is made. If $pid doesn't
  exist at this time because it has completed, wait will return 127
  and the test will fail….
 
 Sorry for delay reply!
 
 Maybe a better idea for this will be:
 
 Opposite to previous way, we do snapshots background while at the
 same time we do sending.
 
 And btrfs-progs should output meaningful information on send failure, i will
 make a tiny patch to address this issue. but to make this test more friendly, 
 i
 think we can still do something like:
 
 btrfs send ..  || echo Failed to send

If send is emitting error messages on failures, then the
|| echo... is redundant and not necessary to cause a golden image
mismatch. If send is not emitting error messages on failure, then it
needs fixing because users are going to hate you for silently
failing to send the data they asked to be sent.

Either way, the echo command on error in the test is not needed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device delete missing panic

2014-02-06 Thread Thermionix
~ # uname -r
3.13.1-12.gfc9498b-pae

~ # btrfs --version
Btrfs v3.12+20131125

~ # mount -o degraded,recovery /pool

Killed

Feb 07 09:53:44 store03 kernel: btrfs: device label pool devid 4 transid 55056 
/dev/sde
Feb 07 09:53:44 store03 kernel: btrfs: allowing degraded mounts
Feb 07 09:53:44 store03 kernel: btrfs: enabling auto recovery
Feb 07 09:53:44 store03 kernel: btrfs: disk space caching is enabled
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: btrfs: failed to read tree root on sde
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: BTRFS critical (device sde): unable to find 
logical 1563803254784 len 4096
Feb 07 09:53:44 store03 kernel: btrfs: failed to read tree root on sde
Feb 07 09:53:44 store03 kernel: btrfs: bdev (null) errs: wr 353, rd 1, flush 
17, corrupt 0, gen 0
Feb 07 09:53:44 store03 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 0050
Feb 07 09:53:44 store03 kernel: IP: [f834b7b7] 
btrfs_sysfs_add_one+0x1a7/0x280 [btrfs]
Feb 07 09:53:44 store03 kernel: *pdpt = 34fce001 *pde = 
 
Feb 07 09:53:44 store03 kernel: Oops:  [#1] SMP 
Feb 07 09:53:44 store03 kernel: Modules linked in: bonding hwmon_vid iTCO_wdt 
joydev iTCO_vendor_support hid_generic serio_raw lpc_ich pcspk
r coretemp mfd_core i2c_i801 ata_generic e1000e usbhid ptp pps_core mvsas 
libsas scsi_transport_sas shpchp sg dm_mod autofs4 btrfs raid6_pq 
xor libcrc32c ata_piix uhci_hcd ehci_pci ehci_hcd usbcore usb_common fan 
thermal processor i915 drm_kms_helper drm i2c_algo_bit button video
 thermal_sys scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh
Feb 07 09:53:44 store03 kernel: CPU: 0 PID: 1893 Comm: mount Not tainted 
3.13.1-12.gfc9498b-pae #1
Feb 07 09:53:44 store03 kernel: Hardware name: PhoenixAward 945GM/945GM, BIOS 
6.00 PG 08/13/2008
Feb 07 09:53:44 store03 kernel: task: f4fa80d0 ti: f1614000 task.ti: f1614000
Feb 07 09:53:44 store03 kernel: EIP: 0060:[f834b7b7] EFLAGS: 00010212 CPU: 0
Feb 07 09:53:44 store03 kernel: EIP is at btrfs_sysfs_add_one+0x1a7/0x280 
[btrfs]
Feb 07 09:53:44 store03 kernel: EAX: f4f8eec0 EBX: f6c0e800 ECX: f14063c0 EDX: 

Feb 07 09:53:44 store03 kernel: ESI: f3454374 EDI:  EBP: f154e000 ESP: 
f1615d38
Feb 07 09:53:44 store03 kernel:  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Feb 07 09:53:44 store03 kernel: CR0: 8005003b CR2: 0050 CR3: 3147 CR4: 
07f0
Feb 07 09:53:44 store03 kernel: Stack:
Feb 07 09:53:44 store03 kernel:  f154ea00 f83d10a0  f83aee16 f154e000 
  f154e000
Feb 07 09:53:44 store03 kernel:  f83d3a60 f154ea00 f83aee1a   
 f6dbf800 f154e000
Feb 07 09:53:44 store03 kernel:  f4fe3000 f1615e48 f832f099 1000 d70f 
 0003 
Feb 07 09:53:44 store03 kernel: Call Trace:
Feb 07 09:53:44 store03 kernel:  [f832f099] open_ctree+0x1409/0x1d80 [btrfs]
Feb 07 09:53:44 store03 kernel:  [f830675b] btrfs_mount+0x6ab/0x880 [btrfs]
Feb 07 09:53:44 store03 kernel:  [c035f1ad] mount_fs+0x2d/0x170
Feb 07 09:53:44 store03 kernel:  [c0375af5] vfs_kern_mount+0x45/0xd0
Feb 07 09:53:44 store03 kernel:  [c0377a74] do_mount+0x1f4/0x8e0
Feb 07 09:53:44 store03 kernel:  [c03783fe] SyS_mount+0x7e/0xc0
Feb 07 09:53:44 store03 kernel:  [c070977d] sysenter_do_call+0x12/0x28
Feb 07 09:53:44 store03 kernel:  [b76ed424] 0xb76ed423
Feb 07 09:53:44 store03 kernel: Code: 0c 8d 76 00 89 e8 e8 59 fd ff ff 89 f8 83 
c4 38 5b 5e 5f 5d c3 8d b4 26 00 00 00 00 8b 1b 39 de 74 b1 
8b 85 28 0a 00 00 8b 53 50 8b 52 50 8b 4a 28 83 c2 28 e8 7b fe 06 c8 85 c0 74 
df 89 c7 eb
Feb 07 09:53:44 store03 kernel: EIP: [f834b7b7] 
btrfs_sysfs_add_one+0x1a7/0x280 [btrfs] SS:ESP 0068:f1615d38
Feb 07 09:53:44 store03 kernel: CR2: 0050
Feb 07 09:53:44 store03 kernel: ---[ end trace 4b34a92ed1b54297 ]---

~ # btrfs-zero-log /dev/sde
warning, device 1 is missing
warning devid 1 not found already
checksum verify failed on 19155989299200 found 42FA71B2 wanted A7E1A39B
checksum verify failed on 19155989299200 found 42FA71B2 wanted A7E1A39B
checksum verify failed on 19155989299200 found 42FA71B2 wanted A7E1A39B
checksum verify failed on 19155989299200 found 42FA71B2 wanted A7E1A39B
Csum didn't match
Couldn't read tree root

~ # mount -o degraded,recovery /pool

Feb 07 10:04:55 store03 kernel: btrfs: device label pool devid 2 transid 55056 
/dev/sdc
Feb 07 10:04:55 store03 kernel: btrfs: 

INFO: possible irq lock inversion dependency detected, btrfs

2014-02-06 Thread Chris Murphy

User has reported a Fedora Rawhide bug that includes a lot of Btrfs messages. 
Kernel is 3.14.0-0.rc1.git1.1.fc21.x86_64.

https://bugzilla.redhat.com/show_bug.cgi?id=1062439

A more full dmesg isn't included in the bug report. I haven't hit this 
particular bug running the same kernel, but I am seeing more IRQ complaints on 
baremetal than prior kernels.


dmesg:
…
BTRFS: device label fedora devid 1 transid 14 /dev/vda3
BTRFS: device label fedora devid 1 transid 14 /dev/vda3
BTRFS: device label fedora devid 1 transid 14 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
BTRFS: device label fedora devid 1 transid 15 /dev/vda3
BTRFS: device label fedora devid 1 transid 15 /dev/vda3
BTRFS: device label fedora devid 1 transid 15 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
EXT4-fs (vda1): INFO: recovery required on readonly filesystem
EXT4-fs (vda1): write access will be enabled during recovery
EXT4-fs (vda1): recovery complete
EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
SELinux: initialized (dev vda1, type ext4), uses xattr
BTRFS: device label fedora devid 1 transid 16 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
BTRFS: device label fedora devid 1 transid 16 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
SELinux: initialized (dev dm-1, type ext4), uses xattr
SGI XFS with ACLs, security attributes, large block/inode numbers, no debug 
enabled
BTRFS: device label fedora devid 1 transid 16 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
BTRFS: device label fedora devid 1 transid 18 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
BTRFS: device label fedora devid 1 transid 4 /dev/vda3
BTRFS: device label fedora devid 1 transid 4 /dev/vda3
BTRFS: device label fedora devid 1 transid 4 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
BTRFS: flagging fs with big metadata feature
BTRFS: creating UUID tree
SELinux: initialized (dev vda3, type btrfs), uses xattr
BTRFS: device label fedora devid 1 transid 7 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
Adding 838652k swap on /dev/vda2.  Priority:-1 extents:1 across:838652k FS
BTRFS: device label fedora devid 1 transid 9 /dev/vda3
BTRFS info (device vda3): disk space caching is enabled
SELinux: initialized (dev vda3, type btrfs), uses xattr
EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
SELinux: initialized (dev vda1, type ext4), uses xattr
SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
BTRFS: device label fedora devid 1 transid 9 /dev/vda3

=
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-0.rc1.git1.1.fc21.x86_64 #1 Not tainted
-
kswapd0/30 just changed the state of lock:
 (delayed_node-mutex){+.+.-.}, at: [a01e09ad] 
__btrfs_release_delayed_node+0x3d/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (found-groups_sem){+.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(found-groups_sem);
   local_irq_disable();
   lock(delayed_node-mutex);
   lock(found-groups_sem);
  Interrupt
lock(delayed_node-mutex);

 *** DEADLOCK ***

2 locks held by kswapd0/30:
 #0:  (shrinker_rwsem){..}, at: [811c45af] shrink_slab+0x3f/0x180
 #1:  (type-s_umount_key#49){+.}, at: [81236a74] 
grab_super_passive+0x44/0x90

the shortest dependencies between 2nd lock and 1st lock:
 - (found-groups_sem){+.} ops: 4248 {
HARDIRQ-ON-W at:
  [810f9479] __lock_acquire+0x659/0x1c40
  [810fb252] lock_acquire+0xa2/0x1d0
  [817d6e6e] down_write+0x4e/0xc0
  [a017785b] __link_block_group+0x3b/0xf0 [btrfs]
  [a017973b] btrfs_read_block_groups+0x30b/0x690 
[btrfs]
  [a0188a79] open_ctree+0x17f9/0x21e0 [btrfs]
  [a015bc5e] btrfs_mount+0x63e/0x810 [btrfs]
  [81237449] mount_fs+0x39/0x1b0
  [8125780b] vfs_kern_mount+0x6b/0x150
  

integration-20140206 build failure

2014-02-06 Thread WorMzy Tykashi
Building the latest integration snapshot, I get the following failure:


...
[CC] cmds-filesystem.o
cmds-filesystem.c: In function 'cmd_show':
cmds-filesystem.c:740:12: error: invalid storage class for function 'cmd_sync'
 static int cmd_sync(int argc, char **argv)
^
cmds-filesystem.c:770:12: error: invalid storage class for function
'parse_compress_type'
 static int parse_compress_type(char *s)
^
cmds-filesystem.c:796:12: error: invalid storage class for function 'do_defrag'
 static int do_defrag(int fd, int fancy_ioctl,
^
cmds-filesystem.c:813:12: error: invalid storage class for function
'defrag_callback'
 static int defrag_callback(const char *fpath, const struct stat *sb,
^
cmds-filesystem.c:848:12: error: invalid storage class for function 'cmd_defrag'
 static int cmd_defrag(int argc, char **argv)
^
cmds-filesystem.c:1000:12: error: invalid storage class for function
'cmd_resize'
 static int cmd_resize(int argc, char **argv)
^
cmds-filesystem.c:1047:12: error: invalid storage class for function 'cmd_label'
 static int cmd_label(int argc, char **argv)
^
cmds-filesystem.c:1076:2: error: non-static initialization of a
flexible array member
  }
  ^
cmds-filesystem.c:1076:2: error: (near initialization for
'filesystem_cmd_group')
cmds-filesystem.c:1082:1: error: expected declaration or statement at
end of input
 }
 ^
cmds-filesystem.c: At top level:
cmds-filesystem.c:602:12: warning: 'cmd_show' defined but not used
[-Wunused-function]
 static int cmd_show(int argc, char **argv)
^
Makefile:113: recipe for target 'cmds-filesystem.o' failed
make: *** [cmds-filesystem.o] Error 1


Full output at [1]

Apologies if this has already been reported somewhere else, I couldn't
find anything on the mailing list.

Build environment:

Arch Linux
gcc: 4.8.2
glibc: 2.18
make: 4.0
e2fsprogs: 1.42.9
lzo2: 2.06
CFLAGS: -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector
--param=ssp-buffer-size=4

Cheers,


WorMzy

[1] http://pastie.org/8706689
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!

The question here is: Does it really make sense to create such snapshots of 
disk images currently online and running a system. They will probably be 
broken anyway after rollback - or at least I'd not fully trust the contents.

VM images should not be part of a subvolume of which snapshots are taken at 
a regular and short interval. The problem will go away if you follow this 
rule.

The same applies to probably any kind of file which you make nocow - e.g. 
database files. Most of those file implement their own way of transaction 
protection or COW system, e.g. look at InnoDB files. Neither they gain 
anything from using IO schedulers (because InnoDB internally does block 
sorting and prioritizing and knows better, doing otherwise even hurts 
performance), nor they gain from file system semantics like COW (because it 
does its own transactions and atomic updates and probably can do better for 
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or 
btrfs images on btrfs). Snapshots can only do harm here (the only 
protection use case would be to have a backup, but snapshots are no 
backups), and COW will probably hurt performance a lot. The only use case is 
taking _controlled_ snapshots - and doing it all 30 seconds is by all means 
NOT controlled, it's completely undeterministic.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread cwillu
On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:
 Duncan 1i5t5.dun...@cox.net schrieb:

 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.

 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!

 The question here is: Does it really make sense to create such snapshots of
 disk images currently online and running a system. They will probably be
 broken anyway after rollback - or at least I'd not fully trust the contents.

 VM images should not be part of a subvolume of which snapshots are taken at
 a regular and short interval. The problem will go away if you follow this
 rule.

 The same applies to probably any kind of file which you make nocow - e.g.
 database files. Most of those file implement their own way of transaction
 protection or COW system, e.g. look at InnoDB files. Neither they gain
 anything from using IO schedulers (because InnoDB internally does block
 sorting and prioritizing and knows better, doing otherwise even hurts
 performance), nor they gain from file system semantics like COW (because it
 does its own transactions and atomic updates and probably can do better for
 its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
 btrfs images on btrfs). Snapshots can only do harm here (the only
 protection use case would be to have a backup, but snapshots are no
 backups), and COW will probably hurt performance a lot. The only use case is
 taking _controlled_ snapshots - and doing it all 30 seconds is by all means
 NOT controlled, it's completely undeterministic.

If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread Chris Murphy

On Feb 6, 2014, at 6:01 PM, cwillu cwi...@cwillu.com wrote:

 On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:
 Duncan 1i5t5.dun...@cox.net schrieb:
 
 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!
 
 The question here is: Does it really make sense to create such snapshots of
 disk images currently online and running a system. They will probably be
 broken anyway after rollback - or at least I'd not fully trust the contents.
 
 VM images should not be part of a subvolume of which snapshots are taken at
 a regular and short interval. The problem will go away if you follow this
 rule.
 
 The same applies to probably any kind of file which you make nocow - e.g.
 database files. Most of those file implement their own way of transaction
 protection or COW system, e.g. look at InnoDB files. Neither they gain
 anything from using IO schedulers (because InnoDB internally does block
 sorting and prioritizing and knows better, doing otherwise even hurts
 performance), nor they gain from file system semantics like COW (because it
 does its own transactions and atomic updates and probably can do better for
 its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
 btrfs images on btrfs). Snapshots can only do harm here (the only
 protection use case would be to have a backup, but snapshots are no
 backups), and COW will probably hurt performance a lot. The only use case is
 taking _controlled_ snapshots - and doing it all 30 seconds is by all means
 NOT controlled, it's completely undeterministic.
 
 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.

How fast is this state fixed on disk from the time of the snapshot command? 
Loosely speaking. I'm curious if this is  1 second; a few seconds; or possibly 
up to the 30 second default commit interval? And also if it's even related to 
the commit interval time at all?

I'm also curious what happens to files that are presently writing. e.g. I'm 
writing a 1GB file to subvol A and before it completes I snapshot subvol A into 
A.1. If I go find the file I was writing to, in A.1, what's its state? 
Truncated? Or or are in-progress writes permitted to complete if it's a rw 
snapshot? Any difference in behavior if it's an ro snapshot?


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Wang Shilong

Hi Dave,

 On Fri, Feb 07, 2014 at 12:10:08AM +0800, Wang Shilong wrote:
 +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
 +$SCRATCH_MNT/snap_1  $seqres.full 21
 +
 +do_snapshots 
 +snapshots_pid=$!
 +
 +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1  /dev/null 21 || echo btrfs 
 send failed
 
 Let's stop this anti-pattern before it takes hold.
 
 If there's output from the send command it should be
 filtered and captured in the golden image. Hence any deviation
 caused by errors is automatically flagged as an error.
 
 That's the whole point of using golden images for capturing errors -
 you don't need to capture return values from binaries and it
 guarantees that users are informed about failures through error
 messages. IOWs:
 
 $BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 | _btrfs_send_filter
 
 is what you should be doing here.

I knew what you mean here, in fact, i did this on purpose.
for this test failure, btrfs-prog did not output failure information from the 
beginning.
So to make older progs can also detect the test failure, i dropped into this 
way.

Anyway, if you don't like this and think  old btrfs-progs needn't consider 
this,  i will update
the patch.^_^

Thanks,
Wang

 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Roman Mamedov
On Thu, 06 Feb 2014 20:54:19 +0100
Goffredo Baroncelli kreij...@libero.it wrote:

 I agree with you about the needing of a solution. However your patch to me 
 seems even worse than the actual code.
 
 For example you cannot take in account the mix of data/linear and 
 metadata/dup (with the pathological case of small files stored in the 
 metadata chunks ), nor different profile level like raid5/6 (or the future 
 raidNxM)
 And do not forget the compression...

Every estimate first and foremost should be measured by how precise it is, or
in this case wrong by how many gigabytes. The actual code returns a result
that is pretty much always wrong by 2x, after the patch it will be close
within gigabytes to the correct value in the most common use case (data raid1,
metadata raid1 and that's it). Of course that PoC is nowhere near the final
solution, what I can't agree with is if another option is somewhat better,
but not ideally perfect, then it's worse than the current one, even
considering the current one is absolutely broken.

 The situation is very complex. I am inclined to use a different approach.
 
 As you know, btrfs allocate space in chunk. Each chunk has an own ration 
 between the data occupied on the disk, and the data available to the 
 filesystem. For SINGLE the ratio is 1, for DUP/RAID1/RAID10 the ratio is 2, 
 for raid 5 the ratio is n/(n-1) (where n is the stripes count), for raid 6 
 the ratio is n/(n-2)
 
 Because a filesystem could have chunks with different ratios, we can compute 
 a global ratio as the composition of the each chunk ratio

 We could further enhance this estimation, taking in account also the total 
 files sizes and their space consumed in the chunks (this could be different 
 due to the compression)

I wonder what would be performance implications of all that. I feel a simpler
approach could work.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: [PATCH v3] Btrfs: add regression test for running snapshot and send concurrently

2014-02-06 Thread Dave Chinner
On Fri, Feb 07, 2014 at 12:18:31PM +0800, Wang Shilong wrote:
 
 Hi Dave,
 
  On Fri, Feb 07, 2014 at 12:10:08AM +0800, Wang Shilong wrote:
  +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
  +  $SCRATCH_MNT/snap_1  $seqres.full 21
  +
  +do_snapshots 
  +snapshots_pid=$!
  +
  +$BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1  /dev/null 21 || echo btrfs 
  send failed
  
  Let's stop this anti-pattern before it takes hold.
  
  If there's output from the send command it should be
  filtered and captured in the golden image. Hence any deviation
  caused by errors is automatically flagged as an error.
  
  That's the whole point of using golden images for capturing errors -
  you don't need to capture return values from binaries and it
  guarantees that users are informed about failures through error
  messages. IOWs:
  
  $BTRFS_UTIL_PROG send $SCRATCH_MNT/snap_1 | _btrfs_send_filter
  
  is what you should be doing here.
 
 I knew what you mean here, in fact, i did this on purpose.

Ok, then you need to explain why you did it on purpose with a comment.
It's just as important to explain the reason for doing something in
test code as it is in the kernel code. i.e. so when we are looking
at the test in 5 years time we know the reason for it being that
way.

 for this test failure, btrfs-prog did not output failure
 information from the beginning. 

I have nothing good to say about that state of affairs, but...

 So to make older progs can also
 detect the test failure, i dropped into this way.

.. it's going to have to stay like it. Please insert an
appropriately sarcastic comment about the usefulness of a silent
send command here, because if I write it I'm going to offend lots of
people. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


user creation/deletion of snapshots permissions bug

2014-02-06 Thread Russell Coker
$ /sbin/btrfs subvol create /tmp/test
Create subvolume '/tmp/test'
$ /sbin/btrfs subvol delete /tmp/test
Delete subvolume '/tmp/test'
ERROR: cannot delete '/tmp/test' - Operation not permitted

The above is when running Debian kernel 3.12 based on Linux upstream 3.12.8.  
I believe that the BTRFS kernel code should do a capabilities check for 
CAP_SYS_ADMIN (which is used for mount/umount among many other things) before 
creating a snapshot.  Currently it appears that the only access control is 
write access to the parent directory.

There is some possibility of debate about the access control needed for 
creating a subvol.  I want to use capabilities set by SE Linux policy to 
prevent unwanted actions by hostile root processes and I think that such use 
of capabilities (which is used by more than just SE Linux) should be 
supported.  I don't think that there is any downside to such checks.

In any case allowing a subvol to be created but not deleted with the same 
level of access is obviously a bug.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device delete missing panic

2014-02-06 Thread Anand Jain



 ~ # mount -o degraded,recovery /pool 


 Killed


Feb 07 09:53:44 store03 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 0050
Feb 07 09:53:44 store03 kernel: IP: [f834b7b7] 
btrfs_sysfs_add_one+0x1a7/0x280 [btrfs]


Mostly the above Bug is fixed in the patch below..

  [RFC PATCH] btrfs: fix null pointer deference at 
btrfs_sysfs_add_one+0x105


If you are hitting the same panic again, could you test it ?

Thanks, Anand

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Chris Murphy

On Feb 6, 2014, at 9:40 PM, Roman Mamedov r...@romanrm.net wrote:

 On Thu, 06 Feb 2014 20:54:19 +0100
 Goffredo Baroncelli kreij...@libero.it wrote:
 
 I agree with you about the needing of a solution. However your patch to me 
 seems even worse than the actual code.
 
 For example you cannot take in account the mix of data/linear and 
 metadata/dup (with the pathological case of small files stored in the 
 metadata chunks ), nor different profile level like raid5/6 (or the future 
 raidNxM)
 And do not forget the compression...
 
 Every estimate first and foremost should be measured by how precise it is, or
 in this case wrong by how many gigabytes. The actual code returns a result
 that is pretty much always wrong by 2x, after the patch it will be close
 within gigabytes to the correct value in the most common use case (data raid1,
 metadata raid1 and that's it). Of course that PoC is nowhere near the final
 solution, what I can't agree with is if another option is somewhat better,
 but not ideally perfect, then it's worse than the current one, even
 considering the current one is absolutely broken.

Is the glass half empty or is it half full?

From the original post, context is a 2x 1TB raid volume:

Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2

Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But that's 
not exactly true either, is it? It's merely a convention to cut the storage 
available in half, while keeping data file sizes the same as if they were on a 
single device without raid.

On Btrfs the file system Size is reported as the total storage stack size, and 
that's not incorrect. And the amount Avail is likewise not wrong because that 
space is not otherwise occupied which is the definition of available. It's 
linguistically consistent, it's just not a familiar convention.

What I don't care for is the fact that btrfs fi df doesn't report total and 
used for raid1, the user has to mentally double the displayed values. I think 
the doubling should already be computed, that's what total and used mean, 
rather than needing secret decoder ring knowledge to understand the situation.

Anyway, there isn't a terribly good solution for this issue still. But I don't 
find the argument that it's absolutely broken very compelling. And I disagree 
that upending Used+Avail=Size as you suggest is a good alternative. How is that 
going to work, by the way?

Your idea:
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   1.8T  1.1M  912G   1% /mnt/p2

If I copy 500GB to this file system, what do you propose df shows me? Clearly 
Size stays the same, and Avail presumably becomes 412G. But what does Used go 
to? 500G? Or 1T? And when full, will it say Size 1.8T, Used 900G, Avail 11M? So 
why is the Size 1.8T, only 900G used and yet it's empty? That doesn't make 
sense. Nor does Used increasing at twice the rate Avail goes down.

I also don't think it's useful to fix the problem more than once either.



Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: user creation/deletion of snapshots permissions bug

2014-02-06 Thread Wang Shilong
Hello,

 $ /sbin/btrfs subvol create /tmp/test
 Create subvolume '/tmp/test'
 $ /sbin/btrfs subvol delete /tmp/test
 Delete subvolume '/tmp/test'
 ERROR: cannot delete '/tmp/test' - Operation not permitted

That is a little strange anyway. Looking into codes, i found
For sub-volume creation, we don't check CAP_SYS_ADMIN while
For sub-voule deletion, we do check CAP_SYS_ADMIN, however, we still have
a mount option user_rm_subvol_allowed to delete a sub-volume by regular users!

IMO, this is a little confusing and arguable. i see there are some discussion 
by googling, for example:

http://unix.stackexchange.com/questions/88932/why-cant-a-regular-user-delete-a-btrfs-subvolume

Thanks,
Wang
 
 The above is when running Debian kernel 3.12 based on Linux upstream 3.12.8.  
 I believe that the BTRFS kernel code should do a capabilities check for 
 CAP_SYS_ADMIN (which is used for mount/umount among many other things) before 
 creating a snapshot.  Currently it appears that the only access control is 
 write access to the parent directory.
 
 There is some possibility of debate about the access control needed for 
 creating a subvol.  I want to use capabilities set by SE Linux policy to 
 prevent unwanted actions by hostile root processes and I think that such use 
 of capabilities (which is used by more than just SE Linux) should be 
 supported.  I don't think that there is any downside to such checks.
 
 In any case allowing a subvol to be created but not deleted with the same 
 level of access is obviously a bug.
 
 -- 
 My Main Blog http://etbe.coker.com.au/
 My Documents Bloghttp://doc.coker.com.au/
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Provide a better free space estimate on RAID1

2014-02-06 Thread Roman Mamedov
On Thu, 6 Feb 2014 22:30:46 -0700
Chris Murphy li...@colorremedies.com wrote:

 From the original post, context is a 2x 1TB raid volume:
 
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda2   1.8T  1.1M  1.8T   1% /mnt/p2
 
 Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But 
 that's not exactly true either, is it?

Much better, and matching the user expectations of how RAID1 should behave,
without a major gotcha blowing up into their face the first minute they are
trying it out. In fact next step that I planned would be finding how to adjust
also Size and Used on all my machines to show what you just mentioned. I get it
that btrfs is special and its RAID1 is not the usual RAID1 either, but that's
not a good reason to break the 'df' behavior; do whatever you want with in
'btrfs fi df', but if I'm not mistaken the UNIX 'df' always was about user
data, how much of my data I have already stored on this partition and how much
more can I store. If that's not possible to tell, then try to be reasonably
close to the truth, not deliberately off by 2x.

 On Btrfs ...the amount Avail is likewise not wrong because that space is not 
 otherwise occupied which is the definition of available.

That's not the definition of available that's directly useful to anyone, but
rather a filesystem-designer level implementation detail, if anything.

What usually interests me is, I have a 100 GB file, can I fit it on this
filesystem, yes/no? Sure let's find out, just check 'df'. Oh wait, not so fast
let's remember was this btrfs? Is that the one with RAID1 or not?... And what
if I am accessing that partition on a server via a network CIFS/NFS share and
don't even *have a way to find out* any of that.


-- 
With respect,
Roman


signature.asc
Description: PGP signature


[PATCH 2/2] btrfs-progs: Add -p/--print-missing options for btrfs fi show

2014-02-06 Thread Qu Wenruo
Since a mounted btrfs filesystem contains all the devices info even a
device is removed after mount(like btrfs/003 in xfstests),
we can use the info to print the known missing device if possible.

So -p/--print-missing options are added to print possible missing
devices.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-filesystem.c | 26 --
 man/btrfs.8.in|  4 +++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 4c9933d..77b142c 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -360,7 +360,7 @@ static u64 calc_used_bytes(struct btrfs_ioctl_space_args 
*si)
 static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info,
struct btrfs_ioctl_dev_info_args *dev_info,
struct btrfs_ioctl_space_args *space_info,
-   char *label, char *path)
+   char *label, char *path, int print_missing)
 {
int i;
int fd;
@@ -392,7 +392,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
fd = open((char *)tmp_dev_info-path, O_RDONLY);
if (fd  0) {
missing = 1;
-   continue;
+   if (print_missing)
+   printf(\tdevid %4llu size %s used %s path %s 
(missing)\n,
+  tmp_dev_info-devid,
+  pretty_size(tmp_dev_info-total_bytes),
+  pretty_size(tmp_dev_info-bytes_used),
+  tmp_dev_info-path);
+   else
+   continue;
}
close(fd);
printf(\tdevid %4llu size %s used %s path %s\n,
@@ -440,7 +447,7 @@ static int check_arg_type(char *input)
return BTRFS_ARG_UNKNOWN;
 }
 
-static int btrfs_scan_kernel(void *search)
+static int btrfs_scan_kernel(void *search, int print_missing)
 {
int ret = 0, fd;
FILE *f;
@@ -477,7 +484,8 @@ static int btrfs_scan_kernel(void *search)
fd = open(mnt-mnt_dir, O_RDONLY);
if ((fd != -1)  !get_df(fd, space_info_arg)) {
print_one_fs(fs_info_arg, dev_info_arg,
-   space_info_arg, label, mnt-mnt_dir);
+space_info_arg, label, mnt-mnt_dir,
+print_missing);
kfree(space_info_arg);
memset(label, 0, sizeof(label));
}
@@ -500,6 +508,7 @@ static const char * const cmd_show_usage[] = {
Show the structure of a filesystem,
-d|--all-devices   show only disks under /dev containing btrfs 
filesystem,
-m|--mounted   show only mounted btrfs,
+   -p|--print-missing show known missing device if possible,
If no argument is given, structure of all present filesystems is 
shown.,
NULL
 };
@@ -513,6 +522,7 @@ static int cmd_show(int argc, char **argv)
int ret;
int where = BTRFS_SCAN_LBLKID;
int type = 0;
+   int print_missing = 0;
char mp[BTRFS_PATH_NAME_MAX + 1];
char path[PATH_MAX];
 
@@ -521,9 +531,10 @@ static int cmd_show(int argc, char **argv)
static struct option long_options[] = {
{ all-devices, no_argument, NULL, 'd'},
{ mounted, no_argument, NULL, 'm'},
+   { print-missing, no_argument, NULL, 'p'},
{ NULL, no_argument, NULL, 0 },
};
-   int c = getopt_long(argc, argv, dm, long_options,
+   int c = getopt_long(argc, argv, dmp, long_options,
long_index);
if (c  0)
break;
@@ -534,6 +545,9 @@ static int cmd_show(int argc, char **argv)
case 'm':
where = BTRFS_SCAN_MOUNTED;
break;
+   case 'p':
+   print_missing = 1;
+   break;
default:
usage(cmd_show_usage);
}
@@ -571,7 +585,7 @@ static int cmd_show(int argc, char **argv)
goto devs_only;
 
/* show mounted btrfs */
-   ret = btrfs_scan_kernel(search);
+   ret = btrfs_scan_kernel(search, print_missing);
if (search  !ret)
return 0;
 
diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index 8fea115..db2e355 100644
--- a/man/btrfs.8.in
+++ b/man/btrfs.8.in
@@ -25,7 +25,7 @@ btrfs \- control a btrfs filesystem
 .PP
 \fBbtrfs\fP \fBfilesystem df\fP\fI path\fP
 .PP
-\fBbtrfs\fP \fBfilesystem show\fP 
[\fI--mounted\fP|\fI--all-devices\fP|\fIuuid\fP]\fP
+\fBbtrfs\fP \fBfilesystem show\fP 
[\fI--mounted\fP|\fI--all-devices\fP|\fI--print-missing\fP|\fIuuid\fP]\fP
 

[PATCH 1/2] btrfs-progs: Add missing devices check for mounted btrfs.

2014-02-06 Thread Qu Wenruo
In btrfs/003 of xfstest, it will check whether btrfs fi show can find
missing devices.

But before the patch, btrfs-progs will not check whether device missing
if given a mounted btrfs mountpoint/block device.
This patch fixes the bug and will pass btrfs/003.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
Cc: Anand Jain anand.j...@oracle.com
---
 cmds-filesystem.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 384d1b9..4c9933d 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -363,6 +363,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
char *label, char *path)
 {
int i;
+   int fd;
+   int missing;
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
struct btrfs_ioctl_dev_info_args *tmp_dev_info;
int ret;
@@ -385,6 +387,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
 
for (i = 0; i  fs_info-num_devices; i++) {
tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)dev_info[i];
+
+   /* Add check for missing devices even mounted */
+   fd = open((char *)tmp_dev_info-path, O_RDONLY);
+   if (fd  0) {
+   missing = 1;
+   continue;
+   }
+   close(fd);
printf(\tdevid %4llu size %s used %s path %s\n,
tmp_dev_info-devid,
pretty_size(tmp_dev_info-total_bytes),
@@ -392,6 +402,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
tmp_dev_info-path);
}
 
+   if (missing)
+   printf(\t*** Some devices missing\n);
printf(\n);
return 0;
 }
-- 
1.8.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: Add missing devices check for mounted btrfs.

2014-02-06 Thread Qu Wenruo
In btrfs/003 of xfstest, it will check whether btrfs fi show can find
missing devices.

But before the patch, btrfs-progs will not check whether device missing
if given a mounted btrfs mountpoint/block device.
This patch fixes the bug and will pass btrfs/003.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
Cc: Anand Jain anand.j...@oracle.com
---
changelog:
v1-v2: fix a uninitialized variant int print_one_fs.
---
 cmds-filesystem.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 384d1b9..c1bdab4 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -363,6 +363,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
char *label, char *path)
 {
int i;
+   int fd;
+   int missing = 0;
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
struct btrfs_ioctl_dev_info_args *tmp_dev_info;
int ret;
@@ -385,6 +387,14 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
 
for (i = 0; i  fs_info-num_devices; i++) {
tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)dev_info[i];
+
+   /* Add check for missing devices even mounted */
+   fd = open((char *)tmp_dev_info-path, O_RDONLY);
+   if (fd  0) {
+   missing = 1;
+   continue;
+   }
+   close(fd);
printf(\tdevid %4llu size %s used %s path %s\n,
tmp_dev_info-devid,
pretty_size(tmp_dev_info-total_bytes),
@@ -392,6 +402,8 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
tmp_dev_info-path);
}
 
+   if (missing)
+   printf(\t*** Some devices missing\n);
printf(\n);
return 0;
 }
-- 
1.8.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread Duncan
Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted:

 Duncan 1i5t5.dun...@cox.net schrieb:
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and
 the continuing VM writes would nearly guarantee that many of those
 snapshots had unique blocks, so the effect was nearly as bad as if it
 wasn't NOCOW at all!
 
 The question here is: Does it really make sense to create such snapshots
 of disk images currently online and running a system. They will probably
 be broken anyway after rollback - or at least I'd not fully trust the
 contents.
 
 VM images should not be part of a subvolume of which snapshots are taken
 at a regular and short interval. The problem will go away if you follow
 this rule.
 
 The same applies to probably any kind of file which you make nocow -
 e.g. database files. The only use case is taking _controlled_ snapshots
 - and doing it all 30 seconds is by all means NOT controlled, it's
 completely undeterministic.

I'd absolutely agree -- and that wasn't my report, I'm just recalling it, 
as at the time I didn't understand the interaction between NOCOW and 
snapshots and couldn't quite understand how a NOCOW file was still 
triggering the snapshot-aware-defrag pathology, which in fact we were 
just beginning to realize based on such reports.

In fact at the time I assumed it was because the NOCOW had been added 
after the file was originally written, such that btrfs couldn't NOCOW it 
properly.  That still might have been the case, but now that I understand 
the interaction between snapshots and NOCOW, I see that such heavy 
snapshotting on an actively written VM could trigger the same issue, even 
if the NOCOW file was created properly and was indeed NOCOW when content 
was actually first written into it.

But definitely agreed.  30 second snapshotting, with a 30 second commit 
deadline, is pretty much off the deep end regardless of the content.  I'd 
even argue that 1 minute snapshotting without snapshots thinned down to 
say 5 or 10 minute snapshots after say an hour, is too extreme to be 
practical.  Even a couple days of that, and how are you going to even 
manage the thousands of snapshots or know which precise snapshot to roll 
back to if you had to?  That's why in the what-I-considered toward the 
extreme end of practical example I posted here some days ago, IIRC I had 
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a 
couple hours and to half hour after a couple days, with something like 90 
day snapshots out to a decade.  Even that I considered extreme altho at 
least reasonably so, but the point was, even with something as extreme as 
1 minute snapshots at first frequency and decade of snapshots, with 
reasonable thinning it was still very manageable, something like 250 
snapshots total, well below the thousands or tens of thousands we're 
sometimes seeing in reports.  That's hardly practical no matter how you 
slice it, as how likely are you to know the exact minute to roll back to, 
even a month out, and even if you do, if you can survive a month before 
detecting it, how important is rolling back to precisely the last minute 
before the problem actually going to be?  At a month out perhaps the 
hour, but the minute?

But some of the snapshotting scripts out there, and the admins running 
them, seem to have the idea that just because it's possible it must be 
done, and they have snapshots taken every minute or more frequently, with 
no automated snapshot thinning at all.  IMO that's pathology run amok 
even if btrfs /was/ stable and mature and /could/ handle it properly.

That's regardless of the content so it's from a different angle than you 
were attacking the problem from...  But if admins aren't able to 
recognize the problem with per-minute snapshots without any thinning at 
all for days, weeks, months on end, I doubt they'll be any better at 
recognizing that VMs, databases, etc, should have a dedicated subvolume.  
Taking the long view, with a bit of luck we'll get to the point were 
database and VM setup scripts and/or documentation recommend setting NOCOW 
on the directory the VMs/DBs/etc will be in, but in practice, even that's 
pushing it, and will take some time (2-5 years) as btrfs stabilizes and 
mainstreams, taking over from ext4 as the assumed Linux default.  Other 
than that, I guess it'll be a case-by-case basis as people report 
problems here.  But with a snapshot-aware-defrag that actually scales, 
hopefully there won't be so many people reporting problems.  True, they 
might not have the best optimized system and may have some minor 
pathologies in their admin practices, but as long as they remain /minor/ 
pathologies because btrfs can deal with them better than it does now thus 
keeping them from