Re: Understanding btrfs and backups

2014-03-13 Thread Chris Samuel
On Sun, 9 Mar 2014 03:30:44 PM Duncan wrote:

 While I realize that was in reference to the up in flames comment and 
 presumably if there's a need to worry about that, offsite backup /is/ of 
 some value, for some people, offsite backup really isn't that valuable.

Actually I missed that comment altogether, it was really just an illustration 
of why people should think about it - and then come to a decision about 
whether or not it makes sense for them.

In your case maybe not, but for me (and my wife) it certainly does.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



signature.asc
Description: This is a digitally signed message part.


Re: Ordering of directory operations maintained across system crashes in Btrfs?

2014-03-13 Thread Goswin von Brederlow
On Mon, Mar 03, 2014 at 11:56:49AM -0600, thanumalayan mad wrote:
 Chris,
 
 Great, thanks. Any guesses whether other filesystems (disk-based) do
 things similar to the last two examples you pointed out? Saying we
 think 3 normal filesystems reorder stuff seems to motivate
 application developers to fix bugs ...
 
 Also, just for more information, the sequence we observed was,
 
 Thread A:
 
 unlink(foo)
 rename(somefile X, somefile Y)
 fsync(somefile Z)
 
 The source and destination of the renamed file are unrelated to the
 fsync. But the rename happens in the fsync()'s transaction, while
 unlink() is delayed. I guess this has something to do with backrefs
 too.
 
 Thanks,
 Thanu
 
 On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason c...@fb.com wrote:
  On 02/25/2014 09:01 PM, thanumalayan mad wrote:
 
  Hi all,
 
  Slightly complicated question.
 
  Assume I do two directory operations in a Btrfs partition (such as an
  unlink() and a rename()), one after the other, and a crash happens
  after the rename(). Can Btrfs (the current version) send the second
  operation to the disk first, so that after the crash, I observe the
  effects of rename() but not the effects of the unlink()?
 
  I think I am observing Btrfs re-ordering an unlink() and a rename(),
  and I just want to confirm that my observation is true. Also, if Btrfs
  does send directory operations to disk out of order, is there some
  limitation on this? Like, is this restricted to only unlink() and
  rename()?
 
  I am looking at some (buggy) applications that use Btrfs, and this
  behavior seems to affect them.
 
 
  There isn't a single answer for this one.
 
  You might have
 
  Thread A:
 
  ulink(foo);
  rename(somefile, somefile2);
  crash
 
  This should always have the rename happen before or in the same transaction
  as the rename.
 
  Thread A:
 
  ulink(dirA/foo);
  rename(dirB/somefile, dirB/somefile2);
 
  Here you're at the mercy of what is happening in dirB.  If someone fsyncs
  that directory, it may hit the disk before the unlink.
 
  Thread A:
 
  ulink(foo);
  rename(somefile, somefile2);
  fsync(somefile);
 
  This one is even fuzzier.  Backrefs allow us to do some file fsyncs without
  touching the directory, making it possible the unlink will hit disk after
  the fsync.
 
  -chris

As I understand it POSIX only garanties that the in-core data is
updated by the syscalls in-order. On crash anything can happen. If the
application needs something to be commited to disk then it needs to
fsync(). Specifically it needs to fsync() the changed files AND
directories.

From man fsync:

   Calling  fsync()  does  not  necessarily  ensure  that the entry in the
   directory containing the file has  also  reached  disk.   For  that  an
   explicit fsync() on a file descriptor for the directory is also needed.

So the fsync(somefile) above doesn't necessarily force the rename to
disk.


My experience with fuse tells me that at least fuse handles operations
in parallel and only blocks a later operation if it is affected by an
earlier operation. An unlink in one directory can (and will) run in
parallel to a rename in another directory. Then, depending on how
threads get scheduled, the rename can complete before the unlink.

My conclusion is that you need to fsync() the directory to ensure the
metadata update has made it to the disk if you require that. Otherwise
you have to be able to cope with (meta)data loss on crash.


Note: https://code.google.com/p/leveldb/issues/detail?id=189 talks a
lot about journaling and that any yournaling filesystem should
preserve the order. I think that is rather pointless for two reasons:

1) The journal gets replayed after a crash so in whatever order the
two journal entries are written doesn't matter. They both make it to
disk. You can't see one without the other. This is assuming you
fsync()ed the dirs so force the metadata change into the journal in
the first place.

2) btrfs afaik doesn't have any journal since COW already garanties
atomic updates and crash protection.


Overall I also think the fear of fsync() is overrated for this issue.
This would only happen on programm start or whenever you open a
database. Not somthing that happens every second.

MfG
Goswin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs-progs: mkfs: make sure we can deal with hard links with -r option

2014-03-13 Thread Wang Shilong

Hi Dave,

On 03/13/2014 12:21 AM, David Sterba wrote:

On Tue, Mar 11, 2014 at 06:29:09PM +0800, Wang Shilong wrote:

@@ -840,6 +833,10 @@ static int traverse_directory(struct btrfs_trans_handle 
*trans,
  cur_file-d_name, cur_inum,
  parent_inum, dir_index_cnt,
  cur_inode);
+   if (ret == -EEXIST) {
+   BUG_ON(st.st_nlink = 1);

As the mkfs operation is restartable, can we handle the error?
This should be a logic error which means a inode has hard links(but 
links = 1). :-)


Add error handling may be better,  i will update it.

Thanks,
Wang


Otherwise, good fix, thanks.


+   continue;
+   }
if (ret) {
fprintf(stderr, add_inode_items failed\n);
goto fail;

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] xfstests: add test for btrfs send issuing premature rmdir operations

2014-03-13 Thread Filipe David Borba Manana
Regression test for btrfs incremental send issue where a rmdir instruction
is sent against an orphan directory inode which is not empty yet, causing
btrfs receive to fail when it attempts to remove the directory.

This issue is fixed by the following linux kernel btrfs patch:

Btrfs: fix send attempting to rmdir non-empty directories

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
Reviewed-by: Josef Bacik jba...@fb.com
---

Resending since Dave Chinner asked to do it for any patches he might have
missed in his last merge.

 tests/btrfs/043 |  149 +++
 tests/btrfs/043.out |1 +
 tests/btrfs/group   |1 +
 3 files changed, 151 insertions(+)
 create mode 100644 tests/btrfs/043
 create mode 100644 tests/btrfs/043.out

diff --git a/tests/btrfs/043 b/tests/btrfs/043
new file mode 100644
index 000..b1fef96
--- /dev/null
+++ b/tests/btrfs/043
@@ -0,0 +1,149 @@
+#! /bin/bash
+# FS QA Test No. btrfs/043
+#
+# Regression test for btrfs incremental send issue where a rmdir instruction
+# is sent against an orphan directory inode which is not empty yet, causing
+# btrfs receive to fail when it attempts to remove the directory.
+#
+# This issue is fixed by the following linux kernel btrfs patch:
+#
+#   Btrfs: fix send attempting to rmdir non-empty directories
+#
+#---
+# Copyright (c) 2014 Filipe Manana.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+tmp=`mktemp -d`
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fssum
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs /dev/null 21
+_scratch_mount
+
+mkdir -p $SCRATCH_MNT/a/b
+mkdir $SCRATCH_MNT/0
+mkdir $SCRATCH_MNT/1
+mkdir $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/0 $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/1 $SCRATCH_MNT/a/b/c
+echo 'ola mundo'  $SCRATCH_MNT/a/b/c/foo.txt
+mkdir $SCRATCH_MNT/a/b/c/x
+mkdir $SCRATCH_MNT/a/b/c/x2
+mkdir $SCRATCH_MNT/a/b/y
+mkdir $SCRATCH_MNT/a/b/z
+mkdir -p $SCRATCH_MNT/a/b/d1/d2/d3
+mkdir $SCRATCH_MNT/a/b/d4
+
+# Filesystem looks like:
+#
+# .(ino 256)
+# |-- a/   (ino 257)
+# |-- b/   (ino 258)
+# |-- c/   (ino 261)
+# |   |-- foo.txt  (ino 262)
+# |   |-- 0/   (ino 259)
+# |   |-- 1/   (ino 260)
+# |   |-- x/   (ino 263)
+# |   |-- x2/  (ino 264)
+# |
+# |-- y/   (ino 265)
+# |-- z/   (ino 266)
+# |-- d1/  (ino 267)
+# |   |-- d2/  (ino 268)
+# |   |-- d3/  (ino 269)
+# |
+# |-- d4/  (ino 270)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
+
+rm -f $SCRATCH_MNT/a/b/c/foo.txt
+mv $SCRATCH_MNT/a/b/y $SCRATCH_MNT/a/b/YY
+mv $SCRATCH_MNT/a/b/z $SCRATCH_MNT/a
+mv $SCRATCH_MNT/a/b/c/x $SCRATCH_MNT/a/b/YY
+mv $SCRATCH_MNT/a/b/c/0 $SCRATCH_MNT/a/b/YY/00
+mv $SCRATCH_MNT/a/b/c/x2 $SCRATCH_MNT/a/z/X_2
+mv $SCRATCH_MNT/a/b/c/1 $SCRATCH_MNT/a/z/X_2
+rmdir $SCRATCH_MNT/a/b/c
+mv $SCRATCH_MNT/a/b/d4 $SCRATCH_MNT/a/d44
+mv $SCRATCH_MNT/a/b/d1/d2 $SCRATCH_MNT/a/d44
+rmdir $SCRATCH_MNT/a/b/d1
+
+# Filesystem now looks like:
+#
+# .(ino 256)
+# |-- a/   (ino 257)
+# |-- b/   (ino 258)
+# |   |-- YY/  (ino 265)
+# ||-- x/  (ino 263)
+# ||-- 00/ (ino 259)
+# |
+# |-- z/   (ino 266)
+# |   |-- X_2/ (ino 264)
+# ||-- 1/  (ino 260)
+# |
+# |-- d44/ (ino 270)
+#  |-- d2/ (ino 268)
+#  |-- d3/ (ino 269)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
+
+run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
+run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
+   

[PATCH RESEND] xfstests: add regression test for btrfs incremental send

2014-03-13 Thread Filipe David Borba Manana
Regression test for a btrfs incremental send issue where invalid paths for
utimes, chown and chmod operations were sent to the send stream, causing
btrfs receive to fail.

If a directory had a move/rename operation delayed, and none of its parent
directories, except for the immediate one, had delayed move/rename operations,
after processing the directory's references, the incremental send code would
issue invalid paths for utimes, chown and chmod operations.

This issue is fixed by the following linux kernel btrfs patch:

Btrfs: fix send issuing outdated paths for utimes, chown and chmod

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
Reviewed-by: Josef Bacik jba...@fb.com
---

Resending since Dave Chinner asked to do it for any patches he might have
missed in his last merge.

Originally submitted with the title:
xfstests: add test btrfs/042 for btrfs incremental send

 tests/btrfs/044 |  129 +++
 tests/btrfs/044.out |1 +
 tests/btrfs/group   |1 +
 3 files changed, 131 insertions(+)
 create mode 100644 tests/btrfs/044
 create mode 100644 tests/btrfs/044.out

diff --git a/tests/btrfs/044 b/tests/btrfs/044
new file mode 100644
index 000..dae189e
--- /dev/null
+++ b/tests/btrfs/044
@@ -0,0 +1,129 @@
+#! /bin/bash
+# FS QA Test No. btrfs/044
+#
+# Regression test for a btrfs incremental send issue where under certain
+# scenarios invalid paths for utimes, chown and chmod operations were sent
+# to the send stream, causing btrfs receive to fail.
+#
+# If a directory had a move/rename operation delayed, and none of its parent
+# directories, except for the immediate one, had delayed move/rename 
operations,
+# after processing the directory's references, the incremental send code would
+# issue invalid paths for utimes, chown and chmod operations.
+#
+# This issue is fixed by the following linux kernel btrfs patch:
+#
+#   Btrfs: fix send issuing outdated paths for utimes, chown and chmod
+#
+#---
+# Copyright (c) 2014 Filipe Manana.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+tmp=`mktemp -d`
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fssum
+_need_to_be_root
+
+rm -f $seqres.full
+
+_scratch_mkfs /dev/null 21
+_scratch_mount
+
+umask 0
+mkdir -p $SCRATCH_MNT/a/b/c/d/e
+mkdir $SCRATCH_MNT/a/b/c/f
+echo 'ola '  $SCRATCH_MNT/a/b/c/d/e/file.txt
+chmod 0777 $SCRATCH_MNT/a/b/c/d/e
+
+# Filesystem looks like:
+#
+# .   (ino 256)
+# |-- a/  (ino 257)
+# |-- b/  (ino 258)
+# |-- c/  (ino 259)
+# |-- d/  (ino 260)
+# |   |-- e/  (ino 261)
+# |   |-- file.txt(ino 262)
+# |
+# |-- f/  (ino 263)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
+
+echo 'mundo'  $SCRATCH_MNT/a/b/c/d/e/file.txt
+mv $SCRATCH_MNT/a/b/c/d/e/file.txt $SCRATCH_MNT/a/b/c/d/e/file2.txt
+mv $SCRATCH_MNT/a/b/c/f $SCRATCH_MNT/a/b/f2
+mv $SCRATCH_MNT/a/b/c/d/e $SCRATCH_MNT/a/b/f2/e2
+mv $SCRATCH_MNT/a/b/c $SCRATCH_MNT/a/b/c2
+mv $SCRATCH_MNT/a/b/c2/d $SCRATCH_MNT/a/b/c2/d2
+chmod 0700 $SCRATCH_MNT/a/b/f2/e2
+
+# Filesystem now looks like:
+#
+# .  (ino 256)
+# |-- a/ (ino 257)
+# |-- b/ (ino 258)
+# |-- c2/(ino 259)
+# |   |-- d2/(ino 260)
+# |
+# |-- f2/(ino 263)
+# |-- e2 (ino 261)
+# |-- file2.txt  (ino 263)
+
+_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
+
+run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
+run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
+   

Re: [PATCH] Btrfs: fix joining same transaction handle more than twice

2014-03-13 Thread Josef Bacik
On 03/13/2014 01:19 AM, Wang Shilong wrote:
 We hit something like the following function call flows:
 
 |-run_delalloc_range()
  |-btrfs_join_transaction()
|-cow_file_range()
  |-btrfs_join_transaction()
|-find_free_extent()
  |-btrfs_join_transaction()
 
 Trace infomation can be seen as:
 
 [ 7411.127040] [ cut here ]
 [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 
 start_transaction+0x561/0x580 [btrfs]()
 [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G   O 
 3.13.0+ #4
 [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
 [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
 [ 7411.127092] Call Trace:
 [ 7411.127097]  [815b87b0] dump_stack+0x45/0x56
 [ 7411.127101]  [81051ffd] warn_slowpath_common+0x7d/0xa0
 [ 7411.127102]  [810520da] warn_slowpath_null+0x1a/0x20
 [ 7411.127109]  [a0444fb1] start_transaction+0x561/0x580 [btrfs]
 [ 7411.127115]  [a0445027] btrfs_join_transaction+0x17/0x20 [btrfs]
 [ 7411.127120]  [a0431c91] find_free_extent+0xa21/0xb50 [btrfs]
 [ 7411.127126]  [a0431f68] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
 [ 7411.127131]  [a04322ce] btrfs_alloc_free_block+0xee/0x440 [btrfs]
 [ 7411.127137]  [a043bd6e] ? btree_set_page_dirty+0xe/0x10 [btrfs]
 [ 7411.127142]  [a041da51] __btrfs_cow_block+0x121/0x530 [btrfs]
 [ 7411.127146]  [a041dfff] btrfs_cow_block+0x11f/0x1c0 [btrfs]
 [ 7411.127151]  [a0421b74] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
 [ 7411.127157]  [a0438567] btrfs_lookup_file_extent+0x37/0x40 
 [btrfs]
 [ 7411.127163]  [a0456bfc] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
 [ 7411.127169]  [a0444ae3] ? start_transaction+0x93/0x580 [btrfs]
 [ 7411.127171]  [811663e2] ? kmem_cache_alloc+0x132/0x140
 [ 7411.127176]  [a041cd9a] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
 [ 7411.127182]  [a044aa61] cow_file_range_inline+0x181/0x2e0 [btrfs]
 [ 7411.127187]  [a044aead] cow_file_range+0x2ed/0x440 [btrfs]
 [ 7411.127194]  [a0464d7f] ? free_extent_buffer+0x4f/0xb0 [btrfs]
 [ 7411.127200]  [a044b38f] run_delalloc_nocow+0x38f/0xa60 [btrfs]
 [ 7411.127207]  [a0461600] ? test_range_bit+0x30/0x180 [btrfs]
 [ 7411.127212]  [a044bd48] run_delalloc_range+0x2e8/0x350 [btrfs]
 [ 7411.127219]  [a04618f9] ? find_lock_delalloc_range+0x1a9/0x1e0 
 [btrfs]
 [ 7411.127222]  [812a1e71] ? blk_queue_bio+0x2c1/0x330
 [ 7411.127228]  [a0462ad4] __extent_writepage+0x2f4/0x760 [btrfs]
 
 Here we fix it by avoiding joining transaction again if we have held
 a transaction handle when allocating chunk in find_free_extent().
 


So I just put that warning there to see if we were ever embedding 3
joins at a time, not because it was an actual problem, I'd say just kill
the warning.  Thanks,

Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts'o
Previously, the no-op mount -o mount /dev/xxx operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: Theodore Ts'o ty...@mit.edu
Cc: linux-fsde...@vger.kernel.org
Cc: Christoph Hellwig h...@infradead.org
Cc: Artem Bityutskiy dedeki...@gmail.com
Cc: Adrian Hunter adrian.hun...@intel.com
Cc: Evgeniy Dushistov dushis...@mail.ru
Cc: Jan Kara j...@suse.cz
Cc: OGAWA Hirofumi hirof...@mail.parknet.co.jp
Cc: Anders Larsen a...@alarsen.net
Cc: Phillip Lougher phil...@squashfs.org.uk
Cc: Kees Cook keesc...@chromium.org
Cc: Mikulas Patocka miku...@artax.karlin.mff.cuni.cz
Cc: Petr Vandrovec p...@vandrovec.name
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-c...@vger.kernel.org
Cc: samba-techni...@lists.samba.org
Cc: codal...@coda.cs.cmu.edu
Cc: linux-e...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: fuse-de...@lists.sourceforge.net
Cc: cluster-de...@redhat.com
Cc: linux-...@lists.infradead.org
Cc: jfs-discuss...@lists.sourceforge.net
Cc: linux-...@vger.kernel.org
Cc: linux-ni...@vger.kernel.org
Cc: linux-ntfs-...@lists.sourceforge.net
Cc: ocfs2-de...@oss.oracle.com
Cc: reiserfs-de...@vger.kernel.org
---
 fs/adfs/super.c  | 1 +
 fs/affs/super.c  | 1 +
 fs/befs/linuxvfs.c   | 1 +
 fs/btrfs/super.c | 1 +
 fs/cifs/cifsfs.c | 1 +
 fs/coda/inode.c  | 1 +
 fs/cramfs/inode.c| 1 +
 fs/debugfs/inode.c   | 1 +
 fs/devpts/inode.c| 1 +
 fs/efs/super.c   | 1 +
 fs/ext2/super.c  | 1 +
 fs/ext3/super.c  | 2 ++
 fs/ext4/super.c  | 2 ++
 fs/f2fs/super.c  | 2 ++
 fs/fat/inode.c   | 2 ++
 fs/freevxfs/vxfs_super.c | 1 +
 fs/fuse/inode.c  | 1 +
 fs/gfs2/super.c  | 2 ++
 fs/hfs/super.c   | 1 +
 fs/hfsplus/super.c   | 1 +
 fs/hpfs/super.c  | 2 ++
 fs/isofs/inode.c | 1 +
 fs/jffs2/super.c | 1 +
 fs/jfs/super.c   | 1 +
 fs/minix/inode.c | 1 +
 fs/ncpfs/inode.c | 1 +
 fs/nfs/super.c   | 2 ++
 fs/nilfs2/super.c| 1 +
 fs/ntfs/super.c  | 2 ++
 fs/ocfs2/super.c | 2 ++
 fs/openpromfs/inode.c| 1 +
 fs/proc/root.c   | 2 ++
 fs/pstore/inode.c| 1 +
 fs/qnx4/inode.c  | 1 +
 fs/qnx6/inode.c  | 1 +
 fs/reiserfs/super.c  | 1 +
 fs/romfs/super.c | 1 +
 fs/squashfs/super.c  | 1 +
 fs/super.c   | 2 --
 fs/sysv/inode.c  | 1 +
 fs/ubifs/super.c | 1 +
 fs/udf/super.c   | 1 +
 fs/ufs/super.c   | 1 +
 fs/xfs/xfs_super.c   | 1 +
 44 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 7b3003c..952aeb0 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
*options)
 
 static int adfs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
return parse_options(sb, data);
 }
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d098731..3074530 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data)
 
pr_debug(AFFS: remount(flags=0x%x,opts=\%s\)\n,*flags,data);
 
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
 
memcpy(volume, sbi-s_volume, 32);
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 845d2d6..56d70c8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int 
silent)
 static int
 befs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
if (!(*flags  MS_RDONLY))
return -EINVAL;
return 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97cc241..00cd0c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1381,6 +1381,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned int old_metadata_ratio = fs_info-metadata_ratio;
int ret;
 
+   sync_filesystem(sb);
btrfs_remount_prepare(fs_info);
 
ret = btrfs_parse_options(root, data);
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..4942c94 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -541,6 +541,7 @@ 

Re: [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Jan Kara
On Thu 13-03-14 10:20:56, Ted Tso wrote:
 Previously, the no-op mount -o mount /dev/xxx operation when the
  ^^remount

 file system is already mounted read-write causes an implied,
 unconditional syncfs().  This seems pretty stupid, and it's certainly
 documented or guaraunteed to do this, nor is it particularly useful,
 except in the case where the file system was mounted rw and is getting
 remounted read-only.
 
 However, it's possible that there might be some file systems that are
 actually depending on this behavior.  In most file systems, it's
 probably fine to only call sync_filesystem() when transitioning from
 read-write to read-only, and there are some file systems where this is
 not needed at all (for example, for a pseudo-filesystem or something
 like romfs).
  Hum, I'd avoid this excercise at least for filesystem where
sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
efs, freevxfs, romfs, squashfs. I think you can find a couple more which
clearly don't care about sync_filesystem() if you look a bit closer.

Honza
 
 Signed-off-by: Theodore Ts'o ty...@mit.edu
 Cc: linux-fsde...@vger.kernel.org
 Cc: Christoph Hellwig h...@infradead.org
 Cc: Artem Bityutskiy dedeki...@gmail.com
 Cc: Adrian Hunter adrian.hun...@intel.com
 Cc: Evgeniy Dushistov dushis...@mail.ru
 Cc: Jan Kara j...@suse.cz
 Cc: OGAWA Hirofumi hirof...@mail.parknet.co.jp
 Cc: Anders Larsen a...@alarsen.net
 Cc: Phillip Lougher phil...@squashfs.org.uk
 Cc: Kees Cook keesc...@chromium.org
 Cc: Mikulas Patocka miku...@artax.karlin.mff.cuni.cz
 Cc: Petr Vandrovec p...@vandrovec.name
 Cc: x...@oss.sgi.com
 Cc: linux-btrfs@vger.kernel.org
 Cc: linux-c...@vger.kernel.org
 Cc: samba-techni...@lists.samba.org
 Cc: codal...@coda.cs.cmu.edu
 Cc: linux-e...@vger.kernel.org
 Cc: linux-f2fs-de...@lists.sourceforge.net
 Cc: fuse-de...@lists.sourceforge.net
 Cc: cluster-de...@redhat.com
 Cc: linux-...@lists.infradead.org
 Cc: jfs-discuss...@lists.sourceforge.net
 Cc: linux-...@vger.kernel.org
 Cc: linux-ni...@vger.kernel.org
 Cc: linux-ntfs-...@lists.sourceforge.net
 Cc: ocfs2-de...@oss.oracle.com
 Cc: reiserfs-de...@vger.kernel.org
 ---
  fs/adfs/super.c  | 1 +
  fs/affs/super.c  | 1 +
  fs/befs/linuxvfs.c   | 1 +
  fs/btrfs/super.c | 1 +
  fs/cifs/cifsfs.c | 1 +
  fs/coda/inode.c  | 1 +
  fs/cramfs/inode.c| 1 +
  fs/debugfs/inode.c   | 1 +
  fs/devpts/inode.c| 1 +
  fs/efs/super.c   | 1 +
  fs/ext2/super.c  | 1 +
  fs/ext3/super.c  | 2 ++
  fs/ext4/super.c  | 2 ++
  fs/f2fs/super.c  | 2 ++
  fs/fat/inode.c   | 2 ++
  fs/freevxfs/vxfs_super.c | 1 +
  fs/fuse/inode.c  | 1 +
  fs/gfs2/super.c  | 2 ++
  fs/hfs/super.c   | 1 +
  fs/hfsplus/super.c   | 1 +
  fs/hpfs/super.c  | 2 ++
  fs/isofs/inode.c | 1 +
  fs/jffs2/super.c | 1 +
  fs/jfs/super.c   | 1 +
  fs/minix/inode.c | 1 +
  fs/ncpfs/inode.c | 1 +
  fs/nfs/super.c   | 2 ++
  fs/nilfs2/super.c| 1 +
  fs/ntfs/super.c  | 2 ++
  fs/ocfs2/super.c | 2 ++
  fs/openpromfs/inode.c| 1 +
  fs/proc/root.c   | 2 ++
  fs/pstore/inode.c| 1 +
  fs/qnx4/inode.c  | 1 +
  fs/qnx6/inode.c  | 1 +
  fs/reiserfs/super.c  | 1 +
  fs/romfs/super.c | 1 +
  fs/squashfs/super.c  | 1 +
  fs/super.c   | 2 --
  fs/sysv/inode.c  | 1 +
  fs/ubifs/super.c | 1 +
  fs/udf/super.c   | 1 +
  fs/ufs/super.c   | 1 +
  fs/xfs/xfs_super.c   | 1 +
  44 files changed, 53 insertions(+), 2 deletions(-)
 
 diff --git a/fs/adfs/super.c b/fs/adfs/super.c
 index 7b3003c..952aeb0 100644
 --- a/fs/adfs/super.c
 +++ b/fs/adfs/super.c
 @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
 *options)
  
  static int adfs_remount(struct super_block *sb, int *flags, char *data)
  {
 + sync_filesystem(sb);
   *flags |= MS_NODIRATIME;
   return parse_options(sb, data);
  }
 diff --git a/fs/affs/super.c b/fs/affs/super.c
 index d098731..3074530 100644
 --- a/fs/affs/super.c
 +++ b/fs/affs/super.c
 @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char 
 *data)
  
   pr_debug(AFFS: remount(flags=0x%x,opts=\%s\)\n,*flags,data);
  
 + sync_filesystem(sb);
   *flags |= MS_NODIRATIME;
  
   memcpy(volume, sbi-s_volume, 32);
 diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
 index 845d2d6..56d70c8 100644
 --- a/fs/befs/linuxvfs.c
 +++ b/fs/befs/linuxvfs.c
 @@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int 
 silent)
  static int
  befs_remount(struct super_block *sb, int *flags, char *data)
  {
 + sync_filesystem(sb);
   if (!(*flags  MS_RDONLY))
   

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Steven Whitehouse
Hi,

On Thu, 2014-03-13 at 17:23 +0100, Jan Kara wrote:
 On Thu 13-03-14 10:20:56, Ted Tso wrote:
  Previously, the no-op mount -o mount /dev/xxx operation when the
   ^^remount
 
  file system is already mounted read-write causes an implied,
  unconditional syncfs().  This seems pretty stupid, and it's certainly
  documented or guaraunteed to do this, nor is it particularly useful,
  except in the case where the file system was mounted rw and is getting
  remounted read-only.
  
  However, it's possible that there might be some file systems that are
  actually depending on this behavior.  In most file systems, it's
  probably fine to only call sync_filesystem() when transitioning from
  read-write to read-only, and there are some file systems where this is
  not needed at all (for example, for a pseudo-filesystem or something
  like romfs).
   Hum, I'd avoid this excercise at least for filesystem where
 sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
 also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
 efs, freevxfs, romfs, squashfs. I think you can find a couple more which
 clearly don't care about sync_filesystem() if you look a bit closer.
 

   Honza

I guess the same is true for other file systems which are mounted ro
too. So maybe a check for MS_RDONLY before doing the sync in those
cases?

Steve.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding btrfs and backups

2014-03-13 Thread Chris Murphy

On Mar 7, 2014, at 7:03 AM, Eric Mesa ericsbinarywo...@gmail.com wrote:
 
 Duncan - thanks for this comprehensive explanation. For a huge portion of
 your reply...I was all wondering why you and others were saying snapshots
 aren't backups. They certainly SEEMED like backups. But now I see that the
 problem is one of precise terminology vs colloquialisms. In other words,
 snapsshots are not backups in and of themselves. They are like Mac's Time
 Machine. BUT if you take these snapshots and then put them on another media
 - whether that's local or not - THEN you have backups. Am I right, or am I
 still missing something subtle?

Hmm, yes because snapshots on a mirrored drive are on another media but that's 
still not considered a backup. I think what makes a backup is separate device 
and separate file system. That's because the top vectors for data loss are: 
user induced, device failure, and file system corruption. These are 
substantially mitigated by having backup files located both on separate file 
systems and device.

Also, Time Machine qualifies as a backup because it copies files to a separate 
device with a separate file system. (There is a feature in recent OS X versions 
that store hourly incremental backups on the local drive when the usual target 
device isn't available - these are arguably not backups but rather snapshots 
that are pending backups. Once the target device is available, the snapshots 
are copied over to it.)

If you have data you feel is really important, my suggestion is that you have a 
completely different backup/restore method than what you're talking about. It 
needs to be bullet proof, well tested. And consider all the Btrfs send/receive 
work you're doing as testing/work-in-progress. There are still cases on the 
list where people have had problems with send/receive, both the send and 
receive code have a lot of churn, so I don't know that anyone can definitively 
tell you that a btrfs send/receive only based backup is going to reliably 
restore in one month let alone three years. Should it? Yes of course. Will it?


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Testing BTRFS

2014-03-13 Thread Lists

On 03/10/2014 06:02 PM, Avi Miller wrote:

Oracle Linux 6 with the Unbreakable Enterprise Kernel Release 2 or Release 3 
has production-ready btrfs support. You can even convert your existing CentOS6 
boxes across to Oracle Linux 6 in-place without reinstalling:

http://linux.oracle.com/switch/centos/

Oracle also now provides all errata, including security and bug fixes for free 
athttp://public-yum.oracle.com  and our kernel source code can be found 
athttps://oss.oracle.com/git/


Is there any issue with BTRFS and 32 bit O/S like with ZFS?

-Ben
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig

My backup use case is different from the what has been recently 
discussed in another thread. I'm trying to guard against hardware 
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to backup 
this filesystem regularly and efficiently to an external disk (same 
model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup and 
rebalancing from the surviving disk to the substitute only applies the 
missing changes.

* when the entire raid fails, I can re-build a new one from the backup.

The filesystem is mounted at its root and has several nested subvolumes 
and snapshots (in a .snapshots subdir on each subvol).

Is it possible to do what I'm looking for?

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Hugo Mills
On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
 
 My backup use case is different from the what has been recently 
 discussed in another thread. I'm trying to guard against hardware 
 failure and other causes of destruction.
 
 I have a btrfs raid1 filesystem spread over two disks. I want to backup 
 this filesystem regularly and efficiently to an external disk (same 
 model as the ones in the raid) in such a way that
 
 * when one disk in the raid fails, I can substitute the backup and 
 rebalancing from the surviving disk to the substitute only applies the 
 missing changes.
 
 * when the entire raid fails, I can re-build a new one from the backup.
 
 The filesystem is mounted at its root and has several nested subvolumes 
 and snapshots (in a .snapshots subdir on each subvol).
 
 Is it possible to do what I'm looking for?

   For point 2, yes. (Add new disk, balance -oconvert from single to
raid1).

   For point 1, not really. It's a different filesystem, so it'll have
a different UUID. You *might* be able to get away with rsync of one of
the block devices in the array to the backup block device, but you'd
have to unmount the FS (or halt all writes to it) for the period of
the rsync to ensure a consistent image, and the rsync would have to
read all the data in the device being synced to work out what to send.
Probably not what you want.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Do not meddle in the affairs of system administrators,  for ---   
  they are subtle,  and quick to anger.  


signature.asc
Description: Digital signature


[PATCH] Btrfs: remove transaction from send

2014-03-13 Thread Josef Bacik
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills h...@carfax.org.uk
Signed-off-by: Josef Bacik jba...@fb.com
---
 fs/btrfs/backref.c | 33 +++
 fs/btrfs/ctree.c   | 88 --
 fs/btrfs/ctree.h   |  3 +-
 fs/btrfs/disk-io.c |  3 +-
 fs/btrfs/extent-tree.c | 20 ++--
 fs/btrfs/inode-map.c   | 14 
 fs/btrfs/send.c| 57 ++--
 fs/btrfs/transaction.c | 45 --
 fs/btrfs/transaction.h |  1 +
 9 files changed, 77 insertions(+), 187 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 860f4f2..0be0e94 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
goto out;
}
 
-   root_level = btrfs_old_root_level(root, time_seq);
+   if (path-search_commit_root)
+   root_level = btrfs_header_level(root-commit_root);
+   else
+   root_level = btrfs_old_root_level(root, time_seq);
 
if (root_level + 1 == level) {
srcu_read_unlock(fs_info-subvol_srcu, index);
@@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct btrfs_trans_handle 
*trans,
  *
  * returns 0 on success,  0 on error.
  */
-int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info, u64 bytenr,
-   u64 time_seq, struct ulist **roots)
+static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans,
+ struct btrfs_fs_info *fs_info, u64 bytenr,
+ u64 time_seq, struct ulist **roots)
 {
struct ulist *tmp;
struct ulist_node *node = NULL;
@@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
+int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr,
+u64 time_seq, struct ulist **roots)
+{
+   int ret;
+
+   if (!trans)
+   down_read(fs_info-commit_root_sem);
+   ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots);
+   if (!trans)
+   up_read(fs_info-commit_root_sem);
+   return ret;
+}
+
 /*
  * this makes the path point to (inum INODE_ITEM ioff)
  */
@@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
if (IS_ERR(trans))
return PTR_ERR(trans);
btrfs_get_tree_mod_seq(fs_info, tree_mod_seq_elem);
+   } else {
+   down_read(fs_info-commit_root_sem);
}
 
ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid,
@@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 
ULIST_ITER_INIT(ref_uiter);
while (!ret  (ref_node = ulist_next(refs, ref_uiter))) {
-   ret = btrfs_find_all_roots(trans, fs_info, ref_node-val,
-  tree_mod_seq_elem.seq, roots);
+   ret = __btrfs_find_all_roots(trans, fs_info, ref_node-val,
+tree_mod_seq_elem.seq, roots);
if (ret)
break;
ULIST_ITER_INIT(root_uiter);
@@ -1542,6 +1561,8 @@ out:
if (!search_commit_root) {
btrfs_put_tree_mod_seq(fs_info, tree_mod_seq_elem);
btrfs_end_transaction(trans, fs_info-extent_root);
+   } else {
+   up_read(fs_info-commit_root_sem);
}
 
return ret;
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 88d1b1e..9d89c16 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5360,7 +5360,6 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 {
int ret;
int cmp;
-   struct btrfs_trans_handle *trans = NULL;
struct btrfs_path 

Re: Testing BTRFS

2014-03-13 Thread Avi Miller
Hi,

On 14 Mar 2014, at 5:10 am, Lists li...@benjamindsmith.com wrote:

 Is there any issue with BTRFS and 32 bit O/S like with ZFS?

We provide some btrfs support with the 32-bit UEK Release 2 on OL6, but we 
strongly recommend only using the UEK Release 3 which is 64-bit only.

--
Oracle http://www.oracle.com
Avi Miller | Product Management Director | +61 (3) 8616 3496
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig
On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
 On 2014-Mar-13 14:28, Hugo Mills wrote:
  On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
  My backup use case is different from the what has been recently
  discussed in another thread. I'm trying to guard against hardware
  failure and other causes of destruction.
  
  I have a btrfs raid1 filesystem spread over two disks. I want to
  backup this filesystem regularly and efficiently to an external
  disk (same model as the ones in the raid) in such a way that
  
  * when one disk in the raid fails, I can substitute the backup and
  rebalancing from the surviving disk to the substitute only applies
  the missing changes.
  
  * when the entire raid fails, I can re-build a new one from the
  backup.
  
  The filesystem is mounted at its root and has several nested
  subvolumes and snapshots (in a .snapshots subdir on each subvol).
[...]

 I'm new; btrfs noob; completely unqualified to write intelligently on
 this topic, nevertheless:
 I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
 backup device someplace /dev/C
 
 Could you, at the time you wanted to backup the filesystem:
 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B
 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added
 3) balance to effect the backup (i.e. rebuilding the RAID1 onto
 /dev/C) 4) break/reconnect the original devices: remove /dev/C;
 re-add /dev/B to the fs

I've thought of this but don't dare try it without approval from the 
experts. At any rate, for being practical, this approach hinges on an 
ability to rebuild the raid1 incrementally. That is, the rebuild would 
have to start from what already is present on disk B (or C, when it is 
re-added). Starting from an effectively blank disk each time would be 
prohibitive.

Even if this would work, I'd much prefer keeping the original raid1 
intact and to only temporarily add another mirror: lazy mirroring, to 
give the thing a name.

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 3:14 PM, Michael Schuerig michael.li...@schuerig.de wrote:

 On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
 On 2014-Mar-13 14:28, Hugo Mills wrote:
 On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
 My backup use case is different from the what has been recently
 discussed in another thread. I'm trying to guard against hardware
 failure and other causes of destruction.
 
 I have a btrfs raid1 filesystem spread over two disks. I want to
 backup this filesystem regularly and efficiently to an external
 disk (same model as the ones in the raid) in such a way that
 
 * when one disk in the raid fails, I can substitute the backup and
 rebalancing from the surviving disk to the substitute only applies
 the missing changes.
 
 * when the entire raid fails, I can re-build a new one from the
 backup.
 
 The filesystem is mounted at its root and has several nested
 subvolumes and snapshots (in a .snapshots subdir on each subvol).
 [...]
 
 I'm new; btrfs noob; completely unqualified to write intelligently on
 this topic, nevertheless:
 I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
 backup device someplace /dev/C
 
 Could you, at the time you wanted to backup the filesystem:
 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B
 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added
 3) balance to effect the backup (i.e. rebuilding the RAID1 onto
 /dev/C) 4) break/reconnect the original devices: remove /dev/C;
 re-add /dev/B to the fs
 
 I've thought of this but don't dare try it without approval from the 
 experts. At any rate, for being practical, this approach hinges on an 
 ability to rebuild the raid1 incrementally. That is, the rebuild would 
 have to start from what already is present on disk B (or C, when it is 
 re-added). Starting from an effectively blank disk each time would be 
 prohibitive.
 
 Even if this would work, I'd much prefer keeping the original raid1 
 intact and to only temporarily add another mirror: lazy mirroring, to 
 give the thing a name.

At best this seems fragile, but I don't think it works and is an edge case from 
the start. This is what send/receive is for.

In the btrfs replace scenario, the missing device is removed from the volume. 
It's like a divorce. Missing device 2 is replaced by a different physical 
device also called device 2. If you then removed 2b and readd (formerly 
replaced) device 2a, what happens? I don't know, I'm pretty sure the volume 
knows this is not device 2b as it should be, and won't accept formerly replaced 
device 2a. But it's an edge case to do this because you've said device 
replace. So lexicon wise, I wouldn't even want this to work, we'd need a 
different command even if not different logic.

In the btfs device add case, you now have a three disk raid1 which is a whole 
different beast. Since this isn't n-way raid1, each disk is not stand alone. 
You're only assured data survives a one disk failure meaning you must have two 
drives. You've just increased your risk by doing this, not reduced it. It 
further proposes running an (ostensibly) production workflow with an always 
degraded volume, mounted with -o degraded, on an on-going basis. So it's three 
strikes.  It's not n-way, you have no uptime if you lose one of two disks 
onsite, you's have to go get the offsite/onshelf disk to keep working. Plus 
that offsite disk isn't stand alone, so why even have it offsite? This is a 
fail.

So the btrfs replace scenario might work but it seems like a bad idea. And 
overall it's a use case for which send/receive was designed anyway so why not 
just use that?

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove transaction from send

2014-03-13 Thread Hugo Mills
On Thu, Mar 13, 2014 at 03:42:13PM -0400, Josef Bacik wrote:
 Lets try this again.  We can deadlock the box if we send on a box and try to
 write onto the same fs with the app that is trying to listen to the send pipe.
 This is because the writer could get stuck waiting for a transaction commit
 which is being blocked by the send.  So fix this by making sure looking at the
 commit roots is always going to be consistent.  We do this by keeping track of
 which roots need to have their commit roots swapped during commit, and then
 taking the commit_root_sem and swapping them all at once.  Then make sure we
 take a read lock on the commit_root_sem in cases where we search the commit 
 root
 to make sure we're always looking at a consistent view of the commit roots.
 Previously we had problems with this because we would swap a fs tree commit 
 root
 and then swap the extent tree commit root independently which would cause the
 backref walking code to screw up sometimes.  With this patch we no longer
 deadlock and pass all the weird send/receive corner cases.  Thanks,

   There's something still going on here. I managed to get about twice
as far through my test as I had before, but I again got an unexpected
EOF in stream, with btrfs send returning 1. As before, I have this in
syslog:

Mar 13 22:09:12 s_src@amelia kernel: BTRFS error (device sda2): did not find 
backref in send_root. inode=1786631, offset=825257984, disk_byte=36504023040 
found extent=36504023040\x0a

   So, on the evidence of one data point (I'll have another one when I
wake up tomorrow morning), this has made the problem harder to trigger
but it's still possible.

   Hugo.

 Reportedy-by: Hugo Mills h...@carfax.org.uk
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  fs/btrfs/backref.c | 33 +++
  fs/btrfs/ctree.c   | 88 
 --
  fs/btrfs/ctree.h   |  3 +-
  fs/btrfs/disk-io.c |  3 +-
  fs/btrfs/extent-tree.c | 20 ++--
  fs/btrfs/inode-map.c   | 14 
  fs/btrfs/send.c| 57 ++--
  fs/btrfs/transaction.c | 45 --
  fs/btrfs/transaction.h |  1 +
  9 files changed, 77 insertions(+), 187 deletions(-)
 
 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 860f4f2..0be0e94 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
 *fs_info,
   goto out;
   }
  
 - root_level = btrfs_old_root_level(root, time_seq);
 + if (path-search_commit_root)
 + root_level = btrfs_header_level(root-commit_root);
 + else
 + root_level = btrfs_old_root_level(root, time_seq);
  
   if (root_level + 1 == level) {
   srcu_read_unlock(fs_info-subvol_srcu, index);
 @@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct 
 btrfs_trans_handle *trans,
   *
   * returns 0 on success,  0 on error.
   */
 -int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 - struct btrfs_fs_info *fs_info, u64 bytenr,
 - u64 time_seq, struct ulist **roots)
 +static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 +   struct btrfs_fs_info *fs_info, u64 bytenr,
 +   u64 time_seq, struct ulist **roots)
  {
   struct ulist *tmp;
   struct ulist_node *node = NULL;
 @@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle 
 *trans,
   return 0;
  }
  
 +int btrfs_find_all_roots(struct btrfs_trans_handle *trans,
 +  struct btrfs_fs_info *fs_info, u64 bytenr,
 +  u64 time_seq, struct ulist **roots)
 +{
 + int ret;
 +
 + if (!trans)
 + down_read(fs_info-commit_root_sem);
 + ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots);
 + if (!trans)
 + up_read(fs_info-commit_root_sem);
 + return ret;
 +}
 +
  /*
   * this makes the path point to (inum INODE_ITEM ioff)
   */
 @@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
   if (IS_ERR(trans))
   return PTR_ERR(trans);
   btrfs_get_tree_mod_seq(fs_info, tree_mod_seq_elem);
 + } else {
 + down_read(fs_info-commit_root_sem);
   }
  
   ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid,
 @@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
  
   ULIST_ITER_INIT(ref_uiter);
   while (!ret  (ref_node = ulist_next(refs, ref_uiter))) {
 - ret = btrfs_find_all_roots(trans, fs_info, ref_node-val,
 -tree_mod_seq_elem.seq, roots);
 + ret = __btrfs_find_all_roots(trans, fs_info, ref_node-val,
 +  tree_mod_seq_elem.seq, roots);
   if (ret)
   

Re: Incremental backup for a raid1

2014-03-13 Thread Michael Schuerig
On Thursday 13 March 2014 16:04:33 Chris Murphy wrote:
 On Mar 13, 2014, at 3:14 PM, Michael Schuerig 
michael.li...@schuerig.de wrote:
  On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:
  On 2014-Mar-13 14:28, Hugo Mills wrote:
  On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:
  My backup use case is different from the what has been recently
  discussed in another thread. I'm trying to guard against hardware
  failure and other causes of destruction.
  
  I have a btrfs raid1 filesystem spread over two disks. I want to
  backup this filesystem regularly and efficiently to an external
  disk (same model as the ones in the raid) in such a way that
  
  * when one disk in the raid fails, I can substitute the backup
  and
  rebalancing from the surviving disk to the substitute only
  applies
  the missing changes.
  
  * when the entire raid fails, I can re-build a new one from the
  backup.
  
  The filesystem is mounted at its root and has several nested
  subvolumes and snapshots (in a .snapshots subdir on each subvol).
  
  [...]
  
  I'm new; btrfs noob; completely unqualified to write intelligently
  on
  this topic, nevertheless:
  I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
  backup device someplace /dev/C
  
  Could you, at the time you wanted to backup the filesystem:
  1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B
  2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added
  3) balance to effect the backup (i.e. rebuilding the RAID1 onto
  /dev/C) 4) break/reconnect the original devices: remove /dev/C;
  re-add /dev/B to the fs
  
  I've thought of this but don't dare try it without approval from the
  experts. At any rate, for being practical, this approach hinges on
  an
  ability to rebuild the raid1 incrementally. That is, the rebuild
  would have to start from what already is present on disk B (or C,
  when it is re-added). Starting from an effectively blank disk each
  time would be prohibitive.
  
  Even if this would work, I'd much prefer keeping the original raid1
  intact and to only temporarily add another mirror: lazy mirroring,
  to give the thing a name.

[...]
 In the btfs device add case, you now have a three disk raid1 which is
 a whole different beast. Since this isn't n-way raid1, each disk is
 not stand alone. You're only assured data survives a one disk failure
 meaning you must have two drives.

Yes, I understand that. Unless someone convinces me that it's a bad 
idea, I keep wishing for a feature that allows to intermittently add a 
third disk to a two disk raid1 and update that disk so that it could 
replace one of the others.

 So the btrfs replace scenario might work but it seems like a bad idea.
 And overall it's a use case for which send/receive was designed
 anyway so why not just use that?

Because it's not just. Doing it right doesn't seem trivial. For one 
thing, there are multiple subvolumes; not at the top-level but nested 
inside a root subvolume. Each of them already has snapshots of its own. 
If there already is a send/receive script that can handle such a setup 
I'll happily have a look at it.

Michael

-- 
Michael Schuerig
mailto:mich...@schuerig.de
http://www.schuerig.de/michael/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts'o
On Thu, Mar 13, 2014 at 04:28:23PM +, Steven Whitehouse wrote:
 
 I guess the same is true for other file systems which are mounted ro
 too. So maybe a check for MS_RDONLY before doing the sync in those
 cases?

My original patch moved the sync_filesystem into the check for
MS_RDONLY in the core VFS code.  The objection was raised that there
might be some file system out there that might depend on this
behaviour.  I can't imagine why, but I suppose it's at least
theoretically possible.

So the idea is that this particular patch is *guaranteed* not to make
any difference.  That way there can be no question about the patch'es
correctness.

I'm going to follow up with a patch for ext4 that does exactly that,
but the idea is to allow each file system maintainer to do that for
their own file system.

I could do that as well for file systems that are obviously
read-only, but then I'll find out that there's some wierd case where
the file system can be used in a read-write fashion.  (Example: UDF is
normally used for DVD's, but at least in theory it can be used
read/write --- I'm told that Windows supports read-write UDF file
systems on USB sticks, and at least in theory it could be used as a
inter-OS exchange format in situations where VFAT and exFAT might not
be appropriate for various reasons.)

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Lists

See comments at the bottom:

On 03/13/2014 05:29 PM, George Mitchell wrote:

On 03/13/2014 04:03 PM, Michael Schuerig wrote:

On Thursday 13 March 2014 16:04:33 Chris Murphy wrote:

On Mar 13, 2014, at 3:14 PM, Michael Schuerig

michael.li...@schuerig.de wrote:

On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote:

On 2014-Mar-13 14:28, Hugo Mills wrote:

On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote:

My backup use case is different from the what has been recently
discussed in another thread. I'm trying to guard against hardware
failure and other causes of destruction.

I have a btrfs raid1 filesystem spread over two disks. I want to
backup this filesystem regularly and efficiently to an external
disk (same model as the ones in the raid) in such a way that

* when one disk in the raid fails, I can substitute the backup
and
rebalancing from the surviving disk to the substitute only
applies
the missing changes.

* when the entire raid fails, I can re-build a new one from the
backup.

The filesystem is mounted at its root and has several nested
subvolumes and snapshots (in a .snapshots subdir on each subvol).

[...]


I'm new; btrfs noob; completely unqualified to write intelligently
on
this topic, nevertheless:
I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a
backup device someplace /dev/C

Could you, at the time you wanted to backup the filesystem:
1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B
2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added
3) balance to effect the backup (i.e. rebuilding the RAID1 onto
/dev/C) 4) break/reconnect the original devices: remove /dev/C;
re-add /dev/B to the fs

I've thought of this but don't dare try it without approval from the
experts. At any rate, for being practical, this approach hinges on
an
ability to rebuild the raid1 incrementally. That is, the rebuild
would have to start from what already is present on disk B (or C,
when it is re-added). Starting from an effectively blank disk each
time would be prohibitive.

Even if this would work, I'd much prefer keeping the original raid1
intact and to only temporarily add another mirror: lazy mirroring,
to give the thing a name.

[...]

In the btfs device add case, you now have a three disk raid1 which is
a whole different beast. Since this isn't n-way raid1, each disk is
not stand alone. You're only assured data survives a one disk failure
meaning you must have two drives.

Yes, I understand that. Unless someone convinces me that it's a bad
idea, I keep wishing for a feature that allows to intermittently add a
third disk to a two disk raid1 and update that disk so that it could
replace one of the others.


So the btrfs replace scenario might work but it seems like a bad idea.
And overall it's a use case for which send/receive was designed
anyway so why not just use that?

Because it's not just. Doing it right doesn't seem trivial. For one
thing, there are multiple subvolumes; not at the top-level but nested
inside a root subvolume. Each of them already has snapshots of its own.
If there already is a send/receive script that can handle such a setup
I'll happily have a look at it.

Michael

I think the closest thing there will ever be to this is n-way 
mirroring.  I currently use rsync to a separate drive to maintain a 
backup copy, but it is not integrated into the array like n-way would 
be, and is definitely not a perfect solution.  But a 3 drive 3-way 
would require the 3rd drive to be in the array the whole time or it 
would run into the same problem requiring a complete rebuild rather 
than an incremental when reintroduced, UNLESS such a feature was 
specifically included in the design, and even then, in a 3-way 
configuration, you would end up simplex on at least some data until 
the partial rebuild was completed.  Personally, I will be DELIGHTED 
when n-way appears simply because basic 3-way gets us out of the 
dreaded simplex trap.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm coming from ZFS land, am a BTRFS newbie, and I don't understand this 
discussion, at all. I'm assuming that BTRFS send/receive works similar 
to ZFS's similarly named feature. We use snapshots and ZFS send/receive 
to a remote server to do our backups. To do an rsync of our production 
file store takes days because there are so many files, while 
snapshotting and using ZFS send/receive takes tens of minutes at local 
(Gbit) speeds, and a few hours at WAN speeds, nearly all of that time 
being transfer time.


So just I don't get the backup problem. Place btrfs' equivalent of a 
pool on the external drive, and use send/receive of the filesystem or 
snapshot(s). Does BTRFS work so differently in this regard? If so, I'd 
like to know what's different.


My primary interest in BTRFS vs ZFS is two-fold:

1) ZFS has a 

Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)

2014-03-13 Thread Marc MERLIN
Can anyone comment on this.

Are others seeing some btrfs operations on filesystem/diskA hang/deadlock
other btrfs operations on filesystem/diskB ?

I just spent time fixing near data corruption in one of my systems due to
a 7h delay between when the timestamp was written and the actual data was
written, and traced it down to a btrfs hang that should never have happened
on that filesystem.

Surely, it's not a single queue for all filesystem and devices, right?

If not, does anyone know what bugs I've been hitting then?

Is the full report below I spent quite a while getting together for you :)
useful in any way to see where the hangs are?

To be honest, I'm looking at moving some important filesystems back to ext4
because I can't afford such long hangs on my root filesystem when I have a
media device that is doing heavy btrfs IO or a send/receive.

Mmmh, is it maybe just btrfs send/receive that is taking a btrfs-wide lock?
Or btrfs scrub maybe?

Thanks,
Marc

On Wed, Mar 12, 2014 at 08:18:08AM -0700, Marc MERLIN wrote:
 I have a file server with 4 cpu cores and 5 btrfs devices:
 Label: btrfs_boot  uuid: e4c1daa8-9c39-4a59-b0a9-86297d397f3b
 Total devices 1 FS bytes used 48.92GiB
 devid1 size 79.93GiB used 73.04GiB path /dev/mapper/cryptroot
 
 Label: varlocalspace  uuid: 9f46dbe2-1344-44c3-b0fb-af2888c34f18
 Total devices 1 FS bytes used 1.10TiB
 devid1 size 1.63TiB used 1.50TiB path /dev/mapper/cryptraid0
 
 Label: btrfs_pool1  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
 Total devices 1 FS bytes used 7.16TiB
 devid1 size 14.55TiB used 7.50TiB path /dev/mapper/dshelf1
 
 Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
 Total devices 1 FS bytes used 3.34TiB
 devid1 size 7.28TiB used 3.42TiB path /dev/mapper/dshelf2
 
 Label: bigbackup  uuid: 024ba4d0-dacb-438d-9f1b-eeb34083fe49
 Total devices 5 FS bytes used 6.02TiB
 devid1 size 1.82TiB used 1.43TiB path /dev/dm-9
 devid2 size 1.82TiB used 1.43TiB path /dev/dm-6
 devid3 size 1.82TiB used 1.43TiB path /dev/dm-5
 devid4 size 1.82TiB used 1.43TiB path /dev/dm-7
 devid5 size 1.82TiB used 1.43TiB path /dev/dm-8
 
 
 I have a very long running btrfs send/receive from btrfs_pool1 to bigbackup
 (long running meaning that it's been slowly copying over 5 days)
 
 The problem is that this is blocking IO to btrfs_pool2 which is using
 totally different drives.
 By blocking IO I mean that IO to pool2 kind of works sometimes, and
 hangs for very long times at other times.
 
 It looks as if one rsync to btrfs_pool2 or one piece of IO hangs on a shared 
 lock
 and once that happens, all IO to btrfs_pool2 stops for a long time.
 It does recover eventually without reboot, but the wait times are ridiculous 
 (it 
 could be 1H or more).
 
 As I write this, I have a killall -9 rsync that waited for over 10mn before
 these processes would finally die:
 23555   07:36 wait_current_trans.isra.15 rsync -av -SH --delete (...)
 23556   07:36 exit   [rsync] defunct
 25387  2-04:41:22 wait_current_trans.isra.15 rsync --password-file  (...)
 27481   31:26 wait_current_trans.isra.15 rsync --password-file  (...)
 2926804:41:34 wait_current_trans.isra.15 rsync --password-file  (...)
 2934304:41:31 exit   [rsync] defunct
 2949204:41:27 wait_current_trans.isra.15 rsync --password-file  (...)
 
 1455907:14:49 wait_current_trans.isra.15 cp -i -al current 
 20140312-feisty
 
 This is all stuck in btrfs kernel code.
 If someeone wants sysrq-w, there it is.
 http://marc.merlins.org/tmp/btrfs_full.txt
 
 A quick summary:
 SysRq : Show Blocked State
   taskPC stack   pid father
 btrfs-cleaner   D 8802126b0840 0  3332  2 0x
  8800c5dc9d00 0046 8800c5dc9fd8 8800c69f6310
  000141c0 8800c69f6310 88017574c170 880211e671e8
   880211e67000 8801e5936e20 8800c5dc9d10
 Call Trace:
  [8160b0d9] schedule+0x73/0x75
  [8122a3c7] wait_current_trans.isra.15+0x98/0xf4
  [81085062] ? finish_wait+0x65/0x65
  [8122b86c] start_transaction+0x48e/0x4f2
  [8122bc4f] ? __btrfs_end_transaction+0x2a1/0x2c6
  [8122b8eb] btrfs_start_transaction+0x1b/0x1d
  [8121c5cd] btrfs_drop_snapshot+0x443/0x610
  [8160d7b3] ? _raw_spin_unlock+0x17/0x2a
  [81074efb] ? finish_task_switch+0x51/0xdb
  [8160afbf] ? __schedule+0x537/0x5de
  [8122c08d] btrfs_clean_one_deleted_snapshot+0x103/0x10f
  [81224859] cleaner_kthread+0x103/0x136
  [81224756] ? btrfs_alloc_root+0x26/0x26
  [8106bc1b] kthread+0xae/0xb6
  [8106bb6d] ? __kthread_parkme+0x61/0x61
  [816141bc] ret_from_fork+0x7c/0xb0
  [8106bb6d] ? __kthread_parkme+0x61/0x61
 

[PATCH 2/2] btrfs-progs: Fix a memleak in btrfs_scan_lblkid().

2014-03-13 Thread quwen...@cn.fujitsu.com
In btrfs_scan_lblkid(), blkid_get_cache() is called but cache not freed.
This patch adds blkid_put_cache() to free it.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 utils.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/utils.c b/utils.c
index 93cf9ac..b809bc5 100644
--- a/utils.c
+++ b/utils.c
@@ -2067,6 +2067,7 @@ int btrfs_scan_lblkid(int update_kernel)
btrfs_register_one_device(path);
}
blkid_dev_iterate_end(iter);
+   blkid_put_cache(cache);
return 0;
 }
 
-- 
1.9.0
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: Fix a memleak in btrfs_scan_one_device.

2014-03-13 Thread quwen...@cn.fujitsu.com
Valgrind reports memleak in btrfs_scan_one_device() about allocating
btrfs_device but on btrfs_close_devices() they are not reclaimed.

Although not a bug since after btrfs_close_devices() btrfs will exit so
memory will be reclaimed by system anyway, it's better to fix it anyway.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-filesystem.c |  6 ++
 volumes.c | 13 ++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index f02e871..c9e27fc 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -651,6 +651,12 @@ devs_only:
if (search  !found)
ret = 1;
 
+   while (!list_empty(all_uuids)) {
+   fs_devices = list_entry(all_uuids-next,
+   struct btrfs_fs_devices, list);
+   list_del(fs_devices-list);
+   btrfs_close_devices(fs_devices);
+   }
 out:
printf(%s\n, BTRFS_BUILD_VERSION);
free_seen_fsid();
diff --git a/volumes.c b/volumes.c
index 8c45851..77ffd32 100644
--- a/volumes.c
+++ b/volumes.c
@@ -160,11 +160,12 @@ static int device_list_add(const char *path,
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices;
-   struct list_head *cur;
struct btrfs_device *device;
+
 again:
-   list_for_each(cur, fs_devices-devices) {
-   device = list_entry(cur, struct btrfs_device, dev_list);
+   while (!list_empty(fs_devices-devices)) {
+   device = list_entry(fs_devices-devices.next,
+   struct btrfs_device, dev_list);
if (device-fd != -1) {
fsync(device-fd);
if (posix_fadvise(device-fd, 0, 0, 
POSIX_FADV_DONTNEED))
@@ -173,6 +174,11 @@ again:
device-fd = -1;
}
device-writeable = 0;
+   list_del(device-dev_list);
+   /* free the memory */
+   free(device-name);
+   free(device-label);
+   free(device);
}
 
seed_devices = fs_devices-seed;
@@ -182,6 +188,7 @@ again:
goto again;
}
 
+   free(fs_devices);
return 0;
 }
 
-- 
1.9.0
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 7:14 PM, Lists li...@benjamindsmith.com wrote:
 
 I'm assuming that BTRFS send/receive works similar to ZFS's similarly named 
 feature.

Similar yes but not all options are the same between them. e.g. zfs send -R 
replicates all descendent file systems. I don't think zfs requires volumes, 
filesystems, or snapshots to be read-only, whereas btrfs send only works on 
read only snapshot-subvolumes. There has been some suggestion of a recursive 
snapshot creation and recursive send for btrfs.

 So just I don't get the backup problem. Place btrfs' equivalent of a pool 
 on the external drive, and use send/receive of the filesystem or snapshot(s). 
 Does BTRFS work so differently in this regard? If so, I'd like to know what's 
 different.

Top most thing in zfs is the pool, which on btrfs is the volume. Neither zfs 
send or btrfs send works on this level to send everything within a pool/volume. 
zfs has the file system and btrfs has the subvolume which can be snapshot. 
Either (or both) can be used with send. 

zfs also has the volume which is a block device that can be snapshot, there 
isn't yet a btrfs equivalent.

Btrfs and zfs have clones but the distinction is stronger with zfs. Like zfs 
snapshots can't be deleted unless its clones are deleted. Btrfs send has a -c 
clone-src option that I don't really understand, and also the --reflink which 
is a clone at the file level.

Anyway there are a lot of similarities but also quite a few differences. Basic 
functionality seems pretty much the same.


 
 My primary interest in BTRFS vs ZFS is two-fold:
 
 1) ZFS has a couple of limitations that I find disappointing, that don't 
 appear to be present in BTRFS.
A) Inability to upgrade a non-redundant ZFS pool/vdev to raidz or increase 
 the raidz (redundancy) level after creation. (Yes, you can plan around this, 
 but I see no good reason to HAVE to)
B) Inability to remove a vdev once added to a pool.
 
 2) Licensing: ZFS on Linux is truly great so far in all my testing, can't 
 throw enough compliments their way, but I would really like to rely on a 
 first class citizen as far as the Linux kernel is concerned.


3. On btrfs you can delete a parent subvolume and the children remain. On zfs, 
you can't destroy a zfs filesystem/volume unless its snapshots are deleted, and 
you can't delete snapshots unless their clones are deleted.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-13 Thread Chris Murphy

On Mar 13, 2014, at 8:11 PM, Marc MERLIN m...@merlins.org wrote:

 On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote:
 discard is, except on the very latest hardware, a synchronous command
 (it's a limitation of the SATA standard), and therefore results in
 very very poor performance.
 
 Interesting. How do I know if a given SSD will hang on discard?
 Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :)

smartctl -a or -x will tell you what SATA revision is in place. The queued trim 
support is in SATA Rev 3.1. I'm not certain if this requires only the drive to 
support that revision level, or both controller and drive.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)

2014-03-13 Thread Duncan
Marc MERLIN posted on Thu, 13 Mar 2014 18:48:13 -0700 as excerpted:

 Are others seeing some btrfs operations on filesystem/diskA
 hang/deadlock other btrfs operations on filesystem/diskB ?

Well, if the filesystem in filesystem/diskA and filesystem/diskB is the 
same (multi-device) filesystem, as the above definitely implies...  Tho 
based on the context I don't believe that's what you actually meant.

Meanwhile, send/receive is intensely focused in bug-finding/fixing mode 
ATM.  The basic concept is there, but to this point it has definitely 
been more development/testing-reliability (as befitted btrfs overall 
state, with the eat-your-babies kconfig option warning only recently 
toned down to what I'd call semi-stable) than enterprise-reliability.  
Hopefully by the time they're done with all this bug-stomping it'll be 
rather closer to the latter.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: discard synchronous on most SSDs?

2014-03-13 Thread Marc MERLIN
On Thu, Mar 13, 2014 at 09:39:02PM -0600, Chris Murphy wrote:
 
 On Mar 13, 2014, at 8:11 PM, Marc MERLIN m...@merlins.org wrote:
 
  On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote:
  discard is, except on the very latest hardware, a synchronous command
  (it's a limitation of the SATA standard), and therefore results in
  very very poor performance.
  
  Interesting. How do I know if a given SSD will hang on discard?
  Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :)
 
 smartctl -a or -x will tell you what SATA revision is in place. The queued 
 trim support is in SATA Rev 3.1. I'm not certain if this requires only the 
 drive to support that revision level, or both controller and drive.

I'm not sure I'm seeing this, which field is that?

=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 840 EVO 1TB
Serial Number:S1D9NEAD934600N
LU WWN Device Id: 5 002538 85009a8ff
Firmware Version: EXT0BB0Q
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Size:  512 bytes logical/physical
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4c
Local Time is:Thu Mar 13 22:15:14 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(15000) seconds.
Offline data collection
capabilities:(0x53) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 250) minutes.
SCT capabilities:  (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010-0
  9 Power_On_Hours  -O--CK   099   099   000-2219
 12 Power_Cycle_Count   -O--CK   099   099   000-659
177 Wear_Leveling_Count PO--C-   099   099   000-3
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010-0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010-0
182 Erase_Fail_Count_Total  -O--CK   100   100   010-0
183 Runtime_Bad_Block   PO--C-   100   100   010-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Airflow_Temperature_Cel -O--CK   054   041   000-46
195 Hardware_ECC_Recovered  -O-RC-   200   200   000-0
199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
235 Unknown_Attribute   -O--C-   099   099   000-35
241 Total_LBAs_Written  -O--CK   099   099   000-12186944165
||_ K auto-keep
|__ C event count
___ R error rate
||| S speed/performance
||_ O updated online
|__ P prefailure warning


-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe