Re: Understanding btrfs and backups
On Sun, 9 Mar 2014 03:30:44 PM Duncan wrote: While I realize that was in reference to the up in flames comment and presumably if there's a need to worry about that, offsite backup /is/ of some value, for some people, offsite backup really isn't that valuable. Actually I missed that comment altogether, it was really just an illustration of why people should think about it - and then come to a decision about whether or not it makes sense for them. In your case maybe not, but for me (and my wife) it certainly does. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC signature.asc Description: This is a digitally signed message part.
Re: Ordering of directory operations maintained across system crashes in Btrfs?
On Mon, Mar 03, 2014 at 11:56:49AM -0600, thanumalayan mad wrote: Chris, Great, thanks. Any guesses whether other filesystems (disk-based) do things similar to the last two examples you pointed out? Saying we think 3 normal filesystems reorder stuff seems to motivate application developers to fix bugs ... Also, just for more information, the sequence we observed was, Thread A: unlink(foo) rename(somefile X, somefile Y) fsync(somefile Z) The source and destination of the renamed file are unrelated to the fsync. But the rename happens in the fsync()'s transaction, while unlink() is delayed. I guess this has something to do with backrefs too. Thanks, Thanu On Mon, Mar 3, 2014 at 11:43 AM, Chris Mason c...@fb.com wrote: On 02/25/2014 09:01 PM, thanumalayan mad wrote: Hi all, Slightly complicated question. Assume I do two directory operations in a Btrfs partition (such as an unlink() and a rename()), one after the other, and a crash happens after the rename(). Can Btrfs (the current version) send the second operation to the disk first, so that after the crash, I observe the effects of rename() but not the effects of the unlink()? I think I am observing Btrfs re-ordering an unlink() and a rename(), and I just want to confirm that my observation is true. Also, if Btrfs does send directory operations to disk out of order, is there some limitation on this? Like, is this restricted to only unlink() and rename()? I am looking at some (buggy) applications that use Btrfs, and this behavior seems to affect them. There isn't a single answer for this one. You might have Thread A: ulink(foo); rename(somefile, somefile2); crash This should always have the rename happen before or in the same transaction as the rename. Thread A: ulink(dirA/foo); rename(dirB/somefile, dirB/somefile2); Here you're at the mercy of what is happening in dirB. If someone fsyncs that directory, it may hit the disk before the unlink. Thread A: ulink(foo); rename(somefile, somefile2); fsync(somefile); This one is even fuzzier. Backrefs allow us to do some file fsyncs without touching the directory, making it possible the unlink will hit disk after the fsync. -chris As I understand it POSIX only garanties that the in-core data is updated by the syscalls in-order. On crash anything can happen. If the application needs something to be commited to disk then it needs to fsync(). Specifically it needs to fsync() the changed files AND directories. From man fsync: Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. So the fsync(somefile) above doesn't necessarily force the rename to disk. My experience with fuse tells me that at least fuse handles operations in parallel and only blocks a later operation if it is affected by an earlier operation. An unlink in one directory can (and will) run in parallel to a rename in another directory. Then, depending on how threads get scheduled, the rename can complete before the unlink. My conclusion is that you need to fsync() the directory to ensure the metadata update has made it to the disk if you require that. Otherwise you have to be able to cope with (meta)data loss on crash. Note: https://code.google.com/p/leveldb/issues/detail?id=189 talks a lot about journaling and that any yournaling filesystem should preserve the order. I think that is rather pointless for two reasons: 1) The journal gets replayed after a crash so in whatever order the two journal entries are written doesn't matter. They both make it to disk. You can't see one without the other. This is assuming you fsync()ed the dirs so force the metadata change into the journal in the first place. 2) btrfs afaik doesn't have any journal since COW already garanties atomic updates and crash protection. Overall I also think the fear of fsync() is overrated for this issue. This would only happen on programm start or whenever you open a database. Not somthing that happens every second. MfG Goswin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs-progs: mkfs: make sure we can deal with hard links with -r option
Hi Dave, On 03/13/2014 12:21 AM, David Sterba wrote: On Tue, Mar 11, 2014 at 06:29:09PM +0800, Wang Shilong wrote: @@ -840,6 +833,10 @@ static int traverse_directory(struct btrfs_trans_handle *trans, cur_file-d_name, cur_inum, parent_inum, dir_index_cnt, cur_inode); + if (ret == -EEXIST) { + BUG_ON(st.st_nlink = 1); As the mkfs operation is restartable, can we handle the error? This should be a logic error which means a inode has hard links(but links = 1). :-) Add error handling may be better, i will update it. Thanks, Wang Otherwise, good fix, thanks. + continue; + } if (ret) { fprintf(stderr, add_inode_items failed\n); goto fail; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] xfstests: add test for btrfs send issuing premature rmdir operations
Regression test for btrfs incremental send issue where a rmdir instruction is sent against an orphan directory inode which is not empty yet, causing btrfs receive to fail when it attempts to remove the directory. This issue is fixed by the following linux kernel btrfs patch: Btrfs: fix send attempting to rmdir non-empty directories Signed-off-by: Filipe David Borba Manana fdman...@gmail.com Reviewed-by: Josef Bacik jba...@fb.com --- Resending since Dave Chinner asked to do it for any patches he might have missed in his last merge. tests/btrfs/043 | 149 +++ tests/btrfs/043.out |1 + tests/btrfs/group |1 + 3 files changed, 151 insertions(+) create mode 100644 tests/btrfs/043 create mode 100644 tests/btrfs/043.out diff --git a/tests/btrfs/043 b/tests/btrfs/043 new file mode 100644 index 000..b1fef96 --- /dev/null +++ b/tests/btrfs/043 @@ -0,0 +1,149 @@ +#! /bin/bash +# FS QA Test No. btrfs/043 +# +# Regression test for btrfs incremental send issue where a rmdir instruction +# is sent against an orphan directory inode which is not empty yet, causing +# btrfs receive to fail when it attempts to remove the directory. +# +# This issue is fixed by the following linux kernel btrfs patch: +# +# Btrfs: fix send attempting to rmdir non-empty directories +# +#--- +# Copyright (c) 2014 Filipe Manana. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +tmp=`mktemp -d` +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +rm -fr $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_fssum +_need_to_be_root + +rm -f $seqres.full + +_scratch_mkfs /dev/null 21 +_scratch_mount + +mkdir -p $SCRATCH_MNT/a/b +mkdir $SCRATCH_MNT/0 +mkdir $SCRATCH_MNT/1 +mkdir $SCRATCH_MNT/a/b/c +mv $SCRATCH_MNT/0 $SCRATCH_MNT/a/b/c +mv $SCRATCH_MNT/1 $SCRATCH_MNT/a/b/c +echo 'ola mundo' $SCRATCH_MNT/a/b/c/foo.txt +mkdir $SCRATCH_MNT/a/b/c/x +mkdir $SCRATCH_MNT/a/b/c/x2 +mkdir $SCRATCH_MNT/a/b/y +mkdir $SCRATCH_MNT/a/b/z +mkdir -p $SCRATCH_MNT/a/b/d1/d2/d3 +mkdir $SCRATCH_MNT/a/b/d4 + +# Filesystem looks like: +# +# .(ino 256) +# |-- a/ (ino 257) +# |-- b/ (ino 258) +# |-- c/ (ino 261) +# | |-- foo.txt (ino 262) +# | |-- 0/ (ino 259) +# | |-- 1/ (ino 260) +# | |-- x/ (ino 263) +# | |-- x2/ (ino 264) +# | +# |-- y/ (ino 265) +# |-- z/ (ino 266) +# |-- d1/ (ino 267) +# | |-- d2/ (ino 268) +# | |-- d3/ (ino 269) +# | +# |-- d4/ (ino 270) + +_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 + +rm -f $SCRATCH_MNT/a/b/c/foo.txt +mv $SCRATCH_MNT/a/b/y $SCRATCH_MNT/a/b/YY +mv $SCRATCH_MNT/a/b/z $SCRATCH_MNT/a +mv $SCRATCH_MNT/a/b/c/x $SCRATCH_MNT/a/b/YY +mv $SCRATCH_MNT/a/b/c/0 $SCRATCH_MNT/a/b/YY/00 +mv $SCRATCH_MNT/a/b/c/x2 $SCRATCH_MNT/a/z/X_2 +mv $SCRATCH_MNT/a/b/c/1 $SCRATCH_MNT/a/z/X_2 +rmdir $SCRATCH_MNT/a/b/c +mv $SCRATCH_MNT/a/b/d4 $SCRATCH_MNT/a/d44 +mv $SCRATCH_MNT/a/b/d1/d2 $SCRATCH_MNT/a/d44 +rmdir $SCRATCH_MNT/a/b/d1 + +# Filesystem now looks like: +# +# .(ino 256) +# |-- a/ (ino 257) +# |-- b/ (ino 258) +# | |-- YY/ (ino 265) +# ||-- x/ (ino 263) +# ||-- 00/ (ino 259) +# | +# |-- z/ (ino 266) +# | |-- X_2/ (ino 264) +# ||-- 1/ (ino 260) +# | +# |-- d44/ (ino 270) +# |-- d2/ (ino 268) +# |-- d3/ (ino 269) + +_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 + +run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1 +run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \ +
[PATCH RESEND] xfstests: add regression test for btrfs incremental send
Regression test for a btrfs incremental send issue where invalid paths for utimes, chown and chmod operations were sent to the send stream, causing btrfs receive to fail. If a directory had a move/rename operation delayed, and none of its parent directories, except for the immediate one, had delayed move/rename operations, after processing the directory's references, the incremental send code would issue invalid paths for utimes, chown and chmod operations. This issue is fixed by the following linux kernel btrfs patch: Btrfs: fix send issuing outdated paths for utimes, chown and chmod Signed-off-by: Filipe David Borba Manana fdman...@gmail.com Reviewed-by: Josef Bacik jba...@fb.com --- Resending since Dave Chinner asked to do it for any patches he might have missed in his last merge. Originally submitted with the title: xfstests: add test btrfs/042 for btrfs incremental send tests/btrfs/044 | 129 +++ tests/btrfs/044.out |1 + tests/btrfs/group |1 + 3 files changed, 131 insertions(+) create mode 100644 tests/btrfs/044 create mode 100644 tests/btrfs/044.out diff --git a/tests/btrfs/044 b/tests/btrfs/044 new file mode 100644 index 000..dae189e --- /dev/null +++ b/tests/btrfs/044 @@ -0,0 +1,129 @@ +#! /bin/bash +# FS QA Test No. btrfs/044 +# +# Regression test for a btrfs incremental send issue where under certain +# scenarios invalid paths for utimes, chown and chmod operations were sent +# to the send stream, causing btrfs receive to fail. +# +# If a directory had a move/rename operation delayed, and none of its parent +# directories, except for the immediate one, had delayed move/rename operations, +# after processing the directory's references, the incremental send code would +# issue invalid paths for utimes, chown and chmod operations. +# +# This issue is fixed by the following linux kernel btrfs patch: +# +# Btrfs: fix send issuing outdated paths for utimes, chown and chmod +# +#--- +# Copyright (c) 2014 Filipe Manana. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +tmp=`mktemp -d` +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +rm -fr $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_fssum +_need_to_be_root + +rm -f $seqres.full + +_scratch_mkfs /dev/null 21 +_scratch_mount + +umask 0 +mkdir -p $SCRATCH_MNT/a/b/c/d/e +mkdir $SCRATCH_MNT/a/b/c/f +echo 'ola ' $SCRATCH_MNT/a/b/c/d/e/file.txt +chmod 0777 $SCRATCH_MNT/a/b/c/d/e + +# Filesystem looks like: +# +# . (ino 256) +# |-- a/ (ino 257) +# |-- b/ (ino 258) +# |-- c/ (ino 259) +# |-- d/ (ino 260) +# | |-- e/ (ino 261) +# | |-- file.txt(ino 262) +# | +# |-- f/ (ino 263) + +_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 + +echo 'mundo' $SCRATCH_MNT/a/b/c/d/e/file.txt +mv $SCRATCH_MNT/a/b/c/d/e/file.txt $SCRATCH_MNT/a/b/c/d/e/file2.txt +mv $SCRATCH_MNT/a/b/c/f $SCRATCH_MNT/a/b/f2 +mv $SCRATCH_MNT/a/b/c/d/e $SCRATCH_MNT/a/b/f2/e2 +mv $SCRATCH_MNT/a/b/c $SCRATCH_MNT/a/b/c2 +mv $SCRATCH_MNT/a/b/c2/d $SCRATCH_MNT/a/b/c2/d2 +chmod 0700 $SCRATCH_MNT/a/b/f2/e2 + +# Filesystem now looks like: +# +# . (ino 256) +# |-- a/ (ino 257) +# |-- b/ (ino 258) +# |-- c2/(ino 259) +# | |-- d2/(ino 260) +# | +# |-- f2/(ino 263) +# |-- e2 (ino 261) +# |-- file2.txt (ino 263) + +_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 + +run_check $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1 +run_check $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \ +
Re: [PATCH] Btrfs: fix joining same transaction handle more than twice
On 03/13/2014 01:19 AM, Wang Shilong wrote: We hit something like the following function call flows: |-run_delalloc_range() |-btrfs_join_transaction() |-cow_file_range() |-btrfs_join_transaction() |-find_free_extent() |-btrfs_join_transaction() Trace infomation can be seen as: [ 7411.127040] [ cut here ] [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]() [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G O 3.13.0+ #4 [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013 [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5) [ 7411.127092] Call Trace: [ 7411.127097] [815b87b0] dump_stack+0x45/0x56 [ 7411.127101] [81051ffd] warn_slowpath_common+0x7d/0xa0 [ 7411.127102] [810520da] warn_slowpath_null+0x1a/0x20 [ 7411.127109] [a0444fb1] start_transaction+0x561/0x580 [btrfs] [ 7411.127115] [a0445027] btrfs_join_transaction+0x17/0x20 [btrfs] [ 7411.127120] [a0431c91] find_free_extent+0xa21/0xb50 [btrfs] [ 7411.127126] [a0431f68] btrfs_reserve_extent+0xa8/0x1a0 [btrfs] [ 7411.127131] [a04322ce] btrfs_alloc_free_block+0xee/0x440 [btrfs] [ 7411.127137] [a043bd6e] ? btree_set_page_dirty+0xe/0x10 [btrfs] [ 7411.127142] [a041da51] __btrfs_cow_block+0x121/0x530 [btrfs] [ 7411.127146] [a041dfff] btrfs_cow_block+0x11f/0x1c0 [btrfs] [ 7411.127151] [a0421b74] btrfs_search_slot+0x1d4/0x9c0 [btrfs] [ 7411.127157] [a0438567] btrfs_lookup_file_extent+0x37/0x40 [btrfs] [ 7411.127163] [a0456bfc] __btrfs_drop_extents+0x16c/0xd90 [btrfs] [ 7411.127169] [a0444ae3] ? start_transaction+0x93/0x580 [btrfs] [ 7411.127171] [811663e2] ? kmem_cache_alloc+0x132/0x140 [ 7411.127176] [a041cd9a] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [ 7411.127182] [a044aa61] cow_file_range_inline+0x181/0x2e0 [btrfs] [ 7411.127187] [a044aead] cow_file_range+0x2ed/0x440 [btrfs] [ 7411.127194] [a0464d7f] ? free_extent_buffer+0x4f/0xb0 [btrfs] [ 7411.127200] [a044b38f] run_delalloc_nocow+0x38f/0xa60 [btrfs] [ 7411.127207] [a0461600] ? test_range_bit+0x30/0x180 [btrfs] [ 7411.127212] [a044bd48] run_delalloc_range+0x2e8/0x350 [btrfs] [ 7411.127219] [a04618f9] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs] [ 7411.127222] [812a1e71] ? blk_queue_bio+0x2c1/0x330 [ 7411.127228] [a0462ad4] __extent_writepage+0x2f4/0x760 [btrfs] Here we fix it by avoiding joining transaction again if we have held a transaction handle when allocating chunk in find_free_extent(). So I just put that warning there to see if we were ever embedding 3 joins at a time, not because it was an actual problem, I'd say just kill the warning. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op mount -o mount /dev/xxx operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: linux-fsde...@vger.kernel.org Cc: Christoph Hellwig h...@infradead.org Cc: Artem Bityutskiy dedeki...@gmail.com Cc: Adrian Hunter adrian.hun...@intel.com Cc: Evgeniy Dushistov dushis...@mail.ru Cc: Jan Kara j...@suse.cz Cc: OGAWA Hirofumi hirof...@mail.parknet.co.jp Cc: Anders Larsen a...@alarsen.net Cc: Phillip Lougher phil...@squashfs.org.uk Cc: Kees Cook keesc...@chromium.org Cc: Mikulas Patocka miku...@artax.karlin.mff.cuni.cz Cc: Petr Vandrovec p...@vandrovec.name Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-c...@vger.kernel.org Cc: samba-techni...@lists.samba.org Cc: codal...@coda.cs.cmu.edu Cc: linux-e...@vger.kernel.org Cc: linux-f2fs-de...@lists.sourceforge.net Cc: fuse-de...@lists.sourceforge.net Cc: cluster-de...@redhat.com Cc: linux-...@lists.infradead.org Cc: jfs-discuss...@lists.sourceforge.net Cc: linux-...@vger.kernel.org Cc: linux-ni...@vger.kernel.org Cc: linux-ntfs-...@lists.sourceforge.net Cc: ocfs2-de...@oss.oracle.com Cc: reiserfs-de...@vger.kernel.org --- fs/adfs/super.c | 1 + fs/affs/super.c | 1 + fs/befs/linuxvfs.c | 1 + fs/btrfs/super.c | 1 + fs/cifs/cifsfs.c | 1 + fs/coda/inode.c | 1 + fs/cramfs/inode.c| 1 + fs/debugfs/inode.c | 1 + fs/devpts/inode.c| 1 + fs/efs/super.c | 1 + fs/ext2/super.c | 1 + fs/ext3/super.c | 2 ++ fs/ext4/super.c | 2 ++ fs/f2fs/super.c | 2 ++ fs/fat/inode.c | 2 ++ fs/freevxfs/vxfs_super.c | 1 + fs/fuse/inode.c | 1 + fs/gfs2/super.c | 2 ++ fs/hfs/super.c | 1 + fs/hfsplus/super.c | 1 + fs/hpfs/super.c | 2 ++ fs/isofs/inode.c | 1 + fs/jffs2/super.c | 1 + fs/jfs/super.c | 1 + fs/minix/inode.c | 1 + fs/ncpfs/inode.c | 1 + fs/nfs/super.c | 2 ++ fs/nilfs2/super.c| 1 + fs/ntfs/super.c | 2 ++ fs/ocfs2/super.c | 2 ++ fs/openpromfs/inode.c| 1 + fs/proc/root.c | 2 ++ fs/pstore/inode.c| 1 + fs/qnx4/inode.c | 1 + fs/qnx6/inode.c | 1 + fs/reiserfs/super.c | 1 + fs/romfs/super.c | 1 + fs/squashfs/super.c | 1 + fs/super.c | 2 -- fs/sysv/inode.c | 1 + fs/ubifs/super.c | 1 + fs/udf/super.c | 1 + fs/ufs/super.c | 1 + fs/xfs/xfs_super.c | 1 + 44 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/adfs/super.c b/fs/adfs/super.c index 7b3003c..952aeb0 100644 --- a/fs/adfs/super.c +++ b/fs/adfs/super.c @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char *options) static int adfs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); *flags |= MS_NODIRATIME; return parse_options(sb, data); } diff --git a/fs/affs/super.c b/fs/affs/super.c index d098731..3074530 100644 --- a/fs/affs/super.c +++ b/fs/affs/super.c @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data) pr_debug(AFFS: remount(flags=0x%x,opts=\%s\)\n,*flags,data); + sync_filesystem(sb); *flags |= MS_NODIRATIME; memcpy(volume, sbi-s_volume, 32); diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 845d2d6..56d70c8 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int silent) static int befs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); if (!(*flags MS_RDONLY)) return -EINVAL; return 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 97cc241..00cd0c5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1381,6 +1381,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) unsigned int old_metadata_ratio = fs_info-metadata_ratio; int ret; + sync_filesystem(sb); btrfs_remount_prepare(fs_info); ret = btrfs_parse_options(root, data); diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 849f613..4942c94 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -541,6 +541,7 @@
Re: [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
On Thu 13-03-14 10:20:56, Ted Tso wrote: Previously, the no-op mount -o mount /dev/xxx operation when the ^^remount file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Hum, I'd avoid this excercise at least for filesystem where sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts, also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs, efs, freevxfs, romfs, squashfs. I think you can find a couple more which clearly don't care about sync_filesystem() if you look a bit closer. Honza Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: linux-fsde...@vger.kernel.org Cc: Christoph Hellwig h...@infradead.org Cc: Artem Bityutskiy dedeki...@gmail.com Cc: Adrian Hunter adrian.hun...@intel.com Cc: Evgeniy Dushistov dushis...@mail.ru Cc: Jan Kara j...@suse.cz Cc: OGAWA Hirofumi hirof...@mail.parknet.co.jp Cc: Anders Larsen a...@alarsen.net Cc: Phillip Lougher phil...@squashfs.org.uk Cc: Kees Cook keesc...@chromium.org Cc: Mikulas Patocka miku...@artax.karlin.mff.cuni.cz Cc: Petr Vandrovec p...@vandrovec.name Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-c...@vger.kernel.org Cc: samba-techni...@lists.samba.org Cc: codal...@coda.cs.cmu.edu Cc: linux-e...@vger.kernel.org Cc: linux-f2fs-de...@lists.sourceforge.net Cc: fuse-de...@lists.sourceforge.net Cc: cluster-de...@redhat.com Cc: linux-...@lists.infradead.org Cc: jfs-discuss...@lists.sourceforge.net Cc: linux-...@vger.kernel.org Cc: linux-ni...@vger.kernel.org Cc: linux-ntfs-...@lists.sourceforge.net Cc: ocfs2-de...@oss.oracle.com Cc: reiserfs-de...@vger.kernel.org --- fs/adfs/super.c | 1 + fs/affs/super.c | 1 + fs/befs/linuxvfs.c | 1 + fs/btrfs/super.c | 1 + fs/cifs/cifsfs.c | 1 + fs/coda/inode.c | 1 + fs/cramfs/inode.c| 1 + fs/debugfs/inode.c | 1 + fs/devpts/inode.c| 1 + fs/efs/super.c | 1 + fs/ext2/super.c | 1 + fs/ext3/super.c | 2 ++ fs/ext4/super.c | 2 ++ fs/f2fs/super.c | 2 ++ fs/fat/inode.c | 2 ++ fs/freevxfs/vxfs_super.c | 1 + fs/fuse/inode.c | 1 + fs/gfs2/super.c | 2 ++ fs/hfs/super.c | 1 + fs/hfsplus/super.c | 1 + fs/hpfs/super.c | 2 ++ fs/isofs/inode.c | 1 + fs/jffs2/super.c | 1 + fs/jfs/super.c | 1 + fs/minix/inode.c | 1 + fs/ncpfs/inode.c | 1 + fs/nfs/super.c | 2 ++ fs/nilfs2/super.c| 1 + fs/ntfs/super.c | 2 ++ fs/ocfs2/super.c | 2 ++ fs/openpromfs/inode.c| 1 + fs/proc/root.c | 2 ++ fs/pstore/inode.c| 1 + fs/qnx4/inode.c | 1 + fs/qnx6/inode.c | 1 + fs/reiserfs/super.c | 1 + fs/romfs/super.c | 1 + fs/squashfs/super.c | 1 + fs/super.c | 2 -- fs/sysv/inode.c | 1 + fs/ubifs/super.c | 1 + fs/udf/super.c | 1 + fs/ufs/super.c | 1 + fs/xfs/xfs_super.c | 1 + 44 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/adfs/super.c b/fs/adfs/super.c index 7b3003c..952aeb0 100644 --- a/fs/adfs/super.c +++ b/fs/adfs/super.c @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char *options) static int adfs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); *flags |= MS_NODIRATIME; return parse_options(sb, data); } diff --git a/fs/affs/super.c b/fs/affs/super.c index d098731..3074530 100644 --- a/fs/affs/super.c +++ b/fs/affs/super.c @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data) pr_debug(AFFS: remount(flags=0x%x,opts=\%s\)\n,*flags,data); + sync_filesystem(sb); *flags |= MS_NODIRATIME; memcpy(volume, sbi-s_volume, 32); diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 845d2d6..56d70c8 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int silent) static int befs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); if (!(*flags MS_RDONLY))
Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
Hi, On Thu, 2014-03-13 at 17:23 +0100, Jan Kara wrote: On Thu 13-03-14 10:20:56, Ted Tso wrote: Previously, the no-op mount -o mount /dev/xxx operation when the ^^remount file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Hum, I'd avoid this excercise at least for filesystem where sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts, also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs, efs, freevxfs, romfs, squashfs. I think you can find a couple more which clearly don't care about sync_filesystem() if you look a bit closer. Honza I guess the same is true for other file systems which are mounted ro too. So maybe a check for MS_RDONLY before doing the sync in those cases? Steve. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding btrfs and backups
On Mar 7, 2014, at 7:03 AM, Eric Mesa ericsbinarywo...@gmail.com wrote: Duncan - thanks for this comprehensive explanation. For a huge portion of your reply...I was all wondering why you and others were saying snapshots aren't backups. They certainly SEEMED like backups. But now I see that the problem is one of precise terminology vs colloquialisms. In other words, snapsshots are not backups in and of themselves. They are like Mac's Time Machine. BUT if you take these snapshots and then put them on another media - whether that's local or not - THEN you have backups. Am I right, or am I still missing something subtle? Hmm, yes because snapshots on a mirrored drive are on another media but that's still not considered a backup. I think what makes a backup is separate device and separate file system. That's because the top vectors for data loss are: user induced, device failure, and file system corruption. These are substantially mitigated by having backup files located both on separate file systems and device. Also, Time Machine qualifies as a backup because it copies files to a separate device with a separate file system. (There is a feature in recent OS X versions that store hourly incremental backups on the local drive when the usual target device isn't available - these are arguably not backups but rather snapshots that are pending backups. Once the target device is available, the snapshots are copied over to it.) If you have data you feel is really important, my suggestion is that you have a completely different backup/restore method than what you're talking about. It needs to be bullet proof, well tested. And consider all the Btrfs send/receive work you're doing as testing/work-in-progress. There are still cases on the list where people have had problems with send/receive, both the send and receive code have a lot of churn, so I don't know that anyone can definitively tell you that a btrfs send/receive only based backup is going to reliably restore in one month let alone three years. Should it? Yes of course. Will it? Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Testing BTRFS
On 03/10/2014 06:02 PM, Avi Miller wrote: Oracle Linux 6 with the Unbreakable Enterprise Kernel Release 2 or Release 3 has production-ready btrfs support. You can even convert your existing CentOS6 boxes across to Oracle Linux 6 in-place without reinstalling: http://linux.oracle.com/switch/centos/ Oracle also now provides all errata, including security and bug fixes for free athttp://public-yum.oracle.com and our kernel source code can be found athttps://oss.oracle.com/git/ Is there any issue with BTRFS and 32 bit O/S like with ZFS? -Ben -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Incremental backup for a raid1
My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). Is it possible to do what I'm looking for? Michael -- Michael Schuerig mailto:mich...@schuerig.de http://www.schuerig.de/michael/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). Is it possible to do what I'm looking for? For point 2, yes. (Add new disk, balance -oconvert from single to raid1). For point 1, not really. It's a different filesystem, so it'll have a different UUID. You *might* be able to get away with rsync of one of the block devices in the array to the backup block device, but you'd have to unmount the FS (or halt all writes to it) for the period of the rsync to ensure a consistent image, and the rsync would have to read all the data in the device being synced to work out what to send. Probably not what you want. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Do not meddle in the affairs of system administrators, for --- they are subtle, and quick to anger. signature.asc Description: Digital signature
[PATCH] Btrfs: remove transaction from send
Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills h...@carfax.org.uk Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/backref.c | 33 +++ fs/btrfs/ctree.c | 88 -- fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 3 +- fs/btrfs/extent-tree.c | 20 ++-- fs/btrfs/inode-map.c | 14 fs/btrfs/send.c| 57 ++-- fs/btrfs/transaction.c | 45 -- fs/btrfs/transaction.h | 1 + 9 files changed, 77 insertions(+), 187 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 860f4f2..0be0e94 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, goto out; } - root_level = btrfs_old_root_level(root, time_seq); + if (path-search_commit_root) + root_level = btrfs_header_level(root-commit_root); + else + root_level = btrfs_old_root_level(root, time_seq); if (root_level + 1 == level) { srcu_read_unlock(fs_info-subvol_srcu, index); @@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct btrfs_trans_handle *trans, * * returns 0 on success, 0 on error. */ -int btrfs_find_all_roots(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info, u64 bytenr, - u64 time_seq, struct ulist **roots) +static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, u64 bytenr, + u64 time_seq, struct ulist **roots) { struct ulist *tmp; struct ulist_node *node = NULL; @@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans, return 0; } +int btrfs_find_all_roots(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, u64 bytenr, +u64 time_seq, struct ulist **roots) +{ + int ret; + + if (!trans) + down_read(fs_info-commit_root_sem); + ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots); + if (!trans) + up_read(fs_info-commit_root_sem); + return ret; +} + /* * this makes the path point to (inum INODE_ITEM ioff) */ @@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, if (IS_ERR(trans)) return PTR_ERR(trans); btrfs_get_tree_mod_seq(fs_info, tree_mod_seq_elem); + } else { + down_read(fs_info-commit_root_sem); } ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid, @@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, ULIST_ITER_INIT(ref_uiter); while (!ret (ref_node = ulist_next(refs, ref_uiter))) { - ret = btrfs_find_all_roots(trans, fs_info, ref_node-val, - tree_mod_seq_elem.seq, roots); + ret = __btrfs_find_all_roots(trans, fs_info, ref_node-val, +tree_mod_seq_elem.seq, roots); if (ret) break; ULIST_ITER_INIT(root_uiter); @@ -1542,6 +1561,8 @@ out: if (!search_commit_root) { btrfs_put_tree_mod_seq(fs_info, tree_mod_seq_elem); btrfs_end_transaction(trans, fs_info-extent_root); + } else { + up_read(fs_info-commit_root_sem); } return ret; diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 88d1b1e..9d89c16 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -5360,7 +5360,6 @@ int btrfs_compare_trees(struct btrfs_root *left_root, { int ret; int cmp; - struct btrfs_trans_handle *trans = NULL; struct btrfs_path
Re: Testing BTRFS
Hi, On 14 Mar 2014, at 5:10 am, Lists li...@benjamindsmith.com wrote: Is there any issue with BTRFS and 32 bit O/S like with ZFS? We provide some btrfs support with the 32-bit UEK Release 2 on OL6, but we strongly recommend only using the UEK Release 3 which is 64-bit only. -- Oracle http://www.oracle.com Avi Miller | Product Management Director | +61 (3) 8616 3496 Oracle Linux and Virtualization 417 St Kilda Road, Melbourne, Victoria 3004 Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote: On 2014-Mar-13 14:28, Hugo Mills wrote: On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). [...] I'm new; btrfs noob; completely unqualified to write intelligently on this topic, nevertheless: I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a backup device someplace /dev/C Could you, at the time you wanted to backup the filesystem: 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added 3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C) 4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to the fs I've thought of this but don't dare try it without approval from the experts. At any rate, for being practical, this approach hinges on an ability to rebuild the raid1 incrementally. That is, the rebuild would have to start from what already is present on disk B (or C, when it is re-added). Starting from an effectively blank disk each time would be prohibitive. Even if this would work, I'd much prefer keeping the original raid1 intact and to only temporarily add another mirror: lazy mirroring, to give the thing a name. Michael -- Michael Schuerig mailto:mich...@schuerig.de http://www.schuerig.de/michael/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On Mar 13, 2014, at 3:14 PM, Michael Schuerig michael.li...@schuerig.de wrote: On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote: On 2014-Mar-13 14:28, Hugo Mills wrote: On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). [...] I'm new; btrfs noob; completely unqualified to write intelligently on this topic, nevertheless: I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a backup device someplace /dev/C Could you, at the time you wanted to backup the filesystem: 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added 3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C) 4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to the fs I've thought of this but don't dare try it without approval from the experts. At any rate, for being practical, this approach hinges on an ability to rebuild the raid1 incrementally. That is, the rebuild would have to start from what already is present on disk B (or C, when it is re-added). Starting from an effectively blank disk each time would be prohibitive. Even if this would work, I'd much prefer keeping the original raid1 intact and to only temporarily add another mirror: lazy mirroring, to give the thing a name. At best this seems fragile, but I don't think it works and is an edge case from the start. This is what send/receive is for. In the btrfs replace scenario, the missing device is removed from the volume. It's like a divorce. Missing device 2 is replaced by a different physical device also called device 2. If you then removed 2b and readd (formerly replaced) device 2a, what happens? I don't know, I'm pretty sure the volume knows this is not device 2b as it should be, and won't accept formerly replaced device 2a. But it's an edge case to do this because you've said device replace. So lexicon wise, I wouldn't even want this to work, we'd need a different command even if not different logic. In the btfs device add case, you now have a three disk raid1 which is a whole different beast. Since this isn't n-way raid1, each disk is not stand alone. You're only assured data survives a one disk failure meaning you must have two drives. You've just increased your risk by doing this, not reduced it. It further proposes running an (ostensibly) production workflow with an always degraded volume, mounted with -o degraded, on an on-going basis. So it's three strikes. It's not n-way, you have no uptime if you lose one of two disks onsite, you's have to go get the offsite/onshelf disk to keep working. Plus that offsite disk isn't stand alone, so why even have it offsite? This is a fail. So the btrfs replace scenario might work but it seems like a bad idea. And overall it's a use case for which send/receive was designed anyway so why not just use that? Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: remove transaction from send
On Thu, Mar 13, 2014 at 03:42:13PM -0400, Josef Bacik wrote: Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, There's something still going on here. I managed to get about twice as far through my test as I had before, but I again got an unexpected EOF in stream, with btrfs send returning 1. As before, I have this in syslog: Mar 13 22:09:12 s_src@amelia kernel: BTRFS error (device sda2): did not find backref in send_root. inode=1786631, offset=825257984, disk_byte=36504023040 found extent=36504023040\x0a So, on the evidence of one data point (I'll have another one when I wake up tomorrow morning), this has made the problem harder to trigger but it's still possible. Hugo. Reportedy-by: Hugo Mills h...@carfax.org.uk Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/backref.c | 33 +++ fs/btrfs/ctree.c | 88 -- fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 3 +- fs/btrfs/extent-tree.c | 20 ++-- fs/btrfs/inode-map.c | 14 fs/btrfs/send.c| 57 ++-- fs/btrfs/transaction.c | 45 -- fs/btrfs/transaction.h | 1 + 9 files changed, 77 insertions(+), 187 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 860f4f2..0be0e94 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -329,7 +329,10 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, goto out; } - root_level = btrfs_old_root_level(root, time_seq); + if (path-search_commit_root) + root_level = btrfs_header_level(root-commit_root); + else + root_level = btrfs_old_root_level(root, time_seq); if (root_level + 1 == level) { srcu_read_unlock(fs_info-subvol_srcu, index); @@ -1092,9 +1095,9 @@ static int btrfs_find_all_leafs(struct btrfs_trans_handle *trans, * * returns 0 on success, 0 on error. */ -int btrfs_find_all_roots(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info, u64 bytenr, - u64 time_seq, struct ulist **roots) +static int __btrfs_find_all_roots(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, u64 bytenr, + u64 time_seq, struct ulist **roots) { struct ulist *tmp; struct ulist_node *node = NULL; @@ -1130,6 +1133,20 @@ int btrfs_find_all_roots(struct btrfs_trans_handle *trans, return 0; } +int btrfs_find_all_roots(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, u64 bytenr, + u64 time_seq, struct ulist **roots) +{ + int ret; + + if (!trans) + down_read(fs_info-commit_root_sem); + ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots); + if (!trans) + up_read(fs_info-commit_root_sem); + return ret; +} + /* * this makes the path point to (inum INODE_ITEM ioff) */ @@ -1509,6 +1526,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, if (IS_ERR(trans)) return PTR_ERR(trans); btrfs_get_tree_mod_seq(fs_info, tree_mod_seq_elem); + } else { + down_read(fs_info-commit_root_sem); } ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid, @@ -1519,8 +1538,8 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, ULIST_ITER_INIT(ref_uiter); while (!ret (ref_node = ulist_next(refs, ref_uiter))) { - ret = btrfs_find_all_roots(trans, fs_info, ref_node-val, -tree_mod_seq_elem.seq, roots); + ret = __btrfs_find_all_roots(trans, fs_info, ref_node-val, + tree_mod_seq_elem.seq, roots); if (ret)
Re: Incremental backup for a raid1
On Thursday 13 March 2014 16:04:33 Chris Murphy wrote: On Mar 13, 2014, at 3:14 PM, Michael Schuerig michael.li...@schuerig.de wrote: On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote: On 2014-Mar-13 14:28, Hugo Mills wrote: On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). [...] I'm new; btrfs noob; completely unqualified to write intelligently on this topic, nevertheless: I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a backup device someplace /dev/C Could you, at the time you wanted to backup the filesystem: 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added 3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C) 4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to the fs I've thought of this but don't dare try it without approval from the experts. At any rate, for being practical, this approach hinges on an ability to rebuild the raid1 incrementally. That is, the rebuild would have to start from what already is present on disk B (or C, when it is re-added). Starting from an effectively blank disk each time would be prohibitive. Even if this would work, I'd much prefer keeping the original raid1 intact and to only temporarily add another mirror: lazy mirroring, to give the thing a name. [...] In the btfs device add case, you now have a three disk raid1 which is a whole different beast. Since this isn't n-way raid1, each disk is not stand alone. You're only assured data survives a one disk failure meaning you must have two drives. Yes, I understand that. Unless someone convinces me that it's a bad idea, I keep wishing for a feature that allows to intermittently add a third disk to a two disk raid1 and update that disk so that it could replace one of the others. So the btrfs replace scenario might work but it seems like a bad idea. And overall it's a use case for which send/receive was designed anyway so why not just use that? Because it's not just. Doing it right doesn't seem trivial. For one thing, there are multiple subvolumes; not at the top-level but nested inside a root subvolume. Each of them already has snapshots of its own. If there already is a send/receive script that can handle such a setup I'll happily have a look at it. Michael -- Michael Schuerig mailto:mich...@schuerig.de http://www.schuerig.de/michael/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
On Thu, Mar 13, 2014 at 04:28:23PM +, Steven Whitehouse wrote: I guess the same is true for other file systems which are mounted ro too. So maybe a check for MS_RDONLY before doing the sync in those cases? My original patch moved the sync_filesystem into the check for MS_RDONLY in the core VFS code. The objection was raised that there might be some file system out there that might depend on this behaviour. I can't imagine why, but I suppose it's at least theoretically possible. So the idea is that this particular patch is *guaranteed* not to make any difference. That way there can be no question about the patch'es correctness. I'm going to follow up with a patch for ext4 that does exactly that, but the idea is to allow each file system maintainer to do that for their own file system. I could do that as well for file systems that are obviously read-only, but then I'll find out that there's some wierd case where the file system can be used in a read-write fashion. (Example: UDF is normally used for DVD's, but at least in theory it can be used read/write --- I'm told that Windows supports read-write UDF file systems on USB sticks, and at least in theory it could be used as a inter-OS exchange format in situations where VFAT and exFAT might not be appropriate for various reasons.) Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
See comments at the bottom: On 03/13/2014 05:29 PM, George Mitchell wrote: On 03/13/2014 04:03 PM, Michael Schuerig wrote: On Thursday 13 March 2014 16:04:33 Chris Murphy wrote: On Mar 13, 2014, at 3:14 PM, Michael Schuerig michael.li...@schuerig.de wrote: On Thursday 13 March 2014 14:48:55 Andrew Skretvedt wrote: On 2014-Mar-13 14:28, Hugo Mills wrote: On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: My backup use case is different from the what has been recently discussed in another thread. I'm trying to guard against hardware failure and other causes of destruction. I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. * when the entire raid fails, I can re-build a new one from the backup. The filesystem is mounted at its root and has several nested subvolumes and snapshots (in a .snapshots subdir on each subvol). [...] I'm new; btrfs noob; completely unqualified to write intelligently on this topic, nevertheless: I understand your setup to be btrfs RAID1 with /dev/A /dev/B, and a backup device someplace /dev/C Could you, at the time you wanted to backup the filesystem: 1) in the filesystem, break RAID1: /dev/A /dev/B -- remove /dev/B 2) reestablish RAID1 to the backup device: /dev/A /dev/C -- added 3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C) 4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to the fs I've thought of this but don't dare try it without approval from the experts. At any rate, for being practical, this approach hinges on an ability to rebuild the raid1 incrementally. That is, the rebuild would have to start from what already is present on disk B (or C, when it is re-added). Starting from an effectively blank disk each time would be prohibitive. Even if this would work, I'd much prefer keeping the original raid1 intact and to only temporarily add another mirror: lazy mirroring, to give the thing a name. [...] In the btfs device add case, you now have a three disk raid1 which is a whole different beast. Since this isn't n-way raid1, each disk is not stand alone. You're only assured data survives a one disk failure meaning you must have two drives. Yes, I understand that. Unless someone convinces me that it's a bad idea, I keep wishing for a feature that allows to intermittently add a third disk to a two disk raid1 and update that disk so that it could replace one of the others. So the btrfs replace scenario might work but it seems like a bad idea. And overall it's a use case for which send/receive was designed anyway so why not just use that? Because it's not just. Doing it right doesn't seem trivial. For one thing, there are multiple subvolumes; not at the top-level but nested inside a root subvolume. Each of them already has snapshots of its own. If there already is a send/receive script that can handle such a setup I'll happily have a look at it. Michael I think the closest thing there will ever be to this is n-way mirroring. I currently use rsync to a separate drive to maintain a backup copy, but it is not integrated into the array like n-way would be, and is definitely not a perfect solution. But a 3 drive 3-way would require the 3rd drive to be in the array the whole time or it would run into the same problem requiring a complete rebuild rather than an incremental when reintroduced, UNLESS such a feature was specifically included in the design, and even then, in a 3-way configuration, you would end up simplex on at least some data until the partial rebuild was completed. Personally, I will be DELIGHTED when n-way appears simply because basic 3-way gets us out of the dreaded simplex trap. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I'm coming from ZFS land, am a BTRFS newbie, and I don't understand this discussion, at all. I'm assuming that BTRFS send/receive works similar to ZFS's similarly named feature. We use snapshots and ZFS send/receive to a remote server to do our backups. To do an rsync of our production file store takes days because there are so many files, while snapshotting and using ZFS send/receive takes tens of minutes at local (Gbit) speeds, and a few hours at WAN speeds, nearly all of that time being transfer time. So just I don't get the backup problem. Place btrfs' equivalent of a pool on the external drive, and use send/receive of the filesystem or snapshot(s). Does BTRFS work so differently in this regard? If so, I'd like to know what's different. My primary interest in BTRFS vs ZFS is two-fold: 1) ZFS has a
Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)
Can anyone comment on this. Are others seeing some btrfs operations on filesystem/diskA hang/deadlock other btrfs operations on filesystem/diskB ? I just spent time fixing near data corruption in one of my systems due to a 7h delay between when the timestamp was written and the actual data was written, and traced it down to a btrfs hang that should never have happened on that filesystem. Surely, it's not a single queue for all filesystem and devices, right? If not, does anyone know what bugs I've been hitting then? Is the full report below I spent quite a while getting together for you :) useful in any way to see where the hangs are? To be honest, I'm looking at moving some important filesystems back to ext4 because I can't afford such long hangs on my root filesystem when I have a media device that is doing heavy btrfs IO or a send/receive. Mmmh, is it maybe just btrfs send/receive that is taking a btrfs-wide lock? Or btrfs scrub maybe? Thanks, Marc On Wed, Mar 12, 2014 at 08:18:08AM -0700, Marc MERLIN wrote: I have a file server with 4 cpu cores and 5 btrfs devices: Label: btrfs_boot uuid: e4c1daa8-9c39-4a59-b0a9-86297d397f3b Total devices 1 FS bytes used 48.92GiB devid1 size 79.93GiB used 73.04GiB path /dev/mapper/cryptroot Label: varlocalspace uuid: 9f46dbe2-1344-44c3-b0fb-af2888c34f18 Total devices 1 FS bytes used 1.10TiB devid1 size 1.63TiB used 1.50TiB path /dev/mapper/cryptraid0 Label: btrfs_pool1 uuid: 6358304a-2234-4243-b02d-4944c9af47d7 Total devices 1 FS bytes used 7.16TiB devid1 size 14.55TiB used 7.50TiB path /dev/mapper/dshelf1 Label: btrfs_pool2 uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6 Total devices 1 FS bytes used 3.34TiB devid1 size 7.28TiB used 3.42TiB path /dev/mapper/dshelf2 Label: bigbackup uuid: 024ba4d0-dacb-438d-9f1b-eeb34083fe49 Total devices 5 FS bytes used 6.02TiB devid1 size 1.82TiB used 1.43TiB path /dev/dm-9 devid2 size 1.82TiB used 1.43TiB path /dev/dm-6 devid3 size 1.82TiB used 1.43TiB path /dev/dm-5 devid4 size 1.82TiB used 1.43TiB path /dev/dm-7 devid5 size 1.82TiB used 1.43TiB path /dev/dm-8 I have a very long running btrfs send/receive from btrfs_pool1 to bigbackup (long running meaning that it's been slowly copying over 5 days) The problem is that this is blocking IO to btrfs_pool2 which is using totally different drives. By blocking IO I mean that IO to pool2 kind of works sometimes, and hangs for very long times at other times. It looks as if one rsync to btrfs_pool2 or one piece of IO hangs on a shared lock and once that happens, all IO to btrfs_pool2 stops for a long time. It does recover eventually without reboot, but the wait times are ridiculous (it could be 1H or more). As I write this, I have a killall -9 rsync that waited for over 10mn before these processes would finally die: 23555 07:36 wait_current_trans.isra.15 rsync -av -SH --delete (...) 23556 07:36 exit [rsync] defunct 25387 2-04:41:22 wait_current_trans.isra.15 rsync --password-file (...) 27481 31:26 wait_current_trans.isra.15 rsync --password-file (...) 2926804:41:34 wait_current_trans.isra.15 rsync --password-file (...) 2934304:41:31 exit [rsync] defunct 2949204:41:27 wait_current_trans.isra.15 rsync --password-file (...) 1455907:14:49 wait_current_trans.isra.15 cp -i -al current 20140312-feisty This is all stuck in btrfs kernel code. If someeone wants sysrq-w, there it is. http://marc.merlins.org/tmp/btrfs_full.txt A quick summary: SysRq : Show Blocked State taskPC stack pid father btrfs-cleaner D 8802126b0840 0 3332 2 0x 8800c5dc9d00 0046 8800c5dc9fd8 8800c69f6310 000141c0 8800c69f6310 88017574c170 880211e671e8 880211e67000 8801e5936e20 8800c5dc9d10 Call Trace: [8160b0d9] schedule+0x73/0x75 [8122a3c7] wait_current_trans.isra.15+0x98/0xf4 [81085062] ? finish_wait+0x65/0x65 [8122b86c] start_transaction+0x48e/0x4f2 [8122bc4f] ? __btrfs_end_transaction+0x2a1/0x2c6 [8122b8eb] btrfs_start_transaction+0x1b/0x1d [8121c5cd] btrfs_drop_snapshot+0x443/0x610 [8160d7b3] ? _raw_spin_unlock+0x17/0x2a [81074efb] ? finish_task_switch+0x51/0xdb [8160afbf] ? __schedule+0x537/0x5de [8122c08d] btrfs_clean_one_deleted_snapshot+0x103/0x10f [81224859] cleaner_kthread+0x103/0x136 [81224756] ? btrfs_alloc_root+0x26/0x26 [8106bc1b] kthread+0xae/0xb6 [8106bb6d] ? __kthread_parkme+0x61/0x61 [816141bc] ret_from_fork+0x7c/0xb0 [8106bb6d] ? __kthread_parkme+0x61/0x61
[PATCH 2/2] btrfs-progs: Fix a memleak in btrfs_scan_lblkid().
In btrfs_scan_lblkid(), blkid_get_cache() is called but cache not freed. This patch adds blkid_put_cache() to free it. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- utils.c | 1 + 1 file changed, 1 insertion(+) diff --git a/utils.c b/utils.c index 93cf9ac..b809bc5 100644 --- a/utils.c +++ b/utils.c @@ -2067,6 +2067,7 @@ int btrfs_scan_lblkid(int update_kernel) btrfs_register_one_device(path); } blkid_dev_iterate_end(iter); + blkid_put_cache(cache); return 0; } -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: Fix a memleak in btrfs_scan_one_device.
Valgrind reports memleak in btrfs_scan_one_device() about allocating btrfs_device but on btrfs_close_devices() they are not reclaimed. Although not a bug since after btrfs_close_devices() btrfs will exit so memory will be reclaimed by system anyway, it's better to fix it anyway. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- cmds-filesystem.c | 6 ++ volumes.c | 13 ++--- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index f02e871..c9e27fc 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -651,6 +651,12 @@ devs_only: if (search !found) ret = 1; + while (!list_empty(all_uuids)) { + fs_devices = list_entry(all_uuids-next, + struct btrfs_fs_devices, list); + list_del(fs_devices-list); + btrfs_close_devices(fs_devices); + } out: printf(%s\n, BTRFS_BUILD_VERSION); free_seen_fsid(); diff --git a/volumes.c b/volumes.c index 8c45851..77ffd32 100644 --- a/volumes.c +++ b/volumes.c @@ -160,11 +160,12 @@ static int device_list_add(const char *path, int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_fs_devices *seed_devices; - struct list_head *cur; struct btrfs_device *device; + again: - list_for_each(cur, fs_devices-devices) { - device = list_entry(cur, struct btrfs_device, dev_list); + while (!list_empty(fs_devices-devices)) { + device = list_entry(fs_devices-devices.next, + struct btrfs_device, dev_list); if (device-fd != -1) { fsync(device-fd); if (posix_fadvise(device-fd, 0, 0, POSIX_FADV_DONTNEED)) @@ -173,6 +174,11 @@ again: device-fd = -1; } device-writeable = 0; + list_del(device-dev_list); + /* free the memory */ + free(device-name); + free(device-label); + free(device); } seed_devices = fs_devices-seed; @@ -182,6 +188,7 @@ again: goto again; } + free(fs_devices); return 0; } -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On Mar 13, 2014, at 7:14 PM, Lists li...@benjamindsmith.com wrote: I'm assuming that BTRFS send/receive works similar to ZFS's similarly named feature. Similar yes but not all options are the same between them. e.g. zfs send -R replicates all descendent file systems. I don't think zfs requires volumes, filesystems, or snapshots to be read-only, whereas btrfs send only works on read only snapshot-subvolumes. There has been some suggestion of a recursive snapshot creation and recursive send for btrfs. So just I don't get the backup problem. Place btrfs' equivalent of a pool on the external drive, and use send/receive of the filesystem or snapshot(s). Does BTRFS work so differently in this regard? If so, I'd like to know what's different. Top most thing in zfs is the pool, which on btrfs is the volume. Neither zfs send or btrfs send works on this level to send everything within a pool/volume. zfs has the file system and btrfs has the subvolume which can be snapshot. Either (or both) can be used with send. zfs also has the volume which is a block device that can be snapshot, there isn't yet a btrfs equivalent. Btrfs and zfs have clones but the distinction is stronger with zfs. Like zfs snapshots can't be deleted unless its clones are deleted. Btrfs send has a -c clone-src option that I don't really understand, and also the --reflink which is a clone at the file level. Anyway there are a lot of similarities but also quite a few differences. Basic functionality seems pretty much the same. My primary interest in BTRFS vs ZFS is two-fold: 1) ZFS has a couple of limitations that I find disappointing, that don't appear to be present in BTRFS. A) Inability to upgrade a non-redundant ZFS pool/vdev to raidz or increase the raidz (redundancy) level after creation. (Yes, you can plan around this, but I see no good reason to HAVE to) B) Inability to remove a vdev once added to a pool. 2) Licensing: ZFS on Linux is truly great so far in all my testing, can't throw enough compliments their way, but I would really like to rely on a first class citizen as far as the Linux kernel is concerned. 3. On btrfs you can delete a parent subvolume and the children remain. On zfs, you can't destroy a zfs filesystem/volume unless its snapshots are deleted, and you can't delete snapshots unless their clones are deleted. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: discard synchronous on most SSDs?
On Mar 13, 2014, at 8:11 PM, Marc MERLIN m...@merlins.org wrote: On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote: discard is, except on the very latest hardware, a synchronous command (it's a limitation of the SATA standard), and therefore results in very very poor performance. Interesting. How do I know if a given SSD will hang on discard? Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :) smartctl -a or -x will tell you what SATA revision is in place. The queued trim support is in SATA Rev 3.1. I'm not certain if this requires only the drive to support that revision level, or both controller and drive. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.14.0-rc3: btrfs send/receive blocks btrfs IO on other devices (near deadlocks)
Marc MERLIN posted on Thu, 13 Mar 2014 18:48:13 -0700 as excerpted: Are others seeing some btrfs operations on filesystem/diskA hang/deadlock other btrfs operations on filesystem/diskB ? Well, if the filesystem in filesystem/diskA and filesystem/diskB is the same (multi-device) filesystem, as the above definitely implies... Tho based on the context I don't believe that's what you actually meant. Meanwhile, send/receive is intensely focused in bug-finding/fixing mode ATM. The basic concept is there, but to this point it has definitely been more development/testing-reliability (as befitted btrfs overall state, with the eat-your-babies kconfig option warning only recently toned down to what I'd call semi-stable) than enterprise-reliability. Hopefully by the time they're done with all this bug-stomping it'll be rather closer to the latter. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: discard synchronous on most SSDs?
On Thu, Mar 13, 2014 at 09:39:02PM -0600, Chris Murphy wrote: On Mar 13, 2014, at 8:11 PM, Marc MERLIN m...@merlins.org wrote: On Sun, Mar 09, 2014 at 11:33:50AM +, Hugo Mills wrote: discard is, except on the very latest hardware, a synchronous command (it's a limitation of the SATA standard), and therefore results in very very poor performance. Interesting. How do I know if a given SSD will hang on discard? Is a Samsung EVO 840 1TB SSD latest hardware enough, or not? :) smartctl -a or -x will tell you what SATA revision is in place. The queued trim support is in SATA Rev 3.1. I'm not certain if this requires only the drive to support that revision level, or both controller and drive. I'm not sure I'm seeing this, which field is that? === START OF INFORMATION SECTION === Device Model: Samsung SSD 840 EVO 1TB Serial Number:S1D9NEAD934600N LU WWN Device Id: 5 002538 85009a8ff Firmware Version: EXT0BB0Q User Capacity:1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is:Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4c Local Time is:Thu Mar 13 22:15:14 2014 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection:(15000) seconds. Offline data collection capabilities:(0x53) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 250) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 100 100 010-0 9 Power_On_Hours -O--CK 099 099 000-2219 12 Power_Cycle_Count -O--CK 099 099 000-659 177 Wear_Leveling_Count PO--C- 099 099 000-3 179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010-0 181 Program_Fail_Cnt_Total -O--CK 100 100 010-0 182 Erase_Fail_Count_Total -O--CK 100 100 010-0 183 Runtime_Bad_Block PO--C- 100 100 010-0 187 Reported_Uncorrect -O--CK 100 100 000-0 190 Airflow_Temperature_Cel -O--CK 054 041 000-46 195 Hardware_ECC_Recovered -O-RC- 200 200 000-0 199 UDMA_CRC_Error_Count-OSRCK 100 100 000-0 235 Unknown_Attribute -O--C- 099 099 000-35 241 Total_LBAs_Written -O--CK 099 099 000-12186944165 ||_ K auto-keep |__ C event count ___ R error rate ||| S speed/performance ||_ O updated online |__ P prefailure warning -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe