Why subvolume and not just volume?
Hi, Does anyone know the reason subvolumes are not called just volumes? I mean, the top subvolume is not called a volume, so there is nothing to be sub of. Also, what is the penalty of a subvolume compared to a directory? From a design perspective, couldn't all directories just be subvolumes? Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why subvolume and not just volume?
Martin wrote on 2015/08/05 09:06 +0200: Hi, Does anyone know the reason subvolumes are not called just volumes? I mean, the top subvolume is not called a volume, so there is nothing to be sub of. Because normally a volume is referred as a complete filesystem. So from this respect, subvolume is not a full filesystem, it still need a lot of extra data from other trees to build its contents. So, that's why there is sub prefix. Although it acts much like a volume as it can be mounted like a filesystem, but it's still not a full filesystem. Also, what is the penalty of a subvolume compared to a directory? From a design perspective, couldn't all directories just be subvolumes? Yes, subvolume has its overhead, and when it comes as many as directories, the overhead won't be small. The overhead that I can remember is shown below. Use empty tree as an example for its size, and default mkfs options. The '+' after number means it will increase with snapshots 1) Empty tree block: 16K Of course takes more if its child file/dir grows 2) ROOT_ITEM in tree root: 439bytes 3) ROOT_BACKREF in tree root: 22+bytes 5) extent backref for tree block: 33+bytes for skinny metadata 53+bytes without skinny metadata Alone with other trees like log tree, one for each subvolume if fsync is called. Not to mention other run-time overhead. For example, to search a inode in one subvolume. Search_slot would be enough to find the INODE_ITEM. But to search a inode across subvolume boundary, need to first found the subvolume boundary and loop until we reach the subvolume containing the inode, then do the above search_slot to locate the INODE_ITEM. Although the overhead is already small, but not that small to make all directories to be subvolume. Thanks, Qu Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fstests: btrfs: Add regression test for reserved space leak.
On Wed, Aug 5, 2015 at 2:08 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: The regression is introduced in v4.2-rc1, with the big btrfs qgroup change. The problem is, qgroup reserved space is never freed, causing even we increase the limit, we can still hit the EDQUOT much faster than it should. Reported-by: Tsutomu Itoh t-i...@jp.fujitsu.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com Reviewed-by: Filipe Manana fdman...@suse.com Thanks! --- tests/btrfs/089 | 88 + tests/btrfs/089.out | 5 +++ tests/btrfs/group | 1 + 3 files changed, 94 insertions(+) create mode 100755 tests/btrfs/089 create mode 100644 tests/btrfs/089.out diff --git a/tests/btrfs/089 b/tests/btrfs/089 new file mode 100755 index 000..82db96c --- /dev/null +++ b/tests/btrfs/089 @@ -0,0 +1,88 @@ +#! /bin/bash +# FS QA Test 089 +# +# Regression test for btrfs qgroup reserved space leak. +# +# Due to qgroup reserved space leak, EDQUOT can be trigged even it's not +# over limit after previous write. +# +#--- +# Copyright (c) 2015 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_need_to_be_root + +# Use big blocksize to ensure there is still enough space left +# for metadata reserve after hitting EDQUOT +BLOCKSIZE=$(( 2 * 1024 * 1024 )) +FILESIZE=$(( 128 * 1024 * 1024 )) # 128Mbytes + +# The last block won't be able to finish write, as metadata takes +# $NODESIZE space, causing the last block triggering EDQUOT +LENGTH=$(( $FILESIZE - $BLOCKSIZE )) + +_scratch_mkfs $seqres.full 21 +_scratch_mount +_require_fs_space $SCRATCH_MNT $(($FILESIZE * 2 / 1024)) + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog qgroup limit $FILESIZE 5 $SCRATCH_MNT + +$XFS_IO_PROG -f -c pwrite -b $BLOCKSIZE 0 $LENGTH \ + $SCRATCH_MNT/foo | _filter_xfs_io + +# A sync is needed to trigger a commit_transaction. +# As the reserved space freeing happens at commit_transaction time, +# without a transaction commit, no reserved space needs freeing and +# won't trigger the bug. +sync + +# Double the limit to allow further write +_run_btrfs_util_prog qgroup limit $(($FILESIZE * 2)) 5 $SCRATCH_MNT + +# Test whether further write can succeed +$XFS_IO_PROG -f -c pwrite -b $BLOCKSIZE $LENGTH $LENGTH \ + $SCRATCH_MNT/foo | _filter_xfs_io + +# success, all done +status=0 +exit diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out new file mode 100644 index 000..396888f --- /dev/null +++ b/tests/btrfs/089.out @@ -0,0 +1,5 @@ +QA output created by 089 +wrote 132120576/132120576 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 132120576/132120576 bytes at offset 132120576 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/btrfs/group b/tests/btrfs/group index ffe18bf..225b532 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -91,6 +91,7 @@ 086 auto quick clone 087 auto quick send 088 auto quick metadata +089 auto quick qgroup 090 auto quick metadata 091 auto quick qgroup 092 auto quick send -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe fstests in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs-progs: fsck: Print correct file hole
If a file lost all its file extents, fsck will unable to print out the hole. Add an extra judgment to print out the hole range. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- cmds-check.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/cmds-check.c b/cmds-check.c index 50bb6f3..31ed589 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -616,15 +616,20 @@ static void print_inode_error(struct btrfs_root *root, struct inode_record *rec) if (errors I_ERR_FILE_EXTENT_DISCOUNT) { struct file_extent_hole *hole; struct rb_node *node; + int found = 0; node = rb_first(rec-holes); fprintf(stderr, Found file extent holes:\n); while (node) { + found = 1; hole = rb_entry(node, struct file_extent_hole, node); - fprintf(stderr, \tstart: %llu, len:%llu\n, + fprintf(stderr, \tstart: %llu, len: %llu\n, hole-start, hole-len); node = rb_next(node); } + if (!found) + fprintf(stderr, \tstart: 0, len: %llu\n, + round_up(rec-isize, root-sectorsize)); } } -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Fix for infinite loop on non-empty inode but with no file extent
A bug reported by Robert Munteanu, which btrfsck infinite loops on an inode with discount file extent. This patchset includes a fix in printing file extent hole, fix the infinite loop, and corresponding test case. BTW, thanks Robert Munteanu a lot for its detailed debug report, makes it super fast to reproduce the error. Qu Wenruo (3): btrfs-progs: fsck: Print correct file hole btrfs-progs: fsck: Fix a infinite loop on discount file extent repair btrfs-progs: fsck-tests: Add test case for inode lost all its file extent cmds-check.c| 16 +++- .../017-missing-all-file-extent/default_case.img.xz | Bin 0 - 1104 bytes 2 files changed, 15 insertions(+), 1 deletion(-) create mode 100644 tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs-progs: fsck: Fix a infinite loop on discount file extent repair
For a special case, discount file extent repair function will cause infinite loop. The case is, if the file loses all its file extent, we won't have a hole to fill, causing repair function doing nothing, and since the I_ERR_DISCOUNT doesn't disappear, the fsck will do infinite loop. For such case, just puch hole to fill all the range to fix it. Reported-by: Robert Munteanu robert.munte...@gmail.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- cmds-check.c | 9 + 1 file changed, 9 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 31ed589..4fa8709 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2665,11 +2665,13 @@ static int repair_inode_discount_extent(struct btrfs_trans_handle *trans, { struct rb_node *node; struct file_extent_hole *hole; + int found = 0; int ret = 0; node = rb_first(rec-holes); while (node) { + found = 1; hole = rb_entry(node, struct file_extent_hole, node); ret = btrfs_punch_hole(trans, root, rec-ino, hole-start, hole-len); @@ -2683,6 +2685,13 @@ static int repair_inode_discount_extent(struct btrfs_trans_handle *trans, rec-errors = ~I_ERR_FILE_EXTENT_DISCOUNT; node = rb_first(rec-holes); } + /* special case for a file losing all its file extent */ + if (!found) { + ret = btrfs_punch_hole(trans, root, rec-ino, 0, + round_up(rec-isize, root-sectorsize)); + if (ret 0) + goto out; + } printf(Fixed discount file extents for inode: %llu in root: %llu\n, rec-ino, root-objectid); out: -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs-progs: fsck-tests: Add test case for inode lost all its file extent
Add test case with no file extents, but still non-zero inode size. To test whether fsck will infinite loop. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- .../017-missing-all-file-extent/default_case.img.xz | Bin 0 - 1104 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz diff --git a/tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz b/tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz new file mode 100644 index ..10cd4c786e1223d1d00eaeab81941b142135d97c GIT binary patch literal 1104 zcmV-W1h4!3H+ooF000E$*0e?f03iVu0001VFXf})lm7${TwRyj;C3^v%$$4d1rE0 zjjaER49m*J$Ny#wkkW^{#)(qVu6jW@U${Z!$pUxwR_2M^H7KU2hve~sxWXa?I z%NmWQokysI=scC%OAq$mZ!a?Xa@3%R}h@8vWy-xl~qOuWtRg-R5!%XR;99;~oq zdk{P`Ryx4e(-g_fYI=#ZDobJ@0`2Gi5Z==2t%tsj$0@PK$0}7L`88Br(*1nkpNK z_XxRVtghToWQcwSZMnTkBh=v`!_jtFZnBcHm1wf5wOm9i*QuB{`*iwt$@={jV)* z%}T2!cPnR{zgSI}F-k1(|ItD#YU*lQ5caPRu;$454g#EIMlP+@`=aT*A@LADD7u zk@V7c2kM}eBPO^Qu;ANX4!roj9d3LYnAteNtD3soCJnDoNp54HZ4ZeX@naau}+a^ zy|#?-`{(08Rl!wH-SY4?wetPi2Qb(f`c45RskYyF)v;1`SlFlxU5!p|~;ics;3^ zBX%=|JUR^Ky(ynC=M7)svk~E4@16W$NYHYzCJ3;@@${p?4~ysw=fq^vz(x(PtElF zwf3WM-8M{A?g(sW#s;uPp5)F}4lhgs%K1q1nFGU`)GD9qPGir}B%j3$a3;A?Gq|tS zaDXnqJ{7yrj8iPECu)i;L_ouU|17}`kpGA2x2A0_=e3S$bZ7H^DTsDQRGOZb@F;q z%#Y6-P@=P|nbod(i7Gd3Lvl9vVzd!)uy)eEr2h(pFvfGR9gMp7SL=Ww6;cvzs zSIkg93uy516A@onIsCrvlQqwJlV~$z#z)3I_GDFsRim|mp%_0{a_wov_H$C4TDq zwn7^-kcft_F)4WPt0OV+n3W`nZ4OC`z1agtXhR}#S0IqeJw$Lsxl4BHy`9hHs zBtPA?RQYVy53bCXwhyaxL;{eBnLK1F~EqXTP1;aSvcM5-y;6(p2G02#C0S{EE=g zW1iR3fbSs^zRyjGTuc(qgYeJO3yRzBq+JtviiuggemrKK5?gVyt|AD#jSXzU4VQ zfop5yezs3WSrm8QE|MhpJH!92~QSYE^Eq1p4jh$Idi{1`RB5jNvfrJ7h*)M7 zoMsbcB?RsO4P-LyY62_cY6mGWv~hlA;J%`X3cfL4vfPWRRidP0rp|UFe_14^R2I z{a1+AOJXzAm=7pkt%~NC1$Zx1R?`MOxJIC_dRjwdc8K9d63f6rzoibNqCnp{u$U z+A~#XVza7Y=CSF*ohfxwwN9we2j*qrOWm@Z+eMws1MlfN_ibWsO9Z^OTT^VoVw{S zPt09il30LPI_!MVdj_{j0W`{}1|)2jXtn*D`NB+JQbn9BfB*m@w2urGUN=kt0jmgr Wr~{`ruvAn#Ao{g01X)^bF)Hl literal 0 HcmV?d1 -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Lockup in BTRFS_IOC_CLONE/Kernel 4.2.0-rc5
I can reproduce a hard btrfs lockup (process issuing the ioctl() is in D-state, same goes for btrfs-transacti process) on Kernel 4.2.0-rc5. I had the same issue on 4.1, so it's unlikely a regression introduced in 4.2. ## With the following steps, I can reproduce the problem: 1. Create a new clean btrfs volume for /var/lib/machines machinectl set-limit 6G 2. Paste this to /tmp/yum.conf [main] reposdir=/dev/null gpgcheck=0 logfile=/var/log/yum.log installroot=/var/lib/machines/centos7.1-base assumeyes=1 [base] name=CentOS 7.1.1503 - x86_64 baseurl=http://mirror.centos.org/centos/7.1.1503/os/x86_64/ enabled=1 3. Bootstrap a CentOS 7.1 base image /usr/bin/yum -c /tmp/yum.conf groupinstall Base 4. Start an ephemeral systemd-nspawn container based on 'centos7.1-base' strace -o /tmp/systemd-nspawn.out -s 500 -f systemd-nspawn -xbD /var/lib/machines/centos7.1-base/ `systemd-nspawn` will now just hang forever. I couldn't come up yet with a shorter/more low-level way to reproduce this as I lack quite a bit of btrfs experience. ## Results: - Last 'strace' lines 6095 fchown(16, 0, 0) = 0 6095 fchmod(16, 0755) = 0 6095 utimensat(16, NULL, {{1402362275, 0}, {1438761285, 819041906}}, 0) = 0 6095 flistxattr(15, , 100) = 0 6095 getdents(15, /* 3 entries */, 32768) = 80 6095 newfstatat(15, coreutils.mo, {st_mode=S_IFREG|0644, st_size=357263, ...}, AT_SYMLINK_NOFOLLOW) = 0 6095 openat(15, coreutils.mo, O_RDONLY|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC) = 17 6095 openat(16, coreutils.mo, O_WRONLY|O_CREAT|O_EXCL|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC, 0644) = 18 6095 fstat(18, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 6095 ioctl(18, BTRFS_IOC_CLONE - call trace in Kernel journal: Aug 05 10:10:03 moria kernel: INFO: task btrfs-transacti:4175 blocked for more than 120 seconds. Aug 05 10:10:03 moria kernel: Tainted: G O4.2.0-rc5 #2 Aug 05 10:10:03 moria kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Aug 05 10:10:03 moria kernel: btrfs-transacti D 8800b13279f8 0 4175 2 0x00080080 Aug 05 10:10:03 moria kernel: 8800b13279f8 88018fd3a380 8800ab4521c0 0246 Aug 05 10:10:03 moria kernel: 8800b1328000 88018d5c8518 88018debdba0 880232d64990 Aug 05 10:10:03 moria kernel: 0197 8800b1327a18 86999201 Aug 05 10:10:03 moria kernel: Call Trace: Aug 05 10:10:03 moria kernel: [86999201] schedule+0x74/0x83 Aug 05 10:10:03 moria kernel: [863ef8f0] btrfs_tree_lock+0xa7/0x1b7 Aug 05 10:10:03 moria kernel: [86137ed7] ? wait_woken+0x74/0x74 Aug 05 10:10:03 moria kernel: [8639d30f] push_leaf_right+0x9a/0x19f Aug 05 10:10:03 moria kernel: [8639dd9b] split_leaf+0x100/0x63f Aug 05 10:10:03 moria kernel: [86398f09] ? leaf_space_used+0xbb/0xea Aug 05 10:10:03 moria kernel: [863efa61] ? btrfs_set_lock_blocking_rw+0x52/0x95 Aug 05 10:10:03 moria kernel: [8639ea46] btrfs_search_slot+0x76c/0x8b3 Aug 05 10:10:03 moria kernel: [863a0107] btrfs_insert_empty_items+0x58/0xa3 Aug 05 10:10:03 moria kernel: [8640805a] btrfs_insert_delayed_items+0x7f/0x3bb Aug 05 10:10:03 moria kernel: [8640842e] __btrfs_run_delayed_items+0x98/0x1c0 Aug 05 10:10:03 moria kernel: [86408739] btrfs_run_delayed_items+0xc/0xe Aug 05 10:10:03 moria kernel: [863bdc50] btrfs_commit_transaction+0x298/0xb66 Aug 05 10:10:03 moria kernel: [863be8d0] ? start_transaction+0x3b2/0x535 Aug 05 10:10:03 moria kernel: [863b9cd9] transaction_kthread+0x100/0x1d6 Aug 05 10:10:03 moria kernel: [863b9bd9] ? btrfs_cleanup_transaction+0x49f/0x49f Aug 05 10:10:03 moria kernel: [8611eca9] kthread+0xcd/0xd5 Aug 05 10:10:03 moria kernel: [8611ebdc] ? kthread_create_on_node+0x17d/0x17d Aug 05 10:10:03 moria kernel: [8699d29f] ret_from_fork+0x3f/0x70 Aug 05 10:10:03 moria kernel: [8611ebdc] ? kthread_create_on_node+0x17d/0x17d Aug 05 10:10:03 moria kernel: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. Aug 05 10:10:03 moria kernel: Tainted: G O4.2.0-rc5 #2 Aug 05 10:10:03 moria kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Aug 05 10:10:03 moria kernel: systemd-nspawn D 88019f3e3668 0 6095 6090 0x00080083 Aug 05 10:10:03 moria kernel: 88019f3e3668 86e5d480 88018fd3a380 0246 Aug 05 10:10:03 moria kernel: 88019f3e4000 88018debdc08 88018fd3a380 88019f3e36b8 Aug 05 10:10:03 moria kernel: 88018fd3a380 88019f3e3688 86999201 Aug 05 10:10:03 moria kernel: Call Trace: Aug 05 10:10:03 moria kernel: [86999201] schedule+0x74/0x83 Aug 05 10:10:03 moria kernel: [863ef64c] btrfs_tree_read_lock+0xc0/0xea Aug 05 10:10:03 moria kernel: [86137ed7] ?
[PATCH] btrfs-progs: Modify confuse error message in scrub
Scrub output following error message in my test: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success) It is caused by a broken kernel and fs, but the we need to avoid outputting both error and success in oneline message as above. This patch modified above message to: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, errno=0(Success) Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- cmds-scrub.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/cmds-scrub.c b/cmds-scrub.c index 7c9318e..2529956 100644 --- a/cmds-scrub.c +++ b/cmds-scrub.c @@ -1457,21 +1457,25 @@ static int scrub_start(int argc, char **argv, int resume) ++err; continue; } - if (sp[i].ret sp[i].ioctl_errno == ENODEV) { - if (do_print) - fprintf(stderr, WARNING: device %lld not - present\n, devid); - continue; - } - if (sp[i].ret sp[i].ioctl_errno == ECANCELED) { - ++err; - } else if (sp[i].ret) { - if (do_print) - fprintf(stderr, ERROR: scrubbing %s failed - for device id %lld (%s)\n, path, - devid, strerror(sp[i].ioctl_errno)); - ++err; - continue; + if (sp[i].ret) { + switch (sp[i].ioctl_errno) { + case ENODEV: + if (do_print) + fprintf(stderr, WARNING: device %lld not present\n, + devid); + continue; + case ECANCELED: + ++err; + break; + default: + if (do_print) + fprintf(stderr, ERROR: scrubbing %s failed for device id %lld, ret=%d, errno=%d(%s)\n, + path, devid, + sp[i].ret, sp[i].ioctl_errno, + strerror(sp[i].ioctl_errno)); + ++err; + continue; + } } if (sp[i].scrub_args.progress.uncorrectable_errors 0) e_uncorrectable++; -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/4] btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()
It can reduce current duplicated code which is similar to scrub_blocked_if_needed() but can not call it because little different. It also used by my next patch which is in same case. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/scrub.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 94db0fa..cbfb8c7 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -332,11 +332,14 @@ static void __scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) } } -static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) +static void scrub_pause_on(struct btrfs_fs_info *fs_info) { atomic_inc(fs_info-scrubs_paused); wake_up(fs_info-scrub_pause_wait); +} +static void scrub_pause_off(struct btrfs_fs_info *fs_info) +{ mutex_lock(fs_info-scrub_lock); __scrub_blocked_if_needed(fs_info); atomic_dec(fs_info-scrubs_paused); @@ -345,6 +348,12 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) wake_up(fs_info-scrub_pause_wait); } +static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) +{ + scrub_pause_on(fs_info); + scrub_pause_off(fs_info); +} + /* * used for workers that require transaction commits (i.e., for the * NOCOW case) -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/4] btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks()
Use new intruduced scrub_pause_on/off() can make this code block clean and more readable. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/scrub.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index cbfb8c7..a882a34 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3492,8 +3492,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, wait_event(sctx-list_wait, atomic_read(sctx-bios_in_flight) == 0); - atomic_inc(fs_info-scrubs_paused); - wake_up(fs_info-scrub_pause_wait); + + scrub_pause_on(fs_info); /* * must be called before we decrease @scrub_paused. @@ -3504,11 +3504,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, atomic_read(sctx-workers_pending) == 0); atomic_set(sctx-wr_ctx.flush_all_writes, 0); - mutex_lock(fs_info-scrub_lock); - __scrub_blocked_if_needed(fs_info); - atomic_dec(fs_info-scrubs_paused); - mutex_unlock(fs_info-scrub_lock); - wake_up(fs_info-scrub_pause_wait); + scrub_pause_off(fs_info); btrfs_put_block_group(cache); if (ret) -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/4] btrfs: Fix data checksum error cause by replace with io-load.
xfstests btrfs/070 sometimes failed. In my test machine, its fail rate is about 30%. In another vm(vmware), its fail rate is about 50%. Reason: btrfs/070 do replace and defrag with fsstress simultaneously, after above operation, checksum error is found by scrub. Actually, it have no relationship with defrag operation, only replace with fsstress can trigger this bug. New data writen to target device have possibility rewrited by old data from source device by replace code in debug, to avoid above problem, we can set target block group to readonly in replace period, so new data requested by other operation will not write to same place with replace code. Before patch(4.1-rc3): 30% failed in 100 xfstests. After patch: 0% failed in 300 xfstests. It also happened in btrfs/071 as it's another scrub with IO load tests. Reported-by: Qu Wenruo quwen...@cn.fujitsu.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- Changelog v3-v4: Patch v3 cause xfstests/061 failed in some case, because btrfs_inc_block_group_ro() include a btrfs_end_transaction() option, which will change datas in reloc_ctl-data_inode, and cause deadlock in relocate: scrub relocate relocate_file_extent_cluster() prealloc_file_extent_cluster() ... btrfs_inc_block_group_ro() btrfs_wait_for_commit() insert_reserved_file_extent() btrfs_set_file_extent_disk_num_bytes() (modify reloc_ctl-data_inode) ... do_relocation() get_new_location() ret -EINVAL (because data_inode's extent changed) __btrfs_cow_block() ret -EINVAL (without unlock eb) btrfs_search_slot() deadlock (try to lock eb again) Changelog v2-v3: 1: Fix a typo(caused in rebase) which make xfstests failed in btrfs/073 and btrfs/066. Changelog v1-v2: Nothing for this patch. --- fs/btrfs/scrub.c | 34 +++--- fs/btrfs/volumes.c | 2 ++ 2 files changed, 29 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a882a34..e04436f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3396,7 +3396,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64 chunk_tree; u64 chunk_objectid; u64 chunk_offset; - int ret; + int ret = 0; int slot; struct extent_buffer *l; struct btrfs_key key; @@ -3424,8 +3424,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (path-slots[0] = btrfs_header_nritems(path-nodes[0])) { ret = btrfs_next_leaf(root, path); - if (ret) + if (ret 0) + break; + if (ret 0) { + ret = 0; break; + } + } else { + ret = 0; } } @@ -3467,6 +3473,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + /* +* we need call btrfs_inc_block_group_ro() with scrubs_paused, +* to avoid deadlock caused by: +* btrfs_inc_block_group_ro() +* - btrfs_wait_for_commit() +* - btrfs_commit_transaction() +* - btrfs_scrub_pause() +*/ + scrub_pause_on(fs_info); + ret = btrfs_inc_block_group_ro(root, cache); + scrub_pause_off(fs_info); + if (ret) { + btrfs_put_block_group(cache); + break; + } + dev_replace-cursor_right = found_key.offset + length; dev_replace-cursor_left = found_key.offset; dev_replace-item_needs_writeback = 1; @@ -3506,6 +3528,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + btrfs_dec_block_group_ro(root, cache); + btrfs_put_block_group(cache); if (ret) break; @@ -3528,11 +3552,7 @@ skip: btrfs_free_path(path); - /* -* ret can still be 1 from search_slot or next_leaf, -* that's not an error -*/ - return ret 0 ? ret : 0; + return ret; } static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9b95503..66f5a15 100644
[PATCH v4 0/4] btrfs: Fix data checksum error cause by replace with io-load
This patchset is used to fix data checksum error cause by replace with io-load. It cause xfstests btrfs/070(071) failed randomly. See description in [PATCH 4/4] for detail. Changelog v3-v4: 1: Fix regression of xfstests/061 Patch v3 cause xfstests/061 failed in some case, because btrfs_inc_block_group_ro() include a btrfs_end_transaction() option, which will change datas in reloc_ctl-data_inode, and cause deadlock in relocate: scrub relocate relocate_file_extent_cluster() prealloc_file_extent_cluster() ... btrfs_inc_block_group_ro() btrfs_wait_for_commit() insert_reserved_file_extent() btrfs_set_file_extent_disk_num_bytes() (modify reloc_ctl-data_inode) ... do_relocation() get_new_location() ret -EINVAL (because data_inode's extent changed) __btrfs_cow_block() ret -EINVAL (without unlock eb) btrfs_search_slot() deadlock (try to lock eb again) Changelog v2-v3: 1: Fix a typo(caused in rebase) which make xfstests failed in btrfs/073 and btrfs/066. 2: Rebase on top of integration-4.2 3: Do full xfstests(generic and btrfs group with 10 mount options) Changelog v1-v2: 1: Update subject to reflect the problem being fixed. 2: Update description to say reason why set read-only can fix the problem. 3: Use a helper function to avoid duplicated code block for set chunk ro. All of above are suggested by: David Sterba dste...@suse.cz Zhao Lei (4): btrfs: Use ref_cnt for set_block_group_ro() btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off() btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks() btrfs: Fix data checksum error cause by replace with io-load. fs/btrfs/ctree.h | 6 +++--- fs/btrfs/extent-tree.c | 42 +++--- fs/btrfs/relocation.c | 14 ++--- fs/btrfs/scrub.c | 55 -- fs/btrfs/volumes.c | 2 ++ 5 files changed, 72 insertions(+), 47 deletions(-) -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/4] btrfs: Use ref_cnt for set_block_group_ro()
More than one code call set_block_group_ro() and restore rw in fail. Old code use bool bit to save blockgroup's ro state, it can not support parallel case(it is confirmd exist in my debug log). This patch use ref count to store ro state, and rename set_block_group_ro/set_block_group_rw to inc_block_group_ro/dec_block_group_ro. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/ctree.h | 6 +++--- fs/btrfs/extent-tree.c | 42 +- fs/btrfs/relocation.c | 14 ++ 3 files changed, 30 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index aac314e..f57e6ca 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1300,7 +1300,7 @@ struct btrfs_block_group_cache { /* for raid56, this is a full stripe, without parity */ unsigned long full_stripe_len; - unsigned int ro:1; + unsigned int ro; unsigned int iref:1; unsigned int has_caching_ctl:1; unsigned int removed:1; @@ -3495,9 +3495,9 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, u64 num_bytes); -int btrfs_set_block_group_ro(struct btrfs_root *root, +int btrfs_inc_block_group_ro(struct btrfs_root *root, struct btrfs_block_group_cache *cache); -void btrfs_set_block_group_rw(struct btrfs_root *root, +void btrfs_dec_block_group_ro(struct btrfs_root *root, struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1c2bd17..a436bd5 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8692,14 +8692,13 @@ static u64 update_block_group_flags(struct btrfs_root *root, u64 flags) return flags; } -static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) +static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force) { struct btrfs_space_info *sinfo = cache-space_info; u64 num_bytes; u64 min_allocable_bytes; int ret = -ENOSPC; - /* * We need some metadata space and system metadata space for * allocating chunks in some corner cases until we force to set @@ -8716,6 +8715,7 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) spin_lock(cache-lock); if (cache-ro) { + cache-ro++; ret = 0; goto out; } @@ -8727,7 +8727,7 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes + min_allocable_bytes = sinfo-total_bytes) { sinfo-bytes_readonly += num_bytes; - cache-ro = 1; + cache-ro++; list_add_tail(cache-ro_list, sinfo-ro_bgs); ret = 0; } @@ -8737,7 +8737,7 @@ out: return ret; } -int btrfs_set_block_group_ro(struct btrfs_root *root, +int btrfs_inc_block_group_ro(struct btrfs_root *root, struct btrfs_block_group_cache *cache) { @@ -8745,8 +8745,6 @@ int btrfs_set_block_group_ro(struct btrfs_root *root, u64 alloc_flags; int ret; - BUG_ON(cache-ro); - again: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) @@ -8789,7 +8787,7 @@ again: goto out; } - ret = set_block_group_ro(cache, 0); + ret = inc_block_group_ro(cache, 0); if (!ret) goto out; alloc_flags = get_alloc_profile(root, cache-space_info-flags); @@ -8797,7 +8795,7 @@ again: CHUNK_ALLOC_FORCE); if (ret 0) goto out; - ret = set_block_group_ro(cache, 0); + ret = inc_block_group_ro(cache, 0); out: if (cache-flags BTRFS_BLOCK_GROUP_SYSTEM) { alloc_flags = update_block_group_flags(root, cache-flags); @@ -8860,7 +8858,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo) return free_bytes; } -void btrfs_set_block_group_rw(struct btrfs_root *root, +void btrfs_dec_block_group_ro(struct btrfs_root *root, struct btrfs_block_group_cache *cache) { struct btrfs_space_info *sinfo = cache-space_info; @@ -8870,11 +8868,13 @@ void btrfs_set_block_group_rw(struct btrfs_root *root, spin_lock(sinfo-lock); spin_lock(cache-lock); - num_bytes = cache-key.offset - cache-reserved - cache-pinned - - cache-bytes_super - btrfs_block_group_used(cache-item); - sinfo-bytes_readonly -= num_bytes; -
[RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves
From: Michal Hocko mho...@suse.com __GFP_NOFAIL is a big hammer used to ensure that the allocation request can never fail. This is a strong requirement and as such it also deserves a special treatment when the system is OOM. The primary problem here is that the allocation request might have come with some locks held and the oom victim might be blocked on the same locks. This is basically an OOM deadlock situation. This patch tries to reduce the risk of such a deadlocks by giving __GFP_NOFAIL allocations a special treatment and let them dive into memory reserves after oom killer invocation. This should help them to make a progress and release resources they are holding. The OOM victim should compensate for the reserves consumption. Signed-off-by: Michal Hocko mho...@suse.com --- mm/page_alloc.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1f9ffbb087cb..ee69c338ca2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* Exhausted what can be done so it's blamo time */ if (out_of_memory(ac-zonelist, gfp_mask, order, ac-nodemask, false) - || WARN_ON_ONCE(gfp_mask __GFP_NOFAIL)) + || WARN_ON_ONCE(gfp_mask __GFP_NOFAIL)) { *did_some_progress = 1; + + if (gfp_mask __GFP_NOFAIL) { + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); + WARN_ONCE(!page, Unable to fullfil gfp_nofail allocation. +Consider increasing min_free_kbytes.\n); + } + } out: mutex_unlock(oom_lock); return page; -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
From: Michal Hocko mho...@suse.com alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM this is allowed to fail which can lead to [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 This is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko mho...@suse.com --- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 53af23f2c087..57a99d19533d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes) * and the stripes */ sizeof(u64) * (total_stripes), - GFP_NOFS); + GFP_NOFS|__GFP_NOFAIL); if (!bbio) return NULL; -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 7/8] btrfs: Prevent from early transaction abort
From: Michal Hocko mho...@suse.com Btrfs relies on GFP_NOFS allocation when commiting the transaction but since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM those allocations are allowed to fail which can lead to a pre-mature transaction abort: [ 55.328093] Call Trace: [ 55.328890] [8154e6f0] dump_stack+0x4f/0x7b [ 55.330518] [8108fa28] ? console_unlock+0x334/0x363 [ 55.332738] [8110873e] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [81100752] pagecache_get_page+0x10e/0x20c [ 55.336844] [a007d916] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [a0059d8c] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [a004f728] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [a003fa34] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [a0040567] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [810682d7] ? get_parent_ip+0xe/0x3e [ 55.349434] [a0041cb2] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [a004ecfb] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [81186808] vfs_fsync_range+0x8f/0x9e [ 55.369047] [81186833] vfs_fsync+0x1c/0x1e [ 55.370654] [81186869] do_fsync+0x34/0x4e [ 55.372246] [81186ab3] SyS_fsync+0x10/0x14 [ 55.373851] [81554f97] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] [ cut here ] [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [8154e6f0] dump_stack+0x4f/0x7b [ 55.384357] [8107f717] ? down_trylock+0x2d/0x37 [ 55.384359] [81046977] warn_slowpath_common+0xa1/0xbb [ 55.384398] [a00a1d6b] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [81046a34] warn_slowpath_null+0x1a/0x1c [ 55.384423] [a00a1d6b] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [a004e5f7] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [a004e600] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [a0060e21] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [81186808] vfs_fsync_range+0x8f/0x9e [ 55.384592] [81186833] vfs_fsync+0x1c/0x1e [ 55.384593] [81186869] do_fsync+0x34/0x4e [ 55.384594] [81186ab3] SyS_fsync+0x10/0x14 [ 55.384595] [81554f97] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by reintroducing the no-fail behavior of this allocation path with the explicit __GFP_NOFAIL. Signed-off-by: Michal Hocko mho...@suse.com --- fs/btrfs/extent_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c374e1e71e5f..88fad7051e38 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, { struct extent_buffer *eb = NULL; - eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS); + eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); if (eb == NULL) return NULL; eb-start = start; @@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, return NULL; for (i = 0; i num_pages; i++, index++) { - p = find_or_create_page(mapping, index, GFP_NOFS); + p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); if (!p) goto free_eb; -- 2.5.0 --
[RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
From: Michal Hocko mho...@suse.com Journal transaction might fail prematurely because the frozen_buffer is allocated by GFP_NOFS request: [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM because small GPF_NOFS allocations never failed. This allocation seems essential for the journal and GFP_NOFS is too restrictive to the memory allocator so let's use __GFP_NOFAIL here to emulate the previous behavior. jbd code has the very same issue so let's do the same there as well. Signed-off-by: Michal Hocko mho...@suse.com --- fs/jbd/transaction.c | 11 +-- fs/jbd2/transaction.c | 14 +++--- 2 files changed, 4 insertions(+), 21 deletions(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 1695ba8334a2..bf7474deda2f 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd_alloc(jh2bh(jh)-b_size, -GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR - %s: OOM for frozen_buffer\n, - __func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } +GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ff2f2e6ad311..bff071e21553 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd2_alloc(jh2bh(jh)-b_size, -GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR - %s: OOM for frozen_buffer\n, - __func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } +GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh-b_committed_data) { - committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS); + committed_data = jbd2_alloc(jh2bh(jh)-b_size, + GFP_NOFS|__GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR %s: No memory for committed data\n, __func__); -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 6/8] ext3: Do not abort journal prematurely
From: Michal Hocko mho...@suse.com journal_get_undo_access is relying on GFP_NOFS allocation yet it is essential for the journal transaction: [ 83.256914] journal_get_undo_access: No memory for committed data [ 83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out of memory in __ext3_journal_get_undo_access [ 83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of memory [ 83.267130] Aborting journal on device hdb1. [ 83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected aborted journal [ 83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM these allocation requests are allowed to fail so we need to use __GFP_NOFAIL to imitate the previous behavior. Signed-off-by: Michal Hocko mho...@suse.com --- fs/jbd/transaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index bf7474deda2f..6c60376a29bc 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh-b_committed_data) { - committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS); + committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS | __GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR %s: No memory for committed data\n, __func__); -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 5/8] ext4: Do not fail journal due to block allocator
From: Michal Hocko mho...@suse.com Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM memory allocator doesn't endlessly loop to satisfy low-order allocations and instead fails them to allow callers to handle them gracefully. Some of the callers are not yet prepared for this behavior though. ext4 block allocator relies solely on GFP_NOFS allocation requests and allocation failures lead to aborting yournal too easily: [ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50 [ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW 4.0.0-nofs3-6-gdfe9931f5f68 #588 [ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014 [ 345.028339] 880005a17708 81538a54 8107a40f [ 345.028341] 0050 880005a17798 810fe854 00018000 [ 345.028342] 0046 81a52100 0246 [ 345.028343] Call Trace: [ 345.028348] [81538a54] dump_stack+0x4f/0x7b [ 345.028370] [810fe854] warn_alloc_failed+0x12a/0x13f [ 345.028373] [81101bd2] __alloc_pages_nodemask+0x7f3/0x8aa [ 345.028375] [810f9933] pagecache_get_page+0x12a/0x1c9 [ 345.028390] [a005bc64] ext4_mb_load_buddy+0x220/0x367 [ext4] [ 345.028414] [a006014f] ext4_free_blocks+0x522/0xa4c [ext4] [ 345.028425] [a0054e14] ext4_ext_remove_space+0x833/0xf22 [ext4] [ 345.028434] [a005677e] ext4_ext_truncate+0x8c/0xb0 [ext4] [ 345.028441] [a00342bf] ext4_truncate+0x20b/0x38d [ext4] [ 345.028462] [a003573c] ext4_evict_inode+0x32b/0x4c1 [ext4] [ 345.028464] [8116d04f] evict+0xa0/0x148 [ 345.028466] [8116dca8] iput+0x1a1/0x1f0 [ 345.028468] [811697b4] __dentry_kill+0x136/0x1a6 [ 345.028470] [81169a3e] dput+0x21a/0x243 [ 345.028472] [81157cda] __fput+0x184/0x19b [ 345.028473] [81157d29] fput+0xe/0x10 [ 345.028475] [8105a05f] task_work_run+0x8a/0xa1 [ 345.028477] [810452f0] do_exit+0x3c6/0x8dc [ 345.028482] [8104588a] do_group_exit+0x4d/0xb2 [ 345.028483] [8104eeeb] get_signal+0x5b1/0x5f5 [ 345.028488] [81002202] do_signal+0x28/0x5d0 [...] [ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory [ 345.033097] Aborting journal on device hdb1-8. [ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only [ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted [ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted [ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted [ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted [ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted The failure is really premature because GFP_NOFS allocation context is very restricted - especially in the fs metadata heavy loads. Before we go with a more sofisticated solution, let's simply imitate the previous behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the buddy block allocator. I wasn't able to trigger the issue with this patch anymore. Signed-off-by: Michal Hocko mho...@suse.com --- fs/ext4/mballoc.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5b1613a54307..e6361622bfd5 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block = group * 2; pnum = block / blocks_per_page; poff = block % blocks_per_page; - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode-i_mapping, pnum, + GFP_NOFS|__GFP_NOFAIL); if (!page) return -ENOMEM; BUG_ON(page-mapping != inode-i_mapping); @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block++; pnum = block / blocks_per_page; - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode-i_mapping, pnum, + GFP_NOFS|__GFP_NOFAIL); if (!page) return -ENOMEM; BUG_ON(page-mapping != inode-i_mapping); @@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb,
[RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
From: Johannes Weiner han...@cmpxchg.org GFP_NOFS allocations are not allowed to invoke the OOM killer since their reclaim abilities are severely diminished. However, without the OOM killer available there is no hope of progress once the reclaimable pages have been exhausted. Don't risk hanging these allocations. Leave it to the allocation site to implement the fallback policy for failing allocations. Signed-off-by: Johannes Weiner han...@cmpxchg.org Signed-off-by: Michal Hocko mho...@suse.com --- mm/page_alloc.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ee69c338ca2a..024d45d51700 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (ac-high_zoneidx ZONE_NORMAL) goto out; /* The OOM killer does not compensate for IO-less reclaim */ - if (!(gfp_mask __GFP_FS)) { - /* -* XXX: Page reclaim didn't yield anything, -* and the OOM killer can't be invoked, but -* keep looping as per tradition. -*/ - *did_some_progress = 1; + if (!(gfp_mask __GFP_FS)) goto out; - } if (pm_suspended_storage()) goto out; /* The OOM killer may not free memory on a specific node */ -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation
From: Michal Hocko mho...@suse.com page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is, however, not called from the fs layera directly so it doesn't need this protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Reported-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp Signed-off-by: Michal Hocko mho...@suse.com --- include/linux/mm.h | 4 mm/filemap.c | 9 - mm/memory.c| 17 + 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7f471789781a..962e37c7cd6a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -220,10 +220,14 @@ extern pgprot_t protection_map[16]; * -fault function. The vma's -fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * + * MM layer fills up gfp_mask for page allocations but fault handler might + * alter it if its implementation requires a different allocation context. + * * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ + gfp_t gfp_mask; /* gfp mask to be used for allocations */ pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ diff --git a/mm/filemap.c b/mm/filemap.c index b63fb81df336..8a16a07bbe02 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter); * This adds the requested page to the page cache if it isn't already there, * and schedules an I/O to read in its contents from disk. */ -static int page_cache_read(struct file *file, pgoff_t offset) +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask) { struct address_space *mapping = file-f_mapping; struct page *page; int ret; do { - page = page_cache_alloc_cold(mapping); + page = __page_cache_alloc(gfp_mask|__GFP_COLD); if (!page) return -ENOMEM; - ret = add_to_page_cache_lru(page, mapping, offset, - GFP_KERNEL mapping_gfp_mask(mapping)); + ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL gfp_mask); if (ret == 0) ret = mapping-a_ops-readpage(file, page); else if (ret == -EEXIST) @@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We're only likely to ever get here if MADV_RANDOM is in * effect. */ - error = page_cache_read(file, offset); + error = page_cache_read(file, offset, vmf-gfp_mask); /* * The page we want has now been added to the page cache. diff --git a/mm/memory.c b/mm/memory.c index 8a2fc9945b46..25ab29560dca 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va, vma); } +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) +{ + struct file
[RFC 0/8] Allow GFP_NOFS allocation to fail
Hi, small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing traditionally even though their reclaim capabilities are restricted because the VM code cannot recurse into filesystems to clean dirty pages. At the same time these allocation requests do not allow to trigger the OOM killer because that would lead to pre-mature OOM killing during heavy fs metadata workloads. This leaves the VM code in an unfortunate situation where GFP_NOFS requests is looping inside the allocator relying on somebody else to make a progress on its behalf. This is prone to deadlocks when the request is holding resources which are necessary for other task to make a progress and release memory (e.g. OOM victim is blocked on the lock held by the NONFS request). Another drawback is that the caller of the allocator cannot define any fallback strategy because the request doesn't fail. As the VM cannot do much about these requests we should face the reality and allow those allocations to fail. Johannes has already posted the patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2) but the discussion died pretty quickly. I was playing with this patch and xfs, ext[34] and btrfs for a while to see what is the effect under heavy memory pressure. As expected this led to some fallouts. My test consisted of a simple memory hog which allocates a lot of anonymous memory and writes to a fs mainly to trigger a fs activity on exit. In parallel there is a parallel fs metadata load (multiple tasks creating thousands of empty files and directories). All is running in a VM with small amount of memory to emulate an under provisioned system. The metadata load is triggering a sufficient load to invoke the direct reclaim even without the memory hog. The memory hog forks several tasks sharing the VM and OOM killer manages to kill it without locking up the system (this was based on the test case from Tetsuo Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't want to kill my machine ;)). With all the patches applied none of the 4 filesystems gets aborted transactions and RO remount (well xfs didn't need any special treatment). This is obviously not sufficient to claim that failing GFP_NOFS is OK now but I think it is a good start for the further discussion. I would be grateful if FS people could have a look at those patches. I have simply used __GFP_NOFAIL in the critical paths. This might be not the best strategy but it sounds like a good first step. The first patch in the series also allows __GFP_NOFAIL allocations to access memory reserves when the system is OOM which should help those requests to make a forward progress - especially in combination with GFP_NOFS. The second patch tries to address a potential pre-mature OOM killer from the page fault path. I have posted it separately but it didn't get much traction. The third patch allows GFP_NOFS to fail and I believe it should see much more testing coverage. It would be really great if it could sit in the mmotm tree for few release cycles so that we can catch more fallouts. The rest are the FS specific patches to fortify allocations requests which are really needed to finish transactions without RO remounts. There might be more needed but my test case survives with these in place. They would obviously need some rewording if they are going to be applied even without Patch3 and I will do that if respective maintainers will take them. Ext3 and JBD are going away soon so they might be dropped but they have been in the tree while I was testing so I've kept them. Thoughts? Opinions? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()
Remove chunk_objectid argument from btrfs_relocate_chunk() because it is not necessary, it can also cleanup some code in caller for prepare its value. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/volumes.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 66f5a15..c3977ed 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2755,9 +2755,7 @@ out: return ret; } -static int btrfs_relocate_chunk(struct btrfs_root *root, - u64 chunk_objectid, - u64 chunk_offset) +static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset) { struct btrfs_root *extent_root; struct btrfs_trans_handle *trans; @@ -2857,7 +2855,6 @@ again: if (chunk_type BTRFS_BLOCK_GROUP_SYSTEM) { ret = btrfs_relocate_chunk(chunk_root, - found_key.objectid, found_key.offset); if (ret == -ENOSPC) failed++; @@ -3377,7 +3374,6 @@ again: } ret = btrfs_relocate_chunk(chunk_root, - found_key.objectid, found_key.offset); mutex_unlock(fs_info-delete_unused_bgs_mutex); if (ret ret != -ENOSPC) @@ -4079,7 +4075,6 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) struct btrfs_dev_extent *dev_extent = NULL; struct btrfs_path *path; u64 length; - u64 chunk_objectid; u64 chunk_offset; int ret; int slot; @@ -4156,11 +4151,10 @@ again: break; } - chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent); chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); btrfs_release_path(path); - ret = btrfs_relocate_chunk(root, chunk_objectid, chunk_offset); + ret = btrfs_relocate_chunk(root, chunk_offset); mutex_unlock(root-fs_info-delete_unused_bgs_mutex); if (ret ret != -ENOSPC) goto done; -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()
objectid's init-value is not used in any case, remove it. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/relocation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 1659c94..4698928 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4144,7 +4144,7 @@ struct inode *create_reloc_inode(struct btrfs_fs_info *fs_info, struct btrfs_trans_handle *trans; struct btrfs_root *root; struct btrfs_key key; - u64 objectid = BTRFS_FIRST_FREE_OBJECTID; + u64 objectid; int err = 0; root = read_fs_root(fs_info, BTRFS_DATA_RELOC_TREE_OBJECTID); -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()
We need error checking code for get_ref_objectid_v0() in relocate_block_group(), to avoid unpredictable result, especially for accessing uninitialized value(when function failed) after this line. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- fs/btrfs/relocation.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 52fe55a..1659c94 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3976,6 +3976,10 @@ restart: sizeof(struct btrfs_extent_item_v0)); ret = get_ref_objectid_v0(rc, path, key, ref_owner, path_change); + if (ret 0) { + err = ret; + break; + } if (ref_owner BTRFS_FIRST_FREE_OBJECTID) flags = BTRFS_EXTENT_FLAG_TREE_BLOCK; else -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 2015-08-04 13:36, John Ettedgui wrote: On Tue, Aug 4, 2015 at 4:28 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2015-08-04 00:58, John Ettedgui wrote: On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Although the best practice is staying away from such converted fs, either using pure, newly created btrfs, or convert back to ext* before any balance. Unfortunately I don't have enough hard drive space to do a clean btrfs, so my only way to use btrfs for that partition was a conversion. If you could get your hands on a decent sized flash drive (32G or more), you could do an incremental conversion offline. The steps would look something like this: 1. Boot the system into a LiveCD or something similar that doesn't need to run from your regular root partition (SystemRescueCD would be my personal recommendation, although if you go that way, make sure to boot the alternative kernel, as it's a lot newer then the standard ones). 2. Plug in the flash drive, format it as BTRFS. 3. Mount both your old partition and the flash drive somewhere. 4. Start copying files from the old partition to the flash drive. 5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink it down to the minimum size possible, and create a new partition in the free space produced by doing so. 6. Add the new partition to the BTRFS filesystem on the flash drive. 7. Repeat steps 4-6 until you have copied everything. 8. Wipe the old partition, and add it to the BTRFS filesystem. 9. Run a full balance on the new BTRFS filesystem. 10. Delete the partition from step 5 that is closest to the old partition (via btrfs device delete), then resize the old partition to fill the space that the deleted partition took up. 11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS filesystem are the old one and the flash drive. 12. Delete the flash drive from the BTRFS filesystem. This takes some time and coordination, but it does work reliably as long as you are careful (I've done it before on multiple systems). I suppose I could do that even without the flash as I have some free space anyway, but moving Tbs of data with Gbs of free space will take days, plus the repartitioning. It'd probably be easier to start with a 1Tb drive or something. Is this currently my best bet as conversion is not as good as I thought? I believe my other 2 partitions also come from conversion, though I may have rebuilt them later from scratch. Thank you! John Yeah, you're probably better off getting a TB disk and starting with that. In theory it is possible to automate the process, but I would advise against that if at all possible, it's a lot easier to recover from an error if you're doing it manually. smime.p7s Description: S/MIME Cryptographic Signature
Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
On Wed 05-08-15 11:51:20, mho...@kernel.org wrote: From: Michal Hocko mho...@suse.com Journal transaction might fail prematurely because the frozen_buffer is allocated by GFP_NOFS request: [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM because small GPF_NOFS allocations never failed. This allocation seems essential for the journal and GFP_NOFS is too restrictive to the memory allocator so let's use __GFP_NOFAIL here to emulate the previous behavior. jbd code has the very same issue so let's do the same there as well. The patch looks good. Btw, the patch 6 can be folded into this patch since it fixes the issue you fix for jbd2 here... But jbd parts will be dropped in the next merge window anyway so it doesn't really matter. You can add: Reviewed-by: Jan Kara j...@suse.com Honza Signed-off-by: Michal Hocko mho...@suse.com --- fs/jbd/transaction.c | 11 +-- fs/jbd2/transaction.c | 14 +++--- 2 files changed, 4 insertions(+), 21 deletions(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 1695ba8334a2..bf7474deda2f 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd_alloc(jh2bh(jh)-b_size, - GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR -%s: OOM for frozen_buffer\n, -__func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } + GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ff2f2e6ad311..bff071e21553 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd2_alloc(jh2bh(jh)-b_size, - GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR -%s: OOM for frozen_buffer\n, -__func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } + GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh-b_committed_data) { - committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS); + committed_data = jbd2_alloc(jh2bh(jh)-b_size, + GFP_NOFS|__GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR %s: No memory for committed data\n, __func__); -- 2.5.0 -- Jan Kara j...@suse.com SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [RFC 5/8] ext4: Do not fail journal due to block allocator
On Wed 05-08-15 11:51:21, mho...@kernel.org wrote: From: Michal Hocko mho...@suse.com Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM memory allocator doesn't endlessly loop to satisfy low-order allocations and instead fails them to allow callers to handle them gracefully. Some of the callers are not yet prepared for this behavior though. ext4 block allocator relies solely on GFP_NOFS allocation requests and allocation failures lead to aborting yournal too easily: [ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50 [ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW 4.0.0-nofs3-6-gdfe9931f5f68 #588 [ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014 [ 345.028339] 880005a17708 81538a54 8107a40f [ 345.028341] 0050 880005a17798 810fe854 00018000 [ 345.028342] 0046 81a52100 0246 [ 345.028343] Call Trace: [ 345.028348] [81538a54] dump_stack+0x4f/0x7b [ 345.028370] [810fe854] warn_alloc_failed+0x12a/0x13f [ 345.028373] [81101bd2] __alloc_pages_nodemask+0x7f3/0x8aa [ 345.028375] [810f9933] pagecache_get_page+0x12a/0x1c9 [ 345.028390] [a005bc64] ext4_mb_load_buddy+0x220/0x367 [ext4] [ 345.028414] [a006014f] ext4_free_blocks+0x522/0xa4c [ext4] [ 345.028425] [a0054e14] ext4_ext_remove_space+0x833/0xf22 [ext4] [ 345.028434] [a005677e] ext4_ext_truncate+0x8c/0xb0 [ext4] [ 345.028441] [a00342bf] ext4_truncate+0x20b/0x38d [ext4] [ 345.028462] [a003573c] ext4_evict_inode+0x32b/0x4c1 [ext4] [ 345.028464] [8116d04f] evict+0xa0/0x148 [ 345.028466] [8116dca8] iput+0x1a1/0x1f0 [ 345.028468] [811697b4] __dentry_kill+0x136/0x1a6 [ 345.028470] [81169a3e] dput+0x21a/0x243 [ 345.028472] [81157cda] __fput+0x184/0x19b [ 345.028473] [81157d29] fput+0xe/0x10 [ 345.028475] [8105a05f] task_work_run+0x8a/0xa1 [ 345.028477] [810452f0] do_exit+0x3c6/0x8dc [ 345.028482] [8104588a] do_group_exit+0x4d/0xb2 [ 345.028483] [8104eeeb] get_signal+0x5b1/0x5f5 [ 345.028488] [81002202] do_signal+0x28/0x5d0 [...] [ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory [ 345.033097] Aborting journal on device hdb1-8. [ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only [ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted [ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted [ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted [ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted [ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted The failure is really premature because GFP_NOFS allocation context is very restricted - especially in the fs metadata heavy loads. Before we go with a more sofisticated solution, let's simply imitate the previous behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the buddy block allocator. I wasn't able to trigger the issue with this patch anymore. The patch looks good. You can add: Reviewed-by: Jan Kara j...@suse.com Honza Signed-off-by: Michal Hocko mho...@suse.com --- fs/ext4/mballoc.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5b1613a54307..e6361622bfd5 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block = group * 2; pnum = block / blocks_per_page; poff = block % blocks_per_page; - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode-i_mapping, pnum, +GFP_NOFS|__GFP_NOFAIL); if (!page) return -ENOMEM; BUG_ON(page-mapping != inode-i_mapping); @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block++; pnum = block / blocks_per_page; - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS); + page =
Re: BTRFS disaster (of my own making). Is this recoverable?
On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote: Seems that if there was someway to edit something in those first overwritten 32MB of disc 2 to say hey, I'm really here, just a bit screwed up maybe some of the recovery tools could actually work. Just want to reiterate this thought. The basic error in most cases with the tools at hand is that Disc 2 is missing so there's little the tools can do. Somewhere in those first 32MB should be something to properly identify the disc as part of the array. If the btrfs tools can't fix it maybe dd can. Is there anything can be gained from the beginning of disc 1 (can dd this to a file) in order to create the necessary bits needed at the beginning of disc2? Or some other way to overwrite the beginning of disc 2 (using dd again) with some identification information so that the automated btrfs tools can take it from there? Thanks, Chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS disaster (of my own making). Is this recoverable?
On Wed, Aug 5, 2015 at 8:31 AM, Sonic sonicsm...@gmail.com wrote: The basic error in most cases with the tools at hand is that Disc 2 is missing so there's little the tools can do. Somewhere in those first 32MB should be something to properly identify the disc as part of the array. Somehow manually create the missing chunk root if this is the core problem?? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
On Wed, Aug 05, 2015 at 11:51:24AM +0200, mho...@kernel.org wrote: From: Michal Hocko mho...@suse.com alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM this is allowed to fail which can lead to [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 This is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko mho...@suse.com Reviewed-by: David Sterba dste...@suse.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 7/8] btrfs: Prevent from early transaction abort
On Wed, Aug 05, 2015 at 11:51:23AM +0200, mho...@kernel.org wrote: From: Michal Hocko mho...@suse.com ... Fix this by reintroducing the no-fail behavior of this allocation path with the explicit __GFP_NOFAIL. Signed-off-by: Michal Hocko mho...@suse.com Reviewed-by: David Sterba dste...@suse.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix for infinite loop on non-empty inode but with no file extent
On Wed, Aug 05, 2015 at 04:03:11PM +0800, Qu Wenruo wrote: A bug reported by Robert Munteanu, which btrfsck infinite loops on an inode with discount file extent. This patchset includes a fix in printing file extent hole, fix the infinite loop, and corresponding test case. BTW, thanks Robert Munteanu a lot for its detailed debug report, makes it super fast to reproduce the error. Qu Wenruo (3): btrfs-progs: fsck: Print correct file hole btrfs-progs: fsck: Fix a infinite loop on discount file extent repair btrfs-progs: fsck-tests: Add test case for inode lost all its file extent Applied, thank you both. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
mho...@kernel.org wrote: From: Michal Hocko mho...@suse.com Journal transaction might fail prematurely because the frozen_buffer is allocated by GFP_NOFS request: [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM because small GPF_NOFS allocations never failed. This allocation seems essential for the journal and GFP_NOFS is too restrictive to the memory allocator so let's use __GFP_NOFAIL here to emulate the previous behavior. jbd code has the very same issue so let's do the same there as well. Signed-off-by: Michal Hocko mho...@suse.com --- fs/jbd/transaction.c | 11 +-- fs/jbd2/transaction.c | 14 +++--- 2 files changed, 4 insertions(+), 21 deletions(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 1695ba8334a2..bf7474deda2f 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd_alloc(jh2bh(jh)-b_size, - GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR -%s: OOM for frozen_buffer\n, -__func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } + GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ff2f2e6ad311..bff071e21553 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd2_alloc(jh2bh(jh)-b_size, - GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR -%s: OOM for frozen_buffer\n, -__func__); - JBUFFER_TRACE(jh, oom!); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } + GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh-b_frozen_data = frozen_buffer; @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh-b_committed_data) { - committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS); + committed_data = jbd2_alloc(jh2bh(jh)-b_size, + GFP_NOFS|__GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR %s: No memory for committed data\n, __func__); Is this if (!committed_data) { check now dead code? I also see other similar suspected dead sites in the rest of the series. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Modify confuse error message in scrub
On Wed, Aug 05, 2015 at 04:32:26PM +0800, Zhao Lei wrote: Scrub output following error message in my test: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success) It is caused by a broken kernel and fs, In what way is it broken? Can we turn it into tests? but the we need to avoid outputting both error and success in oneline message as above. This patch modified above message to: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, errno=0(Success) The net effect of the patch is to add ret=.. and errno=.. to the error message but it also changes a series of ifs to a switch. This belongs to a separate patch. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()
On Wed, Aug 05, 2015 at 06:00:03PM +0800, Zhao Lei wrote: objectid's init-value is not used in any case, remove it. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS disaster (of my own making). Is this recoverable?
On Wed, Aug 5, 2015 at 6:31 AM, Sonic sonicsm...@gmail.com wrote: On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote: Seems that if there was someway to edit something in those first overwritten 32MB of disc 2 to say hey, I'm really here, just a bit screwed up maybe some of the recovery tools could actually work. Just want to reiterate this thought. The basic error in most cases with the tools at hand is that Disc 2 is missing so there's little the tools can do. Somewhere in those first 32MB should be something to properly identify the disc as part of the array. Yes but it was probably uniquely only on that disk, because there's no redundancy for metadata or system chunks. Therefore there's no copy on the other disk to use as a model. The btrfs check command has an option to use other superblocks, so you could try that switch and see if it makes a difference but it sounds like it's finding backup superblocks automatically. That's the one thing that is pretty much always duplicated on the same disk; for sure the first superblock is munged and would need repair. But there's still other chunks missing... so I don't think it'll help. If the btrfs tools can't fix it maybe dd can. Is there anything can be gained from the beginning of disc 1 (can dd this to a file) in order to create the necessary bits needed at the beginning of disc2? Not if there's no metadata or system redundancy profile like raid1. Or some other way to overwrite the beginning of disc 2 (using dd again) with some identification information so that the automated btrfs tools can take it from there? I think to have a viable reference, you need two disks (virtual or real) and you need to exactly replicate how you got to this two disk setup to find out what's in those 32MB that might get the file system to mount even if it complaints of some corrupt files. That's work that's way beyond my skill level. The tools don't do this right now as far as I'm aware. You'd be making byte by byte insertions to multiple sectors. Tedious. But I can't even guess how many steps it is. It might be 10. It might be 1. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()
On Wed, Aug 05, 2015 at 06:00:04PM +0800, Zhao Lei wrote: Remove chunk_objectid argument from btrfs_relocate_chunk() because it is not necessary, it can also cleanup some code in caller for prepare its value. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()
On Wed, Aug 05, 2015 at 06:00:02PM +0800, Zhao Lei wrote: We need error checking code for get_ref_objectid_v0() in relocate_block_group(), to avoid unpredictable result, especially for accessing uninitialized value(when function failed) after this line. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.com Are there even filesystems with v0 refs? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why subvolume and not just volume?
On Wed, Aug 05, 2015 at 09:06:40AM +0200, Martin wrote: Also, what is the penalty of a subvolume compared to a directory? From a design perspective, couldn't all directories just be subvolumes? They could, but this would bring severe performance drop. * creating a subvolume implies a transaction commit * the subvolumes act like a mountpoint boundary so it needs to resolve the next subvolume root before directory traversal can descend to it You can try to create a deep hierarchy of directories and then do the same with subvolumes. The difference is too big for practical purposes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] sysfs-part2 Add seed device representation on the sysfs
On Wed, Jul 08, 2015 at 03:32:48PM +0800, Anand Jain wrote: This patch adds the support to show seed device on the btrfs sysfs. This is a revamped version of the previously single patch 6/6, and plus incorporates David suggestion to add seed fsid under the 'seed' kobject. Since this adds new patches and to bring in seed kobject it needed quite a lot of revamp I am resetting the patch set version to 1. Anand Jain (6): Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link Btrfs: rename super_kobj to fsid_kobj Btrfs: sysfs: support seed devices in the sysfs layout Sorry for late reply, the patches look good. I'm going to prepare a branch for pull into 4.3. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1: system stability
On 2015-07-22 07:00, Russell Coker wrote: On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. I would generally agree with this, although we really shouldn't be doing things like trying to handle hardware failures without user intervention. If a block device disappears from under us, we should throw a warning and if it's the last device in the FS, kill anything that is trying to read or write to that FS. At the very least, we should try to avoid hanging or panicking the system if all of the devices in an FS disappear out from under us. A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning any manner of corrupted data and should not lose data or panic the kernel. It's debatable however whether the array should go read-only when degraded. MD/DM RAID (at least, AFAIK) and most hardware RAID controllers I've seen will still accept writes to degraded arrays, although there are arguments for forcing it read-only as well. Personally, I think that should be controlled by a mount option, so the sysadmin can decide, as it really is a policy decision. A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by mounting read-only or failing all operations on the filesystem. It should not affect any other filesystem or have any significant impact on the system unless it's the root filesystem. Or some other critical filesystem (there are still people who put /usr and/or /var on separate filesystems). Ideally, I'd love to see some some kind of warning from the kernel if a filesystem gets mounted that has the metadata/system profile set to raid0 (and possibly have some of the tools spit out such a warning also). smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID1: system stability
Am Mittwoch, 5. August 2015, 13:32:41 schrieb Austin S Hemmelgarn: On 2015-07-22 07:00, Russell Coker wrote: On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. I would generally agree with this, although we really shouldn't be doing things like trying to handle hardware failures without user intervention. If a block device disappears from under us, we should throw a warning and if it's the last device in the FS, kill anything that is trying to read or write to that FS. At the very least, we should try to avoid hanging or panicking the system if all of the devices in an FS disappear out from under us. The best solution I have ever seen for removable media is with AmigaOS. You remove a disk (or nowadays an usb stick) while it is being written to and AmigaDOS/AmigaOS pops up a dialog window saying You MUST insert volume $VOLUMENAME again. And if you did, it just continued writing. I bet this may be difficult for do for Linux for all devices as unwritten changes pile up in memory until dirty limits are reached, unless one says Okay, disk gone, we block all processes writing to it immediately or quite soon, but for removable media I never saw anything with that amount of sanity. There was some GSoC for NetBSD once to implement this, but I don´t know whether its implemented in there now. For AmigaOS and floppy disks with back then filesystem there was just one culprit: If you didn´t insert the disk again, it was often broken beyond repair. For journaling or COW filesystem it would just be like in any other sudden stop to writes. On Linux with eSATA I saw I can also replug the disk if I didn´t yet hit the timeouts in block layer. After that the disk is gone. Ciao, -- Martin signature.asc Description: This is a digitally signed message part.
Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
On Aug 5, 2015, at 3:51 AM, mho...@kernel.org wrote: Hi, small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing traditionally even though their reclaim capabilities are restricted because the VM code cannot recurse into filesystems to clean dirty pages. At the same time these allocation requests do not allow to trigger the OOM killer because that would lead to pre-mature OOM killing during heavy fs metadata workloads. This leaves the VM code in an unfortunate situation where GFP_NOFS requests is looping inside the allocator relying on somebody else to make a progress on its behalf. This is prone to deadlocks when the request is holding resources which are necessary for other task to make a progress and release memory (e.g. OOM victim is blocked on the lock held by the NONFS request). Another drawback is that the caller of the allocator cannot define any fallback strategy because the request doesn't fail. As the VM cannot do much about these requests we should face the reality and allow those allocations to fail. Johannes has already posted the patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2) but the discussion died pretty quickly. I was playing with this patch and xfs, ext[34] and btrfs for a while to see what is the effect under heavy memory pressure. As expected this led to some fallouts. My test consisted of a simple memory hog which allocates a lot of anonymous memory and writes to a fs mainly to trigger a fs activity on exit. In parallel there is a parallel fs metadata load (multiple tasks creating thousands of empty files and directories). All is running in a VM with small amount of memory to emulate an under provisioned system. The metadata load is triggering a sufficient load to invoke the direct reclaim even without the memory hog. The memory hog forks several tasks sharing the VM and OOM killer manages to kill it without locking up the system (this was based on the test case from Tetsuo Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't want to kill my machine ;)). With all the patches applied none of the 4 filesystems gets aborted transactions and RO remount (well xfs didn't need any special treatment). This is obviously not sufficient to claim that failing GFP_NOFS is OK now but I think it is a good start for the further discussion. I would be grateful if FS people could have a look at those patches. I have simply used __GFP_NOFAIL in the critical paths. This might be not the best strategy but it sounds like a good first step. The first patch in the series also allows __GFP_NOFAIL allocations to access memory reserves when the system is OOM which should help those requests to make a forward progress - especially in combination with GFP_NOFS. The second patch tries to address a potential pre-mature OOM killer from the page fault path. I have posted it separately but it didn't get much traction. The third patch allows GFP_NOFS to fail and I believe it should see much more testing coverage. It would be really great if it could sit in the mmotm tree for few release cycles so that we can catch more fallouts. The rest are the FS specific patches to fortify allocations requests which are really needed to finish transactions without RO remounts. There might be more needed but my test case survives with these in place. Wouldn't it make more sense to order the fs-specific patches _before_ the GFP_NOFS can fail patch (#3), so that once that patch is applied all known failures have already been fixed? Otherwise it could show test failures during bisection that would be confusing. Cheers, Andreas They would obviously need some rewording if they are going to be applied even without Patch3 and I will do that if respective maintainers will take them. Ext3 and JBD are going away soon so they might be dropped but they have been in the tree while I was testing so I've kept them. Thoughts? Opinions? -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] sysfs-part2 Add seed device representation on the sysfs
Hi David, Thanks. more below. On 08/06/2015 01:29 AM, David Sterba wrote: On Wed, Jul 08, 2015 at 03:32:48PM +0800, Anand Jain wrote: This patch adds the support to show seed device on the btrfs sysfs. This is a revamped version of the previously single patch 6/6, and plus incorporates David suggestion to add seed fsid under the 'seed' kobject. Since this adds new patches and to bring in seed kobject it needed quite a lot of revamp I am resetting the patch set version to 1. Anand Jain (6): Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link Btrfs: rename super_kobj to fsid_kobj these can go in. Btrfs: sysfs: support seed devices in the sysfs layout Sorry for late reply, the patches look good. I'm going to prepare a branch for pull into 4.3. Thanks. I suggested if this can wait. on the 2nd thought, I am preparing to conduct a survey to know most preferred sysfs layout for btrfs. mainly between one, less invasive overlays on the existing layout (current method). the other, separates FS and Volume attributes (old method). sorry that I am going back a bit, but i think its worth as these API are forever. thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bedup --defrag freezing
Hi, I've been running btrfs on Fedora for a while now, with bedup --defrag running in a night-time cronjob. Last few runs seem to have gotten stuck, without possibility of even killing the process (kill -9 doesn't work) -- all I could do is hard power cycle. Did something change recently? Is bedup simply too out of date? What should I use to de-duplicate across snapshots instead? Etc.? Thanks, Konstantin # uname -a Linux mireille.svist.net 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10 21:09:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.1 # btrfs fi show Label: none uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939 Total devices 1 FS bytes used 243.43GiB devid1 size 465.76GiB used 318.05GiB path /dev/sda2 btrfs-progs v4.1 # btrfs fi df / Data, single: total=309.01GiB, used=238.24GiB System, single: total=32.00MiB, used=64.00KiB Metadata, single: total=9.01GiB, used=5.19GiB GlobalReserve, single: total=512.00MiB, used=0.00B dmseg attached [0.00] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03 [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 4.0.8-200.fc21.x86_64 (mockbu...@bkernel02.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #1 SMP Fri Jul 10 21:09:54 UTC 2015 [0.00] Command line: BOOT_IMAGE=/main/boot/vmlinuz-4.0.8-200.fc21.x86_64 root=/dev/sda2 ro rootflags=subvol=main vconsole.font=latarcyrheb-sun16 quiet [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xba14] usable [0.00] BIOS-e820: [mem 0xba15-0xba156fff] ACPI NVS [0.00] BIOS-e820: [mem 0xba157000-0xba94] usable [0.00] BIOS-e820: [mem 0xba95-0xbabedfff] reserved [0.00] BIOS-e820: [mem 0xbabee000-0xcac0afff] usable [0.00] BIOS-e820: [mem 0xcac0b000-0xcb10afff] reserved [0.00] BIOS-e820: [mem 0xcb10b000-0xcb63dfff] usable [0.00] BIOS-e820: [mem 0xcb63e000-0xcb7aafff] ACPI NVS [0.00] BIOS-e820: [mem 0xcb7ab000-0xcbffefff] reserved [0.00] BIOS-e820: [mem 0xcbfff000-0xcbff] usable [0.00] BIOS-e820: [mem 0xcd00-0xcf1f] reserved [0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved [0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00022fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.8 present. [0.00] DMI: Notebook P15SM-A/SM1-A/P15SM-A/SM1-A, BIOS 4.6.5 03/27/2014 [0.00] e820: update [mem 0x-0x0fff] usable == reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] e820: last_pfn = 0x22fe00 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-C write-protect [0.00] D-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask 7E write-back [0.00] 1 base 02 mask 7FE000 write-back [0.00] 2 base 022000 mask 7FF000 write-back [0.00] 3 base 00E000 mask 7FE000 uncachable [0.00] 4 base 00D000 mask 7FF000 uncachable [0.00] 5 base 00CE00 mask 7FFE00 uncachable [0.00] 6 base 00CD00 mask 7FFF00 uncachable [0.00] 7 base 022FE0 mask 7FFFE0 uncachable [0.00] 8 disabled [0.00] 9 disabled [0.00] PAT configuration [0-7]: WB WC UC- UC WB WC UC- UC [0.00] e820: update [mem 0xcd00-0x] usable == reserved [0.00] e820: last_pfn = 0xcc000 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000fd830-0x000fd83f] mapped at [880fd830] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] Using
RE: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()
Hi, David Sterba -Original Message- From: David Sterba [mailto:dste...@suse.com] Sent: Thursday, August 06, 2015 1:03 AM To: Zhao Lei Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group() On Wed, Aug 05, 2015 at 06:00:02PM +0800, Zhao Lei wrote: We need error checking code for get_ref_objectid_v0() in relocate_block_group(), to avoid unpredictable result, especially for accessing uninitialized value(when function failed) after this line. Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.com Thanks for review! Are there even filesystems with v0 refs? Rarely, I think. (Just a accidental found in debuging another problem) But for current code, we need hold correct code until we remove v0 refs support. Thanks Zhaolei -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] btrfs-progs: add newline to some error messages
Hi, Itoh-san -Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Zhao Lei Sent: Thursday, August 06, 2015 11:51 AM To: 'Tsutomu Itoh'; linux-btrfs@vger.kernel.org Subject: RE: [PATCH] btrfs-progs: add newline to some error messages Hi, Itoh -Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tsutomu Itoh Sent: Thursday, August 06, 2015 11:06 AM To: linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs-progs: add newline to some error messages Added a missing newline to some error messages. Good found! Seems more code need to be fixed, as: # cat mkfs.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' symlink too long for %s Incompat features: %s # # cat utils.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' ERROR: DUP for data is allowed only in mixed mode %s [y/N]: *1 # *1: It is not problem, should to be ignored Sorry for a bug in above script, it is new version(should get more exact result than old version): # cat cmds-replace.c | tr -d '\n' | grep -o -w 'f\?printf([^;]*);' | sed 's/f\?printf[[:blank:]]*([[:blank:]]*\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' # Thanks Zhaolei Thanks Zhaolei Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- btrfs-corrupt-block.c | 2 +- cmds-check.c | 4 ++-- cmds-send.c | 4 ++-- dir-item.c| 6 +++--- mkfs.c| 2 +- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index 1a2aa23..ea871f4 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root, goto out; } if (ret 0) { - fprintf(stderr, Error searching chunk); + fprintf(stderr, Error searching chunk\n); goto out; } out: diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, BTRFS_FIRST_FREE_OBJECTID, lost_found_ino, mode); if (ret 0) { - fprintf(stderr, Failed to create '%s' dir: %s, + fprintf(stderr, Failed to create '%s' dir: %s\n, dir_name, strerror(-ret)); goto out; } @@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, } if (ret 0) { fprintf(stderr, - Failed to link the inode %llu to %s dir: %s, + Failed to link the inode %llu to %s dir: %s\n, rec-ino, dir_name, strerror(-ret)); goto out; } diff --git a/cmds-send.c b/cmds-send.c index 20bba18..78ee54c 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size) ret = write(fd, (char*)buf + pos, size - pos); if (ret 0) { ret = -errno; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } if (!ret) { ret = -EIO; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } diff --git a/dir-item.c b/dir-item.c index a5bf861..f3ad98f 100644 --- a/dir-item.c +++ b/dir-item.c @@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root, u8 type = btrfs_dir_type(leaf, dir_item); if (type = BTRFS_FT_MAX) { - fprintf(stderr, invalid dir item type: %d, + fprintf(stderr, invalid dir item type: %d\n, (int)type); return 1; } @@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root, namelen = XATTR_NAME_MAX; if (btrfs_dir_name_len(leaf, dir_item) namelen) { - fprintf(stderr, invalid dir item name len: %u, + fprintf(stderr, invalid dir item name len: %u\n, (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; } @@ -302,7 +302,7 @@ int
Re: [PATCH] btrfs-progs: add newline to some error messages
On 2015/08/06 12:51, Zhao Lei wrote: Hi, Itoh -Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tsutomu Itoh Sent: Thursday, August 06, 2015 11:06 AM To: linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs-progs: add newline to some error messages Added a missing newline to some error messages. Good found! Seems more code need to be fixed, as: # cat mkfs.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' symlink too long for %s Incompat features: %s # It's OK. printf(Incompat features: %s, features_buf); printf(\n); # cat utils.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' ERROR: DUP for data is allowed only in mixed mode %s [y/N]: *1 # *1: It is not problem, should to be ignored Already fixed by David in devel branch. Thanks, Tsutomu Thanks Zhaolei Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- btrfs-corrupt-block.c | 2 +- cmds-check.c | 4 ++-- cmds-send.c | 4 ++-- dir-item.c| 6 +++--- mkfs.c| 2 +- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index 1a2aa23..ea871f4 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root, goto out; } if (ret 0) { - fprintf(stderr, Error searching chunk); + fprintf(stderr, Error searching chunk\n); goto out; } out: diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, BTRFS_FIRST_FREE_OBJECTID, lost_found_ino, mode); if (ret 0) { - fprintf(stderr, Failed to create '%s' dir: %s, + fprintf(stderr, Failed to create '%s' dir: %s\n, dir_name, strerror(-ret)); goto out; } @@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, } if (ret 0) { fprintf(stderr, - Failed to link the inode %llu to %s dir: %s, + Failed to link the inode %llu to %s dir: %s\n, rec-ino, dir_name, strerror(-ret)); goto out; } diff --git a/cmds-send.c b/cmds-send.c index 20bba18..78ee54c 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size) ret = write(fd, (char*)buf + pos, size - pos); if (ret 0) { ret = -errno; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } if (!ret) { ret = -EIO; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } diff --git a/dir-item.c b/dir-item.c index a5bf861..f3ad98f 100644 --- a/dir-item.c +++ b/dir-item.c @@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root, u8 type = btrfs_dir_type(leaf, dir_item); if (type = BTRFS_FT_MAX) { - fprintf(stderr, invalid dir item type: %d, + fprintf(stderr, invalid dir item type: %d\n, (int)type); return 1; } @@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root, namelen = XATTR_NAME_MAX; if (btrfs_dir_name_len(leaf, dir_item) namelen) { - fprintf(stderr, invalid dir item name len: %u, + fprintf(stderr, invalid dir item name len: %u\n, (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; } @@ -302,7 +302,7 @@ int verify_dir_item(struct btrfs_root *root, /* BTRFS_MAX_XATTR_SIZE is the same for all dir items */ if ((btrfs_dir_data_len(leaf, dir_item) + btrfs_dir_name_len(leaf, dir_item)) BTRFS_MAX_XATTR_SIZE(root)) { - fprintf(stderr, invalid dir item name + data len: %u + %u, + fprintf(stderr, invalid dir item name + data len: %u + %u\n,
RE: [PATCH] btrfs-progs: Modify confuse error message in scrub
Hi, David Sterba Thanks for review this patch. -Original Message- From: David Sterba [mailto:dste...@suse.com] Sent: Thursday, August 06, 2015 12:51 AM To: Zhao Lei Cc: linux-btrfs@vger.kernel.org Subject: Re: [PATCH] btrfs-progs: Modify confuse error message in scrub On Wed, Aug 05, 2015 at 04:32:26PM +0800, Zhao Lei wrote: Scrub output following error message in my test: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success) It is caused by a broken kernel and fs, In what way is it broken? Can we turn it into tests? It is caused by my custom-made condition, created for debugg another problem in kernel code, and see above outout in xfstests. It is not a real problem for normal user, so not necessary to add A testcase to fstests and user-land tests. But for btrfs-progs, it should not output such message on any case, this is what this patch fixed. but the we need to avoid outputting both error and success in oneline message as above. This patch modified above message to: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, errno=0(Success) The net effect of the patch is to add ret=.. and errno=.. to the error message but it also changes a series of ifs to a switch. This belongs to a separate patch. OK, will send v2. Thanks Zhaolei -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS disaster (of my own making). Is this recoverable?
On Wed, Aug 5, 2015 at 6:45 PM, Paul Jones p...@pauljones.id.au wrote: Would it be possible to store this type of critical information twice on each disk, at the beginning and end? I thought BTRFS already did that, but I might be thinking of some other filesystem. I've had my share of these types of oops! moments as well. That option is metadata profile raid1. To do an automatic -mconvert=raid1 when the user does 'btrfs device add' breaks any use case where you want to temporarily add a small device, maybe a USB stick, and now hundreds of MiBs possibly GiBs of metadata have to be copied over to this device without warning. It could be made smart, autoconvert to raid1 when the added device is at least 4x the size of metadata allocation, but then that makes it inconsistent. OK so it could be made interactive, but now that breaks scripts. So... where do you draw the line? Maybe this would work if the system chunk only was raid1? I don't know what the minimum necessary information is for such a case. Possibly it make more sense if 'btrfs device add' always does -dconvert=raid1 unless a --quick option is passed? -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: add newline to some error messages
Added a missing newline to some error messages. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- btrfs-corrupt-block.c | 2 +- cmds-check.c | 4 ++-- cmds-send.c | 4 ++-- dir-item.c| 6 +++--- mkfs.c| 2 +- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index 1a2aa23..ea871f4 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root, goto out; } if (ret 0) { - fprintf(stderr, Error searching chunk); + fprintf(stderr, Error searching chunk\n); goto out; } out: diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, BTRFS_FIRST_FREE_OBJECTID, lost_found_ino, mode); if (ret 0) { - fprintf(stderr, Failed to create '%s' dir: %s, + fprintf(stderr, Failed to create '%s' dir: %s\n, dir_name, strerror(-ret)); goto out; } @@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, } if (ret 0) { fprintf(stderr, - Failed to link the inode %llu to %s dir: %s, + Failed to link the inode %llu to %s dir: %s\n, rec-ino, dir_name, strerror(-ret)); goto out; } diff --git a/cmds-send.c b/cmds-send.c index 20bba18..78ee54c 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size) ret = write(fd, (char*)buf + pos, size - pos); if (ret 0) { ret = -errno; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } if (!ret) { ret = -EIO; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } diff --git a/dir-item.c b/dir-item.c index a5bf861..f3ad98f 100644 --- a/dir-item.c +++ b/dir-item.c @@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root, u8 type = btrfs_dir_type(leaf, dir_item); if (type = BTRFS_FT_MAX) { - fprintf(stderr, invalid dir item type: %d, + fprintf(stderr, invalid dir item type: %d\n, (int)type); return 1; } @@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root, namelen = XATTR_NAME_MAX; if (btrfs_dir_name_len(leaf, dir_item) namelen) { - fprintf(stderr, invalid dir item name len: %u, + fprintf(stderr, invalid dir item name len: %u\n, (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; } @@ -302,7 +302,7 @@ int verify_dir_item(struct btrfs_root *root, /* BTRFS_MAX_XATTR_SIZE is the same for all dir items */ if ((btrfs_dir_data_len(leaf, dir_item) + btrfs_dir_name_len(leaf, dir_item)) BTRFS_MAX_XATTR_SIZE(root)) { - fprintf(stderr, invalid dir item name + data len: %u + %u, + fprintf(stderr, invalid dir item name + data len: %u + %u\n, (unsigned)btrfs_dir_name_len(leaf, dir_item), (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; diff --git a/mkfs.c b/mkfs.c index dafd500..909b591 100644 --- a/mkfs.c +++ b/mkfs.c @@ -599,7 +599,7 @@ static int add_symbolic_link(struct btrfs_trans_handle *trans, goto fail; } if (ret = sectorsize) { - fprintf(stderr, symlink too long for %s, path_name); + fprintf(stderr, symlink too long for %s\n, path_name); ret = -1; goto fail; } -- 2.4.5 Tsutomu Itoh t-i...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs-progs: use switch instead of a series of ifs for output errormsg
switch statement is more suitable for outputing currsponding message for errno. Suggested-by: David Sterba dste...@suse.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- cmds-scrub.c | 33 ++--- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/cmds-scrub.c b/cmds-scrub.c index 7c9318e..a40eecf 100644 --- a/cmds-scrub.c +++ b/cmds-scrub.c @@ -1457,21 +1457,24 @@ static int scrub_start(int argc, char **argv, int resume) ++err; continue; } - if (sp[i].ret sp[i].ioctl_errno == ENODEV) { - if (do_print) - fprintf(stderr, WARNING: device %lld not - present\n, devid); - continue; - } - if (sp[i].ret sp[i].ioctl_errno == ECANCELED) { - ++err; - } else if (sp[i].ret) { - if (do_print) - fprintf(stderr, ERROR: scrubbing %s failed - for device id %lld (%s)\n, path, - devid, strerror(sp[i].ioctl_errno)); - ++err; - continue; + if (sp[i].ret) { + switch (sp[i].ioctl_errno) { + case ENODEV: + if (do_print) + fprintf(stderr, WARNING: device %lld not present\n, + devid); + continue; + case ECANCELED: + ++err; + break; + default: + if (do_print) + fprintf(stderr, ERROR: scrubbing %s failed for device id %lld (%s)\n, + path, devid, + strerror(sp[i].ioctl_errno)); + ++err; + continue; + } } if (sp[i].scrub_args.progress.uncorrectable_errors 0) e_uncorrectable++; -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs-progs: Modify confuse error message in scrub
Scrub output following error message in my test: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success) It is caused by a broken kernel and fs, but the we need to avoid outputting both error and success in oneline message as above. This patch modified above message to: ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, errno=0(Success) Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com --- cmds-scrub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/cmds-scrub.c b/cmds-scrub.c index a40eecf..2529956 100644 --- a/cmds-scrub.c +++ b/cmds-scrub.c @@ -1469,8 +1469,9 @@ static int scrub_start(int argc, char **argv, int resume) break; default: if (do_print) - fprintf(stderr, ERROR: scrubbing %s failed for device id %lld (%s)\n, + fprintf(stderr, ERROR: scrubbing %s failed for device id %lld, ret=%d, errno=%d(%s)\n, path, devid, + sp[i].ret, sp[i].ioctl_errno, strerror(sp[i].ioctl_errno)); ++err; continue; -- 1.8.5.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: BTRFS disaster (of my own making). Is this recoverable?
-Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- ow...@vger.kernel.org] On Behalf Of Chris Murphy Sent: Thursday, 6 August 2015 2:54 AM To: Sonic sonicsm...@gmail.com Cc: Btrfs BTRFS linux-btrfs@vger.kernel.org; Hugo Mills h...@carfax.org.uk Subject: Re: BTRFS disaster (of my own making). Is this recoverable? On Wed, Aug 5, 2015 at 6:31 AM, Sonic sonicsm...@gmail.com wrote: On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote: Seems that if there was someway to edit something in those first overwritten 32MB of disc 2 to say hey, I'm really here, just a bit screwed up maybe some of the recovery tools could actually work. Just want to reiterate this thought. The basic error in most cases with the tools at hand is that Disc 2 is missing so there's little the tools can do. Somewhere in those first 32MB should be something to properly identify the disc as part of the array. Yes but it was probably uniquely only on that disk, because there's no redundancy for metadata or system chunks. Therefore there's no copy on the other disk to use as a model. The btrfs check command has an option to use other superblocks, so you could try that switch and see if it makes a difference but it sounds like it's finding backup superblocks automatically. That's the one thing that is pretty much always duplicated on the same disk; for sure the first superblock is munged and would need repair. But there's still other chunks missing... so I don't think it'll help. Would it be possible to store this type of critical information twice on each disk, at the beginning and end? I thought BTRFS already did that, but I might be thinking of some other filesystem. I've had my share of these types of oops! moments as well. Paul. N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥
RE: [PATCH] btrfs-progs: add newline to some error messages
Hi, Itoh -Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tsutomu Itoh Sent: Thursday, August 06, 2015 11:06 AM To: linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs-progs: add newline to some error messages Added a missing newline to some error messages. Good found! Seems more code need to be fixed, as: # cat mkfs.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' symlink too long for %s Incompat features: %s # # cat utils.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep -v '\\n' ERROR: DUP for data is allowed only in mixed mode %s [y/N]: *1 # *1: It is not problem, should to be ignored Thanks Zhaolei Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- btrfs-corrupt-block.c | 2 +- cmds-check.c | 4 ++-- cmds-send.c | 4 ++-- dir-item.c| 6 +++--- mkfs.c| 2 +- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index 1a2aa23..ea871f4 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root, goto out; } if (ret 0) { - fprintf(stderr, Error searching chunk); + fprintf(stderr, Error searching chunk\n); goto out; } out: diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, BTRFS_FIRST_FREE_OBJECTID, lost_found_ino, mode); if (ret 0) { - fprintf(stderr, Failed to create '%s' dir: %s, + fprintf(stderr, Failed to create '%s' dir: %s\n, dir_name, strerror(-ret)); goto out; } @@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle *trans, } if (ret 0) { fprintf(stderr, - Failed to link the inode %llu to %s dir: %s, + Failed to link the inode %llu to %s dir: %s\n, rec-ino, dir_name, strerror(-ret)); goto out; } diff --git a/cmds-send.c b/cmds-send.c index 20bba18..78ee54c 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size) ret = write(fd, (char*)buf + pos, size - pos); if (ret 0) { ret = -errno; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } if (!ret) { ret = -EIO; - fprintf(stderr, ERROR: failed to dump stream. %s, + fprintf(stderr, ERROR: failed to dump stream. %s\n, strerror(-ret)); goto out; } diff --git a/dir-item.c b/dir-item.c index a5bf861..f3ad98f 100644 --- a/dir-item.c +++ b/dir-item.c @@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root, u8 type = btrfs_dir_type(leaf, dir_item); if (type = BTRFS_FT_MAX) { - fprintf(stderr, invalid dir item type: %d, + fprintf(stderr, invalid dir item type: %d\n, (int)type); return 1; } @@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root, namelen = XATTR_NAME_MAX; if (btrfs_dir_name_len(leaf, dir_item) namelen) { - fprintf(stderr, invalid dir item name len: %u, + fprintf(stderr, invalid dir item name len: %u\n, (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; } @@ -302,7 +302,7 @@ int verify_dir_item(struct btrfs_root *root, /* BTRFS_MAX_XATTR_SIZE is the same for all dir items */ if ((btrfs_dir_data_len(leaf, dir_item) + btrfs_dir_name_len(leaf, dir_item)) BTRFS_MAX_XATTR_SIZE(root)) { - fprintf(stderr, invalid dir item name + data len: %u + %u, + fprintf(stderr, invalid dir item name + data len: %u + %u\n, (unsigned)btrfs_dir_name_len(leaf, dir_item), (unsigned)btrfs_dir_data_len(leaf, dir_item)); return 1; diff --git a/mkfs.c b/mkfs.c index