date:20150805



Martin wrote on 2015/08/05 09:06 +0200:

Hi,

Does anyone know the reason subvolumes are not called just volumes? I
mean, the top subvolume is not called a volume, so there is nothing to
be sub of.

Because normally a volume is referred as a complete filesystem.
So from this respect, subvolume is not a full filesystem, it still need 
a lot of extra data from other trees to build its contents.


So, that's why there is sub prefix.
Although it acts much like a volume as it can be mounted like a 
filesystem, but it's still not a full filesystem.


Also, what is the penalty of a subvolume compared to a directory? From
a design perspective, couldn't all directories just be subvolumes?
Yes, subvolume has its overhead, and when it comes as many as 
directories, the overhead won't be small.


The overhead that I can remember is shown below.
Use empty tree as an example for its size, and default mkfs options.
The '+' after number means it will increase with snapshots

1) Empty tree block: 16K
   Of course takes more if its child file/dir grows
2) ROOT_ITEM in tree root: 439bytes
3) ROOT_BACKREF in tree root: 22+bytes
5) extent backref for tree block:
33+bytes for skinny metadata
53+bytes without skinny metadata

Alone with other trees like log tree, one for each subvolume if fsync is 
called.


Not to mention other run-time overhead.
For example, to search a inode in one subvolume.
Search_slot would be enough to find the INODE_ITEM.

But to search a inode across subvolume boundary, need to first found the 
subvolume boundary and loop until we reach the subvolume containing the 
inode, then do the above search_slot to locate the INODE_ITEM.


Although the overhead is already small, but not that small to make all 
directories to be subvolume.


Thanks,
Qu


Martin
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] fstests: btrfs: Add regression test for reserved space leak.

2015-08-05 Thread Filipe David Manana

On Wed, Aug 5, 2015 at 2:08 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:
 The regression is introduced in v4.2-rc1, with the big btrfs qgroup
 change.
 The problem is, qgroup reserved space is never freed, causing even we
 increase the limit, we can still hit the EDQUOT much faster than it
 should.

 Reported-by: Tsutomu Itoh t-i...@jp.fujitsu.com
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
Reviewed-by: Filipe Manana fdman...@suse.com

Thanks!

 ---
  tests/btrfs/089 | 88 
 +
  tests/btrfs/089.out |  5 +++
  tests/btrfs/group   |  1 +
  3 files changed, 94 insertions(+)
  create mode 100755 tests/btrfs/089
  create mode 100644 tests/btrfs/089.out

 diff --git a/tests/btrfs/089 b/tests/btrfs/089
 new file mode 100755
 index 000..82db96c
 --- /dev/null
 +++ b/tests/btrfs/089
 @@ -0,0 +1,88 @@
 +#! /bin/bash
 +# FS QA Test 089
 +#
 +# Regression test for btrfs qgroup reserved space leak.
 +#
 +# Due to qgroup reserved space leak, EDQUOT can be trigged even it's not
 +# over limit after previous write.
 +#
 +#---
 +# Copyright (c) 2015 Fujitsu. All Rights Reserved.
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 +#---
 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo QA output created by $seq
 +
 +here=`pwd`
 +tmp=/tmp/$$
 +status=1   # failure is the default!
 +trap _cleanup; exit \$status 0 1 2 3 15
 +
 +_cleanup()
 +{
 +   cd /
 +   rm -f $tmp.*
 +}
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +
 +# real QA test starts here
 +
 +# Modify as appropriate.
 +_supported_fs btrfs
 +_supported_os Linux
 +_require_scratch
 +_need_to_be_root
 +
 +# Use big blocksize to ensure there is still enough space left
 +# for metadata reserve after hitting EDQUOT
 +BLOCKSIZE=$(( 2 * 1024 * 1024 ))
 +FILESIZE=$(( 128 * 1024 * 1024 )) # 128Mbytes
 +
 +# The last block won't be able to finish write, as metadata takes
 +# $NODESIZE space, causing the last block triggering EDQUOT
 +LENGTH=$(( $FILESIZE - $BLOCKSIZE ))
 +
 +_scratch_mkfs $seqres.full 21
 +_scratch_mount
 +_require_fs_space $SCRATCH_MNT $(($FILESIZE * 2 / 1024))
 +
 +_run_btrfs_util_prog quota enable $SCRATCH_MNT
 +_run_btrfs_util_prog qgroup limit $FILESIZE 5 $SCRATCH_MNT
 +
 +$XFS_IO_PROG -f -c pwrite -b $BLOCKSIZE 0 $LENGTH \
 +   $SCRATCH_MNT/foo | _filter_xfs_io
 +
 +# A sync is needed to trigger a commit_transaction.
 +# As the reserved space freeing happens at commit_transaction time,
 +# without a transaction commit, no reserved space needs freeing and
 +# won't trigger the bug.
 +sync
 +
 +# Double the limit to allow further write
 +_run_btrfs_util_prog qgroup limit $(($FILESIZE * 2)) 5 $SCRATCH_MNT
 +
 +# Test whether further write can succeed
 +$XFS_IO_PROG -f -c pwrite -b $BLOCKSIZE $LENGTH $LENGTH \
 +   $SCRATCH_MNT/foo | _filter_xfs_io
 +
 +# success, all done
 +status=0
 +exit
 diff --git a/tests/btrfs/089.out b/tests/btrfs/089.out
 new file mode 100644
 index 000..396888f
 --- /dev/null
 +++ b/tests/btrfs/089.out
 @@ -0,0 +1,5 @@
 +QA output created by 089
 +wrote 132120576/132120576 bytes at offset 0
 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 +wrote 132120576/132120576 bytes at offset 132120576
 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 diff --git a/tests/btrfs/group b/tests/btrfs/group
 index ffe18bf..225b532 100644
 --- a/tests/btrfs/group
 +++ b/tests/btrfs/group
 @@ -91,6 +91,7 @@
  086 auto quick clone
  087 auto quick send
  088 auto quick metadata
 +089 auto quick qgroup
  090 auto quick metadata
  091 auto quick qgroup
  092 auto quick send
 --
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe fstests in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] btrfs-progs: fsck: Print correct file hole

If a file lost all its file extents, fsck will unable to print out the
hole.

Add an extra judgment to print out the hole range.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-check.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 50bb6f3..31ed589 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -616,15 +616,20 @@ static void print_inode_error(struct btrfs_root *root, 
struct inode_record *rec)
if (errors  I_ERR_FILE_EXTENT_DISCOUNT) {
struct file_extent_hole *hole;
struct rb_node *node;
+   int found = 0;
 
node = rb_first(rec-holes);
fprintf(stderr, Found file extent holes:\n);
while (node) {
+   found = 1;
hole = rb_entry(node, struct file_extent_hole, node);
-   fprintf(stderr, \tstart: %llu, len:%llu\n,
+   fprintf(stderr, \tstart: %llu, len: %llu\n,
hole-start, hole-len);
node = rb_next(node);
}
+   if (!found)
+   fprintf(stderr, \tstart: 0, len: %llu\n,
+   round_up(rec-isize, root-sectorsize));
}
 }
 
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] Fix for infinite loop on non-empty inode but with no file extent

A bug reported by Robert Munteanu, which btrfsck infinite loops on an
inode with discount file extent.

This patchset includes a fix in printing file extent hole, fix the
infinite loop, and corresponding test case.

BTW, thanks Robert Munteanu a lot for its detailed debug report, makes
it super fast to reproduce the error.

Qu Wenruo (3):
  btrfs-progs: fsck: Print correct file hole
  btrfs-progs: fsck: Fix a infinite loop on discount file extent repair
  btrfs-progs: fsck-tests: Add test case for inode lost all its file
extent

 cmds-check.c|  16 +++-
 .../017-missing-all-file-extent/default_case.img.xz | Bin 0 - 1104 bytes
 2 files changed, 15 insertions(+), 1 deletion(-)
 create mode 100644 
tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz

-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] btrfs-progs: fsck: Fix a infinite loop on discount file extent repair

For a special case, discount file extent repair function will cause
infinite loop.

The case is, if the file loses all its file extent, we won't have a hole
to fill, causing repair function doing nothing, and since the
I_ERR_DISCOUNT doesn't disappear, the fsck will do infinite loop.

For such case, just puch hole to fill all the range to fix it.

Reported-by: Robert Munteanu robert.munte...@gmail.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-check.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 31ed589..4fa8709 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2665,11 +2665,13 @@ static int repair_inode_discount_extent(struct 
btrfs_trans_handle *trans,
 {
struct rb_node *node;
struct file_extent_hole *hole;
+   int found = 0;
int ret = 0;
 
node = rb_first(rec-holes);
 
while (node) {
+   found = 1;
hole = rb_entry(node, struct file_extent_hole, node);
ret = btrfs_punch_hole(trans, root, rec-ino,
   hole-start, hole-len);
@@ -2683,6 +2685,13 @@ static int repair_inode_discount_extent(struct 
btrfs_trans_handle *trans,
rec-errors = ~I_ERR_FILE_EXTENT_DISCOUNT;
node = rb_first(rec-holes);
}
+   /* special case for a file losing all its file extent */
+   if (!found) {
+   ret = btrfs_punch_hole(trans, root, rec-ino, 0,
+  round_up(rec-isize, root-sectorsize));
+   if (ret  0)
+   goto out;
+   }
printf(Fixed discount file extents for inode: %llu in root: %llu\n,
   rec-ino, root-objectid);
 out:
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] btrfs-progs: fsck-tests: Add test case for inode lost all its file extent

Add test case with no file extents, but still non-zero inode size.
To test whether fsck will infinite loop.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 .../017-missing-all-file-extent/default_case.img.xz  | Bin 0 - 1104 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 
tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz

diff --git a/tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz 
b/tests/fsck-tests/017-missing-all-file-extent/default_case.img.xz
new file mode 100644
index 
..10cd4c786e1223d1d00eaeab81941b142135d97c
GIT binary patch
literal 1104
zcmV-W1h4!3H+ooF000E$*0e?f03iVu0001VFXf})lm7${TwRyj;C3^v%$$4d1rE0
zjjaER49m*J$Ny#wkkW^{#)(qVu6jW@U${Z!$pUxwR_2M^H7KU2hve~sxWXa?I
z%NmWQokysI=scC%OAq$mZ!a?Xa@3%R}h@8vWy-xl~qOuWtRg-R5!%XR;99;~oq
zdk{P`Ryx4e(-g_fYI=#ZDobJ@0`2Gi5Z==2t%tsj$0@PK$0}7L`88Br(*1nkpNK
z_XxRVtghToWQcwSZMnTkBh=v`!_jtFZnBcHm1wf5wOm9i*QuB{`*iwt$@={jV)*
z%}T2!cPnR{zgSI}F-k1(|ItD#YU*lQ5caPRu;$454g#EIMlP+@`=aT*A@LADD7u
zk@V7c2kM}eBPO^Qu;ANX4!roj9d3LYnAteNtD3soCJnDoNp54HZ4ZeX@naau}+a^
zy|#?-`{(08Rl!wH-SY4?wetPi2Qb(f`c45RskYyF)v;1`SlFlxU5!p|~;ics;3^
zBX%=|JUR^Ky(ynC=M7)svk~E4@16W$NYHYzCJ3;@@${p?4~ysw=fq^vz(x(PtElF
zwf3WM-8M{A?g(sW#s;uPp5)F}4lhgs%K1q1nFGU`)GD9qPGir}B%j3$a3;A?Gq|tS
zaDXnqJ{7yrj8iPECu)i;L_ouU|17}`kpGA2x2A0_=e3S$bZ7H^DTsDQRGOZb@F;q
z%#Y6-P@=P|nbod(i7Gd3Lvl9vVzd!)uy)eEr2h(pFvfGR9gMp7SL=Ww6;cvzs
zSIkg93uy516A@onIsCrvlQqwJlV~$z#z)3I_GDFsRim|mp%_0{a_wov_H$C4TDq
zwn7^-kcft_F)4WPt0OV+n3W`nZ4OC`z1agtXhR}#S0IqeJw$Lsxl4BHy`9hHs
zBtPA?RQYVy53bCXwhyaxL;{eBnLK1F~EqXTP1;aSvcM5-y;6(p2G02#C0S{EE=g
zW1iR3fbSs^zRyjGTuc(qgYeJO3yRzBq+JtviiuggemrKK5?gVyt|AD#jSXzU4VQ
zfop5yezs3WSrm8QE|MhpJH!92~QSYE^Eq1p4jh$Idi{1`RB5jNvfrJ7h*)M7
zoMsbcB?RsO4P-LyY62_cY6mGWv~hlA;J%`X3cfL4vfPWRRidP0rp|UFe_14^R2I
z{a1+AOJXzAm=7pkt%~NC1$Zx1R?`MOxJIC_dRjwdc8K9d63f6rzoibNqCnp{u$U
z+A~#XVza7Y=CSF*ohfxwwN9we2j*qrOWm@Z+eMws1MlfN_ibWsO9Z^OTT^VoVw{S
zPt09il30LPI_!MVdj_{j0W`{}1|)2jXtn*D`NB+JQbn9BfB*m@w2urGUN=kt0jmgr
Wr~{`ruvAn#Ao{g01X)^bF)Hl

literal 0
HcmV?d1

-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lockup in BTRFS_IOC_CLONE/Kernel 4.2.0-rc5

2015-08-05 Thread Elias Probst

I can reproduce a hard btrfs lockup (process issuing the ioctl() is in
D-state, same goes for btrfs-transacti process) on Kernel 4.2.0-rc5.

I had the same issue on 4.1, so it's unlikely a regression introduced in
4.2.

## With the following steps, I can reproduce the problem:

1. Create a new clean btrfs volume for /var/lib/machines
machinectl set-limit 6G

2. Paste this to /tmp/yum.conf
[main]
reposdir=/dev/null
gpgcheck=0
logfile=/var/log/yum.log
installroot=/var/lib/machines/centos7.1-base
assumeyes=1

[base]
name=CentOS 7.1.1503 - x86_64
baseurl=http://mirror.centos.org/centos/7.1.1503/os/x86_64/
enabled=1

3. Bootstrap a CentOS 7.1 base image
/usr/bin/yum -c /tmp/yum.conf groupinstall Base

4. Start an ephemeral systemd-nspawn container based on 'centos7.1-base'
strace -o /tmp/systemd-nspawn.out -s 500 -f systemd-nspawn -xbD
/var/lib/machines/centos7.1-base/


`systemd-nspawn` will now just hang forever.
I couldn't come up yet with a shorter/more low-level way to reproduce this as I 
lack quite a bit of btrfs experience.

## Results:

- Last 'strace' lines
6095  fchown(16, 0, 0)  = 0
6095  fchmod(16, 0755)  = 0
6095  utimensat(16, NULL, {{1402362275, 0}, {1438761285, 819041906}}, 0) = 0
6095  flistxattr(15, , 100)   = 0
6095  getdents(15, /* 3 entries */, 32768) = 80
6095  newfstatat(15, coreutils.mo, {st_mode=S_IFREG|0644, st_size=357263, 
...}, AT_SYMLINK_NOFOLLOW) = 0
6095  openat(15, coreutils.mo, O_RDONLY|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC) = 17
6095  openat(16, coreutils.mo, 
O_WRONLY|O_CREAT|O_EXCL|O_NOCTTY|O_NOFOLLOW|O_CLOEXEC, 0644) = 18
6095  fstat(18, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
6095  ioctl(18, BTRFS_IOC_CLONE

- call trace in Kernel journal:
Aug 05 10:10:03 moria kernel: INFO: task btrfs-transacti:4175 blocked for more 
than 120 seconds.
Aug 05 10:10:03 moria kernel:   Tainted: G   O4.2.0-rc5 #2
Aug 05 10:10:03 moria kernel: echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Aug 05 10:10:03 moria kernel: btrfs-transacti D 8800b13279f8 0  4175
  2 0x00080080
Aug 05 10:10:03 moria kernel:  8800b13279f8 88018fd3a380 
8800ab4521c0 0246
Aug 05 10:10:03 moria kernel:  8800b1328000 88018d5c8518 
88018debdba0 880232d64990
Aug 05 10:10:03 moria kernel:  0197 8800b1327a18 
86999201 
Aug 05 10:10:03 moria kernel: Call Trace:
Aug 05 10:10:03 moria kernel:  [86999201] schedule+0x74/0x83
Aug 05 10:10:03 moria kernel:  [863ef8f0] btrfs_tree_lock+0xa7/0x1b7
Aug 05 10:10:03 moria kernel:  [86137ed7] ? wait_woken+0x74/0x74
Aug 05 10:10:03 moria kernel:  [8639d30f] push_leaf_right+0x9a/0x19f
Aug 05 10:10:03 moria kernel:  [8639dd9b] split_leaf+0x100/0x63f
Aug 05 10:10:03 moria kernel:  [86398f09] ? leaf_space_used+0xbb/0xea
Aug 05 10:10:03 moria kernel:  [863efa61] ? 
btrfs_set_lock_blocking_rw+0x52/0x95
Aug 05 10:10:03 moria kernel:  [8639ea46] 
btrfs_search_slot+0x76c/0x8b3
Aug 05 10:10:03 moria kernel:  [863a0107] 
btrfs_insert_empty_items+0x58/0xa3
Aug 05 10:10:03 moria kernel:  [8640805a] 
btrfs_insert_delayed_items+0x7f/0x3bb
Aug 05 10:10:03 moria kernel:  [8640842e] 
__btrfs_run_delayed_items+0x98/0x1c0
Aug 05 10:10:03 moria kernel:  [86408739] 
btrfs_run_delayed_items+0xc/0xe
Aug 05 10:10:03 moria kernel:  [863bdc50] 
btrfs_commit_transaction+0x298/0xb66
Aug 05 10:10:03 moria kernel:  [863be8d0] ? 
start_transaction+0x3b2/0x535
Aug 05 10:10:03 moria kernel:  [863b9cd9] 
transaction_kthread+0x100/0x1d6
Aug 05 10:10:03 moria kernel:  [863b9bd9] ? 
btrfs_cleanup_transaction+0x49f/0x49f
Aug 05 10:10:03 moria kernel:  [8611eca9] kthread+0xcd/0xd5
Aug 05 10:10:03 moria kernel:  [8611ebdc] ? 
kthread_create_on_node+0x17d/0x17d
Aug 05 10:10:03 moria kernel:  [8699d29f] ret_from_fork+0x3f/0x70
Aug 05 10:10:03 moria kernel:  [8611ebdc] ? 
kthread_create_on_node+0x17d/0x17d
Aug 05 10:10:03 moria kernel: INFO: task systemd-nspawn:6095 blocked for more 
than 120 seconds.
Aug 05 10:10:03 moria kernel:   Tainted: G   O4.2.0-rc5 #2
Aug 05 10:10:03 moria kernel: echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Aug 05 10:10:03 moria kernel: systemd-nspawn  D 88019f3e3668 0  6095   
6090 0x00080083
Aug 05 10:10:03 moria kernel:  88019f3e3668 86e5d480 
88018fd3a380 0246
Aug 05 10:10:03 moria kernel:  88019f3e4000 88018debdc08 
88018fd3a380 88019f3e36b8
Aug 05 10:10:03 moria kernel:  88018fd3a380 88019f3e3688 
86999201 
Aug 05 10:10:03 moria kernel: Call Trace:
Aug 05 10:10:03 moria kernel:  [86999201] schedule+0x74/0x83
Aug 05 10:10:03 moria kernel:  [863ef64c] 
btrfs_tree_read_lock+0xc0/0xea
Aug 05 10:10:03 moria kernel:  [86137ed7] ?

[PATCH] btrfs-progs: Modify confuse error message in scrub

Scrub output following error message in my test:
  ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success)

It is caused by a broken kernel and fs, but the we need to avoid
outputting both error and success in oneline message as above.

This patch modified above message to:
  ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, 
errno=0(Success)

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 cmds-scrub.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/cmds-scrub.c b/cmds-scrub.c
index 7c9318e..2529956 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -1457,21 +1457,25 @@ static int scrub_start(int argc, char **argv, int 
resume)
++err;
continue;
}
-   if (sp[i].ret  sp[i].ioctl_errno == ENODEV) {
-   if (do_print)
-   fprintf(stderr, WARNING: device %lld not 
-   present\n, devid);
-   continue;
-   }
-   if (sp[i].ret  sp[i].ioctl_errno == ECANCELED) {
-   ++err;
-   } else if (sp[i].ret) {
-   if (do_print)
-   fprintf(stderr, ERROR: scrubbing %s failed 
-   for device id %lld (%s)\n, path,
-   devid, strerror(sp[i].ioctl_errno));
-   ++err;
-   continue;
+   if (sp[i].ret) {
+   switch (sp[i].ioctl_errno) {
+   case ENODEV:
+   if (do_print)
+   fprintf(stderr, WARNING: device %lld 
not present\n,
+   devid);
+   continue;
+   case ECANCELED:
+   ++err;
+   break;
+   default:
+   if (do_print)
+   fprintf(stderr, ERROR: scrubbing %s 
failed for device id %lld, ret=%d, errno=%d(%s)\n,
+   path, devid,
+   sp[i].ret, sp[i].ioctl_errno,
+   strerror(sp[i].ioctl_errno));
+   ++err;
+   continue;
+   }
}
if (sp[i].scrub_args.progress.uncorrectable_errors  0)
e_uncorrectable++;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/4] btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()

It can reduce current duplicated code which is similar to
scrub_blocked_if_needed() but can not call it because little
different.
It also used by my next patch which is in same case.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/scrub.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 94db0fa..cbfb8c7 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -332,11 +332,14 @@ static void __scrub_blocked_if_needed(struct 
btrfs_fs_info *fs_info)
}
 }
 
-static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
+static void scrub_pause_on(struct btrfs_fs_info *fs_info)
 {
atomic_inc(fs_info-scrubs_paused);
wake_up(fs_info-scrub_pause_wait);
+}
 
+static void scrub_pause_off(struct btrfs_fs_info *fs_info)
+{
mutex_lock(fs_info-scrub_lock);
__scrub_blocked_if_needed(fs_info);
atomic_dec(fs_info-scrubs_paused);
@@ -345,6 +348,12 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
wake_up(fs_info-scrub_pause_wait);
 }
 
+static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
+{
+   scrub_pause_on(fs_info);
+   scrub_pause_off(fs_info);
+}
+
 /*
  * used for workers that require transaction commits (i.e., for the
  * NOCOW case)
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 3/4] btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks()

Use new intruduced scrub_pause_on/off() can make this code block
clean and more readable.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/scrub.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cbfb8c7..a882a34 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3492,8 +3492,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
wait_event(sctx-list_wait,
   atomic_read(sctx-bios_in_flight) == 0);
-   atomic_inc(fs_info-scrubs_paused);
-   wake_up(fs_info-scrub_pause_wait);
+
+   scrub_pause_on(fs_info);
 
/*
 * must be called before we decrease @scrub_paused.
@@ -3504,11 +3504,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
   atomic_read(sctx-workers_pending) == 0);
atomic_set(sctx-wr_ctx.flush_all_writes, 0);
 
-   mutex_lock(fs_info-scrub_lock);
-   __scrub_blocked_if_needed(fs_info);
-   atomic_dec(fs_info-scrubs_paused);
-   mutex_unlock(fs_info-scrub_lock);
-   wake_up(fs_info-scrub_pause_wait);
+   scrub_pause_off(fs_info);
 
btrfs_put_block_group(cache);
if (ret)
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 4/4] btrfs: Fix data checksum error cause by replace with io-load.

xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.

Reason:
  btrfs/070 do replace and defrag with fsstress simultaneously,
  after above operation, checksum error is found by scrub.

  Actually, it have no relationship with defrag operation, only
  replace with fsstress can trigger this bug.

  New data writen to target device have possibility rewrited by
  old data from source device by replace code in debug, to avoid
  above problem, we can set target block group to readonly in
  replace period, so new data requested by other operation will
  not write to same place with replace code.

  Before patch(4.1-rc3):
30% failed in 100 xfstests.
  After patch:
0% failed in 300 xfstests.

It also happened in btrfs/071 as it's another scrub with IO load tests.

Reported-by: Qu Wenruo quwen...@cn.fujitsu.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com

---

Changelog v3-v4:
 Patch v3 cause xfstests/061 failed in some case, because
 btrfs_inc_block_group_ro() include a btrfs_end_transaction()
 option, which will change datas in reloc_ctl-data_inode,
 and cause deadlock in relocate:
 scrub   relocate
 
 relocate_file_extent_cluster()
 prealloc_file_extent_cluster()
 ...
 btrfs_inc_block_group_ro()
 btrfs_wait_for_commit()
 insert_reserved_file_extent()
 btrfs_set_file_extent_disk_num_bytes()
 (modify reloc_ctl-data_inode)
 ...
 do_relocation()
 get_new_location() ret -EINVAL
 (because data_inode's extent changed)
 __btrfs_cow_block() ret -EINVAL
 (without unlock eb)
 btrfs_search_slot() deadlock
 (try to lock eb again)

Changelog v2-v3:
 1: Fix a typo(caused in rebase) which make xfstests failed in
btrfs/073 and btrfs/066.

Changelog v1-v2:
 Nothing for this patch.

---
 fs/btrfs/scrub.c   | 34 +++---
 fs/btrfs/volumes.c |  2 ++
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a882a34..e04436f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3396,7 +3396,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
u64 chunk_tree;
u64 chunk_objectid;
u64 chunk_offset;
-   int ret;
+   int ret = 0;
int slot;
struct extent_buffer *l;
struct btrfs_key key;
@@ -3424,8 +3424,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (path-slots[0] =
btrfs_header_nritems(path-nodes[0])) {
ret = btrfs_next_leaf(root, path);
-   if (ret)
+   if (ret  0)
+   break;
+   if (ret  0) {
+   ret = 0;
break;
+   }
+   } else {
+   ret = 0;
}
}
 
@@ -3467,6 +3473,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
if (!cache)
goto skip;
 
+   /*
+* we need call btrfs_inc_block_group_ro() with scrubs_paused,
+* to avoid deadlock caused by:
+* btrfs_inc_block_group_ro()
+* - btrfs_wait_for_commit()
+* - btrfs_commit_transaction()
+* - btrfs_scrub_pause()
+*/
+   scrub_pause_on(fs_info);
+   ret = btrfs_inc_block_group_ro(root, cache);
+   scrub_pause_off(fs_info);
+   if (ret) {
+   btrfs_put_block_group(cache);
+   break;
+   }
+
dev_replace-cursor_right = found_key.offset + length;
dev_replace-cursor_left = found_key.offset;
dev_replace-item_needs_writeback = 1;
@@ -3506,6 +3528,8 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
scrub_pause_off(fs_info);
 
+   btrfs_dec_block_group_ro(root, cache);
+
btrfs_put_block_group(cache);
if (ret)
break;
@@ -3528,11 +3552,7 @@ skip:
 
btrfs_free_path(path);
 
-   /*
-* ret can still be 1 from search_slot or next_leaf,
-* that's not an error
-*/
-   return ret  0 ? ret : 0;
+   return ret;
 }
 
 static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9b95503..66f5a15 100644

[PATCH v4 0/4] btrfs: Fix data checksum error cause by replace with io-load

This patchset is used to fix data checksum error cause by replace with
io-load.
It cause xfstests btrfs/070(071) failed randomly.

See description in [PATCH 4/4] for detail.

Changelog v3-v4:
 1: Fix regression of xfstests/061
Patch v3 cause xfstests/061 failed in some case, because
btrfs_inc_block_group_ro() include a btrfs_end_transaction()
option, which will change datas in reloc_ctl-data_inode,
and cause deadlock in relocate:
scrub   relocate

relocate_file_extent_cluster()
prealloc_file_extent_cluster()
...
btrfs_inc_block_group_ro()
btrfs_wait_for_commit()
insert_reserved_file_extent()
btrfs_set_file_extent_disk_num_bytes()
(modify reloc_ctl-data_inode)
...
do_relocation()
get_new_location() ret -EINVAL
(because data_inode's extent changed)
__btrfs_cow_block() ret -EINVAL
(without unlock eb)
btrfs_search_slot() deadlock
(try to lock eb again)

Changelog v2-v3:
 1: Fix a typo(caused in rebase) which make xfstests failed in
btrfs/073 and btrfs/066.
 2: Rebase on top of integration-4.2
 3: Do full xfstests(generic and btrfs group with 10 mount options)

Changelog v1-v2:
 1: Update subject to reflect the problem being fixed.
 2: Update description to say reason why set read-only can fix the
problem.
 3: Use a helper function to avoid duplicated code block for set
chunk ro.
 All of above are suggested by: David Sterba dste...@suse.cz

Zhao Lei (4):
  btrfs: Use ref_cnt for set_block_group_ro()
  btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off()
  btrfs: use scrub_pause_on/off() to reduce code in
scrub_enumerate_chunks()
  btrfs: Fix data checksum error cause by replace with io-load.

 fs/btrfs/ctree.h   |  6 +++---
 fs/btrfs/extent-tree.c | 42 +++---
 fs/btrfs/relocation.c  | 14 ++---
 fs/btrfs/scrub.c   | 55 --
 fs/btrfs/volumes.c |  2 ++
 5 files changed, 72 insertions(+), 47 deletions(-)

-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 1/4] btrfs: Use ref_cnt for set_block_group_ro()

More than one code call set_block_group_ro() and restore rw in fail.

Old code use bool bit to save blockgroup's ro state, it can not
support parallel case(it is confirmd exist in my debug log).

This patch use ref count to store ro state, and rename
set_block_group_ro/set_block_group_rw
to
inc_block_group_ro/dec_block_group_ro.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/ctree.h   |  6 +++---
 fs/btrfs/extent-tree.c | 42 +-
 fs/btrfs/relocation.c  | 14 ++
 3 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac314e..f57e6ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1300,7 +1300,7 @@ struct btrfs_block_group_cache {
/* for raid56, this is a full stripe, without parity */
unsigned long full_stripe_len;
 
-   unsigned int ro:1;
+   unsigned int ro;
unsigned int iref:1;
unsigned int has_caching_ctl:1;
unsigned int removed:1;
@@ -3495,9 +3495,9 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_root *root,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
-int btrfs_set_block_group_ro(struct btrfs_root *root,
+int btrfs_inc_block_group_ro(struct btrfs_root *root,
 struct btrfs_block_group_cache *cache);
-void btrfs_set_block_group_rw(struct btrfs_root *root,
+void btrfs_dec_block_group_ro(struct btrfs_root *root,
  struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1c2bd17..a436bd5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8692,14 +8692,13 @@ static u64 update_block_group_flags(struct btrfs_root 
*root, u64 flags)
return flags;
 }
 
-static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force)
+static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
 {
struct btrfs_space_info *sinfo = cache-space_info;
u64 num_bytes;
u64 min_allocable_bytes;
int ret = -ENOSPC;
 
-
/*
 * We need some metadata space and system metadata space for
 * allocating chunks in some corner cases until we force to set
@@ -8716,6 +8715,7 @@ static int set_block_group_ro(struct 
btrfs_block_group_cache *cache, int force)
spin_lock(cache-lock);
 
if (cache-ro) {
+   cache-ro++;
ret = 0;
goto out;
}
@@ -8727,7 +8727,7 @@ static int set_block_group_ro(struct 
btrfs_block_group_cache *cache, int force)
sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes +
min_allocable_bytes = sinfo-total_bytes) {
sinfo-bytes_readonly += num_bytes;
-   cache-ro = 1;
+   cache-ro++;
list_add_tail(cache-ro_list, sinfo-ro_bgs);
ret = 0;
}
@@ -8737,7 +8737,7 @@ out:
return ret;
 }
 
-int btrfs_set_block_group_ro(struct btrfs_root *root,
+int btrfs_inc_block_group_ro(struct btrfs_root *root,
 struct btrfs_block_group_cache *cache)
 
 {
@@ -8745,8 +8745,6 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
u64 alloc_flags;
int ret;
 
-   BUG_ON(cache-ro);
-
 again:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans))
@@ -8789,7 +8787,7 @@ again:
goto out;
}
 
-   ret = set_block_group_ro(cache, 0);
+   ret = inc_block_group_ro(cache, 0);
if (!ret)
goto out;
alloc_flags = get_alloc_profile(root, cache-space_info-flags);
@@ -8797,7 +8795,7 @@ again:
 CHUNK_ALLOC_FORCE);
if (ret  0)
goto out;
-   ret = set_block_group_ro(cache, 0);
+   ret = inc_block_group_ro(cache, 0);
 out:
if (cache-flags  BTRFS_BLOCK_GROUP_SYSTEM) {
alloc_flags = update_block_group_flags(root, cache-flags);
@@ -8860,7 +8858,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct 
btrfs_space_info *sinfo)
return free_bytes;
 }
 
-void btrfs_set_block_group_rw(struct btrfs_root *root,
+void btrfs_dec_block_group_ro(struct btrfs_root *root,
  struct btrfs_block_group_cache *cache)
 {
struct btrfs_space_info *sinfo = cache-space_info;
@@ -8870,11 +8868,13 @@ void btrfs_set_block_group_rw(struct btrfs_root *root,
 
spin_lock(sinfo-lock);
spin_lock(cache-lock);
-   num_bytes = cache-key.offset - cache-reserved - cache-pinned -
-   cache-bytes_super - btrfs_block_group_used(cache-item);
-   sinfo-bytes_readonly -= num_bytes;
-

[RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

From: Michal Hocko mho...@suse.com

__GFP_NOFAIL is a big hammer used to ensure that the allocation
request can never fail. This is a strong requirement and as such
it also deserves a special treatment when the system is OOM. The
primary problem here is that the allocation request might have
come with some locks held and the oom victim might be blocked
on the same locks. This is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them
to make a progress and release resources they are holding. The OOM
victim should compensate for the reserves consumption.

Signed-off-by: Michal Hocko mho...@suse.com
---
 mm/page_alloc.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..ee69c338ca2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
/* Exhausted what can be done so it's blamo time */
if (out_of_memory(ac-zonelist, gfp_mask, order, ac-nodemask, false)
-   || WARN_ON_ONCE(gfp_mask  __GFP_NOFAIL))
+   || WARN_ON_ONCE(gfp_mask  __GFP_NOFAIL)) {
*did_some_progress = 1;
+
+   if (gfp_mask  __GFP_NOFAIL) {
+   page = get_page_from_freelist(gfp_mask, order,
+   ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+   WARN_ONCE(!page, Unable to fullfil gfp_nofail 
allocation.
+Consider increasing min_free_kbytes.\n);
+   }
+   }
 out:
mutex_unlock(oom_lock);
return page;
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

From: Michal Hocko mho...@suse.com

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM this is
allowed to fail which can lead to
[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..57a99d19533d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int 
total_stripes, int real_stripes)
 * and the stripes
 */
sizeof(u64) * (total_stripes),
-   GFP_NOFS);
+   GFP_NOFS|__GFP_NOFAIL);
if (!bbio)
return NULL;
 
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 7/8] btrfs: Prevent from early transaction abort

From: Michal Hocko mho...@suse.com

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[   55.328093] Call Trace:
[   55.328890]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.330518]  [8108fa28] ? console_unlock+0x334/0x363
[   55.332738]  [8110873e] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [81100752] pagecache_get_page+0x10e/0x20c
[   55.336844]  [a007d916] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [a0059d8c] btrfs_find_create_tree_block+0x15/0x17 
[btrfs]
[   55.341329]  [a004f728] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [a003fa34] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [a0040567] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [810682d7] ? get_parent_ip+0xe/0x3e
[   55.349434]  [a0041cb2] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [a004ecfb] __btrfs_run_delayed_refs+0x7a6/0xf35 
[btrfs]
[   55.353979]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [81186833] vfs_fsync+0x1c/0x1e
[   55.370654]  [81186869] do_fsync+0x34/0x4e
[   55.372246]  [81186ab3] SyS_fsync+0x10/0x14
[   55.373851]  [81554f97] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: 
errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted 
transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting 
unused transaction(IO failure).
[   55.384280] [ cut here ]
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 
btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.384357]  [8107f717] ? down_trylock+0x2d/0x37
[   55.384359]  [81046977] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [a00a1d6b] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [81046a34] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [a00a1d6b] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [a004e5f7] ? __btrfs_run_delayed_refs+0xa2/0xf35 
[btrfs]
[   55.384455]  [a004e600] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [81186833] vfs_fsync+0x1c/0x1e
[   55.384593]  [81186869] do_fsync+0x34/0x4e
[   55.384594]  [81186ab3] SyS_fsync+0x10/0x14
[   55.384595]  [81554f97] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/extent_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..88fad7051e38 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 
start,
 {
struct extent_buffer *eb = NULL;
 
-   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
+   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
if (eb == NULL)
return NULL;
eb-start = start;
@@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
return NULL;
 
for (i = 0; i  num_pages; i++, index++) {
-   p = find_or_create_page(mapping, index, GFP_NOFS);
+   p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
if (!p)
goto free_eb;
 
-- 
2.5.0

--

[RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

From: Michal Hocko mho...@suse.com

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: 
Out of memory
(...snipped)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/jbd/transaction.c  | 11 +--
 fs/jbd2/transaction.c | 14 +++---
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head 
*jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd_alloc(jh2bh(jh)-b_size,
-GFP_NOFS);
-   if (!frozen_buffer) {
-   printk(KERN_ERR
-  %s: OOM for frozen_buffer\n,
-  __func__);
-   JBUFFER_TRACE(jh, oom!);
-   error = -ENOMEM;
-   jbd_lock_bh_state(bh);
-   goto done;
-   }
+GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh-b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..bff071e21553 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head 
*jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd2_alloc(jh2bh(jh)-b_size,
-GFP_NOFS);
-   if (!frozen_buffer) {
-   printk(KERN_ERR
-  %s: OOM for frozen_buffer\n,
-  __func__);
-   JBUFFER_TRACE(jh, oom!);
-   error = -ENOMEM;
-   jbd_lock_bh_state(bh);
-   goto done;
-   }
+GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh-b_frozen_data = frozen_buffer;
@@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct 
buffer_head *bh)
 
 repeat:
if (!jh-b_committed_data) {
-   committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS);
+   committed_data = jbd2_alloc(jh2bh(jh)-b_size,
+   GFP_NOFS|__GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR %s: No memory for committed data\n,
__func__);
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 6/8] ext3: Do not abort journal prematurely

From: Michal Hocko mho...@suse.com

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[   83.256914] journal_get_undo_access: No memory for committed data
[   83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[   83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[   83.267130] Aborting journal on device hdb1.
[   83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[   83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/jbd/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..6c60376a29bc 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct 
buffer_head *bh)
 
 repeat:
if (!jh-b_committed_data) {
-   committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS);
+   committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS | 
__GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR %s: No memory for committed data\n,
__func__);
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 5/8] ext4: Do not fail journal due to block allocator

From: Michal Hocko mho...@suse.com

Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW   
4.0.0-nofs3-6-gdfe9931f5f68 #588
[  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.8.1-20150428_134905-gandalf 04/01/2014
[  345.028339]   880005a17708 81538a54 
8107a40f
[  345.028341]  0050 880005a17798 810fe854 
00018000
[  345.028342]  0046  81a52100 
0246
[  345.028343] Call Trace:
[  345.028348]  [81538a54] dump_stack+0x4f/0x7b
[  345.028370]  [810fe854] warn_alloc_failed+0x12a/0x13f
[  345.028373]  [81101bd2] __alloc_pages_nodemask+0x7f3/0x8aa
[  345.028375]  [810f9933] pagecache_get_page+0x12a/0x1c9
[  345.028390]  [a005bc64] ext4_mb_load_buddy+0x220/0x367 [ext4]
[  345.028414]  [a006014f] ext4_free_blocks+0x522/0xa4c [ext4]
[  345.028425]  [a0054e14] ext4_ext_remove_space+0x833/0xf22 [ext4]
[  345.028434]  [a005677e] ext4_ext_truncate+0x8c/0xb0 [ext4]
[  345.028441]  [a00342bf] ext4_truncate+0x20b/0x38d [ext4]
[  345.028462]  [a003573c] ext4_evict_inode+0x32b/0x4c1 [ext4]
[  345.028464]  [8116d04f] evict+0xa0/0x148
[  345.028466]  [8116dca8] iput+0x1a1/0x1f0
[  345.028468]  [811697b4] __dentry_kill+0x136/0x1a6
[  345.028470]  [81169a3e] dput+0x21a/0x243
[  345.028472]  [81157cda] __fput+0x184/0x19b
[  345.028473]  [81157d29] fput+0xe/0x10
[  345.028475]  [8105a05f] task_work_run+0x8a/0xa1
[  345.028477]  [810452f0] do_exit+0x3c6/0x8dc
[  345.028482]  [8104588a] do_group_exit+0x4d/0xb2
[  345.028483]  [8104eeeb] get_signal+0x5b1/0x5f5
[  345.028488]  [81002202] do_signal+0x28/0x5d0
[...]
[  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of 
memory
[  345.033097] Aborting journal on device hdb1-8.
[  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: 
Journal has aborted
[  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal 
has aborted
[  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has 
aborted
[  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has 
aborted
[  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/ext4/mballoc.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..e6361622bfd5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block 
*sb,
block = group * 2;
pnum = block / blocks_per_page;
poff = block % blocks_per_page;
-   page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
+   page = find_or_create_page(inode-i_mapping, pnum,
+  GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page-mapping != inode-i_mapping);
@@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block 
*sb,
 
block++;
pnum = block / blocks_per_page;
-   page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
+   page = find_or_create_page(inode-i_mapping, pnum,
+  GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page-mapping != inode-i_mapping);
@@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb,

[RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

From: Johannes Weiner han...@cmpxchg.org

GFP_NOFS allocations are not allowed to invoke the OOM killer since
their reclaim abilities are severely diminished.  However, without the
OOM killer available there is no hope of progress once the reclaimable
pages have been exhausted.

Don't risk hanging these allocations.  Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
Signed-off-by: Michal Hocko mho...@suse.com
---
 mm/page_alloc.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee69c338ca2a..024d45d51700 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
if (ac-high_zoneidx  ZONE_NORMAL)
goto out;
/* The OOM killer does not compensate for IO-less reclaim */
-   if (!(gfp_mask  __GFP_FS)) {
-   /*
-* XXX: Page reclaim didn't yield anything,
-* and the OOM killer can't be invoked, but
-* keep looping as per tradition.
-*/
-   *did_some_progress = 1;
+   if (!(gfp_mask  __GFP_FS))
goto out;
-   }
if (pm_suspended_storage())
goto out;
/* The OOM killer may not free memory on a specific node */
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation

From: Michal Hocko mho...@suse.com

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layera directly so it doesn't need this
protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers
seem to be OK because they are not taking any fs lock before invoking
generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe
from the reclaim recursion POV because this lock serializes truncate
and punch hole with the page faults and it doesn't get involved in the
reclaim.

The GFP_NOFS protection might be even harmful. There is a push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this
should be safe from the page fault path normally. Why do we care
about mapping_gfp_mask at all then? Because this doesn't hold only
reclaim protection flags but it also might contain zone and movability
restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect
those.

Reported-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Signed-off-by: Michal Hocko mho...@suse.com
---
 include/linux/mm.h |  4 
 mm/filemap.c   |  9 -
 mm/memory.c| 17 +
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..962e37c7cd6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -220,10 +220,14 @@ extern pgprot_t protection_map[16];
  * -fault function. The vma's -fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
  *
+ * MM layer fills up gfp_mask for page allocations but fault handler might
+ * alter it if its implementation requires a different allocation context.
+ *
  * pgoff should be used in favour of virtual_address, if possible.
  */
 struct vm_fault {
unsigned int flags; /* FAULT_FLAG_xxx flags */
+   gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
void __user *virtual_address;   /* Faulting virtual address */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b63fb81df336..8a16a07bbe02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
struct address_space *mapping = file-f_mapping;
struct page *page;
int ret;
 
do {
-   page = page_cache_alloc_cold(mapping);
+   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
if (!page)
return -ENOMEM;
 
-   ret = add_to_page_cache_lru(page, mapping, offset,
-   GFP_KERNEL  mapping_gfp_mask(mapping));
+   ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL  
gfp_mask);
if (ret == 0)
ret = mapping-a_ops-readpage(file, page);
else if (ret == -EEXIST)
@@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * We're only likely to ever get here if MADV_RANDOM is in
 * effect.
 */
-   error = page_cache_read(file, offset);
+   error = page_cache_read(file, offset, vmf-gfp_mask);
 
/*
 * The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..25ab29560dca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
copy_user_highpage(dst, src, va, vma);
 }
 
+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+{
+   struct file

[RFC 0/8] Allow GFP_NOFS allocation to fail

Hi,
small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
traditionally even though their reclaim capabilities are restricted
because the VM code cannot recurse into filesystems to clean dirty
pages. At the same time these allocation requests do not allow to
trigger the OOM killer because that would lead to pre-mature OOM killing
during heavy fs metadata workloads.

This leaves the VM code in an unfortunate situation where GFP_NOFS
requests is looping inside the allocator relying on somebody else to
make a progress on its behalf. This is prone to deadlocks when the
request is holding resources which are necessary for other task to make
a progress and release memory (e.g. OOM victim is blocked on the lock
held by the NONFS request). Another drawback is that the caller of
the allocator cannot define any fallback strategy because the request
doesn't fail.

As the VM cannot do much about these requests we should face the reality
and allow those allocations to fail. Johannes has already posted the
patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2)
but the discussion died pretty quickly.

I was playing with this patch and xfs, ext[34] and btrfs for a while to
see what is the effect under heavy memory pressure. As expected this led
to some fallouts.
My test consisted of a simple memory hog which allocates a lot of
anonymous memory and writes to a fs mainly to trigger a fs activity on
exit. In parallel there is a parallel fs metadata load (multiple tasks
creating thousands of empty files and directories). All is running
in a VM with small amount of memory to emulate an under provisioned
system. The metadata load is triggering a sufficient load to invoke the
direct reclaim even without the memory hog. The memory hog forks several
tasks sharing the VM and OOM killer manages to kill it without locking
up the system (this was based on the test case from Tetsuo Handa -
http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't
want to kill my machine ;)).
With all the patches applied none of the 4 filesystems gets aborted
transactions and RO remount (well xfs didn't need any special
treatment). This is obviously not sufficient to claim that failing
GFP_NOFS is OK now but I think it is a good start for the further
discussion. I would be grateful if FS people could have a look at those
patches.  I have simply used __GFP_NOFAIL in the critical paths. This
might be not the best strategy but it sounds like a good first step.

The first patch in the series also allows __GFP_NOFAIL allocations to
access memory reserves when the system is OOM which should help those
requests to make a forward progress - especially in combination with
GFP_NOFS.

The second patch tries to address a potential pre-mature OOM killer from
the page fault path. I have posted it separately but it didn't get much
traction.

The third patch allows GFP_NOFS to fail and I believe it should see much
more testing coverage. It would be really great if it could sit in the
mmotm tree for few release cycles so that we can catch more fallouts.

The rest are the FS specific patches to fortify allocations
requests which are really needed to finish transactions without RO
remounts. There might be more needed but my test case survives with
these in place.
They would obviously need some rewording if they are going to be applied
even without Patch3 and I will do that if respective maintainers will
take them. Ext3 and JBD are going away soon so they might be dropped but
they have been in the tree while I was testing so I've kept them.

Thoughts? Opinions?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()

Remove chunk_objectid argument from btrfs_relocate_chunk() because
it is not necessary, it can also cleanup some code in caller for
prepare its value.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66f5a15..c3977ed 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2755,9 +2755,7 @@ out:
return ret;
 }
 
-static int btrfs_relocate_chunk(struct btrfs_root *root,
-   u64 chunk_objectid,
-   u64 chunk_offset)
+static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
 {
struct btrfs_root *extent_root;
struct btrfs_trans_handle *trans;
@@ -2857,7 +2855,6 @@ again:
 
if (chunk_type  BTRFS_BLOCK_GROUP_SYSTEM) {
ret = btrfs_relocate_chunk(chunk_root,
-  found_key.objectid,
   found_key.offset);
if (ret == -ENOSPC)
failed++;
@@ -3377,7 +3374,6 @@ again:
}
 
ret = btrfs_relocate_chunk(chunk_root,
-  found_key.objectid,
   found_key.offset);
mutex_unlock(fs_info-delete_unused_bgs_mutex);
if (ret  ret != -ENOSPC)
@@ -4079,7 +4075,6 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 
new_size)
struct btrfs_dev_extent *dev_extent = NULL;
struct btrfs_path *path;
u64 length;
-   u64 chunk_objectid;
u64 chunk_offset;
int ret;
int slot;
@@ -4156,11 +4151,10 @@ again:
break;
}
 
-   chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
btrfs_release_path(path);
 
-   ret = btrfs_relocate_chunk(root, chunk_objectid, chunk_offset);
+   ret = btrfs_relocate_chunk(root, chunk_offset);
mutex_unlock(root-fs_info-delete_unused_bgs_mutex);
if (ret  ret != -ENOSPC)
goto done;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()

objectid's init-value is not used in any case, remove it.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/relocation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 1659c94..4698928 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4144,7 +4144,7 @@ struct inode *create_reloc_inode(struct btrfs_fs_info 
*fs_info,
struct btrfs_trans_handle *trans;
struct btrfs_root *root;
struct btrfs_key key;
-   u64 objectid = BTRFS_FIRST_FREE_OBJECTID;
+   u64 objectid;
int err = 0;
 
root = read_fs_root(fs_info, BTRFS_DATA_RELOC_TREE_OBJECTID);
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()

We need error checking code for get_ref_objectid_v0() in
relocate_block_group(), to avoid unpredictable result, especially
for accessing uninitialized value(when function failed) after
this line.

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 fs/btrfs/relocation.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 52fe55a..1659c94 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3976,6 +3976,10 @@ restart:
   sizeof(struct btrfs_extent_item_v0));
ret = get_ref_objectid_v0(rc, path, key, ref_owner,
  path_change);
+   if (ret  0) {
+   err = ret;
+   break;
+   }
if (ref_owner  BTRFS_FIRST_FREE_OBJECTID)
flags = BTRFS_EXTENT_FLAG_TREE_BLOCK;
else
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2015-08-05 Thread Austin S Hemmelgarn


On 2015-08-04 13:36, John Ettedgui wrote:

On Tue, Aug 4, 2015 at 4:28 AM, Austin S Hemmelgarn
ahferro...@gmail.com wrote:

On 2015-08-04 00:58, John Ettedgui wrote:


On Mon, Aug 3, 2015 at 8:01 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote:


Although the best practice is staying away from such converted fs, either
using pure, newly created btrfs, or convert back to ext* before any
balance.


Unfortunately I don't have enough hard drive space to do a clean
btrfs, so my only way to use btrfs for that partition was a
conversion.


If you could get your hands on a decent sized flash drive (32G or more), you
could do an incremental conversion offline.  The steps would look something
like this:

1. Boot the system into a LiveCD or something similar that doesn't need to
run from your regular root partition (SystemRescueCD would be my personal
recommendation, although if you go that way, make sure to boot the
alternative kernel, as it's a lot newer then the standard ones).
2. Plug in the flash drive, format it as BTRFS.
3. Mount both your old partition and the flash drive somewhere.
4. Start copying files from the old partition to the flash drive.
5. When you hit ENOSPC on the flash drive, unmount the old partition, shrink
it down to the minimum size possible, and create a new partition in the free
space produced by doing so.
6. Add the new partition to the BTRFS filesystem on the flash drive.
7. Repeat steps 4-6 until you have copied everything.
8. Wipe the old partition, and add it to the BTRFS filesystem.
9. Run a full balance on the new BTRFS filesystem.
10. Delete the partition from step 5 that is closest to the old partition
(via btrfs device delete), then resize the old partition to fill the space
that the deleted partition took up.
11. Repeat steps 9-10 until the only remaining partitions in the new BTRFS
filesystem are the old one and the flash drive.
12. Delete the flash drive from the BTRFS filesystem.

This takes some time and coordination, but it does work reliably as long as
you are careful (I've done it before on multiple systems).



I suppose I could do that even without the flash as I have some free
space anyway, but moving Tbs of data with Gbs of free space will take
days, plus the repartitioning. It'd probably be easier to start with a
1Tb drive or something.
Is this currently my best bet as conversion is not as good as I thought?

I believe my other 2 partitions also come from conversion, though I
may have rebuilt them later from scratch.

Thank you!
John

Yeah, you're probably better off getting a TB disk and starting with 
that.  In theory it is possible to automate the process, but I would 
advise against that if at all possible, it's a lot easier to recover 
from an error if you're doing it manually.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-05 Thread Jan Kara

On Wed 05-08-15 11:51:20, mho...@kernel.org wrote:
 From: Michal Hocko mho...@suse.com
 
 Journal transaction might fail prematurely because the frozen_buffer
 is allocated by GFP_NOFS request:
 [   72.440013] do_get_write_access: OOM for frozen_buffer
 [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: 
 Out of memory
 (...snipped)
 [   72.495559] do_get_write_access: OOM for frozen_buffer
 [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.496839] do_get_write_access: OOM for frozen_buffer
 [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.505766] Aborting journal on device sda1-8.
 [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
 
 This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS
 allocations upon OOM because small GPF_NOFS allocations never failed.
 This allocation seems essential for the journal and GFP_NOFS is too
 restrictive to the memory allocator so let's use __GFP_NOFAIL here to
 emulate the previous behavior.
 
 jbd code has the very same issue so let's do the same there as well.

The patch looks good. Btw, the patch 6 can be folded into this patch since
it fixes the issue you fix for jbd2 here... But jbd parts will be dropped
in the next merge window anyway so it doesn't really matter.

You can add:

Reviewed-by: Jan Kara j...@suse.com

Honza
 
 Signed-off-by: Michal Hocko mho...@suse.com
 ---
  fs/jbd/transaction.c  | 11 +--
  fs/jbd2/transaction.c | 14 +++---
  2 files changed, 4 insertions(+), 21 deletions(-)
 
 diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
 index 1695ba8334a2..bf7474deda2f 100644
 --- a/fs/jbd/transaction.c
 +++ b/fs/jbd/transaction.c
 @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct 
 journal_head *jh,
   jbd_unlock_bh_state(bh);
   frozen_buffer =
   jbd_alloc(jh2bh(jh)-b_size,
 -  GFP_NOFS);
 - if (!frozen_buffer) {
 - printk(KERN_ERR
 -%s: OOM for frozen_buffer\n,
 -__func__);
 - JBUFFER_TRACE(jh, oom!);
 - error = -ENOMEM;
 - jbd_lock_bh_state(bh);
 - goto done;
 - }
 +  GFP_NOFS|__GFP_NOFAIL);
   goto repeat;
   }
   jh-b_frozen_data = frozen_buffer;
 diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
 index ff2f2e6ad311..bff071e21553 100644
 --- a/fs/jbd2/transaction.c
 +++ b/fs/jbd2/transaction.c
 @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct 
 journal_head *jh,
   jbd_unlock_bh_state(bh);
   frozen_buffer =
   jbd2_alloc(jh2bh(jh)-b_size,
 -  GFP_NOFS);
 - if (!frozen_buffer) {
 - printk(KERN_ERR
 -%s: OOM for frozen_buffer\n,
 -__func__);
 - JBUFFER_TRACE(jh, oom!);
 - error = -ENOMEM;
 - jbd_lock_bh_state(bh);
 - goto done;
 - }
 +  GFP_NOFS|__GFP_NOFAIL);
   goto repeat;
   }
   jh-b_frozen_data = frozen_buffer;
 @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, 
 struct buffer_head *bh)
  
  repeat:
   if (!jh-b_committed_data) {
 - committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS);
 + committed_data = jbd2_alloc(jh2bh(jh)-b_size,
 + GFP_NOFS|__GFP_NOFAIL);
   if (!committed_data) {
   printk(KERN_ERR %s: No memory for committed data\n,
   __func__);
 -- 
 2.5.0
 
-- 
Jan Kara j...@suse.com
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [RFC 5/8] ext4: Do not fail journal due to block allocator

2015-08-05 Thread Jan Kara

On Wed 05-08-15 11:51:21, mho...@kernel.org wrote:
 From: Michal Hocko mho...@suse.com
 
 Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
 memory allocator doesn't endlessly loop to satisfy low-order allocations
 and instead fails them to allow callers to handle them gracefully.
 
 Some of the callers are not yet prepared for this behavior though. ext4
 block allocator relies solely on GFP_NOFS allocation requests and
 allocation failures lead to aborting yournal too easily:
 
 [  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
 [  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW   
 4.0.0-nofs3-6-gdfe9931f5f68 #588
 [  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
 1.8.1-20150428_134905-gandalf 04/01/2014
 [  345.028339]   880005a17708 81538a54 
 8107a40f
 [  345.028341]  0050 880005a17798 810fe854 
 00018000
 [  345.028342]  0046  81a52100 
 0246
 [  345.028343] Call Trace:
 [  345.028348]  [81538a54] dump_stack+0x4f/0x7b
 [  345.028370]  [810fe854] warn_alloc_failed+0x12a/0x13f
 [  345.028373]  [81101bd2] __alloc_pages_nodemask+0x7f3/0x8aa
 [  345.028375]  [810f9933] pagecache_get_page+0x12a/0x1c9
 [  345.028390]  [a005bc64] ext4_mb_load_buddy+0x220/0x367 [ext4]
 [  345.028414]  [a006014f] ext4_free_blocks+0x522/0xa4c [ext4]
 [  345.028425]  [a0054e14] ext4_ext_remove_space+0x833/0xf22 [ext4]
 [  345.028434]  [a005677e] ext4_ext_truncate+0x8c/0xb0 [ext4]
 [  345.028441]  [a00342bf] ext4_truncate+0x20b/0x38d [ext4]
 [  345.028462]  [a003573c] ext4_evict_inode+0x32b/0x4c1 [ext4]
 [  345.028464]  [8116d04f] evict+0xa0/0x148
 [  345.028466]  [8116dca8] iput+0x1a1/0x1f0
 [  345.028468]  [811697b4] __dentry_kill+0x136/0x1a6
 [  345.028470]  [81169a3e] dput+0x21a/0x243
 [  345.028472]  [81157cda] __fput+0x184/0x19b
 [  345.028473]  [81157d29] fput+0xe/0x10
 [  345.028475]  [8105a05f] task_work_run+0x8a/0xa1
 [  345.028477]  [810452f0] do_exit+0x3c6/0x8dc
 [  345.028482]  [8104588a] do_group_exit+0x4d/0xb2
 [  345.028483]  [8104eeeb] get_signal+0x5b1/0x5f5
 [  345.028488]  [81002202] do_signal+0x28/0x5d0
 [...]
 [  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of 
 memory
 [  345.033097] Aborting journal on device hdb1-8.
 [  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
 [  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
 Journal has aborted
 [  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
 Journal has aborted
 [  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: 
 Journal has aborted
 [  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal 
 has aborted
 [  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
 Journal has aborted
 [  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has 
 aborted
 [  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
 Journal has aborted
 [  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal 
 has aborted
 [  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
 Journal has aborted
 
 The failure is really premature because GFP_NOFS allocation context is
 very restricted - especially in the fs metadata heavy loads. Before we
 go with a more sofisticated solution, let's simply imitate the previous
 behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
 buddy block allocator. I wasn't able to trigger the issue with this
 patch anymore.
 
The patch looks good. You can add:

Reviewed-by: Jan Kara j...@suse.com

Honza

 Signed-off-by: Michal Hocko mho...@suse.com
 ---
  fs/ext4/mballoc.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)
 
 diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
 index 5b1613a54307..e6361622bfd5 100644
 --- a/fs/ext4/mballoc.c
 +++ b/fs/ext4/mballoc.c
 @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block 
 *sb,
   block = group * 2;
   pnum = block / blocks_per_page;
   poff = block % blocks_per_page;
 - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
 + page = find_or_create_page(inode-i_mapping, pnum,
 +GFP_NOFS|__GFP_NOFAIL);
   if (!page)
   return -ENOMEM;
   BUG_ON(page-mapping != inode-i_mapping);
 @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct 
 super_block *sb,
  
   block++;
   pnum = block / blocks_per_page;
 - page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
 + page =

Re: BTRFS disaster (of my own making). Is this recoverable?

2015-08-05 Thread Sonic

On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote:
 Seems that if there was someway to edit something in those first
 overwritten 32MB of disc 2 to say hey, I'm really here, just a bit
 screwed up maybe some of the recovery tools could actually work.

Just want to reiterate this thought.

The basic error in most cases with the tools at hand is that Disc 2 is
missing so there's little the tools can do. Somewhere in those first
32MB should be something to properly identify the disc as part of the
array.

If the btrfs tools can't fix it maybe dd can. Is there anything can be
gained from the beginning of disc 1 (can dd this to a file) in order
to create the necessary bits needed at the beginning of disc2? Or some
other way to overwrite the beginning of disc 2 (using dd again) with
some identification information so that the automated btrfs tools can
take it from there?

Thanks,

Chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS disaster (of my own making). Is this recoverable?

2015-08-05 Thread Sonic

On Wed, Aug 5, 2015 at 8:31 AM, Sonic sonicsm...@gmail.com wrote:
 The basic error in most cases with the tools at hand is that Disc 2 is
 missing so there's little the tools can do. Somewhere in those first
 32MB should be something to properly identify the disc as part of the
 array.

Somehow manually create the missing chunk root if this is the core problem??
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

On Wed, Aug 05, 2015 at 11:51:24AM +0200, mho...@kernel.org wrote:
 From: Michal Hocko mho...@suse.com
 
 alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since mm:
 page_alloc: do not lock up GFP_NOFS allocations upon OOM this is
 allowed to fail which can lead to
 [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
 
 This is clearly undesirable and the nofail behavior should be explicit
 if the allocation failure cannot be tolerated.
 
 Signed-off-by: Michal Hocko mho...@suse.com

Reviewed-by: David Sterba dste...@suse.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 7/8] btrfs: Prevent from early transaction abort

On Wed, Aug 05, 2015 at 11:51:23AM +0200, mho...@kernel.org wrote:
 From: Michal Hocko mho...@suse.com
... 
 Fix this by reintroducing the no-fail behavior of this allocation path
 with the explicit __GFP_NOFAIL.
 
 Signed-off-by: Michal Hocko mho...@suse.com

Reviewed-by: David Sterba dste...@suse.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Fix for infinite loop on non-empty inode but with no file extent

On Wed, Aug 05, 2015 at 04:03:11PM +0800, Qu Wenruo wrote:
 A bug reported by Robert Munteanu, which btrfsck infinite loops on an
 inode with discount file extent.
 
 This patchset includes a fix in printing file extent hole, fix the
 infinite loop, and corresponding test case.
 
 BTW, thanks Robert Munteanu a lot for its detailed debug report, makes
 it super fast to reproduce the error.
 
 Qu Wenruo (3):
   btrfs-progs: fsck: Print correct file hole
   btrfs-progs: fsck: Fix a infinite loop on discount file extent repair
   btrfs-progs: fsck-tests: Add test case for inode lost all its file
 extent

Applied, thank you both.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-05 Thread Greg Thelen


mho...@kernel.org wrote:

 From: Michal Hocko mho...@suse.com

 Journal transaction might fail prematurely because the frozen_buffer
 is allocated by GFP_NOFS request:
 [   72.440013] do_get_write_access: OOM for frozen_buffer
 [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: 
 Out of memory
 (...snipped)
 [   72.495559] do_get_write_access: OOM for frozen_buffer
 [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.496839] do_get_write_access: OOM for frozen_buffer
 [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
 Out of memory in __ext4_journal_get_write_access
 [   72.505766] Aborting journal on device sda1-8.
 [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only

 This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS
 allocations upon OOM because small GPF_NOFS allocations never failed.
 This allocation seems essential for the journal and GFP_NOFS is too
 restrictive to the memory allocator so let's use __GFP_NOFAIL here to
 emulate the previous behavior.

 jbd code has the very same issue so let's do the same there as well.

 Signed-off-by: Michal Hocko mho...@suse.com
 ---
  fs/jbd/transaction.c  | 11 +--
  fs/jbd2/transaction.c | 14 +++---
  2 files changed, 4 insertions(+), 21 deletions(-)

 diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
 index 1695ba8334a2..bf7474deda2f 100644
 --- a/fs/jbd/transaction.c
 +++ b/fs/jbd/transaction.c
 @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct 
 journal_head *jh,
   jbd_unlock_bh_state(bh);
   frozen_buffer =
   jbd_alloc(jh2bh(jh)-b_size,
 -  GFP_NOFS);
 - if (!frozen_buffer) {
 - printk(KERN_ERR
 -%s: OOM for frozen_buffer\n,
 -__func__);
 - JBUFFER_TRACE(jh, oom!);
 - error = -ENOMEM;
 - jbd_lock_bh_state(bh);
 - goto done;
 - }
 +  GFP_NOFS|__GFP_NOFAIL);
   goto repeat;
   }
   jh-b_frozen_data = frozen_buffer;
 diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
 index ff2f2e6ad311..bff071e21553 100644
 --- a/fs/jbd2/transaction.c
 +++ b/fs/jbd2/transaction.c
 @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct 
 journal_head *jh,
   jbd_unlock_bh_state(bh);
   frozen_buffer =
   jbd2_alloc(jh2bh(jh)-b_size,
 -  GFP_NOFS);
 - if (!frozen_buffer) {
 - printk(KERN_ERR
 -%s: OOM for frozen_buffer\n,
 -__func__);
 - JBUFFER_TRACE(jh, oom!);
 - error = -ENOMEM;
 - jbd_lock_bh_state(bh);
 - goto done;
 - }
 +  GFP_NOFS|__GFP_NOFAIL);
   goto repeat;
   }
   jh-b_frozen_data = frozen_buffer;
 @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, 
 struct buffer_head *bh)
  
  repeat:
   if (!jh-b_committed_data) {
 - committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS);
 + committed_data = jbd2_alloc(jh2bh(jh)-b_size,
 + GFP_NOFS|__GFP_NOFAIL);
   if (!committed_data) {
   printk(KERN_ERR %s: No memory for committed data\n,
   __func__);

Is this if (!committed_data) { check now dead code?

I also see other similar suspected dead sites in the rest of the series.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Modify confuse error message in scrub

On Wed, Aug 05, 2015 at 04:32:26PM +0800, Zhao Lei wrote:
 Scrub output following error message in my test:
   ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 
 (Success)
 
 It is caused by a broken kernel and fs,

In what way is it broken? Can we turn it into tests?

 but the we need to avoid
 outputting both error and success in oneline message as above.
 
 This patch modified above message to:
   ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, 
 errno=0(Success)

The net effect of the patch is to add ret=.. and errno=.. to the error
message but it also changes a series of ifs to a switch. This belongs to
a separate patch.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode()

On Wed, Aug 05, 2015 at 06:00:03PM +0800, Zhao Lei wrote:
 objectid's init-value is not used in any case, remove it.
 
 Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com

Reviewed-by: David Sterba dste...@suse.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS disaster (of my own making). Is this recoverable?

2015-08-05 Thread Chris Murphy

On Wed, Aug 5, 2015 at 6:31 AM, Sonic sonicsm...@gmail.com wrote:
 On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote:
 Seems that if there was someway to edit something in those first
 overwritten 32MB of disc 2 to say hey, I'm really here, just a bit
 screwed up maybe some of the recovery tools could actually work.

 Just want to reiterate this thought.

 The basic error in most cases with the tools at hand is that Disc 2 is
 missing so there's little the tools can do. Somewhere in those first
 32MB should be something to properly identify the disc as part of the
 array.

Yes but it was probably  uniquely only on that disk, because there's
no redundancy for metadata or system chunks. Therefore there's no copy
on the other disk to use as a model. The btrfs check command has an
option to use other superblocks, so you could try that switch and see
if it makes a difference but it sounds like it's finding backup
superblocks automatically. That's the one thing that is pretty much
always duplicated on the same disk; for sure the first superblock is
munged and would need repair. But there's still other chunks
missing... so I don't think it'll help.



 If the btrfs tools can't fix it maybe dd can. Is there anything can be
 gained from the beginning of disc 1 (can dd this to a file) in order
 to create the necessary bits needed at the beginning of disc2?

Not if there's no metadata or system redundancy profile like raid1.

 Or some
 other way to overwrite the beginning of disc 2 (using dd again) with
 some identification information so that the automated btrfs tools can
 take it from there?

I think to have a viable reference, you need two disks (virtual or
real) and you need to exactly replicate how you got to this two disk
setup to find out what's in those 32MB that might get the file system
to mount even if it complaints of some corrupt files. That's work
that's way beyond my skill level. The tools don't do this right now as
far as I'm aware. You'd be making byte by byte insertions to multiple
sectors. Tedious. But I can't even guess how many steps it is. It
might be 10. It might be 1.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()

On Wed, Aug 05, 2015 at 06:00:04PM +0800, Zhao Lei wrote:
 Remove chunk_objectid argument from btrfs_relocate_chunk() because
 it is not necessary, it can also cleanup some code in caller for
 prepare its value.
 
 Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com

Reviewed-by: David Sterba dste...@suse.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()

On Wed, Aug 05, 2015 at 06:00:02PM +0800, Zhao Lei wrote:
 We need error checking code for get_ref_objectid_v0() in
 relocate_block_group(), to avoid unpredictable result, especially
 for accessing uninitialized value(when function failed) after
 this line.
 
 Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com

Reviewed-by: David Sterba dste...@suse.com

Are there even filesystems with v0 refs?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why subvolume and not just volume?

On Wed, Aug 05, 2015 at 09:06:40AM +0200, Martin wrote:
 Also, what is the penalty of a subvolume compared to a directory? From
 a design perspective, couldn't all directories just be subvolumes?

They could, but this would bring severe performance drop.

* creating a subvolume implies a transaction commit
* the subvolumes act like a mountpoint boundary so it needs to resolve
  the next subvolume root before directory traversal can descend to it

You can try to create a deep hierarchy of directories and then do the
same with subvolumes. The difference is too big for practical purposes.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] sysfs-part2 Add seed device representation on the sysfs

On Wed, Jul 08, 2015 at 03:32:48PM +0800, Anand Jain wrote:
 This patch adds the support to show seed device on the btrfs sysfs.
 This is a revamped version of the previously single patch 6/6, and plus
 incorporates David suggestion to add seed fsid under the 'seed' kobject.
 
 Since this adds new patches and to bring in seed kobject it needed
 quite a lot of revamp I am resetting the patch set version to 1.
 
 Anand Jain (6):
   Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted
   Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted
   Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link
   Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link
   Btrfs: rename super_kobj to fsid_kobj
   Btrfs: sysfs: support seed devices in the sysfs layout

Sorry for late reply, the patches look good. I'm going to prepare a
branch for pull into 4.3. Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1: system stability

2015-08-05 Thread Austin S Hemmelgarn


On 2015-07-22 07:00, Russell Coker wrote:

On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:

OK I actually don't know what the intended block layer behavior is
when unplugging a device, if it is supposed to vanish, or change state
somehow so that thing that depend on it can know it's missing or
what. So the question here is, is this working as intended? If the
layer Btrfs depends on isn't working as intended, then Btrfs is
probably going to do wild and crazy things. And I don't know that the
part of the block layer Btrfs depends on for this is the same (or
different) as what the md driver depends on.


I disagree with that statement.  BTRFS should be expected to not do wild and
crazy things regardless of what happens with block devices.
I would generally agree with this, although we really shouldn't be doing 
things like trying to handle hardware failures without user 
intervention.  If a block device disappears from under us, we should 
throw a warning and if it's the last device in the FS, kill anything 
that is trying to read or write to that FS.  At the very least, we 
should try to avoid hanging or panicking the system if all of the 
devices in an FS disappear out from under us.


A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning
any manner of corrupted data and should not lose data or panic the kernel.
It's debatable however whether the array should go read-only when 
degraded.  MD/DM RAID (at least, AFAIK) and most hardware RAID 
controllers I've seen will still accept writes to degraded arrays, 
although there are arguments for forcing it read-only as well.
Personally, I think that should be controlled by a mount option, so the 
sysadmin can decide, as it really is a policy decision.


A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by
mounting read-only or failing all operations on the filesystem.  It should not
affect any other filesystem or have any significant impact on the system unless
it's the root filesystem.
Or some other critical filesystem (there are still people who put /usr 
and/or /var on separate filesystems).  Ideally, I'd love to see some 
some kind of warning from the kernel if a filesystem gets mounted that 
has the metadata/system profile set to raid0 (and possibly have some of 
the tools spit out such a warning also).





smime.p7s
Description: S/MIME Cryptographic Signature

Re: RAID1: system stability

2015-08-05 Thread Martin Steigerwald

Am Mittwoch, 5. August 2015, 13:32:41 schrieb Austin S Hemmelgarn:
 On 2015-07-22 07:00, Russell Coker wrote:
  On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:
  OK I actually don't know what the intended block layer behavior is
  when unplugging a device, if it is supposed to vanish, or change state
  somehow so that thing that depend on it can know it's missing or
  what. So the question here is, is this working as intended? If the
  layer Btrfs depends on isn't working as intended, then Btrfs is
  probably going to do wild and crazy things. And I don't know that the
  part of the block layer Btrfs depends on for this is the same (or
  different) as what the md driver depends on.
  
  I disagree with that statement.  BTRFS should be expected to not do wild
  and crazy things regardless of what happens with block devices.
 
 I would generally agree with this, although we really shouldn't be doing
 things like trying to handle hardware failures without user
 intervention.  If a block device disappears from under us, we should
 throw a warning and if it's the last device in the FS, kill anything
 that is trying to read or write to that FS.  At the very least, we
 should try to avoid hanging or panicking the system if all of the
 devices in an FS disappear out from under us.

The best solution I have ever seen for removable media is with AmigaOS. You 
remove a disk (or nowadays an usb stick) while it is being written to and 
AmigaDOS/AmigaOS pops up a dialog window saying You MUST insert volume 
$VOLUMENAME again. And if you did, it just continued writing. I bet this may 
be difficult for do for Linux for all devices as unwritten changes pile up in 
memory until dirty limits are reached, unless one says Okay, disk gone, we 
block all processes writing to it immediately or quite soon, but for 
removable media I never saw anything with that amount of sanity. There was 
some GSoC for NetBSD once to implement this, but I don´t know whether its 
implemented in there now. For AmigaOS and floppy disks with back then 
filesystem there was just one culprit: If you didn´t insert the disk again, it 
was often broken beyond repair. For journaling or COW filesystem it would just 
be like in any other sudden stop to writes.

On Linux with eSATA I saw I can also replug the disk if I didn´t yet hit the 
timeouts in block layer. After that the disk is gone.

Ciao,
-- 
Martin

signature.asc
Description: This is a digitally signed message part.

Re: [RFC 0/8] Allow GFP_NOFS allocation to fail

2015-08-05 Thread Andreas Dilger

On Aug 5, 2015, at 3:51 AM, mho...@kernel.org wrote:
 Hi,
 small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
 traditionally even though their reclaim capabilities are restricted
 because the VM code cannot recurse into filesystems to clean dirty
 pages. At the same time these allocation requests do not allow to
 trigger the OOM killer because that would lead to pre-mature OOM killing
 during heavy fs metadata workloads.
 
 This leaves the VM code in an unfortunate situation where GFP_NOFS
 requests is looping inside the allocator relying on somebody else to
 make a progress on its behalf. This is prone to deadlocks when the
 request is holding resources which are necessary for other task to make
 a progress and release memory (e.g. OOM victim is blocked on the lock
 held by the NONFS request). Another drawback is that the caller of
 the allocator cannot define any fallback strategy because the request
 doesn't fail.
 
 As the VM cannot do much about these requests we should face the reality
 and allow those allocations to fail. Johannes has already posted the
 patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2)
 but the discussion died pretty quickly.
 
 I was playing with this patch and xfs, ext[34] and btrfs for a while
 to see what is the effect under heavy memory pressure. As expected
 this led to some fallouts.
 
 My test consisted of a simple memory hog which allocates a lot of
 anonymous memory and writes to a fs mainly to trigger a fs activity on
 exit. In parallel there is a parallel fs metadata load (multiple tasks
 creating thousands of empty files and directories). All is running
 in a VM with small amount of memory to emulate an under provisioned
 system. The metadata load is triggering a sufficient load to invoke
 the direct reclaim even without the memory hog. The memory hog forks
 several tasks sharing the VM and OOM killer manages to kill it without 
 locking up the system (this was based on the test case from Tetsuo
 Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html -
 I just didn't want to kill my machine ;)).
 
 With all the patches applied none of the 4 filesystems gets aborted
 transactions and RO remount (well xfs didn't need any special
 treatment). This is obviously not sufficient to claim that failing
 GFP_NOFS is OK now but I think it is a good start for the further
 discussion. I would be grateful if FS people could have a look at
 those patches.  I have simply used __GFP_NOFAIL in the critical paths. 
 This might be not the best strategy but it sounds like a good first
 step.
 
 The first patch in the series also allows __GFP_NOFAIL allocations to
 access memory reserves when the system is OOM which should help those
 requests to make a forward progress - especially in combination with
 GFP_NOFS.
 
 The second patch tries to address a potential pre-mature OOM killer
 from the page fault path. I have posted it separately but it didn't
 get much traction.
 
 The third patch allows GFP_NOFS to fail and I believe it should see
 much more testing coverage. It would be really great if it could sit
 in the mmotm tree for few release cycles so that we can catch more
 fallouts.
 
 The rest are the FS specific patches to fortify allocations
 requests which are really needed to finish transactions without RO
 remounts. There might be more needed but my test case survives with
 these in place.

Wouldn't it make more sense to order the fs-specific patches _before_
the GFP_NOFS can fail patch (#3), so that once that patch is applied
all known failures have already been fixed?  Otherwise it could show
test failures during bisection that would be confusing.

Cheers, Andreas

 They would obviously need some rewording if they are going to be
 applied even without Patch3 and I will do that if respective
 maintainers will take them. Ext3 and JBD are going away soon so they
 might be dropped but they have been in the tree while I was testing
 so I've kept them.
 
 Thoughts? Opinions?
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] sysfs-part2 Add seed device representation on the sysfs

2015-08-05 Thread Anand Jain



Hi David,

 Thanks. more below.

On 08/06/2015 01:29 AM, David Sterba wrote:

On Wed, Jul 08, 2015 at 03:32:48PM +0800, Anand Jain wrote:

This patch adds the support to show seed device on the btrfs sysfs.
This is a revamped version of the previously single patch 6/6, and plus
incorporates David suggestion to add seed fsid under the 'seed' kobject.

Since this adds new patches and to bring in seed kobject it needed
quite a lot of revamp I am resetting the patch set version to 1.

Anand Jain (6):
   Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted
   Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted
   Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link
   Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link
   Btrfs: rename super_kobj to fsid_kobj


 these can go in.


   Btrfs: sysfs: support seed devices in the sysfs layout


Sorry for late reply, the patches look good. I'm going to prepare a
branch for pull into 4.3. Thanks.


I suggested if this can wait.
on the 2nd thought, I am preparing to conduct a survey to know most 
preferred sysfs layout for btrfs.

 mainly between
 one, less invasive overlays on the existing layout (current method).
 the other, separates FS and Volume attributes (old method).

 sorry that I am going back a bit, but i think its worth as these API 
are forever.


thanks, Anand
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

bedup --defrag freezing

2015-08-05 Thread Konstantin Svist

Hi,

I've been running btrfs on Fedora for a while now, with bedup --defrag
running in a night-time cronjob.
Last few runs seem to have gotten stuck, without possibility of even
killing the process (kill -9 doesn't work) -- all I could do is hard
power cycle.

Did something change recently? Is bedup simply too out of date? What
should I use to de-duplicate across snapshots instead? Etc.?


Thanks,
Konstantin



# uname -a
Linux mireille.svist.net 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10
21:09:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.1

# btrfs fi show
Label: none  uuid: 5ac56e7d-3d04-4ffa-8160-5a47f46c2939
Total devices 1 FS bytes used 243.43GiB
devid1 size 465.76GiB used 318.05GiB path /dev/sda2

btrfs-progs v4.1

# btrfs fi df /
Data, single: total=309.01GiB, used=238.24GiB
System, single: total=32.00MiB, used=64.00KiB
Metadata, single: total=9.01GiB, used=5.19GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

dmseg attached

[0.00] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.8-200.fc21.x86_64 (mockbu...@bkernel02.phx2.fedoraproject.org) (gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #1 SMP Fri Jul 10 21:09:54 UTC 2015
[0.00] Command line: BOOT_IMAGE=/main/boot/vmlinuz-4.0.8-200.fc21.x86_64 root=/dev/sda2 ro rootflags=subvol=main vconsole.font=latarcyrheb-sun16 quiet
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xba14] usable
[0.00] BIOS-e820: [mem 0xba15-0xba156fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xba157000-0xba94] usable
[0.00] BIOS-e820: [mem 0xba95-0xbabedfff] reserved
[0.00] BIOS-e820: [mem 0xbabee000-0xcac0afff] usable
[0.00] BIOS-e820: [mem 0xcac0b000-0xcb10afff] reserved
[0.00] BIOS-e820: [mem 0xcb10b000-0xcb63dfff] usable
[0.00] BIOS-e820: [mem 0xcb63e000-0xcb7aafff] ACPI NVS
[0.00] BIOS-e820: [mem 0xcb7ab000-0xcbffefff] reserved
[0.00] BIOS-e820: [mem 0xcbfff000-0xcbff] usable
[0.00] BIOS-e820: [mem 0xcd00-0xcf1f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00022fdf] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] DMI: Notebook P15SM-A/SM1-A/P15SM-A/SM1-A, BIOS 4.6.5 03/27/2014
[0.00] e820: update [mem 0x-0x0fff] usable == reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0x22fe00 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C write-protect
[0.00]   D-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask 7E write-back
[0.00]   1 base 02 mask 7FE000 write-back
[0.00]   2 base 022000 mask 7FF000 write-back
[0.00]   3 base 00E000 mask 7FE000 uncachable
[0.00]   4 base 00D000 mask 7FF000 uncachable
[0.00]   5 base 00CE00 mask 7FFE00 uncachable
[0.00]   6 base 00CD00 mask 7FFF00 uncachable
[0.00]   7 base 022FE0 mask 7FFFE0 uncachable
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] e820: update [mem 0xcd00-0x] usable == reserved
[0.00] e820: last_pfn = 0xcc000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fd830-0x000fd83f] mapped at [880fd830]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] Using

RE: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group()

Hi, David Sterba

 -Original Message-
 From: David Sterba [mailto:dste...@suse.com]
 Sent: Thursday, August 06, 2015 1:03 AM
 To: Zhao Lei
 Cc: linux-btrfs@vger.kernel.org
 Subject: Re: [PATCH 1/3] btrfs: Error handle for get_ref_objectid_v0() in
 relocate_block_group()

 On Wed, Aug 05, 2015 at 06:00:02PM +0800, Zhao Lei wrote:
  We need error checking code for get_ref_objectid_v0() in
  relocate_block_group(), to avoid unpredictable result, especially for
  accessing uninitialized value(when function failed) after this line.

  Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com

 Reviewed-by: David Sterba dste...@suse.com

Thanks for review!

 Are there even filesystems with v0 refs?
Rarely, I think.
(Just a accidental found in debuging another problem)
But for current code, we need hold correct code until
we remove v0 refs support.

Thanks
Zhaolei

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] btrfs-progs: add newline to some error messages

Hi, Itoh-san

 -Original Message-
 From: linux-btrfs-ow...@vger.kernel.org
 [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Zhao Lei
 Sent: Thursday, August 06, 2015 11:51 AM
 To: 'Tsutomu Itoh'; linux-btrfs@vger.kernel.org
 Subject: RE: [PATCH] btrfs-progs: add newline to some error messages
 
 Hi, Itoh
 
  -Original Message-
  From: linux-btrfs-ow...@vger.kernel.org
  [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tsutomu Itoh
  Sent: Thursday, August 06, 2015 11:06 AM
  To: linux-btrfs@vger.kernel.org
  Subject: [PATCH] btrfs-progs: add newline to some error messages
 
  Added a missing newline to some error messages.
 
 Good found!
 
 Seems more code need to be fixed, as:
 
 # cat mkfs.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed
 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep 
 -v '\\n'
 symlink too long for %s
 Incompat features:  %s
 #
 
 # cat utils.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed
 's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep 
 -v '\\n'
 ERROR: DUP for data is allowed only in mixed mode %s [y/N]: *1 #
 *1: It is not problem, should to be ignored
 
Sorry for a bug in above script, it is new version(should get more exact result 
than old version):
# cat cmds-replace.c | tr -d '\n' | grep -o -w 'f\?printf([^;]*);' | sed 
's/f\?printf[[:blank:]]*([[:blank:]]*\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g'
 | grep -v '\\n'
#

Thanks
Zhaolei

 Thanks
 Zhaolei
 
  Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
  ---
   btrfs-corrupt-block.c | 2 +-
   cmds-check.c  | 4 ++--
   cmds-send.c   | 4 ++--
   dir-item.c| 6 +++---
   mkfs.c| 2 +-
   5 files changed, 9 insertions(+), 9 deletions(-)
 
  diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index
  1a2aa23..ea871f4
  100644
  --- a/btrfs-corrupt-block.c
  +++ b/btrfs-corrupt-block.c
  @@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root,
  goto out;
  }
  if (ret  0) {
  -   fprintf(stderr, Error searching chunk);
  +   fprintf(stderr, Error searching chunk\n);
  goto out;
  }
   out:
  diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644
  --- a/cmds-check.c
  +++ b/cmds-check.c
  @@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct
  btrfs_trans_handle *trans,
BTRFS_FIRST_FREE_OBJECTID, lost_found_ino,
mode);
  if (ret  0) {
  -   fprintf(stderr, Failed to create '%s' dir: %s,
  +   fprintf(stderr, Failed to create '%s' dir: %s\n,
  dir_name, strerror(-ret));
  goto out;
  }
  @@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct
  btrfs_trans_handle *trans,
  }
  if (ret  0) {
  fprintf(stderr,
  -   Failed to link the inode %llu to %s dir: %s,
  +   Failed to link the inode %llu to %s dir: %s\n,
  rec-ino, dir_name, strerror(-ret));
  goto out;
  }
  diff --git a/cmds-send.c b/cmds-send.c index 20bba18..78ee54c 100644
  --- a/cmds-send.c
  +++ b/cmds-send.c
  @@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int 
  size)
  ret = write(fd, (char*)buf + pos, size - pos);
  if (ret  0) {
  ret = -errno;
  -   fprintf(stderr, ERROR: failed to dump stream. %s,
  +   fprintf(stderr, ERROR: failed to dump stream. %s\n,
  strerror(-ret));
  goto out;
  }
  if (!ret) {
  ret = -EIO;
  -   fprintf(stderr, ERROR: failed to dump stream. %s,
  +   fprintf(stderr, ERROR: failed to dump stream. %s\n,
  strerror(-ret));
  goto out;
  }
  diff --git a/dir-item.c b/dir-item.c
  index a5bf861..f3ad98f 100644
  --- a/dir-item.c
  +++ b/dir-item.c
  @@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root,
  u8 type = btrfs_dir_type(leaf, dir_item);
 
  if (type = BTRFS_FT_MAX) {
  -   fprintf(stderr, invalid dir item type: %d,
  +   fprintf(stderr, invalid dir item type: %d\n,
 (int)type);
  return 1;
  }
  @@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root,
  namelen = XATTR_NAME_MAX;
 
  if (btrfs_dir_name_len(leaf, dir_item)  namelen) {
  -   fprintf(stderr, invalid dir item name len: %u,
  +   fprintf(stderr, invalid dir item name len: %u\n,
 (unsigned)btrfs_dir_data_len(leaf, dir_item));
  return 1;
  }
  @@ -302,7 +302,7 @@ int

Re: [PATCH] btrfs-progs: add newline to some error messages

2015-08-05 Thread Tsutomu Itoh


On 2015/08/06 12:51, Zhao Lei wrote:

Hi, Itoh


-Original Message-
From: linux-btrfs-ow...@vger.kernel.org
[mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Tsutomu Itoh
Sent: Thursday, August 06, 2015 11:06 AM
To: linux-btrfs@vger.kernel.org
Subject: [PATCH] btrfs-progs: add newline to some error messages

Added a missing newline to some error messages.


Good found!

Seems more code need to be fixed, as:

# cat mkfs.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 
's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep 
-v '\\n'  symlink too long for %s
Incompat features:  %s
#


It's OK.

  printf(Incompat features:  %s, features_buf);
  printf(\n);



# cat utils.c | tr -d '\n' | grep -o -w 'f\?printf([^(]*);' | sed 
's/f\?printf[[:blank:]]*(\(stderr,\|\)[[:blank:]]*\(.*\)[,)].*/\2/g' | grep 
-v '\\n'
ERROR: DUP for data is allowed only in mixed mode
%s [y/N]: *1
#
*1: It is not problem, should to be ignored


Already fixed by David in devel branch.

Thanks,
Tsutomu



Thanks
Zhaolei


Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
  btrfs-corrupt-block.c | 2 +-
  cmds-check.c  | 4 ++--
  cmds-send.c   | 4 ++--
  dir-item.c| 6 +++---
  mkfs.c| 2 +-
  5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index 
1a2aa23..ea871f4
100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root,
goto out;
}
if (ret  0) {
-   fprintf(stderr, Error searching chunk);
+   fprintf(stderr, Error searching chunk\n);
goto out;
}
  out:
diff --git a/cmds-check.c b/cmds-check.c index dd2fce3..0ddf57c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct
btrfs_trans_handle *trans,
  BTRFS_FIRST_FREE_OBJECTID, lost_found_ino,
  mode);
if (ret  0) {
-   fprintf(stderr, Failed to create '%s' dir: %s,
+   fprintf(stderr, Failed to create '%s' dir: %s\n,
dir_name, strerror(-ret));
goto out;
}
@@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct
btrfs_trans_handle *trans,
}
if (ret  0) {
fprintf(stderr,
-   Failed to link the inode %llu to %s dir: %s,
+   Failed to link the inode %llu to %s dir: %s\n,
rec-ino, dir_name, strerror(-ret));
goto out;
}
diff --git a/cmds-send.c b/cmds-send.c
index 20bba18..78ee54c 100644
--- a/cmds-send.c
+++ b/cmds-send.c
@@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size)
ret = write(fd, (char*)buf + pos, size - pos);
if (ret  0) {
ret = -errno;
-   fprintf(stderr, ERROR: failed to dump stream. %s,
+   fprintf(stderr, ERROR: failed to dump stream. %s\n,
strerror(-ret));
goto out;
}
if (!ret) {
ret = -EIO;
-   fprintf(stderr, ERROR: failed to dump stream. %s,
+   fprintf(stderr, ERROR: failed to dump stream. %s\n,
strerror(-ret));
goto out;
}
diff --git a/dir-item.c b/dir-item.c
index a5bf861..f3ad98f 100644
--- a/dir-item.c
+++ b/dir-item.c
@@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root,
u8 type = btrfs_dir_type(leaf, dir_item);

if (type = BTRFS_FT_MAX) {
-   fprintf(stderr, invalid dir item type: %d,
+   fprintf(stderr, invalid dir item type: %d\n,
   (int)type);
return 1;
}
@@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root,
namelen = XATTR_NAME_MAX;

if (btrfs_dir_name_len(leaf, dir_item)  namelen) {
-   fprintf(stderr, invalid dir item name len: %u,
+   fprintf(stderr, invalid dir item name len: %u\n,
   (unsigned)btrfs_dir_data_len(leaf, dir_item));
return 1;
}
@@ -302,7 +302,7 @@ int verify_dir_item(struct btrfs_root *root,
/* BTRFS_MAX_XATTR_SIZE is the same for all dir items */
if ((btrfs_dir_data_len(leaf, dir_item) +
 btrfs_dir_name_len(leaf, dir_item)) 
BTRFS_MAX_XATTR_SIZE(root)) {
-   fprintf(stderr, invalid dir item name + data len: %u + %u,
+   fprintf(stderr, invalid dir item name + data len: %u + %u\n,

RE: [PATCH] btrfs-progs: Modify confuse error message in scrub

Hi, David Sterba

Thanks for review this patch.

 -Original Message-
 From: David Sterba [mailto:dste...@suse.com]
 Sent: Thursday, August 06, 2015 12:51 AM
 To: Zhao Lei
 Cc: linux-btrfs@vger.kernel.org
 Subject: Re: [PATCH] btrfs-progs: Modify confuse error message in scrub

 On Wed, Aug 05, 2015 at 04:32:26PM +0800, Zhao Lei wrote:
  Scrub output following error message in my test:
ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5
  (Success)

  It is caused by a broken kernel and fs,

 In what way is it broken? Can we turn it into tests?

It is caused by my custom-made condition, created for debugg
another problem in kernel code, and see above outout in xfstests.

It is not a real problem for normal user, so not necessary to add
A testcase to fstests and user-land tests.

But for btrfs-progs, it should not output such message on any case,
this is what this patch fixed.

  but the we need to avoid
  outputting both error and success in oneline message as above.

  This patch modified above message to:
ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5,
  ret=1, errno=0(Success)

 The net effect of the patch is to add ret=.. and errno=.. to the error message
 but it also changes a series of ifs to a switch. This belongs to a separate 
 patch.
OK, will send v2.

Thanks
Zhaolei

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS disaster (of my own making). Is this recoverable?

2015-08-05 Thread Chris Murphy

On Wed, Aug 5, 2015 at 6:45 PM, Paul Jones p...@pauljones.id.au wrote:
 Would it be possible to store this type of critical information twice on each 
 disk, at the beginning and end? I thought BTRFS already did that, but I might 
 be thinking of some other filesystem. I've had my share of these types of 
 oops! moments as well.

That option is metadata profile raid1. To do an automatic
-mconvert=raid1 when the user does 'btrfs device add' breaks any use
case where you want to temporarily add a small device, maybe a USB
stick, and now hundreds of MiBs possibly GiBs of metadata have to be
copied over to this device without warning. It could be made smart,
autoconvert to raid1 when the added device is at least 4x the size of
metadata allocation, but then that makes it inconsistent. OK so it
could be made interactive, but now that breaks scripts. So... where do
you draw the line?

Maybe this would work if the system chunk only was raid1? I don't know
what the minimum necessary information is for such a case.

Possibly it make more sense if 'btrfs device add' always does
-dconvert=raid1 unless a --quick option is passed?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: add newline to some error messages

2015-08-05 Thread Tsutomu Itoh

Added a missing newline to some error messages.

Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
 btrfs-corrupt-block.c | 2 +-
 cmds-check.c  | 4 ++--
 cmds-send.c   | 4 ++--
 dir-item.c| 6 +++---
 mkfs.c| 2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index 1a2aa23..ea871f4 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -1010,7 +1010,7 @@ int find_chunk_offset(struct btrfs_root *root,
goto out;
}
if (ret  0) {
-   fprintf(stderr, Error searching chunk);
+   fprintf(stderr, Error searching chunk\n);
goto out;
}
 out:
diff --git a/cmds-check.c b/cmds-check.c
index dd2fce3..0ddf57c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2398,7 +2398,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle 
*trans,
  BTRFS_FIRST_FREE_OBJECTID, lost_found_ino,
  mode);
if (ret  0) {
-   fprintf(stderr, Failed to create '%s' dir: %s,
+   fprintf(stderr, Failed to create '%s' dir: %s\n,
dir_name, strerror(-ret));
goto out;
}
@@ -2426,7 +2426,7 @@ static int repair_inode_nlinks(struct btrfs_trans_handle 
*trans,
}
if (ret  0) {
fprintf(stderr,
-   Failed to link the inode %llu to %s dir: %s,
+   Failed to link the inode %llu to %s dir: %s\n,
rec-ino, dir_name, strerror(-ret));
goto out;
}
diff --git a/cmds-send.c b/cmds-send.c
index 20bba18..78ee54c 100644
--- a/cmds-send.c
+++ b/cmds-send.c
@@ -192,13 +192,13 @@ static int write_buf(int fd, const void *buf, int size)
ret = write(fd, (char*)buf + pos, size - pos);
if (ret  0) {
ret = -errno;
-   fprintf(stderr, ERROR: failed to dump stream. %s,
+   fprintf(stderr, ERROR: failed to dump stream. %s\n,
strerror(-ret));
goto out;
}
if (!ret) {
ret = -EIO;
-   fprintf(stderr, ERROR: failed to dump stream. %s,
+   fprintf(stderr, ERROR: failed to dump stream. %s\n,
strerror(-ret));
goto out;
}
diff --git a/dir-item.c b/dir-item.c
index a5bf861..f3ad98f 100644
--- a/dir-item.c
+++ b/dir-item.c
@@ -285,7 +285,7 @@ int verify_dir_item(struct btrfs_root *root,
u8 type = btrfs_dir_type(leaf, dir_item);
 
if (type = BTRFS_FT_MAX) {
-   fprintf(stderr, invalid dir item type: %d,
+   fprintf(stderr, invalid dir item type: %d\n,
   (int)type);
return 1;
}
@@ -294,7 +294,7 @@ int verify_dir_item(struct btrfs_root *root,
namelen = XATTR_NAME_MAX;
 
if (btrfs_dir_name_len(leaf, dir_item)  namelen) {
-   fprintf(stderr, invalid dir item name len: %u,
+   fprintf(stderr, invalid dir item name len: %u\n,
   (unsigned)btrfs_dir_data_len(leaf, dir_item));
return 1;
}
@@ -302,7 +302,7 @@ int verify_dir_item(struct btrfs_root *root,
/* BTRFS_MAX_XATTR_SIZE is the same for all dir items */
if ((btrfs_dir_data_len(leaf, dir_item) +
 btrfs_dir_name_len(leaf, dir_item))  BTRFS_MAX_XATTR_SIZE(root)) {
-   fprintf(stderr, invalid dir item name + data len: %u + %u,
+   fprintf(stderr, invalid dir item name + data len: %u + %u\n,
   (unsigned)btrfs_dir_name_len(leaf, dir_item),
   (unsigned)btrfs_dir_data_len(leaf, dir_item));
return 1;
diff --git a/mkfs.c b/mkfs.c
index dafd500..909b591 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -599,7 +599,7 @@ static int add_symbolic_link(struct btrfs_trans_handle 
*trans,
goto fail;
}
if (ret = sectorsize) {
-   fprintf(stderr, symlink too long for %s, path_name);
+   fprintf(stderr, symlink too long for %s\n, path_name);
ret = -1;
goto fail;
}
-- 
2.4.5


Tsutomu Itoh  t-i...@jp.fujitsu.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] btrfs-progs: use switch instead of a series of ifs for output errormsg

switch statement is more suitable for outputing currsponding message
for errno.

Suggested-by: David Sterba dste...@suse.com
Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 cmds-scrub.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/cmds-scrub.c b/cmds-scrub.c
index 7c9318e..a40eecf 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -1457,21 +1457,24 @@ static int scrub_start(int argc, char **argv, int 
resume)
++err;
continue;
}
-   if (sp[i].ret  sp[i].ioctl_errno == ENODEV) {
-   if (do_print)
-   fprintf(stderr, WARNING: device %lld not 
-   present\n, devid);
-   continue;
-   }
-   if (sp[i].ret  sp[i].ioctl_errno == ECANCELED) {
-   ++err;
-   } else if (sp[i].ret) {
-   if (do_print)
-   fprintf(stderr, ERROR: scrubbing %s failed 
-   for device id %lld (%s)\n, path,
-   devid, strerror(sp[i].ioctl_errno));
-   ++err;
-   continue;
+   if (sp[i].ret) {
+   switch (sp[i].ioctl_errno) {
+   case ENODEV:
+   if (do_print)
+   fprintf(stderr, WARNING: device %lld 
not present\n,
+   devid);
+   continue;
+   case ECANCELED:
+   ++err;
+   break;
+   default:
+   if (do_print)
+   fprintf(stderr, ERROR: scrubbing %s 
failed for device id %lld (%s)\n,
+   path, devid,
+   strerror(sp[i].ioctl_errno));
+   ++err;
+   continue;
+   }
}
if (sp[i].scrub_args.progress.uncorrectable_errors  0)
e_uncorrectable++;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] btrfs-progs: Modify confuse error message in scrub

Scrub output following error message in my test:
  ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5 (Success)

It is caused by a broken kernel and fs, but the we need to avoid
outputting both error and success in oneline message as above.

This patch modified above message to:
  ERROR: scrubbing /var/ltf/tester/scratch_mnt failed for device id 5, ret=1, 
errno=0(Success)

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
---
 cmds-scrub.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cmds-scrub.c b/cmds-scrub.c
index a40eecf..2529956 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -1469,8 +1469,9 @@ static int scrub_start(int argc, char **argv, int resume)
break;
default:
if (do_print)
-   fprintf(stderr, ERROR: scrubbing %s 
failed for device id %lld (%s)\n,
+   fprintf(stderr, ERROR: scrubbing %s 
failed for device id %lld, ret=%d, errno=%d(%s)\n,
path, devid,
+   sp[i].ret, sp[i].ioctl_errno,
strerror(sp[i].ioctl_errno));
++err;
continue;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: BTRFS disaster (of my own making). Is this recoverable?

2015-08-05 Thread Paul Jones

-Original Message-
From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
ow...@vger.kernel.org] On Behalf Of Chris Murphy
Sent: Thursday, 6 August 2015 2:54 AM
To: Sonic sonicsm...@gmail.com
Cc: Btrfs BTRFS linux-btrfs@vger.kernel.org; Hugo Mills
h...@carfax.org.uk
Subject: Re: BTRFS disaster (of my own making). Is this recoverable?

On Wed, Aug 5, 2015 at 6:31 AM, Sonic sonicsm...@gmail.com wrote:
On Tue, Aug 4, 2015 at 4:23 PM, Sonic sonicsm...@gmail.com wrote:
Seems that if there was someway to edit something in those first
overwritten 32MB of disc 2 to say hey, I'm really here, just a bit
screwed up maybe some of the recovery tools could actually work.

Just want to reiterate this thought.

The basic error in most cases with the tools at hand is that Disc 2 is
missing so there's little the tools can do. Somewhere in those first
32MB should be something to properly identify the disc as part of the
array.

Yes but it was probably uniquely only on that disk, because there's no
redundancy for metadata or system chunks. Therefore there's no copy on
the other disk to use as a model. The btrfs check command has an option to
use other superblocks, so you could try that switch and see if it makes a
difference but it sounds like it's finding backup superblocks automatically.
That's the one thing that is pretty much always duplicated on the same disk;
for sure the first superblock is munged and would need repair. But there's
still other chunks missing... so I don't think it'll help.

Would it be possible to store this type of critical information twice on each
disk, at the beginning and end? I thought BTRFS already did that, but I might
be thinking of some other filesystem. I've had my share of these types of oops!
moments as well.

Paul.

N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

RE: [PATCH] btrfs-progs: add newline to some error messages