Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Andrei Borzenkov
13.06.2016 01:49, Henk Slager пишет:
> On Sun, Jun 12, 2016 at 11:22 PM, Maximilian Böhm  wrote:
>> Hi there, I did something terribly wrong, all blame on me. I wanted to
>> write to an USB stick but /dev/sdc wasn't the stick in this case but
>> an attached HDD with GPT and an 8 TB btrfs partition…
> 
> GPT has a secondary copy at the end of the device, so maybe gdisk can
> reconstruct in first main one at the beginning of the disk, I don't
> know all gdisk commands. 

kernel should automatically fall back to secondary GPT if booted with
gpt=1 parameter. Otherwise it is 'x' for expert mode and 'b' to rebuild
primary GPT from secondary copy. Also 'c' to load partition information
from secondary (without writing anything). But note that kernel will
also check for valid PMBR (unless gpt is force with parameter mentioned
above), so you will need yet another 'x' to enter second expert mode and
'n' to create new protective MBR.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread Duncan
Henk Slager posted on Sun, 12 Jun 2016 21:03:22 +0200 as excerpted:

> But now that you anyhow have all data on 3x 6TB drives, you could save
> balancing time by just doing btrfs-replace 6TB to 8TB 3x and then for
> the 4th 8TB just add it and let btrfs do the spreading/balancing over
> time by itself.

That's what I'd suggest.  You have all the data on three of the 6 TB 
drives now.  Just replace one at a time to 8 TB drives.  Then add the 4th 
8 TB drive, and then at your option do a final balance at that point, or 
simply let the normal activity take care of it.

Altho if you're doing mostly add, little delete, without a balance you 
may run out of space prematurely, since raid1 requires two drives with 
unallocated space on them to allocate a new chunk (one copy on each of 
the two), and you'll only have ~2 TB free on each of the three, which 
would be used up with ~2 TB still left free on the last added drive...

So at least a partial balance after adding that 4th 8 TB in is probably a 
good idea.  You can leave that last drive with a couple extra free TB 
compared to the others and cancel the balance at that point, and new 
allocations should take it from there, but unless you're going to be 
deleting several TB of stuff as you add, at least doing a few TB worth of 
balance to the new drive to start the process should result in a pretty 
even spread as it fills up the rest of the way.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: avoid blocking open_ctree from cleaner_kthread

2016-06-12 Thread Zygo Blaxell
This fixes a problem introduced in commit 
2f3165ecf103599f82bf0ea254039db335fb5005
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more 
subvolumes".

open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.

Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.

When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/disk-io.c | 25 ++---
 fs/btrfs/super.c   |  2 ++
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1142127..190a5e0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1806,6 +1806,13 @@ static int cleaner_kthread(void *arg)
if (btrfs_need_cleaner_sleep(root))
goto sleep;
 
+   /*
+* Do not do anything if we might cause open_ctree() to block
+* before we have finished mounting the filesystem.
+*/
+   if (!root->fs_info->open)
+   goto sleep;
+
if (!mutex_trylock(>fs_info->cleaner_mutex))
goto sleep;
 
@@ -2520,7 +2527,6 @@ int open_ctree(struct super_block *sb,
int num_backups_tried = 0;
int backup_index = 0;
int max_active;
-   bool cleaner_mutex_locked = false;
 
tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, GFP_KERNEL);
chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info, 
GFP_KERNEL);
@@ -2999,13 +3005,6 @@ retry_root_backup:
goto fail_sysfs;
}
 
-   /*
-* Hold the cleaner_mutex thread here so that we don't block
-* for a long time on btrfs_recover_relocation.  cleaner_kthread
-* will wait for us to finish mounting the filesystem.
-*/
-   mutex_lock(_info->cleaner_mutex);
-   cleaner_mutex_locked = true;
fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
if (IS_ERR(fs_info->cleaner_kthread))
@@ -3065,8 +3064,10 @@ retry_root_backup:
ret = btrfs_cleanup_fs_roots(fs_info);
if (ret)
goto fail_qgroup;
-   /* We locked cleaner_mutex before creating cleaner_kthread. */
+
+   mutex_lock(_info->cleaner_mutex);
ret = btrfs_recover_relocation(tree_root);
+   mutex_unlock(_info->cleaner_mutex);
if (ret < 0) {
btrfs_warn(fs_info, "failed to recover relocation: %d",
ret);
@@ -3074,8 +3075,6 @@ retry_root_backup:
goto fail_qgroup;
}
}
-   mutex_unlock(_info->cleaner_mutex);
-   cleaner_mutex_locked = false;
 
location.objectid = BTRFS_FS_TREE_OBJECTID;
location.type = BTRFS_ROOT_ITEM_KEY;
@@ -3189,10 +3188,6 @@ fail_cleaner:
filemap_write_and_wait(fs_info->btree_inode->i_mapping);
 
 fail_sysfs:
-   if (cleaner_mutex_locked) {
-   mutex_unlock(_info->cleaner_mutex);
-   cleaner_mutex_locked = false;
-   }
btrfs_sysfs_remove_mounted(fs_info);
 
 fail_fsdev_sysfs:
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4339b66..9934519 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1807,6 +1807,8 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
}
}
sb->s_flags &= ~MS_RDONLY;
+
+   fs_info->open = 1;
}
 out:
wake_up_process(fs_info->transaction_kthread);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recent complete stalls of btrfs (4.6.0-rc4+) -- any advice?

2016-06-12 Thread Yaroslav Halchenko

On Fri, 10 Jun 2016, Chris Murphy wrote:

> > Are those issues something which was fixed since 4.6.0-rc4+ or I should
> > be on look out for them to come back?  What other information should I
> > provide if I run into them again to help you troubleshoot/fix it?

> > P.S. Please CC me the replies


> 4.6.2 is current and it's a lot easier to just use that and see if it
> still happens than for someone to track down whether it's been fixed
> since a six week old RC.

Dear Chris,

Thank you for the reply!  Now running v4.7-rc2-300-g3d0f0b6

The thing is that this issue doesn't happen right away, and it takes a
while for it to develop, and seems to be only after an intensive load.
So the version I run will always be "X weeks old" if I just keep hopping
the recent release of master, and it would be an indefinite goose
chase if left un-analyzed.  That is why I would still appreciate an
advice on what specifics to report/attempt if such crash happens next
time, or may be if someone is having an idea of what could have lead to
this crash to start with.

-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist,Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-12 Thread Eryu Guan
On Mon, Jun 13, 2016 at 10:10:50AM +0800, Lu Fengqi wrote:
> Test if qgroup can handle extent de-reference during reallocation.
> "extent de-reference" means that reducing an extent's reference count
> or freeing an extent.
> Although current qgroup can handle it, we still need to prevent any
> regression which may break current qgroup.
> 
> Signed-off-by: Lu Fengqi 
> ---
>  common/rc   |  4 +--
>  tests/btrfs/028 | 98 
> +
>  tests/btrfs/028.out |  2 ++
>  tests/btrfs/group   |  1 +
>  4 files changed, 103 insertions(+), 2 deletions(-)
>  create mode 100755 tests/btrfs/028
>  create mode 100644 tests/btrfs/028.out
> 
> diff --git a/common/rc b/common/rc
> index 51092a0..650d198 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
>  # stress btrfs by running balance operation in a loop
>  _btrfs_stress_balance()
>  {
> - local btrfs_mnt=$1
> + local options=$@
>   while true; do
> - $BTRFS_UTIL_PROG balance start $btrfs_mnt
> + $BTRFS_UTIL_PROG balance start $options
>   done
>  }
>  
> diff --git a/tests/btrfs/028 b/tests/btrfs/028
> new file mode 100755
> index 000..8cea49a
> --- /dev/null
> +++ b/tests/btrfs/028
> @@ -0,0 +1,98 @@
> +#! /bin/bash
> +# FS QA Test 028
> +#
> +# Test if qgroup can handle extent de-reference during reallocation.
> +# "extent de-reference" means that reducing an extent's reference count
> +# or freeing an extent.
> +# Although current qgroup can handle it, we still need to prevent any
> +# regression which may break current qgroup.
> +#
> +#---
> +# Copyright (c) 2016 Fujitsu. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs
> +_scratch_mount
> +
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> +
> +# Increase the probability of generating de-refer extent, and decrease
> +# other.
> +args=`_scale_fsstress_args -z \
> + -f write=10 -f unlink=10 \
> + -f creat=10 -f fsync=10 \
> + -f fsync=10 -n 10 -p 2 \
> + -d $SCRATCH_MNT/stress_dir`
> +echo "Run fsstress $args" >>$seqres.full
> +$FSSTRESS_PROG $args >/dev/null 2>&1 &
> +fsstress_pid=$!
> +
> +echo "Start balance" >>$seqres.full
> +_btrfs_stress_balance -d $SCRATCH_MNT >/dev/null 2>&1 &
> +balance_pid=$!
> +
> +# 30s is enough to trigger bug
> +sleep $((30*$TIME_FACTOR))
> +kill $fsstress_pid $balance_pid
> +wait
> +
> +# kill _btrfs_stress_balance can't end balance, so call btrfs balance cancel
> +# to cancel running or paused balance.
> +$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
> +
> +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> +
> +_scratch_unmount
> +
> +# generate a qgroup report and look for inconsistent groups
> +$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
> + grep -q -E "Counts for qgroup.*are different"
> +if [ $? -ne 0 ]; then
> + echo "Silence is golden"
> + # success, all done
> + status=0
> +fi

I'm testing with 4.7-rc1 kernel and btrfs-progs v4.4, this test fails,
which means btrfs check finds inconsistent groups. But according to your
commit log, current kernel should pass the test. So is the failure
expected?

Also, just grep for different qgroup counts and print the message out if
grep finds the message, so it breaks golden image on error and we know
something really goes wrong. Right now test fails just because of
missing "Silence is golden", which is unclear why it fails:

 @@ -1,2 +1 @@
  QA output created by 028
 -Silence is golden

Do the following instead:


[PATCH v3] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-12 Thread Lu Fengqi
Test if qgroup can handle extent de-reference during reallocation.
"extent de-reference" means that reducing an extent's reference count
or freeing an extent.
Although current qgroup can handle it, we still need to prevent any
regression which may break current qgroup.

Signed-off-by: Lu Fengqi 
---
 common/rc   |  4 +--
 tests/btrfs/028 | 98 +
 tests/btrfs/028.out |  2 ++
 tests/btrfs/group   |  1 +
 4 files changed, 103 insertions(+), 2 deletions(-)
 create mode 100755 tests/btrfs/028
 create mode 100644 tests/btrfs/028.out

diff --git a/common/rc b/common/rc
index 51092a0..650d198 100644
--- a/common/rc
+++ b/common/rc
@@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
 # stress btrfs by running balance operation in a loop
 _btrfs_stress_balance()
 {
-   local btrfs_mnt=$1
+   local options=$@
while true; do
-   $BTRFS_UTIL_PROG balance start $btrfs_mnt
+   $BTRFS_UTIL_PROG balance start $options
done
 }
 
diff --git a/tests/btrfs/028 b/tests/btrfs/028
new file mode 100755
index 000..8cea49a
--- /dev/null
+++ b/tests/btrfs/028
@@ -0,0 +1,98 @@
+#! /bin/bash
+# FS QA Test 028
+#
+# Test if qgroup can handle extent de-reference during reallocation.
+# "extent de-reference" means that reducing an extent's reference count
+# or freeing an extent.
+# Although current qgroup can handle it, we still need to prevent any
+# regression which may break current qgroup.
+#
+#---
+# Copyright (c) 2016 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs
+_scratch_mount
+
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+# Increase the probability of generating de-refer extent, and decrease
+# other.
+args=`_scale_fsstress_args -z \
+   -f write=10 -f unlink=10 \
+   -f creat=10 -f fsync=10 \
+   -f fsync=10 -n 10 -p 2 \
+   -d $SCRATCH_MNT/stress_dir`
+echo "Run fsstress $args" >>$seqres.full
+$FSSTRESS_PROG $args >/dev/null 2>&1 &
+fsstress_pid=$!
+
+echo "Start balance" >>$seqres.full
+_btrfs_stress_balance -d $SCRATCH_MNT >/dev/null 2>&1 &
+balance_pid=$!
+
+# 30s is enough to trigger bug
+sleep $((30*$TIME_FACTOR))
+kill $fsstress_pid $balance_pid
+wait
+
+# kill _btrfs_stress_balance can't end balance, so call btrfs balance cancel
+# to cancel running or paused balance.
+$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+
+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
+
+_scratch_unmount
+
+# generate a qgroup report and look for inconsistent groups
+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
+   grep -q -E "Counts for qgroup.*are different"
+if [ $? -ne 0 ]; then
+   echo "Silence is golden"
+   # success, all done
+   status=0
+fi
+
+exit
diff --git a/tests/btrfs/028.out b/tests/btrfs/028.out
new file mode 100644
index 000..2615f73
--- /dev/null
+++ b/tests/btrfs/028.out
@@ -0,0 +1,2 @@
+QA output created by 028
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index da0e27f..35ecf59 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -30,6 +30,7 @@
 025 auto quick send clone
 026 auto quick compress prealloc
 027 auto replace
+028 auto qgroup balance
 029 auto quick clone
 030 auto quick send
 031 auto quick subvol clone
-- 
2.5.5



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-12 Thread Lu Fengqi

At 06/12/2016 12:38 AM, Eryu Guan wrote:

On Wed, Jun 01, 2016 at 02:40:11PM +0800, Lu Fengqi wrote:

Test if qgroup can handle extent de-reference during reallocation.
"extent de-reference" means that reducing an extent's reference count
or freeing an extent.
Although current qgroup can handle it, we still need to prevent any
regression which may break current qgroup.

Signed-off-by: Lu Fengqi 
---
 common/rc   |   4 +--
 tests/btrfs/028 | 102 
 tests/btrfs/028.out |   2 ++
 tests/btrfs/group   |   1 +
 4 files changed, 107 insertions(+), 2 deletions(-)
 create mode 100755 tests/btrfs/028
 create mode 100644 tests/btrfs/028.out

diff --git a/common/rc b/common/rc
index 51092a0..650d198 100644
--- a/common/rc
+++ b/common/rc
@@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
 # stress btrfs by running balance operation in a loop
 _btrfs_stress_balance()
 {
-   local btrfs_mnt=$1
+   local options=$@
while true; do
-   $BTRFS_UTIL_PROG balance start $btrfs_mnt
+   $BTRFS_UTIL_PROG balance start $options
done
 }

diff --git a/tests/btrfs/028 b/tests/btrfs/028
new file mode 100755
index 000..04a3508
--- /dev/null
+++ b/tests/btrfs/028
@@ -0,0 +1,102 @@
+#! /bin/bash
+# FS QA Test 028
+#
+# Test if qgroup can handle extent de-reference during reallocation.
+# "extent de-reference" means that reducing an extent's reference count
+# or freeing an extent.
+# Although current qgroup can handle it, we still need to prevent any
+# regression which may break current qgroup.
+#
+#---
+# Copyright (c) 2016 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+# Currently in btrfs the node/leaf size can not be smaller than the page
+# size (but it can be greater than the page size). So use the largest
+# supported node/leaf size (64Kb) so that the test can run on any platform
+# that Linux supports.
+_scratch_mkfs "--nodesize 64k"


I'm not sure if this is necessary, wouldn't mkfs.btrfs pick the proper
node/leaf size according to the platform at mkfs time?

Thanks,
Eryu




Yes, you're right. I check the output of `btrfs qgroup show ` command in 
the previous version, so I need a fixed nodesize. Now it is not 
necessary, I will update this patch and cc you.


--
Thanks,
Lu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-12 Thread Anand Jain



On 06/03/2016 09:50 AM, Andrew Armenia wrote:

This patch adds mount option 'chunk_width_limit=X', which when set forces
the chunk allocator to use only up to X devices when allocating a chunk.
This may help reduce the seek penalties seen in filesystems with large
numbers of devices.


Have you reviewed implementations like device allocation grouping?
Some info is in the btrfs project ideas..

https://btrfs.wiki.kernel.org/index.php/Project_ideas
 Chunk allocation groups
 Limits on number of stripes (stripe width)
 Linear chunk allocation mode

(Device allocation grouping is important for enterprise storage solutions).

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] btrfs: fix check_shared for fiemap ioctl

2016-06-12 Thread Lu Fengqi
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh 
Signed-off-by: Lu Fengqi 
---
 fs/btrfs/backref.c   | 362 +--
 fs/btrfs/extent_io.c |  18 ++-
 2 files changed, 370 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b90cd37..1bf2ff5 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -17,6 +17,7 @@
  */
 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "backref.h"
@@ -34,6 +35,266 @@ struct extent_inode_elem {
struct extent_inode_elem *next;
 };
 
+/*
+ * ref_root is used as the root of the ref tree that hold a collection
+ * of unique references.
+ */
+struct ref_root {
+   struct rb_root rb_root;
+
+   /*
+* The unique_refs represents the number of ref_nodes with a positive
+* count stored in the tree. Even if a ref_node(the count is greater
+* than one) is added, the unique_refs will only increase one.
+*/
+   unsigned int unique_refs;
+};
+
+/* ref_node is used to store a unique reference to the ref tree. */
+struct ref_node {
+   struct rb_node rb_node;
+
+   /* For NORMAL_REF, otherwise all these fields should be set to 0 */
+   u64 root_id;
+   u64 object_id;
+   u64 offset;
+
+   /* For SHARED_REF, otherwise parent field should be set to 0 */
+   u64 parent;
+
+   /* Ref to the ref_mod of btrfs_delayed_ref_node(delayed-ref.h) */
+   int ref_mod;
+};
+
+/* Dynamically allocate and initialize a ref_root */
+static struct ref_root *ref_root_alloc(void)
+{
+   struct ref_root *ref_tree;
+
+   ref_tree = kmalloc(sizeof(*ref_tree), GFP_NOFS);
+   if (!ref_tree)
+   return NULL;
+
+   ref_tree->rb_root = RB_ROOT;
+   ref_tree->unique_refs = 0;
+
+   return ref_tree;
+}
+
+/* Free all nodes in the ref tree, and reinit ref_root */
+static void ref_root_fini(struct ref_root *ref_tree)
+{
+   struct ref_node *node;
+   struct rb_node *next;
+
+   while ((next = rb_first(_tree->rb_root)) != NULL) {
+   node = rb_entry(next, struct ref_node, rb_node);
+   rb_erase(next, _tree->rb_root);
+   kfree(node);
+   }
+
+   ref_tree->rb_root = RB_ROOT;
+   ref_tree->unique_refs = 0;
+}
+
+/* Free dynamically allocated ref_root */
+static void ref_root_free(struct ref_root *ref_tree)
+{
+   if (!ref_tree)
+   return;
+
+   ref_root_fini(ref_tree);
+   kfree(ref_tree);
+}
+
+/*
+ * Compare ref_node with (root_id, object_id, offset, parent)
+ *
+ * The function compares the two ref_node a and b. It returns an integer less
+ * than, equal to, or greater than zero , respectively, to be less than, to
+ * equal, or be greater than b.
+ */
+static int ref_node_cmp(struct ref_node *a, struct ref_node *b)
+{
+   if (a->root_id < b->root_id)
+   return -1;
+   else if (a->root_id > b->root_id)
+   return 1;
+
+   if (a->object_id < b->object_id)
+   return -1;
+   else if (a->object_id > b->object_id)
+   return 1;
+
+   if (a->offset < b->offset)
+   return -1;
+   else if (a->offset > b->offset)
+   return 1;
+
+   if (a->parent < b->parent)
+   return -1;
+   else if (a->parent > b->parent)
+   return 1;
+
+   return 0;
+}
+
+/*
+ * Search ref_node with (root_id, object_id, offset, parent) in the tree
+ *
+ * if found, the pointer of the ref_node will be returned;
+ * if not found, NULL will be returned and pos will point to the rb_node for
+ * insert, pos_parent will point to pos'parent for insert;
+*/
+static struct ref_node *__ref_tree_search(struct ref_root *ref_tree,
+ struct rb_node ***pos,
+ struct rb_node **pos_parent,
+ u64 root_id, u64 object_id,
+ u64 offset, u64 parent)
+{
+   struct ref_node *cur = NULL;
+   struct ref_node entry;
+   int ret;
+
+   entry.root_id = root_id;
+   entry.object_id = object_id;
+   

Re: [PATCH v3] btrfs: fix check_shared for fiemap ioctl

2016-06-12 Thread Lu Fengqi

At 06/09/2016 05:15 PM, David Sterba wrote:

On Wed, Jun 08, 2016 at 08:53:00AM -0700, Mark Fasheh wrote:

On Wed, Jun 08, 2016 at 01:13:03PM +0800, Lu Fengqi wrote:

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least  n^3, so if a extent
has too many references,  even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the  on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh 
Signed-off-by: Lu Fengqi 
---
The caller is fiemap that called from an ioctl. Because it isn't on a
writeout path, so we temporarily use GFP_KERNEL in ref_root_alloc() and
ref_tree_add(). If we use ref_tree replace the existing backref structure
later, we can consider whether to use GFP_NOFS again.


NACK.

You don't need to be on a writeout path to deadlock, you simply need to be
holding locks that the writeout path takes when you allocate. If the
allocator does writeout to free memory then you deadlock. Fiemap is locking
down extents which may also get locked down when you allocate within those
locks. See my e-mail here for details,

http://www.spinics.net/lists/linux-btrfs/msg55789.html


There seems to be a code path that leads to a deadlock scenario, it
depends on the extent state of the file under fiemap. If there's a
delalloc range in the middle of the file, fiemap locks the whole range,
asks for memory that could trigger writeout, the delalloc range would
need to be written and requests locking the dealloc range again. There's
no direct locking but rather waiting for the extent LOCKED bit unset.

So, use GFP_NOFS now, though I still hope we can find a way to avoid
rb-tree and allocations alltogether.




OK, until we find a way to avoid rb-tree and allocations alltogether, I 
will use GFP_NOFS.

--
Thanks,
Lu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Files seen by some apps and not others

2016-06-12 Thread Bearcat Şándor
I don't think it's memory corruption as my modules test out fine, and
the problem began when i ran the btrfs check
--repair. Someone responded that they thought that the missing files
that are playable by the media player were still in memory, but they
still play after a reboot and they're not in a tmp structure anywhere.

So could it be that the metadata is not aligned with the nodes on the
disc? If so, what should i run next to address this? A balance? A
find-root with restore -t?

Would a 'btrfs rescue chunk-recover' help at this point or damage
things further?

Thanks all.


On Sun, Jun 12, 2016 at 9:47 AM, Henk Slager  wrote:
> Bearcat Şándor  gmail.com> writes:
>
>> Is there a fix for the bad tree block error, which seems to be the
>> root (pun intended) of all this?
>
> I think the root cause is some memory corruption. It might be known case,
> maybe someone else recognizes something.
>
> Anyhow, if you can't and won't reproduce it, best is to test
> memory/hardware, check any software that might have overwritten something in
> memory, use a recent (mainline/stable) kernel and see if it runs stable.



-- 
Bearcat M. Şándor
Feline Soul Systems LLC
Voice: 872.CAT.SOUL (872.228.7685)
Fax: 406.235.7070
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Chris Murphy
On Sun, Jun 12, 2016 at 3:22 PM, Maximilian Böhm  wrote:
> Hi there, I did something terribly wrong, all blame on me. I wanted to
> write to an USB stick but /dev/sdc wasn't the stick in this case but
> an attached HDD with GPT and an 8 TB btrfs partition…
>
> $ sudo dd bs=4M if=manjaro-kde-16.06.1-x86_64.iso of=/dev/sdc
> 483+1 Datensätze ein
> 483+1 Datensätze aus
> 2028060672 bytes (2,0 GB, 1,9 GiB) copied, 16,89 s, 120 MB/s
>
> So, shit.
>
> $ sudo btrfs check --repair /dev/sdc
> enabling repair mode
> No valid Btrfs found on /dev/sdc
> Couldn't open file system
>
> $ sudo btrfs-find-root /dev/sdc
> No valid Btrfs found on /dev/sdc
> ERROR: open ctree failed
>
> $ sudo btrfs-show-super /dev/sdc --all
> superblock: bytenr=65536, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 65536
>
> superblock: bytenr=67108864, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 67108864
>
> superblock: bytenr=274877906944, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 274877906944
>
>

OK but none of these can work anyway because the tools work based on
fixed offsets. By pointing the tool to /dev/sdc the wrong offset is
always being used, because originally the drive had a partition. By
default GPT disks offset the 1st partition 1MiB, or 2048 using 512
byte sectors assuming the drive uses 512 byte logical sectors (and
most still do).

What you should do is run gdisk /dev/sdc and it'll find the backup GPT
at the end of the drive, and offer to fix up the primary one. And now
you can point the tools to /dev/sdc1 (or whatever partition is for
Btrfs).

2GIB of Btrfs metadata is a lot of missing metadata though. So even
though btrfs-show-super will find the 3rd superblock, chances are it's
going to point to metadata somewhere in those first 2GiB that were
overwritten, but maybe not. It might be possible to get a -o ro mount
at least and start copying off some data.

If it's ro mountable, it might be possible to put /dev/sdc1 on an
overlay, and then fix the overlay rather than the original with
btrfsck - maybe it can fix up enough such that you'll just get piles
of read errors for the data that's missing rather than hitting some
brick wall and stopping.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 11:22 PM, Maximilian Böhm  wrote:
> Hi there, I did something terribly wrong, all blame on me. I wanted to
> write to an USB stick but /dev/sdc wasn't the stick in this case but
> an attached HDD with GPT and an 8 TB btrfs partition…

GPT has a secondary copy at the end of the device, so maybe gdisk can
reconstruct in first main one at the beginning of the disk, I don't
know all gdisk commands. But is you once created just 1 partition max
size for the whole disk with modern tools, your btrfs fs starts at
sector 2048, (with logical sectorsize 512).


> $ sudo dd bs=4M if=manjaro-kde-16.06.1-x86_64.iso of=/dev/sdc
> 483+1 Datensätze ein
> 483+1 Datensätze aus
> 2028060672 bytes (2,0 GB, 1,9 GiB) copied, 16,89 s, 120 MB/s

To confuse btrfs+tools as little as possible, I would first overwrite
/dev/sdc again from the start with the same amount of bytes byte then
from /dev/zero.
Then create / 'newly overlay' a the original partition offset 1M, till
the end. Alternatively:
$ losetup /dev/loopX -o 1M /dev/sdc
then your broken fs will be on /dev/sdc1 or /dev/loopX

Or overlay it with dm, snapshot, set the original ro and then work on
the rw flavor, so you keep the current broken HDD/fs as is and then do
the above.

> $ sudo btrfs check --repair /dev/sdc
> enabling repair mode
> No valid Btrfs found on /dev/sdc
> Couldn't open file system

Forget --repair I would say, hopefully btrfs restore can still find /
copy most of the data

> $ sudo btrfs-find-root /dev/sdc
> No valid Btrfs found on /dev/sdc
> ERROR: open ctree failed
>
> $ sudo btrfs-show-super /dev/sdc --all
> superblock: bytenr=65536, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 65536
>
> superblock: bytenr=67108864, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 67108864
>
> superblock: bytenr=274877906944, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 274877906944

run
$ sudo btrfs-show-super /dev/sdc1 --all
or
$ sudo btrfs-show-super /dev/loopX --all



> System infos:
>
> $ uname -a
> Linux Mongo 4.6.2-1-MANJARO #1 SMP PREEMPT Wed Jun 8 11:00:08 UTC 2016
> x86_64 GNU/Linux
>
> $ btrfs --version
> btrfs-progs v4.5.3
>
> Don't think dmesg is necessary here.
>
>
> OK, the btrfs wiki says there is a second superblock at 64 MiB
> (overwritten too in my case) and a third at 256 GiB ("0x40").
> But how to restore it? And how to restore the general btrfs header
> metadata? How to restore GPT without doing something terrible again?

See above, maybe you have lookup various sizes etc GPT etc first, but
I think your 3rd SB should be there.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Martin Steigerwald
Hi Maximilian,

On Sonntag, 12. Juni 2016 23:22:11 CEST Maximilian Böhm wrote:
> Hi there, I did something terribly wrong, all blame on me. I wanted to
> write to an USB stick but /dev/sdc wasn't the stick in this case but
> an attached HDD with GPT and an 8 TB btrfs partition…
> 
> $ sudo dd bs=4M if=manjaro-kde-16.06.1-x86_64.iso of=/dev/sdc
> 483+1 Datensätze ein
> 483+1 Datensätze aus
> 2028060672 bytes (2,0 GB, 1,9 GiB) copied, 16,89 s, 120 MB/s
> 
> So, shit.
> 
> $ sudo btrfs check --repair /dev/sdc
> enabling repair mode
> No valid Btrfs found on /dev/sdc
> Couldn't open file system
> 
> $ sudo btrfs-find-root /dev/sdc
> No valid Btrfs found on /dev/sdc
> ERROR: open ctree failed
> 
> $ sudo btrfs-show-super /dev/sdc --all
> superblock: bytenr=65536, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 65536
> 
> superblock: bytenr=67108864, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 67108864
> 
> superblock: bytenr=274877906944, device=/dev/sdc
> -
> ERROR: bad magic on superblock on /dev/sdc at 274877906944
> 
> 
> System infos:
> 
> $ uname -a
> Linux Mongo 4.6.2-1-MANJARO #1 SMP PREEMPT Wed Jun 8 11:00:08 UTC 2016
> x86_64 GNU/Linux
> 
> $ btrfs --version
> btrfs-progs v4.5.3
> 
> Don't think dmesg is necessary here.
> 
> 
> OK, the btrfs wiki says there is a second superblock at 64 MiB
> (overwritten too in my case) and a third at 256 GiB ("0x40").
> But how to restore it? And how to restore the general btrfs header
> metadata? How to restore GPT without doing something terrible again?
> Would be glad for any help!

But it says bad magic on that one as well.

Well, no idea if there is any chance to fix BTRFS in this situation.

Does btrfs restore do anything useful to copy of what if can find from this 
device? It does not work in place, so you need an additional space to let it 
restore to.

If BTRFS cannot be salvaged… you can still have a go with photorec, but it 
will not recover filenames and directory structure, just the data of any file 
of a file in a known format that it finds in one piece.

I suspect you have no backup.

So *good* luck.


I do think tough that dd should just bail out or warn for a BTRFS filesystem 
that is still mounted, or wasn´t it mounted at that time?

I also think it would be good to add an existing filesystem check just like in 
mkfs.btrfs, mkfs.xfs and so on. I´d like that, but that would be a suggestions 
for the coreutils people.

Yes, Unix is for people who know what they are doing… unless they don´t. And 
in the end even one of the most experienced admin could make such a mistake.

Goodnight,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cannot balance FS (No space left on device)

2016-06-12 Thread ojab //
On Fri, Jun 10, 2016 at 9:00 PM, Henk Slager  wrote:
> I have seldom seen an fs so full, very regular numbers :)
>
> But can you provide the output of this script:
> https://github.com/knorrie/btrfs-heatmap/blob/master/show_usage.py
>
> It gives better info w.r.t. devices and it is then easier to say what
> has to be done.
>
> But you have btrfs raid0 data (2 stripes) and raid1 metadata, and they
> both want 2 devices currently and there is only one device with place
> for your 2G chunks. So in theory you need 2 empty devices added for a
> balance to succeed. If you can allow reduces redundancy for some time,
> you could shrink the fs used space on hdd1 to half, same for the
> partition itself, add a hdd2 parttition and add that to the fs. Or
> just add another HDD.
> Then your 50Gb of deletions could get into effect if you start
> balancing. Also have a look at the balance stripe filters I would say.

So after adding another one [100Gb] disk I've successfully run `btrfs
balance` and deleted new disks without any issues.
Thanks for your help.

//wbr ojab
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-12 Thread Maximilian Böhm
Hi there, I did something terribly wrong, all blame on me. I wanted to
write to an USB stick but /dev/sdc wasn't the stick in this case but
an attached HDD with GPT and an 8 TB btrfs partition…

$ sudo dd bs=4M if=manjaro-kde-16.06.1-x86_64.iso of=/dev/sdc
483+1 Datensätze ein
483+1 Datensätze aus
2028060672 bytes (2,0 GB, 1,9 GiB) copied, 16,89 s, 120 MB/s

So, shit.

$ sudo btrfs check --repair /dev/sdc
enabling repair mode
No valid Btrfs found on /dev/sdc
Couldn't open file system

$ sudo btrfs-find-root /dev/sdc
No valid Btrfs found on /dev/sdc
ERROR: open ctree failed

$ sudo btrfs-show-super /dev/sdc --all
superblock: bytenr=65536, device=/dev/sdc
-
ERROR: bad magic on superblock on /dev/sdc at 65536

superblock: bytenr=67108864, device=/dev/sdc
-
ERROR: bad magic on superblock on /dev/sdc at 67108864

superblock: bytenr=274877906944, device=/dev/sdc
-
ERROR: bad magic on superblock on /dev/sdc at 274877906944


System infos:

$ uname -a
Linux Mongo 4.6.2-1-MANJARO #1 SMP PREEMPT Wed Jun 8 11:00:08 UTC 2016
x86_64 GNU/Linux

$ btrfs --version
btrfs-progs v4.5.3

Don't think dmesg is necessary here.


OK, the btrfs wiki says there is a second superblock at 64 MiB
(overwritten too in my case) and a third at 256 GiB ("0x40").
But how to restore it? And how to restore the general btrfs header
metadata? How to restore GPT without doing something terrible again?
Would be glad for any help!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 7:03 PM, boli  wrote:
>>> It's done now, and took close to 99 hours to rebalance 8.1 TB of data from 
>>> a 4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 
>>> 3x6TB raid1 (9 TB capacity).
>>
>> Indeed, it not clear why it takes 4 days for such an action. You
>> indicated that you cannot add an online 5th drive, so then you and
>> intermediate compaction of the fs to less drives is a way to handle
>> this issue. There are 2 ways however:
>>
>> 1) Keeping the to-be-replaced drive online until a btrfs dev remove of
>> it from the fs of it is finished and only then replace a 6TB with an
>> 8TB in the drivebay. So in this case, one needs enough free capacity
>> on the fs (which you had) and full btrfs raid1 redundancy is there all
>> the time.
>>
>> 2) Take a 6TB out of the drivebay first and then do the btrfs dev
>> remove, in this case on a really missing disk. This way, the fs is in
>> degraded mode (or mounted as such) and the action of remove missing is
>> also a sort of 'reconstruction'. I don't know the details of the code,
>> but I can imagine that it has performance implications.
>
> Thanks for reminding me about option 1). So in summary, without temporarily 
> adding an additional drive, there are 3 ways to replace a drive:
>
> 1) Logically removing old drive (triggers 1st rebalance), physically removing 
> it, then adding new drive physically and logically (triggers 2nd rebalance)
>
> 2) Physically removing old drive, mounting degraded, logically removing it 
> (triggers 1st rebalance, while degraded), then adding new drive physically 
> and logically (2nd rebalance)
>
> 3) Physically replacing old with new drive, mounting degraded, then logically 
> replacing old with new drive (triggers rebalance while degraded)
>
>
> I did option 2, which seems to be the worst of the three, as there was no 
> redundancy for a couple days, and 2 rebalances are needed, which potentially 
> take a long time.
>
> Option 1 also has 2 rebalances, but redundancy is always maintained.
>
> Option 3 needs just 1 rebalance, but (like option 1) does not maintain 
> redundancy at all times.
>
> That's where an extra drive bay would come in handy, allowing to maintain 
> redundancy while still just needing one "rebalance"? Question mark because 
> you mentioned "highspeed data transfer" rather than "rebalance" when doing a 
> btrfs-replace, which sounds very efficient (in case of -r option these 
> transfers would be from multiple drives).

I haven't used -r with replace other then for testing purposes inside
virtual machines. I think the '..transfers would be from multiple
drives...' might not be a speed advantage with the current state of
the code. If the drives are still healthy and the replace purpose is
capacity increase, my experience is that without the -r option (and
using an extra SATA port), the transfer is mostly at the drives max
magnetic media transferspeed. Also for cases like if you want to add
LUKS or bcache headers in front of the blockdevice that hosts the
fs/devid1 data.

But now that you anyhow have all data on 3x 6TB drives, you could save
balancing time by just doing btrfs-replace 6TB to 8TB 3x and then for
the 4th 8TB just add it and let btrfs do the spreading/balancing over
time by itself.

> The man page mentioned that the replacement drive needs to be at least as 
> large as the original, which makes me wonder if it's still a "highspeed data 
> transfer" if the new drive is larger, or if it does a rebalance in that case. 
> If not then that'd be pretty much what I'm looking for. More on that below.
>
>>> If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still 
>>> raid1), is there a way to remove one 6 TB drive at a time, recreate its 
>>> exact contents from the other 3 drives onto a new 8 TB drive, without doing 
>>> a full rebalance? That is: without writing any substantial amount of data 
>>> onto the remaining 3 drives.
>>
>> There isn't such a way. This goal has a violation in itself with
>> respect to redundancy (btrfs raid1).
>
> True, it would be "hack" to minimize the amount of data to rebalance (thus 
> saving time), with the (significant) downside of not maintaining redundancy 
> at all times.
> Personally I'd probably be willing to take the risk, since I have a few other 
> copies of this data.
>
>> man btrfs-replace and option -r I would say. But still, having a 5th
>> drive online available makes things much easier and faster and solid
>> and is the way to do a drive replace. You can then do a normal replace
>> and there is just highspeed data transfer for the old and the new disk
>> and only for parts/blocks of the disk that contain filedata. So it is
>> not a sector-by-sector copying also deleted blocks, but from end-user
>> perspective is an exact copy. There are patches ('hot spare') that
>> assume it to be this way, but they aren't in the mainline kernel yet.
>
> Hmm, so maybe I should think about 

Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-12 Thread Hans van Kranenburg

Hi!

On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:

Hi All,

On 2016-06-10 22:47, Hans van Kranenburg wrote:

+if (sk->min_objectid < sk->max_objectid) +
sk->min_objectid += 1;


...and now it's (289406977 168 19193856), which means you're
continuing your search *after* the block group item!

(289406976 168 19193856) is actually (289406976 << 72) + (168 <<
64) + 19193856, which is 1366685806470112827871857008640

The search is continued at 136668581119247931074150336, which
skips 4722366482869645213696 possible places where an object could
live in the tree.


I am not sure to follow you. The extent tree (the tree involved in
the search), contains only two kind of object:

- BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0, size
in bytes)
- EXTENT_ITEM, where the key means (logical address, 0xa8,
size in bytes)

So it seems that for each (possible) "logical address", only two
items might exist; the two item are completely identified by
(objectid, type, ). It should not possible (for the extent tree) to
have two item with the same objectid,key and different offset. So,
for the extent tree, it is safe to advance only the objectid field.

I am wrong ?


When calling the search ioctl, the caller has to provide a memory buffer 
that the kernel is going to fill with results. For BTRFS_IOC_TREE_SEARCH 
used here, this buffer has a fixed size of 4096 bytes. Without some 
headers etc, this leaves a bit less than 4000 bytes of space for the 
kernel to write search result objects to.


If I do a search that will result in far more objects to be returned 
than possible to fit in those <4096 bytes, the kernel will just put a 
few in there until the next one does not fit any more.


It's the responsibility of the caller to change the start of the search 
to point just after the last received object and do the search again, in 
order to retrieve a few extra results.


So, the important line here was: "...when the extent_item just manages 
to squeeze in as last result into the current result buffer from the 
ioctl..."


--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-12 Thread Goffredo Baroncelli
Hi All,

On 2016-06-10 22:47, Hans van Kranenburg wrote:
>> +if (sk->min_objectid < sk->max_objectid) 
>> +sk->min_objectid += 1;
> 
> ...and now it's (289406977 168 19193856), which means you're
> continuing your search *after* the block group item!
> 
> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 64) 
> + 19193856, which is 1366685806470112827871857008640
> 
> The search is continued at 136668581119247931074150336, which
> skips 4722366482869645213696 possible places where an object could
> live in the tree.

I am not sure to follow you. The extent tree (the tree involved in the search), 
contains only two kind of object:

- BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0, size in bytes) 
- EXTENT_ITEM, where the key means (logical address, 0xa8, size in bytes) 

So it seems that for each (possible) "logical address", only two items might 
exist; the two item are completely identified by (objectid, type, ). It should 
not possible (for the extent tree) to have two item with the same objectid,key 
and different offset.
So, for the extent tree, it is safe to advance only the objectid field.

I am wrong ?

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread boli
>> It's done now, and took close to 99 hours to rebalance 8.1 TB of data from a 
>> 4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 3x6TB 
>> raid1 (9 TB capacity).
> 
> Indeed, it not clear why it takes 4 days for such an action. You
> indicated that you cannot add an online 5th drive, so then you and
> intermediate compaction of the fs to less drives is a way to handle
> this issue. There are 2 ways however:
> 
> 1) Keeping the to-be-replaced drive online until a btrfs dev remove of
> it from the fs of it is finished and only then replace a 6TB with an
> 8TB in the drivebay. So in this case, one needs enough free capacity
> on the fs (which you had) and full btrfs raid1 redundancy is there all
> the time.
> 
> 2) Take a 6TB out of the drivebay first and then do the btrfs dev
> remove, in this case on a really missing disk. This way, the fs is in
> degraded mode (or mounted as such) and the action of remove missing is
> also a sort of 'reconstruction'. I don't know the details of the code,
> but I can imagine that it has performance implications.

Thanks for reminding me about option 1). So in summary, without temporarily 
adding an additional drive, there are 3 ways to replace a drive:

1) Logically removing old drive (triggers 1st rebalance), physically removing 
it, then adding new drive physically and logically (triggers 2nd rebalance)

2) Physically removing old drive, mounting degraded, logically removing it 
(triggers 1st rebalance, while degraded), then adding new drive physically and 
logically (2nd rebalance)

3) Physically replacing old with new drive, mounting degraded, then logically 
replacing old with new drive (triggers rebalance while degraded)


I did option 2, which seems to be the worst of the three, as there was no 
redundancy for a couple days, and 2 rebalances are needed, which potentially 
take a long time.

Option 1 also has 2 rebalances, but redundancy is always maintained.

Option 3 needs just 1 rebalance, but (like option 1) does not maintain 
redundancy at all times.

That's where an extra drive bay would come in handy, allowing to maintain 
redundancy while still just needing one "rebalance"? Question mark because you 
mentioned "highspeed data transfer" rather than "rebalance" when doing a 
btrfs-replace, which sounds very efficient (in case of -r option these 
transfers would be from multiple drives).

The man page mentioned that the replacement drive needs to be at least as large 
as the original, which makes me wonder if it's still a "highspeed data 
transfer" if the new drive is larger, or if it does a rebalance in that case. 
If not then that'd be pretty much what I'm looking for. More on that below.

>> If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still 
>> raid1), is there a way to remove one 6 TB drive at a time, recreate its 
>> exact contents from the other 3 drives onto a new 8 TB drive, without doing 
>> a full rebalance? That is: without writing any substantial amount of data 
>> onto the remaining 3 drives.
> 
> There isn't such a way. This goal has a violation in itself with
> respect to redundancy (btrfs raid1).

True, it would be "hack" to minimize the amount of data to rebalance (thus 
saving time), with the (significant) downside of not maintaining redundancy at 
all times.
Personally I'd probably be willing to take the risk, since I have a few other 
copies of this data.

> man btrfs-replace and option -r I would say. But still, having a 5th
> drive online available makes things much easier and faster and solid
> and is the way to do a drive replace. You can then do a normal replace
> and there is just highspeed data transfer for the old and the new disk
> and only for parts/blocks of the disk that contain filedata. So it is
> not a sector-by-sector copying also deleted blocks, but from end-user
> perspective is an exact copy. There are patches ('hot spare') that
> assume it to be this way, but they aren't in the mainline kernel yet.

Hmm, so maybe I should think about using an USB enclosure to temporarily add a 
5th drive.
Being a bit wary about an external USB enclosure, I'd probably try to minimize 
transfers from/to the USB enclosure.

Say by putting the old (to-be-replaced) drive into the USB enclosure, the new 
drive into the internal drive bay where the old drive used to be, and then do a 
btrfs-replace with -r option to minimize reads from USB.

Or put one of the *other* disks into the USB enclosure (neither the old nor its 
new replacement drive), and doing a btrfs-replace without -r option.

> The btrfs-replace should work ok for btrfs raid1 fs (at least it
> worked ok for btrfs raid10 half a year ago I can confirm), if the fs
> is mostly idle during the replace (almost no new files added).

That's good to read. The fs will be idle during the replace.

> Still, you might want to have the replace related fixes added in kernel
> 4.7-rc2.

Hmm, since I'm on Fedora with kernel 4.5.5 (or 4.5.6 after most 

Re: Files seen by some apps and not others

2016-06-12 Thread Henk Slager
Bearcat Şándor  gmail.com> writes:

> Is there a fix for the bad tree block error, which seems to be the
> root (pun intended) of all this?

I think the root cause is some memory corruption. It might be known case, 
maybe someone else recognizes something.

Anyhow, if you can't and won't reproduce it, best is to test 
memory/hardware, check any software that might have overwritten something in 
memory, use a recent (mainline/stable) kernel and see if it runs stable.
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread Henk Slager
On Sun, Jun 12, 2016 at 12:35 PM, boli  wrote:
>> It has now been doing "btrfs device delete missing /mnt" for about 90 hours.
>>
>> These 90 hours seem like a rather long time, given that a rebalance/convert 
>> from 4-disk-raid5 to 4-disk-raid1 took about 20 hours months ago, and a 
>> scrub takes about 7 hours (4-disk-raid1).
>>
>> OTOH the filesystem will be rather full with only 3 of 4 disks available, so 
>> I do expect it to take somewhat "longer than usual".
>>
>> Would anyone venture a guess as to how long it might take?
>
> It's done now, and took close to 99 hours to rebalance 8.1 TB of data from a 
> 4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 3x6TB 
> raid1 (9 TB capacity).

Indeed, it not clear why it takes 4 days for such an action. You
indicated that you cannot add an online 5th drive, so then you and
intermediate compaction of the fs to less drives is a way to handle
this issue. There are 2 ways however:

1) Keeping the to-be-replaced drive online until a btrfs dev remove of
it from the fs of it is finished and only then replace a 6TB with an
8TB in the drivebay. So in this case, one needs enough free capacity
on the fs (which you had) and full btrfs raid1 redundancy is there all
the time.

2) Take a 6TB out of the drivebay first and then do the btrfs dev
remove, in this case on a really missing disk. This way, the fs is in
degraded mode (or mounted as such) and the action of remove missing is
also a sort of 'reconstruction'. I don't know the details of the code,
but I can imagine that it has performance implications.

> Now I made sure quotas were off, then started a screen to fill the new 8 TB 
> disk with zeros, detached it and and checked iotop to get a rough estimate on 
> how long it will take (I'm aware it will become slower in time).
>
> After that I'll add this 8 TB disk to the btrfs raid1 (for yet another 
> rebalance).
>
> The next 3 disks will be replaced with "btrfs replace", so only one rebalance 
> each is needed.
>
> I assume each "btrfs replace" would do a full rebalance, and thus assign 
> chunks according to the normal strategy of choosing the two drives with the 
> most free space, which in this case would be a chunk to the new drive, and a 
> mirrored chunk to that existing 3 drive with most free space.
>
> What I'm wondering is this:
> If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still 
> raid1), is there a way to remove one 6 TB drive at a time, recreate its exact 
> contents from the other 3 drives onto a new 8 TB drive, without doing a full 
> rebalance? That is: without writing any substantial amount of data onto the 
> remaining 3 drives.

There isn't such a way. This goal has a violation in itself with
respect to redundancy (btrfs raid1).

> It seems to me that would be a lot more efficient, but it would go against 
> the normal chunk assignment strategy.

man btrfs-replace and option -r I would say. But still, having a 5th
drive online available makes things much easier and faster and solid
and is the way to do a drive replace. You can then do a normal replace
and there is just highspeed data transfer for the old and the new disk
and only for parts/blocks of the disk that contain filedata. So it is
not a sector-by-sector copying also deleted blocks, but from end-user
perspective is an exact copy. There are patches ('hot spare') that
assume it to be this way, but they aren't in the mainline kernel yet.

The btrfs-replace should work ok for btrfs raid1 fs (at least it
worked ok for btrfs raid10 half a year ago I can confirm), if the fs
is mostly idle during the replace (almost no new files added). Still,
you might want to have the replace related fixes added in kernel
4.7-rc2.

Another less likely reason for the performance issue is that the fs is
changed from raid5 and has 4k nodesize. btrfs-show-super can show you
that. It should not be, but my experience for a delete / add sequence
for such a case is that is very slow.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-12 Thread boli
> It has now been doing "btrfs device delete missing /mnt" for about 90 hours.
> 
> These 90 hours seem like a rather long time, given that a rebalance/convert 
> from 4-disk-raid5 to 4-disk-raid1 took about 20 hours months ago, and a scrub 
> takes about 7 hours (4-disk-raid1).
> 
> OTOH the filesystem will be rather full with only 3 of 4 disks available, so 
> I do expect it to take somewhat "longer than usual".
> 
> Would anyone venture a guess as to how long it might take?

It's done now, and took close to 99 hours to rebalance 8.1 TB of data from a 
4x6TB raid1 (12 TB capacity) with 1 drive missing onto the remaining 3x6TB 
raid1 (9 TB capacity).

Now I made sure quotas were off, then started a screen to fill the new 8 TB 
disk with zeros, detached it and and checked iotop to get a rough estimate on 
how long it will take (I'm aware it will become slower in time).

After that I'll add this 8 TB disk to the btrfs raid1 (for yet another 
rebalance).

The next 3 disks will be replaced with "btrfs replace", so only one rebalance 
each is needed.

I assume each "btrfs replace" would do a full rebalance, and thus assign chunks 
according to the normal strategy of choosing the two drives with the most free 
space, which in this case would be a chunk to the new drive, and a mirrored 
chunk to that existing 3 drive with most free space.

What I'm wondering is this:
If the goal is to replace 4x 6TB drive (raid1) with 4x 8TB drive (still raid1), 
is there a way to remove one 6 TB drive at a time, recreate its exact contents 
from the other 3 drives onto a new 8 TB drive, without doing a full rebalance? 
That is: without writing any substantial amount of data onto the remaining 3 
drives.

It seems to me that would be a lot more efficient, but it would go against the 
normal chunk assignment strategy.

Cheers, boli

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Files seen by some apps and not others

2016-06-12 Thread Bearcat Şándor
Thank you for that info Ducan! I did the restore on the whole drive
and it errored out on me. I'll try the restore on some key files (audo
mostly) and see what i can get off of it.

Is there a fix for the bad tree block error, which seems to be the
root (pun intended) of all this?

On Sat, Jun 11, 2016 at 8:18 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Bearcat Şándor posted on Sat, 11 Jun 2016 13:54:44 -0600 as excerpted:
>
>> I'm about to try a btrfs restore to see what it can do for me. Any
>> pointers or help here? I don't want to fsck things up further.
>
> FWIW, btrfs restore doesn't write anything at all to the filesystem it's
> restoring from -- it's read-only in that regard -- so you really don't
> have to worry about it screwing up a filesystem further.
>
> But by the same token, btrfs restore may not do what you think it does.
> It doesn't try to fix the filesystem.  Rather, it's a way to try to
> salvage anything you can off a filesystem that won't mount, or, as it
> would be used here, that will mount but where files aren't showing up
> properly so you can't just copy them elsewhere using normal means.  It
> writes the files it can salvage to some other filesystem, which of course
> means that whatever filesystem you're writing the files to must have
> enough room for the files to be written.
>
> Also note the various restore options.  In particular, the restore
> metadata option must be used if you want to restore the same ownership,
> permissions and timestamp information.  Otherwise, restore simply writes
> the files as the user you're running it as (root), using the current
> umask.  Similarly, if you want to restore symlinks and extended
> attributes, there's options for that, otherwise they aren't restored.
>
> And you won't necessarily be wanting to restore snapshots, as you should
> have backups if needed for the history, and are likely most worried about
> the current version of the files, so snapshots aren't restored unless you
> use the appropriate option.
>
> Given that the filesystem is still mounted and most files are apparently
> still readable normally, you may want to copy off what you can that way,
> and only restore specific files using btrfs restore.  Or you may not have
> room on the destination filesystem to restore everything, and will need
> to pick only the most important stuff to restore.  That's where the
> pattern-match option comes in.
>
> What I did here when I used restore (I had backups of course but they
> weren't current) was use the metadata and symlinks restore options, and
> simply restored everything.
>
> Note that if a particular directory has a lot of files, restore will
> begin to think that it's looping to much and that it's stuck.  It'll
> prompt you to continue, and may prompt a *LOT*.  Here I have multiple
> independent small filesystems, so it wasn't a big deal, but you may need
> to experiment with automating the "yes" if your filesystem is huge
> (piping the output of the yes command to stdin, for instance, or similar
> sysadmin prompt automation tricks).  A number of folks have mentioned
> that and requested a way to say "yes, really all, don't ask again", an
> option that btrfs restore unfortunately doesn't have yet.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Bearcat M. Şándor
Feline Soul Systems LLC
Voice: 872.CAT.SOUL (872.228.7685)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html