hang in btrfs_async_reclaim_metadata_space

2018-01-05 Thread Adam Borowski
Hi!
I got a reproducible infinite hang, reliably triggered by the testsuite of
"flatpak"; fails on at least 4.15-rc6, 4.9.75, and on another machine with
Debian's 4.14.2-1.

[580632.355107] INFO: task kworker/u8:2:11105 blocked for more than 120 seconds.
[580632.355120]   Not tainted 4.14.0-1-amd64 #1 Debian 4.14.2-1
[580632.355124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[580632.355129] kworker/u8:2D0 11105  2 0x8000
[580632.355176] Workqueue: events_unbound btrfs_async_reclaim_metadata_space 
[btrfs]
[580632.355179] Call Trace:
[580632.355192]  __schedule+0x28e/0x880
[580632.355196]  schedule+0x2c/0x80
[580632.355200]  wb_wait_for_completion+0x64/0x90
[580632.355205]  ? finish_wait+0x80/0x80
[580632.355207]  __writeback_inodes_sb_nr+0xa1/0xd0
[580632.355210]  writeback_inodes_sb_nr+0x10/0x20
[580632.355235]  flush_space+0x3ed/0x520 [btrfs]
[580632.355238]  ? pick_next_task_fair+0x158/0x590
[580632.355242]  ? __switch_to+0x1f3/0x460
[580632.355267]  btrfs_async_reclaim_metadata_space+0xf6/0x4a0 [btrfs]
[580632.355278]  process_one_work+0x198/0x390
[580632.355281]  worker_thread+0x35/0x3c0
[580632.355284]  kthread+0x125/0x140
[580632.355287]  ? process_one_work+0x390/0x390
[580632.355289]  ? kthread_create_on_node+0x70/0x70
[580632.355292]  ? SyS_exit_group+0x14/0x20
[580632.355295]  ret_from_fork+0x25/0x30

The machines are distinct enough that this probably should happen
everywhere:

AMD Phenom2, SSD, noatime,compress=lzo,space_cache=v2
Intel Braswell, rust, noatime,autodefrag,space_cache=v2


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs quota exceeded notifications through netlink sockets

2018-01-05 Thread Karsai, Gabor
I created a subvolume on a btrfs, set a limit and the quota is enforced - 
dumping too much data into the subvolume results in a 'quota exceeded' message 
(from dd, for example). But when I am trying to get netlink socket 
notifications, nothing arrives on the socket (I am using pyroute2 which is 
supposedly able to receive disk quota notifications)

$ uname -a
Linux riaps-dev 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 
2017 x86_64 x86_64 x86_64 GNU/Linux

btrfs: whatever Ubuntu 16.04 has 

Kconfig:
CONFIG_QUOTA_NETLINK_INTERFACE=y

Any advice would be appreciated.
Thanks,
-- Gabor Karsai


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/10] bugfixes and regression tests of btrfs_get_extent

2018-01-05 Thread Liu Bo
Although
commit e6c4efd87ab0 ("btrfs: Fix and enhance merge_extent_mapping() to insert 
best fitted extent map")
fixed up the negetive em->len, it has introduced several regressions, several 
has been fixed by

commit 32be3a1ac6d0 ("btrfs: Fix the wrong condition judgment about subset 
extent map"),
commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map insertion 
in btrfs_get_extent") and
commit 8e2bd3b7fac9 ("Btrfs: deal with existing encompassing extent map in 
btrfs_get_extent()").

Unfortunately, there is one more regression which is caught recently by a
user's workloads.

While debugging the above issue, I found that all of these bugs are caused
by some racy situations, which can be very tricky to reproduce, so I
created several extent map specific test cases in btrfs's selftest
framework.

Patch 1-2 are fixing two bugs.
Patch 3-4 are some preparatory work.
Patch 3-5 are regression tests about the logic of handling EEXIST from
adding extent map.
Patch 8-10 are debugging wise, one is a direct tracepoint and the other is
to enable kprobe on merge_extent_mapping.

v2:
- Improve commit log to provide more details about the bug.
- Adjust bugfixes to the front so that we can merge them firstly.

Liu Bo (10):
  Btrfs: fix incorrect block_len in merge_extent_mapping
  Btrfs: fix unexpected EEXIST from btrfs_get_extent
  Btrfs: add helper for em merge logic
  Btrfs: move extent map specific code to extent_map.c
  Btrfs: add extent map selftests
  Btrfs: extent map selftest: buffered write vs dio read
  Btrfs: extent map selftest: dio write vs dio read
  Btrfs: add WARN_ONCE to detect unexpected error from
merge_extent_mapping
  Btrfs: add tracepoint for em's EEXIST case
  Btrfs: noinline merge_extent_mapping

 fs/btrfs/Makefile |   2 +-
 fs/btrfs/extent_map.c | 134 ++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/inode.c  | 108 +---
 fs/btrfs/tests/btrfs-tests.c  |   3 +
 fs/btrfs/tests/btrfs-tests.h  |   1 +
 fs/btrfs/tests/extent-map-tests.c | 363 ++
 include/trace/events/btrfs.h  |  35 
 8 files changed, 540 insertions(+), 108 deletions(-)
 create mode 100644 fs/btrfs/tests/extent-map-tests.c

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/10] Btrfs: move extent map specific code to extent_map.c

2018-01-05 Thread Liu Bo
These helpers are extent map specific, move them to extent_map.c.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h  |   2 -
 fs/btrfs/extent_map.c | 125 ++
 fs/btrfs/extent_map.h |   2 +
 fs/btrfs/inode.c  | 107 --
 4 files changed, 127 insertions(+), 109 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 328f40f..b2e09fe 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3148,8 +3148,6 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
int delay_iput);
 void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
 
-int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
-struct extent_map **em_in, u64 start, u64 len);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start,
u64 len, int create);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2e348fb..6fe8b14 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -454,3 +454,128 @@ void replace_extent_mapping(struct extent_map_tree *tree,
 
setup_extent_mapping(tree, new, modified);
 }
+
+static struct extent_map *next_extent_map(struct extent_map *em)
+{
+   struct rb_node *next;
+
+   next = rb_next(>rb_node);
+   if (!next)
+   return NULL;
+   return container_of(next, struct extent_map, rb_node);
+}
+
+static struct extent_map *prev_extent_map(struct extent_map *em)
+{
+   struct rb_node *prev;
+
+   prev = rb_prev(>rb_node);
+   if (!prev)
+   return NULL;
+   return container_of(prev, struct extent_map, rb_node);
+}
+
+/* helper for btfs_get_extent.  Given an existing extent in the tree,
+ * the existing extent is the nearest extent to map_start,
+ * and an extent that you want to insert, deal with overlap and insert
+ * the best fitted new extent into the tree.
+ */
+static int merge_extent_mapping(struct extent_map_tree *em_tree,
+   struct extent_map *existing,
+   struct extent_map *em,
+   u64 map_start)
+{
+   struct extent_map *prev;
+   struct extent_map *next;
+   u64 start;
+   u64 end;
+   u64 start_diff;
+
+   BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
+
+   if (existing->start > map_start) {
+   next = existing;
+   prev = prev_extent_map(next);
+   } else {
+   prev = existing;
+   next = next_extent_map(prev);
+   }
+
+   start = prev ? extent_map_end(prev) : em->start;
+   start = max_t(u64, start, em->start);
+   end = next ? next->start : extent_map_end(em);
+   end = min_t(u64, end, extent_map_end(em));
+   start_diff = start - em->start;
+   em->start = start;
+   em->len = end - start;
+   if (em->block_start < EXTENT_MAP_LAST_BYTE &&
+   !test_bit(EXTENT_FLAG_COMPRESSED, >flags)) {
+   em->block_start += start_diff;
+   em->block_len = em->len;
+   }
+   return add_extent_mapping(em_tree, em, 0);
+}
+
+/**
+ * btrfs_add_extent_mapping - add extent mapping into em_tree
+ * @em_tree - the extent tree into which we want to insert the extent mapping
+ * @em_in   - extent we are inserting
+ * @start   - start of the logical range btrfs_get_extent() is requesting
+ * @len - length of the logical range btrfs_get_extent() is requesting
+ *
+ * Note that @em_in's range may be different from [start, start+len),
+ * but they must be overlapped.
+ *
+ * Insert @em_in into @em_tree. In case there is an overlapping range, handle
+ * the -EEXIST by either:
+ * a) Returning the existing extent in @em_in if @start is within the
+ *existing em.
+ * b) Merge the existing extent with @em_in passed in.
+ *
+ * Return 0 on success, otherwise -EEXIST.
+ *
+ */
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len)
+{
+   int ret;
+   struct extent_map *em = *em_in;
+
+   ret = add_extent_mapping(em_tree, em, 0);
+   /* it is possible that someone inserted the extent into the tree
+* while we had the lock dropped.  It is also possible that
+* an overlapping map exists in the tree
+*/
+   if (ret == -EEXIST) {
+   struct extent_map *existing;
+
+   ret = 0;
+
+   existing = search_extent_mapping(em_tree, start, len);
+   /*
+* existing will always be non-NULL, since there must be
+* extent causing the -EEXIST.
+*/
+   if (start >= existing->start &&
+   start < extent_map_end(existing)) {
+

[PATCH v2 08/10] Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping

2018-01-05 Thread Liu Bo
This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.

Signed-off-by: Liu Bo 
---
v2: Remove unnecessary KERN_INFO.

 fs/btrfs/extent_map.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 6fe8b14..b5d0add 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -562,17 +562,23 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
*em_tree,
*em_in = existing;
ret = 0;
} else {
+   u64 orig_start = em->start;
+   u64 orig_len = em->len;
+
/*
 * The existing extent map is the one nearest to
 * the [start, start + len) range which overlaps
 */
ret = merge_extent_mapping(em_tree, existing,
   em, start);
-   free_extent_map(existing);
if (ret) {
free_extent_map(em);
*em_in = NULL;
+   WARN_ONCE(ret, "Unexpected error %d: merge 
existing(start %llu len %llu) with em(start %llu len %llu)\n",
+ ret, existing->start, existing->len,
+ orig_start, orig_len);
}
+   free_extent_map(existing);
}
}
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/10] Btrfs: noinline merge_extent_mapping

2018-01-05 Thread Liu Bo
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.

This is adding noinline attribute to merge_extent_mapping().

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index a5a1d17..9e5fdf9 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -480,10 +480,10 @@ static struct extent_map *prev_extent_map(struct 
extent_map *em)
  * and an extent that you want to insert, deal with overlap and insert
  * the best fitted new extent into the tree.
  */
-static int merge_extent_mapping(struct extent_map_tree *em_tree,
-   struct extent_map *existing,
-   struct extent_map *em,
-   u64 map_start)
+static noinline int merge_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map *existing,
+struct extent_map *em,
+u64 map_start)
 {
struct extent_map *prev;
struct extent_map *next;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/10] Btrfs: extent map selftest: dio write vs dio read

2018-01-05 Thread Liu Bo
This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tests/extent-map-tests.c | 88 +++
 1 file changed, 88 insertions(+)

diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
index 2adf55f..66d5523 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -253,6 +253,93 @@ static void test_case_3(struct extent_map_tree *em_tree)
__test_case_3(em_tree, (12 * 1024ULL));
 }
 
+static void __test_case_4(struct extent_map_tree *em_tree, u64 start)
+{
+   struct extent_map *em;
+   u64 len = SZ_4K;
+   int ret;
+
+   em = alloc_extent_map();
+   if (!em)
+   /* Skip this test on error. */
+   return;
+
+   /* Add [0K, 8K) */
+   em->start = 0;
+   em->len = SZ_8K;
+   em->block_start = 0;
+   em->block_len = SZ_8K;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+
+   /* Add [8K, 24K) */
+   em->start = SZ_8K;
+   em->len = 24 * 1024ULL;
+   em->block_start = SZ_16K; /* avoid merging */
+   em->block_len = 24 * 1024ULL;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+   /* Add [0K, 32K) */
+   em->start = 0;
+   em->len = SZ_32K;
+   em->block_start = 0;
+   em->block_len = SZ_32K;
+   ret = btrfs_add_extent_mapping(em_tree, , start, len);
+   if (ret)
+   test_msg("case4 [0x%llx 0x%llx): ret %d\n",
+start, len, ret);
+   if (em &&
+   (start < em->start || start + len > extent_map_end(em)))
+   test_msg("case4 [0x%llx 0x%llx): ret %d, added wrong em (start 
0x%llx len 0x%llx block_start 0x%llx block_len 0x%llx)\n",
+start, len, ret, em->start, em->len, em->block_start,
+em->block_len);
+   free_extent_map(em);
+out:
+   /* free memory */
+   free_extent_map_tree(em_tree);
+}
+
+/*
+ * Test scenario:
+ *
+ * Suppose that no extent map has been loaded into memory yet.
+ * There is a file extent [0, 32K), two jobs are running concurrently
+ * against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
+ * read from [0, 4K) or [4K, 8K).
+ *
+ * t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
+ *
+ * t1t2
+ *  btrfs_get_blocks_direct() btrfs_get_blocks_direct()
+ *   -> btrfs_get_extent()  -> btrfs_get_extent()
+ *   -> lookup_extent_mapping()
+ *   -> add_extent_mapping()-> lookup_extent_mapping()
+ *  # load [0, 32K)
+ *   -> btrfs_new_extent_direct()
+ *   -> btrfs_drop_extent_cache()
+ *  # split [0, 32K)
+ *   -> add_extent_mapping()
+ *  # add [8K, 32K)
+ *  -> add_extent_mapping()
+ * # handle -EEXIST when adding
+ * # [0, 32K)
+ */
+static void test_case_4(struct extent_map_tree *em_tree)
+{
+   __test_case_4(em_tree, 0);
+   __test_case_4(em_tree, SZ_4K);
+}
+
 int btrfs_test_extent_map()
 {
struct extent_map_tree *em_tree;
@@ -269,6 +356,7 @@ int btrfs_test_extent_map()
test_case_1(em_tree);
test_case_2(em_tree);
test_case_3(em_tree);
+   test_case_4(em_tree);
 
kfree(em_tree);
return 0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/10] Btrfs: extent map selftest: buffered write vs dio read

2018-01-05 Thread Liu Bo
This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tests/extent-map-tests.c | 73 +++
 1 file changed, 73 insertions(+)

diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
index 0407396..2adf55f 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -181,6 +181,78 @@ static void test_case_2(struct extent_map_tree *em_tree)
free_extent_map_tree(em_tree);
 }
 
+static void __test_case_3(struct extent_map_tree *em_tree, u64 start)
+{
+   struct extent_map *em;
+   u64 len = SZ_4K;
+   int ret;
+
+   em = alloc_extent_map();
+   if (!em)
+   /* Skip this test on error. */
+   return;
+
+   /* Add [4K, 8K) */
+   em->start = SZ_4K;
+   em->len = SZ_4K;
+   em->block_start = SZ_4K;
+   em->block_len = SZ_4K;
+   ret = add_extent_mapping(em_tree, em, 0);
+   ASSERT(ret == 0);
+   free_extent_map(em);
+
+   em = alloc_extent_map();
+   if (!em)
+   goto out;
+
+   /* Add [0, 16K) */
+   em->start = 0;
+   em->len = SZ_16K;
+   em->block_start = 0;
+   em->block_len = SZ_16K;
+   ret = btrfs_add_extent_mapping(em_tree, , start, len);
+   if (ret)
+   test_msg("case3 [0x%llx 0x%llx): ret %d\n",
+start, start + len, ret);
+   /*
+* Since bytes within em are contiguous, em->block_start is identical to
+* em->start.
+*/
+   if (em &&
+   (start < em->start || start + len > extent_map_end(em) ||
+em->start != em->block_start || em->len != em->block_len))
+   test_msg("case3 [0x%llx 0x%llx): ret %d em (start 0x%llx len 
0x%llx block_start 0x%llx block_len 0x%llx)\n",
+start, start + len, ret, em->start, em->len,
+em->block_start, em->block_len);
+   free_extent_map(em);
+out:
+   /* free memory */
+   free_extent_map_tree(em_tree);
+}
+
+/*
+ * Test scenario:
+ *
+ * Suppose that no extent map has been loaded into memory yet.
+ * There is a file extent [0, 16K), two jobs are running concurrently
+ * against it, t1 is buffered writing to [4K, 8K) and t2 is doing dio
+ * read from [0, 4K) or [8K, 12K) or [12K, 16K).
+ *
+ * t1 goes ahead of t2 and adds em [4K, 8K) into tree.
+ *
+ * t1   t2
+ *  cow_file_range()btrfs_get_extent()
+ *-> lookup_extent_mapping()
+ *   -> add_extent_mapping()
+ *-> add_extent_mapping()
+ */
+static void test_case_3(struct extent_map_tree *em_tree)
+{
+   __test_case_3(em_tree, 0);
+   __test_case_3(em_tree, SZ_8K);
+   __test_case_3(em_tree, (12 * 1024ULL));
+}
+
 int btrfs_test_extent_map()
 {
struct extent_map_tree *em_tree;
@@ -196,6 +268,7 @@ int btrfs_test_extent_map()
 
test_case_1(em_tree);
test_case_2(em_tree);
+   test_case_3(em_tree);
 
kfree(em_tree);
return 0;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] Btrfs: fix incorrect block_len in merge_extent_mapping

2018-01-05 Thread Liu Bo
%block_len could be checked on deciding if two em are mergeable.

merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.

For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3c..2784bb3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6860,7 +6860,7 @@ static int merge_extent_mapping(struct extent_map_tree 
*em_tree,
if (em->block_start < EXTENT_MAP_LAST_BYTE &&
!test_bit(EXTENT_FLAG_COMPRESSED, >flags)) {
em->block_start += start_diff;
-   em->block_len -= start_diff;
+   em->block_len = em->len;
}
return add_extent_mapping(em_tree, em, 0);
 }
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent

2018-01-05 Thread Liu Bo
This fixes a corner case that is caused by a race of dio write vs dio
read/write.

Here is how the race could happen.

Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).

t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).

--
 t1t2
  btrfs_get_blocks_direct() btrfs_get_blocks_direct()
   -> btrfs_get_extent()  -> btrfs_get_extent()
   -> lookup_extent_mapping()
   -> add_extent_mapping()-> lookup_extent_mapping()
  # load [0, 32K)
   -> btrfs_new_extent_direct()
   -> btrfs_drop_extent_cache()
  # split [0, 32K) and
  # drop [8K, 32K)
   -> add_extent_mapping()
  # add [8K, 32K)
  -> add_extent_mapping()
 # handle -EEXIST when adding
 # [0, 32K)
--
About how t2(dio read/write) runs into -EEXIST:

a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),

b) search_extent_mapping() then returns [0, 8k) as the existing em,
   even though start == existing->start, em is [0, 32k) so that
   extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,

c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
   (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,

d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
   which is confusing applications.

Here I conclude all the possible situations,
1) start < existing->start

+---+em+---+
+--prev---+ | +-+  |
| | | | |  |
+-+ + +---+existing++  ++
+
|
+
 start

2) start == existing->start

  +em+
  | +-+  |
  | | |  |
  + +existing-+  +
|
|
+
 start

3) start > existing->start && start < (existing->start + existing->len)

  +em+
  | +-+  |
  | | |  |
  + +existing-+  +
   |
   |
   +
 start

4) start >= (existing->start + existing->len)

+---+em+---+
| +-+  | +--next---+
| | |  | | |
+ +---+existing++  + +-+
  +
  |
  +
   start

As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).

Reported-by: David Vallender 
Signed-off-by: Liu Bo 
---
v2: Improve commit log to provide more details about the bug.

 fs/btrfs/inode.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2784bb3..a270fe2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7162,19 +7162,12 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 * existing will always be non-NULL, since there must be
 * extent causing the -EEXIST.
 */
-   if (existing->start == em->start &&
-   extent_map_end(existing) >= extent_map_end(em) &&
-   em->block_start == existing->block_start) {
-   /*
-* The existing extent map already encompasses the
-* entire extent map we tried to add.
-*/
+   if (start >= existing->start &&
+   start < extent_map_end(existing)) {
free_extent_map(em);
em = existing;
err = 0;
-
-   } else if (start >= extent_map_end(existing) ||
-   start <= existing->start) {
+   } else {
/*
 * The existing extent map is the one nearest to
 * the [start, start + len) range which overlaps
@@ -7186,10 +7179,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
free_extent_map(em);
em = NULL;
}
-   } else {
-   

[PATCH v2 09/10] Btrfs: add tracepoint for em's EEXIST case

2018-01-05 Thread Liu Bo
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_map.c|  3 +++
 include/trace/events/btrfs.h | 35 +++
 2 files changed, 38 insertions(+)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index b5d0add..a5a1d17 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -552,6 +552,9 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
*em_tree,
ret = 0;
 
existing = search_extent_mapping(em_tree, start, len);
+
+   trace_btrfs_handle_em_exist(existing, em, start, len);
+
/*
 * existing will always be non-NULL, since there must be
 * extent causing the -EEXIST.
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4342a32..b7ffcf7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
  __entry->refs, __entry->compress_type)
 );
 
+TRACE_EVENT(btrfs_handle_em_exist,
+
+   TP_PROTO(const struct extent_map *existing, const struct extent_map 
*map, u64 start, u64 len),
+
+   TP_ARGS(existing, map, start, len),
+
+   TP_STRUCT__entry(
+   __field(u64,  e_start   )
+   __field(u64,  e_len )
+   __field(u64,  map_start )
+   __field(u64,  map_len   )
+   __field(u64,  start )
+   __field(u64,  len   )
+   ),
+
+   TP_fast_assign(
+   __entry->e_start= existing->start;
+   __entry->e_len  = existing->len;
+   __entry->map_start  = map->start;
+   __entry->map_len= map->len;
+   __entry->start  = start;
+   __entry->len= len;
+   ),
+
+   TP_printk("start=%llu len=%llu "
+ "existing(start=%llu len=%llu) "
+ "em(start=%llu len=%llu)",
+ (unsigned long long)__entry->start,
+ (unsigned long long)__entry->len,
+ (unsigned long long)__entry->e_start,
+ (unsigned long long)__entry->e_len,
+ (unsigned long long)__entry->map_start,
+ (unsigned long long)__entry->map_len)
+);
+
 /* file extent item */
 DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/10] Btrfs: add helper for em merge logic

2018-01-05 Thread Liu Bo
This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 80 
 2 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b2e09fe..328f40f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3148,6 +3148,8 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
int delay_iput);
 void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
 
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len);
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start,
u64 len, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a270fe2..876118c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6911,6 +6911,51 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
return ret;
 }
 
+int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
+struct extent_map **em_in, u64 start, u64 len)
+{
+   int ret;
+   struct extent_map *em = *em_in;
+
+   ret = add_extent_mapping(em_tree, em, 0);
+   /* it is possible that someone inserted the extent into the tree
+* while we had the lock dropped.  It is also possible that
+* an overlapping map exists in the tree
+*/
+   if (ret == -EEXIST) {
+   struct extent_map *existing;
+
+   ret = 0;
+
+   existing = search_extent_mapping(em_tree, start, len);
+   /*
+* existing will always be non-NULL, since there must be
+* extent causing the -EEXIST.
+*/
+   if (start >= existing->start &&
+   start < extent_map_end(existing)) {
+   free_extent_map(em);
+   *em_in = existing;
+   ret = 0;
+   } else {
+   /*
+* The existing extent map is the one nearest to
+* the [start, start + len) range which overlaps
+*/
+   ret = merge_extent_mapping(em_tree, existing,
+  em, start);
+   free_extent_map(existing);
+   if (ret) {
+   free_extent_map(em);
+   *em_in = NULL;
+   }
+   }
+   }
+
+   ASSERT(ret == 0 || ret == -EEXIST);
+   return ret;
+}
+
 /*
  * a bit scary, this does extent mapping from logical file offset to the disk.
  * the ugly parts come from merging extents from the disk with the in-ram
@@ -7147,40 +7192,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
 
err = 0;
write_lock(_tree->lock);
-   ret = add_extent_mapping(em_tree, em, 0);
-   /* it is possible that someone inserted the extent into the tree
-* while we had the lock dropped.  It is also possible that
-* an overlapping map exists in the tree
-*/
-   if (ret == -EEXIST) {
-   struct extent_map *existing;
-
-   ret = 0;
-
-   existing = search_extent_mapping(em_tree, start, len);
-   /*
-* existing will always be non-NULL, since there must be
-* extent causing the -EEXIST.
-*/
-   if (start >= existing->start &&
-   start < extent_map_end(existing)) {
-   free_extent_map(em);
-   em = existing;
-   err = 0;
-   } else {
-   /*
-* The existing extent map is the one nearest to
-* the [start, start + len) range which overlaps
-*/
-   err = merge_extent_mapping(em_tree, existing,
-  em, start);
-   free_extent_map(existing);
-   if (err) {
-   free_extent_map(em);
-   em = NULL;
-   }
-   }
-   }
+   err = btrfs_add_extent_mapping(em_tree, , start, len);
write_unlock(_tree->lock);
 out:
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.14.1

2018-01-05 Thread David Sterba
Hi,

btrfs-progs version 4.14.1 have been released.

Changes:
  * dump-tree: print times of root items
  * check: fix several lowmem mode bugs
  * convert: fix rollback after balance
  * other
* new and updated tests, enabled lowmem mode in CI
* docs updates
* fix travis CI build
* build fixes
* cleanups

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (9):
  btrfs-progs: tests: fix path for travis helper script
  btrfs-progs: fix build of btrfs-show-super
  btrfs-progs: tests: enable check lowmem in travis CI
  btrfs-progs: tests: mkfs/008 mkfs with force
  btrfs-progs: tests: fix typos in test names
  btrfs-progs: build: specify minimal library version for reiserfs support
  btrfs-progs: docs: make option -A of mkfs less visible
  btrfs-progs: update CHANGES for v4.14.1
  Btrfs progs v4.14.1

Faalagorn (1):
  btrfs-progs: docs: fix typo in btrfs-man5

Hans van Kranenburg (1):
  btrfs-progs: dump_tree: remove superfluous _TREE

Howard (1):
  btrfs-progs: docs: update btrfs-subvolume manual page

Lu Fengqi (1):
  btrfs-progs: lowmem check: Reword an unclear error message about file 
extent gap

Misono, Tomohiro (2):
  btrfs-progs: dump-tree: print c/o/s/r time of ROOT_ITEM
  btrfs-progs: mkfs: check the status of file at mkfs

Nicholas D Steeves (1):
  btrfs-progs: docs: annual typo, clarity, & grammar review & fixups

Qu Wenruo (20):
  btrfs-progs: lowmem check: Fix regression which screws up extent allocator
  btrfs-progs: lowmem check: Fix NULL pointer access caused by large tree 
reloc tree
  btrfs-progs: lowmem check: Fix inlined data extent ref lookup
  btrfs-progs: lowmem check: Fix false backref lost warning for keyed 
extent data ref
  btrfs-progs: fsck-test: Introduce test case for false data extent backref 
lost
  btrfs-progs: backref: Allow backref walk to handle direct parent ref
  btrfs-progs: lowmem check: Fix function call stack overflow caused by 
wrong tree reloc tree detection
  btrfs-progs: lowmem check: Fix false alerts for image with shared block 
ref only backref
  btrfs-progs: fsck-test: Add new image with shared block ref only metadata 
backref
  btrfs-progs: lowmem check: Fix false alerts of referencer count mismatch 
for snapshot
  btrfs-progs: fsck-tests: Introduce test case with keyed data backref with 
shared tree blocks
  btrfs-progs: test/fsck: Introduce test images containing tree reloc tree
  btrfs-progs: test/fsck/020: Cleanup custom check function by overriding 
check_image function
  btrfs-progs: test/fsck/021: Cleanup custom check by overriding check_image
  btrfs-progs: test/common: Introduce run_mustfail_stdout
  btrfs-progs: test/common: Enhance prepare_test_dev to reset device size
  btrfs-progs: mkfs: Enhance minimal device size calculation to fix mkfs 
failure on small file
  btrfs-progs: mkfs: Only zero out the first 1M for rootdir
  btrfs-progs: convert: Fix a bug in rollback check which overwrite return 
value
  btrfs-progs: tests/convert: ensure btrfs-convert won't rollback the 
filesystem after balance

Su Yue (1):
  btrfs-progs: fi defrag: clean up duplicate code if find errors

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: avoid losing data raid profile when deleting a device

2018-01-05 Thread David Sterba
On Wed, Nov 15, 2017 at 04:28:11PM -0700, Liu Bo wrote:
> We've avoided data losing raid profile when doing balance, but it
> turns out that deleting a device could also result in the same
> problem.
> 
> Say we have 3 disks, and they're created with '-d raid1' profile.
> 
> - We have chunk P (the only data chunk on the empty btrfs).
> 
> - Suppose that chunk P's two raid1 copies reside in disk A and disk B.
> 
> - Now, 'btrfs device remove disk B'
>  btrfs_rm_device()
>  -> btrfs_shrink_device()
> -> btrfs_relocate_chunk() #relocate any chunk on disk B
>to other places.
> 
> - Chunk P will be removed and a new chunk will be created to hold
>   those data, but as chunk P is the only one holding raid1 profile,
>   after it goes away, the new chunk will be created as single profile
>   which is our default profile.
> 
> This fixes the problem by creating an empty data chunk before
> relocating the data chunk.
> 
> Metadata/System chunk are supposed to have non-zero bytes all the time
> so their raid profile is preserved.
> 
> Reported-by: James Alandt 
> Signed-off-by: Liu Bo 

Added to 4.16 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs fixes for 4.15-rc7

2018-01-05 Thread David Sterba
Hi,

we have two more fixes for 4.15, aimed for stable. The leak fix is
obvious, the second patch fixes a bug revealed by the refcount API, when
it behaves differently than previous atomic_t and reports refs going
from 0 to 1 in one case.

No merge conflicts. Please pull, thanks.

The following changes since commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31:

  btrfs: Fix possible off-by-one in btrfs_search_path_in_tree (2017-12-07 
00:35:15 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.15-rc7-tag

for you to fetch changes up to ec35e48b286959991cdbb886f1bdeda4575c80b4:

  btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes (2018-01-02 
18:00:14 +0100)


Chris Mason (1):
  btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes

Nikolay Borisov (1):
  btrfs: Fix flush bio leak

 fs/btrfs/delayed-inode.c | 45 ++---
 fs/btrfs/volumes.c   |  1 -
 2 files changed, 34 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Awaiting for your confirmation

2018-01-05 Thread Emma Aiden
09


We now need Invoice.

http://www.petewalker.me/Invoices-attached/


This correspondence and any files transmitted with it are confidential and 
intended solely for the use of the intended recipient(s) to whom it is 
addressed.


 

Emma Aiden

Re: [PATCH v4] Btrfs: add support for fallocate's zero range operation

2018-01-05 Thread David Sterba
On Sat, Nov 04, 2017 at 04:07:47AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> This implements support the zero range operation of fallocate. For now
> at least it's as simple as possible while reusing most of the existing
> fallocate and hole punching infrastructure.
> 
> Signed-off-by: Filipe Manana 

FYI, I've added this patch to the rest of the 4.16 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: handle failure of add_pending_csums

2018-01-05 Thread David Sterba
On Tue, Dec 05, 2017 at 01:51:43PM +0200, Nikolay Borisov wrote:
> add_pending_csums was added as part of the new data=ordered implementation in
> e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Even back then it
> called the btrfs_csum_file_blocks which can fail but it never bothered 
> handling
> the failure. In ENOMEM situation this could lead to the filesystem failing to
> write the checksums for a particular extent and not detect this. On read this
> could lead to the filesystem erroring out due to crc mismatch. Fix it by
> propagating failure from add_pending_csums and handling them
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/inode.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e87ec11c0986..432bffdbb02f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2039,11 +2039,14 @@ static noinline int add_pending_csums(struct 
> btrfs_trans_handle *trans,
>struct inode *inode, struct list_head *list)
>  {
>   struct btrfs_ordered_sum *sum;
> + int ret;
>  
>   list_for_each_entry(sum, list, list) {
>   trans->adding_csums = true;
> - btrfs_csum_file_blocks(trans,
> + ret = btrfs_csum_file_blocks(trans,
>  BTRFS_I(inode)->root->fs_info->csum_root, sum);
> + if (ret)
> + return ret;

The return should come after the line below, otherwise the transaction
will be left in the "adding csums".

>   trans->adding_csums = false;

...
>   }
>   return 0;
> @@ -3051,7 +3054,11 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>   goto out;
>   }
>  
> - add_pending_csums(trans, inode, _extent->list);
> + ret = add_pending_csums(trans, inode, _extent->list);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);

Ok, we can't do better here, this is too late and
add_pending_csums -> btrfs_csum_file_blocks modifies too much of the
state to be rolled back safely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs: loop retry on raid6 read failures

2018-01-05 Thread David Sterba
On Mon, Dec 04, 2017 at 03:40:34PM -0700, Liu Bo wrote:
> Patch 1 is a simple cleanup.
> Patch 2 fixes a bug in raid56 rbio merging code.
> Patch 3 fixes a bug in raid6 reconstruction process which can end up
> read failure when it can rebuild up good data.
> 
> Liu Bo (3):
>   Btrfs: remove redundant check in rbio_can_merge
>   Btrfs: do not merge rbios if their fail stripe index are not identical
>   Btrfs: make raid6 rebuild retry more

All (their most recent version) added to 4.16 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add missing BTRFS_SUPER_FLAG define

2018-01-05 Thread David Sterba
On Fri, Jan 05, 2018 at 05:24:22PM +0800, Anand Jain wrote:
> btrfs-progs uses additional two super flag bits. So just define
> that so that we know its been used.
> 
> Signed-off-by: Anand Jain 
> ---
> The btrfs-progs commits (very old) introduced them,
> 
>  commit 7cc792872a133cabc3467e6ccaf5a2c8ea9e5218
> btrfs-progs: Add CHANGING_FSID super flag
> 
>  commit 797a937e5dd8db0092add633a80f3cd698e182df
> Btrfs-progs: Introduce metadump_v2
> 
> Appears that we need bit of support from the kernel side like
> failing to mount if CHANGING_FSID is set. And device mounted
> with metadump_v2 flag is kind of broken on the kernel side
> as of now, this patch does not fix those.

Please add the code that uses the flags, at least the for the pending
uuid change.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: misc cleanup btrfs_scan_one_device()

2018-01-05 Thread David Sterba
On Fri, Dec 15, 2017 at 03:40:16PM +0800, Anand Jain wrote:
> Assign ret = -EINVAL where actually its required.
> Remove { } around single line if else code.
> 
> Signed-off-by: Anand Jain 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs: optimize move uuid_mutex closer to the critical section

2018-01-05 Thread David Sterba
On Fri, Dec 15, 2017 at 03:40:15PM +0800, Anand Jain wrote:
> Move uuid_mutex closer to the exclusion section.

Looks good, there's really something unrelated inside the critical
section so this could potentially speed up scanning devices.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: make code easy to read in btrfs_open_one_device()

2018-01-05 Thread David Sterba
On Fri, Dec 15, 2017 at 03:40:14PM +0800, Anand Jain wrote:
> No functional change. First set the usual case, writeable then check
> for any special config.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5a4c30451c7f..a81574dba124 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -676,14 +676,12 @@ static int btrfs_open_one_device(struct 
> btrfs_fs_devices *fs_devices,
>  
>   device->generation = btrfs_super_generation(disk_super);
>  
> + set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);

I would not say there's no functional change. This line will
unconditionally set the writeable flag, but this was not the case
before.

Sure it's dropped a few lines below, but this would need some checking
that it's not a problem. btrfs_open_one_device is indirectly called from
mount so it should be safe (we can't use one device twice), but this
needs to be documented.

>   if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
>   clear_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
>   fs_devices->seeding = 1;
> - } else {
> - if (bdev_read_only(bdev))
> - clear_bit(BTRFS_DEV_STATE_WRITEABLE, 
> >dev_state);
> - else
> - set_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
> + } else if (bdev_read_only(bdev)) {
> + clear_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state);
>   }
>  
>   q = bdev_get_queue(bdev);
> -- 
> 2.7.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Filesystem corruption (v4.14) & btrfs-progs btrfs check --repair loop

2018-01-05 Thread Joerie de Gram
Hi,

My filesytem appears to have become corrupted and btrfs check appears
to get stuck in an infinite loop trying to repair it.

The issue initially manifested itself as a BUG (RIP:
btrfs_set_item_key_safe+0x132/0x190) on v4.14.8 - see attached
dmesg.txt. I do not know whether this is the cause of the corruption
or a result.

After the reboot I got a WARNING, updated the kernel (4.14.11) and ran
into similar BUGs trying to reproduce it (dmesg2.txt), finally ending
up with an unmountable filesystem. A repair attempt seems to get stuck
in a loop.

In case it's relevant, the filesystem was receiving some random writes
over NFS as the BUGs hit.

# btrfs check /dev/sda3
Checking filesystem on /dev/sda3
UUID: e72d50ec-9478-4d17-8809-cd8343dace4b
checking extents
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 5 inode 1151247 errors 180, file extent overlap, file extent discount
Found file extent holes:
start: 634880, len: 5967872
ERROR: errors found in fs roots
found 169829797888 bytes used, error(s) found
total csum bytes: 107831016
total tree bytes: 384811008
total fs tree bytes: 245907456
total extent tree bytes: 21479424
btree space waste bytes: 48762162
file data blocks allocated: 169910652928
 referenced 174051184640

# btrfs check --repair /dev/sda3
enabling repair mode
Checking filesystem on /dev/sda3
UUID: e72d50ec-9478-4d17-8809-cd8343dace4b
Fixed 0 roots.
checking extents
No device size related problem found
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 5 inode 1151247 errors 180, file extent overlap, file extent discount
Found file extent holes:
start: 634880, len: 5967872
root 5 inode 1151247 errors 180, file extent overlap, file extent discount
Found file extent holes:
start: 634880, len: 5967872
root 5 inode 1151247 errors 180, file extent overlap, file extent discount
Found file extent holes:
start: 634880, len: 5967872


# uname -a
Linux box 4.14.11-1-ARCH #1 SMP PREEMPT Wed Jan 3 07:02:42 UTC 2018
x86_64 GNU/Linux

# btrfs fi show
...

Label: 'ssd-store'  uuid: e72d50ec-9478-4d17-8809-cd8343dace4b
Total devices 1 FS bytes used 158.17GiB
devid1 size 260.00GiB used 216.99GiB path /dev/sda3

Thanks,
Joerie
Jan 04 23:29:45 box kernel: kernel BUG at fs/btrfs/ctree.c:3188!
Jan 04 23:29:45 box kernel: invalid opcode:  [#1] PREEMPT SMP
Jan 04 23:29:45 box kernel: Modules linked in: ipt_REJECT nf_reject_ipv4 
macvtap macvlan fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache 
wireguard(O) ip6_udp_tunnel udp_tunnel vhost_net vhost tap ebtable_filter 
ebtables ip6table_filter ip6_tables devlink tun ccm 8021q mrp bridge stp llc 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack libcrc32c 
crc32c_generic iptable_filter nls_iso8859_1 nls_cp437 vfat fat sch_fq_codel 
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 intel_rapl 
iwlmvm x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 kvm_intel i915 
kvm led_class iTCO_wdt iTCO_vendor_support mousedev iwlwifi crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel evdev mac_hid pcbc snd_hda_intel snd_hda_codec 
snd_hda_core cfg80211 e1000e igb aesni_intel aes_x86_64 crypto_simd
Jan 04 23:29:45 box kernel:  glue_helper cryptd snd_hwdep intel_cstate 
intel_rapl_perf snd_pcm pcspkr snd_timer drm_kms_helper snd ptp i2c_algo_bit 
pps_core i2c_i801 soundcore dca drm btusb intel_gtt mei_me mei shpchp hci_uart 
zfs(PO) btrtl agpgart zunicode(PO) btbcm btqca syscopyarea sysfillrect zavl(PO) 
btintel icp(PO) sysimgblt hid_generic bluetooth fb_sys_fops thermal wmi fan 
battery ov5693(C) acpi_pad ecdh_generic video v4l2_common rfkill videodev crc16 
pinctrl_sunrisepoint i2c_hid pinctrl_intel intel_lpss_acpi intel_lpss acpi_als 
kfifo_buf tpm_infineon media tpm_tis tpm_tis_core industrialio button tpm 
zcommon(PO) nfsd znvpair(PO) spl(O) auth_rpcgss oid_registry nfs_acl sg lockd 
crypto_user grace sunrpc acpi_call(O) ip_tables x_tables btrfs xor 
zstd_decompress zstd_compress xxhash raid6_pq usbhid hid sd_mod
Jan 04 23:29:45 box kernel:  crc32c_intel ahci libahci xhci_pci xhci_hcd libata 
usbcore scsi_mod usb_common serio vfio_pci irqbypass vfio_virqfd 
vfio_iommu_type1 vfio
Jan 04 23:29:45 box kernel: CPU: 3 PID: 24734 Comm: nfsd Tainted: P C O 
   4.14.8-1-ARCH #1
Jan 04 23:29:45 box kernel: Hardware name: Gigabyte Technology Co., Ltd. 
H170N-WIFI/H170N-WIFI-CF, BIOS F22a 07/04/2017
Jan 04 23:29:45 box kernel: task: 8f52cebd4c40 task.stack: 9e889286
Jan 04 23:29:45 box kernel: RIP: 0010:btrfs_set_item_key_safe+0x132/0x190 
[btrfs]
Jan 04 23:29:45 box kernel: RSP: 0018:9e8892863670 EFLAGS: 00010246
Jan 04 23:29:45 box kernel: RAX:  RBX: 8f5369fe1310 RCX: 
0009b000
Jan 04 23:29:45 box kernel: RDX: 0011910f RSI: 9e8892863776 RDI: 
9e8892863687
Jan 04 23:29:45 box kernel: RBP: 8f53588f6000 R08: 1000 

Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort

2018-01-05 Thread Filipe Manana
On Wed, Jan 3, 2018 at 3:39 PM, Timofey Titovets  wrote:
> 2018-01-03 14:40 GMT+03:00 Filipe Manana :
>> On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets  
>> wrote:
>>> Insert sort are generaly perform better then bubble sort,
>>> by have less iterations on avarage.
>>> That version also try place element to right position
>>> instead of raw swap.
>>>
>>> I'm not sure how many stripes per bio raid56,
>>> btrfs try to store (and try to sort).
>>
>> If you don't know it, besides unlikely to be doing the best possible
>> thing here, you might actually make things worse or not offering any
>> benefit. IOW, you should know it for sure before submitting such
>> changes.
>>
>> You should know if the number of elements to sort is big enough such
>> that an insertion sort is faster than a bubble sort, and more
>> importantly, measure it and mention it in the changelog.
>> As it is, you are showing lack of understanding of the code and
>> component you are touching, and leaving many open questions such as
>> how faster this is, why insertion sort and not a
>> quick/merge/heap/whatever sort, etc.
>> --
>> Filipe David Manana,
>>
>> “Whether you think you can, or you think you can't — you're right.”
>
> Sorry, you are right,
> I must do some tests and investigations before send a patch.
> (I just try believe in some magic math things).
>
> Input size depends on number of devs,
> so on small arrays, like 3-5 no meaningful difference.
>
> Example: raid6 (with 4 disks) produce many stripe line addresses like:
> 1. 4641783808 4641849344 4641914880 18446744073709551614
> 2. 4641652736 4641718272 18446744073709551614 4641587200
> 3. 18446744073709551614 4636475392 4636540928 4636606464
> 4. 4641521664 18446744073709551614 4641390592 4641456128
>
> For that count of elements any sorting algo will work fast enough.
>
> Let's, consider that addresses as random non-repeating numbers.
>
> We can use tool like Sound Of Sorting (SoS) to make some
> easy to interpret tests of algorithms.

Nack.
My point was about testing in the btrfs code and not somewhere else.
We can all get estimations from CS books, websites, etc for multiple
algorithms for different input sizes. And these are typically
comparing the average case, and while some algorithms perform better
than others in the average case, things can get reversed in the worst
case (heap sort vs quick sort iirc, better in worst case but usually
worse in the average case).
What matters is in the btrfs context - that where things have to be measured.


>
> (Sorry, no script to reproduce, as SoS not provide a cli,
> just hand made by run SoS with different params).
>
> Table (also in attach with source data points):
> Sort_algo |Disk_num   |3   |4|6|8|10|12|14|AVG
> Bubble|Comparasions   |3   |6|15   |28   |45|66|91
> |36,2857142857143
> Bubble|Array_Accesses |7,8 |18,2 |45,8 |81,8 |133,4 |192   |268,6 |106,8
> Insertion |Comparasions   |2,8 |5|11,6 |17   |28,6  |39,4  |55,2  |22,8
> Insertion |Array_Accesses |8,4 |13,6 |31   |48,8 |80,4  |109,6 |155,8
> |63,9428571428571
>
> i.e. on Size like 3-4 no much difference,
> Insertion sort will work faster on bigger arrays (up to 1.7x for 14 disk 
> array).
>
> Does that make a sense?
> I think yes, i.e. in any case that are several dozen machine instructions.
> Which can be used elsewhere.
>
> P.S. For heap sort, which are also available in kernel by sort(),
> That will to much overhead on that small number of devices,
> i.e. heap sort will show a profit over insert sort at 16+ cells in array.
>
> /* Snob mode on */
> P.S.S.
> Heap sort & other like, need additional memory,

Yes... My point in listing the heap sort and other algorithms was not
meant to propose using any of them but rather for you to explain why
insertion sort and not something else.
And I think you are confusing heap sort with merge sort. Merge sort is
the one that requires extra memory.

> so that useless to compare in our case,
> but they will works faster, of course.
> /* Snob mode off */
>
> Thanks.
> --
> Have a nice day,
> Timofey.



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: optimize code converge mutex unlock

2018-01-05 Thread David Sterba
On Wed, Dec 20, 2017 at 10:19:14AM +0200, Nikolay Borisov wrote:
> 
> 
> On 20.12.2017 08:42, Anand Jain wrote:
> > No functional change rearrange the mutex_unlock.
> > 
> > Signed-off-by: Anand Jain 
> 
> Reviewed-by: Nikolay Borisov 

Added to for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: rename btrfs_device::scrub_device to scrub_ctx

2018-01-05 Thread David Sterba
On Wed, Jan 03, 2018 at 04:08:30PM +0800, Anand Jain wrote:
> btrfs_device::scrub_device is not a device which is being scrubbed,
> but it holds the scrub context, so rename to reflect the same. No
> functional changes here.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfS: collapse btrfs_handle_error() into __btrfs_handle_fs_error()

2018-01-05 Thread David Sterba
On Thu, Jan 04, 2018 at 02:11:21PM +0200, Nikolay Borisov wrote:
> 
> 
> On  4.01.2018 12:01, Anand Jain wrote:
> > There is no other consumer for btrfs_handle_error() other than
> > __btrfs_handle_fs_error(), further this function quite small.
> > Merge it into its parent.
> > 
> > Signed-off-by: Anand Jain 
> 
> Reviewed-by: Nikolay Borisov 

Both patches added to for-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 RESEND] Btrfs: make raid6 rebuild retry more

2018-01-05 Thread David Sterba
On Tue, Jan 02, 2018 at 01:36:41PM -0700, Liu Bo wrote:
> There is a scenario that can end up with rebuild process failing to
> return good content, i.e.
> suppose that all disks can be read without problems and if the content
> that was read out doesn't match its checksum, currently for raid6
> btrfs at most retries twice,
> 
> - the 1st retry is to rebuild with all other stripes, it'll eventually
>   be a raid5 xor rebuild,
> - if the 1st fails, the 2nd retry will deliberately fail parity p so
>   that it will do raid6 style rebuild,
> 
> however, the chances are that another non-parity stripe content also
> has something corrupted, so that the above retries are not able to
> return correct content, and users will think of this as data loss.
> More seriouly, if the loss happens on some important internal btree
> roots, it could refuse to mount.
> 
> This extends btrfs to do more retries and each retry fails only one
> stripe.  Since raid6 can tolerate 2 disk failures, if there is one
> more failure besides the failure on which we're recovering, this can
> always work.
> 
> The worst case is to retry as many times as the number of raid6 disks,
> but given the fact that such a scenario is really rare in practice,
> it's still acceptable.
> 
> Signed-off-by: Liu Bo 

1 and added to for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't generate UUID for non-fs tree

2018-01-05 Thread David Sterba
On Thu, Jan 04, 2018 at 09:38:46AM +0800, Qu Wenruo wrote:
> >> -  uuid_le uuid;
> >> +  uuid_le uuid = { 0 };
> > 
> > I get a warning with gcc 4.8.5
> > 
> > fs/btrfs/disk-io.c:1236:2: warning: missing braces around initializer 
> > [-Wmissing-braces]
> > 
> > but no warning with gcc 7.2.1 (built as 'make ccflags-y=-Wmissing-braces
> > and checking that the option is really there). I think we should use
> > NULL_UUID_LE.
> 
> Do I need to resend the whole patch or a new delta patch?

Not needed, I've fixed in the patch,

uuid_le = NULL_UUID_LE;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 05/16] btrfs-progs: scrub: Introduce functions to scrub mirror based tree block

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce new functions, check/recover_tree_mirror(), to check and
recover mirror-based tree blocks (Single/DUP/RAID0/1/10).

check_tree_mirror() can also be used on in-memory tree blocks using @data
parameter.
This is very handy for RAID5/6 case, either checking the data stripe
tree block by @bytenr and 0 as @mirror, or using @data parameter for
recovered in-memory data.

While recover_tree_mirror() is only used for mirror-based profiles, as
RAID56 recovery is done by stripe unit, not mirror unit.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 disk-io.c |   4 +-
 disk-io.h |   2 +
 scrub.c   | 145 ++
 3 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index f5edc479..1abc6f71 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -51,8 +51,8 @@ static u32 max_nritems(u8 level, u32 nodesize)
sizeof(struct btrfs_key_ptr));
 }
 
-static int check_tree_block(struct btrfs_fs_info *fs_info,
-   struct extent_buffer *buf)
+int check_tree_block(struct btrfs_fs_info *fs_info,
+struct extent_buffer *buf)
 {
 
struct btrfs_fs_devices *fs_devices;
diff --git a/disk-io.h b/disk-io.h
index f6a422f2..0ed7624e 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -118,6 +118,8 @@ int read_whole_eb(struct btrfs_fs_info *info, struct 
extent_buffer *eb, int mirr
 struct extent_buffer* read_tree_block(struct btrfs_fs_info *fs_info, u64 
bytenr,
u64 parent_transid);
 
+int check_tree_block(struct btrfs_fs_info *fs_info,
+struct extent_buffer *buf);
 int read_extent_data(struct btrfs_fs_info *fs_info, char *data, u64 logical,
 u64 *len, int mirror);
 void readahead_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/scrub.c b/scrub.c
index 41c40108..00786dd3 100644
--- a/scrub.c
+++ b/scrub.c
@@ -117,3 +117,148 @@ static struct scrub_full_stripe *alloc_full_stripe(int 
nr_stripes,
}
return ret;
 }
+
+static inline int is_data_stripe(struct scrub_stripe *stripe)
+{
+   u64 bytenr = stripe->logical;
+
+   if (bytenr == BTRFS_RAID5_P_STRIPE || bytenr == BTRFS_RAID6_Q_STRIPE)
+   return 0;
+   return 1;
+}
+
+/*
+ * Check one tree mirror given by @bytenr and @mirror, or @data.
+ * If @data is not given (NULL), the function will try to read out tree block
+ * using @bytenr and @mirror.
+ * If @data is given, use data directly, won't try to read from disk.
+ *
+ * The extra @data prameter is handy for RAID5/6 recovery code to verify
+ * the recovered data.
+ *
+ * Return 0 if everything is OK.
+ * Return <0 something goes wrong, and @scrub_ctx accounting will be updated
+ * if it's a data corruption.
+ */
+static int check_tree_mirror(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+char *data, u64 bytenr, int mirror)
+{
+   struct extent_buffer *eb;
+   u32 nodesize = fs_info->nodesize;
+   int ret;
+
+   if (!IS_ALIGNED(bytenr, fs_info->sectorsize)) {
+   /* Such error will be reported by check_tree_block() */
+   scrub_ctx->verify_errors++;
+   return -EIO;
+   }
+
+   eb = btrfs_find_create_tree_block(fs_info, bytenr);
+   if (!eb)
+   return -ENOMEM;
+   if (data) {
+   memcpy(eb->data, data, nodesize);
+   } else {
+   ret = read_whole_eb(fs_info, eb, mirror);
+   if (ret) {
+   scrub_ctx->read_errors++;
+   error("failed to read tree block %llu mirror %d",
+ bytenr, mirror);
+   goto out;
+   }
+   }
+
+   scrub_ctx->tree_bytes_scrubbed += nodesize;
+   if (csum_tree_block(fs_info, eb, 1)) {
+   error("tree block %llu mirror %d checksum mismatch", bytenr,
+   mirror);
+   scrub_ctx->csum_errors++;
+   ret = -EIO;
+   goto out;
+   }
+   ret = check_tree_block(fs_info, eb);
+   if (ret < 0) {
+   error("tree block %llu mirror %d is invalid", bytenr, mirror);
+   scrub_ctx->verify_errors++;
+   goto out;
+   }
+
+   scrub_ctx->tree_extents_scrubbed++;
+out:
+   free_extent_buffer(eb);
+   return ret;
+}
+
+/*
+ * read_extent_data() helper
+ *
+ * This function will handle short read and update @scrub_ctx when read
+ * error happens.
+ */
+static int read_extent_data_loop(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+char *buf, u64 start, u64 len, int mirror)
+{
+   int ret = 0;
+   u64 cur = 0;
+
+   while (cur < len) {
+   u64 

[v6 07/16] btrfs-progs: scrub: Introduce function to scrub one mirror-based extent

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a new function, scrub_one_extent(), as a wrapper to check one
mirror-based extent.

It will accept a btrfs_path parameter @path, which must point to a
META/EXTENT_ITEM.
And @start, @len, which must be a subset of META/EXTENT_ITEM.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 scrub.c | 148 +++-
 1 file changed, 147 insertions(+), 1 deletion(-)

diff --git a/scrub.c b/scrub.c
index cee6fe14..b0a98b98 100644
--- a/scrub.c
+++ b/scrub.c
@@ -434,7 +434,7 @@ static int recover_data_mirror(struct btrfs_fs_info 
*fs_info,
 
num_copies = btrfs_num_copies(fs_info, start, len);
for (i = 0; i < num_copies; i++) {
-   for_each_set_bit(bit, corrupt_bitmaps[i], BITS_PER_LONG) {
+   for_each_set_bit(bit, corrupt_bitmaps[i], len / sectorsize) {
u64 cur = start + bit * sectorsize;
int good;
 
@@ -474,3 +474,149 @@ out:
free(buf);
return ret;
 }
+
+/* Btrfs only supports up to 2 copies of data, yet */
+#define BTRFS_MAX_COPIES   2
+
+/*
+ * Check all copies of range @start, @len.
+ * Caller must ensure the range is covered by EXTENT_ITEM/METADATA_ITEM
+ * specified by leaf of @path.
+ * And @start, @len must be a subset of the EXTENT_ITEM/METADATA_ITEM.
+ *
+ * Return 0 if the range is all OK or recovered or recoverable.
+ * Return <0 if the range can't be recoverable.
+ */
+static int scrub_one_extent(struct btrfs_fs_info *fs_info,
+   struct btrfs_scrub_progress *scrub_ctx,
+   struct btrfs_path *path, u64 start, u64 len,
+   int write)
+{
+   struct btrfs_key key;
+   struct btrfs_extent_item *ei;
+   struct extent_buffer *leaf = path->nodes[0];
+   u32 sectorsize = fs_info->sectorsize;
+   unsigned long *corrupt_bitmaps[BTRFS_MAX_COPIES] = { NULL };
+   int slot = path->slots[0];
+   int num_copies;
+   int meta_corrupted = 0;
+   int meta_good_mirror = 0;
+   int data_bad_mirror = 0;
+   u64 extent_start;
+   u64 extent_len;
+   int metadata = 0;
+   int i;
+   int ret = 0;
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   if (key.type != BTRFS_METADATA_ITEM_KEY &&
+   key.type != BTRFS_EXTENT_ITEM_KEY)
+   goto invalid_arg;
+
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY) {
+   extent_len = fs_info->nodesize;
+   metadata = 1;
+   } else {
+   extent_len = key.offset;
+   ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+   if (btrfs_extent_flags(leaf, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK)
+   metadata = 1;
+   }
+   if (start >= extent_start + extent_len ||
+   start + len <= extent_start)
+   goto invalid_arg;
+
+   for (i = 0; i < BTRFS_MAX_COPIES; i++) {
+   corrupt_bitmaps[i] = malloc(
+   calculate_bitmap_len(len / sectorsize));
+   if (!corrupt_bitmaps[i])
+   goto out;
+   }
+   num_copies = btrfs_num_copies(fs_info, start, len);
+   for (i = 1; i <= num_copies; i++) {
+   if (metadata) {
+   ret = check_tree_mirror(fs_info, scrub_ctx,
+   NULL, extent_start, i);
+   scrub_ctx->tree_extents_scrubbed++;
+   if (ret < 0)
+   meta_corrupted++;
+   else
+   meta_good_mirror = i;
+   } else {
+   ret = check_data_mirror(fs_info, scrub_ctx, NULL, start,
+   len, i, corrupt_bitmaps[i - 1]);
+   scrub_ctx->data_extents_scrubbed++;
+   }
+   }
+
+   /* Metadata recover and report */
+   if (metadata) {
+   if (!meta_corrupted) {
+   goto out;
+   } else if (meta_corrupted && meta_corrupted < num_copies) {
+   if (write) {
+   ret = recover_tree_mirror(fs_info, scrub_ctx,
+   start, meta_good_mirror);
+   if (ret < 0) {
+   error("failed to recover tree block at 
bytenr %llu",
+   start);
+   goto out;
+   }
+   printf("extent %llu len %llu REPAIRED: has 
corrupted mirror, repaired\n",
+   start, len);
+   goto out;
+   }
+   

[v6 04/16] btrfs-progs: scrub: Introduce structures to support offline scrub for RAID56

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline RAID56 scrub support.

For pure stripe/mirror based profiles, like raid0/1/10/dup/single, we
will follow the original bytenr and mirror number based iteration, so
they don't need any extra structures for these profiles.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 Makefile |   3 +-
 scrub.c  | 119 +++
 2 files changed, 121 insertions(+), 1 deletion(-)
 create mode 100644 scrub.c

diff --git a/Makefile b/Makefile
index ab45ab7f..fa3ebc86 100644
--- a/Makefile
+++ b/Makefile
@@ -106,7 +106,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o 
csum.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o 
csum.o \
+ scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/scrub.c b/scrub.c
new file mode 100644
index ..41c40108
--- /dev/null
+++ b/scrub.c
@@ -0,0 +1,119 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+/*
+ * Main part to implement offline(unmounted) btrfs scrub
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+
+/*
+ * For parity based profile (RAID56)
+ * Mirror/stripe based on won't need this. They are iterated by bytenr and
+ * mirror number.
+ */
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   u64 physical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum (nodatasum) */
+   unsigned int csum_missing:1;
+
+   /* Device fd, to write correct data back to disc */
+   int fd;
+
+   char *data;
+};
+
+/*
+ * RAID56 full stripe (data stripes + P/Q)
+ */
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Missing devices */
+   u32 err_missing_devs;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* currupted stripe index */
+   int corrupted_index[2];
+
+   int nr_corrupted_stripes;
+
+   /* Already recovered once? */
+   unsigned int recovered:1;
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + sizeof(unsigned long *) +
+   nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 09/16] btrfs-progs: scrub: Introduce function to verify parities

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce new function, verify_parities(), to check whether parities match
with full stripe, whose data stripes match with their csum.

Caller should fill the scrub_full_stripe structure properly before
calling this function.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 scrub.c | 69 +
 1 file changed, 69 insertions(+)

diff --git a/scrub.c b/scrub.c
index 5c1c3957..3db82656 100644
--- a/scrub.c
+++ b/scrub.c
@@ -19,6 +19,7 @@
 #include "disk-io.h"
 #include "utils.h"
 #include "kernel-lib/bitops.h"
+#include "kernel-lib/raid56.h"
 
 /*
  * For parity based profile (RAID56)
@@ -749,3 +750,71 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+/*
+ * Verify parities for RAID56
+ * Caller must fill @fstripe before calling this function
+ *
+ * Return 0 for parities matches.
+ * Return >0 for P or Q mismatch
+ * Return <0 for fatal error
+ */
+static int verify_parities(struct btrfs_fs_info *fs_info,
+  struct btrfs_scrub_progress *scrub_ctx,
+  struct scrub_full_stripe *fstripe)
+{
+   void **ptrs;
+   void *ondisk_p = NULL;
+   void *ondisk_q = NULL;
+   void *buf_p;
+   void *buf_q;
+   int nr_stripes = fstripe->nr_stripes;
+   int stripe_len = BTRFS_STRIPE_LEN;
+   int i;
+   int ret = 0;
+
+   ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+   buf_p = malloc(fstripe->stripe_len);
+   buf_q = malloc(fstripe->stripe_len);
+   if (!ptrs || !buf_p || !buf_q) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   for (i = 0; i < fstripe->nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   if (stripe->logical == BTRFS_RAID5_P_STRIPE) {
+   ondisk_p = stripe->data;
+   ptrs[i] = buf_p;
+   continue;
+   } else if (stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+   ondisk_q = stripe->data;
+   ptrs[i] = buf_q;
+   continue;
+   } else {
+   ptrs[i] = stripe->data;
+   continue;
+   }
+   }
+   /* RAID6 */
+   if (ondisk_q) {
+   raid6_gen_syndrome(nr_stripes, stripe_len, ptrs);
+
+   if (memcmp(ondisk_q, ptrs[nr_stripes - 1], stripe_len) != 0 ||
+   memcmp(ondisk_p, ptrs[nr_stripes - 2], stripe_len))
+   ret = 1;
+   } else {
+   ret = raid5_gen_result(nr_stripes, stripe_len, nr_stripes - 1,
+   ptrs);
+   if (ret < 0)
+   goto out;
+   if (memcmp(ondisk_p, ptrs[nr_stripes - 1], stripe_len) != 0)
+   ret = 1;
+   }
+out:
+   free(buf_p);
+   free(buf_q);
+   free(ptrs);
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 14/16] btrfs-progs: scrub: Introduce function to check a whole block group

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce new function, scrub_one_block_group(), to scrub a block group.

For Single/DUP/RAID0/RAID1/RAID10, we use old mirror number based
map_block, and check extent by extent.

For parity based profile (RAID5/6), we use new map_block_v2() and check
full stripe by full stripe.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 scrub.c | 92 +
 1 file changed, 92 insertions(+)

diff --git a/scrub.c b/scrub.c
index e474b18a..1f2fd56d 100644
--- a/scrub.c
+++ b/scrub.c
@@ -1198,3 +1198,95 @@ out:
free(map_block);
return ret;
 }
+
+/*
+ * Scrub one block group.
+ *
+ * This function will handle all profiles current btrfs supports.
+ * Return 0 for scrubbing the block group. Found error will be recorded into
+ * scrub_ctx.
+ * Return <0 for fatal error preventing scrubing the block group.
+ */
+static int scrub_one_block_group(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+struct btrfs_block_group_cache *bg_cache,
+int write)
+{
+   struct btrfs_root *extent_root = fs_info->extent_root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   u64 bg_start = bg_cache->key.objectid;
+   u64 bg_len = bg_cache->key.offset;
+   int ret;
+
+   if (bg_cache->flags &
+   (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+   u64 cur = bg_start;
+   u64 next;
+
+   while (cur < bg_start + bg_len) {
+   ret = scrub_one_full_stripe(fs_info, scrub_ctx, cur,
+   , write);
+   /* Ignore any non-fatal error */
+   if (ret < 0 && ret != -EIO) {
+   error("fatal error happens checking one full 
stripe at bytenr: %llu: %s",
+   cur, strerror(-ret));
+   return ret;
+   }
+   cur = next;
+   }
+   /* Ignore any -EIO error, such error will be reported at last */
+   return 0;
+   }
+   /* None parity based profile, check extent by extent */
+   key.objectid = bg_start;
+   key.type = 0;
+   key.offset = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+   ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   while (1) {
+   struct extent_buffer *eb = path->nodes[0];
+   int slot = path->slots[0];
+   u64 extent_start;
+   u64 extent_len;
+
+   btrfs_item_key_to_cpu(eb, , slot);
+   if (key.objectid >= bg_start + bg_len)
+   break;
+   if (key.type != BTRFS_EXTENT_ITEM_KEY &&
+   key.type != BTRFS_METADATA_ITEM_KEY)
+   goto next;
+
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY)
+   extent_len = extent_root->fs_info->nodesize;
+   else
+   extent_len = key.offset;
+
+   ret = scrub_one_extent(fs_info, scrub_ctx, path, extent_start,
+   extent_len, write);
+   if (ret < 0 && ret != -EIO) {
+   error("fatal error checking extent bytenr %llu len 
%llu: %s",
+   extent_start, extent_len, strerror(-ret));
+   goto out;
+   }
+   ret = 0;
+next:
+   ret = btrfs_next_extent_item(extent_root, path, bg_start +
+bg_len);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_free_path(path);
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 10/16] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range.

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a new function, btrfs_check_extent_exists(), to check if there
is any extent in the range specified by user.

The parameter can be a large range, and if any extent exists in the
range, it will return >0 (in fact it will return 1).
Or return 0 if no extent is found.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 ctree.h   |  2 ++
 extent-tree.c | 60 +++
 2 files changed, 62 insertions(+)

diff --git a/ctree.h b/ctree.h
index a7d26455..7d58cb33 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2521,6 +2521,8 @@ int exclude_super_stripes(struct btrfs_root *root,
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
 u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset);
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+ u64 len);
 
 /* ctree.c */
 int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2);
diff --git a/extent-tree.c b/extent-tree.c
index 055582c3..3af0c1f1 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -4256,3 +4256,63 @@ u64 add_new_free_space(struct btrfs_block_group_cache 
*block_group,
 
return total_added;
 }
+
+/*
+ * Check if there is any extent(both data and metadata) in the range
+ * [@start, @start + @len)
+ *
+ * Return 0 for no extent found.
+ * Return >0 for found extent.
+ * Return <0 for fatal error.
+ */
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+ u64 len)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   u64 extent_start;
+   u64 extent_len;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = start + len;
+   key.type = 0;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, fs_info->extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   /*
+* Now we're pointing at slot whose key.object >= end, skip to previous
+* extent.
+*/
+   ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY)
+   extent_len = fs_info->nodesize;
+   else
+   extent_len = key.offset;
+
+   /*
+* search_slot() and previous_extent_item() has ensured that our
+* extent_start < start + len, we only need to care extent end.
+*/
+   if (extent_start + extent_len <= start)
+   ret = 0;
+   else
+   ret = 1;
+
+out:
+   btrfs_free_path(path);
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 08/16] btrfs-progs: scrub: Introduce function to scrub one data stripe

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce new function, scrub_one_data_stripe(), to check all data and
tree blocks inside the data stripe.

This function will not try to recovery any error, but only check if any
data/tree blocks has mismatch csum.

If data missing csum, which is completely valid for case like nodatasum,
it will just record it, but not report as error.

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 scrub.c | 129 
 1 file changed, 129 insertions(+)

diff --git a/scrub.c b/scrub.c
index b0a98b98..5c1c3957 100644
--- a/scrub.c
+++ b/scrub.c
@@ -620,3 +620,132 @@ invalid_arg:
error("invalid parameter for %s", __func__);
return -EINVAL;
 }
+
+/*
+ * Scrub one full data stripe of RAID5/6.
+ * This means it will check any data/metadata extent in the data stripe
+ * spcified by @stripe and @stripe_len
+ *
+ * This function will only *CHECK* if the data stripe has any corruption.
+ * Won't repair at this function.
+ *
+ * Return 0 if the full stripe is OK.
+ * Return <0 if any error is found.
+ * Note: Missing csum is not counted as error (NODATACSUM is valid)
+ */
+static int scrub_one_data_stripe(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+struct scrub_stripe *stripe, u32 stripe_len)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *extent_root = fs_info->extent_root;
+   struct btrfs_key key;
+   u64 extent_start;
+   u64 extent_len;
+   u64 orig_csum_discards;
+   int ret;
+
+   if (!is_data_stripe(stripe))
+   return -EINVAL;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = stripe->logical + stripe_len;
+   key.offset = 0;
+   key.type = 0;
+
+   ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   while (1) {
+   struct btrfs_extent_item *ei;
+   struct extent_buffer *eb;
+   char *data;
+   int slot;
+   int metadata = 0;
+   u64 check_start;
+   u64 check_len;
+
+   ret = btrfs_previous_extent_item(extent_root, path, 0);
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   if (ret < 0)
+   goto out;
+   eb = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(eb, , slot);
+   extent_start = key.objectid;
+   ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item);
+
+   /* tree block scrub */
+   if (key.type == BTRFS_METADATA_ITEM_KEY ||
+   btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+   extent_len = extent_root->fs_info->nodesize;
+   metadata = 1;
+   } else {
+   extent_len = key.offset;
+   metadata = 0;
+   }
+
+   /* Current extent is out of our range, loop comes to end */
+   if (extent_start + extent_len <= stripe->logical)
+   break;
+
+   if (metadata) {
+   /*
+* Check crossing stripe first, which can't be scrubbed
+*/
+   if (check_crossing_stripes(fs_info, extent_start,
+   extent_root->fs_info->nodesize)) {
+   error("tree block at %llu is crossing stripe 
boundary, unable to scrub",
+   extent_start);
+   ret = -EIO;
+   goto out;
+   }
+   data = stripe->data + extent_start - stripe->logical;
+   ret = check_tree_mirror(fs_info, scrub_ctx,
+   data, extent_start, 0);
+   /* Any csum/verify error means the stripe is screwed */
+   if (ret < 0) {
+   stripe->csum_mismatch = 1;
+   ret = -EIO;
+   goto out;
+   }
+   ret = 0;
+   continue;
+   }
+   /* Restrict the extent range to fit stripe range */
+   check_start = max(extent_start, stripe->logical);
+   check_len = min(extent_start + extent_len, stripe->logical +
+   stripe_len) - check_start;
+
+   /* Record original csum_discards to detect missing csum case */
+   orig_csum_discards = 

[v6 03/16] btrfs-progs: csum: Introduce function to read out data csums

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a new function: btrfs_read_data_csums(), to read out csums
for sectors in range.

This is quite useful for read out data csum so we don't need to do it
using open code.

Signed-off-by: Qu Wenruo 
Signed-off-by: Su Yue 
Signed-off-by: Gu Jinxiang 
---
 Makefile |   2 +-
 csum.c   | 130 +++
 ctree.h  |   4 ++
 kerncompat.h |   3 ++
 utils.h  |   5 +++
 5 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 csum.c

diff --git a/Makefile b/Makefile
index 6369e8f4..ab45ab7f 100644
--- a/Makefile
+++ b/Makefile
@@ -106,7 +106,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o 
csum.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/csum.c b/csum.c
new file mode 100644
index ..a2ce755e
--- /dev/null
+++ b/csum.c
@@ -0,0 +1,130 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include "kerncompat.h"
+#include "kernel-lib/bitops.h"
+#include "ctree.h"
+#include "utils.h"
+
+/*
+ * TODO:
+ * 1) Add write support for csum
+ *So we can write new data extents and add csum into csum tree
+ *
+ * Get csums of range[@start, @start + len).
+ *
+ * @start:Start offset, shall be aligned to sectorsize.
+ * @len:  Length, shall be aligned to sectorsize.
+ * @csum_ret: The size of csum_ret shall be @len / sectorsize * csum_size.
+ * @bit_map:  Every bit corresponds to the offset have csum or not.
+ *The size in byte of bit_map should be
+ *calculate_bitmap_len(csum_ret's size / csum_size).
+ *
+ * Returns 0  means success
+ * Returns >0 means on error
+ * Returns <0 means on fatal error
+ */
+
+int btrfs_read_data_csums(struct btrfs_fs_info *fs_info, u64 start, u64 len,
+ void *csum_ret, unsigned long *bitmap_ret)
+
+{
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_root *csum_root = fs_info->csum_root;
+   u32 item_offset;
+   u32 item_size;
+   u32 final_offset;
+   u32 final_len;
+   u32 i;
+   u32 sectorsize = fs_info->sectorsize;
+   u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+   u64 cur_start;
+   u64 cur_end;
+   int found = 0;
+   int ret;
+
+   ASSERT(IS_ALIGNED(start, sectorsize));
+   ASSERT(IS_ALIGNED(len, sectorsize));
+   ASSERT(csum_ret);
+   ASSERT(bitmap_ret);
+
+   memset(bitmap_ret, 0, calculate_bitmap_len(len / sectorsize));
+   btrfs_init_path();
+
+   key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+   key.type = BTRFS_EXTENT_CSUM_KEY;
+   key.offset = start;
+
+   ret = btrfs_search_slot(NULL, csum_root, , , 0, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = btrfs_previous_item(csum_root, ,
+ BTRFS_EXTENT_CSUM_OBJECTID,
+ BTRFS_EXTENT_CSUM_KEY);
+   if (ret < 0)
+   goto out;
+   }
+   /* The csum tree may be empty. */
+   if (!btrfs_header_nritems(path.nodes[0]))
+   goto next;
+
+   while (1) {
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+
+   if (!IS_ALIGNED(key.offset, sectorsize)) {
+   error("csum item bytenr %llu is not aligned to %u",
+ key.offset, sectorsize);
+   ret = -EIO;
+   break;
+   }
+   /* exceeds end */
+   if (key.offset >= start + len)
+   break;
+
+   item_offset = btrfs_item_ptr_offset(path.nodes[0],
+   path.slots[0]);
+   item_size = btrfs_item_size_nr(path.nodes[0], path.slots[0]);
+
+   if (key.offset + item_size / csum_size * 

[v6 11/16] btrfs-progs: scrub: Introduce function to recover data parity

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce function, recover_from_parities(), to recover data stripes.

It just wraps raid56_recov() with extra check functions to
scrub_full_stripe structure.

Signed-off-by: Qu Wenruo 
---
 scrub.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/scrub.c b/scrub.c
index 3db82656..d88c52e1 100644
--- a/scrub.c
+++ b/scrub.c
@@ -818,3 +818,54 @@ out:
free(ptrs);
return ret;
 }
+
+/*
+ * Try to recover data stripe from P or Q stripe
+ *
+ * Return >0 if it can't be require any more.
+ * Return 0 for successful repair or no need to repair at all
+ * Return <0 for fatal error
+ */
+static int recover_from_parities(struct btrfs_fs_info *fs_info,
+ struct btrfs_scrub_progress *scrub_ctx,
+ struct scrub_full_stripe *fstripe)
+{
+   void **ptrs;
+   int nr_stripes = fstripe->nr_stripes;
+   int stripe_len = BTRFS_STRIPE_LEN;
+   int max_tolerance;
+   int i;
+   int ret;
+
+   /* No need to recover */
+   if (!fstripe->nr_corrupted_stripes)
+   return 0;
+
+   /* Already recovered once, no more chance */
+   if (fstripe->recovered)
+   return 1;
+
+   if (fstripe->bg_type & BTRFS_BLOCK_GROUP_RAID5)
+   max_tolerance = 1;
+   else
+   max_tolerance = 2;
+
+   /* Out of repair */
+   if (fstripe->nr_corrupted_stripes > max_tolerance)
+   return 1;
+
+   ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+   if (!ptrs)
+   return -ENOMEM;
+
+   /* Construct ptrs */
+   for (i = 0; i < nr_stripes; i++)
+   ptrs[i] = fstripe->stripes[i].data;
+
+   ret = raid56_recov(nr_stripes, stripe_len, fstripe->bg_type,
+   fstripe->corrupted_index[0],
+   fstripe->corrupted_index[1], ptrs);
+   fstripe->recovered = 1;
+   free(ptrs);
+   return ret;
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 12/16] btrfs-progs: scrub: Introduce helper to write a full stripe

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a internal helper, write_full_stripe() to calculate P/Q and
write the whole full stripe.

This is useful to recover RAID56 stripes.

Signed-off-by: Qu Wenruo 
---
 scrub.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/scrub.c b/scrub.c
index d88c52e1..83f02a95 100644
--- a/scrub.c
+++ b/scrub.c
@@ -869,3 +869,47 @@ static int recover_from_parities(struct btrfs_fs_info 
*fs_info,
free(ptrs);
return ret;
 }
+
+/*
+ * Helper to write a full stripe to disk
+ * P/Q will be re-calculated.
+ */
+static int write_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   void **ptrs;
+   int nr_stripes = fstripe->nr_stripes;
+   int stripe_len = BTRFS_STRIPE_LEN;
+   int i;
+   int ret = 0;
+
+   ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+   if (!ptrs)
+   return -ENOMEM;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   ptrs[i] = fstripe->stripes[i].data;
+
+   if (fstripe->bg_type & BTRFS_BLOCK_GROUP_RAID6) {
+   raid6_gen_syndrome(nr_stripes, stripe_len, ptrs);
+   } else {
+   ret = raid5_gen_result(nr_stripes, stripe_len, nr_stripes - 1,
+   ptrs);
+   if (ret < 0)
+   goto out;
+   }
+
+   for (i = 0; i < fstripe->nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   ret = pwrite(stripe->fd, stripe->data, fstripe->stripe_len,
+stripe->physical);
+   if (ret != fstripe->stripe_len) {
+   ret = -EIO;
+   goto out;
+   }
+   }
+out:
+   free(ptrs);
+   return ret;
+
+}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 02/16] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

For READ, caller normally hopes to get what they request, other than
full stripe map.

In this case, we should remove unrelated stripe map, just like the
following case:
   32K   96K
   |<-request range->|
 0  64k   128K
RAID0:   |Data 1|   Data 2|
  disk1 disk2
Before this patch, we return the full stripe:
Stripe 0: Logical 0, Physical X, Len 64K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 64K, Dev disk2

After this patch, we limit the stripe result to the request range:
Stripe 0: Logical 32K, Physical X+32K, Len 32K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 32K, Dev disk2

And if it's a RAID5/6 stripe, we just handle it like RAID0, ignoring
parities.

This should make caller easier to use.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 103 +-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 2d23712a..72399cde 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1760,6 +1760,107 @@ static int fill_full_map_block(struct map_lookup *map, 
u64 start, u64 length,
return 0;
 }
 
+static void del_one_stripe(struct btrfs_map_block *map_block, int i)
+{
+   int cur_nr = map_block->num_stripes;
+   int size_left = (cur_nr - 1 - i) * sizeof(struct btrfs_map_stripe);
+
+   memmove(_block->stripes[i], _block->stripes[i + 1], size_left);
+   map_block->num_stripes--;
+}
+
+static void remove_unrelated_stripes(struct map_lookup *map,
+int rw, u64 start, u64 length,
+struct btrfs_map_block *map_block)
+{
+   int i = 0;
+   /*
+* RAID5/6 write must use full stripe.
+* No need to do anything.
+*/
+   if (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6) &&
+   rw == WRITE)
+   return;
+
+   /*
+* For RAID0/1/10/DUP, whatever read/write, we can remove unrelated
+* stripes without causing anything wrong.
+* RAID5/6 READ is just like RAID0, we don't care parity unless we need
+* to recovery.
+* For recovery, rw should be set to WRITE.
+*/
+   while (i < map_block->num_stripes) {
+   struct btrfs_map_stripe *stripe;
+   u64 orig_logical; /* Original stripe logical start */
+   u64 orig_end; /* Original stripe logical end */
+
+   stripe = _block->stripes[i];
+
+   /*
+* For READ, we don't really care parity
+*/
+   if (stripe->logical == BTRFS_RAID5_P_STRIPE ||
+   stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+   del_one_stripe(map_block, i);
+   continue;
+   }
+   /* Completely unrelated stripe */
+   if (stripe->logical >= start + length ||
+   stripe->logical + stripe->length <= start) {
+   del_one_stripe(map_block, i);
+   continue;
+   }
+   /* Covered stripe, modify its logical and physical */
+   orig_logical = stripe->logical;
+   orig_end = stripe->logical + stripe->length;
+   if (start + length <= orig_end) {
+   /*
+* |<--range-->|
+*   |  stripe   |
+* Or
+* ||
+*   |  stripe   |
+*/
+   stripe->logical = max(orig_logical, start);
+   stripe->length = start + length;
+   stripe->physical += stripe->logical - orig_logical;
+   } else if (start >= orig_logical) {
+   /*
+* |<-range--->|
+* |  stripe |
+* Or
+* ||
+* |  stripe |
+*/
+   stripe->logical = start;
+   stripe->length = min(orig_end, start + length);
+   stripe->physical += stripe->logical - orig_logical;
+   }
+   /*
+* Remaining case:
+* ||
+*   | stripe |
+* No need to do any modification
+*/
+   i++;
+   }
+
+   /* Recaculate map_block size */
+   map_block->start = 0;
+   map_block->length = 0;
+   for (i = 0; i < map_block->num_stripes; i++) {
+   struct btrfs_map_stripe *stripe;
+
+   stripe = _block->stripes[i];
+   if (stripe->logical > map_block->start)
+   map_block->start = 

[v6 16/16] btrfs-progs: add test for offline-scrub

2018-01-05 Thread Gu Jinxiang
Add a test for offline-scrub.
The process of this test case:
1)create a filesystem with profile raid10
2)mount the filesystem, create a file in the mount point, and write
some data to the file
3)get the logical address of the file's extent data
4)get the physical address of the logical address
5)overwrite the contents in the physical address
6)use offline scrub to check and repair it

Signed-off-by: Gu Jinxiang 
---
 Makefile   |  6 ++-
 tests/scrub-tests.sh   | 43 +++
 tests/scrub-tests/001-offline-scrub-raid10/test.sh | 50 ++
 3 files changed, 98 insertions(+), 1 deletion(-)
 create mode 100755 tests/scrub-tests.sh
 create mode 100755 tests/scrub-tests/001-offline-scrub-raid10/test.sh

diff --git a/Makefile b/Makefile
index fa3ebc86..0a3060f5 100644
--- a/Makefile
+++ b/Makefile
@@ -322,6 +322,10 @@ test-cli: btrfs
@echo "[TEST]   cli-tests.sh"
$(Q)bash tests/cli-tests.sh
 
+test-scrub: btrfs mkfs.btrfs
+   @echo "[TEST]   scrub-tests.sh"
+   $(Q)bash tests/scrub-tests.sh
+
 test-clean:
@echo "Cleaning tests"
$(Q)bash tests/clean-tests.sh
@@ -332,7 +336,7 @@ test-inst: all
$(MAKE) $(MAKEOPTS) DESTDIR=$$tmpdest install && \
$(RM) -rf -- $$tmpdest
 
-test: test-fsck test-mkfs test-convert test-misc test-fuzz test-cli
+test: test-fsck test-mkfs test-convert test-misc test-fuzz test-cli test-scrub
 
 #
 # NOTE: For static compiles, you need to have all the required libs
diff --git a/tests/scrub-tests.sh b/tests/scrub-tests.sh
new file mode 100755
index ..697137f4
--- /dev/null
+++ b/tests/scrub-tests.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#
+# btrfs scrub tests
+
+LANG=C
+SCRIPT_DIR=$(dirname $(readlink -f "$0"))
+TOP=$(readlink -f "$SCRIPT_DIR/../")
+TEST_DEV=${TEST_DEV:-}
+RESULTS="$TOP/tests/scrub-tests-results.txt"
+IMAGE="$TOP/tests/test.img"
+
+source "$TOP/tests/common"
+
+export TOP
+export RESULTS
+export LANG
+export IMAGE
+export TEST_DEV
+
+rm -f "$RESULTS"
+
+check_prereq btrfs
+check_kernel_support
+
+# The tests are driven by their custom script called 'test.sh'
+
+for i in $(find "$TOP/tests/scrub-tests" -maxdepth 1 -mindepth 1 -type d   
\
+   ${TEST:+-name "$TEST"} | sort)
+do
+   echo "[TEST/scrub]   $(basename $i)"
+   cd "$i"
+   echo "=== Entering $i" >> "$RESULTS"
+   if [ -x test.sh ]; then
+   ./test.sh
+   if [ $? -ne 0 ]; then
+   if [[ $TEST_LOG =~ dump ]]; then
+   cat "$RESULTS"
+   fi
+   _fail "test failed for case $(basename $i)"
+   fi
+   fi
+   cd "$TOP"
+done
diff --git a/tests/scrub-tests/001-offline-scrub-raid10/test.sh 
b/tests/scrub-tests/001-offline-scrub-raid10/test.sh
new file mode 100755
index ..c609d870
--- /dev/null
+++ b/tests/scrub-tests/001-offline-scrub-raid10/test.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+
+source $TOP/tests/common
+
+check_prereq mkfs.btrfs
+check_prereq btrfs
+check_prereq btrfs-debug-tree
+check_prereq btrfs-map-logical
+
+setup_root_helper
+
+setup_loopdevs 4
+prepare_loopdevs
+
+dev1=${loopdevs[1]}
+file=$TEST_MNT/file
+
+mkfs_multi()
+{
+   run_check $SUDO_HELPER $TOP/mkfs.btrfs -f $@ ${loopdevs[@]}
+}
+
+#create filesystem
+mkfs_multi -d raid10 -m raid10
+run_check $SUDO_HELPER mount -t btrfs $dev1 "$TEST_MNT"
+
+#write some data
+run_check $SUDO_HELPER touch $file
+run_check $SUDO_HELPER dd if=/dev/zero of=$file bs=64K count=1
+run_check sync -f $file
+
+#get the extent data's logical address of $file
+logical=$($SUDO_HELPER $TOP/btrfs-debug-tree -t 5 $dev1 | grep -oP 
'(?<=byte\s)\d+')
+
+#get the first physical address and device of $file's data
+read physical dev< <($SUDO_HELPER $TOP/btrfs-map-logical -l $logical $dev1| 
head -1 |cut -d ' ' -f6,8)
+
+#then modify the data
+run_check $SUDO_HELPER dd if=/dev/random of=$dev seek=$(($physical/65536)) 
bs=64K count=1
+run_check sync -f $file
+
+run_check $SUDO_HELPER umount "$TEST_MNT"
+log=$(run_check_stdout $SUDO_HELPER $TOP/btrfs scrub start --offline $dev1)
+cleanup_loopdevs
+
+#check result
+result=$(echo $log | grep 'len 65536 REPARIED: has corrupted mirror, repaired')
+if [[ -z "$result" ]] ;then
+   _fail "scrub repair faild"
+fi
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 01/16] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a new function, __btrfs_map_block_v2().

Unlike old btrfs_map_block(), which needs different parameter to handle
different RAID profile, this new function uses unified btrfs_map_block
structure to handle all RAID profile in a more meaningful method:

Return physical address along with logical address for each stripe.

For RAID1/Single/DUP (none-stripped):
result would be like:
Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2

Result will be as long as possible, since it's not stripped at all.

For RAID0/10 (stripped without parity):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4

For RAID5/6 (stripped with parity and dev-rotation):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4

The new unified layout should be very flex and can even handle things
like N-way RAID1 (which old mirror_num basic one can't handle well).

Signed-off-by: Qu Wenruo 
Signed-off-by: Gu Jinxiang 
---
 volumes.c | 181 ++
 volumes.h |  78 +++
 2 files changed, 259 insertions(+)

diff --git a/volumes.c b/volumes.c
index ce3a5405..2d23712a 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1620,6 +1620,187 @@ out:
return 0;
 }
 
+static inline struct btrfs_map_block *alloc_map_block(int num_stripes)
+{
+   struct btrfs_map_block *ret;
+   int size;
+
+   size = sizeof(struct btrfs_map_stripe) * num_stripes +
+   sizeof(struct btrfs_map_block);
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+   memset(ret, 0, size);
+   return ret;
+}
+
+static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
+  struct btrfs_map_block *map_block)
+{
+   u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+   u64 bg_start = map->ce.start;
+   u64 bg_end = bg_start + map->ce.size;
+   u64 bg_offset = start - bg_start; /* offset inside the block group */
+   u64 fstripe_logical = 0;/* Full stripe start logical bytenr */
+   u64 fstripe_size = 0;   /* Full stripe logical size */
+   u64 fstripe_phy_off = 0;/* Full stripe offset in each dev */
+   u32 stripe_len = map->stripe_len;
+   int sub_stripes = map->sub_stripes;
+   int data_stripes = nr_data_stripes(map);
+   int dev_rotation;
+   int i;
+
+   map_block->num_stripes = map->num_stripes;
+   map_block->type = profile;
+
+   /*
+* Common full stripe data for stripe based profiles
+*/
+   if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
+  BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+   fstripe_size = stripe_len * data_stripes;
+   if (sub_stripes)
+   fstripe_size /= sub_stripes;
+   fstripe_logical = bg_offset / fstripe_size * fstripe_size +
+   bg_start;
+   fstripe_phy_off = bg_offset / fstripe_size * stripe_len;
+   }
+
+   switch (profile) {
+   case BTRFS_BLOCK_GROUP_DUP:
+   case BTRFS_BLOCK_GROUP_RAID1:
+   case 0: /* SINGLE */
+   /*
+* None-stripe mode, (Single, DUP and RAID1)
+* Just use offset to fill map_block
+*/
+   map_block->stripe_len = 0;
+   map_block->start = start;
+   map_block->length = min(bg_end, start + length) - start;
+   for (i = 0; i < map->num_stripes; i++) {
+   struct btrfs_map_stripe *stripe;
+
+   stripe = _block->stripes[i];
+
+   stripe->dev = map->stripes[i].dev;
+   stripe->logical = start;
+   stripe->physical = map->stripes[i].physical + bg_offset;
+   stripe->length = map_block->length;
+   }
+   break;
+   case BTRFS_BLOCK_GROUP_RAID10:
+   case BTRFS_BLOCK_GROUP_RAID0:
+   /*
+* Stripe modes without parity (0 and 10)
+* Return the whole full stripe
+*/
+
+   map_block->start = 

[v6 15/16] btrfs-progs: scrub: Introduce offline scrub function

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Now, btrfs-progs has a kernel scrub equivalent.
A new option, --offline is added to "btrfs scrub start".

If --offline is given, btrfs scrub will just act like kernel scrub, to
check every copy of extent and do a report on corrupted data and if it's
recoverable.

The advantage compare to kernel scrub is:
1) No race
   Unlike kernel scrub, which is done in parallel, offline scrub is done
   by a single thread.
   Although it may be slower than kernel one, it's safer and no false
   alert.

2) Correctness
   Kernel has a known bug (fix submitted) which will recovery RAID5/6
   data but screw up P/Q, due to the hardness coding in kernel.
   While in btrfs-progs, no page, (almost) no memory size limit, we're
   can focus on the scrub, and make things easier.

New offline scrub can detect and report P/Q corruption with
recoverability report, while kernel will only report data stripe error.

Signed-off-by: Qu Wenruo 
Signed-off-by: Su 
Signed-off-by: Gu Jinxiang 
---
 Documentation/btrfs-scrub.asciidoc |   9 +++
 cmds-scrub.c   | 116 +++--
 ctree.h|   6 ++
 scrub.c|  71 +++
 utils.h|   6 ++
 5 files changed, 204 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-scrub.asciidoc 
b/Documentation/btrfs-scrub.asciidoc
index eb90a1c4..49527c2a 100644
--- a/Documentation/btrfs-scrub.asciidoc
+++ b/Documentation/btrfs-scrub.asciidoc
@@ -78,6 +78,15 @@ set IO priority classdata (see `ionice`(1) manpage)
 force starting new scrub even if a scrub is already running,
 this can useful when scrub status file is damaged and reports a running
 scrub although it is not, but should not normally be necessary
+--offline
+Do offline scrub.
+NOTE: it's experimental and repair is not supported yet.
+--progress
+Show progress status while doing offline scrub. (Default)
+NOTE: it's only useful with option --offline.
+--no-progress
+Don't show progress status while doing offline scrub.
+NOTE: it's only useful with option --offline.
 
 *status* [-d] |::
 Show status of a running scrub for the filesystem identified by 'path' or
diff --git a/cmds-scrub.c b/cmds-scrub.c
index 5388fdcf..063b4dfd 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -36,12 +36,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ctree.h"
 #include "ioctl.h"
 #include "utils.h"
 #include "volumes.h"
 #include "disk-io.h"
+#include "task-utils.h"
 
 #include "commands.h"
 #include "help.h"
@@ -217,6 +219,32 @@ static void add_to_fs_stat(struct btrfs_scrub_progress *p,
_SCRUB_FS_STAT_MIN(ss, finished, fs_stat);
 }
 
+static void *print_offline_status(void *p)
+{
+   struct task_context *ctx = p;
+   const char work_indicator[] = {'.', 'o', 'O', 'o' };
+   uint32_t count = 0;
+
+   task_period_start(ctx->info, 1000 /* 1s */);
+
+   while (1) {
+   printf("Doing offline scrub [%c] [%llu/%llu]\r",
+  work_indicator[count % 4], ctx->cur, ctx->all);
+   count++;
+   fflush(stdout);
+   task_period_wait(ctx->info);
+   }
+   return NULL;
+}
+
+static int print_offline_return(void *p)
+{
+   printf("\n");
+   fflush(stdout);
+
+   return 0;
+}
+
 static void init_fs_stat(struct scrub_fs_stat *fs_stat)
 {
memset(fs_stat, 0, sizeof(*fs_stat));
@@ -1100,7 +1128,7 @@ static const char * const cmd_scrub_resume_usage[];
 
 static int scrub_start(int argc, char **argv, int resume)
 {
-   int fdmnt;
+   int fdmnt = -1;
int prg_fd = -1;
int fdres = -1;
int ret;
@@ -1124,10 +1152,14 @@ static int scrub_start(int argc, char **argv, int 
resume)
int n_start = 0;
int n_skip = 0;
int n_resume = 0;
+   int offline = 0;
+   int progress_set = -1;
struct btrfs_ioctl_fs_info_args fi_args;
struct btrfs_ioctl_dev_info_args *di_args = NULL;
struct scrub_progress *sp = NULL;
struct scrub_fs_stat fs_stat;
+   struct task_context task = {0};
+   struct btrfs_fs_info *fs_info = NULL;
struct timeval tv;
struct sockaddr_un addr = {
.sun_family = AF_UNIX,
@@ -1147,7 +1179,18 @@ static int scrub_start(int argc, char **argv, int resume)
int force = 0;
int nothing_to_resume = 0;
 
-   while ((c = getopt(argc, argv, "BdqrRc:n:f")) != -1) {
+   enum { GETOPT_VAL_OFFLINE = 257,
+  GETOPT_VAL_PROGRESS,
+  GETOPT_VAL_NO_PROGRESS};
+   static const struct option long_options[] = {
+   { "offline", no_argument, NULL, GETOPT_VAL_OFFLINE},
+   { "progress", no_argument, NULL, GETOPT_VAL_PROGRESS},
+   { "no-progress", no_argument, NULL, GETOPT_VAL_NO_PROGRESS},
+   

[v6 13/16] btrfs-progs: scrub: Introduce a function to scrub one full stripe

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce a new function, scrub_one_full_stripe(), to check a full
stripe.

It handles the full stripe scrub in the following steps:
0) Check if we need to check full stripe
   If full stripe contains no extent, why waste our CPU and IO?

1) Read out full stripe
   Then we know how many devices are missing or have read error.
   If out of repair, then exit

   If have missing device or have read error, try recover here.

2) Check data stripe against csum
   We add data stripe with csum error as corrupted stripe, just like
   dev missing or read error.
   Then recheck if csum mismatch is still below tolerance.

Finally we check the full stripe using 2 factors only:
A) If the full stripe go through recover ever
B) If the full stripe has csum error

Combine factor A and B we get:
1) A && B: Recovered, csum mismatch
   Screwed up totally
2) A && !B: Recovered, csum match
   Recoverable, data corrupted but P/Q is good to recover
3) !A && B: Not recovered, csum mismatch
   Try to recover corrupted data stripes
   If recovered csum match, then recoverable
   Else, screwed up
4) !A && !B: Not recovered, no csum mismatch
   Best case, just check if P/Q matches.
   If P/Q matches, everything is good
   Else, just P/Q is screwed up, still recoverable.

Signed-off-by: Qu Wenruo 
---
 scrub.c | 285 
 1 file changed, 285 insertions(+)

diff --git a/scrub.c b/scrub.c
index 83f02a95..e474b18a 100644
--- a/scrub.c
+++ b/scrub.c
@@ -911,5 +911,290 @@ static int write_full_stripe(struct scrub_full_stripe 
*fstripe)
 out:
free(ptrs);
return ret;
+}
+
+/*
+ * Return 0 if we still have chance to recover
+ * Return <0 if we have no more chance
+ */
+static int report_recoverablity(struct scrub_full_stripe *fstripe)
+{
+   int max_tolerance;
+   u64 start = fstripe->logical_start;
+
+   if (fstripe->bg_type & BTRFS_BLOCK_GROUP_RAID5)
+   max_tolerance = 1;
+   else
+   max_tolerance = 2;
+
+   if (fstripe->nr_corrupted_stripes > max_tolerance) {
+   error(
+   "full stripe %llu CORRUPTED: too many read error or corrupted devices",
+   start);
+   error(
+   "full stripe %llu: tolerance: %d, missing: %d, read error: %d, csum 
error: %d",
+   start, max_tolerance, fstripe->err_read_stripes,
+   fstripe->err_missing_devs, fstripe->err_csum_dstripes);
+   return -EIO;
+   }
+   return 0;
+}
+
+static void clear_corrupted_stripe_record(struct scrub_full_stripe *fstripe)
+{
+   fstripe->corrupted_index[0] = -1;
+   fstripe->corrupted_index[1] = -1;
+   fstripe->nr_corrupted_stripes = 0;
+}
+
+static void record_corrupted_stripe(struct scrub_full_stripe *fstripe,
+   int index)
+{
+   int i = 0;
+
+   for (i = 0; i < 2; i++) {
+   if (fstripe->corrupted_index[i] == -1) {
+   fstripe->corrupted_index[i] = index;
+   break;
+   }
+   }
+   fstripe->nr_corrupted_stripes++;
+}
+
+/*
+ * Scrub one full stripe.
+ *
+ * If everything matches, that's good.
+ * If data stripe corrupted badly, no mean to recovery, it will report it.
+ * If data stripe corrupted, try recovery first and recheck csum, to
+ * determine if it's recoverable or screwed up.
+ */
+static int scrub_one_full_stripe(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+u64 start, u64 *next_ret, int write)
+{
+   struct scrub_full_stripe *fstripe;
+   struct btrfs_map_block *map_block = NULL;
+   u32 stripe_len = BTRFS_STRIPE_LEN;
+   u64 bg_type;
+   u64 len;
+   int i;
+   int ret;
+
+   if (!next_ret) {
+   error("invalid argument for %s", __func__);
+   return -EINVAL;
+   }
+
+   ret = __btrfs_map_block_v2(fs_info, WRITE, start, stripe_len,
+  _block);
+   if (ret < 0) {
+   /* Let caller to skip the whole block group */
+   *next_ret = (u64)-1;
+   return ret;
+   }
+   start = map_block->start;
+   len = map_block->length;
+   *next_ret = start + len;
+
+   /*
+* Step 0: Check if we need to scrub the full stripe
+*
+* If no extent lies in the full stripe, not need to check
+*/
+   ret = btrfs_check_extent_exists(fs_info, start, len);
+   if (ret < 0) {
+   free(map_block);
+   return ret;
+   }
+   /* No extents in range, no need to check */
+   if (ret == 0) {
+   free(map_block);
+   return 0;
+   }
+
+   bg_type = map_block->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+   if (bg_type != 

[v6 06/16] btrfs-progs: scrub: Introduce functions to scrub mirror based data blocks

2018-01-05 Thread Gu Jinxiang
From: Qu Wenruo 

Introduce new function, check/recover_data_mirror(), to check and recover
mirror based data blocks.

Unlike tree block, data blocks must be recovered sector by sector, so we
introduced corrupted_bitmap for check and recover.

Signed-off-by: Qu Wenruo 
Signed-off-by: Su Yue 
Signed-off-by: Gu Jinxiang 
---
 scrub.c | 212 
 1 file changed, 212 insertions(+)

diff --git a/scrub.c b/scrub.c
index 00786dd3..cee6fe14 100644
--- a/scrub.c
+++ b/scrub.c
@@ -18,6 +18,7 @@
 #include "volumes.h"
 #include "disk-io.h"
 #include "utils.h"
+#include "kernel-lib/bitops.h"
 
 /*
  * For parity based profile (RAID56)
@@ -262,3 +263,214 @@ out:
free(buf);
return ret;
 }
+
+/*
+ * Check one data mirror given by @start @len and @mirror, or @data
+ * If @data is not given, try to read it from disk.
+ * This function will try to read out all the data then check sum.
+ *
+ * If @data is given, just use the data.
+ * This behavior is useful for RAID5/6 recovery code to verify recovered data.
+ *
+ * If @corrupt_bitmap is given, restore corrupted sector to that bitmap.
+ * This is useful for mirror based profiles to recover its data.
+ *
+ * Return 0 if everything is OK.
+ * Return <0 if something goes wrong, and @scrub_ctx accounting will be updated
+ * if it's a data corruption.
+ */
+static int check_data_mirror(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+char *data, u64 start, u64 len, int mirror,
+unsigned long *corrupt_bitmap)
+{
+   u32 sectorsize = fs_info->sectorsize;
+   u32 data_csum;
+   u32 *csums = NULL;
+   char *buf = NULL;
+   int ret = 0;
+   int err = 0;
+   int i;
+   unsigned long *csum_bitmap = NULL;
+
+   if (!data) {
+   buf = malloc(len);
+   if (!buf)
+   return -ENOMEM;
+   ret = read_extent_data_loop(fs_info, scrub_ctx, buf, start,
+len, mirror);
+   if (ret < 0)
+   goto out;
+   scrub_ctx->data_bytes_scrubbed += len;
+   } else {
+   buf = data;
+   }
+
+   /* Alloc and Check csums */
+   csums = malloc(len / sectorsize * sizeof(data_csum));
+   if (!csums) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   csum_bitmap = malloc(calculate_bitmap_len(len / sectorsize));
+   if (!csum_bitmap) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   if (corrupt_bitmap)
+   memset(corrupt_bitmap, 0,
+   calculate_bitmap_len(len / sectorsize));
+   ret = btrfs_read_data_csums(fs_info, start, len, csums, csum_bitmap);
+   if (ret < 0)
+   goto out;
+
+   for (i = 0; i < len / sectorsize; i++) {
+   if (!test_bit(i, csum_bitmap)) {
+   scrub_ctx->csum_discards++;
+   continue;
+   }
+
+   data_csum = ~(u32)0;
+   data_csum = btrfs_csum_data(buf + i * sectorsize, data_csum,
+   sectorsize);
+   btrfs_csum_final(data_csum, (u8 *)_csum);
+
+   if (memcmp(_csum, (char *)csums + i * sizeof(data_csum),
+  sizeof(data_csum))) {
+   error("data at bytenr %llu mirror %d csum mismatch, 
have 0x%08x expect 0x%08x",
+ start + i * sectorsize, mirror, data_csum,
+ *(u32 *)((char *)csums + i * sizeof(data_csum)));
+   err = 1;
+   scrub_ctx->csum_errors++;
+   if (corrupt_bitmap)
+   set_bit(i, corrupt_bitmap);
+   continue;
+   }
+   scrub_ctx->data_bytes_scrubbed += sectorsize;
+   }
+out:
+   if (!data)
+   free(buf);
+   free(csums);
+   free(csum_bitmap);
+
+   if (!ret && err)
+   return -EIO;
+   return ret;
+}
+
+/* Helper to check all mirrors for a good copy */
+static int has_good_mirror(unsigned long *corrupt_bitmaps[], int num_copies,
+  int bit, int *good_mirror)
+{
+   int found_good = 0;
+   int i;
+
+   for (i = 0; i < num_copies; i++) {
+   if (!test_bit(bit, corrupt_bitmaps[i])) {
+   found_good = 1;
+   if (good_mirror)
+   *good_mirror = i + 1;
+   break;
+   }
+   }
+   return found_good;
+}
+
+/*
+ * Helper function to check @corrupt_bitmaps, to verify if it's recoverable
+ * for mirror based data 

[v6 00/16] Btrfs-progs offline scrub

2018-01-05 Thread Gu Jinxiang
For any one who wants to try it, it can be get from my repo:
https://github.com/gujx2017/btrfs-progs/tree/offline-scrub/

In this v6, just rebase to v4.14 and a test for offline-scrub.

Several reports on kernel scrub screwing up good data stripes are in ML for 
sometime.

And since kernel scrub won't account P/Q corruption, it makes us quite 
hard to detect error like kernel screwing up P/Q when scrubbing.

To get a comparable tool for kernel scrub, we need a user-space tool to act as 
benchmark to compare their different behaviors.

So here is the patchset for user-space scrub.

Which can do:
1) All mirror/backup check for non-parity based stripe
   Which means for RAID1/DUP/RAID10, we can really check all mirrors
   other than the 1st good mirror.

   Current "--check-data-csum" option should be finally replaced by
   offline scrub.
   As "--check-data-csum" doesn't really check all mirrors, if it hits
   a good copy, then resting copies will just be ignored.

   In v4 update, data check is further improved, inspired by kernel
   behavior, now data extent is checked sector by sector, so it can
   handle the following corruption case:

   Data extent A contains data from 0~28K.
   And |///| = corrupted  |   | = good
 0   4k  8k  12k 16k 20k 24k 28k
   Mirror 0  |///|   |///|   |///|   |   |
   Mirror 1  |   |///|   |///|   |///|   |

   Extent A should be RECOVERABLE, while in v3 we treat data extent A as
   a whole unit, above case is reported as CORRUPTED.

2) RAID5/6 full stripe check
   It will take full use of btrfs csum(both tree and data).
   It will only recover the full stripe if all recovered data matches
   with its csum.

   NOTE: Due to the lack of good bitmap facilities, RAID56 sector by
   sector repair will be quite complex, especially when NODATASUM is
   involved.

   So current RAID56 doesn't support vertical sector recovery yet.

   Data extent A contains data from 0~64K
   And |///| = corrupted while |   | = good
  0   8K  16K 24K 32K 40K 48K 56K 64K
   Data stripe 0  |///|   |///|   |///|   |///|   |
   Data stripe 1  |   |///|   |///|   |///|   |///|
   Parity |   |   |   |   |   |   |   |   |

   Kernel will recover it, while current scrub will report it as
   CORRUPTED.

3) Repair
   In v4 update, repair is finally added.

And this patchset also introduces new btrfs_map_block() function, 
which is more flex than current btrfs_map_block(), and has a unified interface 
for all profiles, not just an extra array for RAID56.

Check the 6th and 7th patch for details.

They are already used in RAID5/6 scrub, but can also be used for other profiles 
too.

The to-do list has been shortened, since repair is added in v4 update.
1) Test cases
   Need to make the infrastructure able to handle multi-device first.

2) Make btrfsck able to handle RAID5 with missing device
   Now it doesn't even open RAID5 btrfs with missing device, even though
   scrub should be able to handle it.

3) RAID56 vertical sector repair
   Although I consider such case is minor compared to RAID1 vertical
   sector repair.
   As for RAID1, an extent can be as large as 128M, while for RAID56 one
   stripe will always be 64K, much smaller than RAID1 case, making the
   possibility lower.

   I prefer to add this function after the patchset get merged, as no
   one really likes get 20 mails every time I update the patchset.

For guys who want to review the patchset, there is a basic function 
relationships slide.
I hope this will reduce the time needed to get what the patchset is doing.
https://docs.google.com/presentation/d/1tAU3lUVaRUXooSjhFaDUeyW3wauHDS
g9H-AiLBOSuIM/edit?usp=sharing

Changelog:
V0.8 RFC:
   Initial RFC patchset

v1:
   First formal patchset.
   RAID6 recovery support added, mainly copied from kernel radi6 lib.
   Cleaner recovery logical.

v2:
   More comments in both code and commit message, suggested by David.
   File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
   Suggested by David

v3:
  Put "--offline" option to scrub, other than put it in fsck.
  Use bitmap to read multiple csums in one run, to improve performance.
  Add --progress/--no-progress option, to tell user we're not just
  wasting CPU and IO.

v4:
  Improve data check. Make data extent to be checked sector by sector.
  And make repair to be supported.
  
v5:
  just make some small fixups of comments on the left 15 patches,
  according to problems pointed out by David when mergering the first
  5 patches of this patchset.
  And rebase it to 93a9004dde410d920f08f85c6365e138713992d8.

v6:
  rebase to v4.14.
  add a test for offline-scrub.

Gu Jinxiang (1):
  btrfs-progs: add test for offline-scrub

Qu Wenruo (15):
  btrfs-progs: Introduce new btrfs_map_block function which returns more
unified result.
  btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
  btrfs-progs: csum: Introduce function to read out data csums
  btrfs-progs: scrub: Introduce structures to 

Re: [PATCH 2/2] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation

2018-01-05 Thread Su Yue



On 01/05/2018 04:12 PM, Qu Wenruo wrote:

Before this patch, chunk allocation is split into 2 parts:

1) Chunk allocation
Handled by btrfs_alloc_chunk(), which will insert chunk and device
extent items.

2) Block group allocation
Handled by btrfs_make_block_group(), which will insert block group
item and update space info.

However for chunk allocation, we don't really need to split these
operations as all btrfs_alloc_chunk() has btrfs_make_block_group()
followed.

So it's reasonable to merge btrfs_make_block_group() call into
btrfs_alloc_chunk() to save several lines, and provides the basis for
later btrfs_alloc_chunk() rework.

Signed-off-by: Qu Wenruo 


Looks good to me.
Reviewed-by: Su Yue 

---
  convert/main.c |  4 
  extent-tree.c  | 10 ++
  mkfs/main.c| 28 
  volumes.c  | 10 ++
  4 files changed, 8 insertions(+), 44 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 8ee858fb2d05..96a04eda5b18 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -915,10 +915,6 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-   BTRFS_BLOCK_GROUP_DATA, cur, len);
-   if (ret < 0)
-   break;
cur += len;
}
}
diff --git a/extent-tree.c b/extent-tree.c
index 4231be11bd53..90e792a3fe62 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1910,15 +1910,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
-   return 0;
+   return ret;
}
-
-   BUG_ON(ret);
-
-   ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags,
-start, num_bytes);
-   BUG_ON(ret);
-   return 0;
+   return ret;
  }
  
  static int update_block_group(struct btrfs_trans_handle *trans,

diff --git a/mkfs/main.c b/mkfs/main.c
index f8e27a7ec8b8..9377aa30f39d 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -97,12 +97,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
error("no space to allocate data/metadata chunk");
goto err;
}
-   if (ret)
-   return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA |
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
if (ret)
return ret;
allocation->mixed += chunk_size;
@@ -116,12 +110,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA,
-chunk_start, chunk_size);
allocation->metadata += chunk_size;
-   if (ret)
-   return ret;
}
  
  	root->fs_info->system_allocs = 0;

@@ -150,12 +139,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
allocation->data += chunk_size;
-   if (ret)
-   return ret;
}
  
  err:

@@ -259,9 +243,6 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
if (ret)
return ret;
  
-	ret = btrfs_make_block_group(trans, fs_info, 0,

-type, chunk_start, chunk_size);
-
type &= BTRFS_BLOCK_GROUP_TYPE_MASK;
if (type == BTRFS_BLOCK_GROUP_DATA) {
allocation->data += chunk_size;
@@ -1006,12 +987,7 @@ static int create_chunks(struct btrfs_trans_handle *trans,
_start, _size, meta_type, false);
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-meta_type, chunk_start,
-chunk_size);
allocation->metadata += 

Re: [PATCH 1/2] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk

2018-01-05 Thread Qu Wenruo


On 2018年01月05日 17:36, Su Yue wrote:
> 
> 
> On 01/05/2018 04:12 PM, Qu Wenruo wrote:
>> We used to have two chunk allocators, btrfs_alloc_chunk() and
>> btrfs_alloc_data_chunk(), the former is the more generic one, while the
>> later is only used in mkfs and convert, to allocate SINGLE data chunk.
>>
>> Although btrfs_alloc_data_chunk() has some special hacks to cooperate
>> with convert, it's quite simple to integrity it into the generic chunk
>> allocator.
>>
>> So merge them into one btrfs_alloc_chunk(), with extra @convert
>> parameter and necessary comment, to make code less duplicated and less
>> thing to maintain.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>   convert/main.c |   6 +-
>>   extent-tree.c  |   2 +-
>>   mkfs/main.c    |  14 ++--
>>   volumes.c  | 220
>> ++---
>>   volumes.h  |   5 +-
>>   5 files changed, 98 insertions(+), 149 deletions(-)
>>
>> diff --git a/convert/main.c b/convert/main.c
>> index 4a510a786394..8ee858fb2d05 100644
>> --- a/convert/main.c
>> +++ b/convert/main.c
>> @@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct
>> btrfs_trans_handle *trans,
>>     len = min(max_chunk_size,
>>     cache->start + cache->size - cur);
>> -    ret = btrfs_alloc_data_chunk(trans, fs_info,
>> -    _backup, len,
>> -    BTRFS_BLOCK_GROUP_DATA, 1);
>> +    ret = btrfs_alloc_chunk(trans, fs_info,
>> +    _backup, ,
>> +    BTRFS_BLOCK_GROUP_DATA, true);
>>   if (ret < 0)
>>   break;
>>   ret = btrfs_make_block_group(trans, fs_info, 0,
>> diff --git a/extent-tree.c b/extent-tree.c
>> index db24da3a3a8c..4231be11bd53 100644
>> --- a/extent-tree.c
>> +++ b/extent-tree.c
>> @@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct
>> btrfs_trans_handle *trans,
>>   return 0;
>>     ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,
>> -    space_info->flags);
>> +    space_info->flags, false);
>>   if (ret == -ENOSPC) {
>>   space_info->full = 1;
>>   return 0;
>> diff --git a/mkfs/main.c b/mkfs/main.c
>> index 938025bfd32e..f8e27a7ec8b8 100644
>> --- a/mkfs/main.c
>> +++ b/mkfs/main.c
>> @@ -92,7 +92,7 @@ static int create_metadata_block_groups(struct
>> btrfs_root *root, int mixed,
>>   ret = btrfs_alloc_chunk(trans, fs_info,
>>   _start, _size,
>>   BTRFS_BLOCK_GROUP_METADATA |
>> -    BTRFS_BLOCK_GROUP_DATA);
>> +    BTRFS_BLOCK_GROUP_DATA, false);
>>   if (ret == -ENOSPC) {
>>   error("no space to allocate data/metadata chunk");
>>   goto err;
>> @@ -109,7 +109,7 @@ static int create_metadata_block_groups(struct
>> btrfs_root *root, int mixed,
>>   } else {
>>   ret = btrfs_alloc_chunk(trans, fs_info,
>>   _start, _size,
>> -    BTRFS_BLOCK_GROUP_METADATA);
>> +    BTRFS_BLOCK_GROUP_METADATA, false);
>>   if (ret == -ENOSPC) {
>>   error("no space to allocate metadata chunk");
>>   goto err;
>> @@ -143,7 +143,7 @@ static int create_data_block_groups(struct
>> btrfs_trans_handle *trans,
>>   if (!mixed) {
>>   ret = btrfs_alloc_chunk(trans, fs_info,
>>   _start, _size,
>> -    BTRFS_BLOCK_GROUP_DATA);
>> +    BTRFS_BLOCK_GROUP_DATA, false);
>>   if (ret == -ENOSPC) {
>>   error("no space to allocate data chunk");
>>   goto err;
>> @@ -251,7 +251,7 @@ static int create_one_raid_group(struct
>> btrfs_trans_handle *trans,
>>   int ret;
>>     ret = btrfs_alloc_chunk(trans, fs_info,
>> -    _start, _size, type);
>> +    _start, _size, type, false);
>>   if (ret == -ENOSPC) {
>>   error("not enough free space to allocate chunk");
>>   exit(1);
>> @@ -1003,7 +1003,7 @@ static int create_chunks(struct
>> btrfs_trans_handle *trans,
>>     for (i = 0; i < num_of_meta_chunks; i++) {
>>   ret = btrfs_alloc_chunk(trans, fs_info,
>> -    _start, _size, meta_type);
>> +    _start, _size, meta_type, false);
>>   if (ret)
>>   return ret;
>>   ret = btrfs_make_block_group(trans, fs_info, 0,
>> @@ -1019,8 +1019,8 @@ static int create_chunks(struct
>> btrfs_trans_handle *trans,
>>   if (size_of_data < minimum_data_chunk_size)
>>   size_of_data = minimum_data_chunk_size;
>>   -    ret = btrfs_alloc_data_chunk(trans, fs_info,
>> - _start, size_of_data, data_type, 0);
>> +    ret = btrfs_alloc_chunk(trans, fs_info,
>> +    _start, _of_data, data_type, false);
>>   if (ret)
>>   return ret;
>>   ret = 

Re: [PATCH 1/2] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk

2018-01-05 Thread Su Yue



On 01/05/2018 05:36 PM, Su Yue wrote:



On 01/05/2018 04:12 PM, Qu Wenruo wrote:

We used to have two chunk allocators, btrfs_alloc_chunk() and
btrfs_alloc_data_chunk(), the former is the more generic one, while the
later is only used in mkfs and convert, to allocate SINGLE data chunk.

Although btrfs_alloc_data_chunk() has some special hacks to cooperate
with convert, it's quite simple to integrity it into the generic chunk
allocator.

So merge them into one btrfs_alloc_chunk(), with extra @convert
parameter and necessary comment, to make code less duplicated and less
thing to maintain.

Signed-off-by: Qu Wenruo 
---
  convert/main.c |   6 +-
  extent-tree.c  |   2 +-
  mkfs/main.c    |  14 ++--
  volumes.c  | 220 
++---

  volumes.h  |   5 +-
  5 files changed, 98 insertions(+), 149 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 4a510a786394..8ee858fb2d05 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,

  len = min(max_chunk_size,
    cache->start + cache->size - cur);
-    ret = btrfs_alloc_data_chunk(trans, fs_info,
-    _backup, len,
-    BTRFS_BLOCK_GROUP_DATA, 1);
+    ret = btrfs_alloc_chunk(trans, fs_info,
+    _backup, ,
+    BTRFS_BLOCK_GROUP_DATA, true);
  if (ret < 0)
  break;
  ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/extent-tree.c b/extent-tree.c
index db24da3a3a8c..4231be11bd53 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct 
btrfs_trans_handle *trans,

  return 0;
  ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,
-    space_info->flags);
+    space_info->flags, false);
  if (ret == -ENOSPC) {
  space_info->full = 1;
  return 0;
diff --git a/mkfs/main.c b/mkfs/main.c
index 938025bfd32e..f8e27a7ec8b8 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -92,7 +92,7 @@ static int create_metadata_block_groups(struct 
btrfs_root *root, int mixed,

  ret = btrfs_alloc_chunk(trans, fs_info,
  _start, _size,
  BTRFS_BLOCK_GROUP_METADATA |
-    BTRFS_BLOCK_GROUP_DATA);
+    BTRFS_BLOCK_GROUP_DATA, false);
  if (ret == -ENOSPC) {
  error("no space to allocate data/metadata chunk");
  goto err;
@@ -109,7 +109,7 @@ static int create_metadata_block_groups(struct 
btrfs_root *root, int mixed,

  } else {
  ret = btrfs_alloc_chunk(trans, fs_info,
  _start, _size,
-    BTRFS_BLOCK_GROUP_METADATA);
+    BTRFS_BLOCK_GROUP_METADATA, false);
  if (ret == -ENOSPC) {
  error("no space to allocate metadata chunk");
  goto err;
@@ -143,7 +143,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,

  if (!mixed) {
  ret = btrfs_alloc_chunk(trans, fs_info,
  _start, _size,
-    BTRFS_BLOCK_GROUP_DATA);
+    BTRFS_BLOCK_GROUP_DATA, false);
  if (ret == -ENOSPC) {
  error("no space to allocate data chunk");
  goto err;
@@ -251,7 +251,7 @@ static int create_one_raid_group(struct 
btrfs_trans_handle *trans,

  int ret;
  ret = btrfs_alloc_chunk(trans, fs_info,
-    _start, _size, type);
+    _start, _size, type, false);
  if (ret == -ENOSPC) {
  error("not enough free space to allocate chunk");
  exit(1);
@@ -1003,7 +1003,7 @@ static int create_chunks(struct 
btrfs_trans_handle *trans,

  for (i = 0; i < num_of_meta_chunks; i++) {
  ret = btrfs_alloc_chunk(trans, fs_info,
-    _start, _size, meta_type);
+    _start, _size, meta_type, false);
  if (ret)
  return ret;
  ret = btrfs_make_block_group(trans, fs_info, 0,
@@ -1019,8 +1019,8 @@ static int create_chunks(struct 
btrfs_trans_handle *trans,

  if (size_of_data < minimum_data_chunk_size)
  size_of_data = minimum_data_chunk_size;
-    ret = btrfs_alloc_data_chunk(trans, fs_info,
- _start, size_of_data, data_type, 0);
+    ret = btrfs_alloc_chunk(trans, fs_info,
+    _start, _of_data, data_type, false);
  if (ret)
  return ret;
  ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/volumes.c b/volumes.c
index fa3c6de023f9..89c2f952f5b3 100644
--- a/volumes.c
+++ b/volumes.c
@@ -844,9 +844,23 @@ error:
  - 2 * sizeof(struct btrfs_chunk))    \
  / sizeof(struct btrfs_stripe) + 1)
+/*
+ * Alloc a chunk, will insert dev extents, chunk item.
+ * NOTE: 

Re: [PATCH 1/2] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk

2018-01-05 Thread Su Yue



On 01/05/2018 04:12 PM, Qu Wenruo wrote:

We used to have two chunk allocators, btrfs_alloc_chunk() and
btrfs_alloc_data_chunk(), the former is the more generic one, while the
later is only used in mkfs and convert, to allocate SINGLE data chunk.

Although btrfs_alloc_data_chunk() has some special hacks to cooperate
with convert, it's quite simple to integrity it into the generic chunk
allocator.

So merge them into one btrfs_alloc_chunk(), with extra @convert
parameter and necessary comment, to make code less duplicated and less
thing to maintain.

Signed-off-by: Qu Wenruo 
---
  convert/main.c |   6 +-
  extent-tree.c  |   2 +-
  mkfs/main.c|  14 ++--
  volumes.c  | 220 ++---
  volumes.h  |   5 +-
  5 files changed, 98 insertions(+), 149 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 4a510a786394..8ee858fb2d05 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
  
  			len = min(max_chunk_size,

  cache->start + cache->size - cur);
-   ret = btrfs_alloc_data_chunk(trans, fs_info,
-   _backup, len,
-   BTRFS_BLOCK_GROUP_DATA, 1);
+   ret = btrfs_alloc_chunk(trans, fs_info,
+   _backup, ,
+   BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/extent-tree.c b/extent-tree.c
index db24da3a3a8c..4231be11bd53 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
return 0;
  
  	ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,

-   space_info->flags);
+   space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
return 0;
diff --git a/mkfs/main.c b/mkfs/main.c
index 938025bfd32e..f8e27a7ec8b8 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -92,7 +92,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
BTRFS_BLOCK_GROUP_METADATA |
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data/metadata chunk");
goto err;
@@ -109,7 +109,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
} else {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_METADATA);
+   BTRFS_BLOCK_GROUP_METADATA, false);
if (ret == -ENOSPC) {
error("no space to allocate metadata chunk");
goto err;
@@ -143,7 +143,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
if (!mixed) {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data chunk");
goto err;
@@ -251,7 +251,7 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
int ret;
  
  	ret = btrfs_alloc_chunk(trans, fs_info,

-   _start, _size, type);
+   _start, _size, type, false);
if (ret == -ENOSPC) {
error("not enough free space to allocate chunk");
exit(1);
@@ -1003,7 +1003,7 @@ static int create_chunks(struct btrfs_trans_handle *trans,
  
  	for (i = 0; i < num_of_meta_chunks; i++) {

ret = btrfs_alloc_chunk(trans, fs_info,
-   _start, _size, meta_type);
+   _start, _size, meta_type, false);
if (ret)
return ret;
ret = btrfs_make_block_group(trans, fs_info, 0,
@@ -1019,8 +1019,8 @@ static int create_chunks(struct btrfs_trans_handle *trans,
if (size_of_data < minimum_data_chunk_size)
size_of_data = minimum_data_chunk_size;
  
-	ret = btrfs_alloc_data_chunk(trans, fs_info,

-_start, 

[PATCH] btrfs: add missing BTRFS_SUPER_FLAG define

2018-01-05 Thread Anand Jain
btrfs-progs uses additional two super flag bits. So just define
that so that we know its been used.

Signed-off-by: Anand Jain 
---
The btrfs-progs commits (very old) introduced them,

 commit 7cc792872a133cabc3467e6ccaf5a2c8ea9e5218
btrfs-progs: Add CHANGING_FSID super flag

 commit 797a937e5dd8db0092add633a80f3cd698e182df
Btrfs-progs: Introduce metadump_v2

Appears that we need bit of support from the kernel side like
failing to mount if CHANGING_FSID is set. And device mounted
with metadump_v2 flag is kind of broken on the kernel side
as of now, this patch does not fix those.

 include/uapi/linux/btrfs_tree.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 6d6e5da51527..aff1356c2bb8 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -456,6 +456,8 @@ struct btrfs_free_space_header {
 
 #define BTRFS_SUPER_FLAG_SEEDING   (1ULL << 32)
 #define BTRFS_SUPER_FLAG_METADUMP  (1ULL << 33)
+#define BTRFS_SUPER_FLAG_METADUMP_V2   (1ULL << 34)
+#define BTRFS_SUPER_FLAG_CHANGING_FSID (1ULL << 35)
 
 
 /*
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation

2018-01-05 Thread Qu Wenruo
Before this patch, chunk allocation is split into 2 parts:

1) Chunk allocation
   Handled by btrfs_alloc_chunk(), which will insert chunk and device
   extent items.

2) Block group allocation
   Handled by btrfs_make_block_group(), which will insert block group
   item and update space info.

However for chunk allocation, we don't really need to split these
operations as all btrfs_alloc_chunk() has btrfs_make_block_group()
followed.

So it's reasonable to merge btrfs_make_block_group() call into
btrfs_alloc_chunk() to save several lines, and provides the basis for
later btrfs_alloc_chunk() rework.

Signed-off-by: Qu Wenruo 
---
 convert/main.c |  4 
 extent-tree.c  | 10 ++
 mkfs/main.c| 28 
 volumes.c  | 10 ++
 4 files changed, 8 insertions(+), 44 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 8ee858fb2d05..96a04eda5b18 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -915,10 +915,6 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-   BTRFS_BLOCK_GROUP_DATA, cur, len);
-   if (ret < 0)
-   break;
cur += len;
}
}
diff --git a/extent-tree.c b/extent-tree.c
index 4231be11bd53..90e792a3fe62 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1910,15 +1910,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
-   return 0;
+   return ret;
}
-
-   BUG_ON(ret);
-
-   ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags,
-start, num_bytes);
-   BUG_ON(ret);
-   return 0;
+   return ret;
 }
 
 static int update_block_group(struct btrfs_trans_handle *trans,
diff --git a/mkfs/main.c b/mkfs/main.c
index f8e27a7ec8b8..9377aa30f39d 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -97,12 +97,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
error("no space to allocate data/metadata chunk");
goto err;
}
-   if (ret)
-   return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA |
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
if (ret)
return ret;
allocation->mixed += chunk_size;
@@ -116,12 +110,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA,
-chunk_start, chunk_size);
allocation->metadata += chunk_size;
-   if (ret)
-   return ret;
}
 
root->fs_info->system_allocs = 0;
@@ -150,12 +139,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
allocation->data += chunk_size;
-   if (ret)
-   return ret;
}
 
 err:
@@ -259,9 +243,6 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
if (ret)
return ret;
 
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-type, chunk_start, chunk_size);
-
type &= BTRFS_BLOCK_GROUP_TYPE_MASK;
if (type == BTRFS_BLOCK_GROUP_DATA) {
allocation->data += chunk_size;
@@ -1006,12 +987,7 @@ static int create_chunks(struct btrfs_trans_handle *trans,
_start, _size, meta_type, false);
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-meta_type, chunk_start,
-chunk_size);
allocation->metadata += chunk_size;
-   if (ret)
-   return ret;

[PATCH 1/2] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk

2018-01-05 Thread Qu Wenruo
We used to have two chunk allocators, btrfs_alloc_chunk() and
btrfs_alloc_data_chunk(), the former is the more generic one, while the
later is only used in mkfs and convert, to allocate SINGLE data chunk.

Although btrfs_alloc_data_chunk() has some special hacks to cooperate
with convert, it's quite simple to integrity it into the generic chunk
allocator.

So merge them into one btrfs_alloc_chunk(), with extra @convert
parameter and necessary comment, to make code less duplicated and less
thing to maintain.

Signed-off-by: Qu Wenruo 
---
 convert/main.c |   6 +-
 extent-tree.c  |   2 +-
 mkfs/main.c|  14 ++--
 volumes.c  | 220 ++---
 volumes.h  |   5 +-
 5 files changed, 98 insertions(+), 149 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index 4a510a786394..8ee858fb2d05 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
 
len = min(max_chunk_size,
  cache->start + cache->size - cur);
-   ret = btrfs_alloc_data_chunk(trans, fs_info,
-   _backup, len,
-   BTRFS_BLOCK_GROUP_DATA, 1);
+   ret = btrfs_alloc_chunk(trans, fs_info,
+   _backup, ,
+   BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/extent-tree.c b/extent-tree.c
index db24da3a3a8c..4231be11bd53 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1907,7 +1907,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
return 0;
 
ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,
-   space_info->flags);
+   space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
return 0;
diff --git a/mkfs/main.c b/mkfs/main.c
index 938025bfd32e..f8e27a7ec8b8 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -92,7 +92,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
BTRFS_BLOCK_GROUP_METADATA |
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data/metadata chunk");
goto err;
@@ -109,7 +109,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
} else {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_METADATA);
+   BTRFS_BLOCK_GROUP_METADATA, false);
if (ret == -ENOSPC) {
error("no space to allocate metadata chunk");
goto err;
@@ -143,7 +143,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
if (!mixed) {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data chunk");
goto err;
@@ -251,7 +251,7 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
int ret;
 
ret = btrfs_alloc_chunk(trans, fs_info,
-   _start, _size, type);
+   _start, _size, type, false);
if (ret == -ENOSPC) {
error("not enough free space to allocate chunk");
exit(1);
@@ -1003,7 +1003,7 @@ static int create_chunks(struct btrfs_trans_handle *trans,
 
for (i = 0; i < num_of_meta_chunks; i++) {
ret = btrfs_alloc_chunk(trans, fs_info,
-   _start, _size, meta_type);
+   _start, _size, meta_type, false);
if (ret)
return ret;
ret = btrfs_make_block_group(trans, fs_info, 0,
@@ -1019,8 +1019,8 @@ static int create_chunks(struct btrfs_trans_handle *trans,
if (size_of_data < minimum_data_chunk_size)
size_of_data = minimum_data_chunk_size;
 
-   ret = btrfs_alloc_data_chunk(trans, fs_info,
-_start, size_of_data, 

[PATCH 0/2] Preparation for later btrfs_alloc_chunk() rework, Part 2

2018-01-05 Thread Qu Wenruo
This patchset, along with its prerequisite (patchset named:
"[PATCH 0/5] Cleanups for later btrfs_alloc_chunk() rework") can be
fetched from github:
https://github.com/adam900710/btrfs-progs/tree/chunk_alloc_enospc

This patchset can still be treated as cleanup, but brings a much larger
structure modification of btrfs_alloc_chunk().

The first patch will merge the original btrfs_alloc_data_chunk() with
more generic btrfs_alloc_chunk().

And 2nd patch integrate btrfs_make_block_group() into
btrfs_alloc_chunk(), provides the critical basis for later rework.
(Later rework needs to update space info before any tree modification,
so btrfs_make_block_group must be integrated)

Considering the importance of btrfs_alloc_chunk() in btrfs-progs, these
2 patches are separated out for a longer review window before larger
rework.

Qu Wenruo (2):
  btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk
  btrfs-progs: Make btrfs_alloc_chunk to handle block group creation

 convert/main.c |  10 +--
 extent-tree.c  |  12 +---
 mkfs/main.c|  42 ++-
 volumes.c  | 222 +++--
 volumes.h  |   5 +-
 5 files changed, 102 insertions(+), 189 deletions(-)

-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html