[PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior

2018-02-08 Thread Qu Wenruo
Kernel uses a delayed chunk allocation behavior for metadata chunks

KERNEL:
btrfs_alloc_chunk()
|- __btrfs_alloc_chunk():   Only allocate chunk mapping
   |- btrfs_make_block_group(): Add corresponding bg to fs_info->new_bgs

Then at transaction commit time, it finishes the remaining work:
btrfs_start_dirty_block_groups():
|- btrfs_create_pending_block_groups()
   |- btrfs_insert_item():  To insert block group item
   |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents

Although btrfs-progs btrfs_alloc_chunk() does all the work in one
function, it can still benefit from function refactor like:
btrfs-progs:
btrfs_alloc_chunk():Wrapper for both normal and convert chunks
|- __btrfs_alloc_chunk():   Only alloc chunk mapping
|  |- btrfs_make_block_group(): <>
|- btrfs_finish_chunk_alloc():  Insert chunk items/dev extents

With such refactor, the following functions can share most of its code
with kernel now:
__btrfs_alloc_chunk()
btrfs_finish_chunk_alloc()
btrfs_alloc_dev_extent()

Signed-off-by: Qu Wenruo 
---
 volumes.c | 421 ++
 1 file changed, 260 insertions(+), 161 deletions(-)

diff --git a/volumes.c b/volumes.c
index cff54c612872..e89520326314 100644
--- a/volumes.c
+++ b/volumes.c
@@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device 
*device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
-static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
-   struct btrfs_fs_info *fs_info,
-   struct map_lookup *map, u64 stripe_size)
+static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device,
+ u64 chunk_offset, u64 physical,
+ u64 stripe_size)
 {
-   int ret = 0;
-   struct btrfs_path path;
+   int ret;
+   struct btrfs_path *path;
+   struct btrfs_fs_info *fs_info = device->fs_info;
struct btrfs_root *root = fs_info->dev_root;
struct btrfs_dev_extent *extent;
struct extent_buffer *leaf;
struct btrfs_key key;
-   int i;
 
-   btrfs_init_path();
-
-   for (i = 0; i < map->num_stripes; i++) {
-   struct btrfs_device *device = map->stripes[i].dev;
-
-   key.objectid = device->devid;
-   key.offset = map->stripes[i].physical;
-   key.type = BTRFS_DEV_EXTENT_KEY;
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
 
-   ret = btrfs_insert_empty_item(trans, root, , ,
- sizeof(*extent));
-   if (ret < 0)
-   goto out;
-   leaf = path.nodes[0];
-   extent = btrfs_item_ptr(leaf, path.slots[0],
-   struct btrfs_dev_extent);
-   btrfs_set_dev_extent_chunk_tree(leaf, extent,
+   key.objectid = device->devid;
+   key.offset = physical;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   ret = btrfs_insert_empty_item(trans, root, path, , sizeof(*extent));
+   if (ret)
+   goto out;
+   leaf = path->nodes[0];
+   extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
+   btrfs_set_dev_extent_chunk_tree(leaf, extent,
BTRFS_CHUNK_TREE_OBJECTID);
-   btrfs_set_dev_extent_chunk_objectid(leaf, extent,
-   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
-   btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
-
-   write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
-   (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
-   BTRFS_UUID_SIZE);
-
-   btrfs_set_dev_extent_length(leaf, extent, stripe_size);
-   btrfs_mark_buffer_dirty(leaf);
-   btrfs_release_path();
-
-   device->bytes_used += stripe_size;
-   ret = btrfs_update_device(trans, device);
-   if (ret < 0)
-   goto out;
-   }
+   btrfs_set_dev_extent_chunk_objectid(leaf, extent,
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+   btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
+   btrfs_set_dev_extent_length(leaf, extent, stripe_size);
+   btrfs_mark_buffer_dirty(leaf);
 out:
-   btrfs_release_path();
+   btrfs_free_path(path);
return ret;
 }
 
@@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const 
void *b)
/ sizeof(struct btrfs_stripe) + 1)
 
 /*
- * Alloc a chunk, will insert dev extents, chunk item, and insert new
- * block group and update space info (so that extent 

[PATCH v2 02/10] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk

2018-02-08 Thread Qu Wenruo
We used to have two chunk allocators, btrfs_alloc_chunk() and
btrfs_alloc_data_chunk(), the former is the more generic one, while the
later is only used in mkfs and convert, to allocate SINGLE data chunk.

Although btrfs_alloc_data_chunk() has some special hacks to cooperate
with convert, it's quite simple to integrity it into the generic chunk
allocator.

So merge them into one btrfs_alloc_chunk(), with extra @convert
parameter and necessary comment, to make code less duplicated and less
thing to maintain.

Signed-off-by: Qu Wenruo 
---
 convert/main.c |   6 +-
 extent-tree.c  |   2 +-
 mkfs/main.c|   8 +--
 volumes.c  | 219 ++---
 volumes.h  |   5 +-
 5 files changed, 94 insertions(+), 146 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index b3ea31d7af43..b2444bb2ff21 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
 
len = min(max_chunk_size,
  cache->start + cache->size - cur);
-   ret = btrfs_alloc_data_chunk(trans, fs_info,
-   _backup, len,
-   BTRFS_BLOCK_GROUP_DATA, 1);
+   ret = btrfs_alloc_chunk(trans, fs_info,
+   _backup, ,
+   BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
ret = btrfs_make_block_group(trans, fs_info, 0,
diff --git a/extent-tree.c b/extent-tree.c
index e2ae74a7fe66..b085ab0352b3 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1906,7 +1906,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
return 0;
 
ret = btrfs_alloc_chunk(trans, fs_info, , _bytes,
-   space_info->flags);
+   space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
return 0;
diff --git a/mkfs/main.c b/mkfs/main.c
index ea65c6d897b2..358395ca0250 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -82,7 +82,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
BTRFS_BLOCK_GROUP_METADATA |
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data/metadata chunk");
goto err;
@@ -99,7 +99,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
} else {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_METADATA);
+   BTRFS_BLOCK_GROUP_METADATA, false);
if (ret == -ENOSPC) {
error("no space to allocate metadata chunk");
goto err;
@@ -133,7 +133,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
if (!mixed) {
ret = btrfs_alloc_chunk(trans, fs_info,
_start, _size,
-   BTRFS_BLOCK_GROUP_DATA);
+   BTRFS_BLOCK_GROUP_DATA, false);
if (ret == -ENOSPC) {
error("no space to allocate data chunk");
goto err;
@@ -241,7 +241,7 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
int ret;
 
ret = btrfs_alloc_chunk(trans, fs_info,
-   _start, _size, type);
+   _start, _size, type, false);
if (ret == -ENOSPC) {
error("not enough free space to allocate chunk");
exit(1);
diff --git a/volumes.c b/volumes.c
index 677d085de96c..9ee4650351c3 100644
--- a/volumes.c
+++ b/volumes.c
@@ -836,9 +836,23 @@ error:
- 2 * sizeof(struct btrfs_chunk))   \
/ sizeof(struct btrfs_stripe) + 1)
 
+/*
+ * Alloc a chunk, will insert dev extents, chunk item.
+ * NOTE: This function will not insert block group item nor mark newly
+ * allocated chunk available for later allocation.
+ * Block group item and free space update is handled by 
btrfs_make_block_group()
+ *
+ * @start: return value of allocated chunk start bytenr.
+ * @num_bytes: return value of allocated chunk size
+ * @type:  chunk type (including both profile and type)
+ * @convert:   if the chunk is 

[PATCH v2 08/10] btrfs-progs: Move chunk stripe size calcution function to volumes.h

2018-02-08 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 check/main.c | 22 --
 volumes.h| 22 ++
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/check/main.c b/check/main.c
index c051a862eb35..96607f6817af 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7638,28 +7638,6 @@ repair_abort:
return err;
 }
 
-u64 calc_stripe_length(u64 type, u64 length, int num_stripes)
-{
-   u64 stripe_size;
-
-   if (type & BTRFS_BLOCK_GROUP_RAID0) {
-   stripe_size = length;
-   stripe_size /= num_stripes;
-   } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
-   stripe_size = length * 2;
-   stripe_size /= num_stripes;
-   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
-   stripe_size = length;
-   stripe_size /= (num_stripes - 1);
-   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
-   stripe_size = length;
-   stripe_size /= (num_stripes - 2);
-   } else {
-   stripe_size = length;
-   }
-   return stripe_size;
-}
-
 /*
  * Check the chunk with its block group/dev list ref:
  * Return 0 if all refs seems valid.
diff --git a/volumes.h b/volumes.h
index 3741d45cae80..950de5a9f910 100644
--- a/volumes.h
+++ b/volumes.h
@@ -216,6 +216,28 @@ static inline int check_crossing_stripes(struct 
btrfs_fs_info *fs_info,
(bg_offset + len - 1) / BTRFS_STRIPE_LEN);
 }
 
+static inline u64 calc_stripe_length(u64 type, u64 length, int num_stripes)
+{
+   u64 stripe_size;
+
+   if (type & BTRFS_BLOCK_GROUP_RAID0) {
+   stripe_size = length;
+   stripe_size /= num_stripes;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
+   stripe_size = length * 2;
+   stripe_size /= num_stripes;
+   } else if (type & BTRFS_BLOCK_GROUP_RAID5) {
+   stripe_size = length;
+   stripe_size /= (num_stripes - 1);
+   } else if (type & BTRFS_BLOCK_GROUP_RAID6) {
+   stripe_size = length;
+   stripe_size /= (num_stripes - 2);
+   } else {
+   stripe_size = length;
+   }
+   return stripe_size;
+}
+
 int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
  u64 logical, u64 *length, u64 *type,
  struct btrfs_multi_bio **multi_ret, int mirror_num,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/10] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation

2018-02-08 Thread Qu Wenruo
Before this patch, chunk allocation is split into 2 parts:

1) Chunk allocation
   Handled by btrfs_alloc_chunk(), which will insert chunk and device
   extent items.

2) Block group allocation
   Handled by btrfs_make_block_group(), which will insert block group
   item and update space info.

However for chunk allocation, we don't really need to split these
operations as all btrfs_alloc_chunk() has btrfs_make_block_group()
followed.

So it's reasonable to merge btrfs_make_block_group() call into
btrfs_alloc_chunk() to save several lines, and provides the basis for
later btrfs_alloc_chunk() rework.

Signed-off-by: Qu Wenruo 
---
 convert/main.c |  4 
 extent-tree.c  | 10 ++
 mkfs/main.c| 19 ---
 volumes.c  | 10 ++
 4 files changed, 8 insertions(+), 35 deletions(-)

diff --git a/convert/main.c b/convert/main.c
index b2444bb2ff21..240d3aa46db9 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -915,10 +915,6 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
BTRFS_BLOCK_GROUP_DATA, true);
if (ret < 0)
break;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-   BTRFS_BLOCK_GROUP_DATA, cur, len);
-   if (ret < 0)
-   break;
cur += len;
}
}
diff --git a/extent-tree.c b/extent-tree.c
index b085ab0352b3..bccd83d1bae6 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1909,15 +1909,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
space_info->flags, false);
if (ret == -ENOSPC) {
space_info->full = 1;
-   return 0;
+   return ret;
}
-
-   BUG_ON(ret);
-
-   ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags,
-start, num_bytes);
-   BUG_ON(ret);
-   return 0;
+   return ret;
 }
 
 static int update_block_group(struct btrfs_root *root,
diff --git a/mkfs/main.c b/mkfs/main.c
index 358395ca0250..49159ea533b9 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -87,12 +87,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
error("no space to allocate data/metadata chunk");
goto err;
}
-   if (ret)
-   return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA |
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
if (ret)
return ret;
allocation->mixed += chunk_size;
@@ -106,12 +100,7 @@ static int create_metadata_block_groups(struct btrfs_root 
*root, int mixed,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_METADATA,
-chunk_start, chunk_size);
allocation->metadata += chunk_size;
-   if (ret)
-   return ret;
}
 
root->fs_info->system_allocs = 0;
@@ -140,12 +129,7 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
}
if (ret)
return ret;
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-BTRFS_BLOCK_GROUP_DATA,
-chunk_start, chunk_size);
allocation->data += chunk_size;
-   if (ret)
-   return ret;
}
 
 err:
@@ -249,9 +233,6 @@ static int create_one_raid_group(struct btrfs_trans_handle 
*trans,
if (ret)
return ret;
 
-   ret = btrfs_make_block_group(trans, fs_info, 0,
-type, chunk_start, chunk_size);
-
type &= BTRFS_BLOCK_GROUP_TYPE_MASK;
if (type == BTRFS_BLOCK_GROUP_DATA) {
allocation->data += chunk_size;
diff --git a/volumes.c b/volumes.c
index 9ee4650351c3..a9dc8c939dc5 100644
--- a/volumes.c
+++ b/volumes.c
@@ -837,10 +837,9 @@ error:
/ sizeof(struct btrfs_stripe) + 1)
 
 /*
- * Alloc a chunk, will insert dev extents, chunk item.
- * NOTE: This function will not insert block group item nor mark newly
- * allocated chunk available for later allocation.
- * Block group item and free space update is handled by 
btrfs_make_block_group()
+ * Alloc a chunk, will insert dev extents, chunk item, and insert new
+ * block group and update space info (so 

[PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs

2018-02-08 Thread Qu Wenruo
Used by later btrfs_alloc_chunk() rework.

Signed-off-by: Qu Wenruo 
---
 Makefile  |   3 +-
 kernel-lib/sort.c | 104 ++
 kernel-lib/sort.h |  16 +
 3 files changed, 122 insertions(+), 1 deletion(-)
 create mode 100644 kernel-lib/sort.c
 create mode 100644 kernel-lib/sort.h

diff --git a/Makefile b/Makefile
index 327cdfa08eba..3db7bbe04662 100644
--- a/Makefile
+++ b/Makefile
@@ -106,7 +106,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o \
+ kernel-lib/sort.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o check/main.o \
diff --git a/kernel-lib/sort.c b/kernel-lib/sort.c
new file mode 100644
index ..70ae3dbe2852
--- /dev/null
+++ b/kernel-lib/sort.c
@@ -0,0 +1,104 @@
+/*
+ * taken from linux kernel lib/sort.c, removed kernel config code and adapted
+ * for btrfsprogs
+ */
+
+#include "sort.h"
+
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
+ *
+ * Jan 23 2005  Matt Mackall 
+ */
+
+static int alignment_ok(const void *base, int align)
+{
+   return ((unsigned long)base & (align - 1)) == 0;
+}
+
+static void u32_swap(void *a, void *b, int size)
+{
+   u32 t = *(u32 *)a;
+   *(u32 *)a = *(u32 *)b;
+   *(u32 *)b = t;
+}
+
+static void u64_swap(void *a, void *b, int size)
+{
+   u64 t = *(u64 *)a;
+   *(u64 *)a = *(u64 *)b;
+   *(u64 *)b = t;
+}
+
+static void generic_swap(void *a, void *b, int size)
+{
+   char t;
+
+   do {
+   t = *(char *)a;
+   *(char *)a++ = *(char *)b;
+   *(char *)b++ = t;
+   } while (--size > 0);
+}
+
+/**
+ * sort - sort an array of elements
+ * @base: pointer to data to sort
+ * @num: number of elements
+ * @size: size of each element
+ * @cmp_func: pointer to comparison function
+ * @swap_func: pointer to swap function or NULL
+ *
+ * This function does a heapsort on the given array. You may provide a
+ * swap_func function optimized to your element type.
+ *
+ * Sorting time is O(n log n) both on average and worst-case. While
+ * qsort is about 20% faster on average, it suffers from exploitable
+ * O(n*n) worst-case behavior and extra memory requirements that make
+ * it less suitable for kernel use.
+ */
+
+void sort(void *base, size_t num, size_t size,
+ int (*cmp_func)(const void *, const void *),
+ void (*swap_func)(void *, void *, int size))
+{
+   /* pre-scale counters for performance */
+   int i = (num/2 - 1) * size, n = num * size, c, r;
+
+   if (!swap_func) {
+   if (size == 4 && alignment_ok(base, 4))
+   swap_func = u32_swap;
+   else if (size == 8 && alignment_ok(base, 8))
+   swap_func = u64_swap;
+   else
+   swap_func = generic_swap;
+   }
+
+   /* heapify */
+   for ( ; i >= 0; i -= size) {
+   for (r = i; r * 2 + size < n; r  = c) {
+   c = r * 2 + size;
+   if (c < n - size &&
+   cmp_func(base + c, base + c + size) < 0)
+   c += size;
+   if (cmp_func(base + r, base + c) >= 0)
+   break;
+   swap_func(base + r, base + c, size);
+   }
+   }
+
+   /* sort */
+   for (i = n - size; i > 0; i -= size) {
+   swap_func(base, base + i, size);
+   for (r = 0; r * 2 + size < i; r = c) {
+   c = r * 2 + size;
+   if (c < i - size &&
+   cmp_func(base + c, base + c + size) < 0)
+   c += size;
+   if (cmp_func(base + r, base + c) >= 0)
+   break;
+   swap_func(base + r, base + c, size);
+   }
+   }
+}
diff --git a/kernel-lib/sort.h b/kernel-lib/sort.h
new file mode 100644
index ..9355e01248f2
--- /dev/null
+++ b/kernel-lib/sort.h
@@ -0,0 +1,16 @@
+/*
+ * taken from linux kernel include/list_sort.h
+ * changed include to kerncompat.h
+ */
+
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SORT_H
+#define _LINUX_SORT_H
+
+#include "kerncompat.h"
+
+void sort(void *base, 

[PATCH v2 00/10] Chunk allocator unification

2018-02-08 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/libbtrfs_prepare

This patchset unified a large part of chunk allocator (free device
extent search) between kernel and btrfs-progs.
And reuses kernel function structures like btrfs_finish_chunk_alloc()
and btrfs_alloc_dev_extent().

Before the unification:
   Kernel|  Btrfs-progs
 btrfs_alloc_chunk() | btrfs_alloc_chunk()
 |- __btrfs_alloc_chunk()| |- Do all the work
 |
 btrfs_create_pending_block_groups() |
 |- btrfs_insert_item()  |
 |- btrfs_finish_chunk_alloc()   |
|- btrfs_alloc_dev_extent()  |

After the unification:
   Kernel|  Btrfs-progs
 btrfs_alloc_chunk() | btrfs_alloc_chunk()
 |- __btrfs_alloc_chunk()| |- __btrfs_alloc_chunk()
 | |- btrfs_finish_chunk_alloc()
 btrfs_create_pending_block_groups() ||- btrfs_alloc_dev_extent()
 |- btrfs_insert_item()  |
 |- btrfs_finish_chunk_alloc()   |

And the similiar functions are share the same code base, with minor
member/functions change.

This update only modifies patches 7 and after.

Changelog:
v2:
  Make error handler in patch 7 better.
  New patches to unify more functions used in btrfs_alloc_chunk()

Qu Wenruo (10):
  btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to
fs_info
  btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk
  btrfs-progs: Make btrfs_alloc_chunk to handle block group creation
  btrfs-progs: Introduce btrfs_raid_array and related infrastructures
  btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum
hole size
  btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
  btrfs-progs: volumes: Unify free dev extent search behavior between
kernel and btrfs-progs.
  btrfs-progs: Move chunk stripe size calcution function to volumes.h
  btrfs-progs: Use btrfs_device->fs_info to replace
btrfs_device->dev_root
  btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and
behavior

 Makefile  |   3 +-
 check/main.c  |  22 --
 convert/main.c|  10 +-
 ctree.h   |  12 +-
 extent-tree.c |  12 +-
 kerncompat.h  |   5 +
 kernel-lib/sort.c | 104 ++
 kernel-lib/sort.h |  16 +
 mkfs/main.c   |  27 +-
 utils.c   |   2 +-
 volumes.c | 927 --
 volumes.h |  66 +++-
 12 files changed, 695 insertions(+), 511 deletions(-)
 create mode 100644 kernel-lib/sort.c
 create mode 100644 kernel-lib/sort.h

-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/10] btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs.

2018-02-08 Thread Qu Wenruo
As preparation to create libbtrfs which shares code between kernel and
btrfs, this patch mainly unifies the search for free device extents.

The main modifications are:

1) Search for free device extent
   Use the kernel method, by sorting the devices by its max hole
   capability, and use that sorted result to determine stripe size
   and chunk size.

   The previous method, which uses available bytes of each device to
   search, can't handle scattered small holes in each device.

2) Chunk/stripe minimal size
   Remove the minimal size requirement.
   Now the real minimal stripe size is 64K (BTRFS_STRIPE_LEN), the same
   as kernel one.

   While up limit is still kept as is, and minimal device size is still
   kept for a while.
   But since we no longer have strange minimal stripe size limit,
   existing minimal device size calculation won't cause any problem.

3) How to insert device extents
   Although not the same as kernel, here we follow kernel behavior to
   delay dev extents insert.

   There is plan to follow kernel __btrfs_alloc_chunk() to make it only
   handle chunk mapping allocation, while do nothing with tree
   operation.

4) Usage of btrfs_raid_array[]
   Which makes a lot of old if-else branches disappear.

There are still a lot of work to do (both kernel and btrfs-progs) before
we could starting extracting code into libbtrfs, but this should make
libbtrfs inside our reach.

Signed-off-by: Qu Wenruo 
---
 kerncompat.h |   5 +
 volumes.c| 621 ++-
 volumes.h|   7 +
 3 files changed, 285 insertions(+), 348 deletions(-)

diff --git a/kerncompat.h b/kerncompat.h
index fa96715fb70c..658d28ed0792 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -285,6 +285,7 @@ static inline int IS_ERR_OR_NULL(const void *ptr)
  */
 #define kmalloc(x, y) malloc(x)
 #define kzalloc(x, y) calloc(1, x)
+#define kcalloc(x, y) calloc(x, y)
 #define kstrdup(x, y) strdup(x)
 #define kfree(x) free(x)
 #define vmalloc(x) malloc(x)
@@ -394,4 +395,8 @@ struct __una_u64 { __le64 x; } __attribute__((__packed__));
 #define noinline
 #endif
 
+static inline u64 div_u64(u64 dividend, u32 divisor)
+{
+   return dividend / ((u64) divisor);
+}
 #endif
diff --git a/volumes.c b/volumes.c
index f4009ffa7c9e..c2efb3c674dc 100644
--- a/volumes.c
+++ b/volumes.c
@@ -29,6 +29,7 @@
 #include "volumes.h"
 #include "utils.h"
 #include "kernel-lib/raid56.h"
+#include "kernel-lib/sort.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
[BTRFS_RAID_RAID10] = {
@@ -522,55 +523,55 @@ static int find_free_dev_extent(struct btrfs_device 
*device, u64 num_bytes,
return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
-static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
- struct btrfs_device *device,
- u64 chunk_offset, u64 num_bytes, u64 *start,
- int convert)
+static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
+   struct btrfs_fs_info *fs_info,
+   struct map_lookup *map, u64 stripe_size)
 {
-   int ret;
-   struct btrfs_path *path;
-   struct btrfs_root *root = device->dev_root;
+   int ret = 0;
+   struct btrfs_path path;
+   struct btrfs_root *root = fs_info->dev_root;
struct btrfs_dev_extent *extent;
struct extent_buffer *leaf;
struct btrfs_key key;
+   int i;
 
-   path = btrfs_alloc_path();
-   if (!path)
-   return -ENOMEM;
+   btrfs_init_path();
 
-   /*
-* For convert case, just skip search free dev_extent, as caller
-* is responsible to make sure it's free.
-*/
-   if (!convert) {
-   ret = find_free_dev_extent(device, num_bytes, start, NULL);
-   if (ret)
-   goto err;
-   }
+   for (i = 0; i < map->num_stripes; i++) {
+   struct btrfs_device *device = map->stripes[i].dev;
 
-   key.objectid = device->devid;
-   key.offset = *start;
-   key.type = BTRFS_DEV_EXTENT_KEY;
-   ret = btrfs_insert_empty_item(trans, root, path, ,
- sizeof(*extent));
-   BUG_ON(ret);
+   key.objectid = device->devid;
+   key.offset = map->stripes[i].physical;
+   key.type = BTRFS_DEV_EXTENT_KEY;
 
-   leaf = path->nodes[0];
-   extent = btrfs_item_ptr(leaf, path->slots[0],
-   struct btrfs_dev_extent);
-   btrfs_set_dev_extent_chunk_tree(leaf, extent, 
BTRFS_CHUNK_TREE_OBJECTID);
-   btrfs_set_dev_extent_chunk_objectid(leaf, extent,
-   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
-   btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
-
-   write_extent_buffer(leaf, 

[PATCH v2 09/10] btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root

2018-02-08 Thread Qu Wenruo
Same as kernel declaration.

Signed-off-by: Qu Wenruo 
---
 utils.c   | 2 +-
 volumes.c | 6 +++---
 volumes.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/utils.c b/utils.c
index e9cb3a82fda6..eff5fb64cfd5 100644
--- a/utils.c
+++ b/utils.c
@@ -216,7 +216,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
device->total_bytes = device_total_bytes;
device->bytes_used = 0;
device->total_ios = 0;
-   device->dev_root = fs_info->dev_root;
+   device->fs_info = fs_info;
device->name = strdup(path);
if (!device->name) {
ret = -ENOMEM;
diff --git a/volumes.c b/volumes.c
index c2efb3c674dc..cff54c612872 100644
--- a/volumes.c
+++ b/volumes.c
@@ -380,7 +380,7 @@ static int find_free_dev_extent_start(struct btrfs_device 
*device,
  u64 *start, u64 *len)
 {
struct btrfs_key key;
-   struct btrfs_root *root = device->dev_root;
+   struct btrfs_root *root = device->fs_info->dev_root;
struct btrfs_dev_extent *dev_extent;
struct btrfs_path *path;
u64 hole_size;
@@ -724,7 +724,7 @@ int btrfs_update_device(struct btrfs_trans_handle *trans,
struct extent_buffer *leaf;
struct btrfs_key key;
 
-   root = device->dev_root->fs_info->chunk_root;
+   root = device->fs_info->chunk_root;
 
path = btrfs_alloc_path();
if (!path)
@@ -1895,7 +1895,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
}
 
fill_device_from_item(leaf, dev_item, device);
-   device->dev_root = fs_info->dev_root;
+   device->fs_info = fs_info;
return ret;
 }
 
diff --git a/volumes.h b/volumes.h
index 950de5a9f910..84deafc98b0d 100644
--- a/volumes.h
+++ b/volumes.h
@@ -26,7 +26,7 @@
 
 struct btrfs_device {
struct list_head dev_list;
-   struct btrfs_root *dev_root;
+   struct btrfs_fs_info *fs_info;
struct btrfs_fs_devices *fs_devices;
 
u64 total_ios;
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/10] btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size

2018-02-08 Thread Qu Wenruo
Just as kernel find_free_dev_extent(), allow it to return maximum hole
size for us to build device list for later chunk allocator rework.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/volumes.c b/volumes.c
index b47ff1f392b5..f4009ffa7c9e 100644
--- a/volumes.c
+++ b/volumes.c
@@ -516,10 +516,10 @@ out:
 }
 
 static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
-   u64 *start)
+   u64 *start, u64 *len)
 {
/* FIXME use last free of some kind */
-   return find_free_dev_extent_start(device, num_bytes, 0, start, NULL);
+   return find_free_dev_extent_start(device, num_bytes, 0, start, len);
 }
 
 static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
@@ -543,7 +543,7 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle 
*trans,
 * is responsible to make sure it's free.
 */
if (!convert) {
-   ret = find_free_dev_extent(device, num_bytes, start);
+   ret = find_free_dev_extent(device, num_bytes, start, NULL);
if (ret)
goto err;
}
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/10] btrfs-progs: Introduce btrfs_raid_array and related infrastructures

2018-02-08 Thread Qu Wenruo
As part of the effort to unify code and behavior between btrfs-progs and
kernel, copy the btrfs_raid_array from kernel to btrfs-progs.

So later we can use the btrfs_raid_array[] to get needed raid info other
than manually do if-else branches.

Signed-off-by: Qu Wenruo 
---
 ctree.h   | 12 +++-
 volumes.c | 66 +++
 volumes.h | 30 +
 3 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/ctree.h b/ctree.h
index 17cdac76c58c..c76849d8deb7 100644
--- a/ctree.h
+++ b/ctree.h
@@ -958,7 +958,17 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID5(1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6(1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
-#define BTRFS_NR_RAID_TYPES 7
+
+enum btrfs_raid_types {
+   BTRFS_RAID_RAID10,
+   BTRFS_RAID_RAID1,
+   BTRFS_RAID_DUP,
+   BTRFS_RAID_RAID0,
+   BTRFS_RAID_SINGLE,
+   BTRFS_RAID_RAID5,
+   BTRFS_RAID_RAID6,
+   BTRFS_NR_RAID_TYPES
+};
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK(BTRFS_BLOCK_GROUP_DATA |\
 BTRFS_BLOCK_GROUP_SYSTEM |  \
diff --git a/volumes.c b/volumes.c
index a9dc8c939dc5..b47ff1f392b5 100644
--- a/volumes.c
+++ b/volumes.c
@@ -30,6 +30,72 @@
 #include "utils.h"
 #include "kernel-lib/raid56.h"
 
+const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
+   [BTRFS_RAID_RAID10] = {
+   .sub_stripes= 2,
+   .dev_stripes= 1,
+   .devs_max   = 0,/* 0 == as many as possible */
+   .devs_min   = 4,
+   .tolerated_failures = 1,
+   .devs_increment = 2,
+   .ncopies= 2,
+   },
+   [BTRFS_RAID_RAID1] = {
+   .sub_stripes= 1,
+   .dev_stripes= 1,
+   .devs_max   = 2,
+   .devs_min   = 2,
+   .tolerated_failures = 1,
+   .devs_increment = 2,
+   .ncopies= 2,
+   },
+   [BTRFS_RAID_DUP] = {
+   .sub_stripes= 1,
+   .dev_stripes= 2,
+   .devs_max   = 1,
+   .devs_min   = 1,
+   .tolerated_failures = 0,
+   .devs_increment = 1,
+   .ncopies= 2,
+   },
+   [BTRFS_RAID_RAID0] = {
+   .sub_stripes= 1,
+   .dev_stripes= 1,
+   .devs_max   = 0,
+   .devs_min   = 2,
+   .tolerated_failures = 0,
+   .devs_increment = 1,
+   .ncopies= 1,
+   },
+   [BTRFS_RAID_SINGLE] = {
+   .sub_stripes= 1,
+   .dev_stripes= 1,
+   .devs_max   = 1,
+   .devs_min   = 1,
+   .tolerated_failures = 0,
+   .devs_increment = 1,
+   .ncopies= 1,
+   },
+   [BTRFS_RAID_RAID5] = {
+   .sub_stripes= 1,
+   .dev_stripes= 1,
+   .devs_max   = 0,
+   .devs_min   = 2,
+   .tolerated_failures = 1,
+   .devs_increment = 1,
+   .ncopies= 2,
+   },
+   [BTRFS_RAID_RAID6] = {
+   .sub_stripes= 1,
+   .dev_stripes= 1,
+   .devs_max   = 0,
+   .devs_min   = 3,
+   .tolerated_failures = 2,
+   .devs_increment = 1,
+   .ncopies= 3,
+   },
+};
+
 struct stripe {
struct btrfs_device *dev;
u64 physical;
diff --git a/volumes.h b/volumes.h
index 7bbdf615d31a..612a0a7586f4 100644
--- a/volumes.h
+++ b/volumes.h
@@ -108,6 +108,36 @@ struct map_lookup {
struct btrfs_bio_stripe stripes[];
 };
 
+struct btrfs_raid_attr {
+   int sub_stripes;/* sub_stripes info for map */
+   int dev_stripes;/* stripes per dev */
+   int devs_max;   /* max devs to use */
+   int devs_min;   /* min devs needed */
+   int tolerated_failures; /* max tolerated fail devs */
+   int devs_increment; /* ndevs has to be a multiple of this */
+   int ncopies;/* how many copies to data has */
+};
+
+extern const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES];
+
+static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
+{
+   if (flags & BTRFS_BLOCK_GROUP_RAID10)
+   return BTRFS_RAID_RAID10;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID1)
+   return BTRFS_RAID_RAID1;
+   else if (flags & BTRFS_BLOCK_GROUP_DUP)
+   return BTRFS_RAID_DUP;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID0)
+   return BTRFS_RAID_RAID0;
+   else if (flags & BTRFS_BLOCK_GROUP_RAID5)
+   return 

[PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()

2018-02-08 Thread Qu Wenruo
Strangely, we have level check in btrfs_print_tree() while we don't have
the same check in read_node_slot().

That's to say, for the following corruption, btrfs_search_slot() or
btrfs_next_leaf() can return invalid leaf:

Parent eb:
  node XX level 1
  ^^^
  Child should be leaf (level 0)
  ...
  key (XXX XXX XXX) block YY

Child eb:
  leaf YY level 1
  ^^^
  Something went wrong now

And for the corrupted leaf returned, later caller can be screwed up
easily.

Although the root cause (powerloss, but still something wrong breaking
metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
avoid SEGV.

Reported-by: Ralph Gauges 
Signed-off-by: Qu Wenruo 
---
changlog:
v2:
  Check if the extent buffer is up-to-date before checking its level to
  avoid possible NULL pointer access.
---
 ctree.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/ctree.c b/ctree.c
index 4fc33b14000a..430805e3043f 100644
--- a/ctree.c
+++ b/ctree.c
@@ -22,6 +22,7 @@
 #include "repair.h"
 #include "internal.h"
 #include "sizes.h"
+#include "messages.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct 
btrfs_key *key,
 struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
   struct extent_buffer *parent, int slot)
 {
+   struct extent_buffer *ret;
int level = btrfs_header_level(parent);
+
if (slot < 0)
return NULL;
if (slot >= btrfs_header_nritems(parent))
@@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct btrfs_fs_info 
*fs_info,
if (level == 0)
return NULL;
 
-   return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
+   ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
   btrfs_node_ptr_generation(parent, slot));
+   if (!extent_buffer_uptodate(ret))
+   return ERR_PTR(-EIO);
+
+   if (btrfs_header_level(ret) != level - 1) {
+   error("child eb corrupted: parent bytenr=%llu item=%d parent 
level=%d child level=%d",
+ btrfs_header_bytenr(parent), slot,
+ btrfs_header_level(parent), btrfs_header_level(ret));
+   free_extent_buffer(ret);
+   return ERR_PTR(-EIO);
+   }
+   return ret;
 }
 
 static int balance_level(struct btrfs_trans_handle *trans,
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/10] btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info

2018-02-08 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 volumes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/volumes.c b/volumes.c
index edad367b593c..677d085de96c 100644
--- a/volumes.c
+++ b/volumes.c
@@ -826,7 +826,7 @@ error:
return ret;
 }
 
-#define BTRFS_MAX_DEVS(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info)   \
+#define BTRFS_MAX_DEVS(info) ((BTRFS_LEAF_DATA_SIZE(info)  \
- sizeof(struct btrfs_item) \
- sizeof(struct btrfs_chunk))   \
/ sizeof(struct btrfs_stripe) + 1)
@@ -882,12 +882,12 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
calc_size = SZ_1G;
max_chunk_size = 10 * calc_size;
min_stripe_size = SZ_64M;
-   max_stripes = BTRFS_MAX_DEVS(chunk_root);
+   max_stripes = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
calc_size = SZ_1G;
max_chunk_size = 4 * calc_size;
min_stripe_size = SZ_32M;
-   max_stripes = BTRFS_MAX_DEVS(chunk_root);
+   max_stripes = BTRFS_MAX_DEVS(info);
}
}
if (type & BTRFS_BLOCK_GROUP_RAID1) {
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: segmentation fault in btrfs tool v4.15

2018-02-08 Thread Qu Wenruo


On 2018年02月09日 15:23, Ralph Gauges wrote:
> Hi Qu,
> 
> I applied the patch to the sources of v4.15 and ran it in gdb. This is
> the result.
> 
> (gdb) set args check /dev/sdf1
> (gdb) run
> Starting program: /home/gauges/Applications/bin/btrfs check /dev/sdf1
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> parent transid verify failed on 266195058688 wanted 1857 found 1864
> parent transid verify failed on 266195058688 wanted 1857 found 1864
> parent transid verify failed on 266195058688 wanted 1857 found 1864
> parent transid verify failed on 266195058688 wanted 1857 found 1864
> Ignoring transid failure
> ERROR: child eb corrupted: parent bytenr=247283269632 item=23 parent
> level=1 child level=2
> ERROR: cannot open file system
> [Inferior 1 (process 7149) exited with code 01]
> 
> 
> So obviously it does not crash any more. Thanks.
> Since you are an expert on the btrfs filesystem, any hints as to how I
> could fix my backup
> partition?

Guys in mail list may have better ideas, CCed to mail list.

In fact the problem all happens in extent tree, may be we could salvage
something by RO mount it?

If kernel can't mount it even RO, then "btrfs restore" may be your last
chance.

Thanks,
Qu

> This output from btrfs seems to suggest that "btrfs check"
> can't handle this error?!
> Or this this last error something else that didn't show up so far
> because of the segfault?
> 
> Thanks
> 
> Ralph
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Darrick J. Wong
On Mon, Feb 05, 2018 at 04:15:02PM -0700, Liu Bo wrote:
> Btrfs tries its best to tolerate write errors, but kind of silently
> (except some messages in kernel log).
> 
> For raid1 and raid10, this is usually not a problem because there is a
> copy as backup, while for parity based raid setup, i.e. raid5 and
> raid6, the problem is that, if a write error occurs due to some bad
> sectors, one horizonal stripe becomes degraded and the number of write
> errors it can tolerate gets reduced by one, now if two disk fails,
> data may be lost forever.
> 
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).
> 
> This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> a chunk of raid5 or raid6 becomes degraded, it will appear in
> 'bad_chunks'.
> 
> Signed-off-by: Liu Bo 
> ---
> - In this patch, 'bad chunks' is not persistent on disk, but it can be
>   added if it's thought to be a good idea.
> - This is lightly tested, comments are very welcome.

Hmmm... sorry to be late to the party and dump a bunch of semirelated
work suggestions, but what if you implemented GETFSMAP for btrfs?  Then
you could define a new 'defective' fsmap type/flag/whatever and export
it for whatever metadata/filedata/whatever is now screwed up?  Existing
interface, you don't have to kludge sysfs data, none of this string
interpretation stuff...

--D

> 
>  fs/btrfs/ctree.h   |  8 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 13 +++
>  fs/btrfs/raid56.c  | 59 
> --
>  fs/btrfs/sysfs.c   | 26 ++
>  fs/btrfs/volumes.c | 15 +++--
>  fs/btrfs/volumes.h |  2 ++
>  7 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 13c260b..08aad65 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1101,6 +1101,9 @@ struct btrfs_fs_info {
>   spinlock_t ref_verify_lock;
>   struct rb_root block_tree;
>  #endif
> +
> + struct list_head bad_chunks;
> + seqlock_t bc_lock;
>  };
>  
>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
> address_space *mapping)
>  
>  /* extent-tree.c */
>  
> +struct btrfs_bad_chunk {
> + u64 chunk_offset;
> + struct list_head list;
> +};
> +
>  enum btrfs_inline_ref_type {
>   BTRFS_REF_TYPE_INVALID = 0,
>   BTRFS_REF_TYPE_BLOCK =   1,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a8ecccf..061e7f94 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb,
>   init_waitqueue_head(_info->async_submit_wait);
>  
>   INIT_LIST_HEAD(_info->pinned_chunks);
> + INIT_LIST_HEAD(_info->bad_chunks);
> + seqlock_init(_info->bc_lock);
>  
>   /* Usable values until the real ones are cached from the superblock */
>   fs_info->nodesize = 4096;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f43285..3ca7cb4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   kobject_del(_info->kobj);
>   kobject_put(_info->kobj);
>   }
> +
> + /* Clean up bad chunks. */
> + write_seqlock_irq(>bc_lock);
> + while (!list_empty(>bad_chunks)) {
> + struct btrfs_bad_chunk *bc;
> +
> + bc = list_first_entry(>bad_chunks,
> +   struct btrfs_bad_chunk, list);
> + list_del_init(>list);
> + kfree(bc);
> + }
> + write_sequnlock_irq(>bc_lock);
> +
>   return 0;
>  }
>  
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index a7f7925..e960247 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio 
> *rbio, blk_status_t err)
>  }
>  
>  /*
> - * end io function used by finish_rmw.  When we finally
> - * get here, we've written a full stripe
> + * end io function used by finish_rmw.  When we finally get here, we've 
> written
> + * a full stripe.
> + *
> + * Note that this is not under interrupt context as we queued endio to 
> workers.
>   */
>  static void raid_write_end_io(struct bio *bio)
>  {
>   struct btrfs_raid_bio *rbio = bio->bi_private;
>   blk_status_t err = bio->bi_status;
>   int max_errors;
> + u64 stripe_start = rbio->bbio->raid_map[0];
> + struct btrfs_fs_info *fs_info = rbio->fs_info;
> + int err_cnt;
>  
>   if (err)
>   fail_bio_stripe(rbio, bio);
> @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio)
>   

RE: [PATCH v5 0/3] Add support for export testsuits

2018-02-08 Thread Gu, Jinxiang
Hi,

> -Original Message-
> From: David Sterba [mailto:dste...@suse.cz]
> Sent: Friday, February 09, 2018 2:02 AM
> To: Gu, Jinxiang/顾 金香 
> Cc: linux-btrfs@vger.kernel.org; dste...@suse.cz
> Subject: Re: [PATCH v5 0/3] Add support for export testsuits
> 
> On Thu, Feb 08, 2018 at 02:34:17PM +0800, Gu Jinxiang wrote:
> > Achieved:
> > 1. export testsuite by:
> >  $ make testsuite
> > files list in testsuites-list will be added into tarball 
> > btrfs-progs-tests.tar.gz.
> >
> > 2. after decompress btrfs-progs-tests.tar.gz, run test by:
> >  $ TEST=`MASK` ./tests/mkfs-tests.sh
> > and, without MASK also be ok.
> > replenish:
> >  $ tar -xzvf ./btrfs-progs-tests.tar.gz  $ ls
> >btrfs-progs
> > tests directory and other files is in btrfs-progs.
> >
> > Changelog:
> > v5->v4: modify patch2.
> > make TEST_TOP to represent tests directory.
> > and introduce INTERNAL_BIN for internal binaries.
> 
> Patches 1 and 2 applied. I reworked most of 1, my idea of the end result of 
> the testsutie is different. In patch 2, I've added quotes to all lines
> that changed the variables on 'source ..' line. In such cases please also 
> look at the resulting code and do not just mechanically apply the
> suggestion to rename a variable. Patch 3 does not bring much information so I 
> did not apply it and wrote the section myself.
Thank you for the change.
My idea was keep the structure as git when export testsuite. But I saw the 
modification, they are more reasonable indeed. 

> 
> The project idea lacked details, as the cards on the github Projects page are 
> supposed to be short. Not all of the tasks there are simple, so
> if you'd want to work on something found there and see that's not clear what 
> to do, it would be better to open an issue so I can fill in.
OK. Got it.
Thanks.
> 





Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Qu Wenruo


On 2018年02月09日 03:07, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +   /* read lock please */
>>> +   do {
>>> +   seq = read_seqbegin(_info->bc_lock);
>>> +   list_for_each_entry(bc, _info->bad_chunks, list) {
>>> +   len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +   bc->chunk_offset);
>>> +   /* chunk offset is u64 */
>>> +   if (len >= PAGE_SIZE)
>>> +   break;
>>> +   }
>>> +   } while (read_seqretry(_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code 
>> correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) 
>> it is possible to list only few hundred of chunks (~200). Not more; and the 
>> last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to 
>> list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page 
>> should be big enough to contains all these chunks offset). If the chunks 
>> number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are 
>> complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.
> 
>> However, my second suggestions is to ... change completely the interface. 
>> What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs//chunks//type  # 
>> data/metadata/sys
>> /sys/fs/btrfs//chunks//profile   # 
>> dup/linear
>> /sys/fs/btrfs//chunks//size  # size
>> /sys/fs/btrfs//chunks//devs  # chunks devs 

What about netlink interface?

Although it may needs an extra daemon to listen to it, and some guys
won't be happy about the abuse of daemon.

Thanks,
Qu

>>
>> And so on. 
>>
>> Checking  "[...]/devs", it would be easy understand if the 
>> chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a 
>> sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a 
>> lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?
> 
> thanks,
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: IO Error (.snapshots is not a btrfs subvolume)

2018-02-08 Thread Nick Gilmour
I've installed openSUSE Tumbleweed on a VM and checked how the disk is
with btrfs partioned, how fstab looks like, how snapper works and also
what are the differences to my system. I've decided to leave it like
it is for now and next time use the guide from the link provided by
@Andrei.

Thanks!

Regards,
Nick

On Thu, Feb 8, 2018 at 4:43 PM, Nick Gilmour  wrote:
> On Thu, Feb 8, 2018 at 5:32 AM, Andrei Borzenkov  wrote:
>> 08.02.2018 06:03, Chris Murphy пишет:
>>> On Wed, Feb 7, 2018 at 6:26 PM, Nick Gilmour  wrote:
 Hi all,

 I have successfully restored a snapshot of root but now when I try to
>>
>> How exactly was it done?
>>
 make a new snapshot I get this error:
 IO Error (.snapshots is not a btrfs subvolume).
 My snapshots were within @ which I renamed to @_old.
 What can I do now? How can I move the snapshots from @_old/ into @ and
 be able to make snapshots again?

 This is an excerpt of my subvolumes list:

 # btrfs subvolume list /
 ID 257 gen 175397 top level 5 path @_old
 ID 258 gen 175392 top level 5 path @pkg
 ID 260 gen 175447 top level 5 path @tmp
 ID 262 gen 19 top level 257 path @_old/var/lib/machines
 ID 268 gen 175441 top level 5 path @test
 ID 291 gen 175394 top level 257 path @_old/.snapshots
 ID 292 gen 1705 top level 291 path @_old/.snapshots/1/snapshot
 ...

 ID 3538 gen 175398 top level 291 path @_old/.snapshots/1594/snapshot
 ID 3540 gen 175447 top level 5 path @

>>>
>>>
>>> This is a snapper behavior. It creates .snapshots as a subvolume and
>>> then puts snapshots into that subvolume. If you snapshot a subvolume
>>> that contains another subvolume, the nested subvolume is not snapshot,
>>> instead a plain directory placeholder is created instead. So your
>>> restored snapshot contains a .snapshot directory rather than a
>>> .snapshot subvolume. Possibly if you delete the directory and create a
>>> new subvolume .snapshot, the problem will be fixed.
>>>
>>
>> No, you should create subvolume @/.snapshots and mount it as /.snapshots
>> (and have it in /etc/fstab). Snapshots should always be available in
>> running system under fixed path and this only possible when it is
>> mounted, otherwise after rollback /.snapshots will be lost just like it
>> happened now.
>>
>> Exact subvolume name probably not matters that much, but better stick
>> with what installer does by default. It may matter for grub2 snapshots
>> handling.
>>
>> Also openSUSE expects that actual root is subvolume under /.snapshots
>> which is valid snapper snapshot (i.e. it has valid metadata). Again, not
>> having this may confuse snapper.
>>
>> It may be possible to move @_old/.snapshots into @/.snapshots, although
>> this breaks parent-child relationships those old snapshots cannot be
>> cleaned up without removing old root completely.
>>
>>> I can't tell you how this will confuse snapper though, or how to
>>> unconfuse it. It pretty much expects to be in control of all
>>> snapshots, creation, deletion, and rollbacks. So if you do it manually
>>> for whatever reason, I think it can confuse snapper.
>>>
>>>
>>
>> There was blog post recently outlining how to restore openSUSE root. You
>> may want to search opensuse or opensuse-factory mailing list. Ah found:
>>
>> https://rootco.de/2018-01-19-opensuse-btrfs-subvolumes/
>
>
> Thanks both for the quick responses!
>
>> How exactly was it done?
> 1. # mount /dev/sde6 /mnt
> 2. # cd /mnt
> 3. # mv @ @_old
> 4. # btrfs subvol snapshot /mnt/@_old/.snapshots/1594/snapshot /mnt/@
>
>> No, you should create subvolume @/.snapshots and mount it as /.snapshots
>> (and have it in /etc/fstab). Snapshots should always be available in
>> running system under fixed path and this only possible when it is
>> mounted, otherwise after rollback /.snapshots will be lost just like it
>> happened now.
>
> When I run this command I get an error:
> # btrfs subvolume create @/.snapshots
> ERROR: cannot access '@': No such file or directory
>
> but when I'm doing this:
> # btrfs subvolume create /.snapshots
> Create subvolume '//.snapshots'
> it works
>
> and this is what I see in the subvolumes list:
> ID 3541 gen 175955 parent 3540 top level 3540 path .snapshots
>
> and then I can create a snapshot with snapper:
> # snapper -c root create --description "test"
>
> but snapper starts numbering from 1 again which I don't really like. I
> would like to keep the previous snapshots and continue the numbering
> after the last restored snapshot (1594).
>
> This is how my fstab looks like now:
>
> # /etc/fstab: static file system information
> #
> # 
>  
>  # /dev/sdd6 LABEL=ROOT
> UUID=...567940c58ea6   /   btrfs
> noatime,nodiratime,compress=lzo,ssd,space_cache,subvolid=257,subvol=/@
>  0   1
>
> # 

Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Liu Bo
On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
> On 02/06/2018 12:15 AM, Liu Bo wrote:
> [...]
> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> 
> [...]
> [...]
> 
> > +
> > +   /* read lock please */
> > +   do {
> > +   seq = read_seqbegin(_info->bc_lock);
> > +   list_for_each_entry(bc, _info->bad_chunks, list) {
> > +   len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> > +   bc->chunk_offset);
> > +   /* chunk offset is u64 */
> > +   if (len >= PAGE_SIZE)
> > +   break;
> > +   }
> > +   } while (read_seqretry(_info->bc_lock, seq));
> 
> Using this interface, how many chunks can you list ? If I read the code 
> correctly, only up to fill a kernel page.
>
> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) 
> it is possible to list only few hundred of chunks (~200). Not more; and the 
> last one could be even listed incomplete.
> 

That's true.

> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to 
> list could be more than 200.
>
> My first suggestion is to limit the number of chunks to show to 200 (a page 
> should be big enough to contains all these chunks offset). If the chunks 
> number are greater, ends the list with a marker (something like '[...]\n').
> This would solve the ambiguity about the fact that the list chunks are 
> complete or not. Anyway you cannot list all the chunks.
>

Good point, and I need to add one more field to each record to specify
it's metadata or data.

So what I have in my mind is that this kind of error is rare and
reflects bad sectors on disk, but if there are really that many
errors, I think we need to replace the disk without hesitation.

> However, my second suggestions is to ... change completely the interface. 
> What about adding a directory in sysfs, where each entry is a chunk ?
> 
> Something like:
> 
> /sys/fs/btrfs//chunks//type   # 
> data/metadata/sys
> /sys/fs/btrfs//chunks//profile# 
> dup/linear
> /sys/fs/btrfs//chunks//size   # size
> /sys/fs/btrfs//chunks//devs   # chunks devs 
> 
> And so on. 
> 
> Checking  "[...]/devs", it would be easy understand if the 
> chunk is in "degraded" mode or not.

I'm afraid we cannot do that as it'll occupy too much memory for large
filesystems given a typical chunk size is 1GB.

> 
> However I have to admit that I don't know how feasible is iterate over a 
> sysfs directory which is a map of a kernel objects list.
> 
> I think that if these interface would be good enough, we could get rid of a 
> lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>

TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs

2018-02-08 Thread Liu Bo
On Thu, Feb 08, 2018 at 06:25:17PM +0200, Nikolay Borisov wrote:
> list_first_entry is essentially a wrapper over cotnainer_of. The latter
> can never return null even if it's working on inconsistent list since it
> will either crash or return some offset in the wrong struct.
> Additionally, for the dirty_bgs list the iteration is done under
> dirty_bgs_lock which ensures consistency of the list.
> 

Reviewed-by: Liu Bo 

-liubo

> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/disk-io.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index aca906971abe..1b3989c54d7c 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4323,11 +4323,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
> *cur_trans,
>   cache = list_first_entry(_trans->dirty_bgs,
>struct btrfs_block_group_cache,
>dirty_list);
> - if (!cache) {
> - btrfs_err(fs_info, "orphan block group dirty_bgs list");
> - spin_unlock(_trans->dirty_bgs_lock);
> - return;
> - }
>  
>   if (!list_empty(>io_list)) {
>   spin_unlock(_trans->dirty_bgs_lock);
> @@ -4351,10 +4346,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
> *cur_trans,
>   cache = list_first_entry(_trans->io_bgs,
>struct btrfs_block_group_cache,
>io_list);
> - if (!cache) {
> - btrfs_err(fs_info, "orphan block group on io_bgs list");
> - return;
> - }
>  
>   list_del_init(>io_list);
>   spin_lock(>lock);
> -- 
> 2.7.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Goffredo Baroncelli
On 02/06/2018 12:15 AM, Liu Bo wrote:
[...]
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).

[...]
[...]

> +
> + /* read lock please */
> + do {
> + seq = read_seqbegin(_info->bc_lock);
> + list_for_each_entry(bc, _info->bad_chunks, list) {
> + len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> + bc->chunk_offset);
> + /* chunk offset is u64 */
> + if (len >= PAGE_SIZE)
> + break;
> + }
> + } while (read_seqretry(_info->bc_lock, seq));

Using this interface, how many chunks can you list ? If I read the code 
correctly, only up to fill a kernel page.

If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it 
is possible to list only few hundred of chunks (~200). Not more; and the last 
one could be even listed incomplete.

IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list 
could be more than 200.

My first suggestion is to limit the number of chunks to show to 200 (a page 
should be big enough to contains all these chunks offset). If the chunks number 
are greater, ends the list with a marker (something like '[...]\n').
This would solve the ambiguity about the fact that the list chunks are complete 
or not. Anyway you cannot list all the chunks.

However, my second suggestions is to ... change completely the interface. What 
about adding a directory in sysfs, where each entry is a chunk ?

Something like:

/sys/fs/btrfs//chunks//type # 
data/metadata/sys
/sys/fs/btrfs//chunks//profile  # dup/linear
/sys/fs/btrfs//chunks//size # size
/sys/fs/btrfs//chunks//devs # chunks devs 

And so on. 

Checking  "[...]/devs", it would be easy understand if the chunk 
is in "degraded" mode or not.

However I have to admit that I don't know how feasible is iterate over a sysfs 
directory which is a map of a kernel objects list.

I think that if these interface would be good enough, we could get rid of a lot 
of ioctl(TREE_SEARCH) from btrfs-progs. 

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/3] Add support for export testsuits

2018-02-08 Thread David Sterba
On Thu, Feb 08, 2018 at 02:34:17PM +0800, Gu Jinxiang wrote:
> Achieved:
> 1. export testsuite by:
>  $ make testsuite
> files list in testsuites-list will be added into tarball 
> btrfs-progs-tests.tar.gz.
> 
> 2. after decompress btrfs-progs-tests.tar.gz, run test by:
>  $ TEST=`MASK` ./tests/mkfs-tests.sh
> and, without MASK also be ok.
> replenish:
>  $ tar -xzvf ./btrfs-progs-tests.tar.gz
>  $ ls
>btrfs-progs
> tests directory and other files is in btrfs-progs.
> 
> Changelog:
> v5->v4: modify patch2.
>   make TEST_TOP to represent tests directory.
>   and introduce INTERNAL_BIN for internal binaries.

Patches 1 and 2 applied. I reworked most of 1, my idea of the end result
of the testsutie is different. In patch 2, I've added quotes to all
lines that changed the variables on 'source ..' line. In such cases
please also look at the resulting code and do not just mechanically
apply the suggestion to rename a variable. Patch 3 does not bring much
information so I did not apply it and wrote the section myself.

The project idea lacked details, as the cards on the github Projects
page are supposed to be short. Not all of the tasks there are simple, so
if you'd want to work on something found there and see that's not clear
what to do, it would be better to open an issue so I can fill in.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs

2018-02-08 Thread Nikolay Borisov
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index aca906971abe..1b3989c54d7c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4323,11 +4323,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
*cur_trans,
cache = list_first_entry(_trans->dirty_bgs,
 struct btrfs_block_group_cache,
 dirty_list);
-   if (!cache) {
-   btrfs_err(fs_info, "orphan block group dirty_bgs list");
-   spin_unlock(_trans->dirty_bgs_lock);
-   return;
-   }
 
if (!list_empty(>io_list)) {
spin_unlock(_trans->dirty_bgs_lock);
@@ -4351,10 +4346,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
*cur_trans,
cache = list_first_entry(_trans->io_bgs,
 struct btrfs_block_group_cache,
 io_list);
-   if (!cache) {
-   btrfs_err(fs_info, "orphan block group on io_bgs list");
-   return;
-   }
 
list_del_init(>io_list);
spin_lock(>lock);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: Document consistency of transaction->io_bgs list

2018-02-08 Thread Nikolay Borisov
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c |  4 
 fs/btrfs/extent-tree.c |  7 ++-
 fs/btrfs/transaction.h | 15 +++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1b3989c54d7c..b6fc734b0f5c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4342,6 +4342,10 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
*cur_trans,
}
spin_unlock(_trans->dirty_bgs_lock);
 
+
+   /* Refer to the definition of io_bgs member for details why it's safe
+* to use it without any locking
+*/
while (!list_empty(_trans->io_bgs)) {
cache = list_first_entry(_trans->io_bgs,
 struct btrfs_block_group_cache,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cc08e6af3542..c4a3dfac224a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3742,7 +3742,8 @@ int btrfs_start_dirty_block_groups(struct 
btrfs_trans_handle *trans,
 
/*
 * the cache_write_mutex is protecting
-* the io_list
+* the io_list, also refer to the definition of
+* btrfs_transaction::io_bfs for more details
 */
list_add_tail(>io_list, io);
} else {
@@ -3934,6 +3935,10 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
}
spin_unlock(_trans->dirty_bgs_lock);
 
+
+   /* Refer to the definition of io_bgs member for details why it's safe
+* to use it without any locking
+*/
while (!list_empty(io)) {
cache = list_first_entry(io, struct btrfs_block_group_cache,
 io_list);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 817fd7c9836b..90b80a6bea66 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -69,6 +69,21 @@ struct btrfs_transaction {
struct list_head pending_chunks;
struct list_head switch_commits;
struct list_head dirty_bgs;
+
+   /* There is no explicit lock which protects io_bgs, rather its
+* consistency is implied by the fact that all the sites which modify
+* it do so under some form of transaction critical section, namely:
+*
+* * btrfs_start_dirty_block_groups - This function can only ever be
+* run by one of the transaction committers. Refer to
+* BTRFS_TRANS_DIRTY_BG_RUN usage in btrfs_commit_transaction
+*
+* * btrfs_write_dirty_blockgroups - this is called by
+* commit_cowonly_roots from transaction critical section
+* (TRANS_STATE_COMMIT_DOING)
+*
+* * btrfs_cleanup_dirty_bgs - called on transaction abort
+*/
struct list_head io_bgs;
struct list_head dropped_roots;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IO Error (.snapshots is not a btrfs subvolume)

2018-02-08 Thread Nick Gilmour
On Thu, Feb 8, 2018 at 5:32 AM, Andrei Borzenkov  wrote:
> 08.02.2018 06:03, Chris Murphy пишет:
>> On Wed, Feb 7, 2018 at 6:26 PM, Nick Gilmour  wrote:
>>> Hi all,
>>>
>>> I have successfully restored a snapshot of root but now when I try to
>
> How exactly was it done?
>
>>> make a new snapshot I get this error:
>>> IO Error (.snapshots is not a btrfs subvolume).
>>> My snapshots were within @ which I renamed to @_old.
>>> What can I do now? How can I move the snapshots from @_old/ into @ and
>>> be able to make snapshots again?
>>>
>>> This is an excerpt of my subvolumes list:
>>>
>>> # btrfs subvolume list /
>>> ID 257 gen 175397 top level 5 path @_old
>>> ID 258 gen 175392 top level 5 path @pkg
>>> ID 260 gen 175447 top level 5 path @tmp
>>> ID 262 gen 19 top level 257 path @_old/var/lib/machines
>>> ID 268 gen 175441 top level 5 path @test
>>> ID 291 gen 175394 top level 257 path @_old/.snapshots
>>> ID 292 gen 1705 top level 291 path @_old/.snapshots/1/snapshot
>>> ...
>>>
>>> ID 3538 gen 175398 top level 291 path @_old/.snapshots/1594/snapshot
>>> ID 3540 gen 175447 top level 5 path @
>>>
>>
>>
>> This is a snapper behavior. It creates .snapshots as a subvolume and
>> then puts snapshots into that subvolume. If you snapshot a subvolume
>> that contains another subvolume, the nested subvolume is not snapshot,
>> instead a plain directory placeholder is created instead. So your
>> restored snapshot contains a .snapshot directory rather than a
>> .snapshot subvolume. Possibly if you delete the directory and create a
>> new subvolume .snapshot, the problem will be fixed.
>>
>
> No, you should create subvolume @/.snapshots and mount it as /.snapshots
> (and have it in /etc/fstab). Snapshots should always be available in
> running system under fixed path and this only possible when it is
> mounted, otherwise after rollback /.snapshots will be lost just like it
> happened now.
>
> Exact subvolume name probably not matters that much, but better stick
> with what installer does by default. It may matter for grub2 snapshots
> handling.
>
> Also openSUSE expects that actual root is subvolume under /.snapshots
> which is valid snapper snapshot (i.e. it has valid metadata). Again, not
> having this may confuse snapper.
>
> It may be possible to move @_old/.snapshots into @/.snapshots, although
> this breaks parent-child relationships those old snapshots cannot be
> cleaned up without removing old root completely.
>
>> I can't tell you how this will confuse snapper though, or how to
>> unconfuse it. It pretty much expects to be in control of all
>> snapshots, creation, deletion, and rollbacks. So if you do it manually
>> for whatever reason, I think it can confuse snapper.
>>
>>
>
> There was blog post recently outlining how to restore openSUSE root. You
> may want to search opensuse or opensuse-factory mailing list. Ah found:
>
> https://rootco.de/2018-01-19-opensuse-btrfs-subvolumes/


Thanks both for the quick responses!

> How exactly was it done?
1. # mount /dev/sde6 /mnt
2. # cd /mnt
3. # mv @ @_old
4. # btrfs subvol snapshot /mnt/@_old/.snapshots/1594/snapshot /mnt/@

> No, you should create subvolume @/.snapshots and mount it as /.snapshots
> (and have it in /etc/fstab). Snapshots should always be available in
> running system under fixed path and this only possible when it is
> mounted, otherwise after rollback /.snapshots will be lost just like it
> happened now.

When I run this command I get an error:
# btrfs subvolume create @/.snapshots
ERROR: cannot access '@': No such file or directory

but when I'm doing this:
# btrfs subvolume create /.snapshots
Create subvolume '//.snapshots'
it works

and this is what I see in the subvolumes list:
ID 3541 gen 175955 parent 3540 top level 3540 path .snapshots

and then I can create a snapshot with snapper:
# snapper -c root create --description "test"

but snapper starts numbering from 1 again which I don't really like. I
would like to keep the previous snapshots and continue the numbering
after the last restored snapshot (1594).

This is how my fstab looks like now:

# /etc/fstab: static file system information
#
# 
 


Re: [PATCH 1/3] btrfs-progs: add prerequisite mkfs.btrfs for test-cli

2018-02-08 Thread David Sterba
On Thu, Feb 08, 2018 at 01:08:56PM +0800, Gu Jinxiang wrote:
> Since tests/cli-tests/002-balance-full-no-filters/test.sh need
> the mkfs.btrfs for prerequisite.
> So add the dependency in Makefile.
> 
> Signed-off-by: Gu Jinxiang 

1-3 applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][btrfs-next] Btrfs: extent map selftest: fix non-ANSI btrfs_test_extent_map declaration

2018-02-08 Thread Colin King
From: Colin Ian King 

The function btrfs_test_extent_map requires a void argument to be ANSI C
compliant and so it matches the prototype in fs/btrfs/tests/btrfs-tests.h

Cleans up sparse warning:
fs/btrfs/tests/extent-map-tests.c:346:27: warning: non-ANSI function
declaration of function 'btrfs_test_extent_map'

Signed-off-by: Colin Ian King 
---
 fs/btrfs/tests/extent-map-tests.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tests/extent-map-tests.c 
b/fs/btrfs/tests/extent-map-tests.c
index 70c993f01670..c23bd00bdd92 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -343,7 +343,7 @@ static void test_case_4(struct extent_map_tree *em_tree)
__test_case_4(em_tree, SZ_4K);
 }
 
-int btrfs_test_extent_map()
+int btrfs_test_extent_map(void)
 {
struct extent_map_tree *em_tree;
 
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-08 Thread Tetsuo Handa
>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Thu, 8 Feb 2018 10:35:35 +0900
Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

 CPU0
 
lock(fs_reclaim);
lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
lockdep_clear_current_reclaim_state() in __perform_reclaim() and
lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
is trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
  return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
("page allocator: calculate the alloc_flags for allocation only once")
added the PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
  goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
__need_fs_reclaim() to return false.

Reported-and-tested-by: Dave Jones 
Signed-off-by: Tetsuo Handa 
Cc: Peter Zijlstra 
Cc: Nick Piggin 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ce..19fb76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
return false;
 
/* this guy won't enter reclaim */
-   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (current->flags & PF_MEMALLOC)
return false;
 
/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash when unraring large archives on btrfs-filesystem

2018-02-08 Thread Stefan Malte Schumacher
> How much RAM on the machine and how much swap available? This looks like a
> lot of dirty data has accumulated, and then also there's swapping happening.
> Both swap out and swap in.

The machine has 16GB Ram and 40GB Swap on a SSD. Its not doing much
besides being my personal file archive, so there should be plenty of
free memory for btrfs. I have remounted the filesystem with the
clear_cache option and now will apply the tweaks mentioned by Duncan.
If this does not fix the problem I will install a more current kernel
from stretch-backports. Testing currently has btrfs-progs 4.13.3-1. Is
this version safe to use and should I upgrade it along with the
kernel?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Anand Jain



On 02/06/2018 07:15 AM, Liu Bo wrote:

Btrfs tries its best to tolerate write errors, but kind of silently
(except some messages in kernel log).

For raid1 and raid10, this is usually not a problem because there is a
copy as backup, while for parity based raid setup, i.e. raid5 and
raid6, the problem is that, if a write error occurs due to some bad
sectors, one horizonal stripe becomes degraded and the number of write
errors it can tolerate gets reduced by one, now if two disk fails,
data may be lost forever.


This is equally true in raid1, raid10, and raid5.  Sorry I didn't get
the point why degraded stripe is critical only to the parity based
stripes (raid5/6)?
And does it really need a bad chunk list to fix in case of parity
based stripes or the balance without bad chunks list can fix as well?


One way to mitigate the data loss pain is to expose 'bad chunks',
i.e. degraded chunks, to users, so that they can use 'btrfs balance'
to relocate the whole chunk and get the full raid6 protection again
(if the relocation works).


Depending on the type of disk error its recovery action would vary. For
example, it can be a complete disk fail or interim RW failure due to
environmental/transport factors. The disk auto relocation will do the
job of relocating the real bad blocks in the most of the modern disks.
The challenging task will be to know where to draw the line between
complete disk failure (failed) vs interim disk failure (offline) so I
had plans of making it tunable base on number of disk errors.

If it's confirmed that a disk is failed, the auto-replace with the hot
spare disk will be its recovery action. Balance with a failed disk won't
help.

Patches to these are in the ML.

If the failure is momentary due to environmental factors, including the
transport layer, then as we expect the disk with the data will come back
we shouldn't kick in the hot spare, that is disk state offline, or maybe
its a state where read old data is fine, but cannot write new data.
I think you are addressing this interim state. It's better to define the
disk states first so that its recovery action can be defined. I can
revise the patches on that. So that replace VS re-balance using bad 
chunks can be decided.



This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
a chunk of raid5 or raid6 becomes degraded, it will appear in
'bad_chunks'.


AFAIK a variable list of output is not allowed on sysfs.

IMHO list of bad chunks won't help the user (it ok if its needed by
kernel). It will help if you provide the list of affected-files
so that the user can use it script to make additional interim external
copy until the disk recovers from the interim error.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html