[PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior
Kernel uses a delayed chunk allocation behavior for metadata chunks KERNEL: btrfs_alloc_chunk() |- __btrfs_alloc_chunk(): Only allocate chunk mapping |- btrfs_make_block_group(): Add corresponding bg to fs_info->new_bgs Then at transaction commit time, it finishes the remaining work: btrfs_start_dirty_block_groups(): |- btrfs_create_pending_block_groups() |- btrfs_insert_item(): To insert block group item |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents Although btrfs-progs btrfs_alloc_chunk() does all the work in one function, it can still benefit from function refactor like: btrfs-progs: btrfs_alloc_chunk():Wrapper for both normal and convert chunks |- __btrfs_alloc_chunk(): Only alloc chunk mapping | |- btrfs_make_block_group(): <> |- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents With such refactor, the following functions can share most of its code with kernel now: __btrfs_alloc_chunk() btrfs_finish_chunk_alloc() btrfs_alloc_dev_extent() Signed-off-by: Qu Wenruo--- volumes.c | 421 ++ 1 file changed, 260 insertions(+), 161 deletions(-) diff --git a/volumes.c b/volumes.c index cff54c612872..e89520326314 100644 --- a/volumes.c +++ b/volumes.c @@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, return find_free_dev_extent_start(device, num_bytes, 0, start, len); } -static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info, - struct map_lookup *map, u64 stripe_size) +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans, + struct btrfs_device *device, + u64 chunk_offset, u64 physical, + u64 stripe_size) { - int ret = 0; - struct btrfs_path path; + int ret; + struct btrfs_path *path; + struct btrfs_fs_info *fs_info = device->fs_info; struct btrfs_root *root = fs_info->dev_root; struct btrfs_dev_extent *extent; struct extent_buffer *leaf; struct btrfs_key key; - int i; - btrfs_init_path(); - - for (i = 0; i < map->num_stripes; i++) { - struct btrfs_device *device = map->stripes[i].dev; - - key.objectid = device->devid; - key.offset = map->stripes[i].physical; - key.type = BTRFS_DEV_EXTENT_KEY; + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; - ret = btrfs_insert_empty_item(trans, root, , , - sizeof(*extent)); - if (ret < 0) - goto out; - leaf = path.nodes[0]; - extent = btrfs_item_ptr(leaf, path.slots[0], - struct btrfs_dev_extent); - btrfs_set_dev_extent_chunk_tree(leaf, extent, + key.objectid = device->devid; + key.offset = physical; + key.type = BTRFS_DEV_EXTENT_KEY; + ret = btrfs_insert_empty_item(trans, root, path, , sizeof(*extent)); + if (ret) + goto out; + leaf = path->nodes[0]; + extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent); + btrfs_set_dev_extent_chunk_tree(leaf, extent, BTRFS_CHUNK_TREE_OBJECTID); - btrfs_set_dev_extent_chunk_objectid(leaf, extent, - BTRFS_FIRST_CHUNK_TREE_OBJECTID); - btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start); - - write_extent_buffer(leaf, fs_info->chunk_tree_uuid, - (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent), - BTRFS_UUID_SIZE); - - btrfs_set_dev_extent_length(leaf, extent, stripe_size); - btrfs_mark_buffer_dirty(leaf); - btrfs_release_path(); - - device->bytes_used += stripe_size; - ret = btrfs_update_device(trans, device); - if (ret < 0) - goto out; - } + btrfs_set_dev_extent_chunk_objectid(leaf, extent, + BTRFS_FIRST_CHUNK_TREE_OBJECTID); + btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset); + btrfs_set_dev_extent_length(leaf, extent, stripe_size); + btrfs_mark_buffer_dirty(leaf); out: - btrfs_release_path(); + btrfs_free_path(path); return ret; } @@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const void *b) / sizeof(struct btrfs_stripe) + 1) /* - * Alloc a chunk, will insert dev extents, chunk item, and insert new - * block group and update space info (so that extent
[PATCH v2 02/10] btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk
We used to have two chunk allocators, btrfs_alloc_chunk() and btrfs_alloc_data_chunk(), the former is the more generic one, while the later is only used in mkfs and convert, to allocate SINGLE data chunk. Although btrfs_alloc_data_chunk() has some special hacks to cooperate with convert, it's quite simple to integrity it into the generic chunk allocator. So merge them into one btrfs_alloc_chunk(), with extra @convert parameter and necessary comment, to make code less duplicated and less thing to maintain. Signed-off-by: Qu Wenruo--- convert/main.c | 6 +- extent-tree.c | 2 +- mkfs/main.c| 8 +-- volumes.c | 219 ++--- volumes.h | 5 +- 5 files changed, 94 insertions(+), 146 deletions(-) diff --git a/convert/main.c b/convert/main.c index b3ea31d7af43..b2444bb2ff21 100644 --- a/convert/main.c +++ b/convert/main.c @@ -910,9 +910,9 @@ static int make_convert_data_block_groups(struct btrfs_trans_handle *trans, len = min(max_chunk_size, cache->start + cache->size - cur); - ret = btrfs_alloc_data_chunk(trans, fs_info, - _backup, len, - BTRFS_BLOCK_GROUP_DATA, 1); + ret = btrfs_alloc_chunk(trans, fs_info, + _backup, , + BTRFS_BLOCK_GROUP_DATA, true); if (ret < 0) break; ret = btrfs_make_block_group(trans, fs_info, 0, diff --git a/extent-tree.c b/extent-tree.c index e2ae74a7fe66..b085ab0352b3 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1906,7 +1906,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, return 0; ret = btrfs_alloc_chunk(trans, fs_info, , _bytes, - space_info->flags); + space_info->flags, false); if (ret == -ENOSPC) { space_info->full = 1; return 0; diff --git a/mkfs/main.c b/mkfs/main.c index ea65c6d897b2..358395ca0250 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -82,7 +82,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed, ret = btrfs_alloc_chunk(trans, fs_info, _start, _size, BTRFS_BLOCK_GROUP_METADATA | - BTRFS_BLOCK_GROUP_DATA); + BTRFS_BLOCK_GROUP_DATA, false); if (ret == -ENOSPC) { error("no space to allocate data/metadata chunk"); goto err; @@ -99,7 +99,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed, } else { ret = btrfs_alloc_chunk(trans, fs_info, _start, _size, - BTRFS_BLOCK_GROUP_METADATA); + BTRFS_BLOCK_GROUP_METADATA, false); if (ret == -ENOSPC) { error("no space to allocate metadata chunk"); goto err; @@ -133,7 +133,7 @@ static int create_data_block_groups(struct btrfs_trans_handle *trans, if (!mixed) { ret = btrfs_alloc_chunk(trans, fs_info, _start, _size, - BTRFS_BLOCK_GROUP_DATA); + BTRFS_BLOCK_GROUP_DATA, false); if (ret == -ENOSPC) { error("no space to allocate data chunk"); goto err; @@ -241,7 +241,7 @@ static int create_one_raid_group(struct btrfs_trans_handle *trans, int ret; ret = btrfs_alloc_chunk(trans, fs_info, - _start, _size, type); + _start, _size, type, false); if (ret == -ENOSPC) { error("not enough free space to allocate chunk"); exit(1); diff --git a/volumes.c b/volumes.c index 677d085de96c..9ee4650351c3 100644 --- a/volumes.c +++ b/volumes.c @@ -836,9 +836,23 @@ error: - 2 * sizeof(struct btrfs_chunk)) \ / sizeof(struct btrfs_stripe) + 1) +/* + * Alloc a chunk, will insert dev extents, chunk item. + * NOTE: This function will not insert block group item nor mark newly + * allocated chunk available for later allocation. + * Block group item and free space update is handled by btrfs_make_block_group() + * + * @start: return value of allocated chunk start bytenr. + * @num_bytes: return value of allocated chunk size + * @type: chunk type (including both profile and type) + * @convert: if the chunk is
[PATCH v2 08/10] btrfs-progs: Move chunk stripe size calcution function to volumes.h
Signed-off-by: Qu Wenruo--- check/main.c | 22 -- volumes.h| 22 ++ 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/check/main.c b/check/main.c index c051a862eb35..96607f6817af 100644 --- a/check/main.c +++ b/check/main.c @@ -7638,28 +7638,6 @@ repair_abort: return err; } -u64 calc_stripe_length(u64 type, u64 length, int num_stripes) -{ - u64 stripe_size; - - if (type & BTRFS_BLOCK_GROUP_RAID0) { - stripe_size = length; - stripe_size /= num_stripes; - } else if (type & BTRFS_BLOCK_GROUP_RAID10) { - stripe_size = length * 2; - stripe_size /= num_stripes; - } else if (type & BTRFS_BLOCK_GROUP_RAID5) { - stripe_size = length; - stripe_size /= (num_stripes - 1); - } else if (type & BTRFS_BLOCK_GROUP_RAID6) { - stripe_size = length; - stripe_size /= (num_stripes - 2); - } else { - stripe_size = length; - } - return stripe_size; -} - /* * Check the chunk with its block group/dev list ref: * Return 0 if all refs seems valid. diff --git a/volumes.h b/volumes.h index 3741d45cae80..950de5a9f910 100644 --- a/volumes.h +++ b/volumes.h @@ -216,6 +216,28 @@ static inline int check_crossing_stripes(struct btrfs_fs_info *fs_info, (bg_offset + len - 1) / BTRFS_STRIPE_LEN); } +static inline u64 calc_stripe_length(u64 type, u64 length, int num_stripes) +{ + u64 stripe_size; + + if (type & BTRFS_BLOCK_GROUP_RAID0) { + stripe_size = length; + stripe_size /= num_stripes; + } else if (type & BTRFS_BLOCK_GROUP_RAID10) { + stripe_size = length * 2; + stripe_size /= num_stripes; + } else if (type & BTRFS_BLOCK_GROUP_RAID5) { + stripe_size = length; + stripe_size /= (num_stripes - 1); + } else if (type & BTRFS_BLOCK_GROUP_RAID6) { + stripe_size = length; + stripe_size /= (num_stripes - 2); + } else { + stripe_size = length; + } + return stripe_size; +} + int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, u64 logical, u64 *length, u64 *type, struct btrfs_multi_bio **multi_ret, int mirror_num, -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/10] btrfs-progs: Make btrfs_alloc_chunk to handle block group creation
Before this patch, chunk allocation is split into 2 parts: 1) Chunk allocation Handled by btrfs_alloc_chunk(), which will insert chunk and device extent items. 2) Block group allocation Handled by btrfs_make_block_group(), which will insert block group item and update space info. However for chunk allocation, we don't really need to split these operations as all btrfs_alloc_chunk() has btrfs_make_block_group() followed. So it's reasonable to merge btrfs_make_block_group() call into btrfs_alloc_chunk() to save several lines, and provides the basis for later btrfs_alloc_chunk() rework. Signed-off-by: Qu Wenruo--- convert/main.c | 4 extent-tree.c | 10 ++ mkfs/main.c| 19 --- volumes.c | 10 ++ 4 files changed, 8 insertions(+), 35 deletions(-) diff --git a/convert/main.c b/convert/main.c index b2444bb2ff21..240d3aa46db9 100644 --- a/convert/main.c +++ b/convert/main.c @@ -915,10 +915,6 @@ static int make_convert_data_block_groups(struct btrfs_trans_handle *trans, BTRFS_BLOCK_GROUP_DATA, true); if (ret < 0) break; - ret = btrfs_make_block_group(trans, fs_info, 0, - BTRFS_BLOCK_GROUP_DATA, cur, len); - if (ret < 0) - break; cur += len; } } diff --git a/extent-tree.c b/extent-tree.c index b085ab0352b3..bccd83d1bae6 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1909,15 +1909,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, space_info->flags, false); if (ret == -ENOSPC) { space_info->full = 1; - return 0; + return ret; } - - BUG_ON(ret); - - ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags, -start, num_bytes); - BUG_ON(ret); - return 0; + return ret; } static int update_block_group(struct btrfs_root *root, diff --git a/mkfs/main.c b/mkfs/main.c index 358395ca0250..49159ea533b9 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -87,12 +87,6 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed, error("no space to allocate data/metadata chunk"); goto err; } - if (ret) - return ret; - ret = btrfs_make_block_group(trans, fs_info, 0, -BTRFS_BLOCK_GROUP_METADATA | -BTRFS_BLOCK_GROUP_DATA, -chunk_start, chunk_size); if (ret) return ret; allocation->mixed += chunk_size; @@ -106,12 +100,7 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed, } if (ret) return ret; - ret = btrfs_make_block_group(trans, fs_info, 0, -BTRFS_BLOCK_GROUP_METADATA, -chunk_start, chunk_size); allocation->metadata += chunk_size; - if (ret) - return ret; } root->fs_info->system_allocs = 0; @@ -140,12 +129,7 @@ static int create_data_block_groups(struct btrfs_trans_handle *trans, } if (ret) return ret; - ret = btrfs_make_block_group(trans, fs_info, 0, -BTRFS_BLOCK_GROUP_DATA, -chunk_start, chunk_size); allocation->data += chunk_size; - if (ret) - return ret; } err: @@ -249,9 +233,6 @@ static int create_one_raid_group(struct btrfs_trans_handle *trans, if (ret) return ret; - ret = btrfs_make_block_group(trans, fs_info, 0, -type, chunk_start, chunk_size); - type &= BTRFS_BLOCK_GROUP_TYPE_MASK; if (type == BTRFS_BLOCK_GROUP_DATA) { allocation->data += chunk_size; diff --git a/volumes.c b/volumes.c index 9ee4650351c3..a9dc8c939dc5 100644 --- a/volumes.c +++ b/volumes.c @@ -837,10 +837,9 @@ error: / sizeof(struct btrfs_stripe) + 1) /* - * Alloc a chunk, will insert dev extents, chunk item. - * NOTE: This function will not insert block group item nor mark newly - * allocated chunk available for later allocation. - * Block group item and free space update is handled by btrfs_make_block_group() + * Alloc a chunk, will insert dev extents, chunk item, and insert new + * block group and update space info (so
[PATCH v2 06/10] btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs
Used by later btrfs_alloc_chunk() rework. Signed-off-by: Qu Wenruo--- Makefile | 3 +- kernel-lib/sort.c | 104 ++ kernel-lib/sort.h | 16 + 3 files changed, 122 insertions(+), 1 deletion(-) create mode 100644 kernel-lib/sort.c create mode 100644 kernel-lib/sort.h diff --git a/Makefile b/Makefile index 327cdfa08eba..3db7bbe04662 100644 --- a/Makefile +++ b/Makefile @@ -106,7 +106,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \ qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \ kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \ inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \ - fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o + fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o \ + kernel-lib/sort.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o check/main.o \ diff --git a/kernel-lib/sort.c b/kernel-lib/sort.c new file mode 100644 index ..70ae3dbe2852 --- /dev/null +++ b/kernel-lib/sort.c @@ -0,0 +1,104 @@ +/* + * taken from linux kernel lib/sort.c, removed kernel config code and adapted + * for btrfsprogs + */ + +#include "sort.h" + +// SPDX-License-Identifier: GPL-2.0 +/* + * A fast, small, non-recursive O(nlog n) sort for the Linux kernel + * + * Jan 23 2005 Matt Mackall + */ + +static int alignment_ok(const void *base, int align) +{ + return ((unsigned long)base & (align - 1)) == 0; +} + +static void u32_swap(void *a, void *b, int size) +{ + u32 t = *(u32 *)a; + *(u32 *)a = *(u32 *)b; + *(u32 *)b = t; +} + +static void u64_swap(void *a, void *b, int size) +{ + u64 t = *(u64 *)a; + *(u64 *)a = *(u64 *)b; + *(u64 *)b = t; +} + +static void generic_swap(void *a, void *b, int size) +{ + char t; + + do { + t = *(char *)a; + *(char *)a++ = *(char *)b; + *(char *)b++ = t; + } while (--size > 0); +} + +/** + * sort - sort an array of elements + * @base: pointer to data to sort + * @num: number of elements + * @size: size of each element + * @cmp_func: pointer to comparison function + * @swap_func: pointer to swap function or NULL + * + * This function does a heapsort on the given array. You may provide a + * swap_func function optimized to your element type. + * + * Sorting time is O(n log n) both on average and worst-case. While + * qsort is about 20% faster on average, it suffers from exploitable + * O(n*n) worst-case behavior and extra memory requirements that make + * it less suitable for kernel use. + */ + +void sort(void *base, size_t num, size_t size, + int (*cmp_func)(const void *, const void *), + void (*swap_func)(void *, void *, int size)) +{ + /* pre-scale counters for performance */ + int i = (num/2 - 1) * size, n = num * size, c, r; + + if (!swap_func) { + if (size == 4 && alignment_ok(base, 4)) + swap_func = u32_swap; + else if (size == 8 && alignment_ok(base, 8)) + swap_func = u64_swap; + else + swap_func = generic_swap; + } + + /* heapify */ + for ( ; i >= 0; i -= size) { + for (r = i; r * 2 + size < n; r = c) { + c = r * 2 + size; + if (c < n - size && + cmp_func(base + c, base + c + size) < 0) + c += size; + if (cmp_func(base + r, base + c) >= 0) + break; + swap_func(base + r, base + c, size); + } + } + + /* sort */ + for (i = n - size; i > 0; i -= size) { + swap_func(base, base + i, size); + for (r = 0; r * 2 + size < i; r = c) { + c = r * 2 + size; + if (c < i - size && + cmp_func(base + c, base + c + size) < 0) + c += size; + if (cmp_func(base + r, base + c) >= 0) + break; + swap_func(base + r, base + c, size); + } + } +} diff --git a/kernel-lib/sort.h b/kernel-lib/sort.h new file mode 100644 index ..9355e01248f2 --- /dev/null +++ b/kernel-lib/sort.h @@ -0,0 +1,16 @@ +/* + * taken from linux kernel include/list_sort.h + * changed include to kerncompat.h + */ + +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SORT_H +#define _LINUX_SORT_H + +#include "kerncompat.h" + +void sort(void *base,
[PATCH v2 00/10] Chunk allocator unification
This patchset can be fetched from github: https://github.com/adam900710/btrfs-progs/tree/libbtrfs_prepare This patchset unified a large part of chunk allocator (free device extent search) between kernel and btrfs-progs. And reuses kernel function structures like btrfs_finish_chunk_alloc() and btrfs_alloc_dev_extent(). Before the unification: Kernel| Btrfs-progs btrfs_alloc_chunk() | btrfs_alloc_chunk() |- __btrfs_alloc_chunk()| |- Do all the work | btrfs_create_pending_block_groups() | |- btrfs_insert_item() | |- btrfs_finish_chunk_alloc() | |- btrfs_alloc_dev_extent() | After the unification: Kernel| Btrfs-progs btrfs_alloc_chunk() | btrfs_alloc_chunk() |- __btrfs_alloc_chunk()| |- __btrfs_alloc_chunk() | |- btrfs_finish_chunk_alloc() btrfs_create_pending_block_groups() ||- btrfs_alloc_dev_extent() |- btrfs_insert_item() | |- btrfs_finish_chunk_alloc() | And the similiar functions are share the same code base, with minor member/functions change. This update only modifies patches 7 and after. Changelog: v2: Make error handler in patch 7 better. New patches to unify more functions used in btrfs_alloc_chunk() Qu Wenruo (10): btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info btrfs-progs: Merge btrfs_alloc_data_chunk into btrfs_alloc_chunk btrfs-progs: Make btrfs_alloc_chunk to handle block group creation btrfs-progs: Introduce btrfs_raid_array and related infrastructures btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size btrfs-progs: kernel-lib: Port kernel sort() to btrfs-progs btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs. btrfs-progs: Move chunk stripe size calcution function to volumes.h btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior Makefile | 3 +- check/main.c | 22 -- convert/main.c| 10 +- ctree.h | 12 +- extent-tree.c | 12 +- kerncompat.h | 5 + kernel-lib/sort.c | 104 ++ kernel-lib/sort.h | 16 + mkfs/main.c | 27 +- utils.c | 2 +- volumes.c | 927 -- volumes.h | 66 +++- 12 files changed, 695 insertions(+), 511 deletions(-) create mode 100644 kernel-lib/sort.c create mode 100644 kernel-lib/sort.h -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/10] btrfs-progs: volumes: Unify free dev extent search behavior between kernel and btrfs-progs.
As preparation to create libbtrfs which shares code between kernel and btrfs, this patch mainly unifies the search for free device extents. The main modifications are: 1) Search for free device extent Use the kernel method, by sorting the devices by its max hole capability, and use that sorted result to determine stripe size and chunk size. The previous method, which uses available bytes of each device to search, can't handle scattered small holes in each device. 2) Chunk/stripe minimal size Remove the minimal size requirement. Now the real minimal stripe size is 64K (BTRFS_STRIPE_LEN), the same as kernel one. While up limit is still kept as is, and minimal device size is still kept for a while. But since we no longer have strange minimal stripe size limit, existing minimal device size calculation won't cause any problem. 3) How to insert device extents Although not the same as kernel, here we follow kernel behavior to delay dev extents insert. There is plan to follow kernel __btrfs_alloc_chunk() to make it only handle chunk mapping allocation, while do nothing with tree operation. 4) Usage of btrfs_raid_array[] Which makes a lot of old if-else branches disappear. There are still a lot of work to do (both kernel and btrfs-progs) before we could starting extracting code into libbtrfs, but this should make libbtrfs inside our reach. Signed-off-by: Qu Wenruo--- kerncompat.h | 5 + volumes.c| 621 ++- volumes.h| 7 + 3 files changed, 285 insertions(+), 348 deletions(-) diff --git a/kerncompat.h b/kerncompat.h index fa96715fb70c..658d28ed0792 100644 --- a/kerncompat.h +++ b/kerncompat.h @@ -285,6 +285,7 @@ static inline int IS_ERR_OR_NULL(const void *ptr) */ #define kmalloc(x, y) malloc(x) #define kzalloc(x, y) calloc(1, x) +#define kcalloc(x, y) calloc(x, y) #define kstrdup(x, y) strdup(x) #define kfree(x) free(x) #define vmalloc(x) malloc(x) @@ -394,4 +395,8 @@ struct __una_u64 { __le64 x; } __attribute__((__packed__)); #define noinline #endif +static inline u64 div_u64(u64 dividend, u32 divisor) +{ + return dividend / ((u64) divisor); +} #endif diff --git a/volumes.c b/volumes.c index f4009ffa7c9e..c2efb3c674dc 100644 --- a/volumes.c +++ b/volumes.c @@ -29,6 +29,7 @@ #include "volumes.h" #include "utils.h" #include "kernel-lib/raid56.h" +#include "kernel-lib/sort.h" const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { @@ -522,55 +523,55 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, return find_free_dev_extent_start(device, num_bytes, 0, start, len); } -static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans, - struct btrfs_device *device, - u64 chunk_offset, u64 num_bytes, u64 *start, - int convert) +static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, + struct map_lookup *map, u64 stripe_size) { - int ret; - struct btrfs_path *path; - struct btrfs_root *root = device->dev_root; + int ret = 0; + struct btrfs_path path; + struct btrfs_root *root = fs_info->dev_root; struct btrfs_dev_extent *extent; struct extent_buffer *leaf; struct btrfs_key key; + int i; - path = btrfs_alloc_path(); - if (!path) - return -ENOMEM; + btrfs_init_path(); - /* -* For convert case, just skip search free dev_extent, as caller -* is responsible to make sure it's free. -*/ - if (!convert) { - ret = find_free_dev_extent(device, num_bytes, start, NULL); - if (ret) - goto err; - } + for (i = 0; i < map->num_stripes; i++) { + struct btrfs_device *device = map->stripes[i].dev; - key.objectid = device->devid; - key.offset = *start; - key.type = BTRFS_DEV_EXTENT_KEY; - ret = btrfs_insert_empty_item(trans, root, path, , - sizeof(*extent)); - BUG_ON(ret); + key.objectid = device->devid; + key.offset = map->stripes[i].physical; + key.type = BTRFS_DEV_EXTENT_KEY; - leaf = path->nodes[0]; - extent = btrfs_item_ptr(leaf, path->slots[0], - struct btrfs_dev_extent); - btrfs_set_dev_extent_chunk_tree(leaf, extent, BTRFS_CHUNK_TREE_OBJECTID); - btrfs_set_dev_extent_chunk_objectid(leaf, extent, - BTRFS_FIRST_CHUNK_TREE_OBJECTID); - btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset); - - write_extent_buffer(leaf,
[PATCH v2 09/10] btrfs-progs: Use btrfs_device->fs_info to replace btrfs_device->dev_root
Same as kernel declaration. Signed-off-by: Qu Wenruo--- utils.c | 2 +- volumes.c | 6 +++--- volumes.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/utils.c b/utils.c index e9cb3a82fda6..eff5fb64cfd5 100644 --- a/utils.c +++ b/utils.c @@ -216,7 +216,7 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans, device->total_bytes = device_total_bytes; device->bytes_used = 0; device->total_ios = 0; - device->dev_root = fs_info->dev_root; + device->fs_info = fs_info; device->name = strdup(path); if (!device->name) { ret = -ENOMEM; diff --git a/volumes.c b/volumes.c index c2efb3c674dc..cff54c612872 100644 --- a/volumes.c +++ b/volumes.c @@ -380,7 +380,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device, u64 *start, u64 *len) { struct btrfs_key key; - struct btrfs_root *root = device->dev_root; + struct btrfs_root *root = device->fs_info->dev_root; struct btrfs_dev_extent *dev_extent; struct btrfs_path *path; u64 hole_size; @@ -724,7 +724,7 @@ int btrfs_update_device(struct btrfs_trans_handle *trans, struct extent_buffer *leaf; struct btrfs_key key; - root = device->dev_root->fs_info->chunk_root; + root = device->fs_info->chunk_root; path = btrfs_alloc_path(); if (!path) @@ -1895,7 +1895,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info, } fill_device_from_item(leaf, dev_item, device); - device->dev_root = fs_info->dev_root; + device->fs_info = fs_info; return ret; } diff --git a/volumes.h b/volumes.h index 950de5a9f910..84deafc98b0d 100644 --- a/volumes.h +++ b/volumes.h @@ -26,7 +26,7 @@ struct btrfs_device { struct list_head dev_list; - struct btrfs_root *dev_root; + struct btrfs_fs_info *fs_info; struct btrfs_fs_devices *fs_devices; u64 total_ios; -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/10] btrfs-progs: volumes: Allow find_free_dev_extent() to return maximum hole size
Just as kernel find_free_dev_extent(), allow it to return maximum hole size for us to build device list for later chunk allocator rework. Signed-off-by: Qu Wenruo--- volumes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/volumes.c b/volumes.c index b47ff1f392b5..f4009ffa7c9e 100644 --- a/volumes.c +++ b/volumes.c @@ -516,10 +516,10 @@ out: } static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes, - u64 *start) + u64 *start, u64 *len) { /* FIXME use last free of some kind */ - return find_free_dev_extent_start(device, num_bytes, 0, start, NULL); + return find_free_dev_extent_start(device, num_bytes, 0, start, len); } static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans, @@ -543,7 +543,7 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans, * is responsible to make sure it's free. */ if (!convert) { - ret = find_free_dev_extent(device, num_bytes, start); + ret = find_free_dev_extent(device, num_bytes, start, NULL); if (ret) goto err; } -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 04/10] btrfs-progs: Introduce btrfs_raid_array and related infrastructures
As part of the effort to unify code and behavior between btrfs-progs and kernel, copy the btrfs_raid_array from kernel to btrfs-progs. So later we can use the btrfs_raid_array[] to get needed raid info other than manually do if-else branches. Signed-off-by: Qu Wenruo--- ctree.h | 12 +++- volumes.c | 66 +++ volumes.h | 30 + 3 files changed, 107 insertions(+), 1 deletion(-) diff --git a/ctree.h b/ctree.h index 17cdac76c58c..c76849d8deb7 100644 --- a/ctree.h +++ b/ctree.h @@ -958,7 +958,17 @@ struct btrfs_csum_item { #define BTRFS_BLOCK_GROUP_RAID5(1ULL << 7) #define BTRFS_BLOCK_GROUP_RAID6(1ULL << 8) #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE -#define BTRFS_NR_RAID_TYPES 7 + +enum btrfs_raid_types { + BTRFS_RAID_RAID10, + BTRFS_RAID_RAID1, + BTRFS_RAID_DUP, + BTRFS_RAID_RAID0, + BTRFS_RAID_SINGLE, + BTRFS_RAID_RAID5, + BTRFS_RAID_RAID6, + BTRFS_NR_RAID_TYPES +}; #define BTRFS_BLOCK_GROUP_TYPE_MASK(BTRFS_BLOCK_GROUP_DATA |\ BTRFS_BLOCK_GROUP_SYSTEM | \ diff --git a/volumes.c b/volumes.c index a9dc8c939dc5..b47ff1f392b5 100644 --- a/volumes.c +++ b/volumes.c @@ -30,6 +30,72 @@ #include "utils.h" #include "kernel-lib/raid56.h" +const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { + [BTRFS_RAID_RAID10] = { + .sub_stripes= 2, + .dev_stripes= 1, + .devs_max = 0,/* 0 == as many as possible */ + .devs_min = 4, + .tolerated_failures = 1, + .devs_increment = 2, + .ncopies= 2, + }, + [BTRFS_RAID_RAID1] = { + .sub_stripes= 1, + .dev_stripes= 1, + .devs_max = 2, + .devs_min = 2, + .tolerated_failures = 1, + .devs_increment = 2, + .ncopies= 2, + }, + [BTRFS_RAID_DUP] = { + .sub_stripes= 1, + .dev_stripes= 2, + .devs_max = 1, + .devs_min = 1, + .tolerated_failures = 0, + .devs_increment = 1, + .ncopies= 2, + }, + [BTRFS_RAID_RAID0] = { + .sub_stripes= 1, + .dev_stripes= 1, + .devs_max = 0, + .devs_min = 2, + .tolerated_failures = 0, + .devs_increment = 1, + .ncopies= 1, + }, + [BTRFS_RAID_SINGLE] = { + .sub_stripes= 1, + .dev_stripes= 1, + .devs_max = 1, + .devs_min = 1, + .tolerated_failures = 0, + .devs_increment = 1, + .ncopies= 1, + }, + [BTRFS_RAID_RAID5] = { + .sub_stripes= 1, + .dev_stripes= 1, + .devs_max = 0, + .devs_min = 2, + .tolerated_failures = 1, + .devs_increment = 1, + .ncopies= 2, + }, + [BTRFS_RAID_RAID6] = { + .sub_stripes= 1, + .dev_stripes= 1, + .devs_max = 0, + .devs_min = 3, + .tolerated_failures = 2, + .devs_increment = 1, + .ncopies= 3, + }, +}; + struct stripe { struct btrfs_device *dev; u64 physical; diff --git a/volumes.h b/volumes.h index 7bbdf615d31a..612a0a7586f4 100644 --- a/volumes.h +++ b/volumes.h @@ -108,6 +108,36 @@ struct map_lookup { struct btrfs_bio_stripe stripes[]; }; +struct btrfs_raid_attr { + int sub_stripes;/* sub_stripes info for map */ + int dev_stripes;/* stripes per dev */ + int devs_max; /* max devs to use */ + int devs_min; /* min devs needed */ + int tolerated_failures; /* max tolerated fail devs */ + int devs_increment; /* ndevs has to be a multiple of this */ + int ncopies;/* how many copies to data has */ +}; + +extern const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES]; + +static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags) +{ + if (flags & BTRFS_BLOCK_GROUP_RAID10) + return BTRFS_RAID_RAID10; + else if (flags & BTRFS_BLOCK_GROUP_RAID1) + return BTRFS_RAID_RAID1; + else if (flags & BTRFS_BLOCK_GROUP_DUP) + return BTRFS_RAID_DUP; + else if (flags & BTRFS_BLOCK_GROUP_RAID0) + return BTRFS_RAID_RAID0; + else if (flags & BTRFS_BLOCK_GROUP_RAID5) + return
[PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()
Strangely, we have level check in btrfs_print_tree() while we don't have the same check in read_node_slot(). That's to say, for the following corruption, btrfs_search_slot() or btrfs_next_leaf() can return invalid leaf: Parent eb: node XX level 1 ^^^ Child should be leaf (level 0) ... key (XXX XXX XXX) block YY Child eb: leaf YY level 1 ^^^ Something went wrong now And for the corrupted leaf returned, later caller can be screwed up easily. Although the root cause (powerloss, but still something wrong breaking metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to avoid SEGV. Reported-by: Ralph GaugesSigned-off-by: Qu Wenruo --- changlog: v2: Check if the extent buffer is up-to-date before checking its level to avoid possible NULL pointer access. --- ctree.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/ctree.c b/ctree.c index 4fc33b14000a..430805e3043f 100644 --- a/ctree.c +++ b/ctree.c @@ -22,6 +22,7 @@ #include "repair.h" #include "internal.h" #include "sizes.h" +#include "messages.h" static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, int level); @@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct btrfs_key *key, struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, int slot) { + struct extent_buffer *ret; int level = btrfs_header_level(parent); + if (slot < 0) return NULL; if (slot >= btrfs_header_nritems(parent)) @@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info, if (level == 0) return NULL; - return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), + ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), btrfs_node_ptr_generation(parent, slot)); + if (!extent_buffer_uptodate(ret)) + return ERR_PTR(-EIO); + + if (btrfs_header_level(ret) != level - 1) { + error("child eb corrupted: parent bytenr=%llu item=%d parent level=%d child level=%d", + btrfs_header_bytenr(parent), slot, + btrfs_header_level(parent), btrfs_header_level(ret)); + free_extent_buffer(ret); + return ERR_PTR(-EIO); + } + return ret; } static int balance_level(struct btrfs_trans_handle *trans, -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/10] btrfs-progs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
Signed-off-by: Qu Wenruo--- volumes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/volumes.c b/volumes.c index edad367b593c..677d085de96c 100644 --- a/volumes.c +++ b/volumes.c @@ -826,7 +826,7 @@ error: return ret; } -#define BTRFS_MAX_DEVS(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) \ +#define BTRFS_MAX_DEVS(info) ((BTRFS_LEAF_DATA_SIZE(info) \ - sizeof(struct btrfs_item) \ - sizeof(struct btrfs_chunk)) \ / sizeof(struct btrfs_stripe) + 1) @@ -882,12 +882,12 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans, calc_size = SZ_1G; max_chunk_size = 10 * calc_size; min_stripe_size = SZ_64M; - max_stripes = BTRFS_MAX_DEVS(chunk_root); + max_stripes = BTRFS_MAX_DEVS(info); } else if (type & BTRFS_BLOCK_GROUP_METADATA) { calc_size = SZ_1G; max_chunk_size = 4 * calc_size; min_stripe_size = SZ_32M; - max_stripes = BTRFS_MAX_DEVS(chunk_root); + max_stripes = BTRFS_MAX_DEVS(info); } } if (type & BTRFS_BLOCK_GROUP_RAID1) { -- 2.16.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: segmentation fault in btrfs tool v4.15
On 2018年02月09日 15:23, Ralph Gauges wrote: > Hi Qu, > > I applied the patch to the sources of v4.15 and ran it in gdb. This is > the result. > > (gdb) set args check /dev/sdf1 > (gdb) run > Starting program: /home/gauges/Applications/bin/btrfs check /dev/sdf1 > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > parent transid verify failed on 266195058688 wanted 1857 found 1864 > parent transid verify failed on 266195058688 wanted 1857 found 1864 > parent transid verify failed on 266195058688 wanted 1857 found 1864 > parent transid verify failed on 266195058688 wanted 1857 found 1864 > Ignoring transid failure > ERROR: child eb corrupted: parent bytenr=247283269632 item=23 parent > level=1 child level=2 > ERROR: cannot open file system > [Inferior 1 (process 7149) exited with code 01] > > > So obviously it does not crash any more. Thanks. > Since you are an expert on the btrfs filesystem, any hints as to how I > could fix my backup > partition? Guys in mail list may have better ideas, CCed to mail list. In fact the problem all happens in extent tree, may be we could salvage something by RO mount it? If kernel can't mount it even RO, then "btrfs restore" may be your last chance. Thanks, Qu > This output from btrfs seems to suggest that "btrfs check" > can't handle this error?! > Or this this last error something else that didn't show up so far > because of the segfault? > > Thanks > > Ralph > > signature.asc Description: OpenPGP digital signature
Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
On Mon, Feb 05, 2018 at 04:15:02PM -0700, Liu Bo wrote: > Btrfs tries its best to tolerate write errors, but kind of silently > (except some messages in kernel log). > > For raid1 and raid10, this is usually not a problem because there is a > copy as backup, while for parity based raid setup, i.e. raid5 and > raid6, the problem is that, if a write error occurs due to some bad > sectors, one horizonal stripe becomes degraded and the number of write > errors it can tolerate gets reduced by one, now if two disk fails, > data may be lost forever. > > One way to mitigate the data loss pain is to expose 'bad chunks', > i.e. degraded chunks, to users, so that they can use 'btrfs balance' > to relocate the whole chunk and get the full raid6 protection again > (if the relocation works). > > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory. Once > a chunk of raid5 or raid6 becomes degraded, it will appear in > 'bad_chunks'. > > Signed-off-by: Liu Bo> --- > - In this patch, 'bad chunks' is not persistent on disk, but it can be > added if it's thought to be a good idea. > - This is lightly tested, comments are very welcome. Hmmm... sorry to be late to the party and dump a bunch of semirelated work suggestions, but what if you implemented GETFSMAP for btrfs? Then you could define a new 'defective' fsmap type/flag/whatever and export it for whatever metadata/filedata/whatever is now screwed up? Existing interface, you don't have to kludge sysfs data, none of this string interpretation stuff... --D > > fs/btrfs/ctree.h | 8 +++ > fs/btrfs/disk-io.c | 2 ++ > fs/btrfs/extent-tree.c | 13 +++ > fs/btrfs/raid56.c | 59 > -- > fs/btrfs/sysfs.c | 26 ++ > fs/btrfs/volumes.c | 15 +++-- > fs/btrfs/volumes.h | 2 ++ > 7 files changed, 121 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 13c260b..08aad65 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1101,6 +1101,9 @@ struct btrfs_fs_info { > spinlock_t ref_verify_lock; > struct rb_root block_tree; > #endif > + > + struct list_head bad_chunks; > + seqlock_t bc_lock; > }; > > static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) > @@ -2568,6 +2571,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct > address_space *mapping) > > /* extent-tree.c */ > > +struct btrfs_bad_chunk { > + u64 chunk_offset; > + struct list_head list; > +}; > + > enum btrfs_inline_ref_type { > BTRFS_REF_TYPE_INVALID = 0, > BTRFS_REF_TYPE_BLOCK = 1, > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index a8ecccf..061e7f94 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2568,6 +2568,8 @@ int open_ctree(struct super_block *sb, > init_waitqueue_head(_info->async_submit_wait); > > INIT_LIST_HEAD(_info->pinned_chunks); > + INIT_LIST_HEAD(_info->bad_chunks); > + seqlock_init(_info->bc_lock); > > /* Usable values until the real ones are cached from the superblock */ > fs_info->nodesize = 4096; > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 2f43285..3ca7cb4 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -9903,6 +9903,19 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) > kobject_del(_info->kobj); > kobject_put(_info->kobj); > } > + > + /* Clean up bad chunks. */ > + write_seqlock_irq(>bc_lock); > + while (!list_empty(>bad_chunks)) { > + struct btrfs_bad_chunk *bc; > + > + bc = list_first_entry(>bad_chunks, > + struct btrfs_bad_chunk, list); > + list_del_init(>list); > + kfree(bc); > + } > + write_sequnlock_irq(>bc_lock); > + > return 0; > } > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index a7f7925..e960247 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -888,14 +888,19 @@ static void rbio_orig_end_io(struct btrfs_raid_bio > *rbio, blk_status_t err) > } > > /* > - * end io function used by finish_rmw. When we finally > - * get here, we've written a full stripe > + * end io function used by finish_rmw. When we finally get here, we've > written > + * a full stripe. > + * > + * Note that this is not under interrupt context as we queued endio to > workers. > */ > static void raid_write_end_io(struct bio *bio) > { > struct btrfs_raid_bio *rbio = bio->bi_private; > blk_status_t err = bio->bi_status; > int max_errors; > + u64 stripe_start = rbio->bbio->raid_map[0]; > + struct btrfs_fs_info *fs_info = rbio->fs_info; > + int err_cnt; > > if (err) > fail_bio_stripe(rbio, bio); > @@ -908,12 +913,58 @@ static void raid_write_end_io(struct bio *bio) >
RE: [PATCH v5 0/3] Add support for export testsuits
Hi, > -Original Message- > From: David Sterba [mailto:dste...@suse.cz] > Sent: Friday, February 09, 2018 2:02 AM > To: Gu, Jinxiang/顾 金香> Cc: linux-btrfs@vger.kernel.org; dste...@suse.cz > Subject: Re: [PATCH v5 0/3] Add support for export testsuits > > On Thu, Feb 08, 2018 at 02:34:17PM +0800, Gu Jinxiang wrote: > > Achieved: > > 1. export testsuite by: > > $ make testsuite > > files list in testsuites-list will be added into tarball > > btrfs-progs-tests.tar.gz. > > > > 2. after decompress btrfs-progs-tests.tar.gz, run test by: > > $ TEST=`MASK` ./tests/mkfs-tests.sh > > and, without MASK also be ok. > > replenish: > > $ tar -xzvf ./btrfs-progs-tests.tar.gz $ ls > >btrfs-progs > > tests directory and other files is in btrfs-progs. > > > > Changelog: > > v5->v4: modify patch2. > > make TEST_TOP to represent tests directory. > > and introduce INTERNAL_BIN for internal binaries. > > Patches 1 and 2 applied. I reworked most of 1, my idea of the end result of > the testsutie is different. In patch 2, I've added quotes to all lines > that changed the variables on 'source ..' line. In such cases please also > look at the resulting code and do not just mechanically apply the > suggestion to rename a variable. Patch 3 does not bring much information so I > did not apply it and wrote the section myself. Thank you for the change. My idea was keep the structure as git when export testsuite. But I saw the modification, they are more reasonable indeed. > > The project idea lacked details, as the cards on the github Projects page are > supposed to be short. Not all of the tasks there are simple, so > if you'd want to work on something found there and see that's not clear what > to do, it would be better to open an issue so I can fill in. OK. Got it. Thanks. >
Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
On 2018年02月09日 03:07, Liu Bo wrote: > On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote: >> On 02/06/2018 12:15 AM, Liu Bo wrote: >> [...] >>> One way to mitigate the data loss pain is to expose 'bad chunks', >>> i.e. degraded chunks, to users, so that they can use 'btrfs balance' >>> to relocate the whole chunk and get the full raid6 protection again >>> (if the relocation works). >> >> [...] >> [...] >> >>> + >>> + /* read lock please */ >>> + do { >>> + seq = read_seqbegin(_info->bc_lock); >>> + list_for_each_entry(bc, _info->bad_chunks, list) { >>> + len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n", >>> + bc->chunk_offset); >>> + /* chunk offset is u64 */ >>> + if (len >= PAGE_SIZE) >>> + break; >>> + } >>> + } while (read_seqretry(_info->bc_lock, seq)); >> >> Using this interface, how many chunks can you list ? If I read the code >> correctly, only up to fill a kernel page. >> >> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) >> it is possible to list only few hundred of chunks (~200). Not more; and the >> last one could be even listed incomplete. >> > > That's true. > >> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to >> list could be more than 200. >> >> My first suggestion is to limit the number of chunks to show to 200 (a page >> should be big enough to contains all these chunks offset). If the chunks >> number are greater, ends the list with a marker (something like '[...]\n'). >> This would solve the ambiguity about the fact that the list chunks are >> complete or not. Anyway you cannot list all the chunks. >> > > Good point, and I need to add one more field to each record to specify > it's metadata or data. > > So what I have in my mind is that this kind of error is rare and > reflects bad sectors on disk, but if there are really that many > errors, I think we need to replace the disk without hesitation. > >> However, my second suggestions is to ... change completely the interface. >> What about adding a directory in sysfs, where each entry is a chunk ? >> >> Something like: >> >> /sys/fs/btrfs//chunks//type # >> data/metadata/sys >> /sys/fs/btrfs//chunks//profile # >> dup/linear >> /sys/fs/btrfs//chunks//size # size >> /sys/fs/btrfs//chunks//devs # chunks devs What about netlink interface? Although it may needs an extra daemon to listen to it, and some guys won't be happy about the abuse of daemon. Thanks, Qu >> >> And so on. >> >> Checking "[...]/devs", it would be easy understand if the >> chunk is in "degraded" mode or not. > > I'm afraid we cannot do that as it'll occupy too much memory for large > filesystems given a typical chunk size is 1GB. > >> >> However I have to admit that I don't know how feasible is iterate over a >> sysfs directory which is a map of a kernel objects list. >> >> I think that if these interface would be good enough, we could get rid of a >> lot of ioctl(TREE_SEARCH) from btrfs-progs. >> > > TREE_SEARCH is faster than iterating sysfs (if we could), isn't it? > > thanks, > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: IO Error (.snapshots is not a btrfs subvolume)
I've installed openSUSE Tumbleweed on a VM and checked how the disk is with btrfs partioned, how fstab looks like, how snapper works and also what are the differences to my system. I've decided to leave it like it is for now and next time use the guide from the link provided by @Andrei. Thanks! Regards, Nick On Thu, Feb 8, 2018 at 4:43 PM, Nick Gilmourwrote: > On Thu, Feb 8, 2018 at 5:32 AM, Andrei Borzenkov wrote: >> 08.02.2018 06:03, Chris Murphy пишет: >>> On Wed, Feb 7, 2018 at 6:26 PM, Nick Gilmour wrote: Hi all, I have successfully restored a snapshot of root but now when I try to >> >> How exactly was it done? >> make a new snapshot I get this error: IO Error (.snapshots is not a btrfs subvolume). My snapshots were within @ which I renamed to @_old. What can I do now? How can I move the snapshots from @_old/ into @ and be able to make snapshots again? This is an excerpt of my subvolumes list: # btrfs subvolume list / ID 257 gen 175397 top level 5 path @_old ID 258 gen 175392 top level 5 path @pkg ID 260 gen 175447 top level 5 path @tmp ID 262 gen 19 top level 257 path @_old/var/lib/machines ID 268 gen 175441 top level 5 path @test ID 291 gen 175394 top level 257 path @_old/.snapshots ID 292 gen 1705 top level 291 path @_old/.snapshots/1/snapshot ... ID 3538 gen 175398 top level 291 path @_old/.snapshots/1594/snapshot ID 3540 gen 175447 top level 5 path @ >>> >>> >>> This is a snapper behavior. It creates .snapshots as a subvolume and >>> then puts snapshots into that subvolume. If you snapshot a subvolume >>> that contains another subvolume, the nested subvolume is not snapshot, >>> instead a plain directory placeholder is created instead. So your >>> restored snapshot contains a .snapshot directory rather than a >>> .snapshot subvolume. Possibly if you delete the directory and create a >>> new subvolume .snapshot, the problem will be fixed. >>> >> >> No, you should create subvolume @/.snapshots and mount it as /.snapshots >> (and have it in /etc/fstab). Snapshots should always be available in >> running system under fixed path and this only possible when it is >> mounted, otherwise after rollback /.snapshots will be lost just like it >> happened now. >> >> Exact subvolume name probably not matters that much, but better stick >> with what installer does by default. It may matter for grub2 snapshots >> handling. >> >> Also openSUSE expects that actual root is subvolume under /.snapshots >> which is valid snapper snapshot (i.e. it has valid metadata). Again, not >> having this may confuse snapper. >> >> It may be possible to move @_old/.snapshots into @/.snapshots, although >> this breaks parent-child relationships those old snapshots cannot be >> cleaned up without removing old root completely. >> >>> I can't tell you how this will confuse snapper though, or how to >>> unconfuse it. It pretty much expects to be in control of all >>> snapshots, creation, deletion, and rollbacks. So if you do it manually >>> for whatever reason, I think it can confuse snapper. >>> >>> >> >> There was blog post recently outlining how to restore openSUSE root. You >> may want to search opensuse or opensuse-factory mailing list. Ah found: >> >> https://rootco.de/2018-01-19-opensuse-btrfs-subvolumes/ > > > Thanks both for the quick responses! > >> How exactly was it done? > 1. # mount /dev/sde6 /mnt > 2. # cd /mnt > 3. # mv @ @_old > 4. # btrfs subvol snapshot /mnt/@_old/.snapshots/1594/snapshot /mnt/@ > >> No, you should create subvolume @/.snapshots and mount it as /.snapshots >> (and have it in /etc/fstab). Snapshots should always be available in >> running system under fixed path and this only possible when it is >> mounted, otherwise after rollback /.snapshots will be lost just like it >> happened now. > > When I run this command I get an error: > # btrfs subvolume create @/.snapshots > ERROR: cannot access '@': No such file or directory > > but when I'm doing this: > # btrfs subvolume create /.snapshots > Create subvolume '//.snapshots' > it works > > and this is what I see in the subvolumes list: > ID 3541 gen 175955 parent 3540 top level 3540 path .snapshots > > and then I can create a snapshot with snapper: > # snapper -c root create --description "test" > > but snapper starts numbering from 1 again which I don't really like. I > would like to keep the previous snapshots and continue the numbering > after the last restored snapshot (1594). > > This is how my fstab looks like now: > > # /etc/fstab: static file system information > # > # > > # /dev/sdd6 LABEL=ROOT > UUID=...567940c58ea6 / btrfs > noatime,nodiratime,compress=lzo,ssd,space_cache,subvolid=257,subvol=/@ > 0 1 > > #
Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote: > On 02/06/2018 12:15 AM, Liu Bo wrote: > [...] > > One way to mitigate the data loss pain is to expose 'bad chunks', > > i.e. degraded chunks, to users, so that they can use 'btrfs balance' > > to relocate the whole chunk and get the full raid6 protection again > > (if the relocation works). > > [...] > [...] > > > + > > + /* read lock please */ > > + do { > > + seq = read_seqbegin(_info->bc_lock); > > + list_for_each_entry(bc, _info->bad_chunks, list) { > > + len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n", > > + bc->chunk_offset); > > + /* chunk offset is u64 */ > > + if (len >= PAGE_SIZE) > > + break; > > + } > > + } while (read_seqretry(_info->bc_lock, seq)); > > Using this interface, how many chunks can you list ? If I read the code > correctly, only up to fill a kernel page. > > If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) > it is possible to list only few hundred of chunks (~200). Not more; and the > last one could be even listed incomplete. > That's true. > IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to > list could be more than 200. > > My first suggestion is to limit the number of chunks to show to 200 (a page > should be big enough to contains all these chunks offset). If the chunks > number are greater, ends the list with a marker (something like '[...]\n'). > This would solve the ambiguity about the fact that the list chunks are > complete or not. Anyway you cannot list all the chunks. > Good point, and I need to add one more field to each record to specify it's metadata or data. So what I have in my mind is that this kind of error is rare and reflects bad sectors on disk, but if there are really that many errors, I think we need to replace the disk without hesitation. > However, my second suggestions is to ... change completely the interface. > What about adding a directory in sysfs, where each entry is a chunk ? > > Something like: > > /sys/fs/btrfs//chunks//type # > data/metadata/sys > /sys/fs/btrfs//chunks//profile# > dup/linear > /sys/fs/btrfs//chunks//size # size > /sys/fs/btrfs//chunks//devs # chunks devs > > And so on. > > Checking "[...]/devs", it would be easy understand if the > chunk is in "degraded" mode or not. I'm afraid we cannot do that as it'll occupy too much memory for large filesystems given a typical chunk size is 1GB. > > However I have to admit that I don't know how feasible is iterate over a > sysfs directory which is a map of a kernel objects list. > > I think that if these interface would be good enough, we could get rid of a > lot of ioctl(TREE_SEARCH) from btrfs-progs. > TREE_SEARCH is faster than iterating sysfs (if we could), isn't it? thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs
On Thu, Feb 08, 2018 at 06:25:17PM +0200, Nikolay Borisov wrote: > list_first_entry is essentially a wrapper over cotnainer_of. The latter > can never return null even if it's working on inconsistent list since it > will either crash or return some offset in the wrong struct. > Additionally, for the dirty_bgs list the iteration is done under > dirty_bgs_lock which ensures consistency of the list. > Reviewed-by: Liu Bo-liubo > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/disk-io.c | 9 - > 1 file changed, 9 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index aca906971abe..1b3989c54d7c 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4323,11 +4323,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction > *cur_trans, > cache = list_first_entry(_trans->dirty_bgs, >struct btrfs_block_group_cache, >dirty_list); > - if (!cache) { > - btrfs_err(fs_info, "orphan block group dirty_bgs list"); > - spin_unlock(_trans->dirty_bgs_lock); > - return; > - } > > if (!list_empty(>io_list)) { > spin_unlock(_trans->dirty_bgs_lock); > @@ -4351,10 +4346,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction > *cur_trans, > cache = list_first_entry(_trans->io_bgs, >struct btrfs_block_group_cache, >io_list); > - if (!cache) { > - btrfs_err(fs_info, "orphan block group on io_bgs list"); > - return; > - } > > list_del_init(>io_list); > spin_lock(>lock); > -- > 2.7.4 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
On 02/06/2018 12:15 AM, Liu Bo wrote: [...] > One way to mitigate the data loss pain is to expose 'bad chunks', > i.e. degraded chunks, to users, so that they can use 'btrfs balance' > to relocate the whole chunk and get the full raid6 protection again > (if the relocation works). [...] [...] > + > + /* read lock please */ > + do { > + seq = read_seqbegin(_info->bc_lock); > + list_for_each_entry(bc, _info->bad_chunks, list) { > + len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n", > + bc->chunk_offset); > + /* chunk offset is u64 */ > + if (len >= PAGE_SIZE) > + break; > + } > + } while (read_seqretry(_info->bc_lock, seq)); Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page. If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete. IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200. My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n'). This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks. However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ? Something like: /sys/fs/btrfs//chunks//type # data/metadata/sys /sys/fs/btrfs//chunks//profile # dup/linear /sys/fs/btrfs//chunks//size # size /sys/fs/btrfs//chunks//devs # chunks devs And so on. Checking "[...]/devs", it would be easy understand if the chunk is in "degraded" mode or not. However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list. I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/3] Add support for export testsuits
On Thu, Feb 08, 2018 at 02:34:17PM +0800, Gu Jinxiang wrote: > Achieved: > 1. export testsuite by: > $ make testsuite > files list in testsuites-list will be added into tarball > btrfs-progs-tests.tar.gz. > > 2. after decompress btrfs-progs-tests.tar.gz, run test by: > $ TEST=`MASK` ./tests/mkfs-tests.sh > and, without MASK also be ok. > replenish: > $ tar -xzvf ./btrfs-progs-tests.tar.gz > $ ls >btrfs-progs > tests directory and other files is in btrfs-progs. > > Changelog: > v5->v4: modify patch2. > make TEST_TOP to represent tests directory. > and introduce INTERNAL_BIN for internal binaries. Patches 1 and 2 applied. I reworked most of 1, my idea of the end result of the testsutie is different. In patch 2, I've added quotes to all lines that changed the variables on 'source ..' line. In such cases please also look at the resulting code and do not just mechanically apply the suggestion to rename a variable. Patch 3 does not bring much information so I did not apply it and wrote the section myself. The project idea lacked details, as the cards on the github Projects page are supposed to be short. Not all of the tasks there are simple, so if you'd want to work on something found there and see that's not clear what to do, it would be better to open an issue so I can fill in. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs
list_first_entry is essentially a wrapper over cotnainer_of. The latter can never return null even if it's working on inconsistent list since it will either crash or return some offset in the wrong struct. Additionally, for the dirty_bgs list the iteration is done under dirty_bgs_lock which ensures consistency of the list. Signed-off-by: Nikolay Borisov--- fs/btrfs/disk-io.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index aca906971abe..1b3989c54d7c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4323,11 +4323,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans, cache = list_first_entry(_trans->dirty_bgs, struct btrfs_block_group_cache, dirty_list); - if (!cache) { - btrfs_err(fs_info, "orphan block group dirty_bgs list"); - spin_unlock(_trans->dirty_bgs_lock); - return; - } if (!list_empty(>io_list)) { spin_unlock(_trans->dirty_bgs_lock); @@ -4351,10 +4346,6 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans, cache = list_first_entry(_trans->io_bgs, struct btrfs_block_group_cache, io_list); - if (!cache) { - btrfs_err(fs_info, "orphan block group on io_bgs list"); - return; - } list_del_init(>io_list); spin_lock(>lock); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: Document consistency of transaction->io_bgs list
The reason why io_bgs can be modified without holding any lock is non-obvious. Document it and reference that documentation from the respective call sites Signed-off-by: Nikolay Borisov--- fs/btrfs/disk-io.c | 4 fs/btrfs/extent-tree.c | 7 ++- fs/btrfs/transaction.h | 15 +++ 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1b3989c54d7c..b6fc734b0f5c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4342,6 +4342,10 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans, } spin_unlock(_trans->dirty_bgs_lock); + + /* Refer to the definition of io_bgs member for details why it's safe +* to use it without any locking +*/ while (!list_empty(_trans->io_bgs)) { cache = list_first_entry(_trans->io_bgs, struct btrfs_block_group_cache, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cc08e6af3542..c4a3dfac224a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3742,7 +3742,8 @@ int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans, /* * the cache_write_mutex is protecting -* the io_list +* the io_list, also refer to the definition of +* btrfs_transaction::io_bfs for more details */ list_add_tail(>io_list, io); } else { @@ -3934,6 +3935,10 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, } spin_unlock(_trans->dirty_bgs_lock); + + /* Refer to the definition of io_bgs member for details why it's safe +* to use it without any locking +*/ while (!list_empty(io)) { cache = list_first_entry(io, struct btrfs_block_group_cache, io_list); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 817fd7c9836b..90b80a6bea66 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -69,6 +69,21 @@ struct btrfs_transaction { struct list_head pending_chunks; struct list_head switch_commits; struct list_head dirty_bgs; + + /* There is no explicit lock which protects io_bgs, rather its +* consistency is implied by the fact that all the sites which modify +* it do so under some form of transaction critical section, namely: +* +* * btrfs_start_dirty_block_groups - This function can only ever be +* run by one of the transaction committers. Refer to +* BTRFS_TRANS_DIRTY_BG_RUN usage in btrfs_commit_transaction +* +* * btrfs_write_dirty_blockgroups - this is called by +* commit_cowonly_roots from transaction critical section +* (TRANS_STATE_COMMIT_DOING) +* +* * btrfs_cleanup_dirty_bgs - called on transaction abort +*/ struct list_head io_bgs; struct list_head dropped_roots; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IO Error (.snapshots is not a btrfs subvolume)
On Thu, Feb 8, 2018 at 5:32 AM, Andrei Borzenkovwrote: > 08.02.2018 06:03, Chris Murphy пишет: >> On Wed, Feb 7, 2018 at 6:26 PM, Nick Gilmour wrote: >>> Hi all, >>> >>> I have successfully restored a snapshot of root but now when I try to > > How exactly was it done? > >>> make a new snapshot I get this error: >>> IO Error (.snapshots is not a btrfs subvolume). >>> My snapshots were within @ which I renamed to @_old. >>> What can I do now? How can I move the snapshots from @_old/ into @ and >>> be able to make snapshots again? >>> >>> This is an excerpt of my subvolumes list: >>> >>> # btrfs subvolume list / >>> ID 257 gen 175397 top level 5 path @_old >>> ID 258 gen 175392 top level 5 path @pkg >>> ID 260 gen 175447 top level 5 path @tmp >>> ID 262 gen 19 top level 257 path @_old/var/lib/machines >>> ID 268 gen 175441 top level 5 path @test >>> ID 291 gen 175394 top level 257 path @_old/.snapshots >>> ID 292 gen 1705 top level 291 path @_old/.snapshots/1/snapshot >>> ... >>> >>> ID 3538 gen 175398 top level 291 path @_old/.snapshots/1594/snapshot >>> ID 3540 gen 175447 top level 5 path @ >>> >> >> >> This is a snapper behavior. It creates .snapshots as a subvolume and >> then puts snapshots into that subvolume. If you snapshot a subvolume >> that contains another subvolume, the nested subvolume is not snapshot, >> instead a plain directory placeholder is created instead. So your >> restored snapshot contains a .snapshot directory rather than a >> .snapshot subvolume. Possibly if you delete the directory and create a >> new subvolume .snapshot, the problem will be fixed. >> > > No, you should create subvolume @/.snapshots and mount it as /.snapshots > (and have it in /etc/fstab). Snapshots should always be available in > running system under fixed path and this only possible when it is > mounted, otherwise after rollback /.snapshots will be lost just like it > happened now. > > Exact subvolume name probably not matters that much, but better stick > with what installer does by default. It may matter for grub2 snapshots > handling. > > Also openSUSE expects that actual root is subvolume under /.snapshots > which is valid snapper snapshot (i.e. it has valid metadata). Again, not > having this may confuse snapper. > > It may be possible to move @_old/.snapshots into @/.snapshots, although > this breaks parent-child relationships those old snapshots cannot be > cleaned up without removing old root completely. > >> I can't tell you how this will confuse snapper though, or how to >> unconfuse it. It pretty much expects to be in control of all >> snapshots, creation, deletion, and rollbacks. So if you do it manually >> for whatever reason, I think it can confuse snapper. >> >> > > There was blog post recently outlining how to restore openSUSE root. You > may want to search opensuse or opensuse-factory mailing list. Ah found: > > https://rootco.de/2018-01-19-opensuse-btrfs-subvolumes/ Thanks both for the quick responses! > How exactly was it done? 1. # mount /dev/sde6 /mnt 2. # cd /mnt 3. # mv @ @_old 4. # btrfs subvol snapshot /mnt/@_old/.snapshots/1594/snapshot /mnt/@ > No, you should create subvolume @/.snapshots and mount it as /.snapshots > (and have it in /etc/fstab). Snapshots should always be available in > running system under fixed path and this only possible when it is > mounted, otherwise after rollback /.snapshots will be lost just like it > happened now. When I run this command I get an error: # btrfs subvolume create @/.snapshots ERROR: cannot access '@': No such file or directory but when I'm doing this: # btrfs subvolume create /.snapshots Create subvolume '//.snapshots' it works and this is what I see in the subvolumes list: ID 3541 gen 175955 parent 3540 top level 3540 path .snapshots and then I can create a snapshot with snapper: # snapper -c root create --description "test" but snapper starts numbering from 1 again which I don't really like. I would like to keep the previous snapshots and continue the numbering after the last restored snapshot (1594). This is how my fstab looks like now: # /etc/fstab: static file system information # #
Re: [PATCH 1/3] btrfs-progs: add prerequisite mkfs.btrfs for test-cli
On Thu, Feb 08, 2018 at 01:08:56PM +0800, Gu Jinxiang wrote: > Since tests/cli-tests/002-balance-full-no-filters/test.sh need > the mkfs.btrfs for prerequisite. > So add the dependency in Makefile. > > Signed-off-by: Gu Jinxiang1-3 applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][btrfs-next] Btrfs: extent map selftest: fix non-ANSI btrfs_test_extent_map declaration
From: Colin Ian KingThe function btrfs_test_extent_map requires a void argument to be ANSI C compliant and so it matches the prototype in fs/btrfs/tests/btrfs-tests.h Cleans up sparse warning: fs/btrfs/tests/extent-map-tests.c:346:27: warning: non-ANSI function declaration of function 'btrfs_test_extent_map' Signed-off-by: Colin Ian King --- fs/btrfs/tests/extent-map-tests.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c index 70c993f01670..c23bd00bdd92 100644 --- a/fs/btrfs/tests/extent-map-tests.c +++ b/fs/btrfs/tests/extent-map-tests.c @@ -343,7 +343,7 @@ static void test_case_4(struct extent_map_tree *em_tree) __test_case_4(em_tree, SZ_4K); } -int btrfs_test_extent_map() +int btrfs_test_extent_map(void) { struct extent_map_tree *em_tree; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] lockdep: Fix fs_reclaim warning.
>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001 From: Tetsuo HandaDate: Thu, 8 Feb 2018 10:35:35 +0900 Subject: [PATCH v2] lockdep: Fix fs_reclaim warning. Dave Jones reported fs_reclaim lockdep warnings. WARNING: possible recursive locking detected 4.15.0-rc9-backup-debug+ #1 Not tainted sshd/24800 is trying to acquire lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 but task is already holding lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(fs_reclaim); lock(fs_reclaim); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by sshd/24800: #0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40 #1: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 stack backtrace: CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 Call Trace: dump_stack+0xbc/0x13f __lock_acquire+0xa09/0x2040 lock_acquire+0x12e/0x350 fs_reclaim_acquire.part.102+0x29/0x30 kmem_cache_alloc+0x3d/0x2c0 alloc_extent_state+0xa7/0x410 __clear_extent_bit+0x3ea/0x570 try_release_extent_mapping+0x21a/0x260 __btrfs_releasepage+0xb0/0x1c0 btrfs_releasepage+0x161/0x170 try_to_release_page+0x162/0x1c0 shrink_page_list+0x1d5a/0x2fb0 shrink_inactive_list+0x451/0x940 shrink_node_memcg.constprop.88+0x4c9/0x5e0 shrink_node+0x12d/0x260 try_to_free_pages+0x418/0xaf0 __alloc_pages_slowpath+0x976/0x1790 __alloc_pages_nodemask+0x52c/0x5c0 new_slab+0x374/0x3f0 ___slab_alloc.constprop.81+0x47e/0x5a0 __slab_alloc.constprop.80+0x32/0x60 __kmalloc_track_caller+0x267/0x310 __kmalloc_reserve.isra.40+0x29/0x80 __alloc_skb+0xee/0x390 sk_stream_alloc_skb+0xb8/0x340 tcp_sendmsg_locked+0x8e6/0x1d30 tcp_sendmsg+0x27/0x40 inet_sendmsg+0xd0/0x310 sock_write_iter+0x17a/0x240 __vfs_write+0x2ab/0x380 vfs_write+0xfb/0x260 SyS_write+0xb6/0x140 do_syscall_64+0x1e5/0xc05 entry_SYSCALL64_slow_path+0x25/0x25 This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/ lockdep_clear_current_reclaim_state() in __perform_reclaim() and lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/ fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is trying to grab the 'fake' lock again when __perform_reclaim() already grabbed the 'fake' lock. The /* this guy won't enter reclaim */ if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) return false; test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for allocation only once") added the PF_MEMALLOC safeguard ( /* Avoid recursion of direct reclaim */ if (p->flags & PF_MEMALLOC) goto nopage; in __alloc_pages_slowpath()). Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow __need_fs_reclaim() to return false. Reported-and-tested-by: Dave Jones Signed-off-by: Tetsuo Handa Cc: Peter Zijlstra Cc: Nick Piggin --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 81e18ce..19fb76b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) return false; /* this guy won't enter reclaim */ - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) + if (current->flags & PF_MEMALLOC) return false; /* We're only interested __GFP_FS allocations for now */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash when unraring large archives on btrfs-filesystem
> How much RAM on the machine and how much swap available? This looks like a > lot of dirty data has accumulated, and then also there's swapping happening. > Both swap out and swap in. The machine has 16GB Ram and 40GB Swap on a SSD. Its not doing much besides being my personal file archive, so there should be plenty of free memory for btrfs. I have remounted the filesystem with the clear_cache option and now will apply the tweaks mentioned by Duncan. If this does not fix the problem I will install a more current kernel from stretch-backports. Testing currently has btrfs-progs 4.13.3-1. Is this version safe to use and should I upgrade it along with the kernel? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
On 02/06/2018 07:15 AM, Liu Bo wrote: Btrfs tries its best to tolerate write errors, but kind of silently (except some messages in kernel log). For raid1 and raid10, this is usually not a problem because there is a copy as backup, while for parity based raid setup, i.e. raid5 and raid6, the problem is that, if a write error occurs due to some bad sectors, one horizonal stripe becomes degraded and the number of write errors it can tolerate gets reduced by one, now if two disk fails, data may be lost forever. This is equally true in raid1, raid10, and raid5. Sorry I didn't get the point why degraded stripe is critical only to the parity based stripes (raid5/6)? And does it really need a bad chunk list to fix in case of parity based stripes or the balance without bad chunks list can fix as well? One way to mitigate the data loss pain is to expose 'bad chunks', i.e. degraded chunks, to users, so that they can use 'btrfs balance' to relocate the whole chunk and get the full raid6 protection again (if the relocation works). Depending on the type of disk error its recovery action would vary. For example, it can be a complete disk fail or interim RW failure due to environmental/transport factors. The disk auto relocation will do the job of relocating the real bad blocks in the most of the modern disks. The challenging task will be to know where to draw the line between complete disk failure (failed) vs interim disk failure (offline) so I had plans of making it tunable base on number of disk errors. If it's confirmed that a disk is failed, the auto-replace with the hot spare disk will be its recovery action. Balance with a failed disk won't help. Patches to these are in the ML. If the failure is momentary due to environmental factors, including the transport layer, then as we expect the disk with the data will come back we shouldn't kick in the hot spare, that is disk state offline, or maybe its a state where read old data is fine, but cannot write new data. I think you are addressing this interim state. It's better to define the disk states first so that its recovery action can be defined. I can revise the patches on that. So that replace VS re-balance using bad chunks can be decided. This introduces 'bad_chunks' in btrfs's per-fs sysfs directory. Once a chunk of raid5 or raid6 becomes degraded, it will appear in 'bad_chunks'. AFAIK a variable list of output is not allowed on sysfs. IMHO list of bad chunks won't help the user (it ok if its needed by kernel). It will help if you provide the list of affected-files so that the user can use it script to make additional interim external copy until the disk recovers from the interim error. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html