[PATCH V2] Btrfs-progs: subvol_uuid_search: Return error code on memory allocation failure
From: Prasanth K S RThis commit fixes coverity defect CID 1328695. Signed-off-by: Prasanth K S R --- cmds-receive.c | 10 +- cmds-send.c| 18 +- send-utils.c | 22 ++ 3 files changed, 32 insertions(+), 18 deletions(-) diff --git a/cmds-receive.c b/cmds-receive.c index d0525bf..40f64de 100644 --- a/cmds-receive.c +++ b/cmds-receive.c @@ -283,12 +283,12 @@ static int process_snapshot(const char *path, const u8 *uuid, u64 ctransid, parent_subvol = subvol_uuid_search(>sus, 0, parent_uuid, parent_ctransid, NULL, subvol_search_by_received_uuid); - if (!parent_subvol) { + if (IS_ERR(parent_subvol)) { parent_subvol = subvol_uuid_search(>sus, 0, parent_uuid, parent_ctransid, NULL, subvol_search_by_uuid); } - if (!parent_subvol) { - ret = -ENOENT; + if (IS_ERR(parent_subvol)) { + ret = PTR_ERR(parent_subvol); error("cannot find parent subvolume"); goto out; } @@ -744,13 +744,13 @@ static int process_clone(const char *path, u64 offset, u64 len, si = subvol_uuid_search(>sus, 0, clone_uuid, clone_ctransid, NULL, subvol_search_by_received_uuid); - if (!si) { + if (IS_ERR(si)) { if (memcmp(clone_uuid, r->cur_subvol.received_uuid, BTRFS_UUID_SIZE) == 0) { /* TODO check generation of extent */ subvol_path = strdup(r->cur_subvol_path); } else { - ret = -ENOENT; + ret = PTR_ERR(si); error("clone: did not find source subvol"); goto out; } diff --git a/cmds-send.c b/cmds-send.c index 74d0128..b773b40 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -68,8 +68,8 @@ static int get_root_id(struct btrfs_send *s, const char *path, u64 *root_id) si = subvol_uuid_search(>sus, 0, NULL, 0, path, subvol_search_by_path); - if (!si) - return -ENOENT; + if (IS_ERR(si)) + return PTR_ERR(si); *root_id = si->root_id; free(si->path); free(si); @@ -83,8 +83,8 @@ static struct subvol_info *get_parent(struct btrfs_send *s, u64 root_id) si_tmp = subvol_uuid_search(>sus, root_id, NULL, 0, NULL, subvol_search_by_root_id); - if (!si_tmp) - return NULL; + if (IS_ERR(si_tmp)) + return si_tmp; si = subvol_uuid_search(>sus, 0, si_tmp->parent_uuid, 0, NULL, subvol_search_by_uuid); @@ -104,8 +104,8 @@ static int find_good_parent(struct btrfs_send *s, u64 root_id, u64 *found) int i; parent = get_parent(s, root_id); - if (!parent) { - ret = -ENOENT; + if (IS_ERR(parent)) { + ret = PTR_ERR(parent); goto out; } @@ -119,7 +119,7 @@ static int find_good_parent(struct btrfs_send *s, u64 root_id, u64 *found) for (i = 0; i < s->clone_sources_count; i++) { parent2 = get_parent(s, s->clone_sources[i]); - if (!parent2) + if (IS_ERR(parent2)) continue; if (parent2->root_id != parent->root_id) { free(parent2->path); @@ -133,8 +133,8 @@ static int find_good_parent(struct btrfs_send *s, u64 root_id, u64 *found) parent2 = subvol_uuid_search(>sus, s->clone_sources[i], NULL, 0, NULL, subvol_search_by_root_id); - if (!parent2) { - ret = -ENOENT; + if (IS_ERR(parent2)) { + ret = PTR_ERR(parent2); goto out; } tmp = parent2->ctransid - parent->ctransid; diff --git a/send-utils.c b/send-utils.c index a85fa08..87b8559 100644 --- a/send-utils.c +++ b/send-utils.c @@ -27,6 +27,7 @@ #include "send-utils.h" #include "ioctl.h" #include "btrfs-list.h" +#include "utils.h" static int btrfs_subvolid_resolve_sub(int fd, char *path, size_t *path_len, u64 subvol_id); @@ -474,6 +475,11 @@ struct subvol_info *subvol_uuid_search(struct subvol_uuid_search *s, goto out; info = calloc(1, sizeof(*info)); + if (!info) { + error("Not enough memory"); + ret = -ENOMEM; + goto out; + } info->root_id = root_id; memcpy(info->uuid, root_item.uuid, BTRFS_UUID_SIZE); memcpy(info->received_uuid, root_item.received_uuid, BTRFS_UUID_SIZE); @@ -486,15 +492,23 @@ struct subvol_info *subvol_uuid_search(struct subvol_uuid_search *s,
Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots
15.10.2016 01:58, Alberto Bursi пишет: > > > On 10/15/2016 12:17 AM, Chris Murphy wrote: >> It should be -e can accept a listing of all the subvolumes you want to >> send at once. And possibly an -r flag, if it existed, could >> automatically populate -e. But the last time I tested -e I just got >> errors. >> >> https://bugzilla.kernel.org/show_bug.cgi?id=111221 >> >> > > Not a problem (for me anyway), I can send all subvolumes already with my > script (one after another, but still automatically). > > What I can't do with btrfs commands is to send over the contents of a ro > snapshot of / called for example "oldRootSnapshot", directly to > "/tmp/newroot" (which is where I have mounted the other drive/volume). > Somehow this is expected - it sends one subvolume to another subvolume. I am not sure whether zfs can do it either. But speaking about openSUSE - it does not have any real data in `/' at all - it is just skeleton of root filesystem with a couple of directories where actual root is in one of /.snapshots subvolumes. > The only thing I can do is send over the subvolume as a subvolume. > So I end up with /tmp/newroot/oldRootSnapshot and inside oldRootSnapshot > I get my root, not what I wanted. > > Only way I found so far is using rsync to move the contents of > oldRootSnapshot in the /tmp/newroot by setting an exclusion list for all > subvolumes, then run a deduplication with duperemove. > > So, is there something I missed to do that? > > -Alberto > N�r��y���b�X��ǧv�^�){.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥ > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: assign error values to the correct bio structs
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Junjie Mao--- fs/btrfs/compression.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ccc70d96958d..d4d8b7e36b2f 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -698,7 +698,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, ret = btrfs_map_bio(root, comp_bio, mirror_num, 0); if (ret) { - bio->bi_error = ret; + comp_bio->bi_error = ret; bio_endio(comp_bio); } @@ -728,7 +728,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, ret = btrfs_map_bio(root, comp_bio, mirror_num, 0); if (ret) { - bio->bi_error = ret; + comp_bio->bi_error = ret; bio_endio(comp_bio); } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up cp --reflink=always
At 10/17/2016 02:54 AM, Stefan Priebe - Profihost AG wrote: Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg: Hi, On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote: cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes) An example: source file: # ls -la vm-279-disk-1.img -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img target file after around 10 minutes: # ls -la vm-279-disk-1.img.tmp -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp Two quick thoughts: 1. How many extents does this img have? filefrag says: 1011508 extents found Too many fragments. Average extent size is only about 200K. Quite common for VM images, if not setting no copy-on-write (C) attr. Normally it's not a good idea to put VM images into btrfs without any tuning. Several default features of btrfs is not suitable for that use case: 1) Copy-on-Write For VM image, a lot of random write happens. This will create a lot of small extents, just as you see here. Traditional non-CoW filesystems, like Ext4 and (current) XFS, overwrite is just overwrite, won't be written into new places. So for these filesystems, no matter how many writes happen, the extent counts won't change much(mostly unchanged) 2) Extent booking Another result of CoW, data extents won't be freed until all its referencer get removed. Which leads to quite some space wastes. 3) Slow metadata operation Btfs tree cow and its lock mechanism makes metadata operation quite slow compared to other fs. Normal read/write is not metadata heavy operation, while reflinking is. (IIRC, xfs with reflink support, not mainlined yet, is faster than btrfs doing reflink) Normally, no cow (C) attr is recommended for VM image use case. This flag will make btrfs acts much like traditional fs, until there is a snapshot containing this file is created. While it has the limitation that it will prohibit reflink, you can't use cp --reflink=always then. If no cow flag is not what you want, and there is no other snapshot/subvolume/reflinked files sharing the file, defrag is high recommended before reflink. That will hugely reduce the number of extents(fragments) and reduce the time calling reflink. However I doubt the time consuming of defrag may be even longer than reflink. Thanks, Qu 2. Is this an XY problem? Why not just put the img in a subvolume and snapshot that? Sorry what's XY problem? Implementing cp reflink was easier - as the original code was based on XFS. But shouldn't be cp reflink / clone a file be nearly identical to a snapshot? Just creating refs to the extents? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 08/14] btrfs-progs: check/scrub: Introduce function to scrub one data stripe
Introduce new function, scrub_one_data_stripe(), to check all data and tree blocks inside the data stripe. Signed-off-by: Qu Wenruo--- check/scrub.c | 111 ++ 1 file changed, 111 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index cdba469..f29effa 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -297,3 +297,114 @@ invalid_arg: error("invalid parameter for %s", __func__); return -EINVAL; } + +static int scrub_one_data_stripe(struct btrfs_fs_info *fs_info, +struct btrfs_scrub_progress *scrub_ctx, +struct scrub_stripe *stripe, u32 stripe_len) +{ + struct btrfs_path *path; + struct btrfs_root *extent_root = fs_info->extent_root; + struct btrfs_key key; + u64 extent_start; + u64 extent_len; + u64 orig_csum_discards; + int ret; + + if (!is_data_stripe(stripe)) + return -EINVAL; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = stripe->logical + stripe_len; + key.offset = 0; + key.type = 0; + + ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0); + if (ret < 0) + goto out; + while (1) { + struct btrfs_extent_item *ei; + struct extent_buffer *eb; + char *data; + int slot; + int metadata = 0; + u64 check_start; + u64 check_len; + + ret = btrfs_previous_extent_item(extent_root, path, 0); + if (ret > 0) { + ret = 0; + goto out; + } + if (ret < 0) + goto out; + eb = path->nodes[0]; + slot = path->slots[0]; + btrfs_item_key_to_cpu(eb, , slot); + extent_start = key.objectid; + ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item); + + /* tree block scrub */ + if (key.type == BTRFS_METADATA_ITEM_KEY || + btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) { + extent_len = extent_root->nodesize; + metadata = 1; + } else { + extent_len = key.offset; + metadata = 0; + } + + /* Current extent is out of our range, loop comes to end */ + if (extent_start + extent_len <= stripe->logical) + break; + + if (metadata) { + /* +* Check crossing stripe first, which can't be scrubbed +*/ + if (check_crossing_stripes(extent_start, + extent_root->nodesize)) { + error("tree block at %llu is crossing stripe boundary, unable to scrub", + extent_start); + ret = -EIO; + goto out; + } + data = stripe->data + extent_start - stripe->logical; + ret = scrub_tree_mirror(fs_info, scrub_ctx, + data, extent_start, 0); + /* Any csum/verify error means the stripe is screwed */ + if (ret < 0) { + stripe->csum_mismatch = 1; + ret = -EIO; + goto out; + } + ret = 0; + continue; + } + /* Restrict the extent range to fit stripe range */ + check_start = max(extent_start, stripe->logical); + check_len = min(extent_start + extent_len, stripe->logical + + stripe_len) - check_start; + + /* Record original csum_discards to detect missing csum case */ + orig_csum_discards = scrub_ctx->csum_discards; + + data = stripe->data + check_start - stripe->logical; + ret = scrub_data_mirror(fs_info, scrub_ctx, data, check_start, + check_len, 0); + /* Csum mismatch, no need to continue anyway*/ + if (ret < 0) { + stripe->csum_mismatch = 1; + goto out; + } + /* Check if there is any missing csum for data */ + if (scrub_ctx->csum_discards != orig_csum_discards) + stripe->csum_missing = 1; + ret = 0; + } +out: + btrfs_free_path(path); + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line
[RFC PATCH v0.8 01/14] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.
Introduce a new function, __btrfs_map_block_v2(). Unlike old btrfs_map_block(), which needs different parameter to handle different RAID profile, this new function uses unified btrfs_map_block structure to handle all RAID profile in a more meaningful method: Return physical address along with logical address for each stripe. For RAID1/Single/DUP (none-stripped): result would be like: Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2 Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1 Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2 Result will be as long as possible, since it's not stripped at all. For RAID0/10 (stripped without parity): Result will be aligned to full stripe size: Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4 Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1 Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2 Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3 Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4 For RAID5/6 (stripped with parity and dev-rotation) Result will be aligned to full stripe size: Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4 Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1 Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2 Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3 Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4 The new unified layout should be very flex and can even handle things like N-way RAID1 (which old mirror_num basic one can't handle well). Signed-off-by: Qu Wenruo--- volumes.c | 181 ++ volumes.h | 49 + 2 files changed, 230 insertions(+) diff --git a/volumes.c b/volumes.c index a7abd92..94f3e42 100644 --- a/volumes.c +++ b/volumes.c @@ -1542,6 +1542,187 @@ out: return 0; } +static inline struct btrfs_map_block *alloc_map_block(int num_stripes) +{ + struct btrfs_map_block *ret; + int size; + + size = sizeof(struct btrfs_map_stripe) * num_stripes + + sizeof(struct btrfs_map_block); + ret = malloc(size); + if (!ret) + return NULL; + memset(ret, 0, size); + return ret; +} + +static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length, + struct btrfs_map_block *map_block) +{ + u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK; + u64 bg_start = map->ce.start; + u64 bg_end = bg_start + map->ce.size; + u64 bg_offset = start - bg_start; /* offset inside the block group */ + u64 fstripe_logical = 0;/* Full stripe start logical bytenr */ + u64 fstripe_size = 0; /* Full stripe logical size */ + u64 fstripe_phy_off = 0;/* Full stripe offset in each dev */ + u32 stripe_len = map->stripe_len; + int sub_stripes = map->sub_stripes; + int data_stripes = nr_data_stripes(map); + int dev_rotation; + int i; + + map_block->num_stripes = map->num_stripes; + map_block->type = profile; + + /* +* Common full stripe data for stripe based profiles +*/ + if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 | + BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) { + fstripe_size = stripe_len * data_stripes; + if (sub_stripes) + fstripe_size /= sub_stripes; + fstripe_logical = round_down(bg_offset, fstripe_size) + + bg_start; + fstripe_phy_off = bg_offset / fstripe_size * stripe_len; + } + + switch (profile) { + case BTRFS_BLOCK_GROUP_DUP: + case BTRFS_BLOCK_GROUP_RAID1: + case 0: /* SINGLE */ + /* +* None-stripe mode,(Single, DUP and RAID1) +* Just use offset to fill map_block +*/ + map_block->stripe_len = 0; + map_block->start = start; + map_block->length = min(bg_end, start + length) - start; + for (i = 0; i < map->num_stripes; i++) { + struct btrfs_map_stripe *stripe; + + stripe = _block->stripes[i]; + + stripe->dev = map->stripes[i].dev; + stripe->logical = start; + stripe->physical = map->stripes[i].physical + bg_offset; + stripe->length = map_block->length; + } + break; + case BTRFS_BLOCK_GROUP_RAID10: + case BTRFS_BLOCK_GROUP_RAID0: + /* +* Stripe modes without parity(0 and 10) +* Return the whole full stripe +*/ + + map_block->start = fstripe_logical; + map_block->length = fstripe_size; + map_block->stripe_len =
[RFC PATCH v0.8 05/14] btrfs-progs: check/scrub: Introduce function to scrub mirror based tree block
Introduce a new function, scrub_tree_mirror(), to scrub mirror based tree blocks (Single/DUP/RAID0/1/10) This function can be used on in-memory tree blocks using @data parameter for RAID5/6 full stripe, or just by @bytenr, for other profiles. Signed-off-by: Qu Wenruo--- check/scrub.c | 59 +++ disk-io.c | 4 ++-- disk-io.h | 2 ++ 3 files changed, 63 insertions(+), 2 deletions(-) diff --git a/check/scrub.c b/check/scrub.c index acfe213..ce8d5e5 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -98,3 +98,62 @@ static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes, return ret; } +static inline int is_data_stripe(struct scrub_stripe *stripe) +{ + u64 bytenr = stripe->logical; + + if (bytenr == BTRFS_RAID5_P_STRIPE || bytenr == BTRFS_RAID6_Q_STRIPE) + return 0; + return 1; +} + +static int scrub_tree_mirror(struct btrfs_fs_info *fs_info, +struct btrfs_scrub_progress *scrub_ctx, +char *data, u64 bytenr, int mirror) +{ + struct extent_buffer *eb; + u32 nodesize = fs_info->tree_root->nodesize; + int ret; + + if (!IS_ALIGNED(bytenr, fs_info->tree_root->sectorsize)) { + /* Such error will be reported by check_tree_block() */ + scrub_ctx->verify_errors++; + return -EIO; + } + + eb = btrfs_find_create_tree_block(fs_info, bytenr, nodesize); + if (!eb) + return -ENOMEM; + if (data) { + memcpy(eb->data, data, nodesize); + } else { + ret = read_whole_eb(fs_info, eb, mirror); + if (ret) { + scrub_ctx->read_errors++; + error("failed to read tree block %llu mirror %d", + bytenr, mirror); + goto out; + } + } + + scrub_ctx->tree_bytes_scrubbed += nodesize; + if (csum_tree_block(fs_info->tree_root, eb, 1)) { + error("tree block %llu mirror %d checksum mismatch", bytenr, + mirror); + scrub_ctx->csum_errors++; + ret = -EIO; + goto out; + } + ret = check_tree_block(fs_info, eb); + if (ret < 0) { + error("tree block %llu mirror %d is invalid", bytenr, mirror); + scrub_ctx->verify_errors++; + goto out; + } + + scrub_ctx->tree_extents_scrubbed++; +out: + free_extent_buffer(eb); + return ret; +} + diff --git a/disk-io.c b/disk-io.c index f24567b..2750e6e 100644 --- a/disk-io.c +++ b/disk-io.c @@ -51,8 +51,8 @@ static u32 max_nritems(u8 level, u32 nodesize) sizeof(struct btrfs_key_ptr)); } -static int check_tree_block(struct btrfs_fs_info *fs_info, - struct extent_buffer *buf) +int check_tree_block(struct btrfs_fs_info *fs_info, +struct extent_buffer *buf) { struct btrfs_fs_devices *fs_devices; diff --git a/disk-io.h b/disk-io.h index 245626c..43ce9c9 100644 --- a/disk-io.h +++ b/disk-io.h @@ -113,6 +113,8 @@ static inline struct extent_buffer* read_tree_block( parent_transid); } +int check_tree_block(struct btrfs_fs_info *fs_info, +struct extent_buffer *buf); int read_extent_data(struct btrfs_root *root, char *data, u64 logical, u64 *len, int mirror); void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize, -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 14/14] btrfs-progs: fsck: Introduce offline scrub function
Now, btrfs check has a kernel scrub equivalent. And even more, it's has stronger csum check against reconstructed data and existing data stripes. It will avoid any possible silent data corruption in kernel scrub. Now it only supports to do read-only check, but is already able to provide info on the recoverability. Signed-off-by: Qu Wenruo--- Documentation/btrfs-check.asciidoc | 8 check/check.h | 2 ++ check/scrub.c | 36 cmds-check.c | 12 +++- 4 files changed, 57 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-check.asciidoc b/Documentation/btrfs-check.asciidoc index a32e1c7..98681ff 100644 --- a/Documentation/btrfs-check.asciidoc +++ b/Documentation/btrfs-check.asciidoc @@ -78,6 +78,14 @@ respective superblock offset is within the device size This can be used to use a different starting point if some of the primary superblock is damaged. +--scrub:: +kernel scrub equivalent. ++ +Off-line scrub has better reconstruction check than kernel. Won't cause +possible silent data corruption for RAID5 ++ +NOTE: Support for RAID6 recover is not fully implemented yet. + DANGEROUS OPTIONS - diff --git a/check/check.h b/check/check.h index 61d1cac..7c14716 100644 --- a/check/check.h +++ b/check/check.h @@ -19,3 +19,5 @@ /* check/csum.c */ int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr, void *csum_ret); +/* check/scrub.c */ +int scrub_btrfs(struct btrfs_fs_info *fs_info); diff --git a/check/scrub.c b/check/scrub.c index 94f8744..3327791 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -774,3 +774,39 @@ out: btrfs_free_path(path); return ret; } + +int scrub_btrfs(struct btrfs_fs_info *fs_info) +{ + struct btrfs_block_group_cache *bg_cache; + struct btrfs_scrub_progress scrub_ctx = {0}; + int ret = 0; + + bg_cache = btrfs_lookup_first_block_group(fs_info, 0); + if (!bg_cache) { + error("no block group is found"); + return -ENOENT; + } + + while (1) { + ret = scrub_one_block_group(fs_info, _ctx, bg_cache); + if (ret < 0 && ret != -EIO) + break; + + bg_cache = btrfs_lookup_first_block_group(fs_info, + bg_cache->key.objectid + bg_cache->key.offset); + if (!bg_cache) + break; + } + + printf("Scrub result:\n"); + printf("Tree bytes scrubbed: %llu\n", scrub_ctx.tree_bytes_scrubbed); + printf("Data bytes scrubbed: %llu\n", scrub_ctx.data_bytes_scrubbed); + printf("Read error: %llu\n", scrub_ctx.read_errors); + printf("Verify error: %llu\n", scrub_ctx.verify_errors); + if (scrub_ctx.csum_errors || scrub_ctx.read_errors || + scrub_ctx.uncorrectable_errors || scrub_ctx.verify_errors) + ret = 1; + else + ret = 0; + return ret; +} diff --git a/cmds-check.c b/cmds-check.c index 670ccd1..a081e82 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -41,6 +41,7 @@ #include "rbtree-utils.h" #include "backref.h" #include "ulist.h" +#include "check.h" enum task_position { TASK_EXTENTS, @@ -11252,6 +11253,7 @@ int cmd_check(int argc, char **argv) int readonly = 0; int qgroup_report = 0; int qgroups_repaired = 0; + int scrub = 0; unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE; while(1) { @@ -11259,7 +11261,7 @@ int cmd_check(int argc, char **argv) enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM, GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM, GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE, - GETOPT_VAL_MODE }; + GETOPT_VAL_MODE, GETOPT_VAL_SCRUB }; static const struct option long_options[] = { { "super", required_argument, NULL, 's' }, { "repair", no_argument, NULL, GETOPT_VAL_REPAIR }, @@ -11279,6 +11281,7 @@ int cmd_check(int argc, char **argv) { "progress", no_argument, NULL, 'p' }, { "mode", required_argument, NULL, GETOPT_VAL_MODE }, + { "scrub", no_argument, NULL, GETOPT_VAL_SCRUB }, { NULL, 0, NULL, 0} }; @@ -11350,6 +11353,9 @@ int cmd_check(int argc, char **argv) exit(1); } break; + case GETOPT_VAL_SCRUB: + scrub = 1; + break; } } @@ -11402,6 +11408,10 @@ int cmd_check(int argc, char **argv) global_info =
[RFC PATCH v0.8 13/14] btrfs-progs: check/scrub: Introduce function to check a whole block group
Introduce new function, scrub_one_block_group(), to scrub a block group. For Single/DUP/RAID0/RAID1/RAID10, we use old mirror number based map_block, and check extent by extent. For parity based profile (RAID5/6), we use new map_block_v2() and check full stripe by full stripe. Signed-off-by: Qu Wenruo--- check/scrub.c | 85 +++ 1 file changed, 85 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index 1c8e440..94f8744 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -689,3 +689,88 @@ out: free(map_block); return ret; } + +static int scrub_one_block_group(struct btrfs_fs_info *fs_info, +struct btrfs_scrub_progress *scrub_ctx, +struct btrfs_block_group_cache *bg_cache) +{ + struct btrfs_root *extent_root = fs_info->extent_root; + struct btrfs_path *path; + struct btrfs_key key; + u64 bg_start = bg_cache->key.objectid; + u64 bg_len = bg_cache->key.offset; + u64 cur; + u64 next; + int ret; + + if (bg_cache->flags & + (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) { + /* RAID5/6 check full stripe by full stripe */ + cur = bg_cache->key.objectid; + + while (cur < bg_start + bg_len) { + ret = scrub_one_full_stripe(fs_info, scrub_ctx, cur, + ); + /* Ignore any non-fatal error */ + if (ret < 0 && ret != -EIO) { + error("fatal error happens checking one full stripe at bytenr: %llu: %s", + cur, strerror(-ret)); + return ret; + } + cur = next; + } + /* Ignore any -EIO error, such error will be reported at last */ + return 0; + } + /* None parity based profile, check extent by extent */ + key.objectid = bg_start; + key.type = 0; + key.offset = 0; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0); + if (ret < 0) + goto out; + while (1) { + struct extent_buffer *eb = path->nodes[0]; + int slot = path->slots[0]; + u64 extent_start; + u64 extent_len; + + btrfs_item_key_to_cpu(eb, , slot); + if (key.objectid >= bg_start + bg_len) + break; + if (key.type != BTRFS_EXTENT_ITEM_KEY && + key.type != BTRFS_METADATA_ITEM_KEY) + goto next; + + extent_start = key.objectid; + if (key.type == BTRFS_METADATA_ITEM_KEY) + extent_len = extent_root->nodesize; + else + extent_len = key.offset; + + ret = scrub_one_extent(fs_info, scrub_ctx, path, extent_start, + extent_len, 1); + if (ret < 0 && ret != -EIO) { + error("fatal error checking extent bytenr %llu len %llu: %s", + extent_start, extent_len, strerror(-ret)); + goto out; + } + ret = 0; +next: + ret = btrfs_next_extent_item(extent_root, path, bg_start + +bg_len); + if (ret < 0) + goto out; + if (ret > 0) { + ret = 0; + break; + } + } +out: + btrfs_free_path(path); + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 09/14] btrfs-progs: check/scrub: Introduce function to verify parities
Introduce new function, verify_parities(), to check if parities matches for full stripe which all data stripes matches with their csum. Signed-off-by: Qu Wenruo--- check/scrub.c | 59 +++ 1 file changed, 59 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index f29effa..d8182d6 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -408,3 +408,62 @@ out: btrfs_free_path(path); return ret; } + +static int verify_parities(struct btrfs_fs_info *fs_info, + struct btrfs_scrub_progress *scrub_ctx, + struct scrub_full_stripe *fstripe) +{ + void **ptrs; + void *ondisk_p = NULL; + void *ondisk_q = NULL; + void *buf_p; + void *buf_q; + int nr_stripes = fstripe->nr_stripes; + int stripe_len = BTRFS_STRIPE_LEN; + int i; + int ret; + + ptrs = malloc(sizeof(void *) * fstripe->nr_stripes); + buf_p = malloc(fstripe->stripe_len); + buf_q = malloc(fstripe->stripe_len); + if (!ptrs || !buf_p || !buf_q) { + ret = -ENOMEM; + goto out; + } + + for (i = 0; i < fstripe->nr_stripes; i++) { + struct scrub_stripe *stripe = >stripes[i]; + + if (stripe->logical == BTRFS_RAID5_P_STRIPE) { + ondisk_p = stripe->data; + ptrs[i] = buf_p; + continue; + } else if (stripe->logical == BTRFS_RAID6_Q_STRIPE) { + ondisk_q = stripe->data; + ptrs[i] = buf_q; + continue; + } else { + ptrs[i] = stripe->data; + continue; + } + } + /* RAID6 */ + if (ondisk_q) { + raid6_gen_syndrome(nr_stripes, stripe_len, ptrs); + if (memcmp(ondisk_q, ptrs[nr_stripes - 1], stripe_len) || + memcmp(ondisk_p, ptrs[nr_stripes - 2], stripe_len)) + ret = -EIO; + } else { + ret = raid5_gen_result(nr_stripes, stripe_len, nr_stripes - 1, + ptrs); + if (ret < 0) + goto out; + if (memcmp(ondisk_p, ptrs[nr_stripes - 1], stripe_len)) + ret = -EIO; + } +out: + free(buf_p); + free(buf_q); + free(ptrs); + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 02/14] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
For READ, caller normally hopes to get what they request, other than full stripe map. In this case, we should remove unrelated stripe map, just like the following case: 32K 96K |<-request range->| 0 64k 128K RAID0: |Data 1| Data 2| disk1 disk2 Before this patch, we return the full stripe: Stripe 0: Logical 0, Physical X, Len 64K, Dev disk1 Stripe 1: Logical 64k, Physical Y, Len 64K, Dev disk2 After this patch, we limit the stripe result to the request range: Stripe 0: Logical 32K, Physical X+32K, Len 32K, Dev disk1 Stripe 1: Logical 64k, Physical Y, Len 32K, Dev disk2 And if it's a RAID5/6 stripe, we just handle it like RAID0, ignoring parities. This should make caller easier to use. Signed-off-by: Qu Wenruo--- volumes.c | 103 +- 1 file changed, 102 insertions(+), 1 deletion(-) diff --git a/volumes.c b/volumes.c index 94f3e42..ba16d19 100644 --- a/volumes.c +++ b/volumes.c @@ -1682,6 +1682,107 @@ static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length, return 0; } +static void del_one_stripe(struct btrfs_map_block *map_block, int i) +{ + int cur_nr = map_block->num_stripes; + int size_left = (cur_nr - 1 - i) * sizeof(struct btrfs_map_stripe); + + memmove(_block->stripes[i], _block->stripes[i + 1], size_left); + map_block->num_stripes--; +} + +static void remove_unrelated_stripes(struct map_lookup *map, +int rw, u64 start, u64 length, +struct btrfs_map_block *map_block) +{ + int i = 0; + /* +* RAID5/6 write must use full stripe. +* No need to do anything. +*/ + if (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6) && + rw == WRITE) + return; + + /* +* For RAID0/1/10/DUP, whatever read/write, we can remove unrelated +* stripes without causing anything wrong. +* RAID5/6 READ is just like RAID0, we don't care parity unless we need +* to recovery. +* For recovery, rw should be set to WRITE. +*/ + while (i < map_block->num_stripes) { + struct btrfs_map_stripe *stripe; + u64 orig_logical; /* Original stripe logical start */ + u64 orig_end; /* Original stripe logical end */ + + stripe = _block->stripes[i]; + + /* +* For READ, we don't really care parity +*/ + if (stripe->logical == BTRFS_RAID5_P_STRIPE || + stripe->logical == BTRFS_RAID6_Q_STRIPE) { + del_one_stripe(map_block, i); + continue; + } + /* Completely unrelated stripe */ + if (stripe->logical >= start + length || + stripe->logical + stripe->length <= start) { + del_one_stripe(map_block, i); + continue; + } + /* Covered stripe, modify its logical and physical */ + orig_logical = stripe->logical; + orig_end = stripe->logical + stripe->length; + if (start + length <= orig_end) { + /* +* |<--range-->| +* | stripe | +* Or +* || +* | stripe | +*/ + stripe->logical = max(orig_logical, start); + stripe->length = start + length; + stripe->physical += stripe->logical - orig_logical; + } else if (start >= orig_logical) { + /* +* |<-range--->| +* | stripe | +* Or +* || +* | stripe | +*/ + stripe->logical = start; + stripe->length = min(orig_end, start + length); + stripe->physical += stripe->logical - orig_logical; + } + /* +* Remaining case: +* | | +* | stripe | +* No need to do any modification +*/ + i++; + } + + /* Recaculate map_block size */ + map_block->start = 0; + map_block->length = 0; + for (i = 0; i < map_block->num_stripes; i++) { + struct btrfs_map_stripe *stripe; + + stripe = _block->stripes[i]; + if (stripe->logical > map_block->start) + map_block->start = stripe->logical; + if
[RFC PATCH v0.8 10/14] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range.
Will be used for later scrub usage. Signed-off-by: Qu Wenruo--- ctree.h | 2 ++ extent-tree.c | 52 2 files changed, 54 insertions(+) diff --git a/ctree.h b/ctree.h index c76b1f1..d22e520 100644 --- a/ctree.h +++ b/ctree.h @@ -2372,6 +2372,8 @@ int exclude_super_stripes(struct btrfs_root *root, u64 add_new_free_space(struct btrfs_block_group_cache *block_group, struct btrfs_fs_info *info, u64 start, u64 end); u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset); +int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start, + u64 len); /* ctree.c */ int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2); diff --git a/extent-tree.c b/extent-tree.c index f6d0a7c..88b91df 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -4244,3 +4244,55 @@ u64 add_new_free_space(struct btrfs_block_group_cache *block_group, return total_added; } + +int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start, + u64 len) +{ + struct btrfs_path *path; + struct btrfs_key key; + u64 extent_start; + u64 extent_len; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = start + len; + key.type = 0; + key.offset = 0; + + ret = btrfs_search_slot(NULL, fs_info->extent_root, , path, 0, 0); + if (ret < 0) + goto out; + /* +* Now we're pointing at slot whose key.object >= end, skip to previous +* extent. +*/ + ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0); + if (ret < 0) + goto out; + if (ret > 0) { + ret = 0; + goto out; + } + btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]); + extent_start = key.objectid; + if (key.type == BTRFS_METADATA_ITEM_KEY) + extent_len = fs_info->extent_root->nodesize; + else + extent_len = key.offset; + + /* +* search_slot() and previous_extent_item() has ensured that our +* extent_start < start + len, we only need to care extent end. +*/ + if (extent_start + extent_len <= start) + ret = 0; + else + ret = 1; + +out: + btrfs_free_path(path); + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 04/14] btrfs-progs: check/scrub: Introduce structures to support fsck scrub
Introuduce new local structures, scrub_full_stripe and scrub_stripe, for incoming offline scrub support. Signed-off-by: Qu Wenruo--- Makefile.in | 2 +- check/scrub.c | 100 ++ 2 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 check/scrub.c diff --git a/Makefile.in b/Makefile.in index 6e2407f..b30880a 100644 --- a/Makefile.in +++ b/Makefile.in @@ -95,7 +95,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \ qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \ ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \ inode.o file.o find-root.o free-space-tree.o help.o \ - check/csum.o + check/csum.o check/scrub.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ diff --git a/check/scrub.c b/check/scrub.c new file mode 100644 index 000..acfe213 --- /dev/null +++ b/check/scrub.c @@ -0,0 +1,100 @@ +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include +#include "ctree.h" +#include "volumes.h" +#include "disk-io.h" +#include "utils.h" +#include "check.h" + +struct scrub_stripe { + /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */ + u64 logical; + + /* Device is missing */ + unsigned int dev_missing:1; + + /* Any tree/data csum mismatches */ + unsigned int csum_mismatch:1; + + /* Some data doesn't have csum(nodatasum) */ + unsigned int csum_missing:1; + + char *data; +}; + +struct scrub_full_stripe { + u64 logical_start; + u64 logical_len; + u64 bg_type; + u32 nr_stripes; + u32 stripe_len; + + /* Read error stripes */ + u32 err_read_stripes; + + /* Csum error data stripes */ + u32 err_csum_dstripes; + + /* Missing csum data stripes */ + u32 missing_csum_dstripes; + + /* Missing stripe index */ + int missing_stripes[2]; + + struct scrub_stripe stripes[]; +}; + +static void free_full_stripe(struct scrub_full_stripe *fstripe) +{ + int i; + + for (i = 0; i < fstripe->nr_stripes; i++) + free(fstripe->stripes[i].data); + free(fstripe); +} + +static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes, + u32 stripe_len) +{ + struct scrub_full_stripe *ret; + int size = sizeof(*ret) + nr_stripes * sizeof(struct scrub_stripe); + int i; + + ret = malloc(size); + if (!ret) + return NULL; + + memset(ret, 0, size); + ret->nr_stripes = nr_stripes; + ret->stripe_len = stripe_len; + + /* Alloc data memory for each stripe */ + for (i = 0; i < nr_stripes; i++) { + struct scrub_stripe *stripe = >stripes[i]; + + stripe->data = malloc(stripe_len); + if (!stripe->data) { + free_full_stripe(ret); + return NULL; + } + } + return ret; +} + -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 07/14] btrfs-progs: check/scrub: Introduce function to scrub one extent
Introduce a new function, scrub_one_extent(), as a wrapper to check one extent. Signed-off-by: Qu Wenruo--- check/scrub.c | 73 +++ 1 file changed, 73 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index 5cd8bc4..cdba469 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -224,3 +224,76 @@ out: return -EIO; return ret; } + +/* + * Check all copies of range @start, @len. + * Caller must ensure the range is covered by th EXTENT_ITEM/METADATA_ITEM + * specified by path. + * If @report is set, it will report if the range is recoverable or totally + * corrupted if it has corrupted mirror. + * + * Return 0 if the range is all OK or recoverable. + * Return <0 if the range can't be recoverable. + */ +static int scrub_one_extent(struct btrfs_fs_info *fs_info, + struct btrfs_scrub_progress *scrub_ctx, + struct btrfs_path *path, u64 start, u64 len, + int report) +{ + struct btrfs_key key; + struct btrfs_extent_item *ei; + struct extent_buffer *leaf = path->nodes[0]; + int slot = path->slots[0]; + int num_copies; + int corrupted = 0; + u64 extent_start; + u64 extent_len; + int metadata = 0; + int i; + int ret; + + btrfs_item_key_to_cpu(leaf, , slot); + if (key.type != BTRFS_METADATA_ITEM_KEY && + key.type != BTRFS_EXTENT_ITEM_KEY) + goto invalid_arg; + + extent_start = key.objectid; + if (key.type == BTRFS_METADATA_ITEM_KEY) { + extent_len = fs_info->tree_root->nodesize; + metadata = 1; + } else { + extent_len = key.offset; + ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item); + if (btrfs_extent_flags(leaf, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) + metadata = 1; + } + if (start >= extent_start + extent_len || + start + len <= extent_start) + goto invalid_arg; + num_copies = btrfs_num_copies(_info->mapping_tree, start, len); + for (i = 1; i <= num_copies; i++) { + if (metadata) + ret = scrub_tree_mirror(fs_info, scrub_ctx, + NULL, extent_start, i); + else + ret = scrub_data_mirror(fs_info, scrub_ctx, NULL, + start, len, i); + if (ret < 0) + corrupted++; + } + + if (report) { + if (corrupted && corrupted < num_copies) + printf("bytenr %llu len %llu has corrupted mirror, but is recoverable\n", + start, len); + else if (corrupted >= num_copies) + error("bytenr %llu len %llu has corrupted mirror, can't be recovered", + start, len); + } + if (corrupted < num_copies) + return 0; + return -EIO; +invalid_arg: + error("invalid parameter for %s", __func__); + return -EINVAL; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 03/14] btrfs-progs: check/csum: Introduce function to read out one data csum
Introduce a new function, btrfs_read_one_data_csum(), to read just one data csum for check usage. Unlike original implement in cmds-check.c which checks csum by one CSUM_EXTENT, this just read out one csum(4 bytes). It is not fast but makes code easier to read. And will be used in later fsck scrub codes. Signed-off-by: Qu Wenruo--- Makefile.in | 6 ++-- check/check.h | 21 + check/csum.c | 96 +++ 3 files changed, 121 insertions(+), 2 deletions(-) create mode 100644 check/check.h create mode 100644 check/csum.c diff --git a/Makefile.in b/Makefile.in index b53cf2c..6e2407f 100644 --- a/Makefile.in +++ b/Makefile.in @@ -63,6 +63,7 @@ CFLAGS = @CFLAGS@ \ -fPIC \ -I$(TOPDIR) \ -I$(TOPDIR)/kernel-lib \ +-I$(TOPDIR)/check \ $(EXTRAWARN_CFLAGS) \ $(DEBUG_CFLAGS_INTERNAL) \ $(EXTRA_CFLAGS) @@ -93,7 +94,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o extent-tree.o print-tree.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \ ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \ - inode.o file.o find-root.o free-space-tree.o help.o + inode.o file.o find-root.o free-space-tree.o help.o \ + check/csum.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ @@ -463,7 +465,7 @@ clean-all: clean clean-doc clean-gen clean: $(CLEANDIRS) @echo "Cleaning" $(Q)$(RM) -f -- $(progs) cscope.out *.o *.o.d \ - kernel-lib/*.o kernel-lib/*.o.d \ + kernel-lib/*.o kernel-lib/*.o.d check/*.o check/*.o.d \ dir-test ioctl-test quick-test send-test library-test library-test-static \ btrfs.static mkfs.btrfs.static \ $(check_defs) \ diff --git a/check/check.h b/check/check.h new file mode 100644 index 000..61d1cac --- /dev/null +++ b/check/check.h @@ -0,0 +1,21 @@ +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +/* check/csum.c */ +int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr, +void *csum_ret); diff --git a/check/csum.c b/check/csum.c new file mode 100644 index 000..53195ea --- /dev/null +++ b/check/csum.c @@ -0,0 +1,96 @@ +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include "ctree.h" +#include "utils.h" +/* + * TODO: + * 1) Add write support for csum + *So we can write new data extents and add csum into csum tree + * 2) Add csum range search function + *So we don't need to search csum tree in a per-sectorsize loop. + */ + +int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr, +void *csum_ret) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_root *csum_root = fs_info->csum_root; + u32 item_offset; + u32 item_size; + u32 final_offset; + u32 sectorsize = fs_info->tree_root->sectorsize; + u16 csum_size = btrfs_super_csum_size(fs_info->super_copy); + int ret; + + if (!csum_ret) { + error("wrong parameter for %s", __func__); + return -EINVAL; + } + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = BTRFS_EXTENT_CSUM_OBJECTID; + key.type =
[RFC PATCH v0.8 11/14] btrfs-progs: check/scrub: Introduce function to recover data parity
Introduce function, recover_from_parities(), to recover data stripes. However this function only support RAID5 yet, but should be good enough for the scrub framework. Signed-off-by: Qu Wenruo--- check/scrub.c | 49 + 1 file changed, 49 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index d8182d6..c965328 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -58,6 +58,9 @@ struct scrub_full_stripe { /* Missing stripe index */ int missing_stripes[2]; + /* Has already been recovered using parities */ + unsigned int recovered:1; + struct scrub_stripe stripes[]; }; @@ -467,3 +470,49 @@ out: free(ptrs); return ret; } + +static int recovery_from_parities(struct btrfs_fs_info *fs_info, + struct btrfs_scrub_progress *scrub_ctx, + struct scrub_full_stripe *fstripe) +{ + void **ptrs; + int nr_stripes = fstripe->nr_stripes; + int corrupted = -1; + int stripe_len = BTRFS_STRIPE_LEN; + int i; + int ret; + + /* No need to recover */ + if (!fstripe->err_read_stripes && !fstripe->err_csum_dstripes) + return 0; + + /* Already recovered once, no more chance */ + if (fstripe->recovered) + return -EINVAL; + + if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6) { + /* Need to recover 2 stripes, not supported yet */ + error("recover data stripes for RAID6 is not support yet"); + return -ENOTTY; + } + + /* Out of repair */ + if (fstripe->err_read_stripes + fstripe->err_csum_dstripes > 1) + return -EINVAL; + + ptrs = malloc(sizeof(void *) * fstripe->nr_stripes); + if (!ptrs) + return -ENOMEM; + + /* Construct ptrs */ + for (i = 0; i < nr_stripes; i++) + ptrs[i] = fstripe->stripes[i].data; + corrupted = fstripe->missing_stripes[0]; + + /* Recover the corrupted data csum */ + ret = raid5_gen_result(nr_stripes, stripe_len, corrupted, ptrs); + + fstripe->recovered = 1; + free(ptrs); + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 12/14] btrfs-progs: check/scrub: Introduce a function to scrub one full stripe
Introduce a new function, scrub_one_full_stripe(), to check a full stripe. It can handle the following case: 1) Device missing Will try to recover, then check against csum 2) Csum mismatch Will try to recover, then check against csum 3) All csum match Will check against parity, to ensure if it's OK 4) Csum missing Just check against parity. Not impelmented: 1) RAID6 recovery. Signed-off-by: Qu Wenruo--- check/scrub.c | 193 +++--- 1 file changed, 183 insertions(+), 10 deletions(-) diff --git a/check/scrub.c b/check/scrub.c index c965328..1c8e440 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -55,8 +55,9 @@ struct scrub_full_stripe { /* Missing csum data stripes */ u32 missing_csum_dstripes; - /* Missing stripe index */ - int missing_stripes[2]; + /* currupted stripe index */ + int corrupted_index[2]; + int nr_corrupted_stripes; /* Has already been recovered using parities */ unsigned int recovered:1; @@ -87,6 +88,8 @@ static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes, memset(ret, 0, size); ret->nr_stripes = nr_stripes; ret->stripe_len = stripe_len; + ret->corrupted_index[0] = -1; + ret->corrupted_index[1] = -1; /* Alloc data memory for each stripe */ for (i = 0; i < nr_stripes; i++) { @@ -471,7 +474,7 @@ out: return ret; } -static int recovery_from_parities(struct btrfs_fs_info *fs_info, +static int recover_from_parities(struct btrfs_fs_info *fs_info, struct btrfs_scrub_progress *scrub_ctx, struct scrub_full_stripe *fstripe) { @@ -483,22 +486,28 @@ static int recovery_from_parities(struct btrfs_fs_info *fs_info, int ret; /* No need to recover */ - if (!fstripe->err_read_stripes && !fstripe->err_csum_dstripes) + if (!fstripe->nr_corrupted_stripes) return 0; - /* Already recovered once, no more chance */ - if (fstripe->recovered) + if (fstripe->recovered) { + error("full stripe %llu has been recovered before, no more chance to recover", + fstripe->logical_start); return -EINVAL; + } - if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6) { + if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6 && + fstripe->nr_corrupted_stripes == 2) { /* Need to recover 2 stripes, not supported yet */ - error("recover data stripes for RAID6 is not support yet"); + error("recover 2 data stripes for RAID6 is not support yet"); return -ENOTTY; } /* Out of repair */ - if (fstripe->err_read_stripes + fstripe->err_csum_dstripes > 1) + if (fstripe->nr_corrupted_stripes > 1) { + error("full stripe %llu has too many missing stripes and csum mismatch, unable to recover", + fstripe->logical_start); return -EINVAL; + } ptrs = malloc(sizeof(void *) * fstripe->nr_stripes); if (!ptrs) @@ -507,7 +516,7 @@ static int recovery_from_parities(struct btrfs_fs_info *fs_info, /* Construct ptrs */ for (i = 0; i < nr_stripes; i++) ptrs[i] = fstripe->stripes[i].data; - corrupted = fstripe->missing_stripes[0]; + corrupted = fstripe->corrupted_index[0]; /* Recover the corrupted data csum */ ret = raid5_gen_result(nr_stripes, stripe_len, corrupted, ptrs); @@ -516,3 +525,167 @@ static int recovery_from_parities(struct btrfs_fs_info *fs_info, free(ptrs); return ret; } + +static void record_corrupted_stripe(struct scrub_full_stripe *fstripe, + int index) +{ + int i = 0; + + for (i = 0; i < 2; i++) { + if (fstripe->corrupted_index[i] == -1) { + fstripe->corrupted_index[i] = index; + break; + } + } + fstripe->nr_corrupted_stripes++; +} + +static int scrub_one_full_stripe(struct btrfs_fs_info *fs_info, +struct btrfs_scrub_progress *scrub_ctx, +u64 start, u64 *next_ret) +{ + struct scrub_full_stripe *fstripe; + struct btrfs_map_block *map_block = NULL; + u32 stripe_len = BTRFS_STRIPE_LEN; + u64 bg_type; + u64 len; + int max_tolerance; + int i; + int ret; + + if (!next_ret) { + error("invalid argument for %s", __func__); + return -EINVAL; + } + + ret = __btrfs_map_block_v2(fs_info, WRITE, start, stripe_len, + _block); + if (ret < 0) + return ret; + start = map_block->start; + len = map_block->length; + *next_ret =
[RFC PATCH v0.8 06/14] btrfs-progs: check/scrub: Introduce function to scrub mirror based data blocks
Introduce a new function, scrub_data_mirror(), to check mirror based data blocks. Signed-off-by: Qu Wenruo--- check/scrub.c | 67 +++ 1 file changed, 67 insertions(+) diff --git a/check/scrub.c b/check/scrub.c index ce8d5e5..5cd8bc4 100644 --- a/check/scrub.c +++ b/check/scrub.c @@ -157,3 +157,70 @@ out: return ret; } +static int scrub_data_mirror(struct btrfs_fs_info *fs_info, +struct btrfs_scrub_progress *scrub_ctx, +char *data, u64 start, u64 len, int mirror) +{ + u64 cur = 0; + u32 csum; + u32 sectorsize = fs_info->tree_root->sectorsize; + char *buf = NULL; + int ret = 0; + int err = 0; + + if (!data) { + buf = malloc(len); + if (!buf) + return -ENOMEM; + /* Read out as much data as possible to speed up read */ + while (cur < len) { + u64 read_len = len - cur; + + ret = read_extent_data(fs_info->tree_root, buf + cur, + start + cur, _len, mirror); + if (ret < 0) { + error("failed to read out data at logical bytenr %llu mirror %d", + start + cur, mirror); + scrub_ctx->read_errors++; + goto out; + } + scrub_ctx->data_bytes_scrubbed += read_len; + cur += read_len; + } + } else { + buf = data; + } + + /* Check csum per-sectorsize */ + cur = 0; + while (cur < len) { + u32 data_csum = ~(u32)0; + + ret = btrfs_read_one_data_csum(fs_info, start + cur, ); + if (ret > 0) { + scrub_ctx->csum_discards++; + /* In case some csum are missing */ + goto next; + } + data_csum = btrfs_csum_data(NULL, buf + cur, data_csum, + sectorsize); + btrfs_csum_final(data_csum, (u8 *)_csum); + if (data_csum != csum) { + error("data at bytenr %llu mirror %d csum mismatch, have %u expect %u", + start + cur, mirror, data_csum, csum); + err = 1; + scrub_ctx->csum_errors++; + cur += sectorsize; + continue; + } + scrub_ctx->data_bytes_scrubbed += sectorsize; +next: + cur += sectorsize; + } +out: + if (!data) + free(buf); + if (!ret && err) + return -EIO; + return ret; +} -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v0.8 00/14] Offline scrub support, and hint to solve kernel scrub data silent corruption
***Just RFC patch for early evaluation, please don't merge it*** For any one who wants to try it, it can be get from my repo: https://github.com/adam900710/btrfs-progs/tree/fsck_scrub Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with mirror or parity or data corrupted. The tool are all able to detect them and give recoverbility report. Several reports on kernel scrub screwing up good data stripes are in ML for sometime. The reason seems to be lack of csum check before and after reconstruction, and unfinished parity write seems also involved. To get a comparable tool for kernel scrub, we need a user-space tool to act as benchmark to compare their different behaviors. So here is the RFC patch set for user-space scrub. Which can do: 1) All mirror/backup check for non-parity based stripe Which means for RAID1/DUP/RAID10, we can really check all mirrors other than the 1st good mirror. Current "--check-data-csum" option will be finally replace by scrub. As it doesn't really check all mirrors, if it hits a good copy, then resting copies will just be ignored. 2) Comprehensive RAID5 full stripe check It will check csum before reconstruction using parity. And if too many data stripes has csum mismatch, no need to reconstruct anyway. And after reconstruction, it will also check the csum of recovered data against csum, to ensure we didn't recover wrong result. For all csum match case, will re-calculate parity and compare it with ondisk parity, to detect parity error. In fact, it can already expose one new btrfs kernel bug. For example, after screwing up a data stripe, kernel did repairs using parity, but recovered full stripe has wrong parity. Need to scrub again to fix it. And this patchset also introduced new map_block() function, which is more flex than current btrfs_map_block(), and has a unified interface for all profiles. Check the 1st and 2nd patch for details. They are already used in RAID5/6 scrub, but can also be used for other profiles too. Since it's just an evaluation patchset, it still has a long to-do list: 1) Repair support In fact, current tool can already report recoverability, repair is not hard to implement. 2) RAID6 support The mathematics behind RAID6 recover is more complex than RAID5. Need some more code to make it possible to recover data stripes, other than just calculating Q and P. 3) Test cases Need to make the infrastructure able to handle multi-device first. 4) Cleaner code and refined logical Need a better shared logical for all profiles to do scrub, and use new map_block_v2() to replace these old codes. 5) Make btrfsck able to handle RAID5 with missing device Now it doesn't even open RAID5 btrfs with missing device, even thouth scrub should be able to handle it. Qu Wenruo (14): btrfs-progs: Introduce new btrfs_map_block function which returns more unified result. btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes btrfs-progs: check/csum: Introduce function to read out one data csum btrfs-progs: check/scrub: Introduce structures to support fsck scrub btrfs-progs: check/scrub: Introduce function to scrub mirror based tree block btrfs-progs: check/scrub: Introduce function to scrub mirror based data blocks btrfs-progs: check/scrub: Introduce function to scrub one extent btrfs-progs: check/scrub: Introduce function to scrub one data stripe btrfs-progs: check/scrub: Introduce function to verify parities btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range. btrfs-progs: check/scrub: Introduce function to recover data parity btrfs-progs: check/scrub: Introduce a function to scrub one full stripe btrfs-progs: check/scrub: Introduce function to check a whole block group btrfs-progs: fsck: Introduce offline scrub function Documentation/btrfs-check.asciidoc | 8 + Makefile.in| 6 +- check/check.h | 23 ++ check/csum.c | 96 + check/scrub.c | 812 + cmds-check.c | 12 +- ctree.h| 2 + disk-io.c | 4 +- disk-io.h | 2 + extent-tree.c | 52 +++ volumes.c | 282 + volumes.h | 49 +++ 12 files changed, 1343 insertions(+), 5 deletions(-) create mode 100644 check/check.h create mode 100644 check/csum.c create mode 100644 check/scrub.c -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up cp --reflink=always
On 10/16/2016 09:48 PM, Hans van Kranenburg wrote: > On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote: >> Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg: >>> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote: cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes) An example: source file: # ls -la vm-279-disk-1.img -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img target file after around 10 minutes: # ls -la vm-279-disk-1.img.tmp -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp >>> >>> Two quick thoughts: >>> 1. How many extents does this img have? >> >> filefrag says: >> 1011508 extents found > > To cp --reflink this, the filesystem needs to create a million new > EXTENT_DATA objects for the new file, which point all parts of the new > file to all the little same parts of the old file, and probably also > needs to update a million EXTENT_DATA objects in the btrees to add a > second backreference back to the new file. Ehm, the second one is EXTENT_ITEM, not EXTENT_DATA. >>> 2. Is this an XY problem? Why not just put the img in a subvolume and >>> snapshot that? >> >> Sorry what's XY problem? > > It means that I suspected that your actual goal is not spending time to > work on optimizing how cp --reflink works, but that you just want to use > the quickest way to have a clone of the file. > > An XY problem is when someone has problem X, then thinks about solution > Y to solve it, then runs into a problem/limitation/whatever when trying > Y and asks help with that actual problem when doing Y while there might > in the end be a better solution to get X done. > >> Implementing cp reflink was easier - as the original code was based on >> XFS. But shouldn't be cp reflink / clone a file be nearly identical to a >> snapshot? Just creating refs to the extents? > > Snapshotting a subvolume only has to write a cowed copy of the top-level > information of the subvolume filesystem tree, and leaves the extent tree > alone. It doesn't have to do 2 million different things. \o/ > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up cp --reflink=always
Am 16.10.2016 um 21:48 schrieb Hans van Kranenburg: > On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote: >> Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg: >>> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote: cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes) An example: source file: # ls -la vm-279-disk-1.img -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img target file after around 10 minutes: # ls -la vm-279-disk-1.img.tmp -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp >>> >>> Two quick thoughts: >>> 1. How many extents does this img have? >> >> filefrag says: >> 1011508 extents found > > To cp --reflink this, the filesystem needs to create a million new > EXTENT_DATA objects for the new file, which point all parts of the new > file to all the little same parts of the old file, and probably also > needs to update a million EXTENT_DATA objects in the btrees to add a > second backreference back to the new file. Thanks for this explanation. > >>> 2. Is this an XY problem? Why not just put the img in a subvolume and >>> snapshot that? >> >> Sorry what's XY problem? > > It means that I suspected that your actual goal is not spending time to > work on optimizing how cp --reflink works, but that you just want to use > the quickest way to have a clone of the file. > > An XY problem is when someone has problem X, then thinks about solution > Y to solve it, then runs into a problem/limitation/whatever when trying > Y and asks help with that actual problem when doing Y while there might > in the end be a better solution to get X done. ah ;-) makes sense. >> Implementing cp reflink was easier - as the original code was based on >> XFS. But shouldn't be cp reflink / clone a file be nearly identical to a >> snapshot? Just creating refs to the extents? > > Snapshotting a subvolume only has to write a cowed copy of the top-level > information of the subvolume filesystem tree, and leaves the extent tree > alone. It doesn't have to do 2 million different things. \o/ Thanks for this explanation. Will look into switching to subvolumes. Wasn't able todo this before as i was always running into ENOSPC issues which was solved last week. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up cp --reflink=always
On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote: > Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg: >> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote: >>> >>> cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes) >>> >>> An example: >>> >>> source file: >>> # ls -la vm-279-disk-1.img >>> -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img >>> >>> target file after around 10 minutes: >>> # ls -la vm-279-disk-1.img.tmp >>> -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp >> >> Two quick thoughts: >> 1. How many extents does this img have? > > filefrag says: > 1011508 extents found To cp --reflink this, the filesystem needs to create a million new EXTENT_DATA objects for the new file, which point all parts of the new file to all the little same parts of the old file, and probably also needs to update a million EXTENT_DATA objects in the btrees to add a second backreference back to the new file. >> 2. Is this an XY problem? Why not just put the img in a subvolume and >> snapshot that? > > Sorry what's XY problem? It means that I suspected that your actual goal is not spending time to work on optimizing how cp --reflink works, but that you just want to use the quickest way to have a clone of the file. An XY problem is when someone has problem X, then thinks about solution Y to solve it, then runs into a problem/limitation/whatever when trying Y and asks help with that actual problem when doing Y while there might in the end be a better solution to get X done. > Implementing cp reflink was easier - as the original code was based on > XFS. But shouldn't be cp reflink / clone a file be nearly identical to a > snapshot? Just creating refs to the extents? Snapshotting a subvolume only has to write a cowed copy of the top-level information of the subvolume filesystem tree, and leaves the extent tree alone. It doesn't have to do 2 million different things. \o/ -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up cp --reflink=always
Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg: > Hi, > > On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote: >> >> cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes) >> >> An example: >> >> source file: >> # ls -la vm-279-disk-1.img >> -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img >> >> target file after around 10 minutes: >> # ls -la vm-279-disk-1.img.tmp >> -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp > > Two quick thoughts: > 1. How many extents does this img have? filefrag says: 1011508 extents found > 2. Is this an XY problem? Why not just put the img in a subvolume and > snapshot that? Sorry what's XY problem? Implementing cp reflink was easier - as the original code was based on XFS. But shouldn't be cp reflink / clone a file be nearly identical to a snapshot? Just creating refs to the extents? Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental send robustness question
On October 14, 2016 12:43:03 AM EDT, Duncan <1i5t5.dun...@cox.net> wrote: >I see the specific questions have been answered, and alternatives >explored in one direction, but I've another alternative, in a different > >direction, to suggest. > >First a disclaimer. I'm a btrfs user/sysadmin and regular on the list, > >but I'm not a dev, and my own use-case doesn't involve send/receive, so > >what I know regarding send/receive is from the list and manpages, not >personal experience. With that in mind... > >It's worth noting that send/receive are subvolume-specific -- a send >won't continue down into a subvolume. > >Also note that in addition to -p/parent, there's -s/clone-src. The >latter is more flexible than the super-strict parent option, at the >expense of a fatter send-stream as additional metadata is sent that >specifies which clone the instructions are relative to. > >It should be possible to use the combination of these two facts to >split >and recombine your send stream in a firewall-timeout-friendly manner, >as >long as no individual files are so big that sending an individual file >exceeds the timeout. > >1) Start by taking a read-only snapshot of your intended source >subvolume, so you have an unchanging reference. > >2) Take multiple writable snapshots of it, and selectively delete >subdirs >(and files if necessary) from each writable snapshot, trimming each one > >to a size that should pass the firewall without interruption, so that >the >combination of all these smaller subvolumes contains the content of the > >single larger one. > >3) Take read-only snapshots of each of these smaller snapshots, >suitable >for sending. > >4) Do a non-incremental send of each of these smaller snapshots to the >remote. > >If it's practical to keep the subvolume divisions, you can simply split > >the working tree into subvolumes and send those individually instead of > >doing the snapshot splitting above, in which case you can then use -p/ >parent on each as you were trying to do on the original, and you can >stop >here. > >If you need/prefer the single subvolume, continue... > >5) Do an incremental send of the original full snapshot, using multiple >-c options to list each of the smaller snapshots. Since all the >data has already been transferred in the smaller snapshot sends, this >send should be all metadata, no actual data. It'll simply be combining > >the individual reference subvolumes into a single larger subvolume once > >again. > >6) Once you have the single larger subvolume on the receive side, you >can >delete the smaller snapshots as you now have a copy of the larger >subvolume on each side to do further incremental sends of the working >copy against. > >7) I believe the first incremental send of the full working copy >against >the original larger snapshot will still have to use -c, while >incremental >sends based on that first one will be able to use the stricter but >slimmer send-stream -p, with each one then using the previous one as >the >parent. However, I'm not sure on that. It may be that you have to >continue using the fatter send-stream -c each time. > >Again, I don't have send/receive experience of my own, so hopefully >someone who does can reply either confirming that this should work and >whether or not -p can be used after the initial setup, or explaining >why >the idea won't work, but at this point based on my own understanding, >it >seems like it should be perfectly workable to me. =:^) I was considering doing something like this, but the simple solution of "just bring the disk over" won out. If that hadn't been possible, I might have done something like that, and I'm still mulling over possible solutions to similar / related problems. I think the biggest solution would be support for partial / resuming receives. That'll probably go on my ever-growing list of things to possibly look into when I happen upon some free time. It sounds quite complicated, though... --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] misc: fix fallocate commands that need the unshare switch
On Sat, Oct 15, 2016 at 10:03:03AM -0700, Christoph Hellwig wrote: > The poster child would be btrfs, and I would have added some output > here if btrfs support in xfstests wasn't completely broken at this > point. > > Well, added Ccs and some output anyway in this case.. Turns out the btrfs failure was my stupidity, sorry. I can reproduce the issue I was going to originally show (which was actually pointed out by Eric for a different fallocate flag check I wanted to add), here is the diff of the output files when running generic/156 on btrfs with your patch: --- tests/generic/156.out 2016-03-29 13:59:30.411720622 + +++ /root/xfstests/results//generic/156.out.bad 2016-10-16 06:15:27.118776421 + @@ -2,8 +2,13 @@ Create the original file blocks Create the reflink copies funshare part of a file +fallocate: Operation not supported funshare some of the copies +fallocate: Operation not supported +fallocate: Operation not supported funshare the rest of the files +fallocate: Operation not supported +fallocate: Operation not supported Rewrite the original file free blocks after reflinking is in range free blocks after nocow'ing some copies is in range So what we really need an enhanced falloc tester that checks that the tested subcommand is actually implemented on the given file system. (And we already need something like that for -k on NFS) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html