[PATCH V2] Btrfs-progs: subvol_uuid_search: Return error code on memory allocation failure

2016-10-16 Thread Prasanth K S R
From: Prasanth K S R 

This commit fixes coverity defect CID 1328695.

Signed-off-by: Prasanth K S R 
---
 cmds-receive.c | 10 +-
 cmds-send.c| 18 +-
 send-utils.c   | 22 ++
 3 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/cmds-receive.c b/cmds-receive.c
index d0525bf..40f64de 100644
--- a/cmds-receive.c
+++ b/cmds-receive.c
@@ -283,12 +283,12 @@ static int process_snapshot(const char *path, const u8 
*uuid, u64 ctransid,
 
parent_subvol = subvol_uuid_search(>sus, 0, parent_uuid,
parent_ctransid, NULL, subvol_search_by_received_uuid);
-   if (!parent_subvol) {
+   if (IS_ERR(parent_subvol)) {
parent_subvol = subvol_uuid_search(>sus, 0, parent_uuid,
parent_ctransid, NULL, subvol_search_by_uuid);
}
-   if (!parent_subvol) {
-   ret = -ENOENT;
+   if (IS_ERR(parent_subvol)) {
+   ret = PTR_ERR(parent_subvol);
error("cannot find parent subvolume");
goto out;
}
@@ -744,13 +744,13 @@ static int process_clone(const char *path, u64 offset, 
u64 len,
 
si = subvol_uuid_search(>sus, 0, clone_uuid, clone_ctransid, NULL,
subvol_search_by_received_uuid);
-   if (!si) {
+   if (IS_ERR(si)) {
if (memcmp(clone_uuid, r->cur_subvol.received_uuid,
BTRFS_UUID_SIZE) == 0) {
/* TODO check generation of extent */
subvol_path = strdup(r->cur_subvol_path);
} else {
-   ret = -ENOENT;
+   ret = PTR_ERR(si);
error("clone: did not find source subvol");
goto out;
}
diff --git a/cmds-send.c b/cmds-send.c
index 74d0128..b773b40 100644
--- a/cmds-send.c
+++ b/cmds-send.c
@@ -68,8 +68,8 @@ static int get_root_id(struct btrfs_send *s, const char 
*path, u64 *root_id)
 
si = subvol_uuid_search(>sus, 0, NULL, 0, path,
subvol_search_by_path);
-   if (!si)
-   return -ENOENT;
+   if (IS_ERR(si))
+   return PTR_ERR(si);
*root_id = si->root_id;
free(si->path);
free(si);
@@ -83,8 +83,8 @@ static struct subvol_info *get_parent(struct btrfs_send *s, 
u64 root_id)
 
si_tmp = subvol_uuid_search(>sus, root_id, NULL, 0, NULL,
subvol_search_by_root_id);
-   if (!si_tmp)
-   return NULL;
+   if (IS_ERR(si_tmp))
+   return si_tmp;
 
si = subvol_uuid_search(>sus, 0, si_tmp->parent_uuid, 0, NULL,
subvol_search_by_uuid);
@@ -104,8 +104,8 @@ static int find_good_parent(struct btrfs_send *s, u64 
root_id, u64 *found)
int i;
 
parent = get_parent(s, root_id);
-   if (!parent) {
-   ret = -ENOENT;
+   if (IS_ERR(parent)) {
+   ret = PTR_ERR(parent);
goto out;
}
 
@@ -119,7 +119,7 @@ static int find_good_parent(struct btrfs_send *s, u64 
root_id, u64 *found)
 
for (i = 0; i < s->clone_sources_count; i++) {
parent2 = get_parent(s, s->clone_sources[i]);
-   if (!parent2)
+   if (IS_ERR(parent2))
continue;
if (parent2->root_id != parent->root_id) {
free(parent2->path);
@@ -133,8 +133,8 @@ static int find_good_parent(struct btrfs_send *s, u64 
root_id, u64 *found)
parent2 = subvol_uuid_search(>sus, s->clone_sources[i], NULL,
0, NULL, subvol_search_by_root_id);
 
-   if (!parent2) {
-   ret = -ENOENT;
+   if (IS_ERR(parent2)) {
+   ret = PTR_ERR(parent2);
goto out;
}
tmp = parent2->ctransid - parent->ctransid;
diff --git a/send-utils.c b/send-utils.c
index a85fa08..87b8559 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -27,6 +27,7 @@
 #include "send-utils.h"
 #include "ioctl.h"
 #include "btrfs-list.h"
+#include "utils.h"
 
 static int btrfs_subvolid_resolve_sub(int fd, char *path, size_t *path_len,
  u64 subvol_id);
@@ -474,6 +475,11 @@ struct subvol_info *subvol_uuid_search(struct 
subvol_uuid_search *s,
goto out;
 
info = calloc(1, sizeof(*info));
+   if (!info) {
+   error("Not enough memory");
+   ret = -ENOMEM;
+   goto out;
+   }
info->root_id = root_id;
memcpy(info->uuid, root_item.uuid, BTRFS_UUID_SIZE);
memcpy(info->received_uuid, root_item.received_uuid, BTRFS_UUID_SIZE);
@@ -486,15 +492,23 @@ struct subvol_info *subvol_uuid_search(struct 
subvol_uuid_search *s,

Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots

2016-10-16 Thread Andrei Borzenkov
15.10.2016 01:58, Alberto Bursi пишет:
> 
> 
> On 10/15/2016 12:17 AM, Chris Murphy wrote:
>> It should be -e can accept a listing of all the subvolumes you want to
>> send at once. And possibly an -r flag, if it existed, could
>> automatically populate -e. But the last time I tested -e I just got
>> errors.
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=111221
>>
>>
> 
> Not a problem (for me anyway), I can send all subvolumes already with my 
> script (one after another, but still automatically).
> 
> What I can't do with btrfs commands is to send over the contents of a ro 
> snapshot of / called for example "oldRootSnapshot", directly to 
> "/tmp/newroot" (which is where I have mounted the other drive/volume).
> 

Somehow this is expected - it sends one subvolume to another subvolume.
I am not sure whether zfs can do it either.

But speaking about openSUSE - it does not have any real data in `/' at
all - it is just skeleton of root filesystem with a couple of
directories where actual root is in one of /.snapshots subvolumes.

> The only thing I can do is send over the subvolume as a subvolume.
> So I end up with /tmp/newroot/oldRootSnapshot and inside oldRootSnapshot 
> I get my root, not what I wanted.
> 
> Only way I found so far is using rsync to move the contents of 
> oldRootSnapshot in the /tmp/newroot by setting an exclusion list for all 
> subvolumes, then run a deduplication with duperemove.
> 
> So, is there something I missed to do that?
> 
> -Alberto
> N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: assign error values to the correct bio structs

2016-10-16 Thread Junjie Mao
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")

Signed-off-by: Junjie Mao 
---
 fs/btrfs/compression.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ccc70d96958d..d4d8b7e36b2f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -698,7 +698,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
 
ret = btrfs_map_bio(root, comp_bio, mirror_num, 0);
if (ret) {
-   bio->bi_error = ret;
+   comp_bio->bi_error = ret;
bio_endio(comp_bio);
}
 
@@ -728,7 +728,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
 
ret = btrfs_map_bio(root, comp_bio, mirror_num, 0);
if (ret) {
-   bio->bi_error = ret;
+   comp_bio->bi_error = ret;
bio_endio(comp_bio);
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up cp --reflink=always

2016-10-16 Thread Qu Wenruo



At 10/17/2016 02:54 AM, Stefan Priebe - Profihost AG wrote:

Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg:

Hi,

On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote:


cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes)

An example:

source file:
# ls -la vm-279-disk-1.img
-rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img

target file after around 10 minutes:
# ls -la vm-279-disk-1.img.tmp
-rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp


Two quick thoughts:
1. How many extents does this img have?


filefrag says:
1011508 extents found


Too many fragments.
Average extent size is only about 200K.
Quite common for VM images, if not setting no copy-on-write (C) attr.

Normally it's not a good idea to put VM images into btrfs without any 
tuning.


Several default features of btrfs is not suitable for that use case:
1) Copy-on-Write
   For VM image, a lot of random write happens.
   This will create a lot of small extents, just as you see here.

   Traditional non-CoW filesystems, like Ext4 and (current) XFS,
   overwrite is just overwrite, won't be written into new places.
   So for these filesystems, no matter how many writes happen, the
   extent counts won't change much(mostly unchanged)

2) Extent booking
   Another result of CoW, data extents won't be freed until all its
   referencer get removed.
   Which leads to quite some space wastes.

3) Slow metadata operation
   Btfs tree cow and its lock mechanism makes metadata operation quite
   slow compared to other fs.

   Normal read/write is not metadata heavy operation, while reflinking
   is.
   (IIRC, xfs with reflink support, not mainlined yet, is faster than
btrfs doing reflink)

Normally, no cow (C) attr is recommended for VM image use case.
This flag will make btrfs acts much like traditional fs, until there is 
a snapshot containing this file is created.


While it has the limitation that it will prohibit reflink, you can't use 
cp --reflink=always then.



If no cow flag is not what you want, and there is no other 
snapshot/subvolume/reflinked files sharing the file, defrag is high 
recommended before reflink.


That will hugely reduce the number of extents(fragments) and reduce the 
time calling reflink.


However I doubt the time consuming of defrag may be even longer than 
reflink.


Thanks,
Qu



2. Is this an XY problem? Why not just put the img in a subvolume and
snapshot that?


Sorry what's XY problem?

Implementing cp reflink was easier - as the original code was based on
XFS. But shouldn't be cp reflink / clone a file be nearly identical to a
snapshot? Just creating refs to the extents?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 08/14] btrfs-progs: check/scrub: Introduce function to scrub one data stripe

2016-10-16 Thread Qu Wenruo
Introduce new function, scrub_one_data_stripe(), to check all data and
tree blocks inside the data stripe.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 111 ++
 1 file changed, 111 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index cdba469..f29effa 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -297,3 +297,114 @@ invalid_arg:
error("invalid parameter for %s", __func__);
return -EINVAL;
 }
+
+static int scrub_one_data_stripe(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+struct scrub_stripe *stripe, u32 stripe_len)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *extent_root = fs_info->extent_root;
+   struct btrfs_key key;
+   u64 extent_start;
+   u64 extent_len;
+   u64 orig_csum_discards;
+   int ret;
+
+   if (!is_data_stripe(stripe))
+   return -EINVAL;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = stripe->logical + stripe_len;
+   key.offset = 0;
+   key.type = 0;
+
+   ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   while (1) {
+   struct btrfs_extent_item *ei;
+   struct extent_buffer *eb;
+   char *data;
+   int slot;
+   int metadata = 0;
+   u64 check_start;
+   u64 check_len;
+
+   ret = btrfs_previous_extent_item(extent_root, path, 0);
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   if (ret < 0)
+   goto out;
+   eb = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(eb, , slot);
+   extent_start = key.objectid;
+   ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item);
+
+   /* tree block scrub */
+   if (key.type == BTRFS_METADATA_ITEM_KEY ||
+   btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+   extent_len = extent_root->nodesize;
+   metadata = 1;
+   } else {
+   extent_len = key.offset;
+   metadata = 0;
+   }
+
+   /* Current extent is out of our range, loop comes to end */
+   if (extent_start + extent_len <= stripe->logical)
+   break;
+
+   if (metadata) {
+   /*
+* Check crossing stripe first, which can't be scrubbed
+*/
+   if (check_crossing_stripes(extent_start,
+   extent_root->nodesize)) {
+   error("tree block at %llu is crossing stripe 
boundary, unable to scrub",
+   extent_start);
+   ret = -EIO;
+   goto out;
+   }
+   data = stripe->data + extent_start - stripe->logical;
+   ret = scrub_tree_mirror(fs_info, scrub_ctx,
+   data, extent_start, 0);
+   /* Any csum/verify error means the stripe is screwed */
+   if (ret < 0) {
+   stripe->csum_mismatch = 1;
+   ret = -EIO;
+   goto out;
+   }
+   ret = 0;
+   continue;
+   }
+   /* Restrict the extent range to fit stripe range */
+   check_start = max(extent_start, stripe->logical);
+   check_len = min(extent_start + extent_len, stripe->logical +
+   stripe_len) - check_start;
+
+   /* Record original csum_discards to detect missing csum case */
+   orig_csum_discards = scrub_ctx->csum_discards;
+
+   data = stripe->data + check_start - stripe->logical;
+   ret = scrub_data_mirror(fs_info, scrub_ctx, data, check_start,
+   check_len, 0);
+   /* Csum mismatch, no need to continue anyway*/
+   if (ret < 0) {
+   stripe->csum_mismatch = 1;
+   goto out;
+   }
+   /* Check if there is any missing csum for data */
+   if (scrub_ctx->csum_discards != orig_csum_discards)
+   stripe->csum_missing = 1;
+   ret = 0;
+   }
+out:
+   btrfs_free_path(path);
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line 

[RFC PATCH v0.8 01/14] btrfs-progs: Introduce new btrfs_map_block function which returns more unified result.

2016-10-16 Thread Qu Wenruo
Introduce a new function, __btrfs_map_block_v2().

Unlike old btrfs_map_block(), which needs different parameter to handle
different RAID profile, this new function uses unified btrfs_map_block
structure to handle all RAID profile in a more meaningful method:

Return physical address along with logical address for each stripe.

For RAID1/Single/DUP (none-stripped):
result would be like:
Map block: Logical 128M, Len 10M, Type RAID1, Stripe len 0, Nr_stripes 2
Stripe 0: Logical 128M, Physical X, Len: 10M Dev dev1
Stripe 1: Logical 128M, Physical Y, Len: 10M Dev dev2

Result will be as long as possible, since it's not stripped at all.

For RAID0/10 (stripped without parity):
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID10, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 64K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical 128K, Physical Z, Len 64K Dev dev3
Stripe 3: Logical 128K, Physical W, Len 64K Dev dev4

For RAID5/6 (stripped with parity and dev-rotation)
Result will be aligned to full stripe size:
Map block: Logical 64K, Len 128K, Type RAID6, Stripe len 64K, Nr_stripes 4
Stripe 0: Logical 64K, Physical X, Len 64K Dev dev1
Stripe 1: Logical 128K, Physical Y, Len 64K Dev dev2
Stripe 2: Logical RAID5_P, Physical Z, Len 64K Dev dev3
Stripe 3: Logical RAID6_Q, Physical W, Len 64K Dev dev4

The new unified layout should be very flex and can even handle things
like N-way RAID1 (which old mirror_num basic one can't handle well).

Signed-off-by: Qu Wenruo 
---
 volumes.c | 181 ++
 volumes.h |  49 +
 2 files changed, 230 insertions(+)

diff --git a/volumes.c b/volumes.c
index a7abd92..94f3e42 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1542,6 +1542,187 @@ out:
return 0;
 }
 
+static inline struct btrfs_map_block *alloc_map_block(int num_stripes)
+{
+   struct btrfs_map_block *ret;
+   int size;
+
+   size = sizeof(struct btrfs_map_stripe) * num_stripes +
+   sizeof(struct btrfs_map_block);
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+   memset(ret, 0, size);
+   return ret;
+}
+
+static int fill_full_map_block(struct map_lookup *map, u64 start, u64 length,
+  struct btrfs_map_block *map_block)
+{
+   u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+   u64 bg_start = map->ce.start;
+   u64 bg_end = bg_start + map->ce.size;
+   u64 bg_offset = start - bg_start; /* offset inside the block group */
+   u64 fstripe_logical = 0;/* Full stripe start logical bytenr */
+   u64 fstripe_size = 0;   /* Full stripe logical size */
+   u64 fstripe_phy_off = 0;/* Full stripe offset in each dev */
+   u32 stripe_len = map->stripe_len;
+   int sub_stripes = map->sub_stripes;
+   int data_stripes = nr_data_stripes(map);
+   int dev_rotation;
+   int i;
+
+   map_block->num_stripes = map->num_stripes;
+   map_block->type = profile;
+
+   /*
+* Common full stripe data for stripe based profiles
+*/
+   if (profile & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID10 |
+  BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+   fstripe_size = stripe_len * data_stripes;
+   if (sub_stripes)
+   fstripe_size /= sub_stripes;
+   fstripe_logical = round_down(bg_offset, fstripe_size) +
+   bg_start;
+   fstripe_phy_off = bg_offset / fstripe_size * stripe_len;
+   }
+
+   switch (profile) {
+   case BTRFS_BLOCK_GROUP_DUP:
+   case BTRFS_BLOCK_GROUP_RAID1:
+   case 0: /* SINGLE */
+   /*
+* None-stripe mode,(Single, DUP and RAID1)
+* Just use offset to fill map_block
+*/
+   map_block->stripe_len = 0;
+   map_block->start = start;
+   map_block->length = min(bg_end, start + length) - start;
+   for (i = 0; i < map->num_stripes; i++) {
+   struct btrfs_map_stripe *stripe;
+
+   stripe = _block->stripes[i];
+
+   stripe->dev = map->stripes[i].dev;
+   stripe->logical = start;
+   stripe->physical = map->stripes[i].physical + bg_offset;
+   stripe->length = map_block->length;
+   }
+   break;
+   case BTRFS_BLOCK_GROUP_RAID10:
+   case BTRFS_BLOCK_GROUP_RAID0:
+   /*
+* Stripe modes without parity(0 and 10)
+* Return the whole full stripe
+*/
+
+   map_block->start = fstripe_logical;
+   map_block->length = fstripe_size;
+   map_block->stripe_len = 

[RFC PATCH v0.8 05/14] btrfs-progs: check/scrub: Introduce function to scrub mirror based tree block

2016-10-16 Thread Qu Wenruo
Introduce a new function, scrub_tree_mirror(), to scrub mirror based
tree blocks (Single/DUP/RAID0/1/10)

This function can be used on in-memory tree blocks using @data parameter
for RAID5/6 full stripe, or just by @bytenr, for other profiles.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 59 +++
 disk-io.c |  4 ++--
 disk-io.h |  2 ++
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/check/scrub.c b/check/scrub.c
index acfe213..ce8d5e5 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -98,3 +98,62 @@ static struct scrub_full_stripe *alloc_full_stripe(int 
nr_stripes,
return ret;
 }
 
+static inline int is_data_stripe(struct scrub_stripe *stripe)
+{
+   u64 bytenr = stripe->logical;
+
+   if (bytenr == BTRFS_RAID5_P_STRIPE || bytenr == BTRFS_RAID6_Q_STRIPE)
+   return 0;
+   return 1;
+}
+
+static int scrub_tree_mirror(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+char *data, u64 bytenr, int mirror)
+{
+   struct extent_buffer *eb;
+   u32 nodesize = fs_info->tree_root->nodesize;
+   int ret;
+
+   if (!IS_ALIGNED(bytenr, fs_info->tree_root->sectorsize)) {
+   /* Such error will be reported by check_tree_block() */
+   scrub_ctx->verify_errors++;
+   return -EIO;
+   }
+
+   eb = btrfs_find_create_tree_block(fs_info, bytenr, nodesize);
+   if (!eb)
+   return -ENOMEM;
+   if (data) {
+   memcpy(eb->data, data, nodesize);
+   } else {
+   ret = read_whole_eb(fs_info, eb, mirror);
+   if (ret) {
+   scrub_ctx->read_errors++;
+   error("failed to read tree block %llu mirror %d",
+ bytenr, mirror);
+   goto out;
+   }
+   }
+
+   scrub_ctx->tree_bytes_scrubbed += nodesize;
+   if (csum_tree_block(fs_info->tree_root, eb, 1)) {
+   error("tree block %llu mirror %d checksum mismatch", bytenr,
+   mirror);
+   scrub_ctx->csum_errors++;
+   ret = -EIO;
+   goto out;
+   }
+   ret = check_tree_block(fs_info, eb);
+   if (ret < 0) {
+   error("tree block %llu mirror %d is invalid", bytenr, mirror);
+   scrub_ctx->verify_errors++;
+   goto out;
+   }
+
+   scrub_ctx->tree_extents_scrubbed++;
+out:
+   free_extent_buffer(eb);
+   return ret;
+}
+
diff --git a/disk-io.c b/disk-io.c
index f24567b..2750e6e 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -51,8 +51,8 @@ static u32 max_nritems(u8 level, u32 nodesize)
sizeof(struct btrfs_key_ptr));
 }
 
-static int check_tree_block(struct btrfs_fs_info *fs_info,
-   struct extent_buffer *buf)
+int check_tree_block(struct btrfs_fs_info *fs_info,
+struct extent_buffer *buf)
 {
 
struct btrfs_fs_devices *fs_devices;
diff --git a/disk-io.h b/disk-io.h
index 245626c..43ce9c9 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -113,6 +113,8 @@ static inline struct extent_buffer* read_tree_block(
parent_transid);
 }
 
+int check_tree_block(struct btrfs_fs_info *fs_info,
+struct extent_buffer *buf);
 int read_extent_data(struct btrfs_root *root, char *data, u64 logical,
 u64 *len, int mirror);
 void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 14/14] btrfs-progs: fsck: Introduce offline scrub function

2016-10-16 Thread Qu Wenruo
Now, btrfs check has a kernel scrub equivalent.

And even more, it's has stronger csum check against reconstructed data
and existing data stripes.
It will avoid any possible silent data corruption in kernel scrub.

Now it only supports to do read-only check, but is already able to
provide info on the recoverability.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-check.asciidoc |  8 
 check/check.h  |  2 ++
 check/scrub.c  | 36 
 cmds-check.c   | 12 +++-
 4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index a32e1c7..98681ff 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -78,6 +78,14 @@ respective superblock offset is within the device size
 This can be used to use a different starting point if some of the primary
 superblock is damaged.
 
+--scrub::
+kernel scrub equivalent.
++
+Off-line scrub has better reconstruction check than kernel. Won't cause
+possible silent data corruption for RAID5
++
+NOTE: Support for RAID6 recover is not fully implemented yet.
+
 DANGEROUS OPTIONS
 -
 
diff --git a/check/check.h b/check/check.h
index 61d1cac..7c14716 100644
--- a/check/check.h
+++ b/check/check.h
@@ -19,3 +19,5 @@
 /* check/csum.c */
 int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
 void *csum_ret);
+/* check/scrub.c */
+int scrub_btrfs(struct btrfs_fs_info *fs_info);
diff --git a/check/scrub.c b/check/scrub.c
index 94f8744..3327791 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -774,3 +774,39 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+int scrub_btrfs(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_block_group_cache *bg_cache;
+   struct btrfs_scrub_progress scrub_ctx = {0};
+   int ret = 0;
+
+   bg_cache = btrfs_lookup_first_block_group(fs_info, 0);
+   if (!bg_cache) {
+   error("no block group is found");
+   return -ENOENT;
+   }
+
+   while (1) {
+   ret = scrub_one_block_group(fs_info, _ctx, bg_cache);
+   if (ret < 0 && ret != -EIO)
+   break;
+
+   bg_cache = btrfs_lookup_first_block_group(fs_info,
+   bg_cache->key.objectid + bg_cache->key.offset);
+   if (!bg_cache)
+   break;
+   }
+
+   printf("Scrub result:\n");
+   printf("Tree bytes scrubbed: %llu\n", scrub_ctx.tree_bytes_scrubbed);
+   printf("Data bytes scrubbed: %llu\n", scrub_ctx.data_bytes_scrubbed);
+   printf("Read error: %llu\n", scrub_ctx.read_errors);
+   printf("Verify error: %llu\n", scrub_ctx.verify_errors);
+   if (scrub_ctx.csum_errors || scrub_ctx.read_errors ||
+   scrub_ctx.uncorrectable_errors || scrub_ctx.verify_errors)
+   ret = 1;
+   else
+   ret = 0;
+   return ret;
+}
diff --git a/cmds-check.c b/cmds-check.c
index 670ccd1..a081e82 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -41,6 +41,7 @@
 #include "rbtree-utils.h"
 #include "backref.h"
 #include "ulist.h"
+#include "check.h"
 
 enum task_position {
TASK_EXTENTS,
@@ -11252,6 +11253,7 @@ int cmd_check(int argc, char **argv)
int readonly = 0;
int qgroup_report = 0;
int qgroups_repaired = 0;
+   int scrub = 0;
unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
 
while(1) {
@@ -11259,7 +11261,7 @@ int cmd_check(int argc, char **argv)
enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM,
GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM,
GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE,
-   GETOPT_VAL_MODE };
+   GETOPT_VAL_MODE, GETOPT_VAL_SCRUB };
static const struct option long_options[] = {
{ "super", required_argument, NULL, 's' },
{ "repair", no_argument, NULL, GETOPT_VAL_REPAIR },
@@ -11279,6 +11281,7 @@ int cmd_check(int argc, char **argv)
{ "progress", no_argument, NULL, 'p' },
{ "mode", required_argument, NULL,
GETOPT_VAL_MODE },
+   { "scrub", no_argument, NULL, GETOPT_VAL_SCRUB },
{ NULL, 0, NULL, 0}
};
 
@@ -11350,6 +11353,9 @@ int cmd_check(int argc, char **argv)
exit(1);
}
break;
+   case GETOPT_VAL_SCRUB:
+   scrub = 1;
+   break;
}
}
 
@@ -11402,6 +11408,10 @@ int cmd_check(int argc, char **argv)
global_info = 

[RFC PATCH v0.8 13/14] btrfs-progs: check/scrub: Introduce function to check a whole block group

2016-10-16 Thread Qu Wenruo
Introduce new function, scrub_one_block_group(), to scrub a block group.

For Single/DUP/RAID0/RAID1/RAID10, we use old mirror number based
map_block, and check extent by extent.

For parity based profile (RAID5/6), we use new map_block_v2() and check
full stripe by full stripe.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 85 +++
 1 file changed, 85 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index 1c8e440..94f8744 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -689,3 +689,88 @@ out:
free(map_block);
return ret;
 }
+
+static int scrub_one_block_group(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+struct btrfs_block_group_cache *bg_cache)
+{
+   struct btrfs_root *extent_root = fs_info->extent_root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   u64 bg_start = bg_cache->key.objectid;
+   u64 bg_len = bg_cache->key.offset;
+   u64 cur;
+   u64 next;
+   int ret;
+
+   if (bg_cache->flags &
+   (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) {
+   /* RAID5/6 check full stripe by full stripe */
+   cur = bg_cache->key.objectid;
+
+   while (cur < bg_start + bg_len) {
+   ret = scrub_one_full_stripe(fs_info, scrub_ctx, cur,
+   );
+   /* Ignore any non-fatal error */
+   if (ret < 0 && ret != -EIO) {
+   error("fatal error happens checking one full 
stripe at bytenr: %llu: %s",
+   cur, strerror(-ret));
+   return ret;
+   }
+   cur = next;
+   }
+   /* Ignore any -EIO error, such error will be reported at last */
+   return 0;
+   }
+   /* None parity based profile, check extent by extent */
+   key.objectid = bg_start;
+   key.type = 0;
+   key.offset = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+   ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   while (1) {
+   struct extent_buffer *eb = path->nodes[0];
+   int slot = path->slots[0];
+   u64 extent_start;
+   u64 extent_len;
+
+   btrfs_item_key_to_cpu(eb, , slot);
+   if (key.objectid >= bg_start + bg_len)
+   break;
+   if (key.type != BTRFS_EXTENT_ITEM_KEY &&
+   key.type != BTRFS_METADATA_ITEM_KEY)
+   goto next;
+
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY)
+   extent_len = extent_root->nodesize;
+   else
+   extent_len = key.offset;
+
+   ret = scrub_one_extent(fs_info, scrub_ctx, path, extent_start,
+   extent_len, 1);
+   if (ret < 0 && ret != -EIO) {
+   error("fatal error checking extent bytenr %llu len 
%llu: %s",
+   extent_start, extent_len, strerror(-ret));
+   goto out;
+   }
+   ret = 0;
+next:
+   ret = btrfs_next_extent_item(extent_root, path, bg_start +
+bg_len);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_free_path(path);
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 09/14] btrfs-progs: check/scrub: Introduce function to verify parities

2016-10-16 Thread Qu Wenruo
Introduce new function, verify_parities(), to check if parities matches
for full stripe which all data stripes matches with their csum.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 59 +++
 1 file changed, 59 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index f29effa..d8182d6 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -408,3 +408,62 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+static int verify_parities(struct btrfs_fs_info *fs_info,
+  struct btrfs_scrub_progress *scrub_ctx,
+  struct scrub_full_stripe *fstripe)
+{
+   void **ptrs;
+   void *ondisk_p = NULL;
+   void *ondisk_q = NULL;
+   void *buf_p;
+   void *buf_q;
+   int nr_stripes = fstripe->nr_stripes;
+   int stripe_len = BTRFS_STRIPE_LEN;
+   int i;
+   int ret;
+
+   ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+   buf_p = malloc(fstripe->stripe_len);
+   buf_q = malloc(fstripe->stripe_len);
+   if (!ptrs || !buf_p || !buf_q) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   for (i = 0; i < fstripe->nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   if (stripe->logical == BTRFS_RAID5_P_STRIPE) {
+   ondisk_p = stripe->data;
+   ptrs[i] = buf_p;
+   continue;
+   } else if (stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+   ondisk_q = stripe->data;
+   ptrs[i] = buf_q;
+   continue;
+   } else {
+   ptrs[i] = stripe->data;
+   continue;
+   }
+   }
+   /* RAID6 */
+   if (ondisk_q) {
+   raid6_gen_syndrome(nr_stripes, stripe_len, ptrs);
+   if (memcmp(ondisk_q, ptrs[nr_stripes - 1], stripe_len) ||
+   memcmp(ondisk_p, ptrs[nr_stripes - 2], stripe_len))
+   ret = -EIO;
+   } else {
+   ret = raid5_gen_result(nr_stripes, stripe_len, nr_stripes - 1,
+   ptrs);
+   if (ret < 0)
+   goto out;
+   if (memcmp(ondisk_p, ptrs[nr_stripes - 1], stripe_len))
+   ret = -EIO;
+   }
+out:
+   free(buf_p);
+   free(buf_q);
+   free(ptrs);
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 02/14] btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes

2016-10-16 Thread Qu Wenruo
For READ, caller normally hopes to get what they request, other than
full stripe map.

In this case, we should remove unrelated stripe map, just like the
following case:
   32K   96K
   |<-request range->|
 0  64k   128K
RAID0:   |Data 1|   Data 2|
  disk1 disk2
Before this patch, we return the full stripe:
Stripe 0: Logical 0, Physical X, Len 64K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 64K, Dev disk2

After this patch, we limit the stripe result to the request range:
Stripe 0: Logical 32K, Physical X+32K, Len 32K, Dev disk1
Stripe 1: Logical 64k, Physical Y, Len 32K, Dev disk2

And if it's a RAID5/6 stripe, we just handle it like RAID0, ignoring
parities.

This should make caller easier to use.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 103 +-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index 94f3e42..ba16d19 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1682,6 +1682,107 @@ static int fill_full_map_block(struct map_lookup *map, 
u64 start, u64 length,
return 0;
 }
 
+static void del_one_stripe(struct btrfs_map_block *map_block, int i)
+{
+   int cur_nr = map_block->num_stripes;
+   int size_left = (cur_nr - 1 - i) * sizeof(struct btrfs_map_stripe);
+
+   memmove(_block->stripes[i], _block->stripes[i + 1], size_left);
+   map_block->num_stripes--;
+}
+
+static void remove_unrelated_stripes(struct map_lookup *map,
+int rw, u64 start, u64 length,
+struct btrfs_map_block *map_block)
+{
+   int i = 0;
+   /*
+* RAID5/6 write must use full stripe.
+* No need to do anything.
+*/
+   if (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6) &&
+   rw == WRITE)
+   return;
+
+   /*
+* For RAID0/1/10/DUP, whatever read/write, we can remove unrelated
+* stripes without causing anything wrong.
+* RAID5/6 READ is just like RAID0, we don't care parity unless we need
+* to recovery.
+* For recovery, rw should be set to WRITE.
+*/
+   while (i < map_block->num_stripes) {
+   struct btrfs_map_stripe *stripe;
+   u64 orig_logical; /* Original stripe logical start */
+   u64 orig_end; /* Original stripe logical end */
+
+   stripe = _block->stripes[i];
+
+   /*
+* For READ, we don't really care parity
+*/
+   if (stripe->logical == BTRFS_RAID5_P_STRIPE ||
+   stripe->logical == BTRFS_RAID6_Q_STRIPE) {
+   del_one_stripe(map_block, i);
+   continue;
+   }
+   /* Completely unrelated stripe */
+   if (stripe->logical >= start + length ||
+   stripe->logical + stripe->length <= start) {
+   del_one_stripe(map_block, i);
+   continue;
+   }
+   /* Covered stripe, modify its logical and physical */
+   orig_logical = stripe->logical;
+   orig_end = stripe->logical + stripe->length;
+   if (start + length <= orig_end) {
+   /*
+* |<--range-->|
+*   |  stripe   |
+* Or
+* ||
+*   |  stripe   |
+*/
+   stripe->logical = max(orig_logical, start);
+   stripe->length = start + length;
+   stripe->physical += stripe->logical - orig_logical;
+   } else if (start >= orig_logical) {
+   /*
+* |<-range--->|
+* |  stripe |
+* Or
+* ||
+* |  stripe |
+*/
+   stripe->logical = start;
+   stripe->length = min(orig_end, start + length);
+   stripe->physical += stripe->logical - orig_logical;
+   }
+   /*
+* Remaining case:
+* ||
+*   | stripe |
+* No need to do any modification
+*/
+   i++;
+   }
+
+   /* Recaculate map_block size */
+   map_block->start = 0;
+   map_block->length = 0;
+   for (i = 0; i < map_block->num_stripes; i++) {
+   struct btrfs_map_stripe *stripe;
+
+   stripe = _block->stripes[i];
+   if (stripe->logical > map_block->start)
+   map_block->start = stripe->logical;
+   if 

[RFC PATCH v0.8 10/14] btrfs-progs: extent-tree: Introduce function to check if there is any extent in given range.

2016-10-16 Thread Qu Wenruo
Will be used for later scrub usage.

Signed-off-by: Qu Wenruo 
---
 ctree.h   |  2 ++
 extent-tree.c | 52 
 2 files changed, 54 insertions(+)

diff --git a/ctree.h b/ctree.h
index c76b1f1..d22e520 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2372,6 +2372,8 @@ int exclude_super_stripes(struct btrfs_root *root,
 u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
   struct btrfs_fs_info *info, u64 start, u64 end);
 u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset);
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+ u64 len);
 
 /* ctree.c */
 int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2);
diff --git a/extent-tree.c b/extent-tree.c
index f6d0a7c..88b91df 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -4244,3 +4244,55 @@ u64 add_new_free_space(struct btrfs_block_group_cache 
*block_group,
 
return total_added;
 }
+
+int btrfs_check_extent_exists(struct btrfs_fs_info *fs_info, u64 start,
+ u64 len)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   u64 extent_start;
+   u64 extent_len;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = start + len;
+   key.type = 0;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, fs_info->extent_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   /*
+* Now we're pointing at slot whose key.object >= end, skip to previous
+* extent.
+*/
+   ret = btrfs_previous_extent_item(fs_info->extent_root, path, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY)
+   extent_len = fs_info->extent_root->nodesize;
+   else
+   extent_len = key.offset;
+
+   /*
+* search_slot() and previous_extent_item() has ensured that our
+* extent_start < start + len, we only need to care extent end.
+*/
+   if (extent_start + extent_len <= start)
+   ret = 0;
+   else
+   ret = 1;
+
+out:
+   btrfs_free_path(path);
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 04/14] btrfs-progs: check/scrub: Introduce structures to support fsck scrub

2016-10-16 Thread Qu Wenruo
Introuduce new local structures, scrub_full_stripe and scrub_stripe, for
incoming offline scrub support.

Signed-off-by: Qu Wenruo 
---
 Makefile.in   |   2 +-
 check/scrub.c | 100 ++
 2 files changed, 101 insertions(+), 1 deletion(-)
 create mode 100644 check/scrub.c

diff --git a/Makefile.in b/Makefile.in
index 6e2407f..b30880a 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -95,7 +95,7 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o \
- check/csum.o
+ check/csum.o check/scrub.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/check/scrub.c b/check/scrub.c
new file mode 100644
index 000..acfe213
--- /dev/null
+++ b/check/scrub.c
@@ -0,0 +1,100 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include "ctree.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "utils.h"
+#include "check.h"
+
+struct scrub_stripe {
+   /* For P/Q logical start will be BTRFS_RAID5/6_P/Q_STRIPE */
+   u64 logical;
+
+   /* Device is missing */
+   unsigned int dev_missing:1;
+
+   /* Any tree/data csum mismatches */
+   unsigned int csum_mismatch:1;
+
+   /* Some data doesn't have csum(nodatasum) */
+   unsigned int csum_missing:1;
+
+   char *data;
+};
+
+struct scrub_full_stripe {
+   u64 logical_start;
+   u64 logical_len;
+   u64 bg_type;
+   u32 nr_stripes;
+   u32 stripe_len;
+
+   /* Read error stripes */
+   u32 err_read_stripes;
+
+   /* Csum error data stripes */
+   u32 err_csum_dstripes;
+
+   /* Missing csum data stripes */
+   u32 missing_csum_dstripes;
+
+   /* Missing stripe index */
+   int missing_stripes[2];
+
+   struct scrub_stripe stripes[];
+};
+
+static void free_full_stripe(struct scrub_full_stripe *fstripe)
+{
+   int i;
+
+   for (i = 0; i < fstripe->nr_stripes; i++)
+   free(fstripe->stripes[i].data);
+   free(fstripe);
+}
+
+static struct scrub_full_stripe *alloc_full_stripe(int nr_stripes,
+   u32 stripe_len)
+{
+   struct scrub_full_stripe *ret;
+   int size = sizeof(*ret) + nr_stripes * sizeof(struct scrub_stripe);
+   int i;
+
+   ret = malloc(size);
+   if (!ret)
+   return NULL;
+
+   memset(ret, 0, size);
+   ret->nr_stripes = nr_stripes;
+   ret->stripe_len = stripe_len;
+
+   /* Alloc data memory for each stripe */
+   for (i = 0; i < nr_stripes; i++) {
+   struct scrub_stripe *stripe = >stripes[i];
+
+   stripe->data = malloc(stripe_len);
+   if (!stripe->data) {
+   free_full_stripe(ret);
+   return NULL;
+   }
+   }
+   return ret;
+}
+
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 07/14] btrfs-progs: check/scrub: Introduce function to scrub one extent

2016-10-16 Thread Qu Wenruo
Introduce a new function, scrub_one_extent(), as a wrapper to check one
extent.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 73 +++
 1 file changed, 73 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index 5cd8bc4..cdba469 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -224,3 +224,76 @@ out:
return -EIO;
return ret;
 }
+
+/*
+ * Check all copies of range @start, @len.
+ * Caller must ensure the range is covered by th EXTENT_ITEM/METADATA_ITEM
+ * specified by path.
+ * If @report is set, it will report if the range is recoverable or totally
+ * corrupted if it has corrupted mirror.
+ *
+ * Return 0 if the range is all OK or recoverable.
+ * Return <0 if the range can't be recoverable.
+ */
+static int scrub_one_extent(struct btrfs_fs_info *fs_info,
+   struct btrfs_scrub_progress *scrub_ctx,
+   struct btrfs_path *path, u64 start, u64 len,
+   int report)
+{
+   struct btrfs_key key;
+   struct btrfs_extent_item *ei;
+   struct extent_buffer *leaf = path->nodes[0];
+   int slot = path->slots[0];
+   int num_copies;
+   int corrupted = 0;
+   u64 extent_start;
+   u64 extent_len;
+   int metadata = 0;
+   int i;
+   int ret;
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   if (key.type != BTRFS_METADATA_ITEM_KEY &&
+   key.type != BTRFS_EXTENT_ITEM_KEY)
+   goto invalid_arg;
+
+   extent_start = key.objectid;
+   if (key.type == BTRFS_METADATA_ITEM_KEY) {
+   extent_len = fs_info->tree_root->nodesize;
+   metadata = 1;
+   } else {
+   extent_len = key.offset;
+   ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+   if (btrfs_extent_flags(leaf, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK)
+   metadata = 1;
+   }
+   if (start >= extent_start + extent_len ||
+   start + len <= extent_start)
+   goto invalid_arg;
+   num_copies = btrfs_num_copies(_info->mapping_tree, start, len);
+   for (i = 1; i <= num_copies; i++) {
+   if (metadata)
+   ret = scrub_tree_mirror(fs_info, scrub_ctx,
+   NULL, extent_start, i);
+   else
+   ret = scrub_data_mirror(fs_info, scrub_ctx, NULL,
+   start, len, i);
+   if (ret < 0)
+   corrupted++;
+   }
+
+   if (report) {
+   if (corrupted && corrupted < num_copies)
+   printf("bytenr %llu len %llu has corrupted mirror, but 
is recoverable\n",
+   start, len);
+   else if (corrupted >= num_copies)
+   error("bytenr %llu len %llu has corrupted mirror, can't 
be recovered",
+   start, len);
+   }
+   if (corrupted < num_copies)
+   return 0;
+   return -EIO;
+invalid_arg:
+   error("invalid parameter for %s", __func__);
+   return -EINVAL;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 03/14] btrfs-progs: check/csum: Introduce function to read out one data csum

2016-10-16 Thread Qu Wenruo
Introduce a new function, btrfs_read_one_data_csum(), to read just one
data csum for check usage.

Unlike original implement in cmds-check.c which checks csum by one
CSUM_EXTENT, this just read out one csum(4 bytes).
It is not fast but makes code easier to read.

And will be used in later fsck scrub codes.

Signed-off-by: Qu Wenruo 
---
 Makefile.in   |  6 ++--
 check/check.h | 21 +
 check/csum.c  | 96 +++
 3 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 check/check.h
 create mode 100644 check/csum.c

diff --git a/Makefile.in b/Makefile.in
index b53cf2c..6e2407f 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -63,6 +63,7 @@ CFLAGS = @CFLAGS@ \
 -fPIC \
 -I$(TOPDIR) \
 -I$(TOPDIR)/kernel-lib \
+-I$(TOPDIR)/check \
 $(EXTRAWARN_CFLAGS) \
 $(DEBUG_CFLAGS_INTERNAL) \
 $(EXTRA_CFLAGS)
@@ -93,7 +94,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid56.o free-space-cache.o kernel-lib/list_sort.o props.o \
  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
- inode.o file.o find-root.o free-space-tree.o help.o
+ inode.o file.o find-root.o free-space-tree.o help.o \
+ check/csum.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
@@ -463,7 +465,7 @@ clean-all: clean clean-doc clean-gen
 clean: $(CLEANDIRS)
@echo "Cleaning"
$(Q)$(RM) -f -- $(progs) cscope.out *.o *.o.d \
-   kernel-lib/*.o kernel-lib/*.o.d \
+   kernel-lib/*.o kernel-lib/*.o.d check/*.o check/*.o.d \
  dir-test ioctl-test quick-test send-test library-test 
library-test-static \
  btrfs.static mkfs.btrfs.static \
  $(check_defs) \
diff --git a/check/check.h b/check/check.h
new file mode 100644
index 000..61d1cac
--- /dev/null
+++ b/check/check.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/* check/csum.c */
+int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
+void *csum_ret);
diff --git a/check/csum.c b/check/csum.c
new file mode 100644
index 000..53195ea
--- /dev/null
+++ b/check/csum.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "utils.h"
+/*
+ * TODO:
+ * 1) Add write support for csum
+ *So we can write new data extents and add csum into csum tree
+ * 2) Add csum range search function
+ *So we don't need to search csum tree in a per-sectorsize loop.
+ */
+
+int btrfs_read_one_data_csum(struct btrfs_fs_info *fs_info, u64 bytenr,
+void *csum_ret)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root *csum_root = fs_info->csum_root;
+   u32 item_offset;
+   u32 item_size;
+   u32 final_offset;
+   u32 sectorsize = fs_info->tree_root->sectorsize;
+   u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+   int ret;
+
+   if (!csum_ret) {
+   error("wrong parameter for %s", __func__);
+   return -EINVAL;
+   }
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+   key.type = 

[RFC PATCH v0.8 11/14] btrfs-progs: check/scrub: Introduce function to recover data parity

2016-10-16 Thread Qu Wenruo
Introduce function, recover_from_parities(), to recover data stripes.

However this function only support RAID5 yet, but should be good enough
for the scrub framework.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 49 +
 1 file changed, 49 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index d8182d6..c965328 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -58,6 +58,9 @@ struct scrub_full_stripe {
/* Missing stripe index */
int missing_stripes[2];
 
+   /* Has already been recovered using parities */
+   unsigned int recovered:1;
+
struct scrub_stripe stripes[];
 };
 
@@ -467,3 +470,49 @@ out:
free(ptrs);
return ret;
 }
+
+static int recovery_from_parities(struct btrfs_fs_info *fs_info,
+ struct btrfs_scrub_progress *scrub_ctx,
+ struct scrub_full_stripe *fstripe)
+{
+   void **ptrs;
+   int nr_stripes = fstripe->nr_stripes;
+   int corrupted = -1;
+   int stripe_len = BTRFS_STRIPE_LEN;
+   int i;
+   int ret;
+
+   /* No need to recover */
+   if (!fstripe->err_read_stripes && !fstripe->err_csum_dstripes)
+   return 0;
+
+   /* Already recovered once, no more chance */
+   if (fstripe->recovered)
+   return -EINVAL;
+
+   if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6) {
+   /* Need to recover 2 stripes, not supported yet */
+   error("recover data stripes for RAID6 is not support yet");
+   return -ENOTTY;
+   }
+
+   /* Out of repair */
+   if (fstripe->err_read_stripes + fstripe->err_csum_dstripes > 1)
+   return -EINVAL;
+
+   ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
+   if (!ptrs)
+   return -ENOMEM;
+
+   /* Construct ptrs */
+   for (i = 0; i < nr_stripes; i++)
+   ptrs[i] = fstripe->stripes[i].data;
+   corrupted = fstripe->missing_stripes[0];
+
+   /* Recover the corrupted data csum */
+   ret = raid5_gen_result(nr_stripes, stripe_len, corrupted, ptrs);
+
+   fstripe->recovered = 1;
+   free(ptrs);
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 12/14] btrfs-progs: check/scrub: Introduce a function to scrub one full stripe

2016-10-16 Thread Qu Wenruo
Introduce a new function, scrub_one_full_stripe(), to check a full
stripe.

It can handle the following case:
1) Device missing
   Will try to recover, then check against csum

2) Csum mismatch
   Will try to recover, then check against csum

3) All csum match
   Will check against parity, to ensure if it's OK

4) Csum missing
   Just check against parity.

Not impelmented:
1) RAID6 recovery.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 193 +++---
 1 file changed, 183 insertions(+), 10 deletions(-)

diff --git a/check/scrub.c b/check/scrub.c
index c965328..1c8e440 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -55,8 +55,9 @@ struct scrub_full_stripe {
/* Missing csum data stripes */
u32 missing_csum_dstripes;
 
-   /* Missing stripe index */
-   int missing_stripes[2];
+   /* currupted stripe index */
+   int corrupted_index[2];
+   int nr_corrupted_stripes;
 
/* Has already been recovered using parities */
unsigned int recovered:1;
@@ -87,6 +88,8 @@ static struct scrub_full_stripe *alloc_full_stripe(int 
nr_stripes,
memset(ret, 0, size);
ret->nr_stripes = nr_stripes;
ret->stripe_len = stripe_len;
+   ret->corrupted_index[0] = -1;
+   ret->corrupted_index[1] = -1;
 
/* Alloc data memory for each stripe */
for (i = 0; i < nr_stripes; i++) {
@@ -471,7 +474,7 @@ out:
return ret;
 }
 
-static int recovery_from_parities(struct btrfs_fs_info *fs_info,
+static int recover_from_parities(struct btrfs_fs_info *fs_info,
  struct btrfs_scrub_progress *scrub_ctx,
  struct scrub_full_stripe *fstripe)
 {
@@ -483,22 +486,28 @@ static int recovery_from_parities(struct btrfs_fs_info 
*fs_info,
int ret;
 
/* No need to recover */
-   if (!fstripe->err_read_stripes && !fstripe->err_csum_dstripes)
+   if (!fstripe->nr_corrupted_stripes)
return 0;
 
-   /* Already recovered once, no more chance */
-   if (fstripe->recovered)
+   if (fstripe->recovered) {
+   error("full stripe %llu has been recovered before, no more 
chance to recover",
+ fstripe->logical_start);
return -EINVAL;
+   }
 
-   if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6) {
+   if (fstripe->bg_type == BTRFS_BLOCK_GROUP_RAID6 &&
+   fstripe->nr_corrupted_stripes == 2) {
/* Need to recover 2 stripes, not supported yet */
-   error("recover data stripes for RAID6 is not support yet");
+   error("recover 2 data stripes for RAID6 is not support yet");
return -ENOTTY;
}
 
/* Out of repair */
-   if (fstripe->err_read_stripes + fstripe->err_csum_dstripes > 1)
+   if (fstripe->nr_corrupted_stripes > 1) {
+   error("full stripe %llu has too many missing stripes and csum 
mismatch, unable to recover",
+ fstripe->logical_start);
return -EINVAL;
+   }
 
ptrs = malloc(sizeof(void *) * fstripe->nr_stripes);
if (!ptrs)
@@ -507,7 +516,7 @@ static int recovery_from_parities(struct btrfs_fs_info 
*fs_info,
/* Construct ptrs */
for (i = 0; i < nr_stripes; i++)
ptrs[i] = fstripe->stripes[i].data;
-   corrupted = fstripe->missing_stripes[0];
+   corrupted = fstripe->corrupted_index[0];
 
/* Recover the corrupted data csum */
ret = raid5_gen_result(nr_stripes, stripe_len, corrupted, ptrs);
@@ -516,3 +525,167 @@ static int recovery_from_parities(struct btrfs_fs_info 
*fs_info,
free(ptrs);
return ret;
 }
+
+static void record_corrupted_stripe(struct scrub_full_stripe *fstripe,
+   int index)
+{
+   int i = 0;
+
+   for (i = 0; i < 2; i++) {
+   if (fstripe->corrupted_index[i] == -1) {
+   fstripe->corrupted_index[i] = index;
+   break;
+   }
+   }
+   fstripe->nr_corrupted_stripes++;
+}
+
+static int scrub_one_full_stripe(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+u64 start, u64 *next_ret)
+{
+   struct scrub_full_stripe *fstripe;
+   struct btrfs_map_block *map_block = NULL;
+   u32 stripe_len = BTRFS_STRIPE_LEN;
+   u64 bg_type;
+   u64 len;
+   int max_tolerance;
+   int i;
+   int ret;
+
+   if (!next_ret) {
+   error("invalid argument for %s", __func__);
+   return -EINVAL;
+   }
+
+   ret = __btrfs_map_block_v2(fs_info, WRITE, start, stripe_len,
+  _block);
+   if (ret < 0)
+   return ret;
+   start = map_block->start;
+   len = map_block->length;
+   *next_ret = 

[RFC PATCH v0.8 06/14] btrfs-progs: check/scrub: Introduce function to scrub mirror based data blocks

2016-10-16 Thread Qu Wenruo
Introduce a new function, scrub_data_mirror(), to check mirror based
data blocks.

Signed-off-by: Qu Wenruo 
---
 check/scrub.c | 67 +++
 1 file changed, 67 insertions(+)

diff --git a/check/scrub.c b/check/scrub.c
index ce8d5e5..5cd8bc4 100644
--- a/check/scrub.c
+++ b/check/scrub.c
@@ -157,3 +157,70 @@ out:
return ret;
 }
 
+static int scrub_data_mirror(struct btrfs_fs_info *fs_info,
+struct btrfs_scrub_progress *scrub_ctx,
+char *data, u64 start, u64 len, int mirror)
+{
+   u64 cur = 0;
+   u32 csum;
+   u32 sectorsize = fs_info->tree_root->sectorsize;
+   char *buf = NULL;
+   int ret = 0;
+   int err = 0;
+
+   if (!data) {
+   buf = malloc(len);
+   if (!buf)
+   return -ENOMEM;
+   /* Read out as much data as possible to speed up read */
+   while (cur < len) {
+   u64 read_len = len - cur;
+
+   ret = read_extent_data(fs_info->tree_root, buf + cur,
+   start + cur, _len, mirror);
+   if (ret < 0) {
+   error("failed to read out data at logical 
bytenr %llu mirror %d",
+ start + cur, mirror);
+   scrub_ctx->read_errors++;
+   goto out;
+   }
+   scrub_ctx->data_bytes_scrubbed += read_len;
+   cur += read_len;
+   }
+   } else {
+   buf = data;
+   }
+
+   /* Check csum per-sectorsize */
+   cur = 0;
+   while (cur < len) {
+   u32 data_csum = ~(u32)0;
+
+   ret = btrfs_read_one_data_csum(fs_info, start + cur, );
+   if (ret > 0) {
+   scrub_ctx->csum_discards++;
+   /* In case some csum are missing */
+   goto next;
+   }
+   data_csum = btrfs_csum_data(NULL, buf + cur, data_csum,
+   sectorsize);
+   btrfs_csum_final(data_csum, (u8 *)_csum);
+   if (data_csum != csum) {
+   error("data at bytenr %llu mirror %d csum mismatch, 
have %u expect %u",
+ start + cur, mirror, data_csum, csum);
+   err = 1;
+   scrub_ctx->csum_errors++;
+   cur += sectorsize;
+   continue;
+   }
+   scrub_ctx->data_bytes_scrubbed += sectorsize;
+next:
+   cur += sectorsize;
+   }
+out:
+   if (!data)
+   free(buf);
+   if (!ret && err)
+   return -EIO;
+   return ret;
+}
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0.8 00/14] Offline scrub support, and hint to solve kernel scrub data silent corruption

2016-10-16 Thread Qu Wenruo
***Just RFC patch for early evaluation, please don't merge it***

For any one who wants to try it, it can be get from my repo:
https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

Currently, I only tested it on SINGLE/DUP/RAID1/RAID5 filesystems, with
mirror or parity or data corrupted.
The tool are all able to detect them and give recoverbility report.

Several reports on kernel scrub screwing up good data stripes are in ML
for sometime.

The reason seems to be lack of csum check before and after reconstruction, and
unfinished parity write seems also involved.

To get a comparable tool for kernel scrub, we need a user-space tool to
act as benchmark to compare their different behaviors.

So here is the RFC patch set for user-space scrub.

Which can do:

1) All mirror/backup check for non-parity based stripe
   Which means for RAID1/DUP/RAID10, we can really check all mirrors
   other than the 1st good mirror.

   Current "--check-data-csum" option will be finally replace by scrub.
   As it doesn't really check all mirrors, if it hits a good copy, then
   resting copies will just be ignored.

2) Comprehensive RAID5 full stripe check
   It will check csum before reconstruction using parity.
   And if too many data stripes has csum mismatch, no need to
   reconstruct anyway.

   And after reconstruction, it will also check the csum of recovered
   data against csum, to ensure we didn't recover wrong result.

   For all csum match case, will re-calculate parity and compare it with
   ondisk parity, to detect parity error.

In fact, it can already expose one new btrfs kernel bug.
For example, after screwing up a data stripe, kernel did repairs using
parity, but recovered full stripe has wrong parity.
Need to scrub again to fix it.

And this patchset also introduced new map_block() function, which is
more flex than current btrfs_map_block(), and has a unified interface
for all profiles.
Check the 1st and 2nd patch for details.

They are already used in RAID5/6 scrub, but can also be used for other
profiles too.

Since it's just an evaluation patchset, it still has a long to-do list:

1) Repair support
   In fact, current tool can already report recoverability, repair is
   not hard to implement.

2) RAID6 support
   The mathematics behind RAID6 recover is more complex than RAID5.
   Need some more code to make it possible to recover data stripes,
   other than just calculating Q and P.

3) Test cases
   Need to make the infrastructure able to handle multi-device first.

4) Cleaner code and refined logical
   Need a better shared logical for all profiles to do scrub, and use
   new map_block_v2() to replace these old codes.

5) Make btrfsck able to handle RAID5 with missing device
   Now it doesn't even open RAID5 btrfs with missing device, even thouth
   scrub should be able to handle it.

Qu Wenruo (14):
  btrfs-progs: Introduce new btrfs_map_block function which returns more
unified result.
  btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes
  btrfs-progs: check/csum: Introduce function to read out one data csum
  btrfs-progs: check/scrub: Introduce structures to support fsck scrub
  btrfs-progs: check/scrub: Introduce function to scrub mirror based
tree block
  btrfs-progs: check/scrub: Introduce function to scrub mirror based
data blocks
  btrfs-progs: check/scrub: Introduce function to scrub one extent
  btrfs-progs: check/scrub: Introduce function to scrub one data stripe
  btrfs-progs: check/scrub: Introduce function to verify parities
  btrfs-progs: extent-tree: Introduce function to check if there is any
extent in given range.
  btrfs-progs: check/scrub: Introduce function to recover data parity
  btrfs-progs: check/scrub: Introduce a function to scrub one full
stripe
  btrfs-progs: check/scrub: Introduce function to check a whole block
group
  btrfs-progs: fsck: Introduce offline scrub function

 Documentation/btrfs-check.asciidoc |   8 +
 Makefile.in|   6 +-
 check/check.h  |  23 ++
 check/csum.c   |  96 +
 check/scrub.c  | 812 +
 cmds-check.c   |  12 +-
 ctree.h|   2 +
 disk-io.c  |   4 +-
 disk-io.h  |   2 +
 extent-tree.c  |  52 +++
 volumes.c  | 282 +
 volumes.h  |  49 +++
 12 files changed, 1343 insertions(+), 5 deletions(-)
 create mode 100644 check/check.h
 create mode 100644 check/csum.c
 create mode 100644 check/scrub.c

-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up cp --reflink=always

2016-10-16 Thread Hans van Kranenburg
On 10/16/2016 09:48 PM, Hans van Kranenburg wrote:
> On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote:
>> Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg:
>>> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote:

 cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes)

 An example:

 source file:
 # ls -la vm-279-disk-1.img
 -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img

 target file after around 10 minutes:
 # ls -la vm-279-disk-1.img.tmp
 -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp
>>>
>>> Two quick thoughts:
>>> 1. How many extents does this img have?
>>
>> filefrag says:
>> 1011508 extents found
> 
> To cp --reflink this, the filesystem needs to create a million new
> EXTENT_DATA objects for the new file, which point all parts of the new
> file to all the little same parts of the old file, and probably also
> needs to update a million EXTENT_DATA objects in the btrees to add a
> second backreference back to the new file.

Ehm, the second one is EXTENT_ITEM, not EXTENT_DATA.

>>> 2. Is this an XY problem? Why not just put the img in a subvolume and
>>> snapshot that?
>>
>> Sorry what's XY problem?
> 
> It means that I suspected that your actual goal is not spending time to
> work on optimizing how cp --reflink works, but that you just want to use
> the quickest way to have a clone of the file.
> 
> An XY problem is when someone has problem X, then thinks about solution
> Y to solve it, then runs into a problem/limitation/whatever when trying
> Y and asks help with that actual problem when doing Y while there might
> in the end be a better solution to get X done.
> 
>> Implementing cp reflink was easier - as the original code was based on
>> XFS. But shouldn't be cp reflink / clone a file be nearly identical to a
>> snapshot? Just creating refs to the extents?
> 
> Snapshotting a subvolume only has to write a cowed copy of the top-level
> information of the subvolume filesystem tree, and leaves the extent tree
> alone. It doesn't have to do 2 million different things. \o/
> 


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up cp --reflink=always

2016-10-16 Thread Stefan Priebe - Profihost AG
Am 16.10.2016 um 21:48 schrieb Hans van Kranenburg:
> On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote:
>> Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg:
>>> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote:

 cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes)

 An example:

 source file:
 # ls -la vm-279-disk-1.img
 -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img

 target file after around 10 minutes:
 # ls -la vm-279-disk-1.img.tmp
 -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp
>>>
>>> Two quick thoughts:
>>> 1. How many extents does this img have?
>>
>> filefrag says:
>> 1011508 extents found
> 
> To cp --reflink this, the filesystem needs to create a million new
> EXTENT_DATA objects for the new file, which point all parts of the new
> file to all the little same parts of the old file, and probably also
> needs to update a million EXTENT_DATA objects in the btrees to add a
> second backreference back to the new file.

Thanks for this explanation.

> 
>>> 2. Is this an XY problem? Why not just put the img in a subvolume and
>>> snapshot that?
>>
>> Sorry what's XY problem?
> 
> It means that I suspected that your actual goal is not spending time to
> work on optimizing how cp --reflink works, but that you just want to use
> the quickest way to have a clone of the file.
> 
> An XY problem is when someone has problem X, then thinks about solution
> Y to solve it, then runs into a problem/limitation/whatever when trying
> Y and asks help with that actual problem when doing Y while there might
> in the end be a better solution to get X done.

ah ;-) makes sense.

>> Implementing cp reflink was easier - as the original code was based on
>> XFS. But shouldn't be cp reflink / clone a file be nearly identical to a
>> snapshot? Just creating refs to the extents?
> 
> Snapshotting a subvolume only has to write a cowed copy of the top-level
> information of the subvolume filesystem tree, and leaves the extent tree
> alone. It doesn't have to do 2 million different things. \o/

Thanks for this explanation. Will look into switching to subvolumes.
Wasn't able todo this before as i was always running into ENOSPC issues
which was solved last week.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up cp --reflink=always

2016-10-16 Thread Hans van Kranenburg
On 10/16/2016 08:54 PM, Stefan Priebe - Profihost AG wrote:
> Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg:
>> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote:
>>>
>>> cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes)
>>>
>>> An example:
>>>
>>> source file:
>>> # ls -la vm-279-disk-1.img
>>> -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img
>>>
>>> target file after around 10 minutes:
>>> # ls -la vm-279-disk-1.img.tmp
>>> -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp
>>
>> Two quick thoughts:
>> 1. How many extents does this img have?
> 
> filefrag says:
> 1011508 extents found

To cp --reflink this, the filesystem needs to create a million new
EXTENT_DATA objects for the new file, which point all parts of the new
file to all the little same parts of the old file, and probably also
needs to update a million EXTENT_DATA objects in the btrees to add a
second backreference back to the new file.

>> 2. Is this an XY problem? Why not just put the img in a subvolume and
>> snapshot that?
> 
> Sorry what's XY problem?

It means that I suspected that your actual goal is not spending time to
work on optimizing how cp --reflink works, but that you just want to use
the quickest way to have a clone of the file.

An XY problem is when someone has problem X, then thinks about solution
Y to solve it, then runs into a problem/limitation/whatever when trying
Y and asks help with that actual problem when doing Y while there might
in the end be a better solution to get X done.

> Implementing cp reflink was easier - as the original code was based on
> XFS. But shouldn't be cp reflink / clone a file be nearly identical to a
> snapshot? Just creating refs to the extents?

Snapshotting a subvolume only has to write a cowed copy of the top-level
information of the subvolume filesystem tree, and leaves the extent tree
alone. It doesn't have to do 2 million different things. \o/

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up cp --reflink=always

2016-10-16 Thread Stefan Priebe - Profihost AG
Am 16.10.2016 um 00:37 schrieb Hans van Kranenburg:
> Hi,
> 
> On 10/15/2016 10:49 PM, Stefan Priebe - Profihost AG wrote:
>>
>> cp --reflink=always takes sometimes very long. (i.e. 25-35 minutes)
>>
>> An example:
>>
>> source file:
>> # ls -la vm-279-disk-1.img
>> -rw-r--r-- 1 root root 204010946560 Oct 14 12:15 vm-279-disk-1.img
>>
>> target file after around 10 minutes:
>> # ls -la vm-279-disk-1.img.tmp
>> -rw-r--r-- 1 root root 65022328832 Oct 15 22:13 vm-279-disk-1.img.tmp
> 
> Two quick thoughts:
> 1. How many extents does this img have?

filefrag says:
1011508 extents found

> 2. Is this an XY problem? Why not just put the img in a subvolume and
> snapshot that?

Sorry what's XY problem?

Implementing cp reflink was easier - as the original code was based on
XFS. But shouldn't be cp reflink / clone a file be nearly identical to a
snapshot? Just creating refs to the extents?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental send robustness question

2016-10-16 Thread Sean Greenslade
On October 14, 2016 12:43:03 AM EDT, Duncan <1i5t5.dun...@cox.net> wrote:
>I see the specific questions have been answered, and alternatives 
>explored in one direction, but I've another alternative, in a different
>
>direction, to suggest.
>
>First a disclaimer.  I'm a btrfs user/sysadmin and regular on the list,
>
>but I'm not a dev, and my own use-case doesn't involve send/receive, so
>
>what I know regarding send/receive is from the list and manpages, not 
>personal experience.  With that in mind...
>
>It's worth noting that send/receive are subvolume-specific -- a send 
>won't continue down into a subvolume.
>
>Also note that in addition to -p/parent, there's -s/clone-src.  The 
>latter is more flexible than the super-strict parent option, at the 
>expense of a fatter send-stream as additional metadata is sent that 
>specifies which clone the instructions are relative to.
>
>It should be possible to use the combination of these two facts to
>split 
>and recombine your send stream in a firewall-timeout-friendly manner,
>as 
>long as no individual files are so big that sending an individual file 
>exceeds the timeout.
>
>1) Start by taking a read-only snapshot of your intended source 
>subvolume, so you have an unchanging reference.
>
>2) Take multiple writable snapshots of it, and selectively delete
>subdirs 
>(and files if necessary) from each writable snapshot, trimming each one
>
>to a size that should pass the firewall without interruption, so that
>the 
>combination of all these smaller subvolumes contains the content of the
>
>single larger one.
>
>3) Take read-only snapshots of each of these smaller snapshots,
>suitable 
>for sending.
>
>4) Do a non-incremental send of each of these smaller snapshots to the 
>remote.
>
>If it's practical to keep the subvolume divisions, you can simply split
>
>the working tree into subvolumes and send those individually instead of
>
>doing the snapshot splitting above, in which case you can then use -p/
>parent on each as you were trying to do on the original, and you can
>stop 
>here.
>
>If you need/prefer the single subvolume, continue...
>
>5) Do an incremental send of the original full snapshot, using multiple
>-c  options to list each of the smaller snapshots.  Since all the 
>data has already been transferred in the smaller snapshot sends, this 
>send should be all metadata, no actual data.  It'll simply be combining
>
>the individual reference subvolumes into a single larger subvolume once
>
>again.
>
>6) Once you have the single larger subvolume on the receive side, you
>can 
>delete the smaller snapshots as you now have a copy of the larger 
>subvolume on each side to do further incremental sends of the working 
>copy against.
>
>7) I believe the first incremental send of the full working copy
>against 
>the original larger snapshot will still have to use -c, while
>incremental 
>sends based on that first one will be able to use the stricter but 
>slimmer send-stream -p, with each one then using the previous one as
>the 
>parent.  However, I'm not sure on that.  It may be that you have to 
>continue using the fatter send-stream -c each time.
>
>Again, I don't have send/receive experience of my own, so hopefully 
>someone who does can reply either confirming that this should work and 
>whether or not -p can be used after the initial setup, or explaining
>why 
>the idea won't work, but at this point based on my own understanding,
>it 
>seems like it should be perfectly workable to me. =:^)

I was considering doing something like this, but the simple solution of "just 
bring the disk over" won out. If that hadn't been possible, I might have done 
something like that, and I'm still mulling over possible solutions to similar / 
related problems.

I think the biggest solution would be support for partial / resuming receives. 
That'll probably go on my ever-growing list of things to possibly look into 
when I happen upon some free time. It sounds quite complicated, though...

--Sean


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] misc: fix fallocate commands that need the unshare switch

2016-10-16 Thread Christoph Hellwig
On Sat, Oct 15, 2016 at 10:03:03AM -0700, Christoph Hellwig wrote:
> The poster child would be btrfs, and I would have added some output
> here if btrfs support in xfstests wasn't completely broken at this
> point.
> 
> Well, added Ccs and some output anyway in this case..

Turns out the btrfs failure was my stupidity, sorry.

I can reproduce the issue I was going to originally show (which was
actually pointed out by Eric for a different fallocate flag check
I wanted to add), here is the diff of the output files when running
generic/156 on btrfs with your patch:

--- tests/generic/156.out   2016-03-29 13:59:30.411720622 +
+++ /root/xfstests/results//generic/156.out.bad 2016-10-16 06:15:27.118776421 
+
@@ -2,8 +2,13 @@
 Create the original file blocks
 Create the reflink copies
 funshare part of a file
+fallocate: Operation not supported
 funshare some of the copies
+fallocate: Operation not supported
+fallocate: Operation not supported
 funshare the rest of the files
+fallocate: Operation not supported
+fallocate: Operation not supported
 Rewrite the original file
 free blocks after reflinking is in range
 free blocks after nocow'ing some copies is in range

So what we really need an enhanced falloc tester that checks that
the tested subcommand is actually implemented on the given file system.
(And we already need something like that for -k on NFS)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html