Re: [PATCH 3/5] btrfs: raid56: Use correct stolen pages to calculate P/Q

2017-03-19 Thread Qu Wenruo



At 03/18/2017 06:19 AM, Liu Bo wrote:

On Fri, Mar 17, 2017 at 02:31:08PM +0800, Qu Wenruo wrote:



At 03/16/2017 01:36 PM, Liu Bo wrote:

On Fri, Feb 03, 2017 at 04:20:21PM +0800, Qu Wenruo wrote:

In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0x (Bad)   | 0xcdcd | 0x|
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

After scrubbing dev3 only:

| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0xcdcd (Good)  | 0xcdcd | 0xcdcd (Bad)  |
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

The calltrace of such corruption is as following:

scrub_bio_end_io_worker() get called for each extent read out
|- scriub_block_complete()
   |- Data extent csum mismatch
   |- scrub_handle_errored_block
  |- scrub_recheck_block()
 |- scrub_submit_raid56_bio_wait()
|- raid56_parity_recover()

Now we have a rbio with correct data stripe 1 recovered.
Let's call it "good_rbio".

scrub_parity_check_and_repair()
|- raid56_parity_submit_scrub_rbio()
   |- lock_stripe_add()
   |  |- steal_rbio()
   | |- Recovered data are steal from "good_rbio", stored into
   |rbio->stripe_pages[]
   |Now rbio->bio_pages[] are bad data read from disk.


At this point, we should have already known that whether
rbio->bio_pages are corrupted because rbio->bio_pages are indexed from
the list sparity->pages, and we only do scrub_parity_put after
finishing the endio of reading all pages linked at sparity->pages.

Since the previous checksuming failure has made a recovery and we
got the correct data on that rbio, instead of adding this corrupted
page into the new rbio, it'd be fine to skip it and we use all
rbio->stripe_pages which can be stolen from the previous good rbio.


Unfortunately, the rbio->bio_pages[] are pages stored in rbio->bio_list.

While we don't have good function to remove page from a bio, nor the
function to remove a bio from a bio_list (although it's quite easy to
implement), it will be quite a mess to do it in steal_rbio() or in btrfs.



That's not what I was suggesting.

This bug is about the case that we have corrupted data on one of the stripes,
not corrupted parity.  So raid56 is a little bit different with raid1, for raid1
we're good to go after writing back good copy to the position that has bad copy,
while for raid56 we need to check/repair the parity after writing back good
copy, however, the pages listed at sparity->pages have the original bad copy we
read from the disk, not the good copy from re-checking all mirrors, the parity
then ends up being corrupted by taking these bad copy's pages to do the xor.

If the raid56 read rebuild was successful after scrub_handle_errored_block(),
then we got all good data across this stripe, and they're cached in
rbio->stripe_pages for any later read/write to steal from.  So the bug can be
fixed by not adding these bad copy's pages into the new rbio.

There is another case that we have a broken parity, then if all data copy are
good, we're good to go directly check and repair anything wrong on parity
without adding pages listed at sparity_pages to the new rbio, if there is some
bad data, then we're not able to fix up the data or parity so eventually we get
an unrecoverable error, meanwhile both ebitmap and dbitmap got updated
respectively so that it won't update that horizonal stripe on disk.

In a word, by the time we go check and repair parity, the raid56 stripe on disk
should have correct data, there is no need to put pages into rbio.

So my point of view is to remove the code of adding pages at sparity->pages to
the rbio which is with operation BTRFS_RBIO_PARITY_SCRUB.


Now I understand.

Indeed at finish_parity_scrub() time, rbio->stripes_pages[] are 
recovered correct pages.


It's the added pages causing the bug.

I'll use this method in next update.

Thanks,
Qu


Thanks,

-liubo


That's why I use new bad_on_disk_a/b other than directly stealing pages from
rbio->bio_pages[].

Thanks,
Qu


Thanks,

-liubo


   |- async_scrub_parity()
  |- scrub_parity_work() (delayed_call to scrub_parity_work)

scrub_parity_work()
|- raid56_parity_scrub_stripe()
   |- validate_rbio_for_parity_scrub()
  |- finish_parity_scrub()
 |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
So good parity is overwritten with 

Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-03-19 Thread Josef Bacik
This sounds reasonable to me, I'll look at it more when I'm on the ground and 
can look at the code and see for sure.  Thanks,

Josef

Sent from my iPhone

> On Mar 19, 2017, at 1:19 PM, Alex Lyakas  wrote:
> 
> We have a commit_root_sem, which is a read-write semaphore that protects the 
> commit roots.
> But it is also used to protect the list of caching block groups.
> 
> As a result, while doing "slow" caching, the following issue is seen:
> 
> Some of the caching threads are scanning the extent tree with commit_root_sem
> acquired in shared mode, with stack like:
> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 
> [btrfs]
> [] read_tree_block+0x40/0x70 [btrfs]
> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
> [] btrfs_search_slot+0x3c6/0xb10 [btrfs]
> [] caching_thread+0x1b9/0x820 [btrfs]
> [] normal_work_helper+0xc6/0x340 [btrfs]
> [] btrfs_cache_helper+0x12/0x20 [btrfs]
> 
> IO requests that want to allocate space are waiting in cache_block_group()
> to acquire the commit_root_sem in exclusive mode. But they only want to add
> the caching control structure to the list of caching block-groups:
> [] schedule+0x29/0x70
> [] rwsem_down_write_failed+0x145/0x320
> [] call_rwsem_down_write_failed+0x13/0x20
> [] cache_block_group+0x25b/0x450 [btrfs]
> [] find_free_extent+0xd16/0xdb0 [btrfs]
> [] btrfs_reserve_extent+0xaf/0x160 [btrfs]
> 
> Other caching threads want to continue their scanning, and for that they
> are waiting to acquire commit_root_sem in shared mode. But since there are
> IO threads that want the exclusive lock, the caching threads are unable
> to continue the scanning, because (I presume) rw_semaphore guarantees some 
> fairness:
> [] schedule+0x29/0x70
> [] rwsem_down_read_failed+0xc5/0x120
> [] call_rwsem_down_read_failed+0x14/0x30
> [] caching_thread+0x1a1/0x820 [btrfs]
> [] normal_work_helper+0xc6/0x340 [btrfs]
> [] btrfs_cache_helper+0x12/0x20 [btrfs]
> [] process_one_work+0x146/0x410
> 
> This causes slowness of the IO, especially when there are many block groups
> that need to be scanned for free space. In some cases it takes minutes
> until a single IO thread is able to allocate free space.
> 
> I don't see a deadlock here, because the caching threads that were able to 
> acquire
> the commit_root_sem will call rwsem_is_contended() and should give up the 
> semaphore,
> so that IO threads are able to acquire it in exclusive mode.
> 
> However, introducing a separate mutex that protects only the list of caching
> block groups makes things move forward much faster.
> 
> This patch is based on kernel 3.18.
> Unfortunately, I am not able to submit a patch based on one of the latest 
> kernels, because
> here btrfs is part of the larger system, and upgrading the kernel is a 
> significant effort.
> Hence marking the patch as RFC.
> Hopefully, this patch still has some value to the community.
> 
> Signed-off-by: Alex Lyakas 
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 42d11e7..74feacb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>struct list_head trans_list;
>struct list_head dead_roots;
>struct list_head caching_block_groups;
> +/* protects the above list */
> +struct mutex caching_block_groups_mutex;
> 
>spinlock_t delayed_iput_lock;
>struct list_head delayed_iputs;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5177954..130ec58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>INIT_LIST_HEAD(_info->delayed_iputs);
>INIT_LIST_HEAD(_info->delalloc_roots);
>INIT_LIST_HEAD(_info->caching_block_groups);
> +mutex_init(_info->caching_block_groups_mutex);
>spin_lock_init(_info->delalloc_root_lock);
>spin_lock_init(_info->trans_lock);
>spin_lock_init(_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a067065..906fb08 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -637,10 +637,10 @@ static int cache_block_group(struct 
> btrfs_block_group_cache *cache,
>return 0;
>}
> 
> -down_write(_info->commit_root_sem);
> +mutex_lock(_info->caching_block_groups_mutex);
>atomic_inc(_ctl->count);
>list_add_tail(_ctl->list, _info->caching_block_groups);
> -up_write(_info->commit_root_sem);
> +mutex_unlock(_info->caching_block_groups_mutex);
> 
>btrfs_get_block_group(cache);
> 
> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct 
> btrfs_trans_handle *trans,
> 
>down_write(_info->commit_root_sem);
> 
> +mutex_lock(_info->caching_block_groups_mutex);
>list_for_each_entry_safe(caching_ctl, next,
> _info->caching_block_groups, list) {
>cache = caching_ctl->block_group;
> @@ -5704,6 +5705,7 @@ void 

[PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-03-19 Thread Alex Lyakas
We have a commit_root_sem, which is a read-write semaphore that protects the 
commit roots.

But it is also used to protect the list of caching block groups.

As a result, while doing "slow" caching, the following issue is seen:

Some of the caching threads are scanning the extent tree with 
commit_root_sem

acquired in shared mode, with stack like:
[] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
[] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 
[btrfs]

[] read_tree_block+0x40/0x70 [btrfs]
[] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
[] btrfs_search_slot+0x3c6/0xb10 [btrfs]
[] caching_thread+0x1b9/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]

IO requests that want to allocate space are waiting in cache_block_group()
to acquire the commit_root_sem in exclusive mode. But they only want to add
the caching control structure to the list of caching block-groups:
[] schedule+0x29/0x70
[] rwsem_down_write_failed+0x145/0x320
[] call_rwsem_down_write_failed+0x13/0x20
[] cache_block_group+0x25b/0x450 [btrfs]
[] find_free_extent+0xd16/0xdb0 [btrfs]
[] btrfs_reserve_extent+0xaf/0x160 [btrfs]

Other caching threads want to continue their scanning, and for that they
are waiting to acquire commit_root_sem in shared mode. But since there are
IO threads that want the exclusive lock, the caching threads are unable
to continue the scanning, because (I presume) rw_semaphore guarantees some 
fairness:

[] schedule+0x29/0x70
[] rwsem_down_read_failed+0xc5/0x120
[] call_rwsem_down_read_failed+0x14/0x30
[] caching_thread+0x1a1/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]
[] process_one_work+0x146/0x410

This causes slowness of the IO, especially when there are many block groups
that need to be scanned for free space. In some cases it takes minutes
until a single IO thread is able to allocate free space.

I don't see a deadlock here, because the caching threads that were able to 
acquire
the commit_root_sem will call rwsem_is_contended() and should give up the 
semaphore,

so that IO threads are able to acquire it in exclusive mode.

However, introducing a separate mutex that protects only the list of caching
block groups makes things move forward much faster.

This patch is based on kernel 3.18.
Unfortunately, I am not able to submit a patch based on one of the latest 
kernels, because
here btrfs is part of the larger system, and upgrading the kernel is a 
significant effort.

Hence marking the patch as RFC.
Hopefully, this patch still has some value to the community.

Signed-off-by: Alex Lyakas 

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 42d11e7..74feacb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
struct list_head trans_list;
struct list_head dead_roots;
struct list_head caching_block_groups;
+/* protects the above list */
+struct mutex caching_block_groups_mutex;

spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5177954..130ec58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
INIT_LIST_HEAD(_info->delayed_iputs);
INIT_LIST_HEAD(_info->delalloc_roots);
INIT_LIST_HEAD(_info->caching_block_groups);
+mutex_init(_info->caching_block_groups_mutex);
spin_lock_init(_info->delalloc_root_lock);
spin_lock_init(_info->trans_lock);
spin_lock_init(_info->fs_roots_radix_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a067065..906fb08 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -637,10 +637,10 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,

return 0;
}

-down_write(_info->commit_root_sem);
+mutex_lock(_info->caching_block_groups_mutex);
atomic_inc(_ctl->count);
list_add_tail(_ctl->list, _info->caching_block_groups);
-up_write(_info->commit_root_sem);
+mutex_unlock(_info->caching_block_groups_mutex);

btrfs_get_block_group(cache);

@@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,


down_write(_info->commit_root_sem);

+mutex_lock(_info->caching_block_groups_mutex);
list_for_each_entry_safe(caching_ctl, next,
 _info->caching_block_groups, list) {
cache = caching_ctl->block_group;
@@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,

cache->last_byte_to_unpin = caching_ctl->progress;
}
}
+mutex_unlock(_info->caching_block_groups_mutex);

if (fs_info->pinned_extents == _info->freed_extents[0])
fs_info->pinned_extents = _info->freed_extents[1];
@@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info 
*info)

struct btrfs_caching_control *caching_ctl;
struct 

[no subject]

2017-03-19 Thread Ilan Schwarts
Hi,
sorry if this is a newbie question. I am newbie.

In my kernel driver, I get device id by converting struct inode struct
to btrfs_inode, I use the code:
struct btrfs_inode *btrfsInode;
btrfsInode = BTRFS_I(inode);

I usually download kernel-headers rpm package, this is not enough. it
fails to find the btrfs header files.

I had to download them not via rpm package and declare:
#include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/ctree.h"
#include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/btrfs_inode.h"

This is not good, why ctree.h and btrfs_inode.h are not in kernel headers?
Is there another package i need to download in order to get them, in
addition to kernel-headers? ?


I see they are not provided in kernel-header package, e.g:
https://rpmfind.net/linux/RPM/fedora/23/x86_64/k/kernel-headers-4.2.3-300.fc23.x86_64.html

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html