[PATCH 0/5] Cleanups for later btrfs_alloc_chunk() rework

2018-01-02 Thread Qu Wenruo
Just small cleanups for incoming btrfs_alloc_chunk() rework, which is
designed to allow btrfs_alloc_chunk() to be able to alloc meta chunk,
even current meta chunks are already full.

The cleanups are quite small, most of them are just removing unnecessary
parameters, and make some function static.

Qu Wenruo (5):
  btrfs-progs: Use bool parameter to determine if we're allocating data
extent
  btrfs-progs: volumes: Make find_free_dev_extent_start static
  btrfs-progs: volumes: Remove unnecessary trans parameter
  btrfs-progs: volumes: Remove unnecessary parameters when allocating
device extent
  btrfs-progs: Remove unnecessary parameter for btrfs_add_block_group

 cmds-check.c   |  4 ++--
 convert/main.c |  4 +---
 ctree.h|  7 +++
 extent-tree.c  | 31 ---
 mkfs/main.c| 14 --
 volumes.c  | 35 ++-
 6 files changed, 40 insertions(+), 55 deletions(-)

-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs-progs: Use bool parameter to determine if we're allocating data extent

2018-01-02 Thread Qu Wenruo
btrfs_reserve_extent() uses int @data to determine if we're allocating
data extent, while reuse the parameter later to pass it as profile
(data/meta/sys).

It's a little confusing, this patch will follow kernel parameter to use
bool @is_data to replace it.
And in btrfs_reserve_extent(), use dedicated u64 @profile.

Signed-off-by: Qu Wenruo 
---
 ctree.h   |  2 +-
 extent-tree.c | 17 +
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/ctree.h b/ctree.h
index b92df1c1a518..61d9f9fc9dff 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2435,7 +2435,7 @@ int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 num_bytes, u64 empty_size,
 u64 hint_byte, u64 search_end,
-struct btrfs_key *ins, int data);
+struct btrfs_key *ins, bool is_data);
 int btrfs_fix_block_accounting(struct btrfs_trans_handle *trans,
 struct btrfs_root *root);
 void btrfs_pin_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 
num_bytes);
diff --git a/extent-tree.c b/extent-tree.c
index 055582c36da6..58b64a21d226 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2652,36 +2652,37 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root,
 u64 num_bytes, u64 empty_size,
 u64 hint_byte, u64 search_end,
-struct btrfs_key *ins, int data)
+struct btrfs_key *ins, bool is_data)
 {
int ret;
u64 search_start = 0;
u64 alloc_profile;
+   u64 profile;
struct btrfs_fs_info *info = root->fs_info;
 
-   if (data) {
+   if (is_data) {
alloc_profile = info->avail_data_alloc_bits &
info->data_alloc_profile;
-   data = BTRFS_BLOCK_GROUP_DATA | alloc_profile;
+   profile = BTRFS_BLOCK_GROUP_DATA | alloc_profile;
} else if (info->system_allocs == 1 || root == info->chunk_root) {
alloc_profile = info->avail_system_alloc_bits &
info->system_alloc_profile;
-   data = BTRFS_BLOCK_GROUP_SYSTEM | alloc_profile;
+   profile = BTRFS_BLOCK_GROUP_SYSTEM | alloc_profile;
} else {
alloc_profile = info->avail_metadata_alloc_bits &
info->metadata_alloc_profile;
-   data = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
+   profile = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
}
 
if (root->ref_cows) {
-   if (!(data & BTRFS_BLOCK_GROUP_METADATA)) {
+   if (!(profile & BTRFS_BLOCK_GROUP_METADATA)) {
ret = do_chunk_alloc(trans, info,
 num_bytes,
 BTRFS_BLOCK_GROUP_METADATA);
BUG_ON(ret);
}
ret = do_chunk_alloc(trans, info,
-num_bytes + SZ_2M, data);
+num_bytes + SZ_2M, profile);
BUG_ON(ret);
}
 
@@ -2689,7 +2690,7 @@ int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
ret = find_free_extent(trans, root, num_bytes, empty_size,
   search_start, search_end, hint_byte, ins,
   trans->alloc_exclude_start,
-  trans->alloc_exclude_nr, data);
+  trans->alloc_exclude_nr, profile);
if (ret < 0)
return ret;
clear_extent_dirty(&info->free_space_cache,
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs-progs: volumes: Make find_free_dev_extent_start static

2018-01-02 Thread Qu Wenruo
The function is not used by anyone else outside of volumes.c, make it
static.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index ce3a540578fd..0e50e1d5833e 100644
--- a/volumes.c
+++ b/volumes.c
@@ -457,7 +457,7 @@ out:
return ret;
 }
 
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
+static int find_free_dev_extent(struct btrfs_trans_handle *trans,
 struct btrfs_device *device, u64 num_bytes,
 u64 *start)
 {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs-progs: Remove unnecessary parameter for btrfs_add_block_group

2018-01-02 Thread Qu Wenruo
@chunk_objectid of btrfs_make_block_group() function is always fixed to
BTRFS_FIRST_FREE_OBJECTID, so there is no need to pass it as parameter
explicitly.

Signed-off-by: Qu Wenruo 
---
 cmds-check.c   |  4 ++--
 convert/main.c |  4 +---
 ctree.h|  5 ++---
 extent-tree.c  | 14 +++---
 mkfs/main.c| 14 --
 5 files changed, 16 insertions(+), 25 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index a93ac2c88a38..635c1c44ff6f 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -13036,7 +13036,7 @@ static int repair_chunk_item(struct btrfs_trans_handle 
*trans,
 
if (err & REFERENCER_MISSING) {
ret = btrfs_make_block_group(trans, chunk_root->fs_info, 0,
-type, chunk_key.objectid, chunk_key.offset, length);
+type, chunk_key.offset, length);
if (ret) {
error("fail to add block group item[%llu %llu]",
  chunk_key.offset, length);
@@ -13637,7 +13637,7 @@ static int reset_block_groups(struct btrfs_fs_info 
*fs_info)
chunk = btrfs_item_ptr(leaf, path.slots[0], struct btrfs_chunk);
btrfs_add_block_group(fs_info, 0,
  btrfs_chunk_type(leaf, chunk),
- key.objectid, key.offset,
+ key.offset,
  btrfs_chunk_length(leaf, chunk));
set_extent_dirty(&fs_info->free_space_cache, key.offset,
 key.offset + btrfs_chunk_length(leaf, chunk));
diff --git a/convert/main.c b/convert/main.c
index af2855316fee..4a510a786394 100644
--- a/convert/main.c
+++ b/convert/main.c
@@ -916,9 +916,7 @@ static int make_convert_data_block_groups(struct 
btrfs_trans_handle *trans,
if (ret < 0)
break;
ret = btrfs_make_block_group(trans, fs_info, 0,
-   BTRFS_BLOCK_GROUP_DATA,
-   BTRFS_FIRST_CHUNK_TREE_OBJECTID,
-   cur, len);
+   BTRFS_BLOCK_GROUP_DATA, cur, len);
if (ret < 0)
break;
cur += len;
diff --git a/ctree.h b/ctree.h
index 61d9f9fc9dff..5a04c0f1c912 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2497,11 +2497,10 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info);
 int btrfs_read_block_groups(struct btrfs_root *root);
 struct btrfs_block_group_cache *
 btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
- u64 chunk_objectid, u64 chunk_offset, u64 size);
+ u64 chunk_offset, u64 size);
 int btrfs_make_block_group(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info, u64 bytes_used,
-  u64 type, u64 chunk_objectid, u64 chunk_offset,
-  u64 size);
+  u64 type, u64 chunk_offset, u64 size);
 int btrfs_make_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info);
 int btrfs_update_block_group(struct btrfs_trans_handle *trans,
diff --git a/extent-tree.c b/extent-tree.c
index 58b64a21d226..db24da3a3a8c 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1916,7 +1916,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
BUG_ON(ret);
 
ret = btrfs_make_block_group(trans, fs_info, 0, space_info->flags,
-BTRFS_FIRST_CHUNK_TREE_OBJECTID, start, num_bytes);
+start, num_bytes);
BUG_ON(ret);
return 0;
 }
@@ -3312,7 +3312,7 @@ error:
 
 struct btrfs_block_group_cache *
 btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
- u64 chunk_objectid, u64 chunk_offset, u64 size)
+ u64 chunk_offset, u64 size)
 {
int ret;
int bit = 0;
@@ -3328,7 +3328,8 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 
bytes_used, u64 type,
 
cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
btrfs_set_block_group_used(&cache->item, bytes_used);
-   btrfs_set_block_group_chunk_objectid(&cache->item, chunk_objectid);
+   btrfs_set_block_group_chunk_objectid(&cache->item,
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
cache->flags = type;
btrfs_set_block_group_flags(&cache->item, type);
 
@@ -3353,15 +3354,14 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, 
u64 bytes_used, u64 type,
 
 int btrfs_make_block_group(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info, u64 bytes_used,
-  u64 type, u64 chunk_objectid, u64 chunk_offset,
-  u64 

[PATCH 3/5] btrfs-progs: volumes: Remove unnecessary trans parameter

2018-01-02 Thread Qu Wenruo
Remove @trans parameter for find_free_dev_extent_start() and its
callers.

The function itself is doing read-only tree search, no use of
transaction.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/volumes.c b/volumes.c
index 0e50e1d5833e..f6c6447fe925 100644
--- a/volumes.c
+++ b/volumes.c
@@ -316,9 +316,9 @@ int btrfs_scan_one_device(int fd, const char *path,
  * But if we don't find suitable free space, it is used to store the size of
  * the max free space.
  */
-static int find_free_dev_extent_start(struct btrfs_trans_handle *trans,
-  struct btrfs_device *device, u64 num_bytes,
-  u64 search_start, u64 *start, u64 *len)
+static int find_free_dev_extent_start(struct btrfs_device *device,
+ u64 num_bytes, u64 search_start,
+ u64 *start, u64 *len)
 {
struct btrfs_key key;
struct btrfs_root *root = device->dev_root;
@@ -457,13 +457,11 @@ out:
return ret;
 }
 
-static int find_free_dev_extent(struct btrfs_trans_handle *trans,
-struct btrfs_device *device, u64 num_bytes,
-u64 *start)
+static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
+   u64 *start)
 {
/* FIXME use last free of some kind */
-   return find_free_dev_extent_start(trans, device,
- num_bytes, 0, start, NULL);
+   return find_free_dev_extent_start(device, num_bytes, 0, start, NULL);
 }
 
 static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
@@ -488,8 +486,7 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle 
*trans,
 * is responsible to make sure it's free.
 */
if (!convert) {
-   ret = find_free_dev_extent(trans, device, num_bytes,
-  start);
+   ret = find_free_dev_extent(device, num_bytes, start);
if (ret)
goto err;
}
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs-progs: volumes: Remove unnecessary parameters when allocating device extent

2018-01-02 Thread Qu Wenruo
@chunk_tree and @chunk_objectid of device extent is fixed to
BTRFS_CHUNK_TREE_OBJECTID and BTRFS_FIRST_CHUNK_TREE_OBJECTID
respectively.

There is no need to pass them as parameter explicitly.

Signed-off-by: Qu Wenruo 
---
 volumes.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/volumes.c b/volumes.c
index f6c6447fe925..fa3c6de023f9 100644
--- a/volumes.c
+++ b/volumes.c
@@ -466,9 +466,8 @@ static int find_free_dev_extent(struct btrfs_device 
*device, u64 num_bytes,
 
 static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
  struct btrfs_device *device,
- u64 chunk_tree, u64 chunk_objectid,
- u64 chunk_offset,
- u64 num_bytes, u64 *start, int convert)
+ u64 chunk_offset, u64 num_bytes, u64 *start,
+ int convert)
 {
int ret;
struct btrfs_path *path;
@@ -501,8 +500,9 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle 
*trans,
leaf = path->nodes[0];
extent = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_dev_extent);
-   btrfs_set_dev_extent_chunk_tree(leaf, extent, chunk_tree);
-   btrfs_set_dev_extent_chunk_objectid(leaf, extent, chunk_objectid);
+   btrfs_set_dev_extent_chunk_tree(leaf, extent, 
BTRFS_CHUNK_TREE_OBJECTID);
+   btrfs_set_dev_extent_chunk_objectid(leaf, extent,
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
 
write_extent_buffer(leaf, root->fs_info->chunk_tree_uuid,
@@ -1039,9 +1039,7 @@ again:
(index == num_stripes - 1))
list_move_tail(&device->dev_list, dev_list);
 
-   ret = btrfs_alloc_dev_extent(trans, device,
-info->chunk_root->root_key.objectid,
-BTRFS_FIRST_CHUNK_TREE_OBJECTID, key.offset,
+   ret = btrfs_alloc_dev_extent(trans, device, key.offset,
 calc_size, &dev_offset, 0);
if (ret < 0)
goto out_chunk_map;
@@ -1176,9 +1174,7 @@ int btrfs_alloc_data_chunk(struct btrfs_trans_handle 
*trans,
while (index < num_stripes) {
struct btrfs_stripe *stripe;
 
-   ret = btrfs_alloc_dev_extent(trans, device,
-info->chunk_root->root_key.objectid,
-BTRFS_FIRST_CHUNK_TREE_OBJECTID, key.offset,
+   ret = btrfs_alloc_dev_extent(trans, device, key.offset,
 calc_size, &dev_offset, convert);
BUG_ON(ret);
 
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash during btrfs scrub

2018-01-02 Thread Qu Wenruo


On 2018年01月03日 09:12, Dmitry Katsubo wrote:
> Dear btrfs team,
> 
> I send a kernel crash report which I have observed recently during btrfs 
> scrub.
> It looks like scrub itself has completed without errors.
> 
> # btrfs scrub status /home
> scrub status for 83a3cb60-3334-4d11-9fdf-70b8e8703167
> scrub started at Mon Jan  1 06:52:01 2018 and finished after 00:30:47
> total bytes scrubbed: 87.55GiB with 0 errors
> 
> # btrfs scrub status /var/log
> scrub status for 5b45ac8e-fd8c-4759-854a-94e45069959d
> scrub started at Mon Jan  1 06:52:01 2018 and finished after 00:15:45
> total bytes scrubbed: 23.39GiB with 0 errors
> 
> Linux kernel v4.14.2-1
> btrfs-progs v4.7.3-1
> 

It's not a kernel crash (if I didn't miss anything), but just kernel
warning.

The warning is caused by the fact that your fs (mostly created by old
mkfs.btrfs) has device with unaligned size.

You could either resize the device down a little (e.g. -4K) and newer
kernel (the one you're using should be new enough) could handle it well.

Or you could update your btrfs-progs (I assume you're using Arch, which
is already shipping btrfs-progs v4.14) and use "btrfs rescue
fix-device-size" to fix other device related problems offline.
(Not only the warning, but also potential superblock size mismatch)

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-02 Thread Dave Chinner
On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote:
> On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> > On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > > seem to me to be the right abstraction to use for this - in most
> > > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > > and unknown write ordering, not an inode full of data that can be
> > > > > sequentially written.
> > > > > 
> > > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for
> > > > > sync based operations. That way different filesystems can ignore the
> > > > > parts they don't need simply by not implementing those operations,
> > > > > and the writeback code doesn't need to try to cater for all
> > > > > operations through the one op. The writeback code should be cleaner,
> > > > > the filesystem code should be cleaner, and we can tailor the work
> > > > > guidelines for each operation separately so there's less mismatch
> > > > > between what writeback is asking and how filesystems track dirty
> > > > > metadata...
> > > > 
> > > > I agree that writeback for memory cleaning and writeback for data 
> > > > integrity
> > > > are two very different things especially for metadata. In fact for data
> > > > integrity writeback we already have ->sync_fs operation so there the
> > > > functionality gets duplicated. What we could do is that in
> > > > writeback_sb_inodes() we'd call ->write_metadata only when
> > > > work->for_kupdate or work->for_background is set. That way 
> > > > ->write_metadata
> > > > would be called only for memory cleaning purposes.
> > > 
> > > That makes sense, but I still think we need a better indication of
> > > how much writeback we need to do than just "writeback this chunk of
> > > pages". That "writeback a chunk" interface is necessary to share
> > > writeback bandwidth across numerous data inodes so that we don't
> > > starve any one inode of writeback bandwidth. That's unnecessary for
> > > metadata writeback on a superblock - we don't need to share that
> > > bandwidth around hundreds or thousands of inodes. What we actually
> > > need to know is how much writeback we need to do as a total of all
> > > the dirty metadata on the superblock.
> > > 
> > > Sure, that's not ideal for btrfs and mayext4, but we can write a
> > > simple generic helper that converts "flush X percent of dirty
> > > metadata" to a page/byte chunk as the current code does. DOing it
> > > this way allows filesystems to completely internalise the accounting
> > > that needs to be done, rather than trying to hack around a
> > > writeback accounting interface with large impedance mismatches to
> > > how the filesystem accounts for dirty metadata and/or tracks
> > > writeback progress.
> > 
> > Let me think loud on how we could tie this into how memory cleaning
> > writeback currently works - the one with for_background == 1 which is
> > generally used to get amount of dirty pages in the system under control.
> > We have a queue of inodes to write, we iterate over this queue and ask each
> > inode to write some amount (e.g. 64 M - exact amount depends on measured

It's a maximum of 1024 pages per inode.

> > writeback bandwidth etc.). Some amount from that inode gets written and we
> > continue with the next inode in the queue (put this one at the end of the
> > queue if it still has dirty pages). We do this until:
> > 
> > a) the number of dirty pages in the system is below background dirty limit
> >and the number dirty pages for this device is below background dirty
> >limit for this device.
> > b) run out of dirty inodes on this device
> > c) someone queues different type of writeback
> > 
> > And we need to somehow incorporate metadata writeback into this loop. I see
> > two questions here:
> > 
> > 1) When / how often should we ask for metadata writeback?
> > 2) How much to ask to write in one go?
> > 
> > The second question is especially tricky in the presence of completely
> > async metadata flushing in XFS - we can ask to write say half of dirty
> > metadata but then we have no idea whether the next observation of dirty
> > metadata counters is with that part of metadata already under writeback /
> > cleaned or whether xfsaild didn't even start working and pushing more has
> > no sense.

Well, like with ext4, we've also got to consider that a bunch of the
recently dirtied metadata (e.g. from delalloc, EOF updates on IO
completion, etc) is still pinned in memory because the
journal has not been flushed/checkpointed. Hence we should not be
attempting to write back metadata we've dirtied as a result of
writing data in the background writeback loop.

That greatly simplifies what we need to co

Kernel crash during btrfs scrub

2018-01-02 Thread Dmitry Katsubo
Dear btrfs team,

I send a kernel crash report which I have observed recently during btrfs scrub.
It looks like scrub itself has completed without errors.

# btrfs scrub status /home
scrub status for 83a3cb60-3334-4d11-9fdf-70b8e8703167
scrub started at Mon Jan  1 06:52:01 2018 and finished after 00:30:47
total bytes scrubbed: 87.55GiB with 0 errors

# btrfs scrub status /var/log
scrub status for 5b45ac8e-fd8c-4759-854a-94e45069959d
scrub started at Mon Jan  1 06:52:01 2018 and finished after 00:15:45
total bytes scrubbed: 23.39GiB with 0 errors

Linux kernel v4.14.2-1
btrfs-progs v4.7.3-1

-- 
With best regards,
Dmitry
[Mon Jan  1 07:04:44 2018] [ cut here ]
[Mon Jan  1 07:04:44 2018] WARNING: CPU: 0 PID: 13583 at 
/build/linux-SCFPgu/linux-4.14.2/fs/btrfs/ctree.h:1564 
btrfs_update_device+0x220/0x230 [btrfs]
[Mon Jan  1 07:04:44 2018] Modules linked in: md4 nls_utf8 cifs ccm 
dns_resolver fscache option usb_wwan usbserial isofs loop ses enclosure 
scsi_transport_sas hid_generic usbhid hid ipt_REJECT nf_reject_ipv4 
xt_multiport iptable_filter xt_REDIRECT nf_nat_redirect xt_physdev br_netfilter 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
libcrc32c xt_tcpudp iptable_mangle bridge stp llc arc4 iTCO_wdt 
iTCO_vendor_support ppdev evdev snd_hda_codec_realtek snd_hda_codec_generic 
ath5k ath mac80211 cfg80211 snd_hda_intel i915 rfkill coretemp snd_hda_codec 
snd_hda_core snd_hwdep serio_raw snd_pcm_oss pcspkr snd_mixer_oss lpc_ich 
snd_pcm mfd_core snd_timer snd video soundcore drm_kms_helper sg drm shpchp 
i2c_algo_bit rng_core parport_pc parport button acpi_cpufreq binfmt_misc 
w83627hf hwmon_vid
[Mon Jan  1 07:04:44 2018]  ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 
fscrypto ecb crypto_simd cryptd aes_i586 btrfs crc32c_generic xor 
zstd_decompress zstd_compress xxhash raid6_pq uas usb_storage sr_mod sd_mod 
cdrom ata_generic ata_piix i2c_i801 libata firewire_ohci scsi_mod firewire_core 
crc_itu_t ehci_pci uhci_hcd ehci_hcd e1000e ptp pps_core usbcore usb_common
[Mon Jan  1 07:04:44 2018] CPU: 0 PID: 13583 Comm: btrfs Tainted: GW
   4.14.0-1-686-pae #1 Debian 4.14.2-1
[Mon Jan  1 07:04:44 2018] Hardware name: AOpen i945GMx-IF/i945GMx-IF, BIOS 
i945GMx-IF R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
[Mon Jan  1 07:04:44 2018] task: eba6a000 task.stack: ca216000
[Mon Jan  1 07:04:44 2018] EIP: btrfs_update_device+0x220/0x230 [btrfs]
[Mon Jan  1 07:04:44 2018] EFLAGS: 00210206 CPU: 0
[Mon Jan  1 07:04:44 2018] EAX:  EBX: f6908400 ECX: 000c EDX: 
0200
[Mon Jan  1 07:04:44 2018] ESI: f69e2280 EDI:  EBP: ca217bd8 ESP: 
ca217b98
[Mon Jan  1 07:04:44 2018]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[Mon Jan  1 07:04:44 2018] CR0: 80050033 CR2: b795da00 CR3: 1a2e2460 CR4: 
06f0
[Mon Jan  1 07:04:44 2018] Call Trace:
[Mon Jan  1 07:04:44 2018]  btrfs_finish_chunk_alloc+0xf3/0x480 [btrfs]
[Mon Jan  1 07:04:44 2018]  ? btrfs_free_path.part.26+0x1c/0x20 [btrfs]
[Mon Jan  1 07:04:44 2018]  ? btrfs_insert_item+0x66/0xd0 [btrfs]
[Mon Jan  1 07:04:44 2018]  btrfs_create_pending_block_groups+0x139/0x250 
[btrfs]
[Mon Jan  1 07:04:44 2018]  __btrfs_end_transaction+0x78/0x2e0 [btrfs]
[Mon Jan  1 07:04:44 2018]  btrfs_end_transaction+0xf/0x20 [btrfs]
[Mon Jan  1 07:04:44 2018]  btrfs_inc_block_group_ro+0xea/0x190 [btrfs]
[Mon Jan  1 07:04:44 2018]  scrub_enumerate_chunks+0x215/0x660 [btrfs]
[Mon Jan  1 07:04:44 2018]  btrfs_scrub_dev+0x1e8/0x4e0 [btrfs]
[Mon Jan  1 07:04:44 2018]  btrfs_ioctl+0x1480/0x28b0 [btrfs]
[Mon Jan  1 07:04:44 2018]  ? kmem_cache_alloc+0x30c/0x540
[Mon Jan  1 07:04:44 2018]  ? btrfs_ioctl_get_supported_features+0x30/0x30 
[btrfs]
[Mon Jan  1 07:04:44 2018]  do_vfs_ioctl+0x90/0x650
[Mon Jan  1 07:04:44 2018]  ? do_vfs_ioctl+0x90/0x650
[Mon Jan  1 07:04:44 2018]  ? create_task_io_context+0x78/0xe0
[Mon Jan  1 07:04:44 2018]  ? get_task_io_context+0x3d/0x80
[Mon Jan  1 07:04:44 2018]  SyS_ioctl+0x58/0x70
[Mon Jan  1 07:04:44 2018]  do_fast_syscall_32+0x71/0x1a0
[Mon Jan  1 07:04:44 2018]  entry_SYSENTER_32+0x4e/0x7c
[Mon Jan  1 07:04:44 2018] EIP: 0xb7f81cf9
[Mon Jan  1 07:04:44 2018] EFLAGS: 0246 CPU: 0
[Mon Jan  1 07:04:44 2018] EAX: ffda EBX: 0003 ECX: c400941b EDX: 
092e21b8
[Mon Jan  1 07:04:44 2018] ESI: 092e21b8 EDI: 003d0f00 EBP: b7cff1e8 ESP: 
b7cff188
[Mon Jan  1 07:04:44 2018]  DS: 007b ES: 007b FS:  GS: 0033 SS: 007b
[Mon Jan  1 07:04:44 2018] Code: e9 81 fe ff ff 8d b6 00 00 00 00 bf f4 ff ff 
ff e9 78 fe ff ff 8d b6 00 00 00 00 f3 90 eb a8 8d 74 26 00 f3 90 e9 2b ff ff 
ff 90 <0f> ff e9 7a ff ff ff e8 14 ad 48 d0 8d 74 26 00 3e 8d 74 26 00
[Mon Jan  1 07:04:44 2018] ---[ end trace 6b4736d811ae42e1 ]---
[Mon Jan  1 07:05:00 2018] [ cut here ]
[Mon Jan  1 07:05:00 2018] WARNING: CPU: 1 PID: 443 at 
/build/linux-SCFPgu/linux-4.14.2/fs/btrfs/ctree.h:1564 
btrfs_update_device+0x220/0x230 [btrfs]
[Mon Jan  1 07:05:00 2018] Mo

Re: Btrfs reserve metadata problem

2018-01-02 Thread Qu Wenruo


On 2018年01月03日 09:55, robbieko wrote:
> Hi Qu,
> 
> Do you have a patch to reduce meta rsv ?

Not exactly, only for qgroup.

[PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode
qgroup rsv

But that could provide enough clue to implement a smaller meta rsv.

My current safe guess would be "(BTRFS_MAX_TREE_LEVEL + 2) * nodesize"
for each outstanding extent, and further step it with the number of
outstanding extents.
(Not always increase/decrease the meta rsv if outstanding extents
changes, but only increase/decrease if oustanding extents exceeds
certain amount)

Thanks,
Qu

> 
> 
> Hi Peter Grandi,
> 
> 1. all files have been initialized with dd, No need to change any metadata.
> 2. my test with Total volume size 190G, used 128G, available 60G, but
> only 60 MB dirty pages.
>     According to the meta rsv rules, 1GB free space up to only 1MB dirty
> pages
> 3. cow enabled is also the same problem
> 
> It is a serious performance issue.
> 
> Thanks.
> robbieko
> 
> 
> p...@btrfs.list.sabi.co.uk 於 2018-01-02 21:08 寫到:
>>> When testing Btrfs with fio 4k random write,
>>
>> That's an exceptionally narrowly defined workload. Also it is
>> narrower than that, because it must be without 'fsync' after
>> each write, or else there would be no accumulation of dirty
>> blocks in memory at all.
>>
>>> I found that volume with smaller free space available has
>>> lower performance.
>>
>> That's an inappropriate use of "performance"... The speed may be
>> lower, the performance is another matter.
>>
>>> It seems that the smaller the free space of volume is, the
>>> smaller amount of dirty page filesystem could have.
>>
>> Is this a problem? Consider: all filesystems do less well when
>> there is less free space (smaller chance of finding spatially
>> compact allocations), it is usually good to minimize the the
>> amont of dirty pages anyhow (even if there are reasons to keep
>> delay writing them out).
>>
>>> [ ... ] btrfs will reserve metadata for every write.  The
>>> amount to reserve is calculated as follows: nodesize *
>>> BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
>>> The maximum amount of metadata reservation depends on size of
>>> metadata currently in used and free space within volume(free
>>> chunk size /16) When metadata reaches the limit, btrfs will
>>> need to flush the data to release the reservation.
>>
>> I don't understand here: under POSIX semantics filesystems are
>> not really allowed to avoid flushing *metadata* to disk for most
>> operations, that is metadata operations have an implied 'fsync'.
>> Your case of the "4k random write" with "cow disabled" the only
>> metadata that should get updated is the last-modified timestamp,
>> unless the user/application has been so amazingly stupid to not
>> preallocate the file, and then they deserve whatever they get.
>>
>>> 1. Is there any logic behind the value (free chunk size /16)
>>
>>>   /*
>>>    * If we have dup, raid1 or raid10 then only half of the free
>>>    * space is actually useable. For raid56, the space info used
>>>    * doesn't include the parity drive, so we don't have to
>>>    * change the math
>>>    */
>>>   if (profile & (BTRFS_BLOCK_GROUP_DUP |
>>>   BTRFS_BLOCK_GROUP_RAID1 |
>>>   BTRFS_BLOCK_GROUP_RAID10))
>>>    avail >>= 1;
>>
>> As written there is a plausible logic, but it is quite crude.
>>
>>>   /*
>>>    * If we aren't flushing all things, let us overcommit up to
>>>    * 1/2th of the space. If we can flush, don't let us overcommit
>>>    * too much, let it overcommit up to 1/8 of the space.
>>>    */
>>>   if (flush == BTRFS_RESERVE_FLUSH_ALL)
>>>    avail >>= 3;
>>>   else
>>>    avail >>= 1;
>>
>> Presumably overcommitting beings some benefits on other workloads.
>>
>> In particular other parts of Btrfs don't behave awesomely well
>> when free space runs out.
>>
>>> 2. Is there any way to improve this problem?
>>
>> Again, is it a problem? More interestingly, if it is a problem
>> is a solution available that does not impact other workloads?
>> It is simply impossible to optimize a filesystem perfectly for
>> every workload.
>>
>> I'll try to summarize your report as I understand it:
>>
>> * If:
>>   - The workload is "4k random write" (without 'fsync').
>>   - On a "cow disabled" file.
>>   - The file is not preallocated.
>>   - There is not much free space available.
>> * Then allocation overcommitting results in a higher frequency
>>   of unrequested metadata flushes, and those metadata flushes
>>   slow down a specific benchmark.
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: Op

Re: Btrfs reserve metadata problem

2018-01-02 Thread robbieko

Hi Qu,

Do you have a patch to reduce meta rsv ?


Hi Peter Grandi,

1. all files have been initialized with dd, No need to change any 
metadata.
2. my test with Total volume size 190G, used 128G, available 60G, but 
only 60 MB dirty pages.
According to the meta rsv rules, 1GB free space up to only 1MB dirty 
pages

3. cow enabled is also the same problem

It is a serious performance issue.

Thanks.
robbieko


p...@btrfs.list.sabi.co.uk 於 2018-01-02 21:08 寫到:

When testing Btrfs with fio 4k random write,


That's an exceptionally narrowly defined workload. Also it is
narrower than that, because it must be without 'fsync' after
each write, or else there would be no accumulation of dirty
blocks in memory at all.


I found that volume with smaller free space available has
lower performance.


That's an inappropriate use of "performance"... The speed may be
lower, the performance is another matter.


It seems that the smaller the free space of volume is, the
smaller amount of dirty page filesystem could have.


Is this a problem? Consider: all filesystems do less well when
there is less free space (smaller chance of finding spatially
compact allocations), it is usually good to minimize the the
amont of dirty pages anyhow (even if there are reasons to keep
delay writing them out).


[ ... ] btrfs will reserve metadata for every write.  The
amount to reserve is calculated as follows: nodesize *
BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
The maximum amount of metadata reservation depends on size of
metadata currently in used and free space within volume(free
chunk size /16) When metadata reaches the limit, btrfs will
need to flush the data to release the reservation.


I don't understand here: under POSIX semantics filesystems are
not really allowed to avoid flushing *metadata* to disk for most
operations, that is metadata operations have an implied 'fsync'.
Your case of the "4k random write" with "cow disabled" the only
metadata that should get updated is the last-modified timestamp,
unless the user/application has been so amazingly stupid to not
preallocate the file, and then they deserve whatever they get.


1. Is there any logic behind the value (free chunk size /16)



  /*
   * If we have dup, raid1 or raid10 then only half of the free
   * space is actually useable. For raid56, the space info used
   * doesn't include the parity drive, so we don't have to
   * change the math
   */
  if (profile & (BTRFS_BLOCK_GROUP_DUP |
  BTRFS_BLOCK_GROUP_RAID1 |
  BTRFS_BLOCK_GROUP_RAID10))
   avail >>= 1;


As written there is a plausible logic, but it is quite crude.


  /*
   * If we aren't flushing all things, let us overcommit up to
   * 1/2th of the space. If we can flush, don't let us overcommit
   * too much, let it overcommit up to 1/8 of the space.
   */
  if (flush == BTRFS_RESERVE_FLUSH_ALL)
   avail >>= 3;
  else
   avail >>= 1;


Presumably overcommitting beings some benefits on other workloads.

In particular other parts of Btrfs don't behave awesomely well
when free space runs out.


2. Is there any way to improve this problem?


Again, is it a problem? More interestingly, if it is a problem
is a solution available that does not impact other workloads?
It is simply impossible to optimize a filesystem perfectly for
every workload.

I'll try to summarize your report as I understand it:

* If:
  - The workload is "4k random write" (without 'fsync').
  - On a "cow disabled" file.
  - The file is not preallocated.
  - There is not much free space available.
* Then allocation overcommitting results in a higher frequency
  of unrequested metadata flushes, and those metadata flushes
  slow down a specific benchmark.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 01/78] xfs: Rename xa_ elements to ail_

2018-01-02 Thread Darrick J. Wong
On Fri, Dec 15, 2017 at 02:03:33PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> This is a simple rename, except that xa_ail becomes ail_head.
> 
> Signed-off-by: Matthew Wilcox 

That was an eyeful,
Reviewed-by: Darrick J. Wong 

> ---
>  fs/xfs/xfs_buf_item.c|  10 ++--
>  fs/xfs/xfs_dquot.c   |   4 +-
>  fs/xfs/xfs_dquot_item.c  |  11 ++--
>  fs/xfs/xfs_inode_item.c  |  22 +++
>  fs/xfs/xfs_log.c |   6 +-
>  fs/xfs/xfs_log_recover.c |  80 -
>  fs/xfs/xfs_trans.c   |  18 +++---
>  fs/xfs/xfs_trans_ail.c   | 152 
> +++
>  fs/xfs/xfs_trans_buf.c   |   4 +-
>  fs/xfs/xfs_trans_priv.h  |  42 ++---
>  10 files changed, 175 insertions(+), 174 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index e0a0af0946f2..6c5035544a93 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -459,7 +459,7 @@ xfs_buf_item_unpin(
>   bp->b_fspriv = NULL;
>   bp->b_iodone = NULL;
>   } else {
> - spin_lock(&ailp->xa_lock);
> + spin_lock(&ailp->ail_lock);
>   xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
>   xfs_buf_item_relse(bp);
>   ASSERT(bp->b_fspriv == NULL);
> @@ -1056,13 +1056,13 @@ xfs_buf_do_callbacks_fail(
>   struct xfs_log_item *lip = bp->b_fspriv;
>   struct xfs_ail  *ailp = lip->li_ailp;
>  
> - spin_lock(&ailp->xa_lock);
> + spin_lock(&ailp->ail_lock);
>   for (; lip; lip = next) {
>   next = lip->li_bio_list;
>   if (lip->li_ops->iop_error)
>   lip->li_ops->iop_error(lip, bp);
>   }
> - spin_unlock(&ailp->xa_lock);
> + spin_unlock(&ailp->ail_lock);
>  }
>  
>  static bool
> @@ -1215,7 +1215,7 @@ xfs_buf_iodone(
>*
>* Either way, AIL is useless if we're forcing a shutdown.
>*/
> - spin_lock(&ailp->xa_lock);
> + spin_lock(&ailp->ail_lock);
>   xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
>   xfs_buf_item_free(BUF_ITEM(lip));
>  }
> @@ -1236,7 +1236,7 @@ xfs_buf_resubmit_failed_buffers(
>   /*
>* Clear XFS_LI_FAILED flag from all items before resubmit
>*
> -  * XFS_LI_FAILED set/clear is protected by xa_lock, caller  this
> +  * XFS_LI_FAILED set/clear is protected by ail_lock, caller  this
>* function already have it acquired
>*/
>   for (; lip; lip = next) {
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index f248708c10ff..e2a466df5dd1 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -974,7 +974,7 @@ xfs_qm_dqflush_done(
>(lip->li_flags & XFS_LI_FAILED))) {
>  
>   /* xfs_trans_ail_delete() drops the AIL lock. */
> - spin_lock(&ailp->xa_lock);
> + spin_lock(&ailp->ail_lock);
>   if (lip->li_lsn == qip->qli_flush_lsn) {
>   xfs_trans_ail_delete(ailp, lip, 
> SHUTDOWN_CORRUPT_INCORE);
>   } else {
> @@ -984,7 +984,7 @@ xfs_qm_dqflush_done(
>*/
>   if (lip->li_flags & XFS_LI_FAILED)
>   xfs_clear_li_failed(lip);
> - spin_unlock(&ailp->xa_lock);
> + spin_unlock(&ailp->ail_lock);
>   }
>   }
>  
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 664dea105e76..62637a226601 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -160,8 +160,9 @@ xfs_dquot_item_error(
>  STATIC uint
>  xfs_qm_dquot_logitem_push(
>   struct xfs_log_item *lip,
> - struct list_head*buffer_list) __releases(&lip->li_ailp->xa_lock)
> -   __acquires(&lip->li_ailp->xa_lock)
> + struct list_head*buffer_list)
> + __releases(&lip->li_ailp->ail_lock)
> + __acquires(&lip->li_ailp->ail_lock)
>  {
>   struct xfs_dquot*dqp = DQUOT_ITEM(lip)->qli_dquot;
>   struct xfs_buf  *bp = lip->li_buf;
> @@ -208,7 +209,7 @@ xfs_qm_dquot_logitem_push(
>   goto out_unlock;
>   }
>  
> - spin_unlock(&lip->li_ailp->xa_lock);
> + spin_unlock(&lip->li_ailp->ail_lock);
>  
>   error = xfs_qm_dqflush(dqp, &bp);
>   if (error) {
> @@ -220,7 +221,7 @@ xfs_qm_dquot_logitem_push(
>   xfs_buf_relse(bp);
>   }
>  
> - spin_lock(&lip->li_ailp->xa_lock);
> + spin_lock(&lip->li_ailp->ail_lock);
>  out_unlock:
>   xfs_dqunlock(dqp);
>   return rval;
> @@ -403,7 +404,7 @@ xfs_qm_qoffend_logitem_committed(
>* Delete the qoff-start logitem from the AIL.
>* xfs_trans_ail_delete() drops the AIL lock.
>*/
> - spin_lock(&ailp->xa_lock);
> + spin_lock(&ailp->ail_lock);
>   xfs_trans_ail_dele

[josef-btrfs:kill-btree-inode 30/32] fs/btrfs/disk-io.c:1059:13: sparse: incorrect type in assignment (different base types)

2018-01-02 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
kill-btree-inode
head:   225f09a5848138ede157eff4aa27bfa70d354fcb
commit: fb6cad3454e3172599129cff50d5234a7abc551c [30/32] Btrfs: kill the 
btree_inode
reproduce:
# apt-get install sparse
git checkout fb6cad3454e3172599129cff50d5234a7abc551c
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


vim +1059 fs/btrfs/disk-io.c

  1047  
  1048  static blk_status_t __btree_submit_bio_done(void *private_data, struct 
bio *bio,
  1049  int mirror_num, unsigned 
long bio_flags,
  1050  u64 bio_offset)
  1051  {
  1052  struct btrfs_eb_info *eb_info = private_data;
  1053  int ret;
  1054  
  1055  /*
  1056   * when we're called for a write, we're already in the async
  1057   * submission context.  Just jump into btrfs_map_bio
  1058   */
> 1059  ret = btrfs_map_bio(eb_info->fs_info, bio, mirror_num, 1);
  1060  if (ret) {
> 1061  bio->bi_status = ret;
  1062  bio_endio(bio);
  1063  }
> 1064  return ret;
  1065  }
  1066  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root

2018-01-02 Thread Matthew Wilcox
On Tue, Jan 02, 2018 at 10:01:55AM -0800, Darrick J. Wong wrote:
> On Tue, Dec 26, 2017 at 07:58:15PM -0800, Matthew Wilcox wrote:
> > spin_lock_irqsave(&mapping->pages, flags);
> > __delete_from_page_cache(page, NULL);
> > spin_unlock_irqrestore(&mapping->pages, flags);
> >
> > More details here: https://9p.io/sys/doc/compiler.html
> 
> I read the link, and I understand (from section 3.3) that replacing
> foo.bar.baz.goo with foo.goo is less typing, but otoh the first time I
> read your example above I thought "we're passing (an array of pages |
> something that doesn't have the word 'lock' in the name) to
> spin_lock_irqsave? wtf?"

I can see that being a bit jarring initially.  If you think about what
object-oriented languages were offering in the nineties, this is basically
C++ multiple-inheritance / Java interfaces.  So when I read the above
example, I think "lock the mapping pages, delete from page cache, unlock
the mapping pages", and I don't have a wtf moment.  It's just simpler to
read than "lock the mapping pages lock", and less redundant.

> I suppose it does force me to go dig into whatever mapping->pages is to
> figure out that there's an unnamed spinlock_t and that the compiler can
> insert the appropriate pointer arithmetic, but now my brain trips over
> 'pages' being at the end of the selector for parameter 1 which slows
> down my review reading...
> 
> OTOH I guess it /did/ motivate me to click the link, so well played,
> sir. :)

Now if only I can trick you into giving your ACK on patch 1,
"xfs: Rename xa_ elements to ail_"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2 RESEND] Btrfs: make raid6 rebuild retry more

2018-01-02 Thread Liu Bo
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.

This extends btrfs to do more retries and each retry fails only one
stripe.  Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.

The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.

Signed-off-by: Liu Bo 
---
 fs/btrfs/raid56.c  | 18 ++
 fs/btrfs/volumes.c |  9 -
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b078429..c1ed7e8 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2187,11 +2187,21 @@ int raid56_parity_recover(struct btrfs_fs_info 
*fs_info, struct bio *bio,
}
 
/*
-* reconstruct from the q stripe if they are
-* asking for mirror 3
+* Loop retry:
+* for 'mirror == 2', reconstruct from all other stripes.
+* for 'mirror_num > 2', select a stripe to fail on every retry.
 */
-   if (mirror_num == 3)
-   rbio->failb = rbio->real_stripes - 2;
+   if (mirror_num > 2) {
+   /*
+* 'mirror == 3' is to fail the p stripe and
+* reconstruct from the q stripe.  'mirror > 3' is to
+* fail a data stripe and reconstruct from p+q stripe.
+*/
+   rbio->failb = rbio->real_stripes - (mirror_num - 1);
+   ASSERT(rbio->failb > 0);
+   if (rbio->failb <= rbio->faila)
+   rbio->failb--;
+   }
 
ret = lock_stripe_add(rbio);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 49810b7..558c6c0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5104,7 +5104,14 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
ret = 2;
else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
-   ret = 3;
+   /*
+* There could be two corrupted data stripes, we need
+* to loop retry in order to rebuild the correct data.
+* 
+* Fail a stripe at a time on every retry except the
+* stripe under reconstruction.
+*/
+   ret = map->num_stripes;
else
ret = 1;
free_extent_map(em);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: fix scrub to repair raid6 corruption

2018-01-02 Thread Liu Bo
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
  be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
  that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo 
---
 fs/btrfs/scrub.c | 44 
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 951e881..1bc3ee4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -301,6 +301,11 @@ static void __scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info);
 static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info);
 static void scrub_put_ctx(struct scrub_ctx *sctx);
 
+static inline int scrub_is_page_on_raid56(struct scrub_page *page)
+{
+   return page->recover &&
+  (page->recover->bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK);
+}
 
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
 {
@@ -1323,15 +1328,34 @@ static int scrub_handle_errored_block(struct 
scrub_block *sblock_to_check)
 * could happen otherwise that a correct page would be
 * overwritten by a bad one).
 */
-   for (mirror_index = 0;
-mirror_index < BTRFS_MAX_MIRRORS &&
-sblocks_for_recheck[mirror_index].page_count > 0;
-mirror_index++) {
+   for (mirror_index = 0; ;mirror_index++) {
struct scrub_block *sblock_other;
 
if (mirror_index == failed_mirror_index)
continue;
-   sblock_other = sblocks_for_recheck + mirror_index;
+
+   /* raid56's mirror can be more than BTRFS_MAX_MIRRORS */
+   if (!scrub_is_page_on_raid56(sblock_bad->pagev[0])) {
+   if (mirror_index >= BTRFS_MAX_MIRRORS)
+   break;
+   if (!sblocks_for_recheck[mirror_index].page_count)
+   break;
+
+   sblock_other = sblocks_for_recheck + mirror_index;
+   } else {
+   struct scrub_recover *r = sblock_bad->pagev[0]->recover;
+   int max_allowed = r->bbio->num_stripes -
+   r->bbio->num_tgtdevs;
+
+   if (mirror_index >= max_allowed)
+   break;
+   if (!sblocks_for_recheck[1].page_count)
+   break;
+
+   ASSERT(failed_mirror_index == 0);
+   sblock_other = sblocks_for_recheck + 1;
+   sblock_other->pagev[0]->mirror_num = 1 + mirror_index;
+   }
 
/* build and submit the bios, check checksums */
scrub_recheck_block(fs_info, sblock_other, 0);
@@ -1671,26 +1695,22 @@ static void scrub_bio_wait_endio(struct bio *bio)
complete(bio->bi_private);
 }
 
-static inline int scrub_is_page_on_raid56(struct scrub_page *page)
-{
-   return page->recover &&
-  (page->recover->bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK);
-}
-
 static int scrub_submit_raid56_bio_wait(struct btrfs_fs_info *fs_info,
struct bio *bio,
struct scrub_page *page)
 {
DECLARE_COMPLETION_ONSTACK(done);
int ret;
+   int mirror_num;
 
bio->bi_iter.bi_sector = page->logical >> 9;
bio->bi_private = &done;
bio->bi_end_io = scrub_bio_wait_endio;
 
+   mirror_num = page->sblock->pagev[0]->mirror_num;
ret = raid56_parity_recover(fs_info, bio, page->recover->bbio,
page->recover->map_length,
-   page->mirror_num, 0);
+   mirror_num, 0);
if (ret)
return ret;
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs/158: reproduce a scrub bug on raid6 corruption

2018-01-02 Thread Liu Bo
This is to reproduce a bug of scrub, with which scrub is unable to
repair raid6 corruption as expected.

The kernel side fixes are
  Btrfs: make raid6 rebuild retry more
  Btrfs: fix scrub to repair raid6 corruption

Signed-off-by: Liu Bo 
---
 tests/btrfs/158 | 114 
 tests/btrfs/158.out |  10 +
 tests/btrfs/group   |   1 +
 3 files changed, 125 insertions(+)
 create mode 100755 tests/btrfs/158
 create mode 100644 tests/btrfs/158.out

diff --git a/tests/btrfs/158 b/tests/btrfs/158
new file mode 100755
index 000..43afc2d
--- /dev/null
+++ b/tests/btrfs/158
@@ -0,0 +1,114 @@
+#! /bin/bash
+# FS QA Test 158
+#
+# The test case is check if scrub is able fix raid6 data corruption,
+# ie. if there is data corruption on two disks in the same horizontal
+# stripe, e.g.  due to bitrot.
+#
+# The kernel fixes are
+#  Btrfs: make raid6 rebuild retry more
+#  Btrfs: fix scrub to repair raid6 corruption
+#
+#---
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 4
+_require_btrfs_command inspect-internal dump-tree
+
+get_physical_stripe0()
+{
+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \
+   grep " DATA\|RAID6" -A 10 | \
+   $AWK_PROG '($1 ~ /stripe/ && $3 ~ /devid/ && $2 ~ /0/) { print $6 }'
+}
+
+get_physical_stripe1()
+{
+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \
+   grep " DATA\|RAID6" -A 10 | \
+   $AWK_PROG '($1 ~ /stripe/ && $3 ~ /devid/ && $2 ~ /1/) { print $6 }'
+}
+
+_scratch_dev_pool_get 4
+# step 1: create a raid6 btrfs and create a 4K file
+echo "step 1..mkfs.btrfs" >>$seqres.full
+
+mkfs_opts="-d raid6 -b 1G"
+_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
+
+# -o nospace_cache makes sure data is written to the start position of the data
+# chunk
+_scratch_mount -o nospace_cache
+
+# [0,64K) is written to stripe 0 and [64K, 128K) is written to stripe 1
+$XFS_IO_PROG -f -d -c "pwrite -S 0xaa 0 128K" -c "fsync" \
+   "$SCRATCH_MNT/foobar" | _filter_xfs_io
+
+_scratch_unmount
+
+stripe_0=`get_physical_stripe0`
+stripe_1=`get_physical_stripe1`
+dev4=`echo $SCRATCH_DEV_POOL | awk '{print $4}'`
+dev3=`echo $SCRATCH_DEV_POOL | awk '{print $3}'`
+
+# step 2: corrupt the 1st and 2nd stripe (stripe 0 and 1)
+echo "step 2..simulate bitrot at offset $stripe_0 of device_4($dev4) and 
offset $stripe_1 of device_3($dev3)" >>$seqres.full
+
+$XFS_IO_PROG -f -d -c "pwrite -S 0xbb $stripe_0 64K" $dev4 | _filter_xfs_io
+$XFS_IO_PROG -f -d -c "pwrite -S 0xbb $stripe_1 64K" $dev3 | _filter_xfs_io
+
+# step 3: read foobar to repair the bitrot
+echo "step 3..repair the bitrot" >> $seqres.full
+_scratch_mount -o nospace_cache
+
+btrfs scrub start -B $SCRATCH_MNT >> $seqres.full 2>&1
+
+od -x $SCRATCH_MNT/foobar
+
+_scratch_dev_pool_put
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/158.out b/tests/btrfs/158.out
new file mode 100644
index 000..1f5ad3f
--- /dev/null
+++ b/tests/btrfs/158.out
@@ -0,0 +1,10 @@
+QA output created by 158
+wrote 131072/131072 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 9437184
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 9437184
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+000        
+*
+040
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f68abf4..0b3cf12 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -160,3 +160,4 @@
 155 auto quick send
 156 auto quick trim
 157 auto quick raid
+158 auto quick raid
-- 
2.5.0

--
To unsubscribe from this list

Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2018-01-02 Thread Timofey Titovets
2018-01-02 21:31 GMT+03:00 Liu Bo :
> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote:
>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>> based on pid % num of mirrors.
>>
>> Make logic understood:
>>  - if one of underline devices are non rotational
>>  - Queue leght to underline devices
>>
>> By default try use pid % num_mirrors guessing, but:
>>  - If one of mirrors are non rotational, repick optimal to it
>>  - If underline mirror have less queue leght then optimal,
>>repick to that mirror
>>
>> For avoid round-robin request balancing,
>> lets round down queue leght:
>>  - By 8 for rotational devs
>>  - By 2 for all non rotational devs
>>
>
> Sorry for making a late comment on v3.
>
> It's good to choose non-rotational if it could.
>
> But I'm not sure whether it's a good idea to guess queue depth here
> because filesystem is still at a high position of IO stack.  It'd
> probably get good results when running tests, but in practical mixed
> workloads, the underlying queue depth will be changing all the time.

First version supposed for SSD, SSD + HDD only cases.
At that version that just a "attempt", make LB on hdd.
That can be easy dropped, if we decide that's a bad behaviour.

If i understood correctly, which counters used,
we check count of I/O ops that device processing currently
(i.e. after merging & etc),
not queue what not send (i.e. before merging & etc).

i.e. we guessed based on low level block io stuff.
As example that not work on zram devs (AFAIK, as zram don't have that counters).

So, no matter at which level we check that.

> In fact, I think for rotational disks, more merging and less seeking
> make more sense, even in raid1/10 case.
>
> Thanks,
>
> -liubo

queue_depth changing must not create big problems there,
i.e. round_down must make all changes "equal".

For hdd, if we have a "big" (8..16?) queue depth,
with high probability that hdd overloaded,
and if other hdd have much less load
(may be instead of round_down, that better use abs diff > 8)
we try to requeue requests to other hdd.

That will not show true equal distribution, but in case where
one disks have more load, and pid based mechanism fail to make LB,
we will just move part of load to other hdd.

Until load distribution will not changed.

May be for HDD that need to make threshold more aggressive, like 16
(i.e. afaik SATA drives have hw rq len 31, so just use half of that).

Thanks.

>> Changes:
>>   v1 -> v2:
>> - Use helper part_in_flight() from genhd.c
>>   to get queue lenght
>> - Move guess code to guess_optimal()
>> - Change balancer logic, try use pid % mirror by default
>>   Make balancing on spinning rust if one of underline devices
>>   are overloaded
>>   v2 -> v3:
>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>
>> Signed-off-by: Timofey Titovets 
>> ---
>>  block/genhd.c  |   1 +
>>  fs/btrfs/volumes.c | 115 
>> -
>>  2 files changed, 114 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 96a66f671720..a77426a7 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
>> hd_struct *part,
>>   atomic_read(&part->in_flight[1]);
>>   }
>>  }
>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>
>>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>>  {
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 49810b70afd3..a3b80ba31d4d 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -27,6 +27,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include "ctree.h"
>>  #include "extent_map.h"
>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>   return ret;
>>  }
>>
>> +/**
>> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
>> + *
>> + * @bdev: target bdev
>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
>> + */
>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
>> +{
>> + int sum;
>> + struct hd_struct *bd_part = bdev->bd_part;
>> + struct request_queue *rq = bdev_get_queue(bdev);
>> + uint32_t inflight[2] = {0, 0};
>> +
>> + part_in_flight(rq, bd_part, inflight);
>> +
>> + sum = max_t(uint32_t, inflight[0], inflight[1]);
>> +
>> + /*
>> +  * Try prevent switch for every sneeze
>> +  * By roundup output num by some value
>> +  */
>> + return ALIGN_DOWN(sum, round_down);
>> +}
>> +
>> +/**
>> + * guess_optimal - return guessed optimal mirror
>> + *
>> + * Optimal expected to be pid % num_stripes
>> + *
>> + * That's generaly ok for spread load
>> + * Add some balancer based on queue leght to device
>> + *
>> + * Basic ideas:
>> + *  - Sequential read generate low amount of request
>> + *

Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2018-01-02 Thread Liu Bo
On Wed, Dec 20, 2017 at 11:59:20PM +0300, Timofey Titovets wrote:
> How reproduce:
> touch test_file
> chattr +C test_file
> dd if=/dev/zero of=test_file bs=1M count=1
> btrfs fi def -vrczlib test_file
> filefrag -v test_file
> 
> test_file
> Filesystem type is: 9123683e
> File size of test_file is 1048576 (256 blocks of 4096 bytes)
> ext: logical_offset:physical_offset: length:   expected: flags:
>   0:0..  31:   72917050..  72917081: 32: encoded
>   1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
>   2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
>   3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
>   4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
>   5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
>   6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
>   7:  224.. 255:   72967084..  72967115: 32:   72944984:
> last,encoded,eof
> test_file: 8 extents found
> 
> I can't found at now, where that error happen in code,
> but it's reproducible on Linux 4.14.8
>

Please check the comments in this function need_force_cow(),

* Force cow if given extent needs to be defragged.

and using zlib makes it compress the data.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2018-01-02 Thread Liu Bo
On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote:
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
> 
> Make logic understood:
>  - if one of underline devices are non rotational
>  - Queue leght to underline devices
> 
> By default try use pid % num_mirrors guessing, but:
>  - If one of mirrors are non rotational, repick optimal to it
>  - If underline mirror have less queue leght then optimal,
>repick to that mirror
> 
> For avoid round-robin request balancing,
> lets round down queue leght:
>  - By 8 for rotational devs
>  - By 2 for all non rotational devs
>

Sorry for making a late comment on v3.

It's good to choose non-rotational if it could.

But I'm not sure whether it's a good idea to guess queue depth here
because filesystem is still at a high position of IO stack.  It'd
probably get good results when running tests, but in practical mixed
workloads, the underlying queue depth will be changing all the time.

In fact, I think for rotational disks, more merging and less seeking
make more sense, even in raid1/10 case.

Thanks,

-liubo

> Changes:
>   v1 -> v2:
> - Use helper part_in_flight() from genhd.c
>   to get queue lenght
> - Move guess code to guess_optimal()
> - Change balancer logic, try use pid % mirror by default
>   Make balancing on spinning rust if one of underline devices
>   are overloaded
>   v2 -> v3:
> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
> 
> Signed-off-by: Timofey Titovets 
> ---
>  block/genhd.c  |   1 +
>  fs/btrfs/volumes.c | 115 
> -
>  2 files changed, 114 insertions(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 96a66f671720..a77426a7 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
> hd_struct *part,
>   atomic_read(&part->in_flight[1]);
>   }
>  }
> +EXPORT_SYMBOL_GPL(part_in_flight);
>  
>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>  {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 49810b70afd3..a3b80ba31d4d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "extent_map.h"
> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>   return ret;
>  }
>  
> +/**
> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
> + *
> + * @bdev: target bdev
> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
> + */
> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
> +{
> + int sum;
> + struct hd_struct *bd_part = bdev->bd_part;
> + struct request_queue *rq = bdev_get_queue(bdev);
> + uint32_t inflight[2] = {0, 0};
> +
> + part_in_flight(rq, bd_part, inflight);
> +
> + sum = max_t(uint32_t, inflight[0], inflight[1]);
> +
> + /*
> +  * Try prevent switch for every sneeze
> +  * By roundup output num by some value
> +  */
> + return ALIGN_DOWN(sum, round_down);
> +}
> +
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue leght to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *and repick if other dev have "significant" less queue lenght
> + *  - Repick optimal if queue leght of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> +{
> + int i;
> + int round_down = 8;
> + int qlen[num];
> + bool is_nonrot[num];
> + bool all_bdev_nonrot = true;
> + bool all_bdev_rotate = true;
> + struct block_device *bdev;
> +
> + if (num == 1)
> + return optimal;
> +
> + /* Check accessible bdevs */
> + for (i = 0; i < num; i++) {
> + /* Init for missing bdevs */
> + is_nonrot[i] = false;
> + qlen[i] = INT_MAX;
> + bdev = map->stripes[i].dev->bdev;
> + if (bdev) {
> + qlen[i] = 0;
> + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> + if (is_nonrot[i])
> + all_bdev_rotate = false;
> + else
> + all_bdev_nonrot = false;
> + }
> + }
> +
> + /*
> +  * Don't bother with computation
> +  * if only one of two bdevs are accessible
> +  */
> + if (num == 2 && qlen[0] != qlen[1]) {
> + 

Re: [PATCH 09/10] Btrfs: add tracepoint for em's EEXIST case

2018-01-02 Thread Liu Bo
On Sat, Dec 23, 2017 at 09:39:00AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 21:07, Liu Bo wrote:
> > On Fri, Dec 22, 2017 at 10:56:31AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 22.12.2017 00:42, Liu Bo wrote:
> >>> This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
> >>> subtle bugs around merge_extent_mapping.
> >>
> >> In the next patch you are already making the function which takes all
> >> these values noinline, meaning you can attach a kprobe so you can
> >> interrogate the args via systemtap,perf probe or even bpf. So I'd rather
> >> not add this tracepoint since the general sentiment seems to be that
> >> tracepoints are ABI and so have to be maintained.
> >>
> > 
> > The following patch of noinline function is inside the else {} branch,
> > so kprobe would give us something back only when it runs to else{}.
> > 
> > Since subtle bugs may lie in this area, a simple tracepoint like this
> > can help us understand them more efficiently.
> 
> In that case if you make search_extent_mapping and put a kprobe/ret
> kprobe there you should be able to gain the same information, we should
> really careful when adding trace events...

Correct me if I'm wrong, using 'perf probe' can only return @retval,
it seems that we're not able to reference retval's number like
retval->var or retval.val.

As @exisiting is returned by search_extent_mapping(), I'm fine without
this tracepoint if 'perf probe' can do the referencing for retval.

Thanks,

-liubo
> > 
> >>>
> >>> Signed-off-by: Liu Bo 
> >>> ---
> >>>  fs/btrfs/extent_map.c|  1 +
> >>>  include/trace/events/btrfs.h | 35 +++
> >>>  2 files changed, 36 insertions(+)
> >>>
> >>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> >>> index a8b7e24..40e4d30 100644
> >>> --- a/fs/btrfs/extent_map.c
> >>> +++ b/fs/btrfs/extent_map.c
> >>> @@ -539,6 +539,7 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
> >>> *em_tree,
> >>>   ret = 0;
> >>>  
> >>>   existing = search_extent_mapping(em_tree, start, len);
> >>> + trace_btrfs_handle_em_exist(existing, em, start, len);
> >>>  
> >>>   /*
> >>>* existing will always be non-NULL, since there must be
> >>> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> >>> index 4342a32..b7ffcf7 100644
> >>> --- a/include/trace/events/btrfs.h
> >>> +++ b/include/trace/events/btrfs.h
> >>> @@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
> >>> __entry->refs, __entry->compress_type)
> >>>  );
> >>>  
> >>> +TRACE_EVENT(btrfs_handle_em_exist,
> >>> +
> >>> + TP_PROTO(const struct extent_map *existing, const struct extent_map 
> >>> *map, u64 start, u64 len),
> >>> +
> >>> + TP_ARGS(existing, map, start, len),
> >>> +
> >>> + TP_STRUCT__entry(
> >>> + __field(u64,  e_start   )
> >>> + __field(u64,  e_len )
> >>> + __field(u64,  map_start )
> >>> + __field(u64,  map_len   )
> >>> + __field(u64,  start )
> >>> + __field(u64,  len   )
> >>> + ),
> >>> +
> >>> + TP_fast_assign(
> >>> + __entry->e_start= existing->start;
> >>> + __entry->e_len  = existing->len;
> >>> + __entry->map_start  = map->start;
> >>> + __entry->map_len= map->len;
> >>> + __entry->start  = start;
> >>> + __entry->len= len;
> >>> + ),
> >>> +
> >>> + TP_printk("start=%llu len=%llu "
> >>> +   "existing(start=%llu len=%llu) "
> >>> +   "em(start=%llu len=%llu)",
> >>> +   (unsigned long long)__entry->start,
> >>> +   (unsigned long long)__entry->len,
> >>> +   (unsigned long long)__entry->e_start,
> >>> +   (unsigned long long)__entry->e_len,
> >>> +   (unsigned long long)__entry->map_start,
> >>> +   (unsigned long long)__entry->map_len)
> >>> +);
> >>> +
> >>>  /* file extent item */
> >>>  DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
> >>>  
> >>>
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

2018-01-02 Thread Jeff Layton
On Tue, 2018-01-02 at 17:50 +0100, Jan Kara wrote:
> On Fri 22-12-17 07:05:53, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > We only really need to update i_version if someone has queried for it
> > since we last incremented it. By doing that, we can avoid having to
> > update the inode if the times haven't changed.
> > 
> > If the times have changed, then we go ahead and forcibly increment the
> > counter, under the assumption that we'll be going to the storage
> > anyway, and the increment itself is relatively cheap.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c | 10 +++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 19e72f500f71..2fa920188759 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path 
> > *path, struct inode *inode,
> >  int generic_update_time(struct inode *inode, struct timespec *time, int 
> > flags)
> >  {
> > int iflags = I_DIRTY_TIME;
> > +   bool dirty = false;
> >  
> > if (flags & S_ATIME)
> > inode->i_atime = *time;
> > if (flags & S_VERSION)
> > -   inode_inc_iversion(inode);
> > +   dirty |= inode_maybe_inc_iversion(inode, dirty);
> > if (flags & S_CTIME)
> > inode->i_ctime = *time;
> > if (flags & S_MTIME)
> > inode->i_mtime = *time;
> > +   if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
> > +   !(inode->i_sb->s_flags & SB_LAZYTIME))
> > +   dirty = true;
> 
> When you pass 'dirty' to inode_maybe_inc_iversion(), it is always false.
> Maybe this condition should be at the beginning of the function? Once you
> fix that the patch looks good so you can add:
> 
> Reviewed-by: Jan Kara 
> 

Thanks for the review! I've fixed it in my tree. I'll not re-post the
set unless I have to make another significant change or someone requests
it.

I did make one other change, and that was to drop the "const" qualifiers
on the integer arguments in the new API. David Howells pointed out that
they don't really help anything, and the prototypes look cleaner without
them.

This set is now in linux-next as well, so I'm going to try to get this
merged into v4.16, assuming no problems between now and the merge
window.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/19] afs: convert to new i_version API

2018-01-02 Thread Jeff Layton
On Tue, 2018-01-02 at 17:20 +, David Howells wrote:
> Jeff Layton  wrote:
> 
> > Note that AFS has quite a different definition for this counter. AFS
> > only increments it on changes to the data, not for the metadata.
> 
> This also applies to AFS directories: create, mkdir, unlink, rmdir, link,
> symlink, rename, and mountpoint creation/removal all bump the data version
> number on a directory by exactly one if they change it.
> 

Thanks! I updated that part of the the commit log to read:

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data to the data in regular files
and contents of the directories. Inode metadata changes do not result
in a version increment.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/78] xarray: Add the xa_lock to the radix_tree_root

2018-01-02 Thread Darrick J. Wong
On Tue, Dec 26, 2017 at 07:58:15PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 26, 2017 at 07:43:40PM -0800, Matthew Wilcox wrote:
> > Also add the xa_lock() and xa_unlock() family of wrappers to make it
> > easier to use the lock.  If we could rely on -fplan9-extensions in
> > the compiler, we could avoid all of this syntactic sugar, but that
> > wasn't added until gcc 4.6.
> 
> Oh, in case anyone's wondering, here's how I'd do it with plan9 extensions:
> 
> struct xarray {
> spinlock_t;
> int xa_flags;
> void *xa_head;
> };
> 
> ...
> spin_lock_irqsave(&mapping->pages, flags);
> __delete_from_page_cache(page, NULL);
> spin_unlock_irqrestore(&mapping->pages, flags);
> ...
> 
> The plan9 extensions permit passing a pointer to a struct which has an
> unnamed element to a function which is expecting a pointer to the type
> of that element.  The compiler does any necessary arithmetic to produce 
> a pointer.  It's exactly as if I had written:
> 
> spin_lock_irqsave(&mapping->pages.xa_lock, flags);
> __delete_from_page_cache(page, NULL);
> spin_unlock_irqrestore(&mapping->pages.xa_lock, flags);
> 
> More details here: https://9p.io/sys/doc/compiler.html

I read the link, and I understand (from section 3.3) that replacing
foo.bar.baz.goo with foo.goo is less typing, but otoh the first time I
read your example above I thought "we're passing (an array of pages |
something that doesn't have the word 'lock' in the name) to
spin_lock_irqsave? wtf?"

I suppose it does force me to go dig into whatever mapping->pages is to
figure out that there's an unnamed spinlock_t and that the compiler can
insert the appropriate pointer arithmetic, but now my brain trips over
'pages' being at the end of the selector for parameter 1 which slows
down my review reading...

OTOH I guess it /did/ motivate me to click the link, so well played,
sir. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: factor btrfs_check_rw_degradable() to check given device

2018-01-02 Thread David Sterba
On Mon, Dec 18, 2017 at 05:46:59PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年12月18日 17:08, Anand Jain wrote:
> > Update btrfs_check_rw_degradable() to check against the given
> > device if its lost.
> > 
> > We can use this function to know if the volume is going to be
> > in degraded mode OR failed state, when the given device fails.
> > Which is needed when we are handling the device failed state.
> > 
> > A preparatory patch does not affect the flow as such.
> > 
> > Signed-off-by: Anand Jain 
> 
> Looks good.
> 
> Reviewed-by: Qu Wenruo 

Added to 4.16 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Project idea: reduce boot time/RAM usage: option to disable/delay raid6_pq and xor kmod

2018-01-02 Thread David Sterba
On Thu, Dec 28, 2017 at 09:18:07PM +0100, David Disseldorp wrote:
> On Sun, 24 Dec 2017 13:31:40 +0100, Ceriel Jacobs wrote:
> 
> > Saving:
> > 1. ± 0.4 seconds of boot time (10% of boot until root)
> > 2. ± 150k of RAM
> > 3. ± 75k of disk space
> 
> Thanks for bringing this up - I'm also particularly frustrated by the
> boot delay caused by the raid6 algorithm benchmark (1).
> 
> > New kernel command-line parameters?
> > a.) disable, like:
> >  - btrfs=noraid6_pq
> >  - btrfs=noraid (=no xor at all)
> > b.) delay raid6_pq and xor module loading, for cases where root mount 
> > doesn't need raid6_pq and/or xor.
> 
> c) It might not help with (2) or (3), but I'd be happy with an option to
> preselect the raid6 algorithm, so that the benchmark didn't run on each
> boot.

The preselection looks like the least intrusive option.

Another other option is loading the modules on demand (either explicitly
eg.  inside the first mount or implicitly on the first use of the loaded
module), which might be tricky.

The crc32c functions are loaded through the crypto API, and not as a
direct module, but still initialized in the btfs module initialization
itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: drop unused parameters from mount_subvol

2018-01-02 Thread David Sterba
Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.

Signed-off-by: David Sterba 
---
 fs/btrfs/super.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a35ac567384a..1434f288a397 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1395,8 +1395,7 @@ static inline int is_subvolume_inode(struct inode *inode)
 }
 
 static struct dentry *mount_subvol(const char *subvol_name, u64 
subvol_objectid,
-  int flags, const char *device_name,
-  char *data, struct vfsmount *mnt)
+  const char *device_name, struct vfsmount 
*mnt)
 {
struct dentry *root;
int ret;
@@ -1691,8 +1690,7 @@ static struct dentry *btrfs_mount(struct file_system_type 
*fs_type, int flags,
}
 
/* mount_subvol() will free subvol_name and mnt_root */
-   root = mount_subvol(subvol_name, subvol_objectid, flags, device_name,
-   data, mnt_root);
+   root = mount_subvol(subvol_name, subvol_objectid, device_name, 
mnt_root);
 
 out:
return root;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 05/19] afs: convert to new i_version API

2018-01-02 Thread David Howells
Jeff Layton  wrote:

> Note that AFS has quite a different definition for this counter. AFS
> only increments it on changes to the data, not for the metadata.

This also applies to AFS directories: create, mkdir, unlink, rmdir, link,
symlink, rename, and mountpoint creation/removal all bump the data version
number on a directory by exactly one if they change it.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: cleanup unnecessary string dup in btrfs_parse_options()

2018-01-02 Thread David Sterba
On Thu, Dec 14, 2017 at 05:28:00PM +0900, Misono, Tomohiro wrote:
> Long ago, commit edf24abe51493 ("btrfs: sanity mount option parsing and
> early mount code") split the btrfs_parse_options() into two parts
> (btrfs_parse_early_options() and btrfs_parse_options()). As a result,
> btrfs_parse_optins no longer gets called twice and is the last one to parse
> mount option string. Therefore there is no need to dup it.
> 
> Signed-off-by: Tomohiro Misono 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

2018-01-02 Thread Jan Kara
On Fri 22-12-17 18:54:57, Jeff Layton wrote:
> On Sat, 2017-12-23 at 10:14 +1100, NeilBrown wrote:
> > > +#include 
> > > +
> > > +/*
> > > + * The change attribute (i_version) is mandated by NFSv4 and is mostly 
> > > for
> > > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version 
> > > must
> > > + * appear different to observers if there was a change to the inode's 
> > > data or
> > > + * metadata since it was last queried.
> > > + *
> > > + * It should be considered an opaque value by observers. If it remains 
> > > the same
> > 
> >  
> > 
> > You keep using that word ... I don't think it means what you think it
> > means.
> > Change that sentence to:
> > 
> > Observers see i_version as a 64 number which never decreases.
> > 
> > and the rest still makes perfect sense.
> > 
> 
> Thanks! Fixed in my tree. I'll not resend the set just for that though.

With this fixed the patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

2018-01-02 Thread Jan Kara
On Fri 22-12-17 07:05:56, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Since i_version is mostly treated as an opaque value, we can exploit that
> fact to avoid incrementing it when no one is watching. With that change,
> we can avoid incrementing the counter on writes, unless someone has
> queried for it since it was last incremented. If the a/c/mtime don't
> change, and the i_version hasn't changed, then there's no need to dirty
> the inode metadata on a write.
> 
> Convert the i_version counter to an atomic64_t, and use the lowest order
> bit to hold a flag that will tell whether anyone has queried the value
> since it was last incremented.
> 
> When we go to maybe increment it, we fetch the value and check the flag
> bit.  If it's clear then we don't need to do anything if the update
> isn't being forced.
> 
> If we do need to update, then we increment the counter by 2, and clear
> the flag bit, and then use a CAS op to swap it into place. If that
> works, we return true. If it doesn't then do it again with the value
> that we fetch from the CAS operation.
> 
> On the query side, if the flag is already set, then we just shift the
> value down by 1 bit and return it. Otherwise, we set the flag in our
> on-stack value and again use cmpxchg to swap it into place if it hasn't
> changed. If it has, then we use the value from the cmpxchg as the new
> "old" value and try again.
> 
> This method allows us to avoid incrementing the counter on writes (and
> dirtying the metadata) under typical workloads. We only need to increment
> if it has been queried since it was last changed.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h   |   2 +-
>  include/linux/iversion.h | 208 
> ++-
>  2 files changed, 154 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 76382c24e9d0..6804d075933e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -639,7 +639,7 @@ struct inode {
>   struct hlist_head   i_dentry;
>   struct rcu_head i_rcu;
>   };
> - u64 i_version;
> + atomic64_t  i_version;
>   atomic_ti_count;
>   atomic_ti_dio_count;
>   atomic_ti_writecount;
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e08c634779df..cef242e54489 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -5,6 +5,8 @@
>  #include 
>  
>  /*
> + * The inode->i_version field:
> + * ---
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
>   * appear different to observers if there was a change to the inode's data or
> @@ -27,86 +29,171 @@
>   * i_version on namespace changes in directories (mkdir, rmdir, unlink, 
> etc.).
>   * We consider these sorts of filesystems to have a kernel-managed i_version.
>   *
> + * This implementation uses the low bit in the i_version field as a flag to
> + * track when the value has been queried. If it has not been queried since it
> + * was last incremented, we can skip the increment in most cases.
> + *
> + * In the event that we're updating the ctime, we will usually go ahead and
> + * bump the i_version anyway. Since that has to go to stable storage in some
> + * fashion, we might as well increment it as well.
> + *
> + * With this implementation, the value should always appear to observers to
> + * increase over time if the file has changed. It's recommended to use
> + * inode_cmp_iversion() helper to compare values.
> + *
>   * Note that some filesystems (e.g. NFS and AFS) just use the field to store
> - * a server-provided value (for the most part). For that reason, those
> + * a server-provided value for the most part. For that reason, those
>   * filesystems do not set SB_I_VERSION. These filesystems are considered to
>   * have a self-managed i_version.
> + *
> + * Persistently storing the i_version
> + * --
> + * Queries of the i_version field are not gated on them hitting the backing
> + * store. It's always possible that the host could crash after allowing
> + * a query of the value but before it has made it to disk.
> + *
> + * To mitigate this problem, filesystems should always use
> + * inode_set_iversion_queried when loading an existing inode from disk. This
> + * ensures that the next attempted inode increment will result in the value
> + * changing.
> + *
> + * Storing the value to disk therefore does not count as a query, so those
> + * filesystems should use inode_peek_iversion to grab the value to be stored.
> + * There is no need to flag the value as having been queried in that case.
>   */
>  
> +/*
> +

Re: [PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

2018-01-02 Thread Jan Kara
On Fri 22-12-17 07:05:53, Jeff Layton wrote:
> From: Jeff Layton 
> 
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
> 
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 19e72f500f71..2fa920188759 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path 
> *path, struct inode *inode,
>  int generic_update_time(struct inode *inode, struct timespec *time, int 
> flags)
>  {
>   int iflags = I_DIRTY_TIME;
> + bool dirty = false;
>  
>   if (flags & S_ATIME)
>   inode->i_atime = *time;
>   if (flags & S_VERSION)
> - inode_inc_iversion(inode);
> + dirty |= inode_maybe_inc_iversion(inode, dirty);
>   if (flags & S_CTIME)
>   inode->i_ctime = *time;
>   if (flags & S_MTIME)
>   inode->i_mtime = *time;
> + if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
> + !(inode->i_sb->s_flags & SB_LAZYTIME))
> + dirty = true;

When you pass 'dirty' to inode_maybe_inc_iversion(), it is always false.
Maybe this condition should be at the beginning of the function? Once you
fix that the patch looks good so you can add:

Reviewed-by: Jan Kara 


Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove unused wait in btrfs_stripe_hash

2018-01-02 Thread David Sterba
On Fri, Dec 22, 2017 at 04:23:01PM -0700, Liu Bo wrote:
> In fact nobody is waiting on @wait's waitqueue, it can be safely
> removed.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Add new 'limit' test group for btrfs

2018-01-02 Thread David Sterba
On Tue, Dec 26, 2017 at 01:57:44PM +0800, Qu Wenruo wrote:
> Btrfs qgroup also supports to limit the usage of specified qgroups.
> 
> It's possible to enable qgroup but doesn't enable limit.
> (Most user won't use qgroup limit for various problems)
> 
> So add a new test group 'limit' for btrfs, as a subset of existing
> 'qgroup' group.

The term 'limit' is IMHO too generic, qglimit or qg_limit would be more
descriptive.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove unused btrfs_start_transaction_lflush function

2018-01-02 Thread David Sterba
On Fri, Dec 15, 2017 at 12:06:18PM +0200, Nikolay Borisov wrote:
> Commit 0e8c36a9fd81 ("Btrfs: fix lots of orphan inodes when the space
> is not enough") changed the way transaction reservation is made in
> btrfs_evict_node and as a result this function became unused. This has
> been the status quo for 5 years in which time no one noticed, so I'd
> say it's safe to assume it's unlikely it will ever be used again.
> 
> Signed-off-by: Nikolay Borisov 

https://marc.info/?l=linux-btrfs&m=142117508730660&w=2
https://marc.info/?l=linux-kernel&m=142119399504907&w=2

"Transcaction API, removing the func does not make sense without
removing BTRFS_RESERVE_FLUSH_LIMIT at the same time."

> ---
>  fs/btrfs/transaction.c | 10 +-
>  fs/btrfs/transaction.h |  3 ---
>  2 files changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5a8c2649af2f..b1793688c716 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -542,7 +542,7 @@ start_transaction(struct btrfs_root *root, unsigned int 
> num_items,
>* and then we deadlock with somebody doing a freeze.
>*
>* If we are ATTACH, it means we just want to catch the current
> -  * transaction and commit it, so we needn't do sb_start_intwrite(). 
> +  * transaction and commit it, so we needn't do sb_start_intwrite().

Unrelated change.

>*/
>   if (type & __TRANS_FREEZABLE)
>   sb_start_intwrite(fs_info->sb);
> @@ -658,14 +658,6 @@ struct btrfs_trans_handle 
> *btrfs_start_transaction_fallback_global_rsv(
>   return trans;
>  }
>  
> -struct btrfs_trans_handle *btrfs_start_transaction_lflush(
> - struct btrfs_root *root,
> - unsigned int num_items)
> -{
> - return start_transaction(root, num_items, TRANS_START,
> -  BTRFS_RESERVE_FLUSH_LIMIT, true);
> -}
> -
>  struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root)
>  {
>   return start_transaction(root, 0, TRANS_JOIN, BTRFS_RESERVE_NO_FLUSH,
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index c55e44560103..9b25b0ba92ee 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -187,9 +187,6 @@ struct btrfs_trans_handle 
> *btrfs_start_transaction_fallback_global_rsv(
>   struct btrfs_root *root,
>   unsigned int num_items,
>   int min_factor);
> -struct btrfs_trans_handle *btrfs_start_transaction_lflush(
> - struct btrfs_root *root,
> - unsigned int num_items);
>  struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root);
>  struct btrfs_trans_handle *btrfs_join_transaction_nolock(struct btrfs_root 
> *root);
>  struct btrfs_trans_handle *btrfs_attach_transaction(struct btrfs_root *root);
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2018-01-02 Thread Josef Bacik
On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote:
> On Wed 20-12-17 08:35:05, Dave Chinner wrote:
> > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote:
> > > On Wed 13-12-17 09:20:04, Dave Chinner wrote:
> > > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > > This is just one of those things that's going to be slightly shitty.  
> > > > > It's the
> > > > > same for memory reclaim, all of those places use pages so we just take
> > > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close 
> > > > > enough.
> > > > 
> > > > Ok, so that isn't exactly easy to deal with, because all our
> > > > metadata writeback is based on log sequence number targets (i.e. how
> > > > far to push the tail of the log towards the current head). We've
> > > > actually got no idea how pages/bytes actually map to a LSN target
> > > > because while we might account a full buffer as dirty for memory
> > > > reclaim purposes (up to 64k in size), we might have only logged 128
> > > > bytes of it.
> > > > 
> > > > i.e. if we are asked to push 2MB of metadata and we treat that as
> > > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> > > > logged several tens of megabytes of dirty metadata in that LSN
> > > > range and have to flush it all. OTOH, if the buffers are fully
> > > > logged, then that same target might only flush 1.5MB of metadata
> > > > once all the log overhead is taken into account.
> > > > 
> > > > So there's a fairly large disconnect between the "flush N bytes of
> > > > metadata" API and the "push to a target LSN" that XFS uses for
> > > > flushing metadata in aged order. I'm betting that extN and otehr
> > > > filesystems might have similar mismatches with their journal
> > > > flushing...
> > > 
> > > Well, for ext4 it isn't as bad since we do full block logging only. So if
> > > we are asked to flush N pages, we can easily translate that to number of 
> > > fs
> > > blocks and flush that many from the oldest transaction.
> > > 
> > > Couldn't XFS just track how much it has cleaned (from reclaim perspective)
> > > when pushing items from AIL (which is what I suppose XFS would do in
> > > response to metadata writeback request) and just stop pushing when it has
> > > cleaned as much as it was asked to?
> > 
> > If only it were that simple :/
> > 
> > To start with, flushing the dirty objects (such as inodes) to their
> > backing buffers do not mean the the object is clean once the
> > writeback completes. XFS has decoupled in-memory objects with
> > logical object logging rather than logging physical buffers, and
> > so can be modified and dirtied while the inode buffer
> > is being written back. Hence if we just count things like "buffer
> > size written" it's not actually a correct account of the amount of
> > dirty metadata we've cleaned. If we don't get that right, it'll
> > result in accounting errors and incorrect behaviour.
> > 
> > The bigger problem, however, is that we have no channel to return
> > flush information from the AIL pushing to whatever caller asked for
> > the push. Pushing metadata is completely decoupled from every other
> > subsystem. i.e. the caller asked the xfsaild to push to a specific
> > LSN (e.g. to free up a certain amount of log space for new
> > transactions), and *nothing* has any idea of how much metadata we'll
> > need to write to push the tail of the log to that LSN.
> > 
> > It's also completely asynchronous - there's no mechanism for waiting
> > on a push to a specific LSN. Anything that needs a specific amount
> > of log space to be available waits in ordered ticket queues on the
> > log tail moving forwards. The only interfaces that have access to
> > the log tail ticket waiting is the transaction reservation
> > subsystem, which cannot be used during metadata writeback because
> > that's a guaranteed deadlock vector
> > 
> > Saying "just account for bytes written" assumes directly connected,
> > synchronous dispatch metadata writeback infrastructure which we
> > simply don't have in XFS. "just clean this many bytes" doesn't
> > really fit at all because we have no way of referencing that to the
> > distance we need to push the tail of the log. An interface that
> > tells us "clean this percentage of dirty metadata" is much more
> > useful because we can map that easily to a log sequence number
> > based push target
> 
> OK, understood.
> 
> > > > IOWs, treating metadata like it's one great big data inode doesn't
> > > > seem to me to be the right abstraction to use for this - in most
> > > > fileystems it's a bunch of objects with a complex dependency tree
> > > > and unknown write ordering, not an inode full of data that can be
> > > > sequentially written.
> > > > 
> > > > Maybe we need multiple ops with well defined behaviours. e.g.
> > > > ->w

Re: [PATCH] btrfs: qgroup: remove unused label 'retry'

2018-01-02 Thread David Sterba
On Fri, Dec 22, 2017 at 12:55:08AM +, Colin King wrote:
> From: Colin Ian King 
> 
> Label 'retry' is not used, remove it. Cleans up a clang build
> warning:
> 
> warning: label ‘retry’ defined but not used [-Wunused-label]
> 
> Fixes: b283738ab0ad ("Revert "btrfs: qgroups: Retry after commit on getting 
> EDQUOT"")
> Signed-off-by: Colin Ian King 

Thanks, we already have another patch to fix that in for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Cleanup of bio_get/set

2018-01-02 Thread David Sterba
On Wed, Dec 13, 2017 at 10:25:36AM +0200, Nikolay Borisov wrote:
> Hello, 
> 
> Here is a series which cleans the btrfs source code of all the redundant 
> bio_get/set calls. After it's applied there is only a single bio_get 
> inovcation
> left in __alloc_device for the flush_bio (and that is wrong since it 
> leaks the bio, but this will be fixed in a follow up fix).
> 
> Nikolay Borisov (4):
>   btrfs: Remove redundant bio_get/set calls in compressed read/write
> paths
>   btrfs: Remove redundant bio_get/set from submit_dio_repair_bio
>   btrfs: Remove redundan bio_get/bio_set pair from submit_one_bio
>   btrfs: Remove redundant pair of bio_get/set in __btrfs_submit_dio_bio

All

Reviewed-by: David Sterba 

and added to 4.16 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-02 Thread ein
On 01/02/2018 01:45 PM, Marat Khalili wrote:
>> I think the 1-3TB Seagate drives are garbage.
> 
> There are known problems with ST3000DM001, but first of all you should not 
> put PC-oriented disks in RAID, they are not designed for it on multiple 
> levels (vibration tolerance, error reporting...) There are similar horror 
> stories about people filling whole cases with WD Greens and observing their 
> (non-BTRFS) RAID 6 fail. 
> 
> 

Same with ST2000DM001, lenovo and seagate did that for instance. How the
hell user may know...

>
> (Sorry for OT.)

m2

-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs reserve metadata problem

2018-01-02 Thread Peter Grandi
> When testing Btrfs with fio 4k random write,

That's an exceptionally narrowly defined workload. Also it is
narrower than that, because it must be without 'fsync' after
each write, or else there would be no accumulation of dirty
blocks in memory at all.

> I found that volume with smaller free space available has
> lower performance.

That's an inappropriate use of "performance"... The speed may be
lower, the performance is another matter.

> It seems that the smaller the free space of volume is, the
> smaller amount of dirty page filesystem could have.

Is this a problem? Consider: all filesystems do less well when
there is less free space (smaller chance of finding spatially
compact allocations), it is usually good to minimize the the
amont of dirty pages anyhow (even if there are reasons to keep
delay writing them out).

> [ ... ] btrfs will reserve metadata for every write.  The
> amount to reserve is calculated as follows: nodesize *
> BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
> The maximum amount of metadata reservation depends on size of
> metadata currently in used and free space within volume(free
> chunk size /16) When metadata reaches the limit, btrfs will
> need to flush the data to release the reservation.

I don't understand here: under POSIX semantics filesystems are
not really allowed to avoid flushing *metadata* to disk for most
operations, that is metadata operations have an implied 'fsync'.
Your case of the "4k random write" with "cow disabled" the only
metadata that should get updated is the last-modified timestamp,
unless the user/application has been so amazingly stupid to not
preallocate the file, and then they deserve whatever they get.

> 1. Is there any logic behind the value (free chunk size /16)

>   /*
>* If we have dup, raid1 or raid10 then only half of the free
>* space is actually useable. For raid56, the space info used
>* doesn't include the parity drive, so we don't have to
>* change the math
>*/
>   if (profile & (BTRFS_BLOCK_GROUP_DUP |
>   BTRFS_BLOCK_GROUP_RAID1 |
>   BTRFS_BLOCK_GROUP_RAID10))
>avail >>= 1;

As written there is a plausible logic, but it is quite crude.

>   /*
>* If we aren't flushing all things, let us overcommit up to
>* 1/2th of the space. If we can flush, don't let us overcommit
>* too much, let it overcommit up to 1/8 of the space.
>*/
>   if (flush == BTRFS_RESERVE_FLUSH_ALL)
>avail >>= 3;
>   else
>avail >>= 1;

Presumably overcommitting beings some benefits on other workloads.

In particular other parts of Btrfs don't behave awesomely well
when free space runs out.

> 2. Is there any way to improve this problem?

Again, is it a problem? More interestingly, if it is a problem
is a solution available that does not impact other workloads?
It is simply impossible to optimize a filesystem perfectly for
every workload.

I'll try to summarize your report as I understand it:

* If:
  - The workload is "4k random write" (without 'fsync').
  - On a "cow disabled" file.
  - The file is not preallocated.
  - There is not much free space available.
* Then allocation overcommitting results in a higher frequency
  of unrequested metadata flushes, and those metadata flushes
  slow down a specific benchmark.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-02 Thread Marat Khalili
> I think the 1-3TB Seagate drives are garbage.

There are known problems with ST3000DM001, but first of all you should not put 
PC-oriented disks in RAID, they are not designed for it on multiple levels 
(vibration tolerance, error reporting...) There are similar horror stories 
about people filling whole cases with WD Greens and observing their (non-BTRFS) 
RAID 6 fail. 

(Sorry for OT.)
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-02 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of ein
> Sent: Tuesday, 2 January 2018 9:03 PM
> To: swest...@gmail.com; Kai Krakow 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: A Big Thank You, and some Notes on Current Recovery Tools.


> Forgive me if it's not relevant, but I own quite a few disks from that series,
> like:
> 
> root@iomega-ordo:~# hdparm -i /dev/sda
> /dev/sda:
>  Model=ST2000DM001-1CH164, FwRev=CC27, SerialNo=Z1E6EV85  Config={
> HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
> 
> root@iomega-acm:~# smartctl -d sat -a /dev/sda === START OF
> INFORMATION SECTION ===
> Device Model: ST3000DM001-9YN166
> Serial Number:S1F0PGQJ
> LU WWN Device Id: 5 000c50 0516fce00
> Firmware Version: CC4B
> 
> root@iomega-europol:~# smartctl -d sat -a /dev/sda smartctl 5.41 2011-06-09
> r3365 [armv5tel-linux-2.6.31.8] (local build) === START OF INFORMATION
> SECTION ===
> Device Model: ST3000DM001-9YN166
> Serial Number:Z1F1H5KA
> LU WWN Device Id: 5 000c50 04ec18fda
> 
> Different locations, different environments, different boards one more
> stable (the power) than others.
> 
> I replaced at least three four in the past 3 years. All of them died because
> heavy random wirte workload. (rsnapshot, massive cp -al of millions of files
> every day). In my case every time bad sectors occurred too, but I didn't
> analyze where exactly, it was just a backup destination drive. I pretty
> convinced it could be ext2 supers too though.

I think the 1-3TB Seagate drives are garbage. Out of 6 drives I replaced all 
under warranty due to bad sectors, 2 of them were replaced twice! As the 
replacements failed out of warranty they were replaced with 3-4TB HGST drives 
and I've had no problems ever since. My workload was just a daily backup store, 
so they sat there idling about 22 hours a day.
I hear the 4+TB Seagate drives are much better quality but I have no experience 
with them.

Paul.


Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-02 Thread ein
On 01/01/2018 08:44 PM, Stirling Westrup wrote:
> On Mon, Jan 1, 2018 at 7:15 AM, Kai Krakow  wrote:
>> Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:
>>
>>> On 2018年01月01日 08:48, Stirling Westrup wrote:

 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
 errors exactly coincided with the 3 super-blocks on the drive.
>>>
>>> WTF, why all these corruption all happens at btrfs super blocks?!
>>>
>>> What a coincident.
>>
>> Maybe it's a hybrid drive with flash? Or something that went wrong in the
>> drive-internal cache memory the very time when superblocks where updated?
>>
>> I bet that the sectors aren't really broken, just the on-disk checksum
>> didn't match the sector. I remember such things happening to me more than
>> once back in the days when drives where still connected by molex power
>> connectors. Those connectors started to get loose over time, due to
>> thermals or repeated disconnect and connect. That is, drives sometimes
>> started to no longer have a reliable power source which let to all sorts
>> of very strange problems, mostly resulting in pseudo-defective sectors.
>>
>> That said, the OP would like to check the power supply after this
>> coincidence... Maybe it's aging and no longer able to support all four
>> drives, CPU, GPU and stuff with stable power.
> 
> You may be right about the cause of the error being a power-supply issue.
> For those that are curious, the drive that failed was a Seagate Barracuda
> LP 2000G drive (ST2000DL003).
> 

Forgive me if it's not relevant, but I own quite a few disks from that
series, like:

root@iomega-ordo:~# hdparm -i /dev/sda
/dev/sda:
 Model=ST2000DM001-1CH164, FwRev=CC27, SerialNo=Z1E6EV85
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }

root@iomega-acm:~# smartctl -d sat -a /dev/sda
=== START OF INFORMATION SECTION ===
Device Model: ST3000DM001-9YN166
Serial Number:S1F0PGQJ
LU WWN Device Id: 5 000c50 0516fce00
Firmware Version: CC4B

root@iomega-europol:~# smartctl -d sat -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [armv5tel-linux-2.6.31.8] (local build)
=== START OF INFORMATION SECTION ===
Device Model: ST3000DM001-9YN166
Serial Number:Z1F1H5KA
LU WWN Device Id: 5 000c50 04ec18fda

Different locations, different environments, different boards one more
stable (the power) than others.

I replaced at least three four in the past 3 years. All of them died
because heavy random wirte workload. (rsnapshot, massive cp -al of
millions of files every day). In my case every time bad sectors occurred
too, but I didn't analyze where exactly, it was just a backup
destination drive. I pretty convinced it could be ext2 supers too though.


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs reserve metadata problem

2018-01-02 Thread Qu Wenruo


On 2018年01月02日 15:51, robbieko wrote:
> Hi All,
> 
> When testing Btrfs with fio 4k random write, I found that volume with
> smaller free space available has lower performance.
> 
> It seems that the smaller the free space of volume is, the smaller
> amount of dirty page filesystem could have.
> There are only 6 MB dirty pages when free space of volume is only 10GB
> with 16KB-nodesize and cow disabled.
> 
> btrfs will reserve metadata for every write.
> The amount to reserve is calculated as follows: nodesize *
> BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
> The maximum amount of metadata reservation depends on size of metadata
> currently in used and free space within volume(free chunk size /16)
> When metadata reaches the limit, btrfs will need to flush the data to
> release the reservation.
> 
> 1. Is there any logic behind the value (free chunk size /16)
> 
>  /*
>   * If we have dup, raid1 or raid10 then only half of the free
>   * space is actually useable. For raid56, the space info used
>   * doesn't include the parity drive, so we don't have to
>   * change the math
>   */
>  if (profile & (BTRFS_BLOCK_GROUP_DUP |
>  BTRFS_BLOCK_GROUP_RAID1 |
>  BTRFS_BLOCK_GROUP_RAID10))
>   avail >>= 1;
> 
>  /*
>   * If we aren't flushing all things, let us overcommit up to
>   * 1/2th of the space. If we can flush, don't let us overcommit
>   * too much, let it overcommit up to 1/8 of the space.
>   */
>  if (flush == BTRFS_RESERVE_FLUSH_ALL)
>   avail >>= 3;
>  else
>   avail >>= 1;
> 
> 2. Is there any way to improve this problem?

One solution is to reduce the metadata reserve.

I experienced similar problem, although in qgroup metadata reservation
other than extent level reservation.

In qgroup respect, it only needs to care about the "net" reservation it
needs, so qgroup rsv can do a special calculation, get rid of the
over-killed nodesize * BTRFS_MAX_LEVEL(8) * 2.

But for extent allocator, it must ensure that there is enough space for
delalloc, so such qgroup trick can't be applied directly.


Although we can still reduce the amount a lot.

Personally speaking, btrfs may only need at most (current_tree_level +
2) * nodesize for each meta rsv.
(One for possible tree level increase, and one for extra leaf split).

And specially for delalloc, all delalloc result (file extents) are
adjust to each other, which could further reduce the meta reservation.

For example, if there are 4 outstanding extents to be written to disk,
we only need (current_tree_level + 2) *nodesize.
As even if we need to increase the tree level *AND* split tree, we will
definitely have enough free space to contain just 4 file extents.
(After a rough calculation, we can keep that meta rsv until outstanding
extent exceed 307)

So we could definitely reduce meta rsv amount needed to reduce
unnecessary flush.

Thanks,
Qu

> 
> Thanks.
> Robbie Ko
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature