Re: [PATCH] btrfs-progs: doc: Update docs about RAID profiles

2016-12-01 Thread Duncan
Austin S. Hemmelgarn posted on Thu, 01 Dec 2016 15:01:17 -0500 as
excerpted:

> This adds some more info about chunk profiles in the mkfs manpage,
> specifically providing better info about raid1 and raid10 profiles and
> the fact that they can't survive more than one device failing.
> 
> This should hopefully make it less likely that people hit unexpected
> behavior when using these profiles.
> 
> Signed-off-by: Austin S. Hemmelgarn 
> ---
> This should work to cover most of the issues brought up on the mailing
> list recently regarding this particular aspect of documentation.
> 
>  Documentation/mkfs.btrfs.asciidoc | 44
>  ---
>  1 file changed, 36 insertions(+), 8 deletions(-)

FWIW, LGTM as a btrfs user and list regular.  Thanks. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/18] btrfs: root->fs_info cleanup, io_ctl_init

2016-12-01 Thread jeffm
From: Jeff Mahoney 

The io_ctl->root member was only being used to access root->fs_info.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h|  2 +-
 fs/btrfs/free-space-cache.c | 12 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9a3ca79..15ff880 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -519,7 +519,7 @@ struct btrfs_io_ctl {
void *cur, *orig;
struct page *page;
struct page **pages;
-   struct btrfs_root *root;
+   struct btrfs_fs_info *fs_info;
struct inode *inode;
unsigned long size;
int index;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index a538133..d2320ee 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -303,7 +303,7 @@ static int readahead_cache(struct inode *inode)
 }
 
 static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct inode *inode,
-  struct btrfs_root *root, int write)
+  int write)
 {
int num_pages;
int check_crcs = 0;
@@ -325,7 +325,7 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct 
inode *inode,
return -ENOMEM;
 
io_ctl->num_pages = num_pages;
-   io_ctl->root = root;
+   io_ctl->fs_info = btrfs_sb(inode->i_sb);
io_ctl->check_crcs = check_crcs;
io_ctl->inode = inode;
 
@@ -448,7 +448,7 @@ static int io_ctl_check_generation(struct btrfs_io_ctl 
*io_ctl, u64 generation)
 
gen = io_ctl->cur;
if (le64_to_cpu(*gen) != generation) {
-   btrfs_err_rl(io_ctl->root->fs_info,
+   btrfs_err_rl(io_ctl->fs_info,
"space cache generation (%llu) does not match inode 
(%llu)",
*gen, generation);
io_ctl_unmap_page(io_ctl);
@@ -504,7 +504,7 @@ static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, 
int index)
  PAGE_SIZE - offset);
btrfs_csum_final(crc, (u8 *));
if (val != crc) {
-   btrfs_err_rl(io_ctl->root->fs_info,
+   btrfs_err_rl(io_ctl->fs_info,
"csum mismatch on free space cache");
io_ctl_unmap_page(io_ctl);
return -EIO;
@@ -722,7 +722,7 @@ static int __load_free_space_cache(struct btrfs_root *root, 
struct inode *inode,
if (!num_entries)
return 0;
 
-   ret = io_ctl_init(_ctl, inode, root, 0);
+   ret = io_ctl_init(_ctl, inode, 0);
if (ret)
return ret;
 
@@ -1229,7 +1229,7 @@ static int __btrfs_write_out_cache(struct btrfs_root 
*root, struct inode *inode,
return -EIO;
 
WARN_ON(io_ctl->pages);
-   ret = io_ctl_init(io_ctl, inode, root, 1);
+   ret = io_ctl_init(io_ctl, inode, 1);
if (ret)
return ret;
 
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/18] btrfs: root->fs_info cleanup, lock/unlock_chunks

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/disk-io.c  |  4 +--
 fs/btrfs/extent-tree.c  |  8 +++---
 fs/btrfs/free-space-cache.c |  4 +--
 fs/btrfs/volumes.c  | 70 ++---
 fs/btrfs/volumes.h  |  8 +++---
 5 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3d4fb99..5abf3af 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3998,7 +3998,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
__btrfs_free_block_rsv(root->orphan_block_rsv);
root->orphan_block_rsv = NULL;
 
-   lock_chunks(root);
+   lock_chunks(root->fs_info);
while (!list_empty(_info->pinned_chunks)) {
struct extent_map *em;
 
@@ -4007,7 +4007,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
list_del_init(>list);
free_extent_map(em);
}
-   unlock_chunks(root);
+   unlock_chunks(root->fs_info);
 }
 
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b8ad81c..cc9ae54 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9436,9 +9436,9 @@ int btrfs_inc_block_group_ro(struct btrfs_root *root,
 out:
if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
alloc_flags = update_block_group_flags(root, cache->flags);
-   lock_chunks(root->fs_info->chunk_root);
+   lock_chunks(root->fs_info);
check_system_chunk(trans, root, alloc_flags);
-   unlock_chunks(root->fs_info->chunk_root);
+   unlock_chunks(root->fs_info);
}
mutex_unlock(>fs_info->ro_block_group_mutex);
 
@@ -10482,7 +10482,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
 
memcpy(, _group->key, sizeof(key));
 
-   lock_chunks(root);
+   lock_chunks(root->fs_info);
if (!list_empty(>list)) {
/* We're in the transaction->pending_chunks list. */
free_extent_map(em);
@@ -10550,7 +10550,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
free_extent_map(em);
}
 
-   unlock_chunks(root);
+   unlock_chunks(root->fs_info);
 
ret = remove_block_group_free_space(trans, root->fs_info, block_group);
if (ret)
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index aee1255..8424617 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3328,7 +3328,7 @@ void btrfs_put_block_group_trimming(struct 
btrfs_block_group_cache *block_group)
spin_unlock(_group->lock);
 
if (cleanup) {
-   lock_chunks(block_group->fs_info->chunk_root);
+   lock_chunks(block_group->fs_info);
em_tree = _group->fs_info->mapping_tree.map_tree;
write_lock(_tree->lock);
em = lookup_extent_mapping(em_tree, block_group->key.objectid,
@@ -3340,7 +3340,7 @@ void btrfs_put_block_group_trimming(struct 
btrfs_block_group_cache *block_group)
 */
remove_extent_mapping(em_tree, em);
write_unlock(_tree->lock);
-   unlock_chunks(block_group->fs_info->chunk_root);
+   unlock_chunks(block_group->fs_info);
 
/* once for us and once for the tree */
free_extent_map(em);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6b3bcc4..a53086f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1889,10 +1889,10 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
}
 
if (device->writeable) {
-   lock_chunks(root);
+   lock_chunks(root->fs_info);
list_del_init(>dev_alloc_list);
device->fs_devices->rw_devices--;
-   unlock_chunks(root);
+   unlock_chunks(root->fs_info);
clear_super = true;
}
 
@@ -1981,11 +1981,11 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
 
 error_undo:
if (device->writeable) {
-   lock_chunks(root);
+   lock_chunks(root->fs_info);
list_add(>dev_alloc_list,
 >fs_info->fs_devices->alloc_list);
device->fs_devices->rw_devices++;
-   unlock_chunks(root);
+   unlock_chunks(root->fs_info);
}
goto out;
 }
@@ -2212,9 +2212,9 @@ static int btrfs_prepare_sprout(struct btrfs_root *root)
list_for_each_entry(device, _devices->devices, dev_list)
device->fs_devices = seed_devices;
 
-   lock_chunks(root);
+   lock_chunks(root->fs_info);
list_splice_init(_devices->alloc_list, _devices->alloc_list);
-   unlock_chunks(root);
+   

[PATCH 14/18] btrfs: root->fs_info cleanup, access fs_info->delayed_root directly

2016-12-01 Thread jeffm
From: Jeff Mahoney 

This results in btrfs_assert_delayed_root_empty and
btrfs_destroy_delayed_inode taking an fs_info instead of a root.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/delayed-inode.c | 23 ++-
 fs/btrfs/delayed-inode.h |  4 ++--
 fs/btrfs/disk-io.c   |  8 
 fs/btrfs/transaction.c   |  2 +-
 4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index d7d5eb9..c8ffceb 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -72,12 +72,6 @@ static inline int btrfs_is_continuous_delayed_item(
return 0;
 }
 
-static inline struct btrfs_delayed_root *btrfs_get_delayed_root(
-   struct btrfs_root *root)
-{
-   return root->fs_info->delayed_root;
-}
-
 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode)
 {
struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
@@ -1163,7 +1157,7 @@ static int __btrfs_run_delayed_items(struct 
btrfs_trans_handle *trans,
block_rsv = trans->block_rsv;
trans->block_rsv = _info->delayed_block_rsv;
 
-   delayed_root = btrfs_get_delayed_root(root);
+   delayed_root = fs_info->delayed_root;
 
curr_node = btrfs_first_delayed_node(delayed_root);
while (curr_node && (!count || (count && nr--))) {
@@ -1390,11 +1384,9 @@ static int btrfs_wq_run_delayed_node(struct 
btrfs_delayed_root *delayed_root,
return 0;
 }
 
-void btrfs_assert_delayed_root_empty(struct btrfs_root *root)
+void btrfs_assert_delayed_root_empty(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_delayed_root *delayed_root;
-   delayed_root = btrfs_get_delayed_root(root);
-   WARN_ON(btrfs_first_delayed_node(delayed_root));
+   WARN_ON(btrfs_first_delayed_node(fs_info->delayed_root));
 }
 
 static int could_end_wait(struct btrfs_delayed_root *delayed_root, int seq)
@@ -1415,7 +1407,7 @@ void btrfs_balance_delayed_items(struct btrfs_root *root)
struct btrfs_delayed_root *delayed_root;
struct btrfs_fs_info *fs_info = root->fs_info;
 
-   delayed_root = btrfs_get_delayed_root(root);
+   delayed_root = fs_info->delayed_root;
 
if (atomic_read(_root->items) < BTRFS_DELAYED_BACKGROUND)
return;
@@ -1980,14 +1972,11 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root 
*root)
}
 }
 
-void btrfs_destroy_delayed_inodes(struct btrfs_root *root)
+void btrfs_destroy_delayed_inodes(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_delayed_root *delayed_root;
struct btrfs_delayed_node *curr_node, *prev_node;
 
-   delayed_root = btrfs_get_delayed_root(root);
-
-   curr_node = btrfs_first_delayed_node(delayed_root);
+   curr_node = btrfs_first_delayed_node(fs_info->delayed_root);
while (curr_node) {
__btrfs_kill_delayed_node(curr_node);
 
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 2c1cbe2..7320d72 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -134,7 +134,7 @@ int btrfs_delayed_delete_inode_ref(struct inode *inode);
 void btrfs_kill_all_delayed_nodes(struct btrfs_root *root);
 
 /* Used for clean the transaction */
-void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
+void btrfs_destroy_delayed_inodes(struct btrfs_fs_info *fs_info);
 
 /* Used for readdir() */
 bool btrfs_readdir_get_delayed_items(struct inode *inode,
@@ -153,6 +153,6 @@ int __init btrfs_delayed_inode_init(void);
 void btrfs_delayed_inode_exit(void);
 
 /* for debugging */
-void btrfs_assert_delayed_root_empty(struct btrfs_root *root);
+void btrfs_assert_delayed_root_empty(struct btrfs_fs_info *fs_info);
 
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 02ba794..5f7d283 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4587,8 +4587,8 @@ void btrfs_cleanup_one_transaction(struct 
btrfs_transaction *cur_trans,
cur_trans->state = TRANS_STATE_UNBLOCKED;
wake_up(_info->transaction_wait);
 
-   btrfs_destroy_delayed_inodes(root);
-   btrfs_assert_delayed_root_empty(root);
+   btrfs_destroy_delayed_inodes(fs_info);
+   btrfs_assert_delayed_root_empty(fs_info);
 
btrfs_destroy_marked_extents(root, _trans->dirty_pages,
 EXTENT_DIRTY);
@@ -4649,8 +4649,8 @@ static int btrfs_cleanup_transaction(struct btrfs_root 
*root)
}
spin_unlock(_info->trans_lock);
btrfs_destroy_all_ordered_extents(fs_info);
-   btrfs_destroy_delayed_inodes(root);
-   btrfs_assert_delayed_root_empty(root);
+   btrfs_destroy_delayed_inodes(fs_info);
+   btrfs_assert_delayed_root_empty(fs_info);
btrfs_destroy_pinned_extent(root, fs_info->pinned_extents);
btrfs_destroy_all_delalloc_inodes(fs_info);
mutex_unlock(_info->transaction_kthread_mutex);
diff --git 

[PATCH 03/18] btrfs: btrfs_init_new_device should use fs_info->dev_root

2016-12-01 Thread jeffm
From: Jeff Mahoney 

btrfs_init_new_device only uses the root passed in via the ioctl to
start the transaction.  Nothing else that happens is related to whatever
root the user used to initiate the ioctl.  We can drop the root requirement
and just use fs_info->dev_root instead.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ioctl.c   | 2 +-
 fs/btrfs/volumes.c | 3 ++-
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 67c37fd..5b21a9b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2672,7 +2672,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_root *root, 
void __user *arg)
}
 
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
-   ret = btrfs_init_new_device(root, vol_args->name);
+   ret = btrfs_init_new_device(root->fs_info, vol_args->name);
 
if (!ret)
btrfs_info(root->fs_info, "disk added %s",vol_args->name);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ed868a0..f336f07 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2310,8 +2310,9 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
+int btrfs_init_new_device(struct btrfs_fs_info *fs_info, char *device_path)
 {
+   struct btrfs_root *root = fs_info->dev_root;
struct request_queue *q;
struct btrfs_trans_handle *trans;
struct btrfs_device *device;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 471a619..0c8e77b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -439,7 +439,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 struct btrfs_device *btrfs_find_device(struct btrfs_fs_info *fs_info, u64 
devid,
   u8 *uuid, u8 *fsid);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
-int btrfs_init_new_device(struct btrfs_root *root, char *path);
+int btrfs_init_new_device(struct btrfs_fs_info *fs_info, char *path);
 int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
  struct btrfs_device *srcdev,
  struct btrfs_device **device_out);
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/18] misc-4.10: root->fs_info patchset

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Hi all -

Here's the latest version of my root->fs_info patchset.  It's against
Dave's misc-4.10 branch.

-Jeff

---

Jeff Mahoney (18):
  btrfs: call functions that overwrite their root parameter with fs_info
  btrfs: call functions that always use the same root with fs_info
instead
  btrfs: btrfs_init_new_device should use fs_info->dev_root
  btrfs: alloc_reserved_file_extent trace point should use extent_root
  btrfs: struct btrfsic_state->root should be an fs_info
  btrfs: struct reada_control.root -> reada_control.fs_info
  btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
  btrfs: root->fs_info cleanup, io_ctl_init
  btrfs: pull node/sector/stripe sizes out of root and into fs_info
  btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
  btrfs: root->fs_info cleanup, lock/unlock_chunks
  btrfs: root->fs_info cleanup, update_block_group{,flags}
  btrfs: root->fs_info cleanup, add fs_info convenience variables
  btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
  btrfs: convert extent-tree tracepoints to use fs_info
  btrfs: simplify btrfs_wait_cache_io prototype
  btrfs: take an fs_info directly when the root is not used otherwise
  btrfs: split btrfs_wait_marked_extents into normal and tree log
functions

 fs/btrfs/backref.c |8 +-
 fs/btrfs/check-integrity.c |   73 +-
 fs/btrfs/check-integrity.h |5 +-
 fs/btrfs/compression.c |   54 +-
 fs/btrfs/ctree.c   |  468 ++--
 fs/btrfs/ctree.h   |  216 +++---
 fs/btrfs/delayed-inode.c   |  134 ++--
 fs/btrfs/delayed-inode.h   |   19 +-
 fs/btrfs/dev-replace.c |   58 +-
 fs/btrfs/dev-replace.h |4 +-
 fs/btrfs/dir-item.c|   45 +-
 fs/btrfs/disk-io.c |  538 +++---
 fs/btrfs/disk-io.h |   30 +-
 fs/btrfs/export.c  |   10 +-
 fs/btrfs/extent-tree.c | 1269 
 fs/btrfs/extent_io.c   |   61 +-
 fs/btrfs/extent_io.h   |8 +-
 fs/btrfs/file-item.c   |  152 ++--
 fs/btrfs/file.c|  194 ++---
 fs/btrfs/free-space-cache.c|  154 ++--
 fs/btrfs/free-space-cache.h|   12 +-
 fs/btrfs/free-space-tree.c |   36 +-
 fs/btrfs/inode-item.c  |   11 +-
 fs/btrfs/inode-map.c   |   22 +-
 fs/btrfs/inode.c   |  691 +
 fs/btrfs/ioctl.c   |  518 +++--
 fs/btrfs/ordered-data.c|   38 +-
 fs/btrfs/ordered-data.h|4 +-
 fs/btrfs/print-tree.c  |   19 +-
 fs/btrfs/print-tree.h  |4 +-
 fs/btrfs/props.c   |5 +-
 fs/btrfs/qgroup.c  |   56 +-
 fs/btrfs/qgroup.h  |2 +-
 fs/btrfs/raid56.c  |   62 +-
 fs/btrfs/raid56.h  |8 +-
 fs/btrfs/reada.c   |   34 +-
 fs/btrfs/relocation.c  |  229 +++---
 fs/btrfs/root-tree.c   |   26 +-
 fs/btrfs/scrub.c   |  157 ++--
 fs/btrfs/send.c|   29 +-
 fs/btrfs/super.c   |  130 ++--
 fs/btrfs/tests/btrfs-tests.c   |   13 +-
 fs/btrfs/tests/btrfs-tests.h   |4 +-
 fs/btrfs/tests/extent-buffer-tests.c   |7 +-
 fs/btrfs/tests/extent-io-tests.c   |5 +-
 fs/btrfs/tests/free-space-tests.c  |   18 +-
 fs/btrfs/tests/free-space-tree-tests.c |9 +-
 fs/btrfs/tests/inode-tests.c   |   16 +-
 fs/btrfs/tests/qgroup-tests.c  |   11 +-
 fs/btrfs/transaction.c |  562 +++---
 fs/btrfs/transaction.h |   11 +-
 fs/btrfs/tree-log.c|  191 ++---
 fs/btrfs/uuid-tree.c   |   21 +-
 fs/btrfs/volumes.c |  753 +--
 fs/btrfs/volumes.h |   43 +-
 fs/btrfs/xattr.c   |   19 +-
 include/trace/events/btrfs.h   |   65 +-
 57 files changed, 3745 insertions(+), 3596 deletions(-)

-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/18] btrfs: convert extent-tree tracepoints to use fs_info

2016-12-01 Thread jeffm
From: Jeff Mahoney 

The extent-tree tracepoints all operate on the extent root, regardless of
which root is passed in.  Let's just use the extent root objectid instead.
If it turns out that nobody is depending on the format of this tracepoint,
we can drop the root printing entirely.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c   | 17 ---
 include/trace/events/btrfs.h | 49 
 2 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 03512c6..a358aaa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7244,7 +7244,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
btrfs_add_free_space(cache, buf->start, buf->len);
btrfs_free_reserved_bytes(cache, buf->len, 0);
btrfs_put_block_group(cache);
-   trace_btrfs_reserved_extent_free(root, buf->start, buf->len);
+   trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
pin = 0;
}
 out:
@@ -7493,7 +7493,7 @@ static noinline int find_free_extent(struct btrfs_root 
*orig_root,
ins->objectid = 0;
ins->offset = 0;
 
-   trace_find_free_extent(orig_root, num_bytes, empty_size, flags);
+   trace_find_free_extent(fs_info, num_bytes, empty_size, flags);
 
space_info = __find_space_info(fs_info, flags);
if (!space_info) {
@@ -7652,7 +7652,7 @@ static noinline int find_free_extent(struct btrfs_root 
*orig_root,
if (offset) {
/* we have a block, we're done */
spin_unlock(_ptr->refill_lock);
-   trace_btrfs_reserve_extent_cluster(root,
+   trace_btrfs_reserve_extent_cluster(fs_info,
used_block_group,
search_start, num_bytes);
if (used_block_group != block_group) {
@@ -7725,7 +7725,7 @@ static noinline int find_free_extent(struct btrfs_root 
*orig_root,
if (offset) {
/* we found one, proceed */
spin_unlock(_ptr->refill_lock);
-   trace_btrfs_reserve_extent_cluster(root,
+   
trace_btrfs_reserve_extent_cluster(fs_info,
block_group, search_start,
num_bytes);
goto checks;
@@ -7823,7 +7823,7 @@ static noinline int find_free_extent(struct btrfs_root 
*orig_root,
ins->objectid = search_start;
ins->offset = num_bytes;
 
-   trace_btrfs_reserve_extent(orig_root, block_group,
+   trace_btrfs_reserve_extent(fs_info, block_group,
   search_start, num_bytes);
btrfs_release_block_group(block_group, delalloc);
break;
@@ -8048,7 +8048,7 @@ static int __btrfs_free_reserved_extent(struct btrfs_root 
*root,
ret = btrfs_discard_extent(root, start, len, NULL);
btrfs_add_free_space(cache, start, len);
btrfs_free_reserved_bytes(cache, len, delalloc);
-   trace_btrfs_reserved_extent_free(root, start, len);
+   trace_btrfs_reserved_extent_free(fs_info, start, len);
}
 
btrfs_put_block_group(cache);
@@ -8139,8 +8139,7 @@ static int alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ins->objectid, ins->offset);
BUG();
}
-   trace_btrfs_reserved_extent_alloc(fs_info->extent_root,
- ins->objectid, ins->offset);
+   trace_btrfs_reserved_extent_alloc(fs_info, ins->objectid, ins->offset);
return ret;
 }
 
@@ -8226,7 +8225,7 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
BUG();
}
 
-   trace_btrfs_reserved_extent_alloc(root, ins->objectid,
+   trace_btrfs_reserved_extent_alloc(fs_info, ins->objectid,
  fs_info->nodesize);
return ret;
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ff5cd17..c14bed4 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -891,65 +891,61 @@ TRACE_EVENT(btrfs_flush_space,
 
 DECLARE_EVENT_CLASS(btrfs__reserved_extent,
 
-   TP_PROTO(struct btrfs_root *root, u64 start, u64 len),
+   TP_PROTO(struct btrfs_fs_info *fs_info, u64 start, u64 len),
 
-   TP_ARGS(root, start, len),
+   TP_ARGS(fs_info, start, len),
 
TP_STRUCT__entry_btrfs(
- 

[PATCH 02/18] btrfs: call functions that always use the same root with fs_info instead

2016-12-01 Thread jeffm
From: Jeff Mahoney 

There are many functions that are always called with the same root
argument.  Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h | 16 +++
 fs/btrfs/disk-io.c   | 44 +
 fs/btrfs/disk-io.h   |  4 ++--
 fs/btrfs/extent-tree.c   | 20 ++-
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/ioctl.c | 18 -
 fs/btrfs/relocation.c|  4 ++--
 fs/btrfs/root-tree.c |  9 ++---
 fs/btrfs/super.c | 13 ++--
 fs/btrfs/transaction.c   |  6 +++---
 fs/btrfs/uuid-tree.c | 10 ++
 fs/btrfs/volumes.c   | 47 
 fs/btrfs/volumes.h   |  4 ++--
 include/trace/events/btrfs.h | 16 +++
 14 files changed, 115 insertions(+), 102 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 51cd757..74110c7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2640,7 +2640,7 @@ int btrfs_setup_space_cache(struct btrfs_trans_handle 
*trans,
 int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
 int btrfs_read_block_groups(struct btrfs_fs_info *info);
-int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr);
+int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr);
 int btrfs_make_block_group(struct btrfs_trans_handle *trans,
   struct btrfs_root *root, u64 bytes_used,
   u64 type, u64 chunk_objectid, u64 chunk_offset,
@@ -2649,7 +2649,7 @@ struct btrfs_trans_handle 
*btrfs_start_trans_remove_block_group(
struct btrfs_fs_info *fs_info,
const u64 chunk_offset);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 group_start,
+struct btrfs_fs_info *fs_info, u64 group_start,
 struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
@@ -2935,11 +2935,11 @@ int btrfs_old_root_level(struct btrfs_root *root, u64 
time_seq);
 
 /* root-item.c */
 int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *tree_root,
+  struct btrfs_fs_info *fs_info,
   u64 root_id, u64 ref_id, u64 dirid, u64 sequence,
   const char *name, int name_len);
 int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *tree_root,
+  struct btrfs_fs_info *fs_info,
   u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
   const char *name, int name_len);
 int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
@@ -2954,7 +2954,7 @@ int __must_check btrfs_update_root(struct 
btrfs_trans_handle *trans,
 int btrfs_find_root(struct btrfs_root *root, struct btrfs_key *search_key,
struct btrfs_path *path, struct btrfs_root_item *root_item,
struct btrfs_key *root_key);
-int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
+int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info);
 void btrfs_set_root_node(struct btrfs_root_item *item,
 struct extent_buffer *node);
 void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
@@ -2963,10 +2963,10 @@ void btrfs_update_root_times(struct btrfs_trans_handle 
*trans,
 
 /* uuid-tree.c */
 int btrfs_uuid_tree_add(struct btrfs_trans_handle *trans,
-   struct btrfs_root *uuid_root, u8 *uuid, u8 type,
+   struct btrfs_fs_info *fs_info, u8 *uuid, u8 type,
u64 subid);
 int btrfs_uuid_tree_rem(struct btrfs_trans_handle *trans,
-   struct btrfs_root *uuid_root, u8 *uuid, u8 type,
+   struct btrfs_fs_info *fs_info, u8 *uuid, u8 type,
u64 subid);
 int btrfs_uuid_tree_iterate(struct btrfs_fs_info *fs_info,
int (*check_func)(struct btrfs_fs_info *, u8 *, u8,
@@ -3613,7 +3613,7 @@ static inline int btrfs_init_acl(struct 
btrfs_trans_handle *trans,
 #endif
 
 /* relocation.c */
-int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start);
+int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start);
 int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
  struct btrfs_root *root);
 int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/disk-io.c 

[PATCH 05/18] btrfs: struct btrfsic_state->root should be an fs_info

2016-12-01 Thread jeffm
From: Jeff Mahoney 

The root member is never used except for obtaining an fs_info pointer.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/check-integrity.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 86f681f..270da67 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -254,7 +254,7 @@ struct btrfsic_state {
struct list_head all_blocks_list;
struct btrfsic_block_hashtable block_hashtable;
struct btrfsic_block_link_hashtable block_link_hashtable;
-   struct btrfs_root *root;
+   struct btrfs_fs_info *fs_info;
u64 max_superblock_generation;
struct btrfsic_block *latest_superblock;
u32 metablock_size;
@@ -717,7 +717,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
}
 
num_copies =
-   btrfs_num_copies(state->root->fs_info,
+   btrfs_num_copies(state->fs_info,
 next_bytenr, state->metablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
pr_info("num_copies(log_bytenr=%llu) = %d\n",
@@ -888,7 +888,7 @@ static int btrfsic_process_superblock_dev_mirror(
}
 
num_copies =
-   btrfs_num_copies(state->root->fs_info,
+   btrfs_num_copies(state->fs_info,
 next_bytenr, state->metablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
pr_info("num_copies(log_bytenr=%llu) = %d\n",
@@ -1263,7 +1263,7 @@ static int btrfsic_create_link_to_next_block(
*next_blockp = NULL;
if (0 == *num_copiesp) {
*num_copiesp =
-   btrfs_num_copies(state->root->fs_info,
+   btrfs_num_copies(state->fs_info,
 next_bytenr, state->metablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
pr_info("num_copies(log_bytenr=%llu) = %d\n",
@@ -1457,7 +1457,7 @@ static int btrfsic_handle_extent_data(
chunk_len = num_bytes;
 
num_copies =
-   btrfs_num_copies(state->root->fs_info,
+   btrfs_num_copies(state->fs_info,
 next_bytenr, state->datablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
pr_info("num_copies(log_bytenr=%llu) = %d\n",
@@ -1539,7 +1539,7 @@ static int btrfsic_map_block(struct btrfsic_state *state, 
u64 bytenr, u32 len,
struct btrfs_device *device;
 
length = len;
-   ret = btrfs_map_block(state->root->fs_info, BTRFS_MAP_READ,
+   ret = btrfs_map_block(state->fs_info, BTRFS_MAP_READ,
  bytenr, , , mirror_num);
 
if (ret) {
@@ -1741,7 +1741,7 @@ static int btrfsic_test_for_metadata(struct btrfsic_state 
*state,
num_pages = state->metablock_size >> PAGE_SHIFT;
h = (struct btrfs_header *)datav[0];
 
-   if (memcmp(h->fsid, state->root->fs_info->fsid, BTRFS_UUID_SIZE))
+   if (memcmp(h->fsid, state->fs_info->fsid, BTRFS_UUID_SIZE))
return 1;
 
for (i = 0; i < num_pages; i++) {
@@ -2276,7 +2276,7 @@ static int btrfsic_process_written_superblock(
}
 
num_copies =
-   btrfs_num_copies(state->root->fs_info,
+   btrfs_num_copies(state->fs_info,
 next_bytenr, BTRFS_SUPER_INFO_SIZE);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
pr_info("num_copies(log_bytenr=%llu) = %d\n",
@@ -2705,7 +2705,7 @@ static void btrfsic_cmp_log_and_dev_bytenr(struct 
btrfsic_state *state,
struct btrfsic_block_data_ctx block_ctx;
int match = 0;
 
-   num_copies = btrfs_num_copies(state->root->fs_info,
+   num_copies = btrfs_num_copies(state->fs_info,
  bytenr, state->metablock_size);
 
for (mirror_num = 1; mirror_num <= num_copies; mirror_num++) {
@@ -2936,7 +2936,7 @@ int btrfsic_mount(struct btrfs_root *root,
btrfsic_is_initialized = 1;
}
mutex_lock(_mutex);
-   state->root = root;
+   state->fs_info = root->fs_info;
state->print_mask = print_mask;
state->include_extent_data = including_extent_data;
state->csum_size = 0;
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/18] btrfs: split btrfs_wait_marked_extents into normal and tree log functions

2016-12-01 Thread jeffm
From: Jeff Mahoney 

btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
btrfs_wait_marked_extents, which provides a core loop and then handles
errors differently based on whether it's it's a log root or not.

This means that btrfs_write_and_wait_marked_extents needs to take a root
because btrfs_wait_marked_extents requires one, even though it's only
used to determine whether the root is a log root.  The log root code
won't ever call into the transaction commit code using a log root, so we
can factor out the core loop and provide the error handling appropriate
to each waiter in new routines.  This allows us to eventually remove
the root argument from btrfs_commit_transaction, and as a result,
btrfs_end_transaction.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/transaction.c | 79 +++---
 fs/btrfs/transaction.h |  5 ++--
 fs/btrfs/tree-log.c| 14 -
 3 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8667a99..a23fedd 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -957,11 +957,11 @@ int btrfs_write_marked_extents(struct btrfs_fs_info 
*fs_info,
 * time a temporary error. So when it happens, ignore the error
 * and wait for writeback of this range to finish - because we
 * failed to set the bit EXTENT_NEED_WAIT for the range, a call
-* to btrfs_wait_marked_extents() would not know that writeback
-* for this range started and therefore wouldn't wait for it to
-* finish - we don't want to commit a superblock that points to
-* btree nodes/leafs for which writeback hasn't finished yet
-* (and without errors).
+* to __btrfs_wait_marked_extents() would not know that
+* writeback for this range started and therefore wouldn't
+* wait for it to finish - we don't want to commit a
+* superblock that points to btree nodes/leafs for which
+* writeback hasn't finished yet (and without errors).
 * We cleanup any entries left in the io tree when committing
 * the transaction (through clear_btree_io_tree()).
 */
@@ -989,17 +989,15 @@ int btrfs_write_marked_extents(struct btrfs_fs_info 
*fs_info,
  * those extents are on disk for transaction or log commit.  We wait
  * on all the pages and clear them from the dirty pages state tree
  */
-int btrfs_wait_marked_extents(struct btrfs_root *root,
- struct extent_io_tree *dirty_pages, int mark)
+static int __btrfs_wait_marked_extents(struct btrfs_fs_info *fs_info,
+  struct extent_io_tree *dirty_pages)
 {
int err = 0;
int werr = 0;
-   struct btrfs_fs_info *fs_info = root->fs_info;
struct address_space *mapping = fs_info->btree_inode->i_mapping;
struct extent_state *cached_state = NULL;
u64 start = 0;
u64 end;
-   bool errors = false;
 
while (!find_first_extent_bit(dirty_pages, start, , ,
  EXTENT_NEED_WAIT, _state)) {
@@ -1027,24 +1025,45 @@ int btrfs_wait_marked_extents(struct btrfs_root *root,
}
if (err)
werr = err;
+   return werr;
+}
 
-   if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
-   if ((mark & EXTENT_DIRTY) &&
-   test_and_clear_bit(BTRFS_FS_LOG1_ERR, _info->flags))
-   errors = true;
+int btrfs_wait_extents(struct btrfs_fs_info *fs_info,
+  struct extent_io_tree *dirty_pages)
+{
+   bool errors = false;
+   int err;
 
-   if ((mark & EXTENT_NEW) &&
-   test_and_clear_bit(BTRFS_FS_LOG2_ERR, _info->flags))
-   errors = true;
-   } else {
-   if (test_and_clear_bit(BTRFS_FS_BTREE_ERR, _info->flags))
-   errors = true;
-   }
+   err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
+   if (test_and_clear_bit(BTRFS_FS_BTREE_ERR, _info->flags))
+   errors = true;
+
+   if (errors && !err)
+   err = -EIO;
+   return err;
+}
 
-   if (errors && !werr)
-   werr = -EIO;
+int btrfs_wait_tree_log_extents(struct btrfs_root *log_root, int mark)
+{
+   struct btrfs_fs_info *fs_info = log_root->fs_info;
+   struct extent_io_tree *dirty_pages = _root->dirty_log_pages;
+   bool errors = false;
+   int err;
 
-   return werr;
+   ASSERT(log_root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
+
+   err = __btrfs_wait_marked_extents(fs_info, dirty_pages);
+   if ((mark & EXTENT_DIRTY) &&
+   test_and_clear_bit(BTRFS_FS_LOG1_ERR, _info->flags))
+   errors = true;
+
+   

[PATCH 10/18] btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h|  8 
 fs/btrfs/delayed-inode.c|  4 ++--
 fs/btrfs/extent-tree.c  | 35 +++
 fs/btrfs/file.c |  4 ++--
 fs/btrfs/free-space-cache.c |  4 ++--
 fs/btrfs/inode-map.c|  3 ++-
 fs/btrfs/inode.c|  4 ++--
 fs/btrfs/props.c|  2 +-
 fs/btrfs/transaction.c  |  5 +++--
 9 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6a5c007..19b6bb2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2535,20 +2535,20 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
address_space *mapping)
 
 u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes);
 
-static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
+static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_fs_info *fs_info,
 unsigned num_items)
 {
-   return root->fs_info->nodesize * BTRFS_MAX_LEVEL * 2 * num_items;
+   return fs_info->nodesize * BTRFS_MAX_LEVEL * 2 * num_items;
 }
 
 /*
  * Doing a truncate won't result in new nodes or leaves, just what we need for
  * COW.
  */
-static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_root *root,
+static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info,
 unsigned num_items)
 {
-   return root->fs_info->nodesize * BTRFS_MAX_LEVEL * num_items;
+   return fs_info->nodesize * BTRFS_MAX_LEVEL * num_items;
 }
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index d90d444..d4e0781 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -549,7 +549,7 @@ static int btrfs_delayed_item_reserve_metadata(struct 
btrfs_trans_handle *trans,
src_rsv = trans->block_rsv;
dst_rsv = >fs_info->delayed_block_rsv;
 
-   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   num_bytes = btrfs_calc_trans_metadata_size(root->fs_info, 1);
ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
if (!ret) {
trace_btrfs_space_reservation(root->fs_info, "delayed_item",
@@ -592,7 +592,7 @@ static int btrfs_delayed_inode_reserve_metadata(
src_rsv = trans->block_rsv;
dst_rsv = >fs_info->delayed_block_rsv;
 
-   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   num_bytes = btrfs_calc_trans_metadata_size(root->fs_info, 1);
 
/*
 * If our block_rsv is the delalloc block reserve then check and see if
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 127a54b..b8ad81c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2791,13 +2791,13 @@ int btrfs_check_space_for_delayed_refs(struct 
btrfs_trans_handle *trans,
u64 num_bytes, num_dirty_bgs_bytes;
int ret = 0;
 
-   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   num_bytes = btrfs_calc_trans_metadata_size(root->fs_info, 1);
num_heads = heads_to_leaves(root, num_heads);
if (num_heads > 1)
num_bytes += (num_heads - 1) * root->fs_info->nodesize;
num_bytes <<= 1;
num_bytes += btrfs_csum_bytes_to_leaves(root, csum_bytes) * 
root->fs_info->nodesize;
-   num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(root,
+   num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(root->fs_info,
 num_dirty_bgs);
global_rsv = >fs_info->global_block_rsv;
 
@@ -4440,8 +4440,8 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
num_devs = get_profile_num_devs(root, type);
 
/* num_devs device items to update and 1 chunk item to add or remove */
-   thresh = btrfs_calc_trunc_metadata_size(root, num_devs) +
-   btrfs_calc_trans_metadata_size(root, 1);
+   thresh = btrfs_calc_trunc_metadata_size(root->fs_info, num_devs) +
+   btrfs_calc_trans_metadata_size(root->fs_info, 1);
 
if (left < thresh && btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) {
btrfs_info(root->fs_info, "left=%llu, need=%llu, flags=%llu",
@@ -4695,7 +4695,7 @@ static inline int calc_reclaim_items_nr(struct btrfs_root 
*root, u64 to_reclaim)
u64 bytes;
int nr;
 
-   bytes = btrfs_calc_trans_metadata_size(root, 1);
+   bytes = btrfs_calc_trans_metadata_size(root->fs_info, 1);
nr = (int)div64_u64(to_reclaim, bytes);
if (!nr)
nr = 1;
@@ -5770,7 +5770,7 @@ int btrfs_orphan_reserve_metadata(struct 
btrfs_trans_handle *trans,
 * added it, so this takes the reservation so we can release it later
 * when we are truly done with the orphan item.
 */
-   u64 num_bytes = 

[PATCH 01/18] btrfs: call functions that overwrite their root parameter with fs_info

2016-12-01 Thread jeffm
From: Jeff Mahoney 

There are 11 functions that accept a root parameter and immediately
overwrite it.  We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h|  4 ++--
 fs/btrfs/disk-io.c  |  4 ++--
 fs/btrfs/extent-tree.c  | 17 +++---
 fs/btrfs/file-item.c|  5 ++---
 fs/btrfs/free-space-cache.c |  5 ++---
 fs/btrfs/free-space-cache.h |  2 +-
 fs/btrfs/transaction.c  |  9 
 fs/btrfs/tree-log.c |  6 ++---
 fs/btrfs/volumes.c  | 55 +
 fs/btrfs/volumes.h  |  4 ++--
 10 files changed, 52 insertions(+), 59 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1b25a46..51cd757 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2639,7 +2639,7 @@ int btrfs_setup_space_cache(struct btrfs_trans_handle 
*trans,
struct btrfs_root *root);
 int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
-int btrfs_read_block_groups(struct btrfs_root *root);
+int btrfs_read_block_groups(struct btrfs_fs_info *info);
 int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr);
 int btrfs_make_block_group(struct btrfs_trans_handle *trans,
   struct btrfs_root *root, u64 bytes_used,
@@ -3055,7 +3055,7 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path 
*path,
 /* file-item.c */
 struct btrfs_dio_private;
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, u64 bytenr, u64 len);
+   struct btrfs_fs_info *fs_info, u64 bytenr, u64 len);
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811662c..92c2aea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2937,7 +2937,7 @@ int open_ctree(struct super_block *sb,
read_extent_buffer(chunk_root->node, fs_info->chunk_tree_uuid,
   btrfs_header_chunk_tree_uuid(chunk_root->node), BTRFS_UUID_SIZE);
 
-   ret = btrfs_read_chunk_tree(chunk_root);
+   ret = btrfs_read_chunk_tree(fs_info);
if (ret) {
btrfs_err(fs_info, "failed to read chunk tree: %d", ret);
goto fail_tree_roots;
@@ -3038,7 +3038,7 @@ int open_ctree(struct super_block *sb,
goto fail_sysfs;
}
 
-   ret = btrfs_read_block_groups(fs_info->extent_root);
+   ret = btrfs_read_block_groups(fs_info);
if (ret) {
btrfs_err(fs_info, "failed to read block groups: %d", ret);
goto fail_sysfs;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0d80136..13ef5d5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2414,7 +2414,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
-   ret = btrfs_del_csums(trans, root,
+   ret = btrfs_del_csums(trans, root->fs_info,
  node->bytenr,
  node->num_bytes);
}
@@ -3622,7 +3622,8 @@ int btrfs_start_dirty_block_groups(struct 
btrfs_trans_handle *trans,
 
if (cache->disk_cache_state == BTRFS_DC_SETUP) {
cache->io_ctl.inode = NULL;
-   ret = btrfs_write_out_cache(root, trans, cache, path);
+   ret = btrfs_write_out_cache(root->fs_info, trans,
+   cache, path);
if (ret == 0 && cache->io_ctl.inode) {
num_started++;
should_put = 0;
@@ -3774,7 +3775,8 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
 
if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP) {
cache->io_ctl.inode = NULL;
-   ret = btrfs_write_out_cache(root, trans, cache, path);
+   ret = btrfs_write_out_cache(root->fs_info, trans,
+   cache, path);
if (ret == 0 && cache->io_ctl.inode) {
num_started++;
should_put = 0;
@@ -7068,7 +7070,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
-   ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
+ 

[PATCH 07/18] btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/disk-io.c |  4 +--
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/scrub.c   | 86 +++---
 fs/btrfs/volumes.c | 41 +++---
 fs/btrfs/volumes.h |  3 +-
 6 files changed, 68 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 270da67..91f6bd9 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -832,7 +832,7 @@ static int btrfsic_process_superblock_dev_mirror(
superblock_tmp->never_written = 0;
superblock_tmp->mirror_num = 1 + superblock_mirror_num;
if (state->print_mask & BTRFSIC_PRINT_MASK_SUPERBLOCK_WRITE)
-   btrfs_info_in_rcu(device->dev_root->fs_info,
+   btrfs_info_in_rcu(device->fs_info,
"new initial S-block (bdev %p, %s) @%llu 
(%s/%llu/%d)",
 superblock_bdev,
 rcu_str_deref(device->name), dev_bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1db5f03..8384831 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3308,7 +3308,7 @@ static void btrfs_end_buffer_write_sync(struct 
buffer_head *bh, int uptodate)
struct btrfs_device *device = (struct btrfs_device *)
bh->b_private;
 
-   btrfs_warn_rl_in_rcu(device->dev_root->fs_info,
+   btrfs_warn_rl_in_rcu(device->fs_info,
"lost page write due to IO error on %s",
  rcu_str_deref(device->name));
/* note, we don't set_buffer_write_io_error because we have
@@ -3453,7 +3453,7 @@ static int write_dev_supers(struct btrfs_device *device,
bh = __getblk(device->bdev, bytenr / 4096,
  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
-   btrfs_err(device->dev_root->fs_info,
+   btrfs_err(device->fs_info,
"couldn't get super buffer head for bytenr 
%llu",
bytenr);
errors++;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e8b8e30..9e83423 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10853,7 +10853,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
ret = 0;
 
while (1) {
-   struct btrfs_fs_info *fs_info = device->dev_root->fs_info;
+   struct btrfs_fs_info *fs_info = device->fs_info;
struct btrfs_transaction *trans;
u64 bytes;
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 589d792..ed70246 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -171,7 +171,7 @@ struct scrub_wr_ctx {
 
 struct scrub_ctx {
struct scrub_bio*bios[SCRUB_BIOS_PER_SCTX];
-   struct btrfs_root   *dev_root;
+   struct btrfs_fs_info*fs_info;
int first_free;
int curr;
atomic_tbios_in_flight;
@@ -356,7 +356,7 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
  */
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx)
 {
-   struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+   struct btrfs_fs_info *fs_info = sctx->fs_info;
 
atomic_inc(>refs);
/*
@@ -388,7 +388,7 @@ static void scrub_pending_trans_workers_inc(struct 
scrub_ctx *sctx)
 /* used for workers that require transaction commits */
 static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx)
 {
-   struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+   struct btrfs_fs_info *fs_info = sctx->fs_info;
 
/*
 * see scrub_pending_trans_workers_inc() why we're pretending
@@ -458,7 +458,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, 
int is_dev_replace)
 {
struct scrub_ctx *sctx;
int i;
-   struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
+   struct btrfs_fs_info *fs_info = dev->fs_info;
int ret;
 
sctx = kzalloc(sizeof(*sctx), GFP_KERNEL);
@@ -468,7 +468,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, 
int is_dev_replace)
sctx->is_dev_replace = is_dev_replace;
sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
sctx->curr = -1;
-   sctx->dev_root = dev->dev_root;
+   sctx->fs_info = dev->fs_info;
for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
struct scrub_bio *sbio;
 
@@ -489,8 +489,8 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, 
int is_dev_replace)
   

[PATCH 16/18] btrfs: simplify btrfs_wait_cache_io prototype

2016-12-01 Thread jeffm
From: Jeff Mahoney 

With the exception of the one case where btrfs_wait_cache_io is called
without a block group, it's called with the same arguments.  The root
argument is only used in the special case, so let's factor out the core
and simplify the call in the normal case to require a trans, block group,
and path.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c  | 15 ---
 fs/btrfs/free-space-cache.c | 40 
 fs/btrfs/free-space-cache.h |  6 ++
 3 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a358aaa..d0c5d5d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3610,9 +3610,7 @@ int btrfs_start_dirty_block_groups(struct 
btrfs_trans_handle *trans,
 */
if (!list_empty(>io_list)) {
list_del_init(>io_list);
-   btrfs_wait_cache_io(root, trans, cache,
-   >io_ctl, path,
-   cache->key.objectid);
+   btrfs_wait_cache_io(trans, cache, path);
btrfs_put_block_group(cache);
}
 
@@ -3767,9 +3765,7 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
if (!list_empty(>io_list)) {
spin_unlock(_trans->dirty_bgs_lock);
list_del_init(>io_list);
-   btrfs_wait_cache_io(root, trans, cache,
-   >io_ctl, path,
-   cache->key.objectid);
+   btrfs_wait_cache_io(trans, cache, path);
btrfs_put_block_group(cache);
spin_lock(_trans->dirty_bgs_lock);
}
@@ -3839,8 +3835,7 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
cache = list_first_entry(io, struct btrfs_block_group_cache,
 io_list);
list_del_init(>io_list);
-   btrfs_wait_cache_io(root, trans, cache,
-   >io_ctl, path, cache->key.objectid);
+   btrfs_wait_cache_io(trans, cache, path);
btrfs_put_block_group(cache);
}
 
@@ -10383,9 +10378,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
WARN_ON(!IS_ERR(inode) && inode != block_group->io_ctl.inode);
 
spin_unlock(>transaction->dirty_bgs_lock);
-   btrfs_wait_cache_io(root, trans, block_group,
-   _group->io_ctl, path,
-   block_group->key.objectid);
+   btrfs_wait_cache_io(trans, block_group, path);
btrfs_put_block_group(block_group);
spin_lock(>transaction->dirty_bgs_lock);
}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index e937636..ab7e2b9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -42,6 +42,10 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl,
   struct btrfs_free_space *info);
 static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
  struct btrfs_free_space *info);
+static int btrfs_wait_cache_io_root(struct btrfs_root *root,
+struct btrfs_trans_handle *trans,
+struct btrfs_io_ctl *io_ctl,
+struct btrfs_path *path);
 
 static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
   struct btrfs_path *path,
@@ -244,9 +248,7 @@ int btrfs_truncate_free_space_cache(struct btrfs_root *root,
if (!list_empty(_group->io_list)) {
list_del_init(_group->io_list);
 
-   btrfs_wait_cache_io(root, trans, block_group,
-   _group->io_ctl, path,
-   block_group->key.objectid);
+   btrfs_wait_cache_io(trans, block_group, path);
btrfs_put_block_group(block_group);
}
 
@@ -1139,11 +1141,11 @@ cleanup_write_cache_enospc(struct inode *inode,
 GFP_NOFS);
 }
 
-int btrfs_wait_cache_io(struct btrfs_root *root,
-   struct btrfs_trans_handle *trans,
-   struct btrfs_block_group_cache *block_group,
-   struct btrfs_io_ctl *io_ctl,
-   struct btrfs_path *path, u64 offset)
+static int __btrfs_wait_cache_io(struct btrfs_root *root,
+struct btrfs_trans_handle *trans,
+struct btrfs_block_group_cache *block_group,

[PATCH 04/18] btrfs: alloc_reserved_file_extent trace point should use extent_root

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Even though a separate root is passed in, we're still operating on the
extent root.  Let's use that for the trace point.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e4b3fc0..e8b8e30 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8105,7 +8105,8 @@ static int alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ins->objectid, ins->offset);
BUG();
}
-   trace_btrfs_reserved_extent_alloc(root, ins->objectid, ins->offset);
+   trace_btrfs_reserved_extent_alloc(fs_info->extent_root,
+ ins->objectid, ins->offset);
return ret;
 }
 
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/18] btrfs: struct reada_control.root -> reada_control.fs_info

2016-12-01 Thread jeffm
From: Jeff Mahoney 

The root is never used.  We substitute extent_root in for the
reada_find_extent call, since it's only ever used to obtain the node
size.  This call site will be changed to use fs_info in a later patch.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/reada.c | 13 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 74110c7..9a3ca79 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3652,7 +3652,7 @@ static inline void btrfs_bio_counter_dec(struct 
btrfs_fs_info *fs_info)
 
 /* reada.c */
 struct reada_control {
-   struct btrfs_root   *root;  /* tree to prefetch */
+   struct btrfs_fs_info*fs_info;   /* tree to prefetch */
struct btrfs_keykey_start;
struct btrfs_keykey_end;/* exclusive */
atomic_telems;
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index f0beb63..540e729 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -544,17 +544,18 @@ static void reada_control_release(struct kref *kref)
 static int reada_add_block(struct reada_control *rc, u64 logical,
   struct btrfs_key *top, u64 generation)
 {
-   struct btrfs_root *root = rc->root;
+   struct btrfs_fs_info *fs_info = rc->fs_info;
struct reada_extent *re;
struct reada_extctl *rec;
 
-   re = reada_find_extent(root, logical, top); /* takes one ref */
+   /* takes one ref */
+   re = reada_find_extent(fs_info->tree_root, logical, top);
if (!re)
return -1;
 
rec = kzalloc(sizeof(*rec), GFP_KERNEL);
if (!rec) {
-   reada_extent_put(root->fs_info, re);
+   reada_extent_put(fs_info, re);
return -ENOMEM;
}
 
@@ -914,7 +915,7 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
if (!rc)
return ERR_PTR(-ENOMEM);
 
-   rc->root = root;
+   rc->fs_info = root->fs_info;
rc->key_start = *key_start;
rc->key_end = *key_end;
atomic_set(>elems, 0);
@@ -942,7 +943,7 @@ struct reada_control *btrfs_reada_add(struct btrfs_root 
*root,
 int btrfs_reada_wait(void *handle)
 {
struct reada_control *rc = handle;
-   struct btrfs_fs_info *fs_info = rc->root->fs_info;
+   struct btrfs_fs_info *fs_info = rc->fs_info;
 
while (atomic_read(>elems)) {
if (!atomic_read(_info->reada_works_cnt))
@@ -963,7 +964,7 @@ int btrfs_reada_wait(void *handle)
 int btrfs_reada_wait(void *handle)
 {
struct reada_control *rc = handle;
-   struct btrfs_fs_info *fs_info = rc->root->fs_info;
+   struct btrfs_fs_info *fs_info = rc->fs_info;
 
while (atomic_read(>elems)) {
if (!atomic_read(_info->reada_works_cnt))
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/18] btrfs: root->fs_info cleanup, update_block_group{,flags}

2016-12-01 Thread jeffm
From: Jeff Mahoney 

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cc9ae54..2e395d4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -61,7 +61,7 @@ enum {
 };
 
 static int update_block_group(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, u64 bytenr,
+ struct btrfs_fs_info *fs_info, u64 bytenr,
  u64 num_bytes, int alloc);
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
@@ -6182,11 +6182,10 @@ void btrfs_delalloc_release_space(struct inode *inode, 
u64 start, u64 len)
 }
 
 static int update_block_group(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, u64 bytenr,
+ struct btrfs_fs_info *info, u64 bytenr,
  u64 num_bytes, int alloc)
 {
struct btrfs_block_group_cache *cache = NULL;
-   struct btrfs_fs_info *info = root->fs_info;
u64 total = num_bytes;
u64 old_val;
u64 byte_in_group;
@@ -6227,7 +6226,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
spin_lock(>space_info->lock);
spin_lock(>lock);
 
-   if (btrfs_test_opt(root->fs_info, SPACE_CACHE) &&
+   if (btrfs_test_opt(info, SPACE_CACHE) &&
cache->disk_cache_state < BTRFS_DC_CLEAR)
cache->disk_cache_state = BTRFS_DC_CLEAR;
 
@@ -7088,7 +7087,8 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
goto out;
}
 
-   ret = update_block_group(trans, root, bytenr, num_bytes, 0);
+   ret = update_block_group(trans, root->fs_info, bytenr,
+num_bytes, 0);
if (ret) {
btrfs_abort_transaction(trans, ret);
goto out;
@@ -8104,7 +8104,7 @@ static int alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
if (ret)
return ret;
 
-   ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
+   ret = update_block_group(trans, fs_info, ins->objectid, ins->offset, 1);
if (ret) { /* -ENOENT, logic error */
btrfs_err(fs_info, "update block group failed for %llu %llu",
ins->objectid, ins->offset);
@@ -8190,9 +8190,8 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
if (ret)
return ret;
 
-   ret = update_block_group(trans, root, ins->objectid,
-root->fs_info->nodesize,
-1);
+   ret = update_block_group(trans, fs_info, ins->objectid,
+fs_info->nodesize, 1);
if (ret) { /* -ENOENT, logic error */
btrfs_err(fs_info, "update block group failed for %llu %llu",
ins->objectid, ins->offset);
@@ -9280,7 +9279,7 @@ int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
return ret;
 }
 
-static u64 update_block_group_flags(struct btrfs_root *root, u64 flags)
+static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 {
u64 num_devices;
u64 stripped;
@@ -9289,11 +9288,11 @@ static u64 update_block_group_flags(struct btrfs_root 
*root, u64 flags)
 * if restripe for this chunk_type is on pick target profile and
 * return, otherwise do the usual balance
 */
-   stripped = get_restripe_target(root->fs_info, flags);
+   stripped = get_restripe_target(fs_info, flags);
if (stripped)
return extended_to_chunk(stripped);
 
-   num_devices = root->fs_info->fs_devices->rw_devices;
+   num_devices = fs_info->fs_devices->rw_devices;
 
stripped = BTRFS_BLOCK_GROUP_RAID0 |
BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
@@ -9409,7 +9408,7 @@ int btrfs_inc_block_group_ro(struct btrfs_root *root,
 * if we are changing raid levels, try to allocate a corresponding
 * block group with the new raid level.
 */
-   alloc_flags = update_block_group_flags(root, cache->flags);
+   alloc_flags = update_block_group_flags(root->fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = do_chunk_alloc(trans, root, alloc_flags,
 CHUNK_ALLOC_FORCE);
@@ -9435,7 +9434,8 @@ int btrfs_inc_block_group_ro(struct btrfs_root *root,
ret = inc_block_group_ro(cache, 0);
 out:
if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
-   alloc_flags = update_block_group_flags(root, 

btrfs lockdep warning: possible recursive locking of >log_mutex

2016-12-01 Thread Eric Biggers
When using btrfs and a kernel with lockdep enabled (4.9-rc7, but this easily
could have been there for a while) I got the following lockdep warning:

[   37.796703] =
[   37.796773] [ INFO: possible recursive locking detected ]
[   37.796854] 4.9.0-rc7 #351 Tainted: G L 
[   37.796917] -
[   37.796986] systemd-journal/280 is trying to acquire lock:
[   37.797051]  (
[   37.797077] >log_mutex
[   37.797119] ){+.+...}
[   37.797135] , at: 
[   37.797176] [] btrfs_log_inode+0x33d/0x20a0
[   37.797254] 
   but task is already holding lock:
[   37.797328]  (
[   37.797353] >log_mutex
[   37.797396] ){+.+...}
[   37.797411] , at: 
[   37.797449] [] btrfs_log_inode+0x33d/0x20a0
[   37.797521] 
   other info that might help us debug this:
[   37.797603]  Possible unsafe locking scenario:

[   37.797682]CPU0
[   37.797717]
[   37.797751]   lock(
[   37.797782] >log_mutex
[   37.797822] );
[   37.797848]   lock(
[   37.797878] >log_mutex
[   37.797920] );
[   37.797946] 
*** DEADLOCK ***

[   37.798020]  May be due to missing lock nesting notation

[   37.798120] 3 locks held by systemd-journal/280:
[   37.798180]  #0: 
[   37.798208]  (
[   37.798238] >s_type->i_mutex_key
[   37.798269] #9
[   37.798299] ){+.+.+.}
[   37.798315] , at: 
[   37.798354] [] btrfs_sync_file+0x1b1/0x990
[   37.798425]  #1: 
[   37.798453]  (
[   37.798483] sb_internal
[   37.798501] ){.+.+.+}
[   37.798538] , at: 
[   37.798553] [] start_transaction+0x7bc/0xe60
[   37.798632]  #2: 
[   37.798660]  (
[   37.798690] >log_mutex
[   37.798711] ){+.+...}
[   37.798747] , at: 
[   37.798763] [] btrfs_log_inode+0x33d/0x20a0
[   37.798840] 
   stack backtrace:
[   37.798902] CPU: 2 PID: 280 Comm: systemd-journal Tainted: G L  
4.9.0-rc7 #351
[   37.799017] Hardware name: Dell Inc. Inspiron 15-7568/0M5YMV, BIOS 01.00.00 
08/07/2015
[   37.799111]  880213ca72d0 81a1fe82 84056000 
83a98ae0
[   37.799230]  880213ca7498 811f43c1 dc00 
88021470cf20
[   37.799348]  dc00 82fa15c0 88021470cf00 
110042794e72
[   37.799465] Call Trace:
[   37.799522]  [] dump_stack+0x68/0x96
[   37.799591]  [] __lock_acquire+0x1bd1/0x5290
[   37.799670]  [] ? debug_check_no_locks_freed+0x280/0x280
[   37.799760]  [] ? debug_check_no_locks_freed+0x280/0x280
[   37.799849]  [] ? mark_held_locks+0xc8/0x120
[   37.799926]  [] ? mark_held_locks+0xc8/0x120
[   37.81]  [] ? __mutex_unlock_slowpath+0x221/0x420
[   37.800088]  [] lock_acquire+0xdd/0x190
[   37.800160]  [] ? btrfs_log_inode+0x33d/0x20a0
[   37.800239]  [] ? 
__ww_mutex_lock_interruptible+0x1500/0x1500
[   37.800333]  [] mutex_lock_nested+0xa4/0x7e0
[   37.803435]  [] ? btrfs_log_inode+0x33d/0x20a0
[   37.806090]  [] ? mutex_trylock+0x3f0/0x3f0
[   37.808347]  [] ? __btrfs_btree_balance_dirty+0xcf/0x1a0
[   37.811207]  [] ? 
btrfs_commit_inode_delayed_inode+0x23b/0x360
[   37.814263]  [] btrfs_log_inode+0x33d/0x20a0
[   37.817343]  [] ? iget5_locked+0x8f/0x3a0
[   37.820420]  [] ? _raw_spin_unlock+0x22/0x30
[   37.823462]  [] ? btrfs_i_callback+0x20/0x20
[   37.826466]  [] ? btrfs_log_changed_extents+0x15b0/0x15b0
[   37.829300]  [] ? release_extent_buffer+0x102/0x150
[   37.832101]  [] ? release_extent_buffer+0x102/0x150
[   37.834628]  [] ? free_extent_buffer+0xe2/0x220
[   37.837466]  [] ? btrfs_release_path+0x85/0x1b0
[   37.840325]  [] btrfs_log_inode+0x1723/0x20a0
[   37.843265]  [] ? btrfs_log_changed_extents+0x15b0/0x15b0
[   37.846248]  [] ? mutex_lock_nested+0x511/0x7e0
[   37.849177]  [] ? mark_held_locks+0xc8/0x120
[   37.852120]  [] ? __mutex_unlock_slowpath+0x221/0x420
[   37.855053]  [] ? 
__ww_mutex_lock_interruptible+0x1500/0x1500
[   37.858013]  [] btrfs_log_inode_parent+0x689/0x2280
[   37.860982]  [] ? btrfs_end_log_trans+0x70/0x70
[   37.863918]  [] ? dget_parent+0x91/0x350
[   37.866855]  [] ? dget_parent+0xa9/0x350
[   37.869762]  [] btrfs_log_dentry_safe+0x74/0xa0
[   37.872650]  [] btrfs_sync_file+0x54e/0x990
[   37.875570]  [] ? start_ordered_ops+0x20/0x20
[   37.878502]  [] ? syscall_trace_enter+0x289/0x7c0
[   37.881140]  [] ? start_ordered_ops+0x20/0x20
[   37.883550]  [] vfs_fsync_range+0xe8/0x280
[   37.886026]  [] do_fsync+0x38/0x60
[   37.888499]  [] ? SyS_syncfs+0xc0/0xc0
[   37.891003]  [] SyS_fsync+0xb/0x10
[   37.893851]  [] do_syscall_64+0x17c/0x420
[   37.896879]  [] entry_SYSCALL64_slow_path+0x25/0x25
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges

2016-12-01 Thread Qu Wenruo
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at error path:
(Page size 4K, function name without "btrfs_" prefix)

 Task A  | Task B
--
Buffered_write [0, 2K)   |
|- check_data_free_space()   |
|  |- qgroup_reserve_data()  |
| Range aligned to page  |
| range [0, 4K)  <<< |
| 4K bytes reserved  <<< |
|- copy pages to page cache  |
 | Buffered_write [2K, 4K)
 | |- check_data_free_space()
 | |  |- qgroup_reserved_data()
 | | Range alinged to page
 | | range [0, 4K)
 | | Already reserved by A <<<
 | | 0 bytes reserved  <<<
 | |- delalloc_reserve_metadata()
 | |  And it *FAILED* (Maybe EQUOTA)
 | |- free_reserved_data_space()
  |- qgroup_free_data()
 Range aligned to page range
 [0, 4K)
 Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at write back time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  8 ---
 fs/btrfs/extent-tree.c | 12 ++
 fs/btrfs/file.c| 32 +++--
 fs/btrfs/inode.c   | 35 +--
 fs/btrfs/ioctl.c   |  4 ++--
 fs/btrfs/qgroup.c  | 64 ++
 fs/btrfs/qgroup.h  |  3 ++-
 fs/btrfs/relocation.c  |  8 +++
 8 files changed, 117 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9f7e109..1d5eaf3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2697,7 +2697,11 @@ enum btrfs_flush_state {
 int btrfs_check_data_free_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
+void btrfs_delalloc_release_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len,
+   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
@@ -2720,8 +2724,6 @@ void btrfs_delalloc_release_metadata(struct inode *inode, 
u64 num_bytes,
 int btrfs_delalloc_reserve_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len,
enum btrfs_metadata_reserve_type reserve_type);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
-   enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f116bcf..a1e9c7b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4365,7 +4365,8 @@ void btrfs_free_reserved_data_space_noquota(struct inode 
*inode, u64 start,
  * This one will handle the per-inode data rsv map for accurate reserved
  * space framework.
  */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+void btrfs_free_reserved_data_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
 
@@ -4375,7 +4376,7 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
start = round_down(start, root->sectorsize);
 
btrfs_free_reserved_data_space_noquota(inode, start, len);
-   

[PATCH 1/2] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2016-12-01 Thread Qu Wenruo
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it at error
path.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  6 --
 fs/btrfs/extent-tree.c | 18 --
 fs/btrfs/extent_io.h   | 22 ++
 fs/btrfs/file.c| 13 ++---
 fs/btrfs/inode-map.c   |  4 +++-
 fs/btrfs/inode.c   | 19 ++-
 fs/btrfs/ioctl.c   |  5 -
 fs/btrfs/qgroup.c  | 25 +++--
 fs/btrfs/qgroup.h  |  3 ++-
 fs/btrfs/relocation.c  |  4 +++-
 10 files changed, 89 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0db037c..9f7e109 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2694,7 +2694,8 @@ enum btrfs_flush_state {
COMMIT_TRANS=   6,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_check_data_free_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
@@ -2716,7 +2717,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes,
enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
enum btrfs_metadata_reserve_type reserve_type);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+int btrfs_delalloc_reserve_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len,
enum btrfs_metadata_reserve_type reserve_type);
 void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
enum btrfs_metadata_reserve_type reserve_type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dae287d..f116bcf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3357,6 +3357,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
 {
struct btrfs_root *root = block_group->fs_info->tree_root;
struct inode *inode = NULL;
+   struct extent_changeset data_reserved = EMPTY_CHANGESET;
u64 alloc_hint = 0;
int dcs = BTRFS_DC_ERROR;
u64 num_pages = 0;
@@ -3474,7 +3475,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
num_pages *= 16;
num_pages *= PAGE_SIZE;
 
-   ret = btrfs_check_data_free_space(inode, 0, num_pages);
+   ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3505,6 +3506,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
block_group->disk_cache_state = dcs;
spin_unlock(_group->lock);
 
+   extent_changeset_release(_reserved);
return ret;
 }
 
@@ -4302,7 +4304,8 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, 
u64 bytes)
  * Will replace old btrfs_check_data_free_space(), but for patch split,
  * add a new function first and then replace it.
  */
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+int btrfs_check_data_free_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
int ret;
@@ -4317,9 +4320,11 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
return ret;
 
/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
-   ret = btrfs_qgroup_reserve_data(inode, start, len);
+   ret = btrfs_qgroup_reserve_data(inode, reserved, start, len);
if (ret < 0)
btrfs_free_reserved_data_space_noquota(inode, start, len);
+   else
+   ret = 0;
return ret;
 }
 
@@ -6254,12 +6259,13 @@ void btrfs_delalloc_release_metadata(struct inode 
*inode, u64 num_bytes,
  * Return 0 for success
  * Return <0 for error(-ENOSPC or -EQUOT)
  */
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
-enum btrfs_metadata_reserve_type reserve_type)
+int btrfs_delalloc_reserve_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len,
+ 

Re: resend: Re: Btrfs: adjust len of writes if following a preallocated extent

2016-12-01 Thread Liu Bo
On Thu, Nov 24, 2016 at 11:13:37AM +, Filipe Manana wrote:
> On Wed, Nov 23, 2016 at 9:22 PM, Liu Bo  wrote:
> > Hi,
> >
> > On Wed, Nov 23, 2016 at 06:21:35PM +0100, Stefan Priebe - Profihost AG 
> > wrote:
> >> Hi,
> >>
> >> sorry last mail was from the wrong box.
> >>
> >> Am 04.11.2016 um 20:20 schrieb Liu Bo:
> >> > If we have
> >> >
> >> > |0--hole--4095||4096--preallocate--12287|
> >> >
> >> > instead of using preallocated space, a 8K direct write will just
> >> > create a new 8K extent and it'll end up with
> >> >
> >> > |0--new extent--8191||8192--preallocate--12287|
> >> >
> >> > It's because we find a hole em and then go to create a new 8K
> >> > extent directly without adjusting @len.
> >>
> >> after applying that one on top of my 4.4 btrfs branch (includes patches
> >> up to 4.10 / next). i'm getting deadlocks in btrfs.
> >
> > This is really interesting, thanks for the quick testing.
> >
> > After going through the stacks listed below, I think the patch has
> > exposed a bug around BTRFS_I(inode)->dio_sem:
> >
> > 1. Since fsync has acquired inode_lock(), the dio write must be
> > an overwrite within EOF.
> >
> > 2. Lets say the inode size is 16k and it already has a preallocated extent 
> > [4k, 8k],
> > then we feed it with a dio write against [0k, 8k], with this patch
> > applied, the write can be splitted into a new extent of [0, 4k] and a
> > fill-write against the preallocated one [4k, 8k],
> >
> > 3.
> > dio   fsync
> > ->btrfs_direct_IO  
> > btrfs_sync_file
> >  ->do_direct_IO
> >   ->get_more_blocks()
> > ->inode_lock()
> > ->btrfs_get_blocks_direct() # for [0, 8k]
> > ->btrfs_log_inode()
> >   ->btrfs_new_direct_extent()  
> > ->btrfs_log_changed_extents()
> > ->btrfs_create_dio_extent()
> >   ->down_read(_I(inode)->dio_sem)
> > # dio write is splitted and
> > # em of [0, 4k] is inserted as well as
> > # the ordered extent.
> >   ->up_read(_I(inode)->dio_sem)
> ># do_direct_IO tries to collect more pages
> ># before sending them down, so [0, 4k] is not
> ># yet submitted.
> > 
> >  
> > ->down_write(_I(inode)->dio_sem)
> >  # 
> > found ordered extent of [0, 4k]
> >  # 
> > wait for [0, 4k] to finish
> >   ->get_more_blocks()
> > ->btrfs_get_blocks_direct() # for [4k, 8k]
> >   ->btrfs_create_dio_extent()
> > -> up_read(_I(inode)->dio_sem)
> ># deadlock occurs
> >
> > 4. _Without_ this patch, we could hit the deadlock as well under space 
> > pressure,
> > i.e. if we request [0, 8k], but btrfs_reserve_extent() returns only [0, 4k].
> >
> > (Filipe may correct me, cc'd Filipe.)
> 
> The analysis is correct Bo.
> Originally to fix races between fsync and direct IO writes there was a
> solution [1, 2] that didn't involve adding a semaphore and relied on
> creating first the ordered extents and then the extent maps only in
> the direct IO write path (we do things in the reverse order everywhere
> else). It worked and was documented in comments but wasn't
> particularly elegant and Josef was not happy because of that, so then
> we added the semaphore and made direct IO write path create the extent
> maps and ordered extents in the same order as everywhere else [3].
> 
> So here I can only see 2 simple solutions. Either revert [3] (which
> added the semaphore) or acquire the semaphore at a higher level in
> direct IO write path like this:
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1f980ef..b2c277d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7237,7 +7237,6 @@ static struct extent_map
> *btrfs_create_dio_extent(struct inode *inode,
> struct extent_map *em = NULL;
> int ret;
> 
> -   down_read(_I(inode)->dio_sem);
> if (type != BTRFS_ORDERED_NOCOW) {
> em = create_pinned_em(inode, start, len, orig_start,
>   block_start, block_len, orig_block_len,
> @@ -7256,8 +7255,6 @@ static struct extent_map
> *btrfs_create_dio_extent(struct inode *inode,
> em = ERR_PTR(ret);
> }
>   out:
> -   up_read(_I(inode)->dio_sem);
> -
> return em;
>  }
> 
> @@ -8715,11 +8712,14 @@ static ssize_t btrfs_direct_IO(struct kiocb
> *iocb, struct iov_iter *iter)
> wakeup = false;
> }
> 
> +   if (iov_iter_rw(iter) == WRITE)
> +   down_read(_I(inode)->dio_sem);
> 

Re: [PATCH] Btrfs: fix infinite loop when tree log recovery

2016-12-01 Thread robbieko

Hi Filipe,

Thank you for your help.

I will make up the incremental send change log as soon as possible.

Thanks.
robbieko

Filipe Manana 於 2016-12-01 19:14 寫到:

On Thu, Dec 1, 2016 at 1:42 AM, robbieko  wrote:

Hi Filipe,

Thank you for your review.
I have seen your modified change log with below
Btrfs: fix tree search logic when replaying directory entry 
deletes

Btrfs: fix deadlock caused by fsync when logging directory entries
Btrfs: fix enospc in hole punching
So what's the next step ?
modify patch change log and then send again ?


You don't need to do anything else for those patches.
Thanks.



Thanks.
robbieko

Filipe Manana 於 2016-12-01 00:53 寫到:

On Fri, Oct 7, 2016 at 10:30 AM, robbieko  
wrote:


From: Robbie Ko 

if log tree like below:
leaf N:
...
item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
dir log end 1275809046
leaf N+1:
item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 
itemsize 8

dir log end 18446744073709551615
...

when start_ret > 1275809046, but slot[0] never >= nritems,
so never go to next leaf.



This doesn't explain how the infinite loop happens. Nor exactly how
any problem happens.

It's important to have detailed information in the change logs. I
understand that english isn't your native tongue (it's not mine
either, and I'm far from mastering it), but that's not an excuse to
not express all the important information in detail (we can all live
with grammar errors and typos, and we all do such errors frequently).

I've added this patch to my branch at

https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10
but with a modified changelog and subject.

The results of the wrong logic that decides when to move to the next
leaf are unpredictable, and it won't always result in an infinite
loop. We are accessing a slot that doesn't point to an item, to a
memory location containing garbage to something unexpected, and in 
the
worst case that location is beyond the last page of the extent 
buffer.


Thanks.




Signed-off-by: Robbie Ko 
---
 fs/btrfs/tree-log.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index ef9c55b..e63dd99 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct
btrfs_root *root,
 next:
/* check the next slot in the tree to see if it is a valid 
item

*/
nritems = btrfs_header_nritems(path->nodes[0]);
+   path->slots[0]++;
if (path->slots[0] >= nritems) {
ret = btrfs_next_leaf(root, path);
if (ret)
goto out;
-   } else {
-   path->slots[0]++;
}

btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe 
linux-btrfs" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs, add missing umount for raid5 tests 124 and 125

2016-12-01 Thread Anand Jain


Hi,

I didn't add umount at end of the test because...
_check_btrfs_filesystem() does it, which gets called as this test does 
not specify _require_scratch_nocheck



 _check_btrfs_filesystem()
{
device=$1

# If type is set, we're mounted
type=`_fs_type $device`
ok=1

if [ "$type" = "$FSTYP" ]
then
# mounted ...
mountpoint=`_umount_or_remount_ro $device`  <
fi

btrfsck $device >$tmp.fsck 2>&1



I faced the similar problem on some other tests and I found
adding the delay is the right approach. for eg:

--
diff --git a/tests/generic/298 b/tests/generic/298
index e85db1266fa9..4092efa6b961 100755
--- a/tests/generic/298
+++ b/tests/generic/298
@@ -92,7 +92,7 @@ echo "reflink of $n bytes took $delta seconds" >> 
$seqres.full
 test $delta -gt $timeout && _fail "reflink didn't stop in time, n=$n 
t=$delta"


 echo "Check scratch fs"
-sleep 2# give it a few seconds to actually die...
+sleep 40   # give it a few seconds to actually die...

 # success, all done
 status=0
--


HTH
-Anand



On 11/24/16 14:25, fdman...@kernel.org wrote:

From: Filipe Manana 

The tests mount the second device in the device pool but never unmount
it, causing the next test to fail.

Example:

$ cat local.config
export TEST_DEV=/dev/sdb
export TEST_DIR=/home/fdmanana/btrfs-tests/dev
export SCRATCH_MNT="/home/fdmanana/btrfs-tests/scratch_1"
export SCRATCH_DEV_POOL="/dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg"
export FSTYP=btrfs

$ ./check btrfs/125 btrfs/126
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.8.0-rc8-btrfs-next-35+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/125 23s ... 22s
btrfs/126 1s ... - output mismatch (see 
/home/fdmanana/git/hub/xfstests/results//btrfs/126.out.bad)
--- tests/btrfs/126.out 2016-11-24 06:11:42.048372385 +
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/126.out.bad  
2016-11-24 06:16:35.987988895 +
@@ -1,2 +1,5 @@
 QA output created by 126
-pwrite: Disk quota exceeded
+ERROR: /dev/sdc is mounted
+mount: /dev/sdc is already mounted or /home/fdmanana/btrfs-tests/scratch_1 
busy
+   /dev/sdc is already mounted on /home/fdmanana/btrfs-tests/scratch_1
+/home/fdmanana/btrfs-tests/scratch_1/test_file: Disk quota exceeded
...
(Run 'diff -u tests/btrfs/126.out 
/home/fdmanana/git/hub/xfstests/results//btrfs/126.out.bad'  to see the entire 
diff)
Ran: btrfs/125 btrfs/126
Failures: btrfs/126
Failed 1 of 2 tests

So just make sure those test unmount the device before they finish.

Signed-off-by: Filipe Manana 
---
 tests/btrfs/124 | 1 +
 tests/btrfs/125 | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tests/btrfs/124 b/tests/btrfs/124
index 2618a26..7206094 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -159,6 +159,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then
echo "Inital sum does not match with data on dev2 written by balance"
 fi

+$UMOUNT_PROG $dev2
 _scratch_dev_pool_put
 _test_mount

diff --git a/tests/btrfs/125 b/tests/btrfs/125
index 1062b87..91aa8d8 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -175,6 +175,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then
echo "Inital sum does not match with data on dev2 written by balance"
 fi

+$UMOUNT_PROG $dev2
 _scratch_dev_pool_put
 _test_mount



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ITS HIER WIEDER!

2016-12-01 Thread Jack And Erik Finance



--
Guten Tag Ihnen,
Sie benötigen finanzielle Unterstützung für Ihr Unternehmen? Oder 
benötigen Sie einen persönlichen Kredit?
Wir bieten ein legitimes Darlehen zu 2% Zinsen pro Jahr. Für die 
Anwendung kontaktieren Sie uns bitte.
Wir sind in Australien und wir bieten Darlehen für internationale 
Menschen und Unternehmen!

Füllen Sie einfach das untenstehende Formular aus.

Name:
Land:
Menge:
Dauer:

HINWEIS: Nur ernste Menschen. Bitte, wenn Sie nicht ernst sind, zögern 
Sie bitte nicht uns zu kontaktieren.


Vielen Dank.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-12-01 Thread Eric Wheeler
On Wed, 30 Nov 2016, Marc MERLIN wrote:
> On Wed, Nov 30, 2016 at 03:57:28PM -0800, Eric Wheeler wrote:
> > > I'll start another separate thread with the btrfs folks on how much
> > > pressure is put on the system, but on your side it would be good to help
> > > ensure that bcache doesn't crash the system altogether if too many
> > > requests are allowed to pile up.
> > 
> > Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk 
> > writes at the request queue on its way to the spinning disk or SSD:
> > http://algo.ing.unimo.it/people/paolo/disk_sched/
> > 
> > use the latest BFQ git here, merge it into v4.8.y:
> > https://github.com/linusw/linux-bfq/commits/bfq-v8
> > 
> > This doesn't completely fix the dirty_ration problem, but it is far better 
> > than CFQ or deadline in my opinion (and experience).
> 
> That's good to know thanks.
> But for my uninformed opinion, is there anything bcache can do to throttle
> incoming requests if they are piling up, or they're coming from producers
> upstream and bcache has no choice but try and process them as quickly as
> possible without a way to block the sender if too many are coming?

Not really.  The congestion isn't in bcache, its at the disk queue beyond 
bcache, but userspace processes are blocked by the (huge) pagecache dirty 
writeback which happens before bcache gets it and must complete before 
userspace may proceed: 

fs -> pagecache -> bcache -> {ssd,disk}  

The real issue is that the dirty page cache gets really big, flushes, 
waits for downstream devices (bcache->ssd,disk) to finish, and then 
returns to userspace.  The only way to limit dirty cache are those options 
that Linus mentioned.

BFQ can help for processes not tied to the flush because it may re-order 
other process requests ahead of the big flush---so even though a big flush 
is happening and that process is stalled, others might proceed without 
delay.

See this thread, too:

https://groups.google.com/forum/#!msg/bfq-iosched/M2M_UhbC05A/hf6Ni9JbAQAJ

--
Eric Wheeler



> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-12-01 Thread Tomasz Kusmierz
FYI.
There is an old saying in embedded circles that I revolve that evolved
from Arthur C Clarke "Any sufficiently advanced technology is
indistinguishable from magic." Engineering version states "Any
sufficiently advanced incompetence is indistinguishable from malice"
Also I'll quote you on throwing under the bus thing :) (I actually
like that justification)

On 1 December 2016 at 17:28, Chris Murphy  wrote:
> On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz  
> wrote:
>
>> Please, I beg you add another column to man and wiki stating clearly
>> how many devices every profile can withstand to loose. I frequently
>> have to explain how btrfs profiles work and show quotes from this
>> mailing list because "dawning-kruger effect victims" keep poping up
>> with statements like "in btrfs raid10 with 8 drives you can loose 4
>> drives" ... I seriously beg you guys, my beating stick is half broken
>> by now.
>
> You need a new stick. It's called the ad hominem attack. When stupid
> people say stupid things, the dispute is not about the facts or
> opinions in the argument itself, but rather the person involved. There
> is the possibility this is more than stupidity, it really borders on
> maliciousness. Any ethical code of conduct for a list will accept ad
> hominem attacks over the willful dissemination of provably wrong
> information. When stupid assholes throw users under the bus with
> provably wrong (and bad) advice, it becomes something of an obligation
> to resort to name calling.
>
> Of course, I'd also like the wiki to clearly state the only profile
> that tolerates more than one device loss is raid6; and be very
> explicit with the manifestly wrong terminology being used by Btrfs's
> raid10 terminology. That is a fairly egregious violation of common
> terminology and the trust we're supposed to be developing, both in the
> usage of common terms, but also in Btrfs specifically.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly

2016-12-01 Thread Liu Bo
btrfs_ordered_update_i_size can be called by truncate and endio, but only endio
takes ordered_extent which contains the completed IO.

while truncating down a file, if there are some in-flight IOs,
btrfs_ordered_update_i_size in endio will set disk_i_size to @orig_offset that
is zero.  If truncating-down fails somehow, we try to recover in memory isize
with this zero'd disk_i_size.

Fix it by only updating disk_i_size with @orig_offset when
btrfs_ordered_update_i_size is not called from endio while truncating down and
waiting for in-flight IOs completing their work before recover in-memory size.

Besides fixing the above issue, add an assertion for last_size to double check
we truncate down to the desired size.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c| 14 ++
 fs/btrfs/ordered-data.c |  9 +++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 09157dd..ef3594d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4682,6 +4682,13 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
btrfs_free_path(path);
 
+   if (err == 0) {
+   /* only inline file may have last_size != new_size */
+   if (new_size >= root->sectorsize ||
+   new_size > root->fs_info->max_inline)
+   ASSERT(last_size == new_size);
+   }
+
if (be_nice && bytes_deleted > SZ_32M) {
unsigned long updates = trans->delayed_ref_updates;
if (updates) {
@@ -5064,6 +5071,13 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
if (ret && inode->i_nlink) {
int err;
 
+   /* To get a stable disk_i_size */
+   err = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+   if (err) {
+   btrfs_orphan_del(NULL, inode);
+   return err;
+   }
+
/*
 * failed to truncate, disk_i_size is only adjusted down
 * as we remove extents, so it should represent the true
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b2d1e95..5eaa25a 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -982,8 +982,13 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 
offset,
}
disk_i_size = BTRFS_I(inode)->disk_i_size;
 
-   /* truncate file */
-   if (disk_i_size > i_size) {
+   /*
+* truncate file.
+* If ordered is not NULL, then this is called from endio and
+* disk_i_size will be updated by either truncate itself or any
+* in-flight IOs which are inside the disk_i_size.
+*/
+   if (!ordered && disk_i_size > i_size) {
BTRFS_I(inode)->disk_i_size = orig_offset;
ret = 0;
goto out;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix lockdep warning about log_mutex

2016-12-01 Thread Liu Bo
While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
different inode's log_mutex with holding the current inode's log_mutex, and
lockdep has complained this with a possilble deadlock warning.

Fix this by using mutex_lock_nested() when processing the other inode's
log_mutex.

Signed-off-by: Liu Bo 
---
 fs/btrfs/tree-log.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 3d33c4e..e961451 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -37,6 +37,7 @@
  */
 #define LOG_INODE_ALL 0
 #define LOG_INODE_EXISTS 1
+#define LOG_OTHER_INODE 2
 
 /*
  * directory trouble cases
@@ -4624,7 +4625,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
if (S_ISDIR(inode->i_mode) ||
(!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
   _I(inode)->runtime_flags) &&
-inode_only == LOG_INODE_EXISTS))
+inode_only >= LOG_INODE_EXISTS))
max_key.type = BTRFS_XATTR_ITEM_KEY;
else
max_key.type = (u8)-1;
@@ -4648,7 +4649,12 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
return ret;
}
 
-   mutex_lock(_I(inode)->log_mutex);
+   if (inode_only == LOG_OTHER_INODE) {
+   inode_only = LOG_INODE_EXISTS;
+   mutex_lock_nested(_I(inode)->log_mutex, 1);
+   } else {
+   mutex_lock(_I(inode)->log_mutex);
+   }
 
/*
 * a brute force approach to making sure we get the most uptodate
@@ -4800,7 +4806,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
 * unpin it.
 */
err = btrfs_log_inode(trans, root, other_inode,
- LOG_INODE_EXISTS,
+ LOG_OTHER_INODE,
  0, LLONG_MAX, ctx);
iput(other_inode);
if (err)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add truncated_len for ordered extent tracepoints

2016-12-01 Thread Liu Bo
This can help us monitor truncated ordered extents.

Signed-off-by: Liu Bo 
---
 include/trace/events/btrfs.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index e030d6f..1dc1197 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -259,6 +259,7 @@ DECLARE_EVENT_CLASS(btrfs__ordered_extent,
__field(int,  compress_type )
__field(int,  refs  )
__field(u64,  root_objectid )
+   __field(u64,  truncated_len )
),
 
TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
@@ -273,10 +274,12 @@ DECLARE_EVENT_CLASS(btrfs__ordered_extent,
__entry->refs   = atomic_read(>refs);
__entry->root_objectid  =
BTRFS_I(inode)->root->root_key.objectid;
+   __entry->truncated_len  = ordered->truncated_len;
),
 
TP_printk_btrfs("root = %llu(%s), ino = %llu, file_offset = %llu, "
  "start = %llu, len = %llu, disk_len = %llu, "
+ "truncated_len = %llu, "
  "bytes_left = %llu, flags = %s, compress_type = %d, "
  "refs = %d",
  show_root_type(__entry->root_objectid),
@@ -285,6 +288,7 @@ DECLARE_EVENT_CLASS(btrfs__ordered_extent,
  (unsigned long long)__entry->start,
  (unsigned long long)__entry->len,
  (unsigned long long)__entry->disk_len,
+ (unsigned long long)__entry->truncated_len,
  (unsigned long long)__entry->bytes_left,
  show_ordered_flags(__entry->flags),
  __entry->compress_type, __entry->refs)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: use down_read_nested to make lockdep silent

2016-12-01 Thread Liu Bo
If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 210c94a..cdb082a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7394,7 +7394,8 @@ btrfs_lock_cluster(struct btrfs_block_group_cache 
*block_group,
 
spin_unlock(>refill_lock);
 
-   down_read(_bg->data_rwsem);
+   /* We should only have one-level nested. */
+   down_read_nested(_bg->data_rwsem, 1);
 
spin_lock(>refill_lock);
if (used_bg == cluster->block_group)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add 'inode' for extent map tracepoint

2016-12-01 Thread Liu Bo
'inode' is an important field for btrfs_get_extent, lets trace it.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c |  2 +-
 include/trace/events/btrfs.h | 13 -
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 603dd492..79f35b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7095,7 +7095,7 @@ struct extent_map *btrfs_get_extent(struct inode *inode, 
struct page *page,
write_unlock(_tree->lock);
 out:
 
-   trace_btrfs_get_extent(root, em);
+   trace_btrfs_get_extent(root, inode, em);
 
btrfs_free_path(path);
if (trans) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index e030d6f..0e04208 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -184,14 +184,16 @@ DEFINE_EVENT(btrfs__inode, btrfs_inode_evict,
 
 TRACE_EVENT_CONDITION(btrfs_get_extent,
 
-   TP_PROTO(struct btrfs_root *root, struct extent_map *map),
+   TP_PROTO(struct btrfs_root *root, struct inode *inode,
+struct extent_map *map),
 
-   TP_ARGS(root, map),
+   TP_ARGS(root, inode, map),
 
TP_CONDITION(map),
 
TP_STRUCT__entry_btrfs(
__field(u64,  root_objectid )
+   __field(u64,  ino   )
__field(u64,  start )
__field(u64,  len   )
__field(u64,  orig_start)
@@ -204,7 +206,8 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 
TP_fast_assign_btrfs(root->fs_info,
__entry->root_objectid  = root->root_key.objectid;
-   __entry->start  = map->start;
+   __entry->ino= btrfs_ino(inode);
+   __entry->start  = map->start;
__entry->len= map->len;
__entry->orig_start = map->orig_start;
__entry->block_start= map->block_start;
@@ -214,12 +217,12 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
__entry->compress_type  = map->compress_type;
),
 
-   TP_printk_btrfs("root = %llu(%s), start = %llu, len = %llu, "
+   TP_printk_btrfs("root = %llu(%s), ino = %llu start = %llu, len = %llu, "
  "orig_start = %llu, block_start = %llu(%s), "
  "block_len = %llu, flags = %s, refs = %u, "
  "compress_type = %u",
  show_root_type(__entry->root_objectid),
- (unsigned long long)__entry->start,
+ __entry->ino, (unsigned long long)__entry->start,
  (unsigned long long)__entry->len,
  (unsigned long long)__entry->orig_start,
  show_map_type(__entry->block_start),
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix truncate down when no_holes feature is enabled

2016-12-01 Thread Liu Bo
For such a file mapping,

[0-4k][hole][8k-12k]

In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature 
is enabled")
 fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.

Reviewed-by: Chris Mason 
Signed-off-by: Liu Bo 
---
v2: Remove the assertion.

 fs/btrfs/inode.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2b790bd..09157dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4483,8 +4483,19 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
if (found_type > min_type) {
del_item = 1;
} else {
-   if (item_end < new_size)
+   if (item_end < new_size) {
+   /*
+* With NO_HOLES mode, for the following mapping
+*
+* [0-4k][hole][8k-12k]
+*
+* if truncating isize down to 6k, it ends up
+* isize being 8k.
+*/
+   if (btrfs_fs_incompat(root->fs_info, NO_HOLES))
+   last_size = new_size;
break;
+   }
if (found_key.offset >= new_size)
del_item = 1;
else
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-12-01 Thread Janos Toth F.
Is there any fundamental reason not to support huge writeback caches?
(I mean, besides working around bugs and/or questionably poor design
choices which no one wishes to fix.)
The obvious drawback is the increased risk of data loss upon hardware
failure or kernel panic but why couldn't the user be allowed to draw
the line between probability of data loss and potential performance
gains?

The last time I changed hardware, I put double the amount of RAM into
my little home server for the sole reason to use a relatively huge
cache, especially a huge writeback cache. Although I realized it soon
enough that writeback ratios like 20/45 will make the system unstable
(OOM reaping) even if ~90% of the memory is theoretically free = used
as some form of cache, read or write, depending on this ratio
parameter and I ended up below the default to get rid of The Reaper.

My plan was to try and decrease the fragmentation of files which are
created by dumping several parallel real-time video streams into
separate files (and also minimize the HDD head seeks due to that).
(The computer in question is on a UPS.)

On Thu, Dec 1, 2016 at 4:49 PM, Michal Hocko  wrote:
> On Wed 30-11-16 10:16:53, Marc MERLIN wrote:
>> +folks from linux-mm thread for your suggestion
>>
>> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
>> > > swraid5 < bcache < dmcrypt < btrfs
>> > >
>> > > Copying with btrfs send/receive causes massive hangs on the system.
>> > > Please see this explanation from Linus on why the workaround was
>> > > suggested:
>> > > https://lkml.org/lkml/2016/11/29/667
>> > And Linux' assessment is absolutely correct (at least, the general
>> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
>> > than willing to bet he's correct that that's the culprit).
>>
>> > > All of this mostly went away with Linus' suggestion:
>> > > echo 2 > /proc/sys/vm/dirty_ratio
>> > > echo 1 > /proc/sys/vm/dirty_background_ratio
>> > >
>> > > But that's hiding the symptom which I think is that btrfs is piling up 
>> > > too many I/O
>> > > requests during btrfs send/receive and btrfs scrub (probably balance 
>> > > too) and not
>> > > looking at resulting impact to system health.
>>
>> > I see pretty much identical behavior using any number of other storage
>> > configurations on a USB 2.0 flash drive connected to a system with 16GB of
>> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB
>> > of data for writeback.  While BTRFS is doing highly sub-optimal things 
>> > here,
>> > the ancient default writeback ratios are just as much a culprit.  I would
>> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, 
>> > which
>> > would give overall almost identical behavior to x86-32, which in turn works
>> > reasonably well for most cases.  I sadly don't have the time, patience, or
>> > expertise to write up such a patch myself though.
>>
>> Dear linux-mm folks, is that something you could consider (changing the
>> dirty_ratio defaults) given that it affects at least bcache and btrfs
>> (with or without bcache)?
>
> As much as the dirty_*ratio defaults a major PITA this is not something
> that would be _easy_ to change without high risks of regressions. This
> topic has been discussed many times with many good ideas, nothing really
> materialized from them though :/
>
> To be honest I really do hate dirty_*ratio and have seen many issues on
> very large machines and always suggested to use dirty_bytes instead but
> a particular value has always been a challenge to get right. It has
> always been very workload specific.
>
> That being said this is something more for IO people than MM IMHO.
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: add dev stats returncode option

2016-12-01 Thread Mike Fleetwood
On 1 December 2016 at 18:43, Austin S. Hemmelgarn  wrote:
> Currently, `btrfs device stats` returns non-zero only when there was an
> error getting the counter values.  This is fine for when it gets run by a
> user directly, but is a serious pain when trying to use it in a script or
> for monitoring since you need to parse the (not at all machine friendly)
> output to check the counter values.
>
> This patch adds an option ('-s') which causes `btrfs device stats`
> to set bit 7 in the return code if any of the counters are non-zero.
> This greatly simplifies checking from a script or monitoring software
> if any errors have been recorded.  In the event that this switch is
> passed and an error occurs reading the stats, the return code will have
> bit 0 set (so if there are errors reading counters, and the counters
> which were read were non-zero, the return value will be 129).

I don't think using bit 7 is a good idea.  Bash (and I think all
shells) report exist status 128+SIGNUM when the process is killed by a
signal.  I.e. status 129 would be returned when a process is killed by
SIGHUP.

Perhaps bit 6 would be OK to use.

Thanks,
Mike

https://tiswww.case.edu/php/chet/bash/bashref.html#Exit-Status
"Exit statuses fall between 0 and 255, though, as explained below, the
shell may use values above 125 specially. ...

When a command terminates on a fatal signal whose number is N, Bash
uses the value 128+N as the exit status. ...

If a command is not found, the child process created to execute it
returns a status of 127. If a command is found but is not executable,
the return status is 126."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: doc: Update docs about RAID profiles

2016-12-01 Thread Austin S. Hemmelgarn
This adds some more info about chunk profiles in the mkfs manpage,
specifically providing better info about raid1 and raid10 profiles and
the fact that they can't survive more than one device failing.

This should hopefully make it less likely that people hit unexpected
behavior when using these profiles.

Signed-off-by: Austin S. Hemmelgarn 
---
This should work to cover most of the issues brought up on the mailing
list recently regarding this particular aspect of documentation.

 Documentation/mkfs.btrfs.asciidoc | 44 ---
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/Documentation/mkfs.btrfs.asciidoc 
b/Documentation/mkfs.btrfs.asciidoc
index 9b1d45a..a5a8dc1 100644
--- a/Documentation/mkfs.btrfs.asciidoc
+++ b/Documentation/mkfs.btrfs.asciidoc
@@ -247,10 +247,10 @@ There are the following block group types available:
 | single  | 1||| 1/any
 | DUP | 2 / 1 device ||| 1/any ^(see note 1)^
 | RAID0   |  || 1 to N | 2/any
-| RAID1   | 2||| 2/any
-| RAID10  | 2|| 1 to N | 4/any
-| RAID5   | 1| 1  | 2 to N - 1 | 2/any ^(see note 2)^
-| RAID6   | 1| 2  | 3 to N - 2 | 3/any ^(see note 3)^
+| RAID1   | 2||| 2/any ^(see note 2)^
+| RAID10  | 2|| 1 to N | 4/any ^(see note 2)^
+| RAID5   | 1| 1  | 2 to N - 1 | 2/any ^(see note 3)^
+| RAID6   | 1| 2  | 3 to N - 2 | 3/any ^(see note 4)^
 |=
 
 WARNING: It's not recommended to build btrfs with RAID0/1/10/5/6 prfiles on
@@ -261,13 +261,17 @@ improved.
 another one is added. Since version 4.5.1, *mkfs.btrfs* will let you create DUP
 on multiple devices.
 
-'Note 2:' It's not recommended to use 2 devices with RAID5. In that case,
+'Note 2:' BTRFS implementattions of RAID1 and RAID10 can only sustain
+a *single* device failure before the filesystem is irreperably damaged,
+no matter how many actual devices are in the aray.  See 'KNOWN ISSUES'
+below for more on this.
+
+'Note 3:' It's not recommended to use 2 devices with RAID5. In that case,
 parity stripe will contain the same data as the data stripe, making RAID5
-degraded to RAID1 with more overhead.
+equivalent to RAID1 with more overhead.
 
-'Note 3:' It's also not recommended to use 3 devices with RAID6, unless you
+'Note 4:' It's also not recommended to use 3 devices with RAID6, unless you
 want to get effectively 3 copies in a RAID1-like manner (but not exactly that).
-N-copies RAID1 is not implemented.
 
 DUP PROFILES ON A SINGLE DEVICE
 ---
@@ -345,6 +349,30 @@ The ENOSPC occurs during the creation of the UUID tree. 
This is caused
 by large metadata blocks and space reservation strategy that allocates more
 than can fit into the filesystem.
 
+***MULTI-DEVICE BTRFS FILESYSTEMS AND PROFILE NAMING***
+
+BTRFS supports multiple devices being used in one filesystem.
+The terminology used for the different chunk profiles is somewhat
+misleading because it just copies the closest term from traditional
+storage amnagement technologies.  In particular, RAID1 and RAID10 do
+not function the same as an LVM or MD RAID1 or RAID10 volume.
+
+In BTRFS, RAID1 currently means exactly 2 copies are stored on separate
+devices in the array.  Support for higher levels of replication is
+planned, but currently has no known ETA for inclusion.  This means
+that a BTRFS RAID1 filesystem actually functions more like an MD RAID10
+volume (2 copies of a block, rotating which disks are used), is only
+guraanteed to survive a single device failure.
+
+The situation with RAID10 is a bit different.  It actually does function
+like most typical RAID10 implementations (2 copies striped across an
+arbitrary number of disks).  The big difference here is that in a
+traditional RAID10 configuration, the mapping of mirrors to devices is
+static (ie, part 1 of copy 1 of a block is always on device 1, part 2
+oc copy 1 on device 2, etc), while on BTRFS this mapping is pseudo-random.
+The net result of this is that while it is theoretically possible for
+a BTRFS RAID10 filesystem to survive multiple disk failures in certain
+combinations, in practice it never happens.
 
 AVAILABILITY
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: add dev stats returncode option

2016-12-01 Thread Austin S. Hemmelgarn
Currently, `btrfs device stats` returns non-zero only when there was an
error getting the counter values.  This is fine for when it gets run by a
user directly, but is a serious pain when trying to use it in a script or
for monitoring since you need to parse the (not at all machine friendly)
output to check the counter values.

This patch adds an option ('-s') which causes `btrfs device stats`
to set bit 7 in the return code if any of the counters are non-zero.
This greatly simplifies checking from a script or monitoring software
if any errors have been recorded.  In the event that this switch is
passed and an error occurs reading the stats, the return code will have
bit 0 set (so if there are errors reading counters, and the counters
which were read were non-zero, the return value will be 129).

Signed-off-by: Austin S. Hemmelgarn 
---
Tested on multiple filesystems with various values of error counters
(all manually set with a hex-editor)

Both the flag letter and the bit being set were picked rather arbitrarily
(-s intended to be short for status, bit 7 just seemed reasonable).
I have no issue changing either, but would prefer to avoid bikeshedding
about stuff like this since this helps out with an area where BTRFS is
severely lacking right now (monitoring).

 Documentation/btrfs-device.asciidoc |  8 +++-
 cmds-device.c   | 39 ++---
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/Documentation/btrfs-device.asciidoc 
b/Documentation/btrfs-device.asciidoc
index 239c99b..97a2ed6 100644
--- a/Documentation/btrfs-device.asciidoc
+++ b/Documentation/btrfs-device.asciidoc
@@ -98,7 +98,7 @@ remain as such. Reloading the kernel module will drop this 
information. There's
 an alternative way of mounting multiple-device filesystem without the need for
 prior scanning. See the mount option 'device'.
 
-*stats* [-z] |::
+*stats* [-zs] |::
 Read and print the device IO error statistics for all devices of the given
 filesystem identified by  or for a single . See section *DEVICE
 STATS* for more information.
@@ -108,6 +108,9 @@ STATS* for more information.
 -z
 Print the stats and reset the values to zero afterwards.
 
+-s
+Set the high bit of the return-code if any error statistics are non-zero.
+
 *usage* [options]  [...]::
 Show detailed information about internal allocations in devices.
 +
@@ -231,6 +234,9 @@ EXIT STATUS
 *btrfs device* returns a zero exit status if it succeeds. Non zero is
 returned in case of failure.
 
+If the '-s' option is used, *btrfs device stats* will add 128 to the
+exit status if any of the error counters is non-zero.
+
 AVAILABILITY
 
 *btrfs* is part of btrfs-progs.
diff --git a/cmds-device.c b/cmds-device.c
index fa0830f..3fa3018 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -376,6 +376,7 @@ static const char * const cmd_device_stats_usage[] = {
"Show current device IO stats.",
"",
"-z show current stats and reset values to zero",
+   "-s return non-zero if any stat counter is not 
zero",
NULL
 };
 
@@ -389,14 +390,18 @@ static int cmd_device_stats(int argc, char **argv)
int i;
int c;
int err = 0;
+   int status = 0;
__u64 flags = 0;
DIR *dirstream = NULL;
 
-   while ((c = getopt(argc, argv, "z")) != -1) {
+   while ((c = getopt(argc, argv, "zs")) != -1) {
switch (c) {
case 'z':
flags = BTRFS_DEV_STATS_RESET;
break;
+   case 's':
+   status = 1;
+   break;
case '?':
default:
usage(cmd_device_stats_usage);
@@ -440,7 +445,7 @@ static int cmd_device_stats(int argc, char **argv)
if (ioctl(fdmnt, BTRFS_IOC_GET_DEV_STATS, ) < 0) {
error("DEV_STATS ioctl failed on %s: %s",
  path, strerror(errno));
-   err = 1;
+   err |= 1;
} else {
char *canonical_path;
 
@@ -457,31 +462,51 @@ static int cmd_device_stats(int argc, char **argv)
 "devid:%llu", args.devid);
}
 
-   if (args.nr_items >= BTRFS_DEV_STAT_WRITE_ERRS + 1)
+   if (args.nr_items >= BTRFS_DEV_STAT_WRITE_ERRS + 1) {
printf("[%s].write_io_errs   %llu\n",
   canonical_path,
   (unsigned long long) args.values[
BTRFS_DEV_STAT_WRITE_ERRS]);
-   if (args.nr_items >= BTRFS_DEV_STAT_READ_ERRS + 1)
+   if ((status == 1) && 
(args.values[BTRFS_DEV_STAT_WRITE_ERRS] > 0)) {
+

Re: Metadata balance fails ENOSPC

2016-12-01 Thread Stefan Priebe - Profihost AG

Am 01.12.2016 um 16:48 schrieb Chris Murphy:
> On Thu, Dec 1, 2016 at 7:10 AM, Stefan Priebe - Profihost AG
>  wrote:
>>
>> Am 01.12.2016 um 14:51 schrieb Hans van Kranenburg:
>>> On 12/01/2016 09:12 AM, Andrei Borzenkov wrote:
 On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
  wrote:
 ...
>
> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
> which does the same.
>
>
>>> # btrfs filesystem show /ssddisk/
>>> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
>>> Total devices 1 FS bytes used 305.67GiB
>>> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>>>
>>> # btrfs filesystem usage /ssddisk/
>>> Overall:
>>> Device size: 500.00GiB
>>> Device allocated:500.00GiB
>>> Device unallocated:1.05MiB
>>
>> Drive is actually fully allocated so if Btrfs needs to create a new
>> chunk right now, it can't. However,
>
> Yes but there's lot of free space:
> Free (estimated):193.46GiB  (min: 193.46GiB)
>
> How does this match?
>
>
>> All three chunk types have quite a bit of unused space in them, so
>> it's unclear why there's a no space left error.
>>

 I remember discussion that balance always tries to pre-allocate one
 chunk in advance, and I believe there was patch to correct it but I am
 not sure whether it was merged.
>>>
>>> http://www.spinics.net/lists/linux-btrfs/msg56772.html
>>
>> Thanks - still don't understand why that one is not upstream or why it
>> was reverted. Looks absolutely reasonable to me.
> 
> It is upstream and hasn't been reverted.
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/volumes.c?id=refs/tags/v4.8.11
> line 3650
> 
> I would try Duncan's idea of using just one filter and seeing what happens:
> 
> 'btrfs balance start -dusage=1 '

see below:

[zabbix-db ~]# btrfs balance start -dusage=1 /ssddisk/
Done, had to relocate 0 out of 505 chunks
[zabbix-db ~]# btrfs balance start -dusage=10 /ssddisk/
Done, had to relocate 0 out of 505 chunks
[zabbix-db ~]# btrfs balance start -musage=1 /ssddisk/
ERROR: error during balancing '/ssddisk/': No space left on device
There may be more info in syslog - try dmesg | tail
[zabbix-db ~]# dmesg
[78306.288834] BTRFS warning (device vdb1): no space to allocate a new
chunk for block group 839941881856
[78306.289197] BTRFS info (device vdb1): 1 enospc errors during balance

> 
> 
> With enospc debug it says:
> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
> chunk for block group 839941881856
> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance
> 
> It might be nice if this stated what kind of chunk it's trying to allocate.
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-12-01 Thread Chris Murphy
On Wed, Nov 30, 2016 at 1:29 PM, Tomasz Kusmierz  wrote:

> Please, I beg you add another column to man and wiki stating clearly
> how many devices every profile can withstand to loose. I frequently
> have to explain how btrfs profiles work and show quotes from this
> mailing list because "dawning-kruger effect victims" keep poping up
> with statements like "in btrfs raid10 with 8 drives you can loose 4
> drives" ... I seriously beg you guys, my beating stick is half broken
> by now.

You need a new stick. It's called the ad hominem attack. When stupid
people say stupid things, the dispute is not about the facts or
opinions in the argument itself, but rather the person involved. There
is the possibility this is more than stupidity, it really borders on
maliciousness. Any ethical code of conduct for a list will accept ad
hominem attacks over the willful dissemination of provably wrong
information. When stupid assholes throw users under the bus with
provably wrong (and bad) advice, it becomes something of an obligation
to resort to name calling.

Of course, I'd also like the wiki to clearly state the only profile
that tolerates more than one device loss is raid6; and be very
explicit with the manifestly wrong terminology being used by Btrfs's
raid10 terminology. That is a fairly egregious violation of common
terminology and the trust we're supposed to be developing, both in the
usage of common terms, but also in Btrfs specifically.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Initialize ret to suppress compiler warning

2016-12-01 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 cmds-check.c| 2 +-
 qgroup-verify.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 85eaa63..a9501f5 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -8041,7 +8041,7 @@ static int record_extent(struct btrfs_trans_handle *trans,
 struct extent_backref *back,
 int allocated, u64 flags)
 {
-   int ret;
+   int ret = 0;
struct btrfs_root *extent_root = info->extent_root;
struct extent_buffer *leaf;
struct btrfs_key ins_key;
diff --git a/qgroup-verify.c b/qgroup-verify.c
index 39762bf..ff46bc4 100644
--- a/qgroup-verify.c
+++ b/qgroup-verify.c
@@ -1575,7 +1575,7 @@ out:
 
 int repair_qgroups(struct btrfs_fs_info *info, int *repaired)
 {
-   int ret;
+   int ret = 0;
struct qgroup_count *count, *tmpcount;
 
*repaired = 0;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: Fix extents after finding all errors

2016-12-01 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Simplifying the logic of fixing.

Calling fixup_extent_ref() after encountering every error causes
more error messages after the extent is fixed. In case of multiple errors,
this is confusing because the error message is displayed after the fix
message and it works on stale data. It is best to show all errors and
then fix the extents.

Set a variable and call fixup_extent_ref() if it is set. err is not used,
so cleared it.

Changes since v1:
 + assign fix from ret for a correct repair_abort code path

Signed-off-by: Goldwyn Rodrigues 
---
 cmds-check.c | 72 +++-
 1 file changed, 23 insertions(+), 49 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 30eabb2..85eaa63 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -8998,6 +8998,9 @@ out:
ret = err;
}
 
+   if (!ret)
+   fprintf(stderr, "Repaired extent references for %llu\n", 
(unsigned long long)rec->start);
+
btrfs_release_path();
return ret;
 }
@@ -9055,7 +9058,11 @@ static int fixup_extent_flags(struct btrfs_fs_info 
*fs_info,
btrfs_set_extent_flags(path.nodes[0], ei, flags);
btrfs_mark_buffer_dirty(path.nodes[0]);
btrfs_release_path();
-   return btrfs_commit_transaction(trans, root);
+   ret = btrfs_commit_transaction(trans, root);
+   if (!ret)
+   fprintf(stderr, "Repaired extent flags for %llu\n", (unsigned 
long long)rec->start);
+
+   return ret;
 }
 
 /* right now we only prune from the extent allocation tree */
@@ -9182,11 +9189,8 @@ static int check_extent_refs(struct btrfs_root *root,
 {
struct extent_record *rec;
struct cache_extent *cache;
-   int err = 0;
int ret = 0;
-   int fixed = 0;
int had_dups = 0;
-   int recorded = 0;
 
if (repair) {
/*
@@ -9255,9 +9259,8 @@ static int check_extent_refs(struct btrfs_root *root,
 
while(1) {
int cur_err = 0;
+   int fix = 0;
 
-   fixed = 0;
-   recorded = 0;
cache = search_cache_extent(extent_cache, 0);
if (!cache)
break;
@@ -9265,7 +9268,6 @@ static int check_extent_refs(struct btrfs_root *root,
if (rec->num_duplicates) {
fprintf(stderr, "extent item %llu has multiple extent "
"items\n", (unsigned long long)rec->start);
-   err = 1;
cur_err = 1;
}
 
@@ -9279,54 +9281,31 @@ static int check_extent_refs(struct btrfs_root *root,
ret = record_orphan_data_extents(root->fs_info, rec);
if (ret < 0)
goto repair_abort;
-   if (ret == 0) {
-   recorded = 1;
-   } else {
-   /*
-* we can't use the extent to repair file
-* extent, let the fallback method handle it.
-*/
-   if (!fixed && repair) {
-   ret = fixup_extent_refs(
-   root->fs_info,
-   extent_cache, rec);
-   if (ret)
-   goto repair_abort;
-   fixed = 1;
-   }
-   }
-   err = 1;
+   fix = ret;
cur_err = 1;
}
if (all_backpointers_checked(rec, 1)) {
fprintf(stderr, "backpointer mismatch on [%llu %llu]\n",
(unsigned long long)rec->start,
(unsigned long long)rec->nr);
-
-   if (!fixed && !recorded && repair) {
-   ret = fixup_extent_refs(root->fs_info,
-   extent_cache, rec);
-   if (ret)
-   goto repair_abort;
-   fixed = 1;
-   }
+   fix = 1;
cur_err = 1;
-   err = 1;
}
if (!rec->owner_ref_checked) {
fprintf(stderr, "owner ref check failed [%llu %llu]\n",
(unsigned long long)rec->start,
(unsigned long long)rec->nr);
-   if (!fixed && !recorded && repair) {
-   ret = 

[PATCH] btrfs-progs: find_free_dev_extent() closer to kernel code

2016-12-01 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This solves an ENOSPC issue with nearly full filesystems.

The only things missing from the function is contains_pending_extent()
which should not be required in this case.

Signed-off-by: Goldwyn Rodrigues 
---
 volumes.c | 191 +++---
 1 file changed, 119 insertions(+), 72 deletions(-)

diff --git a/volumes.c b/volumes.c
index 5d770e5..a0a85ed 100644
--- a/volumes.c
+++ b/volumes.c
@@ -276,53 +276,79 @@ int btrfs_scan_one_device(int fd, const char *path,
 }
 
 /*
+ * find_free_dev_extent_start - find free space in the specified device
+ * @device:  the device which we search the free space in
+ * @num_bytes:   the size of the free space that we need
+ * @search_start: the position from which to begin the search
+ * @start:   store the start of the free space.
+ * @len: the size of the free space. that we find, or the size
+ *   of the max free space if we don't find suitable free space
+ *
  * this uses a pretty simple search, the expectation is that it is
  * called very infrequently and that a given device has a small number
  * of extents
+ *
+ * @start is used to store the start of the free space if we find. But if we
+ * don't find suitable free space, it will be used to store the start position
+ * of the max free space.
+ *
+ * @len is used to store the size of the free space that we find.
+ * But if we don't find suitable free space, it is used to store the size of
+ * the max free space.
  */
-static int find_free_dev_extent(struct btrfs_trans_handle *trans,
-   struct btrfs_device *device,
-   struct btrfs_path *path,
-   u64 num_bytes, u64 *start)
+static int find_free_dev_extent_start(struct btrfs_trans_handle *trans,
+  struct btrfs_device *device, u64 num_bytes,
+  u64 search_start, u64 *start, u64 *len)
 {
struct btrfs_key key;
struct btrfs_root *root = device->dev_root;
-   struct btrfs_dev_extent *dev_extent = NULL;
-   u64 hole_size = 0;
-   u64 last_byte = 0;
-   u64 search_start = root->fs_info->alloc_start;
+   struct btrfs_dev_extent *dev_extent;
+   struct btrfs_path *path;
+   u64 hole_size;
+   u64 max_hole_start;
+   u64 max_hole_size;
+   u64 extent_end;
u64 search_end = device->total_bytes;
int ret;
-   int slot = 0;
-   int start_found;
+   int slot;
struct extent_buffer *l;
+   u64 min_search_start;
 
-   start_found = 0;
-   path->reada = 2;
+   /*
+* We don't want to overwrite the superblock on the drive nor any area
+* used by the boot loader (grub for example), so we make sure to start
+* at an offset of at least 1MB.
+*/
+   min_search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
+   search_start = max(search_start, min_search_start);
 
-   /* FIXME use last free of some kind */
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
 
-   /* we don't want to overwrite the superblock on the drive,
-* so we make sure to start at an offset of at least 1MB
-*/
-   search_start = max(BTRFS_BLOCK_RESERVED_1M_FOR_SUPER, search_start);
+   max_hole_start = search_start;
+   max_hole_size = 0;
 
if (search_start >= search_end) {
ret = -ENOSPC;
-   goto error;
+   goto out;
}
 
+   path->reada = 2;
+
key.objectid = device->devid;
key.offset = search_start;
key.type = BTRFS_DEV_EXTENT_KEY;
-   ret = btrfs_search_slot(trans, root, , path, 0, 0);
-   if (ret < 0)
-   goto error;
-   ret = btrfs_previous_item(root, path, 0, key.type);
+
+   ret = btrfs_search_slot(NULL, root, , path, 0, 0);
if (ret < 0)
-   goto error;
-   l = path->nodes[0];
-   btrfs_item_key_to_cpu(l, , path->slots[0]);
+   goto out;
+   if (ret > 0) {
+   ret = btrfs_previous_item(root, path, key.objectid, key.type);
+   if (ret < 0)
+   goto out;
+   }
+
while (1) {
l = path->nodes[0];
slot = path->slots[0];
@@ -331,24 +357,9 @@ static int find_free_dev_extent(struct btrfs_trans_handle 
*trans,
if (ret == 0)
continue;
if (ret < 0)
-   goto error;
-no_more_items:
-   if (!start_found) {
-   if (search_start >= search_end) {
-   ret = -ENOSPC;
-   goto error;
-   }
-   *start = search_start;

Re: btrfs flooding the I/O subsystem and hanging the machine, with bcache cache turned off

2016-12-01 Thread Michal Hocko
On Wed 30-11-16 10:16:53, Marc MERLIN wrote:
> +folks from linux-mm thread for your suggestion
> 
> On Wed, Nov 30, 2016 at 01:00:45PM -0500, Austin S. Hemmelgarn wrote:
> > > swraid5 < bcache < dmcrypt < btrfs
> > > 
> > > Copying with btrfs send/receive causes massive hangs on the system.
> > > Please see this explanation from Linus on why the workaround was
> > > suggested:
> > > https://lkml.org/lkml/2016/11/29/667
> > And Linux' assessment is absolutely correct (at least, the general
> > assessment is, I have no idea about btrfs_start_shared_extent, but I'm more
> > than willing to bet he's correct that that's the culprit).
> 
> > > All of this mostly went away with Linus' suggestion:
> > > echo 2 > /proc/sys/vm/dirty_ratio
> > > echo 1 > /proc/sys/vm/dirty_background_ratio
> > > 
> > > But that's hiding the symptom which I think is that btrfs is piling up 
> > > too many I/O
> > > requests during btrfs send/receive and btrfs scrub (probably balance too) 
> > > and not
> > > looking at resulting impact to system health.
> 
> > I see pretty much identical behavior using any number of other storage
> > configurations on a USB 2.0 flash drive connected to a system with 16GB of
> > RAM with the default dirty ratios because it's trying to cache up to 3.2GB
> > of data for writeback.  While BTRFS is doing highly sub-optimal things here,
> > the ancient default writeback ratios are just as much a culprit.  I would
> > suggest that get changed to 200MB or 20% of RAM, whichever is smaller, which
> > would give overall almost identical behavior to x86-32, which in turn works
> > reasonably well for most cases.  I sadly don't have the time, patience, or
> > expertise to write up such a patch myself though.
> 
> Dear linux-mm folks, is that something you could consider (changing the
> dirty_ratio defaults) given that it affects at least bcache and btrfs
> (with or without bcache)?

As much as the dirty_*ratio defaults a major PITA this is not something
that would be _easy_ to change without high risks of regressions. This
topic has been discussed many times with many good ideas, nothing really
materialized from them though :/

To be honest I really do hate dirty_*ratio and have seen many issues on
very large machines and always suggested to use dirty_bytes instead but
a particular value has always been a challenge to get right. It has
always been very workload specific.

That being said this is something more for IO people than MM IMHO.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Chris Murphy
On Thu, Dec 1, 2016 at 7:10 AM, Stefan Priebe - Profihost AG
 wrote:
>
> Am 01.12.2016 um 14:51 schrieb Hans van Kranenburg:
>> On 12/01/2016 09:12 AM, Andrei Borzenkov wrote:
>>> On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
>>>  wrote:
>>> ...

 Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
 which does the same.


>> # btrfs filesystem show /ssddisk/
>> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
>> Total devices 1 FS bytes used 305.67GiB
>> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>>
>> # btrfs filesystem usage /ssddisk/
>> Overall:
>> Device size: 500.00GiB
>> Device allocated:500.00GiB
>> Device unallocated:1.05MiB
>
> Drive is actually fully allocated so if Btrfs needs to create a new
> chunk right now, it can't. However,

 Yes but there's lot of free space:
 Free (estimated):193.46GiB  (min: 193.46GiB)

 How does this match?


> All three chunk types have quite a bit of unused space in them, so
> it's unclear why there's a no space left error.
>
>>>
>>> I remember discussion that balance always tries to pre-allocate one
>>> chunk in advance, and I believe there was patch to correct it but I am
>>> not sure whether it was merged.
>>
>> http://www.spinics.net/lists/linux-btrfs/msg56772.html
>
> Thanks - still don't understand why that one is not upstream or why it
> was reverted. Looks absolutely reasonable to me.

It is upstream and hasn't been reverted.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/volumes.c?id=refs/tags/v4.8.11
line 3650

I would try Duncan's idea of using just one filter and seeing what happens:

'btrfs balance start -dusage=1 '


 With enospc debug it says:
 [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
 chunk for block group 839941881856
 [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance

It might be nice if this stated what kind of chunk it's trying to allocate.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode warn (outstanding extents)

2016-12-01 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:58:45PM -0500, Dave Jones wrote:
 > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote:
 > 
 >  > [  317.689216] BUG: Bad page state in process kworker/u8:8  pfn:4d8fd4
 >  > trace from just before this happened. Does this shed any light ?
 >  > 
 >  > https://codemonkey.org.uk/junk/trace.txt
 > 
 > crap, I just noticed the timestamps in the trace come from quite a bit
 > later.  I'll tweak the code to do the taint checking/ftrace stop after
 > every syscall, that should narrow the window some more.
 > 
 > Getting closer..

Ok, this is getting more like it.
http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents.txt

Also same bug, different run, but a different traceview 
http://codemonkey.org.uk/junk/btrfs-destroy-inode-outstanding-extents-function-graph.txt

(function-graph screws up the RIP for some reason, 'return_to_handler'
 should actually be btrfs_destroy_inode)


Anyways, I've got some code that works pretty well for dumping the
ftrace buffer now when things go awry.  I just need to run it enough
times that I hit that bad page state instead of this, or a lockdep bug first.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] btrfs-progs: Use helper function to access btrfs_super_block->sys_chunk_array_size

2016-12-01 Thread David Sterba
On Thu, Dec 01, 2016 at 07:21:14PM +0530, Chandan Rajendra wrote:
> btrfs_super_block->sys_chunk_array_size is stored as le32 data on
> disk. However insert_temp_chunk_item() writes sys_chunk_array_size in
> host cpu order. This commit fixes this by using super block access
> helper functions to read and write
> btrfs_super_block->sys_chunk_array_size field.
> 
> Signed-off-by: Chandan Rajendra 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LSF/MM 2017: Call for Proposals

2016-12-01 Thread Jeff Layton
The annual Linux Storage, Filesystem and Memory Management (LSF/MM)
Summit for 2017 will be held on March 20th and 21st at the Hyatt
Cambridge, Cambridge, MA. LSF/MM is an invitation-only technical
workshop to map out improvements to the Linux storage, filesystem and
memory management subsystems that will make their way into the mainline
kernel within the coming years.


http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit

Like last year, LSF/MM will be colocated with the Linux Foundation Vault
conference which takes place on March 22nd and 23rd in the same Venue.
For those that do not know, Vault is designed to be an event where open
source storage and filesystem practitioners meet storage implementors
and, as such, it would be of benefit for LSF/MM attendees to attend.

Unlike past years, Vault admission is not free for LSF/MM attendees this
year unless they're giving a talk. There is a discount for LSF/MM
attendees, however we would also like to encourage folks to submit talk
proposals to speak at the Vault conference.

http://events.linuxfoundation.org/events/vault

On behalf of the committee I am issuing a call for agenda proposals that
are suitable for cross-track discussion as well as technical subjects
for the breakout sessions.

If advance notice is required for visa applications then please point
that out in your proposal or request to attend, and submit the topic
as soon as possible.

1) Proposals for agenda topics should be sent before January 15th, 2016
to:

lsf...@lists.linux-foundation.org

and cc the Linux list or lists that are relevant for the topic in
question:

ATA:   linux-...@vger.kernel.org
Block: linux-bl...@vger.kernel.org
FS:linux-fsde...@vger.kernel.org
MM:linux...@kvack.org
SCSI:  linux-s...@vger.kernel.org
NVMe:  linux-n...@lists.infradead.org

Please tag your proposal with [LSF/MM TOPIC] to make it easier to track.
In addition, please make sure to start a new thread for each topic
rather than following up to an existing one.  Agenda topics and
attendees will be selected by the program committee, but the final
agenda will be formed by consensus of the attendees on the day.

2) Requests to attend the summit for those that are not proposing a
topic should be sent to:

lsf...@lists.linux-foundation.org

Please summarise what expertise you will bring to the meeting, and what
you would like to discuss. Please also tag your email with [LSF/MM
ATTEND] and send it as a new thread so there is less chance of it
getting lost.

We will try to cap attendance at around 25-30 per track to facilitate
discussions although the final numbers will depend on the room sizes at
the venue.

Brief presentations are allowed to guide discussion, but are strongly
discouraged. There will be no recording or audio bridge. However, we
expect that written minutes will be published as we did in previous
years:

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

2013: http://lwn.net/Articles/548089/

3) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

lsf...@lists.linux-foundation.org

Thank you on behalf of the program committee:

Storage:
James Bottomley
Martin K. Petersen (track chair)
Sagi Grimberg

Filesystems:
Anna Schumaker
Chris Mason
Eric Sandeen
Jan Kara
Jeff Layton (summit chair)
Josef Bacik (track chair)
Trond Myklebust

MM:
Johannes Weiner
Rik van Riel (track chair)
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Stefan Priebe - Profihost AG

Am 01.12.2016 um 14:51 schrieb Hans van Kranenburg:
> On 12/01/2016 09:12 AM, Andrei Borzenkov wrote:
>> On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
>>  wrote:
>> ...
>>>
>>> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
>>> which does the same.
>>>
>>>
> # btrfs filesystem show /ssddisk/
> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
> Total devices 1 FS bytes used 305.67GiB
> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>
> # btrfs filesystem usage /ssddisk/
> Overall:
> Device size: 500.00GiB
> Device allocated:500.00GiB
> Device unallocated:1.05MiB

 Drive is actually fully allocated so if Btrfs needs to create a new
 chunk right now, it can't. However,
>>>
>>> Yes but there's lot of free space:
>>> Free (estimated):193.46GiB  (min: 193.46GiB)
>>>
>>> How does this match?
>>>
>>>
 All three chunk types have quite a bit of unused space in them, so
 it's unclear why there's a no space left error.

>>
>> I remember discussion that balance always tries to pre-allocate one
>> chunk in advance, and I believe there was patch to correct it but I am
>> not sure whether it was merged.
> 
> http://www.spinics.net/lists/linux-btrfs/msg56772.html

Thanks - still don't understand why that one is not upstream or why it
was reverted. Looks absolutely reasonable to me. Other option would be
to make it possible to make allocated unused space unallocted again - no
idea how todo that.

> 
 Try remounting with enoscp_debug, and then trigger the problem again,
 and post the resulting kernel messages.
>>>
>>> With enospc debug it says:
>>> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
>>> chunk for block group 839941881856
>>> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Hans van Kranenburg
On 12/01/2016 09:12 AM, Andrei Borzenkov wrote:
> On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
>  wrote:
> ...
>>
>> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
>> which does the same.
>>
>>
 # btrfs filesystem show /ssddisk/
 Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
 Total devices 1 FS bytes used 305.67GiB
 devid1 size 500.00GiB used 500.00GiB path /dev/vdb1

 # btrfs filesystem usage /ssddisk/
 Overall:
 Device size: 500.00GiB
 Device allocated:500.00GiB
 Device unallocated:1.05MiB
>>>
>>> Drive is actually fully allocated so if Btrfs needs to create a new
>>> chunk right now, it can't. However,
>>
>> Yes but there's lot of free space:
>> Free (estimated):193.46GiB  (min: 193.46GiB)
>>
>> How does this match?
>>
>>
>>> All three chunk types have quite a bit of unused space in them, so
>>> it's unclear why there's a no space left error.
>>>
> 
> I remember discussion that balance always tries to pre-allocate one
> chunk in advance, and I believe there was patch to correct it but I am
> not sure whether it was merged.

http://www.spinics.net/lists/linux-btrfs/msg56772.html

>>> Try remounting with enoscp_debug, and then trigger the problem again,
>>> and post the resulting kernel messages.
>>
>> With enospc debug it says:
>> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
>> chunk for block group 839941881856
>> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] btrfs-progs: Use helper function to access btrfs_super_block->sys_chunk_array_size

2016-12-01 Thread Chandan Rajendra
btrfs_super_block->sys_chunk_array_size is stored as le32 data on
disk. However insert_temp_chunk_item() writes sys_chunk_array_size in
host cpu order. This commit fixes this by using super block access
helper functions to read and write
btrfs_super_block->sys_chunk_array_size field.

Signed-off-by: Chandan Rajendra 
---
 utils.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/utils.c b/utils.c
index d0189ad..74dde1e 100644
--- a/utils.c
+++ b/utils.c
@@ -562,14 +562,18 @@ static int insert_temp_chunk_item(int fd, struct 
extent_buffer *buf,
 */
if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
char *cur;
+   u32 array_size;
 
-   cur = (char *)sb->sys_chunk_array + sb->sys_chunk_array_size;
+   cur = (char *)sb->sys_chunk_array
+   + btrfs_super_sys_array_size(sb);
memcpy(cur, _key, sizeof(disk_key));
cur += sizeof(disk_key);
read_extent_buffer(buf, cur, (unsigned long int)chunk,
   btrfs_chunk_item_size(1));
-   sb->sys_chunk_array_size += btrfs_chunk_item_size(1) +
+   array_size = btrfs_super_sys_array_size(sb);
+   array_size += btrfs_chunk_item_size(1) +
sizeof(disk_key);
+   btrfs_set_super_sys_array_size(sb, array_size);
 
ret = write_temp_super(fd, sb, cfg->super_bytenr);
}
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread E V
I've frequently seen free space cache corruption lead to phantom
ENOSPC. You could try clearing the space cache, and/or mounting with
nospache_cache.

On Thu, Dec 1, 2016 at 6:55 AM, Stefan Priebe - Profihost AG
 wrote:
>
> Am 01.12.2016 um 09:12 schrieb Andrei Borzenkov:
>> On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
>>  wrote:
>> ...
>>>
>>> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
>>> which does the same.
>>>
>>>
> # btrfs filesystem show /ssddisk/
> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
> Total devices 1 FS bytes used 305.67GiB
> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>
> # btrfs filesystem usage /ssddisk/
> Overall:
> Device size: 500.00GiB
> Device allocated:500.00GiB
> Device unallocated:1.05MiB

 Drive is actually fully allocated so if Btrfs needs to create a new
 chunk right now, it can't. However,
>>>
>>> Yes but there's lot of free space:
>>> Free (estimated):193.46GiB  (min: 193.46GiB)
>>>
>>> How does this match?
>>>
>>>
 All three chunk types have quite a bit of unused space in them, so
 it's unclear why there's a no space left error.

>>
>> I remember discussion that balance always tries to pre-allocate one
>> chunk in advance, and I believe there was patch to correct it but I am
>> not sure whether it was merged.
>
> Is there otherwise a possibility to make the free space unallocated again?
>
> Stefan
>
>>
 Try remounting with enoscp_debug, and then trigger the problem again,
 and post the resulting kernel messages.
>>>
>>> With enospc debug it says:
>>> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
>>> chunk for block group 839941881856
>>> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.8.8, bcache deadlock and hard lockup

2016-12-01 Thread Austin S. Hemmelgarn

On 2016-11-30 19:48, Chris Murphy wrote:

On Wed, Nov 30, 2016 at 4:57 PM, Eric Wheeler  wrote:

On Wed, 30 Nov 2016, Marc MERLIN wrote:

+btrfs mailing list, see below why

On Tue, Nov 29, 2016 at 12:59:44PM -0800, Eric Wheeler wrote:

On Mon, 27 Nov 2016, Coly Li wrote:


Yes, too many work queues... I guess the locking might be caused by some
very obscure reference of closure code. I cannot have any clue if I
cannot find a stable procedure to reproduce this issue.

Hmm, if there is a tool to clone all the meta data of the back end cache
and whole cached device, there might be a method to replay the oops much
easier.

Eric, do you have any hint ?


Note that the backing device doesn't have any metadata, just a superblock.
You can easily dd that off onto some other volume without transferring the
data. By default, data starts at 8k, or whatever you used in `make-bcache
-w`.


Ok, Linus helped me find a workaround for this problem:
https://lkml.org/lkml/2016/11/29/667
namely:
   echo 2 > /proc/sys/vm/dirty_ratio
   echo 1 > /proc/sys/vm/dirty_background_ratio
(it's a 24GB system, so the defaults of 20 and 10 were creating too many
requests in th buffers)

Note that this is only a workaround, not a fix.

When I did this and re tried my big copy again, I still got 100+ kernel
work queues, but apparently the underlying swraid5 was able to unblock
and satisfy the write requests before too many accumulated and crashed
the kernel.

I'm not a kernel coder, but seems to me that bcache needs a way to
throttle incoming requests if there are too many so that it does not end
up in a state where things blow up due to too many piled up requests.

You should be able to reproduce this by taking 5 spinning rust drives,
put raid5 on top, dmcrypt, bcache and hopefully any filesystem (although
I used btrfs) and send lots of requests.
Actually to be honest, the problems have mostly been happening when I do
btrfs scrub and btrfs send/receive which both generate I/O from within
the kernel instead of user space.
So here, btrfs may be a contributor to the problem too, but while btrfs
still trashes my system if I remove the caching device on bcache (and
with the default dirty ratio values), it doesn't crash the kernel.

I'll start another separate thread with the btrfs folks on how much
pressure is put on the system, but on your side it would be good to help
ensure that bcache doesn't crash the system altogether if too many
requests are allowed to pile up.



Try BFQ.  It is AWESOME and helps reduce the congestion problem with bulk
writes at the request queue on its way to the spinning disk or SSD:
http://algo.ing.unimo.it/people/paolo/disk_sched/

use the latest BFQ git here, merge it into v4.8.y:
https://github.com/linusw/linux-bfq/commits/bfq-v8

This doesn't completely fix the dirty_ration problem, but it is far better
than CFQ or deadline in my opinion (and experience).


There are several threads over the past year with users having
problems no one else had previously reported, and they were using BFQ.
But there's no evidence whether BFQ was the cause, or exposing some
existing bug that another scheduler doesn't. Anyway, I'd say using an
out of tree scheduler means higher burden of testing and skepticism.
Normally I'd agree on this, but BFQ is a bit of a different situation 
from usual because:
1. 90% of the reason that BFQ isn't in mainline is that the block 
maintainers have declared the legacy (non blk-mq) code deprecated and 
refuse to take anything new there despite having absolutely zero 
scheduling in blk-mq.
2. It's been around for years with hundreds of thousands of users over 
the years who have had no issues with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Stefan Priebe - Profihost AG

Am 01.12.2016 um 09:12 schrieb Andrei Borzenkov:
> On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
>  wrote:
> ...
>>
>> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
>> which does the same.
>>
>>
 # btrfs filesystem show /ssddisk/
 Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
 Total devices 1 FS bytes used 305.67GiB
 devid1 size 500.00GiB used 500.00GiB path /dev/vdb1

 # btrfs filesystem usage /ssddisk/
 Overall:
 Device size: 500.00GiB
 Device allocated:500.00GiB
 Device unallocated:1.05MiB
>>>
>>> Drive is actually fully allocated so if Btrfs needs to create a new
>>> chunk right now, it can't. However,
>>
>> Yes but there's lot of free space:
>> Free (estimated):193.46GiB  (min: 193.46GiB)
>>
>> How does this match?
>>
>>
>>> All three chunk types have quite a bit of unused space in them, so
>>> it's unclear why there's a no space left error.
>>>
> 
> I remember discussion that balance always tries to pre-allocate one
> chunk in advance, and I believe there was patch to correct it but I am
> not sure whether it was merged.

Is there otherwise a possibility to make the free space unallocated again?

Stefan

> 
>>> Try remounting with enoscp_debug, and then trigger the problem again,
>>> and post the resulting kernel messages.
>>
>> With enospc debug it says:
>> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
>> chunk for block group 839941881856
>> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-12-01 Thread Niccolò Belli

On giovedì 1 dicembre 2016 10:37:13 CET, Wilson Meier wrote:

The only thing i have asked for is to document the *known*
problems/flaws/limitations of all raid profiles and link to them from
the stability matrix.


+1

Do someone mind if I ask for an account and I start copy-pasting any 
relevant post in this thread?


Niccolò Belli
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix infinite loop when tree log recovery

2016-12-01 Thread Filipe Manana
On Thu, Dec 1, 2016 at 1:42 AM, robbieko  wrote:
> Hi Filipe,
>
> Thank you for your review.
> I have seen your modified change log with below
> Btrfs: fix tree search logic when replaying directory entry deletes
> Btrfs: fix deadlock caused by fsync when logging directory entries
> Btrfs: fix enospc in hole punching
> So what's the next step ?
> modify patch change log and then send again ?

You don't need to do anything else for those patches.
Thanks.

>
> Thanks.
> robbieko
>
> Filipe Manana 於 2016-12-01 00:53 寫到:
>
>> On Fri, Oct 7, 2016 at 10:30 AM, robbieko  wrote:
>>>
>>> From: Robbie Ko 
>>>
>>> if log tree like below:
>>> leaf N:
>>> ...
>>> item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
>>> dir log end 1275809046
>>> leaf N+1:
>>> item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8
>>> dir log end 18446744073709551615
>>> ...
>>>
>>> when start_ret > 1275809046, but slot[0] never >= nritems,
>>> so never go to next leaf.
>>
>>
>> This doesn't explain how the infinite loop happens. Nor exactly how
>> any problem happens.
>>
>> It's important to have detailed information in the change logs. I
>> understand that english isn't your native tongue (it's not mine
>> either, and I'm far from mastering it), but that's not an excuse to
>> not express all the important information in detail (we can all live
>> with grammar errors and typos, and we all do such errors frequently).
>>
>> I've added this patch to my branch at
>>
>> https://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=for-chris-4.10
>> but with a modified changelog and subject.
>>
>> The results of the wrong logic that decides when to move to the next
>> leaf are unpredictable, and it won't always result in an infinite
>> loop. We are accessing a slot that doesn't point to an item, to a
>> memory location containing garbage to something unexpected, and in the
>> worst case that location is beyond the last page of the extent buffer.
>>
>> Thanks.
>>
>>
>>>
>>> Signed-off-by: Robbie Ko 
>>> ---
>>>  fs/btrfs/tree-log.c | 3 +--
>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>>> index ef9c55b..e63dd99 100644
>>> --- a/fs/btrfs/tree-log.c
>>> +++ b/fs/btrfs/tree-log.c
>>> @@ -1940,12 +1940,11 @@ static noinline int find_dir_range(struct
>>> btrfs_root *root,
>>>  next:
>>> /* check the next slot in the tree to see if it is a valid item
>>> */
>>> nritems = btrfs_header_nritems(path->nodes[0]);
>>> +   path->slots[0]++;
>>> if (path->slots[0] >= nritems) {
>>> ret = btrfs_next_leaf(root, path);
>>> if (ret)
>>> goto out;
>>> -   } else {
>>> -   path->slots[0]++;
>>> }
>>>
>>> btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
>>> --
>>> 1.9.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-12-01 Thread Wilson Meier
Am 30/11/16 um 17:48 schrieb Austin S. Hemmelgarn:
> On 2016-11-30 10:49, Wilson Meier wrote:
>> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
>>
>> Transferring this to car analogy, just to make it a bit more funny:
>> The airbag (raid level whatever) itself is ok but the micro controller
>> (general btrfs) which has the responsibility to inflate the airbag is
>> suffers some problems, sometimes doesn't inflate and the manufacturer
>> doesn't mention about that fact.
>> From your point of you the airbag is ok. From my point of view -> Don't
>> buy that car!!!
>> Don't you mean that the fact that the live safer suffers problems should
>> be noted and every dependent component should point to that fact?
>> I think it should.
>> I'm not talking about performance issues, i'm talking about data loss.
>> Now the next one can throw in "Backups, always make backups!".
>> Sure, but backup is backup and raid is raid. Both have their own
>> concerns.
> A better analogy for a car would be something along the lines of the
> radio working fine but the general wiring having issues that cause all
> the electronics in the car to stop working under certain
> circumstances. In that case, the radio itself is absolutely OK, but it
> suffers from issues caused directly by poor design elsewhere in the
> vehicle.
Ahm, no. You cannot exchange a security mechanism (raid) with a comfort
one (compression) and treat them as the same in terms of importance.
It makes a serious difference to have a not properly working airbag or
not being able to listen to music while your a driving against a wall.
Anyway, we should stop this here.
 I'm not angry or something like that :) .
 I just would like to have the possibility to read such information
 about
 the storage i put my personal data (> 3 TB) on its official wiki.
> There are more places than the wiki to look for info about BTRFS (and
> this is the case about almost any piece of software, not just BTRFS,
> very few things have one central source for everything).  I don't mean
> to sound unsympathetic, but given what you're saying, it's sounding
> more and more like you didn't look at anything beyond the wiki and
> should have checked other sources as well.
This is your assumption.


Am 01/12/16 um 07:47 schrieb Duncan:
> Austin S. Hemmelgarn posted on Wed, 30 Nov 2016 11:48:57 -0500 as
> excerpted:
>> On 2016-11-30 10:49, Wilson Meier wrote:
>>> Do you also have all home users in mind, which go to vacation (sometime
 3 weeks) and don't have a 24/7 support team to replace monitored disks
>>> which do report SMART errors?
>> Better than 90% of people I know either shut down their systems when
>> they're going to be away for a long period of time, or like me have
>> ways to log in remotely and tell the FS to not use that disk anymore.
> https://btrfs.wiki.kernel.org/index.php/Getting_started ... ... has
> two warnings offset in red right in the first section: * If you have
> btrfs filesystems, run the latest kernel.
I do. Ok not the very latest but i'm always on the latest major version.
Right now i have 4.8.4 and the very latest is 4.8.11.
> * You should keep and test backups of your data, and be prepared to use 
> them.
I have daily backups.
> As to the three weeks vacation thing... And "daily use" != "three
> weeks without physical access to something you're going to actually be
> relying on for parts of those three weeks".
>
Maybe i have my own mailserver and owncloud to server files to my
family? Maybe i'm out of country and somewhere i have no internet access?
I will not comment this any further as it leads us nowhere.


In general i think that this discussion is taking a complete wrong
direction.
The only thing i have asked for is to document the *known*
problems/flaws/limitations of all raid profiles and link to them from
the stability matrix.

Regarding raid10:
Even if one knows about the fact that btrfs handles things on chunk
level one would assume that the code is written in a way to put the
copies on different stripes.
Otherwise raid10 ***can*** become pretty useless in terms of data
redundancy and 2 x raid1 with an lvm should be considered as a replacement.
This is a serious thing and should be documented. If this is documented
somewhere then please point me to it as i cannot find a word about that
anywhere.

Cheers,
Wilson


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Andrei Borzenkov
On Thu, Dec 1, 2016 at 10:49 AM, Stefan Priebe - Profihost AG
 wrote:
...
>
> Custom 4.4 kernel with patches up to 4.10. But i already tried 4.9-rc7
> which does the same.
>
>
>>> # btrfs filesystem show /ssddisk/
>>> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
>>> Total devices 1 FS bytes used 305.67GiB
>>> devid1 size 500.00GiB used 500.00GiB path /dev/vdb1
>>>
>>> # btrfs filesystem usage /ssddisk/
>>> Overall:
>>> Device size: 500.00GiB
>>> Device allocated:500.00GiB
>>> Device unallocated:1.05MiB
>>
>> Drive is actually fully allocated so if Btrfs needs to create a new
>> chunk right now, it can't. However,
>
> Yes but there's lot of free space:
> Free (estimated):193.46GiB  (min: 193.46GiB)
>
> How does this match?
>
>
>> All three chunk types have quite a bit of unused space in them, so
>> it's unclear why there's a no space left error.
>>

I remember discussion that balance always tries to pre-allocate one
chunk in advance, and I believe there was patch to correct it but I am
not sure whether it was merged.

>> Try remounting with enoscp_debug, and then trigger the problem again,
>> and post the resulting kernel messages.
>
> With enospc debug it says:
> [39193.425682] BTRFS warning (device vdb1): no space to allocate a new
> chunk for block group 839941881856
> [39193.426033] BTRFS info (device vdb1): 1 enospc errors during balance
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata balance fails ENOSPC

2016-12-01 Thread Duncan
Chris Murphy posted on Wed, 30 Nov 2016 16:02:29 -0700 as excerpted:

> On Wed, Nov 30, 2016 at 2:03 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hello,
>>
>> # btrfs balance start -v -dusage=0 -musage=1 /ssddisk/
>> Dumping filters: flags 0x7, state 0x0, force is off
>>   DATA (flags 0x2): balancing, usage=0
>>   METADATA (flags 0x2): balancing, usage=1
>>   SYSTEM (flags 0x2): balancing, usage=1
>> ERROR: error during balancing '/ssddisk/': No space left on device
>> There may be more info in syslog - try dmesg | tail
> 
> You haven't provided kernel messages at the time of the error.
> 
> Also useful is the kernel version.

I won't disagree here as often it's kernel-version-specific behavior in 
question, but in this case I think the behavior is generic and the 
question can thus be answered on that basis, without the kernel version 
or dmesg output.

@ Chris: Note that the ENOSPC wasn't during ordinary use, but 
/specifically/ during balance, which behaves a bit differently regarding 
ENOSPC, and I believe it's that version-generic behavior difference 
that's in focus, here.

>> # btrfs filesystem show /ssddisk/
>> Label: none  uuid: a69d2e90-c2ca-4589-9876-234446868adc
>> Total devices 1 FS bytes used 305.67GiB
>  devid1 size 500.00GiB used 500.00GiB path /dev/vdb1

Device line says 100% used (meaning allocated).  The below simply shows 
it a different way, confirming the 100% used.

>> # btrfs filesystem usage /ssddisk/
>> Overall:
>> Device size: 500.00GiB
>> Device allocated:500.00GiB
>> Device unallocated:1.05MiB
> 
> Drive is actually fully allocated so if Btrfs needs to create a new
> chunk right now, it can't.

... And that right there is the problem.

When doing chunk consolidation, with one exception noted below, btrfs 
balance creates new chunks to write into, then rewrites the content from 
the old into the new.  But there's no space left (1 MiB isn't enough) 
unallocated to allocate new chunks from, so balance errors out with 
ENOSPC.

>> Data,single: Size:483.97GiB, Used:298.18GiB
>>/dev/vdb1 483.97GiB
>>
>> Metadata,single: Size:16.00GiB, Used:7.51GiB
>>/dev/vdb1  16.00GiB
>>
>> System,single: Size:32.00MiB, Used:144.00KiB
>>/dev/vdb1  32.00MiB
> 
> All three chunk types have quite a bit of unused space in them, so it's
> unclear why there's a no space left error.

Normal usage can still write into the existing chunks since they're not 
yet entirely full, but that's not where the error occurred.  There's no 
space left unallocated to allocate further chunks from, and that's what 
balance, with one single exception, must do first, allocate a new chunk 
in ordered to write into, so it errors out.

The one single exception is when there's actually nothing to rewrite, the 
usage=0 case, in which case balance will simply erase any entirely empty 
chunks of the appropriate type (-d=data, -m=metadata).

This _used_ to be required somewhat regularly, as the kernel knew how to 
allocate new chunks but couldn't deallocate chunks, even entirely empty 
chunks, without a balance.  However, since 3.16 (IIRC), the kernel has 
been able to deallocate entirely empty chunks entirely on its own 
(automatically), and does so reasonably regularly in normal usage, so the 
issue of zero-sized chunks is far rarer than it used to be.

But apparently there's still a bug or two somewhere, as we still get 
reports of the usage=0 filter actually deallocating some empty chunks 
back to unallocated, even on kernels that should be doing that 
automatically.  It's not as common as it once was, but it does still 
happen.

So the usage=0 filter, the only case where the kernel doesn't have to 
create a new chunk in ordered to clear space during a balance, because 
it's not actually writing a new chunk, only deleting an empty one, does 
still make sense to try, because sometimes it _does_ work, and in the 
100% allocated case it's the simplest thing to try so it's worth trying 
even tho there's a good chance it won't work, because the kernel is 
/supposed/ to be removing those chunks automatically now, and /usually/ 
does just that.


OK, so what was wrong with the above command, and what should be tried 
instead?

The above command used TWO filters, -dusage=0 -musage=1 .  It choked on 
the -musage=1, apparently because it tried a less-than 1% full but not /
entirely/ empty metadata chunk first, before trying data chunks with the 
-dusuage=0, which should have succeeded, even if it found no empty data 
chunks to remove.

So the fix is to try either -dusage=0 -musage=0 together, first, or to 
try -dusage by itself first (and possibly -musage=0 after that), before 
trying -musage=1.

If it works and there are empty chunks of either type that can be 
removed, hopefully that will free up enough space to write at least one 
more metadata chunk, leaving room to create at least the one more (it'd 
be two with