Re: btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Hans van Kranenburg
On 02/09/2018 05:45 PM, Ellis H. Wilson III wrote:
> 
> I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> impacts foreground performance, specifically during snapshot deletion.
> My experience so far has been that it can be dramatically disruptive to
> foreground I/O.
> 
> Looking through the wiki at kernel.org I have not yet stumbled onto any
> analysis that would shed light on this specific problem.  I have found
> numerous complaints about btrfs-cleaner online, especially relating to
> quotas being enabled.  This has proven thus far less than helpful, as
> the response tends to be "use less snapshots," or "disable quotas," both
> of which strike me as intellectually unsatisfying answers

Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
didn't even realize I had those, and now the problem is gone."

>, especially
> the former in a filesystem where snapshots are supposed to be
> "first-class citizens."

Throwing complaints around is also not helpful.

> The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot
> performance analysis I would expect to see given the assertions in the
> latter that "BTRFS...supports efficient snapshots..."  The former is
> sufficiently pre-BTRFS that while it does performance analysis of btree
> clones, it's unclear (to me at least) if the results can be
> forward-propagated in some way to real-world performance expectations
> for BTRFS snapshot creation/deletion/modification.

I don't really think they can.

> Has this analysis been performed somewhere else and I'm just missing it?
>  Also, I'll be glad to comment on my specific setup, kernel version,
> etc, and discuss pragmatic work-arounds, but I'd like to better
> understand the high-level performance implications first.

The "performance implications" are highly dependent on your specific
setup, kernel version, etc, so it really makes sense to share:

* kernel version
* mount options (from /proc/mounts|grep btrfs)
* is it ssd? hdd? iscsi lun?
* how big is the FS
* how many subvolumes/snapshots? (how many snapshots per subvolume)

And what's essential to look at is what your computer is doing while you
are throwing a list of subvolumes into the cleaner.

* is it using 100% cpu?
* is it showing 100% disk read I/O utilization?
* is it showing 100% disk write I/O utilization? (is it writing lots and
lots of data to disk?)

Since you could be looking at any combination of answers on all those
things, there's not much specific to tell.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Peter Grandi
> I am trying to better understand how the cleaner kthread
> (btrfs-cleaner) impacts foreground performance, specifically
> during snapshot deletion.  My experience so far has been that
> it can be dramatically disruptive to foreground I/O.

That's such a warmly innocent and optimistic question! This post
gives the answer, and to an even more general question:

  http://www.sabi.co.uk/blog/17-one.html?170610#170610

> the response tends to be "use less snapshots," or "disable
> quotas," both of which strike me as intellectually
> unsatisfying answers, especially the former in a filesystem
> where snapshots are supposed to be "first-class citizens."

They are "first class" but not "cost-free".
In particular every extent is linked in a forward map and a
reverse map, and deleting a snapshot involves materializing and
updating a join of the two, which seems to be done with a
classic nested-loop join strategy resulting in N^2 running
time. I suspect that quotas have a similar optimization.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-09 Thread Goffredo Baroncelli
On 02/08/2018 08:07 PM, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +   /* read lock please */
>>> +   do {
>>> +   seq = read_seqbegin(&fs_info->bc_lock);
>>> +   list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +   len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +   bc->chunk_offset);
>>> +   /* chunk offset is u64 */
>>> +   if (len >= PAGE_SIZE)
>>> +   break;
>>> +   }
>>> +   } while (read_seqretry(&fs_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code 
>> correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) 
>> it is possible to list only few hundred of chunks (~200). Not more; and the 
>> last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to 
>> list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page 
>> should be big enough to contains all these chunks offset). If the chunks 
>> number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are 
>> complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.

What happens if you loose an entire disk ? If so, you fill "bad_sector"


> 
>> However, my second suggestions is to ... change completely the interface. 
>> What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs//chunks//type  # 
>> data/metadata/sys
>> /sys/fs/btrfs//chunks//profile   # 
>> dup/linear
>> /sys/fs/btrfs//chunks//size  # size
>> /sys/fs/btrfs//chunks//devs  # chunks devs 
>>
>> And so on. 
>>
>> Checking  "[...]/devs", it would be easy understand if the 
>> chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a 
>> sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a 
>> lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

Likely yes. However TREE_SEARCH is bad because it is hard linked to the 
filesystem structure. I.e. if for some reason BTRFS change the on disk layout, 
what happens for the old application which expect the previous disk format ? 
And for the same reason, it is root-only (UID==0)

Let me to give another idea: what about exporting these information via an 
ioctl ? It could be extended to query all information about (all) the chunks...

> 
> thanks,
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Ellis H. Wilson III

Hi all,

I am trying to better understand how the cleaner kthread (btrfs-cleaner) 
impacts foreground performance, specifically during snapshot deletion. 
My experience so far has been that it can be dramatically disruptive to 
foreground I/O.


Looking through the wiki at kernel.org I have not yet stumbled onto any 
analysis that would shed light on this specific problem.  I have found 
numerous complaints about btrfs-cleaner online, especially relating to 
quotas being enabled.  This has proven thus far less than helpful, as 
the response tends to be "use less snapshots," or "disable quotas," both 
of which strike me as intellectually unsatisfying answers, especially 
the former in a filesystem where snapshots are supposed to be 
"first-class citizens."


The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot 
performance analysis I would expect to see given the assertions in the 
latter that "BTRFS...supports efficient snapshots..."  The former is 
sufficiently pre-BTRFS that while it does performance analysis of btree 
clones, it's unclear (to me at least) if the results can be 
forward-propagated in some way to real-world performance expectations 
for BTRFS snapshot creation/deletion/modification.


Has this analysis been performed somewhere else and I'm just missing it? 
 Also, I'll be glad to comment on my specific setup, kernel version, 
etc, and discuss pragmatic work-arounds, but I'd like to better 
understand the high-level performance implications first.


Thanks in advance to anyone who can comment on this.  I am very inclined 
to read anything thrown at me, so if there is documentation I failed to 
read, please just send the link.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()

2018-02-09 Thread Qu Wenruo


On 2018年02月09日 16:09, Nikolay Borisov wrote:
> 
> 
> On  9.02.2018 09:44, Qu Wenruo wrote:
>> Strangely, we have level check in btrfs_print_tree() while we don't have
>> the same check in read_node_slot().
>>
>> That's to say, for the following corruption, btrfs_search_slot() or
>> btrfs_next_leaf() can return invalid leaf:
>>
>> Parent eb:
>>   node XX level 1
>>   ^^^
>>   Child should be leaf (level 0)
>>   ...
>>   key (XXX XXX XXX) block YY
>>
>> Child eb:
>>   leaf YY level 1
>>   ^^^
>>   Something went wrong now
>>
>> And for the corrupted leaf returned, later caller can be screwed up
>> easily.
>>
>> Although the root cause (powerloss, but still something wrong breaking
>> metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
>> avoid SEGV.
>>
>> Reported-by: Ralph Gauges 
>> Signed-off-by: Qu Wenruo 
>> ---
>> changlog:
>> v2:
>>   Check if the extent buffer is up-to-date before checking its level to
>>   avoid possible NULL pointer access.
>> ---
>>  ctree.c | 16 +++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> That was sent separately so I'd assume it was in the wrong dir ;)

Yep, my bad habit of putting all patches in project dir.

Thanks,
Qu

> 
>>
>> diff --git a/ctree.c b/ctree.c
>> index 4fc33b14000a..430805e3043f 100644
>> --- a/ctree.c
>> +++ b/ctree.c
>> @@ -22,6 +22,7 @@
>>  #include "repair.h"
>>  #include "internal.h"
>>  #include "sizes.h"
>> +#include "messages.h"
>>  
>>  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
>>*root, struct btrfs_path *path, int level);
>> @@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct 
>> btrfs_key *key,
>>  struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
>> struct extent_buffer *parent, int slot)
>>  {
>> +struct extent_buffer *ret;
>>  int level = btrfs_header_level(parent);
>> +
>>  if (slot < 0)
>>  return NULL;
>>  if (slot >= btrfs_header_nritems(parent))
>> @@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct 
>> btrfs_fs_info *fs_info,
>>  if (level == 0)
>>  return NULL;
>>  
>> -return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
>> +ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
>> btrfs_node_ptr_generation(parent, slot));
>> +if (!extent_buffer_uptodate(ret))
>> +return ERR_PTR(-EIO);
>> +
>> +if (btrfs_header_level(ret) != level - 1) {
>> +error("child eb corrupted: parent bytenr=%llu item=%d parent 
>> level=%d child level=%d",
>> +  btrfs_header_bytenr(parent), slot,
>> +  btrfs_header_level(parent), btrfs_header_level(ret));
>> +free_extent_buffer(ret);
>> +return ERR_PTR(-EIO);
>> +}
>> +return ret;
>>  }
>>  
>>  static int balance_level(struct btrfs_trans_handle *trans,
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior

2018-02-09 Thread Qu Wenruo


On 2018年02月09日 16:17, Nikolay Borisov wrote:
> 
> 
> On  9.02.2018 09:44, Qu Wenruo wrote:
>> Kernel uses a delayed chunk allocation behavior for metadata chunks
>>
>> KERNEL:
>> btrfs_alloc_chunk()
>> |- __btrfs_alloc_chunk():Only allocate chunk mapping
>>|- btrfs_make_block_group():  Add corresponding bg to fs_info->new_bgs
>>
>> Then at transaction commit time, it finishes the remaining work:
>> btrfs_start_dirty_block_groups():
>> |- btrfs_create_pending_block_groups()
>>|- btrfs_insert_item():   To insert block group item
>>|- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
>>
>> Although btrfs-progs btrfs_alloc_chunk() does all the work in one
>> function, it can still benefit from function refactor like:
>> btrfs-progs:
>> btrfs_alloc_chunk(): Wrapper for both normal and convert chunks
>> |- __btrfs_alloc_chunk():Only alloc chunk mapping
>> |  |- btrfs_make_block_group(): <>
>> |- btrfs_finish_chunk_alloc():  Insert chunk items/dev extents
>>
>> With such refactor, the following functions can share most of its code
>> with kernel now:
>> __btrfs_alloc_chunk()
>> btrfs_finish_chunk_alloc()
>> btrfs_alloc_dev_extent()
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  volumes.c | 421 
>> ++
>>  1 file changed, 260 insertions(+), 161 deletions(-)
>>
>> diff --git a/volumes.c b/volumes.c
>> index cff54c612872..e89520326314 100644
>> --- a/volumes.c
>> +++ b/volumes.c
>> @@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device 
>> *device, u64 num_bytes,
>>  return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>>  }
>>  
>> -static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
>> -struct btrfs_fs_info *fs_info,
>> -struct map_lookup *map, u64 stripe_size)
>> +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
>> +  struct btrfs_device *device,
>> +  u64 chunk_offset, u64 physical,
>> +  u64 stripe_size)
>>  {
>> -int ret = 0;
>> -struct btrfs_path path;
>> +int ret;
>> +struct btrfs_path *path;
>> +struct btrfs_fs_info *fs_info = device->fs_info;
>>  struct btrfs_root *root = fs_info->dev_root;
>>  struct btrfs_dev_extent *extent;
>>  struct extent_buffer *leaf;
>>  struct btrfs_key key;
>> -int i;
>>  
>> -btrfs_init_path(&path);
>> -
>> -for (i = 0; i < map->num_stripes; i++) {
>> -struct btrfs_device *device = map->stripes[i].dev;
>> -
>> -key.objectid = device->devid;
>> -key.offset = map->stripes[i].physical;
>> -key.type = BTRFS_DEV_EXTENT_KEY;
>> +path = btrfs_alloc_path();
>> +if (!path)
>> +return -ENOMEM;
>>  
>> -ret = btrfs_insert_empty_item(trans, root, &path, &key,
>> -  sizeof(*extent));
>> -if (ret < 0)
>> -goto out;
>> -leaf = path.nodes[0];
>> -extent = btrfs_item_ptr(leaf, path.slots[0],
>> -struct btrfs_dev_extent);
>> -btrfs_set_dev_extent_chunk_tree(leaf, extent,
>> +key.objectid = device->devid;
>> +key.offset = physical;
>> +key.type = BTRFS_DEV_EXTENT_KEY;
>> +ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
>> +if (ret)
>> +goto out;
>> +leaf = path->nodes[0];
>> +extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
>> +btrfs_set_dev_extent_chunk_tree(leaf, extent,
>>  BTRFS_CHUNK_TREE_OBJECTID);
>> -btrfs_set_dev_extent_chunk_objectid(leaf, extent,
>> -BTRFS_FIRST_CHUNK_TREE_OBJECTID);
>> -btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
>> -
>> -write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
>> -(unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
>> -BTRFS_UUID_SIZE);
>> -
>> -btrfs_set_dev_extent_length(leaf, extent, stripe_size);
>> -btrfs_mark_buffer_dirty(leaf);
>> -btrfs_release_path(&path);
>> -
>> -device->bytes_used += stripe_size;
>> -ret = btrfs_update_device(trans, device);
>> -if (ret < 0)
>> -goto out;
>> -}
>> +btrfs_set_dev_extent_chunk_objectid(leaf, extent,
>> +BTRFS_FIRST_CHUNK_TREE_OBJECTID);
>> +btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
>> +btrfs_set_dev_extent_length(leaf, extent, stripe_size);
>> +btrfs_mark_buffer_dirty(leaf);
>>  out:
>> -btrfs_release_path(&path);
>> +btrfs_free_path(path);
>>  return ret;
>>  }
>>  
>> @@ -813,28 +798,28 @@ stati

[PATCH] btrfs: Move error handling of btrfs_start_dirty_block_groups closer to call site

2018-02-09 Thread Nikolay Borisov
Even though btrfs_start_dirty_block_groups fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function
is executed and let the sole runner of dirty block groups handle the
return value. no functional changes

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/transaction.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a57065f022ff..2cdf7be02f41 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2012,12 +2012,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
run_it = 1;
mutex_unlock(&fs_info->ro_block_group_mutex);
 
-   if (run_it)
+   if (run_it) {
ret = btrfs_start_dirty_block_groups(trans);
-   }
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
+   if (ret) {
+   btrfs_end_transaction(trans);
+   return ret;
+   }
+   }
}
 
spin_lock(&fs_info->trans_lock);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Btrfs: just bunch of patches to ioctl.c

2018-02-09 Thread Timofey Titovets
Gentle ping

2018-01-09 13:53 GMT+03:00 Timofey Titovets :
> Gentle ping
>
> 2017-12-19 13:02 GMT+03:00 Timofey Titovets :
>> 1st patch, remove 16MiB restriction from extent_same ioctl(),
>> by doing iterations over passed range.
>>
>> I did not see much difference in performance, so it's just remove
>> logic restriction.
>>
>> 2-3 pathes, update defrag ioctl():
>>  - Fix bad behaviour with full rewriting all compressed
>>extents in defrag range. (that also make autodefrag on compressed fs
>>not so expensive)
>>  - Allow userspace specify NONE as target compression type,
>>that allow users to uncompress files by defragmentation with btrfs-progs
>>  - Make defrag ioctl understood requested compression type and current
>>compression type of extents, to make btrfs fi def -rc
>>idempotent operation.
>>i.e. now possible to say, make all extents compressed with lzo,
>>and btrfs will not recompress lzo compressed data.
>>Same for zlib, zstd, none.
>>(patch to btrfs-progs in PR on kdave GitHub).
>>
>> 4th patch, reduce size of struct btrfs_inode
>>  - btrfs_inode store fields like: prop_compress, defrag_compress and
>>after 3rd patch, change_compress.
>>They use unsigned as a type, and use 12 bytes in sum.
>>But change_compress is a bitflag, and prop_compress/defrag_compress
>>only store compression type, that currently use 0-3 of 2^32-1.
>>
>>So, set a bitfields on that vars, and reduce size of btrfs_inode:
>>1136 -> 1128.
>>
>> Timofey Titovets (4):
>>   Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
>>   Btrfs: make should_defrag_range() understood compressed extents
>>   Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
>>   Btrfs: reduce size of struct btrfs_inode
>>
>>  fs/btrfs/btrfs_inode.h |   5 +-
>>  fs/btrfs/inode.c   |   4 +-
>>  fs/btrfs/ioctl.c   | 203 
>> +++--
>>  3 files changed, 133 insertions(+), 79 deletions(-)
>>
>> --
>> 2.15.1
>
>
>
> --
> Have a nice day,
> Timofey.



-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs-progs: Refactor btrfs_alloc_chunk to mimic kernel structure and behavior

2018-02-09 Thread Nikolay Borisov


On  9.02.2018 09:44, Qu Wenruo wrote:
> Kernel uses a delayed chunk allocation behavior for metadata chunks
> 
> KERNEL:
> btrfs_alloc_chunk()
> |- __btrfs_alloc_chunk(): Only allocate chunk mapping
>|- btrfs_make_block_group():   Add corresponding bg to fs_info->new_bgs
> 
> Then at transaction commit time, it finishes the remaining work:
> btrfs_start_dirty_block_groups():
> |- btrfs_create_pending_block_groups()
>|- btrfs_insert_item():To insert block group item
>|- btrfs_finish_chunk_alloc(): Insert chunk items/dev extents
> 
> Although btrfs-progs btrfs_alloc_chunk() does all the work in one
> function, it can still benefit from function refactor like:
> btrfs-progs:
> btrfs_alloc_chunk():  Wrapper for both normal and convert chunks
> |- __btrfs_alloc_chunk(): Only alloc chunk mapping
> |  |- btrfs_make_block_group(): <>
> |- btrfs_finish_chunk_alloc():  Insert chunk items/dev extents
> 
> With such refactor, the following functions can share most of its code
> with kernel now:
> __btrfs_alloc_chunk()
> btrfs_finish_chunk_alloc()
> btrfs_alloc_dev_extent()
> 
> Signed-off-by: Qu Wenruo 
> ---
>  volumes.c | 421 
> ++
>  1 file changed, 260 insertions(+), 161 deletions(-)
> 
> diff --git a/volumes.c b/volumes.c
> index cff54c612872..e89520326314 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -523,55 +523,40 @@ static int find_free_dev_extent(struct btrfs_device 
> *device, u64 num_bytes,
>   return find_free_dev_extent_start(device, num_bytes, 0, start, len);
>  }
>  
> -static int btrfs_insert_dev_extents(struct btrfs_trans_handle *trans,
> - struct btrfs_fs_info *fs_info,
> - struct map_lookup *map, u64 stripe_size)
> +static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
> +   struct btrfs_device *device,
> +   u64 chunk_offset, u64 physical,
> +   u64 stripe_size)
>  {
> - int ret = 0;
> - struct btrfs_path path;
> + int ret;
> + struct btrfs_path *path;
> + struct btrfs_fs_info *fs_info = device->fs_info;
>   struct btrfs_root *root = fs_info->dev_root;
>   struct btrfs_dev_extent *extent;
>   struct extent_buffer *leaf;
>   struct btrfs_key key;
> - int i;
>  
> - btrfs_init_path(&path);
> -
> - for (i = 0; i < map->num_stripes; i++) {
> - struct btrfs_device *device = map->stripes[i].dev;
> -
> - key.objectid = device->devid;
> - key.offset = map->stripes[i].physical;
> - key.type = BTRFS_DEV_EXTENT_KEY;
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
>  
> - ret = btrfs_insert_empty_item(trans, root, &path, &key,
> -   sizeof(*extent));
> - if (ret < 0)
> - goto out;
> - leaf = path.nodes[0];
> - extent = btrfs_item_ptr(leaf, path.slots[0],
> - struct btrfs_dev_extent);
> - btrfs_set_dev_extent_chunk_tree(leaf, extent,
> + key.objectid = device->devid;
> + key.offset = physical;
> + key.type = BTRFS_DEV_EXTENT_KEY;
> + ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*extent));
> + if (ret)
> + goto out;
> + leaf = path->nodes[0];
> + extent = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_extent);
> + btrfs_set_dev_extent_chunk_tree(leaf, extent,
>   BTRFS_CHUNK_TREE_OBJECTID);
> - btrfs_set_dev_extent_chunk_objectid(leaf, extent,
> - BTRFS_FIRST_CHUNK_TREE_OBJECTID);
> - btrfs_set_dev_extent_chunk_offset(leaf, extent, map->ce.start);
> -
> - write_extent_buffer(leaf, fs_info->chunk_tree_uuid,
> - (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
> - BTRFS_UUID_SIZE);
> -
> - btrfs_set_dev_extent_length(leaf, extent, stripe_size);
> - btrfs_mark_buffer_dirty(leaf);
> - btrfs_release_path(&path);
> -
> - device->bytes_used += stripe_size;
> - ret = btrfs_update_device(trans, device);
> - if (ret < 0)
> - goto out;
> - }
> + btrfs_set_dev_extent_chunk_objectid(leaf, extent,
> + BTRFS_FIRST_CHUNK_TREE_OBJECTID);
> + btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
> + btrfs_set_dev_extent_length(leaf, extent, stripe_size);
> + btrfs_mark_buffer_dirty(leaf);
>  out:
> - btrfs_release_path(&path);
> + btrfs_free_path(path);
>   return ret;
>  }
>  
> @@ -813,28 +798,28 @@ static int btrfs_cmp_device_info(const void *a, const 
> void *b)
>   / si

Re: [PATCH v2] btrfs-progs: ctree: Add extra level check for read_node_slot()

2018-02-09 Thread Nikolay Borisov


On  9.02.2018 09:44, Qu Wenruo wrote:
> Strangely, we have level check in btrfs_print_tree() while we don't have
> the same check in read_node_slot().
> 
> That's to say, for the following corruption, btrfs_search_slot() or
> btrfs_next_leaf() can return invalid leaf:
> 
> Parent eb:
>   node XX level 1
>   ^^^
>   Child should be leaf (level 0)
>   ...
>   key (XXX XXX XXX) block YY
> 
> Child eb:
>   leaf YY level 1
>   ^^^
>   Something went wrong now
> 
> And for the corrupted leaf returned, later caller can be screwed up
> easily.
> 
> Although the root cause (powerloss, but still something wrong breaking
> metadata CoW of btrfs) is still unknown, at least enhance btrfs-progs to
> avoid SEGV.
> 
> Reported-by: Ralph Gauges 
> Signed-off-by: Qu Wenruo 
> ---
> changlog:
> v2:
>   Check if the extent buffer is up-to-date before checking its level to
>   avoid possible NULL pointer access.
> ---
>  ctree.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)

That was sent separately so I'd assume it was in the wrong dir ;)

> 
> diff --git a/ctree.c b/ctree.c
> index 4fc33b14000a..430805e3043f 100644
> --- a/ctree.c
> +++ b/ctree.c
> @@ -22,6 +22,7 @@
>  #include "repair.h"
>  #include "internal.h"
>  #include "sizes.h"
> +#include "messages.h"
>  
>  static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
> *root, struct btrfs_path *path, int level);
> @@ -640,7 +641,9 @@ static int bin_search(struct extent_buffer *eb, struct 
> btrfs_key *key,
>  struct extent_buffer *read_node_slot(struct btrfs_fs_info *fs_info,
>  struct extent_buffer *parent, int slot)
>  {
> + struct extent_buffer *ret;
>   int level = btrfs_header_level(parent);
> +
>   if (slot < 0)
>   return NULL;
>   if (slot >= btrfs_header_nritems(parent))
> @@ -649,8 +652,19 @@ struct extent_buffer *read_node_slot(struct 
> btrfs_fs_info *fs_info,
>   if (level == 0)
>   return NULL;
>  
> - return read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
> + ret = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
>  btrfs_node_ptr_generation(parent, slot));
> + if (!extent_buffer_uptodate(ret))
> + return ERR_PTR(-EIO);
> +
> + if (btrfs_header_level(ret) != level - 1) {
> + error("child eb corrupted: parent bytenr=%llu item=%d parent 
> level=%d child level=%d",
> +   btrfs_header_bytenr(parent), slot,
> +   btrfs_header_level(parent), btrfs_header_level(ret));
> + free_extent_buffer(ret);
> + return ERR_PTR(-EIO);
> + }
> + return ret;
>  }
>  
>  static int balance_level(struct btrfs_trans_handle *trans,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html